Waiting for 8.4 – Visibility maps

Yeah. This one patch alone is worth upgrading to 8.4.

On 3rd of December Heikki Linnakangas committed his patch. Commit message:

Introduce visibility map. The visibility map is a bitmap with one bit per
heap page, where a set bit indicates that all tuples on the page are
visible to all transactions, and the page therefore doesn't need
vacuuming. It is stored in a new relation fork.
 
Lazy vacuum uses the visibility map to skip pages that don't need
vacuuming. Vacuum is also responsible for setting the bits in the map.
In the future, this can hopefully be used to implement index-only-scans,
but we can't currently guarantee that the visibility map is always 100%
up-to-date.
 
In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on
each heap page, also indicating that all tuples on the page are visible to
all transactions. It's important that this flag is kept up-to-date. It
is also used to skip visibility tests in sequential scans, which gives a
small performance gain on seqscans.

There was later one more patch which allowed autovacuum use it.

Now, 8.4 is already great. It has tons of new features. Some will say that CTE is the most important. For others it will be database-level locale.

But CTE will not be used by everyone. Same thing with database-level locales, and practically every other feature.

But vacuum – we all use vacuum. That doesn't mean other patches are not important – they all are. But this one will simply be welcomed by more users.

In short. What it does? To make things simple – it makes vacuum faster. Possibly – much faster.

Let's see.

First, I start pre-vilibility-maps-patched version of Pg, turn off autovacuum, and vacuum whole database (to avoid accidental run of autovacuum to prevent xid wraparound).

Now, let's create 4 test tables:

CREATE TABLE test_1 (i INT4);
CREATE TABLE test_2 (i INT4);
CREATE TABLE test_3 (i INT4);
CREATE TABLE test_4 (i INT4);

insert some rows to them:

INSERT INTO test_1 SELECT generate_series(1, 100000000);
INSERT INTO test_2 SELECT generate_series(1, 100000000);
INSERT INTO test_3 SELECT generate_series(1, 100000000);
INSERT INTO test_4 SELECT generate_series(1, 100000000);

(I know it's not much, but I'm doing the test on laptop, and I don't want to wait all night for results 🙂

Now, let's update the tables:

UPDATE test_2 SET i = i + 1 WHERE i < 10000000;
UPDATE test_3 SET i = i + 1 WHERE i < 50000000;
UPDATE test_4 SET i = i + 1 WHERE i < 90000000;

To make use of new functionality I have to vacuum several times – i.e. every vacuum uses information gathered on previous one. So I inserted:

VACUUM test_1;
VACUUM test_2;
VACUUM test_3;
VACUUM test_4;

After initial INSERT and after UPDATE. To show some more information, I added second UPDATE (exactly the same), and third vacuum block – after second update.

Results:

Visibility maps available Change
No Yes
Vacuum #1
post-insert
test_1 233.34 s 310.53 s +33.1 %
test_2 243.45 s 262.20 s +7.7 %
test_3 237.86 s 260.26 s +9.4 %
test_4 198.72 s 200.60 s +0.9 %
Vacuum #2
post-update #1
test_1 96.71 s 0.38 s -99.6 %
test_2 150.91 s 59.81 s -60.4 %
test_3 283.22 s 234.06 s -17.4 %
test_4 418.41 s 503.25 s +20.3 %
Vacuum #3
post-update #2
test_1 98.11 s 0.54 s -99.4 %
test_2 142.92 s 27.10 s -81.0 %
test_3 283.91 s 297.43 s +4.8 %
test_4 416.35 s 507.77 s +22.0 %

What does it mean? Generally – PostgreSQL with visibility maps is actually slower when vacuum has a lot of “new" rows – whether they have been inserted or updated. But if you have autovacuum configured correctly – it shouldn't happen. It should be much common to have vacuum see ~10% of rows changed (or even less), and in these situations (test_2) performance simply rocks.

Additionally – if you have (even large) tables that are basically constant – vacuuming them becomes practically instantaneous – really nice benefit.

I would love to see some performance improvements for it – for cases like first vacuum, but all together – it is great news!

Also – please note this sentence from commit log:

In the future, this can hopefully be used to implement index-only-scans, …

This is really promising. Other databases (including the other open source database engine) have covering indexes, but PostgreSQL still is missing it. Reasons are solid, but visibility maps bring us 1 step closer to situation when we also will get them 🙂

One thought on “Waiting for 8.4 – Visibility maps”

Comments are closed.