What does “Fix VACUUM’s tests to see whether it can update relfrozenxid” really mean?

In release notes to latest release you can find:

Fix VACUUM's tests to see whether it can update relfrozenxid (Andres Freund)
 
In some cases VACUUM (either manual or autovacuum) could incorrectly advance
a table's relfrozenxid value, allowing tuples to escape freezing, causing
those rows to become invisible once 2^31 transactions have elapsed. The
probability of data loss is fairly low since multiple incorrect advancements
would need to happen before actual loss occurs, but it's not zero. In 9.2.0
and later, the probability of loss is higher, and it's also possible to get
"could not access status of transaction" errors as a consequence of this
bug. Users upgrading from releases 9.0.4 or 8.4.8 or earlier are not
affected, but all later versions contain the bug.
 
The issue can be ameliorated by, after upgrading, vacuuming all tables in
all databases while having vacuum_freeze_table_age set to zero. This will
fix any latent corruption but will not be able to fix all pre-existing data
errors. However, an installation can be presumed safe after performing this
vacuuming if it has executed fewer than 2^31 update transactions in its
lifetime (check this with SELECT txid_current() < 2^31).

What does it really mean?

Before I can explain the bug, let me explain what xid is, and how does that matter.

Every (well, technically not every, but let's ignore it for now) transaction in PostgreSQL has it's own number. For example 123.

For reason that will be explained in a moment, these numbers start from 3.

And what about the numbers?

Every row, in every table, has (aside from other data) 2 xid values – xmin and xmax. Normally you can't see them:

$ SELECT * FROM test;
 a | b 
---+---
 1 | 3
 2 | 4
 3 | 5
 4 | 6
 5 | 7
(5 ROWS)

But if you know their names, you can force pg to show them:

$ SELECT xmin, xmax, * FROM test;
 xmin | xmax | a | b 
------+------+---+---
  877 |    0 | 1 | 3
  877 |    0 | 2 | 4
  877 |    0 | 3 | 5
  877 |    0 | 4 | 6
  877 |    0 | 5 | 7
(5 ROWS)

The values above mean that row was inserted in transaction 877 (so it will be visible to transactions 877, 878, 879 and all later). Xmax value of 0 means that the row has not been deleted.

If the row would be deleted, it's data in table would contain some non-zero value in xmax column. But, if you'd use transaction that was created before the row was deleted, you could still see it.

So – whether the row is visible (which basically means: whether it exists) is all depending on these two numbers.

Now. As you might noticed, when row is deleted (or updated, as update is, on file level, treated as delete + insert), it is actually left there, with some xmax value.

I hope that so far it's all clear.

Now, here comes VACUUM. When vacuum runs, and it sees deleted row (one that has xmax that is not 0), it (vacuum) checks if there are any transactions running that are older than the xmax (so they should still see the row). If there are, nothing is done as the row might still be needed.

But if there aren't any such transactions – the row is cleaned, and both xmin/xmax are set to NULL. Thanks to this the place that the row occupied can be reused for another row, inserted in the future.

It all is fun, and cool, until you'll consider one thing. XID is only 32 bit. It's unsigned, but it's 32bit. This means that we have ~ 4 billion possible xids – from 0 to 4,294,967,295. What would happen then?

Luckily Pg doesn't break when you hit 4 billion rows. It just wraps, and restarts xid numbering from lowest possible (3, as I mentioned earlier).

This means – OK. But if I had some row inserted in transaction “4" and not removed later. Xid wrapped, I'm now in transaction 3. What happens? If I'm just looking at the xids – I shouldn't see the row, as the insert is in the future?

Xid being in the future is pretty normal situation – it happens if you have long transaction running, and some other inserted row during your long one. Xid of inserted row will be “in the future" for your long running transaction.

Anyway. So, what can Pg do about it?

It's actually pretty simple. Every so often vacuum finds rows that are old (i.e. inserted long time ago, and the xid is older than all other transactions currently in the system). And then vacuum changes xid to special value – 2.

This means – this row is Frozen. Frozen means – it's older than anything else in the system, and (since it's xmax is 0) not deleted, so it's visible to all transactions. Even after 10000 wraps of xids – the row will still be visible.

To prevent the problem with “wrapped xid, and now inserted rows seem to be in the future" – all old rows in the system have to be frozen before half of xid range will pass from their creation.

To remind – full range of xid is 2^32, so half of it is 2^31, which is more or less 2 billion. This means that if you inserted row with xid = 100, you have 2 billion transactions to freeze it.

Freezing, as I said, is done by vacuum. It's one of the things vacuum does.

And as such, it should be relatively fast. Scanning entire table to find rows that can be frozen is not the best idea – as it would be slow. Luckily Pg has pretty cool way to speed things up.

Whenever it freezes rows, it remembers newest xid that was frozen – i.e. if it is sure that there are no more rows with xmin earlier than “x", it saves this “x" value in relfrozenxid in pg_class table.

So, the value of relfrozenxid is simple information: in the table there are no unfrozen rows from transactions earlier than relfrozenxid.

I hope you follow.

So, with everything running correctly, at any point in time, we can be sure that all rows in the table are either:

  • xmin = 2 (frozen)
  • xmin is between relfrozenxid and current xid

The “between" is a bit problematic since xid wraps, but assuming we have current xid = 100, relfrozenxid = 4,000,000,000, we know it's after wrap, and all rows should have xmin (and xmax in case of removed rows) >= 4,000,000,000 or <= 100. The bug that was found in pg broke this premise. It was possible for some rows to not be freezed, but relfrozenxid was still progressed.

And since, it did progress, no further vacuum would freeze this missed rows, since it wouldn't even know to look for them. After all – if you know that you should freeze rows with xid from 50 to 100, you don't look for rows with xid = 45.

What is the immediate problem? Well, none.

All visibility checks will still work fine, without problem. After all – 45 is still older than current transaction, so it will be visible.

But what happens when we'll go 2^31 (half of xid range) to the future? Suddenly the very old value becomes very distant in the future (generally at any given time, 2^31 xids before current are treated as past, and 2^31 xids ahead of current are treated as future).

This can mean couple of things. Let me show you, but to keep the example simple let's assume we have just 100 xids. 0, 1, 2 are still restricted, and after xid 100 we're going back to 3.

Now, let's assume we have some rows in a table with these values:

 xmin | xmax | row-DATA
 3    | 0    | ROW a
 3    | 20   | ROW b
 2    | 0    | ROW c
 2    | 15   | ROW d

When we are at transaction 30, Pg assumes that:

  • transactions with xid => 80 or < 30 are in the past
  • transactions with xid >31 and < 80 are in the future

So it will see “row a" and “row c" – since the other two have been deleted in xids 20 and 15.

Now, let's move forward to transaction 60. Now, Pg assumes that:

  • transactions with xid => 10 and < 60 are in the past
  • transactions with xid >61 or < 10 are in the future

And this means that we have situation that currently, we'll see only the “row c" (which is correctly frozen, so always visible).

“row a", which should be visible has xmin = 3, which is in the future. So it will become re-visible in 41 transactions.

Interesting thing is that if we'll continue forwards, and wrap, and go to transaction “7" – we'll see all four rows – since they will get deleted “in the future".

So, depending on what are the values of xmin/xmax, you might lose visibility of the normally visible row, or you might start to see rows that shouldn't be visible. That doesn't look good.

Fortunately, the bug is (afaik) not very problematic – the chances of being bitten are there, but just not all that good.

The thing is – how do you know if you're “bitten"? If you read, and understood, above, but do not fully grasp how pg works – you can say: “that's trivial. Let's just fetch all rows, and compare xmin/xmax with relfrozenxid and current xid, and you're done".

Well, it's not all that simple. The problem is that from SQL level you can't really select rows that are not visible to you. So while you might be lucky enough to find some rows that have xmin before relfrozenxid, if the row has been deleted, or the 2^31 transactions already passed – you will not see some of the “damaged" rows, because visibility rules will forbid you from seeing them.

Luckily there is PostgreSQL extension that gives you low-level access to table (and index) pages. It's called pageinspect. You can, using it, get information of all rows in any given page of relation. Like this:

$ SELECT * FROM heap_page_items(get_raw_page('test', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid  
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+--------
  1 |   8160 |        1 |     32 |      2 |      0 |        1 | (0,1)  |           2 |       2304 |     24 | [NULL] | [NULL]
  2 |   8128 |        1 |     32 |      2 |      0 |        1 | (0,2)  |           2 |       2304 |     24 | [NULL] | [NULL]
  3 |   8032 |        1 |     32 |    879 |      0 |        0 | (0,3)  |           2 |       2304 |     24 | [NULL] | [NULL]
  4 |   8096 |        1 |     32 |      2 |      0 |        1 | (0,4)  |           2 |       2304 |     24 | [NULL] | [NULL]
  5 |   8064 |        1 |     32 |      2 |    880 |        0 | (0,5)  |        8194 |        256 |     24 | [NULL] | [NULL]
(5 ROWS)

This shows, that there are 5 rows in the page. One of them was removed in transaction 880 (lp = 5). Four rows were frozen to xid = 2, and one of them was not frozen – either due to bug, or (in this case) simply because there is no point in freezing one row immediately after it got inserted. You might notice that the content of the row is nowhere to be found, but we can get to the rows using ctid. For example:

$ SELECT * FROM test WHERE ctid = '(0,3)';
 a | b 
---+---
 1 | 2
(1 ROW)

This will not show me row from ‘(0,5)':

$ SELECT * FROM test WHERE ctid = '(0,5)';
 a | b 
---+---
(0 ROWS)

And that's because xmin/xmax rules forbid me from seeing it.

Anyway – pageinspect lets me see all rows in every page of the table, so I can check xmin/xmax now.

Doing it, proved to be more complicated than I assumed. I wrote a function. And tested it. And it kept having errors. Finally, I think I have it done. Finally split the code to two functions:

(you can also download it here)

CREATE OR REPLACE FUNCTION xmin_xmax_status (
    IN   p_xmin        xid,
    IN   p_xmax        xid,
    IN   p_frozenxid   xid,
    IN   p_currentxid  xid,
    OUT  STATUS        INT4
) RETURNS INT4 AS $$
DECLARE
    v_xmin        int8  :=  p_xmin::text::int8;
    v_xmax        int8  :=  p_xmax::text::int8;
    v_frozenxid   int8  :=  p_frozenxid::text::int8;
    v_currentxid  int8  :=  p_currentxid::text::int8;
BEGIN
    -- status = 0 (binary: 00) - all ok
    -- status = 1 (binary: 01) - xmin bad
    -- status = 2 (binary: 10) - xmax bad
    -- status = 3 (binary: 11) - xmin and xmax bad
    STATUS := 0;
    IF v_xmin IS NULL AND v_xmax IS NULL THEN
        -- vacuumed, deleted, row ?
        RETURN;
    END IF;
    IF v_xmin IN (1,2,v_frozenxid,v_currentxid) THEN
        -- correct values
    ELSIF v_xmin = 0 THEN
        STATUS := STATUS | 1;
    ELSIF v_frozenxid <= v_currentxid THEN
        IF v_xmin BETWEEN v_frozenxid AND v_currentxid THEN
            -- correct value
        ELSE
            STATUS := STATUS | 1;
        END IF;
    ELSE
        -- xid wrapped between frozenxid AND currentxid
        IF v_xmin BETWEEN v_currentxid AND v_frozenxid THEN
            STATUS := STATUS | 1;
        ELSE
            -- correct value
        END IF;
    END IF;
    IF v_xmax IN (0,1,v_frozenxid,v_currentxid) THEN
        -- correct value
    ELSIF v_xmax = 2 THEN
        STATUS := STATUS | 2;
    ELSIF v_frozenxid <= v_currentxid THEN
        IF v_xmax BETWEEN v_frozenxid AND v_currentxid THEN
            -- correct value
        ELSE
            STATUS := STATUS | 2;
        END IF;
    ELSE
        -- xid wrapped between frozenxid AND currentxid
        IF v_xmax BETWEEN v_currentxid AND v_frozenxid THEN
            STATUS := STATUS | 2;
        ELSE
            -- correct value
        END IF;
    END IF;
    RETURN;
END;
$$ LANGUAGE plpgsql;

That's the main logic function – it checks if given xmin/xmax values make sense when we consider (also given) relfrozenxid, and current xid number.

Then there is the looping function:

(you can also download it here)

CREATE OR REPLACE FUNCTION test_correct_relfrozenxid(
    IN   p_table        regclass,
    IN   p_results      TEXT   DEFAULT  NULL,
    OUT  TABLE_NAME     TEXT,
    OUT  scanned_pages  INT8,
    OUT  bad_pages      INT8,
    OUT  scanned_rows   INT8,
    OUT  bad_rows       INT8
) RETURNS record AS $$
DECLARE
    r_class        record;
    page_no        INT4;
    temprec        record;
    max_age        INT4;
    v_current_xid  xid;
BEGIN
 
    -- verify that results table exists, if we should use it
    IF p_results IS NOT NULL THEN
        BEGIN
            SELECT oid INTO temprec FROM pg_class WHERE oid = p_results::regclass;
        EXCEPTION WHEN undefined_table THEN
            EXECUTE 'CREATE TABLE ' || p_results || ' as
                SELECT ''pg_class''::regclass as table_name, 0::INT8 as page_no, ''0''::xid as relfrozenxid, ''0''::xid as current_xid, i.*, 0::INT4 as check_status, h.*
                FROM heap_page_items(get_raw_page(''pg_catalog.pg_class'', 0)) as i,
                page_header(get_raw_page(''pg_catalog.pg_class'', 0)) as h
                LIMIT 0';
        END;
    END IF;
 
    -- get base relation data
    SELECT n.nspname, c.relname, pg_relation_size(c.oid) / current_setting('block_size')::INT4 AS relpages, c.relfrozenxid INTO r_class
        FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid
        WHERE c.relkind = 'r' AND c.oid = p_table;
    IF NOT FOUND THEN raise exception 'No such table: %', p_table; END IF;
 
    TABLE_NAME := format('%I.%I', r_class.nspname, r_class.relname);
 
    -- 5 million is just some cutoff value to accomodate rows that are currently being inserted/updated/deleted.
    -- this value was suggested to me BY Andres Freund, and since he originally found the bug, I tend to trust him
    v_current_xid := (txid_current() + 5000000)::TEXT::xid;
 
    scanned_pages := r_class.relpages;
    bad_pages := 0;
    scanned_rows := 0;
    bad_rows := 0;
 
    -- Iterate over all pages of relation...
    FOR page_no IN SELECT generate_series(0, r_class.relpages - 1) loop
 
        -- check how many *bad* rows are there in this page
 
        IF p_results IS NULL THEN
            -- we're not collecting bad rows
            SELECT COUNT(*) AS all_rows,
                SUM(
                    CASE WHEN xmin_xmax_status( t_xmin, t_xmax, r_class.relfrozenxid, v_current_xid ) > 0 THEN 1 ELSE 0 END
                ) AS bad_rows
                INTO temprec
                FROM heap_page_items(get_raw_page(TABLE_NAME, page_no));
        ELSE
            -- we are collecting bad rows, so we should get them, and some stats ...
            EXECUTE '
                with page as (
                    SELECT get_raw_page( $1, $2 ) as p
                ), all_rows_in_page as (
                    SELECT *, xmin_xmax_status( t_xmin, t_xmax, $3, $4 ) as check_status
                    FROM heap_page_items((SELECT p FROM page))
                ), insert_bad_rows as (
                    INSERT INTO ' || p_results || ' SELECT $1, $2, $3, $4, r.*, h.* FROM all_rows_in_page r, (SELECT page_header( p ) FROM page ) as h WHERE r.check_status > 0
                )
                SELECT
                    count(*) as all_rows,
                    sum( case when check_status > 0 THEN 1 ELSE 0 END) as bad_rows
                FROM all_rows_in_page
            ' INTO temprec USING TABLE_NAME, page_no, r_class.relfrozenxid, v_current_xid;
        END IF;
 
        -- update statistics based on data fetched from check query above
        scanned_rows := scanned_rows + temprec.all_rows;
 
        IF temprec.bad_rows > 0 THEN
            raise notice 'Found bad rows (%) in TABLE (%) page (%)', temprec.bad_rows, TABLE_NAME, page_no;
            bad_rows  := bad_rows + temprec.bad_rows;
            bad_pages := bad_pages + 1;
        END IF;
 
    END LOOP;
    RETURN;
END;
$$ LANGUAGE plpgsql;

How do you use it?

For starters you can:

$ SELECT * FROM test_correct_relfrozenxid('pg_class' );
     TABLE_NAME      | scanned_pages | bad_pages | scanned_rows | bad_rows 
---------------------+---------------+-----------+--------------+----------
 pg_catalog.pg_class |           310 |         0 |        15674 |        0
(1 ROW)

This will show you some statistics. That it scanned 310 pages. On these pages there were 15674 rows (this includes also deleted rows!). Out of which there were 0 problems.

In other case you might get:

$ SELECT * FROM test_correct_relfrozenxid('xxxxxxxxxxxxxx' );
NOTICE:  Found bad ROWS (1) IN TABLE (public.xxxxxxxxxxxxxx) page (19)
NOTICE:  Found bad ROWS (1) IN TABLE (public.xxxxxxxxxxxxxx) page (42)
      TABLE_NAME       | scanned_pages | bad_pages | scanned_rows | bad_rows 
-----------------------+---------------+-----------+--------------+----------
 public.xxxxxxxxxxxxxx |            50 |         2 |         6986 |        2
(1 ROW)

In here, we see some problems.

Alternatively, I can supply second argument to the test_correct_relfrozenxid() function – it is name of a table to store bad row info:

$ \d xxx
Did NOT find any relation named "xxx".
 
$ SELECT * FROM test_correct_relfrozenxid('xxxxxxxxxxxxxx', 'xxx' );
NOTICE:  Found bad ROWS (1) IN TABLE (public.xxxxxxxxxxxxxx) page (19)
NOTICE:  Found bad ROWS (1) IN TABLE (public.xxxxxxxxxxxxxx) page (42)
      TABLE_NAME       | scanned_pages | bad_pages | scanned_rows | bad_rows 
-----------------------+---------------+-----------+--------------+----------
 public.xxxxxxxxxxxxxx |            50 |         2 |         6986 |        2
(1 ROW)
 
$ SELECT * FROM xxx;
-[ RECORD 1 ]+-------------------------------------------------------------------------
TABLE_NAME   | xxxxxxxxxxxxxx
page_no      | 19
relfrozenxid | 178398715
current_xid  | 233400263
lp           | 139
lp_off       | 3968
lp_flags     | 1
lp_len       | 77
t_xmin       | 2
t_xmax       | 10853873
t_field3     | 2
t_ctid       | (19,139)
t_infomask2  | 11
t_infomask   | 6529
t_hoff       | 32
t_bits       | 110011110010000000000000000000000000000000000000000000000000000000000000
t_oid        | [NULL]
check_status | 2
lsn          | (2CBE/EF092CA8,4,5,664,760,8192,8192,4,0)
tli          | [NULL]
flags        | [NULL]
LOWER        | [NULL]
UPPER        | [NULL]
special      | [NULL]
pagesize     | [NULL]
version      | [NULL]
prune_xid    | [NULL]
-[ RECORD 2 ]+-------------------------------------------------------------------------
TABLE_NAME   | xxxxxxxxxxxxxx
page_no      | 42
relfrozenxid | 178398715
current_xid  | 233400263
lp           | 33
lp_off       | 5504
lp_flags     | 1
lp_len       | 77
t_xmin       | 2
t_xmax       | 9255167
t_field3     | 2
t_ctid       | (42,33)
t_infomask2  | 11
t_infomask   | 6529
t_hoff       | 32
t_bits       | 110011110010000000000000000000000000000000000000000000000000000000000000
t_oid        | [NULL]
check_status | 2
lsn          | (2CBE/EF0983B0,4,1,412,3664,8192,8192,4,0)
tli          | [NULL]
flags        | [NULL]
LOWER        | [NULL]
UPPER        | [NULL]
special      | [NULL]
pagesize     | [NULL]
version      | [NULL]
prune_xid    | [NULL]

The “xxx' table can be reused – if you'll run another test_correct_relfrozenxid(), it will simply add new rows to the “xxx" table. And if that wouldn't be obvious – it doesn't have to be called “xxx" – I just picked it as simple to type.

Anyway. Data in xxx table show name of the table, which page of the table contains the bad row. What was the relfrozenxid of the table when the check was done, what was current xid at the the of the check (well, current xid + 5 million, but that's not all that important). Then there are row_data (up to t_oid column), and afterwards there are some page header information.

In our case above, we can see that the two rows were deleted at some point in time (xmax = 10,853,873 and 9,255,167), and given that we're now at xid 233,400,263 – we still have quite a long way ahead before the problem will hit us (which would happen at xid 2,156,738,815), because then the xmax would “switch" to being in future, and the row would again be visible, despite the fact that it should be removed.

And here lies another issue. What if the transactions that removed these rows were actually *rolled back*? This would mean that now we can see it (it's testable luckily), but in some time (2^31 transactions more or less) status of the transaction will get reset, and then after some time – it might become deleted.

Now. About the fix. With newer Pgs (the ones released recently), the bug with moving relfrozenxid too soon, is no more.

But the fact that relfrozenxid will not advance incorrectly anymore doesn't change the fact that you might already have bad data.

Release notes say that you should, after upgrade:

$ SET vacuum_freeze_table_age = zero;
$ vacuum;

This would force vacuum to check each end every row in the table, and freeze them correctly.

But. If you already passed 2^31 transactions since the problem happened – you might have seen rows that shouldn't be there. Or you could have “lost" some rows that should be there. This would mean that you could have violations of foreign keys, unique constraints or primary keys.

What to do with it?

My opinion is:

Get all the data about problems Before upgrade. Check, row by row (usually there is really not much of these rows), if they should be deleted, or not. Maybe someone will write function/extension to modify xmin/xmax values directly in pg pages, and then you could fix it manually. If not – you can try to use pg_dirtyread to get data from affected rows. Or just note ids, and after post-ugprade vacuum, delete the ones that should be deleted.

Finally, after upgrade + vacuum use this script:

(you can also download it here)

#!/usr/bin/env bash
 
IF ! psql -qAtX -c "select 1" &>/dev/NULL
THEN
    echo "Cannot connect to PostgreSQL database using current PG* settings:" >&2
    SET | grep ^PG >&2
    exit 1
fi
 
# Makes sorting much faster
export LC_ALL=C
 
timestamping_awk='{print strftime("%Y-%m-%d %H:%M:%S :"), $0; fflush();}'
EXEC > >( awk "$timestamping_awk" ) 2>&1
logging_process_pid="$!"
 
current_db="$( psql -qAtX -c "SELECT current_database()" )"
CURRENT_USER="$( psql -qAtX -c "SELECT CURRENT_USER" )"
 
tmp_dir="$( mktemp -d )"
trap 'rm -rf "$tmp_dir"; kill $logging_process_pid' EXIT
 
df_line="$( df -hP "$tmp_dir" | tail -n 1 )"
df_available="$( echo "$df_line" | awk '{print $4}' )"
df_mount="$( echo "$df_line" | awk '{print $NF}' )"
 
echo "You are about to check database [$current_db], using account [$current_user]"
echo "Temporary files will be stored on $df_mount filesystem (in $tmp_dir directory). There is $df_available space available."
echo
echo -n "Do you want to continue? (type: \"yes\" to continue): "
READ answer
IF [[ ! "$answer" == "yes" ]]
THEN
    echo "Exiting."
    exit
fi
 
all_problems=""
 
echo "Testing Unique keys"
psql -qAtX -c "COPY (select c.oid, pg_get_indexdef( c.oid ), pg_size_pretty(pg_relation_size(c.oid))  from pg_namespace n join pg_class c on c.relnamespace = n.oid join pg_index i on c.oid = i.indexrelid where c.relkind = 'i' and i.indisunique order by pg_relation_size(c.oid) desc) TO STDOUT" > "$tmp_dir/indexes.lst"
index_count="$( wc -l "$tmp_dir/indexes.lst" | awk '{print $1}' )"
 
seq_scan_preamble="
set enable_bitmapscan = false;
set enable_indexonlyscan = false;
set enable_indexscan = false;
set enable_seqscan = true;
"
 
FOR i IN $( seq 1 $index_count )
do
    idx_line="$( sed -ne "${i}p" "$tmp_dir/indexes.lst" )"
    idx_oid="$( echo "$idx_line" | cut -f1 )"
    idx_def="$( echo "$idx_line" | cut -f2 )"
    idx_size="$( echo "$idx_line" | cut -f3 )"
    echo "- Index $i/$index_count:"
    echo "  - def   : $idx_def"
    echo "  - size  : $idx_size"
    query="$( echo "$idx_def" | perl -ne '
    if ( /^.* ON (.*) USING [^ ]* \((.*)\) WHERE \((.*)\)\s*$/ ) {
        print "SELECT $2 FROM $1 WHERE $3\n";
    } elsif ( /^.* ON (.*) USING [^ ]* \((.*)\)\s*$/ ) {
        print "SELECT $2 FROM $1\n";
    }' )"
    IF [[ -z "$query" ]]
    THEN
        echo "Cannot build query for this index?! Something is wrong." >&2
        continue
    fi
    echo "  - query : $query"
    echo "$seq_scan_preamble COPY ($query) TO STDOUT;" | \
        psql -qAtX | \
        perl -ne 'print unless /(^|\t)\\N($|\t)/' | \
        sort -S1G | \
        uniq -dc > "$tmp_dir/duplicates"
 
    IF [[ -s "$tmp_dir/duplicates" ]]
    THEN
        echo "There are duplicates here:"
        cat "$tmp_dir/duplicates"
        all_problems="$all_problems- Index: $idx_def
"
    fi
    rm "$tmp_dir/duplicates"
done
 
echo "Testing Foreign keys"
echo "
COPY (
with con as (
    SELECT
        c.conname,
        c.conrelid::regclass as con_rel,
        c.conkey,
        c.confrelid::regclass as conf_rel,
        c.confkey,
        generate_subscripts(c.conkey, 1) as i
    FROM
        pg_constraint c
    WHERE
        c.contype = 'f'
)
SELECT
    c.con_rel,
    string_agg( quote_ident(a.attname), ', ' ORDER BY c.i ) as con_col,
    c.conf_rel,
    string_agg( quote_ident(fa.attname), ', ' ORDER BY c.i ) as conf_col
FROM
    con as c
    join pg_attribute a on c.con_rel = a.attrelid AND a.attnum = c.conkey[c.i]
    join pg_attribute fa on c.conf_rel = fa.attrelid AND fa.attnum = c.confkey[c.i]
WHERE
    pg_relation_size(c.con_rel) > 0
    and pg_relation_size(c.conf_rel) > 0
group BY
    c.conname,
    c.con_rel,
    c.conf_rel
ORDER BY pg_relation_size(c.conf_rel) + pg_relation_size(con_rel) desc
) TO STDOUT
" > "$tmp_dir/fkey-get-query"
 
psql -qAtX -f "$tmp_dir/fkey-get-query" | awk '!c[$0]++' > $tmp_dir/fkey-get-list
all_fkeys="$( cat $tmp_dir/fkey-get-list | wc -l )"
i=0
while IFS=$'\t' READ r_table r_columns p_table p_columns
do
    rm -f "$tmp_dir/table_r.gz"
    rm -f "$tmp_dir/table_p.gz"
 
    i=$(( i + 1 ))
    echo "Fkey #$i / $all_fkeys):"
    echo "- $r_table ($r_columns) -=> $p_table ($p_columns)"
 
    echo "BEGIN;
    SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
    SET TRANSACTION READ ONLY;
    \o | sort -S1G -u | pigz -c - > $tmp_dir/table_r.gz
    COPY ( SELECT $r_columns FROM $r_table ) TO STDOUT;
    \o | sort -S1G -u | pigz -c - > $tmp_dir/table_p.gz
    COPY ( SELECT $p_columns FROM $p_table ) TO STDOUT;
    \o
    ROLLBACK;" | psql -qAtX
    bad_lines="$( comm -13 <( pigz -dc $tmp_dir/table_p.gz | perl -ne 'print unless /(^|\t)\\N($|\t)/' ) <( pigz -dc $tmp_dir/table_r.gz | perl -ne 'print unless /(^|\t)\\N($|\t)/' ) | wc -l )"
    IF (( $bad_lines == 0 ))
    THEN
        continue
    fi
    echo "Bad values in $r_table ($r_columns) - not existing in $p_table ($p_columns) : $bad_lines different values. Sample:"
    comm -13 <( pigz -dc $tmp_dir/table_p.gz ) <( pigz -dc $tmp_dir/table_r.gz ) | head -n 5 | sed 's/^/- /'
    all_problems="$all_problems- Fkey: $r_table ($r_columns) -=> $p_table ($p_columns)
"
    echo
done < <( cat $tmp_dir/fkey-get-list )
 
IF [[ -z "$all_problems" ]]
THEN
    echo "All OK."
ELSE
    echo "Problems found:"
    echo "$all_problems"
fi

To test if your database has unique/fkey violations.

That was quite a long one. All in all – I think that bugs happen, so that's kinda ok. But what bugs me is that the topic, which in itself is pretty complicated and confusing, was presented “as is" in the release docs.

I would be much happier if our beloved devs (no sarcasm in here, I am deeply grateful for all their work!) would provide such function (like the one I wrote), with “Pg Devs Blessing" as part of distribution. Perhaps even a tool to check for fkey/unique violations? I know that these shouldn't happen. But they do. Usually because of hardware issues, sure, but it would be really good to have a blessed way to test for these cases.

Anyway – That's about it. If you have any comments/questions – “comment below, let me know" 🙂

26 thoughts on “What does “Fix VACUUM’s tests to see whether it can update relfrozenxid” really mean?”

  1. You wrote, that xmax greater than 0 means that the row is deleted. Now I have noticed strange behavior in my database. No transaction is progress. I do simple select on one of my tables:

    SELECT xmin,xmax,cl_id FROM core.clients ORDER by cl_id

    and got:

        xmin    |    xmax    | cl_id 
    ------------+------------+-------
              2 |          0 |     1
     1395466567 | 1395466567 |     2
              2 | 1394063040 |     3
     1359574078 | 1359574078 |     4
    ...
    

    In this sample every row beside that with cl_id is deleted? I get it right?

    But how it is possible when these rows are accessable by transactions with xid > xmax? Am I affected by this bug?

  2. My only question here is whether the suggestion of:

    SELECT txid_current() < (2 ^ 31)

    has any kind of wrap-around behavior to be aware of.

  3. I ate “… beside cl_id = 1 is deleted”.

    This is postgresql 9.3.1 on freebsd and we are running for long time with high insert/update/delete rate.

    I’ve just inserted row into this table and got:

    1395417586 | 1395417587 | 336

    So it looks like it was marked already deleted.

  4. @ David Johnston

    The theory is that if your database shows current_txid() < 2^31, you don't have to worry about any wraparound / xid issues caused by the bug, so it should be safe to upgrade and run the vacuum with vacuum_freeze_table_age set to 0 and be ok with regards to this bug.

  5. @quaker:
    the transactions could have been rolled back. So now you can see this. This is not a problem. Some vacuum in the future will change xmax to 0

  6. Would running the test_correct_relfrozenxid on a promoted slave show the same results as running it on the master?

    We have >2TB of data, and it would be hard to run this on a master.

    Thanks!

  7. @Omar:
    it’s a good start. Please rmember that running the check on master, aside from some reads, does not cause problems. there are no important locks, and it generally just works.

  8. @Depesz: I understand, but the I/O load + displacing all the hot data from FS cache would probably impact it too much. 🙂

  9. FS cache is probably smart enough to ignore simple sequential read (which this more or less does). Anyway – up to you. You can check on promoted slave, but I wouldn’t trust the results in 100%.

  10. Hi,

    I run your utility on my production DB and even after upgrading to 9.1.11 and manually vacuuming as instructed in the release note, your utility still reports some bad row from time to time. Even more intriguing is that these bad rows are created in the past few days. For example, I ran your utility this morning and it reported several bad rows with xmax=3000. So this is possible if there is rollback, however, I dont understand why the xmax is 3000, as the current transaction ID is like 1883161595:

    MY_DB=# select txid_current();
    txid_current
    ————–
    1885152935
    (1 row)

    MY_DB=# select xmin,xmax,id,file_type from file_sent where id=225020;
    xmin | xmax | id | file_type
    ————+——+——–+———–
    1883161595 | 0 | 225020 | pdf
    (1 row)

    MY_DB=#

    Any idea?

    Thanks a lot,

    Pius

  11. Sorry, post the wrong result, it should be like this:

    MY_DB=# select xmin,xmax,id,file_type from scheduled_report_delivery where id=225020;
    xmin | xmax | id | file_type
    ————+——+——–+———–
    1883161595 | 3000 | 225020 | pdf
    (1 row)

  12. @Pius:
    assuming your txid_current is 1885152935, the 3000 in xmin does look strange.

    What does the output from test_correct_relfrozenxid(…) for this table look like?

  13. Hi,

    Actually it is the xmax which has the value 3000, I found these rows after running your utility test_correct_relfrozenxid. According to my knowledge, this xmax is only populated on ROLLBACK, so I don’t understand how a value of 3000 was put in this row and I am sure this row was inserted few days ago.

    Thanks for your help.

    Pius

  14. @Pius:
    1. you didn’t show me the output I asked
    2. xmax is not related to rollback. xmax is set when given record has been deleted (or updated).

  15. Hi Depesz,

    Here is the output of test_correct_relfrozenxid:

    MY_DB=# select * from test_correct_relfrozenxid(‘public.scenario’,’bad_result’);
    NOTICE: Found bad rows (1) in TABLE (public.scenario) page (161)
    NOTICE: Found bad rows (1) in TABLE (public.scenario) page (163)
    NOTICE: Found bad rows (1) in TABLE (public.scenario) page (165)
    NOTICE: Found bad rows (1) in TABLE (public.scenario) page (233)
    NOTICE: Found bad rows (1) in TABLE (public.scenario) page (244)
    NOTICE: Found bad rows (3) in TABLE (public.scenario) page (255)
    NOTICE: Found bad rows (1) in TABLE (public.scenario) page (283)
    table_name | scanned_pages | bad_pages | scanned_rows | bad_rows
    —————–+—————+———–+————–+———-
    public.scenario | 296 | 7 | 30526 | 9
    (1 row)

    and here is one row in the bad_result table:

    MY_DB=# select * from bad_result limit 1;
    -[ RECORD 1 ]+————————————————————————-
    table_name | scenario
    page_no | 161
    relfrozenxid | 1801398747
    current_xid | 1927675335
    lp | 68
    lp_off | 3552
    lp_flags | 1
    lp_len | 72
    t_xmin | 2
    t_xmax | 3387
    t_field3 | 0
    t_ctid | (161,68)
    t_infomask2 | 6
    t_infomask | 4491
    t_hoff | 32
    t_bits | 110011000000000000000000000000000000000011100110110110010110111101000000
    t_oid | 49716071
    check_status | 2
    lsn | (688/74EE0FA8,2,1,440,496,8192,8192,4,0)
    tli |
    flags |
    lower |
    upper |
    special |
    pagesize |
    version |
    prune_xid |

    Pius

  16. Hi depesz,

    This is response from Alvaro Herrera:

    ================================================
    It’s normal to have low numbers used in Xmax. This just means that they used a multixact instead of a plain transaction ID. You can check that by using the pageinspect extension and seeing the flag set in the t_infomask field.

    I’m not saying there is no bug here, only that you have provided no evidence that there is.
    =================================================

    Pius

  17. @Pius:
    You’re OK.
    The t_infomask value of 4491 contains bit for multixact (0x1000).

    So the xmax is multixact, and you’re OK.

  18. Hello,
    MY current database status after runnig
    SELECT datname,datfrozenxid, age(datfrozenxid) FROM pg_database

    ;datfrozenxid=3619646522 & and age(datfrozenxid)=1173874332.

    vacuum_freeze_min_age = 50000000
    vacuum_freeze_table_age = 150000000
    autovacuum_freeze_max_age = 200000000

    i have started vacuum freeze tables manually .
    but my datfrozenxid is continuously increasing..
    should i have to worry for it ?
    please suggest.

  19. @Hiren:
    no, you shouldn’t worry. age has to increase, and it’s perfectly OK. you don’t have to sun vac freeze manually either, autovacuum will handle it.

  20. Hello depesz
    Thanks a lot..
    But i don’t get why today my transaction
    datfrozenxid=4319646721

    Is there any way to decrease datfrozenxid because it goes beyond 400 cr.
    Will it affect database status in future ?
    as per my understanding datfrozenxid should be near to age(datfrozenxid) am i right ?
    Pelase advise .

  21. Hello depesz,
    Also i want to share few basic detail of my database.
    it is around 3 TB, autovacuum processes=3 and it is not getting time to auto vacuum all tables

  22. my db now at 2107837550:
    SELECT txid_current();

    pg_controldata now at:
    Latest checkpoint’s NextXID: 0/2107840216

    What is the maximum NextXID? Is it 2^31 or 2^32?

    Thanks in advance!

  23. Thanks! Wooh! You don’t know how much that helped. We’re just burning these XIDs too fast. This db instance been around for just 18 months. I’m not worried at all about the transaction wraparound since I have frequent manual and auto vacuum running most of the time. I can keep up with that. I’m more afraid of hitting the wraparound once I hit the maximum txid_current().

    docs [1] says:
    The internal transaction ID type (xid) is 32 bits wide and wraps around every 4 billion transactions.

    [1] https://www.postgresql.org/docs/current/static/functions-info.html#FUNCTIONS-TXID-SNAPSHOT

    Am I understanding XIDs correctly? Should I reload the entire db again soon?

    This is current output of my pg_controldata (pg 9.4):
    Latest checkpoint’s NextXID: 0/2111209443
    Latest checkpoint’s NextOID: 537213126
    Latest checkpoint’s NextMultiXactId: 729161
    Latest checkpoint’s NextMultiOffset: 1532079
    Latest checkpoint’s oldestXID: 1917540210
    Latest checkpoint’s oldestXID’s DB: 16384
    Latest checkpoint’s oldestActiveXID: 0
    Latest checkpoint’s oldestMultiXid: 1
    Latest checkpoint’s oldestMulti’s DB: 16384

    DB was not reloaded using pg_upgrade.

Comments are closed.