Waiting for 9.5 – Add pg_rewind, for re-synchronizing a master server after failback.

On 23rd of March, Heikki Linnakangas committed patch:

Add pg_rewind, for re-synchronizing a master server after failback.
 
Earlier versions of this tool were available (and still are) on github.
 
Thanks to Michael Paquier, Alvaro Herrera, Peter Eisentraut, Amit Kapila,
and Satoshi Nagayasu for review.

So, we have a situation, where we have master and slave, some kind of replication, and we have to failover. This is trivial – just promote slave to standalone, and we're good. But we don't have slave then. Normal procedure is to setup slave on old master from scratch, using full data dir sync. But with non-trivial datasizes, it can take long time.

pg_rewind should solve the problem by just working on xlogs.

Let's see if it will really work that way. For added difficulty I will try to make it so that old master still did *some* work after the failover – this work should be “rolled back" after it will be set to slave from newly elected master.

To test it we will need two PostgreSQL's, but first I'll start with just master:

=$ initdb --data-checksums -D /var/tmp/master
=$ perl -pi -e 's/^#port\s*=.*/port = 5435/' /var/tmp/master/postgresql.conf
=$ pg_ctl -D /var/tmp/master/ -l /var/tmp/master.log start

With this in place, I have master running, on port 5435. It's very under-configured, but I'm mostly fine with it, just need some small changes so:

=$ psql -p 5435 -d postgres
$ ALTER system SET archive_command = 'pigz -c %p > /var/tmp/archive/%f.gz';
$ ALTER system SET archive_mode = ON;
$ ALTER system SET wal_level = hot_standby;
$ ALTER system SET max_wal_senders = 10;
$ ALTER system SET min_wal_size = '32MB';
$ ALTER system SET max_wal_size = '32MB';
$ ALTER system SET hot_standby = ON;
$ ALTER system SET wal_log_hints = ON;
$ CREATE USER replication WITH replication;

I set the wal size so small, so that it will be simple to show how the situation works when all necessary wal files are no longer in pg_xlog/ directory.

With this in place, I just need to allow replication connections:

=$ echo "local replication replication trust" >> /var/tmp/master/pg_hba.conf

And of course, restart Pg to make all the setting come live:

=$ pg_ctl -D /var/tmp/master -m fast restart

Now, I need some test data to work on, and show that rewinding history works. So, I'll make a simple table:

$ CREATE TABLE test AS SELECT i, repeat('depesz'::text, 100) AS payload, now() AS last_updated FROM generate_series(1, 1000000) i;
$ CREATE INDEX what_updated ON test (last_updated);
$ ALTER TABLE test ADD PRIMARY KEY (i);

This generates this, very simple, table:

$ \d test
                 TABLE "public.test"
    COLUMN    |           TYPE           | Modifiers
--------------+--------------------------+-----------
 i            | INTEGER                  | NOT NULL
 payload      | text                     |
 last_updated | TIMESTAMP WITH TIME zone |
Indexes:
    "test_pkey" PRIMARY KEY, btree (i)
    "what_updated" btree (last_updated)

Where id is integer 1..1000000, payload is just long string to make the table and updates heavier, and last_updated is timestamptz that we will update.

To update the fields I will use:

$ UPDATE test
    SET last_updated = clock_timestamp(), payload = repeat(clock_timestamp()::text, 100)
    WHERE i = ( SELECT ( (random() * 100000 )::int8 * 2 + current_setting('port')::int8 ) % 1000000 + 1 );
UPDATE 1

The where clause is complicated, but I wanted to update semi-random row, but make it so that the same script ran on slave will update different rows. I will put slave on port 5436, so master will always update even rows (with i being divisible by 2), and slave will be updating odd rows.

I repeat such update 10 times in a transaction, and as soon as one transaction is done, I start another.

After a minute or so, I can see:

$ SELECT i, last_updated FROM test ORDER BY last_updated DESC LIMIT 10;
   i    |         last_updated
--------+-------------------------------
  45232 | 2015-04-04 13:43:50.272215+02
  57178 | 2015-04-04 13:43:50.272022+02
 136892 | 2015-04-04 13:43:50.271823+02
 116754 | 2015-04-04 13:43:50.271641+02
 177292 | 2015-04-04 13:43:50.271449+02
 131486 | 2015-04-04 13:43:50.271247+02
 140360 | 2015-04-04 13:43:50.270976+02
 105724 | 2015-04-04 13:43:50.270737+02
 130434 | 2015-04-04 13:43:50.270493+02
 127542 | 2015-04-04 13:43:50.270251+02
(10 ROWS)

Now, with this in place, I can setup slave (the updater process still goes on):

=$ pg_basebackup -D /var/tmp/slave -R -c fast -p 5435 -U replication

To have slave working, I will need to change port, and make sure it knows how to recover xlogs from archive, so:

=$ perl -pi -e 's/port = 5435/port = 5436/' /var/tmp/slave/postgresql.conf
=$ echo "restore_command = 'gzip -dc /var/tmp/archive/%f.gz > %p'" >> /var/tmp/slave/recovery.conf

With this done, I start slave:

=$ pg_ctl -D /var/tmp/slave/ -l /var/tmp/slave.log start

and wait for slave to catchup with replication. For this I used this query:

WITH x AS (SELECT pg_current_xlog_location() AS cx)
SELECT
    now()::timestamptz(0),
    pg_xlog_location_diff( cx, sent_location ) AS sent_lag,
    pg_xlog_location_diff( cx, write_location ) AS write_lag,
    pg_xlog_location_diff( cx, flush_location ) AS flush_lag,
    pg_xlog_location_diff( cx, replay_location ) AS replay_lag
FROM
    pg_stat_replication,
    x;

Repeating it every 2 seconds, I got:

=$ while TRUE; do psql -p 5435 -d postgres -f show_lag.sql ; sleep 2; done | awk 'NR==1 || /[0-9]$/'
          now           | sent_lag | write_lag | flush_lag | replay_lag
 2015-04-04 13:45:52+02 |        0 |    188032 |    188032 |     219808
 2015-04-04 13:45:54+02 |        0 |         0 |     82840 |      82840
 2015-04-04 13:45:56+02 |        0 |         0 |    123192 |     123192
 2015-04-04 13:45:58+02 |        0 |         0 |     86664 |      86664
 2015-04-04 13:46:00+02 |        0 |         0 |     96008 |      96008
 2015-04-04 13:46:02+02 |        0 |         0 |         0 |      65976
 2015-04-04 13:46:04+02 |        0 |         0 |         0 |      79056
 2015-04-04 13:46:06+02 |   101672 |    101672 |    101712 |     223456
 2015-04-04 13:46:08+02 |        0 |         0 |     76320 |      76320

I see that replication is working steadily (the sent/write lags happen every so often, but are immediately cleared). So we're good.

Now. for the actual pg_rewind test. I will promote slave to standalone/master:

=$ pg_ctl -D /var/tmp/slave/ promote

After it got promoted (the update script on old master is still running), we are starting to see differences:

=$ psql -p 5435 -d postgres -c 'select i, last_updated from test order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 162358 | 2015-04-04 13:46:39.827506+02
  14854 | 2015-04-04 13:46:39.827326+02
  81126 | 2015-04-04 13:46:39.827058+02
(3 ROWS)
 
 
=$ psql -p 5436 -d postgres -c 'select i, last_updated from test order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 204796 | 2015-04-04 13:46:27.894727+02
  96450 | 2015-04-04 13:46:27.894492+02
  21950 | 2015-04-04 13:46:27.894219+02
(3 ROWS)

Old master is still updating, but new master stopped updated when the promotion happened. So now, I'll stop the updating script, and rerun it on new master…

Now, we're getting newer values from 5436, but also please note that newest even rows are older than on old master:

=$ psql -p 5435 -d postgres -c 'select i, last_updated from test order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 106572 | 2015-04-04 13:47:21.519875+02
  16160 | 2015-04-04 13:47:21.519372+02
  13140 | 2015-04-04 13:47:21.51886+02
(3 ROWS)
 
 
=$ psql -p 5436 -d postgres -c 'select i, last_updated from test order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 18331 | 2015-04-04 13:48:02.355589+02
  8521 | 2015-04-04 13:48:02.355299+02
 18015 | 2015-04-04 13:48:02.354986+02
(3 ROWS)
 
 
=$ psql -p 5436 -d postgres -c 'select i, last_updated from test where 0 = i % 2 order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 204796 | 2015-04-04 13:46:27.894727+02
  96450 | 2015-04-04 13:46:27.894492+02
  21950 | 2015-04-04 13:46:27.894219+02
(3 ROWS)

And now it's time to bring old master to become slave.

First, I need to stop old master:

=$ pg_ctl -D /var/tmp/master/ -m fast stop

Now I run pg_rewind:

=$ pg_rewind -D /var/tmp/master --source-server="port=5436 user=pgdba dbname=postgres"
The servers diverged at WAL POSITION 0/9BC9E268 ON timeline 1.
could NOT OPEN file "/var/tmp/master/pg_xlog/00000001000000000000009B": No such file OR directory
 
could NOT find previous WAL record at 0/9BC9E268
Failure, exiting

This is OK. I purposely made master keep as little xlogs as possible, so now I need to copy some back. Quick inspection shows:

=$ ls -l /var/tmp/master/pg_xlog/
total 32780
-rw------- 1 pgdba pgdba      307 Apr  4 13:44 000000010000000000000077.000090D8.backup
-rw------- 1 pgdba pgdba 16777216 Apr  4 13:51 0000000100000000000000B5
-rw------- 1 pgdba pgdba 16777216 Apr  4 13:51 0000000100000000000000B6
-rw------- 1 pgdba pgdba       42 Apr  4 13:49 00000002.history
drwx------ 2 pgdba pgdba     4096 Apr  4 13:51 archive_status/

So I'll just copy all xlogs from 9B to B5:

=$ cp /var/tmp/archive/00000001000000000000009[BCDEF].gz /var/tmp/master/pg_xlog/
=$ cp /var/tmp/archive/0000000100000000000000A?.gz /var/tmp/master/pg_xlog/
=$ cp /var/tmp/archive/0000000100000000000000B[01234].gz /var/tmp/master/pg_xlog/
=$ gzip -d /var/tmp/master/pg_xlog/*.gz

With this in place, I rerun pg_rewind:

=$ pg_rewind -D /var/tmp/master --source-server="port=5436 user=pgdba dbname=postgres"
The servers diverged at WAL position 0/9BC9E268 on timeline 1.
Rewinding from last common checkpoint at 0/9B8B20D0 on timeline 1
Done!

This took ~ 1 second. Now – after running pg_rewind, state of /var/tmp/master is like if I had run pg_basebackup from “slave" (current master), including bad port (5436), and no sensible recovery.conf. Let's fix that:

=$ perl -pi -e 's/5436/5435/' /var/tmp/master/postgresql.conf
=$ cat /var/tmp/master/recovery.done  | sed 's/5435/5436/' > /var/tmp/master/recovery.conf

After this was done, I should be able to start new slave:

=$ pg_ctl -D /var/tmp/master -l /var/tmp/master.log START

After couple of seconds I can see that old master, current slave, is replicating – first from wal archive:

=$ ps uxfww | grep -E '( [-]D |[0]00000)'
pgdba      603  0.0  0.0 180328 13416 pts/4    S    13:59   0:00 /home/pgdba/work/bin/postgres -D /var/tmp/master
pgdba      609 15.3  0.8 180416 135984 ?       Ss   13:59   0:01  \_ postgres: startup process   waiting for 0000000200000000000000F9
pgdba      900  0.0  0.0   4440   656 ?        S    13:59   0:00  |   \_ sh -c gzip -dc /var/tmp/archive/0000000200000000000000F9.gz > pg_xlog/RECOVERYXLOG
pgdba      901  7.0  0.0   4728   616 ?        D    13:59   0:00  |       \_ gzip -dc /var/tmp/archive/0000000200000000000000F9.gz
pgdba    24111  0.0  0.0 180324 13468 pts/4    S    13:44   0:00 /home/pgdba/work/bin/postgres -D /var/tmp/slave
pgdba    25106  0.0  0.0  35300   972 ?        Ss   13:46   0:00  \_ postgres: archiver process   last was 0000000200000001000000CB

Couple more minutes, and streaming kicks-in:

=$ ps uxfww | grep -E '( [-]D |[0]00000|[s]tream)'
pgdba      603  0.0  0.0 180328 13416 pts/4    S    13:59   0:00 /home/pgdba/work/bin/postgres -D /var/tmp/master
pgdba      609  4.2  0.8 180416 136396 ?       Ss   13:59   0:08  \_ postgres: startup process   recovering 0000000200000001000000E9
pgdba     2000  1.2  0.0 191396  2208 ?        Ss   14:00   0:01  \_ postgres: wal receiver process   streaming 1/E9EFAF08
pgdba    24111  0.0  0.0 180324 13460 pts/4    S    13:44   0:00 /home/pgdba/work/bin/postgres -D /var/tmp/slave
pgdba    25106  0.0  0.0  35300   972 ?        Ss   13:46   0:00  \_ postgres: archiver process   last was 0000000200000001000000E8
pgdba     2001  0.7  0.0 180984  2168 ?        Ss   14:00   0:00  \_ postgres: wal sender process replication [local] streaming 1/E9EFAF08

Of course, on both server, the “newest" rows will be odd – updated on slave:

=$ psql -p 5435 -d postgres -c 'select i, last_updated from test order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
  42785 | 2015-04-04 14:09:14.065943+02
 190257 | 2015-04-04 14:09:14.065771+02
 145031 | 2015-04-04 14:09:14.065536+02
(3 ROWS)
 
=$ psql -p 5436 -d postgres -c 'select i, last_updated from test order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 192215 | 2015-04-04 14:09:14.520619+02
  64991 | 2015-04-04 14:09:14.52043+02
  69937 | 2015-04-04 14:09:14.520249+02
(3 ROWS)

but the interesting thing is – what will be the newest even rows now?

Before pg_rewind old master has newer values than new master. And now:

=$ psql -p 5435 -d postgres -c 'select i, last_updated from test where 0 = i % 2 order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 204796 | 2015-04-04 13:46:27.894727+02
  96450 | 2015-04-04 13:46:27.894492+02
  21950 | 2015-04-04 13:46:27.894219+02
(3 ROWS)
 
=$ psql -p 5436 -d postgres -c 'select i, last_updated from test where 0 = i % 2 order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 204796 | 2015-04-04 13:46:27.894727+02
  96450 | 2015-04-04 13:46:27.894492+02
  21950 | 2015-04-04 13:46:27.894219+02
(3 ROWS)

Same thing. So, true to its name, pg_rewind removed changes from old master that didn't make it to new master. And that's just great. Thanks to all involved, pg_rewind will clearly be very useful.