March 11th, 2010 by depesz | Tags: , , , , , , | 24 comments »
Did it help? If yes - maybe you can help me?

There are several approaches on replication/failover – you might have heard of Slony, Londiste, pgPool and some other tools.

WAL Replication is different from all of them in one aspect – it doesn't let you query slave database (until 9.0, in which you actually can run read only queries on slave.

Since you can't run queries on slave, what is it good for? Well. It's good, and great in 1 very important aspect – all things that happen in database are replicated. Schema changes. Sequence modifications. Everything.

There is also drawback – you can't (as of now) replicate just one database. You replicate whole cluster (I don't like this word in this context – let's say: whole installation) of PostgreSQL. All databases that reside in given DATA directory.

So, the question is – how to set it up?

I will provide some very short howto on how to set it, but please read also these great pages in PostgreSQL manual: Continuous Archiving and Warm Standby.

One note before I will go deep into details – all paths in following text assume installation as described in one of my previous posts. It's not to promote it in any way, but to provide a stable “platform" that you can relate to. If you're using some other installation (self made, or packages) – you just need to make sure to install all necessary programs, and fix paths.

So, first of all we will need 2 installations of Pg, that have the same version (major is enough), and the same compilation options. Usually it's enough to check “integer datetimes" setting.

I have 2 machines: arch and centos, and will setup wal replication between them. Let's check the parameters:

pgdba@arch:~$ postgres -V
postgres (PostgreSQL) 8.4.2
 
pgdba@arch:~$ pg_config --configure  | grep -oE '[^[:space:]]*integer-datetimes[^[:space:]]*'
'--enable-integer-datetimes'

So, I have 8.4.2, and integer datetimes are enabled. What about centos (which will work as slave)?

pgdba@centos:~$ postgres -V
postgres (PostgreSQL) 8.4.2
 
pgdba@centos:~$ pg_config --configure  | grep -oE '[^[:space:]]*integer-datetimes[^[:space:]]*'
'--enable-integer-datetimes'

Same thing. So we're good.

Now, let's do some sanity check. First, we'll see what's in WAL segments directory:

pgdba@arch:~$ ls -lR ~/data/5840/pg_xlog/
/home/pgdba/data/5840/pg_xlog/:
total 16408
-rw------- 1 pgdba pgdba 16777216 Mar 10 01:33 000000010000000000000000
drwx------ 2 pgdba pgdba     4096 Feb 26 02:36 archive_status
 
/home/pgdba/data/5840/pg_xlog/archive_status:
total 0

Now, let's do some writes, to see segments being actually created:

pgdba=# create table text (q text);
CREATE TABLE
 
pgdba=# insert into text (q) select repeat('depesz'||i, 10000) from generate_series(1,10000) i;
INSERT 0 10000

This did create table with 10,000 rows, each row containing long string – which was of course T.O.A.S.T.ed.

Let's see the sizes:

pgdba=# select pg_relation_size('text'), pg_total_relation_size('text');
 pg_relation_size | pg_total_relation_size
------------------+------------------------
         13443072 |               13475840
(1 row)

Nothing really dramatic, but let's see pg_xlog:

pgdba@arch:~$ ls -lR ~/data/5840/pg_xlog/
/home/pgdba/data/5840/pg_xlog/:
total 32812
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:37 000000010000000000000000
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:37 000000010000000000000001
drwx------ 2 pgdba pgdba     4096 Feb 26 02:36 archive_status
 
/home/pgdba/data/5840/pg_xlog/archive_status:
total 0

OK. As we can see new segment has been created.

What's more – we can check where PostgreSQL currently writes to:

pgdba=# select pg_current_xlog_location()  ;
 pg_current_xlog_location
--------------------------
 0/130EE90
(1 row)

This means that it writes to 000000010000000000000001, and current offset in this xlog file is hex(30EE90). Let's add some more data to see how it will move on.

pgdba=# insert into text (q) select repeat('depesz'||i, 10000) from generate_series(10001, 50000) i;
INSERT 0 40000
 
pgdba=# select pg_relation_size('text'), pg_total_relation_size('text');
 pg_relation_size | pg_total_relation_size
------------------+------------------------
         68059136 |               68108288
(1 row)
 
pgdba=# select pg_current_xlog_location()  ;
 pg_current_xlog_location
--------------------------
 0/462732C
(1 row)

And pg_xlog content:

pgdba@arch:~$ ls -l ~/data/5840/pg_xlog/
total 82024
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:37 000000010000000000000000
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:42 000000010000000000000001
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:42 000000010000000000000002
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:42 000000010000000000000003
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:43 000000010000000000000004
drwx------ 2 pgdba pgdba     4096 Feb 26 02:36 archive_status

So, currently it writes to 000000010000000000000004, and offset is hex(62732C).

Obvious question – since we're currently writing to …004, what about those files …000, …001, …002 and …003 ?

Well, these files will get “rotated" after checkpoint. Which happens automatically after some time, or I can force it to happen:

pgdba=# checkpoint;
CHECKPOINT
pgdba@arch:~$ ls -l ~/data/5840/pg_xlog/
total 82024
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:49 000000010000000000000004
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:37 000000010000000000000005
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:42 000000010000000000000006
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:42 000000010000000000000007
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:42 000000010000000000000008
drwx------ 2 pgdba pgdba     4096 Feb 26 02:36 archive_status

So. We know that it works, so now let's move to setting actual replication.

For starters – we need a way to connect from master (arch) to centos (slave) – without providing password. One way (not really good, but simple – I'll talk about it later) is using ssh.

Let's make password-less ssh key:

pgdba@arch:~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/pgdba/.ssh/id_rsa):
Created directory '/home/pgdba/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/pgdba/.ssh/id_rsa.
Your public key has been saved in /home/pgdba/.ssh/id_rsa.pub.
The key fingerprint is:
3d:8f:5a:7b:d2:f4:c6:24:c7:d9:fd:87:18:64:0d:66 pgdba@arch
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|            E    |
|           o o   |
|         .  o .  |
|        S oo . o.|
|           ++ = o|
|          oo.O ..|
|         o..+ = o|
|        . .o .  .|
+-----------------+

I answered with pressing enter key for every question. Now in ~/.ssh I have:

pgdba@arch:~$ ls -l ~/.ssh/
total 8
-rw------- 1 pgdba pgdba 1675 Mar 11 12:52 id_rsa
-rw-r--r-- 1 pgdba pgdba  392 Mar 11 12:52 id_rsa.pub

Content of this id_rsa.pub file needs to be copied to ~/.ssh/authorized_keys file on centos. Any method will do. Afterwards:

pgdba@arch:~$ cat .ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAmZdsjF6msFq/0tvEBsCx2FNM2wc6IyhV+EUxoI3/yZj1cr8yqvsWL82hP37nqAWXtDYtm4Wx/vxIoJBx8exciRlFwU/P8DDKjqeNnNq3Ug0oAIDsJPZuNFPRT3ygKkuBd9JWaU++jN8PeBk3bLc8sxHQtaLLa9FnK9qHu5dqrn1rTtYkSiH6lcR+BL/5L3M58KOfj/NmCZ6di9lP2CsLEomlmemYxKhUsFw2zjrQuEiV1NwyMXIoN196GGF5W9k6pLN3YN4fY9mo7dv5M5se35ZD59NGueorBIuPMNgYiSECLLrt5tjNVoaf1MFpFLFu79CH4aKoASZDpkUiDih8fQ== pgdba@arch
 
pgdba@arch:~$ ssh centos
The authenticity of host '172.28.173.127 (172.28.173.127)' can't be established.
RSA key fingerprint is 2a:f2:1a:45:e2:32:04:87:e0:17:14:39:7f:7f:71:45.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '172.28.173.127' (RSA) to the list of known hosts.
 
pgdba@centos:~$ cat .ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAmZdsjF6msFq/0tvEBsCx2FNM2wc6IyhV+EUxoI3/yZj1cr8yqvsWL82hP37nqAWXtDYtm4Wx/vxIoJBx8exciRlFwU/P8DDKjqeNnNq3Ug0oAIDsJPZuNFPRT3ygKkuBd9JWaU++jN8PeBk3bLc8sxHQtaLLa9FnK9qHu5dqrn1rTtYkSiH6lcR+BL/5L3M58KOfj/NmCZ6di9lP2CsLEomlmemYxKhUsFw2zjrQuEiV1NwyMXIoN196GGF5W9k6pLN3YN4fY9mo7dv5M5se35ZD59NGueorBIuPMNgYiSECLLrt5tjNVoaf1MFpFLFu79CH4aKoASZDpkUiDih8fQ== pgdba@arch

The point is that I can now connect without providing any kind of password.

Now. On slave, we need to make directory which will server as temporary place for storing WAL segments.

pgdba@arch:~$ ssh centos mkdir /home/pgdba/wal-archive

And now we can modify postgresql.conf to actually send the files to slave. To do so, we need to edit postgresql.conf in /home/pgdba/data/5840/ and set following options:

  • archive_mode = on
  • archive_command = ‘rsync -q “%p" centos:/home/pgdba/wal-archive/%f'

Afterwards, all I need to do, is restart postgres on master:

pgdba@arch:~$ pg_ctl -D ~/data/5840/ restart
waiting for server to shut down.... done
server stopped
server starting

At the moment, there are no files in wal-archive directory:

pgdba@centos:~$ ls -l wal-archive/
total 0

So, let's make some, to test that sending files actually work. Again, let's insert some dummy data:

pgdba=# select pg_current_xlog_location();
 pg_current_xlog_location
--------------------------
 0/46293A8
(1 row)
 
pgdba=# insert into text (q) select repeat('depesz'||i, 10000) from generate_series(1,10000) i;
INSERT 0 10000
 
pgdba=# select pg_current_xlog_location();
 pg_current_xlog_location
--------------------------
 0/51BEC10
(1 row)
 
pgdba@arch:~$ ls -l data/5840/pg_xlog/
total 82024
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:02 000000010000000000000004
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:02 000000010000000000000005
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:42 000000010000000000000006
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:42 000000010000000000000007
-rw------- 1 pgdba pgdba 16777216 Mar 11 12:42 000000010000000000000008
drwx------ 2 pgdba pgdba     4096 Mar 11 13:02 archive_status

As we can see, master switched to writing to next segment, …004 hasn't been renamed (yet), but on slave:

pgdba@centos:~$ ls -l wal-archive/
total 16408
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:02 000000010000000000000004

It works.

Of course we can now check some more data, but it seems kind of pointless. The question now would be: what will happen if master cannot connect to slave?

Let's emulate network problems:

[root@centos ~]# iptables -I INPUT -s 172.28.173.124 -p tcp --dport 22 -j REJECT --reject-with icmp-host-unreachable

And now ssh of course cannot work:

pgdba@arch:~$ ssh centos
ssh: connect to host 172.28.173.127 port 22: No route to host

So, how will PostgreSQL on master behave?

pgdba=# insert into text (q) select repeat('depesz'||i, 10000) from generate_series(1,50000) i;
INSERT 0 50000
 
pgdba=# select pg_current_xlog_location();
 pg_current_xlog_location
--------------------------
 0/9112EC0
(1 row)

pg_xlog directory looks like this:

pgdba@arch:~$ ls -l data/5840/pg_xlog/
total 98428
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:02 000000010000000000000004
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:06 000000010000000000000005
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:07 000000010000000000000006
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:07 000000010000000000000007
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:08 000000010000000000000008
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:08 000000010000000000000009
drwx------ 2 pgdba pgdba     4096 Mar 11 13:08 archive_status

What's important – if I'll issue “CHECKPOINT" – these files will not be rotated:

pgdba@arch:~$ psql -c "checkpoint"
CHECKPOINT
 
pgdba@arch:~$ ls -l data/5840/pg_xlog/
total 98428
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:06 000000010000000000000005
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:07 000000010000000000000006
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:07 000000010000000000000007
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:08 000000010000000000000008
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:09 000000010000000000000009
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:02 00000001000000000000000A
drwx------ 2 pgdba pgdba     4096 Mar 11 13:09 archive_status

Only …004 was rotated, but it's OK – since this segment was already copied.

We can also check Pg logs:

pgdba@arch:~$ tail -n 10 data/5840/pg_log/postgresql-2010-03-11_130021.csv
2010-03-11 13:09:21.031 CET,,,1091,"",4b98dd71.443,1,"/opt/pgsql-8.4.2/bin/postgres",2010-03-11 13:09:21 CET,,0,LOG,00000,"connection received: host=[local]",,,,,,,,
2010-03-11 13:09:21.034 CET,"pgdba","pgdba",1091,"[local]",4b98dd71.443,2,"authentication",2010-03-11 13:09:21 CET,,0,LOG,00000,"connection authorized: user=pgdba database=pgdba",,,,,,,,
2010-03-11 13:09:21.035 CET,,,1027,,4b98db55.403,3,,2010-03-11 13:00:21 CET,,0,LOG,00000,"checkpoint starting: immediate force wait",,,,,,,,
2010-03-11 13:09:21.323 CET,,,1027,,4b98db55.403,4,,2010-03-11 13:00:21 CET,,0,LOG,00000,"checkpoint complete: wrote 3502 buffers (97.7%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.032 s, sync=0.202 s, total=0.287 s",,,,,,,,
2010-03-11 13:09:21.336 CET,"pgdba","pgdba",1091,"[local]",4b98dd71.443,3,"CHECKPOINT",2010-03-11 13:09:21 CET,1/0,0,LOG,00000,"duration: 301.080 ms  statement: checkpoint",,,,,,,,
2010-03-11 13:09:21.336 CET,"pgdba","pgdba",1091,"[local]",4b98dd71.443,4,"idle",2010-03-11 13:09:21 CET,,0,LOG,00000,"disconnection: session time: 0:00:00.305 user=pgdba database=pgdba host=[local]",,,,,,,,
2010-03-11 13:09:24.353 CET,,,1081,,4b98dd36.439,5,,2010-03-11 13:08:22 CET,,0,LOG,00000,"archive command failed with exit code 12","The failed archive command was: rsync -q ""pg_xlog/000000010000000000000005"" centos:/home/pgdba/wal-archive/000000010000000000000005",,,,,,,
2010-03-11 13:09:25.367 CET,,,1081,,4b98dd36.439,6,,2010-03-11 13:08:22 CET,,0,LOG,00000,"archive command failed with exit code 12","The failed archive command was: rsync -q ""pg_xlog/000000010000000000000005"" centos:/home/pgdba/wal-archive/000000010000000000000005",,,,,,,
2010-03-11 13:09:29.385 CET,,,1081,,4b98dd36.439,7,,2010-03-11 13:08:22 CET,,0,LOG,00000,"archive command failed with exit code 12","The failed archive command was: rsync -q ""pg_xlog/000000010000000000000005"" centos:/home/pgdba/wal-archive/000000010000000000000005",,,,,,,
2010-03-11 13:09:29.385 CET,,,1081,,4b98dd36.439,8,,2010-03-11 13:08:22 CET,,0,WARNING,01000,"transaction log file ""000000010000000000000005"" could not be archived: too many failures",,,,,,,,

As we can see – PostgreSQL knows that it can't send the segments over to slave.

This has one potentially important drawback – if for whatever reason sending segments to slave will not work for long period – it can use up all of disk space on master, since old segment files will not be recycled.

While writing this text, PostgreSQL on its own retried sending segments:

pgdba@arch:~$ tail -n 10 data/5840/pg_log/postgresql-2010-03-11_130021.csv
2010-03-11 13:09:29.385 CET,,,1081,,4b98dd36.439,7,,2010-03-11 13:08:22 CET,,0,LOG,00000,"archive command failed with exit code 12","The failed archive command was: rsync -q ""pg_xlog/000000010000000000000005"" centos:/home/pgdba/wal-archive/000000010000000000000005",,,,,,,
2010-03-11 13:09:29.385 CET,,,1081,,4b98dd36.439,8,,2010-03-11 13:08:22 CET,,0,WARNING,01000,"transaction log file ""000000010000000000000005"" could not be archived: too many failures",,,,,,,,
2010-03-11 13:10:29.529 CET,,,1081,,4b98dd36.439,9,,2010-03-11 13:08:22 CET,,0,LOG,00000,"archive command failed with exit code 12","The failed archive command was: rsync -q ""pg_xlog/000000010000000000000005"" centos:/home/pgdba/wal-archive/000000010000000000000005",,,,,,,
2010-03-11 13:10:30.543 CET,,,1081,,4b98dd36.439,10,,2010-03-11 13:08:22 CET,,0,LOG,00000,"archive command failed with exit code 12","The failed archive command was: rsync -q ""pg_xlog/000000010000000000000005"" centos:/home/pgdba/wal-archive/000000010000000000000005",,,,,,,
2010-03-11 13:10:31.560 CET,,,1081,,4b98dd36.439,11,,2010-03-11 13:08:22 CET,,0,LOG,00000,"archive command failed with exit code 12","The failed archive command was: rsync -q ""pg_xlog/000000010000000000000005"" centos:/home/pgdba/wal-archive/000000010000000000000005",,,,,,,
2010-03-11 13:10:31.560 CET,,,1081,,4b98dd36.439,12,,2010-03-11 13:08:22 CET,,0,WARNING,01000,"transaction log file ""000000010000000000000005"" could not be archived: too many failures",,,,,,,,
2010-03-11 13:11:31.691 CET,,,1081,,4b98dd36.439,13,,2010-03-11 13:08:22 CET,,0,LOG,00000,"archive command failed with exit code 12","The failed archive command was: rsync -q ""pg_xlog/000000010000000000000005"" centos:/home/pgdba/wal-archive/000000010000000000000005",,,,,,,
2010-03-11 13:11:32.706 CET,,,1081,,4b98dd36.439,14,,2010-03-11 13:08:22 CET,,0,LOG,00000,"archive command failed with exit code 12","The failed archive command was: rsync -q ""pg_xlog/000000010000000000000005"" centos:/home/pgdba/wal-archive/000000010000000000000005",,,,,,,
2010-03-11 13:11:36.722 CET,,,1081,,4b98dd36.439,15,,2010-03-11 13:08:22 CET,,0,LOG,00000,"archive command failed with exit code 12","The failed archive command was: rsync -q ""pg_xlog/000000010000000000000005"" centos:/home/pgdba/wal-archive/000000010000000000000005",,,,,,,
2010-03-11 13:11:36.722 CET,,,1081,,4b98dd36.439,16,,2010-03-11 13:08:22 CET,,0,WARNING,01000,"transaction log file ""000000010000000000000005"" could not be archived: too many failures",,,,,,,,

So, let's remove the “network problem":

[root@centos ~]# iptables -D INPUT -s 172.28.173.124 -p tcp --dport 22 -j REJECT --reject-with icmp-host-unreachable

now we'll force checkpoint to speed things up:

pgdba@arch:~$ psql -c "checkpoint"
CHECKPOINT
 
pgdba@arch:~$ ls -l data/5840/pg_xlog/
total 98428
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:25 000000010000000000000009
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:02 00000001000000000000000A
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:07 00000001000000000000000B
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:08 00000001000000000000000C
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:06 00000001000000000000000D
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:07 00000001000000000000000E
drwx------ 2 pgdba pgdba     4096 Mar 11 13:25 archive_status

Nice. And what about slave?

pgdba@arch:~$ ssh centos ls -l wal-archive
total 82040
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:02 000000010000000000000004
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:13 000000010000000000000005
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:13 000000010000000000000006
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:13 000000010000000000000007
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:13 000000010000000000000008

That's cool.

Now that we have the wal segment archiving working, we need to send initial state of PostgreSQL to slave.

There comes one important very important point: all paths are best left to be the same on master and slave. There is room for “play" in here, but we're just starting. And we need to keep it simple.

Before data transfer, let's make sure that PostgreSQL is not running on slave, and that destination directory is clean or doesn't exist:

pgdba@centos:~$ ps uxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
pgdba     5092  1.0  0.0   4540  1452 pts/0    S    13:37   0:00 -bash
pgdba     5120  0.0  0.0   4228   840 pts/0    R+   13:37   0:00  \_ ps uxf

OK, Postgres is not working.

pgdba@centos:~$ rm -rf ~/data/5840

And now I'm sure that destination directory doesn't exist. But ~/data, should so, in case it doesn't exist in your case:

pgdba@centos:~$ mkdir -p ~/data

Now. We need to transfer data from master to slave.

To do so we will need to run 3 commands:

pgdba@arch:~$ psql -c "select pg_start_backup('whatever')"
 pg_start_backup
-----------------
 0/A000020
(1 row)
 
pgdba@arch:~$ rsync -a --exclude 'pg_log/*' --exclude 'pg_xlog/*' --exclude postmaster.pid ~/data/5840/ centos:data/5840/
 
pgdba@arch:~$ psql -c "select pg_stop_backup()"
 pg_stop_backup
----------------
 0/A000080
(1 row)

As you can see from psql outputs there have been no work on master database while I performed the copy, but be assured that it's safe, and you can easily work on the database the whole time. Everything will work as usual.

A word about excludes in rsync – pg_log/* – we don't need textual logs from master on slave – no point in having them. pg_xlog – we don't need it, and actually transferring it would most likely cause problems. postmaster.pid – well – postgres is not running on slave, so the postmaster.pid file shouldn't be there.

After we finished the procedure, we need to do one change in postgresql.conf on slave: change archive_command to /bin/true:

pgdba@centos:~$ grep archive_command data/5840/postgresql.conf
#archive_command = 'rsync -q "%p" centos:/home/pgdba/wal-archive/%f'
archive_command = '/bin/true'

As you can see I left previous command, just commented it out.

Next step is also (like everything) very simple. On slave, in data/5840 we need to create new file, named ‘recovery.conf' with this content:

restore_command = '/opt/pgsql-8.4.2/bin/pg_standby -t /home/pgdba/wal-archive/recovery.trigger /home/pgdba/wal-archive/ %f %p %r'

And now we can start slave:

pgdba@centos:~$ pg_ctl -D ~/data/5840/ start
server starting

We can see with ps that it's running recovery:

pgdba@centos:~$ ps uxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
pgdba     5092  0.0  0.0   4540  1476 pts/0    S    13:37   0:00 -bash
pgdba     5290  0.0  0.0   4228   848 pts/0    R+   13:54   0:00  \_ ps uxf
pgdba     5284 21.5  0.1  41664  3424 pts/0    S    13:54   0:00 /opt/pgsql-8.4.2/bin/postgres -D /home/pgdba/data/5840
pgdba     5285  0.0  0.0   9840   776 ?        Ss   13:54   0:00  \_ postgres: logger process
pgdba     5286  0.0  0.0  41664  1020 ?        Ss   13:54   0:00  \_ postgres: startup process
pgdba     5287  0.0  0.0   3672   692 ?        S    13:54   0:00      \_ /opt/pgsql-8.4.2/bin/pg_standby -t /home/pgdba/wal-archive/recovery.trigger /home/pgdba/wal-archive/ 00000001.history pg_xlog/RECOVERYHISTORY 000000000000000000000000

In the logs, we can see:

pgdba@centos:~$ cat data/5840/pg_log/postgresql-2010-03-11_135422.csv 
2010-03-11 13:54:22.141 CET,,,5286,,4b98e7fe.14a6,1,,2010-03-11 13:54:22 CET,,0,LOG,00000,"database system was interrupted; last known up at 2010-03-11 13:41:25 CET",,,,,,,,
2010-03-11 13:54:22.141 CET,,,5286,,4b98e7fe.14a6,2,,2010-03-11 13:54:22 CET,,0,LOG,00000,"creating missing WAL directory ""pg_xlog/archive_status""",,,,,,,,
2010-03-11 13:54:22.155 CET,,,5286,,4b98e7fe.14a6,3,,2010-03-11 13:54:22 CET,,0,LOG,00000,"starting archive recovery",,,,,,,,
2010-03-11 13:54:22.157 CET,,,5286,,4b98e7fe.14a6,4,,2010-03-11 13:54:22 CET,,0,LOG,00000,"restore_command = '/opt/pgsql-8.4.2/bin/pg_standby -t /home/pgdba/wal-archive/recovery.trigger /home/pgdba/wal-archive/ %f %p %r'",,,,,,,,
2010-03-11 13:54:52.260 CET,,,5286,,4b98e7fe.14a6,5,,2010-03-11 13:54:22 CET,,0,LOG,00000,"restored log file ""00000001000000000000000A.00000020.backup"" from archive",,,,,,,,
2010-03-11 13:54:52.293 CET,,,5286,,4b98e7fe.14a6,6,,2010-03-11 13:54:22 CET,,0,LOG,00000,"restored log file ""00000001000000000000000A"" from archive",,,,,,,,
2010-03-11 13:54:52.296 CET,,,5286,,4b98e7fe.14a6,7,,2010-03-11 13:54:22 CET,,0,LOG,00000,"automatic recovery in progress",,,,,,,,
2010-03-11 13:54:52.380 CET,,,5286,,4b98e7fe.14a6,8,,2010-03-11 13:54:22 CET,,0,LOG,00000,"redo starts at 0/A000064, consistency will be reached at 0/A000080",,,,,,,,
2010-03-11 13:54:52.380 CET,,,5286,,4b98e7fe.14a6,9,,2010-03-11 13:54:22 CET,,0,LOG,00000,"consistent recovery state reached",,,,,,,,

Now. Just like we can use pg_current_xlog_location() on master to see current xlog location, on slave we can use pg_controldata program.:

pgdba@centos:~$ pg_controldata ~/data/5840/
pg_control version number:            843
Catalog version number:               200904091
Database system identifier:           5442359980668217335
Database cluster state:               in archive recovery
pg_control last modified:             Thu 11 Mar 2010 01:54:52 PM CET
Latest checkpoint location:           0/A000020
Prior checkpoint location:            0/A000020
Latest checkpoint's REDO location:    0/A000020
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          0/726
Latest checkpoint's NextOID:          16988
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Time of latest checkpoint:            Thu 11 Mar 2010 01:41:25 PM CET
Minimum recovery ending location:     0/A000080
Maximum data alignment:               4
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        2000
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by reference

There is a lot of information, but the two most important lines are these:

Latest checkpoint location:           0/A000020
Latest checkpoint's REDO location:    0/A000020

At the moment these are not really cool – same value. So let's play with them a bit. To do so, we will need to make some writes on master:

pgdba=# insert into text (q) select repeat('depesz'||i, 10000) from generate_series(1,50000) i;
INSERT 0 50000
 
pgdba=# select pg_current_xlog_location();
 pg_current_xlog_location 
--------------------------
 0/EEABEC0
(1 row)
 
pgdba=# checkpoint;
CHECKPOINT
 
pgdba=# select pg_current_xlog_location();
 pg_current_xlog_location 
--------------------------
 0/EEABF04
(1 row)
 
pgdba=# insert into text (q) select repeat('depesz'||i, 10000) from generate_series(1,50000) i;
INSERT 0 50000
 
pgdba=# select pg_current_xlog_location();
 pg_current_xlog_location 
--------------------------
 0/12D586D4
(1 row)
 
pgdba=# \! ssh centos pg_controldata ~/data/5840/
pg_control version number:            843
Catalog version number:               200904091
Database system identifier:           5442359980668217335
Database cluster state:               in archive recovery
pg_control last modified:             Thu 11 Mar 2010 02:00:24 PM CET
Latest checkpoint location:           0/B000020
Prior checkpoint location:            0/A000020
Latest checkpoint's REDO location:    0/B000020
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          0/726
Latest checkpoint's NextOID:          16988
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Time of latest checkpoint:            Thu 11 Mar 2010 01:46:25 PM CET
Minimum recovery ending location:     0/11FFFEC4
Maximum data alignment:               4
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        2000
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by reference

Now. Some more time passed, and not controldata shows:

pgdba@centos:~$ pg_controldata ~/data/5840/ | grep -E '^Latest checkpoint(.s REDO)? location:'
Latest checkpoint location:           0/EEABEC0
Latest checkpoint's REDO location:    0/EEABEC0

And now content of wal-archive is:

pgdba@centos:~$ ls -l wal-archive/
total 114864
-rw------- 1 pgdba pgdba      241 Mar 11 13:45 00000001000000000000000A.00000020.backup
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:59 00000001000000000000000B
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:59 00000001000000000000000C
-rw------- 1 pgdba pgdba 16777216 Mar 11 13:59 00000001000000000000000D
-rw------- 1 pgdba pgdba 16777216 Mar 11 14:00 00000001000000000000000E
-rw------- 1 pgdba pgdba 16777216 Mar 11 14:00 00000001000000000000000F
-rw------- 1 pgdba pgdba 16777216 Mar 11 14:01 000000010000000000000010
-rw------- 1 pgdba pgdba 16777216 Mar 11 14:01 000000010000000000000011

Please notice that old, unnecessary WAL segments have been removed. Generally – these are removed always when they are no longer needed – assuming we gave ‘%r' as one of arguments to pg_standby.

And that's practically it.

We have fully working wal replication, old segments are being removed, so as long as network will not fail, we should be safe.

Now. How to make slave work as master?

THat's where ‘-t /home/pgdba/wal-archive/recovery.trigger' option to pg_standby comes in handy.

Let's see:

pgdba@centos:~$ touch /home/pgdba/wal-archive/recovery.trigger

And in logs we can see:

2010-03-11 14:18:17.076 CET,,,5286,,4b98e7fe.14a6,18,,2010-03-11 13:54:22 CET,,0,LOG,58P01,"could not open file ""pg_xlog/000000010000000000000013"" (log file 0, segment 19): No such file or directory",,,,,,,,
2010-03-11 14:18:17.077 CET,,,5286,,4b98e7fe.14a6,19,,2010-03-11 13:54:22 CET,,0,LOG,00000,"redo done at 0/12D587C4",,,,,,,,
2010-03-11 14:18:17.077 CET,,,5286,,4b98e7fe.14a6,20,,2010-03-11 13:54:22 CET,,0,LOG,00000,"last completed transaction was at log time 2010-03-11 14:02:04.678272+01",,,,,,,,
2010-03-11 14:18:17.127 CET,,,5286,,4b98e7fe.14a6,21,,2010-03-11 13:54:22 CET,,0,LOG,00000,"restored log file ""000000010000000000000012"" from archive",,,,,,,,
2010-03-11 14:18:47.224 CET,,,5286,,4b98e7fe.14a6,22,,2010-03-11 13:54:22 CET,,0,LOG,00000,"selected new timeline ID: 2",,,,,,,,
2010-03-11 14:19:17.291 CET,,,5286,,4b98e7fe.14a6,23,,2010-03-11 13:54:22 CET,,0,LOG,00000,"archive recovery complete",,,,,,,,
2010-03-11 14:19:17.298 CET,,,5298,,4b98e81c.14b2,10,,2010-03-11 13:54:52 CET,,0,LOG,00000,"checkpoint starting: end-of-recovery immediate wait",,,,,,,,
2010-03-11 14:19:17.452 CET,,,5298,,4b98e81c.14b2,11,,2010-03-11 13:54:52 CET,,0,LOG,00000,"checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.161 s",,,,,,,,
2010-03-11 14:19:17.497 CET,,,5588,,4b98edd5.15d4,1,,2010-03-11 14:19:17 CET,,0,LOG,00000,"autovacuum launcher started",,,,,,,,
2010-03-11 14:19:17.501 CET,,,5284,,4b98e7fd.14a4,1,,2010-03-11 13:54:21 CET,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,

and now I can:

pgdba@centos:~$ psql -c "select pg_current_xlog_location()"
 pg_current_xlog_location 
--------------------------
 0/13000064
(1 row)

That is – I can connect and work on the database.

Before, It would be good to remove “recovery.done" and “backup_label.old" files from /home/pgdba/data/5840 on slave.

So. To wrap it up to simple step-by-step procedure:

  1. ensure you can connect without password from master to slave
  2. enable archiving (archive_command and archive_mode) in postgresql.conf
  3. restart postgresql on master
  4. start backup procedure on master (pg_start_backup)
  5. copy whole datadir to slave, with certain excludes
  6. finish backup procedure on master (pg_stop_backup)
  7. create recovery.conf on slave
  8. start slave PostgreSQL

Now. The last thing that's left in here is what I mentioned earlier:

… One way (not really good, but simple – I'll talk about it later) is using ssh. …

Why is ssh not good? Couple of reasons:

  • Standard ssh setup lets potential attacker get shell access
  • SSH is slow
  • rsync over ssh has very little control over what files can be accessed

Let's discuss these points.

First – shell access. Of course you can remove this ability by assigning special command to key in authorized_keys. Sure. Have you ever done it? Have you even been aware that it's posible to make ssh access that can be used to run only 1, predefined, command? Well. Most of the people I know – don't know about it. And this means that even if you'll learn about it, some other poor admin/dba will have hard time figuring out why he can't login to slave server, while rsyncing over ssh works.

Second – SSH is slow. This (for me) is the most important part. Let me show you some example.

I created test file, which contains random data, and it's 1GB in size:

=> dd if=/dev/urandom of=test.file bs=1k count=1M

Now. Let's copy it, using rsync-over-ssh to localhost:

=> time rsync test.file localhost:rsync-over-ssh
 
real    1m37.026s
user    1m9.960s
sys     0m4.080s

and now, let's use rsyncd – tcp/ip server, which does it's own network communication:

=> time rsync test.file rsync://localhost/media/other/
 
real    0m14.543s
user    0m3.310s
sys     0m2.210s

In the first case – we got speed of ~ 11MB/s, and in the second – 73MB/s.

The problem is that the encryption that ssh uses isn't free. No encryption is free – there is always some overhead.

And while it might sound to you that 11MB/s is good enough – consider this: I've seen servers which generate 200 WAL segments per minute. Since WAL segment is 16MB – that means we created 53MB of WAL segments every second. Compression can help some, but it's also eating the processor.

Last point why rsync-over-ssh is not good – control over what can be accessed. Even if you'll make sure that only rsync can be run on the other end of ssh connection, it still has access to practically anywhere pgdba user has access to. Using (for example) rsyncd, I can easily allow access to only wal-archive directory, and nowhere else.

You might wonder – why do I force rsync so much? Why don't use ftp or nfs, or anything else? The reason is very simple. Rsync has one very interesting capability. When rsyncing file, it first creates file under different name, and only after transfer is done, it atomically renames the temp file to final name.

Benefit of this is pretty simple – we avoid potential race condition which could happen if recovery process would “see" the file that is just being transferred, and use it before all of its 16MB got to destination.

I hope this wraps the subject. In case you have any questions, or you will find errors in what I wrote – please let me know via comments.

  1. 24 comments

  2. Mar 11, 2010

    Try using the arcfour cipher with SSH for rsync, when speed is more important than security.

  3. Mar 11, 2010

    “1. ensure you can cannot without password from master to slave”
    !you can connect

    Thanks for great article!

  4. Mar 11, 2010

    @Bob:

    It helps, but not much:

    => time scp -c arcfour test.file localhost:output.file
    test.file 100% 1024MB 13.5MB/s 01:16

    real 1m17.063s
    user 1m2.510s
    sys 0m2.600s

    Generally *any* encryption will be slower than lack of encryption. And in my experience master and slave databases are usually so close, that adding encryption is pretty much pointless.

  5. Mar 11, 2010

    @DIMITRY:
    thanks, fixed.

  6. # pg_alchemist
    Mar 12, 2010

    Awesome Depesz. That was a thorough and painstakingly detailed tutorial. Thanks
    It will surely help many first timers like
    me. I also discovered so many new ideas you
    can do with sed, grep. I appreciate this and
    hope more creative juice keeps flowing.

    Cheers,
    pg_alchemist

  7. Mar 14, 2010

    This is great, thanks for the writeup.

    How would you stream out the WALs to multiple slaves? Some apps might benefit from having multiple hot standbys in order to distribute load.

    I can think of a couple of ways but I’m wondering your approach would be.

  8. Mar 14, 2010

    @Harold:
    not sure I understand – remember that currently You can’t have hot standby!

    If you’re talking about doing it in 9.0 – quite simply just rsync to all servers one-by-one, remembering which ones got the data, and return true if all of them received segment.

  9. # manish
    May 6, 2011

    hi,

    just want to know, from any slave or any PG 9.0 (in SR replication case), how can i check, i am slave of which Master PG Server….
    In Mysql, we use
    Show Slave status\G

    what is in case of Postgresql 9.0

    Regards
    Manish

  10. May 6, 2011

    If you have hot standby, which you have to have to be able to run queries, and you have streaming replication – you can simply:

    select pg_read_file( ‘recovery.conf’, 0, (pg_stat_file(‘recovery.conf’)).size);

    to get content of recovery.conf, which includes all the information you need.

  11. Mar 19, 2012

    Thanks a lot, this is the most elaborate explanation on stand by postgres database as far as I know.

    Thank you
    Ajay.

  12. Mar 19, 2012

    I have one question, I don’t see pg_standby in my bin folder. I am using PostgreSQL 8.2.4, how do I configure pg_standby?

  13. Mar 19, 2012

    @Ajay:
    pg_standby was added, as contrib module, in pg 8.3.

  14. Apr 24, 2012

    I am facing certain issue on my standby server here is the output from log file on standby.
    `/baflbackup/wal-archive/000000010000007200000080′ -> `pg_xlog/RECOVERYXLOG’
    LOG: restored log file “000000010000007200000080” from archive
    `/baflbackup/wal-archive/000000010000007200000081′ -> `pg_xlog/RECOVERYXLOG’
    LOG: restored log file “000000010000007200000081” from archive
    `/baflbackup/wal-archive/000000010000007200000082′ -> `pg_xlog/RECOVERYXLOG’
    LOG: restored log file “000000010000007200000082” from archive
    FATAL: could not restore file “000000010000007200000083” from archive: return code 15
    LOG: received smart shutdown request
    LOG: startup process (PID 19570) exited with exit code 1
    LOG: aborting startup due to startup process failure
    How should I deal with this?

    Thank you
    Ajay

  15. Apr 24, 2012

    @Ajay:
    what’s your restore command?

  16. Dec 11, 2012

    Thanks ,it worked for me .

    The total data from master is replicated to slave.

    After that if i create any database in master , that new data is not going to slave automatically .

    for automatically replication do i need to do any thing .

    Please help me .

  17. Dec 11, 2012

    i mean , how to do incremental replication

  18. Dec 11, 2012

    @Srinivas:
    if you did setup everything as in here, all data changes should be replicated.

  19. Dec 13, 2012

    Sorry ,I tried 3 times .But every time i failed at incremental replication .

    do i need to schedule any above commands for incremental replication . Please let me know where i am doing wrong .

  20. Dec 13, 2012

    @Srinivas:
    How can I tell what you’re doing wrong if I don’t know what you did?
    You clearly did not do something that I described, or didn’t check that it works, but it is *you* who has access to config files and logs, and not me.

  21. # Bob Lunney
    Jul 24, 2013

    Great tutorial, Depesz, very thorough. I have one question, though. Once the master has failed and the slave completes recovery and comes up, how do you get it to archive WAL so you can build another slave? Looks like this requires a restart, at least for 8.4, so your running in a degraded state until there is a window for restarting the newly promoted slave database. Any insights?

  22. Jul 24, 2013

    @Bob:
    well, you need to set proper archive_command on slave. If you had `archive_mode = on` the whole time, then restart will not be needed.

    Since slave generally doesn’t do archiving, I tend to leave:

    archive_mode = on
    archive_command = ‘/bin/true’

    on slaves, and after failover, i just change archive_command, reload, and I’m done.

  23. # Bob Lunney
    Jul 24, 2013

    Thanks for the quick response, Depesz. That is exactly the answer I was hoping for.

    And thanks for all you do to promote our favorite database and educate others. You are an invaluable asset to the PG community.

  24. # shan
    Aug 13, 2013

    How to do the switchover operation in postgresql8.4 warm standby?

  25. Aug 13, 2013

    @Shan:
    promote slave to standalone, and switch application to new database.

Sorry, comments for this post are disabled.