Waiting for 9.0 – Streaming replication

The BIG feature. The feature that made PostgreSQL leap from 8.4 to 9.0. Patch was written by Fujii Masao, and committed by Heikki Linnakangas on 15th of January 2010:

Log Message:
-----------
Introduce Streaming Replication.
 
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.
 
Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.
 
Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.
 
Fujii Masao, with additional hacking by me

What exactly does it mean?

Well. As you perhaps are aware, PostgreSQL offers replication using WAL segments – which, till 8.4, could be used to have Warm Standby, but from 9.0 you will be able to make it actually Hot Standby.

Problem with this replication is pretty simple – replication is based on files, which get rotated (and thus replication) after 16MB of data.

In some scenarios it's not a problem – if you're generating constantly at least 16MB of writes per minute – your replication lag is basically always below one minute.

But what happens if you don't have that many writes? Of course you can force WAL rotation, but this is not really good idea, and it's not suggested to force it below one minute anyway.

But now, thanks to Streaming Replication, lag should be nearly zero. Of course it's still asynchronous replication, but for most usecases – async is really good enough.

So, let's see how to set it up.

On my test machine, I have PostgreSQL binaries in /home/pgdba/work, and main datadir in /home/pgdba/data (listens on port 5850)

Now. let's setup streaming replication and standby server, on the same machine in /home/pgdba/data2, listening on 5851.

First, I will need to stop main PostgreSQL:

/home/pgdba/work/bin/pg_ctl -D /home/pgdba/data stop

Now let's make some changes in config. Before applying the changes, I copied postgresql.conf to postgresql.conf.before, so I can now show you simple diff:

=> diff -u postgresql.conf.before postgresql.conf
--- postgresql.conf.before      2010-01-31 23:27:24.000000000 +0100
+++ postgresql.conf     2010-01-31 23:29:29.000000000 +0100
@@ -175,9 +175,9 @@
 
 # - Archiving -
 
-#archive_mode = off            # allows archiving to be done
+archive_mode = on              # allows archiving to be done
                                # (change requires restart)
-#archive_command = ''          # command to use to archive a logfile segment
+archive_command = 'cp "%p" /home/pgdba/wal_archive/"%f"'               # command to use to archive a logfile segment
 #archive_timeout = 0           # force a logfile segment switch after this
                                # number of seconds; 0 disables
 
@@ -190,7 +190,7 @@
 
 # - Replication -
 
-#max_wal_senders = 0           # max number of walsender processes
+max_wal_senders = 1            # max number of walsender processes
 #wal_sender_delay = 200ms      # 1-10000 milliseconds

As you can see it's not really complicated.

Now, I can start master PostgreSQL:

/home/pgdba/work/bin/pg_ctl -D /home/pgdba/data start

OK. Now, I generated some traffic in database, to make it pre-generate some wal-segments. Not much – let's say 100.

My test script generates around 7 wal segments per minute.

Now. Having this, let's setup slave server:

=> /home/pgdba/WORK/bin/psql -U postgres -p 5850 -c "SELECT pg_start_backup('depesz')"
 pg_start_backup
-----------------
 0/74C645E8
(1 ROW)
=> rsync -a /home/pgdba/DATA/ /home/pgdba/data2/
=> /home/pgdba/WORK/bin/psql -U postgres -p 5850 -c "SELECT pg_stop_backup()"
 pg_stop_backup
----------------
 0/7A8FB9A8
(1 ROW)

Now, let's clear data2:

rm -f /home/pgdba/data2/postmaster.pid /home/pgdba/data2/pg_xlog/archive_status/* /home/pgdba/data2/pg_xlog/0*

And change config. Diff between data and data2:

=> diff -u data/postgresql.conf data2/postgresql.conf
--- data/postgresql.conf        2010-01-31 23:34:11.000000000 +0100
+++ data2/postgresql.conf       2010-02-01 00:02:18.000000000 +0100
@@ -60,7 +60,7 @@
                                        # comma-separated list of addresses;
                                        # defaults to 'localhost', '*' = all
                                        # (change requires restart)
-#port = 5850                           # (change requires restart)
+port = 5851                            # (change requires restart)
 max_connections = 100                  # (change requires restart)
 # Note:  Increasing max_connections costs ~400 bytes of shared memory per
 # connection slot, plus lock space (see max_locks_per_transaction).
@@ -175,7 +175,7 @@
 
 # - Archiving -
 
-archive_mode = on              # allows archiving to be done
+archive_mode = off             # allows archiving to be done
                                # (change requires restart)
 archive_command = 'cp "%p" /home/pgdba/wal_archive/"%f"'               # command to use to archive a logfile segment
 #archive_timeout = 0           # force a logfile segment switch after this
@@ -190,7 +190,7 @@
 
 # - Replication -
 
-max_wal_senders = 1            # max number of walsender processes
+max_wal_senders = 0            # max number of walsender processes
 #wal_sender_delay = 200ms      # 1-10000 milliseconds

On slave I created recovery.conf:

=> cat /home/pgdba/data2/recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=127.0.0.1 port=5850 user=postgres'
trigger_file = '/home/pgdba/data2/finish.replication'
restore_command = 'cp -i /home/pgdba/wal_archive/%f "%p" </dev/null'

It is critical that you can't use pg_standby or anything like this that will wait for next wal segment.

Also – as far as I understand – wal segments should be sent directly from master to slave using archive_command – i.e. copying them to some temp directory, and then rsyncing from cron, will probably not work.

Anyway. After some time, slave system fetched all wal segments from archive directory, and switched to streaming.

It can be seen because now we have 2 new, interesting processes:

=> ps uw --ppid $( head -n 1 /home/pgdba/data2/postmaster.pid )
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
pgdba    13384  0.0  0.0  28712  1056 ?        Ss   00:22   0:00 postgres: logger process
pgdba    13385  1.1  0.2  59916 27200 ?        Ss   00:22   0:03 postgres: startup process   recovering 0000000100000000000000EE
pgdba    13390  0.0  0.2  59812 26208 ?        Ss   00:22   0:00 postgres: writer process
pgdba    13397  0.0  0.0  30800  1052 ?        Ss   00:22   0:00 postgres: stats collector process
pgdba    13402  0.4  0.0  77016  7368 ?        Ss   00:22   0:01 postgres: wal receiver process   streaming 0/EE2B7FD8
 
=> ps uw --ppid $( head -n 1 /home/pgdba/data/postmaster.pid )
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
pgdba     6904  0.9  0.0  28716  1056 ?        Ss   Jan31   0:31 postgres: logger process
pgdba     6906  0.1  0.2  60092 27060 ?        Ss   Jan31   0:04 postgres: writer process
pgdba     6907  0.0  0.0  59652  1304 ?        Ss   Jan31   0:00 postgres: wal writer process
pgdba     6908  0.0  0.0  62624  3992 ?        Ss   Jan31   0:00 postgres: autovacuum launcher process
pgdba     6909  0.0  0.0  30912  1024 ?        Ss   Jan31   0:00 postgres: archiver process   LAST was 0000000100000000000000ED
pgdba     6910  0.0  0.1  47356 12896 ?        Ss   Jan31   0:00 postgres: stats collector process
pgdba     8713 85.6  0.6 118308 80628 ?        Rs   Jan31  31:35 postgres: depesz depesz [LOCAL] DROP TABLE
pgdba     8818  0.0  0.5 110428 66604 ?        Ss   Jan31   0:01 postgres: autovacuum worker process   depesz
pgdba     8887  0.0  0.6 134780 84884 ?        Ss   Jan31   0:01 postgres: autovacuum worker process   depesz
pgdba     8952  0.0  0.6 126812 83580 ?        Ss   Jan31   0:01 postgres: autovacuum worker process   depesz
pgdba    13403  1.4  0.0  69224  8580 ?        Ss   00:22   0:04 postgres: wal sender process postgres 127.0.0.1(52186) streaming 0/EE3D8090

As you can see master is still doing archivization of wal segments – which should go on, but you should use some cronjob to remove no-longer-needed wal segments. By no-longer-needed I mean – older than “Latest checkpoint's REDO location" from “pg_controldata SLAVE_DATA_DIR/".

Now. let's see how it really works.

I stopped my load generator – thank to this there will be basically no traffic on database – as soon as autovacuums will finish 🙂

Afterwards, I did simple test. I have table:

# \d test
              Table "public.test"
 Column |           Type           | Modifiers
--------+--------------------------+-----------
 x      | timestamp with time zone | not null
Indexes:
    "test_pkey" PRIMARY KEY, btree (x)

First test:

=$ /home/pgdba/WORK/bin/psql -U depesz -p 5850 -c "insert into test (x) values (now()) returning x"
               x
-------------------------------
 2010-02-01 00:33:29.543639+01
(1 ROW)
 
INSERT 0 1
 
=$ /home/pgdba/WORK/bin/psql -U depesz -p 5851 -c "select * from test"
 x
---
(0 ROWS)

Of course I don't see it – streaming replication is fast, but not that fast 🙂

But adding even 1 second of delay worked perfectly:

=$ /home/pgdba/WORK/bin/psql -U depesz -p 5850 -c "insert into test (x) values (now()) returning x"
               x
-------------------------------
 2010-02-01 00:35:13.520617+01
(1 ROW)
 
INSERT 0 1
 
=$ sleep 1
 
=$ /home/pgdba/WORK/bin/psql -U depesz -p 5851 -c "select * from test order by x desc limit 1"
               x
-------------------------------
 2010-02-01 00:35:13.520617+01
(1 ROW)

(These command have been copy-pasted to terminal, so there was no typing-induced delay).

So, as we can see it works.

Is it ready? Definitely no. For starters – docs are far from being usable. If you'd like to play with it yourself – you can check docs but you should check Wiki page.

Do I like it? YEAH! Even with not ready docs, setup was not really long, and effect is well worth some sweat 🙂

18 thoughts on “Waiting for 9.0 – Streaming replication”

  1. Nice article – hot standby is a much needed feature in Postgresql. One thing you might want to consider is using rsync in place of cp. rsync does an atomic copy, so there’s no chance of the other process picking up a partially written file. I use something like this:

    archive_command = ‘rsync %p /var/lib/pgsql/archives/%f

  2. This feature alone is why I plan on testing 9.0 as early as possible so that I can start using this wonderful feature. My thanks to everyone involved!

  3. Does anyone know how this exactly works? Does it still apply blocks (like a warm standby) or does it work on query level (like an Oracle logical standby database)?

  4. @Florian:
    It is still wal replication. But instead of waiting for 16MB of data to be send, data are being sent more or less in real time.

  5. I’m so disappointed about the miss of a TRANSPARENT Table Partitioning in PG 9.0. 🙁
    Table Partitioning is great to speed tunning large databases.

  6. Nice demo on Hotstandby. Any strong reason why we not using the pg_standby command here.

  7. Currently I use PostgreSql 8.2 together with slony-I for partial replication of a set of tables at the master to the database of the slave.

    Is it possible to use this type of replication built in PostgreSql 9 for partial replication too?

  8. @Ferdy:
    no. whole WAL is being sent to slave, which means all databases, and all tables from your pg instance.

  9. Hi, Nice article.I tried hot standby and streaming replication. That’s nice feature but lacks one thing dumping of data (pg_dump) from standby server. Is there any option to perform pg_dump on streaming server.

  10. How fast must be backup server?
    Can we make backup from host with powerful disk subsystem to slave with regular SATA disk, for example?

  11. Hi all,

    Any one know, how to change slave as a master when master server get down or some time fail case.

    Please share possibilities, it more helpful.

Comments are closed.