February 1st, 2010 by depesz | Tags: , , , , , , , , | 15 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

The BIG feature. The feature that made PostgreSQL leap from 8.4 to 9.0. Patch was written by Fujii Masao, and committed by Heikki Linnakangas on 15th of January 2010:

Log Message:
-----------
Introduce Streaming Replication.
 
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.
 
Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.
 
Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.
 
Fujii Masao, with additional hacking by me

What exactly does it mean?

Well. As you perhaps are aware, PostgreSQL offers replication using WAL segments – which, till 8.4, could be used to have Warm Standby, but from 9.0 you will be able to make it actually Hot Standby.

Problem with this replication is pretty simple – replication is based on files, which get rotated (and thus replication) after 16MB of data.

In some scenarios it's not a problem – if you're generating constantly at least 16MB of writes per minute – your replication lag is basically always below one minute.

But what happens if you don't have that many writes? Of course you can force WAL rotation, but this is not really good idea, and it's not suggested to force it below one minute anyway.

But now, thanks to Streaming Replication, lag should be nearly zero. Of course it's still asynchronous replication, but for most usecases – async is really good enough.

So, let's see how to set it up.

On my test machine, I have PostgreSQL binaries in /home/pgdba/work, and main datadir in /home/pgdba/data (listens on port 5850)

Now. let's setup streaming replication and standby server, on the same machine in /home/pgdba/data2, listening on 5851.

First, I will need to stop main PostgreSQL:

/home/pgdba/work/bin/pg_ctl -D /home/pgdba/data stop

Now let's make some changes in config. Before applying the changes, I copied postgresql.conf to postgresql.conf.before, so I can now show you simple diff:

=> diff -u postgresql.conf.before postgresql.conf
--- postgresql.conf.before 2010-01-31 23:27:24.000000000 +0100
+++ postgresql.conf 2010-01-31 23:29:29.000000000 +0100
@@ -175,9 +175,9 @@
 
# - Archiving -
 
-#archive_mode = off # allows archiving to be done
+archive_mode = on # allows archiving to be done
# (change requires restart)
-#archive_command = '' # command to use to archive a logfile segment
+archive_command = 'cp "%p" /home/pgdba/wal_archive/"%f"' # command to use to archive a logfile segment
#archive_timeout = 0 # force a logfile segment switch after this
# number of seconds; 0 disables
 
@@ -190,7 +190,7 @@
 
# - Replication -
 
-#max_wal_senders = 0 # max number of walsender processes
+max_wal_senders = 1 # max number of walsender processes
#wal_sender_delay = 200ms # 1-10000 milliseconds

As you can see it's not really complicated.

Now, I can start master PostgreSQL:

/home/pgdba/work/bin/pg_ctl -D /home/pgdba/data start

OK. Now, I generated some traffic in database, to make it pre-generate some wal-segments. Not much – let's say 100.

My test script generates around 7 wal segments per minute.

Now. Having this, let's setup slave server:

=> /home/pgdba/work/bin/psql -U postgres -p 5850 -c "SELECT pg_start_backup('depesz')"
pg_start_backup
-----------------
0/74C645E8
(1 row)
=> rsync -a /home/pgdba/data/ /home/pgdba/data2/
=> /home/pgdba/work/bin/psql -U postgres -p 5850 -c "SELECT pg_stop_backup()"
pg_stop_backup
----------------
0/7A8FB9A8
(1 row)

Now, let's clear data2:

rm -f /home/pgdba/data2/postmaster.pid /home/pgdba/data2/pg_xlog/archive_status/* /home/pgdba/data2/pg_xlog/0*

And change config. Diff between data and data2:

=> diff -u data/postgresql.conf data2/postgresql.conf
--- data/postgresql.conf 2010-01-31 23:34:11.000000000 +0100
+++ data2/postgresql.conf 2010-02-01 00:02:18.000000000 +0100
@@ -60,7 +60,7 @@
# comma-separated list of addresses;
# defaults to 'localhost', '*' = all
# (change requires restart)
-#port = 5850 # (change requires restart)
+port = 5851 # (change requires restart)
max_connections = 100 # (change requires restart)
# Note: Increasing max_connections costs ~400 bytes of shared memory per
# connection slot, plus lock space (see max_locks_per_transaction).
@@ -175,7 +175,7 @@
 
# - Archiving -
 
-archive_mode = on # allows archiving to be done
+archive_mode = off # allows archiving to be done
# (change requires restart)
archive_command = 'cp "%p" /home/pgdba/wal_archive/"%f"' # command to use to archive a logfile segment
#archive_timeout = 0 # force a logfile segment switch after this
@@ -190,7 +190,7 @@
 
# - Replication -
 
-max_wal_senders = 1 # max number of walsender processes
+max_wal_senders = 0 # max number of walsender processes
#wal_sender_delay = 200ms # 1-10000 milliseconds

On slave I created recovery.conf:

=> cat /home/pgdba/data2/recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=127.0.0.1 port=5850 user=postgres'
trigger_file = '/home/pgdba/data2/finish.replication'
restore_command = 'cp -i /home/pgdba/wal_archive/%f "%p" </dev/null'

It is critical that you can't use pg_standby or anything like this that will wait for next wal segment.

Also – as far as I understand – wal segments should be sent directly from master to slave using archive_command – i.e. copying them to some temp directory, and then rsyncing from cron, will probably not work.

Anyway. After some time, slave system fetched all wal segments from archive directory, and switched to streaming.

It can be seen because now we have 2 new, interesting processes:

=> ps uw --ppid $( head -n 1 /home/pgdba/data2/postmaster.pid )
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
pgdba 13384 0.0 0.0 28712 1056 ? Ss 00:22 0:00 postgres: logger process
pgdba 13385 1.1 0.2 59916 27200 ? Ss 00:22 0:03 postgres: startup process recovering 0000000100000000000000EE
pgdba 13390 0.0 0.2 59812 26208 ? Ss 00:22 0:00 postgres: writer process
pgdba 13397 0.0 0.0 30800 1052 ? Ss 00:22 0:00 postgres: stats collector process
pgdba 13402 0.4 0.0 77016 7368 ? Ss 00:22 0:01 postgres: wal receiver process streaming 0/EE2B7FD8
 
=> ps uw --ppid $( head -n 1 /home/pgdba/data/postmaster.pid )
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
pgdba 6904 0.9 0.0 28716 1056 ? Ss Jan31 0:31 postgres: logger process
pgdba 6906 0.1 0.2 60092 27060 ? Ss Jan31 0:04 postgres: writer process
pgdba 6907 0.0 0.0 59652 1304 ? Ss Jan31 0:00 postgres: wal writer process
pgdba 6908 0.0 0.0 62624 3992 ? Ss Jan31 0:00 postgres: autovacuum launcher process
pgdba 6909 0.0 0.0 30912 1024 ? Ss Jan31 0:00 postgres: archiver process last was 0000000100000000000000ED
pgdba 6910 0.0 0.1 47356 12896 ? Ss Jan31 0:00 postgres: stats collector process
pgdba 8713 85.6 0.6 118308 80628 ? Rs Jan31 31:35 postgres: depesz depesz [local] DROP TABLE
pgdba 8818 0.0 0.5 110428 66604 ? Ss Jan31 0:01 postgres: autovacuum worker process depesz
pgdba 8887 0.0 0.6 134780 84884 ? Ss Jan31 0:01 postgres: autovacuum worker process depesz
pgdba 8952 0.0 0.6 126812 83580 ? Ss Jan31 0:01 postgres: autovacuum worker process depesz
pgdba 13403 1.4 0.0 69224 8580 ? Ss 00:22 0:04 postgres: wal sender process postgres 127.0.0.1(52186) streaming 0/EE3D8090

As you can see master is still doing archivization of wal segments – which should go on, but you should use some cronjob to remove no-longer-needed wal segments. By no-longer-needed I mean – older than “Latest checkpoint's REDO location" from “pg_controldata SLAVE_DATA_DIR/".

Now. let's see how it really works.

I stopped my load generator – thank to this there will be basically no traffic on database – as soon as autovacuums will finish :)

Afterwards, I did simple test. I have table:

# \d test
Table "public.test"
Column | Type | Modifiers
--------+--------------------------+-----------
x | timestamp with time zone | not null
Indexes:
"test_pkey" PRIMARY KEY, btree (x)

First test:

=$ /home/pgdba/work/bin/psql -U depesz -p 5850 -c "insert into test (x) values (now()) returning x"
x
-------------------------------
2010-02-01 00:33:29.543639+01
(1 row)
 
INSERT 0 1
 
=$ /home/pgdba/work/bin/psql -U depesz -p 5851 -c "select * from test"
x
---
(0 rows)

Of course I don't see it – streaming replication is fast, but not that fast :)

But adding even 1 second of delay worked perfectly:

=$ /home/pgdba/work/bin/psql -U depesz -p 5850 -c "insert into test (x) values (now()) returning x"
x
-------------------------------
2010-02-01 00:35:13.520617+01
(1 row)
 
INSERT 0 1
 
=$ sleep 1
 
=$ /home/pgdba/work/bin/psql -U depesz -p 5851 -c "select * from test order by x desc limit 1"
x
-------------------------------
2010-02-01 00:35:13.520617+01
(1 row)

(These command have been copy-pasted to terminal, so there was no typing-induced delay).

So, as we can see it works.

Is it ready? Definitely no. For starters – docs are far from being usable. If you'd like to play with it yourself – you can check docs but you should check Wiki page.

Do I like it? YEAH! Even with not ready docs, setup was not really long, and effect is well worth some sweat :)

  1. 15 comments

  2. Feb 1, 2010

    Another fantastic post! Thanks for making this demo :)

  3. Feb 2, 2010

    Nice article – hot standby is a much needed feature in Postgresql. One thing you might want to consider is using rsync in place of cp. rsync does an atomic copy, so there’s no chance of the other process picking up a partially written file. I use something like this:

    archive_command = ‘rsync %p /var/lib/pgsql/archives/%f

  4. # David Thomas
    Feb 2, 2010

    Awesome news! Thanks for the write-up.

  5. # Matthew
    Feb 13, 2010

    This feature alone is why I plan on testing 9.0 as early as possible so that I can start using this wonderful feature. My thanks to everyone involved!

  6. # Florian Helmberger
    Feb 15, 2010

    Does anyone know how this exactly works? Does it still apply blocks (like a warm standby) or does it work on query level (like an Oracle logical standby database)?

  7. Feb 15, 2010

    @Florian:
    It is still wal replication. But instead of waiting for 16MB of data to be send, data are being sent more or less in real time.

  8. # Alexandre Arruda
    Mar 1, 2010

    I’m so disappointed about the miss of a TRANSPARENT Table Partitioning in PG 9.0. :(
    Table Partitioning is great to speed tunning large databases.

  9. # raghav
    May 21, 2010

    Nice demo on Hotstandby. Any strong reason why we not using the pg_standby command here.

  10. # Ferdy
    Aug 19, 2010

    Currently I use PostgreSql 8.2 together with slony-I for partial replication of a set of tables at the master to the database of the slave.

    Is it possible to use this type of replication built in PostgreSql 9 for partial replication too?

  11. Aug 19, 2010

    @Ferdy:
    no. whole WAL is being sent to slave, which means all databases, and all tables from your pg instance.

  12. # Muhammad Amer Siddique
    Sep 8, 2010

    Hi, Nice article.I tried hot standby and streaming replication. That’s nice feature but lacks one thing dumping of data (pg_dump) from standby server. Is there any option to perform pg_dump on streaming server.

  13. # shrikant
    Dec 12, 2011

    Can you let me know the contents of trigger file

  14. Dec 12, 2011

    @Shrikant:
    it’s just *trigger* file – it’s content is not important.

  15. # edo
    Apr 6, 2012

    How fast must be backup server?
    Can we make backup from host with powerful disk subsystem to slave with regular SATA disk, for example?

  16. Apr 6, 2012

    @Edo: usually it will work. check for yourself, though, if you want definitive answer.

Leave a comment