March 18th, 2011 by depesz | Tags: , , , , , , , | 26 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

On 6th of March, Simon Riggs committed patch:

Efficient transaction-controlled synchronous replication.
If a standby is broadcasting reply messages and we have named
one or more standbys in synchronous_standby_names then allow
users who set synchronous_replication to wait for commit, which
then provides strict data integrity guarantees. Design avoids
sending and receiving transaction state information so minimises
bookkeeping overheads. We synchronize with the highest priority
standby that is connected and ready to synchronize. Other standbys
can be defined to takeover in case of standby failure.
 
This version has very strict behaviour; more relaxed options
may be added at a later date.
 
Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime
Casanova, Heikki Linnakangas and Robert Haas, plus the assistance
of many other design reviewers.

Before I will go into details, let's first define what synchronous replication is, and what it's not.

It is:

  • solution that prevents loss of data in case of failure of master database server

and it is not:

  • solution for load balancing
  • solution for having multiple masters

What's important – using synchronous replication is slower than not using it. So you should be really willing to take the cost for benefits.

As for benefits.

Virtually all replication systems we have now are so called asynchronous. That is – when data is written to master, after some time, it will appear on slave(s). Usually the time is short, but it is not impossible to have asynchronous replication with lag time of hours.

This situation (replication lag) has one drawback. When data is committed on master, and immediately the master will be “killed" there is no warranty that the data will be on slave(s). Generally lag time of “x" means that we are ok with loosing data that was committed in this “x" time.

Of course replication system do guarantee consistency, but data loss can happen.

Synchronous replication is solution for those cases where data loss cannot happen.

To understand how it works you have to understand streaming replication.

With streaming replication, as soon as data is written to WAL segment, it is also sent to all attached slave systems.

Synchronous replication modifies this setup in such a way that given transaction can request that “when I commit, I want to be sure that wal data has been sent to slave, and it confirmed that it got it".

Please note that I mentioned “given transaction". Setting whether you will be using normal streaming replication (asynchronous) or its synchronous variant is set by transaction. Of course you can set your default to always do synchronous replication, but it can be modified on per-transaction basis.

Thanks to this you can have slower, but more failproof handling of critical data, and faster, but not as safe modifying of data that is not that important.

So. How to set it up.

To playtest, I will need 3 Pg instances. Master and 2 slaves. So:

First, need to prepare my playground, so:

sh> mkdir /var/tmp/depesz
sh> cd /var/tmp/depesz/
sh> mkdir -m 700 master
sh> initdb -D master/ -E UTF8
sh> echo "host replication all 127.0.0.1/32 trust" >> master/pg_hba.conf
sh> vim master/postgresql.conf

In postgresql.conf, I set following parameters:

  • listen_addresses => *
  • port => 4001
  • logging_collector => on
  • wal_level => hot_standby
  • synchronous_standby_names => *
  • max_wal_senders => 5
  • wal_keep_segments => 50
  • archive_mode => on
  • archive_command => ‘/bin/true'

Afterwards:

sh> pg_ctl -D master/ start

And I have my test master server ready. For replication, I'll use special user:

sh> psql -p 4001 -d postgres -c "create user replication with replication"

all done.

So, let's make the slaves:

sh> psql -p 4001 -d postgres -c "select pg_start_backup('x')"
sh> rsync -a master/ slave1/
sh> rsync -a master/ slave2/
sh> psql -p 4001 -d postgres -c "select pg_stop_backup()"
sh> rm slave{1,2}/postmaster.pid slave{1,2}/pg_xlog/0* slave{1,2}/pg_xlog/archive_status/* slave{1,2}/pg_log/*
sh> vim -o slave{1,2}/postgresql.conf

In those configs, I change port number to 4011 for slave1 and 4021 for slave2, and enable hot_standby (hot_standby = on).

Finally, in both slaves, I create file recovery.conf with the same content:

restore_command = '/bin/false'
standby_mode = 'on'
primary_conninfo = 'port=4001 host=127.0.0.1 user=replication'

Afterwards I start both slaves:

sh> pg_ctl -D slave1/ start
sh> pg_ctl -D slave2/ start

And, I can verify that it all works:

sh> paste -d, {master,slave{1,2}}/postmaster.pid | head -n 1 | xargs -I{} ps uwf --ppid {} --pid {}
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
depesz 15504 0.0 0.0 58864 6212 pts/3 S 12:58 0:00 /home/pgdba/work/bin/postgres -D slave2
depesz 15505 0.0 0.0 25096 688 ? Ss 12:58 0:00 \_ postgres: logger process
depesz 15506 0.0 0.0 58932 1412 ? Ss 12:58 0:00 \_ postgres: startup process recovering 00000001000000000000000A
depesz 15509 0.0 0.0 75916 4112 ? Ss 12:58 0:00 \_ postgres: wal receiver process streaming 0/A0000B0
depesz 15511 0.0 0.0 58864 1012 ? Ss 12:58 0:00 \_ postgres: writer process
depesz 15512 0.0 0.0 27192 896 ? Ss 12:58 0:00 \_ postgres: stats collector process
depesz 15490 0.0 0.0 58868 6208 pts/3 S 12:58 0:00 /home/pgdba/work/bin/postgres -D slave1
depesz 15491 0.0 0.0 25100 688 ? Ss 12:58 0:00 \_ postgres: logger process
depesz 15492 0.0 0.0 58936 1408 ? Ss 12:58 0:00 \_ postgres: startup process recovering 00000001000000000000000A
depesz 15495 0.0 0.0 75920 4108 ? Ss 12:58 0:00 \_ postgres: wal receiver process streaming 0/A0000B0
depesz 15497 0.0 0.0 58868 1008 ? Ss 12:58 0:00 \_ postgres: writer process
depesz 15498 0.0 0.0 27196 892 ? Ss 12:58 0:00 \_ postgres: stats collector process
depesz 14832 0.0 0.0 58840 6256 pts/6 S 12:51 0:00 /home/pgdba/work/bin/postgres -D master
depesz 14833 0.0 0.0 25104 688 ? Ss 12:51 0:00 \_ postgres: logger process
depesz 14835 0.0 0.0 58948 1688 ? Ss 12:51 0:00 \_ postgres: writer process
depesz 14836 0.0 0.0 58840 1008 ? Ss 12:51 0:00 \_ postgres: wal writer process
depesz 14837 0.0 0.0 59700 2112 ? Ss 12:51 0:00 \_ postgres: autovacuum launcher process
depesz 14838 0.0 0.0 27200 860 ? Ss 12:51 0:00 \_ postgres: archiver process last was 000000010000000000000009.00000020.backup
depesz 14839 0.0 0.0 27200 1020 ? Ss 12:51 0:00 \_ postgres: stats collector process
depesz 15496 0.0 0.0 59676 2468 ? Ss 12:58 0:00 \_ postgres: wal sender process replication 127.0.0.1(37690) streaming 0/A0000B0
depesz 15510 0.0 0.0 59676 2468 ? Ss 12:58 0:00 \_ postgres: wal sender process replication 127.0.0.1(37691) streaming 0/A0000B0

As we can see both slaves did connect to master as streaming replication slaves, and are up to date.

Nice. So, first test is to see if we can notice the “missing data on slave" problem. To do it, I'll simply insert now() to some test table to master, and immediately check slaves:

sh> psql -p 4001 -d postgres -c "create table test (x timestamptz primary key)"
sh> for i in {1..10}
sh> do
sh> echo "Test #$i:"
sh> psql -qAtX -p 4001 -d postgres -c "insert into test (x) values (now()) returning x"
sh> psql -qAtX -p 4011 -d postgres -c "select max(x) from test"
sh> psql -qAtX -p 4021 -d postgres -c "select max(x) from test"
sh> echo
sh> done

This test, does the insert-returning, and select max() from both slaves 10 times in a row, when ran, it showed me:

Test #1:
2011-03-18 13:06:26.399963+01
 
 
 
Test #2:
2011-03-18 13:06:26.420898+01
2011-03-18 13:06:26.399963+01
2011-03-18 13:06:26.399963+01
 
Test #3:
2011-03-18 13:06:26.434145+01
2011-03-18 13:06:26.420898+01
2011-03-18 13:06:26.420898+01
 
Test #4:
2011-03-18 13:06:26.452543+01
2011-03-18 13:06:26.434145+01
2011-03-18 13:06:26.434145+01
 
Test #5:
2011-03-18 13:06:26.470355+01
2011-03-18 13:06:26.452543+01
2011-03-18 13:06:26.452543+01
 
Test #6:
2011-03-18 13:06:26.488227+01
2011-03-18 13:06:26.470355+01
2011-03-18 13:06:26.470355+01
 
Test #7:
2011-03-18 13:06:26.506338+01
2011-03-18 13:06:26.488227+01
2011-03-18 13:06:26.488227+01
 
Test #8:
2011-03-18 13:06:26.524422+01
2011-03-18 13:06:26.506338+01
2011-03-18 13:06:26.506338+01
 
Test #9:
2011-03-18 13:06:26.542727+01
2011-03-18 13:06:26.524422+01
2011-03-18 13:06:26.524422+01
 
Test #10:
2011-03-18 13:06:26.56041+01
2011-03-18 13:06:26.542727+01
2011-03-18 13:06:26.542727+01

Interesting. At no times slaves showed the same value as master.

But now, let's repeat it, but first change the master query from

insert into test (x) values (now()) returning x

to

set synchronous_replication = on; insert into test (x) values (now()) returning x

Results:

Test #1:
2011-03-18 13:10:15.920473+01
2011-03-18 13:10:15.920473+01
2011-03-18 13:10:15.920473+01
 
Test #2:
2011-03-18 13:10:15.958886+01
2011-03-18 13:10:15.958886+01
2011-03-18 13:10:15.958886+01
 
Test #3:
2011-03-18 13:10:15.994531+01
2011-03-18 13:10:15.994531+01
2011-03-18 13:10:15.994531+01
 
Test #4:
2011-03-18 13:10:16.030824+01
2011-03-18 13:10:16.030824+01
2011-03-18 13:10:16.030824+01
 
Test #5:
2011-03-18 13:10:16.066797+01
2011-03-18 13:10:16.066797+01
2011-03-18 13:10:16.066797+01
 
Test #6:
2011-03-18 13:10:16.102643+01
2011-03-18 13:10:16.102643+01
2011-03-18 13:10:16.102643+01
 
Test #7:
2011-03-18 13:10:16.138506+01
2011-03-18 13:10:16.138506+01
2011-03-18 13:10:16.138506+01
 
Test #8:
2011-03-18 13:10:16.174385+01
2011-03-18 13:10:16.174385+01
2011-03-18 13:10:16.174385+01
 
Test #9:
2011-03-18 13:10:16.210709+01
2011-03-18 13:10:16.210709+01
2011-03-18 13:10:16.210709+01
 
Test #10:
2011-03-18 13:10:16.246883+01
2011-03-18 13:10:16.246883+01
2011-03-18 13:10:16.246883+01

Whoa, that's great! All transactions (and since I didn't use begin/commit – all commands) will wait with finish till data is replicated.

There is just one small issue. Speed.

Let's redo the test, ignoring output, but checking how long it will take.

First, without synchronous_replication. 3 runs (each inserting 1000 rows, each row with separate insert). All 3 runs the same time – 1.8s.

Now, the same test, but first, I'll do this:

sh> psql -d postgres -p 4001 -c "alter database postgres set synchronous_replication = on"

Thus effectively enforcing synchronous replication for all transactions (unless some transaction would explicitly disable it.

Again, 3 runs, each inserting 1000 rows. Times: 2.4s, 2.394s and 2.4s.

Please note that in my example, I have all 3 servers on the same server, so there is virtually no overhead of tcp/ip and network.

In certain cases it might be preferred to have more than 1 slave, but only 1 being synchronous. I.e. in my case – let's say that slave1 has to be synchronous, but slave2 – I don't care.

Thanks to synchronous_standby_names option, it's actually possible.

First, I need to make sure my slave servers will properly introduce themselves. I do it by changing primary_conninfo in their recovery.conf files to:

slave1/recovery.conf:primary_conninfo = 'port=4001 host=127.0.0.1 user=replication application_name=slave1'
slave2/recovery.conf:primary_conninfo = 'port=4001 host=127.0.0.1 user=replication application_name=slave2'

I.e. the change is addition of application_name parameter.

After slaves restart, I can verify on master, that the names are now properly seen by replication system:

sh> psql -p 4001 -d postgres -c "select procpid, application_name, state from pg_stat_replication"
procpid | application_name | state
---------+------------------+-----------
22196 | slave1 | STREAMING
22219 | slave2 | STREAMING
(2 rows)

Now, in masters postgresql.conf, I change synchronous_standby_names from ‘*' to ‘slave1′:

sh> psql -p 4001 -d postgres -c "show synchronous_standby_names"
synchronous_standby_names
---------------------------
slave1
(1 row)

So, let's repeat now the test with inserts and selects in synchronous mode:

sh> for i in {1..10}
sh> do
sh> echo "Test #$i:"
sh> psql -qAtX -p 4001 -d postgres -c "set synchronous_replication = on
sh> insert into test (x) values (now()) returning x"
sh> psql -qAtX -p 4021 -d postgres -c "select max(x) from test"
sh> psql -qAtX -p 4011 -d postgres -c "select max(x) from test"
sh> echo
sh> done

Results:

Test #1:
2011-03-18 14:04:30.973902+01
2011-03-18 14:04:30.973902+01
2011-03-18 14:04:30.973902+01
 
Test #2:
2011-03-18 14:04:31.013951+01
2011-03-18 14:04:31.013951+01
2011-03-18 14:04:31.013951+01
 
Test #3:
2011-03-18 14:04:31.06333+01
2011-03-18 14:04:31.06333+01
2011-03-18 14:04:31.06333+01
 
Test #4:
2011-03-18 14:04:31.099386+01
2011-03-18 14:04:31.099386+01
2011-03-18 14:04:31.099386+01
 
Test #5:
2011-03-18 14:04:31.134893+01
2011-03-18 14:04:31.134893+01
2011-03-18 14:04:31.134893+01
 
Test #6:
2011-03-18 14:04:31.171247+01
2011-03-18 14:04:31.171247+01
2011-03-18 14:04:31.171247+01
 
Test #7:
2011-03-18 14:04:31.207192+01
2011-03-18 14:04:31.171247+01
2011-03-18 14:04:31.207192+01
 
Test #8:
2011-03-18 14:04:31.236442+01
2011-03-18 14:04:31.236442+01
2011-03-18 14:04:31.236442+01
 
Test #9:
2011-03-18 14:04:31.27299+01
2011-03-18 14:04:31.27299+01
2011-03-18 14:04:31.27299+01
 
Test #10:
2011-03-18 14:04:31.309419+01
2011-03-18 14:04:31.309419+01
2011-03-18 14:04:31.309419+01

Please note that:

  • I query slave2 (-p 4021) first, and slave1 second (this is because I'm testing it all on single host, and the network lag is 0, so I can't show it nicely otherwise)
  • Results for test #7

Usually both slaves get the same data, but sometimes, the one that is not listed in synchronous_standby_names will be visibly lagged.

There is one important problem though.

If none of your listed synchronous_standby_names slave servers is available (which includes case where synchronous_standby_names is ‘*' and all slaves are unavailable) all your synchronous transactions will hang.

This is by design, as they have to wait for confirmation from slaves, and if the slaves are not available – they can't confirm.

All in all – it's a great feature, that many have been waiting for.

  1. 26 comments

  2. Mar 18, 2011

    Any particular reason you’re not using pg_basebackup to set up your slaves? Just wondering if we’re missing some critical feature you think is needed there ;)

  3. Mar 18, 2011

    @Magnus:
    I’m just more used to rsync. Besides – it’s just 2 directories on the same machine, using pg_basebackup would be overkill :)

  4. # gj
    Mar 18, 2011

    great write up. Thanks :)

  5. Mar 18, 2011

    But wait a minute!

    Synchronous replication does NOT guarantee that the change will be *visible* on the slave at the time the commit completes on the master. It merely guarantees that the WAL will have been transferred to the slave and written durably to disk. It’s unsafe to rely on the assumption that it will have also been replayed.

  6. Mar 18, 2011

    @Robert:
    Thanks for clarification. The distinction is pretty important.

  7. # Mark Lillywhite
    Mar 18, 2011

    Hi, I’m curious about your comment regarding latency being lower because there is no TCP traffic in your example. But in your example there will instead be contention for CPU and disk, so wouldn’t it be quite possible that having separate machines for the slave might actually result in better performance despite having to marshal everything via TCP? I mean one would presumably be running gigabit Ethernet between the machines… Latency is pretty low, no?

  8. Mar 18, 2011

    @Mark:
    usually, aside from test-environment, I don’t imagine anyone running master and slave on the same server, so there is no real comparison.

  9. # AJ
    Jun 7, 2011

    The commands on master are hanging:
    psql -p 4001 -d postgres -c “create user replication with replication”
    psql -p 4001 -d postgres -c “select pg_start_backup(‘x’)”

    When I kill them, I get the message:

    WARNING: canceling wait for synchronous replication due to user request
    DETAIL: The transaction has already committed locally, but may not have been replicated to the standby.

  10. Jun 7, 2011

    @AJ:
    And? Perhaps you should ask on some mailing list, or even report a bug?

  11. # AJ
    Jun 7, 2011

    @DEPESZ:
    Do you remember it hanging for you when you tried out the steps ?
    Maybe it is expected to hang as slave(s) is not up when these commands are executed.

  12. Jun 7, 2011

    No, it didn’t hang for me.

  13. # dohan
    Sep 19, 2011

    What are the changes need to be done if I setup the slave1 in another machine?

  14. Sep 19, 2011

    @Dohan:
    hmm .. I thought it’s obvious. You need to add different IP in pg_hba.conf for replication, and change ip in primary_conninfo on slave.

  15. # dohan
    Sep 20, 2011

    Ok tx I did it. Now I setup the cluster in three machines each having one pg instance. I am going to test the failover scenario. I used trigger_file setting to make a slave to accept read/writes when the master goes down. Currently only that setting not work because there is no slaves configured in the new master as old one. So it hangs at a INSERT. How can I deal with this? And how can I start old master safely as a slave without stopping current master?

  16. Sep 20, 2011

    @Dohan:
    When you switched slave to become master, it’s basically standalone pg. So you have to setup replication from scratch, just the other way around.

  17. # LOU
    Dec 7, 2011

    @# AJ
    the master can hang when the slave is in catchup mode after starting and cannot get the WAL file from the master pg_xlog or archive (00000001000000000000005C in this case)

    postgres 9537 0.0 0.0 459548 1764 ? Ss 15:28 0:00 postgres: startup process waiting for 00000001000000000000005C

    this results in an error entry in the current log in pg_log

  18. # ManG
    Jan 11, 2012

    Thanks for an awesome article on getting PG sync replication setup and running in minutes.

    Here are my questions.

    1. I don’t think synchronous_replication variable is any more valid. At least PG 9.1.2 doesn’t have it. Synchronous_commit is on by default.

    2. Can I have sync replication with standby “*” setup as above and master node run without any standby ?

    3. I want to configure timeout so that if standby sync takes more than few seconds, master wants to break the sync and continue on its own.

    Help me with the above.

    thanks,
    ManG

  19. Jan 11, 2012

    @Mang:
    1. yes. the synchronous_replication guc has been removed, and it’s logic is now in sunchronous_commit.

    2. why don’t you just try it?

    3. i don’t think it’s possible.

  20. # ManG
    Jan 13, 2012

    Hi,

    I have a PG 9.1.2 Master Slave with synchronous replication setup. They are all working fine as expected. I do have a case where I want to flip Master to non replication mode whenever its slave is not responding. I have set replication_timeout to 5s and whenever salve is not responding for for more than 5s, i see the master detecting it. But, the transactions on the master is stuck till the slave comes back. To get over it, I reloaded the config on master with synchronous_commit = local. Further transactions on the master are going thru fine with this local commits turned on.

    Here are my questions:

    1. Transaction which was stuck right when slave going away never went thru even after I reloaded master’s config with local commit on. I do see all new transactions on master are going thru fine, except the one which was stuck initially. How to get this stuck transaction complete or return with error.

    2. Whenever there is a problem with slave, I have to manually reload master’s config with local commit turned on to get master go forward. Is there any automated way to reload this config with local commit on on slave’s unresponsiveness ? tcp connection timeouts, replication timeouts all detect the failures, but i want to run some corrective action on these failure detection.

    your help is much appreciated.

    thanks.

  21. # ManG
    Jan 16, 2012

    Hi Depesz,

    your views on my question dated ‘Jan 13, 2012′ please ? Any help would be much appreciated.

    thanks,
    Manoj

  22. Jan 16, 2012

    @ManG: these were more of a “Support” questions. While it might look like, I don’t provide support for PostgreSQL on my blog. I might sometimes answer questions, but your questions are either impossible to answer without knowing a lot of details about your situation, or trivial to check on your own – so I don’t see much point in doing the check on my side, and providing you with ready answer – it’s a matter who puts effort into checking thing.

  23. # JWolf
    Apr 16, 2012

    Our application is written in Java and uses standard JDBC driver API ( behind the scenes postgres jdbc provider) to connect to postgres.

    I am wondering how to turn on and off synchronous_replication for some transactions only.
    This way I can have the replication enabled for some important tables and not for other tables.

    Do you happen to know how to set this when using jdbc drivers synchronous_replication

    In your example, I see that you are using psql to set the same. But wondering how to set this in the application that typically does not use psql.

  24. Apr 16, 2012

    @JWolf:

    Well, SET … is just an sql query. So you can run it the same way you run any other query. No idea how that works in JDBC, but I don’t see why it would be different than (for example) calling update or delete.

  25. # JWolf
    Apr 16, 2012

    makes sense. Thanks,
    Any plans on writing a blog on horizontal shards using postgres

  26. Apr 16, 2012

    @JWolf:
    hmnmm … like this one: http://www.depesz.com/2011/12/02/the-secret-ingredient-in-the-webscale-sauce/

  27. # JWolf
    Apr 17, 2012

    Thanks! nicely written. Your blogs are way cool!

Leave a comment