Major-version upgrading with minimal downtime

There is this idea that from 9.5 we will be able to upgrade pg clusters to newer, major, version, without downtime (well, almost), thanks to magic of logical replication.

Sounds very promising, so I figured – I'll test it. To some extent at least.

First, I need some 9.5 database. This is relatively simple, luckily. Got myself an ubuntu box with these packages:

# dpkg -l | grep postgresql
ii  pgdg-keyring                       2014.1                                 all          keyring for apt.postgresql.org
ii  postgresql-9.5                     9.5.4-1.pgdg14.04+1                    amd64        object-relational SQL database, version 9.5 server
ii  postgresql-9.5-dbg                 9.5.4-1.pgdg14.04+1                    amd64        debug symbols for postgresql-9.5
ii  postgresql-9.5-pg-collkey          0.5.1-1insops1                         amd64        ICU collation function wrapper for PostgreSQL 9.5
ii  postgresql-9.5-postgis-2.2         2.2.2+dfsg-4.pgdg14.04+1               amd64        Geographic objects support for PostgreSQL 9.5
ii  postgresql-9.5-postgis-2.2-scripts 2.2.2+dfsg-4.pgdg14.04+1               all          Geographic objects support for PostgreSQL 9.5 -- scripts
ii  postgresql-9.5-postgis-scripts     2.2.2+dfsg-4.pgdg14.04+1               all          transitional dummy package
ii  postgresql-client-9.5              9.5.4-1.pgdg14.04+1                    amd64        front-end programs for PostgreSQL 9.5
ii  postgresql-client-common           175.pgdg14.04+1                        all          manager for multiple PostgreSQL client versions
ii  postgresql-common                  175.pgdg14.04+1                        all          PostgreSQL database-cluster manager
ii  postgresql-contrib-9.5             9.5.4-1.pgdg14.04+1                    amd64        additional facilities for PostgreSQL
ii  postgresql-plperl-9.5              9.5.4-1.pgdg14.04+1                    amd64        PL/Perl procedural language for PostgreSQL 9.5

Since this is a test installation, I don't have anything there, so let's quickly add some test database and table and perhaps even some data 🙂

postgres@h3po4:~$ createuser test
postgres@h3po4:~$ createdb -O test test

To make sure that there is something happening in the database, I wrote couple of simple scripts:

-- create.sql
CREATE TABLE base_data (
    id serial PRIMARY KEY,
    VALUE INT4 NOT NULL DEFAULT 0
);
CREATE TABLE data_changes (
    id serial PRIMARY KEY,
    data_id INT8 NOT NULL REFERENCES base_data (id),
    changed_when timestamptz,
    change_from INT4,
    change_to INT4
);
INSERT INTO base_data (id, VALUE)
    SELECT i, 0 FROM generate_series(1,1000) i;
SELECT SETVAL('base_data_id_seq', 1001);
-- load-single.sql
WITH random_data AS (
    SELECT
        1 + FLOOR(random() * 1000) AS random_id,
        FLOOR(random() * 201 - 100) AS random_change
), data_update AS (
    UPDATE base_data AS bd
        SET
            VALUE = bd.value + r.random_change
        FROM
            random_data r
        WHERE
            bd.id = r.random_id
    returning bd.id, r.random_change, bd.value
), insert_log AS (
    INSERT INTO data_changes (data_id, changed_when, change_from, change_to)
    SELECT
        du.id, now(), du.value - du.random_change, du.value
    FROM data_update du
    returning id
)
DELETE FROM data_changes WHERE id < ( SELECT MIN(id - 10000) FROM insert_log );

What the first file does is pretty self-explanatory.

The other file (load-single.sql) is a bit more complex, so let me explain step by step what it does:

  • random_data cte – generates two random integers – id of row in base_data to change, and how much to change the value in base_data – randomized in range -100..100
  • data_update cte – updates the row in base_data using random values from random_data
  • insert_log cte – inserts row into data_changes log that specifies which row was update, from which value, to which value, and when.
  • delete at the end – removes old rows from data_changes – where old is defined as older than latest 10,000 rows

With this in place, I create one more file, but it's content will be boring – single line repeated 1000 times:

postgres@h3po4:~$ seq 1 1000 | sed 's/.*/\\i load-single.sql/' > load-multiple.sql

With this in place, I start 2 concurrent copies of:

postgres@h3po4:~$ while true; do echo "$( date ) : $( /usr/bin/time -f %E psql -U test -qAtX -f load-multiple.sql 2>&1 )"; done

This will generate some load on the database, and show me simple stats, like:

Tue Nov  8 15:18:01 CET 2016 : 0:03.63
Tue Nov  8 15:18:05 CET 2016 : 0:03.70
Tue Nov  8 15:18:09 CET 2016 : 0:03.57
Tue Nov  8 15:18:12 CET 2016 : 0:03.64

Now, let's setup pglogical. The first step, is obviously installing it.

Now, it's time to install the magic tool: pglogical. It can be downloaded by apt, so let's try. On both servers:

=$ echo "deb [arch=amd64] http://packages.2ndquadrant.com/pglogical/apt/ $( lsb_release -cs )-2ndquadrant main" | sudo tee /etc/apt/sources.list.d/pglogical.list
deb [arch=amd64] http://packages.2ndquadrant.com/pglogical/apt/ trusty-2ndquadrant main
 
=$ wget --quiet -O - http://packages.2ndquadrant.com/pglogical/apt/AA7A6805.asc | sudo apt-key add -
OK
 
=$ sudo apt-get update

Now, depending on which box I'm running it on, I do either:

root@h3po4:~# apt-get install postgresql-9.5-pglogical

or

root@krowka:~# apt-get install postgresql-9.6-pglogical

(to remind: h3po4 is 9.5 host, and krowka has pg 9.6.

Now, I need some setup on 9.5 instance:

postgres@h3po4:~$ psql -d test -c "select name, setting from pg_settings where name in ('wal_level', 'max_worker_processes', 'max_replication_slots', 'max_wal_senders', 'shared_preload_libraries');"
           name           |  setting  
--------------------------+-----------
 max_replication_slots    | 10
 max_wal_senders          | 10
 max_worker_processes     | 8
 shared_preload_libraries | pglogical
 wal_level                | logical
(5 ROWS)

on 9.6 I wouldn't need it all – max_replication_slots and max_wal_senders are not required, but since I can set them up already, there is no reason to skip it already:

postgres@krowka:~$ psql -c "select name, setting from pg_settings where name in ('wal_level', 'max_worker_processes', 'max_replication_slots', 'max_wal_senders', 'shared_preload_libraries');"
           name           |  setting
--------------------------+-----------
 max_replication_slots    | 10
 max_wal_senders          | 10
 max_worker_processes     | 8
 shared_preload_libraries | pglogical
 wal_level                | logical
(5 ROWS)

Now, I need to make sure all global objects (users and tablespaces generally) are copied over, so I could go with pg_dumpall -g on 9.5, transfer it to 9.6, and load there, but since I have (for now) just one user, and one database, I can simply:

postgres@krowka:~$ createuser test
postgres@krowka:~$ createdb -O test test

With this in place, I have both user and database created in both places, but tables, and data are only in 9.5.

Logical replication will require special user, and this user also has to be superuser. So let's make it on both servers:

postgres@h3po4:~$ createuser -s --replication logical_replication
postgres@krowka:~$ createuser -s --replication logical_replication

Since this is just test, and I don't really care about security, I'll just make it possible to log using logical_replication with trust:

local  replication  logical_replication                trust
host   replication  logical_replication  127.0.0.1/32  trust
host   replication  logical_replication  0.0.0.0/0     trust
local  test         logical_replication                trust
host   test         logical_replication  127.0.0.1/32  trust
host   test         logical_replication  0.0.0.0/0     trust

Both of my hosts are in 172.28.173.x network, but I'm lazy 🙂

To make next examples more clear, let me show the ips, as I will be using host ip numbers, and not hotnames:

  • h3po4 (9.5) – 172.28.173.18
  • krowka (9.6) – 172.28.173.3

And now, finally we get to play with pglogical itself:

postgres@h3po4:~$ psql -d test -c 'create extension pglogical'
CREATE EXTENSION
 
postgres@krowka:~$ psql -d test -c 'create extension pglogical'
CREATE EXTENSION
 
postgres@h3po4:~$ psql -d test -c "SELECT pglogical.create_node( node_name := 'provider', dsn := 'host=172.28.173.18 port=5432 dbname=test user=logical_replication' );"
 create_node 
-------------
  3171898924
(1 ROW)

Now, I need to add all tables to the replication. Since my test schema has only two tables, and the point is to replicate them, I can simply:

postgres@h3po4:~$ psql -d test -c "SELECT pglogical.replication_set_add_all_tables('default', '{public}'::text[]);"
 replication_set_add_all_tables 
--------------------------------
 t
(1 ROW)

and now, on slave:

postgres@krowka:~$ psql -d test -c "SELECT pglogical.create_node( node_name := 'subscriber', dsn := 'host=172.28.173.3 port=5432 dbname=test user=logical_replication' );"
 create_node
-------------
  2941155235
(1 ROW)
 
postgres@krowka:~$ psql -d test -c "SELECT pglogical.create_subscription( subscription_name := 'subscription', provider_dsn := 'host=172.28.173.18 port=5432 dbname=test user=logical_replication');"
 create_subscription
---------------------
          2875150205
(1 ROW)

aaand .. nothing.

Well, not really nothing. I see pglogical processes:

postgres@h3po4:~$ ps uxf  | grep [l]ogical
postgres  6079  0.0  0.0 300632  5640 ?        Ss   15:57   0:00  \_ postgres: bgworker: pglogical supervisor   
postgres  6134  0.0  0.0 301168  9300 ?        Ss   15:57   0:00  \_ postgres: bgworker: pglogical manager 16385   
postgres  6386  4.6  0.1 304840 15784 ?        Ss   15:57   0:19  \_ postgres: wal sender process logical_replication 172.28.173.3(51816) idle

and on 9.6 too:

postgres@krowka:~$ ps uxf | grep [l]ogical
postgres 20795  0.0  0.0 311200  5640 ?        Ss   15:57   0:00  \_ postgres: 9.6/main: bgworker: pglogical supervisor
postgres 20799  0.0  0.0 311720 11124 ?        Ss   15:57   0:00  \_ postgres: 9.6/main: bgworker: pglogical manager 16451
postgres 21014  0.0  0.0 318188 16332 ?        Ss   15:57   0:00  \_ postgres: 9.6/main: bgworker: pglogical apply 16451:2875150205

but the tables are not there on 9.6. And since there is no tables, there is no data.

Then I remembered a thing that is in docs

The event trigger facility can be used for describing rules which define replication sets for newly created tables.

This suggests that if I'll just do, what I did, it will not work, because DDL (creation of tables) is by default not replicated.

So, I dropped replication, dropped the test database on 9.6, recreated it, but then did in there, from test account:

CREATE TABLE base_data (
    id serial PRIMARY KEY,
    VALUE INT4 NOT NULL DEFAULT 0
);
CREATE TABLE data_changes (
    id serial PRIMARY KEY,
    data_id INT8 NOT NULL REFERENCES base_data (id),
    changed_when timestamptz,
    change_from INT4,
    change_to INT4
);

That is create table commands from create.sql.

Then, I repeated:

postgres@krowka:~$ psql -d test -c 'create extension pglogical'
postgres@krowka:~$ psql -d test -c "SELECT pglogical.create_node( node_name := 'subscriber', dsn := 'host=172.28.173.3 port=5432 dbname=test user=logical_replication' );"
postgres@krowka:~$ psql -d test -c "SELECT pglogical.create_subscription( subscription_name := 'subscription', provider_dsn := 'host=172.28.173.18 port=5432 dbname=test user=logical_replication');"

And almost immediately I saw the data:

postgres@krowka:~$ psql -U test -c 'select count(*) from base_data;'
 COUNT
-------
  1000
(1 ROW)

So, let's see if the data is replicated correctly. I did stop load generator scripts, and waited for replication lag to drop to 0:

postgres@h3po4:~$ psql -d test -c "select pg_current_xlog_location() - replay_location from pg_stat_replication where usename = 'logical_replication' and client_addr = '172.28.173.3'"
 ?COLUMN? 
----------
        0
(1 ROW)

Now, let's see some basic stats, and perhaps some simple data comparison:

postgres@h3po4:~$ printf "select count(*), sum(value) from base_data; select count(*), sum(data_id), sum(change_from), sum(change_to) from data_changes;" | psql -d test
 count |  sum  
-------+-------
  1000 | -6704
(1 row)
 
 count |   sum   |  sum   |  sum   
-------+---------+--------+--------
 10001 | 4999425 | -24505 | -31308
(1 row)

and on 9.6:

postgres@krowka:~$ printf "select count(*), sum(value) from base_data; select count(*), sum(data_id), sum(change_from), sum(change_to) from data_changes;" | psql -d test
 COUNT |  SUM
-------+-------
  1000 | -6704
(1 ROW)
 
 COUNT |   SUM    |  SUM   |  SUM
-------+----------+--------+--------
 20002 | 10022125 | -10975 | -17040
(1 ROW)

Whoa. That doesn't look good. Data in base_data looks OK, which I can also check with checksums:

postgres@h3po4:~$ psql -d test -qAtX -c "select * from base_data order by id" | md5sum -
958651d1b69a1aff7d7cd6b71622765d  -
 
postgres@krowka:~$ psql -d test -qAtX -c "select * from base_data order by id" | md5sum -
958651d1b69a1aff7d7cd6b71622765d  -

But why is data_changes wrong?

Quick sanity check:

postgres@h3po4:~$ psql -d test -c "select min(id), max(id) from data_changes"
   MIN   |   MAX   
---------+---------
 1397920 | 1407920
(1 ROW)
 
postgres@krowka:~$ psql -d test -c "select min(id), max(id) from data_changes"
   MIN   |   MAX
---------+---------
 1105628 | 1407920
(1 ROW)
 
postgres@krowka:~$ printf "select count(*), sum(data_id), sum(change_from), sum(change_to) from data_changes where id >= 1397920;" | psql -d test
 COUNT |   SUM   |  SUM   |  SUM
-------+---------+--------+--------
 10001 | 4999425 | -24505 | -31308
(1 ROW)

OK. So, it looks that last 10001 records are OK, but apparently older rows were not removed.

So, I checked pglogical data:

postgres@h3po4:~$ psql -d test -c "select * from pglogical.replication_set"
   set_id   | set_nodeid |      set_name       | replicate_insert | replicate_update | replicate_delete | replicate_truncate 
------------+------------+---------------------+------------------+------------------+------------------+--------------------
  495597167 | 3171898924 | DEFAULT             | t                | t                | t                | t
 1269193154 | 3171898924 | default_insert_only | t                | f                | f                | t
 3925901991 | 3171898924 | ddl_sql             | t                | f                | f                | f
(3 ROWS)

and on 9.6:

postgres@krowka:~$ psql -d test -x -c "select * from pglogical.subscription"
-[ RECORD 1 ]--------+--------------------------------------
sub_id               | 2875150205
sub_name             | subscription
sub_origin           | 3171898924
sub_target           | 2941155235
sub_origin_if        | 202885847
sub_target_if        | 2467722369
sub_enabled          | t
sub_slot_name        | pgl_test_provider_subscription
sub_replication_sets | {DEFAULT,default_insert_only,ddl_sql}
sub_forward_origins  | {ALL}

It kinda worried me that default_insert is included, since it doesn't replicate deletes (it shouldn't also replicate updates, so I'm not sure what's going on there).

So, let's clean, and redo:

postgres@krowka:~$ psql -d test -c "select pglogical.drop_subscription('subscription');"
 drop_subscription
-------------------
                 1
(1 ROW)
 
postgres@krowka:~$ psql -d test -c 'truncate base_data cascade'
NOTICE:  TRUNCATE cascades TO TABLE "data_changes"
TRUNCATE TABLE

and now, let's re-add the subscription, but this time make sure that only “default" is replicated:

postgres@krowka:~$ psql -d test -c "SELECT pglogical.create_subscription( subscription_name := 'subscription', provider_dsn := 'host=172.28.173.18 port=5432 dbname=test user=logical_replication', replication_sets := '{default}'::text[] );"
 create_subscription
---------------------
          2875150205
(1 row)

Start load-generators, wait couple of minutes, stop them, wait for replication lag to drop to zero, and recheck data:

postgres@h3po4:~$ printf "select count(*), sum(value) from base_data; select count(*), sum(data_id), sum(change_from), sum(change_to) from data_changes;" | psql -d test
 COUNT | SUM  
-------+------
  1000 | -313
(1 ROW)
 
 COUNT |   SUM   |  SUM   |  SUM   
-------+---------+--------+--------
 10001 | 4988452 | 170618 | 171594
(1 ROW)
 
postgres@krowka:~$ printf "select count(*), sum(value) from base_data; select count(*), sum(data_id), sum(change_from), sum(change_to) from data_changes;" | psql -d test
 COUNT | SUM
-------+------
  1000 | -313
(1 ROW)
 
 COUNT |   SUM   |  SUM   |  SUM
-------+---------+--------+--------
 10001 | 4988452 | 170618 | 171594
(1 ROW)

Better. Now. Next step – what about sequences? Sequence on base_data doesn't change, but on data_changes is used quite a lot, so:

postgres@h3po4:~$ psql -d test -c "select * from data_changes_id_seq"
    sequence_name    | last_value | start_value | increment_by |      max_value      | min_value | cache_value | log_cnt | is_cycled | is_called 
---------------------+------------+-------------+--------------+---------------------+-----------+-------------+---------+-----------+-----------
 data_changes_id_seq |    1471710 |           1 |            1 | 9223372036854775807 |         1 |           1 |      19 | f         | t
(1 row)
 
postgres@krowka:~$ psql -d test -c "select * from data_changes_id_seq"
    sequence_name    | last_value | start_value | increment_by |      max_value      | min_value | cache_value | log_cnt | is_cycled | is_called
---------------------+------------+-------------+--------------+---------------------+-----------+-------------+---------+-----------+-----------
 data_changes_id_seq |          1 |           1 |            1 | 9223372036854775807 |         1 |           1 |       0 | f         | f
(1 row)

OK. So it looks that sequences are not replicated, at least not by default.

Luckily docs mention interesting function: pglogical.replication_set_add_all_sequences(), so let's try to use it:

postgres@h3po4:~$ psql -d test -c "select pglogical.replication_set_add_all_sequences( set_name := 'default', schema_names := '{public}'::text[], synchronize_data := true )"
replication_set_add_all_sequences 
-----------------------------------
 t
(1 ROW)

The synchronize_data option, from my understanding, makes sure that sequences will be synchronized immediately – if it was set to false, it would happen at some random point in time, as explained in docs:

The state of sequences added to replication sets is replicated periodically and not in real-time.

After I added sequences to set:

postgres@krowka:~$ psql -d test -c "select * from data_changes_id_seq"
    sequence_name    | last_value | start_value | increment_by |      max_value      | min_value | cache_value | log_cnt | is_cycled | is_called
---------------------+------------+-------------+--------------+---------------------+-----------+-------------+---------+-----------+-----------
 data_changes_id_seq |    1471710 |           1 |            1 | 9223372036854775807 |         1 |           1 |       0 | f         | t
(1 ROW)

Given the periodical approach to synchronize sequences, if you want to switch to current subscriber, it might be good idea to run, after there are no app connections to old primary:

postgres@h3po4:~$ psql -d test -c "select pglogical.synchronize_sequence( seqoid ) from pglogical.sequence_state"
 synchronize_sequence 
----------------------
 t
 t
(2 ROWS)

So that pglogical will get current values of all subscribed sequences, and push them to all subscribers.

Based on my tests, it looks that pglogical replicates sequence state every 60-70 seconds. Not bad, but obviously, when switching over to replica you have to keep that in mind.

So, let's see how does replicating ddl work. First, let's try something simple:

postgres@h3po4:~$ psql -d test -c "create table test as select i from generate_series(1,10) i"
SELECT 10

After a while – new table is not replicated. But that was to be expected. Let's drop it, and try to use event triggers to replicate new table creation:

postgres@h3po4:~$ psql -d test -c "drop table test"

Then, I use this SQL script:

-- replicate-new-tables.sql
CREATE OR REPLACE FUNCTION pglogical_assign_repset()
RETURNS event_trigger AS $$
DECLARE obj record;
BEGIN
    FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands()
    LOOP
        IF obj.object_type = 'table' THEN
            IF obj.schema_name = 'public' AND NOT obj.in_extension THEN
                PERFORM pglogical.replication_set_add_table('default', obj.objid);
            END IF;
        END IF;
    END LOOP;
END;
$$ LANGUAGE plpgsql;
 
CREATE EVENT TRIGGER pglogical_assign_repset_trg
    ON ddl_command_end
    WHEN TAG IN ('CREATE TABLE', 'CREATE TABLE AS')
    EXECUTE PROCEDURE pglogical_assign_repset();

and now, let's load it:

postgres@h3po4:~$ psql -d test -f replicate-new-tables.sql 
CREATE FUNCTION
CREATE EVENT TRIGGER

It's pretty cool that you can write your own logic in the trigger function, so that you could, for example, have separate replication_sets for various schema, and add new table to proper set.

Now, that this is done, let's see if adding the table will replicate it:

postgres@h3po4:~$ psql -d test -c "create table test as select i from generate_series(1,10) i"
ERROR:  TABLE test cannot be added TO replication SET DEFAULT
DETAIL:  TABLE does NOT have PRIMARY KEY AND given replication SET IS configured TO replicate UPDATEs AND/OR DELETEs
HINT:  ADD a PRIMARY KEY TO the TABLE
CONTEXT:  SQL statement "SELECT pglogical.replication_set_add_table('default', obj.objid)"
PL/pgSQL FUNCTION pglogical_assign_repset() line 8 at PERFORM

Oh, interesting. I can't add tables without primary key. Makes sense, I guess. So, let's try different way:

postgres@h3po4:~$ psql -d test -c "create table test ( id serial primary key, payload text )"
CREATE TABLE

This worked, but the table was not created on 9.6 subscriber. Let's see if if it will replicate data if I'll create it manually, and then insert some rows:

postgres@krowka:~$ psql -d test -c "create table test ( id serial primary key, payload text )"
CRETE TABLE
 
postgres@h3po4:~$ psql -d test -c "insert into test (payload) select i from generate_series(1,5) i"
INSERT 0 5
 
postgres@h3po4:~$ psql -d test -c "select * from test"
 id | payload 
----+---------
  1 | 1
  2 | 2
  3 | 3
  4 | 4
  5 | 5
(5 ROWS)
 
postgres@krowka:~$ psql -d test -c "select * from test"
 id | payload
----+---------
  1 | 1
  2 | 2
  3 | 3
  4 | 4
  5 | 5
(5 ROWS)

It worked! But manually adding all the tables to subscriber side can be problematic.

Docs say:

Automatic DDL replication is not supported. Managing DDL so that the provider and subscriber database(s) remain compatible is the responsibility of the user. pglogical provides the pglogical.replicate_ddl_command function to allow DDL to be run on the provider and subscriber at a consistent point.

So, let's try it. I'll make new table, add some data, and will try to replicate it to subscribers:

postgres@h3po4:~$ psql -d test -c "select pglogical.replicate_ddl_command('create table public.test2 ( id serial primary key, payload text )', '{default}'::text[])"
 replicate_ddl_command 
-----------------------
 t
(1 ROW)
 
postgres@h3po4:~$ psql -d test -c "insert into test2 (payload) select repeat('test', i) from generate_series(1,3) i"
INSERT 0 3

after a while, on 9.6 instance:

postgres@krowka:~$ psql -d test -c '\d test2'
                          TABLE "public.test2"
 COLUMN  |  TYPE   |                     Modifiers
---------+---------+----------------------------------------------------
 id      | INTEGER | NOT NULL DEFAULT NEXTVAL('test2_id_seq'::regclass)
 payload | text    |
Indexes:
    "test2_pkey" PRIMARY KEY, btree (id)
 
 
postgres@krowka:~$ psql -d test -c 'select * from test2'
 id |   payload
----+--------------
  1 | test
  2 | testtest
  3 | testtesttest
(3 ROWS)

While checking how the replication is working, I noticed something:

postgres@h3po4:~$ psql -d test -c "select seqoid::regclass from pglogical.sequence_state"
       seqoid        
---------------------
 base_data_id_seq
 data_changes_id_seq
(2 ROWS)

Only the first two sequences are added to replication_set, but we now have more of them:

postgres@h3po4:~$ psql -d test -c '\ds'
                 List OF relations
 Schema |        Name         |   TYPE   |  Owner   
--------+---------------------+----------+----------
 public | base_data_id_seq    | SEQUENCE | test
 public | data_changes_id_seq | SEQUENCE | test
 public | test2_id_seq        | SEQUENCE | postgres
 public | test_id_seq         | SEQUENCE | postgres
(4 ROWS)

This is simply because the trigger that I copy/paste from documentation handled only create table events.

Let's fix it. First, let's add the sequences that are already there:

postgres@h3po4:~$ psql -d test -c "select pglogical.replication_set_add_all_sequences( set_name := 'default', schema_names := '{public}'::text[], synchronize_data := true )"
 replication_set_add_all_sequences 
-----------------------------------
 t
(1 ROW)
 
postgres@h3po4:~$ psql -d test -c "select seqoid::regclass from pglogical.sequence_state"
       seqoid        
---------------------
 base_data_id_seq
 data_changes_id_seq
 test2_id_seq
 test_id_seq
(4 ROWS)

OK, Now, let's see about this trigger:

-- replicate-new-tables-and-sequences.sql
BEGIN;
 
DROP event TRIGGER IF EXISTS pglogical_assign_repset_trg;
 
CREATE OR REPLACE FUNCTION pglogical_assign_repset()
RETURNS event_trigger AS $$
DECLARE obj record;
BEGIN
    FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands() WHERE command_tag IN ('CREATE TABLE', 'CREATE TABLE AS', 'CREATE SEQUENCE')
    LOOP
        IF obj.schema_name = 'public' AND NOT obj.in_extension THEN
            IF obj.object_type = 'table' THEN
                PERFORM pglogical.replication_set_add_table('default', obj.objid);
            ELSIF obj.object_type = 'sequence' THEN
                PERFORM pglogical.replication_set_add_sequence('default', obj.objid);
            END IF;
        END IF;
    END LOOP;
END;
$$ LANGUAGE plpgsql;
 
CREATE EVENT TRIGGER pglogical_assign_repset_trg
    ON ddl_command_end
    WHEN TAG IN ('CREATE TABLE', 'CREATE TABLE AS', 'CREATE SEQUENCE')
    EXECUTE PROCEDURE pglogical_assign_repset();
 
commit;

And of course, it needs to be loaded, on master (9.5):

postgres@h3po4:~$ psql -d test -f replicate-new-tables-and-sequences.sql
BEGIN
DROP EVENT TRIGGER
CREATE FUNCTION
CREATE EVENT TRIGGER
COMMIT

With this in place, when I'll add new table:

postgres@h3po4:~$ psql -d test -c "select pglogical.replicate_ddl_command('create table public.test3 ( id serial primary key, payload text )', '{default}'::text[]);"
CREATE TABLE

Both the table, and its sequence are immediately added to replication set:

postgres@h3po4:~$ psql -d test -c "select * from pglogical.replication_set_relation  where set_reloid::text ~ 'test3'"
  set_id   |  set_reloid  
-----------+--------------
 495597167 | test3_id_seq
 495597167 | test3
(2 ROWS)

What's left? Well, not much. In case we'd like to have this replication going for longer time, we would probably want to add handling of dropping of objects and truncation of tables.

But as I see it now – for upgrade, you generally need to:

  1. setup provider
  2. subscriber
  3. replicate data
  4. wait for replication lag to drop to 0
  5. stop application
  6. update sequences state
  7. remove pglogical setup from both nodes
  8. stop old master
  9. make changes so that app will connect to new master
  10. start app

Downtime is only from step #5 to step #10, but it shouldn't really take significant time.

The “remove pglogical setup" is pretty simple on its own – you need to stop replicating, then remove nodes, and remove extension itself:

postgres@krowka:~$ psql -d test -c "select pglogical.drop_subscription('subscription', true)"
postgres@krowka:~$ psql -d test -c "select pglogical.drop_node('subscriber', true)"
postgres@krowka:~$ psql -d test -c "drop extension pglogical cascade"

The last command will provide information about objects that were dropped by cascade, but these should be only some triggers that are internal to pglogical (though, on normal tables).

Of course, the whole process should be tested, extensively, with your schema and data. But it does look very promising.

15 thoughts on “Major-version upgrading with minimal downtime”

  1. Thank you A LOT for this post. Recently I’ve tried to configure pglogical but didn’t succeeded since unfortunately official documentation is very confusing and unclear whether some command should be executed on provider or subscriber, etc. I opened a corresponding issue https://github.com/2ndQuadrant/pglogical/issues/56 Sadly it doesn’t look like 2nd Quadrant guys care much about writing a good documentation.

  2. If you set instant replication of sequences and replication lag is 0, in theory you should be able to just switch the app to the new DB, then after that remove pglogical? Or am I missing something?

  3. @Kustodian:

    I wouldn’t really risk it. If your app is busy enough there is no such thing as immediate switch.

    Why not pause the database pool(s) in the bouncer for the ~ 10 seconds to do the switch?

  4. Hi,

    Thanks for the article, its really helpful.

    Our production have database->schema(more than 300)->table-> rows architecture, when I am implementing pglogical, do i need to crate database->schema->table structure in 9.6 environment? I am planning to upgrade from 9.4 to 9.6.

    Thanks
    -Ashish

  5. @Ashish:

    yes – you need to copy schema yourself. This can be done using pg_dump -s, and then loading generated sql to 9.6 with psql.

  6. hi when try to apply last step I can not success. but I put pg_hba entry with all case 🙁

    psql -U postgres -d test -c “SELECT pglogical.create_subscription( subscription_name := ‘subscription’, provider_dsn := ‘host=172.16.150.149 port=5432 dbname=test user=logical_replication password=postgres’);”
    ERROR: could not connect to the postgresql server: FATAL: no pg_hba.conf entry for host “172.16.150.151”, user “logical_replication”, database “test”, SSL off

    DETAIL: dsn was: host=172.16.150.151 port=5432 dbname=test user=logical_replication

  7. @mehmet:

    you need to add appropriate line in pg_hba.conf on master, and then reload config there.

    Most likely the line will look like:

    host replication logical_replication 172.16.150.151/32 trust
    

    Possibly also

    host test logical_replication 172.16.150.151/32 trust
    
  8. thank you it is solved this extension is very nice..

  9. hi I am failing to execute the subscription step .

    query :
    select pglogical.create_subscription(subscription_name:=’subscription’,provider_dsn:=’host=172.17.27.150 port=5432 dbname=db’);

    error :
    could fetch remote node info: ERROR: schema “pglogical” does not exist
    LINE 1: …, node_name, sysid, dbname, replication_sets FROM pglogical….

    but when i checked i have schema named pglogical.
    few of the changes i followed while installing :
    1)I did not create any separate user. I used postgres
    user only.
    2)I installed pglogical using rpm files.
    provider IP : 172.17.27.150
    subscriber IP : 172.17.27.151

  10. @Daksha:

    clealry pglogical schema is not there. why? no idea. you said that it is, but the error clearly contradicts you. so either postgresql suddently became magical, or you’re doing something wrong, but I can’t tell you what or how, since I don’t know *any* details.

    If you want help, then it’s usually much better to use some form of real-time talk – like irc channel, and not blog post comments section.

  11. Hi

    How do pglogical works on partition tables?
    I have parent table without pk column in it and it’s child tables have pk column on them.
    I have added an extra column to the parent table and created pk by updating the column with unique values now the challenge is how to replicate the data from provider to subscriber? as there is pk on different column there might be null values in child table’s new added column and pglogical will do “copy to select * from “. now, parent tables not allow null values and breaking the replication.

    Please suggest me if you have any other solution to add partition tables in pglogical

  12. Excellent article and trigger. Thank you.

    Having trouble with the last replicate_ddl_command, it ran successfully. I see the table and data in the provider. I see the table in replication_set_table. However i do not see the table in the subscriber.

  13. @Vik:

    if you need help I suggest that you contact some support channel for pglogical. I just wrote about one particular usecase, am not developer or (at the moment) even user of pglogical, so can’t really help.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.