February 27th, 2015 by depesz | Tags: , , , , , , , , | 2 comments »
Did it help? If yes - maybe you can help me?

One of my clients is upgrading some servers. The procedure we have took some time to get to current state, and we found some potential problems, so decided to write more about it.

First, what we have, and what we want to have.

We have usually 3 servers:

  • master
  • slave
  • slave2

Both slaves use streaming replication.

In front of this there is pgbouncer server, which has 2 pools configured:

  • appname
  • appname_ro

where appname is directed to master, and appname_ro to slave.

Slave2 is there for ad hoc queries and, most importantly, running backups.

All these are instances in AWS.

Such setup (master, slave, slave2, pgbouncer) is called “cluster" and there are multiple “clusters".

Some time ago, there was a decision to upgrade from used Pg 9.1 to 9.3 (I know that there is now 9.3, but we want to have as little deviation across clusters, so we will upgrade to 9.3, and then, possibly to 9.4.

Anyway – when we decide that given cluster is to be upgraded, we follow these steps:

  • setup new slave (slave3) on identical instance and storage as master. Make it catchup to master. Then stop replication to slave3, and bring it up as standalone. Afterwards, run all the upgrade scripts on slave3, effectively upgrading it to pg 9.3, timing each part. After timing is done – terminate the instance (i.e. drop the server)
  • if timings are acceptable (We can bring the service down for some time, but not very long – we can proceed. If not – another means of upgrade has to be taken, but these are not the topic of this blogpost
  • when the actual time to upgrade will happen, we start by installing new binaries on the server. All servers are running Ubuntu Server, latest LTS. We add pgdg repos to apt sources, and install appropriate packages (whatever we have for 9.1, we install for 9.3)
  • we drop, created by default, 9.3/main cluster, and create new one, making sure it's collate and ctype is the same as currently running 9.1 cluster – this can be checked by
    select datcollate, datctype from pg_database where datname = 'template0'

    and then passed to pg_createcluster as appropriate options.

  • We copy config files from /etc/postgresql/VERSION/main/ to /var/lib/postgresql/VERSION/main – pg_upgrade requires config files to be in data directory, but ubuntu has other ideas.
  • In pgbouncer, we pause pools for both appname and appname_ro “databases" – thanks to this, and the fact that all app access goes through pgBouncer – we are sure that nothing will reach to database while it's being upgraded
  • Next thing is – drop problematic things. While doing tests we find out, and note, every view or function that has to be dropped because they break pg_upgrade. We have some of these due to wrapping system views into our own views. Not many, but there are some. Each has to be dropped
  • then we run the test upgrade. This is done using command:
    /usr/lib/postgresql/9.3/bin/pg_upgrade \
        --old-bindir=/usr/lib/postgresql/9.1/bin/ \
        --new-bindir=/usr/lib/postgresql/9.3/bin/ \
        --old-datadir=/var/lib/postgresql/9.1/main/ \
        --new-datadir=/var/lib/postgresql/9.3/main/ \
        --link \
        --jobs=25 \
        --old-port=5432 \
        --new-port=5433 \
        --check \
        --verbose

    As you can see we make the upgrade highly parallel, this is because we quite often have thousands of tables to handle. This test is the last chance on bailing if anything would go wrong.

  • If it was OK, then we run actual upgrade – same command, but without –check option. That's the only difference.
  • After actual upgrade is done, we put in place prepared config files (in /etc/postgresql/9.3/main) that are optimized for app – as at this point the configs are just defaults from ubuntu. We also change port for 9.1 instance (not to be run anyway, anymore) to 5433, and on 9.3 pg – to 5432.
  • Then there is vacuum – this takes long time. Unfortunately parallel vacuum isn't there, so we wrote a simple script (one-liner basically) which iterates over all databases, and lists tables in order of their size (descending) and then runs on them, in parallel vacuum. More or less like:
    for db in $( psql -qAtX -c "select datname from pg_database where datallowconn" )
    do
        export db
        psql -qAtX -c "select oid::regclass from pg_class where relkind = 'r' order by relpages desc" | \
            xargs -d'\n' -n1 -P20 -I{} vacuumdb -z -t {} $db
    done

    This is the longest part usually. If this will succeed then we're good with live traffic. If not – well, it didn't happen yet, but if it did, we'd have to either fix the problem, or fail to slave + slave2 on 9.1, and then try to figure out what to do next

  • Once vacuum is done, we recreate dropped views and functions (using pre-made .sql files)
  • Old cluster gets deleted: pg_dropcluster 9.1 main
  • On pgbouncers we resume appname pool, and redirect appname_ro to master, after which appname_ro is resumed too.
  • Once this is done, and we verify that app works, and everything is OK, we upgrade slaves by dropping 9.1 main cluster, installing new pg binaries, and doing pg_basebackup from master.
  • When slave is setup again, on 9.3, as streaming slave, we change pgbouncer config for appname_ro pool to point to slave, and reload config
  • At this moment we have upgraded everything, app works, and we can either upgrade next “cluster" or get beer/sleep.

While working on these upgrades we found out couple of things that need looking into. First of them is dropping/recreating certain views/functions that cause problems.

Another could be if it's a table that is causing problems. In such case (we didn't have it at this location, but another of my clients had it) – we simply decided to pg_dump this single table, and then just drop it, and load from dump after upgrade.

Last thing that actually caused some problems is that we were using central storage for wal archive. On S3. Location doesn't matter, but the point is that we copied xlogs to some location to place like: //appname/cluster-x/xlogs/.

This was all great. But – some clusters had many timeline changes – namely we had some clusters that were already into two-digit timeline.

So, on 9.1 we had xlogs like: 0000001200000ABC0000007E. The thing is that walarchive was “forever" – never removed files from there.

And after upgrade wal location was the same, but timeline got reset.

So, 9.3 was writing to 0000000100000ABC0000007E, 0000000100000ABC0000007F and so on.

This wasn't a problem in its own, because in timeline 01, pg9.1 never got to this wal location.

But – walarchive also contains .history files – which are responsible for timeline change.

And when we started newly setup slave on 9.3, it immediately tried to switch to “newer" timeline – because there were newer .history files.

Long story short – you should either have your xlogs stored in walarchive that is version-based (like: //appname/cluster-x/xlogs/9.3/) or just remove whole walarchive (or just .history files) before starting new slaves. We switched to versioned walarchive, and the problem was gone.

And that's about it. Not really complex, but still there are some things that need to be considered. The biggest point of it all: test the upgrade beforehand, on identical system, so you'll know what to expect. And you can always redo the test as many times as you need so you have everything scripted and fully-repeatable.

  1. 2 comments

  2. # Dane
    Feb 28, 2015

    Though the developer in me loves your “Waiting for …” series posts like this and the previous one about what to do after you’ve got PostgreSQL installed, are really helpful to new DBAs or part time DBAs like myself. So thanks for sharing.

  3. # Misha
    Mar 4, 2015

    depesz, thanks again for you blog!

    another point.

    one should be mentioned that this roadmap makes us to be tolerant to perform all traffic at one master host (without any slaves) during initial pg_basebackup for slaves.

    it could be very important and depends on end-user traffic distribution through day: does we have enough idle cpu/io window (maybe at night hours) to make full data copy — that is the question. // thinking about logical replication….

Leave a comment