Speeding up dump/restore process

As some of you know, I've been working lately for OmniTI company. When doing things for them (PostgreSQL related of course :), I stumbled on very interesting problem.

One of our clients is working on PostgreSQL 8.2, and wants to upgrade to 8.3. This is generally trivial – pg_dump, pg_restore/psql, and you're done.

But, this situation was with a twist – we had only very limited time-frame to do the migration. To be exact – we had 4 hours that we can bring the website down for.

So, we had to dump database, transfer it to new server, and load. All within 4 hours. Simple? Sure. Database was ~ 200GB (after restore – around 130GB). How to do it?

First, let's list problems:

  1. Dump time
  2. Transfer time (copy dump to new server)
  3. Restore time

Now, all of this happens on really cool hardware (I'm not going to discuss it in length, but just assume that this hardware is really nice, with a lot of power).

So, first – pg_dump – standard pg_dump took around 2.5 hour (using pg_dump -Fc -a).

Transfer – generated dump file was around 30GB, and scp'ing it took around 1h 45 minutes.

Restore – original restore process (if I recall correctly) took around 5 hours.

So, in total we had approximately 9 hours worth of work to do in 4 hours.

The easiest thing that can be done is to remove SCP from the equation. So there was a decision, that dump will end up on DAS, which will be physically reconnected to new server after dump, thus reducing time to transfer to about 10 minutes, but requiring somebody to be in hsoting facility.

But then – what can we do about pg_restore and pg_dump? Well, we can load data in parallel.

We tried using standard parallelism of data load, and got the time down to about 3.5 hours. Nice. What can we do next?

It would be great to speedup pg_dump.

Now, there is a very simple way to make pg_dump faster – dump in parallel. But pg_dump cannot work in parallel on itself for very good reason – it is not possible to guarantee consistency of dump made by several processes – there is always possible situation when some write to database will be caught, by some other – to other table, will not, thus generating useless dump.

But this was actually not a problem for us! We could shut down Postgres, and do anything we want – we had this 4 hour window of opportunity, that we could block all accesses to it.

So, the idea was, that we will block external access to postgres, and local access will be only to 1 account – used by dump processes. In this was we have kind-of guarantee that everything will be fine.

Next step was actually pretty easy: make list of tables (something along the lines of:

SELECT n.nspname, c.relname
FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE c.relkind = 'r'
    AND n.nspname <> 'information_schema'
    AND n.nspname !~ '^pg_';

And then dump each table separately.

What came in really handy in here is xargs and it's -P option, which allows me to write a simple shell script which will work in parallel to dump many tables at once, but not more than X tables at the same time.

This was all great, but then we found out, that while parallelization did indeed help, it helped less than we expected. Problem: while majority of tables were relatively small, couple (5-10) were much bigger, and 3 largest tables were bigger than all other tables together.

So. Next question: can we speedup dumping those tables? Apparently: yes.

Idea was: All these tables had primary key based on serial. We could easily get min and max value of the primary key column, and then split it into half-a-million-ids “partitions", then dump them separately using:

psql -qAt -c "COPY ( SELECT * FROM TABLE WHERE id BETWEEN x AND y) TO STDOUT" | gzip -c - > TABLE.x.y.dump

(side note: actually dumping without gzip was a bit faster, but the difference was practically negligible).

So, we manually decided which tables will be dumped “in whole", and which “in partitions", than we created ( by script of course, not hand-made ) sql script with commands like:

psql -qAt -c &quot;COPY ( SELECT * FROM TABLE WHERE id BETWEEN x AND y) TO STDOUT&quot; | gzip -c - &gt; TABLE.x.y.dump
psql -qAt -c &quot;COPY ( SELECT * FROM TABLE WHERE id BETWEEN y AND z) TO STDOUT&quot; | gzip -c - &gt; TABLE.y.z.dump
psql -qAt -c &quot;COPY ( SELECT * FROM TABLE2 ) TO STDOUT&quot; | gzip -c - &gt; TABLE.whole.dump

(of course the ranges were not overlapping!)

This list contained around 1000 commands, each of them was either dumping small table in whole, or part of larger table.

This command was fed to xargs, which ran them all in parallel. Final effect was greater than whatever I personally expected. Whole database dumped in 26 minutes – that's nice speedup 🙂

Of course this dumping procedure is not for everybody – it requires absolutely no accesses to database (for the time of dump), and it generates quite high load (depending on number of parallel processes xargs is spawning). But for our situation – it's simply great.

In the mean time – I did a test, and it become obvious that speed of transfer was limited by SCP and it's crypto/compression parts. When I switched to rsync (not rsync over ssh, but plain rsync), without compression, speed of transfer rose from 4.5MB/s to 80MB/s, thus reducing whole time to transfer 30GB of data to 6.5 minutes – thus making DAS switching obsolete!

Summary to this point:

  • dump process: down from 2.5h to 26 minutes.
  • dump transfer: down from 1.75h to 6 minutes.

Now that we had data partitioned we could load it without any problem in parallel.

But there is a question. How to make the load of data as fast as possible? Should we create indexes before or after? Should we create triggers after data was loaded, or create it before, and just disable them for the time of loading.

Well. To create FOREIGN KEY, you have to have UNIQUE INDEX on referenced column. And while you can disable the FOREIGN KEY for data load – you cannot disable INDEX. And data load with INDEX on tables proved to be too slow (I tested it once, but stopped the test after 6 hours – there was no point in running it longer).

So, to make the data load as fast as we can, we split the procedure into several steps:

  • create database, schemata, tables, functions
  • load data
  • create indexes
  • create foreign keys
  • create triggers and restore privileges

First and last step are actually very fast, timed in couple of seconds. Middle 3 steps are tricky.

First the data load – thanks to our dumping procedure, we could easily load the data the same way we dumped – i.e. use many workers, each loading either whole table (from the small tables), or part of large table.

After data is loaded we have to create indexes.

There are actually 3 kinds of indexes that we are interested in:

  • primary key – added by ALTER TABLE ADD PRIMARY KEY
  • unique constraint – added by ALTER TABLE ADD UNIQUE
  • normal or unique indexes created by CREATE INDEX or CREATE UNIQUE INDEX

I tested, and it became obvious that while you can run many CREATE INDEX on the same table at the same time, and they don't block, – you can't do it with the ALTER TABLE ADD … things – as ALTER TABLE locks the table.

So, running all commands separately, each in it's own parallel worker would be (and was, I tested) not really good because ALTER TABLEs would block tables that some CREATE INDEX workers would try to use, thus greatly reducing parallelism of the process.

Solution become apparent:

Put all ALTER TABLE in one file per table, and all CREATE INDEX in one file per index.

Then, run in parallel all “ALTER TABLE" files – each of them uses separate table, so they can work in parallel. And as soon as this finishes – run all the CREATE INDEX files in parallel – they don't block, so you can (and it's actually good to do it) create multiple indexes on the same table at the same time.

So, we're left with FOREIGN KEY. Unfortunately each “ALTER TABLE ADD FOREIGN KEY" does:

  • lock the table that is being altered
  • check if all values in fkey column exist in referenced table.column

And this is slow. And very lock prone. I was not able to find a good way to handle this particular problem – I just run all “ADD FOREIGN KEY" with a lot of parallel workers, and wait. Some of them will lock, some not, generally it's a bit faster than running them sequentially, but not much.

(Side note: I would love to get “ALTER TABLE … ADD FOREIGN KEY … SKIP CHECK" – to skip checking of pre-existing data, but it's unlikely to happen).

After all of these changes we brought the load time to around 2 hours (with data loading procedure taking approx. 70% of the time).

Together with 6 minute rsync, and 26 minute dump, it makes the whole procedure take about 2.5 hours, leaving us with 1.5 hour for tests of migrated data, or unexpected situations. Not bad 🙂

At the end of the post I would like to express my gratitude to xargs authors. -P is absolutely amazing. And it makes my work waaay easier.

( Scripts that I used for this migration cannot be (most likely) distributed, but if you'd need to write something like this, and you'll have problems – send me an email, and I'll try to help, or find me on #postgresql on irc.freenode.net.

32 thoughts on “Speeding up dump/restore process”

  1. Have you considered setting up some replication from old host to the new one (a few days before migration so it catches up)? Then it is a matter of migrating the master, a few minutes…

  2. @Mekk:
    It was considered. But the problem is that replication (at least the most common like Slony) requires Primary Key on every table (which, sadly, is not true in our case). But, what’s more important – replication (initial) takes longer, and imposes some load on master – which is a big “no-go” for us. It’s not that it’s impossible, but since we found a way to do the migration withing short time window, without any slowdown for users – it’s clearly better option.

  3. Not that there’s any problem with using rsync to transfer, but another option is nc:

    dest-machine $> nc -l $SOME_PORT > $DEST_FILE
    src-machine $> cat $SRC_FILE | nc $DEST_MACHINE $SOME_PORT

    You can also throw in gzip / gunzip if you feel like it.

  4. @eggyknap:

    Funny that you mention it, because our fist non-scp transfer was actually done using nc.

    But then – using rsync allowed us to start sending the files (the dump contains many files, not one) before they finished, using this one-liner:

    while true; do rsync -av /dir/with/dumps/ rsync://user@remote/mod/dir/for/dumps/

    As for compression – each individual dump was compressed on its own, so compressing them again on transfer time wouldn’t help.

  5. Nice one (and thanks for sharing numbers!) 🙂

    Another idea would be just to investigate option with enabling archiving on primary, setting up standby. After standby would catch-up with logs, you would just failover to it (after some checks) during downtime window. It’s standard procedure in Oracle to setup DataGuard standby or Oracle Streams Downstream Capture (you can do both online;in pgsql that would be warm standby/e.g. I prefer walmgr) and just switchover/failover when needed. I’ve did experiments with Streams, and during full DB load you can upgrade DB under 80-90s from app perspective (it is not depended on db-size).

  6. @Jakub: pg wal replication doesn’t work between various versions. And we did migration from 8.2 to 8.3.

  7. If the difference between gzip and uncompressed was not large, then it was probably a somewhat close trade off between CPU vs. disk I/O. I use lzop as it results in less disk I/O and is somewhere between gzip and uncompressed in terms of CPU resources.

  8. Did you tried parallel dump/restore using no intermediate files? Something like `pg_dump –host=old …|pg_restore –host=new …` In most cases disk I/O is the most limiting factor.
    (Except, I think, in your case scp was limited by its TCP network latency, and it could be tried in parallel too, if network security was a concern.)

  9. @wiktor:

    did you read the blog post you’re commenting? I wrote that we tried it: “We tried using standard parallelism of data load, and got the time down to about 3.5 hours.” and it did link to blogpost about this feature of 8.4

  10. @Konstantin:
    it would be possibly interesting, but as you can see transfer time if 5 minutes only. hardly something worth optimising.

    And scp was not limited by network latency – which I thought was obvious, when I wrote that we switched to rsync, and the speed we got from it.

  11. @Matthew:
    dumps compress quite nicely. As for using lzop – it’s interesting idea, thanks.

  12. @depesz:
    If you really have separate disks/RAIDs for DB and for dump on old host and separate disks for DB, dump and transaction log on new host, then you’re almost right, there is little to optimize.
    If not, then when it dumps, it reads DB and writes dump – twice work for disks. When it restores, it reads dump and writes to DB, again twice work for disks. And even with lots of RAIDs, it doesn’t do all at the same time.
    Direct transfer from pg_dump to pg_restore frees disk load and makes dump and restore simultaneous. That should be not (26 min dump + 6 min transfer + 2 hr restore), but 2 hours (dump+transfer+restore).

    As you said, “this hardware is really nice”, ssh encryption would not be a limitation. At the same time, ssh/scp connections generally cannot produce big throughput because ssh uses its own application-level acknowledgement of sent-received data, while simple TCP connection can use large TCP window, that is the usual reason of difference between scp and non-ssh throughput. That is not I advocate for scp, but suppose explanation.

  13. I just love reading about difficult problems and great solutions like this. Great work, I bet it was a blast.

  14. Changed my backup strategy of my read only archive tables to use a variant (generated a shell script that backgrounds the processes and then issues a “wait” for the pid) of this and got my backup from a few hours to 10 mins! Thanks for the idea.

  15. The addition of COPY ( query ) is one of my favourite features added to 8.2.
    How would you parallelise the dump for 8.1?

  16. parallel pg_restore will sort out the dependencies between index commands etc. for you. So what I would do is a schema-only dump. Then run the first part of the restore, i.e. up to where it would start copying data if there were data in the dump. Then run your hand cut parallel copy to get the base data. Then run the rest of the schema restore with pg_restore –jobs=nn for however much parallelism you want.

  17. @Andrew:
    This is all nice and easy when you have parallel pg_restore, which was added in 8.4, afair.

  18. Thanks for the tips. It proved very helpful.

    I just need some more help with xargs command. Can you please tell me how did you feed ‘psql -qAt -c “COPY ( SELECT * FROM TABLE WHERE id BETWEEN x AND y) TO STDOUT” | gzip -c – > TABLE.x.y.dump’ commands to xargs.

  19. Never mind about the xargs. I got it.

    I noticed that it doesn’t give me the expected results (now the expectations are raised with this post 🙂 ). Could you please tell me your system configurations so that the results at my end can be justified.

  20. Hi Depesz,

    Nice article.

    You said, “So, the idea was, that we will block external access to postgres, and local access will be only to 1 account – used by dump processes. In this was we have kind-of guarantee that everything will be fine.”

    If I have a website, does it mean the site has to be in “Under Maintenance” mode?

    Thanks!
    -L

  21. @Laurent: yes. any writes to the database can cause the dump to lose some data, or even not be restorable.

  22. Hi Depesz,

    Thank you for your reply.

    I have a question if you don’t mind.

    Currently, I am using postgres 9.10. And I have nightly dump that takes from 7-10 hours for 30GB gz file, and also we have slave and master replication.

    I am planning to upgrade to 9.2, but still looking a better solution.

    I tried with slony with small testing database and it works, but the problem that you have mentioned, every table has to have primary key.

    Do you have any pointers or link that I can dig deep and learn from it?

    Thanks!
    -L

  23. @Laurent:
    not sure what you mean by 9.10 – there is no such version. Perhaps you meant 9.1.0 (which means you should upgrade asap to 9.1.6), or 9.0.10 – at which point you’re ok.

    Well, I generally don’t use any other means of replication than PITR replication, and with this you can use filesystem-level backups which are fast, and much less resource-intensive.

Comments are closed.