As some of you know, I've been working lately for OmniTI company. When doing things for them (PostgreSQL related of course :), I stumbled on very interesting problem.
One of our clients is working on PostgreSQL 8.2, and wants to upgrade to 8.3. This is generally trivial – pg_dump, pg_restore/psql, and you're done.
But, this situation was with a twist – we had only very limited time-frame to do the migration. To be exact – we had 4 hours that we can bring the website down for.
So, we had to dump database, transfer it to new server, and load. All within 4 hours. Simple? Sure. Database was ~ 200GB (after restore – around 130GB). How to do it?
First, let's list problems:
- Dump time
- Transfer time (copy dump to new server)
- Restore time
Now, all of this happens on really cool hardware (I'm not going to discuss it in length, but just assume that this hardware is really nice, with a lot of power).
So, first – pg_dump – standard pg_dump took around 2.5 hour (using pg_dump -Fc -a).
Transfer – generated dump file was around 30GB, and scp'ing it took around 1h 45 minutes.
Restore – original restore process (if I recall correctly) took around 5 hours.
So, in total we had approximately 9 hours worth of work to do in 4 hours.
The easiest thing that can be done is to remove SCP from the equation. So there was a decision, that dump will end up on DAS, which will be physically reconnected to new server after dump, thus reducing time to transfer to about 10 minutes, but requiring somebody to be in hsoting facility.
But then – what can we do about pg_restore and pg_dump? Well, we can load data in parallel.
We tried using standard parallelism of data load, and got the time down to about 3.5 hours. Nice. What can we do next?
It would be great to speedup pg_dump.
Now, there is a very simple way to make pg_dump faster – dump in parallel. But pg_dump cannot work in parallel on itself for very good reason – it is not possible to guarantee consistency of dump made by several processes – there is always possible situation when some write to database will be caught, by some other – to other table, will not, thus generating useless dump.
But this was actually not a problem for us! We could shut down Postgres, and do anything we want – we had this 4 hour window of opportunity, that we could block all accesses to it.
So, the idea was, that we will block external access to postgres, and local access will be only to 1 account – used by dump processes. In this was we have kind-of guarantee that everything will be fine.
Next step was actually pretty easy: make list of tables (something along the lines of:
SELECT n.nspname, c.relname FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid WHERE c.relkind = 'r' AND n.nspname <> 'information_schema' AND n.nspname !~ '^pg_';
And then dump each table separately.
What came in really handy in here is xargs and it's -P option, which allows me to write a simple shell script which will work in parallel to dump many tables at once, but not more than X tables at the same time.
This was all great, but then we found out, that while parallelization did indeed help, it helped less than we expected. Problem: while majority of tables were relatively small, couple (5-10) were much bigger, and 3 largest tables were bigger than all other tables together.
So. Next question: can we speedup dumping those tables? Apparently: yes.
Idea was: All these tables had primary key based on serial. We could easily get min and max value of the primary key column, and then split it into half-a-million-ids “partitions", then dump them separately using:
psql -qAt -c "COPY ( SELECT * FROM TABLE WHERE id BETWEEN x AND y) TO STDOUT" | gzip -c - > TABLE.x.y.dump
(side note: actually dumping without gzip was a bit faster, but the difference was practically negligible).
So, we manually decided which tables will be dumped “in whole", and which “in partitions", than we created ( by script of course, not hand-made ) sql script with commands like:
psql -qAt -c "COPY ( SELECT * FROM TABLE WHERE id BETWEEN x AND y) TO STDOUT" | gzip -c - > TABLE.x.y.dump psql -qAt -c "COPY ( SELECT * FROM TABLE WHERE id BETWEEN y AND z) TO STDOUT" | gzip -c - > TABLE.y.z.dump psql -qAt -c "COPY ( SELECT * FROM TABLE2 ) TO STDOUT" | gzip -c - > TABLE.whole.dump
(of course the ranges were not overlapping!)
This list contained around 1000 commands, each of them was either dumping small table in whole, or part of larger table.
This command was fed to xargs, which ran them all in parallel. Final effect was greater than whatever I personally expected. Whole database dumped in 26 minutes – that's nice speedup 🙂
Of course this dumping procedure is not for everybody – it requires absolutely no accesses to database (for the time of dump), and it generates quite high load (depending on number of parallel processes xargs is spawning). But for our situation – it's simply great.
In the mean time – I did a test, and it become obvious that speed of transfer was limited by SCP and it's crypto/compression parts. When I switched to rsync (not rsync over ssh, but plain rsync), without compression, speed of transfer rose from 4.5MB/s to 80MB/s, thus reducing whole time to transfer 30GB of data to 6.5 minutes – thus making DAS switching obsolete!
Summary to this point:
- dump process: down from 2.5h to 26 minutes.
- dump transfer: down from 1.75h to 6 minutes.
Now that we had data partitioned we could load it without any problem in parallel.
But there is a question. How to make the load of data as fast as possible? Should we create indexes before or after? Should we create triggers after data was loaded, or create it before, and just disable them for the time of loading.
Well. To create FOREIGN KEY, you have to have UNIQUE INDEX on referenced column. And while you can disable the FOREIGN KEY for data load – you cannot disable INDEX. And data load with INDEX on tables proved to be too slow (I tested it once, but stopped the test after 6 hours – there was no point in running it longer).
So, to make the data load as fast as we can, we split the procedure into several steps:
- create database, schemata, tables, functions
- load data
- create indexes
- create foreign keys
- create triggers and restore privileges
First and last step are actually very fast, timed in couple of seconds. Middle 3 steps are tricky.
First the data load – thanks to our dumping procedure, we could easily load the data the same way we dumped – i.e. use many workers, each loading either whole table (from the small tables), or part of large table.
After data is loaded we have to create indexes.
There are actually 3 kinds of indexes that we are interested in:
- primary key – added by ALTER TABLE ADD PRIMARY KEY
- unique constraint – added by ALTER TABLE ADD UNIQUE
- normal or unique indexes created by CREATE INDEX or CREATE UNIQUE INDEX
I tested, and it became obvious that while you can run many CREATE INDEX on the same table at the same time, and they don't block, – you can't do it with the ALTER TABLE ADD … things – as ALTER TABLE locks the table.
So, running all commands separately, each in it's own parallel worker would be (and was, I tested) not really good because ALTER TABLEs would block tables that some CREATE INDEX workers would try to use, thus greatly reducing parallelism of the process.
Solution become apparent:
Put all ALTER TABLE in one file per table, and all CREATE INDEX in one file per index.
Then, run in parallel all “ALTER TABLE" files – each of them uses separate table, so they can work in parallel. And as soon as this finishes – run all the CREATE INDEX files in parallel – they don't block, so you can (and it's actually good to do it) create multiple indexes on the same table at the same time.
So, we're left with FOREIGN KEY. Unfortunately each “ALTER TABLE ADD FOREIGN KEY" does:
- lock the table that is being altered
- check if all values in fkey column exist in referenced table.column
And this is slow. And very lock prone. I was not able to find a good way to handle this particular problem – I just run all “ADD FOREIGN KEY" with a lot of parallel workers, and wait. Some of them will lock, some not, generally it's a bit faster than running them sequentially, but not much.
(Side note: I would love to get “ALTER TABLE … ADD FOREIGN KEY … SKIP CHECK" – to skip checking of pre-existing data, but it's unlikely to happen).
After all of these changes we brought the load time to around 2 hours (with data loading procedure taking approx. 70% of the time).
Together with 6 minute rsync, and 26 minute dump, it makes the whole procedure take about 2.5 hours, leaving us with 1.5 hour for tests of migrated data, or unexpected situations. Not bad 🙂
At the end of the post I would like to express my gratitude to xargs authors. -P is absolutely amazing. And it makes my work waaay easier.
( Scripts that I used for this migration cannot be (most likely) distributed, but if you'd need to write something like this, and you'll have problems – send me an email, and I'll try to help, or find me on #postgresql on irc.freenode.net.