For various reasons, and in various cases, bloat happens. Theoretically autovacuum protects us all, but sometimes it doesn't. Sometimes someone disables it, or mis-configures, or bad planet alignment happens, and we end up in deep bloat.
What to do then? Vacuum? Vacuum Full? Cluster? No. pg_reorg!
Before I will go to describe what pg_reorg is, and how it works, let me briefly describe alternatives.
Normal vacuum doesn't (usually) shrink table files. It just moves the rows around in table, to make sure that pages are as full as possible. If it will happen that there are some 100% empty pages as the very end of table file – they will get removed, and table size decreased. But that's not very common.
But it's still slow. It also shares big drawback with vacuum full (pre 9.0).
Vacuum full before version 9.0 of Pg did the shuffling for the whole table. Which means that it did in fact shrink the table. But it took ages. And. Because of the tuple moving – it actually bloated the indexes. Since it had to deal with whole table, it also locked the table, which made it (in my opinion) worst possible option:
- unbearably slow
- bloats indexes
- locks the table
There was just one benefit – it needed relatively small extra space – basically just for the index bloat. Because all shuffling of rows was done within the table files themselves.
Because vacuum full was so bad, people started doing other things. Like – table rewrite. The simplest form is pg_dump of table, drop table, load table. But there are smarter ways, like:
CLUSTER table_name ON primary_key_index; ALTER TABLE table_name SET WITHOUT CLUSTER;
First command caused full table rewrite, but without reorganizing the rows (well, without packing them to get the best density – they were reorganized in the way that they are ordered by primary_key_index column(s) order), so it was way faster than VACUUM FULL. Second – removed cluster mark on table. This mark causes certain decisions in planner, and we don't want to keep the table clustered (i.e. ordered in index order), but just used side-effect of the cluster command.
CLUSTER (or any other rewrite-based) approach was so good, that there were voices to get rid of vacuum full at all. Most notably a blog post by Josh Berkus.
All those rewrite based approaches have 2 drawbacks:
- they require full, exclusive table lock (which is kind of OK, since alternative – VACUUM FULL – required it too
- they need more disk space.
The disk space is actually a problem. If you have table that now takes 100GB of disk space, and has 50% bloat – you need 50GB of free disk space to rewrite it. After rewrite, the old bloated version will be dropped. But during rewrite – it has to be there.
In PostgreSQL 9.0 vacuum full has been replaced by solution based on rewrite, so its speed has improved.
But the locking is still a problem. How can you remove bloat from table that has to be online?
The answer, for long time, was “you're out of luck". But now (well, since beginning of 2010 at least) we have new tool: pg_reorg.
What it does is both amazing, and pretty cool.
The general idea is: rewrite the table, and after rewrite is done, swap table files from old-bloated to new, small, fast and happy.
Of course the details are a bit more complex. First question that you should ask is, well, OK – but what happens with the rows that I add/delete/update during rewrite phase? Second should be: is it safe?
To answer to the first question: pg_reorg, before it starts, adds to your bloated table trigger which stores all changes in “log table", so that after rewrite it can be applied.
Full procedure looks like this:
- create log table for changes
- create triggers on old table to log changes
- create new table with copy of all data in old table
- create all indexes on new table
- apply all changes from log table to new table
- switch (in system catalogs) information about table files
- drop old table
If you know a bit about databases, you know that adding triggers is “ALTER TABLE" command, which requires exclusive lock. And also the switch has to be done with exclusive lock, because it deals with the internals of Pg.
Now, let me change the topic for a moment.
Let's say you have a pretty busy table. There are hundreds of connections dealing with it, some doing selects, some writes. Some transactions take fraction of second, some take minutes. When you'll issue “ALTER TABLE" on such table, it will try to get explicit lock. As soon as it will get it – the operation to add trigger is nearly instant. But the fun is when the getting this exclusive lock cannot happen immediately.
What happens is that the exclusive lock is marked as “waiting" till all other locks on the table (previously acquired) will not get released. All share locks, writes of rows, everything. But the pain lies in fact, that since there is exclusive lock waiting – all other locks also cannot be taken. Which basically kills the database, as every query touching the table tries to get its own light weight lock, but cannot due to “exclusive lock" waiting.
Now – pg_reorg does its work without killing database and concurrent queries. How? It's beautiful. It sets very short statement_timeout. Then tries to get the lock. If it will succeed – it finishes work that the lock is for – it's very fast, and non-problematic. And if it can't get the lock within the short statement_timeout? It sleeps a bit, and retries.
The beauty of this approach is that it can do virtually anything, and other connections will never have to wait long for lock because of its work!
Similar things happen in applying changes. It doesn't apply all changes in one transactions. It splits the work into small batches, and does them separately.
Now, you could ask – what happens with changes that happen between “apply all changes from log table to new table" and “switch (in system catalogs) information about table files", are they lost?
No. Pg_reorg applies changes till log will have no more changes to apply. Then it does it's magical “dance" to get exclusive lock on old table, then it checks and applies all new changes (no more changes will happen now, since the table is fully locked), and then switches – all in the same transaction.
And thanks to the fact that before acquiring exclusive lock, pg_reorg did apply all pending changes – there are usually no, or very few, changes to apply under exclusive lock.
Of course – since indexes are created on newly created table – they get refreshed too, and contain no bloat. All without locking.
All in all – amazing stuff.
The final question is: is it safe?
Well. The answer is: we have been running it on multiple databases, on multiple Pg versions, on multiple operating systems and architectures. It works. We never seen data loss.
There are some problems, though.
If you have dropped columns in your table, which can be checked with:
select * from pg_attribute where attrelid = 'table_name'::regclass and attisdropped;
There could be problems. Not data loss, but the table might require manual fixing afterwards.
If you're interested in the problem: all constraints operate on attribute numbers, and not names. So, if you have table:
create table test ( a int4, b int4 default 2, c bool );
Pg remembers that the default is for column #2, and not for column “b". Which means that if I'll drop column “a", and then do pg_reorg of the table – column #2 is now “c" and not “b", and Pg will complain about invalid value for bool column – 2.
Similar problem (afair) was with functions and/or views. Generally – use the CVS version. It has been committed with the patch from Denish in April, and that's all, so it's pretty stable.
All in all – it's a great tool, which does amazing job. And I already used the loop-timeout-lock approach in other situations, as it proved to be simple but very effective.