Bloat happens

For various reasons, and in various cases, bloat happens. Theoretically autovacuum protects us all, but sometimes it doesn't. Sometimes someone disables it, or mis-configures, or bad planet alignment happens, and we end up in deep bloat.

What to do then? Vacuum? Vacuum Full? Cluster? No. pg_reorg!

Before I will go to describe what pg_reorg is, and how it works, let me briefly describe alternatives.

Normal vacuum doesn't (usually) shrink table files. It just moves the rows around in table, to make sure that pages are as full as possible. If it will happen that there are some 100% empty pages as the very end of table file – they will get removed, and table size decreased. But that's not very common.

Based on the vacuum behavior I once wrote a tool that removes bloat by doing a lot of updates and vacuums. Later on it was optimized by Nathan Thom.

But it's still slow. It also shares big drawback with vacuum full (pre 9.0).

Vacuum full before version 9.0 of Pg did the shuffling for the whole table. Which means that it did in fact shrink the table. But it took ages. And. Because of the tuple moving – it actually bloated the indexes. Since it had to deal with whole table, it also locked the table, which made it (in my opinion) worst possible option:

  • unbearably slow
  • bloats indexes
  • locks the table

There was just one benefit – it needed relatively small extra space – basically just for the index bloat. Because all shuffling of rows was done within the table files themselves.

Because vacuum full was so bad, people started doing other things. Like – table rewrite. The simplest form is pg_dump of table, drop table, load table. But there are smarter ways, like:

CLUSTER TABLE_NAME ON primary_key_index;
ALTER TABLE TABLE_NAME SET WITHOUT CLUSTER;

First command caused full table rewrite, but without reorganizing the rows (well, without packing them to get the best density – they were reorganized in the way that they are ordered by primary_key_index column(s) order), so it was way faster than VACUUM FULL. Second – removed cluster mark on table. This mark causes certain decisions in planner, and we don't want to keep the table clustered (i.e. ordered in index order), but just used side-effect of the cluster command.

CLUSTER (or any other rewrite-based) approach was so good, that there were voices to get rid of vacuum full at all. Most notably a blog post by Josh Berkus.

All those rewrite based approaches have 2 drawbacks:

  • they require full, exclusive table lock (which is kind of OK, since alternative – VACUUM FULL – required it too
  • they need more disk space.

The disk space is actually a problem. If you have table that now takes 100GB of disk space, and has 50% bloat – you need 50GB of free disk space to rewrite it. After rewrite, the old bloated version will be dropped. But during rewrite – it has to be there.

In PostgreSQL 9.0 vacuum full has been replaced by solution based on rewrite, so its speed has improved.

But the locking is still a problem. How can you remove bloat from table that has to be online?

The answer, for long time, was “you're out of luck". But now (well, since beginning of 2010 at least) we have new tool: pg_reorg.

What it does is both amazing, and pretty cool.

The general idea is: rewrite the table, and after rewrite is done, swap table files from old-bloated to new, small, fast and happy.

Of course the details are a bit more complex. First question that you should ask is, well, OK – but what happens with the rows that I add/delete/update during rewrite phase? Second should be: is it safe?

To answer to the first question: pg_reorg, before it starts, adds to your bloated table trigger which stores all changes in “log table", so that after rewrite it can be applied.

Full procedure looks like this:

  • create log table for changes
  • create triggers on old table to log changes
  • create new table with copy of all data in old table
  • create all indexes on new table
  • apply all changes from log table to new table
  • switch (in system catalogs) information about table files
  • drop old table

If you know a bit about databases, you know that adding triggers is “ALTER TABLE" command, which requires exclusive lock. And also the switch has to be done with exclusive lock, because it deals with the internals of Pg.

Now, let me change the topic for a moment.

Let's say you have a pretty busy table. There are hundreds of connections dealing with it, some doing selects, some writes. Some transactions take fraction of second, some take minutes. When you'll issue “ALTER TABLE" on such table, it will try to get explicit lock. As soon as it will get it – the operation to add trigger is nearly instant. But the fun is when the getting this exclusive lock cannot happen immediately.

What happens is that the exclusive lock is marked as “waiting" till all other locks on the table (previously acquired) will not get released. All share locks, writes of rows, everything. But the pain lies in fact, that since there is exclusive lock waiting – all other locks also cannot be taken. Which basically kills the database, as every query touching the table tries to get its own light weight lock, but cannot due to “exclusive lock" waiting.

Now – pg_reorg does its work without killing database and concurrent queries. How? It's beautiful. It sets very short statement_timeout. Then tries to get the lock. If it will succeed – it finishes work that the lock is for – it's very fast, and non-problematic. And if it can't get the lock within the short statement_timeout? It sleeps a bit, and retries.

The beauty of this approach is that it can do virtually anything, and other connections will never have to wait long for lock because of its work!

Similar things happen in applying changes. It doesn't apply all changes in one transactions. It splits the work into small batches, and does them separately.

Now, you could ask – what happens with changes that happen between “apply all changes from log table to new table" and “switch (in system catalogs) information about table files", are they lost?

No. Pg_reorg applies changes till log will have no more changes to apply. Then it does it's magical “dance" to get exclusive lock on old table, then it checks and applies all new changes (no more changes will happen now, since the table is fully locked), and then switches – all in the same transaction.

And thanks to the fact that before acquiring exclusive lock, pg_reorg did apply all pending changes – there are usually no, or very few, changes to apply under exclusive lock.

Of course – since indexes are created on newly created table – they get refreshed too, and contain no bloat. All without locking.

All in all – amazing stuff.

The final question is: is it safe?

Well. The answer is: we have been running it on multiple databases, on multiple Pg versions, on multiple operating systems and architectures. It works. We never seen data loss.

There are some problems, though.

If you have dropped columns in your table, which can be checked with:

SELECT * FROM pg_attribute WHERE attrelid = 'table_name'::regclass AND attisdropped;

There could be problems. Not data loss, but the table might require manual fixing afterwards.

Fix for the problem has been proposed by my colleague Denish Patel and has been (in modified version) committed to repo, but it hasn't been released yet (not sure why).

If you're interested in the problem: all constraints operate on attribute numbers, and not names. So, if you have table:

CREATE TABLE test (
  a int4,
  b int4 DEFAULT 2,
  c bool
);

Pg remembers that the default is for column #2, and not for column “b". Which means that if I'll drop column “a", and then do pg_reorg of the table – column #2 is now “c" and not “b", and Pg will complain about invalid value for bool column – 2.

Similar problem (afair) was with functions and/or views. Generally – use the CVS version. It has been committed with the patch from Denish in April, and that's all, so it's pretty stable.

All in all – it's a great tool, which does amazing job. And I already used the loop-timeout-lock approach in other situations, as it proved to be simple but very effective.

20 thoughts on “Bloat happens”

  1. “CLUSTER TABLE_NAME ON primary_key_index;… First command caused full table rewrite, but without reorganizing the rows, ”

    Is this based on an assumption of a serial PK?

  2. @Rob:
    it doesn’t have to be serial. It can be any PK.

    I added clarification in the text, because there is reorganizing in CLUSTER, but different than in vacuum full.

  3. Getting a full table lock on a very frequently updated table is actually something that I have found to be nearly impossible during work-hours. I usually use the command “LOCK TABLE x NOWAIT”.

    This “solution” doesn’t really help in this instance in other words…

  4. I still hope to get an old vacuum-full that runs step by step.
    If I can move a few of the last rows to the first free space, i would be happy.

  5. Hi Depesz. Thanks for the neat explanation of the features of this tool. Now I am planning to use the latest version pf pg_reorg 1.1.7 to un-bloat our production tables. However I have only one concern.

    So while Iam doing pg_reorg -t tableA , can I do DDL’s on other tables like dropping a tableB, creating a new tableC, adding a column to tableD ..ect .

    I see this in the pg_reorg documentation but it doesnt specifically mention that the DDL ought to be avoided only on the tables being processed via pg_reorg. Here is the exact text from the doc.

    “You cannot do DDL commands except VACUUM and ANALYZE during pg_reorg. In many case pg_reorg
    would fail and rollback collectly, but there are some cases ending with data-corruption .”

    Kindly suggest

  6. @pg_lizard:
    I don’t see any problem with concurrent ddl or vacuums on other tables.
    we don’t recall if we did it, but given how pg_reorg works, I don’t see how it could be a problem.

  7. That’s super cool. Now that makes me a little less worried. Thanks for the reply Depesz. I used the pg_reorg on our database and got rid of a decent amount of bloat.

  8. hi depesz. there’s one thing I wanna know: when is bloat too big?
    what’s your exp. num of “warning” and “critical” bloat?

  9. @Han Zheng:
    I try to reorg tables as soon as they hit 20% of bloat (as reported by check_postgres.pl), or, if the table is small (<200MB) - 30-40%

  10. Any tool that will do something similar where tables DO NOT have Primary KEY?

  11. @Raj:
    no idea. The question is – why on earth do you have tables, big enough to bloat matter, that don’t have primary key?!

  12. Hello depesz, great explanation. One question, how could the statement_timeout related to acquiring lock? AFAIK, only lock_timeout related to acquiring lock. Thanks.

  13. @soniharriz:
    Acquiring lock happens in a query – statement. So if the statement runs for longer than statement_timeout (either because it’s doing something or because it’s waiting for lock) it will get killed.

  14. Hello again depesz, We have 2 PG 9.1 server replicated using slony 2.0.7. Is it safe to run pg_repack 1.2.0 or 1.2.1 on slony master and slave?

  15. @soniharriz:
    to be honest – not sure, I don’t use slony, so I never tested it. But it should be OK, I think.

Comments are closed.