Reduce bloat of table without long/exclusive locks

Some time ago Joshua Tolley described how to reduce bloat from tables without locking (well, some locks are there, but very short, and not really intrusive).

Side note: Joshua: big thanks, great idea.

Based on his idea and some our research, i wrote a tool which does just this – reduces bloat in table.

This tool is available to download from OmniTI SVN repo.

How does it work?

First I'll create small table:

$ CREATE TABLE test_table AS
    SELECT
        i AS id,
        repeat( 'value : ' || i, CAST(random() * 500 + 500 AS int4)) AS textual,
        now() - '10 years'::INTERVAL * random() AS when_tsz,
        random() < 0.5 AS some_flag,
        random() * 1000000 AS some_float
    FROM generate_series( 1, 1000000) i;
SELECT 1000000

as you can see, it has 1M rows, and it's size is;

$ SELECT pg_relation_size( 'test_table' ), pg_total_relation_size( 'test_table' );
 pg_relation_size | pg_total_relation_size
------------------+------------------------
        207290368 |              207298560
(1 ROW)

since it was created by insert, it has zero bloat.

So, let's make it have 50% of table bloat:

$ DELETE FROM test_table WHERE 0 = id % 2;
DELETE 500000

Now, for demonstration purposes, i will create 2 indexes:

$ CREATE INDEX whatever1 ON test_table (when_tsz);
CREATE INDEX
 
$ ALTER TABLE test_table ADD PRIMARY KEY (id);
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will CREATE implicit INDEX "test_table_pkey" FOR TABLE "test_table"
ALTER TABLE

please note that these indexes do not have any bloat – they were created after I deleted rows from table.

Sizes of the indexes:

$ select pg_relation_size( ‘whatever1' ), pg_relation_size( ‘test_table_pkey' );
pg_relation_size | pg_relation_size
——————+——————
11255808 | 11255808
(1 row)

Now. I can run my tool.

It requires that I will tell it how many pages to try to free.

It's important, as each page freed means bloat of indexes, so it's safer (although slower) to start with some small numbers. Let's say 20.

So, I run the script:

2010-10-17 15:08:38 : ./compact_table Settings:
  - CLEAN_PAGES       : 20
  - EXTENDED_DEBUG    : 0
  - FINAL_VACUUM      : 1
  - INITIAL_VACUUM    : 1
  - PGDATABASE        :
  - PGHOST            :
  - PGPORT            :
  - PGUSER            :
  - PSQL              : psql
  - TABLE_NAME        : test_table
  - TABLE_SCHEMA      : public
  - VERBOSE           : 1
2010-10-17 15:08:38 : At most, we can have 315 tuples per page.
2010-10-17 15:08:38 : Updates will be done ON COLUMN: textual
2010-10-17 15:08:38 : Entering main loop.
2010-10-17 15:08:38 : Initial vacuuming
2010-10-17 15:08:53 : CURRENT TABLE SIZE: 25304 pages.
2010-10-17 15:08:53 : Working ON page 25303 (1 OF 20)
2010-10-17 15:08:54 : Working ON page 25302 (2 OF 20)
2010-10-17 15:08:56 : Working ON page 25301 (3 OF 20)
2010-10-17 15:08:58 : Working ON page 25300 (4 OF 20)
2010-10-17 15:08:59 : Working ON page 25299 (5 OF 20)
2010-10-17 15:09:01 : Working ON page 25298 (6 OF 20)
2010-10-17 15:09:03 : Working ON page 25297 (7 OF 20)
2010-10-17 15:09:05 : Working ON page 25296 (8 OF 20)
2010-10-17 15:09:06 : Working ON page 25295 (9 OF 20)
2010-10-17 15:09:08 : Working ON page 25294 (10 OF 20)
2010-10-17 15:09:10 : Working ON page 25293 (11 OF 20)
2010-10-17 15:09:12 : Working ON page 25292 (12 OF 20)
2010-10-17 15:09:14 : Working ON page 25291 (13 OF 20)
2010-10-17 15:09:16 : Working ON page 25290 (14 OF 20)
2010-10-17 15:09:18 : Working ON page 25289 (15 OF 20)
2010-10-17 15:09:20 : Working ON page 25288 (16 OF 20)
2010-10-17 15:09:22 : Working ON page 25287 (17 OF 20)
2010-10-17 15:09:23 : Working ON page 25286 (18 OF 20)
2010-10-17 15:09:25 : Working ON page 25285 (19 OF 20)
2010-10-17 15:09:27 : Working ON page 25284 (20 OF 20)
2010-10-17 15:09:29 : Final vacuuming
2010-10-17 15:09:31 : Final TABLE SIZE: 25304 pages.
SELECT 'Size of index (public.whatever1) before reindex:', pg_relation_size('public.whatever1');
CREATE INDEX CONCURRENTLY whatever1_new ON test_table USING btree (when_tsz);
DROP INDEX public.whatever1;
ALTER INDEX public.whatever1_new RENAME TO whatever1;
SELECT 'Size of index (public.whatever1) after reindex:', pg_relation_size('public.whatever1');
SELECT 'Size of index (public.test_table_pkey) before reindex:', pg_relation_size('public.test_table_pkey');
CREATE UNIQUE INDEX CONCURRENTLY test_table_pkey_new ON test_table USING btree (id);
DROP INDEX public.test_table_pkey;
ALTER INDEX public.test_table_pkey_new RENAME TO test_table_pkey;
SELECT 'Size of index (public.test_table_pkey) after reindex:', pg_relation_size('public.test_table_pkey');
2010-10-17 15:09:31 : ALL done.

That's a lot of output. Let's see what it showed.

First it showed settings to be used to connect to database, and how to cleanup.

Then, it calculated that with this PostgreSQL, there can be at most 315 tuples per page. This is important, because we need to use “=" operator for tid scans, and not (like in Joshua blog) “>=" operator, because this forces seq scan, which we don't want.

Afterwards it ran initial vacuum. It's imporant, as vacuum has to mark free places in table data before we can move records there. Usually you don't need it (that's why it's disabled by default), but in my case – I just created the table, so autovacuum didn't yet had the chance to scan it.

Afterwards, it calculated current table size (25304 pages), and it worked on last 20 pages.

After finishing last (of 20) pages, it ran final vacuum. This vacuum is much more important, as this is when the relation truncate actually happens, so it's on by default.

Post vacuum, script checks current size of table. As you can see, in this case – size of relation did not change. I'm not sure why, but I guess it has something to do with vacuums, because when ran next time, i got this output (irrelevant parts skipped):

=$ ./compact_table -v -t test_table -k 20
...
2010-10-17 15:14:41 : Current table size: 25304 pages.
2010-10-17 15:14:41 : Working on page 25303 (1 of 20)
...
2010-10-17 15:15:10 : Working on page 25284 (20 of 20)
2010-10-17 15:15:12 : Final vacuuming
2010-10-17 15:15:12 : Final table size: 25284 pages.

Which shows that table size decreased. After this 2nd run, all future will truncate table without problems.

And finally, compact_table script outputs small SQL script (starting with SELECT ‘Size of index…) which can be used to reindex all indexes on the compacted table, if you think you've compacted it already enough.

There is small problem with indexes, though. If given index is base for constraint (primary key, or unique constraint) – it cannot be dropped. In such case, you should drop the constraint first, but afterwards – it will not be possible to re-add the constraint (there is work in progress for 9.1 to allow adding constraints based on existing indexes, but it's not done yet).

Generally, the best way to use the tool, is to run it many times with small -k (10? 20? 100? depends on how big is the table), and only after freeing significant space, do the reindexation.

You can also check index sizes before compacting, and watch them grow, and decide when to stop compacting, and reindex indexes.

Have fun, and hope you'll find it useful.

12 thoughts on “Reduce bloat of table without long/exclusive locks”

  1. Note that while the lock required here is short and unobtrusive, it’s still an exclusive one. The way the VACUUM code is written, if it can’t grab that exclusive lock in order to shrink the table, it just quietly moves on without reducing the table’s size. See the comments for lazy_truncate_heap in src/backend/command/vacuumlazy.c for details.

    If there is activity against this table, you can do most of the shrinking work concurrently with that using your too. But you will need to find or force an idle period before the shrinking VACUUM will execute, so that it succeeds when it tries to grab the exclusive lock required.

  2. @Dim:

    thanks for comment. I forgot to mention.

    Well, reorg does something like this:

    1. creates new table
    2. adds triggers on source table to update new table with changes
    3. copies data from source table
    4. modifies catalog to point relfilenode of base table to new file.

    all in all it seems pretty invasive, and requiring a lot of IO. (copying whole table).

    compact_table is much leaner – all it does is a series of updates in your compacted table, and (every now and then) vacuums.

    so, both approaches have their own areas where they are better.

  3. You need to do an analyze after the index is created and before you drop it. This will update the statistics and usually have queries start using the new index. If you don’t, you may find your drop index being blocked or worse queries choosing a wrong plan do to bad statistics.

  4. @Joshua Tolley:
    yeah – seqscans were a big bummer. Tbh, I don’t understand why there is no tidscan for >= / <= operators.

  5. Hi,

    I’m wondering when pages are actually released.
    as far as i knew vacuum frees dead rows but no disk space. Or am I wrong? Or does relpa

    with update you migrate rows from one page to another to compact. that I understand. but….

    Cheers,
    Uwe

  6. Uwe,
    from the docs:
    The standard form of VACUUM removes dead row versions in tables and indexes and marks the space available for future reuse. However, it will not return the space to the operating system, except in the special case where one or more pages at the end of a table become entirely free and an exclusive table lock can be easily obtained.

Comments are closed.