Bloat removal by tuples moving – select * from depesz;

Looong time ago, I wrote a piece about removing bloat by moving rows away from the end of table, and vacuuming it.

This is/was very slow, and was optimized (to some extent) by Nathan Thom, but his blogpost vanished. Besides, later on we got great tool: pg_reorg (or, as it's currently named: pg_repack).

But recently I was in position where I couldn't pg_reorg. So I had to look for other way. And I found it 🙂

First, let me describe situation:

I had a 350GB table, which had many rows inserted, but never removed or updated. Then, at some point in time, client made a decision to remove all rows older than 4 months. These rows were removed, but this didn't decrease the table size. Why?

Because all free pages are at the beginning of the table “file" (well, it's several files, but let's assume it's single file).

Normal vacuum can shrink the table only if the trailing page(s) are empty – it just truncates the table file by how many empty pages are at the very end.

You can use VACUUM FULL, but this has drawbacks:

it locks the table for the whole duration
depending on version it might: bloat indexes and/or take ages to finish

The problem was that quick estimate told us that if we removed 100% of bloat, the table (with indexes) would use ~ 150GB, and we had only 50GB of free disk space, so using pg_reorg is not an option.

In case you don't understand why pg_reorg is not an option: it copies the table. So for some time you have both old, bloated table, and new, clean. Thus effectively using much more disk space.

When put in this situation I recalled the compacting tool I wrote almost 3 years ago, but it was just too slow.

The problem is that when doing UPDATE, PostgreSQL marks previous version of row as removed, and inserts new one. But it tries hard to put the new row in the same page (8kB) as previous version. So, to move all rows from the last page, I would have to disable autovacuum, and update them multiple times just so that I'd use *all* of available free space in last page. And then I'd have to repeat it all for 2nd to last page, and so on.

This is extremely time consuming.

But when I was thinking about it, I realized something: update is internally, more or less “delete + insert". And what if I did: delete, and then insert of the same row? Would that also put the new row in the same page? It's testing time.

First, we need some test table. That is simple:

$ create table test (id serial primary key, payload text);
CREATE TABLE

And now, let's insert some rows:

$ insert into test (payload) select repeat('depesz', 100) from generate_series(1,1000000);
INSERT 0 1000000

In case you wonder, what's the point of payload column – it's just to make rows wider, and use more pages.

After this insert, the table is 651MB, and index on id column is 21MB.

Not bad.

Now, let's remove first 80% of the table:

$ delete from test where id < 800000;
DELETE 799999

couple of vacuums later (just to be sure that all free pages got marked as free):

for i in {1..5}; do vacuumdb -t test; done

but the size is still the same:

$ \dt+ test
                   List of relations
 Schema | Name | Type  | Owner  |  Size  | Description
--------+------+-------+--------+--------+-------------
 public | test | table | depesz | 651 MB |
(1 row)
 
$ select pg_table_size('test');
 pg_table_size
---------------
     682885120
(1 row)

So, first, let's verify that update tries to put the row in the same page. We know that first 80% of the table is reusable. Let's look at the end of the table:

$ select ctid, id from test order by id desc limit 10;
    ctid    |   id    
------------+---------
 (83333,4)  | 1000000
 (83333,3)  |  999999
 (83333,2)  |  999998
 (83333,1)  |  999997
 (83332,12) |  999996
 (83332,11) |  999995
 (83332,10) |  999994
 (83332,9)  |  999993
 (83332,8)  |  999992
 (83332,7)  |  999991
(10 rows)

In case you don't know – ctid is system, hidden, column, which shows where in the file this row is. It is always a pair of numbers. First number is page number, starting from 0. The second number is tuple (record) number within the page, and it's starting with 1.

Based on ctids with page 83332, I see that Pg fits 12 rows per page. So, in the last page (83333) there are still 8 free spaces.

So, let's update the last row:

$ update test set id = id where id = 1000000;
UPDATE 1

This update didn't do anything in terms of data, but it caused Pg to copy the row:

$ select ctid, id from test order by id desc limit 10;
    ctid    |   id    
------------+---------
 (83333,5)  | 1000000
 (83333,3)  |  999999
 (83333,2)  |  999998
 (83333,1)  |  999997
 (83332,12) |  999996
 (83332,11) |  999995
 (83332,10) |  999994
 (83332,9)  |  999993
 (83332,8)  |  999992
 (83332,7)  |  999991
(10 rows)

As you can see the (83333,4) is now missing. If I'd update it again:

$ update test set id = id where id = 1000000;
UPDATE 1
 
$ select ctid, id from test order by id desc limit 10;
    ctid    |   id    
------------+---------
 (83333,6)  | 1000000
 (83333,3)  |  999999
 (83333,2)  |  999998
 (83333,1)  |  999997
 (83332,12) |  999996
 (83332,11) |  999995
 (83332,10) |  999994
 (83332,9)  |  999993
 (83332,8)  |  999992
 (83332,7)  |  999991
(10 rows)

We can see the row moving forward. Let's vacuum the table, and redo the test:

$ vacuum test;
VACUUM
 
$ update test set id = id where id = 1000000;
UPDATE 1
 
$ select ctid, id from test order by id desc limit 10;
    ctid    |   id    
------------+---------
 (83333,5)  | 1000000
 (83333,3)  |  999999
 (83333,2)  |  999998
 (83333,1)  |  999997
 (83332,12) |  999996
 (83332,11) |  999995
 (83332,10) |  999994
 (83332,9)  |  999993
 (83332,8)  |  999992
 (83332,7)  |  999991
(10 rows)

Please note that instead of saving the row as (83333,7) (next row), Pg actually reused (83333,5). But it reused within the same page. So this doesn't help me with my task of cleaning last pages.

What about delete/insert then?

I know the data in the row, so it's simple:

$ delete from test where id = 1000000;
DELETE 1
 
$ insert into test (id, payload) values (1000000, repeat('depesz', 100) );
INSERT 0 1
 
$ select ctid, id from test order by id desc limit 5;
    ctid    |   id    
------------+---------
 (0,1)      | 1000000
 (83333,3)  |  999999
 (83333,2)  |  999998
 (83333,1)  |  999997
 (83332,12) |  999996
(5 rows)

YAY! The record got moved to the very beginning of the table. But the idea of getting the row, saving somewhere the data, deleting and inserting is not really nice.

Luckily I'm on new Pg (9.4devel, but it would have worked since 9.1), so I can:

$ with x as (
delete from test where id in (999997,999998,999999) returning *
)
insert into test
    select * from x;
INSERT 0 3
 
$ select ctid, id from test order by id desc limit 5;
    ctid    |   id    
------------+---------
 (0,1)      | 1000000
 (1,3)      |  999999
 (1,2)      |  999998
 (1,1)      |  999997
 (83332,12) |  999996
(5 rows)

Yeah. The page 83333 should now be free. So let's see:

$ vacuum test;
VACUUM
 
$ select pg_table_size('test');
 pg_table_size 
---------------
     682876928
(1 row)

previous size was 682885120 – so we got 8192 bytes decrease in size – one page to be exact.

The good things are: each row has to be updated (well, deleted and inserted) just once, and it will be relocated to first free space.

It's reasonably fast (I was able to reclaim ~ 15 GB in 6 hours on my bad-situation server).

Does it mean that it's a silver bullet? Well, obviously – no. If the table had triggers – it could get problematic (both on delete and on insert triggers would get called). Same happens with foreign keys pointing to this table. But if you don't have these (fkeys to this table and triggers on this table) – then it works just great.

The only thing you have to keep in mind is to update the rows starting from the last ones.

I did it by simply doing : select ctid, primary_key_column from table, storing output in flat file, and then sorting it (outside of database to avoid putting too much load on it). Afterwards, I had sorted list of primary key column values that I should use in my updates.

If you'd want to do it too, one bit of warning: do not run vacuums in the mean time. Current vacuum is doing just some part of all its work, and then exits – to avoid putting too much strain on database.

So, when I ran two sessions:

delete/insert, in batches of 1000 rows, next batch immediately after previous
vacuum of the table, new vacuum immediately after previous

All vacuum did was marking pages as free, but it didn't get to the point of work to actually truncate the table.

Only after I stopped the row moving, vacuum did catchup with marking rows as deleted, and started to truncate the table.

All in all, after 6 hours of row moving, I ran vacuums for another 2 hours (each vacuum took ~ 3 minutes) to get all the disk space reclaimed (all that was possible to reclaim, not all that there was, as vacuum can't reclaim free space from the middle of table).

Hope you'll fine it useful.

8 thoughts on “Bloat removal by tuples moving”

Filip says:

2013-06-21 at 16:05

Very nice. I always stress the importance of internals of UPDATEs, and MVCC in general, to my students. Now we have a proof that knowing what’s under the hood helps a lot.

Regarding technicals – can you unload and reload data using COPY, to make things even faster?
depesz says:

2013-06-21 at 16:07

Sure, you can. But I want to keep all data available to queries.
My approach locks only 1000 rows at a time, and there is no point in time when any other transaction can not see *all* of the data.
Anonymous says:

2013-06-21 at 20:33

Oh man, this would be great for a database I have, except that it doesn’t seem to work on my 9.1 install – a delete followed by an insert gets moved to the end of the table. Is it because I’m not moving rows from the last page? If so, that makes this trick useless on busy tables. 🙁
Sean Chittenden says:

2013-06-22 at 08:29

There’s another approach that may be less tedious and is done entirely inside of PostgreSQL: use table inheritance. If it’s of interest, I can put together a post on this, but the gist of the procedure is:

1) BEGIN;
2) Rename big_table to big_table_unvacuumed.
3) Create parent table named big_table
4) Add big_table_unvacuumed to big_table
5) Create a new table big_table_vacuumed and add to big_table
6) Add an INSERT trigger to big_table to make sure it retains zero rows (not strictly necessary, but this is how I like doing things). I would direct new rows to the vacuumed table
7) Commit;

Once the above is done, use a wCTE exactly like what was described to DELETE+INSERT from the _unvacuumed table to big_table_vacuumed. The advantage here is instead of manually having to specify rows in the DELETE’s IN() clause, you can use “WHERE id IN(SELECT * FROM big_table_unvacuumed ORDER BY id DESC LIMIT 1000)” and you automatically get the last 1K rows. And also like what you stated, use a second terminal window that has vacuumdb running in a loop against the unvacuumed table. Rinse repeat until there are 0 rows left in _unvacuumed.

And finally, once you’re done, UNINHERIT and swap big_table_vacuumed with big_table and nuke the residual. Not ideal, but it’s handy when you’re in a jam.
depesz says:

2013-06-22 at 11:26

@Anonymous: if it is movig rows to the end fo table, it would suggest that you didn’t vacuum the table recently.
greg says:

2013-06-23 at 17:12

One difference between update and insert+delete is what happens to concurrent updates in read committed mode. If a concurrent update comes across a row already updates by another transaction it will wait to see if that transaction commits and then update the new row (if it still matches the where clause). If it comes across a deleted row it just ignores it.

Read committed mode works the way people expect in the usual cases but replacing rows by deleting them and reinserting them is the kind of that requires predicate locks to get perfect.
Ivan says:

2013-06-24 at 08:40

Nice job, depesz!
By chance, I researched this model in the last weekend (22 – 23 june) and didn’t see this article.
I have a 4TB table, including 3.5TB of toast table.
In this case, vacuum full it’s a really bad idea.
Unfortunately, the table has some triggers and foreign key.
I also came to the conclusion that the update records is not appropriate, it is compounded by the fact that the table has fillfactor = 80.
My algorithm:
1. Select relpages from pg_class – it’s a last page in table file.
2. Insert into table clones of records on last pages with special mark.
3. Delete original records.
4. Remove special mark by update clones.
5. Vacuum analyze table – it will change relpages value and will truncate empty pages.

As a result, toast table does not change its size.
Do you have any idea how you can truncate toast table without manipulating pg_toast.pg_toast_X table? And how pg_toast records are associated with table records (I need to check data integrity)?
Sergey says:

2013-06-24 at 21:18

I just wonder how fast would it be in comparison to pgcompactor (https://code.google.com/p/pgtoolkit/)? It uses the same principle (updates) as your previous compacting tool, but more improved and optimized. I managed to compact 0.5TB database (on 6 SAS RAID10) with it in <18h.

Comments are closed.