just recently i saw a very interesting situation.
one table in customer database (with about 1.5 milion rows) was *very often* updated. using series of simple:
UPDATE TABLE SET FIELD = .., WHERE id = ...
updates always updated 1 record, search was using primary key, so it was quite fast.
what was strange was the fact that the table get about 20 times more updates then the next most updated table.
after careful checking i found out that a lot (about 60-70%) of the updates actually didn't change anything!
they were simply setting values that were already there.
so, i started to think about hwo to avoid this kind of situation.
and this is what i found out.
first, let's check how it really works. let's create a table, insert some rows, and update them with the same values. and let's check some basic data about it:
# CREATE TABLE test (id serial PRIMARY KEY, column1 INT4, column2 TEXT); NOTICE: CREATE TABLE will CREATE implicit SEQUENCE "test_id_seq" FOR serial COLUMN "test.id" NOTICE: CREATE TABLE / PRIMARY KEY will CREATE implicit INDEX "test_pkey" FOR TABLE "test" CREATE TABLE # INSERT INTO test (column1, column2) SELECT i, i::TEXT FROM generate_series(1,100000) i; INSERT 0 100000 # vacuum FULL analyze test; VACUUM # REINDEX TABLE TEST; REINDEX
ok, now i have a nice table with 100k records. how big is it?
# SELECT relname, relkind, pg_size_pretty(pg_relation_size(oid)) FROM pg_class WHERE relname IN ('test', 'test_pkey'); relname | relkind | pg_size_pretty -----------+---------+---------------- test | r | 4328 kB test_pkey | i | 1768 kB (2 ROWS)
not, much. so, let's update the whole table (which means 100k rows) with the same values:
# UPDATE test SET column1 = column1, column2 = column2; UPDATE 100000 TIME: 3038.965 ms # SELECT relname, relkind, pg_size_pretty(pg_relation_size(oid)) FROM pg_class WHERE relname IN ('test', 'test_pkey'); relname | relkind | pg_size_pretty -----------+---------+---------------- test | r | 8648 kB test_pkey | i | 5272 kB (2 ROWS)
so, table got twice as big. index – even more. time to update – over 3 seconds ( i redid the test couple of times and got results like: 3538.617, 3019.927, 2480.197, 1732.637, 2908.378).
now, basically – there are situations where one could say that it's perfectly sensible to make such a update – for example if we are interested in calling triggers to refresh some data.
but for me – it was rather bad. i didn't have any triggers. modifying the code was possible, but troublesome.
basically – i would like to make all those updates (thousands per hour) less intrusive.
so, there goes an idea – let's create a trigger that will simply skip update if updated row is exactly the same as previous.
some coding, and here we have:
CREATE OR REPLACE FUNCTION prevent_empty_updates() RETURNS TRIGGER AS $BODY$ DECLARE BEGIN IF ROW(OLD.*) IS DISTINCT FROM ROW(NEW.*) THEN RETURN NEW; END IF; RETURN NULL; END; $BODY$ LANGUAGE plpgsql; CREATE TRIGGER prevent_empty_updates BEFORE UPDATE ON test FOR EACH ROW EXECUTE PROCEDURE prevent_empty_updates();
now, how good it is?
let's test. i fixed my test sql, so now it: creates the table, inserts rows, creates trigger, vacuum, reindex, and times run of update. after all – drop the test table.
the best time without trigger was 1.7s. so, how well (or how bad) will it be now?
times after 10 runs: 992.598, 1055.256, 949.712, 1143.513, 901.878, 968.054, 999.859, 959.484, 1282.964 and 912.813.
so, it looks quite well. on average it was a bit over 1 second.
what's more important – in this way i also avoid table/index bloat – as the updates don't really occur.
so, is it cheap gain without any cost?
unfortunately no – trigger running has some cost. how much? let's check how long will the update take if i really will update the rows (with the trigger), this time, setting some different value.
another 10 rounds. times:
2897.916, 2923.946, 3328.016, 3291.242, 2886.756, 2901.304, 3395.658, 3278.864, 2881.064 and 3468.860 (average: 3125.3 ms!)
whoa. it looks almost too good to be true. it looks like almost no cost added? let's re-test updates, in exactly the same fashion, but without the trigger – just to be sure. times:
2550.144, 1708.779, 1836.645, 2371.454, 1715.331, 2185.992, 1894.649, 2116.357, 2612.019, 1880.154 (average: 2087.1ms).
ok, now the results are a bit saner.
so – it shows that putting this trigger yields about 50% penalty in performance.
but i think that in real-life it will not be as high. first of all – i used a very simple table. no additional triggers, no foreign keys, no nothing like this. and since this trigger effectively avoids all these things in case of “empty" updates (and don't forget about cheaper vacuums!), i guess that the real-life cost would be closer to 15-20%.
so, finishing the post – it would be best to avoid empty updates in client code – either by issuing updates only when necessary, or issuing them in a way that they will update only the rows that really need to be updated, like:
UPDATE TABLE SET FIELD = 12 WHERE id = 90 AND FIELD IS DISTINCT FROM 12;
but if you can't modify the code, and you have a similar problem with excessive number of not really useful updates – a simple trigger can save the day.