avoiding “empty” updates

just recently i saw a very interesting situation.

one table in customer database (with about 1.5 milion rows) was *very often* updated. using series of simple:

UPDATE TABLE SET FIELD = .., WHERE id = ...

updates always updated 1 record, search was using primary key, so it was quite fast.

what was strange was the fact that the table get about 20 times more updates then the next most updated table.

after careful checking i found out that a lot (about 60-70%) of the updates actually didn't change anything!

they were simply setting values that were already there.

so, i started to think about hwo to avoid this kind of situation.

and this is what i found out.

first, let's check how it really works. let's create a table, insert some rows, and update them with the same values. and let's check some basic data about it:

# CREATE TABLE test (id serial PRIMARY KEY, column1 INT4, column2 TEXT);
NOTICE:  CREATE TABLE will CREATE implicit SEQUENCE "test_id_seq" FOR serial COLUMN "test.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will CREATE implicit INDEX "test_pkey" FOR TABLE "test"
CREATE TABLE
# INSERT INTO test (column1, column2) SELECT i, i::TEXT FROM generate_series(1,100000) i;
INSERT 0 100000
# vacuum FULL analyze test;
VACUUM
# REINDEX TABLE TEST;
REINDEX

ok, now i have a nice table with 100k records. how big is it?

# SELECT relname, relkind, pg_size_pretty(pg_relation_size(oid)) FROM pg_class WHERE relname IN ('test', 'test_pkey');
  relname  | relkind | pg_size_pretty
-----------+---------+----------------
 test      | r       | 4328 kB
 test_pkey | i       | 1768 kB
(2 ROWS)

not, much. so, let's update the whole table (which means 100k rows) with the same values:

# UPDATE test SET column1 = column1, column2 = column2;
UPDATE 100000
TIME: 3038.965 ms
# SELECT relname, relkind, pg_size_pretty(pg_relation_size(oid)) FROM pg_class WHERE relname IN ('test', 'test_pkey');
  relname  | relkind | pg_size_pretty
-----------+---------+----------------
 test      | r       | 8648 kB
 test_pkey | i       | 5272 kB
(2 ROWS)

so, table got twice as big. index – even more. time to update – over 3 seconds ( i redid the test couple of times and got results like: 3538.617, 3019.927, 2480.197, 1732.637, 2908.378).

now, basically – there are situations where one could say that it's perfectly sensible to make such a update – for example if we are interested in calling triggers to refresh some data.

but for me – it was rather bad. i didn't have any triggers. modifying the code was possible, but troublesome.

basically – i would like to make all those updates (thousands per hour) less intrusive.

so, there goes an idea – let's create a trigger that will simply skip update if updated row is exactly the same as previous.

some coding, and here we have:

CREATE OR REPLACE FUNCTION prevent_empty_updates() RETURNS TRIGGER AS $BODY$
DECLARE
BEGIN
    IF ROW(OLD.*) IS DISTINCT FROM ROW(NEW.*) THEN
        RETURN NEW;
    END IF;
    RETURN NULL;
END;
$BODY$ LANGUAGE plpgsql;
CREATE TRIGGER prevent_empty_updates BEFORE UPDATE ON test FOR EACH ROW EXECUTE PROCEDURE prevent_empty_updates();

now, how good it is?

let's test. i fixed my test sql, so now it: creates the table, inserts rows, creates trigger, vacuum, reindex, and times run of update. after all – drop the test table.

the best time without trigger was 1.7s. so, how well (or how bad) will it be now?

times after 10 runs: 992.598, 1055.256, 949.712, 1143.513, 901.878, 968.054, 999.859, 959.484, 1282.964 and 912.813.

so, it looks quite well. on average it was a bit over 1 second.

what's more important – in this way i also avoid table/index bloat – as the updates don't really occur.

so, is it cheap gain without any cost?

unfortunately no – trigger running has some cost. how much? let's check how long will the update take if i really will update the rows (with the trigger), this time, setting some different value.

another 10 rounds. times:

2897.916, 2923.946, 3328.016, 3291.242, 2886.756, 2901.304, 3395.658, 3278.864, 2881.064 and 3468.860 (average: 3125.3 ms!)

whoa. it looks almost too good to be true. it looks like almost no cost added? let's re-test updates, in exactly the same fashion, but without the trigger – just to be sure. times:

2550.144, 1708.779, 1836.645, 2371.454, 1715.331, 2185.992, 1894.649, 2116.357, 2612.019, 1880.154 (average: 2087.1ms).

ok, now the results are a bit saner.

so – it shows that putting this trigger yields about 50% penalty in performance.

but i think that in real-life it will not be as high. first of all – i used a very simple table. no additional triggers, no foreign keys, no nothing like this. and since this trigger effectively avoids all these things in case of “empty" updates (and don't forget about cheaper vacuums!), i guess that the real-life cost would be closer to 15-20%.

so, finishing the post – it would be best to avoid empty updates in client code – either by issuing updates only when necessary, or issuing them in a way that they will update only the rows that really need to be updated, like:

UPDATE TABLE SET FIELD = 12 WHERE id = 90 AND FIELD IS DISTINCT FROM 12;

but if you can't modify the code, and you have a similar problem with excessive number of not really useful updates – a simple trigger can save the day.

9 thoughts on “avoiding “empty” updates”

  1. Another way to do this, using RULEs 🙂

    CREATE RULE no_unchanging_updates AS
    ON UPDATE
    TO test
    WHERE ROW(OLD.*) IS NOT DISTINCT FROM ROW(NEW.*)
    DO INSTEAD NOTHING;

  2. I agree that it could be done in the database. But in theory, comparing two records for every update could be costly, since PostgreSQL supports user-defined types, and out-of-line storage.

    If you update based on a simple integer key, but another value in the same record is large and held in a separate physical location, the comparison could require reading additional data.

    This could probably be optimized, but at least there are triggers if you really need them. I suspect that you could make the overhead of such a trigger negligible.

  3. @Radim Kolar:
    the fact that mysql does something doesn’t really prove anything. they also “do” commands that can work partially 🙂

    and on the more serious side – i can see some uses to make empty updates. side-effects of update triggers are the first to think about.
    plus – comparing (as Jeff mentioned) might be costly – remember that a single column might be up to 1gb of data! i definitely don’t want database to waste time with comparing this kind of data. and if i want – a trigger (or rule, as David showed) is really simple.

  4. @David Fetter:
    true. in fact – i’m not fan of rules, so i dont know them well, but in this situation rule seems to be perfect as it works before any kind of triggers, and checks.

  5. The penalty you are seeing above is almost certainly due to your using plpgsql for the trigger, and to using the row constructor to do field by field comparison. Try my C version, which compares using memcmp on the whole record in one hit, and you will see far better results.

  6. Looks like I was a bit late on this one but it’s really a good post. I’m thinking about a reply on my site . . . Art Of War

Comments are closed.