what fields are usually changed when update’ing?

there was this situation, that we had a lot of tables and a lot of update activity. so, we thought about splitting the most updated tables to parts that are usually stable, and parts (columns) which change often.

but how to know what changes? unfortunately orm that was used issued updates like this:

UPDATE TABLE SET field1='..', field2='...', field3='...' WHERE id = 123;

basically it always updated all fields. (don't even start to comment that orms are by definition broken).

so, i had to find a nice way to find out what was really updated.

so, i decided to add a table, and each update would insert new row with information on what was updated. i dont want any updates in this “stats" table, as i want the trigger to have the smallest overhead possible – so i will do all calculations, grouping and so on manually in sql later.

to keep my toys away from real application i created new schema:

CREATE SCHEMA updates_tracking;

and a table to store information about updates:

CREATE TABLE updates_tracking.updates (
    updated TIMESTAMP DEFAULT now(),
    TABLE_NAME TEXT,
    modification TEXT
);

please note that there is no primary key, no indexes, no checks (like not null). this table has to have blazing speed when it comes to inserts.

now,let's create general update-tracking trigger. since it has to be general (i.e. not bound to any specific table) plpgsql will not work for me – i need something that has the ability to iterate over fields of “NEW." and “OLD.", and pl/pgsql can't do it. do you know which language i will use? no, not pl/brainfuck. pl/perl (i know, the syntax similarities are sometimes confusing 🙂

CREATE OR REPLACE FUNCTION updates_tracking.tracker()
    RETURNS TRIGGER
    LANGUAGE plperl
    AS $BODY$
IF ( ! $_SHARED{ 'updates_tracking' } ) {
    $_SHARED{ 'updates_tracking' } = spi_prepare(
        'INSERT INTO updates_tracking.updates (table_name, modification) VALUES ($1, $2)', 'TEXT', 'TEXT'
    );
}
my @changed_columns = sort grep {
    my $n = $_TD->{'new'}->{ $_ };
    my $o = $_TD->{'old'}->{ $_ };
    ( defined $n && NOT defined $o ) ||
    ( defined $o && NOT defined $n ) ||
    ( defined $o && defined $o && $o ne $n )
} KEYS %{ $_TD->{'new'} };
spi_exec_prepared( $_SHARED{ 'updates_tracking' }, $_TD->{TABLE_NAME}, JOIN(" ", @changed_columns));
RETURN;
$BODY$;

if you dont understand the code – don't worry. it simply works.

and now, that i have a table to store update info i can:

CREATE TRIGGER updates_tracking AFTER UPDATE ON some_table FOR EACH ROW EXECUTE PROCEDURE updates_tracking.tracker();

what's nice, to add the same thing for another table, i just have to repeat above “create trigger" with the slight change of table name.

and what it gives me?

# SELECT modification, COUNT(*) FROM updates_tracking.updates WHERE TABLE_NAME = 'some_table' GROUP BY modification ORDER BY COUNT DESC;
     modification      | COUNT
-----------------------+-------
 field1 field2         |  5734
 field1                |  1447
 field2                |   829
 field3                |   644
 field4                |   129
 field5 field1 field2  |    41
                       |    26
 field1 field3 field2  |     7
 field5 field6         |     4
 field7                |     4
 field1 field3         |     3
 field8 field9         |     2
 field10               |     1
 field11 field8 field9 |     1
 field5 field1         |     1
 field12               |     1
(16 ROWS)

this clearly shows that moving fields “field1", and “field2" to another table would decrease number of updates in this table by something like 90% 🙂

of course all these updates will go to the new table, but this new table will be much smaller (i mean row size, not number of rows), so generally disk load should decrease.

and this also clearly shows that there were 26 updates that didn't update anything (26 out of nearly 9000 is not that bad, i can live with it 🙂

6 thoughts on “what fields are usually changed when update’ing?”

  1. this is a nice logging mechanism.
    I take it that this sort of thing isn’t possible with PLpgSQL? If not, does PLperl have a port for those of us that are have to work with a windows server?

  2. @Richard Broersma Jr.:
    plpgsql cannot “iterate” over NEW or OLD records (like: for (keys $_TD->{new}))
    so in plpgsql you would need to make separate function for every table, or make one (very complex) using reading of pg_catalog, and a lot of ‘executes’.

    as for windows server – sorry, i’m not familiar with this.

  3. @depesz Also, for your new heavily updated table, you’d probably want to use a smaller fill factor than the default. If the same rows get updated frequently without ever “stabilizing” you might want to try as low as 50-60%. If you get lots of inserts which are then update, but later stabilize, 80-90% would probably be better to reduce bloat.

  4. @Antoni Jakubiak:
    orm is mapper between not really compatible concepts.

    but what really bugs me is that in all orm’s i have seen to day, it is very unnatural to add your own sql queries in case you need them. and i need them often as i can write them better than orms can.

    dont get me wrong – orm’s are great for some purposes (raporting!). but using them for general-purpose (like webbuilding) is trading speed-of-application for speed-of-writing. and i dont quite like it. especially since i know sql already, and dont know orms – so i have to spend some time to learn them.

  5. In PostgreSQL 8.1 there is no spi_prepare and spi_exec_prepared. So I’ve had to do some mirror changes to the code (remove spi_prepare, spi_exec_prepared -> spi_exec_query). Also $_TD->{table_name} is called $_TD->{relname}.

Comments are closed.