January 18th, 2008 by depesz | Tags: | 7 comments »
Did it help? If yes - maybe you can help me?
  • we use orm
  • yes, and?
  • and we have a problem with it…
  • you already said that

now, imagine a simple scenario:

  • objects table (let's name it “topics"), which contains column: posts_count
  • sub-objects table (posts), with foreign key to topics
  • triggers which update posts_count on insert/update/delete on posts

simple? yes? well. it was. now, enter disaster: orm.

this orm idea of “update" works like this:

TOPIC_OBJECT = orm.get_topic_object(id = 123);
TOPIC_OBJECT.title("whatever");
TOPIC_OBJECT.store_to_database();

one could say that it should yield this sql:

update topics set title = 'whatever' where id = 123;

this would be quite sane. but not in this case. this brilliant orm decides to do the update this way:

update topics set title = 'whatever', posts_count = 34, ... where id = 123

it sets values of all fields of given object.

where does it get those values? it's simple. from some kind of deep hidden, untouchable cache.

which means that even if our trigger on posts does the right job, first update of topic title will “reset" the posts_count to some previous state.

now. i should say: ditch the orm. or fix it.

but it's not an option. we're in commercial environment, time is limited, we will not rewrite application from scratch without orm, and touching orm itself is very scary. too scary for even the bravest souls among our programmers.

so what do we do? ditch the triggers? they do actually quite good job.

the idea is to somehow prevent changes of posts_count field if it's not from posts-trigger.

we'd like to do it in such a way that it will not require changes in table schema (to avoid potential problems with orm 🙂

as for now we have 2 ideas:

  1. create a specific user to run triggers (using “security definer"), and a trigger on “before update on topics" which will allow update of posts_count only if current_user is the right one.
  2. since we know a way how to distinguish between statements we can do the trick with abusing custom_variable_class in settings.

first, let's create a testbed for this:

CREATE TABLE topics (
    id          SERIAL,
    title       TEXT NOT NULL DEFAULT '',
    posts_count INT4 NOT NULL DEFAULT 0,
    PRIMARY KEY (id)
);
CREATE TABLE posts (
    id       BIGSERIAL,
    topic_id INT4 NOT NULL DEFAULT 0 references topics (id),
    title    TEXT NOT NULL DEFAULT '',
    PRIMARY KEY (id)
);

now, let's create trigger on posts:

CREATE OR REPLACE FUNCTION trg_posts_iud() RETURNS TRIGGER AS
$BODY$
DECLARE
BEGIN
    IF TG_OP = 'INSERT' THEN
        UPDATE topics SET posts_count = posts_count + 1 WHERE id = NEW.topic_id;
        RETURN NEW;
    ELSIF TG_OP = 'DELETE' THEN
        UPDATE topics SET posts_count = posts_count - 1 WHERE id = OLD.topic_id;
        RETURN OLD;
    ELSIF NEW.topic_id = OLD.topic_id THEN
        RETURN NEW;
    ELSIF NEW.topic_id < OLD.topic_id THEN
        UPDATE topics SET posts_count = posts_count + 1 WHERE id = NEW.topic_id;
        UPDATE topics SET posts_count = posts_count - 1 WHERE id = OLD.topic_id;
    ELSE
        UPDATE topics SET posts_count = posts_count - 1 WHERE id = OLD.topic_id;
        UPDATE topics SET posts_count = posts_count + 1 WHERE id = NEW.topic_id;
    END IF;
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER trg_posts_iud after INSERT OR UPDATE OR DELETE ON posts FOR EACH ROW EXECUTE PROCEDURE trg_posts_iud();

does it work well?

let's insert and mangle some random data:

INSERT INTO topics (title) SELECT 'title #' || i FROM generate_series(1,100) i;
INSERT INTO posts (title, topic_id) SELECT 'post title #' || i, cast(random() * 99 + 1 as INT4) FROM generate_series(1,10000) i;
UPDATE posts SET topic_id = cast(random() * 99 + 1 as INT4) WHERE random() < .5;
DELETE FROM posts WHERE random() < 0.25;

and now verification if everything is fine:

# SELECT topic_id, count(*) FROM posts group BY topic_id except SELECT id, posts_count FROM topics;
 topic_id | count
----------+-------
(0 rows)
# SELECT id, posts_count FROM topics except SELECT topic_id, count(*) FROMposts group BY topic_id;
 id | posts_count
----+-------------
(0 rows)

looks fine.

problem shows if i'll do:

update topics set posts_count = posts_count + 1 where id = 51;

now, verification shows that the count is falsified:

# SELECT topic_id, count(*) FROM posts group BY topic_id except SELECT id, posts_count FROM topics;
 topic_id | count
----------+-------
       51 |    68
(1 row)
# SELECT id, posts_count FROM topics except SELECT topic_id, count(*) FROM posts group BY topic_id;
 id | posts_count
----+-------------
 51 |          69
(1 row)

ok. so let's fix the issue. first, let's create a trigger using another user:

CREATE user dbtrigger;
grant all on topics to dbtrigger;
grant all on posts to dbtrigger;
\c - dbtrigger
CREATE OR REPLACE FUNCTION trg_posts_iud() RETURNS TRIGGER AS
$BODY$
DECLARE
BEGIN
    IF TG_OP = 'INSERT' THEN
        UPDATE topics SET posts_count = posts_count + 1 WHERE id = NEW.topic_id;
        RETURN NEW;
    ELSIF TG_OP = 'DELETE' THEN
        UPDATE topics SET posts_count = posts_count - 1 WHERE id = OLD.topic_id;
        RETURN OLD;
    ELSIF NEW.topic_id = OLD.topic_id THEN
        RETURN NEW;
    ELSIF NEW.topic_id < OLD.topic_id THEN
        UPDATE topics SET posts_count = posts_count + 1 WHERE id = NEW.topic_id;
        UPDATE topics SET posts_count = posts_count - 1 WHERE id = OLD.topic_id;
    ELSE
        UPDATE topics SET posts_count = posts_count - 1 WHERE id = OLD.topic_id;
        UPDATE topics SET posts_count = posts_count + 1 WHERE id = NEW.topic_id;
    END IF;
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql' SECURITY DEFINER;
CREATE TRIGGER trg_posts_iud after INSERT OR UPDATE OR DELETE ON posts FOR EACH ROW EXECUTE PROCEDURE trg_posts_iud();
\c - depesz

now we need a trigger on topics:

CREATE OR REPLACE FUNCTION trg_topics_u() RETURNS TRIGGER AS
$BODY$
DECLARE
BEGIN
    IF CURRENT_USER <> 'dbtrigger' THEN
        NEW.posts_count := OLD.posts_count;
    END IF;
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER trg_topics_u BEFORE UPDATE ON topics FOR EACH ROW EXECUTE PROCEDURE trg_topics_u();

effect?

# select * from topics where id = 51;
 id |   title   | posts_count
----+-----------+-------------
 51 | title #51 |          90
(1 row)
# update topics set title = 'xxx', posts_count = 100 where id = 51;
UPDATE 1
# select * from topics where id = 51;
 id | title | posts_count
----+-------+-------------
 51 | xxx   |          90
(1 row)

so, it clearly forbids me from changing posts_count. will the trigger on posts still work?

# insert into posts (topic_id, title) values (51, 'q');
INSERT 0 1
# select * from topics where id = 51;
 id | title | posts_count
----+-------+-------------
 51 | xxx   |          91
(1 row)

YES! works.

now, for the second approach, based on configuration variables.

it so happens, that for every database installation i add custom variable class – for future use:

# show custom_variable_classes ;
 custom_variable_classes
-------------------------
 depesz
(1 row)

this can be used in such a manner:

CREATE OR REPLACE FUNCTION trg_posts_iud() RETURNS TRIGGER AS
$BODY$
DECLARE
BEGIN
    PERFORM set_config('depesz.topic_update_stamp', statement_timestamp()::TEXT, false);
    IF TG_OP = 'INSERT' THEN
        UPDATE topics SET posts_count = posts_count + 1 WHERE id = NEW.topic_id;
        RETURN NEW;
    ELSIF TG_OP = 'DELETE' THEN
        UPDATE topics SET posts_count = posts_count - 1 WHERE id = OLD.topic_id;
        RETURN OLD;
    ELSIF NEW.topic_id = OLD.topic_id THEN
        RETURN NEW;
    ELSIF NEW.topic_id < OLD.topic_id THEN
        UPDATE topics SET posts_count = posts_count + 1 WHERE id = NEW.topic_id;
        UPDATE topics SET posts_count = posts_count - 1 WHERE id = OLD.topic_id;
    ELSE
        UPDATE topics SET posts_count = posts_count - 1 WHERE id = OLD.topic_id;
        UPDATE topics SET posts_count = posts_count + 1 WHERE id = NEW.topic_id;
    END IF;
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER trg_posts_iud after INSERT OR UPDATE OR DELETE ON posts FOR EACH ROW EXECUTE PROCEDURE trg_posts_iud();
CREATE OR REPLACE FUNCTION trg_topics_u() RETURNS TRIGGER AS
$BODY$
DECLARE
BEGIN
    IF statement_timestamp()::TEXT IS distinct FROM current_setting('depesz.topic_update_stamp') THEN
        NEW.posts_count := OLD.posts_count;
    END IF;
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER trg_topics_u BEFORE UPDATE ON topics FOR EACH ROW EXECUTE PROCEDURE trg_topics_u();

now, let's test it:

# select * from topics where id = 51;
 id |   title   | posts_count
----+-----------+-------------
 51 | title #51 |          71
(1 row)
# update topics set title = 'xxx', posts_count = 100 where id = 51;
UPDATE 1
# select * from topicswhere id = 51;
 id | title | posts_count
----+-------+-------------
 51 | xxx   |          71
(1 row)
# insert into posts (topic_id, title) values (51, 'q');
INSERT 0 1
# select * from topics where id = 51;
 id | title | posts_count
----+-------+-------------
 51 | xxx   |          72
(1 row)

now, both of these approaches have it's benefits.

user based approach is simpler to understand, and most probably (will test it in a moment) faster.

on the other hand – if you already have custom variables class – you might be tempted to go with config-based way, as it doesn't require new user. plus it's definitely “cooler", as it works in more obscure way. (just kidding).

so, how about speed?

for me, what's important is speed of inserts to posts table.

so i'll do 3 tests:

  1. no protection against bad updates
  2. protection with CURRENT_USER
  3. protection with statement_timestamp

each test will create tables, insert 100 topics and 100000 posts. time will be taken from insert of posts time.

pass 1:

  1. 51023.180 ms
  2. 53437.412 ms
  3. 55625.248 ms

pass 2:

  1. 49584.570 ms
  2. 53987.809 ms
  3. 56324.298 ms

pass 3:

  1. 52096.815 ms
  2. 53500.527 ms
  3. 57295.232

as expected – config based solution is slower than username based. 🙂

in final notes – this is by no means full list of possible solutions. for example – when using plperl triggers i could use %_SHARED to pass information and allow modification of posts_count based on this.

if schema modification was allowed i could add a field that would be used as a marker.

i could also abuse the fact that count of objects cannot be negative, and send updates from posts-trigger in form of (-1 * (posts_count + 1)), and then in topics trigger i could fix the count (or forbid update if new.posts_count is positive.

anyway – as for now, the way with separate user seems to be the most reasonable.

hope you found this article useful 🙂

  1. 7 comments

  2. # cezio
    Jan 18, 2008

    We had similar problem: two tables, which would update each other (A kept ‘summary’, but on change should update table B, which kept detailed information, BUT should update A on some changes).

    To keep them from circular update, we added flag column (default false) to each table, and row trigger, which would update other table only if flag was set.

  3. Jan 19, 2008

    I don’t know how it’s work, but it’s another Your Post, when You dope out my problem exactly in time, when it’s born in my work!

    Excellent.

  4. Jan 19, 2008

    The key “problem” you have with ORMs is that they use RDB only as a shared storage and nothing more. The relationships and logic is coded in objects at the application level. There is no need to reimplement it in RDB and if you do, they will conflict sooner or later.

    If you try to push “the dirty work” such as sequence tracking, relations etc. to the DB, and do only “the nice things” in app… Well… the DB master gets the bitchin’, right? 😉

  5. Jan 19, 2008

    @smoku:
    orms prevent tons of optimizations/speedup. i understand that they simplify writing *slow* code, but when you want to make the code go fast – with orms you loose.

    as for “bitchin'” – there are jobs to be done in db, and there are jobs to be done outside of db. the problem arises when soembody (out of laziness, or out of not having neccessary knowledge/experience) mixes the roles, and does db-tasks outside of db, and outside jobs in db.

  6. Jan 21, 2008

    Hm.. that’s problem in ORM implementation. Normally they should knew whose fields are changed and update only these with ‘dirty’ flag.
    Your library is.. dumb. 🙂

    Personally I use ORM to write code faster. But that isn’t fast code.. When I must create complex query or use complex result I use plain result sets.

    Please depesz tell me, did you used SQL Alchemy in this post?

  7. Jan 21, 2008

    @Łukasz Dywicki:
    no, it’s orm from django (old version, but i was told that new django works the same way).

    but *please* do not understand as django bashing. django *is* cool.

  8. # woeye
    Jan 23, 2008

    Hibernate (a popular Java ORM framework) has the concept of “generated properties”. By defining a property in a class as generated the Hibernate will not insert or update this property. Furthermore it will refetch the data from the database after an update/insert in order to keep this property up-to-date. There’s no such feature in Django’s ORM afaik?

Leave a comment