My take on trees in SQL

Quick note in polish: jeśli znasz moje poprzednie posty nt. drzew, to ten możesz sobie pewnie odpuścić. będzie zawierał jedynie opis implementacji zbliżony do tego co już jest dostępne.

OK, back to English (or at least my version of English).

Finding a good way to store trees in SQL was/is my long-term hobby. I tried ltree, basic adjacency list, Celko's nested sets way, and nothing really was able to make me feel satisfied.

Ltree is great, but PostgreSQL only (not that it's a big problem). Adjacency list is very simple in insert, update and delete operations, but forces me to use recursive queries in case of some not-so-standard queries. Nested sets are quite the contrary – great for selects, but I simply hate writing insert/update/delete to these trees.

Is there anything better? I think so.

My way of storing trees started with simple adjacency list. This is simple to write to, and I will try to make selects somewhat simpler. Doing things the other way around (starting with simple selects, and complicated writes, and then making writes simpler is next to impossible for me).

So, I start with this schema:

CREATE TABLE objects (
    id             SERIAL,
    codename       TEXT NOT NULL UNIQUE,
    printable_name TEXT,
    some_property  INT4,
    PRIMARY KEY (id)
);

It is simple table which store “objects", where each object has unique codename, some printable_name, and some_property.

Both printable_name and some_property are actually irrelevant, but I just wanted to show some data in it to make it easier for you to follow.

Now, we need to add tree information.

ALTER TABLE objects ADD COLUMN parent_id INT4 REFERENCES objects (id);

Our test table looks now like this:

                             Table "public.objects"
     Column     |  Type   |                      Modifiers
----------------+---------+------------------------------------------------------
 id             | integer | not null default nextval('objects_id_seq'::regclass)
 codename       | text    | not null
 printable_name | text    |
 some_property  | integer |
 parent_id      | integer |
Indexes:
    "objects_pkey" PRIMARY KEY, btree (id)
    "objects_codename_key" UNIQUE, btree (codename)
Foreign-key constraints:
    "objects_parent_id_fkey" FOREIGN KEY (parent_id) REFERENCES objects(id)
Referenced by:
  "objects_parent_id_fkey" IN objects FOREIGN KEY (parent_id) REFERENCES objects(id)

Nothing really fancy here.

Now. Let's add functionality 🙂

First is very simple, and it's actually foundation for further work: let's add ability to select (without any kind of loops or recursiveness) to select all parents and/or all children of given node.

To do it, I will need to add new table:

CREATE TABLE objects_tree (
    id        SERIAL PRIMARY KEY,
    parent_id INT4 NOT NULL REFERENCES objects (id) ON DELETE CASCADE,
    child_id  INT4 NOT NULL REFERENCES objects (id) ON DELETE CASCADE,
    depth     INT4 NOT NULL,
    UNIQUE (parent_id, child_id)
);

You might ask: what's depth? This is number of levels which separate 2 objects. You will see in examples (in a minute) how it works.

OK. I will use this table to store information about all parents of given node. Of course I will not require application to fill this table – this would defeat one of the main purposes of new structure: easy writes.

So, if I can't use application to fill this table, I have to use triggers. Luckily this is pretty simple. I will write it as 2 separate triggers to make it easier to understand.

First I will write the simpler one: for inserts:

CREATE OR REPLACE FUNCTION tree_objects_ai() RETURNS TRIGGER AS
$BODY$
DECLARE
BEGIN
    INSERT INTO objects_tree (parent_id, child_id, depth) VALUES (NEW.id, NEW.id, 0);
    INSERT INTO objects_tree (parent_id, child_id, depth) SELECT x.parent_id, NEW.id, x.depth + 1 FROM objects_tree x WHERE x.child_id = NEW.parent_id;
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER tree_objects_ai AFTER INSERT ON objects FOR EACH ROW EXECUTE PROCEDURE tree_objects_ai();

What does it do?

Let's iterate:

  • it inserts row, where both parent_id and child_id are set to the same value (id of newly inserted object), and depth is set to 0 (as both child and parent are on the same level) – why we need this record – I'll show you a bit later
  • and now, in second insert we copy all rows that our parent had as its parents, but we modify child_id in these rows to be id of currently inserted row, and increase depth

Does it sound complicated? Maybe a simple example will clear all doubt:

I insert 2 top-level nodes:

INSERT INTO objects (id, codename) VALUES (1, 'a');
INSERT INTO objects (id, codename) VALUES (2, 'b');

Content of object_tree is pretty simple, and quite obvious:

 id | parent_id | child_id | depth
----+-----------+----------+-------
  1 |         1 |        1 |     0
  2 |         2 |        2 |     0
(2 rows)

Now, I insert 2 records, directly “below" just-inserted objects:

INSERT INTO objects (id, codename, parent_id) VALUES (3, 'c', 1);
INSERT INTO objects (id, codename, parent_id) VALUES (4, 'd', 2);

Object_tree contains now:

 id | parent_id | child_id | depth
----+-----------+----------+-------
  1 |         1 |        1 |     0
  2 |         2 |        2 |     0
  3 |         3 |        3 |     0
  4 |         1 |        3 |     1
  5 |         4 |        4 |     0
  6 |         2 |        4 |     1
(6 rows)

Which is not really interesting, as exactly the same information is in objects table.

But what happens when I insert object which is child of not top-level object?

INSERT INTO objects (id, codename, parent_id) VALUES (5, 'e', 3);
SELECT * FROM objects_tree;
 id | parent_id | child_id | depth
----+-----------+----------+-------
  1 |         1 |        1 |     0
  2 |         2 |        2 |     0
  3 |         3 |        3 |     0
  4 |         1 |        3 |     1
  5 |         4 |        4 |     0
  6 |         2 |        4 |     1
  7 |         5 |        5 |     0
  8 |         3 |        5 |     1
  9 |         1 |        5 |     2
(9 ROWS)

Now it shows some useful information. For the first time we have available piece of information which is not directly accessible from objects table, row with information about (parent_id, child_id, depth) = (1, 5, 2).

As you see, the object_tree table grows pretty fast. Or it looks like it. This is true, every object in this table will have N rows in this table, where N is object level (starting from 1 for top-level objects). If this is a problem for you – I think nested sets might be better for you. For me – additional data doesn't really matter, as each object usually already has a lot of data, and some additional integers don't bother me.

Now for update trigger.

CREATE OR REPLACE FUNCTION tree_objects_au() RETURNS TRIGGER AS
$BODY$
DECLARE
BEGIN
    IF NOT OLD.parent_id IS DISTINCT FROM NEW.parent_id THEN
        RETURN NEW;
    END IF;
    IF OLD.parent_id IS NOT NULL THEN
        DELETE FROM objects_tree WHERE id IN (
            SELECT r2.id FROM objects_tree r1 JOIN objects_tree r2 ON r1.child_id = r2.child_id
            WHERE r1.parent_id = NEW.id AND r2.depth > r1.depth
        );
    END IF;
    IF NEW.parent_id IS NOT NULL THEN
        INSERT INTO objects_tree (parent_id, child_id, depth)
            SELECT r1.parent_id, r2.child_id, r1.depth + r2.depth + 1
        FROM
            objects_tree r1,
            objects_tree r2
        WHERE
            r1.child_id = NEW.parent_id AND
            r2.parent_id = NEW.id;
    END IF;
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER tree_objects_au AFTER UPDATE ON objects FOR EACH ROW EXECUTE PROCEDURE tree_objects_au();

Now, let's check 2 possible scenarios:

Moving some child object to become top-level object:

UPDATE objects SET parent_id = NULL WHERE id = 3;
SELECT * FROM objects;
 id | codename | printable_name | some_property | parent_id
----+----------+----------------+---------------+-----------
  1 | a        | [NULL]         |        [NULL] |    [NULL]
  2 | b        | [NULL]         |        [NULL] |    [NULL]
  3 | c        | [NULL]         |        [NULL] |    [NULL]
  4 | d        | [NULL]         |        [NULL] |         2
  5 | e        | [NULL]         |        [NULL] |         3
(5 ROWS)
SELECT * FROM objects_tree;
 id | parent_id | child_id | depth
----+-----------+----------+-------
  1 |         1 |        1 |     0
  2 |         2 |        2 |     0
  3 |         3 |        3 |     0
  5 |         4 |        4 |     0
  6 |         2 |        4 |     1
  7 |         5 |        5 |     0
  8 |         3 |        5 |     1
(7 ROWS)

This looks OK.

Now, let's move some top-level object to become child object:

UPDATE objects SET parent_id = 5 WHERE id = 2;
SELECT * FROM objects;
 id | codename | printable_name | some_property | parent_id
----+----------+----------------+---------------+-----------
  1 | a        | [NULL]         |        [NULL] |    [NULL]
  2 | b        | [NULL]         |        [NULL] |         5
  3 | c        | [NULL]         |        [NULL] |    [NULL]
  4 | d        | [NULL]         |        [NULL] |         2
  5 | e        | [NULL]         |        [NULL] |         3
(5 ROWS)
SELECT * FROM objects_tree ORDER BY depth;
 id | parent_id | child_id | depth
----+-----------+----------+-------
  1 |         1 |        1 |     0
  2 |         2 |        2 |     0
  3 |         3 |        3 |     0
  5 |         4 |        4 |     0
  7 |         5 |        5 |     0
  8 |         3 |        5 |     1
 10 |         5 |        2 |     1
  6 |         2 |        4 |     1
 11 |         3 |        2 |     2
 12 |         5 |        4 |     2
 13 |         3 |        4 |     3
(11 ROWS)

OK. And the last way to update: move some child object under new parent:

UPDATE objects SET parent_id = 1 WHERE id = 5;
SELECT * FROM objects;
 id | codename | printable_name | some_property | parent_id
----+----------+----------------+---------------+-----------
  1 | a        | [NULL]         |        [NULL] |    [NULL]
  2 | b        | [NULL]         |        [NULL] |         5
  3 | c        | [NULL]         |        [NULL] |    [NULL]
  4 | d        | [NULL]         |        [NULL] |         2
  5 | e        | [NULL]         |        [NULL] |         1
(5 ROWS)
SELECT * FROM objects_tree ORDER BY depth;
 id | parent_id | child_id | depth
----+-----------+----------+-------
  1 |         1 |        1 |     0
  2 |         2 |        2 |     0
  3 |         3 |        3 |     0
  5 |         4 |        4 |     0
  7 |         5 |        5 |     0
 10 |         5 |        2 |     1
 14 |         1 |        5 |     1
  6 |         2 |        4 |     1
 12 |         5 |        4 |     2
 15 |         1 |        2 |     2
 16 |         1 |        4 |     3
(11 ROWS)

Looks like working.

Now, I promised that I will tell you why we need record with depth 0 and why we need depth column.

Let's assume our objects are categories. And we have some products in these categories. Like this:

CREATE TABLE products (
    id          SERIAL PRIMARY KEY,
    category_id INT4 NOT NULL REFERENCES objects (id),
    ...
);

It is quite common to ask database for all products in given category and it's subcategories.

Now, I can simply:

SELECT
    p.*
FROM
    products p
    JOIN objects_tree c ON p.category_id = c.child_id
WHERE
    c.parent_id = <SOME_ID>;

If I hadn't add these “depth=0" rows, I would have to write it as:

SELECT
    p.*
FROM
    products p
    JOIN objects_tree c ON p.category_id = c.child_id
WHERE
    c.parent_id = <SOME_ID>
UNION ALL
SELECT
    *
FROM
    products
WHERE
    category_id = <SOME_ID>;

Ouch.

Second, absolutely unnecessary scan (indexable of course, but unnecessary anyway) of products table.

Plus – it is not as simple to write as the first query, and my goal was to make it as simple as possible.

And why do we need depth column?

Let's stay with this categories example. When user is in some category, we would like to show him path to this category. So he could easily move to some parent category.

Now, it's pretty simple:

SELECT
    o.*
FROM
    objects o
    JOIN objects_tree t ON o.id = t.parent_id
WHERE
    t.child_id = 4
ORDER BY
    t.depth DESC;

Try to do it without depth column 🙂 of course you could assume that lower id means category higher in hierarchy, but since one can move objects in tree freely, there is no guarantee that this will be always true.

Now.

As for moving objects in tree freely.

We should forbid moves that would create loops:

CREATE OR REPLACE FUNCTION tree_objects_bu() RETURNS TRIGGER AS
$BODY$
DECLARE
BEGIN
    IF NEW.id <> OLD.id THEN
        RAISE EXCEPTION 'Changing ids is forbidden.';
    END IF;
    IF NOT OLD.parent_id IS DISTINCT FROM NEW.parent_id THEN
        RETURN NEW;
    END IF;
    IF NEW.parent_id IS NULL THEN
        RETURN NEW;
    END IF;
    PERFORM 1 FROM objects_tree WHERE ( parent_id, child_id ) = ( NEW.id, NEW.parent_id );
    IF FOUND THEN
        RAISE EXCEPTION 'Update blocked, because it would create loop in tree.';
    END IF;
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER tree_objects_bu BEFORE UPDATE ON objects FOR EACH ROW EXECUTE PROCEDURE tree_objects_bu();

As you might noticed – I also added condition that would fail the update in case id got changed. Changing of ids is complicated task, and I think it shouldn't be done. If you don't like this – you can easily remove 3 lines from the function.

Now, when I try to create loop in tree:

UPDATE objects SET parent_id = 4 WHERE id = 1;
ERROR:  UPDATE blocked, because it would CREATE loop IN tree.

Great.

Now. Another task for our tree. Let's say we want to generate urls for categories based on their codenames and position in tree.

For this I will need another field in objects table. This field will never be null, but I can't make it “not null", as this would force application to put some data there.

So, lets:

ALTER TABLE objects ADD COLUMN tree_path TEXT;

Just to remind you, now the objects table looks like this:

                             TABLE "public.objects"
     COLUMN     |  TYPE   |                      Modifiers
----------------+---------+------------------------------------------------------
 id             | INTEGER | NOT NULL DEFAULT NEXTVAL('objects_id_seq'::regclass)
 codename       | text    | NOT NULL
 printable_name | text    |
 some_property  | INTEGER |
 parent_id      | INTEGER |
 tree_path      | text    |
Indexes:
    "objects_pkey" PRIMARY KEY, btree (id)
    "objects_codename_key" UNIQUE, btree (codename)
Foreign-KEY constraints:
    "objects_parent_id_fkey" FOREIGN KEY (parent_id) REFERENCES objects(id)
Referenced BY:
  "objects_parent_id_fkey" IN objects FOREIGN KEY (parent_id) REFERENCES objects(id)
  "objects_tree_child_id_fkey" IN objects_tree FOREIGN KEY (child_id) REFERENCES objects(id) ON DELETE CASCADE
  "objects_tree_parent_id_fkey" IN objects_tree FOREIGN KEY (parent_id) REFERENCES objects(id) ON DELETE CASCADE
Triggers:
    tree_objects_ai AFTER INSERT ON objects FOR EACH ROW EXECUTE PROCEDURE tree_objects_ai()
    tree_objects_au AFTER UPDATE ON objects FOR EACH ROW EXECUTE PROCEDURE tree_objects_au()
    tree_objects_bu BEFORE UPDATE ON objects FOR EACH ROW EXECUTE PROCEDURE tree_objects_bu()

And now we have to fill the information in this column.

2 simple (or not so simple) triggers:

CREATE OR REPLACE FUNCTION tree_path_objects_bi() RETURNS TRIGGER AS
$BODY$
DECLARE
BEGIN
    IF NEW.parent_id IS NULL THEN
        NEW.tree_path := NEW.codename;
    ELSE
        SELECT tree_path || '/' || NEW.codename INTO NEW.tree_path FROM objects WHERE id = NEW.parent_id;
    END IF;
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER tree_path_objects_bi BEFORE INSERT ON objects FOR EACH ROW EXECUTE PROCEDURE tree_path_objects_bi();
CREATE OR REPLACE FUNCTION tree_path_objects_bu() RETURNS TRIGGER AS
$BODY$
DECLARE
    replace_from TEXT := '^';
    replace_to   TEXT := '';
BEGIN
    IF NOT OLD.parent_id IS DISTINCT FROM NEW.parent_id THEN
        RETURN NEW;
    END IF;
    IF OLD.parent_id IS NOT NULL THEN
        SELECT '^' || tree_path || '/' INTO replace_from FROM objects WHERE id = OLD.parent_id;
    END IF;
    IF NEW.parent_id IS NOT NULL THEN
        SELECT tree_path || '/' INTO replace_to FROM objects WHERE id = NEW.parent_id;
    END IF;
    NEW.tree_path := regexp_replace( NEW.tree_path, replace_from, replace_to );
    UPDATE objects SET tree_path = regexp_replace(tree_path, replace_from, replace_to ) WHERE id IN (SELECT child_id FROM objects_tree WHERE parent_id = NEW.id AND depth > 0);
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER tree_path_objects_bu BEFORE UPDATE ON objects FOR EACH ROW EXECUTE PROCEDURE tree_path_objects_bu();

Now, let's delete all rows, and reinsert first 5 rows:

DELETE FROM objects;
INSERT INTO objects (id, codename, parent_id) VALUES (1, 'a', NULL), (2, 'b', NULL), (3, 'c', 1), (4, 'd', 2), (5, 'e', 3);
SELECT * FROM objects;
 id | codename | printable_name | some_property | parent_id | tree_path
----+----------+----------------+---------------+-----------+-----------
  1 | a        | [NULL]         |        [NULL] |    [NULL] | a
  2 | b        | [NULL]         |        [NULL] |    [NULL] | b
  3 | c        | [NULL]         |        [NULL] |         1 | a/c
  4 | d        | [NULL]         |        [NULL] |         2 | b/d
  5 | e        | [NULL]         |        [NULL] |         3 | a/c/e
(5 ROWS)

Now, lets do some changes:

UPDATE objects SET parent_id = NULL WHERE id = 3;
UPDATE objects SET parent_id = 5 WHERE id = 2;
UPDATE objects SET parent_id = 1 WHERE id = 5;
SELECT * FROM objects;
 id | codename | printable_name | some_property | parent_id | tree_path
----+----------+----------------+---------------+-----------+-----------
  1 | a        | [NULL]         |        [NULL] |    [NULL] | a
  3 | c        | [NULL]         |        [NULL] |    [NULL] | c
  2 | b        | [NULL]         |        [NULL] |         5 | a/e/b
  4 | d        | [NULL]         |        [NULL] |         2 | a/e/b/d
  5 | e        | [NULL]         |        [NULL] |         1 | a/e
(5 ROWS)

Looks cool. What's more – you can do something like this:

SELECT id, tree_path, codename, parent_id FROM objects ORDER BY tree_path;
 id | tree_path | codename | parent_id
----+-----------+----------+-----------
  1 | a         | a        |    [NULL]
  5 | a/e       | e        |         1
  2 | a/e/b     | b        |         5
  4 | a/e/b/d   | d        |         2
  3 | c         | c        |    [NULL]
(5 ROWS)

which perhaps doesn't look impressive, but let's add some more records:

INSERT INTO objects (id, codename, parent_id) VALUES (6, 'f', 3), (7, 'g', 3), (8, 'h', 6), (9, 'i', 7), (10, 'j', 8);
SELECT id, tree_path, codename, parent_id FROM objects ORDER BY tree_path;
 id | tree_path | codename | parent_id
----+-----------+----------+-----------
  1 | a         | a        |    [NULL]
  5 | a/e       | e        |         1
  2 | a/e/b     | b        |         5
  4 | a/e/b/d   | d        |         2
  3 | c         | c        |    [NULL]
  6 | c/f       | f        |         3
  8 | c/f/h     | h        |         6
 10 | c/f/h/j   | j        |         8
  7 | c/g       | g        |         3
  9 | c/g/i     | I        |         7
(10 ROWS)

Nice. But what if I'd like to have c/g before c/f ?

This will require input from user/application – saying what order should we have.

Now. We'd like to keep it as simple as possible, both in selects and in writes. To make it as simple we will make this field unique. We could make it unique in pair (parent_id, ordering), but then moving objects in tree would become difficult.

So, next modification of our table:

ALTER TABLE objects ADD ordering INT4 UNIQUE;

Now. This field really shouldn't be null – otherwise we could get 2 elements with null ordering, and we wouldn't be able to unambiguously order them.

So, let's:

UPDATE objects SET ordering = id;
ALTER TABLE objects ALTER COLUMN ordering SET NOT NULL;

Now. It's great that we have this field, but sorting with it will be at the very least tedious.

What I mean. While getting all children of given object, ordered, is simple:

SELECT * FROM objects WHERE parent_id = 3 ORDER BY ordering;

There is next-to-no way to get it properly sorted for whole tree at once.

To allow sorting of whole tree (or just some branch) we need to add new column, also trigger-filled, which will be used for sorting. This field will be globally unique (just like tree_path):

ALTER TABLE objects ADD COLUMN ordering_path TEXT UNIQUE;

This field will be filled by these triggers:

CREATE OR REPLACE FUNCTION tree_ordering_path_objects_bi() RETURNS TRIGGER AS
$BODY$
DECLARE
BEGIN
    IF NEW.parent_id IS NULL THEN
        NEW.ordering_path := to_char(NEW.ordering, '000000000000');
    ELSE
        SELECT ordering_path || '/' || to_char(NEW.ordering, '000000000000') INTO NEW.ordering_path FROM objects WHERE id = NEW.parent_id;
    END IF;
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER tree_ordering_path_objects_bi BEFORE INSERT ON objects FOR EACH ROW EXECUTE PROCEDURE tree_ordering_path_objects_bi();
CREATE OR REPLACE FUNCTION tree_ordering_path_objects_bu() RETURNS TRIGGER AS
$BODY$
DECLARE
BEGIN
    IF OLD.ordering = NEW.ordering THEN
        RETURN NEW;
    END IF;
    IF NEW.parent_id IS NULL THEN
        NEW.ordering_path := to_char(NEW.ordering, '000000000000');
    ELSE
        SELECT ordering_path || '/' || to_char(NEW.ordering, '000000000000') INTO NEW.ordering_path FROM objects WHERE id = NEW.parent_id;
    END IF;
    UPDATE objects SET ordering_path = regexp_replace(ordering_path, '^' || OLD.ordering_path, NEW.ordering_path )
        WHERE id IN (SELECT child_id FROM objects_tree WHERE parent_id = NEW.id AND depth > 0);
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER tree_ordering_path_objects_bu BEFORE UPDATE ON objects FOR EACH ROW EXECUTE PROCEDURE tree_ordering_path_objects_bu();

Now, to show how it all works, I will delete all content from objects, add new data:

DELETE FROM objects;
INSERT INTO objects (id, codename, parent_id, ordering) VALUES
    (1, 'a', NULL, 100),
    (2, 'b', NULL, 200),
    (3, 'c', 1, 300),
    (4, 'd', 2, 400),
    (5, 'e', 3, 500),
    (6, 'f', 3, 600),
    (7, 'g', 3, 700),
    (8, 'h', 6, 800),
    (9, 'i', 7, 900),
    (10, 'j', 8, 1000);

And how it looks in table?

SELECT id, tree_path, ordering_path, ordering FROM objects ORDER BY ordering_path;
 id | tree_path |                             ordering_path                             | ordering
----+-----------+-----------------------------------------------------------------------+----------
  1 | a         |  000000000100                                                         |      100
  3 | a/c       |  000000000100/ 000000000300                                           |      300
  5 | a/c/e     |  000000000100/ 000000000300/ 000000000500                             |      500
  6 | a/c/f     |  000000000100/ 000000000300/ 000000000600                             |      600
  8 | a/c/f/h   |  000000000100/ 000000000300/ 000000000600/ 000000000800               |      800
 10 | a/c/f/h/j |  000000000100/ 000000000300/ 000000000600/ 000000000800/ 000000001000 |     1000
  7 | a/c/g     |  000000000100/ 000000000300/ 000000000700                             |      700
  9 | a/c/g/i   |  000000000100/ 000000000300/ 000000000700/ 000000000900               |      900
  2 | b         |  000000000200                                                         |      200
  4 | b/d       |  000000000200/ 000000000400                                           |      400
(10 ROWS)

So, now let's test some updates:

UPDATE objects SET ordering = 550 WHERE id = 7;
SELECT id, tree_path, ordering_path, ordering FROM objects ORDER BY ordering_path;
 id | tree_path |                             ordering_path                             | ordering
----+-----------+-----------------------------------------------------------------------+----------
  1 | a         |  000000000100                                                         |      100
  3 | a/c       |  000000000100/ 000000000300                                           |      300
  5 | a/c/e     |  000000000100/ 000000000300/ 000000000500                             |      500
  7 | a/c/g     |  000000000100/ 000000000300/ 000000000550                             |      550
  9 | a/c/g/i   |  000000000100/ 000000000300/ 000000000550/ 000000000900               |      900
  6 | a/c/f     |  000000000100/ 000000000300/ 000000000600                             |      600
  8 | a/c/f/h   |  000000000100/ 000000000300/ 000000000600/ 000000000800               |      800
 10 | a/c/f/h/j |  000000000100/ 000000000300/ 000000000600/ 000000000800/ 000000001000 |     1000
  2 | b         |  000000000200                                                         |      200
  4 | b/d       |  000000000200/ 000000000400                                           |      400
(10 ROWS)
UPDATE objects SET ordering = 50 WHERE id = 2;
SELECT id, tree_path, ordering_path, ordering FROM objects ORDER BY ordering_path;
 id | tree_path |                             ordering_path                             | ordering
----+-----------+-----------------------------------------------------------------------+----------
  2 | b         |  000000000050                                                         |       50
  4 | b/d       |  000000000050/ 000000000400                                           |      400
  1 | a         |  000000000100                                                         |      100
  3 | a/c       |  000000000100/ 000000000300                                           |      300
  5 | a/c/e     |  000000000100/ 000000000300/ 000000000500                             |      500
  7 | a/c/g     |  000000000100/ 000000000300/ 000000000550                             |      550
  9 | a/c/g/i   |  000000000100/ 000000000300/ 000000000550/ 000000000900               |      900
  6 | a/c/f     |  000000000100/ 000000000300/ 000000000600                             |      600
  8 | a/c/f/h   |  000000000100/ 000000000300/ 000000000600/ 000000000800               |      800
 10 | a/c/f/h/j |  000000000100/ 000000000300/ 000000000600/ 000000000800/ 000000001000 |     1000
(10 ROWS)
UPDATE objects SET ordering = 5 WHERE id = 1;
SELECT id, tree_path, ordering_path, ordering FROM objects ORDER BY ordering_path;
 id | tree_path |                             ordering_path                             | ordering
----+-----------+-----------------------------------------------------------------------+----------
  1 | a         |  000000000005                                                         |        5
  3 | a/c       |  000000000005/ 000000000300                                           |      300
  5 | a/c/e     |  000000000005/ 000000000300/ 000000000500                             |      500
  7 | a/c/g     |  000000000005/ 000000000300/ 000000000550                             |      550
  9 | a/c/g/i   |  000000000005/ 000000000300/ 000000000550/ 000000000900               |      900
  6 | a/c/f     |  000000000005/ 000000000300/ 000000000600                             |      600
  8 | a/c/f/h   |  000000000005/ 000000000300/ 000000000600/ 000000000800               |      800
 10 | a/c/f/h/j |  000000000005/ 000000000300/ 000000000600/ 000000000800/ 000000001000 |     1000
  2 | b         |  000000000050                                                         |       50
  4 | b/d       |  000000000050/ 000000000400                                           |      400
(10 ROWS)

OK. So now we have trees with easy way to sort elements in them.

That basically concludes this post, as a last thing.

At one of companies I worked for we had this problem of getting 2nd level element in tree, that would be parent of given object.

For example, let's assume you have geographical tree. Top-level elements are countries, 2nd level are states, 3rd level are cities, 4th level are districts, and 5th level are streets.

Now, somebody tells you – I have this street with id 123, and I want to know in which state (or city or whatever) it is.

With standard adjacency-list way – oops, welcome loops. With nested sets – it's simpler, but you will be looking at some subselects.

And here? It's pretty simple:

SELECT o.*
FROM objects o JOIN objects_tree t ON o.id = t.parent_id
WHERE t.child_id = 123
ORDER BY t.depth DESC
LIMIT 1 offset 1;

If I would add “tree_level" (also calculated on triggers of course) to objects_tree table, I could even do it without order/limit.

Now. At the end – you might be scared by amount of triggers in this code. Yes. There are some. You could easily get rid of some of them by simply bundling functionalities in single trigger.

28 thoughts on “My take on trees in SQL”

  1. Great post, depesz, as usual. Thanks so much for putting this together! Properly creating tree structures in the database has been an abiding interest of mine since we created category trees in [Bricolage](http://bricolage.cc/ “Bricolage CMS”) back in 2000. This seems like a *really* good solution, and easy to do, too.

    I do have a few questions and comments, though, if you’ll indulge me.

    * In the first trigger, you have this `INSERT` statement:

    INSERT INTO objects_tree (parent_id, child_id, depth)
    SELECT x.parent_id, NEW.id, x.depth + 1
    FROM objects_tree x WHERE x.child_id = NEW.parent_id

    Is there not a race condition in that

    x.depth + 1

    bit? What if another process updates x.depth before this one commits? That’s possible, no? If so, you’d need to lock those rows by putting a `FOR UPDATE` clause in that query. But maybe I’m missing something here?

    * I think you omitted an `ON DELETE` trigger, yes?

    * I really like the `IS DISTINCT FROM` clause. I never noticed it before! Personally, however, I’d find this easier to read:

    IF OLD.parent_id IS NOT DISTINCT FROM NEW.parent_id THEN

    Than this:

    IF NOT OLD.parent_id IS DISTINCT FROM NEW.parent_id THEN

    At least in English.

    * For the loop prevention, should you not also have a trigger to prevent loops on `INSERT`?

    * I didn’t really follow the whole “ordering” column bit. Maybe it’s just because I always think of a file system when looking at this problem, so having the “tree_path” column always suits that need. So the ordering stuff seems like an extra layer for which I’m not sure of the use case.

    * What would a “tree_level” column do in that last example? Maybe that’s for another post?

    Anyway, thanks again for the great post, and for tolerating my wheedling. 🙂

    —Theory

  2. @Theory:

    1. Race condition in INSERT. Technically – yes. But it would require nearly simultaneous addition to tree and tree modification (element moved).
    I could get rid of the issue by issuing SELECT … FOR UPDATE first, but I think the risk is not high enough to justify additional workload for locking.

    2. ON DELETE triggers are not necessary because I created FOREIGN KEYS with “ON DELETE CASCADE” option.

    3. The problem is IS NOT DISTINCT FROM is that it wasn’t available from beginning (if we count beginning as the time when IS DISTINCT FROM was available). So I got used to use “NOT” before condition.

    4. Loops on INSERT – I don’t think it is possible to add a loop with insert.

    5. Ordering stuff in file system case is absolutely obsolete. But it is of great use in E-commerce solutions. Without it we would be forced to use alphabetical sorts. But somebody might want to have “Health care products” before “Car care products” on his website 🙂

    6. If I had tree_level column I would be able to write the query like this:

    SELECT o.*
    FROM objects o JOIN objects_tree t ON o.id = t.parent_id
    WHERE t.child_id = 123 AND o.tree_level = 2;

    Benefit from this might be not obvious at the first sight, so if You don’t see it (the benefit) explanation: query with order/limit/offset cannot be easily joined with something else.

  3. This is my version of tree structure un functions for tree structure representation 🙂 I use these structure and functions for classifiers… (only 1 table for all simple classifiers). In this case Update and Delete are standart SQL query, there not need any triggers…
    ——————————

    CREATE TABLE “public”.”tree” (
    “id” SERIAL,
    “_id” INTEGER,
    “name” VARCHAR(100),
    CONSTRAINT “tree_pkey” PRIMARY KEY(“id”),
    CONSTRAINT “tree__id_fk” FOREIGN KEY (“_id”)
    REFERENCES “public”.”tree”(“id”)
    ON DELETE RESTRICT
    ON UPDATE CASCADE
    NOT DEFERRABLE
    ) WITHOUT OIDS;

    /* — sample data —
    INSERT INTO “public”.”tree” (“id”, “_id”, “name”) VALUES (1, NULL, ‘a_name’);
    INSERT INTO “public”.”tree” (“id”, “_id”, “name”) VALUES (2, NULL, ‘b_name’);
    INSERT INTO “public”.”tree” (“id”, “_id”, “name”) VALUES (3, 1, ‘aa_name’);
    INSERT INTO “public”.”tree” (“id”, “_id”, “name”) VALUES (4, 2, ‘bb_name’);
    INSERT INTO “public”.”tree” (“id”, “_id”, “name”) VALUES (6, 3, ‘aaa2_name’);
    INSERT INTO “public”.”tree” (“id”, “_id”, “name”) VALUES (5, 3, ‘aaa1_name’);
    INSERT INTO “public”.”tree” (“id”, “_id”, “name”) VALUES (7, 4, ‘bbb_name’);
    */

    CREATE TYPE “public”.”t_tree_level” AS (
    “id” INTEGER,
    “_id” INTEGER,
    “name” VARCHAR,
    “level” INTEGER
    );

    /* returns tree child rows + depth */
    CREATE OR REPLACE FUNCTION “public”.”get_tree” (root integer, depth integer) RETURNS SETOF “public”.”t_tree_level” AS
    $body$
    DECLARE
    tempRow1 t_tree_level%ROWTYPE;
    tempRow2 t_tree_level%ROWTYPE;
    BEGIN

    FOR tempRow1 IN
    SELECT id, _id, name, depth
    FROM tree
    WHERE _id = root
    ORDER BY name
    LOOP
    RETURN NEXT tempRow1;

    FOR tempRow2 IN
    SELECT id, _id, name, level
    FROM get_tree(tempRow1.id, depth+1)
    LOOP
    RETURN NEXT tempRow2;
    END LOOP;
    END LOOP;
    RETURN;

    END;
    $body$
    LANGUAGE ‘plpgsql’ VOLATILE CALLED ON NULL INPUT SECURITY INVOKER;

    /* returns full tree */
    CREATE OR REPLACE FUNCTION “public”.”get_tree” () RETURNS SETOF “public”.”t_tree_level” AS
    $body$
    DECLARE
    tempRow t_tree_level%ROWTYPE;
    tempRow1 t_tree_level%ROWTYPE;
    BEGIN
    FOR tempRow IN
    SELECT *
    FROM tree
    WHERE _id is null
    LOOP
    tempRow.level=0;
    RETURN NEXT tempRow;

    FOR tempRow1 IN
    SELECT id, _id, name, level
    FROM get_tree(tempRow.id,tempRow.level+1)
    LOOP
    RETURN NEXT tempRow1;
    END LOOP;

    END LOOP;
    RETURN;
    END;
    $body$
    LANGUAGE ‘plpgsql’ VOLATILE CALLED ON NULL INPUT SECURITY INVOKER;

    /* tree table with sample data */
    SELECT * FROM tree;
    id _id name
    1 a_name
    2 b_name
    3 1 aa_name
    4 2 bb_name
    5 3 aaa1_name
    6 3 aaa2_name
    7 4 bbb_name

    /* ordered tree with child_id (id) parent_id (_id), name and level (depth) */
    SELECT * FROM get_tree();
    id _id name level
    1 a_name 0
    3 1 aa_name 1
    5 3 aaa1_name 2
    6 3 aaa2_name 2
    2 b_name 0
    4 2 bb_name 1
    7 4 bbb_name 2

    /* select part of tree */
    SELECT * FROM get_tree(2,1); /* parent_id=2 AND level/depth starts with=1 */
    id _id name level
    4 2 bb_name 1
    7 4 bbb_name 2

  4. @xor:
    This is pretty standard adjacency list approach. It is simple, but I don’t like it because it requires loops to get data.

  5. Hello Hubert.
    This’s not the first time I read something really interesting in your web site.
    At least in my not short experience trees are much less often modified than traversed.
    Maybe one can focus more on “faster to execute” solutions than “easier to write” solutions. Yours seems to match both and makes it more interesting than others’.
    Thanks a lot for this anything but a small gem!

  6. depesz you rigth about loops (i’m also did’t like these loops), but sometimes, when row count in table is not large then this way is ok. Or sometimes when we know max_depth (and it is less than e.g. 5), then we can write static view with subqueries and result is similar. Your case is good for fast tree table reading 🙂 but when depth larger than 10 levels, then there is 10 times more rows… but if we look from performance – your solution is faster (in my mind)!
    Thanks! 🙂

  7. Pingback: DbRunas - My take on trees in SQL
  8. Great article. I was fighting with porting your solution to MySQL (I’ve inherited one project based on that database). In two words: big pain. With newest Mysql (5.0.51a):
    – i can’t raise exception – patch i pending, but time when it will be in mainline is uncertain (there are some dirty hacks, like assigning variable to integer to raise exception), i’ve to drop code that checked for infinite loops in trigger,
    – i can’t do simple UPDATE same_table SET col = value WHERE id IN (SELECT id FROM same_table WHERE …) – you can’t update table which is used in subselect, you must create temporary tables from select and then use it in update, temporary tables don’t disapear after trigger finishes/transaction commits, have to drop it by myself,
    – auto_increment value for column is assigned after BEFORE trigger, in BEFORE trigger is just 0, completely useless,
    – you can’t create before update triger on table same_table and then in it run update on the same table (even with WHERE securing your code from entering infinite loop), have to replace BEFORE UPDATE trigger with “stored procedure”, that worked.
    – regexp replace not exists in mysql, have to done dirty hacking with insert() function (not insert statement!),

    I’m glad that such databases as PostgreSQL exists, and I can use in other projects.

  9. Pingback: We’re Not Freak » Blog Archive » Adjacency list tree on MySQL
  10. There a small thing missing in one of the triggers:
    in tree_ordering_path_bu() there is:

    IF OLD.ordering = NEW.ordering THEN…
    and there should be
    IF OLD.ordering = NEW.ordering AND OLD.parent_id = NEW.parent_id THEN…

    cause when one is moving branches from one place to another, there may be no need to change the ordering, in that situation trigger was not relfecting the new ordering path that needed to be generated on element move without ordering change.

  11. This is something I’ve been needing to do for years with my company website. This is by far the best solution I’ve found — much easier to understand than nested sets. And you’ve even solved the user-defined ordering issue! Thanks for posting your solution in such detail — you’ve laid out everything I need to implement this.

  12. I’ve just uploaded to CPAN (the Comprehensive Perl Archive Network) SQL::Tree – a module and associated command-line tool (sqltree) to automatically generate the implementation you’ve described here for any given table and column names. While writing this I also ended up porting things to SQLite.

    The distribution can be downloaded (shortly) from here:

    http://search.cpan.org/~mlawren/SQL-Tree-0.01/

    or installed using the standard Perl cpan/cpanm tools.

    The name SQL::Tree is simply what made sense to me, as you didn’t define one for your method in this article. If you think of something better I’m open to change.

    Regards,
    Mark.

  13. Nice Job!

    One thing still is tricky, at least to me. Is it possible to find the ID of, say, a/b/c/d/e/f without a ‘tree_path’ column?

  14. @Flip:
    Sure, but it will make the query more complex.
    You can easily just use “array_agg” on nodes, ordered by depth, and compare it with given array which will contain path to node you’re looking for.

  15. Thanks for the quick answer.

    Of course the query would be a bit more complex, but using your solution at least doesn’t require heavy loops.

    Thanks again!

  16. just a quick question. What if you wished to delete a row that had children. The triggers would currently reject the delete as a foreign key constraint. Can you alter the triggers to delete all the children or update the children that had the parent deleted to a new parent?

    I have implemented this in a test environment but because I am not good at triggers I am struggling with a more intelligent delete.

    This is a great approach, well done.

    Thanks

    rob

  17. @Anonymous:
    well, it’s all possible. deleting all children or reattaching children – from trigger or from custom function.

    Personally, I think that the cleaner solution is to prevent deletes if there are children, and so force earlier move of children wherever you want.

  18. What’s your approach to importing denormalized table representing tree structure (eg. from spreadsheet) to the 2-table structure you proposed?
    Is it better to write some VBA macro or import whole table into Postgres and then process data?

    Thank!

  19. Do you know if this structure/technique has any heritage? If not, do you mind if I use it in talks with reference and attribution as the ‘Depesz Tree’?

  20. @Wes:
    I am not aware of any similar structure. We did develop it with colleagues at a company I worked several years ago, but we were not basing our work on any external materials.
    I do not object – it’s fine. I’m just not entirely sure if it’s worth being called specialized method – technically it’s adjacency list + some redundant info set with triggers.

  21. ‘Trigger Tree’ sounds better anyway. 😉 In any case it’s very good.

    Thanks!

Comments are closed.