December 9th, 2012 by depesz | Tags: , , , , , , , | 1 comment »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

[ CuTE kitten ]

In PostgreSQL 8.4 we got CTE – Common Table Expressions. Since then we have this great tool available, but apparently for some people it's still black magic. CuTE, but still magic. I'll try to make it a bit less magical, and more understandable.

The very basic way you can use CTE, is with query like:

1
2
3
4
5
6
7
8
9
10
11
with base_info as (
    select
        oid::regclass as table_name, pg_table_size(oid) as size
    from
        pg_class
    where
        relkind = 'r'
)
SELECT *
FROM base_info
order by size desc limit 10;

Result of the query looks like this:

  table_name   |  size
----------------+--------
 pg_proc        | 548864
 pg_rewrite     | 507904
 pg_depend      | 417792
 test           | 393216
 pg_attribute   | 376832
 pg_description | 286720
 pg_statistic   | 221184
 pg_operator    | 147456
 pg_class       |  98304
 pg_type        |  98304
(10 rows)

Let's analyze what's inside the query.

  • Line #1 – starts with word “WITH" – this is the tell-tale sign of CTEs. Every query that uses them has to start using WITH
  • Still line #1 – “base_info" – name of the CTE that will be defined after “as", and within parentheses. Any name is acceptable – basically it is just like table or view name
  • Lines 2-7 – definition of the CTE. This is a query, that can return any number of rows – you can think of it as inlined VIEW (though there are differences)
  • Line 8 – parentheses finishing current CTE. We could have more than one in given query, using syntax like: with a as (…), b as (…), c as (…), but for now single CTE is enough
  • Lines 9-11 – actual query, that is using the CTE just as it would any table/view – just providing its name in FROM clause.

Hope that's not scary.

Now you could ask: well, OK. But what is the point of using CTE when you can simply have VIEWs?

Answer has two elements. First – because it can be simpler. Instead of going the whole route of think about view, create it, grant privileges, and so on – you just embed the view in your query, and you're done.

But the other, much more important (in my opinion) thing is that CTEs, unlike VIEWs, are treated as separate entities, and behave like “optimization wall".

What wall? Let me show you simple example:

DROP TABLE IF exists test;
CREATE TABLE test ( id INT4 NOT NULL, some_val INT4 NOT NULL);
 
INSERT INTO test (id, some_val)
    SELECT i, case when random() < 0.5 THEN 1 ELSE 2 END FROM generate_series(1,1000) i;
INSERT INTO test (id, some_val)
    SELECT i, cast(2 + random() * 100 as INT4) FROM generate_series(1001, 10000) i;
 
ALTER TABLE test add PRIMARY KEY (id);
CREATE INDEX i1 on test (some_val);
analyze test;

This is relatively simple table, with 10000 rows. Each row has 2 columns: id – primary key, integer from 1 to 10000, and some_val, which is basically random.

What do I mean by “basically". Well, the values range from 1 to 102, but values 1 and 2 are much more common (~ 5 times as common), but 1 happens only in the first 1000 records.

Now, let's see simple query that is looking for some_val = 1:

explain analyze SELECT * FROM test WHERE some_val = 1;
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=11.97..62.94 rows=478 width=8) (actual time=0.113..0.208 rows=478 loops=1)
   Recheck Cond: (some_val = 1)
   ->  Bitmap Index Scan on i1  (cost=0.00..11.85 rows=478 width=0) (actual time=0.089..0.089 rows=478 loops=1)
         Index Cond: (some_val = 1)
 Total runtime: 0.286 ms
(5 rows)

Works OK – it did use index on some_val column, fetched all rows (478 of them).

What would happen if I'd want just 5 newest rows with some_val = 1?

explain analyze SELECT * FROM test WHERE some_val = 1 ORDER BY id desc LIMIT 5;
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3.59 rows=5 width=8) (actual time=3.117..3.121 rows=5 loops=1)
   ->  Index Scan Backward using test_pkey on test  (cost=0.00..343.26 rows=478 width=8) (actual time=3.115..3.117 rows=5 loops=1)
         Filter: (some_val = 1)
         Rows Removed by Filter: 9001
 Total runtime: 3.163 ms
(5 rows)

Please note that planner did something rather stupid in here: instead of getting 478 rows that match where condition, it instead scanned almost whole table, discarded first found 9001 rows, and then, finally, found 5 newest rows with some_val = 1.

( side note: this example is one of the reasons I'm firm believer that we should have planner HINTs, despite all the opposition from some developers )

Why did it happen? Well, without going into too much detail – PostgreSQL assumed some things that are not true, and ran the query as if they were. So it was slower than it could be.

Let's see how I can make PostgreSQL behave if I'll use CTE optimization barrier to fight stupid decisions:

explain analyze
WITH i_know_what_im_doing AS (
    SELECT *
    FROM test
    WHERE some_val = 1
)
SELECT *
FROM i_know_what_im_doing
ORDER BY id DESC limit 5;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=80.44..80.45 rows=5 width=8) (actual time=0.526..0.527 rows=5 loops=1)
   CTE i_know_what_im_doing
     ->  Bitmap Heap Scan on test  (cost=11.97..62.94 rows=478 width=8) (actual time=0.101..0.206 rows=478 loops=1)
           Recheck Cond: (some_val = 1)
           ->  Bitmap Index Scan on i1  (cost=0.00..11.85 rows=478 width=0) (actual time=0.080..0.080 rows=478 loops=1)
                 Index Cond: (some_val = 1)
   ->  Sort  (cost=17.50..18.69 rows=478 width=8) (actual time=0.523..0.523 rows=5 loops=1)
         Sort Key: i_know_what_im_doing.id
         Sort Method: top-N heapsort  Memory: 25kB
         ->  CTE Scan on i_know_what_im_doing  (cost=0.00..9.56 rows=478 width=8) (actual time=0.106..0.360 rows=478 loops=1)
 Total runtime: 0.596 ms
(11 rows)

Now. Please note that PostgreSQL did use i1 index, to find matching rows. Afterwards, these rows were sorted using normal top-N heapsort. It's faster.

How would a view work in here?

create view im_lost as SELECT * FROM test WHERE some_val = 1;
 
explain analyze
select *
FROM im_lost
ORDER BY id desc limit 5;
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3.59 rows=5 width=8) (actual time=2.946..2.948 rows=5 loops=1)
   ->  Index Scan Backward using test_pkey on test  (cost=0.00..343.26 rows=478 width=8) (actual time=2.943..2.945 rows=5 loops=1)
         Filter: (some_val = 1)
         Rows Removed by Filter: 9001
 Total runtime: 2.997 ms
(5 rows)

As you can see – adding view made Pg use bad plan. That's because views are internally rewrite rules.

Another benefit is that CTE data is calculated once.

What does it mean?

Let's make a test function that takes long time to calculate something. In our case it will “calculate" random number, in range 0-1000:

create function hard_work() returns int4 as $$
begin
    perform pg_sleep(0.1);
    return floor( random() * 1000 );
end;
$$ language plpgsql;

Of course this function doesn't make any sense, it's just here to show you, in simple, numeric, way some facts.

Now, let's assume that we want to call it 10 times:

select hard_work() from generate_series(1,10);

But, let's assume we don't know what it returns, so we also want to get sum of the values it did return.

Let's start with a view approach:

create view lots_of_work as
    select hard_work() as h from generate_series(1,10);

Now, I can write simple query:

select *, (select sum(h) from lots_of_work) as sum
from lots_of_work;
  h  | sum
-----+------
 519 | 4111
 729 | 4111
 140 | 4111
 945 | 4111
  69 | 4111
 765 | 4111
 179 | 4111
 751 | 4111
 714 | 4111
 462 | 4111
(10 rows)
 
Time: 2007.982 ms

Before I will analyze it, let's see how it works in case of CTE:

WITH less_work AS (
    select hard_work() as h from generate_series(1,10)
)
SELECT
    h, (select sum(h) from less_work) as sum
FROM less_work;
  h  | sum
-----+------
 319 | 3510
  43 | 3510
  26 | 3510
 484 | 3510
 455 | 3510
  75 | 3510
 473 | 3510
  69 | 3510
 851 | 3510
 715 | 3510
(10 rows)
 
Time: 1005.168 ms

Values are of course different, but the point is different. First of all – please notice that the runtime of CTE version is half of VIEW version.

And also – please note that while sum in case of CTE is correct ( 319 + 43 + 26 + 484 + 455 + 75 + 473 + 69 + 851 + 715 = 3510 ), in case of VIEW – the sum is incorrect ( 519 + 729 + 140 + 945 + 69 + 765 + 179 + 751 + 714 + 462 = 5273 ).

Why is that so? Very simple – view rewrites the query, and CTE is “materialized". And this means that in VIEW version our slow function has to be called 20 times (10 times to return records, and 10 times to be used for summing of return values). In case of CTE – function is called 10 times, and the returned values are available to reuse for summing.

This can be seen in these explain analyze plans:

explain analyze
select *, (select sum(h) from lots_of_work) as sum
from lots_of_work;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on lots_of_work  (cost=272.51..542.51 rows=1000 width=4) (actual time=1104.304..2006.659 rows=10 loops=1)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=272.50..272.51 rows=1 width=4) (actual time=1002.646..1002.646 rows=1 loops=1)
           ->  Function Scan on generate_series generate_series_1  (cost=0.00..260.00 rows=1000 width=0) (actual time=100.268..1002.633 rows=10 loops=1)
   ->  Function Scan on generate_series  (cost=0.00..260.00 rows=1000 width=0) (actual time=101.654..1004.003 rows=10 loops=1)
 Total runtime: 2006.778 ms
(6 rows)

You can see now that there are two separate calls to generate_series(). With CTE, just one:

explain analyze
WITH less_work AS (
    select hard_work() as h from generate_series(1,10)
)
SELECT
    h, (select sum(h) from less_work) as sum
FROM less_work;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on less_work  (cost=282.51..302.51 rows=1000 width=4) (actual time=1004.037..1004.041 rows=10 loops=1)
   CTE less_work
     ->  Function Scan on generate_series  (cost=0.00..260.00 rows=1000 width=0) (actual time=101.640..1003.990 rows=10 loops=1)
   InitPlan 2 (returns $1)
     ->  Aggregate  (cost=22.50..22.51 rows=1 width=4) (actual time=902.390..902.390 rows=1 loops=1)
           ->  CTE Scan on less_work less_work_1  (cost=0.00..20.00 rows=1000 width=4) (actual time=0.001..902.377 rows=10 loops=1)
 Total runtime: 1004.142 ms
(7 rows)

Does that mean that you should use CTE always and never views. Well, no.

Let's see how many rows there are for every some_val in our temp table. First, I'll do it by defining a view:

create view smart_view as
    select some_val, count(*)
    from test group by some_val;

and afterwards, I can use it in a query:

explain analyze
select * from smart_view where some_val = 3;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=4.94..51.49 rows=1 width=4) (actual time=0.288..0.288 rows=1 loops=1)
   ->  Bitmap Heap Scan on test  (cost=4.94..51.04 rows=88 width=4) (actual time=0.132..0.271 rows=81 loops=1)
         Recheck Cond: (some_val = 3)
         ->  Bitmap Index Scan on i1  (cost=0.00..4.92 rows=88 width=0) (actual time=0.116..0.116 rows=81 loops=1)
               Index Cond: (some_val = 3)
 Total runtime: 0.396 ms
(6 rows)

Now, let's see how it would work with view definition as CTE:

explain analyze
WITH not_so_smart_cte AS (
    SELECT some_val, count(*)
    FROM test
    GROUP BY some_val
)
SELECT *
FROM not_so_smart_cte
WHERE some_val = 3;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 CTE Scan on not_so_smart_cte  (cost=196.02..198.31 rows=1 width=12) (actual time=6.165..6.232 rows=1 loops=1)
   Filter: (some_val = 3)
   Rows Removed by Filter: 101
   CTE not_so_smart_cte
     ->  HashAggregate  (cost=195.00..196.02 rows=102 width=4) (actual time=6.145..6.164 rows=102 loops=1)
           ->  Seq Scan on test  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.014..1.827 rows=10000 loops=1)
 Total runtime: 6.334 ms
(7 rows)

The difference should be clearly visible – view approach was able to use the “some_val = 3″ condition to fetch less rows from table, and then count just them.

CTE approach is different – first it takes all records (rows=10000), then it aggregates it, and then, from the aggregate takes single row.

As you can see – there is no silver bullet. There are cases where CTEs are better, and there are cases when CTEs are worse. I like to use them, because it lets me write my query in a sort of “block-by-block" way. Like: first, I need some rows from here, so I'll build CTE that does it. Then, based on these rows I need something else, so I add another CTE, and so on.

All I have written up to now is about simple CTEs – just a “views in disguise". But CTEs have one big trick up their sleeve: recursive cuteness.

We all know that to understand recursion you have to first understand recursion. How can I explain it, then?

First, let's see some simple recursive CTE query:

with recursive i as (
    select 1 as x
    union all
    select x+1 from i where x < 10
)
select * from i;
 x
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
(10 rows)

That's it. The key things to see/understand:

  • recursive CTE needs to be marked using “WITH RECURSIVE"
  • recursive CTE consists of two queries joined with UNION or UNION ALL. In my case it is “SELECT 1 as x" and “SELECT x+1 FROM i WHERE x < 10

You might ask: what if I have multiple CTEs in single query, and one of them, and not the first one, is recursive, while others are not? It's simple – as long as any of the CTEs are recursive, you have to use “WITH RECURSIVE".

Let's get a bit more complicated query, so we can see what's happening and how.

For my tests, I will be using copy of DMOZ category structure that I downloaded long time ago. You can get the data in sql-dump format from here so you can follow with my examples.

To have rather small, but interesting sample data, I chose these categories:

  id   | parent_id |                         label
-------+-----------+--------------------------------------------------------
 92987 |     92960 | Turkey
 92988 |     92987 | Bilkent_University
 92990 |     92987 | Bogazici_University
 93001 |     92987 | Middle_East_Technical_University
 92989 |     92988 | Departments
 92991 |     92990 | Faculties
 92996 |     92990 | Institutes
 92998 |     92990 | Schools
 92992 |     92991 | Arts_and_Sciences
 92993 |     92991 | Economics_and_Administrative_Sciences
 92994 |     92991 | Education
 92995 |     92991 | Engineering
 92997 |     92996 | Kandilli_Observatory_Earthquake_and_Research_Institute
 92999 |     92998 | Applied_Disciplines
 93000 |     92998 | Foreign_Languages
(15 rows)

When you'll create paths, based on id/parent_id relations, you'll get:

  id   | parent_id |                                             path
-------+-----------+----------------------------------------------------------------------------------------------
 92987 |     92960 | Turkey
 92988 |     92987 | Turkey/Bilkent_University
 92989 |     92988 | Turkey/Bilkent_University/Departments
 92990 |     92987 | Turkey/Bogazici_University
 92991 |     92990 | Turkey/Bogazici_University/Faculties
 92992 |     92991 | Turkey/Bogazici_University/Faculties/Arts_and_Sciences
 92993 |     92991 | Turkey/Bogazici_University/Faculties/Economics_and_Administrative_Sciences
 92994 |     92991 | Turkey/Bogazici_University/Faculties/Education
 92995 |     92991 | Turkey/Bogazici_University/Faculties/Engineering
 92996 |     92990 | Turkey/Bogazici_University/Institutes
 92997 |     92996 | Turkey/Bogazici_University/Institutes/Kandilli_Observatory_Earthquake_and_Research_Institute
 92998 |     92990 | Turkey/Bogazici_University/Schools
 92999 |     92998 | Turkey/Bogazici_University/Schools/Applied_Disciplines
 93000 |     92998 | Turkey/Bogazici_University/Schools/Foreign_Languages
 93001 |     92987 | Turkey/Middle_East_Technical_University
(15 rows)

The exercise is to get all kids for id = 92987 using recursive CTE.

We start by getting first element – the one that will be start of our recursion. So, clearly it should be:

$ SELECT * FROM dmoz WHERE id = 92987;
  id   | parent_id | label
-------+-----------+--------
 92987 |     92960 | Turkey
(1 row)

So far, so good. Now, we need to add kids. First, think about just getting direct subcategories, which can be done obviously with:

$ SELECT * FROM dmoz WHERE parent_id = 92987;
  id   | parent_id |              label
-------+-----------+----------------------------------
 92988 |     92987 | Bilkent_University
 92990 |     92987 | Bogazici_University
 93001 |     92987 | Middle_East_Technical_University
(3 rows)

So far, so good. We could then continue, without using CTE, by doing something like:

$ SELECT * FROM dmoz WHERE parent_id IN (
    SELECT id FROM dmoz WHERE parent_id = 92987
);
  id   | parent_id |    label
-------+-----------+-------------
 92989 |     92988 | Departments
 92991 |     92990 | Faculties
 92996 |     92990 | Institutes
 92998 |     92990 | Schools
(4 rows)

But this gets tedious pretty soon. And, there is also another problem. Let's assume you write the above 3 queries as one, using UNION ALL:

$ SELECT * FROM dmoz WHERE id = 92987
UNION ALL
SELECT * FROM dmoz WHERE parent_id = 92987
UNION ALL
SELECT * FROM dmoz WHERE parent_id IN (
    SELECT id FROM dmoz WHERE parent_id = 92987
);
  id   | parent_id |              label
-------+-----------+----------------------------------
 92987 |     92960 | Turkey
 92988 |     92987 | Bilkent_University
 92990 |     92987 | Bogazici_University
 93001 |     92987 | Middle_East_Technical_University
 92989 |     92988 | Departments
 92991 |     92990 | Faculties
 92996 |     92990 | Institutes
 92998 |     92990 | Schools
(8 rows)

Of course, it works, but please note that we would have to add another query just to get 4 levels that we need in our case, but technically – we should be adding much more queries if we didn't know how deep the structure goes.

Recursive CTE is great solution, because You can rewrite potentially infinite query (UNION-ALL style, like above) to something way simpler:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$ WITH RECURSIVE all_kids AS (
    SELECT * FROM dmoz WHERE id = 92987
    UNION ALL
    SELECT d.*
    FROM dmoz d
        JOIN all_kids ak on d.parent_id = ak.id
)
SELECT * FROM all_kids;
  id   | parent_id |                         label
-------+-----------+--------------------------------------------------------
 92987 |     92960 | Turkey
 92988 |     92987 | Bilkent_University
 92990 |     92987 | Bogazici_University
 93001 |     92987 | Middle_East_Technical_University
 92989 |     92988 | Departments
 92991 |     92990 | Faculties
 92996 |     92990 | Institutes
 92998 |     92990 | Schools
 92992 |     92991 | Arts_and_Sciences
 92993 |     92991 | Economics_and_Administrative_Sciences
 92994 |     92991 | Education
 92995 |     92991 | Engineering
 92997 |     92996 | Kandilli_Observatory_Earthquake_and_Research_Institute
 92999 |     92998 | Applied_Disciplines
 93000 |     92998 | Foreign_Languages
(15 rows)

That was fast. In writing at least :)

Let's see how it works:

  • Line #1 – marks that we will be doing some recursion (WITH RECURSIVE), and starts “all_kids" cte.
  • Line #2 – it's the initial row for recursion – the row that we know how to get by id
  • Line #3 – UNION ALL – all recursive queries, as I said – as two queries with UNION or UNION ALL in between
  • Lines #4-#5 – we will be returning all columns from table dmoz
  • Line #6 – rows to return should have parent_id that is listed in all_kids cte
  • Line #7 – ends the CTE
  • Line #8 – returns rows from all_kids

You might wonder – well, but on first loop – the 2nd query (after UNION ALL), will be equivalent to: select * from dmoz where parent_id = 92987, because that's the only row there. What happens on subsequent rows, though? Which rows are taken into consideration? Of course new ones (with ids 92989, 92991, 92996, 92998, but is 92987 (top level) still taken into consideration)? No. On each run only newly added rows are “visible" to recursive part. We can see it by adding simple window function:

$ WITH RECURSIVE all_kids AS (
    SELECT *, '{}'::int4[] as parents FROM dmoz WHERE id = 92987
    UNION ALL
    SELECT d.*, array_agg(ak.id) over ()
    FROM dmoz d
        JOIN all_kids ak on d.parent_id = ak.id
)
select
    id,
    parent_id,
    label,
    array( select distinct unnest( parents ) ) as parents
from
all_kids;
  id   | parent_id |                         label                          |       parents
-------+-----------+--------------------------------------------------------+---------------------
 92987 |     92960 | Turkey                                                 | {}
 92988 |     92987 | Bilkent_University                                     | {92987}
 92990 |     92987 | Bogazici_University                                    | {92987}
 93001 |     92987 | Middle_East_Technical_University                       | {92987}
 92989 |     92988 | Departments                                            | {92988,92990}
 92991 |     92990 | Faculties                                              | {92988,92990}
 92996 |     92990 | Institutes                                             | {92988,92990}
 92998 |     92990 | Schools                                                | {92988,92990}
 92992 |     92991 | Arts_and_Sciences                                      | {92998,92996,92991}
 92993 |     92991 | Economics_and_Administrative_Sciences                  | {92998,92996,92991}
 92994 |     92991 | Education                                              | {92998,92996,92991}
 92995 |     92991 | Engineering                                            | {92998,92996,92991}
 92997 |     92996 | Kandilli_Observatory_Earthquake_and_Research_Institute | {92998,92996,92991}
 92999 |     92998 | Applied_Disciplines                                    | {92998,92996,92991}
 93000 |     92998 | Foreign_Languages                                      | {92998,92996,92991}
(15 rows)

(The additional logic is adding array_agg() in CTE, and then array() call to make the list of parents distinct)

As you can see, each subsequent recurrence is seeing in all_kids just the rows that were added there by previous run.

Of course recursive CTE is not limited to working on trees. It can be used to do many things – solving travelling salesman problem, finding cheap and many others.

Finally, since 9.1 we have writable CTE. This is basically normal CTE, but where source of rows is not SELECT clause, but rather some write operation (INSERT, UPDATE, DELETE) with RETURNING clause.

This is rather simple, we can do stuff like:

$ with ids_to_delete as (
    select id from dmoz order by id desc limit 5
), delete_it as (
    delete from dmoz where id in (select id from ids_to_delete ) returning *
)
select string_agg(label, ', ') from delete_it;
                        string_agg
----------------------------------------------------------
 Vietnamese, Vinodam, Ukrainian, Vaarthaa_Patrikalu, Thai
(1 row)

In here, I used first normal CTE (ids_to_delete) to get some ids of rows to delete, then I used delete_it as writable cte – that is – it did remove the rows, and delete_it CTE contained deleted rows. And finally – I got labels of all deleted rows as single column.

I hope this will make this CuTE feature of PostgreSQL simpler to understand and use. As always – if anything is not clear, please let me know so I can fix it.

  1. One comment

  2. # J. Prevost
    Dec 9, 2012

    I’d just like to note that this “optimization wall” thing has frequently hit me from the opposite side of the equation. I often like to use CTEs to make my query clearer and more organized–but because of this effect, it can result in an unworkable query plan. (As in: going from less than a second to potentially multiple hours. Not just getting a little less efficient.)

    That kind of situation, where the structure of the query must be changed in order to influence the query plan, is exactly what I never want in my DBMS. So sign me up for “hints would be a very good thing to have, particularly if they don’t have to be in-line in the body of the query”, and a desire that the query planner also do a better job of making a larger variety of similar queries behave the same way.

Leave a comment