Explaining the unexplainable

One of the first things new DBA hears is “Use the EXPLAIN". And upon first try he/she is greeted with incomprehensible:

                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=146.63..148.65 ROWS=808 width=138) (actual TIME=55.009..55.012 ROWS=71 loops=1)
   Sort KEY: n.nspname, p.proname, (pg_get_function_arguments(p.oid))
   Sort Method: quicksort  Memory: 43kB
   ->  Hash JOIN  (cost=1.14..107.61 ROWS=808 width=138) (actual TIME=42.495..54.854 ROWS=71 loops=1)
         Hash Cond: (p.pronamespace = n.oid)
         ->  Seq Scan ON pg_proc p  (cost=0.00..89.30 ROWS=808 width=78) (actual TIME=0.052..53.465 ROWS=2402 loops=1)
               FILTER: pg_function_is_visible(oid)
         ->  Hash  (cost=1.09..1.09 ROWS=4 width=68) (actual TIME=0.011..0.011 ROWS=4 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 1kB
               ->  Seq Scan ON pg_namespace n  (cost=0.00..1.09 ROWS=4 width=68) (actual TIME=0.005..0.007 ROWS=4 loops=1)
                     FILTER: ((nspname <> 'pg_catalog'::name) AND (nspname <> 'information_schema'::name))

What does it even mean?

Of course trying to understand the explain above, as a start point, is pretty futile. Let's start with something simpler. But even before that I want you to understand one very important thing:

PostgreSQL knows

That means that PostgreSQL keeps some meta-information (information about information). Rowcounts, number of distinct values, most common values, and so on. For large tables these are based on random sample, but in general, it (Pg) is pretty good at knowing stuff about our data.

So, now. Let's see simple query, and it's explain:

$ EXPLAIN SELECT * FROM test WHERE i = 1;
                      QUERY PLAN                      
------------------------------------------------------
 Seq Scan ON test  (cost=0.00..40.00 ROWS=12 width=4)
   FILTER: (i = 1)
(2 ROWS)

Query is pretty simple, and I take it that it doesn't require any special comments.

In explain – first line, and all lines that start with “->" are operations. Other lines as additional info for operation above.

In our case, we have just one operation: sequential scanning of table test.

There is additional information – about Filter.

Sequential scan means that PostgreSQL will “open" the table data, and read it all, potentially filtering (removing) rows, but generally ready to read and return whole table.

Why ready? I'll talk about it in a minute.

So, the Seq scan line, informs us that we are scanning table in sequential mode. And that the table is named “test" (though in here lies one of the biggest problems with explain, as it's not showing schema, which did bite my ass more than couple of times).

And what about the numbers afterwards?

Let me ask you a question: You have table like this:

                            TABLE "public.t"
   COLUMN    |  TYPE   |                   Modifiers                    
-------------+---------+------------------------------------------------
 id          | INTEGER | NOT NULL DEFAULT NEXTVAL('t_id_seq'::regclass)
 some_column | INTEGER | 
 something   | text    | 
Indexes:
    "t_pkey" PRIMARY KEY, btree (id)
    "q" btree (some_column)

Given this table definition, and query:

SELECT * FROM t WHERE some_column = 123;

What do you think would be the best way to run the query? Sequentially scan the table, or use index?

If you say: use index of course, there is index on this column, so it will make it faster – I'll ask: what about case where the table has one row, and that it has some_column = 123?

To do seq scan, I just need to read one page (8192 bytes) from table, and I get the row. To use index, I have to read page from index, check it to find if we have row matching condition in the table and then read page from table.

In total – more than twice the work!

You could say – sure, but that's for very small tables, so the speed doesn't matter. OK. So let's imagine a table that has 10 billion rows, and all of them have some_column = 123. Index doesn't help at all, and in reality it makes the situation much, much worse.

Of course – if you have million rows, and just one has some_column = 123 – index scan will be clearly better.

So – it is impossible to tell whether given query will use index, or even if it should use index (I'm talking about general case) – you need to know more. And this leads us to simple thing: depending on situation one way of getting data will be better or worse then other.

PostgreSQL (up to a point) examines all possible plans. It knows how many rows you have, it knows how many rows will (likely) match the criteria, so it can make pretty smart decisions.

But how are the decisions made? That's exactly what the first set of numbers in explain shows. It's the cost.

Some people think that cost is estimate shown in seconds. It's not. It's unit is “fetching single page in sequential manner". It is about time and resource usage.

In postgresql.conf, you might have noticed these params:

seq_page_cost = 1.0                    # measured ON an arbitrary scale
random_page_cost = 4.0                 # same scale AS above
cpu_tuple_cost = 0.01                  # same scale AS above
cpu_index_tuple_cost = 0.005           # same scale AS above
cpu_operator_cost = 0.0025             # same scale AS above

So, we can even change how much it costs to read sequential page. These parameters dictate costs that PostgreSQL assumes it would take to run various methods of running the same query.

For example, let's make a simple 1000 row table, with some texts, and index:

CREATE TABLE test (id serial PRIMARY KEY, some_text text);
CREATE TABLE
 
INSERT INTO test (some_text) SELECT 'whatever' FROM generate_series(1,1000);
INSERT 0 1000

Now, we can see that running explain with condition on id will show:

EXPLAIN SELECT * FROM test WHERE id = 50;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 INDEX Scan USING test_pkey ON test  (cost=0.28..8.29 ROWS=1 width=36)
   INDEX Cond: (id = 50)
(2 ROWS)

And what if we'd tell pg that under no circumstances it can use index scan?

SET enable_indexscan = FALSE;
EXPLAIN SELECT * FROM test WHERE id = 50;
                               QUERY PLAN                               
------------------------------------------------------------------------
 Bitmap Heap Scan ON test  (cost=4.28..8.30 ROWS=1 width=13)
   Recheck Cond: (id = 50)
   ->  Bitmap INDEX Scan ON test_pkey  (cost=0.00..4.28 ROWS=1 width=0)
         INDEX Cond: (id = 50)
(4 ROWS)

and let's disable this one too:

SET enable_bitmapscan = FALSE;
EXPLAIN SELECT * FROM test WHERE id = 50;
                      QUERY PLAN                      
------------------------------------------------------
 Seq Scan ON test  (cost=0.00..18.50 ROWS=1 width=13)
   FILTER: (id = 50)
(2 ROWS)

Before I will go any further, please note that I used SET command to disable certain ways Pg can process the query. Disabled them. There are more of them, and you can see them all using:

SELECT name, setting, short_desc || COALESCE(E'\n' || extra_desc, '')
FROM pg_settings WHERE name ~ '^enable_';

They can be used to test various ideas, and sometimes, especially beginner DBAs get the idea to use them to their advantage in production code. Please don't. These are for development/testing, not for production. Specifically because their scope is to great – you can't disable index scan of index “xxx". It's either enable, or disabled, nothing in between.

Anyway – back to operations, let's show them close to each other:

INDEX Scan USING test_pkey ON test  (cost=0.28..8.29 ROWS=1 width=36)
Bitmap Heap Scan ON test  (cost=4.28..8.30 ROWS=1 width=13)
Seq Scan ON test  (cost=0.00..18.50 ROWS=1 width=13)

By default, Pg used Index Scan. Why? It's simple – it's the cheapest in this case. Total cost of 8.29, while bitmap heap scan (whatever that would be) would be 8.30 and seq scan – 18.5.

OK, But cost shows two numbers: number..number. What is this about, and why am I talking about the second number only? If we'd get into consideration first number then seq scan is winner, as it has there 0 (zero), while index scan is 0.28, and bitmap heap scan – 4.28.

So, the range (number .. number) is because it shows cost for starting the operation row and cost for getting all rows (By all, I mean all returned by this operation, not all in table).

What is the starting cost? Well, for seq scan there is none – you just read page, and return rows. That's all. But for example, for sorting dataset – you have to read all the data, and actually sort it, before you can consider returning even first of the rows. Which can be nicely seen in this explain:

                            QUERY PLAN                             
-------------------------------------------------------------------
 Sort  (cost=22.88..23.61 ROWS=292 width=202)
   Sort KEY: relfilenode
   ->  Seq Scan ON pg_class  (cost=0.00..10.92 ROWS=292 width=202)
(3 ROWS)

Please note that startup cost for Sort is 22.88, while total cost is just 23.61. So returning rows from Sort is trivial (in terms of cost), but sorting them – not so much.

Next information in explain is “rows". This is estimate, how many rows does PostgreSQL think that this operation is capable of returning (it might return less, for example in case of LIMIT). This is also important for some operations – joins for example. As joining two tables that have 20 rows together can be done in many ways, and it doesn't really matter how, but when you join 1 million row table with 1 billion row table, the way you do the join is very important (I'm not talking about “inner join/left join/…" but rather about “hash join", “nested loop", “merge join" – if you don't understand, don't worry – will write about it later).

This number can be of course misestimated – for many reasons. Sometimes it doesn't matter, and sometimes it does. But we'll talk about misestimates later.

Final bit of information is width. This is PostgreSQL idea on how many bytes, on average, there are in single row returned from given operation. For example:

EXPLAIN SELECT * FROM pg_class;
                         QUERY PLAN                          
-------------------------------------------------------------
 Seq Scan ON pg_class  (cost=0.00..10.92 ROWS=292 width=202)
(1 ROW)
 
EXPLAIN SELECT relname, relkind FROM pg_class;
                         QUERY PLAN                         
------------------------------------------------------------
 Seq Scan ON pg_class  (cost=0.00..10.92 ROWS=292 width=65)
(1 ROW)

As you can see limiting number of fields modified width, and in turn, total amount of data that will need to be passed through execution of the query.

Next, it the single most important bit of information. Explains are trees. Upper node needs data from nodes below.

Let's consider this plan.

There are 5 operations there: sort, hash join, seq scan, hash, and seq scan. PostgreSQL executes the top one – sort. Which in turn executes the ones directly below (Hash join) and gets data from them. Hash join, to return data to sort, has to run seq scan (on pg_proc) and hash (#4). And the final hash, to be able to return data, has to run seq scan on pg_namespace.

It is critical to understand that some operations can return data immediately, and, what's even more important, gradually. For example – Seq Scan. And some others cannot. For example – in here we see that Hash (#4) has the same cost for startup as it's “suboperation" seq scan – for “all rows". This means, that for hashing operation to start (well, to be able to return even single row), it has to read in all the rows from suboperation(s).

The part about returning rows gradually becomes very important when you'll start writing functions. Let's consider such functions:

  1. CREATE OR REPLACE FUNCTION public.test()
  2.  RETURNS SETOF INTEGER
  3.  LANGUAGE plpgsql
  4. AS $function$
  5. DECLARE
  6.     i int4;
  7. BEGIN
  8.     FOR i IN 1..3 loop
  9.         RETURN NEXT i;
  10.         perform pg_sleep(1);
  11.     END loop;
  12.     RETURN;
  13. END;
  14. $function$;

If you don't understand it, don't worry. The functions returns 3 rows, each contains single integer – 1, 2 and 3. The important bit, though, is that it sleeps for 1 second after returning each row.

This means that if I'll do:

SELECT * FROM test();

I can expect to wait 3 seconds for results.

But how long will it take to return from this:

SELECT * FROM test() LIMIT 1;

Let's see:

\timing
Timing IS ON.
 
SELECT * FROM test() LIMIT 1;
 test 
------
    1
(1 ROW)
 
TIME: 3005.334 ms

The same 3 seconds. Why? Because PL/pgSQL (and most, if not all, other PL/* languages) cannot return partial results. It looks like it can – with “return next", but all these are stored in a buffer and returned together when function execution ends.

On the other hand – “normal" operations, usually, can return partial data. It can be seen with something trivial like seq scan, on non-trivial table:

CREATE TABLE t AS
    SELECT i AS id,
        repeat('depesz', 100)::text AS payload
    FROM generate_series(1,1000000) i;

With this table, we can see:

EXPLAIN analyze SELECT * FROM t;
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Seq Scan ON t  (cost=0.00..185834.82 ROWS=10250082 width=36) (actual TIME=0.015..232.380 ROWS=1000000 loops=1)
 Total runtime: 269.666 ms
(2 ROWS)
 
EXPLAIN analyze SELECT * FROM t LIMIT 1;
                                                  QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
 LIMIT  (cost=0.00..0.02 ROWS=1 width=36) (actual TIME=0.003..0.003 ROWS=1 loops=1)
   ->  Seq Scan ON t  (cost=0.00..185834.82 ROWS=10250082 width=36) (actual TIME=0.003..0.003 ROWS=1 loops=1)
 Total runtime: 0.016 ms
(3 ROWS)

(please look only at “Total runtime: .." for now)

As it can be seen – seq scan ended very fast – as soon as it satisfied Limit's appetite for exactly 1 row.

Please also note that in here, even the costs (which are not the best thing for comparing queries) show that top node (seq scan in first, and limit in second query) have very different values for returning all rows – 185834.82 vs. 0.02.

So, the first 4 numbers for any given operation (two for cost plus rows and width) are all estimates. They might be correct, but but they as well might not.

The other 4 numbers, which you get when you run “EXPLAIN ANALYZE query" or “EXPLAIN ( ANALYZE on ) query" show the reality.

Time is again a range, but this time is real. It is how much time PostgreSQL actually did spend working on given operation (on average, because it could have run the same operation multiple times). And just as with cost – time is a range. Startup time, and time to return all data. Let's check this plan:

$ EXPLAIN analyze SELECT * FROM t LIMIT 100;
                                                  QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
 LIMIT  (cost=0.00..9.33 ROWS=100 width=608) (actual TIME=0.008..0.152 ROWS=100 loops=1)
   ->  Seq Scan ON t  (cost=0.00..93333.86 ROWS=999986 width=608) (actual TIME=0.007..0.133 ROWS=100 loops=1)
 Total runtime: 0.181 ms
(3 ROWS)

As you can see – Limit has startup time of 0.008 (millisecond, that's the unit in here). This is because Seq Scan (which Limit called to get data) took 0.007ms to return first row, and then there was 0.001ms of processing within limit itself.

Afterwards (after returning 1st row), limit kept getting data from Seq Scan, until it got 100 rows. Then it terminated Seq scan (which happened 0.133ms after start of query), and it finished after another 0.019 ms.

Actual rows value, just as name suggests, shows how many rows (on average) this operation returned. And loops show how many times this operation was in total ran.

In what case would an operation be called more than once? For example in some cases of joins, or subqueries. It looks like this plan.

Please note that loops in 3rd operation is 2. This means that this Seq Scan was ran twice, returning, on average 1 row, and it took, on average, 0.160ms to finish. So the total time spent in this particular operation is 2 * 0.160ms = 0.32ms (that's what is in exclusive/inclusive columns on explain.depesz.com).

Very often poor performance of a query comes from the fact that it had to loop many times over something. As in here.

(of course it doesn't mean that it's Pg fault that it chose such plan – maybe there simply weren't other options, or other options were estimated as even more expensive).

In above example, please note that while actual time for operation 3 is just 0.003ms, this operation was run over 26000 times, resulting in total time spent in here of almost 79ms.

I think that wraps theoretical information required to read explains. You will probably still don't understand what the operations or other information mean, but at the very least – you will know what the numbers mean, and what's the difference between explain (shows costs in abstract units, which are based on random-sample estimates) and explain analyze (shows real life times, rowcounts and execution counts, in units that can be compared with different queries).

As always, I'm afraid that I skipped a lot of things that might be important but just escaped me, or (even worse) I assumed that these are “obvious". If you'll find anything missing, please let me know, I'll fill in ASAP.

But, before, let me just say, that I plan to extend this blogpost in 2-3 another posts that would cover more about:

  • what are various operations, how they work, and what to expect when you see them in explain output
  • what are statistics, how Pg gets them, how to view them, and how to get the best out of it

15 thoughts on “Explaining the unexplainable”

  1. Just wanted to help, thought you may want to correct:

    1. examctly -> exactly
    2. simple 100 row table -> simple 1000 row table
    3. and actually sort if -> and actually sort it

  2. Yay I understood all of this.

    I sort of expected an unit / decimal point mixup with 0.008 milliseconds and the timings after that, but the post is correct, that’s 8 microseconds. I do wish PostgreSQL would print the units (or at least default to seconds).

  3. @Tobu: once you get used to it, it just makes sense. Adding “ms” to every value would take too much space. And defaulting to second would make the times even worse. I see lots of queries that run in less than 1ms. and this would look absolutely awful: 0.000008

  4. >> So, we can even change how much it costs to read sequential page.
    Are there any reasonable recommendations whether these settings worth changing ?

  5. I wouldn’t change seq scan cost, as this is basically an unit. If you have your database on SSD, it might be good to lower random_page_cost significantly. As for the rest – test it. Play with it on your dataset and see what comes out of it.

  6. In above example, please note that while actual time for operation 3 is just 0.003ms, this operation was run over 26000 times, resulting in total time spent in here of almost 79ms.

    not 79ms but 79s

    You artical helps a lot

  7. Fully qualified names in explain output can be reached by verbose option of explain (EXPLAIN VERBOSE). I had the same problem. Thanks to advice from Pavel Stehule my problem has been resolved.

    Keep writing posts like this one 🙂 I learnt so much

  8. Hi, thanks for a nice write up.

    I’m puzzled by

    > And what if we’d tell pg that under no circumstances it can use index scan?

    How did we actually tell that? It is unclear from the query snippet.

  9. @Adam:

    oh, sorry. So you can do it by executing:

    set enable_indexscan = false;

    there is set of enable_* parameters which can be used to disable given type of scan/functionality.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.