Explaining the unexplainable – part 2

Last time I wrote about what explain output shows. Now I'd like to talk more about various types of “nodes" / operations that you might see in explain plans.

The most basic one is Seq Scan.

It looks like this:

EXPLAIN analyze SELECT * FROM pg_class;
                                               QUERY PLAN                                                
---------------------------------------------------------------------------------------------------------
 Seq Scan ON pg_class  (cost=0.00..10.92 ROWS=292 width=202) (actual TIME=0.009..0.049 ROWS=295 loops=1)
 Total runtime: 0.249 ms
(2 ROWS)

This is the simplest possible operation – PostgreSQL opens table file, and reads rows, one by one, returning them to user or to upper node in explain tree, for example to limit, as in:

EXPLAIN analyze SELECT * FROM pg_class LIMIT 2;
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 LIMIT  (cost=0.00..0.07 ROWS=2 width=202) (actual TIME=0.014..0.014 ROWS=2 loops=1)
   ->  Seq Scan ON pg_class  (cost=0.00..10.92 ROWS=292 width=202) (actual TIME=0.009..0.009 ROWS=2 loops=1)
 Total runtime: 0.132 ms
(3 ROWS)

It is important to understand that the order of returned rows is not any specific. It's not “in order of insertion", or “last updated first" or anything like this. Concurrent selects, updates, deletes, vacuums can modify the order of rows at any time.

Seq Scan can filter rows – that is reject some from being returned. This happens for example when you'll add “WHERE" clause:

EXPLAIN analyze SELECT * FROM pg_class WHERE relname ~ 'a';
                                               QUERY PLAN                                                
---------------------------------------------------------------------------------------------------------
 Seq Scan ON pg_class  (cost=0.00..11.65 ROWS=227 width=202) (actual TIME=0.030..0.294 ROWS=229 loops=1)
   FILTER: (relname ~ 'a'::text)
   ROWS Removed BY FILTER: 66
 Total runtime: 0.379 ms
(4 ROWS)

As you can see now we have Filter: information. And because I'm on 9.2 or newer I got also “Rows removed by filter" line.

Next type of node is “Index Scan".

This type of scan seems to be very straight forward, and most people understand when it is used at least in one case:

EXPLAIN analyze SELECT * FROM pg_class WHERE oid = 1247;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 INDEX Scan USING pg_class_oid_index ON pg_class  (cost=0.15..8.17 ROWS=1 width=202) (actual TIME=0.007..0.007 ROWS=1 loops=1)
   INDEX Cond: (oid = 1247::oid)
 Total runtime: 0.077 ms
(3 ROWS)

That is – we have index that matches the condition, so PostgreSQL does:

  • opens the index
  • in the index if finds where (in table data) there might be rows that match given condition
  • opens table
  • fetches row(s) pointed to by index
  • if the rows can be returned – i.e. they are visible to current session – they are returned

You can of course ask: how can a row be invisible? It might happen for rows deleted that are still in the table (haven't been vacuumed). Or have been updated. Or were inserted, but after current transaction.

Index Scan is also used when you want some data ordered using order from index. As in here:

EXPLAIN analyze SELECT * FROM pg_class ORDER BY oid LIMIT 10;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 LIMIT  (cost=0.15..1.67 ROWS=10 width=206) (actual TIME=0.017..0.029 ROWS=10 loops=1)
   ->  INDEX Scan USING pg_class_oid_index ON pg_class  (cost=0.15..44.53 ROWS=292 width=206) (actual TIME=0.014..0.026 ROWS=10 loops=1)
 Total runtime: 0.145 ms
(3 ROWS)

There is no condition here, but we can add condition easily, like this:

EXPLAIN analyze SELECT * FROM pg_class WHERE oid > 1247 ORDER BY oid LIMIT 10;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 LIMIT  (cost=0.15..4.03 ROWS=10 width=206) (actual TIME=0.021..0.035 ROWS=10 loops=1)
   ->  INDEX Scan USING pg_class_oid_index ON pg_class  (cost=0.15..37.84 ROWS=97 width=206) (actual TIME=0.017..0.031 ROWS=10 loops=1)
         INDEX Cond: (oid > 1247::oid)
 Total runtime: 0.132 ms
(4 ROWS)

In these cases, Pg finds starting point in index (either first row that is > 1247, or simply smallest value in index, and then returns next rows/values until Limit will be satisfied.

There is a version of Index Scan, called “Index Scan Backward" – which does the same thing, but is used for scanning in descending order:

EXPLAIN analyze SELECT * FROM pg_class WHERE oid < 1247 ORDER BY oid DESC LIMIT 10;
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 LIMIT  (cost=0.15..4.03 ROWS=10 width=206) (actual TIME=0.012..0.026 ROWS=10 loops=1)
   ->  INDEX Scan Backward USING pg_class_oid_index ON pg_class  (cost=0.15..37.84 ROWS=97 width=206) (actual TIME=0.009..0.022 ROWS=10 loops=1)
         INDEX Cond: (oid < 1247::oid)
 Total runtime: 0.119 ms
(4 ROWS)

This is the same kind of operation – open index, and for every row pointed to by index, fetch row from table, just it happens not “from small to big" but “from big to small".

Another similar operation is “Index Only Scan".

Let's create simple table:

CREATE TABLE test (id serial PRIMARY KEY, i int4);
CREATE TABLE
 
INSERT INTO test (i) SELECT random() * 1000000000 FROM generate_series(1,100000);
INSERT 0 100000
 
vacuum analyze test;
VACUUM

This gives me table like this one:

SELECT * FROM test LIMIT 10;
 id |     i     
----+-----------
  1 | 546119592
  2 | 253476978
  3 | 235791031
  4 | 654694043
  5 | 187647296
  6 | 709050245
  7 | 210316749
  8 | 348927354
  9 | 120463097
 10 |   5611946
(10 ROWS)

In here, I have index on id:

\d test
                         TABLE "public.test"
 COLUMN |  TYPE   |                     Modifiers                     
--------+---------+---------------------------------------------------
 id     | INTEGER | NOT NULL DEFAULT NEXTVAL('test_id_seq'::regclass)
 i      | INTEGER | 
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)

So, if some conditions are met (more on it in a bit), I can get plan like this:

EXPLAIN analyze SELECT id FROM test ORDER BY id ASC LIMIT 10;
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 LIMIT  (cost=0.29..0.55 ROWS=10 width=4) (actual TIME=0.039..0.042 ROWS=10 loops=1)
   ->  INDEX ONLY Scan USING test_pkey ON test  (cost=0.29..2604.29 ROWS=100000 width=4) (actual TIME=0.036..0.038 ROWS=10 loops=1)
         Heap Fetches: 0
 Total runtime: 0.092 ms
(4 ROWS)

Please note the “Only" word in “Index Only Scan".

This means that Pg realized that I select only data (columns) that are in the index. And it is possible that it doesn't need to check anything in the table files. So that it will return the data straight from index.

These scans were the big change in PostgreSQL 9.2, as they have the ability to work way faster than normal Index Scans, because they don't have to verify anything in table data.

The problem is that, in order to make it work, Index has to contain information that given rows are in pages, that didn't have any changes “recently". This means that in order to utilize Index Only Scans, you have to have your table well vacuumed. But with autovacuum running, it shouldn't be that big of a deal.

Final kind of table scans is so called Bitmap Index Scan. It looks like this:

EXPLAIN analyze SELECT * FROM test WHERE i < 100000;
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON test  (cost=4.37..39.99 ROWS=10 width=8) (actual TIME=0.025..0.110 ROWS=13 loops=1)
   Recheck Cond: (i < 100000)
   ->  Bitmap INDEX Scan ON i1  (cost=0.00..4.37 ROWS=10 width=0) (actual TIME=0.013..0.013 ROWS=13 loops=1)
         INDEX Cond: (i < 100000)
 Total runtime: 0.154 ms
(5 ROWS)

(if you're reading and paying attention, you'll notice that it's using index that I didn't talk about creating earlier, it's simple: create index i1 on test (i);).

Bitmap Scans are always in (at least) two nodes. First (lower level) there is Bitmap Index Scan, and then there is Bitmap Heap Scan.

How does it work?

Let's assume your table has 100000 pages (that would be ~ 780MB). Bitmap Index Scan would create a bitmap where there would be one bit for every page in your table. So in this case, we'd get memory block of 100,000 bits ~ 12.5kB. All these bits would be set to 0. Then Bitmap Index Scan, would set some bits to 1, depending on which page in table might contain row that should be returned.

This part doesn't touch table data at all. Just index. After it will be done – that is all pages that might contain row that should be returned will be “marked", this bitmap is passed to upper node – Bitmap Heap Scan, which reads them in more sequential fashion.

What is the point of such operation? Well, Index Scans (normal) cause random IO – that is, pages from disk are loaded in random fashion. Which, at least on spinning disks, is slow.

Sequential scan is faster for getting single page, but on the other hand – you not always need all the pages.

Bitmap Index Scans joins the two cases when you need many rows from the table, but not all, and when the rows that you'll be returning are not in single block (which would be the case if I did “… where id < ..."). Bitmap scans have also one more interesting feature. That is - they can join two operations, two indexes, together. Like in here:

EXPLAIN analyze SELECT * FROM test WHERE i < 5000000 OR i > 950000000;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON test  (cost=107.36..630.60 ROWS=5323 width=8) (actual TIME=1.023..4.353 ROWS=5386 loops=1)
   Recheck Cond: ((i < 5000000) OR (i > 950000000))
   ->  BitmapOr  (cost=107.36..107.36 ROWS=5349 width=0) (actual TIME=0.922..0.922 ROWS=0 loops=1)
         ->  Bitmap INDEX Scan ON i1  (cost=0.00..12.25 ROWS=527 width=0) (actual TIME=0.120..0.120 ROWS=491 loops=1)
               INDEX Cond: (i < 5000000)
         ->  Bitmap INDEX Scan ON i1  (cost=0.00..92.46 ROWS=4822 width=0) (actual TIME=0.799..0.799 ROWS=4895 loops=1)
               INDEX Cond: (i > 950000000)
 Total runtime: 4.765 ms
(8 ROWS)

In here we see two Bitmap Index Scans (there can be more of them), which are then joined (not as SQL “JOIN"!) using BitmapOr.

As you remember – output of Bitmap Index Scan is a bitmap – that is memory block with some zeros and some ones. Having multiple such bitmaps means that you can easily do logical operations on it: Or, And or Not.

In here we see that two such bitmaps were joined together using Or operator, and resulting bitmap was passed to Bitmap Heap Scan which loaded appropriate rows from the table.

While in here both Index Scans use the same index, it's not always the case. For example, let's add quickly some more columns:

ALTER TABLE test ADD COLUMN j int4 DEFAULT random() * 1000000000;
ALTER TABLE
ALTER TABLE test ADD COLUMN h int4 DEFAULT random() * 1000000000;
ALTER TABLE
CREATE INDEX i2 ON test (j);
CREATE INDEX
CREATE INDEX i3 ON test (h);
CREATE INDEX

And now:

EXPLAIN analyze SELECT * FROM test WHERE j < 50000000 AND i < 50000000 AND h > 950000000;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON test  (cost=280.76..323.61 ROWS=12 width=16) (actual TIME=2.295..2.352 ROWS=11 loops=1)
   Recheck Cond: ((h > 950000000) AND (j < 50000000) AND (i < 50000000))
   ->  BitmapAnd  (cost=280.76..280.76 ROWS=12 width=0) (actual TIME=2.278..2.278 ROWS=0 loops=1)
         ->  Bitmap INDEX Scan ON i3  (cost=0.00..92.53 ROWS=4832 width=0) (actual TIME=0.546..0.546 ROWS=4938 loops=1)
               INDEX Cond: (h > 950000000)
         ->  Bitmap INDEX Scan ON i2  (cost=0.00..93.76 ROWS=4996 width=0) (actual TIME=0.783..0.783 ROWS=5021 loops=1)
               INDEX Cond: (j < 50000000)
         ->  Bitmap INDEX Scan ON i1  (cost=0.00..93.96 ROWS=5022 width=0) (actual TIME=0.798..0.798 ROWS=4998 loops=1)
               INDEX Cond: (i < 50000000)
 Total runtime: 2.428 ms
(10 ROWS)

Three Bitmap Index Scans, each using different index, bitmaps joined using “and" bit operation, and result fed to Bitmap Heap Scan.

In case you wonder – why is the BitmapAnd showing “Actual rows = 0" – it's simple. This node doesn't deal with rows at all (just bitmap of disk pages). So it can't return any rows.

Thats about it for now – these are your possible table scans – how you get data from disk. Next time I'll talk about joining multiple sources together, and other types of plans.

17 thoughts on “Explaining the unexplainable – part 2”

  1. I should have read these two posts 5 years ago and it would have made my life so much easier. Thank you and keep up the awesomeness.

  2. Very useful. Can’t help thinking something like these posts should be in the official docs somewhere. EXPLAIN is such a useful tool.

  3. It was a long time ago this blog became required reading for advanced PostgreSQL subjects.
    And the good stuff just keep piling up with regularity !

  4. Thanks for this helpful post. Understanding the output of EXPLAIN is hard, and you made it a bit easier.

  5. @kk: i think there will be more than one. I still have some operations to cover, and statistics will definitely need a separate blogpost.

  6. I had always wondered what the BitMap* functions were. Thank you for the informative post!

  7. Thank you so much for taking the time to explain such a complicated subject in such a simple way!

  8. Thank you so much for post. I have tested the Index only scan in my system. It is not working for me. I am using postgres 9.5 version

    postgres=# explain analyze select id from test;
    QUERY PLAN
    ————————————————————————————————————
    Seq Scan on test (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.033..26.492 rows=100000 loops=1)
    Planning time: 0.223 ms
    Execution time: 31.822 ms
    (3 rows)

    Time: 33.355 ms
    postgres=# vacuum full analyze test;
    VACUUM
    Time: 506.563 ms
    postgres=# explain analyze select id from test;
    QUERY PLAN
    ————————————————————————————————————
    Seq Scan on test (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.018..25.843 rows=100000 loops=1)
    Planning time: 0.993 ms
    Execution time: 31.164 ms
    (3 rows)

    Time: 33.195 ms
    postgres=# \d+ test
    Table “public.test”
    Column | Type | Modifiers | Storage | Stats target | Description
    ——–+————————+—————————————————+———-+————–+————-
    id | integer | not null default nextval(‘test_id_seq’::regclass) | plain | |
    name | character varying(200) | | extended | |
    Indexes:
    “test_pkey” PRIMARY KEY, btree (id)

  9. @Srinivas:

    index only scan will be used only in some cases, when exactly – it’s not entirely clear to me. But – getting 100k rows, is clearly not normal situation.

    try select id from test order by id asc limit 10;

  10. HiDEPESZ,
    Could u plz explain to me about the below Explain Analyze result.

    Sort (cost=717.34..717.59 rows=101 width=488) (actual time=7.761..7.774 rows=100 loops=1)
    Sort Key: t1.fivethous
    Sort Method: quicksort Memory: 77kB
    -> Hash Join (cost=230.47..713.98 rows=101 width=488) (actual time=0.711..7.427 rows=100 loops=1)
    Hash Cond: (t2.unique2 = t1.unique2)
    -> Seq Scan on tenk2 t2 (cost=0.00..445.00 rows=10000 width=244) (actual time=0.007..2.583 rows=10000 loops=1)
    -> Hash (cost=229.20..229.20 rows=101 width=244) (actual time=0.659..0.659 rows=100 loops=1)
    Buckets: 1024 Batches: 1 Memory Usage: 28kB
    -> Bitmap Heap Scan on tenk1 t1 (cost=5.07..229.20 rows=101 width=244) (actual time=0.080..0.526 rows=100 loops=1)
    Recheck Cond: (unique1 Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0) (actual time=0.049..0.049 rows=100 loops=1)
    Index Cond: (unique1 < 100)
    Planning time: 0.194 ms
    Execution time: 8.008 ms

  11. In Bitmap Index Scan, you explained that:
    “Then Bitmap Index Scan, would set some bits to 1, depending on which page in table might contain row that should be returned.”
    Could you tell me how does PostgreSQL know a page contains row or not?

  12. @Thanh:

    it opens the page, and checks if the row is there, and is visible to current transaction.

Comments are closed.