# Explaining the unexplainable – part 3

In previous post in the series I wrote about how to interpret single line in explain analyze output, it's structure, and later on described all basic data-getting operations (nodes in explain tree).

Today, we'll move towards more complicated operations.

# Function Scan

Example:

```\$ explain analyze select * from generate_Series(1,10) i;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series i  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.012..0.013 rows=10 loops=1)
Total runtime: 0.034 ms
(2 rows)```

Generally it's so simple, that I shouldn't need describing it, but since I will use it in next examples, I decided to write a thing about it.

Function Scan, is very simple node – it runs a function that returns recordset – that is, it will not run function like “lower()", but a function that returns (at least potentially) multiple rows, or multiple columns. After the function will return rows, these are returned to whatever is above Function Scan in plan tree, or to client, if Function Scan is the top node.

The only additional logic it might have, is ability to filter returned rows, like in here:

```\$ explain analyze select * from generate_Series(1,10) i where i < 3;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series i  (cost=0.00..12.50 rows=333 width=4) (actual time=0.012..0.014 rows=2 loops=1)
Filter: (i < 3)
Rows Removed by Filter: 8
Total runtime: 0.030 ms
(4 rows)```

# Sort

This seems to be easy to understand – sort gets given records and returns them sorted in some way.

Example:

```\$ explain analyze select * from pg_class order by relname;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Sort  (cost=22.88..23.61 rows=292 width=203) (actual time=0.230..0.253 rows=295 loops=1)
Sort Key: relname
Sort Method: quicksort  Memory: 103kB
->  Seq Scan on pg_class  (cost=0.00..10.92 rows=292 width=203) (actual time=0.007..0.048 rows=295 loops=1)
Total runtime: 0.326 ms
(5 rows)```

While it is simple, it has some cool logic inside. For starters – if memory used for sorting would be more than work_mem, it will switch to using disk based sorting:

```\$ explain analyze select random() as x from generate_series(1,14000) i order by x;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=62.33..64.83 rows=1000 width=0) (actual time=16.713..18.090 rows=14000 loops=1)
Sort Key: (random())
Sort Method: quicksort  Memory: 998kB
->  Function Scan on generate_series i  (cost=0.00..12.50 rows=1000 width=0) (actual time=2.036..4.533 rows=14000 loops=1)
Total runtime: 18.942 ms
(5 rows)

\$ explain analyze select random() as x from generate_series(1,15000) i order by x;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=62.33..64.83 rows=1000 width=0) (actual time=27.052..28.780 rows=15000 loops=1)
Sort Key: (random())
Sort Method: external merge  Disk: 264kB
->  Function Scan on generate_series i  (cost=0.00..12.50 rows=1000 width=0) (actual time=2.171..4.894 rows=15000 loops=1)
Total runtime: 29.767 ms
(5 rows)```

Please note the Sort Method: change above.

To handle such cases, Pg will use temporary files stored in \$PGDATA/base/pgsql_tmp/ directory. They will of course be removed as soon as they are not needed.

One additional feature is that Sort can change it's method of working if it's called by Limit operation, like here:

```\$ explain analyze select * from pg_class order by relfilenode limit 5;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Limit  (cost=15.77..15.78 rows=5 width=203) (actual time=0.119..0.120 rows=5 loops=1)
->  Sort  (cost=15.77..16.50 rows=292 width=203) (actual time=0.118..0.118 rows=5 loops=1)
Sort Key: relfilenode
Sort Method: top-N heapsort  Memory: 26kB
->  Seq Scan on pg_class  (cost=0.00..10.92 rows=292 width=203) (actual time=0.005..0.047 rows=295 loops=1)
Total runtime: 0.161 ms
(6 rows)```

Normally, to sort given dataset, you need to process it in whole. But Pg knows, that if you need only some small number of rows, it doesn't have to sort whole dataset, and it's good enough to get just the first values.

In Big O notation, general sort has complexity of O(m * log(m)), but Top-N has complexity of O(m * log(n)) – where m is number of rows in table, and n is number of returned rows. What's most important – this kind of sort also uses much less memory (after all, it doesn't have to construct whole dataset of sorted rows, just couple of rows), so it's less likely to use slow disk for temporary files.

# Limit

I used limit many times, because it's so simple, but let's describe it fully. Limit operation runs it's sub-operation, and returns just first N rows from what it returned. Usually it also stops sub-operation afterwards, but in some cases (pl/PgSQL functions for example), the sub-operation is already finished when it returned first row.

Simple example:

```\$ explain analyze select * from pg_class;
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Seq Scan on pg_class  (cost=0.00..10.92 rows=292 width=203) (actual time=0.008..0.047 rows=295 loops=1)
Total runtime: 0.096 ms
(2 rows)

\$ explain analyze select * from pg_class limit 2;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..0.07 rows=2 width=203) (actual time=0.009..0.010 rows=2 loops=1)
->  Seq Scan on pg_class  (cost=0.00..10.92 rows=292 width=203) (actual time=0.008..0.009 rows=2 loops=1)
Total runtime: 0.045 ms
(3 rows)```

As you can see using limit in the 2nd case caused underlying Seq Scan to finish it's work immediately after finding two rows.

# HashAggregate

This operation is used basically whenever you are using GROUP BY and some aggregates, like sum(), avg(), min(), max() or others.

Example:

```\$ explain analyze select relkind, count(*) from pg_Class group by relkind;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
HashAggregate  (cost=12.38..12.42 rows=4 width=1) (actual time=0.223..0.224 rows=5 loops=1)
->  Seq Scan on pg_class  (cost=0.00..10.92 rows=292 width=1) (actual time=0.008..0.053 rows=295 loops=1)
Total runtime: 0.273 ms
(3 rows)```

HashAggregate does something like this: for every row it gets, it finds GROUP BY “key" (in this case relkind). Then, in hash (associative array, dictionary), puts given row into bucket designated by given key.

After all rows have been processed, it scans the hash, and returns single row per each key value, when necessary – doing appropriate calculations (sum, min, avg, and so on).

It is important to understand that HashAggregate has to scan all rows before it can return even single row.

Now, if you understand it, you should see potential problem: well, what about case when there are millions of rows? The hash will be too big to fit in memory. And here, again, we'll be using work_mem. If generated hash is too big, it will “spill" to disk (again in the \$PGDATA/base/pgsql_tmp).

This means that if we have plan that has both HashAggregate and Sort – we can use up to 2 * work_mem. And such plan is simple to get:

```\$ explain analyze select relkind, count(*) from pg_Class group by relkind order by relkind;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Sort  (cost=12.46..12.47 rows=4 width=1) (actual time=0.260..0.261 rows=5 loops=1)
Sort Key: relkind
Sort Method: quicksort  Memory: 25kB
->  HashAggregate  (cost=12.38..12.42 rows=4 width=1) (actual time=0.221..0.222 rows=5 loops=1)
->  Seq Scan on pg_class  (cost=0.00..10.92 rows=292 width=1) (actual time=0.006..0.044 rows=295 loops=1)
Total runtime: 0.312 ms
(6 rows)```

In reality – single query can use many times work_mem, as work_mem is a limit per operation. So, if your query uses 1000 of HashAggregates and Sorts (and other work_mem using operations) total memory usage can get pretty high.

# Hash Join / Hash

Since we were discussing HashAggregate, it seemed natural to move to Hash Join.

This operation, unlike all other that we previously discussed, has two sub operations. One of them is always “Hash", and the other is something else.

Hash Join is used, as name suggests to join two recordsets. For example like here:

```\$ explain analyze select * from pg_class c join pg_namespace n on c.relnamespace = n.oid;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Hash Join  (cost=1.14..16.07 rows=292 width=316) (actual time=0.036..0.343 rows=295 loops=1)
Hash Cond: (c.relnamespace = n.oid)
->  Seq Scan on pg_class c  (cost=0.00..10.92 rows=292 width=203) (actual time=0.007..0.044 rows=295 loops=1)
->  Hash  (cost=1.06..1.06 rows=6 width=117) (actual time=0.012..0.012 rows=6 loops=1)
Buckets: 1024  Batches: 1  Memory Usage: 1kB
->  Seq Scan on pg_namespace n  (cost=0.00..1.06 rows=6 width=117) (actual time=0.004..0.005 rows=6 loops=1)
Total runtime: 0.462 ms
(7 rows)```

It works like this – first Hash Join calls “Hash", which in turns calls something else (Seq Scan on pg_namespace in our case). Then, Hash makes a memory (or disk, depending on size) hash/associative-array/dictionary with rows from the source, hashed using whatever is used to join the data (in our case, it's OID column in pg_namespace).

Of course – you can have many rows for given join key (well, not in this case, as I'm joining using primary key, but generally, it's perfectly possible to have multiple rows for single hash key.

So, using Perl notation, output of Hash is something like:

```{
'123' => [ { data for row with OID = 123 }, ],
'256' => [ { data for row with OID = 256 }, ],
...
}```

Then, Hash Join runs the second suboperation (Seq Scan on pg_class in our case), and for each row from it, it does:

1. check if join key (pg_class.relnamespace in our case) is in hash returned by Hash operation
2. if it is not – given row from suboperation is ignored (will not be returned)
3. if it exists – Hash Join fetches rows from hash, and based on row from one side, and all rows from hash, it generates output rows

It is important to note that both sides are run only once ( in our case, these both are seq scans), but first (the one called by Hash) has to return all rows, which have to be stored in hash, and the other is processed one row at a time, and some rows will get skipped if they don't exist in hash from the other side (hope the sentence is clear, there are many “hash"es there).

Of course, since both subscans can be any type of operation, these can do filter or index scan or whatever you can imagine.

Final note for Hash Join/Hash is that the Hash operation, just like Sort and HashAggregate – will use up to work_mem of memory.

# Nested Loop

Since we're at joins – we have to discuss Nested Loop. Example:

```\$ explain analyze select a.* from pg_class c join pg_attribute a on c.oid = a.attrelid where c.relname in ( 'pg_class', 'pg_namespace' );
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=0.28..52.32 rows=16 width=203) (actual time=0.057..0.134 rows=46 loops=1)
->  Seq Scan on pg_class c  (cost=0.00..11.65 rows=2 width=4) (actual time=0.043..0.080 rows=2 loops=1)
Filter: (relname = ANY ('{pg_class,pg_namespace}'::name[]))
Rows Removed by Filter: 291
->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.28..20.25 rows=8 width=203) (actual time=0.007..0.015 rows=23 loops=2)
Index Cond: (attrelid = c.oid)
Total runtime: 0.182 ms```

This is very interesting plan as it can run given operations multiple times.

Just as Hash Join, Nested Loop has two “children". First it runs “Seq Scan" (in our example, generally, first it runs the first node that is there, and then, for every row it returns (2 rows in our example), it runs 2nd operation (Index Scan on pg_attribute in our case).

You might notices that Index Scan has “loops=2" in it's actual run metainfo. This means that this operation has been run twice, and the other values (rows, time) are averages across all runs.

Let's check this plain from explain.depesz.com. Note that the actual times for the categories index scan are 0.002 to 0.003 ms. But total time on this node is 78.852ms, because this index scan has been ran over 26k times.

So, the processing looks like this:

1. Nested Loop runs one side of join, once. Let's name it “A".
2. For every row in “A", it runs second operation (let's name it “B")
3. if “B" didn't return any rows – data from “A" is ignored
4. if “B" did return rows, for every row it returned, new row is returned by Nested Loop, based on current row from A, and current row from B

# Merge Join

Another method of joining data is called Merge Join. This is used, if joined datasets are (or can be cheaply) sorted using join key.

I don't have nice example of this, so I will force it by using subselects that sort data before joining:

```\$ explain analyze select * from
( select oid, * from pg_class order by oid) as c
join
( select * from pg_attribute a order by attrelid) as a
on c.oid = a.attrelid;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join  (cost=23.16..268.01 rows=2273 width=410) (actual time=0.658..3.779 rows=2274 loops=1)
Merge Cond: (pg_class.oid = a.attrelid)
->  Sort  (cost=22.88..23.61 rows=292 width=207) (actual time=0.624..0.655 rows=293 loops=1)
Sort Key: pg_class.oid
Sort Method: quicksort  Memory: 102kB
->  Seq Scan on pg_class  (cost=0.00..10.92 rows=292 width=207) (actual time=0.011..0.211 rows=293 loops=1)
->  Materialize  (cost=0.28..212.34 rows=2273 width=203) (actual time=0.028..1.264 rows=2274 loops=1)
->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.28..183.92 rows=2273 width=203) (actual time=0.015..0.752 rows=2274 loops=1)
Total runtime: 4.009 ms
(9 rows)```

Merge Join, as other joins, runs two sub operations (Sort and Materialize in this case). Because both of these return data sorted and the sort order is the same as join operation, Pg can scan both returnsets from suboperations at the same time, and simply check whether ids match.

The procedure looks like this:

1. if join column on right side is the same as join column on left side:
• return new joined row, based on current rows on the right and left sides
• get next row from right side (or, if there are no more rows, on left side)
• go to step 1
2. if join column on right side is “smaller" than join column on left side:
• get next row from right side (if there are no more rows, finish processing)
• go to step 1
3. if join column on right side is “larger" than join column on left side:
• get next row from left side (if there are no more rows, finish processing)
• go to step 1

This is very cool way of joining datasets, but it works only for sorted sources. Based on current db of explain.depesz.com, there are:

• 44,721 plans which contain “Nested Loop" operation
• 34,305 plans with “Hash Join"
• only 8,889 that uses “Merge Join"

# Hash Join / Nested Loop / Merge Join modifiers

In all examples above I showed that Join operation returns row only when it gets rows from both sides of join.

But this is not always the case. We can have LEFT/RIGHT/FULL outer joins. And there are so called anti-joins.

In case of left/right joins, the operation names get changed to:

• Hash Left Join
• Hash Right Join
• Merge Left Join
• Merge Right Join
• Nested Loop Left Join

There is no Nested Loop Right Join, because Nested Loop always starts with left side as basis to looping. So join that uses RIGHT JOIN, that would use Nested Loop, will get internally transformed to LEFT JOIN so that Nested Loop can work.

In all those cases the logic is simple – we have two sides of join – left and right. And when side is mentioned in join, then join will return new row even if the other side doesn't have matching rows.

This all happens with queries like:

`select * from a left join b on ...`

(or right join).

All other information for Hash Join/Merge Join or Nested Loop are the same, it's just a slight change in logic on when to generate output row.

There is also a version called Full Join, with operation names:

• Hash Full Join
• Merge Full Join

In which case join generates new output row regardless of whether data on either side is missing (as long as the data is there for one side). This happens in case of:

`select * from a full join b ...`

Of course all processing is the same as previously.

There are also so called Anti Joins. Their operation names look like:

• Hash Anti Join
• Merge Anti Join
• Nested Loop Anti Join

In these cases Join emits row only if the right side doesn't find any row. This is useful when you're doing things like “WHERE not exists ()" or “left join … where right_table.column is null".

Like in here:

```\$ explain analyze select * from pg_class c where not exists (select * from pg_attribute a where a.attrelid = c.oid and a.attnum = 10);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join  (cost=62.27..78.66 rows=250 width=203) (actual time=0.145..0.448 rows=251 loops=1)
Hash Cond: (c.oid = a.attrelid)
->  Seq Scan on pg_class c  (cost=0.00..10.92 rows=292 width=207) (actual time=0.009..0.195 rows=293 loops=1)
->  Hash  (cost=61.75..61.75 rows=42 width=4) (actual time=0.123..0.123 rows=42 loops=1)
Buckets: 1024  Batches: 1  Memory Usage: 2kB
->  Index Only Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.28..61.75 rows=42 width=4) (actual time=0.021..0.109 rows=42 loops=1)
Index Cond: (attnum = 10)
Heap Fetches: 0
Total runtime: 0.521 ms
(9 rows)```

In here, Pg ran the right side (Index Scan on pg_attribute), hashed it, and then ran left side (Seq Scan on pg_class), returning only rows where there was no item in Hash for given pg_class.oid.

# Materialize

This operation showed earlier in example for Merge Join, but it is also usable in another cases.

psql has many internal commands. One of them is \dTS – which lists all system datatypes. Internally \dTS runs this query:

```SELECT n.nspname as "Schema",
pg_catalog.format_type(t.oid, NULL) AS "Name",
pg_catalog.obj_description(t.oid, 'pg_type') as "Description"
FROM pg_catalog.pg_type t
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = t.typnamespace
WHERE (t.typrelid = 0 OR (SELECT c.relkind = 'c' FROM pg_catalog.pg_class c WHERE c.oid = t.typrelid))
AND NOT EXISTS(SELECT 1 FROM pg_catalog.pg_type el WHERE el.oid = t.typelem AND el.typarray = t.oid)
AND pg_catalog.pg_type_is_visible(t.oid)
ORDER BY 1, 2;```

It's plan is:

```                                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=2783.00..2783.16 rows=65 width=68) (actual time=3.883..3.888 rows=87 loops=1)
Sort Key: n.nspname, (format_type(t.oid, NULL::integer))
Sort Method: quicksort  Memory: 39kB
->  Nested Loop Left Join  (cost=16.32..2781.04 rows=65 width=68) (actual time=0.601..3.657 rows=87 loops=1)
Join Filter: (n.oid = t.typnamespace)
Rows Removed by Join Filter: 435
->  Hash Anti Join  (cost=16.32..2757.70 rows=65 width=8) (actual time=0.264..0.981 rows=87 loops=1)
Hash Cond: ((t.typelem = el.oid) AND (t.oid = el.typarray))
->  Seq Scan on pg_type t  (cost=0.00..2740.26 rows=81 width=12) (actual time=0.012..0.662 rows=157 loops=1)
Filter: (pg_type_is_visible(oid) AND ((typrelid = 0::oid) OR (SubPlan 1)))
Rows Removed by Filter: 185
SubPlan 1
->  Index Scan using pg_class_oid_index on pg_class c  (cost=0.15..8.17 rows=1 width=1) (actual time=0.002..0.002 rows=1 loops=98)
Index Cond: (oid = t.typrelid)
->  Hash  (cost=11.33..11.33 rows=333 width=8) (actual time=0.241..0.241 rows=342 loops=1)
Buckets: 1024  Batches: 1  Memory Usage: 14kB
->  Seq Scan on pg_type el  (cost=0.00..11.33 rows=333 width=8) (actual time=0.002..0.130 rows=342 loops=1)
->  Materialize  (cost=0.00..1.09 rows=6 width=68) (actual time=0.000..0.001 rows=6 loops=87)
->  Seq Scan on pg_namespace n  (cost=0.00..1.06 rows=6 width=68) (actual time=0.002..0.003 rows=6 loops=1)
Total runtime: 3.959 ms```

For easier viewing, I also uploaded this plan to explain.depesz.com.

Please note that operation #9 there is Materialize. Why is that so?

Materialize is called by Nested Loop Left Join – operation #2. We know that Nested Loop causes given operation to be ran multiple times, in this case – 87 times.

Right side of the join is Seq Scan on pg_namespace. So Pg, theoretically, should run Sequential Scan on pg_namespace 87 times. Given that single Seq Scan of this table takes 0.003ms, we could expect total time of ~ 0.25ms.

But Pg is smarter than that. It realized that it will be cheaper to scan the table just once, and build memory representation of all the rows in there. So, that next time, it will not have to scan the table, check visibility information, parse data pages. It will just get the data from memory.

Thanks to this total time of: reading the table once, preparing memory representation of the data and scanning this representation 87 times was 0.087ms.

You might then ask, OK, but why did the merge join earlier on use materialize – it was just doing one scan? Let's remind the plan:

```\$ explain analyze select * from
( select oid, * from pg_class order by oid) as c
join
( select * from pg_attribute a order by attrelid) as a
on c.oid = a.attrelid;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join  (cost=23.16..268.01 rows=2273 width=410) (actual time=0.658..3.779 rows=2274 loops=1)
Merge Cond: (pg_class.oid = a.attrelid)
->  Sort  (cost=22.88..23.61 rows=292 width=207) (actual time=0.624..0.655 rows=293 loops=1)
Sort Key: pg_class.oid
Sort Method: quicksort  Memory: 102kB
->  Seq Scan on pg_class  (cost=0.00..10.92 rows=292 width=207) (actual time=0.011..0.211 rows=293 loops=1)
->  Materialize  (cost=0.28..212.34 rows=2273 width=203) (actual time=0.028..1.264 rows=2274 loops=1)
->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.28..183.92 rows=2273 width=203) (actual time=0.015..0.752 rows=2274 loops=1)
Total runtime: 4.009 ms
(9 rows)```

Yes. It was run just once. The problem is, though, that source of data for Merge Join has to match several criteria. Some are obvious (data has to be sorted) and some are not so obvious as are more technical (data has to be scrollable back and forth).

Because of this (these not so obvious criteria) sometimes Pg will have to Materialize the data coming from source (Index Scan in our case) so that it will have all the necessary features when using it.

Long story short – Materialize gets data from underlying operation and stores it in memory (or partially in memory) so that it can be used faster, or with additional features that underlying operation doesn't provide.

And that's it for today. I thought that I will be done, but there are still many operations that need to be described. So, we will have at least two more posts in the series (rest of the operations, and statistics info).

## 5 thoughts on “Explaining the unexplainable – part 3”

1. David Johnston says:

With respect to these numbers:

44,721 plans which contain “Nested Loop” operation
34,305 plans with “Hash Join”
only 8,889 that uses “Merge Join”

There is major selection bias going on here since most plans that are submitted to depesz are likely poorly performing. Since “Merge Join” is a fairly fast operation for joining it is likely that few of these would end up being explained.

2. Victor says:

Very informative series, thanks a lot!

As you speak of Anti joins, perhaps you should also mention Semi joins, the ones produced by “WHERE EXISTS()” or “… IN()” constructs?

Like the one produced by:
EXPLAIN SELECT * FROM pg_class WHERE relnamespace IN (
SELECT oid FROM pg_namespace WHERE nspname=’public’);

3. depesz says:

@Victor:

yes, I forgot about these. Interestingly – the query that you showed generates (in my pg 9.3) Hash join – https://explain.depesz.com/s/IqG .

But, to the point.

Semi Join is basically a reverse of Anti Join. When doing semi join of table “a” and table “b” using some comparison, emitted are only rows from “a” that there is row in b that matches the condition.

So, Semi Join, is similar to:

select a.* from a join b on (a.x=b.x);

but, in case b.x had duplicates, join above would duplicate rows from “a”. But semi join would not, as it only checks if a row is there on “b-side”, and if yes, it emits row from a.

There can be all three variants of semi joins (Nested Loop Semi Join, Hash Semi Join and Merge Semi Join).

4. Victor says:

Interesting, 9.2 gives: https://explain.depesz.com/s/mLx

Is this smth new in 9.3 maybe?

5. depesz says:

@Victor: it could be related to statistics. This is what the next (and final) part of the series will be about, but I have yet to write it.