Waiting for PostgreSQL 19 – Display Memoize planner estimates in EXPLAIN

On 29th of July 2025, David Rowley committed patch:

Display Memoize planner estimates in EXPLAIN
 
There've been a few complaints that it can be overly difficult to figure
out why the planner picked a Memoize plan.  To help address that, here we
adjust the EXPLAIN output to display the following additional details:
 
1) The estimated number of cache entries that can be stored at once
2) The estimated number of unique lookup keys that we expect to see
3) The number of lookups we expect
4) The estimated hit ratio
 
Technically #4 can be calculated using #1, #2 and #3, but it's not a
particularly obvious calculation, so we opt to display it explicitly.
The original patch by Lukas Fittl only displayed the hit ratio, but
there was a fear that might lead to more questions about how that was
calculated.  The idea with displaying all 4 is to be transparent which
may allow queries to be tuned more easily.  For example, if #2 isn't
correct then maybe extended statistics or a manual n_distinct estimate can
be used to help fix poor plan choices.
 
Author: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAP53Pky29GWAVVk3oBgKBDqhND0BRBN6yTPeguV_qSivFL5N_g%40mail.gmail.com

This post will be slightly longer, as I want to use it to explain what are Memoize nodes.

They first appeared in PostgreSQL 14, back in 2021.

Generally the idea is that if Postgres has to repeatedly scan some table (within single query), it can choose to cache results, if it thinks that search keys will be repeated.

For example, let's assume we have users table, and related to them sessions table.

For each of the last 100 sessions we want to show user id, so we join (one way or the other) users, and have to do 100 scans of users (join can use different methods, but for now, let's assume it's doing Nested Loop). If we'd have situation where, in the last 100 sessions, there were only two users, then, without memoization, pg would have to scan user tables 100 times, each time providing one of two user ids that are in those sessions.

Memoizations solves this issue.

When Pg has information (from data statistics) that situation with repeated querying same key will happen, it instead adds there Memoization step, and it makes the process more effective.

Let's make a quick demonstration:

=$ create table users (
    id int8 generated always as identity primary key,
    payload float8
);
CREATE TABLE
 
=$ insert into users (payload)
    select random() from generate_series(1, 1000000) i;
INSERT 0 1000000
 
=$ create table sessions (
    id int8 generated always as identity primary key,
    user_id int8 not null references users (id),
    payload float8
);
CREATE TABLE
 
=$ insert into sessions (user_id, payload)
    select floor( 1 + random() * 1000000)::int4, random() from generate_series(1, 1000);
INSERT 0 1000
 
=$ vacuum analyze users, sessions;
VACUUM

This made me 1 million row users table, and 1000 row sessions table which has some random users related. Now, if i'd like to select all data about sessions and connected user, I would use a query like:

=$ select * from sessions s join users u on s.user_id = u.id;

and it's explain analyze would look like:

                                                          QUERY PLAN                                                          
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Nested Loop  (cost=0.42..7507.50 rows=1000 width=40) (actual time=0.042..19.157 rows=1000 loops=1)
   Buffers: shared hit=4007
   ->  Seq Scan on sessions s  (cost=0.00..17.00 rows=1000 width=24) (actual time=0.013..2.028 rows=1000 loops=1)
         Buffers: shared hit=7
   ->  Index Scan using users_pkey on users u  (cost=0.42..7.49 rows=1 width=16) (actual time=0.011..0.011 rows=1 loops=1000)
         Index Cond: (id = s.user_id)
         Buffers: shared hit=4000
 Planning:
   Buffers: shared hit=23
 Planning Time: 0.394 ms
 Execution Time: 21.099 ms
(11 rows)

Plan available also on explain.depesz.com

Now, if I'd instead still had 1000 sessions, but all of them belonging to just 2 users:

=$ truncate sessions ;
TRUNCATE TABLE
 
=$ insert into sessions (user_id, payload)
    select case when random() < .5 then 50 else 100 end, random() from generate_series(1, 1000);
INSERT 0 1000
 
=$ vacuum analyze sessions ;
VACUUM

explain for the same query will show that suddenly we got memoization node:

=$ explain (analyze, buffers) select * from sessions s join users u on s.user_id = u.id;
                                                           QUERY PLAN                                                            
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Nested Loop  (cost=0.43..64.45 rows=1000 width=40) (actual time=0.027..4.715 rows=1000 loops=1)
   Buffers: shared hit=15
   ->  Seq Scan on sessions s  (cost=0.00..17.00 rows=1000 width=24) (actual time=0.007..0.916 rows=1000 loops=1)
         Buffers: shared hit=7
   ->  Memoize  (cost=0.43..7.50 rows=1 width=16) (actual time=0.001..0.001 rows=1 loops=1000)
         Cache Key: s.user_id
         Cache Mode: logical
         Hits: 998  Misses: 2  Evictions: 0  Overflows: 0  Memory Usage: 1kB
         Buffers: shared hit=8
         ->  Index Scan using users_pkey on users u  (cost=0.42..7.49 rows=1 width=16) (actual time=0.005..0.005 rows=1 loops=2)
               Index Cond: (id = s.user_id)
               Buffers: shared hit=8
 Planning:
   Buffers: shared hit=22
 Planning Time: 0.256 ms
 Execution Time: 5.628 ms
(16 rows)

plan on explain.depesz.com.

Memoization information shows that it used 1kB of memory, was caching using user_id, which is pretty obvious, had two misses (first two scans for each user id), and 998 hits (the other scans for the two user ids).

All is great. But in some cases it might not be obvious why Pg even decided to use Memoization.

Here comes the change that I mentioned in the beginning of this post.

Thanks to authors work, the same explain, in the same situation (two user ids) show now this:

                                                             QUERY PLAN
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Nested Loop  (cost=0.43..64.45 rows=1000 width=40) (actual time=0.017..0.277 rows=1000.00 loops=1)
   Buffers: shared hit=15
   ->  Seq Scan on sessions s  (cost=0.00..17.00 rows=1000 width=24) (actual time=0.006..0.036 rows=1000.00 loops=1)
         Buffers: shared hit=7
   ->  Memoize  (cost=0.43..7.50 rows=1 width=16) (actual time=0.000..0.000 rows=1.00 loops=1000)
         Cache Key: s.user_id
         Cache Mode: logical
         Estimates: capacity=2 distinct keys=2 lookups=1000 hit percent=99.80%
         Hits: 998  Misses: 2  Evictions: 0  Overflows: 0  Memory Usage: 1kB
         Buffers: shared hit=8
         ->  Index Scan using users_pkey on users u  (cost=0.42..7.49 rows=1 width=16) (actual time=0.003..0.003 rows=1.00 loops=2)
               Index Cond: (id = s.user_id)
               Index Searches: 2
               Buffers: shared hit=8
 Planning:
   Buffers: shared hit=39
 Planning Time: 0.219 ms
 Execution Time: 0.318 ms
(18 rows)

plan uploaded to explain.depesz.com.

The Estimates: line shows what PostgreSQL thought usage of the index scan on users will be – specifically, that there will be only two keys, 1000 separate lookups, and given caching, it should be 99.8% hit rate.

Obviously in my example the numbers are very simple, easy to check and calculate, but in real life scenarios it might be much more difficult to check/calculate/verify.

This is great. Thanks a lot to everyone involved.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.