Explaining the unexplainable – part 6: buffers

Back in 2013 I wrote a series of 5 posts about how to read explain analyze output.

Figured that there is one big part missing – buffers info.

You don't see this part in normal explain analyze, you have to specifically enable it: explain (analyze on, buffers on).

Well, technically you don't need analyze part, but then you'll be shown explain like:

                                      QUERY PLAN
 Finalize Aggregate  (cost=40496.38..40496.39 ROWS=1 width=8)
   ->  Gather  (cost=40496.17..40496.38 ROWS=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=39496.17..39496.18 ROWS=1 width=8)
               ->  Parallel Seq Scan ON z  (cost=0.00..35225.33 ROWS=1708333 width=0)
   Buffers: shared hit=3
(7 ROWS)

Which shows buffers info only for planning. Which is interesting information on its own (this means that to plan this query planner had to get 3 pages from storage, but luckily all 3 were in cache (shared_buffers). Since (by default) each page is 8kB, we know that PostgreSQL would read 24kB of data just to plan this query.

But if I'd run the same query with analyze more info will be shown:

$ EXPLAIN (analyze ON, buffers ON) SELECT COUNT(*) FROM z;
                                                    QUERY PLAN                                                     
 Aggregate  (cost=69392.00..69392.01 ROWS=1 width=8) (actual TIME=368.522..368.523 ROWS=1 loops=1)
   Buffers: shared hit=480 READ=17662
   I/O Timings: READ=24.848
   ->  Seq Scan ON z  (cost=0.00..59142.00 ROWS=4100000 width=0) (actual TIME=0.095..213.068 ROWS=4100000 loops=1)
         Buffers: shared hit=480 READ=17662
         I/O Timings: READ=24.848
   Buffers: shared hit=19
 Planning TIME: 0.448 ms
 Execution TIME: 368.742 ms
(10 ROWS)

Please note that we now see buffers info for both Seq Scan, and for top-level node: Aggregate.

There is also I/O Timings: information, which is there thanks to track_io_timing config parameter being set to on.

So, let's get down to details. First – is shared hit/read all that can be there, and what exactly is the meaning of the values?

Luckily, I have BIG database of plans, so I can easily find lots of information. For example, in this plan we can see:

Buffers: shared hit=13,538,373 read=85,237 dirtied=204 written=6,263, local hit=10,047,224 read=773,475 dirtied=271,617 written=271,617, temp read=620,674 written=604,150

This is max number of different fields that can be in buffers line:

  • shared:
    • hit
    • read
    • dirtied
    • written
  • local:
    • hit
    • read
    • dirtied
    • written
  • temp:
    • read
    • written

All values are in blocks, where block is very usually 8192 bytes (you can check your block size by doing: show block_size;).

So, for example, the function scan did read:

  • 85,237 pages in shared
  • 773,475 pages in local
  • 629,674 pages in temp

This is equal to (assuming 8kB page size):

$ SELECT pg_size_pretty( ( 85237 + 773475 + 629674 )::int8 * 8192 );
 11 GB
(1 ROW)

So, in total it read ~ 11GB of data off disk.

But what is exact meaning of hit, read, dirtied, written, shared, local, and temp? and why temp has only read and written?

PostgreSQL has some part of memory set aside as cache. It's size is set in shared_buffers guc.

Whenever Pg has to read some data from disk it can check if it's in shared_buffers. If it is there – page can be returned from cache. If it's not there – it's being read from disk (and stored in shared_buffers for reuse). This leads to first split in our parameters: hit number tells how many pages were already in cache. And read tells us how many pages were not in cache and had to be read from disk.

Next thing we can see is written. This is rather simple – this many pages have been written to disk. It is possible that select will generate writes. For example, let's assume we have shared_buffers with only 10 buffers size. And two of these pages are already used to cache some disk pages, but, they were modified in memory, but not yet on disk – for example because writing transaction hasn't committed yet. This means that these pages, in memory, are dirty and we'd want to use the place in memory that they are in, we have first to write them.

This is what written generally is – how much data was written to disk, because we needed to free some space in shared_buffers for other data.

And finally – dirtied. Sometimes, it just so happens, that there are new pages in a table/index, for example, due to index or update operations, and if we select at just the right moment, PostgreSQL will do something called “updating hint bits". Finding what hint bits actually are proved to be rather tricky. As far as I was able to figure it out this is just information whether transactions that inserted or deleted given row are already fully in the past. This information is generally updated on vacuum, but sometimes it can happen on select time. If it will happen so (that is: select will update hint bits), page will be marked as dirty – to be written to disk whenever it will be possible.

So, we know what is hit/read/written/dirtied. What are share/local/temp? That's WAY simpler. Pages that belong to normal objects, shared between many database connections (tables, indexes, materialized views) are counted as shared. Pages that belong to temporary objects – that is objects that belong only to current session – are counted as local. And temp is for temporary data access needed for sorting, hashing, or materializing.

With this information we can now read data from previously mentioned explain node with a bit more understanding. Refresh of what we see in explain:

->  FUNCTION Scan ON ...
      Buffers: shared hit=13538373 READ=85237 dirtied=204 written=6263, LOCAL hit=10047224 READ=773475 dirtied=271617 written=271617, temp READ=620674 written=604150

This means that while processing this function, PostgreSQL did:

  • Wanted to read 13,623,610 pages from normal db objects, found 13,538,373 of them in cache, and had to read 85,237 from disk.
  • Dirtied (could have written new rows, update existing ones, delete or even just set hint bits, we don't know as it's function, so we don't know what it did) 204 pages in normal, shared, db objects.
  • Wrote, to make space in memory, 6,263 buffers that belonged to normal, shared, db objects.
  • Wanted to read 10,820,699 pages from temporary db objects, found 10,047,224 of them in cache, and had to read 773,475 from disk.
  • Dirtied 271,617 pages in temporary db objects.
  • Wrote, to make space in memory, 271,617 buffers that belonged to temporary db objects.
  • Read 620,674 blocks from temporary files, for operations that use temp files when work_mem is not enough.
  • Wrote 603,150 blocks to temporary files.

In total, this single function scan generated:

  • 85,237 + 773,475 + 620,674 = 1,479,386 blocks = 11.28GB reads
  • 6,263 + 271,617 + 604,150 = 882,020 blocks = 6.73GB writes

Next bit of information. Let's consider this plan:

$ EXPLAIN (analyze ON, buffers ON) SELECT COUNT(DISTINCT id) FROM x;
                                                    QUERY PLAN                                                     
 Aggregate  (cost=73413.00..73413.01 ROWS=1 width=8) (actual TIME=829.034..829.034 ROWS=1 loops=1)
   Buffers: shared hit=2144 READ=20019, temp READ=8586 written=8609
   I/O Timings: READ=27.700
   ->  Seq Scan ON x  (cost=0.00..63163.00 ROWS=4100000 width=4) (actual TIME=0.075..217.427 ROWS=4100000 loops=1)
         Buffers: shared hit=2144 READ=20019
         I/O Timings: READ=27.700
   Buffers: shared hit=6 dirtied=1
 Planning TIME: 0.172 ms
 Execution TIME: 829.078 ms
(10 ROWS)

We can see that Seq Scan node had only shared hit and read, but Aggregate node had more. The thing is that buffers info is summarized. Any node reports total of what it used itself, plus all that its subnodes used.

In our case that means that Aggregate node used only what is reported in temp part, because shared hit/read numbers are exactly the same as for it's subnode: Seq Scan.

This leaves two last bits of information: what about loops=, and what about parallel queries.

First, normal loops:

$ EXPLAIN (analyze ON, buffers ON) SELECT i, (SELECT COUNT(*) FROM z WHERE id + 100 > i) FROM generate_series(1,3) i;
                                                      QUERY PLAN                                                      
 FUNCTION Scan ON generate_series i  (cost=0.00..6079.06 ROWS=3 width=12) (actual TIME=41.353..68.639 ROWS=3 loops=1)
   Buffers: shared hit=1329
   SubPlan 1
     ->  Aggregate  (cost=2026.33..2026.34 ROWS=1 width=8) (actual TIME=22.868..22.869 ROWS=1 loops=3)
           Buffers: shared hit=1329
           ->  Seq Scan ON z  (cost=0.00..1943.00 ROWS=33333 width=0) (actual TIME=0.007..15.753 ROWS=100000 loops=3)
                 FILTER: ((id + 100) > i.i)
                 Buffers: shared hit=1329
   Buffers: shared hit=2
 Planning TIME: 0.189 ms
 Execution TIME: 68.695 ms
(12 ROWS)

With this query I forced Pg to run 3 separate Seq Scans of table Z. At the time of test z had 443 pages. Number shown, 1,329 shared/hit pages, is exactly 3 * 443. Which shows that the number is total across all loops.

And parallel queries?

Made larger version of the table (44,248 pages), and ran test:

$ EXPLAIN (analyze ON, buffers ON) SELECT COUNT(*) FROM z;
                                                               QUERY PLAN                                                               
 Finalize Aggregate  (cost=97331.80..97331.81 ROWS=1 width=8) (actual TIME=269.856..271.856 ROWS=1 loops=1)
   Buffers: shared hit=16111 READ=28137
   I/O Timings: READ=36.838
   ->  Gather  (cost=97331.58..97331.79 ROWS=2 width=8) (actual TIME=269.773..271.850 ROWS=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=16111 READ=28137
         I/O Timings: READ=36.838
         ->  Partial Aggregate  (cost=96331.58..96331.59 ROWS=1 width=8) (actual TIME=268.094..268.095 ROWS=1 loops=3)
               Buffers: shared hit=16111 READ=28137
               I/O Timings: READ=36.838
               ->  Parallel Seq Scan ON z  (cost=0.00..85914.87 ROWS=4166687 width=0) (actual TIME=0.015..153.453 ROWS=3333333 loops=3)
                     Buffers: shared hit=16111 READ=28137
                     I/O Timings: READ=36.838
   Buffers: shared hit=5 READ=3
   I/O Timings: READ=0.011
 Planning TIME: 0.069 ms
 Execution TIME: 271.923 ms
(19 ROWS)

Now, in here we see that there were 3 concurrent workers (2 workers launched + main process). But buffers reported for Parallel Seq Scan are: hit=16,111 and read=28,137, total: 44,284 – exactly full size of the table.

This blogpost is a foundation I needed to write before I will add buffers info parsing to Pg::Explain, and display to explain.depesz.com. So now, you know that soon(ish) – there will be buffers info somewhere there.

I would like to thank to all people that asked me about it over the years – it took some time, but I'm finally getting to do it 🙂

2 thoughts on “Explaining the unexplainable – part 6: buffers”

  1. I always love when someone takes the time to really dig into the details and presents them in a digestible manner. Thank you for sharing this!

Comments are closed.