Per-worker, and global, IO bandwidth in explain plans

Jeremy Schneider suggested a change to how plans are displayed – adding another bit of information in case we have timing information for IO for explain node. Took me a while to research, but it finally made it's way…

Let's consider this simple plan.

In it's Parallel Seq Scan node we see:

  ->  Parallel Seq Scan on mvtestertres m  (cost=0.00..5795952.75 rows=1 width=249) (actual time=42869.176..53264.538 rows=0 loops=3)
        Buffers: shared hit=3375 read=5031650
        I/O Timings: read=144131.465

And a bit earlier we see that Gather node spawned two workers.

What can we understand from it?

  • There were, in total, three parallel workers (two spawned and the main one)
  • Time on wall clock that passed while reading the table was 53264.538ms – ~ 53 seconds
  • This scan read 5,031,650 pages, which, with default page size of 8192 bytes gives us almost 39GB of data
  • Reading all this data took 144131.465ms – ~ 2.5 minutes

Wait what? 2.5 minutes? But the whole node took only 53 seconds?

This is because buffer, and timing, information is summarized across all workers.

In total, PostgreSQL was processing this node for ~ 2m40s, but since it did it in 3 parallel workers, only 53 seconds passed.

In total, Pg read 39GB of data, and in total, it took it 2.5 minutes. But since it happen in parallel, each of the workers read only (on average) 13GB of data, and used ~ 48 seconds to read it.

This leads to couple of interesting bandwidth calculations:

  • we read 39GB of data in ~ 2.5 minutes, so this gives us speed of ~ 272MB/s
  • each worker read 13GB in ~ 48 seconds, which also gives us 272MB/s
  • disks, though, had to read 39GB of data, in total time (in terms of wall clock) of 53 seconds. This means that total bandwidth that disks had to provide was ~ 739MB/s.

Previously, before Jeremy pointed it out, explain.depesz.com was showing IO data like:

39 GB
in 144,131.465 ms
~ 273 MB/s

Which correctly shows total size, total time summed across all workers, and average speed at which each worker read data.

But now, thanks to Jeremy, we get:

39 GB
in 144,131.465 ms
~ 273 MB/s (avg-per-worker)
~ 739 MB/s (exclusive-time)

Which mentions what the 273MB/s means, and also adds total bandwidth that disks had to provide.

Providing timing information in case of parallel execution is complicated. But thanks to Jeremys work, we now have more detailed view for the plans that provide timing info. Thanks a lot.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.