Waiting for PostgreSQL 14 – Report progress of COPY commands

On 6th of January 2021, Tomas Vondra committed patch:

Report progress of COPY commands
 
This commit introduces a view pg_stat_progress_copy, reporting progress
of COPY commands.  This allows rough estimates how far a running COPY
progressed, with the caveat that the total number of bytes may not be
available in some cases (e.g. when the input comes from the client).
 
Author: Josef Šimánek
Reviewed-by: Fujii Masao, Bharath Rupireddy, Vignesh C, Matthias van de Meent
Discussion: https://postgr.es/m/CAFp7QwqMGEi4OyyaLEK9DR0+E+oK3UtA4bEjDVCa4bNkwUY2PQ@mail.gmail.com
Discussion: https://postgr.es/m/CAFp7Qwr6_FmRM6pCO0x_a0mymOfX_Gg+FEKet4XaTGSW=LitKQ@mail.gmail.com

Loading huge datasets takes time, and it can be frustrating, since you don't know what is your current situation – whether you're at 1% of 99%.

Now, we will get some insight.

Let's test it. The simplest test will be to export some data from DB – preferably one that I can generate at will, and not store in DB itself.

So let's start with this query:

$ SELECT i, now() - random() * '2 years'::INTERVAL, random()
    FROM generate_series(1,5) i;
  i  |           ?COLUMN?            |        random        
-----+-------------------------------+----------------------
   1 | 2020-10-26 02:22:33.570694+01 |   0.7886766259618767
   2 | 2020-08-09 11:51:18.114694+02 |   0.9346193980623099
   3 | 2020-10-27 11:56:38.313094+01 |   0.7523086089557793
   4 | 2019-07-11 02:27:29.749894+02 |   0.0374716722469941
   5 | 2020-05-25 00:51:41.039494+02 |    0.857011143252528
(5 ROWS)

This generated 5 rows with some values. The good thing is that I can scale it to virtually any number of rows without using any disk space.

Now, let's put it within COPY, sending output to /dev/null to avoid filling disk space, and see the progress.

To see progress, in separate psql session I ran:

$ SELECT * FROM pg_stat_progress_copy \watch 1

Which, in the beginning was showing:

                   Tue 12 Jan 2021 11:19:39 AM CET (every 1s)
 
 pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed 
-----+-------+---------+-------+-----------------+-------------+-----------------
(0 ROWS)
 
                   Tue 12 Jan 2021 11:19:40 AM CET (every 1s)
 
 pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed 
-----+-------+---------+-------+-----------------+-------------+-----------------
(0 ROWS)

No rows, since no COPY operation was in progress.

Now, with this in place, I ran larger COPY:

$ \copy (
    SELECT i, now() - random() * '2 years'::INTERVAL, random()
    FROM generate_series(1,10000000) i
) TO /dev/NULL;

While this was running (~ 7 seconds), my watcher psql showed:

                     Tue 12 Jan 2021 11:19:41 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |        33733834 |           0 |          601473
(1 ROW)
 
                     Tue 12 Jan 2021 11:19:42 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |       121581834 |           0 |         2142388
(1 ROW)
 
                     Tue 12 Jan 2021 11:19:43 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |       210759763 |           0 |         3699553
(1 ROW)
 
                     Tue 12 Jan 2021 11:19:44 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |       300090238 |           0 |         5259338
(1 ROW)
 
                     Tue 12 Jan 2021 11:19:45 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |       388599315 |           0 |         6804802
(1 ROW)
 
                     Tue 12 Jan 2021 11:19:46 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |       477635774 |           0 |         8359459
(1 ROW)
 
                     Tue 12 Jan 2021 11:19:47 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |       565792160 |           0 |         9898785
(1 ROW)
 
                   Tue 12 Jan 2021 11:19:48 AM CET (every 1s)
 
 pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed 
-----+-------+---------+-------+-----------------+-------------+-----------------
(0 ROWS)

This is pretty cool. I saw both number of bytes that were send, and number of lines (rows).

Given that I knew that my output is 10 million rows, I could easily see where in the stream I am.

Great addition that adds yet another progress information.

Thanks a lot to all involved.

5 thoughts on “Waiting for PostgreSQL 14 – Report progress of COPY commands”

  1. The reporting has since changed the name of one column, and has had some columns added, so you might want to update this post with the newer results.

  2. @Matthias:
    yeah, I generally don’t do it. There have been many changes through the years, and keeping stuff updated is not realistic. I assume that if/when someone will read about it, and will want to use it, they will check what is the current state of view/data.

  3. That seems actually quite reasonable. Thanks for the great blog!

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.