Waiting for PostgreSQL 14 – Report progress of COPY commands

On 6th of January 2021, Tomas Vondra committed patch:

Report progress of COPY commands
 
This commit introduces a view pg_stat_progress_copy, reporting progress
of COPY commands.  This allows rough estimates how far a running COPY
progressed, with the caveat that the total number of bytes may not be
available in some cases (e.g. when the input comes from the client).
 
Author: Josef Šimánek
Reviewed-by: Fujii Masao, Bharath Rupireddy, Vignesh C, Matthias van de Meent
Discussion: https://postgr.es/m/CAFp7QwqMGEi4OyyaLEK9DR0+E+oK3UtA4bEjDVCa4bNkwUY2PQ@mail.gmail.com
Discussion: https://postgr.es/m/CAFp7Qwr6_FmRM6pCO0x_a0mymOfX_Gg+FEKet4XaTGSW=LitKQ@mail.gmail.com

Loading huge datasets takes time, and it can be frustrating, since you don't know what is your current situation – whether you're at 1% of 99%.

Now, we will get some insight.

Let's test it. The simplest test will be to export some data from DB – preferably one that I can generate at will, and not store in DB itself.

So let's start with this query:

$ SELECT i, now() - random() * '2 years'::INTERVAL, random()
    FROM generate_series(1,5) i;
  i  |           ?COLUMN?            |        random        
-----+-------------------------------+----------------------
   1 | 2020-10-26 02:22:33.570694+01 |   0.7886766259618767
   2 | 2020-08-09 11:51:18.114694+02 |   0.9346193980623099
   3 | 2020-10-27 11:56:38.313094+01 |   0.7523086089557793
   4 | 2019-07-11 02:27:29.749894+02 |   0.0374716722469941
   5 | 2020-05-25 00:51:41.039494+02 |    0.857011143252528
(5 ROWS)

This generated 5 rows with some values. The good thing is that I can scale it to virtually any number of rows without using any disk space.

Now, let's put it within COPY, sending output to /dev/null to avoid filling disk space, and see the progress.

To see progress, in separate psql session I ran:

$ SELECT * FROM pg_stat_progress_copy \watch 1

Which, in the beginning was showing:

                   Tue 12 Jan 2021 11:19:39 AM CET (every 1s)
 
 pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed 
-----+-------+---------+-------+-----------------+-------------+-----------------
(0 ROWS)
 
                   Tue 12 Jan 2021 11:19:40 AM CET (every 1s)
 
 pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed 
-----+-------+---------+-------+-----------------+-------------+-----------------
(0 ROWS)

No rows, since no COPY operation was in progress.

Now, with this in place, I ran larger COPY:

$ \copy (
    SELECT i, now() - random() * '2 years'::INTERVAL, random()
    FROM generate_series(1,10000000) i
) TO /dev/NULL;

While this was running (~ 7 seconds), my watcher psql showed:

                     Tue 12 Jan 2021 11:19:41 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |        33733834 |           0 |          601473
(1 ROW)
 
                     Tue 12 Jan 2021 11:19:42 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |       121581834 |           0 |         2142388
(1 ROW)
 
                     Tue 12 Jan 2021 11:19:43 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |       210759763 |           0 |         3699553
(1 ROW)
 
                     Tue 12 Jan 2021 11:19:44 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |       300090238 |           0 |         5259338
(1 ROW)
 
                     Tue 12 Jan 2021 11:19:45 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |       388599315 |           0 |         6804802
(1 ROW)
 
                     Tue 12 Jan 2021 11:19:46 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |       477635774 |           0 |         8359459
(1 ROW)
 
                     Tue 12 Jan 2021 11:19:47 AM CET (every 1s)
 
  pid  | datid  | datname | relid | bytes_processed | bytes_total | lines_processed 
-------+--------+---------+-------+-----------------+-------------+-----------------
 56595 | 223019 | depesz  |     0 |       565792160 |           0 |         9898785
(1 ROW)
 
                   Tue 12 Jan 2021 11:19:48 AM CET (every 1s)
 
 pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed 
-----+-------+---------+-------+-----------------+-------------+-----------------
(0 ROWS)

This is pretty cool. I saw both number of bytes that were send, and number of lines (rows).

Given that I knew that my output is 10 million rows, I could easily see where in the stream I am.

Great addition that adds yet another progress information.

Thanks a lot to all involved.

2 thoughts on “Waiting for PostgreSQL 14 – Report progress of COPY commands”

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.