Waiting for PostgreSQL 12 – Allow multi-inserts during COPY into a partitioned table

On 1st of August 2018, Peter Eisentraut committed patch:

Allow multi-inserts during COPY into a partitioned table
 
 
CopyFrom allows multi-inserts to be used for non-partitioned tables, but
this was disabled for partitioned tables.  The reason for this appeared
to be that the tuple may not belong to the same partition as the
previous tuple did.  Not allowing multi-inserts here greatly slowed down
imports into partitioned tables.  These could take twice as long as a
copy to an equivalent non-partitioned table.  It seems wise to do
something about this, so this change allows the multi-inserts by
flushing the so-far inserted tuples to the partition when the next tuple
does not belong to the same partition, or when the buffer fills.  This
improves performance when the next tuple in the stream commonly belongs
to the same partition as the previous tuple.
 
In cases where the target partition changes on every tuple, using
multi-inserts slightly slows the performance.  To get around this we
track the average size of the batches that have been inserted and
adaptively enable or disable multi-inserts based on the size of the
batch.  Some testing was done and the regression only seems to exist
when the average size of the insert batch is close to 1, so let's just
enable multi-inserts when the average size is at least 1.3.  More
performance testing might reveal a better number for, this, but since
the slowdown was only 1-2% it does not seem critical enough to spend too
much time calculating it.  In any case it may depend on other factors
rather than just the size of the batch.
 
Allowing multi-inserts for partitions required a bit of work around the
per-tuple memory contexts as we must flush the tuples when the next
tuple does not belong the same partition.  In which case there is no
good time to reset the per-tuple context, as we've already built the new
tuple by this time.  In order to work around this we maintain two
per-tuple contexts and just switch between them every time the partition
changes and reset the old one.  This does mean that the first of each
batch of tuples is not allocated in the same memory context as the
others, but that does not matter since we only reset the context once
the previous batch has been inserted.
 
Author: David Rowley <david.rowley@2ndquadrant.com>

Description in commit message is pretty verbose and clear, but let's see how well it actually behaves.

To test it, I will use plans from explain.depesz.com

There are, in total, 537,056 rows in full set. I will also test a subset that contains data only for single partition – and this subset has 8,804 rows.

Schema for my test will be:

=$ CREATE TABLE plans (
    id text NOT NULL,
    plan text NOT NULL,
    entered_on timestamp with time zone DEFAULT now() NOT NULL,
    is_public boolean DEFAULT true NOT NULL,
    is_anonymized boolean DEFAULT false NOT NULL,
    title text,
    delete_key text,
    is_deleted boolean DEFAULT false NOT NULL,
    added_by text,
    optimization_for text
) partition by range ( id );

and the partitions will be:

=$ CREATE TABLE plans_0 partition of plans (primary key (id)) for values from (minvalue) to ('1');
=$ CREATE TABLE plans_1 partition of plans (primary key (id)) for values from ('1') to ('2');
=$ CREATE TABLE plans_2 partition of plans (primary key (id)) for values from ('2') to ('3');
...
=$ CREATE TABLE plans_8 partition of plans (primary key (id)) for values from ('8') to ('9');
=$ CREATE TABLE plans_9 partition of plans (primary key (id)) for values from ('9') to ('a');
=$ CREATE TABLE plans_a partition of plans (primary key (id)) for values from ('a') to ('b');
...
=$ CREATE TABLE plans_y partition of plans (primary key (id)) for values from ('y') to ('z');
=$ CREATE TABLE plans_z partition of plans (primary key (id)) for values from ('z') to (maxvalue);

Afterwards, \d+ plans looks like:

                                                 Table "public.plans"
      Column      |           Type           | Collation | Nullable | Default | Storage  | Stats target | Description 
------------------+--------------------------+-----------+----------+---------+----------+--------------+-------------
 id               | text                     |           | not null |         | extended |              | 
 plan             | text                     |           | not null |         | extended |              | 
 entered_on       | timestamp with time zone |           | not null | now()   | plain    |              | 
 is_public        | boolean                  |           | not null | true    | plain    |              | 
 is_anonymized    | boolean                  |           | not null | false   | plain    |              | 
 title            | text                     |           |          |         | extended |              | 
 delete_key       | text                     |           |          |         | extended |              | 
 is_deleted       | boolean                  |           | not null | false   | plain    |              | 
 added_by         | text                     |           |          |         | extended |              | 
 optimization_for | text                     |           |          |         | extended |              | 
Partition key: RANGE (id)
Partitions: plans_0 FOR VALUES FROM (MINVALUE) TO ('1'),
            plans_1 FOR VALUES FROM ('1') TO ('2'),
            plans_2 FOR VALUES FROM ('2') TO ('3'),
            plans_3 FOR VALUES FROM ('3') TO ('4'),
            plans_4 FOR VALUES FROM ('4') TO ('5'),
            plans_5 FOR VALUES FROM ('5') TO ('6'),
            plans_6 FOR VALUES FROM ('6') TO ('7'),
            plans_7 FOR VALUES FROM ('7') TO ('8'),
            plans_8 FOR VALUES FROM ('8') TO ('9'),
            plans_9 FOR VALUES FROM ('9') TO ('a'),
            plans_a FOR VALUES FROM ('a') TO ('b'),
            plans_b FOR VALUES FROM ('b') TO ('c'),
            plans_c FOR VALUES FROM ('c') TO ('d'),
            plans_d FOR VALUES FROM ('d') TO ('e'),
            plans_e FOR VALUES FROM ('e') TO ('f'),
            plans_f FOR VALUES FROM ('f') TO ('g'),
            plans_g FOR VALUES FROM ('g') TO ('h'),
            plans_h FOR VALUES FROM ('h') TO ('i'),
            plans_i FOR VALUES FROM ('i') TO ('j'),
            plans_j FOR VALUES FROM ('j') TO ('k'),
            plans_k FOR VALUES FROM ('k') TO ('l'),
            plans_l FOR VALUES FROM ('l') TO ('m'),
            plans_m FOR VALUES FROM ('m') TO ('n'),
            plans_n FOR VALUES FROM ('n') TO ('o'),
            plans_o FOR VALUES FROM ('o') TO ('p'),
            plans_p FOR VALUES FROM ('p') TO ('q'),
            plans_q FOR VALUES FROM ('q') TO ('r'),
            plans_r FOR VALUES FROM ('r') TO ('s'),
            plans_s FOR VALUES FROM ('s') TO ('t'),
            plans_t FOR VALUES FROM ('t') TO ('u'),
            plans_u FOR VALUES FROM ('u') TO ('v'),
            plans_v FOR VALUES FROM ('v') TO ('w'),
            plans_w FOR VALUES FROM ('w') TO ('x'),
            plans_x FOR VALUES FROM ('x') TO ('y'),
            plans_y FOR VALUES FROM ('y') TO ('z'),
            plans_z FOR VALUES FROM ('z') TO (MAXVALUE)

So, let's try loading test data.

For each test, I ran it 5 times, and picked best time (to avoid problems with concurrent activity).

First, before the patch:

subset: 1.302 seconds
all plans in random order: 1 minute 15.997 seconds
all plans in alphabetical order: 1 minute 10.507 seconds

After the patch (and some more, as I tested on current git HEAD):

subset: 1.361 seconds
all plans in random order: 1 minute 14.803 seconds
all plans in alphabetical order: 1 minute 9.985 seconds

As an easy to compare table:

	subset	all randomized	all ordered
before	1.302 s	75.997 s	70.507 s
after	1.361 s	74.803 s	69.985 s
difference	+ 4.5%	– 1.5%	– 0.7%

Well, to be honest, I'm underwhelmed by the numbers. Not sure if I did something wrong, or is the improvement really that small.

In any way – thanks to David and Peter for work – all work to improve performance is good 🙂