Waiting for 9.5 – Add support for INSERT … ON CONFLICT DO NOTHING/UPDATE.

On 8th of May, Andres Freund committed patch:

Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
 
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint.  DO NOTHING avoids the
constraint violation, without touching the pre-existing row.  DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed.  The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
 
This feature is often referred to as upsert.
 
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert.  If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made.  If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
 
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
 
Bumps catversion as stored rules change.
 
Author: Peter Geoghegan, with significant contributions from Heikki
    Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
    Dean Rasheed, Stephen Frost and many others.

This is a dream-come-true for many. Lack of upsert was troubling to many, caused many questions on irc, and posts on blogs, including mine, of course.

Now, it's all over. With the new syntax we can forget about all of this.

So, let's see how it works.

CREATE TABLE test (
    id serial PRIMARY KEY,
    some_key text NOT NULL UNIQUE,
    some_val int4 NOT NULL DEFAULT 0
);

So, let's insert some rows, and see what we can do.

First, sanity checks:

INSERT INTO test (some_key) VALUES ('a'), ('b');
SELECT * FROM test;
 id | some_key | some_val 
----+----------+----------
  1 | a        |        0
  2 | b        |        0
(2 ROWS)

With these two rows, we can do:

INSERT INTO test (some_key) VALUES ('a');
ERROR:  duplicate KEY VALUE violates UNIQUE CONSTRAINT "test_some_key_key"
DETAIL:  KEY (some_key)=(a) already EXISTS.

This, correctly, errored out. How about ignoring the error?

INSERT INTO test (some_key) VALUES ('a')
    ON CONFLICT DO NOTHING;
INSERT 0 0
SELECT * FROM test;
 id | some_key | some_val 
----+----------+----------
  1 | a        |        0
  2 | b        |        0
(2 ROWS)

That's new. How about making update?

INSERT INTO test (some_key) VALUES ('a')
    ON CONFLICT DO UPDATE SET some_val = some_val + 1;
ERROR:  ON CONFLICT DO UPDATE requires inference specification OR CONSTRAINT name
LINE 2:     ON CONFLICT DO UPDATE SET some_val = some_val + 1;
            ^
HINT:  FOR example, ON CONFLICT ON CONFLICT (<column>).

This is interesting – for “DO NOTHING" you can just type it as it, but when doing “DO UPDATE" you have to put some more work.

Why? Consider that we have 2 different unique indexes – on id and on some_key. What we update should be different depending on which key might get violated.

So, for DO UPDATE we need to specify which key violation is the one we care about, so let's do it:

INSERT INTO test (some_key) VALUES ('a')
    ON CONFLICT ON CONSTRAINT test_some_key_key DO UPDATE SET some_val = some_val + 1;
ERROR:  COLUMN reference "some_val" IS ambiguous
LINE 2: ...TRAINT test_some_key_key DO UPDATE SET some_val = some_val +...
                                                             ^

And another error. This time – it has problem with “some_val". Why? That's because UPDATE has access to both values in row that is being updated, and the ones that I just tried to insert. And some_val doesn't specify where the data comes from, or goes into.

So, how to specify which some_val, I want? For the table columns, prefix it with table name, and for the row that you tried to insert – use “excluded" pseudo-table.

So, my simple example becomes

INSERT INTO test (some_key) VALUES ('a')
    ON CONFLICT ON CONSTRAINT test_some_key_key DO UPDATE SET some_val = test.some_val + 1;
UPSERT 0 1
SELECT * FROM test;
 id | some_key | some_val 
----+----------+----------
  2 | b        |        0
  1 | a        |        1
(2 ROWS)

But if I wanted, for example, to use provided value, I could:

INSERT INTO test (some_key, some_val) VALUES ('b', 10)
    ON CONFLICT ON CONSTRAINT test_some_key_key DO UPDATE SET some_val = excluded.some_val;
UPSERT 0 1
SELECT * FROM test;
 id | some_key | some_val 
----+----------+----------
  1 | a        |        1
  2 | b        |       10
(2 ROWS)

From what I can tell, you can't have two “ON CONFLICT" clauses, which means you can't have different handling of pkey violation, and some other key violation, but maybe it will be added in future.

Now, how does that behave in case of multiple rows?

INSERT INTO test (some_key, some_val) VALUES ('a', 5), ('b', 10), ('c', 15)
    ON CONFLICT ON CONSTRAINT test_some_key_key DO UPDATE SET some_val = excluded.some_val
    returning *;
 id | some_key | some_val 
----+----------+----------
  1 | a        |        5
  2 | b        |       10
  9 | c        |       15
(3 ROWS)
UPSERT 0 3

And if I'd ignore duplicates?

INSERT INTO test (some_key, some_val) VALUES ('a', 5), ('b', 10), ('c', 15), ('d', 20)
    ON CONFLICT DO NOTHING
    returning *;
 id | some_key | some_val 
----+----------+----------
 13 | d        |       20
(1 ROW)

Nice.

So, finally, let's test performance.

I'll test 2 approaches:

  • insert on conflict do nothing
  • upsert done via function, which uses LOOP approach

Insert is trivial. Now for the function.

CREATE FUNCTION upsert_loop ( p_key text, p_val int4 ) RETURNS void AS $$
BEGIN
    LOOP
        UPDATE test SET some_val = p_val WHERE some_key = p_key;
        IF found THEN
            RETURN;
        END IF;
        BEGIN
            INSERT INTO test(some_key,some_val) VALUES (p_key, p_val);
            RETURN;
        EXCEPTION WHEN unique_violation THEN
            -- Do nothing, and loop to try the UPDATE again.
        END;
    END LOOP;
END;
$$ LANGUAGE plpgsql;
CREATE FUNCTION
 
SELECT upsert_loop ('a', 123);
 upsert_loop 
-------------
 
(1 ROW)
SELECT upsert_loop ('z', 23);
 upsert_loop 
-------------
 
(1 ROW)
 
SELECT * FROM test;
 id | some_key | some_val 
----+----------+----------
  2 | b        |       10
  9 | c        |       15
 13 | d        |       20
  1 | a        |      123
 14 | z        |       23
(5 ROWS)

Works. So now, let's write simple test.

I will need couple of test files.

I need some (many) duplicates to happen, so let's make dataset that will be having them:

perl -e 'my @c = ("a".."z", "A".."Z", "0".."9"); printf "%s%s %d\n", $c[rand @c], $c[rand @c], rand 1000 for 1..100000' | split -l 20000 - input.

This, somewhat cryptic line, generated five files, named “input.aa" to “input.ae".

Each of these files contains 20000 lines, and the lines look like this:

uE 459
GD 347
HN 977
wE 3
OD 201
MD 926
MB 523
t1 903
FH 157
tk 104

There is 3844 different combinations in 1 column, so we will have lots of updates.

Now, I'll convert these files into 2 sets of sql files, one using insert … on conflict, and the other will use function:

for a in input.a?; do cat $a | sed "s/ /', /;s/^/select upsert_loop('/;s/\$/);/" > function-$a; done
for a in input.a?; do cat $a | sed "s/ /', /;s/^/insert into test (some_key, some_val) values ('/;s/\$/) on conflict do nothing;/" > conflict-$a; done

The files looks now like this:

=$ head function-INPUT.aa 
SELECT upsert_loop('uE', 459);
SELECT upsert_loop('GD', 347);
SELECT upsert_loop('HN', 977);
SELECT upsert_loop('wE', 3);
SELECT upsert_loop('OD', 201);
SELECT upsert_loop('MD', 926);
SELECT upsert_loop('MB', 523);
SELECT upsert_loop('t1', 903);
SELECT upsert_loop('FH', 157);
SELECT upsert_loop('tk', 104);
 
14:41:00 depesz@krowka ~
=$ head conflict-INPUT.aa
INSERT INTO test (some_key, some_val) VALUES ('uE', 459) ON conflict do nothing;
INSERT INTO test (some_key, some_val) VALUES ('GD', 347) ON conflict do nothing;
INSERT INTO test (some_key, some_val) VALUES ('HN', 977) ON conflict do nothing;
INSERT INTO test (some_key, some_val) VALUES ('wE', 3) ON conflict do nothing;
INSERT INTO test (some_key, some_val) VALUES ('OD', 201) ON conflict do nothing;
INSERT INTO test (some_key, some_val) VALUES ('MD', 926) ON conflict do nothing;
INSERT INTO test (some_key, some_val) VALUES ('MB', 523) ON conflict do nothing;
INSERT INTO test (some_key, some_val) VALUES ('t1', 903) ON conflict do nothing;
INSERT INTO test (some_key, some_val) VALUES ('FH', 157) ON conflict do nothing;
INSERT INTO test (some_key, some_val) VALUES ('tk', 104) ON conflict do nothing;

Now, with this in place, I will run them all, at the same time, and check how long it takes. From clean table.

So, I need s helper script, which does:

#!/usr/bin/env bash
prefix="$1"
psql -qAtX -c 'truncate test'
for a in $prefix-input.a?; do psql -qAtX -f $a >/dev/null & done
time wait

This will run truncate on the test table, then run all sql files (for given prefix) in parallel, and check how long it takes for them to finish.

Since single run is not really good thing, i will test each of the methods 3 times, and pick best time. Results:

  • ON CONFLICT – 16.780s
  • function – didn't finish. after 3 minutes a gave up.

Winner is clear. But, ON CONFLICT has also one benefit – it seamlessly handles working with multiple rows. So importing new data will be much simpler now. Thanks guys – that's great feature.

update

Andreas notice that I used key name in all “on conflict" clauses – where you can use “on conflict (col_a, col_b)". This form (with listed columns, and not constraint name) has the benefit that it will work if you'd change name of unique constraint.

9 thoughts on “Waiting for 9.5 – Add support for INSERT … ON CONFLICT DO NOTHING/UPDATE.”

  1. There should be a way of tersely saying “if and only if the new row matches an existing row in its entirety then do nothing”. The idea is that when we have a unique/primary key, we are dealing with set logic, and so any duplicate tuple insertion is logically a safe action and so should be treated as a no-op rather than an error.

  2. Great feature. But I think there are some differences in UPSERT/MERGE implementation in other databases and above feature in PostgreSQL.
    But this will really help in working with large data sets. Thanks team for it.

  3. Hi Darren,

    I’m not sure what precisely you mean. Do you want to avoid specifying the exact column(s)/constraint you’re conflicting on? Or do you really want to check that all columns are the same, even if they’re not in the constraint? The latter sounds rather strange to me, if you need that your constraints aren’t well defined. Checking all columns is a) not necessarily possible, there’s datatypes without equality operators defined, b) not necessarily meaningful, consider e.g. a ‘updated_at’ column. c) not easy to implement.

    Andres

  4. Generally it’s better to not name the constraint explicitly, as that can break e.g. if you create/drop a new index concurrently. You instead can specify the columns you’re conflicting with using ON CONFLICT (cola, colb, ..). That’ll match any index containing, in any order, exactly these columns.

    Even partial unique indexes can be matched that way, by adding a WHERE clause to the conflict part, like ON CONFLICT (username) WHERE NOT is_deleted;

  5. Hi Andres,

    What I’m looking for is an easy way to declare that attempting to insert a duplicate row is not an error and the duplicate would be silently ignored as redundant with the existing copy.

    One should be able to say, “if any key constraint fails, then do nothing only if the entire tuple matches the conflicting one, otherwise raise the key constraint error you were going to have raised”.

    insert into test … values … on duplicate of all|col list do nothing

    This can piggyback the new on conflict functionality. If col list is a superset of any key then treat as an on conflict do nothing of that key only if all the other columns named or in the tuple match, and otherwise fail with that constraint as usual.

  6. Hi!

    There is also this one solution to perform UPSERT without writing procedure. But this solution doesn’t use any unique indexes, so it is more complicated to use:

    WITH u AS (
    	UPDATE TABLE SET FIELD = 1 WHERE id = 2 RETURNING id 
    )
    INSERT INTO TABLE (id, FIELD) SELECT 2,1
    WHERE NOT EXISTS (SELECT 1 FROM u);

    I think it might be interesting to bench this query with new UPSERT feature.

  7. @Eugene:
    it has a race condition. run many of these in parallel, and you will see that some of them end up with unique violation.

    And it does use unique index – on id column – otherwise, that is, if there is no unique index on id column – you will end up with duplicates.

  8. > From what I can tell, you can’t have two “ON CONFLICT” clauses, which means you can’t have different handling of pkey violation, and some other key violation, but maybe it will be added in future.

    This is bothering me right now. Can anyone think of a workaround?

  9. Given:
    on conflict (uid, product, coalesce(kit, 0))

    When this clause is placed inside a function with the same parameter names it fails. Unfortulately PG 9.6 doesn’t allow to explicitly specify column names to avoid confusion:

    on conflict (tbl.uid, tbl.product, coalesce(tbl.kit, 0))

    Workaround is to rename parameters of a function but I’m not so happy about that.

Comments are closed.