finding missing pairs

let's assume we have a simple table:

     Table "public.test"
 Column |  Type   | Modifiers
--------+---------+-----------
 a      | integer | not null
 b      | integer | not null
Indexes:
    "test_pkey" PRIMARY KEY, btree (a, b)
    "q" UNIQUE, btree (b, a)

now, let's insert some rows to it:

# INSERT INTO test SELECT * FROM generate_series(1,100) AS a, generate_series(1,100) AS b^J;
INSERT 0 10000

remove rows with a = b:

# DELETE FROM test WHERE a = b;
DELETE 100

and prepare test-case by randomly removing some rows:

# DELETE FROM test WHERE random() < 0.002;
DELETE 17

the question is – find all pairs of (a,b) where there is no row (a',b') where (a'=b and b'=a).

in other words – every row (a,b) should be paired. rows with a = 2 and b = 3, is paired by row with a = 3 and b = 2.

how to find incomplete pairs?

the basic approach would be to do something like:

SELECT * FROM test t WHERE NOT EXISTS (SELECT * FROM test t2 WHERE (t2.a, t2.b) = (t.b, t.a))

looks reasonable?

no. unfortunately no.

execution plan for this query consisted of 1 seq scan of test table, and nested index scan for every row of it.

does it look bad? with 10k rows, it is not really bad – after all, 1 seq scan over 10k rows, plus 10k index scans – it's doable.

but what if you have 70 million rows in the table?

in this situation it goes simply to 70 million index scans. quick test showed that (out of 10 tries) i got:

  • minimal time: 0.047ms
  • maximal time: 215ms
  • average time: around 0.9-1ms.

full seq scan of the table took about 40 seconds. assuming best situation (0.047ms per index scan), it adds up to nearly 1 hour. with average time of 1ms, total time escalates to … 49 days!

oops. now, i don't have that much time for the task 🙂

better options?

perhaps left outer join approach?

SELECT t1.* FROM test t1 LEFT OUTER JOIN test t2 ON (t1.a, t1.b) = (t2.b, t2.a) WHERE t2.a IS NULL

this becomes 2 sequential scans joined with hash join (with small 10k rows table it becomes merge join of 2 index (but full table) scans.

better, but can it be done faster?

actually yes:

SELECT least(a,b), greatest(a,b), COUNT(*) FROM test GROUP BY least(a,b), greatest(a,b) HAVING COUNT(*) = 1;

this query does only 1 table scan, with hash aggregate. time is unbeatable 🙂

how does it work?

basically i thought about finding a way to tell postgresql that pair (12,50) and (50,12) are “the same", so i could use grouping.

there is a simple way to do it – using least() and greatest() functions. least() always returns smaller, and greatest() bigger number.

so, for both (12,50) and (50,12), output of (least(), greatest) is (12,50)!

now, that i have a way to “group it", i can use standard “having" trick to filter only the groups that contain only 1 row.

effect – total time of execution – 5minutes 🙂 for 70 million rows.

and how to add missing pairs?

first, let's copy the group info to temp table:

CREATE temp TABLE depesz_temp AS
SELECT least(a,b) AS x, greatest(a,b) AS y, COUNT(*) FROM test GROUP BY least(a,b), greatest(a,b) HAVING COUNT(*) = 1

now – we need to find what row exactly is there, and insert it's pair.

to make the story short – full insert looks like this:

INSERT INTO test (a,b)
    SELECT
        (CASE WHEN t.a = q.x THEN q.y ELSE q.x END),
        (CASE WHEN t.b = q.x THEN q.y ELSE q.x END)
    FROM
        test t,
        depesz_temp q
    WHERE
        (t.a, t.b) = (q.x, q.y) OR (t.b, t.a) = (q.x, q.y)

you might be tempted to reuse least/greatest trick in the where condition in above insert/select, but if you'd do – indexes on (a,b) and (b,a) wouldn't be used, so it would become really slow 🙂

7 thoughts on “finding missing pairs”

  1. insert into test (a, b)
    select min(b), min(a)
    from test
    group by least(a,b), greatest(a,b)
    having count(*) = 1

    but test really should be a view (select a, b from test2 union all select b, a from test2) and test2 should have a “a

  2. insert into test (a, b)
    select min(b), min(a)
    from test
    group by least(a,b), greatest(a,b)
    having count(*) = 1

    but test really should be a view (select a, b from test2 union all select b, a from test2) and test2 should have a “a<b” check constraint trigger that would insert (b,a) if an attempt was made to insert (a,b) where not (a<b).

  3. Makes one wonder why both entries are stored, instead of storing one of them and generating the second on the fly.

    Anyway, that’s still a neat trick.

  4. @Andreas:
    it’s an interesting idea, but then when searching for rows where some given “id” is in any of the fields – it becomes complicated.

    @Rod Taylor:
    these are connection between nodes on some kind of graph (not directional).
    having both of the rows in database simplifies (a lot) searches.

  5. @depesz: Re: searching for rows where some given “id” is in any of the fields – it becomes complicated

    No, you would just use the view, which makes it look like both rows are stored.

  6. @Andreas:
    complicated not as “for human” but “for computer”.
    it becomes *two* index scans instead of *one*

  7. I like
    select a,b from test
    except
    select b,a from test

    Anyway: Is Your solution really faster? I think that such group by must (for large tables) be building some kind of index for easy locating of the groups, which would be slower.

Comments are closed.