finding missing pairs – select * from depesz;

let's assume we have a simple table:

     Table "public.test"
 Column |  Type   | Modifiers
--------+---------+-----------
 a      | integer | not null
 b      | integer | not null
Indexes:
    "test_pkey" PRIMARY KEY, btree (a, b)
    "q" UNIQUE, btree (b, a)

now, let's insert some rows to it:

# insert into test select * from generate_series(1,100) as a, generate_series(1,100) as b^J;
INSERT 0 10000

remove rows with a = b:

# delete from test where a = b;
DELETE 100

and prepare test-case by randomly removing some rows:

# delete from test where random() < 0.002;
DELETE 17

the question is – find all pairs of (a,b) where there is no row (a',b') where (a'=b and b'=a).

in other words – every row (a,b) should be paired. rows with a = 2 and b = 3, is paired by row with a = 3 and b = 2.

how to find incomplete pairs?

the basic approach would be to do something like:

select * from test t where not exists (select * from test t2 where (t2.a, t2.b) = (t.b, t.a))

looks reasonable?

no. unfortunately no.

execution plan for this query consisted of 1 seq scan of test table, and nested index scan for every row of it.

does it look bad? with 10k rows, it is not really bad – after all, 1 seq scan over 10k rows, plus 10k index scans – it's doable.

but what if you have 70 million rows in the table?

in this situation it goes simply to 70 million index scans. quick test showed that (out of 10 tries) i got:

minimal time: 0.047ms
maximal time: 215ms
average time: around 0.9-1ms.

full seq scan of the table took about 40 seconds. assuming best situation (0.047ms per index scan), it adds up to nearly 1 hour. with average time of 1ms, total time escalates to … 49 days!

oops. now, i don't have that much time for the task 🙂

better options?

perhaps left outer join approach?

select t1.* from test t1 left outer join test t2 on (t1.a, t1.b) = (t2.b, t2.a) where t2.a is null

this becomes 2 sequential scans joined with hash join (with small 10k rows table it becomes merge join of 2 index (but full table) scans.

better, but can it be done faster?

actually yes:

select least(a,b), greatest(a,b), count(*) from test group by least(a,b), greatest(a,b) having count(*) = 1;

this query does only 1 table scan, with hash aggregate. time is unbeatable 🙂

how does it work?

basically i thought about finding a way to tell postgresql that pair (12,50) and (50,12) are “the same", so i could use grouping.

there is a simple way to do it – using least() and greatest() functions. least() always returns smaller, and greatest() bigger number.

so, for both (12,50) and (50,12), output of (least(), greatest) is (12,50)!

now, that i have a way to “group it", i can use standard “having" trick to filter only the groups that contain only 1 row.

effect – total time of execution – 5minutes 🙂 for 70 million rows.

and how to add missing pairs?

first, let's copy the group info to temp table:

CREATE temp TABLE depesz_temp as
select least(a,b) as x, greatest(a,b) as y, count(*) from test group by least(a,b), greatest(a,b) having count(*) = 1

now – we need to find what row exactly is there, and insert it's pair.

to make the story short – full insert looks like this:

INSERT INTO test (a,b)
    SELECT
        (case when t.a = q.x THEN q.y ELSE q.x END),
        (case when t.b = q.x THEN q.y ELSE q.x END)
    FROM
        test t,
        depesz_temp q
    WHERE
        (t.a, t.b) = (q.x, q.y) OR (t.b, t.a) = (q.x, q.y)

you might be tempted to reuse least/greatest trick in the where condition in above insert/select, but if you'd do – indexes on (a,b) and (b,a) wouldn't be used, so it would become really slow 🙂

7 thoughts on “finding missing pairs”

Anders says:

2007-10-12 at 22:17

insert into test (a, b)
select min(b), min(a)
from test
group by least(a,b), greatest(a,b)
having count(*) = 1

but test really should be a view (select a, b from test2 union all select b, a from test2) and test2 should have a “a
Anders says:

2007-10-12 at 22:18

insert into test (a, b)
select min(b), min(a)
from test
group by least(a,b), greatest(a,b)
having count(*) = 1

but test really should be a view (select a, b from test2 union all select b, a from test2) and test2 should have a “a<b” check constraint trigger that would insert (b,a) if an attempt was made to insert (a,b) where not (a<b).
Rod Taylor says:

2007-10-12 at 22:26

Makes one wonder why both entries are stored, instead of storing one of them and generating the second on the fly.

Anyway, that’s still a neat trick.
depesz says:

2007-10-12 at 22:37

@Andreas:
it’s an interesting idea, but then when searching for rows where some given “id” is in any of the fields – it becomes complicated.

@Rod Taylor:
these are connection between nodes on some kind of graph (not directional).
having both of the rows in database simplifies (a lot) searches.
Anders says:

2007-10-13 at 09:22

@depesz: Re: searching for rows where some given “id” is in any of the fields – it becomes complicated

No, you would just use the view, which makes it look like both rows are stored.
depesz says:

2007-10-15 at 17:03

@Andreas:
complicated not as “for human” but “for computer”.
it becomes *two* index scans instead of *one*
Marcin Mańk says:

2007-11-02 at 15:42

I like
select a,b from test
except
select b,a from test

Anyway: Is Your solution really faster? I think that such group by must (for large tables) be building some kind of index for easy locating of the groups, which would be slower.

Comments are closed.