# finding missing pairs

let's assume we have a simple table:

```     Table "public.test"
Column |  Type   | Modifiers
--------+---------+-----------
a      | integer | not null
b      | integer | not null
Indexes:
"test_pkey" PRIMARY KEY, btree (a, b)
"q" UNIQUE, btree (b, a)```

now, let's insert some rows to it:

```# INSERT INTO test SELECT * FROM generate_series(1,100) AS a, generate_series(1,100) AS b^J;
INSERT 0 10000```

remove rows with a = b:

```# DELETE FROM test WHERE a = b;
DELETE 100```

and prepare test-case by randomly removing some rows:

```# DELETE FROM test WHERE random() < 0.002;
DELETE 17```

the question is – find all pairs of (a,b) where there is no row (a',b') where (a'=b and b'=a).

in other words – every row (a,b) should be paired. rows with a = 2 and b = 3, is paired by row with a = 3 and b = 2.

how to find incomplete pairs?

the basic approach would be to do something like:

`SELECT * FROM test t WHERE NOT EXISTS (SELECT * FROM test t2 WHERE (t2.a, t2.b) = (t.b, t.a))`

looks reasonable?

no. unfortunately no.

execution plan for this query consisted of 1 seq scan of test table, and nested index scan for every row of it.

does it look bad? with 10k rows, it is not really bad – after all, 1 seq scan over 10k rows, plus 10k index scans – it's doable.

but what if you have 70 million rows in the table?

in this situation it goes simply to 70 million index scans. quick test showed that (out of 10 tries) i got:

• minimal time: 0.047ms
• maximal time: 215ms
• average time: around 0.9-1ms.

full seq scan of the table took about 40 seconds. assuming best situation (0.047ms per index scan), it adds up to nearly 1 hour. with average time of 1ms, total time escalates to … 49 days!

oops. now, i don't have that much time for the task 🙂

better options?

perhaps left outer join approach?

`SELECT t1.* FROM test t1 LEFT OUTER JOIN test t2 ON (t1.a, t1.b) = (t2.b, t2.a) WHERE t2.a IS NULL`

this becomes 2 sequential scans joined with hash join (with small 10k rows table it becomes merge join of 2 index (but full table) scans.

better, but can it be done faster?

actually yes:

`SELECT least(a,b), greatest(a,b), COUNT(*) FROM test GROUP BY least(a,b), greatest(a,b) HAVING COUNT(*) = 1;`

this query does only 1 table scan, with hash aggregate. time is unbeatable 🙂

how does it work?

basically i thought about finding a way to tell postgresql that pair (12,50) and (50,12) are “the same", so i could use grouping.

there is a simple way to do it – using least() and greatest() functions. least() always returns smaller, and greatest() bigger number.

so, for both (12,50) and (50,12), output of (least(), greatest) is (12,50)!

now, that i have a way to “group it", i can use standard “having" trick to filter only the groups that contain only 1 row.

effect – total time of execution – 5minutes 🙂 for 70 million rows.

and how to add missing pairs?

first, let's copy the group info to temp table:

```CREATE temp TABLE depesz_temp AS
SELECT least(a,b) AS x, greatest(a,b) AS y, COUNT(*) FROM test GROUP BY least(a,b), greatest(a,b) HAVING COUNT(*) = 1```

now – we need to find what row exactly is there, and insert it's pair.

to make the story short – full insert looks like this:

```INSERT INTO test (a,b)
SELECT
(CASE WHEN t.a = q.x THEN q.y ELSE q.x END),
(CASE WHEN t.b = q.x THEN q.y ELSE q.x END)
FROM
test t,
depesz_temp q
WHERE
(t.a, t.b) = (q.x, q.y) OR (t.b, t.a) = (q.x, q.y)```

you might be tempted to reuse least/greatest trick in the where condition in above insert/select, but if you'd do – indexes on (a,b) and (b,a) wouldn't be used, so it would become really slow 🙂

## 7 thoughts on “finding missing pairs”

1. Anders says:

insert into test (a, b)
select min(b), min(a)
from test
group by least(a,b), greatest(a,b)
having count(*) = 1

but test really should be a view (select a, b from test2 union all select b, a from test2) and test2 should have a “a

2. Anders says:

insert into test (a, b)
select min(b), min(a)
from test
group by least(a,b), greatest(a,b)
having count(*) = 1

but test really should be a view (select a, b from test2 union all select b, a from test2) and test2 should have a “a<b” check constraint trigger that would insert (b,a) if an attempt was made to insert (a,b) where not (a<b).

3. Rod Taylor says:

Makes one wonder why both entries are stored, instead of storing one of them and generating the second on the fly.

Anyway, that’s still a neat trick.

4. depesz says:

@Andreas:
it’s an interesting idea, but then when searching for rows where some given “id” is in any of the fields – it becomes complicated.

@Rod Taylor:
these are connection between nodes on some kind of graph (not directional).
having both of the rows in database simplifies (a lot) searches.

5. Anders says:

@depesz: Re: searching for rows where some given “id” is in any of the fields – it becomes complicated

No, you would just use the view, which makes it look like both rows are stored.

6. depesz says:

@Andreas:
complicated not as “for human” but “for computer”.
it becomes *two* index scans instead of *one*

7. Marcin Mańk says:

I like
select a,b from test
except
select b,a from test

Anyway: Is Your solution really faster? I think that such group by must (for large tables) be building some kind of index for easy locating of the groups, which would be slower.