October 12th, 2007 by depesz | Tags: | 7 comments »
Did it help? If yes - maybe you can help me?

let's assume we have a simple table:

Table "public.test"
Column | Type | Modifiers
--------+---------+-----------
a | integer | not null
b | integer | not null
Indexes:
"test_pkey" PRIMARY KEY, btree (a, b)
"q" UNIQUE, btree (b, a)

now, let's insert some rows to it:

# insert into test select * from generate_series(1,100) as a, generate_series(1,100) as b^J;
INSERT 0 10000

remove rows with a = b:

# delete from test where a = b;
DELETE 100

and prepare test-case by randomly removing some rows:

# delete from test where random() < 0.002; DELETE 17

the question is - find all pairs of (a,b) where there is no row (a',b') where (a'=b and b'=a).

in other words - every row (a,b) should be paired. rows with a = 2 and b = 3, is paired by row with a = 3 and b = 2.

how to find incomplete pairs?

the basic approach would be to do something like:

select * from test t where not exists (select * from test t2 where (t2.a, t2.b) = (t.b, t.a))

looks reasonable?

no. unfortunately no.

execution plan for this query consisted of 1 seq scan of test table, and nested index scan for every row of it.

does it look bad? with 10k rows, it is not really bad - after all, 1 seq scan over 10k rows, plus 10k index scans - it's doable.

but what if you have 70 million rows in the table?

in this situation it goes simply to 70 million index scans. quick test showed that (out of 10 tries) i got:

  • minimal time: 0.047ms
  • maximal time: 215ms
  • average time: around 0.9-1ms.

full seq scan of the table took about 40 seconds. assuming best situation (0.047ms per index scan), it adds up to nearly 1 hour. with average time of 1ms, total time escalates to ... 49 days!

oops. now, i don't have that much time for the task 🙂

better options?

perhaps left outer join approach?

select t1.* from test t1 left outer join test t2 on (t1.a, t1.b) = (t2.b, t2.a) where t2.a is null

this becomes 2 sequential scans joined with hash join (with small 10k rows table it becomes merge join of 2 index (but full table) scans.

better, but can it be done faster?

actually yes:

select least(a,b), greatest(a,b), count(*) from test group by least(a,b), greatest(a,b) having count(*) = 1;

this query does only 1 table scan, with hash aggregate. time is unbeatable 🙂

how does it work?

basically i thought about finding a way to tell postgresql that pair (12,50) and (50,12) are "the same", so i could use grouping.

there is a simple way to do it - using least() and greatest() functions. least() always returns smaller, and greatest() bigger number.

so, for both (12,50) and (50,12), output of (least(), greatest) is (12,50)!

now, that i have a way to "group it", i can use standard "having" trick to filter only the groups that contain only 1 row.

effect - total time of execution - 5minutes 🙂 for 70 million rows.

and how to add missing pairs?

first, let's copy the group info to temp table:

CREATE temp TABLE depesz_temp as
select least(a,b) as x, greatest(a,b) as y, count(*) from test group by least(a,b), greatest(a,b) having count(*) = 1

now - we need to find what row exactly is there, and insert it's pair.

to make the story short - full insert looks like this:

INSERT INTO test (a,b)
SELECT
(case when t.a = q.x THEN q.y ELSE q.x END),
(case when t.b = q.x THEN q.y ELSE q.x END)
FROM
test t,
depesz_temp q
WHERE
(t.a, t.b) = (q.x, q.y) OR (t.b, t.a) = (q.x, q.y)

you might be tempted to reuse least/greatest trick in the where condition in above insert/select, but if you'd do - indexes on (a,b) and (b,a) wouldn't be used, so it would become really slow 🙂

  1. 7 comments

  2. # Anders
    Oct 12, 2007

    insert into test (a, b)
    select min(b), min(a)
    from test
    group by least(a,b), greatest(a,b)
    having count(*) = 1

    but test really should be a view (select a, b from test2 union all select b, a from test2) and test2 should have a “a

  3. # Anders
    Oct 12, 2007

    insert into test (a, b)
    select min(b), min(a)
    from test
    group by least(a,b), greatest(a,b)
    having count(*) = 1

    but test really should be a view (select a, b from test2 union all select b, a from test2) and test2 should have a “a<b” check constraint trigger that would insert (b,a) if an attempt was made to insert (a,b) where not (a<b).

  4. # Rod Taylor
    Oct 12, 2007

    Makes one wonder why both entries are stored, instead of storing one of them and generating the second on the fly.

    Anyway, that’s still a neat trick.

  5. Oct 12, 2007

    @Andreas:
    it’s an interesting idea, but then when searching for rows where some given “id” is in any of the fields – it becomes complicated.

    @Rod Taylor:
    these are connection between nodes on some kind of graph (not directional).
    having both of the rows in database simplifies (a lot) searches.

  6. # Anders
    Oct 13, 2007

    @depesz: Re: searching for rows where some given “id” is in any of the fields – it becomes complicated

    No, you would just use the view, which makes it look like both rows are stored.

  7. Oct 15, 2007

    @Andreas:
    complicated not as “for human” but “for computer”.
    it becomes *two* index scans instead of *one*

  8. # Marcin Mańk
    Nov 2, 2007

    I like
    select a,b from test
    except
    select b,a from test

    Anyway: Is Your solution really faster? I think that such group by must (for large tables) be building some kind of index for easy locating of the groups, which would be slower.

Leave a comment