October 12th, 2007 by depesz | Tags: | 7 comments »
Did it help? If yes - maybe you can help me?

let's assume we have a simple table:

     Table "public.test"
 Column |  Type   | Modifiers
--------+---------+-----------
 a      | integer | not null
 b      | integer | not null
Indexes:
    "test_pkey" PRIMARY KEY, btree (a, b)
    "q" UNIQUE, btree (b, a)

now, let's insert some rows to it:

# insert into test select * from generate_series(1,100) as a, generate_series(1,100) as b^J;
INSERT 0 10000

remove rows with a = b:

# delete from test where a = b;
DELETE 100

and prepare test-case by randomly removing some rows:

# delete from test where random() < 0.002;
DELETE 17

the question is – find all pairs of (a,b) where there is no row (a',b') where (a'=b and b'=a).

in other words – every row (a,b) should be paired. rows with a = 2 and b = 3, is paired by row with a = 3 and b = 2.

how to find incomplete pairs?

the basic approach would be to do something like:

select * from test t where not exists (select * from test t2 where (t2.a, t2.b) = (t.b, t.a))

looks reasonable?

no. unfortunately no.

execution plan for this query consisted of 1 seq scan of test table, and nested index scan for every row of it.

does it look bad? with 10k rows, it is not really bad – after all, 1 seq scan over 10k rows, plus 10k index scans – it's doable.

but what if you have 70 million rows in the table?

in this situation it goes simply to 70 million index scans. quick test showed that (out of 10 tries) i got:

  • minimal time: 0.047ms
  • maximal time: 215ms
  • average time: around 0.9-1ms.

full seq scan of the table took about 40 seconds. assuming best situation (0.047ms per index scan), it adds up to nearly 1 hour. with average time of 1ms, total time escalates to … 49 days!

oops. now, i don't have that much time for the task 🙂

better options?

perhaps left outer join approach?

select t1.* from test t1 left outer join test t2 on (t1.a, t1.b) = (t2.b, t2.a) where t2.a is null

this becomes 2 sequential scans joined with hash join (with small 10k rows table it becomes merge join of 2 index (but full table) scans.

better, but can it be done faster?

actually yes:

select least(a,b), greatest(a,b), count(*) from test group by least(a,b), greatest(a,b) having count(*) = 1;

this query does only 1 table scan, with hash aggregate. time is unbeatable 🙂

how does it work?

basically i thought about finding a way to tell postgresql that pair (12,50) and (50,12) are “the same", so i could use grouping.

there is a simple way to do it – using least() and greatest() functions. least() always returns smaller, and greatest() bigger number.

so, for both (12,50) and (50,12), output of (least(), greatest) is (12,50)!

now, that i have a way to “group it", i can use standard “having" trick to filter only the groups that contain only 1 row.

effect – total time of execution – 5minutes 🙂 for 70 million rows.

and how to add missing pairs?

first, let's copy the group info to temp table:

CREATE temp TABLE depesz_temp as
select least(a,b) as x, greatest(a,b) as y, count(*) from test group by least(a,b), greatest(a,b) having count(*) = 1

now – we need to find what row exactly is there, and insert it's pair.

to make the story short – full insert looks like this:

INSERT INTO test (a,b)
    SELECT
        (case when t.a = q.x THEN q.y ELSE q.x END),
        (case when t.b = q.x THEN q.y ELSE q.x END)
    FROM
        test t,
        depesz_temp q
    WHERE
        (t.a, t.b) = (q.x, q.y) OR (t.b, t.a) = (q.x, q.y)

you might be tempted to reuse least/greatest trick in the where condition in above insert/select, but if you'd do – indexes on (a,b) and (b,a) wouldn't be used, so it would become really slow 🙂

  1. 7 comments

  2. # Anders
    Oct 12, 2007

    insert into test (a, b)
    select min(b), min(a)
    from test
    group by least(a,b), greatest(a,b)
    having count(*) = 1

    but test really should be a view (select a, b from test2 union all select b, a from test2) and test2 should have a “a

  3. # Anders
    Oct 12, 2007

    insert into test (a, b)
    select min(b), min(a)
    from test
    group by least(a,b), greatest(a,b)
    having count(*) = 1

    but test really should be a view (select a, b from test2 union all select b, a from test2) and test2 should have a “a<b” check constraint trigger that would insert (b,a) if an attempt was made to insert (a,b) where not (a<b).

  4. # Rod Taylor
    Oct 12, 2007

    Makes one wonder why both entries are stored, instead of storing one of them and generating the second on the fly.

    Anyway, that’s still a neat trick.

  5. Oct 12, 2007

    @Andreas:
    it’s an interesting idea, but then when searching for rows where some given “id” is in any of the fields – it becomes complicated.

    @Rod Taylor:
    these are connection between nodes on some kind of graph (not directional).
    having both of the rows in database simplifies (a lot) searches.

  6. # Anders
    Oct 13, 2007

    @depesz: Re: searching for rows where some given “id” is in any of the fields – it becomes complicated

    No, you would just use the view, which makes it look like both rows are stored.

  7. Oct 15, 2007

    @Andreas:
    complicated not as “for human” but “for computer”.
    it becomes *two* index scans instead of *one*

  8. # Marcin Mańk
    Nov 2, 2007

    I like
    select a,b from test
    except
    select b,a from test

    Anyway: Is Your solution really faster? I think that such group by must (for large tables) be building some kind of index for easy locating of the groups, which would be slower.

Sorry, comments for this post are disabled.