Waiting for 8.4 – hash based DISTINCT

Today Tom Lane committed patch which gives DISTINCT ability to use hash aggregate – just like GROUP BY.

Log message:

Improve SELECT DISTINCT TO consider hash aggregation, AS well AS sort/uniq,
AS methods FOR implementing the DISTINCT step.  This eliminates the former
performance gap BETWEEN DISTINCT AND GROUP BY, AND also makes it possible
TO do SELECT DISTINCT ON datatypes that ONLY support hashing NOT sorting.
 
SELECT DISTINCT ON IS still always implemented BY sorting; it would take
executor changes TO support hashing that, AND it's not clear it's worth
the trouble.
 
This IS a release-note-worthy incompatibility FROM previous PG versions,
since SELECT DISTINCT can no longer be counted ON TO deliver sorted output
WITHOUT explicitly saying ORDER BY.  (Anyone who can't cope with that
can consider turning off enable_hashagg.)
 
Several regression test queries needed to have ORDER BY added to preserve
stable output order.  I fixed the ones that manifested here, but there
might be some other cases that show up on other platforms.

What exactly it gives you?

It's pretty simple. Let's create simple test table:

# CREATE TABLE x (i int4);
CREATE TABLE
# INSERT INTO x (i) SELECT random() * 100000 FROM generate_series(1,150000);
INSERT 0 150000

Now, in version of PostgreSQL without this patch, I run (2 times to get better results with filled cache):

# EXPLAIN analyze SELECT DISTINCT i FROM x;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 UNIQUE  (cost=15789.25..16496.05 ROWS=200 width=4) (actual TIME=623.208..1177.344 ROWS=77774 loops=1)
   ->  Sort  (cost=15789.25..16142.65 ROWS=141360 width=4) (actual TIME=623.203..872.388 ROWS=150000 loops=1)
         Sort KEY: i
         Sort Method:  external MERGE  Disk: 2328kB
         ->  Seq Scan ON x  (cost=0.00..2002.60 ROWS=141360 width=4) (actual TIME=0.051..224.693 ROWS=150000 loops=1)
 Total runtime: 1280.360 ms
(6 ROWS)
 
# EXPLAIN analyze SELECT DISTINCT i FROM x;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 UNIQUE  (cost=15789.25..16496.05 ROWS=200 width=4) (actual TIME=555.254..1111.487 ROWS=77774 loops=1)
   ->  Sort  (cost=15789.25..16142.65 ROWS=141360 width=4) (actual TIME=555.251..805.267 ROWS=150000 loops=1)
         Sort KEY: i
         Sort Method:  external MERGE  Disk: 2328kB
         ->  Seq Scan ON x  (cost=0.00..2002.60 ROWS=141360 width=4) (actual TIME=0.020..199.417 ROWS=150000 loops=1)
 Total runtime: 1208.449 ms
(6 ROWS)

As you can see we get time in around 1.2s, and DISTINCT clearly works by sorting data and then doing “unique" over them.

But what happens after I apply this patch?

# EXPLAIN analyze SELECT DISTINCT i FROM x;
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=2356.00..2358.00 ROWS=200 width=4) (actual TIME=475.303..603.042 ROWS=77774 loops=1)
   ->  Seq Scan ON x  (cost=0.00..2002.60 ROWS=141360 width=4) (actual TIME=0.059..213.981 ROWS=150000 loops=1)
 Total runtime: 700.063 ms
(3 ROWS)
 
# EXPLAIN analyze SELECT DISTINCT i FROM x;
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=2356.00..2358.00 ROWS=200 width=4) (actual TIME=465.973..576.127 ROWS=77774 loops=1)
   ->  Seq Scan ON x  (cost=0.00..2002.60 ROWS=141360 width=4) (actual TIME=0.019..205.799 ROWS=150000 loops=1)
 Total runtime: 670.394 ms
(3 ROWS)

Pretty cool. The time was cut in half. Of course – it will never be very fast –
to make it very fast I should have made dictionary table (possibly with updates
on triggers), but it is definitely worthy addition to PostgreSQL code.

2 thoughts on “Waiting for 8.4 – hash based DISTINCT”

  1. Hi, what do you mean with dictionary table to make the distinct-query faster? Some kind of preprocessing?
    Another question I´ve got is, how good the hash-function is to avoid collisions, while inserting the groups to the buckets?

  2. @Frank: yes. triggers on insert/update/delete on base table which generate dictionary table.

    As for your other question – I have no idea. You can try to ask on pg-hackers list – it never bothered me.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.