Counting number of distinct elements

When I was working for one of customers we found some strange thing. We needed to found number of distinct sessions per day. Table layout was very simple:

        Table "public.some_test_table"
    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 logdate      | date                   |
 sessionid    | character varying(32)  |
 ...

Basically, every session can appear many times during a day. The most basic approach was:

SELECT logdate, COUNT(DISTINCT sessionid)
FROM some_test_table
GROUP BY logdate;

Simple, and to the point. How does it work?

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=104478.04..109433.49 rows=272 width=39) (actual time=3558.844..10145.117 rows=301 loops=1)
   ->  Sort  (cost=104478.04..106128.72 rows=660273 width=39) (actual time=3558.663..5168.047 rows=660341 loops=1)
         Sort Key: logdate
         ->  Seq Scan on some_test_table  (cost=0.00..20339.73 rows=660273 width=39) (actual time=0.009..1184.484 rows=660341 loops=1)
 Total runtime: 10147.912 ms
(5 rows)

Not bad, but then the smart guy at the customers office suggested another approach (it was actually their original approach).

His approach was first to generate list of unique pairs date/sessionid, and then count sessions. Hmm .. I guessed that it will be slower, but let's check it:

SELECT logdate, COUNT(1)
FROM (
    SELECT logdate, sessionid
    FROM some_test_table
    GROUP BY logdate, sessionid
) AS x
GROUP BY logdate;

After all – we have 2 aggregates here. So it has to be slower. Right? Hmm .. wrong:

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=27084.87..27087.37 rows=200 width=4) (actual time=4240.413..4240.964 rows=301 loops=1)
   ->  HashAggregate  (cost=23641.10..25018.60 rows=137751 width=39) (actual time=2900.614..3586.063 rows=353711 loops=1)
         ->  Seq Scan on some_test_table  (cost=0.00..20339.73 rows=660273 width=39) (actual time=0.044..1260.961 rows=660341 loops=1)
 Total runtime: 4248.956 ms
(4 rows)

Ouch. This approach was 2.5 times faster than mine.

I tried also a solution with distinct in inner query:

SELECT logdate, COUNT(1)
FROM (
  SELECT DISTINCT logdate, sessionid
  FROM some_test_table
) AS v
GROUP BY logdate;

But it went even worse than the count(distinct) approach:

                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=104478.04..111498.85 rows=200 width=4) (actual time=6581.273..11958.277 rows=301 loops=1)
   ->  Unique  (cost=104478.04..109430.09 rows=137751 width=39) (actual time=6581.215..11310.329 rows=353711 loops=1)
         ->  Sort  (cost=104478.04..106128.72 rows=660273 width=39) (actual time=6581.210..9476.508 rows=660341 loops=1)
               Sort Key: some_test_table.logdate, some_test_table.sessionid
               ->  Seq Scan on some_test_table  (cost=0.00..20339.73 rows=660273 width=39) (actual time=0.013..1265.267 rows=660341 loops=1)
 Total runtime: 11961.482 ms
(6 rows)

Lessons learned:

  • Don't guess. Check the performance when seeing multiple approaches.
  • “distinct" is even more evil than I always assumed

2 thoughts on “Counting number of distinct elements”

  1. PostgreSQL surprises me alot. Subselect would appear to me the least efficient as it generates unoptimized sequence (in natural order) for scanning. On MSSQL (and few other databases I know), creating index on logdate would give good result. On Oracle, creating indexes on both logdate and sessionid would be even better. Only AS/400 gave better results with subselect, but this is not a real database server (only SQL interface to a flat-file database).

  2. This is just another illustration of the superiority of HashAggregate over the other similar operations that postgresql can run. You can get a lot of performance gains right now if you can switch queries to make use of HashAggregate rather than sort/distinct, but keep in mind this optimization is similar to the old IN vs. EXISTS optimizations..ie. the behavior could change as alternative code paths are optimized.

Comments are closed.