Getting count of distinct elements, per group, in PostgreSQL.

So, couple of days ago, some guy, from Periscope company wrote a blogpost about getting number of distinct elements, per group, faster using subqueries.

This was then submitted to Hacker News and r/Programming on Reddit.

Then, the original authors submitted second blogpost comparing speed between four different DB engines. Which, in turn, was also commented on Reddit.

I found the numbers presented by Periscope (as their improvement) as not that great.

Unfortunately – their blog doesn't allow for comments, so I decided to test it, and write on my own blog, what I can find about it.

First, what I gathered from their blogposts:

  • 2 tables: dashboards (with columns: name, id and possibly something else) and time_on_site_logs (with columns: user_id, dashboard_id, and possibly something else)
  • dashboards has 1200 rows
  • time_on_site_logs has 14 million rows
  • there is 1700 distinct (dashboard_id, user_id) pairs in time_on_site_logs

Since, I have no idea about table width (which can be important), I will make some guesses about number of columns that should be there. There is also no information about data distribution in time_on_site_logs (as in: average number of dashboards per user, average number of rows per user, and so on). So I'll have to improvise.

Dashboards:

$ CREATE TABLE dashboards (
    id serial PRIMARY KEY,
    name text NOT NULL UNIQUE,
    created timestamptz NOT NULL DEFAULT now(),
    visited int8 NOT NULL DEFAULT 0
);
 
$ INSERT INTO dashboards (name, created, visited)
WITH x AS (
    SELECT now() - '1 year'::INTERVAL * random() AS c
    FROM generate_series(1,1200)
    ORDER BY c
)
SELECT
    'Dashboard #' || ROW_NUMBER() OVER (ORDER BY c),
    c,
    random() * 10000
FROM x;

This gives me data like these:

SELECT * FROM dashboards ORDER BY id DESC LIMIT 20;
  id  |      name       |            created            | visited 
------+-----------------+-------------------------------+---------
 1200 | Dashboard #1200 | 2014-01-29 09:16:01.894632+01 |    3264
 1199 | Dashboard #1199 | 2014-01-29 05:27:19.501032+01 |    5889
 1198 | Dashboard #1198 | 2014-01-29 02:52:03.939432+01 |     595
 1197 | Dashboard #1197 | 2014-01-28 11:22:32.134632+01 |    8283
 1196 | Dashboard #1196 | 2014-01-28 07:43:48.406632+01 |    5713
 1195 | Dashboard #1195 | 2014-01-28 03:16:50.537832+01 |    4082
 1194 | Dashboard #1194 | 2014-01-27 23:22:24.877032+01 |    7836
 1193 | Dashboard #1193 | 2014-01-27 11:16:10.765032+01 |    9849
 1192 | Dashboard #1192 | 2014-01-27 10:34:12.032232+01 |    7207
 1191 | Dashboard #1191 | 2014-01-27 10:34:10.304232+01 |    2657
 1190 | Dashboard #1190 | 2014-01-27 03:49:03.565032+01 |    1996
 1189 | Dashboard #1189 | 2014-01-26 08:06:21.257832+01 |    9084
 1188 | Dashboard #1188 | 2014-01-26 02:46:08.921832+01 |    9549
 1187 | Dashboard #1187 | 2014-01-25 21:19:41.869032+01 |    7073
 1186 | Dashboard #1186 | 2014-01-25 15:07:59.264232+01 |    9089
 1185 | Dashboard #1185 | 2014-01-25 14:12:43.750632+01 |    3228
 1184 | Dashboard #1184 | 2014-01-25 10:03:10.198632+01 |    9176
 1183 | Dashboard #1183 | 2014-01-25 07:31:07.395432+01 |    5257
 1182 | Dashboard #1182 | 2014-01-25 01:29:44.710632+01 |     633
 1181 | Dashboard #1181 | 2014-01-24 21:51:53.773032+01 |    9663
(20 ROWS)

Now, for the time_on_site_logs table:

$ CREATE TABLE time_on_site_logs (
    id serial PRIMARY KEY,
    dashboard_id int4 NOT NULL REFERENCES dashboards (id),
    user_id int4 NOT NULL,
    session_started timestamptz NOT NULL,
    session_time INTERVAL NOT NULL
);

Schema is ready. Now for the data. Since I have no idea about data distribution or anything like this, let's just create 1700 random dashboard/user pairs, and repeat the data 14000000/1700 times, and call it a day:

INSERT INTO time_on_site_logs (dashboard_id, user_id, session_started, session_time)
WITH x AS (
    SELECT d, u
    FROM generate_series(1,1200) AS d, generate_series(1,10) AS u
    ORDER BY random() LIMIT 1700
)
SELECT
    d,
    u,
    now() - random() * '2 years'::INTERVAL,
    random() * '30 minutes'::INTERVAL
FROM
    x,
    generate_series(1,8236) AS q;

Stats afterwards:

WITH x AS (
    SELECT dashboard_id, user_id, COUNT(*) FROM time_on_site_logs
    GROUP BY dashboard_id, user_id
)
SELECT SUM(COUNT) AS all_rows,
    COUNT(*) AS distinct_pairs
FROM x;
 all_rows | distinct_pairs 
----------+----------------
 14001200 |           1700
(1 ROW)

Some of the comments (very ill-informed, in my opinion) on Reddit suggested that the speed of the queries (as shown in original blogposts) depends on indexes, and the fact that the query plan showed seq scans means that there are no indexes.

I don't believe this explanation, so let me add some additional indexes:

$ CREATE INDEX i1 ON time_on_site_logs  (dashboard_id);
CREATE INDEX
$ CREATE INDEX i2 ON time_on_site_logs  (user_id);
CREATE INDEX
$ CREATE INDEX i3 ON time_on_site_logs  (dashboard_id, user_id);
CREATE INDEX
$ CREATE INDEX i4 ON time_on_site_logs  (user_id, dashboard_id);
CREATE INDEX

Now tables look like this:

$ \d dashboards
                                  TABLE "public.dashboards"
 COLUMN  |           TYPE           |                        Modifiers                        
---------+--------------------------+---------------------------------------------------------
 id      | INTEGER                  | NOT NULL DEFAULT NEXTVAL('dashboards_id_seq'::regclass)
 name    | text                     | NOT NULL
 created | TIMESTAMP WITH TIME zone | NOT NULL DEFAULT now()
 visited | BIGINT                   | NOT NULL DEFAULT 0
Indexes:
    "dashboards_pkey" PRIMARY KEY, btree (id)
    "dashboards_name_key" UNIQUE CONSTRAINT, btree (name)
Referenced BY:
    TABLE "time_on_site_logs" CONSTRAINT "time_on_site_logs_dashboard_id_fkey" FOREIGN KEY (dashboard_id) REFERENCES dashboards(id)
 
$ \d time_on_site_logs
                                      TABLE "public.time_on_site_logs"
     COLUMN      |           TYPE           |                           Modifiers                            
-----------------+--------------------------+----------------------------------------------------------------
 id              | INTEGER                  | NOT NULL DEFAULT NEXTVAL('time_on_site_logs_id_seq'::regclass)
 dashboard_id    | INTEGER                  | NOT NULL
 user_id         | INTEGER                  | NOT NULL
 session_started | TIMESTAMP WITH TIME zone | NOT NULL
 session_time    | INTERVAL                 | NOT NULL
Indexes:
    "time_on_site_logs_pkey" PRIMARY KEY, btree (id)
    "i1" btree (dashboard_id)
    "i2" btree (user_id)
    "i3" btree (dashboard_id, user_id)
    "i4" btree (user_id, dashboard_id)
Foreign-KEY constraints:
    "time_on_site_logs_dashboard_id_fkey" FOREIGN KEY (dashboard_id) REFERENCES dashboards(id)

That should handle all indexing needs for now 🙂

Now – Periscope guys wrote that they tested it on Amazon EC2 instance. I'm cheap, so I'm testing it on my desktop. So the numbers will not be directly comparable. But the ratios should be.

So, let's test the queries.

First, the naive approach:

SELECT
  dashboards.name,
  COUNT(DISTINCT time_on_site_logs.user_id)
FROM time_on_site_logs
JOIN dashboards ON time_on_site_logs.dashboard_id = dashboards.id
GROUP BY name
ORDER BY COUNT DESC;

I ran it 3 times, and got best result: 492 seconds.

Their (Periscope's) second approach was:

SELECT
  dashboards.name,
  log_counts.ct
FROM dashboards
JOIN (
  SELECT
    dashboard_id,
    COUNT(DISTINCT user_id) AS ct
  FROM time_on_site_logs 
  GROUP BY dashboard_id
) AS log_counts 
ON log_counts.dashboard_id = dashboards.id
ORDER BY log_counts.ct DESC;

Again, best of three runs was 23.1 second.

Third query by Periscope:

SELECT
  dashboards.name,
  log_counts.ct
FROM dashboards 
JOIN (
  SELECT distinct_logs.dashboard_id, 
  COUNT(1) AS ct
  FROM (
    SELECT DISTINCT dashboard_id, user_id
    FROM time_on_site_logs
  ) AS distinct_logs
  GROUP BY distinct_logs.dashboard_id
) AS log_counts 
ON log_counts.dashboard_id = dashboards.id
ORDER BY log_counts.ct DESC

Yielded best time of slighly less than 4.6s.

So far, we have:

Query Periscope test depesz test
Naive 348s 492s
Aggregate, Then Join 10.6s 23.1s
First, Reduce The Data Set 7.13s 4.6s

This wraps me re-testing what Periscope wrote.

Couple of comments:

  • to the person that said (on Reddit) that PostgreSQL cannot use Hash for count(distinct …) – well, true. But rewriting the query so that it will use hashes is trivial, as shown above
  • to the people saying that “Yet again they say the query plans include table scans, which imply either no indexes or an awful index design" – well, look at this – I have indexes on everything, yet PostgreSQL doesn't use them in such queries. You can of course provide information how do you think indexes can help us with these queries. With sample schema, data and queries.

But – all things said – the numbers are, in my opinion, still too large.

I mean – if I was to get how many rows there are in time_on_site_logs for every (dashboard/user) – sure. It can take a while, as it has to scan all of the table. But just getting distinct count? Especially so few rows (946 rows of output)? That's got to be optimizable.

Luckily, it is:

I can define a simple function which gives me list of distinct (dashboard_id, user_id):

CREATE OR REPLACE FUNCTION get_list_of_unique_pairs( OUT dashboard_id INT4, OUT user_id INT4 ) RETURNS setof record AS $$
DECLARE
    v_dash ALIAS FOR dashboard_id;
    v_user ALIAS FOR user_id;
BEGIN
    SELECT l.dashboard_id, l.user_id INTO v_dash, v_user
        FROM time_on_site_logs l ORDER BY l.dashboard_id, l.user_id LIMIT 1;
    LOOP
        EXIT WHEN NOT FOUND;
        RETURN NEXT;
        SELECT l.dashboard_id, l.user_id INTO v_dash, v_user
            FROM time_on_site_logs l
            WHERE (l.dashboard_id, l.user_id) > (v_dash, v_user )
            ORDER BY l.dashboard_id, l.user_id LIMIT 1;
    END LOOP;
    RETURN;
END;
$$ LANGUAGE plpgsql;

With this function in place, I can just:

WITH x AS (
    SELECT dashboard_id, COUNT(*)
    FROM get_list_of_unique_pairs()
    GROUP BY dashboard_id
)
SELECT
    d.name,
    x.count
FROM
    x JOIN dashboards d ON x.dashboard_id = d.id
ORDER BY x.count DESC;

This query runs faster than 4.6s. It runs faster than 2.13s. It runs faster than 1s. It runs in 39 ms (0.039s).

Of course – you can say that using functions is cheating. Well, it can be done without functions, but the query is more complicated then:

WITH recursive distinct_pairs AS (
    (
        SELECT l AS rl FROM time_on_site_logs l ORDER BY l.dashboard_id, l.user_id LIMIT 1
    )
    UNION ALL
    SELECT (
            SELECT l
            FROM time_on_site_logs l
            WHERE
                (l.dashboard_id, l.user_id) > ((p.rl).dashboard_id, (p.rl).user_id)
            ORDER BY l.dashboard_id, l.user_id LIMIT 1
        )
    FROM distinct_pairs p
    WHERE (p.rl).id IS NOT NULL
), unpacked_counts AS (
    SELECT (rl).dashboard_id, COUNT(*)
    FROM distinct_pairs
    WHERE (rl).dashboard_id IS NOT NULL
    GROUP BY (rl).dashboard_id
)
SELECT
    d.name,
    uc.count
FROM
    dashboards d
    JOIN unpacked_counts uc ON d.id = uc.dashboard_id
ORDER BY
    uc.count DESC;

This query (which I didn't write myself, RhodiumToad helped me a lot), runs in just a bit below 23ms.

Final comments:

  • I would really expect more from a company that writes: “We make … tool that makes SQL data analysis really fast." or “building a much faster way to do SQL data analysis." – sorry, but it can be done faster. Way faster.
  • The queries that I wrote are based on the fact that there is little (comparatively) distinct (dashboard_id, user_id) values. If there were more (for example – 50% of row count of time_on_site_logs rows) – it wouldn't work nicely.
  • Doing this kind of analysis should be done using rollup tables, which gather information in some side tables, and then you just query these side tables to get results. Otherwise – it might be fast for 14M rows, but for 14B rows, it will be slow again
  • The whole process, as shown in Periscope posts can be, relatively simply, parallelized – even in PostgreSQL

and finally – that was fun.

15 thoughts on “Getting count of distinct elements, per group, in PostgreSQL.”

  1. Well, for me at least the claims about “making SQL data analysis really fast” were more about making it “faster” for non-database-experts to answer business questions about data and not about how many seconds it takes to get that answer. Also the post that started it all was more about showing how to do a simple query optimization and the authors them-self believed it should work similar in all databases… But that aside.

    The whole story got me interested in what make it difficult for the original, naive query to be automatically transformed into logically equivalent (it is right?) final query as shown here? Like are there any information missing for the optimizer or what? Could you recommend any good entry level introduction to the topic?

  2. @Cc: not really. It’s a matter of PostgreSQL not being able to do “skip scan”. Not sure about other databases, but that’s about it.

  3. Not sure we understand each other. Initial query here takes 492s final less than a second, I’m not asking why the initial query is slow (to which “skip scans” are the answer I presume) but what prevents database for transforming one query into another – that works fast even with current db engine. And as shown in the second blogpost by the periscope guys every db tested could use that kind of query rewriting…

    So, what should I google to understand what is so hard about it?

  4. “I don’t believe this explanation, so let me ass some additional indexes…”

    That typo is fantastic.

  5. @Cc:
    well, what prevents dbs? Well, the fact that noone wrote such heuristic. That’s about it.

    Plus – in the Pg – please note that I used procedural language – which is not a simple “rewrite” of query. And as for recursive CTE – well it’s relatively new thing. Long story short – every database has a limit on what it can do – based on what the programmers did put in (at least until we’ll have the A.I.). Simply – programmers never put in logic to transform such query into the other one.

    Which, I believe, is a good thing. In a lot of cases normal plan would be faster. The fact that my query is faster in here is because the dataset is seriously skewed.

  6. “to the person that said (on Reddit) that PostgreSQL cannot use Hash for count(distinct …) – well, true. But rewriting the query so that it will use hashes is trivial, as shown above”

    Agreed, my point was that PostgreSQL currently does not do much to optimize count(DISTINCT …) while other other databases seemingly do. As we can see PostgreSQL started performing well as soon as the query was rewritten.

  7. @CC asks: “what make it difficult for the original, naive query to be automatically transformed into logically equivalent (it is right?) final query as shown here?”

    There is nothing preventing a database from doing this. SQL Server and Oracle automatically do the optimization. See Periscope’s second blog post, linked to at the beginning of this page.

  8. @Peter: well, no, they don’t. They do some kind of optimization, but they don’t do skip scan. Otherwise you’d also get result in millisecond area, and not 2-4 seconds.

  9. By speaking of “skip scan”, do you mean this: http://wiki.postgresql.org/wiki/Loose_indexscan ?

    If yes, which name for the feature suits better here?

    For Oracle has a slightly different meaning of “skip scan”, it actually matches the second case in the above wiki page, but not the first one.

  10. @Victor: as for the names – no idea. I’m not really big into “naming” things. If you prefer to use “Loose indexscan” – fine by me. I use skip scan, because I encountered this name first.

  11. thanks for posting this. it really helped me take another look at how I approach queries. i was able to restructure a 70s select down to 7s. (i still need to cut it down a lot more, but this is a great start).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.