# Getting top-N rows per group

`Hi, how do I get top 5 values from a column group by another column??`

From further discussion, I learned that:

`total rows in table is 2 million. It'll have unique words of less than 1 million.. (approx count)`

I didn't have time yesterday, but decided to write a solution, or two, to the problem.

First, let's think about solutions. Before 8.4, where you didn't have window functions nor common table expressions the only way was to make a subselect to count number of elements “before" given one.

Something along the lines of:

```SELECT
u.some_ts,
u.random_value
FROM
test u
WHERE
4 > (
SELECT count(*)
FROM test su
AND su.some_ts > u.some_ts
)
ORDER BY u.username asc, u.some_ts desc```

Problem with this approach is that it doesn't scale nicely. Explain shows:

```                                                              QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sort  (cost=10000000835.91..10000000835.99 rows=33 width=20) (actual time=1.460..1.463 rows=44 loops=1)
Sort Method: quicksort  Memory: 28kB
->  Seq Scan on test u  (cost=10000000000.00..10000000835.08 rows=33 width=20) (actual time=0.058..1.289 rows=44 loops=1)
Filter: (4 > (SubPlan 1))
Rows Removed by Filter: 56
SubPlan 1
->  Aggregate  (cost=8.32..8.33 rows=1 width=0) (actual time=0.011..0.012 rows=1 loops=100)
->  Index Only Scan using i on test su  (cost=0.00..8.31 rows=3 width=0) (actual time=0.008..0.010 rows=5 loops=100)
Heap Fetches: 451
Total runtime: 1.575 ms```

Costs over 10000000000 are because I forcibly disabled seq scan, as in my example the table has only 100 rows, and without such disable Pg chose this plan:

```                                                      QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sort  (cost=254.83..254.91 rows=33 width=20) (actual time=4.108..4.112 rows=44 loops=1)
Sort Method: quicksort  Memory: 28kB
->  Seq Scan on test u  (cost=0.00..254.00 rows=33 width=20) (actual time=0.069..3.881 rows=44 loops=1)
Filter: (4 > (SubPlan 1))
Rows Removed by Filter: 56
SubPlan 1
->  Aggregate  (cost=2.51..2.52 rows=1 width=0) (actual time=0.037..0.037 rows=1 loops=100)
->  Seq Scan on test su  (cost=0.00..2.50 rows=3 width=0) (actual time=0.013..0.035 rows=5 loops=100)
Rows Removed by Filter: 95
Total runtime: 4.215 ms
(12 rows)```

Which is more or less the same, but instead of doing 100x index only scan, it does 100x Seq Scan.

Anyway. Cost of running such query is:

• Seq Scan over the table
• For every row Index Only scan (or Seq Scan)
• Finally order by over the groups – in this case 44 rows (4 rows for every user, 11 users)

That's a lot.

Can it be done faster? Yes. But how – it depends on your case.

There are two very different cases:

• You have lots of “groups" (users in my example) – so much that you're fetching large portion of the table anyway
• You have relatively few groups, and you're just getting small subset of the table

Situation that the guy on irc had is clearly the first one, so I will write something for it first.

Of course, I need some “playground":

```CREATE TABLE test (
some_ts timestamptz,
random_value INT4
);```

Now some data in there:

```INSERT INTO test (username, some_ts, random_value)
with users as (
SELECT
'user #' || i as username,
cast( 1+ random() * random() * random() * 10 as INT4) as row_count
FROM
generate_series(1,1000000) i
)
SELECT
now() - '1 year'::INTERVAL * random(),
cast(random() * 100000000 as INT4)
FROM
users u,
generate_series(1,10) i(c)
WHERE
i.c <= u.row_count;```

And check if the data match his (the guy from irc) description:

```SELECT
count(*)
FROM test;
count  │  count
─────────┼─────────
1000000 │ 2206611
(1 row)```

Looks about right. I have slightly more rows, but it's not that big of a difference.

Last thing – do I even have any users with more than 5 rows?

```WITH x as (
SELECT
count(*) as rows_per_user
FROM
test
)
SELECT
rows_per_user,
count(*) as different_users
FROM x
GROUP BY rows_per_user
ORDER BY rows_per_user;
rows_per_user │ different_users
───────────────┼─────────────────
1 │          424296
2 │          280339
3 │          132580
4 │           72984
5 │           43071
6 │           23940
7 │           13186
8 │            6406
9 │            2564
10 │             634
(10 rows)```

Most of the users have 1, 2 or 3 rows, but we even have some users with 10 rows. All as planned.

For sanity check, I can calculate how many rows I should be getting, which is:

1. 1 * 424296 +
2. 2 * 280339 +
3. 3 * 132580 +
4. 4 * 72984 +
5. 5 * 43071 +
6. 5 * 23940 +
7. 5 * 13186 +
8. 5 * 6406 +
9. 5 * 2564 +
10. 5 * 634

for a total of 2123655 rows. Just 82956 rows will be not included.

And now, with the playground ready, I can finally write the query:

```explain analyze
WITH numbered as (
SELECT
*,
row_number() over (partition BY username ORDER BY some_ts) as rowno
FROM
test
)
SELECT *
FROM numbered
WHERE rowno <= 5
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
CTE Scan on numbered  (cost=450693.32..500342.07 rows=735537 width=52) (actual time=17803.640..21914.244 rows=2123655 loops=1)
Filter: (rowno <= 5)
Rows Removed by Filter: 82956
CTE numbered
->  WindowAgg  (cost=406561.10..450693.32 rows=2206611 width=24) (actual time=17803.629..20773.791 rows=2206611 loops=1)
->  Sort  (cost=406561.10..412077.63 rows=2206611 width=24) (actual time=17803.613..19666.841 rows=2206611 loops=1)
Sort Method: external merge  Disk: 81960kB
->  Seq Scan on test  (cost=0.00..38292.11 rows=2206611 width=24) (actual time=0.007..208.733 rows=2206611 loops=1)
Total runtime: 22044.126 ms
(10 rows)```

This explain is uploaded to explain.depesz.com.

Couple of notes:

• There is only one table scan – Seq Scan
• Most of the time is spent sorting the data – but this can be “optimized" by simply adding more work_mem. For example, after upping work_mem to ‘1GB', I got “Sort Method: quicksort Memory: 270696kB" and it was ~ 5 seconds faster
• While we should add explicit order by at the end of query, it's not needed now (in current 9.3devel) because data is sorted correctly for window function row_number()

How it compares to the query I had earlier tested for small dataset?

Explain for this query is:

```                                                                QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sort  (cost=28078918.96..28080757.81 rows=735537 width=24) (actual time=43410.638..45760.144 rows=2123655 loops=1)
Sort Method: external merge  Disk: 78864kB
->  Seq Scan on test u  (cost=0.00..27977076.63 rows=735537 width=24) (actual time=0.111..25967.798 rows=2123655 loops=1)
Filter: (5 > (SubPlan 1))
Rows Removed by Filter: 82956
SubPlan 1
->  Aggregate  (cost=12.65..12.66 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=2206611)
->  Index Only Scan using i on test su  (cost=0.00..12.65 rows=1 width=0) (actual time=0.010..0.010 rows=1 loops=2206611)
Heap Fetches: 2482901
Total runtime: 45850.787 ms
(12 rows)```

As you can see (mostly on the explain.depesz.com page – all those index scans (2.2 million of them) take actually more time than Sort. And we still need sort for correct ordering of output rows.

This solves the first problem, described yesterday, but what about case with low number of groups?

Some time ago I wrote about getting list of unique elements, in a fast way, using function that implemented scan technique which doesn't exist in PostgreSQL – Skip Scan/Loose indexscan.

Now, thanks to existence of recursive CTE, it should be possible to do without functions.

But first, I need some tidying to my playground, and a different set of data. This time it will be easier to write:

```DROP TABLE test;
DROP TABLE

CREATE TABLE test (
some_ts timestamptz,
random_value INT4
);
CREATE TABLE

INSERT INTO test (username, some_ts, random_value)
SELECT
'user #' || cast(floor(random() * 10) as int4),
now() - '1 year'::INTERVAL * random(),
cast(random() * 100000000 as INT4)
FROM
generate_series(1,2000000);
INSERT 0 2000000

CREATE INDEX i on test (username, some_ts);
CREATE INDEX

analyze test;
ANALYZE

SELECT
count(*)
FROM test
──────────┼────────
user #0  │ 199871
user #1  │ 199939
user #2  │ 200388
user #3  │ 199849
user #4  │ 200329
user #5  │ 199504
user #6  │ 199903
user #7  │ 200799
user #8  │ 199487
user #9  │ 199931
(10 rows)```

I have 10 users, each with ~ 200,000 rows. I will be selecting 5 rows for every user, so I should be getting 50 rows in output.

Since this is now actually relatively important, let's assume I will be getting newest 5 rows (order by some_ts desc) per each user.

1. `WITH RECURSIVE skip AS (`
2. `    ( SELECT t.username FROM test as t ORDER BY t.username limit 1 )`
3. `    union all`
4. `    (`
5. `        SELECT`
6. `            (`
7. `                SELECT min( t2.username )`
8. `                FROM test t2`
9. `                WHERE t2.username > s.username`
10. `            )`
11. `        FROM skip s`
12. `        WHERE s.username IS NOT NULL`
13. `    )`
14. `),`
15. `with_data as (`
16. `    SELECT array(`
17. `        SELECT t`
18. `        FROM test t`
19. `        WHERE t.username = s.username`
20. `        ORDER BY t.some_ts desc LIMIT 5`
21. `    ) as rows`
22. `    FROM skip s`
23. `    WHERE s.username IS NOT NULL`
24. `)`
25. `SELECT (unnest( rows )).* FROM with_data`

Did you just thought “my eyes, they hurt"?

Well, it might be ugly, but it's fast. Explain shows that it took just 2.3ms to get the results.

But how does it work?

• lines 1-14 generate list of unique usernames – just usernames. This is done using recursive CTE:
• Line 2 gets first, smallest username
• Lines 5-12 are called recursively, and they fetch next username each time it gets called

The only issue with it is that we will get 11 rows – final row will contain NULL. But this can be filtered out later on.

• Lines 15-23 get actual data for all rows
• Lines 16-21 get an array of rows (because we can't generate more rows than we had from “skip" CTE). Each array contains 5 newest rows for given user. I don't have to SELECT username separately, because it's part of the row that is being compacted into array.
• Line 23 removed this additional NULL row that I mentioned above
• Line 25 generates final resultset. It gets (from with_data) 10 rows, each row contains array. Each array has 5 elements, and each of these elements contains a row from original test table. Now, we just need to:
• Unnest array – it generates 50 rows, each with single value, being row representation
• Unpack rows – by using (row_variable).* syntax

Final result:

``` username │            some_ts            │ random_value
──────────┼───────────────────────────────┼──────────────
user #0  │ 2012-10-05 14:02:40.850461+02 │     38340423
user #0  │ 2012-10-05 13:59:36.991261+02 │     47539361
user #0  │ 2012-10-05 13:58:13.788061+02 │     90133298
user #0  │ 2012-10-05 13:57:43.461661+02 │     39151654
user #0  │ 2012-10-05 13:55:19.519261+02 │     23971279
user #1  │ 2012-10-05 14:09:48.184861+02 │     68225136
user #1  │ 2012-10-05 14:07:50.853661+02 │     23783242
user #1  │ 2012-10-05 14:02:39.036061+02 │     45015932
user #1  │ 2012-10-05 13:58:33.400861+02 │       856787
user #1  │ 2012-10-05 13:57:44.066461+02 │     97875281
user #2  │ 2012-10-05 14:06:49.941661+02 │     44353630
user #2  │ 2012-10-05 14:05:41.426461+02 │     56720972
user #2  │ 2012-10-05 14:02:29.100061+02 │     47825604
user #2  │ 2012-10-05 14:00:09.391261+02 │     75955113
user #2  │ 2012-10-05 13:59:13.663261+02 │     14216525
user #3  │ 2012-10-05 14:01:00.367261+02 │     31505609
user #3  │ 2012-10-05 13:59:47.445661+02 │     60049729
user #3  │ 2012-10-05 13:55:49.240861+02 │     89598870
user #3  │ 2012-10-05 13:53:57.266461+02 │     33846849
user #3  │ 2012-10-05 13:53:33.852061+02 │     82449747
user #4  │ 2012-10-05 14:06:02.853661+02 │     63597247
user #4  │ 2012-10-05 14:05:51.708061+02 │     54494125
user #4  │ 2012-10-05 14:02:59.599261+02 │     29407125
user #4  │ 2012-10-05 14:01:06.415261+02 │     17218964
user #4  │ 2012-10-05 14:00:17.080861+02 │     62982517
user #5  │ 2012-10-05 14:07:27.612061+02 │     32170514
user #5  │ 2012-10-05 14:01:53.589661+02 │      7295343
user #5  │ 2012-10-05 14:01:11.512861+02 │     27237756
user #5  │ 2012-10-05 14:00:58.639261+02 │     26827360
user #5  │ 2012-10-05 14:00:55.096861+02 │     41067140
user #6  │ 2012-10-05 14:09:33.324061+02 │     95999924
user #6  │ 2012-10-05 14:07:10.072861+02 │        72678
user #6  │ 2012-10-05 14:06:41.474461+02 │     37179557
user #6  │ 2012-10-05 13:54:30.444061+02 │     54959828
user #6  │ 2012-10-05 13:49:53.532061+02 │     90795817
user #7  │ 2012-10-05 14:08:57.900061+02 │     83603366
user #7  │ 2012-10-05 14:07:47.656861+02 │     16851893
user #7  │ 2012-10-05 14:04:29.109661+02 │     45867068
user #7  │ 2012-10-05 14:01:53.762461+02 │     89712017
user #7  │ 2012-10-05 13:56:15.074461+02 │     86761399
user #8  │ 2012-10-05 14:04:05.436061+02 │     46245277
user #8  │ 2012-10-05 14:01:53.330461+02 │     48946212
user #8  │ 2012-10-05 14:00:02.306461+02 │     77759370
user #8  │ 2012-10-05 13:52:22.831261+02 │     23588197
user #8  │ 2012-10-05 13:51:34.533661+02 │     88615369
user #9  │ 2012-10-05 14:05:37.192861+02 │     66530600
user #9  │ 2012-10-05 14:01:28.360861+02 │     68077186
user #9  │ 2012-10-05 14:00:46.024861+02 │     30766930
user #9  │ 2012-10-05 14:00:12.328861+02 │     34347000
user #9  │ 2012-10-05 13:59:38.892061+02 │     59314056
(50 rows)```

There are some issues with it, though. If the returned row would be too wide, using arrays would become problematic.

In such case you can/should use solution by RhodiumToad (2nd example).

That would be it. I mean – you still can use function approach, and in 9.3 you will be able to use LATERAL for such cases:

```WITH RECURSIVE skip AS (
( SELECT t.username FROM test as t ORDER BY t.username limit 1 )
union all
( SELECT (SELECT min( t2.username ) FROM test t2 WHERE t2.username > s.username ) FROM skip s WHERE s.username IS NOT NULL )
)
SELECT
x.*
FROM
skip s,
lateral (
SELECT t.*
FROM test t