Summary of the problem from mail is: we have a table, ~ 800 million rows, with, at least 2 columns:
- station – 170 distinct values
- channel – generally 1-3 channels per station
And then we want to run:
SELECT station, array_agg(DISTINCT (channel)) AS channels FROM DATA GROUP BY station;
Which, on Israel's (original poster) machine took ~ 5 minutes.
And this is with index on on data (station, channel).
Can we do better?
To have a way to test it I made myself a table:
=$ CREATE TABLE DATA ( ts timestamptz, station int4, channel int4, VALUE float8 );
and loaded 800 million rows that have following stats:
- ts – semi-randomly generated timestamp, from 2019-03-15 12:10:51+01 to 2021-09-26 15:29:36+02. Timestamps are always rounded up to a second, and there are ~ 10 records per second.
- station – value from 1 to 200
- channel – generally value from 1 to 150. There is a fixed list of channels per station – from 50 to 145, randomly.
- value – random float value with up to 2 decimals
=$ SELECT * FROM DATA LIMIT 10; ts │ station │ channel │ VALUE ────────────────────────┼─────────┼─────────┼─────────── 2019-03-15 12:10:52+01 │ 38 │ 26 │ 894387.96 2019-03-15 12:10:53+01 │ 42 │ 58 │ 337180.77 2019-03-15 12:10:53+01 │ 49 │ 16 │ 939004.82 2019-03-15 12:10:52+01 │ 108 │ 14 │ 343292.31 2019-03-15 12:10:53+01 │ 108 │ 36 │ 852999.79 2019-03-15 12:10:52+01 │ 138 │ 56 │ 375714.57 2019-03-15 12:10:52+01 │ 30 │ 29 │ 195615.94 2019-03-15 12:10:51+01 │ 97 │ 61 │ 945488.73 2019-03-15 12:10:53+01 │ 135 │ 33 │ 263851.91 2019-03-15 12:10:51+01 │ 36 │ 10 │ 626327.65 (10 ROWS)
After loading data, I made index, just like the original mail had:
=$ CREATE INDEX data_station_channel_idx ON DATA (station, channel);
After vacuum I ran the query two times getting the same time: 136 seconds.
Now, this is, obviously bad. The fastest approach would be to precache this data using triggers. Since my table is static I can simply:
=$ CREATE TABLE distinct_station_channel AS SELECT station,array_agg(DISTINCT(channel)) AS channels FROM DATA GROUP BY station; =$ ALTER TABLE distinct_station_channel ADD PRIMARY KEY (station);
Afterwards getting the data would be, of course, virtually instantaneous: less than 1ms.
With such table (well, with one more column) we could then use triggers on data column to keep it updated. I showed such triggers previously
But, let's assume that this is no-go. What else can we do?
We could do something called skip scan, or loose scan or loose index scan. To do it will need to use recursive queries.
First, we start with getting first station/channel combo. Sorted alphabetically:
=$ SELECT station, channel FROM DATA ORDER BY station, channel LIMIT 1; station | channel ---------+--------- 1 | 1 (1 ROW)
This query is very fast: ~ 1ms. Now, that we have first row, we can find next one. Since first pair is (1,1), then next one will have to be > (1,1):
=$ SELECT station, channel FROM DATA WHERE (station, channel) > (1, 1) ORDER BY station, channel LIMIT 1; station | channel ---------+--------- 1 | 5 (1 ROW)
Again, runtime is ~ 1ms.
Now, we can build recursive query that will give us all (station, channel) combinations:
=$ WITH recursive dis AS (
( SELECT station, channel FROM DATA ORDER BY station, channel LIMIT 1 )
SELECT d.station, d.channel
SELECT x.station, x.channel
FROM DATA x
WHERE (x.station, x.channel) > (p.station, p.channel)
ORDER BY x.station, x.channel LIMIT 1
SELECT station, array_agg(channel) FROM dis GROUP BY station;
This query returns data in ~ 157ms. Obviously longer than getting the data from precalculated table, but not too bad in comparison with original 2 minutes.
- subselect in line 2 is getting simply first combo of station, channel
- subselect in lines 9-12 simply gets next combo using values from previously calculated row
- final line, 16, simply groups channels into single array per station
Simple, and works without any data changes.
Are there any issues with it? Well, for starters, it will only work well if there are very few distinct values within data set. So, if you have 800 million rows, but 200 million distinct values – it will not help.
Also – you can't really use it with additional conditions based on range/inequality. So, I couldn't use it for checking what were distinct (station, channel) combinations within specific time frame. For this I'd need to use some other approach – for example triggers that group data in precomputed data for some time ranges – for example, daily, or hourly.
But, for the problem that was reported on mailing list – this solution will work great.