Summary of the problem from mail is: we have a table, ~ 800 million rows, with, at least 2 columns:
- station – 170 distinct values
- channel – generally 1-3 channels per station
And then we want to run:
SELECT station, array_agg(DISTINCT (channel)) AS channels FROM DATA GROUP BY station;
Which, on Israel's (original poster) machine took ~ 5 minutes.
And this is with index on on data (station, channel).
Can we do better?