I wrote about similar things couple of times, but recently found thread on pgsql-general mailing list that made me thing about it again.
Summary of the problem from mail is: we have a table, ~ 800 million rows, with, at least 2 columns:
- station – 170 distinct values
- channel – generally 1-3 channels per station
And then we want to run:
array_agg(DISTINCT (channel)) AS channels
Which, on Israel's (original poster) machine took ~ 5 minutes.
And this is with index on on data (station, channel).
Can we do better?
Continue reading Using recursive queries to get distinct elements from table
Today, on irc, someone asked interesting question.
Basically she ran a query like:
select a, b, c, d, e, f from table order by a
then, she processed the query to get, for each a array of unique values of b, c, d, e, and f, and then he inserted it back to database, to some other table.
It was a problem, because the table had many rows (millions I would assume), and the whole process was slow.
So, how to make it faster?
Continue reading Getting list of unique elements in table, per group
So, couple of days ago, some guy, from Periscope company wrote a blogpost about getting number of distinct elements, per group, faster using subqueries.
This was then submitted to Hacker News and r/Programming on Reddit.
Then, the original authors submitted second blogpost comparing speed between four different DB engines. Which, in turn, was also commented on Reddit.
I found the numbers presented by Periscope (as their improvement) as not that great.
Unfortunately – their blog doesn't allow for comments, so I decided to test it, and write on my own blog, what I can find about it.
Continue reading Getting count of distinct elements, per group, in PostgreSQL.
Let's assume you have some simple database with “articles" – each article can be in many “categories". And now you want to get list of all articles in given set of categories.
articles as a
join articles_in_categories as aic on a.id = aic.article_id
aic.category_id in (14,62,70,53,138)
Will return duplicated article data if given article is in more than one from listed categories. How to remove redundant rows?
Continue reading Getting unique elements
Robert Haas wrote, and Tom Lane committed (on 2nd of August) patch which adds interesting capability:
Add ALTER TABLE ... ALTER COLUMN ... SET STATISTICS DISTINCT
Log message is rather terse, so what does it exactly do?
Continue reading Waiting for 8.5 – hinting for number of distinct values
Every so often you need to get list of unique elements in some column. The standard way to do it is:
select distinct column from table;
select column from table group by column;
The only problem is that it's slow – as it has to seq scan whole table. Can it be done faster?
Continue reading Getting list of unique elements
Today Tom Lane committed patch which gives DISTINCT ability to use hash aggregate – just like GROUP BY.
Continue reading Waiting for 8.4 – hash based DISTINCT
When I was working for one of customers we found some strange thing. We needed to found number of distinct sessions per day. Table layout was very simple:
Continue reading Counting number of distinct elements