Getting unique elements

Let's assume you have some simple database with “articles" – each article can be in many “categories". And now you want to get list of all articles in given set of categories.

Standard approach:

SELECT
    a.*
FROM
    articles AS a
    JOIN articles_in_categories AS aic ON a.id = aic.article_id
WHERE
    aic.category_id IN (14,62,70,53,138)

Will return duplicated article data if given article is in more than one from listed categories. How to remove redundant rows?

Of course we can use distinct:

SELECT
    DISTINCT a.*
FROM
    articles AS a
    JOIN articles_in_categories AS aic ON a.id = aic.article_id
WHERE
    aic.category_id IN (14,62,70,53,138)

But it doesn't look really fast:

                                                                                        QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=366797.31..369717.01 rows=145985 width=886) (actual time=6178.264..6531.972 rows=141345 loops=1)
   ->  Sort  (cost=366797.31..367162.27 rows=145985 width=886) (actual time=6178.262..6277.667 rows=149536 loops=1)
         Sort Key: a.id, a.keywords, a.title, a.subtitle, a.lead, a.body, a.created
         Sort Method:  external sort  Disk: 132272kB
         ->  Merge Join  (cost=39071.52..183618.61 rows=145985 width=886) (actual time=434.034..1580.322 rows=149536 loops=1)
               Merge Cond: (a.id = aic.article_id)
               ->  Index Scan using articles_pkey on articles a  (cost=0.00..139495.17 rows=999842 width=886) (actual time=0.010..713.832 rows=1000000 loops=1)
               ->  Materialize  (cost=39069.10..39799.03 rows=145985 width=4) (actual time=434.013..544.324 rows=149536 loops=1)
                     ->  Sort  (cost=39069.10..39434.06 rows=145985 width=4) (actual time=434.011..488.313 rows=149536 loops=1)
                           Sort Key: aic.article_id
                           Sort Method:  external sort  Disk: 2048kB
                           ->  Bitmap Heap Scan on articles_in_categories aic  (cost=2789.14..24800.40 rows=145985 width=4) (actual time=31.969..291.847 rows=149536 loops=1)
                                 Recheck Cond: (category_id = ANY ('{14,62,70,53,138}'::integer[]))
                                 ->  Bitmap Index Scan on i_articles_in_categories_category_id  (cost=0.00..2752.65 rows=145985 width=0) (actual time=28.290..28.290 rows=149536 loops=1)
                                       Index Cond: (category_id = ANY ('{14,62,70,53,138}'::integer[]))
 Total runtime: 6566.497 ms

Of course – for 141k rows, it's not bad, but is there any way to get it any faster?

Before 8.4 you could get better results by using group by instead of distinct. Let's try:

SELECT
    a.id, a.keywords, a.title,
    a.subtitle, a.lead, a.body,
    a.created
FROM
    articles AS a
    JOIN articles_in_categories AS aic ON a.id = aic.article_id
WHERE
    aic.category_id IN (14,62,70,53,138)
GROUP BY
    a.id, a.keywords, a.title,
    a.subtitle, a.lead, a.body,
    a.created

The timing is virtually identical:

                                                                                        QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Group  (cost=366797.31..369717.01 rows=145985 width=886) (actual time=6128.916..6488.671 rows=141345 loops=1)
   ->  Sort  (cost=366797.31..367162.27 rows=145985 width=886) (actual time=6128.913..6227.215 rows=149536 loops=1)
         Sort Key: a.id, a.keywords, a.title, a.subtitle, a.lead, a.body, a.created
         Sort Method:  external sort  Disk: 132272kB
         ->  Merge Join  (cost=39071.52..183618.61 rows=145985 width=886) (actual time=430.126..1567.348 rows=149536 loops=1)
               Merge Cond: (a.id = aic.article_id)
               ->  Index Scan using articles_pkey on articles a  (cost=0.00..139495.17 rows=999842 width=886) (actual time=0.009..708.477 rows=1000000 loops=1)
               ->  Materialize  (cost=39069.10..39799.03 rows=145985 width=4) (actual time=430.106..539.320 rows=149536 loops=1)
                     ->  Sort  (cost=39069.10..39434.06 rows=145985 width=4) (actual time=430.104..484.071 rows=149536 loops=1)
                           Sort Key: aic.article_id
                           Sort Method:  external sort  Disk: 2048kB
                           ->  Bitmap Heap Scan on articles_in_categories aic  (cost=2789.14..24800.40 rows=145985 width=4) (actual time=32.247..288.985 rows=149536 loops=1)
                                 Recheck Cond: (category_id = ANY ('{14,62,70,53,138}'::integer[]))
                                 ->  Bitmap Index Scan on i_articles_in_categories_category_id  (cost=0.00..2752.65 rows=145985 width=0) (actual time=28.620..28.620 rows=149536 loops=1)
                                       Index Cond: (category_id = ANY ('{14,62,70,53,138}'::integer[]))
 Total runtime: 6524.135 ms
(16 rows)

There is relatively simple trick that we can use: for every single “id" value – all other fields will be also replicated. So instead of grouping by all columns, we can just group by id, and use some aggregate function to get value of other column. Like this:

SELECT
    a.id,
    MIN( a.keywords ), MIN( a.title ), MIN( a.subtitle ),
    MIN( a.lead ), MIN( a.body ), MIN( a.created )
FROM
    articles AS a
    JOIN articles_in_categories AS aic ON a.id = aic.article_id
WHERE
    aic.category_id IN (14,62,70,53,138)
GROUP BY
    a.id
;
                                                                                     QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=39071.52..189822.97 rows=145985 width=886) (actual time=425.216..5937.723 rows=141345 loops=1)
   ->  Merge Join  (cost=39071.52..183618.61 rows=145985 width=886) (actual time=425.195..1531.145 rows=149536 loops=1)
         Merge Cond: (a.id = aic.article_id)
         ->  Index Scan using articles_pkey on articles a  (cost=0.00..139495.17 rows=999842 width=886) (actual time=0.009..688.067 rows=1000000 loops=1)
         ->  Materialize  (cost=39069.10..39799.03 rows=145985 width=4) (actual time=425.176..531.384 rows=149536 loops=1)
               ->  Sort  (cost=39069.10..39434.06 rows=145985 width=4) (actual time=425.173..476.727 rows=149536 loops=1)
                     Sort Key: aic.article_id
                     Sort Method:  external sort  Disk: 2048kB
                     ->  Bitmap Heap Scan on articles_in_categories aic  (cost=2789.14..24800.40 rows=145985 width=4) (actual time=31.710..286.270 rows=149536 loops=1)
                           Recheck Cond: (category_id = ANY ('{14,62,70,53,138}'::integer[]))
                           ->  Bitmap Index Scan on i_articles_in_categories_category_id  (cost=0.00..2752.65 rows=145985 width=0) (actual time=28.084..28.084 rows=149536 loops=1)
                                 Index Cond: (category_id = ANY ('{14,62,70,53,138}'::integer[]))
 Total runtime: 5957.665 ms
(13 rows)

The difference is not big – around 9%, but it might be helpful in some cases. Also – please note that we got rid of sorting 132MB of data using on-disk sorting method, which is cool in all cases.

25 thoughts on “Getting unique elements”

  1. How would the performance of a subquery be compared to that?

    select * from articles where id in (select article_id from articles_in_categories where category_id in (14, 62, 70, 53, 138);

  2. @Nine:
    Heh, I totally forgot about “in ()”. Testing it now, but I since dropped my test data, so will have to recreate the data, and test also previous queries. will comment with results soon.

  3. OK. Ran it. It’s different set of data (due to random-generator), so I had to repeat also previous queries.

    Results:

    – DISTINCT – 6886ms average
    – group by all columns – 6867ms average
    – group by + aggregate – 6287ms average
    – where .. in () – 8825ms average.

  4. Try temporary table :

    create temp table x as
    select distinct article_id from articles_in_categories where category_id in (14, 62, 70, 53, 138);

    select *
    from articles
    where id in (select id from x)

  5. @pasman:
    ~2s on average. nice. The only problem is that it’s not a query. that’s 2 queries, which might be (in some cases) problematic. but aside from that – great ‘hack’.

    this leads to another approach:

    explain analyze select a.* from articles a join (select article_id from articles_in_categories where category_id in (107,134,33,40,42) group by article_id) as c on a.id = c.article_id;

    avg time – 1684ms.

  6. How does this compare?

    select
    *
    from
    articles
    where exists (select * from articles_in_categories where article_id = articles.id) and category_id in (14,62,70,53,138)

  7. @Ian:
    for starters – it doesn’t run as it articles doesn’t have category_id column.

    after fix – 8700ms.

  8. For completeness, there is also DISTINCT ON

    SELECT DISTINCT ON (a.id),
    a.id, a.keywords, a.title, a.subtitle, a.lead,
    a.body, a.created
    FROM …

  9. @Scott:
    sure, but it requires ordering of result set, so comparing is not really apples-to-apples.

  10. Instead of a temp table, maybe a CTE could help?
    (In theory, since I hadn’t actually tested this, I just typed this off-the-cuff…)

    with x as
    (
    select distinct
    article_id
    from articles_in_categories
    where category_id in (14, 62, 70, 53, 138)
    )
    select *
    from articles
    where id in (select article_id from x);

  11. What if you use a more simple (bogus) aggregate? Since you’re assuming all the values are the same, just return the second value each time instead of doing any work. A new LAST() aggregate:

    CREATE OR REPLACE FUNCTION _last( anyelement, anyelement ) RETURNS anyelement as $$
    SELECT $2;
    $$ LANGUAGE SQL;

    CREATE AGGREGATE last (
    BASETYPE = anyelement,
    SFUNC = _last,
    STYPE = anyelement
    );

    Seems like for simple cases like your “SELECT DISTINCT a.* FROM …”, where all the values are from a single table, and there are no aliases or functions, postgres could do an optimization to know that every row with the same id must have the same values for the other fields (since it’s UNIQUE), and then be able to use the id column’s index.

  12. When structuring queries like your example, I often use an explicit subquery in the FROM clause, where the subquery does all the filtering stuff and produces a small amount of data, and then that is joined in the main query with other tables to add all the related information; I’m inclined to think that approach is both very readable and very efficient. Example:

    select a.*
    from (
    select distinct article_id
    from articles_in_categories
    where category_id in (14,62,70,53,138)
    ) as a_filt
    join articles as a on a.id = a_filt.article_id

    So how does that perform for you?

  13. Around 2 seconds? Hmm, maybe it’s because I was using the DISTINCT instead of GROUP BY. What about using a GROUP BY in the CTE (yet again, I haven’t checked this since I don’t have your test data…). At least it’s a single query instead of two:

    with x as
    (
    select
    article_id
    from articles_in_categories
    where category_id in (14, 62, 70, 53, 138)
    group by article_id
    )
    select *
    from articles
    where id in (select article_id from x);

  14. @RobW:
    2121ms – and also please check comment above.

    As for dataset – this test database has 4.5GB of data, so I’m not sure if sharing it is simple/easy to do.

  15. @DEPESZ:
    Using nested SELECTs are not “hacks” as far as the relational model is concerned, where the column selection, joins, groups, where-filtering, etc are separate operations anyway; SQL is horribly deficient in considering nested SELECTs to be “separate” queries, whereas logically the whole thing is just a single value-expression query; at least the ability to do nested SELECTs largely brings back a lot of power and expressibility that would otherwise be lost to SQL, and I consider using nested SELECTs to be “proper” for SQL.

  16. @Darren:
    <sarcastic type=”slighly”>oh, really?</sarcastic>

    I was referring to use of temp table, not subselect. And yes, I know it’s not technically a hack.

  17. A shame on Postgres’ planner, it should be learnt to recognize the equivalence of

    SELECT a.* FROM articles a WHERE id IN (SELECT article_id FROM articles_in_categories WHERE category_id IN (107,134,33,40,42));

    SELECT a.* FROM articles a WHERE id IN (SELECT DISTINCT article_id FROM articles_in_categories WHERE category_id IN (107,134,33,40,42));

    SELECT a.* FROM articles a JOIN (SELECT article_id FROM articles_in_categories WHERE category_id IN (107,134,33,40,42) GROUP BY article_id) AS c ON a.id = c.article_id;

    SELECT a.* FROM articles a JOIN (SELECT DISTINCT article_id FROM articles_in_categories WHERE category_id IN (107,134,33,40,42)) AS c ON a.id = c.article_id;

    Only last two give equivalent plans in 9.0alpha4

  18. Try this one.

    Add new column in ‘articles’:
    ALTER TABLE articles ADD categories_id INTEGER[] NOT NULL DEFAULT ‘{}’::INTEGER[];

    Now, fill this field with identifiers from ‘articles_in_categories’.
    Simple trigger will help for regular operations.
    For one-shot fill something like:
    UPDATE articles
    SET categories_id = (SELECT array_agg(category_id) FROM articles_in_categories WHERE article_id = articles.id)

    Then create proper index:
    CREATE INDEX article_categories_idx ON articles USING gist (categories_id);

    The rest is quite simple, it just playing wih:

    SELECT categories_id
    FROM articles
    WHERE categories_id && ‘{14,62,70,53,138}’;

    Above will allow you to filter articles from given categories WITHOUT any join…

    With some clever aggregate function or (something less clever):

    SELECT sum(icount(categories_id & ‘{14}’)) AS k14,
    sum(icount(categories_id & ‘{62}’)) AS k62,

    FROM articles
    WHERE categories_id && ‘{14,62,70,53,138}’;

    Not tested but I bet you will drop below 0.3 sec.
    😉

    Postgres gave you contrib/intarray, why not use it ? 🙂

  19. My mistake 😉

    I tried gave the answer to the question:
    “How many articles there are in each of the given category ?”

    The solution to your problem is simply:
    SELECT *
    FROM articles
    WHERE categories_id && ‘{14,62,70,53,138}’;

    or

    SELECT count(id)
    FROM articles
    WHERE categories_id && ‘{14,62,70,53,138}’;

    the speed is simply proportional to the number of record matching WHERE clause..

    🙂

  20. @Arek:
    You do realize it has virtually nothing to do with this blogpost?

    I chose this schema just to show some practical ways of getting list of unique objects.

    Of course you can hack your way out of the problem with some intarray in here, but what will you do with situation that required 10 joins, and textual keys?

    Again – I see your solution is really cool, but not really related. I.e. it solves particular situation that I chose to show some problem, but not the problem itself.

  21. @Depesz

    Ok, I got your point …

    On the other hand, one of the ways to get the list of the unique objects is to model the database in the way you have it already unique 😉

    Yes, I know sometimes it is not possible 😉

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.