Getting list of unique elements

Every so often you need to get list of unique elements in some column. The standard way to do it is:

SELECT DISTINCT COLUMN FROM TABLE;

or

SELECT COLUMN FROM TABLE GROUP BY COLUMN;

The only problem is that it's slow – as it has to seq scan whole table. Can it be done faster?

Usually when somebody wants to speed up such query, (s)he is advised to add triggers which will keep list of values in some side table – dictionary.

This is nice, but a bit cumbersome, and relatively hard to do correctly. Let's see how it would work.

Let's assume we have this table:

                           Table "public.test"
   Column   |  Type   |                     Modifiers
------------+---------+---------------------------------------------------
 id         | integer | not null default nextval('test_id_seq'::regclass)
 test_field | text    |
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)

Now, let's insert to it some data:

# INSERT INTO test (test_field)
    SELECT proname::text FROM pg_proc, generate_series(1,10) WHERE random() < 0.33;
INSERT 0 7480

Now, let's see how the data looks:

# SELECT COUNT(*), COUNT(DISTINCT test_field) FROM test;
 COUNT | COUNT
-------+-------
  7480 |  1821
(1 ROW)

Now, to get list of all possible test_field values, I do:

# EXPLAIN analyze SELECT DISTINCT test_field FROM test;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=135.50..153.71 ROWS=1821 width=12) (actual TIME=38.275..42.255 ROWS=1821 loops=1)
   ->  Seq Scan ON test  (cost=0.00..116.80 ROWS=7480 width=12) (actual TIME=0.060..17.386 ROWS=7480 loops=1)
 Total runtime: 45.635 ms
(3 ROWS)

( I tested it several times, and it generally runs in 44-47 ms )

So, let's add our dictionary table, and triggers.

# \d test_dictionary
        Table "public.test_dictionary"
    Column     |  Type   |     Modifiers
---------------+---------+--------------------
 test_field    | text    | not null
 element_count | integer | not null default 1
Indexes:
    "test_dictionary_pkey" PRIMARY KEY, btree (test_field)

Now, we will need 3 triggers (on insert, update and delete), but internally they will be using 2 specialized code blocks:

  • add new elements to dictionary, or increment element_count
  • decrement element_count, or remove element from dictionary

Because of this we will start with 2 simple functions:

CREATE OR REPLACE FUNCTION add_to_dictionary(use_value TEXT) RETURNS void AS $$
DECLARE
BEGIN
    LOOP
        UPDATE test_dictionary SET element_count = element_count + 1 WHERE test_field = use_value;
        IF found THEN
            RETURN;
        END IF;
        BEGIN
            INSERT INTO test_dictionary(test_field) VALUES (use_value);
            RETURN;
            EXCEPTION WHEN unique_violation THEN
                -- do nothing, and loop to try the UPDATE again
        END;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

(the code is taken from the manual).

CREATE OR REPLACE FUNCTION remove_from_dictionary(use_value TEXT) RETURNS void AS $$
DECLARE
    tempint int4;
BEGIN
    UPDATE test_dictionary SET element_count = element_count - 1 WHERE test_field = use_value RETURNING element_count INTO tempint;
    IF NOT FOUND THEN
        raise exception 'remove_from_dictionary() called with element that doesn''t exist in test_dictionary ?! [%]', use_value;
    END IF;
    IF tempint = 0 THEN
        DELETE FROM test_dictionary WHERE test_field = use_value;
    END IF;
    RETURN;
END;
$$ LANGUAGE plpgsql;

Now, let's test if the functions work:

First, the table is empty:

# SELECT * FROM test_dictionary ;
 test_field | element_count
------------+---------------
(0 ROWS)

Now, let's add some values:

# SELECT add_to_dictionary('added');
 add_to_dictionary
-------------------
 
(1 ROW)
 
# SELECT add_to_dictionary('added');
 add_to_dictionary
-------------------
 
(1 ROW)
 
# SELECT add_to_dictionary('xxx');
 add_to_dictionary
-------------------
 
(1 ROW)

And let's see how it looks:

# SELECT * FROM test_dictionary ;
 test_field | element_count
------------+---------------
 added      |             2
 xxx        |             1
(2 ROWS)

Looks ok.

Now, let's remove the elements:

# SELECT remove_from_dictionary('xxx');
 remove_from_dictionary
------------------------
 
(1 ROW)
 
# SELECT * FROM test_dictionary ;
 test_field | element_count
------------+---------------
 added      |             2
(1 ROW)
 
# SELECT remove_from_dictionary('added');
 remove_from_dictionary
------------------------
 
(1 ROW)
 
# SELECT * FROM test_dictionary ;
 test_field | element_count
------------+---------------
 added      |             1
(1 ROW)
 
# SELECT remove_from_dictionary('added');
 remove_from_dictionary
------------------------
 
(1 ROW)
 
# SELECT * FROM test_dictionary ;
 test_field | element_count
------------+---------------
(0 ROWS)

Last test – check if it will correctly fail when trying to remove element not in dictionary:

# SELECT remove_from_dictionary('added');
ERROR:  remove_from_dictionary() called WITH element that doesn't exist in test_dictionary ?! [added]

All works. Now for the triggers.

Insert and delete triggers are trivial:

CREATE OR REPLACE FUNCTION test_dictionary_i() RETURNS TRIGGER AS
$BODY$
BEGIN
    IF NEW.test_field IS NOT NULL THEN
        perform add_to_dictionary(NEW.test_field);
    END IF;
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER test_dictionary_i AFTER INSERT ON test FOR EACH ROW EXECUTE PROCEDURE test_dictionary_i();
CREATE OR REPLACE FUNCTION test_dictionary_d() RETURNS TRIGGER AS
$BODY$
BEGIN
    IF OLD.test_field IS NOT NULL THEN
        perform remove_from_dictionary(OLD.test_field);
    END IF;
    RETURN OLD;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER test_dictionary_d AFTER DELETE ON test FOR EACH ROW EXECUTE PROCEDURE test_dictionary_d();

Update is a bit more complex:

CREATE OR REPLACE FUNCTION test_dictionary_u() RETURNS TRIGGER AS
$BODY$
BEGIN
    IF NOT (OLD.test_field IS DISTINCT FROM NEW.test_field) THEN
        RETURN NEW;
    END IF;
    IF OLD.test_field IS NULL THEN
        -- it means that NEW.test_field IS NOT NULL
        perform add_to_dictionary(NEW.test_field);
    ELSIF NEW.test_field IS NULL THEN
        -- it means that OLD.test_field IS NOT NULL
        perform remove_from_dictionary(OLD.test_field);
    ELSIF OLD.test_field < NEW.test_field THEN
        perform remove_from_dictionary(OLD.test_field);
        perform add_to_dictionary(NEW.test_field);
    ELSE
        perform add_to_dictionary(NEW.test_field);
        perform remove_from_dictionary(OLD.test_field);
    END IF;
    RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER test_dictionary_u AFTER UPDATE ON test FOR EACH ROW EXECUTE PROCEDURE test_dictionary_u();

You might wonder why there is the last elsif and else – with mixed order of function calls. If you do wonder about it – please check this blogpost, especially the part where it explains deadlock problem.

And the final step – we have to prefill dictionary table. Like this:

INSERT INTO test_dictionary (test_field, element_count)
    SELECT test_field, COUNT(*) FROM test WHERE test_field IS NOT NULL GROUP BY test_field;

And that's basically all.

First of all, let's check if it is faster:

# EXPLAIN analyze SELECT test_field FROM test_dictionary;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Seq Scan ON test_dictionary  (cost=0.00..24.53 ROWS=1353 width=32) (actual TIME=0.023..4.042 ROWS=1821 loops=1)
 Total runtime: 7.412 ms
(2 ROWS)

No surprise here.

Getting the list of elements now depends on count of distinct elements, and not count of all rows in our main (test) table.

The larger source table – the faster (in comparison) access to dictionary will be.

Of course we should check if the triggers work ok. doing this on 1800 distinct values is complicated, so let's clear the table:

# DELETE FROM test;
DELETE 7480

Interesting – after this delete, test_dictionary should be empty. Is it?

# SELECT COUNT(*) FROM test_dictionary ;
 COUNT
-------
     0
(1 ROW)

So, let's insert some rows, update some rows, and check if it's ok:

# INSERT INTO test (test_field) VALUES ('a'),('b'),('c'),('a'),('e');
INSERT 0 5
 
# UPDATE test SET test_field = 'x' WHERE id = (SELECT MIN(id) FROM test WHERE test_field = 'a');
UPDATE 1
 
# UPDATE test SET test_field = 'x' WHERE test_field = 'e';
UPDATE 1
 
# SELECT * FROM test;
  id  | test_field
------+------------
 7482 | b
 7483 | c
 7484 | a
 7481 | x
 7485 | x
(5 ROWS)
 
# SELECT * FROM test_dictionary ;
 test_field | element_count
------------+---------------
 b          |             1
 c          |             1
 a          |             1
 x          |             2
(4 ROWS)

Looks good to me.

Now, above solution is great – it's relatively simple, keeps the dictionary updates, is (as far as i can tell) deadlock free.

But sometimes you just need something a bit different.

I was very recently in situation when I needed to get list of distinct values in some table.

Important facts:

  • table was ~ 5 milion rows
  • number of distinct values was ~ 20
  • new values hardly ever occur
  • there was index on the column that I wanted to get distinct values from

Given the fact that the list of fields hardly ever changes, and that I need the list of values relatively rarely, I thought about getting the list in a bit different way.

First, let's simulate my test table:

# CREATE TABLE test ( id serial PRIMARY KEY, test_field text );

(I cleaned previous test table, so the create actually worked 🙂

Now, let's insert some data:

# INSERT INTO test (test_field) SELECT chr(i) FROM generate_series(65,84) i, generate_series(1,250000) x;

As you can see we have 20 different values, each repeated 250000 times:

# SELECT test_field, COUNT(*) FROM test GROUP BY test_field;
 test_field | COUNT
------------+--------
 L          | 250000
 J          | 250000
 I          | 250000
 D          | 250000
 F          | 250000
 K          | 250000
 H          | 250000
 N          | 250000
 R          | 250000
 B          | 250000
 M          | 250000
 G          | 250000
 E          | 250000
 C          | 250000
 S          | 250000
 P          | 250000
 T          | 250000
 O          | 250000
 A          | 250000
 Q          | 250000
(20 ROWS)

Now, getting list of test_field values is long, even when we have index:

# CREATE INDEX q ON test (test_field);
CREATE INDEX
 
# analyze test;
ANALYZE
 
# EXPLAIN analyze SELECT DISTINCT test_field FROM test;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 UNIQUE  (cost=0.00..153523.69 ROWS=20 width=2) (actual TIME=0.141..15163.846 ROWS=20 loops=1)
   ->  INDEX Scan USING q ON test  (cost=0.00..141023.69 ROWS=5000000 width=2) (actual TIME=0.135..8337.897 ROWS=5000000 loops=1)
 Total runtime: 15163.943 ms
(3 ROWS)
 
# EXPLAIN analyze SELECT test_field FROM test GROUP BY test_field;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=84624.00..84624.20 ROWS=20 width=2) (actual TIME=14552.031..14552.061 ROWS=20 loops=1)
   ->  Seq Scan ON test  (cost=0.00..72124.00 ROWS=5000000 width=2) (actual TIME=0.043..6734.036 ROWS=5000000 loops=1)
 Total runtime: 14552.155 ms
(3 ROWS)

14-15 seconds. Not really cool. But, we can write a simple function:

CREATE OR REPLACE FUNCTION get_list_of_unique_elements() RETURNS setof TEXT AS $$
DECLARE
    previous TEXT;
    temptext TEXT;
BEGIN
    SELECT MIN(test_field) INTO temptext FROM test;
    IF NOT FOUND THEN
        RETURN;
    END IF;
    previous := temptext;
    RETURN NEXT previous;
    LOOP
        SELECT MIN(test_field) INTO temptext FROM test WHERE test_field > previous;
        IF NOT FOUND OR temptext IS NULL THEN
            RETURN;
        END IF;
        previous := temptext;
        RETURN NEXT previous;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

And how it works?

# SELECT * FROM get_list_of_unique_elements();
 get_list_of_unique_elements
-----------------------------
 A
 B
 C
 D
 E
 F
 G
 H
 I
 J
 K
 L
 M
 N
 O
 P
 Q
 R
 S
 T
(20 ROWS)

And time?

# EXPLAIN analyze SELECT * FROM get_list_of_unique_elements();
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 FUNCTION Scan ON get_list_of_unique_elements  (cost=0.00..260.00 ROWS=1000 width=32) (actual TIME=4.611..4.650 ROWS=20 loops=1)
 Total runtime: 4.724 ms
(2 ROWS)

Interesting facts:

  • This function will return elements in sorted order
  • It should be faster than distinct/group by, by only as long as you have relatively low number of distinct values. My guess is that it's worth using for situations where you have over 10 times as many rows as unique values
  • It is possible to write the function so that it will accept table/column name as argument (using plpgsql EXECUTE), but it's not-trivial, if the column is not of TEXT type.

11 thoughts on “Getting list of unique elements”

  1. about the first example, doing count like that on default transactional level is dangerous.
    In reality you would require all transactions on that table to be serializable, which is true for any concurrent math operations.

  2. Thanks for this hint.

    What I’m really looking for is a DynaEnum or AutoEnum datatype which supports the same features as an enum, plus:

    – Behaves strictly like a varchar in all aspects…
    – Automatically creates new values when needed – up to a limited number of new values (a few hundreds / thousands would probably be OK)
    – Takes much less space
    – Gives access to an quick function to count the distinct values

    For instance in the case of Credit Card transactions processing I would have the following:

    CREATE TABLE CreditCard (brand AUTOENUM, […], transactionOutcome AUTOENUM);

    INSERT INTO CreditCardTransaction VALUES (‘VISA’, …, ‘OK’);
    INSERT INTO CreditCardTransaction VALUES (‘VISA’, …, ‘OK’);
    INSERT INTO CreditCardTransaction VALUES (‘AMEX’, …, ‘DENIED’);

    -> this would automatically create 2 AutoEnums, one with (‘VISA’, ‘AMEX’) and the other one with (‘OK’, ‘DENIED’);

    The storage need for those 2 enums would probably be really small.

    I have many tables which would benefit from this. Especially tables which are used to log events, where the number of possible values is small but not necessarily known beforehand – typically error messages -.

  3. Yeah I was thinking about this yesterday. I was reading that in 8.4 the optimizer can use bitmap indexes internally but you still can’t create a bitmap index on a table. The bitmap index is ideal for these high volume, low cardinality tables.

    But I think we could actually mimic the behavior with a table that stored the index name and an array of values. As each row is indexed, it would look up the position of the value in the array and append it to the array if not found. (I’m guessing this is pretty close to how the enum type works internally.)

    You could use that approach to make the above mentioned AutoEnum type (DynaEnum sounds like it might blow up on you) and for bitmap indexes.

  4. Yeah, the approach of a table would work… but It’s not backwards compatible with legacy code, it requires code rewrite… and it’s just cumbersome. And it makes the whole DB schema more complex for not much benefit.

  5. I think what you implemented in plpgsql in your last solution is called “skip scan” or something like that. I think this is something that should be considered in the optimizer — TODO for 8.5?

  6. @alvherre:
    I would *love* to see it in optimizer, but I’m definitely not the right person to ask about being in TODO – my C skills are next to none.

  7. The function remove_from_dictionary() appears unsafe. After “tmpint” is set, and before the DELETE is executed, the item may be added by some concurrent process. You may be able to make it safe by adding a “WHERE element_count = 0” to the DELETE.

  8. @Jeff Davis:
    I’m not sure. Wouldn’t UPDATE obtain lock on the row? So the concurrent addition would have to wait for transaction end.

  9. If you have only a few distinct values, wouldn’t it be most efficient to query pg_stats for the histogram of the column?

    That requires no additional coding and should be quite accurate assuming statistics_target is high enough.

    Thomas

  10. @Thomas:
    great idea, with 2 small problems:
    1. statistics can (and usually are) not up to date
    2. it would require the number of values to be *really* low. in terms of absolute numbers. the method I showed in the post works well for low number of values but in relation to number of rows in table. i.e. .it will work quite well for 10000 values in 1 million row table.

  11. re 1) yes, I am aware of that, but in your example you said the table was not updated very often. If “not very” is something like once a week, then the histogram can probably be used without problems. Autovacuum should take care of that.

    re 2) right. I was thinking about an absolute number (not more than 100)

Comments are closed.