my thoughts on getting random row

this topic has been written about by many smart people – from the recent past, by greg sabino mullane and josh berkus.

they show 4 different approaches:

  1. order by random()
  2. >= random() limit 1
  3. random column
  4. random aggregate

all these approaches have their benefits and drawbacks, but i'd like to show another one (polish readers saw the approach already in january 2007, but this time i will make the code more robust).

first, let's discuss existing methods:

order by random()

this is very good way to get truly random records. there is no “preference" for specific rows. the major drawback is speed. or lack of it actually.

>= random() limit 1

this method is faster, but will lead to different distributions of rows in case there are many “gaps" in numbering.

random column

this is the solution that greg proposes, and it looks quite well. until we will see that it requires another column which has to be periodically updated in whole table, just to get different randoms.

random aggregate

this is brilliant solution for some cases – i.e. when we want to get random record for every “user". but it also requires one sequential scan.

so, is there a way to make it faster?

i created a temporary table:

# \d some_data
                          Table "public.some_data"
  Column  |  Type   |                       Modifiers
----------+---------+--------------------------------------------------------
 id       | integer | not null default nextval('some_data_id_seq'::regclass)
 group_id | integer |
 value    | text    |
Indexes:
    "some_data_pkey" PRIMARY KEY, btree (id)

it contains 10000 rows in 500 different group_ids:

# SELECT COUNT(*), COUNT(DISTINCT group_id), COUNT(DISTINCT VALUE) FROM some_data;
 COUNT | COUNT | COUNT
-------+-------+-------
 10000 |   500 | 10000
(1 ROW)

my approach is based on the fact that index scans are very fast.

basically – given a table that has numerical primary key (like id column in my example) i can find lowest and highest id and then search for some random record in between.

it is possible that the generated random row will not exist, so i will need to check if it really exists, and if now – find another row.

in theory this can lead to infinite loop. in practice – i can easily put a limit that after some (10? 100?) faulty tries i simply raise exception, or choose random record using some other method.

so, let's write the basic code:

CREATE OR REPLACE FUNCTION get_random_id() RETURNS INT4 AS $BODY$
DECLARE
    id_range record;
    reply INT4;
    try INT4 := 0;
BEGIN
    SELECT MIN(id), MAX(id) - MIN(id) + 1 AS range INTO id_range FROM some_data;
    WHILE ( try < 10 ) LOOP
        try := try + 1;
        reply := FLOOR( random() * id_range.range) + id_range.min;
        perform id FROM some_data WHERE id = reply;
        IF found THEN
            RETURN reply;
        END IF;
    END LOOP;
    RAISE EXCEPTION 'something strange happened - no record found in % tries', try;
END;
$BODY$ LANGUAGE plpgsql STABLE;

and how do i use it? it's simple:

# EXPLAIN analyze SELECT * FROM some_data WHERE id = get_random_id();
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 INDEX Scan USING some_data_pkey ON some_data  (cost=0.25..8.52 ROWS=1 width=24) (actual TIME=0.202..0.205 ROWS=1 loops=1)
   INDEX Cond: (id = get_random_id())
 Total runtime: 0.241 ms
(3 ROWS)

for comparison:

# EXPLAIN analyze SELECT * FROM some_data ORDER BY random() LIMIT 1;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 LIMIT  (cost=239.00..239.00 ROWS=1 width=24) (actual TIME=55.740..55.743 ROWS=1 loops=1)
   ->  Sort  (cost=239.00..264.00 ROWS=10000 width=24) (actual TIME=55.734..55.734 ROWS=1 loops=1)
         Sort KEY: (random())
         Sort Method:  top-N heapsort  Memory: 17kB
         ->  Seq Scan ON some_data  (cost=0.00..189.00 ROWS=10000 width=24) (actual TIME=0.022..28.413 ROWS=10000 loops=1)
 Total runtime: 55.807 ms
(6 ROWS)

random() limit 1 is quite fast actually, thanks to 2 facts:

  • there is only 10,000 rows
  • i'm using postgresql 8.3 and it has this nice “top-N" sorting method

so, the code looks pretty simple, but can we make it more robust? perhaps a way to make the function work with multiple tables? or return more than one row-id?

ok, let's drop the function, and create 2 new ones. first will be really simple:

CREATE OR REPLACE FUNCTION get_random_id(in_table_name TEXT) RETURNS INT4[] AS $BODY$
BEGIN
    RETURN get_random_id(in_table_name, 1);
END;
$BODY$ LANGUAGE plpgsql STABLE;

this is just a wrapper around our soon-to-exist-function that will make “default" number of ids to be returnes. please note though that now it is defined to return array of int4!

now, the main function:

CREATE OR REPLACE FUNCTION get_random_id(in_table_name TEXT, want_elements INT4) RETURNS INT4[] AS $BODY$
DECLARE
    use_sql        TEXT;
    id_range       record;
    reply          INT4[];
    this_part      INT4[];
    try            INT4    := 0;
    i              INT4;
BEGIN
    use_sql := 'SELECT min(id), max(id) - min(id) + 1 as range FROM ' || quote_ident(in_table_name);
    EXECUTE use_sql INTO id_range;
    IF id_range.range < want_elements THEN
        raise exception 'not enough elements in % table.', in_table_name;
    END IF;
    WHILE ( try < 10 ) LOOP
        try := try + 1;
        this_part := '{}'::INT4[];
        WHILE ( COALESCE(array_upper( this_part, 1 ), 0) < want_elements - COALESCE(array_upper(reply, 1), 0)) LOOP
            i := FLOOR( random() * id_range.range) + id_range.min;
            IF NOT i = ANY(this_part) THEN
                this_part := this_part || i;
            END IF;
        END LOOP;
        use_sql := 'SELECT id FROM ' || quote_ident(in_table_name) || ' WHERE id = ANY(' || quote_literal(this_part::TEXT) || '::INT4[])';
        FOR i IN EXECUTE use_sql LOOP
            reply := reply || i;
        END LOOP;
        IF (COALESCE(array_upper(reply, 1), 0) = want_elements) THEN
            RETURN reply;
        END IF;
    END LOOP;
    RAISE EXCEPTION 'something strange happened - requested records not found in % tries', try;
END;
$BODY$ LANGUAGE plpgsql STABLE;

it's longer. but how well does it work?

# EXPLAIN analyze SELECT * FROM some_data WHERE id = any(get_random_id('some_data', 10));
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON some_data  (cost=38.84..69.70 ROWS=10 width=24) (actual TIME=2.576..2.606 ROWS=10 loops=1)
   Recheck Cond: (id = ANY (get_random_id('some_data'::text, 10)))
   ->  Bitmap INDEX Scan ON some_data_pkey  (cost=0.00..38.84 ROWS=10 width=0) (actual TIME=2.566..2.566 ROWS=10 loops=1)
         INDEX Cond: (id = ANY (get_random_id('some_data'::text, 10)))
 Total runtime: 2.681 ms
(5 ROWS)

not bad 🙂

and thanks to our wrapper function i can as well call it without second argument:

# EXPLAIN analyze SELECT * FROM some_data WHERE id = any(get_random_id('some_data'));
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON some_data  (cost=38.84..69.70 ROWS=10 width=24) (actual TIME=1.318..1.321 ROWS=1 loops=1)
   Recheck Cond: (id = ANY (get_random_id('some_data'::text)))
   ->  Bitmap INDEX Scan ON some_data_pkey  (cost=0.00..38.84 ROWS=10 width=0) (actual TIME=1.311..1.311 ROWS=1 loops=1)
         INDEX Cond: (id = ANY (get_random_id('some_data'::text)))
 Total runtime: 1.375 ms
(5 ROWS)

now. it would be cool to add a way to “find out" name of the table in more automatic way, but i dont see a reasonably fault-proof way to do it (scanning pg_stat_activity and parsing sql doesn't look sane to me).

nice add-ons for this could be limiting id's to some user-supplied range (like in: get_random_id(‘some_data', 10, 1000, 2000) – 10 random ids between 1000 and 2000), or adding custom ‘where' clauses.

one thing – what if you dont have single-column primary key? or you have lot's of deletes, so the gaps in numbering are enormous?

solution is very simple – and similar to greg's solution – add int4 column with “unique, not null", and periodically update it (from some sequence) to remove gaps. this is needed basically only if the gaps constist of more than 75% of possible rows. and once you have it set you can run as many get_random_id() calls as you'd like.

16 thoughts on “my thoughts on getting random row”

  1. SELECT id FROM ORDER BY random() LIMIT 1
    can be replaced

    SELECT ..
    WHERE id = ANY (ARRAY(SELECT (random()*max_id)::int FROM generate_series(1,20))) LIMIT 1

  2. Interesting. How about something like this

    select * from some_data
    where floor(random()*1001)%5 limit 1

  3. @Regina:
    i think i dont understand your suggestion – are you sure there shouldn’t be “id” in it somewhere?

  4. Depesz,

    Actually there was a flaw in my thinking. What I was trying to do with the modulo operator was to create a random dice thrower that returns true an average of x times per set. I was trying to get away from reliance on id.
    Its too bad PostgreSQL doesn’t have row_number() function.

    The modulo operator behaves rather oddly though(like in the above I initially tried it returned a boolean, but if I cast each argument to int it returns a number as expected. So I have no idea what it is doing in the above on further inspection.

    This is closer to what I was thinking.

    SELECT * from some_data
    where (random()*100000000)::int%1000 = 0
    order by random() limit 1

    where the 1000 you would change as a function of size of your data set so you might replace it with a (SELECT count(id) ..from..) subselect or something like that.

    The above seems much faster than just doing a simple
    SELECT * from some_data
    order by random() limit 1

    On an order of about 10 times faster on the 150,000 record set I tested.
    but since its not relying on an indexed field, its probably slower than using an id based check.

  5. Pavel,

    That looks neat. One question I had is that is that ARRAY collect call you have necessary? I tried the below without the Array call and seemed to work just as well and just as fast but could be a version thing.

    SELECT ..
    WHERE id = ANY (SELECT (random()*max_id)::int FROM generate_series(1,20)) LIMIT 1

  6. @Regina:
    i think it’s not neccessary (why? i dont know), but it can be simplified to:
    where id in (select …) limit 1;

  7. @Regina

    There is difference in execution plans.

    WHERE id = ANY(SELECT -> MERGE JOIN (or any JOIN)

    WHERE id = ANY(ARRAY( -> it’s only index scan (it’s faster and more efective), this is trick

    Look to execution plans

  8. I’d be interesting to run each of these 10,000 times and then see just how “random” the results you got were.

  9. @Robert Treat:
    while testing i found that i posted a version of the function with debug code (if random() < .5), so i removed it.
    as for making those tests – i tested my function only, and the results i got were rather good. for 50 rows in test table, and 10000 calls to get_random_id(‘test’, 1) i got this values:
    id | selected_times
    —-+—————-
    1 | 192
    2 | 195
    3 | 193
    4 | 204
    5 | 207
    6 | 192
    7 | 191
    8 | 187
    9 | 222
    10 | 215
    11 | 192
    12 | 203
    13 | 209
    14 | 210
    15 | 199
    16 | 217
    17 | 217
    18 | 196
    19 | 202
    20 | 191
    21 | 179
    22 | 188
    23 | 185
    24 | 183
    25 | 181
    26 | 206
    27 | 205
    28 | 204
    29 | 197
    30 | 212
    31 | 231
    32 | 189
    33 | 196
    34 | 226
    35 | 216
    36 | 177
    37 | 194
    38 | 210
    39 | 208
    40 | 217
    41 | 192
    42 | 195
    43 | 173
    44 | 182
    45 | 207
    46 | 211
    47 | 216
    48 | 202
    49 | 183
    50 | 201
    (50 rows)

    which looks quite good.
    as for testing other methods – i’m not really sure which methods to choose, as some of them have quite obvious problems (not considering lower limit, using casting to int (which rounds) instead of flooring and so on. i dont want to make the tests of examples which are bad, but fixing them just to test them doesn’t look good as well. perhaps authors of these solutions can show us their randomness numbers?

  10. Just a thought: maybe instead of the loop and equality check you could use inequality and random (small) offset (to avoid selecting too often the boundaries of large gaps). So something like:

    SELECT … WHERE id > min_id + (random() * (max_id – max_offset – min_id))::int … LIMIT 1 OFFSET (random() * max_offset)::int

  11. @Csaba Nagy:
    i wrote that >/= are not acceptable:

    >= random() limit 1

    this method is faster, but will lead to different distributions of rows in case there are many “gaps” in numbering.

  12. Hi,

    I’ve came up with another solution, that uses ORDER BY random() applied only on a small subset of the actual table. It appears to perform really very well, in fact about ten times faster than plain ORDER BY random() and about three time faster than get_random_id().

    http://isteve.bofh.cz/~isteve/tmp/ordertest.html

    Any comments whatsoever accepted, and please excuse a link instead of a full post — the text is quite long.

Comments are closed.