Generate short, random, textual IDs

Couple of people asked, on irc, about how to generate unique, short, ids for rows.

Since I wrote code for this for explain.depesz.com, I figured that instead of pointing to sources, and letting everyone to extract the functions, I will, instead, write a set of functions that what is needed in reusable way.

First thing that we need is a function that will generate random string. It will need one or two arguments:

  1. number of characters that the string should have
  2. optionally – set of characters to use for generation

The function itself is pretty simple:

$ CREATE FUNCTION get_random_string(
        IN string_length INTEGER,
        IN possible_chars TEXT DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
    ) RETURNS text
    LANGUAGE plpgsql
    AS $$
DECLARE
    output TEXT = '';
    i INT4;
    pos INT4;
BEGIN
    FOR i IN 1..string_length LOOP
        pos := 1 + CAST( random() * ( LENGTH(possible_chars) - 1) AS INT4 );
        output := output || substr(possible_chars, pos, 1);
    END LOOP;
    RETURN output;
END;
$$;
CREATE FUNCTION

Sample usage:

$ SELECT get_random_string(10);
 get_random_string 
-------------------
 BOaM1K6wbS
(1 ROW)
 
$ SELECT get_random_string(15);
 get_random_string 
-------------------
 fWejQRndBUyJ5Mh
(1 ROW)
 
$ SELECT get_random_string(5, 'abc');
 get_random_string 
-------------------
 babbc
(1 ROW)

Then we need a way to insert data into table. Using a function (like explain.depesz.com) is not always convenient, so perhaps we can make something like:

$ CREATE TABLE test (
    id text DEFAULT get_random_string(10) PRIMARY KEY,
    something text
);
CREATE TABLE

Obviously, it will work:

$ INSERT INTO test (something)
    SELECT 'sth #' || i
    FROM generate_series(1,10) i;
INSERT 0 10
 
$ SELECT * FROM test;
     id     | something 
------------+-----------
 QOGAsBmuXQ | sth #1
 yu24hNR6cx | sth #2
 BJBlvuKHlx | sth #3
 ZCMoMF18AY | sth #4
 Z9TbDByeHa | sth #5
 cSunFqiZ8U | sth #6
 XggtW39WBK | sth #7
 5kTYMfjLK1 | sth #8
 vwTqkihTIp | sth #9
 xoWei3hrZs | sth #10
(10 ROWS)

But there are couple of issues:

  • what will happen if the same string will get generated?
  • 10 characters is not exactly short

With 10 characters in id, chances of generating duplicate id are pretty slim, so let's cheat a bit:

$ ALTER TABLE test ALTER COLUMN id SET DEFAULT get_random_string(2, 'ab');
ALTER TABLE
 
$ INSERT INTO test (something)
    SELECT 'sth #' || i
    FROM generate_series(11,20) i;
ERROR:  duplicate KEY VALUE violates UNIQUE CONSTRAINT "test_pkey"
DETAIL:  KEY (id)=(aa) already EXISTS.

Obviously, it's not good. Inserts use default value to assume that it will work.

So, let's make another function, that will be used as default:

$ CREATE OR REPLACE FUNCTION generate_random_id(
    IN string_length INTEGER,
    IN possible_chars TEXT DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
) RETURNS text
    LANGUAGE plpgsql
    AS $$
DECLARE
    v_random_id text;
    v_temp text;
BEGIN
    LOOP
        v_random_id := get_random_string( string_length, possible_chars );
        SELECT id INTO v_temp FROM test WHERE id = v_random_id;
        exit WHEN NOT found;
    END LOOP;
    RETURN v_random_id;
END;
$$;
CREATE FUNCTION

Now, let's test it:

$ ALTER TABLE test ALTER COLUMN id SET DEFAULT generate_random_id(2, 'ab');
ALTER TABLE
 
$ INSERT INTO test (something) VALUES ('a');
INSERT 0 1
 
$ INSERT INTO test (something) VALUES ('b');
INSERT 0 1
 
$ INSERT INTO test (something) VALUES ('c');
INSERT 0 1
 
$ INSERT INTO test (something) VALUES ('d');
INSERT 0 1
 
$ INSERT INTO test (something) VALUES ('e');
Cancel request sent
ERROR:  canceling statement due TO USER request
CONTEXT:  SQL statement "select id             FROM test where id = v_random_id"
PL/pgSQL FUNCTION generate_random_id(INTEGER,text) line 8 at SQL statement

I stopped it, because it entered infinite loop. The problem is that once all possible combinations of possible_chars with given string_length are used, then it will simply spin in infinite loop.

Not so great. Luckily – it's trivial to fix. We'll just make the loop increase length of id if collision will be found:

$ CREATE OR REPLACE FUNCTION generate_random_id(
    IN string_length INTEGER,
    IN possible_chars TEXT DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
) RETURNS text
    LANGUAGE plpgsql
    AS $$
DECLARE
    v_random_id  text;
    v_temp       text;
    v_length     int4   :=  string_length;
BEGIN
    LOOP
        v_random_id := get_random_string( v_length, possible_chars );
        SELECT id INTO v_temp FROM test WHERE id = v_random_id;
        exit WHEN NOT found;
        v_length := v_length + 1;
    END LOOP;
    RETURN v_random_id;
END;
$$;
CREATE FUNCTION

So, now, even with default that generates strings with length “2", if I'll try to insert, I will get id with length of 3:

$ INSERT INTO test (something) VALUES ('e') returning *;
 id  | something 
-----+-----------
 bab | e
(1 ROW)
INSERT 0 1

This is pretty good, but there are still problems.

First problem: what to do if I want to generate ID for table that is not named test?

That should be solvable too, let's pass table name as parameter. Since we know that tables can be in various schema, let's pass schema and table names separately. And we'll need column name too:

$ CREATE OR REPLACE FUNCTION generate_random_id(
    IN table_schema   TEXT,
    IN TABLE_NAME     TEXT,
    IN column_name    TEXT,
    IN string_length  INTEGER,
    IN possible_chars TEXT DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
) RETURNS text
    LANGUAGE plpgsql
    AS $$
DECLARE
    v_random_id  text;
    v_temp       text;
    v_length     int4   :=  string_length;
    v_sql        text;
BEGIN
    v_sql := format( 'SELECT %I FROM %I.%I WHERE %I = $1', column_name, table_schema, TABLE_NAME, column_name );
    LOOP
        v_random_id := get_random_string( v_length, possible_chars );
        EXECUTE v_sql INTO v_temp USING v_random_id;
        exit WHEN v_temp IS NULL;
        v_length := v_length + 1;
    END LOOP;
    RETURN v_random_id;
END;
$$ STRICT;
CREATE FUNCTION

I added STRICT designation, to avoid processing when any argument is NULL.

So, now I can:

$ ALTER TABLE test ALTER COLUMN id SET DEFAULT generate_random_id('public', 'test', 'id', 2, 'ab');
ALTER TABLE
 
$ INSERT INTO test (something) VALUES ('f'),('g'),('h') returning *;
 id  | something 
-----+-----------
 baa | f
 aab | g
 aaa | h
(3 ROWS)

Nice. Works. But does it really?

Let's consider, what will happen in this (unlikely, but we want to be safe) scenario:

  • we have two concurrent sessions (s1 and s2)
  • both sessions issue insert which calls our random generator
  • generator, in both cases, generates exactly the same random id
  • both functions (generate_random_id) check in underlying table if the id is used – and in both of them it returns false – the id is unused
  • both functions return the same id to insert command
  • one of the inserts will succeed, but the other one will fail.

Of course, the problem might seem far fetched – chances are slim, to put it lightly, but why take a chance anyway? We can fix it.

To fix it we can use advisory locks. New version of the function:

$ CREATE OR REPLACE FUNCTION generate_random_id(
    IN table_schema   TEXT,
    IN TABLE_NAME     TEXT,
    IN column_name    TEXT,
    IN string_length  INTEGER,
    IN possible_chars TEXT DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
) RETURNS text
    LANGUAGE plpgsql
    AS $$
DECLARE
    v_random_id   text;
    v_temp        text;
    v_length      int4   :=  string_length;
    v_sql         text;
    v_advisory_1  int4 := hashtext( format('%I:%I:%I', table_schema, TABLE_NAME, column_name) );
    v_advisory_2  int4;
    v_advisory_ok bool;
BEGIN
    v_sql := format( 'SELECT %I FROM %I.%I WHERE %I = $1', column_name, table_schema, TABLE_NAME, column_name );
    LOOP
        v_random_id := get_random_string( v_length, possible_chars );
        v_advisory_2 := hashtext( v_random_id );
        v_advisory_ok := pg_try_advisory_xact_lock( v_advisory_1, v_advisory_2 );
        IF v_advisory_ok THEN
            EXECUTE v_sql INTO v_temp USING v_random_id;
            exit WHEN v_temp IS NULL;
        END IF;
        v_length := v_length + 1;
    END LOOP;
    RETURN v_random_id;
END;
$$ STRICT;
CREATE FUNCTION

Quick sanity check:

$ INSERT INTO test (something)
    SELECT unnest('{i,j,k,l,m}'::text[]) returning *;
  id   | something 
-------+-----------
 baaa  | i
 abbbb | j
 abb   | k
 bbba  | l
 abbab | m
(5 ROWS)

Since getting advisory lock is cheaper than doing actual data check, I figured that getting the lock first is slightly better idea.

After some using of the function, you might find out that it always returns strings longer than the minimal requested – it could be because pool of possible combinations of possible_chars with given length is empty. In such case – you might want to change default expression to start with longer strings, to avoid one loop that will always fail.

Now, since I already wrote the code, there is no point in redoing it by others. The extension is available on GitHub and in PGXN. You can get it, install, use and/or modify. Have fun 🙂

One thought on “Generate short, random, textual IDs”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.