Couple of people asked, on irc, about how to generate unique, short, ids for rows.
Since I wrote code for this for explain.depesz.com, I figured that instead of pointing to sources, and letting everyone to extract the functions, I will, instead, write a set of functions that what is needed in reusable way.
First thing that we need is a function that will generate random string. It will need one or two arguments:
- number of characters that the string should have
- optionally – set of characters to use for generation
The function itself is pretty simple:
$ CREATE FUNCTION get_random_string( IN string_length INTEGER, IN possible_chars TEXT DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' ) RETURNS text LANGUAGE plpgsql AS $$ DECLARE output TEXT = ''; i INT4; pos INT4; BEGIN FOR i IN 1..string_length LOOP pos := 1 + CAST( random() * ( LENGTH(possible_chars) - 1) AS INT4 ); output := output || substr(possible_chars, pos, 1); END LOOP; RETURN output; END; $$; CREATE FUNCTION
Sample usage:
$ SELECT get_random_string(10); get_random_string ------------------- BOaM1K6wbS (1 ROW) $ SELECT get_random_string(15); get_random_string ------------------- fWejQRndBUyJ5Mh (1 ROW) $ SELECT get_random_string(5, 'abc'); get_random_string ------------------- babbc (1 ROW)
Then we need a way to insert data into table. Using a function (like explain.depesz.com) is not always convenient, so perhaps we can make something like:
$ CREATE TABLE test ( id text DEFAULT get_random_string(10) PRIMARY KEY, something text ); CREATE TABLE
Obviously, it will work:
$ INSERT INTO test (something) SELECT 'sth #' || i FROM generate_series(1,10) i; INSERT 0 10 $ SELECT * FROM test; id | something ------------+----------- QOGAsBmuXQ | sth #1 yu24hNR6cx | sth #2 BJBlvuKHlx | sth #3 ZCMoMF18AY | sth #4 Z9TbDByeHa | sth #5 cSunFqiZ8U | sth #6 XggtW39WBK | sth #7 5kTYMfjLK1 | sth #8 vwTqkihTIp | sth #9 xoWei3hrZs | sth #10 (10 ROWS)
But there are couple of issues:
- what will happen if the same string will get generated?
- 10 characters is not exactly short
With 10 characters in id, chances of generating duplicate id are pretty slim, so let's cheat a bit:
$ ALTER TABLE test ALTER COLUMN id SET DEFAULT get_random_string(2, 'ab'); ALTER TABLE $ INSERT INTO test (something) SELECT 'sth #' || i FROM generate_series(11,20) i; ERROR: duplicate KEY VALUE violates UNIQUE CONSTRAINT "test_pkey" DETAIL: KEY (id)=(aa) already EXISTS.
Obviously, it's not good. Inserts use default value to assume that it will work.
So, let's make another function, that will be used as default:
$ CREATE OR REPLACE FUNCTION generate_random_id( IN string_length INTEGER, IN possible_chars TEXT DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' ) RETURNS text LANGUAGE plpgsql AS $$ DECLARE v_random_id text; v_temp text; BEGIN LOOP v_random_id := get_random_string( string_length, possible_chars ); SELECT id INTO v_temp FROM test WHERE id = v_random_id; exit WHEN NOT found; END LOOP; RETURN v_random_id; END; $$; CREATE FUNCTION
Now, let's test it:
$ ALTER TABLE test ALTER COLUMN id SET DEFAULT generate_random_id(2, 'ab'); ALTER TABLE $ INSERT INTO test (something) VALUES ('a'); INSERT 0 1 $ INSERT INTO test (something) VALUES ('b'); INSERT 0 1 $ INSERT INTO test (something) VALUES ('c'); INSERT 0 1 $ INSERT INTO test (something) VALUES ('d'); INSERT 0 1 $ INSERT INTO test (something) VALUES ('e'); Cancel request sent ERROR: canceling statement due TO USER request CONTEXT: SQL statement "select id FROM test where id = v_random_id" PL/pgSQL FUNCTION generate_random_id(INTEGER,text) line 8 at SQL statement
I stopped it, because it entered infinite loop. The problem is that once all possible combinations of possible_chars with given string_length are used, then it will simply spin in infinite loop.
Not so great. Luckily – it's trivial to fix. We'll just make the loop increase length of id if collision will be found:
$ CREATE OR REPLACE FUNCTION generate_random_id( IN string_length INTEGER, IN possible_chars TEXT DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' ) RETURNS text LANGUAGE plpgsql AS $$ DECLARE v_random_id text; v_temp text; v_length int4 := string_length; BEGIN LOOP v_random_id := get_random_string( v_length, possible_chars ); SELECT id INTO v_temp FROM test WHERE id = v_random_id; exit WHEN NOT found; v_length := v_length + 1; END LOOP; RETURN v_random_id; END; $$; CREATE FUNCTION
So, now, even with default that generates strings with length “2", if I'll try to insert, I will get id with length of 3:
$ INSERT INTO test (something) VALUES ('e') returning *; id | something -----+----------- bab | e (1 ROW) INSERT 0 1
This is pretty good, but there are still problems.
First problem: what to do if I want to generate ID for table that is not named test?
That should be solvable too, let's pass table name as parameter. Since we know that tables can be in various schema, let's pass schema and table names separately. And we'll need column name too:
$ CREATE OR REPLACE FUNCTION generate_random_id( IN table_schema TEXT, IN TABLE_NAME TEXT, IN column_name TEXT, IN string_length INTEGER, IN possible_chars TEXT DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' ) RETURNS text LANGUAGE plpgsql AS $$ DECLARE v_random_id text; v_temp text; v_length int4 := string_length; v_sql text; BEGIN v_sql := format( 'SELECT %I FROM %I.%I WHERE %I = $1', column_name, table_schema, TABLE_NAME, column_name ); LOOP v_random_id := get_random_string( v_length, possible_chars ); EXECUTE v_sql INTO v_temp USING v_random_id; exit WHEN v_temp IS NULL; v_length := v_length + 1; END LOOP; RETURN v_random_id; END; $$ STRICT; CREATE FUNCTION
I added STRICT designation, to avoid processing when any argument is NULL.
So, now I can:
$ ALTER TABLE test ALTER COLUMN id SET DEFAULT generate_random_id('public', 'test', 'id', 2, 'ab'); ALTER TABLE $ INSERT INTO test (something) VALUES ('f'),('g'),('h') returning *; id | something -----+----------- baa | f aab | g aaa | h (3 ROWS)
Nice. Works. But does it really?
Let's consider, what will happen in this (unlikely, but we want to be safe) scenario:
- we have two concurrent sessions (s1 and s2)
- both sessions issue insert which calls our random generator
- generator, in both cases, generates exactly the same random id
- both functions (generate_random_id) check in underlying table if the id is used – and in both of them it returns false – the id is unused
- both functions return the same id to insert command
- one of the inserts will succeed, but the other one will fail.
Of course, the problem might seem far fetched – chances are slim, to put it lightly, but why take a chance anyway? We can fix it.
To fix it we can use advisory locks. New version of the function:
$ CREATE OR REPLACE FUNCTION generate_random_id( IN table_schema TEXT, IN TABLE_NAME TEXT, IN column_name TEXT, IN string_length INTEGER, IN possible_chars TEXT DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' ) RETURNS text LANGUAGE plpgsql AS $$ DECLARE v_random_id text; v_temp text; v_length int4 := string_length; v_sql text; v_advisory_1 int4 := hashtext( format('%I:%I:%I', table_schema, TABLE_NAME, column_name) ); v_advisory_2 int4; v_advisory_ok bool; BEGIN v_sql := format( 'SELECT %I FROM %I.%I WHERE %I = $1', column_name, table_schema, TABLE_NAME, column_name ); LOOP v_random_id := get_random_string( v_length, possible_chars ); v_advisory_2 := hashtext( v_random_id ); v_advisory_ok := pg_try_advisory_xact_lock( v_advisory_1, v_advisory_2 ); IF v_advisory_ok THEN EXECUTE v_sql INTO v_temp USING v_random_id; exit WHEN v_temp IS NULL; END IF; v_length := v_length + 1; END LOOP; RETURN v_random_id; END; $$ STRICT; CREATE FUNCTION
Quick sanity check:
$ INSERT INTO test (something) SELECT unnest('{i,j,k,l,m}'::text[]) returning *; id | something -------+----------- baaa | i abbbb | j abb | k bbba | l abbab | m (5 ROWS)
Since getting advisory lock is cheaper than doing actual data check, I figured that getting the lock first is slightly better idea.
After some using of the function, you might find out that it always returns strings longer than the minimal requested – it could be because pool of possible combinations of possible_chars with given length is empty. In such case – you might want to change default expression to start with longer strings, to avoid one loop that will always fail.
Now, since I already wrote the code, there is no point in redoing it by others. The extension is available on GitHub and in PGXN. You can get it, install, use and/or modify. Have fun 🙂
Very nice, thank you for sharing.