There is this idea that normal form in databases require you to use integer, auto incrementing, primary keys.
One of the points that proponents of surrogate keys (i.e. those based on integer and sequences) raise is that comparing integers is faster than comparing texts. So,
SELECT * FROM users WHERE id = 123
is faster than
SELECT * FROM users WHERE username = 'depesz'
It definitely looks like it should be slower – after all – integer takes 4 bytes, and text (“depesz" in this case) takes 7. almost twice as bad.
So, since I haven't written anything on the blog for quite some time, I decided to take a closer look at this question.
Of course I need some test tables.
In case of int4-primary-key the table is:
CREATE TABLE users_s ( id INT4 PRIMARY KEY, username TEXT NOT NULL UNIQUE, password TEXT , is_admin BOOL NOT NULL DEFAULT 'false', created_tsz TIMESTAMPTZ NOT NULL, blocked_tsz TIMESTAMPTZ , blocked_by INT4 REFERENCES users_s (id ) );
Version with text based primary key is similar:
CREATE TABLE users_n ( username TEXT NOT NULL PRIMARY KEY, password TEXT , is_admin BOOL NOT NULL DEFAULT 'false', created_tsz TIMESTAMPTZ NOT NULL, blocked_tsz TIMESTAMPTZ , blocked_by TEXT REFERENCES users_s (username ) );
To these tables I loaded the same dataset – 10 million rows, with following properties:
- username – random string 2 or more characters long (from: a-z, A-Z, 0-9)
- password – random strgin, 32 letters
- is_admin – false in ~ 99.99% of rows, true in ~ 0.01%
- created_tsz – random timestamp between 2000-01-01 00:00:00 and now, made so that (in users_s table) it only rises (when ordering by id)
- blocked_tsz – null in 99.9% rows. In other cases – random timestamp between created_tsz and now
- blocked_by – null if blocked_tsz is null, in other case – id (or username) of random user that has “is_admin" set to true.
$ SELECT * FROM users_s WHERE id >= random() * 10000000 LIMIT 10; id │ username │ password │ is_admin │ created_tsz │ blocked_tsz │ blocked_by ─────────┼────────────┼──────────────────────────────────┼──────────┼────────────────────────┼─────────────┼──────────── 3223759 │ NmqLFYS1xa │ TNiwvhtqJGPYeLdbuSjpXWDMQCKsFgAa │ f │ 2004-01-02 09:53:39+01 │ [NULL] │ [NULL] 3223760 │ g9LE46QtWU │ wFgkSRLMOHvdcTyWNxhYtlAVbmPQfUED │ f │ 2004-01-02 09:53:53+01 │ [NULL] │ [NULL] 3223762 │ arjvTCObUi │ MswoWZpSYaALuVnHzCEIgtmvrBJxXQby │ f │ 2004-01-02 09:54:35+01 │ [NULL] │ [NULL] 3223764 │ brJ4aMXng3 │ FzKLxrwWqJeVdkRIAjpaYCNUsPZihBcb │ f │ 2004-01-02 09:55:36+01 │ [NULL] │ [NULL] 3223765 │ thZWJzMb7K │ rAckTDMwuozflENFZKUbhxPnaRisYCVJ │ f │ 2004-01-02 09:57:38+01 │ [NULL] │ [NULL] 3223766 │ aykbcMFQ0e │ SejNzfrnmOhJyCXEivxautqkgDQbdTpL │ f │ 2004-01-02 09:58:50+01 │ [NULL] │ [NULL] 3223767 │ WgR1yxm9Zu │ kAZVTmMxjqocCHgnOuseiQyNFtXGPKRh │ f │ 2004-01-02 09:58:51+01 │ [NULL] │ [NULL] 3223768 │ HFOJsG9nKl │ zNYrptxwTPIRdQugEBicLebyZShaWHsG │ f │ 2004-01-02 09:59:12+01 │ [NULL] │ [NULL] 3223770 │ fbXPnuUMgD │ BwDWRgpQasbZuAfzVqKonclCGOYJSkxL │ f │ 2004-01-02 10:00:02+01 │ [NULL] │ [NULL] 3223776 │ yJ8A1V9rNL │ YwDSTceukhCXEiPoaVIMrzJgbtnvHQZp │ f │ 2004-01-02 10:01:11+01 │ [NULL] │ [NULL] (10 ROWS)
length of username distribution:
$ SELECT LENGTH(username), COUNT(*) FROM users_s GROUP BY LENGTH ORDER BY LENGTH; LENGTH │ COUNT ────────┼───────── 4 │ 999999 6 │ 1000000 8 │ 1000000 10 │ 1000000 12 │ 1000000 14 │ 1000000 16 │ 1000000 18 │ 1000000 20 │ 1000000 22 │ 1000000 24 │ 1 (11 ROWS)
(oops, looks like I had an “off by one" error in data generation script. But that's not very important.
Sizes of tables and indexes:
- int4 based primary key
- table data (with toast): 886 MB
- primary key index (on “id"): 214 MB
- username index: 318 MB
- created_tsz index: 214 MB
- total size: 1633 MB
- text based primary key
- table data (with toast): 847 MB
- primary key index (on “username"): 318 MB
- created_tsz index: 214 MB
- total size: 1379 MB
So, from start it looks like text based primary key is better because it decreases size of data that have to be cached. But that's not the point of this blogpost. Let's see some speeds.
Generated list of 100000 ids, and 10 lists, each of 100000 usernames – each list contained only usernames of given length – from 4 to 22 characters.
Then, wrote a simple program, which did, in a loop:
- pick random test (check by id, check by username of length 4, check by username …)
- fetch next value to test (id or username)
- run ->execute() on the query, getting time information
- run ->fetchrow() on the query, getting time information
The query was always select * from table where field = ?.
This was done in a loop that queried all values (100000 ids, and 1 million usernames).
Results are below in a table.
Most of the columns should be self-explanatory, just a note on “95%" – this is time that 95% of queries finished in. So if it's “2 seconds" it means that 95% of queries finished in 2 seconds or less time, and only 5% was more than 2 seconds.
All times are in milliseconds (0.001 of a second) – i skipped the units to make it less cluttered (though it's still is).
Table is huge, but the first thing that I see is that we can easily skip the “fetch" part – it shows very similar timings – actually fetching row with integer primary key is a bit slower (~10% slower), but that's because row with additional column is wider.
Also – I'm not really interested in all the other columns – 95% should be enough. Why not max? Well, there can be always some spikes which will cause general system performance degradation, and I don't want to punish some tests “on random". 95% of cases seems to be good enough to get some sensible results.
So, I can simplify the table to:
Much nicer table (to read, at least.
Whoa. But it looks like integer searching is ~ 3 times slower than text searching. It could be influenced by simple fact – better caching of the text-bases table – i.e. it was queried a million times, and integer-based table – only 100,000. So the text based table got unfair advantage.
So let's redo the test, this time, I will query them in order – 100,000 queries to integer based table, and then all the other tests, also in order. Results:
Much better. And a lesson for me, for future, about how to do such tests 🙂
Anyway – what do these results tell us? Well – as expected – text based searches are slower. But the difference is (in my opinion) negligible.
- ~ 2.7% of performance decrease when switching to 8 character texts
- ~ 4.8% when switching to (huge) 22 character texts
8 character string is using 9 bytes. 22 characters – 23 bytes. So we can see that the performance decrease is not linear with number of bytes.
That, plus the fact that with text based primary keys we can remove one column, and one index means that benefits will most likely outweigh this small slowdown.
Of course – this doesn't mean: get rid of all integer primary keys. It just means: do not treat “performance degradation" as an argument against using natural, text based, keys.