There is this idea that normal form in databases require you to use integer, auto incrementing, primary keys.
The idea was discussed by many people, I will just point you to series of three blog posts on the subject by Josh Berkus ( part 1, 2 and 3, and reprise).
One of the points that proponents of surrogate keys (i.e. those based on integer and sequences) raise is that comparing integers is faster than comparing texts. So,
SELECT * FROM users WHERE id = 123
is faster than
SELECT * FROM users WHERE username = 'depesz'
It definitely looks like it should be slower – after all – integer takes 4 bytes, and text (“depesz" in this case) takes 7. almost twice as bad.
So, since I haven't written anything on the blog for quite some time, I decided to take a closer look at this question.
Of course I need some test tables.
In case of int4-primary-key the table is:
CREATE TABLE users_s ( id INT4 PRIMARY KEY, username TEXT NOT NULL UNIQUE, password TEXT , is_admin BOOL NOT NULL DEFAULT 'false', created_tsz TIMESTAMPTZ NOT NULL, blocked_tsz TIMESTAMPTZ , blocked_by INT4 REFERENCES users_s (id ) );
Version with text based primary key is similar:
CREATE TABLE users_n ( username TEXT NOT NULL PRIMARY KEY, password TEXT , is_admin BOOL NOT NULL DEFAULT 'false', created_tsz TIMESTAMPTZ NOT NULL, blocked_tsz TIMESTAMPTZ , blocked_by TEXT REFERENCES users_s (username ) );
To these tables I loaded the same dataset – 10 million rows, with following properties:
- username – random string 2 or more characters long (from: a-z, A-Z, 0-9)
- password – random strgin, 32 letters
- is_admin – false in ~ 99.99% of rows, true in ~ 0.01%
- created_tsz – random timestamp between 2000-01-01 00:00:00 and now, made so that (in users_s table) it only rises (when ordering by id)
- blocked_tsz – null in 99.9% rows. In other cases – random timestamp between created_tsz and now
- blocked_by – null if blocked_tsz is null, in other case – id (or username) of random user that has “is_admin" set to true.
$ SELECT * FROM users_s WHERE id >= random() * 10000000 LIMIT 10; id │ username │ password │ is_admin │ created_tsz │ blocked_tsz │ blocked_by ─────────┼────────────┼──────────────────────────────────┼──────────┼────────────────────────┼─────────────┼──────────── 3223759 │ NmqLFYS1xa │ TNiwvhtqJGPYeLdbuSjpXWDMQCKsFgAa │ f │ 2004-01-02 09:53:39+01 │ [NULL] │ [NULL] 3223760 │ g9LE46QtWU │ wFgkSRLMOHvdcTyWNxhYtlAVbmPQfUED │ f │ 2004-01-02 09:53:53+01 │ [NULL] │ [NULL] 3223762 │ arjvTCObUi │ MswoWZpSYaALuVnHzCEIgtmvrBJxXQby │ f │ 2004-01-02 09:54:35+01 │ [NULL] │ [NULL] 3223764 │ brJ4aMXng3 │ FzKLxrwWqJeVdkRIAjpaYCNUsPZihBcb │ f │ 2004-01-02 09:55:36+01 │ [NULL] │ [NULL] 3223765 │ thZWJzMb7K │ rAckTDMwuozflENFZKUbhxPnaRisYCVJ │ f │ 2004-01-02 09:57:38+01 │ [NULL] │ [NULL] 3223766 │ aykbcMFQ0e │ SejNzfrnmOhJyCXEivxautqkgDQbdTpL │ f │ 2004-01-02 09:58:50+01 │ [NULL] │ [NULL] 3223767 │ WgR1yxm9Zu │ kAZVTmMxjqocCHgnOuseiQyNFtXGPKRh │ f │ 2004-01-02 09:58:51+01 │ [NULL] │ [NULL] 3223768 │ HFOJsG9nKl │ zNYrptxwTPIRdQugEBicLebyZShaWHsG │ f │ 2004-01-02 09:59:12+01 │ [NULL] │ [NULL] 3223770 │ fbXPnuUMgD │ BwDWRgpQasbZuAfzVqKonclCGOYJSkxL │ f │ 2004-01-02 10:00:02+01 │ [NULL] │ [NULL] 3223776 │ yJ8A1V9rNL │ YwDSTceukhCXEiPoaVIMrzJgbtnvHQZp │ f │ 2004-01-02 10:01:11+01 │ [NULL] │ [NULL] (10 ROWS)
length of username distribution:
$ SELECT LENGTH(username), COUNT(*) FROM users_s GROUP BY LENGTH ORDER BY LENGTH; LENGTH │ COUNT ────────┼───────── 4 │ 999999 6 │ 1000000 8 │ 1000000 10 │ 1000000 12 │ 1000000 14 │ 1000000 16 │ 1000000 18 │ 1000000 20 │ 1000000 22 │ 1000000 24 │ 1 (11 ROWS)
(oops, looks like I had an “off by one" error in data generation script. But that's not very important.
Sizes of tables and indexes:
- int4 based primary key
- table data (with toast): 886 MB
- primary key index (on “id"): 214 MB
- username index: 318 MB
- created_tsz index: 214 MB
- total size: 1633 MB
- text based primary key
- table data (with toast): 847 MB
- primary key index (on “username"): 318 MB
- created_tsz index: 214 MB
- total size: 1379 MB
So, from start it looks like text based primary key is better because it decreases size of data that have to be cached. But that's not the point of this blogpost. Let's see some speeds.
Generated list of 100000 ids, and 10 lists, each of 100000 usernames – each list contained only usernames of given length – from 4 to 22 characters.
Then, wrote a simple program, which did, in a loop:
- pick random test (check by id, check by username of length 4, check by username …)
- fetch next value to test (id or username)
- run ->execute() on the query, getting time information
- run ->fetchrow() on the query, getting time information
The query was always select * from table where field = ?.
This was done in a loop that queried all values (100000 ids, and 1 million usernames).
Results are below in a table.
Most of the columns should be self-explanatory, just a note on “95%" – this is time that 95% of queries finished in. So if it's “2 seconds" it means that 95% of queries finished in 2 seconds or less time, and only 5% was more than 2 seconds.
All times are in milliseconds (0.001 of a second) – i skipped the units to make it less cluttered (though it's still is).
Table is huge, but the first thing that I see is that we can easily skip the “fetch" part – it shows very similar timings – actually fetching row with integer primary key is a bit slower (~10% slower), but that's because row with additional column is wider.
Also – I'm not really interested in all the other columns – 95% should be enough. Why not max? Well, there can be always some spikes which will cause general system performance degradation, and I don't want to punish some tests “on random". 95% of cases seems to be good enough to get some sensible results.
So, I can simplify the table to:
Much nicer table (to read, at least.
Whoa. But it looks like integer searching is ~ 3 times slower than text searching. It could be influenced by simple fact – better caching of the text-bases table – i.e. it was queried a million times, and integer-based table – only 100,000. So the text based table got unfair advantage.
So let's redo the test, this time, I will query them in order – 100,000 queries to integer based table, and then all the other tests, also in order. Results:
Much better. And a lesson for me, for future, about how to do such tests 🙂
Anyway – what do these results tell us? Well – as expected – text based searches are slower. But the difference is (in my opinion) negligible.
- ~ 2.7% of performance decrease when switching to 8 character texts
- ~ 4.8% when switching to (huge) 22 character texts
8 character string is using 9 bytes. 22 characters – 23 bytes. So we can see that the performance decrease is not linear with number of bytes.
That, plus the fact that with text based primary keys we can remove one column, and one index means that benefits will most likely outweigh this small slowdown.
Of course – this doesn't mean: get rid of all integer primary keys. It just means: do not treat “performance degradation" as an argument against using natural, text based, keys.
26 thoughts on ““= 123” vs. “= ‘depesz'”. What is faster?”
I’ve often wondered about that. Thanks!
Of course, the downside of using a text column for a primary key is if the primary key ever changes, all the rows referencing that key need to be updated as well. Seems like that scenario could happen more if using a text column that references, say, a username.
Nice comparison, thanks. Might be interesting to see how this performs if a UUID is used for a primary key.
This is great info. It would be interesting to see if there is a significant difference joining tables on an integer key vs. a text key.
IMHO it’s important that index is used, reducing significantly the number of comparisons. On the other hand, if for example the (primary) key consists of multiple columns such that the underlying index can’t be used for some particular query, full scan may become necessary. Then I’d expect the total number of comparisons to grow and in turn higher difference in the query execution time.
Sure. There are always benefits and drawbacks. I just wanted to clear the case about performance of scans.
uuid length is 16 bytes. So I can only assume it would behave as text of 15 normal characters.
Same thing – join has to scan the table using some method. If it will use index scan – this is what you’ll get. If it will use sequential scan – you will get a bit faster response, since the table is smaller.
Not sure what you mean – of course index is used – I am fetching a row using primary key lookup.
If you are talking in general – it doesn’t matter – if the planner cannot use index on text column – it wouldn’t use one on integer too.
I too wonder about JOIN performance. That would be a neat test to write. Thanks!
The difference will be noticeably bigger if your index gets bigger than the amount of shared_buffers and then when it gets bigger than ram.
For one the int index is simply smaller, for another it will take quite a bit longer for the int index to get 4 levels deep than with strings.
what about UUID’s?
@Caleb: check earlier comments.
I think I would hope that UUID are faster than just a text search, but I’ll be honest to say that I don’t understand indexing to have any idea if that would remotely hold up logically.
Indexes are binary so comparisons on indexed fields are going to be pretty much the same regardless of the type.
My understanding is that the disadvantage of large text keys lies in the generation of the index during inserts.
That comparison would be interesting.
There is one thing you didn’t share during your test.
What is your locale & encoding?
I ask this because (for example):
1. A UTF8 string for a non-latin string would take more bytes than characters (resulting in larger tables and longer fetching/comparison time).
2. When the locale is not trivial (ala C) then the comparison has to go through some more complex routines (see how that can affect sorting for example). http://www.postgresql.org/docs/devel/static/locale.html#AEN32412
So the test is not entirely reflective of text vs integer fields performance issues.
We have a legacy database which uses VARCHAR(20) columns as IDs. These columns contain numbers, padded with ‘0’, e. g. ‘00000000000000012345’.
I wonder how (in)efficient such IDs are, compared to normal integers.
as you saw the data didn’t contain any non-latin characters (i mentioned it specifically). As such – the utf8 is irrelevant.
and – i was using utf8 with en_US.UTF-8 locale.
the problem starts when you got foreign keys, and you want to change user name of a guy.
Also, if there’s many tables linked through FK – then the difference becomes significant.
But like in every case – its a choice, and you have to decide at design time. There’s no one size fits them all. Ignoring either choices is plain stupid.
Interesting article, but until there is a world-moving, compelling, difference-which-makes-a-difference kind of reason to use strings as the ID, count me among the stalinists who use BIGINT for their IDs.
One more thing: this article, but much more so the Berkus article linked from here shows that there is a widespread, fundamental misunderstanding of what an ID is and what its function is.
Great stuff depesz, as usual. I, however, would also be interested in seeing benchmarks for values including multibyte characters. Limiting the test to ASCII is, increasingly, misleading: lots of folks use accented characters, asian language charcters, and other fun stuff, in their usernames. And it’s not just usernames that work that way, of course.
sure, but it just doesn’t matter. index doesn’t care about what the bytes represent (well, at least usually). So multibyte characters mean *only* increased length of string.
So “żółw” is 8 bytes.
Maybe I don’t understand indexing well enough, but won’t adding 1000s of characters make the btree much broader? That is, thousands of branches to scan at the root of the tree, rather than just 128. I would think that would make for slower index scans.
@depesz: Well, btrees very frequently do compare different values and expect to learn whether a value is smaller, equal or bigger. For that they do locale specific comparison which is noticeable more expensive for non a-zA-Z0-9 characters.
@david: Sure, the fanout is relevant, but thats noticeable by the tree getting deeper, not the root getting wider.
Zysk na rozmiarze danych w tym przypadku jest widoczny, jednak nie uwzględniłeś (może przeoczyłem) przypadku większej bazy – gdzie tabele powiązane (np. posty, obrazy…) muszą mieć kolumnę klucza zewnętrznego z tekstowym identyfikatorem. Na twoim przykładzie widać, że jest to 100MB różnicy. Przy 20 tabelach daje to 2GB, a więc zysk w takim przypadku byłby iluzją.
W przypadku większej bazy te 2GB nie mają znaczenia. To jest podstawowy problem gdy wyciąga się argument ilości danych. Jak masz większą bazę, to i pojedynczy rekord w tabeli jest większy i rekordów jest więcej. Więc – w liczbach bezwzglednych – wzrost wielkości bazy jest większy, ale procentowo – dużo mniejszy.
This post / research is highly misleading, mainly due to being overly simplistic:
1) It doesn’t test performance between string and INT (i.e. the title of this post), but instead it tests the impact (only part of it, actually) of adding an extra INT surrogate key field when a natural key already exists. Testing the performance difference between both datatypes would have use the same table schema and data for both tests, where the only difference was in the structure of the PK.
2) It doesn’t consider down-stream impact on JOINs, backups, etc. If the field is a PK and FKed elsewhere, all of those tables will have a varying length field that is nearly always larger than the 4 byte int. So we aren’t just talking about a minor increase of a few bytes per row for a 300k rows, we are talking about (or should be talking about) a few extra bytes (or more, perhaps) for millions (or many millions?) of rows.
3) What about updates to the username field (i.e. the PK)? How often is a “natural” key entered in incorrectly? Or how often is a natural key turn out to not be truly unique, such as with SSNs? A login name would seem to be ok (though some systems–ones that are properly keyed off of numeric IDs–do allow for changes). ISO codes such as 2 or 3 characters country codes are fine. But most others?
4) The example uses username, which is typically case-insensitive, but has not been tested as case-insensitive. Of course, the true nature of this post regards using a string/text field as a PK. In that context–as it relates to JOINs to FKed tables on this field–it wouldn’t need to be tested for case insensitivity as the JOINs would be done as case-sensitive (binary collation / ctype would be best). But the possibility of needing to do a case-insensitive match cannot be ruled-out and hence needs to be represented with an additional test. This becomes even more important when considering that some readers of this post will use this recommendation in situations well beyond its very narrow applicability.
5) It really should be stated that the results of the same test could vary greatly between the various RDBMS’s, especially when considering vendor-specific optimization features.
I am not saying to never use text-based keys as there are cases when they are preferred (especially in cases like ISO country codes). However, most of the time, your system (the entire thing, not just one table) will benefit from a surrogate key of a numeric type. And I am not saying that all tables should have a surrogate key (unless required by some facility such as replication or something like Full Text Search for MS SQL Server) as there are cases where they, at best, add no value and in some cases even cause problems, such as with bridge tables used to allow for many-to-many relationships where the PK is naturally the combination of both sides of the relationship. But, something to consider, especially related to the specific example here of the username/login, is that most OS’s and RDBMS’s have a surrogate userid (or SID, etc).
1. I seriously doubt that minimally wider table makes any difference. What’s more – addition of integer key is the common case, and not “remove username, and instead add integer” – you have to store the text somewhere anyway. If you think otherwise, please do your test, and post results.
2. impact on joins would be the same as impact on normal search – mostly because joins are done using the same methods as “normal” queries. as for backups – the argument about space saving kills me every time i hear it. yeah. 5 bytes over 1 million rows is 5 megabytes. right. but there are more bytes in these rows, so the 5 MB is just as miniscule over 1 million rows as it would be over 1k rows.
3. how is that even relevant to this blogpost? did i say “never use integer pkeys”? No. I just measured performance difference. There are places where integer based pkeys are ok, and there are places where text based pkeys are ok.
4. sorry, but i don’t see your point. If I wanted to have case-insensitive username, then I would have unique index on lower(username). And your point would be as moot as it is now – index on (data) works basically the same whether the data is “depesz” or “DEPESZ”.
5. yeah. given that I spend so much time writing about mssql, sybase, informix, oracle, mysql, sqlite, i really should have known better and make sure to say that this particular post is related to postgresql only. /s
1) There might be a misunderstanding here. I am speaking about the title of the post and the statement at the beginning:
“One of the points that proponents of surrogate keys (i.e. those based on integer and sequences) raise is that comparing integers is faster than comparing texts.”
I understand that the text of the username needs to be stored somewhere. I am saying that the tests performed here are not determining whether searching on strings or integers is faster: they are determining if searching a string field is faster than searching an integer field in the same table if that table had an additional int field.
2) I wasn’t saying that the JOIN performance by itself would be worse, at least not any more than your “field = value” tests show (I suppose I wasn’t very clear there, and it wasn’t until #4 that I said that JOIN performance shouldn’t be impacted), I was saying that duplicating the string field into all FKed tables would have an impact. Your final test even proves this as the slightly larger table (the one with the additional INT field) is still faster (even if minimally) when searching on that INT field than searching on the string field in the table that is smaller by not having that INT field.
So now if we only have the username field as the PK and hence as the FK field in all related tables, all of those related tables are now bigger by their username data size minus 4 bytes (per row). Your test shows that a slightly smaller table is still at a slight disadvantage, yet all related tables will now be larger and hence show even more performance degradation. Yes, the degradation shown in the final test was only a few %, but now that will happen across all queries against this field in all of these related tables?
And you can laugh all you want about the 5 MB of savings, but _wasted_ space chews up memory, disk (I am not worried about PCs, but SAN storage that is not cheap), CPU, etc. And on the contrary, I am saying that one _should_ spend the extra space if it makes sense (i.e. the extra INT field). You need to keep in mind that we are not talking about just 1 table here. If the issue were that simple, then yes, 5 MB would be negligible (all other things being equal, of course).
3) What is relevant about the issue of updating the username field IF it is the PK is the manageability of it (i.e. needing to CASCADE that change to the FKed tables). This is not an issue of performance (outside of possible locking of rows necessary to make that change). I am mentioning it (and I think 1 or 2 others have as well) as it is an impact on the system. In the end you say:
“That, plus the fact that with text based primary keys we can remove one column, and one index means that benefits will most likely outweigh this small slowdown.”
And I am saying ‘most likely’? Based on what? You tested one part of a multi-part system. You _might_ be right, but at this point there is absolutely no way to know. And in fact, your statement has a low probability of being correct. In the case of needing to do the case-insensitive search (#4 below), you are now actually increasing the size of this table since you will need 2 indexes on username: one for the PK, and one with “lower(username)”.
So just to be clear, the point I am making is that the advice given here is not considering any secondary effects of this modeling decision, yet this decision is never done in a vaccuum in the real world, only in testing ;-). And you never even detail what the benefits are. The only thing I see is a savings of 254 MB, yet you just laughed (#2 above) at the notion of saving a few MB here and there.
4) I didn’t realize that PostgreSQL could index an expression that was not a field in the table. Ok, but as I said above, adding that index eliminates the main benefit of removing the INT field: the 254 MB of savings, and the impact of maintaining the index, of the removed field and related index. So again we come back to the concept of total system impact.
5) *I* realize that you were speaking of PostgreSQL, but it seems that people quite frequently generalize information if it even remotely appears to be general, especially if they are beginners. All I was saying is that this _seems to me_ to be presented in such a way that makes it easy for people to think that it applies more generally to other RDBMSs. BUT, maybe I am being unfair, and that the burden really is on the reader for knowing better.
As tables size and index size grows you will really appreciate using an integer for your primary key.
Obviously the larger size of a UUID or other String will lead to much larger space taken up in memory and on disk to store index data.
One thing no one considers: at the lowest level CPUs are brilliant at loading, storing and, very importantly comparing ints. They do this in a single cycle using SILICON. Strings/Text need to be compared in SOFTWARE – much more time consuming.
Comments are closed.