June 7th, 2012 by depesz | Tags: , , , , , , , | 25 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

There is this idea that normal form in databases require you to use integer, auto incrementing, primary keys.

The idea was discussed by many people, I will just point you to series of three blog posts on the subject by Josh Berkus ( part 1, 2 and 3).

One of the points that proponents of surrogate keys (i.e. those based on integer and sequences) raise is that comparing integers is faster than comparing texts. So,

select * from users where id = 123

is faster than

select * from users where username = 'depesz'

Is it?

It definitely looks like it should be slower – after all – integer takes 4 bytes, and text (“depesz" in this case) takes 7. almost twice as bad.

So, since I haven't written anything on the blog for quite some time, I decided to take a closer look at this question.

Of course I need some test tables.

In case of int4-primary-key the table is:

CREATE TABLE users_s (
    id          INT4        PRIMARY KEY,
    username    TEXT        NOT NULL UNIQUE,
    password    TEXT        ,
    is_admin    BOOL        NOT NULL DEFAULT 'false',
    created_tsz TIMESTAMPTZ NOT NULL,
    blocked_tsz TIMESTAMPTZ ,
    blocked_by  INT4        REFERENCES users_s (id )
);

Version with text based primary key is similar:

CREATE TABLE users_n (
    username    TEXT        NOT NULL PRIMARY KEY,
    password    TEXT        ,
    is_admin    BOOL        NOT NULL DEFAULT 'false',
    created_tsz TIMESTAMPTZ NOT NULL,
    blocked_tsz TIMESTAMPTZ ,
    blocked_by  TEXT        REFERENCES users_s (username )
);

To these tables I loaded the same dataset – 10 million rows, with following properties:

  • username – random string 2 or more characters long (from: a-z, A-Z, 0-9)
  • password – random strgin, 32 letters
  • is_admin – false in ~ 99.99% of rows, true in ~ 0.01%
  • created_tsz – random timestamp between 2000-01-01 00:00:00 and now, made so that (in users_s table) it only rises (when ordering by id)
  • blocked_tsz – null in 99.9% rows. In other cases – random timestamp between created_tsz and now
  • blocked_by – null if blocked_tsz is null, in other case – id (or username) of random user that has “is_admin" set to true.

sample data:

$ select * from users_s where id >= random() * 10000000 limit 10;
   id    │  username  │             password             │ is_admin │      created_tsz       │ blocked_tsz │ blocked_by
─────────┼────────────┼──────────────────────────────────┼──────────┼────────────────────────┼─────────────┼────────────
 3223759 │ NmqLFYS1xa │ TNiwvhtqJGPYeLdbuSjpXWDMQCKsFgAa │ f        │ 2004-01-02 09:53:39+01 │ [null][null]
 3223760 │ g9LE46QtWU │ wFgkSRLMOHvdcTyWNxhYtlAVbmPQfUED │ f        │ 2004-01-02 09:53:53+01 │ [null][null]
 3223762 │ arjvTCObUi │ MswoWZpSYaALuVnHzCEIgtmvrBJxXQby │ f        │ 2004-01-02 09:54:35+01 │ [null][null]
 3223764 │ brJ4aMXng3 │ FzKLxrwWqJeVdkRIAjpaYCNUsPZihBcb │ f        │ 2004-01-02 09:55:36+01 │ [null][null]
 3223765 │ thZWJzMb7K │ rAckTDMwuozflENFZKUbhxPnaRisYCVJ │ f        │ 2004-01-02 09:57:38+01 │ [null][null]
 3223766 │ aykbcMFQ0e │ SejNzfrnmOhJyCXEivxautqkgDQbdTpL │ f        │ 2004-01-02 09:58:50+01 │ [null][null]
 3223767 │ WgR1yxm9Zu │ kAZVTmMxjqocCHgnOuseiQyNFtXGPKRh │ f        │ 2004-01-02 09:58:51+01 │ [null][null]
 3223768 │ HFOJsG9nKl │ zNYrptxwTPIRdQugEBicLebyZShaWHsG │ f        │ 2004-01-02 09:59:12+01 │ [null][null]
 3223770 │ fbXPnuUMgD │ BwDWRgpQasbZuAfzVqKonclCGOYJSkxL │ f        │ 2004-01-02 10:00:02+01 │ [null][null]
 3223776 │ yJ8A1V9rNL │ YwDSTceukhCXEiPoaVIMrzJgbtnvHQZp │ f        │ 2004-01-02 10:01:11+01 │ [null][null]
(10 rows)

length of username distribution:

$ select length(username), count(*) from users_s group by length order by length;
 lengthcount
────────┼─────────
      4999999
      61000000
      81000000
     101000000
     121000000
     141000000
     161000000
     181000000
     201000000
     221000000
     241
(11 rows)

(oops, looks like I had an “off by one" error in data generation script. But that's not very important.

Sizes of tables and indexes:

  • int4 based primary key
    • table data (with toast): 886 MB
    • primary key index (on “id"): 214 MB
    • username index: 318 MB
    • created_tsz index: 214 MB
    • total size: 1633 MB
  • text based primary key
    • table data (with toast): 847 MB
    • primary key index (on “username"): 318 MB
    • created_tsz index: 214 MB
    • total size: 1379 MB

So, from start it looks like text based primary key is better because it decreases size of data that have to be cached. But that's not the point of this blogpost. Let's see some speeds.

Generated list of 100000 ids, and 10 lists, each of 100000 usernames – each list contained only usernames of given length – from 4 to 22 characters.

Then, wrote a simple program, which did, in a loop:

  • pick random test (check by id, check by username of length 4, check by username …)
  • fetch next value to test (id or username)
  • run ->execute() on the query, getting time information
  • run ->fetchrow() on the query, getting time information

The query was always select * from table where field = ?.

This was done in a loop that queried all values (100000 ids, and 1 million usernames).

Results are below in a table.

Most of the columns should be self-explanatory, just a note on “95%" – this is time that 95% of queries finished in. So if it's “2 seconds" it means that 95% of queries finished in 2 seconds or less time, and only 5% was more than 2 seconds.

All times are in milliseconds (0.001 of a second) – i skipped the units to make it less cluttered (though it's still is).

test execute fetch query
min avg 95% max sum min avg 95% max sum min avg 95% max sum
100k-ids 0.068 0.637 0.452 150.973 63734.133 0.009 0.018 0.019 0.299 1782.907 0.077 0.655 0.473 150.992 65517.040
100k-u4 0.075 0.144 0.152 32.408 14372.208 0.008 0.016 0.017 0.427 1591.908 0.084 0.160 0.168 32.426 15964.116
100k-u6 0.075 0.143 0.152 17.031 14282.245 0.008 0.016 0.017 0.180 1590.029 0.084 0.159 0.169 17.050 15872.274
100k-u8 0.071 0.144 0.152 19.309 14420.358 0.008 0.016 0.017 0.396 1597.304 0.080 0.160 0.169 19.327 16017.662
100k-u10 0.071 0.199 0.154 82.696 19910.616 0.008 0.016 0.017 0.125 1588.379 0.080 0.215 0.172 82.717 21498.995
100k-u12 0.071 0.224 0.155 38.788 22438.004 0.008 0.016 0.017 0.432 1589.520 0.080 0.240 0.173 38.806 24027.524
100k-u14 0.069 0.266 0.157 106.409 26584.445 0.008 0.016 0.017 0.120 1581.030 0.077 0.282 0.175 106.429 28165.475
100k-u16 0.075 0.296 0.159 98.583 29613.015 0.008 0.016 0.017 0.188 1594.267 0.084 0.312 0.177 98.602 31207.282
100k-u18 0.075 0.212 0.155 52.395 21248.188 0.008 0.016 0.017 0.407 1591.861 0.084 0.228 0.173 52.417 22840.050
100k-u20 0.070 0.146 0.154 25.016 14587.051 0.008 0.016 0.017 0.444 1592.307 0.079 0.162 0.171 25.033 16179.358
100k-u22 0.073 0.146 0.155 14.897 14647.908 0.008 0.016 0.017 1.680 1583.701 0.082 0.162 0.171 14.915 16231.609

Table is huge, but the first thing that I see is that we can easily skip the “fetch" part – it shows very similar timings – actually fetching row with integer primary key is a bit slower (~10% slower), but that's because row with additional column is wider.

Also – I'm not really interested in all the other columns – 95% should be enough. Why not max? Well, there can be always some spikes which will cause general system performance degradation, and I don't want to punish some tests “on random". 95% of cases seems to be good enough to get some sensible results.

So, I can simplify the table to:

test 95% execute
100k-ids 0.452 ms
100k-u4 0.152 ms
100k-u6 0.152 ms
100k-u8 0.152 ms
100k-u10 0.154 ms
100k-u12 0.155 ms
100k-u14 0.157 ms
100k-u16 0.159 ms
100k-u18 0.155 ms
100k-u20 0.154 ms
100k-u22 0.155 ms

Much nicer table (to read, at least.

Whoa. But it looks like integer searching is ~ 3 times slower than text searching. It could be influenced by simple fact – better caching of the text-bases table – i.e. it was queried a million times, and integer-based table – only 100,000. So the text based table got unfair advantage.

So let's redo the test, this time, I will query them in order – 100,000 queries to integer based table, and then all the other tests, also in order. Results:

test 95% execute
100k-ids 0.143 ms
100k-u4 0.146 ms
100k-u6 0.146 ms
100k-u8 0.147 ms
100k-u10 0.148 ms
100k-u12 0.148 ms
100k-u14 0.148 ms
100k-u16 0.149 ms
100k-u18 0.150 ms
100k-u20 0.149 ms
100k-u22 0.150 ms

Much better. And a lesson for me, for future, about how to do such tests :)

Anyway – what do these results tell us? Well – as expected – text based searches are slower. But the difference is (in my opinion) negligible.

  • ~ 2.7% of performance decrease when switching to 8 character texts
  • ~ 4.8% when switching to (huge) 22 character texts

8 character string is using 9 bytes. 22 characters – 23 bytes. So we can see that the performance decrease is not linear with number of bytes.

That, plus the fact that with text based primary keys we can remove one column, and one index means that benefits will most likely outweigh this small slowdown.

Of course – this doesn't mean: get rid of all integer primary keys. It just means: do not treat “performance degradation" as an argument against using natural, text based, keys.

  1. 25 comments

  2. Jun 7, 2012

    I’ve often wondered about that. Thanks!

    Of course, the downside of using a text column for a primary key is if the primary key ever changes, all the rows referencing that key need to be updated as well. Seems like that scenario could happen more if using a text column that references, say, a username.

  3. # Thomas
    Jun 7, 2012

    Nice comparison, thanks. Might be interesting to see how this performs if a UUID is used for a primary key.

  4. # Joby
    Jun 7, 2012

    This is great info. It would be interesting to see if there is a significant difference joining tables on an integer key vs. a text key.

  5. # Tony H.
    Jun 7, 2012

    IMHO it’s important that index is used, reducing significantly the number of comparisons. On the other hand, if for example the (primary) key consists of multiple columns such that the underlying index can’t be used for some particular query, full scan may become necessary. Then I’d expect the total number of comparisons to grow and in turn higher difference in the query execution time.

  6. Jun 7, 2012

    @Joe:
    Sure. There are always benefits and drawbacks. I just wanted to clear the case about performance of scans.

    @Thomas:
    uuid length is 16 bytes. So I can only assume it would behave as text of 15 normal characters.

    @Joby:
    Same thing – join has to scan the table using some method. If it will use index scan – this is what you’ll get. If it will use sequential scan – you will get a bit faster response, since the table is smaller.

    @Tony:
    Not sure what you mean – of course index is used – I am fetching a row using primary key lookup.
    If you are talking in general – it doesn’t matter – if the planner cannot use index on text column – it wouldn’t use one on integer too.

  7. I too wonder about JOIN performance. That would be a neat test to write. Thanks!

  8. # Andres Freund
    Jun 7, 2012

    The difference will be noticeably bigger if your index gets bigger than the amount of shared_buffers and then when it gets bigger than ram.
    For one the int index is simply smaller, for another it will take quite a bit longer for the int index to get 4 levels deep than with strings.

  9. Jun 7, 2012

    what about UUID’s?

  10. Jun 7, 2012

    @Caleb: check earlier comments.

  11. Jun 8, 2012

    I think I would hope that UUID are faster than just a text search, but I’ll be honest to say that I don’t understand indexing to have any idea if that would remotely hold up logically.

  12. # Galaxiom
    Jun 8, 2012

    Indexes are binary so comparisons on indexed fields are going to be pretty much the same regardless of the type.

    My understanding is that the disadvantage of large text keys lies in the generation of the index during inserts.

    That comparison would be interesting.

  13. # Sefer
    Jun 8, 2012

    Hi Depesz,

    There is one thing you didn’t share during your test.
    What is your locale & encoding?

    I ask this because (for example):

    1. A UTF8 string for a non-latin string would take more bytes than characters (resulting in larger tables and longer fetching/comparison time).

    2. When the locale is not trivial (ala C) then the comparison has to go through some more complex routines (see how that can affect sorting for example). http://www.postgresql.org/docs/devel/static/locale.html#AEN32412

    So the test is not entirely reflective of text vs integer fields performance issues.
    :)

    Thanks,
    Sefer.

  14. # Stephan
    Jun 8, 2012

    Great article!

    We have a legacy database which uses VARCHAR(20) columns as IDs. These columns contain numbers, padded with ‘0’, e. g. ‘00000000000000012345’.

    I wonder how (in)efficient such IDs are, compared to normal integers.

  15. Jun 8, 2012

    @Sefer:
    as you saw the data didn’t contain any non-latin characters (i mentioned it specifically). As such – the utf8 is irrelevant.
    and – i was using utf8 with en_US.UTF-8 locale.

  16. # gj
    Jun 8, 2012

    the problem starts when you got foreign keys, and you want to change user name of a guy.
    Also, if there’s many tables linked through FK – then the difference becomes significant.

    But like in every case – its a choice, and you have to decide at design time. There’s no one size fits them all. Ignoring either choices is plain stupid.

  17. # Dawgbert the Elder
    Jun 8, 2012

    Interesting article, but until there is a world-moving, compelling, difference-which-makes-a-difference kind of reason to use strings as the ID, count me among the stalinists who use BIGINT for their IDs.

    One more thing: this article, but much more so the Berkus article linked from here shows that there is a widespread, fundamental misunderstanding of what an ID is and what its function is.

  18. Jun 8, 2012

    Great stuff depesz, as usual. I, however, would also be interested in seeing benchmarks for values including multibyte characters. Limiting the test to ASCII is, increasingly, misleading: lots of folks use accented characters, asian language charcters, and other fun stuff, in their usernames. And it’s not just usernames that work that way, of course.

  19. Jun 8, 2012

    @David:
    sure, but it just doesn’t matter. index doesn’t care about what the bytes represent (well, at least usually). So multibyte characters mean *only* increased length of string.
    So “żółw” is 8 bytes.

  20. # David E. Wheeler
    Jun 9, 2012

    @depesz:

    Maybe I don’t understand indexing well enough, but won’t adding 1000s of characters make the btree much broader? That is, thousands of branches to scan at the root of the tree, rather than just 128. I would think that would make for slower index scans.

  21. # Andres Freund
    Jun 11, 2012

    @depesz: Well, btrees very frequently do compare different values and expect to learn whether a value is smaller, equal or bigger. For that they do locale specific comparison which is noticeable more expensive for non a-zA-Z0-9 characters.

    @david: Sure, the fanout is relevant, but thats noticeable by the tree getting deeper, not the root getting wider.

  22. # Dawid
    Jun 11, 2013

    Zysk na rozmiarze danych w tym przypadku jest widoczny, jednak nie uwzględniłeś (może przeoczyłem) przypadku większej bazy – gdzie tabele powiązane (np. posty, obrazy…) muszą mieć kolumnę klucza zewnętrznego z tekstowym identyfikatorem. Na twoim przykładzie widać, że jest to 100MB różnicy. Przy 20 tabelach daje to 2GB, a więc zysk w takim przypadku byłby iluzją.
    Pozdrawiam

  23. Jun 11, 2013

    @Dawid:
    W przypadku większej bazy te 2GB nie mają znaczenia. To jest podstawowy problem gdy wyciąga się argument ilości danych. Jak masz większą bazę, to i pojedynczy rekord w tabeli jest większy i rekordów jest więcej. Więc – w liczbach bezwzglednych – wzrost wielkości bazy jest większy, ale procentowo – dużo mniejszy.

  24. Jun 30, 2015

    This post / research is highly misleading, mainly due to being overly simplistic:

    1) It doesn’t test performance between string and INT (i.e. the title of this post), but instead it tests the impact (only part of it, actually) of adding an extra INT surrogate key field when a natural key already exists. Testing the performance difference between both datatypes would have use the same table schema and data for both tests, where the only difference was in the structure of the PK.

    2) It doesn’t consider down-stream impact on JOINs, backups, etc. If the field is a PK and FKed elsewhere, all of those tables will have a varying length field that is nearly always larger than the 4 byte int. So we aren’t just talking about a minor increase of a few bytes per row for a 300k rows, we are talking about (or should be talking about) a few extra bytes (or more, perhaps) for millions (or many millions?) of rows.

    3) What about updates to the username field (i.e. the PK)? How often is a “natural” key entered in incorrectly? Or how often is a natural key turn out to not be truly unique, such as with SSNs? A login name would seem to be ok (though some systems–ones that are properly keyed off of numeric IDs–do allow for changes). ISO codes such as 2 or 3 characters country codes are fine. But most others?

    4) The example uses username, which is typically case-insensitive, but has not been tested as case-insensitive. Of course, the true nature of this post regards using a string/text field as a PK. In that context–as it relates to JOINs to FKed tables on this field–it wouldn’t need to be tested for case insensitivity as the JOINs would be done as case-sensitive (binary collation / ctype would be best). But the possibility of needing to do a case-insensitive match cannot be ruled-out and hence needs to be represented with an additional test. This becomes even more important when considering that some readers of this post will use this recommendation in situations well beyond its very narrow applicability.

    5) It really should be stated that the results of the same test could vary greatly between the various RDBMS’s, especially when considering vendor-specific optimization features.

    I am not saying to never use text-based keys as there are cases when they are preferred (especially in cases like ISO country codes). However, most of the time, your system (the entire thing, not just one table) will benefit from a surrogate key of a numeric type. And I am not saying that all tables should have a surrogate key (unless required by some facility such as replication or something like Full Text Search for MS SQL Server) as there are cases where they, at best, add no value and in some cases even cause problems, such as with bridge tables used to allow for many-to-many relationships where the PK is naturally the combination of both sides of the relationship. But, something to consider, especially related to the specific example here of the username/login, is that most OS’s and RDBMS’s have a surrogate userid (or SID, etc).

    Take care,
    Solomon…

  25. Jun 30, 2015

    @Solomon:

    1. I seriously doubt that minimally wider table makes any difference. What’s more – addition of integer key is the common case, and not “remove username, and instead add integer” – you have to store the text somewhere anyway. If you think otherwise, please do your test, and post results.

    2. impact on joins would be the same as impact on normal search – mostly because joins are done using the same methods as “normal” queries. as for backups – the argument about space saving kills me every time i hear it. yeah. 5 bytes over 1 million rows is 5 megabytes. right. but there are more bytes in these rows, so the 5 MB is just as miniscule over 1 million rows as it would be over 1k rows.

    3. how is that even relevant to this blogpost? did i say “never use integer pkeys”? No. I just measured performance difference. There are places where integer based pkeys are ok, and there are places where text based pkeys are ok.

    4. sorry, but i don’t see your point. If I wanted to have case-insensitive username, then I would have unique index on lower(username). And your point would be as moot as it is now – index on (data) works basically the same whether the data is “depesz” or “DEPESZ”.

    5. yeah. given that I spend so much time writing about mssql, sybase, informix, oracle, mysql, sqlite, i really should have known better and make sure to say that this particular post is related to postgresql only. /s

  26. Jun 30, 2015

    @DEPESZ

    1) There might be a misunderstanding here. I am speaking about the title of the post and the statement at the beginning:

    “One of the points that proponents of surrogate keys (i.e. those based on integer and sequences) raise is that comparing integers is faster than comparing texts.”

    I understand that the text of the username needs to be stored somewhere. I am saying that the tests performed here are not determining whether searching on strings or integers is faster: they are determining if searching a string field is faster than searching an integer field in the same table if that table had an additional int field.

    2) I wasn’t saying that the JOIN performance by itself would be worse, at least not any more than your “field = value” tests show (I suppose I wasn’t very clear there, and it wasn’t until #4 that I said that JOIN performance shouldn’t be impacted), I was saying that duplicating the string field into all FKed tables would have an impact. Your final test even proves this as the slightly larger table (the one with the additional INT field) is still faster (even if minimally) when searching on that INT field than searching on the string field in the table that is smaller by not having that INT field.

    So now if we only have the username field as the PK and hence as the FK field in all related tables, all of those related tables are now bigger by their username data size minus 4 bytes (per row). Your test shows that a slightly smaller table is still at a slight disadvantage, yet all related tables will now be larger and hence show even more performance degradation. Yes, the degradation shown in the final test was only a few %, but now that will happen across all queries against this field in all of these related tables?

    And you can laugh all you want about the 5 MB of savings, but _wasted_ space chews up memory, disk (I am not worried about PCs, but SAN storage that is not cheap), CPU, etc. And on the contrary, I am saying that one _should_ spend the extra space if it makes sense (i.e. the extra INT field). You need to keep in mind that we are not talking about just 1 table here. If the issue were that simple, then yes, 5 MB would be negligible (all other things being equal, of course).

    3) What is relevant about the issue of updating the username field IF it is the PK is the manageability of it (i.e. needing to CASCADE that change to the FKed tables). This is not an issue of performance (outside of possible locking of rows necessary to make that change). I am mentioning it (and I think 1 or 2 others have as well) as it is an impact on the system. In the end you say:

    “That, plus the fact that with text based primary keys we can remove one column, and one index means that benefits will most likely outweigh this small slowdown.”

    And I am saying ‘most likely’? Based on what? You tested one part of a multi-part system. You _might_ be right, but at this point there is absolutely no way to know. And in fact, your statement has a low probability of being correct. In the case of needing to do the case-insensitive search (#4 below), you are now actually increasing the size of this table since you will need 2 indexes on username: one for the PK, and one with “lower(username)”.

    So just to be clear, the point I am making is that the advice given here is not considering any secondary effects of this modeling decision, yet this decision is never done in a vaccuum in the real world, only in testing ;-). And you never even detail what the benefits are. The only thing I see is a savings of 254 MB, yet you just laughed (#2 above) at the notion of saving a few MB here and there.

    4) I didn’t realize that PostgreSQL could index an expression that was not a field in the table. Ok, but as I said above, adding that index eliminates the main benefit of removing the INT field: the 254 MB of savings, and the impact of maintaining the index, of the removed field and related index. So again we come back to the concept of total system impact.

    5) *I* realize that you were speaking of PostgreSQL, but it seems that people quite frequently generalize information if it even remotely appears to be general, especially if they are beginners. All I was saying is that this _seems to me_ to be presented in such a way that makes it easy for people to think that it applies more generally to other RDBMSs. BUT, maybe I am being unfair, and that the burden really is on the reader for knowing better.

Leave a comment