“= 123″ vs. “= ‘depesz’” – followup

Yesterday I wrote about selects on int4 vs. texts.

One of the comments that caught my attention was question about index creation time. So, let's see…

Before I will comment on sensibility of this test, let's see the timings.

Created tables with single column – text or int4. Each table got 1 million records. Text based tables had values of always the same length 4, 6, 8, 10, .. 22 characters.

Integer storing table had values in range 960 – 1073741659 (randomly chosen from 1 – 230).

All values, in all tables, are unique, and indexes where created using simple:

CREATE INDEX xxx ON TABLE (COLUMN)

Each index was created 10 times in a row (with drops in between of course). From these 10 times, I took the 2nd worst (which represents 90 percentile of times). Results:

table time
integer 2927.950 ms
text_4 6761.198 ms
text_6 6707.434 ms
text_8 7537.491 ms
text_10 7552.021 ms
text_12 7549.987 ms
text_14 7585.836 ms
text_16 8029.812 ms
text_18 8225.947 ms
text_20 8072.088 ms
text_22 8151.019 ms

Clearly – integer based indexes are way faster to create. That's fact. But what exactly does it tell us?

Let's face it – if you are in a position when you can choose between using surrogate and natural primary key – you (very usually) will index the “natural" column anyway.

For example – if you have users table – if you choose to have surrogate primary key – this index is faster. But you still have index on username (at least – every database I've seen so far had one).

So, while int4 indexes are faster than text ones, int4 index + text index will be always slower than just text index.

I think that concludes the topic – I am not trying to convert people to using natural keys with these posts – just merely showing some numbers.

6 thoughts on ““= 123″ vs. “= ‘depesz’” – followup”

  1. Ok sorry, one last “but” : I’d love to use natural keys instead of surrogate but…

    While we usually gain space on the users table, all the tables that reference users.login will be a bit bigger (index + data).

    I know we’re getting into an endless number of variations here (How many tables reference users ? Does it enable us to skip a join on users ?), but that’s one last thing to keep in mind (and bench for your own application).

  2. surrogates are appropriate in some cases, in fact users are perfect because sometimes users want to change usernames, or even if you use unique things like an email for a username sometimes those change. In fact I was designing a system that allowed the user to log in with multiple emails, and therefore required a surrogate. However in the same system my boss tried to convince me that every table MUST have a surrogate and I was like no, userid + phone number + ext is unique enough, I have no need to have a phoneid.

  3. Caleb,

    There seems to be this idea that keys or unique indexes can never change. There’s nothing preventing you from updating a column used in a key, including a username column.

  4. yes, and some people have said but you need something that uniquely identifies that row. “but those 3 columns do uniquely identify that row”. But it’s not a row identifier. “the database already knows how to identify the row’s location on the disk internally”.

    or “I’ve always done it this way” or “it’s easier”

  5. Unique indexes can change, but identifiers are more than just a unique value within a table. It is a pointer to data that is used outside of the database, which is why it should never change.

    Some ORMs require that IDs never change, which is a reasonable requirement to handle concurrency. If two users have the same record open, and one changed the “identifier”, then the other user’s instance has no association to the original record any longer. There could be legal requirements for not loosing the association, such as keeping track of if a letter has been sent and generated or taking someone off of a list of some kind. Natural keys that can be written ambiguously, typed incorrectly, or misunderstood through unfamiliar accents, so natural key corrections can break things.

    A standardized identifier format (such as an integer) used for all tables can simplify logging and auditing functions, as well as application design. For example, if it is a sequence, then using it for shards is going to work well. A natural key will probably distribute the data unevenly, so something else will have to be done to identify the shard.

    Also, natural keys that seem obvious are not always so. For example, for health insurance, a government issued unique ID is required to track the individual. So that ID should be a natural key, right (assuming that corrections to said ID have been mitigated somehow)? You can have claims against persons who do not yet have a government issued ID. So your natural key must allow duplicate NULL’s. It is not a key anymore. In this case, a birth sequence number (starting at 1 for each set of parents) is used to identify the child in combination with one of the parent’s ID. If you do not discover this requirement until after many parts of your application have been written, then redoing anything that makes use of that key can be expensive. It is always safer to use a surrogate.

  6. No, it isn’t always safer to use a surrogate. There was one example on PostgreSQL blogs where company had important Events table that had primary key that was surrogate with additional constraints that were later removed. General clusterfuck and duplication (no one knew if specific event was separate one) cost that company $100k. (costly meetings were scheduled for no reason other than wrong data in DB)

    Traditional school of DB design says that you should determine what keys specific relation has, write it down and then proceed to physical model, using surrogate keys if it’s a lot more efficient.

    Most people use surrogate (they should be called meaningless) keys without thinking about it, it’s worse than premature optimization because lot of them don’t even know it’s an optimization.

Comments are closed.