Why I’m not fan of uuid datatype

Recently, on irc, there were couple of cases where someone wanted to use uuid as datatype for their primary key.

I opposed, and tried to explain, but IRC doesn't really allow for longer texts, so figured I'll write a blogpost.

First problem – UUID values are completely opaque. That means – uuids generated for table sessions will be indistinguishable from the ones for table users.

If you're using natural keys, or something like serial, and you have two tables with very different numbers of rows, you can, usually see that value “12345678" doesn't really look OK as a key in table that has ~ 1000 rows.

Additionally, looking at list of uuids doesn't really give you any insight:

  • 5a77c4ee-658f-4d2c-bf7a-a22aacc5c220
  • 152b2b18-5714-474b-8d11-48c386920b2b
  • 0511f54a-9c49-4346-ba57-54a7e886060f
  • 524b67c9-a2e5-44cb-8d18-277825abb8df
  • e18cdb0a-a712-4c98-b4aa-81a72040dbd6
  • 7d2885e6-63ba-4e7c-b33c-8e28e81e30b5
  • 3bb1863c-01f2-4d0a-9452-f19de7fe8b61
  • b25ab666-dce8-47e3-8050-fb96db97b002
  • e0bd2b59-84fb-4f7a-9b50-ca4d79140591
  • 4860a2cb-a94b-453d-a97d-1add2bf8f6b2

Doesn't mean anything to me. But if I'd saw values like:

  • 72397
  • 185513
  • 137659
  • 195639
  • 476172
  • 282708
  • 530256
  • 459210
  • 476800
  • 822362

I can make some guesses about expected number of rows, or their order.

Second problem – you can't use uuids to sort in a way that gives you newest rows first.

Well, you can use uuid_generate_v1() but it has the drawback of containing also some form of MAC. Which I'm not entirely sure I like.

Third problem – it's pretty wide value.

It uses 16 bytes per value, which is twice what is used by int8.

Based on these problems – my point is that in majority of cases, you don't really need or want uuid, and can use simpler keys.

If you need just unique value – int8 sequence/identifier will be usually more than enough.

If you need something that will allow sharing ids between multiple machines – check instagram ids (which I coded in C and pl/PgSQL in this blogpost.

Of course there might be some situations where UUID is the only acceptable solution. If you have such case, please let me know in comments.

Finally, out of curiosity, let's try to see some performance tests.

Since we know that searching (select * from table where id = …) will have virtually the same speed, let's see how fast does it take to load data.

To test it I wrote a script:

DROP TABLE IF EXISTS tint;
DROP TABLE IF EXISTS tuuidv1;
DROP TABLE IF EXISTS tuuidv1mc;
DROP TABLE IF EXISTS tuuidv4;
CREATE TABLE tint (id INT8 generated always AS IDENTITY, payload TEXT, PRIMARY KEY (id) );
CREATE TABLE tuuidv1 (id uuid NOT NULL DEFAULT uuid_generate_v1(), payload TEXT, PRIMARY KEY (id) );
CREATE TABLE tuuidv1mc (id uuid NOT NULL DEFAULT uuid_generate_v1mc(), payload TEXT, PRIMARY KEY (id) );
CREATE TABLE tuuidv4 (id uuid NOT NULL DEFAULT uuid_generate_v4(), payload TEXT, PRIMARY KEY (id) );
\timing ON
checkpoint;
\echo doing tint
\copy tint (payload) FROM '/tmp/words.lst'
checkpoint;
\echo doing tuuidv1
\copy tuuidv1 (payload) FROM '/tmp/words.lst'
checkpoint;
\echo doing tuuidv1mc
\copy tuuidv1mc (payload) FROM '/tmp/words.lst'
checkpoint;
\echo doing tuuidv4
\copy tuuidv4 (payload) FROM '/tmp/words.lst'
checkpoint;

/tmp/words.lst file contained 1 million random lines generated using this ruby script:

words = File.open('/usr/share/dict/american-english').each_line.map { |l| l.chomp }.to_a
File.open("/tmp/words.lst", "w") do |f|
  1000000.times { |x| s = 1.upto(10).map { |i| words[rand() * words.size] }.join(' '); f.write(s + "\n") }
end

I ran the whole test five times, and then got results:

Table Time to load 1M rows (in ms)
best average worst
tint 1906.239 1924.722 1950.813
tuuidv1 4177.412 4199.647 4220.624
tuuidv1mc 4208.107 4236.575 4266.439
tuuidv4 5254.344 5298.116 5361.273

As you can see, in this particular test, INT8 based primary key was ~ twice as fast as any UUID.

Of course, if the table would be wider, the difference would be smaller. But then – these rows were pretty wide anyway – average length of payload was 93 characters.

19 thoughts on “Why I’m not fan of uuid datatype”

  1. Not contradicting all your points altogether, but since I was reading the following article about sequential UUIDs earlier today (before you published this post), I feel obligated to share it: https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/

    Any comment on whether this changes some of your arguments against UUID (particularly about sorting and speed, which are addressed in that article).

  2. This is the very first article I mostly disagree.

    1. sorting by UUID — all best practices I know about application architecture always state that sorting by primary key is a bad practice and this should happen against `created_at` or so,
    2. UUIDs are indistinguishable — and that’s ok: enumeration is much more difficult than series what decreases potential disease size,
    3. list doesn’t give any insight — it’s not for insight but for solving problem: do you hate IPv6 because of their form?
    4. slower — everything costs; advantages are — for my perspective — more valuable than drawbacks

  3. @eRIZ

    Ad. 1 – only if we’re talking about general purpose (application, final) sorting. But if we change our rationale context to debugging, serial values give a direct hint about what’s going on inside, e.g. many gaps in sequence suggest there are many rolled back transactions. And having your `created_at` we can easily count absolute number of transactions or create a histogram of rollbacks as a function of time of day to investigate bottlenecks.

    Ad. 2 – only if the data is to be innumerable, but this approach (a.k.a. poor’s man security) is often not the best practice, especially when used (as 128-bit random) instead actual access policy.

    Ad. 3 – yes, I do hate IPv6 because of that (and many others). A great example of university-style bs with solving problems that noone had, not solving problems that existed (except for the address-space) and generating even bigger problems with impractical solutions. IPv6 should all be flushed down to the ocean and started over.

    Ad. 4 – WHAT advantageS? Randomness to prevent enumeration is singular.

  4. There are two more drawback when using UUIDs:
    1. they appear to be random, but the collisions still might happen – one extra thing to take care,
    2. like everything “random”, they eat system entropy, which might generate additional work and hard to debug (load/hw-specific) slowdowns.

    Ad. 1 – this might be solved …using additional SERIAL. Since 2-column PKEY with SERIAL makes no sense, the UUID becomes simple “anti-enumeration” access-scheme for front-end application, but doesn’t replace the sequence itself.

    Ad. 2 – this might be solved by async/external/delayed UUID generation, what means we also/still need the PKEY serial in the first place.

    In order to guarantee uniqueness without rechecks, some kind of sequencing/time-derivation process is required. If so, the natural ordering of resulting object MIGHT be it’s intristic feature. UUIDs make this harder than possible: https://github.com/uuidjs/uuid/issues/75 by not having an option to use different order.

    One possible solution would be to create own UUID-mangling function to create “sequential” UUIDs on INSERT instead parsing them in ORDER:

    https://ivanyu.me/blog/2016/03/28/time-based-version-1-uuids-ordering-in-postgresql/

    Worth reading:
    https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439

  5. UUIDs has many drawbacks, but has some advantages for developer.

    1. UUID has some standards so you can use them in different places in different languages and contexts, and they will work the same way.
    2. When you treat UUID as natural identifier, then it gives you some opportunities, like…
    3. UUID is good as identifier if you would like to generate id before persist entity. You generate it in your C/Java/Python code and it does the job. It is very important specially if you work with ORMs. They can manage entities before persist it.
    4. More you can generate your ids in many places in system and there will be very small chance to collision. If you need to generate or book ids from one place it can provide to bottlenecks.

    I’m looking at this from app developer point of view.

  6. This article, while focused on MySQL is an interesting comparison of UUIDs and bigint from a performance standpoint: https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/

    In our case we switched to UUIDs for the following reasons:

    1. As a large multi-tenant enterprise class application deployed both on-premise and in the cloud, customers would want to migrate between these deployments. UUIDs made it possible to take a on-premise database and easily merge it with one of our existing cloud databases, essentially adding them as a tenant.

    2. On-premise databases, since the customer could access the DB directly would sometimes cause sequences to get out of sync. (Inserting without incrementing the sequence) Though rare, UUIDs eliminated all the extra checks we needed to add to keep sequences in sync, “just in case”.

    3. Due to application specific requirements, we needed to make a call to the DB to obtain the next ID in sequence, then perform the insert. Using UUIDs eliminated this step and actually improved overall insert performance for us.

    4. An added bonus, grep logs for a UUID and you can instantly locate what you want. With INTs it would take 10-15mins to find the exact log lines we needed as other methods were required.

  7. There is a completely different story at the data discovery side.

    With UUIDs, if you do select count(*) from a_table where not candidate_fk in (select the_pk from other_table), you will leave completely sure if there isn’t a relationship on that column. With serials, not so much.

  8. @Koziołek

    Ad. 1. so does the integer – this is standard “math” object. Actually, UUID is simply 128-bit integer with fancy randomness and possibility to mix-in some external values, like timestamp od some kind of hashes.
    However, noone seems to make use of that properties, and moreover embedding such fingerprint inside the object identifier, instead the separate column, is considered bad practice; just normalize your database instead.

    Actually, UUID generation scheme was designed for ad-hoc scenarios.

    Ad. 2. Like what? And what is natural in random number? Distribution?

    Ad. 3. It is, unless …it isn’t, as it doesn’t guarantee uniqueness. Just a shortcut to create hard to debug errors.

    Ad. 4. If you accept “very small chance of breaking data” you migh consider some non-ACID no-SQL databases, which would be faster and easier to manage in multi-tenant setups.

    I hope we’re all taking about ACID-compliant database with data integrity, not some fuzzy storage for pictures with cats?

  9. @Mike

    Ad. 1. until some collision happens, which is 100% sure based on Murphy’s laws.

    Ad. 2. out-of-sync PKEY? How about fixing the app instead masking the errors?

    Ad. 3. another myth – as UUIDs might and eventually will collide, you need to either handle such error (which is tempting not to do) or – if the sequence generation makes difference, use separate entity to generate them.

    All of this (and Koziołek cases) was described in “instagram ids”.

    And all your nightmares would come true as soon as some UUID generation node gets out of entropy or some “truly random number: 4” get’s into the application.

    Ad. 4. ?! What is the year on your calendar, to grep for IDs in logs? And what is the difference between grepping UUID and integer? Ever heard about “grep -w”?

    All I can see for now are bad practices and misuse of tooling…

  10. @Marcos

    Sorry for being rude, but if you do “SELECT … WHERE x IN (SELECT …)” …just stop doing that.
    To discover relationships we, in relational databases, use FKs on normalized tables. Either the data is a subset (with duplicates) of another (unique indexed) set, or it isn’t. If it only “somehow”, then fix the problem, not mask it, or don’t pretend you actually need RDBMS and go for some document store (even if it still was Postgres with some JSON/B). But if you do have PKEY column with relations, just use these relations properly (FK).

    There are shortcomings in Postgres FK, like missing partial FKs (“a.x REFERENCES b.y WHERE a.z>1”) I do really miss and some issues with mixed-order multicolumn indices, but UUIDs don’t help here.

  11. I typically only look at the first group of characters in a UUID – very much like you do with git commits. After a while it’s pretty much the same

    A serial or identity column also doesn’t really tell you (reliably) which one was the latest (think about sequence caching on the client!). The only reliable way to do that is to use a timestamp column populated with clock_timestamp().

    If for some reason the client can’t call the database for a new unique ID, a UUID is a perfect fit.

  12. @gotar

    If the original creator of a database bothered to document it with FKs, good table and column names, and comments, then there is no point on doing data discovery on it.

    But that said, I have seen a few bugs (not many, but well, we are all bikeshedding) on leggacy systems caused by inserting ids on the wrong columns, that only blew years later when one of the related tables got an extra record and the ids stopped being valid on both columns.

  13. @Marcos – if you do data discovery, it is some “post-mortem” process, long after the implementation. You do not choose the data types at this moment, the decision was already made before.

    Thus using UUIDs is not a solution for such FUTURE problem. Using FKs is.

    Do not encourage people to use invalid solution, that “usually works”, only “sometimes fails, but this is rare and won’t happen” (it will). Properly designed system should fail fast and hard when encoutering a bug, not “years later”.

    UUIDs don’t SOLVE any problem (mentioned here). They only make it LOOK like solved, allowing to develop fast and leaving fixing this for a “future considerations” i.e. never, like every duct tape solution.

  14. I will try to address all your points.

    > you can, usually see that value “12345678″ doesn’t really look OK as a key in table that has ~ 1000 rows
    Of course, you can say that but to prevent storing of senseless data in your tables another approaches should be considered. More to that, there are cases when it’s OK to have such gaps in surrogate PK values.
    > I can make some guesses about expected number of rows, or their order.
    There are a lot of ways to predict how your system is growing. And PK values definitely not on of them. More to that, when you expose PK to the external world you might want not to discover the amount of data you have. In this case, UUID is much better.
    > It uses 16 bytes per value, which is twice what is used by int8.
    I agree. But this is the price you pay for its uniqueness.

    And, finally, UUID works when you need to generate a key outside of your database, without any locking.

    Nevertheless, there is one drawback to UUID you glossed over. In PostgreSQL, when huge_page_writes is on, writing UUID to btree indexes will cause write amplification due to the UUID randomness (https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-page-writes/).

    Instagram’s approach is good but incurs additional maintenance overhead.

  15. Three of your points argue that you cannot infer meaning from UUIDs: ordering, amount-of-data/growth and distinguishability (is that English?).

    I’d argue that this is good. It is good that a PK, or ID, does not convey such information.

    If you need ordering, add an explicit column to order by; anyone with long-lived databases will tell you this. UUIDs force you to do this: they force you to do The Right Thing at day 0.

    If you need growth, number of rows, and such data, use either the tooling made especially for that (built into many database) or keep some stats yourselves. Relying on PKs that were neither designed for this, nor constrained to keep helping you with this, is naive at best and restricting at worst. Gaps will occur. Merges will introduce gaps. Migrations will move Ids around. That is good, because if you have the freedom to do so, life (migrations, merges, vacuums, cleanups) will be easier than when there’s the “we cannot defrag the IDs because our monitoring!”.

    If you need distinguishability, make it explicit. Relying on accidental ranges (users are below 1000, tickets should range above 10000) is flakey at best and error-prone at worst. “Woops, I deleted the first 3000 tickets because I assumed all tickets had an ID higher than 10.000” (I did not make this up: a war story from a former colleague (whom might have made it up, IDK))

  16. 1. I am not a fan of ipv6 but saying it tried to solve only the “address space” problem is quite incorrect. ipv4 has a LOT of warts, and ipv6 tried solving some of those as well.

    2. I find UUID a more useful user identifier than an int8 PK. Having said that I keep both columns.

    3. I have not heard of uuid v4 collisions in real life situations so far. That would surely get a huge press as billions of data points use UUID as their sole unique identifier.

    4. What is the point of these artificial table loading examples? Many real life applications’ data grows organically. Surely you won’t design your schema solely around backup restoring speed?

  17. We use UUID’s for many things, mostly because different systems create the ID’s, and we do a lot of p2p-things etc.

    However, one thing we don’t do, but which I saw Stripe, the money processor, is doing with their IDs, is that they’re always prefixing them with what they are. Which is really nice. So you have ID’s like cus_1234f123098b238aoeu, crd_0923450897bac, inv_1234098701235, sub_023986098bfab391 – and that makes it really easy to see what is customer, credit card, invoice and subscription. Which kinda fixes the obtuseness of the ids.

    (They’re not UUID4’s afaics, but close to the same length, I just randomly pressed the keyboard to get something that looked like it).

  18. I’m not a fan of UUID either but I had this app https://github.com/yeo/bima
    which is basically a MFA app for desktop. They use SQLite and can run offline.
    When they want to sync, they send data to our server. I want to maintenance a
    as close as possible of what they had on sqlite(columns name, value etc) to our
    postgres db on backend.

    Without UUID? How can I achieve this problem nicely? Right now, I just generate
    a UUID for a `token` table say `uuid|secret|created_at` and send to backend, which
    persisted to postgres in same way.

    Would love to hear some solution because I really hate UUID…They feel heavy and
    generating them isn’t easy. It’s easy in term of using a library, but when thinking
    about wasting cpu cycle just to get a unique, collison-free among peers feel very
    wasteful.

  19. I agree with the performance impact (more storage required), but the other points don’t seem to be an issue IMHO.
    A pro that I see, is that by using UUIDs, you can pre-allocate the ID at the domain layer, without any extra round trip. That is if you have a new record and you want to know what its ID will be before it’s written to the DB you can just call the uuid function from your application. Otherwise you’d have to either: a) insert, and then check what ID was assigned, or b) so something like “select max(id) + 1..” which is even worse.

    Just my two cents.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.