Waiting for 8.4 – case insensitive text ( citext )

On 29th of July, Tom Lane committed patch written by David E. Wheeler, which added new contrib module: citext.

Log message is rather laconic:

Add a new, improved version of citext as a contrib module.
David E. Wheeler

So, let's look into details.

First, let's imagine a use case.

You have web system, which stores some information about users, but you want to make sure that every email is unique.

This is pretty simple:

    id serial PRIMARY KEY,
    email text NOT NULL UNIQUE

Which looks fine, until you'll realize that emails are case insensitive. This means that email: depesz@depesz.com is the same as depesz@DEPESZ.COM. But in database table defined as above – it will be treated as 2 separate rows:

# INSERT INTO users (email) VALUES ('depesz@depesz.com'), ('depesz@DEPESZ.COM');
# SELECT * FROM users;
 id |       email
  1 | depesz@depesz.com
  2 | depesz@DEPESZ.COM
(2 ROWS)

Of course – it is possible that you want it to behave that way, but usually such table mean that somebody didn't really thought well about it.

To make email case insensitive you basically had to:

  1. create unique index users_email_ci_key on users (lower(email))
  2. in every query you had to use lower: select * from users where lower(email) = lower(‘depesz@DePeSz.com');

Of course – you could also write a trigger that would automatically lowercase all data that's being put in the table, but this is not necessarily good solution, as it will modify the data user supplied – and he/she might want it to be “depesz@DEPESZ.COM", and don't like it to be “depesz@depesz.com".

Now, with CITEXT, it can be solved in a much nicer way.

First you need to add citext to your database. Depending on your installation the directory might be different, but this one workes for me:

psql -U test -d test -f \i /home/pgdba/work/share/postgresql/contrib/citext.sql

Now, let's add the table:

    id serial PRIMARY KEY,
    email citext NOT NULL UNIQUE

And let's add some data:

# INSERT INTO users (email) VALUES ('depesz@depesz.com'), ('depesz@DEPESZ.COM');
ERROR:  duplicate KEY VALUE violates UNIQUE CONSTRAINT "users_email_key"

Nice. So, let's add it one-by-one:

# INSERT INTO users (email) VALUES ('depesz@depesz.com');

OK, now add “bad" record:

ERROR:  duplicate KEY VALUE violates UNIQUE CONSTRAINT "users_email_key"

Sweet. New good one:

# INSERT INTO users (email) VALUES ('depesz@gmail.com');
# SELECT * FROM users;
 id |       email
  3 | depesz@depesz.com
  5 | depesz@gmail.com
(2 ROWS)

So, let's check if I can select data from here:

 id |       email
  3 | depesz@depesz.com
(1 ROW)

Nice. Works as expected. One more thing – will it use index?

# \d users
                         Table "public.users"
 Column |  Type   |                     Modifiers
 id     | integer | not null default nextval('users_id_seq'::regclass)
 email  | citext  | not null
    "users_pkey" PRIMARY KEY, btree (id)
    "users_email_key" UNIQUE, btree (email)
                                  QUERY PLAN
 INDEX Scan USING users_email_key ON users  (cost=0.00..8.27 ROWS=1 width=36)
   INDEX Cond: (email = 'DEPESZ@DEPESZ.COM'::citext)
(2 ROWS)

Well, it uses index. I'm a bit worried about premature usage of index, as there is definitely not enough data in the table to make usage of index faster than seqscan, but at least we know it will use index when we will have more data.

Everything looks fine – another great feature of PostgreSQL 🙂

9 thoughts on “Waiting for 8.4 – case insensitive text ( citext )”

  1. Hey depesz, I’ve been waiting for you to blob citext. Thanks! I have just a couple of things to add:

    First, it needs to be compiled. That usually means `cd contrib/citext && gmake && gmake install`. Then you can install it in whatever database you like. I always put it into template1, myself, since I will be using it in damn near every app I write from here on in.

    The second thing to note is a caveat: citext is not a true cast-insensitive data type. Rather, it does exactly the same thing you do when you use `LOWER()` to get the same result. The downside is that it’s slower than a true case-insensitive data type would be. The upside, however, is that you can now create case-insensitive primary keys and constraints without having two indexes! And it’s a wee bit faster than doing `LOWER()` manually, as we’ve been having to do for lo these many years.

    Thanks for the coverage, I appreciate it!


  2. @Theory:
    Sorry for keeping you waiting. I had a backtrack “thanks” to my vacation, but now I think it’s all done (unless I missed something).

    As for compilation requirement.

    Well, when you’re installing pg84 from binary packages (when it will be release), you’ll just do apt-get install postgreqsl-contrib-8.4.

    If you’re compiling from source – then you usually already know how to get contrib modules 🙂

    But the point is still valid – I should have mention compilation requirement.

  3. Nice addition. Would be nice if PostgreSQL could one day resolve this in a more “global” manner by allowing users to choose the charset/collation similar to MySQL, where one has the choice between binary, case sensitive and case insensitive. This works all the way down to the column level. One of the few areas where MySQL is actually ahead of PostgreSQL in features.

  4. @Lukas:
    without a doubt it would be good.

    On the other hand – I just recently spent 3 hours trying to load a dump from mysql (to another mysql), and finally failed. Because of problems with setting collation/charset.

    It is most probably due to lack of knowledge of MySQL, but it looks very complicated.

  5. Is it possbile to make postgres use index with text_pattern_ops option for citext columns?

  6. @mtr:
    not that I know of.
    I guess the simplest solution would be index on lower(text_field) with text_pattern_ops.

  7. @DEPESZ

    Thanks for reply.
    Maybe I’ll ask question in other way. Can postgres use index for citext column used in like condition?
    Does it depend on collate of the column or database?

  8. @mtr: I guess it would be possible in C locale, but I don’t use C, so it’s not using index.

    Basically – if you want case insensitive searches, I would say that going with lower(text_field) like ” is better option.

  9. thanks.
    I checked C locale. For citext pgsql does not use index in “like ‘a%’ in any combination.
    For text it works fine for C locale even without text_pattern_ops. For locale other than C it’s necessary do use text_pattern_ops.

Comments are closed.