August 10th, 2008 by depesz | Tags: , , , , , | 4 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

On 29th of July, Tom Lane committed patch written by David E. Wheeler, which added new contrib module: citext.

Log message is rather laconic:

Add a new, improved version of citext as a contrib module.
 
David E. Wheeler

So, let's look into details.

First, let's imagine a use case.

You have web system, which stores some information about users, but you want to make sure that every email is unique.

This is pretty simple:

CREATE TABLE users (
id serial primary key,
email text not null unique
);

Which looks fine, until you'll realize that emails are case insensitive. This means that email: depesz@depesz.com is the same as depesz@DEPESZ.COM. But in database table defined as above – it will be treated as 2 separate rows:

# insert into users (email) values ('depesz@depesz.com'), ('depesz@DEPESZ.COM');
INSERT 0 2
 
# select * from users;
id | email
----+-------------------
1 | depesz@depesz.com
2 | depesz@DEPESZ.COM
(2 rows)

Of course – it is possible that you want it to behave that way, but usually such table mean that somebody didn't really thought well about it.

To make email case insensitive you basically had to:

  1. create unique index users_email_ci_key on users (lower(email))
  2. in every query you had to use lower: select * from users where lower(email) = lower(‘depesz@DePeSz.com');

Of course – you could also write a trigger that would automatically lowercase all data that's being put in the table, but this is not necessarily good solution, as it will modify the data user supplied – and he/she might want it to be “depesz@DEPESZ.COM", and don't like it to be “depesz@depesz.com".

Now, with CITEXT, it can be solved in a much nicer way.

First you need to add citext to your database. Depending on your installation the directory might be different, but this one workes for me:

psql -U test -d test -f \i /home/pgdba/work/share/postgresql/contrib/citext.sql

Now, let's add the table:

CREATE TABLE users (
id serial primary key,
email citext not null unique
);

And let's add some data:

# insert into users (email) values ('depesz@depesz.com'), ('depesz@DEPESZ.COM');
ERROR: duplicate key value violates unique constraint "users_email_key"

Nice. So, let's add it one-by-one:

# insert into users (email) values ('depesz@depesz.com');
INSERT 0 1

OK, now add “bad" record:

# insert into users (email) values ('DEPESZ@DEPESZ.COM');
ERROR: duplicate key value violates unique constraint "users_email_key"

Sweet. New good one:

# insert into users (email) values ('depesz@gmail.com');
INSERT 0 1
 
# select * from users;
id | email
----+-------------------
3 | depesz@depesz.com
5 | depesz@gmail.com
(2 rows)

So, let's check if I can select data from here:

# select * from users where email = 'DEPESZ@DEPESZ.COM';
id | email
----+-------------------
3 | depesz@depesz.com
(1 row)

Nice. Works as expected. One more thing – will it use index?

# \d users
Table "public.users"
Column | Type | Modifiers
--------+---------+----------------------------------------------------
id | integer | not null default nextval('users_id_seq'::regclass)
email | citext | not null
Indexes:
"users_pkey" PRIMARY KEY, btree (id)
"users_email_key" UNIQUE, btree (email)

# explain select * from users where email = 'DEPESZ@DEPESZ.COM';
QUERY PLAN
------------------------------------------------------------------------------
Index Scan using users_email_key on users (cost=0.00..8.27 rows=1 width=36)
Index Cond: (email = 'DEPESZ@DEPESZ.COM'::citext)
(2 rows)

Well, it uses index. I'm a bit worried about premature usage of index, as there is definitely not enough data in the table to make usage of index faster than seqscan, but at least we know it will use index when we will have more data.

Everything looks fine – another great feature of PostgreSQL :)

  1. 4 comments

  2. Aug 11, 2008

    Hey depesz, I’ve been waiting for you to blob citext. Thanks! I have just a couple of things to add:

    First, it needs to be compiled. That usually means `cd contrib/citext && gmake && gmake install`. Then you can install it in whatever database you like. I always put it into template1, myself, since I will be using it in damn near every app I write from here on in.

    The second thing to note is a caveat: citext is not a true cast-insensitive data type. Rather, it does exactly the same thing you do when you use `LOWER()` to get the same result. The downside is that it’s slower than a true case-insensitive data type would be. The upside, however, is that you can now create case-insensitive primary keys and constraints without having two indexes! And it’s a wee bit faster than doing `LOWER()` manually, as we’ve been having to do for lo these many years.

    Thanks for the coverage, I appreciate it!

    —Theory

  3. Aug 11, 2008

    @Theory:
    Sorry for keeping you waiting. I had a backtrack “thanks” to my vacation, but now I think it’s all done (unless I missed something).

    As for compilation requirement.

    Well, when you’re installing pg84 from binary packages (when it will be release), you’ll just do apt-get install postgreqsl-contrib-8.4.

    If you’re compiling from source – then you usually already know how to get contrib modules :)

    But the point is still valid – I should have mention compilation requirement.

  4. Aug 11, 2008

    Nice addition. Would be nice if PostgreSQL could one day resolve this in a more “global” manner by allowing users to choose the charset/collation similar to MySQL, where one has the choice between binary, case sensitive and case insensitive. This works all the way down to the column level. One of the few areas where MySQL is actually ahead of PostgreSQL in features.

  5. Aug 11, 2008

    @Lukas:
    without a doubt it would be good.

    On the other hand – I just recently spent 3 hours trying to load a dump from mysql (to another mysql), and finally failed. Because of problems with setting collation/charset.

    It is most probably due to lack of knowledge of MySQL, but it looks very complicated.

Leave a comment