Waiting for 8.4 – database-level lc_collation and lc_ctype

On 23rd of September, Heikki Linnakangas committed patch that was written by Radek Strnad (actually committed patch is a stripped-down version of original).

What it does? It adds ability to have (finally!) different collation order and character categories for different databases.

Up until now, you had to set LC_COLLATE and LC_CTYPE when doing initdb, and then it was unchangeable. Database initdb'ed as LATIN2 would not work fully correctly when dealing with UTF-8.

Now, it has been changed:

Make LC_COLLATE and LC_CTYPE database-level settings. Collation and
ctype are now more like encoding, stored in new datcollate and datctype
columns in pg_database.
 
This is a stripped-down version of Radek Strnad's patch, with further
changes by me.

What it gives us?

For example, let's assume we have PostgreSQL instance initialized with “C" locale:

# \l
                                List of databases
   Name    | Owner  | Encoding  | Collation | Ctype |     Access Privileges
-----------+--------+-----------+-----------+-------+----------------------------
 depesz    | depesz | SQL_ASCII | C         | C     |
 postgres  | pgdba  | SQL_ASCII | C         | C     |
 template0 | pgdba  | SQL_ASCII | C         | C     | {=c/pgdba,pgdba=CTc/pgdba}
 template1 | pgdba  | SQL_ASCII | C         | C     | {=c/pgdba,pgdba=CTc/pgdba}
(4 rows)

Since “C" doesn't know anything about polish national characters it will not be able to sort polish text correctly. Nor can it upper() correctly:

# set client_encoding = 'UTF-8';
SET
 
# select c, upper(c) from (values ('a'), ('ć'), ('e'), ('ź'), ('x'), ('ł'), ('ś'), ('w')) as x (c) order by c;
 c | upper
---+-------
 a | A
 e | E
 w | W
 x | X
 ć | ć
 ł | ł
 ś | ś
 ź | ź
(8 rows)

(If you're not familiar with polish alphabet, just trust me – it's not ok. I'll show how it should look in a moment.).

With Pg <= 8.3 I would have to reinitdb, and thus convert all databases to new collation. Which is a bit troublesome.

Luckily, with 8.4, I can add new database, that will use another collation:

# CREATE DATABASE depesz_pl with encoding 'utf8' collate 'pl_PL.UTF-8' ctype 'pl_PL.UTF-8' template template0;
CREATE DATABASE

The only problem is that I need to use template0. Otherwise I'll get:

# CREATE DATABASE depesz_pl with encoding 'utf8' collate 'pl_PL.UTF-8' ctype 'pl_PL.UTF-8';
ERROR:  new collation is incompatible with the collation of the template database (C)
HINT:  Use the same collation as in the template database, or use template0 as template

(of course I could have created template1_pl, but then – I would have to create this template somehow 🙂

So, having this new database, let's try to run our test query in it:

# \c depesz_pl
You are now connected to database "depesz_pl".
 
# show client_encoding ;
 client_encoding
-----------------
 UTF8
(1 row)
 
# select c, upper(c) from (values ('a'), ('ć'), ('e'), ('ź'), ('x'), ('ł'), ('ś'), ('w')) as x (c) order by c;
 c | upper
---+-------
 a | A
 ć | Ć
 e | E
 ł | Ł
 ś | Ś
 w | W
 x | X
 ź | Ź
(8 rows)

YES! It works!

Also, \l output has been also modified to accommodate new information:

# \l
                                    List of databases
   Name    | Owner  | Encoding  |  Collation  |    Ctype    |     Access Privileges
-----------+--------+-----------+-------------+-------------+----------------------------
 depesz    | depesz | SQL_ASCII | C           | C           |
 depesz_pl | depesz | UTF8      | pl_PL.UTF-8 | pl_PL.UTF-8 |
 postgres  | pgdba  | SQL_ASCII | C           | C           |
 template0 | pgdba  | SQL_ASCII | C           | C           | {=c/pgdba,pgdba=CTc/pgdba}
 template1 | pgdba  | SQL_ASCII | C           | C           | {=c/pgdba,pgdba=CTc/pgdba}
(5 rows)

One thing – you can't change collation/ctype of existing database. Reason for this is pretty simple: indexes rely on collation (and can rely on ctype). So changing collation/ctype would require reindexation of all data. While this would be technically possible – you can do it by simply dumping database, creating new database with desired locale, and loading dump.

Of course this is still a long way to make PostgreSQL fully functional in multi-language environments, but at the very least it is a step in right direction. Awaited, and important step.

6 thoughts on “Waiting for 8.4 – database-level lc_collation and lc_ctype”

Steve says:

2008-09-29 at 06:43

This is so awesome! Thanks for the continued updates on 8.4.

You mention “this is stil la long way to make PostgreSQL fully functional in multi-language environments”. Could you probably post where it is currently lacking multi-language support or potentially cause problems?

Thanks!
depesz says:

2008-09-29 at 09:55

@Steve:
to make it fully functional I would need to supply colation/ctype on much smaller objects than databases: tables or columns.

for example – consider sorting texts in many languages. of course – you can get away with it using simply utf8 everywhere, but that introduces additional costs of text conversion if your output charset is different.
Thomas says:

2008-09-29 at 15:35

great new feature! i’ve been looking forward to have dbs with different collates on the same server!
Kari says:

2008-09-30 at 20:43

Finally!! I have been waiting for this for several years. Running several parallel postgres instances is quite .. inconvenient.
Vadym says:

2009-01-23 at 18:50

Really interesting. Too bad ms sql server does not provide ability to have more than one collation on the instance because of tempdb usage… for those who are interested in how it works in sql server see
http://db-staff.com/index.php/microsoft-sql-server/69-change-collation
Troy says:

2009-04-01 at 05:06

I used pg_collkey to get around collation order in 8.1.x, however since we are using 8.3 now this does not work anymore.

ORDER BY collkey(field,locale) ;

is there any other way currently in 8.3 or 8.4 i can achieve this?

Comments are closed.