Waiting for 8.4 - database-level lc_collation and lc_ctype

2008-09-28 22:26:49 CEST | Tags: , , , , ,

On 23rd of September, Heikki Linnakangas committed patch that was written by Radek Strnad (actually committed patch is a stripped-down version of original).

What it does? It adds ability to have (finally!) different collation order and character categories for different databases.

Up until now, you had to set LC_COLLATE and LC_CTYPE when doing initdb, and then it was unchangeable. Database initdb’ed as LATIN2 would not work fully correctly when dealing with UTF-8.

Now, it has been changed:

Make LC_COLLATE and LC_CTYPE database-level settings. Collation and
ctype are now more like encoding, stored in new datcollate and datctype
columns in pg_database.
 
This is a stripped-down version of Radek Strnad's patch, with further
changes by me.

What it gives us?

For example, let’s assume we have PostgreSQL instance initialized with “C” locale:

# \l
List of databases
Name | Owner | Encoding | Collation | Ctype | Access Privileges
-----------+--------+-----------+-----------+-------+----------------------------
depesz | depesz | SQL_ASCII | C | C |
postgres | pgdba | SQL_ASCII | C | C |
template0 | pgdba | SQL_ASCII | C | C | {=c/pgdba,pgdba=CTc/pgdba}
template1 | pgdba | SQL_ASCII | C | C | {=c/pgdba,pgdba=CTc/pgdba}
(4 rows)

Since “C” doesn’t know anything about polish national characters it will not be able to sort polish text correctly. Nor can it upper() correctly:

# set client_encoding = 'UTF-8';
SET
 
# select c, upper(c) from (values ('a'), ('ć'), ('e'), ('ź'), ('x'), ('ł'), ('ś'), ('w')) as x (c) order by c;
c | upper
---+-------
a | A
e | E
w | W
x | X
ć | ć
ł | ł
ś | ś
ź | ź
(8 rows)

(If you’re not familiar with polish alphabet, just trust me - it’s not ok. I’ll show how it should look in a moment.).

With Pg <= 8.3 I would have to reinitdb, and thus convert all databases to new collation. Which is a bit troublesome.

Luckily, with 8.4, I can add new database, that will use another collation:

# CREATE DATABASE depesz_pl with encoding 'utf8' collate 'pl_PL.UTF-8' ctype 'pl_PL.UTF-8' template template0;
CREATE DATABASE

The only problem is that I need to use template0. Otherwise I’ll get:

# CREATE DATABASE depesz_pl with encoding 'utf8' collate 'pl_PL.UTF-8' ctype 'pl_PL.UTF-8';
ERROR: new collation is incompatible with the collation of the template database (C)
HINT: Use the same collation as in the template database, or use template0 as template

(of course I could have created template1_pl, but then - I would have to create this template somehow :)

So, having this new database, let’s try to run our test query in it:

# \c depesz_pl
You are now connected to database "depesz_pl".
 
# show client_encoding ;
client_encoding
-----------------
UTF8
(1 row)
 
# select c, upper(c) from (values ('a'), ('ć'), ('e'), ('ź'), ('x'), ('ł'), ('ś'), ('w')) as x (c) order by c;
c | upper
---+-------
a | A
ć | Ć
e | E
ł | Ł
ś | Ś
w | W
x | X
ź | Ź
(8 rows)

YES! It works!

Also, \l output has been also modified to accommodate new information:

# \l
List of databases
Name | Owner | Encoding | Collation | Ctype | Access Privileges
-----------+--------+-----------+-------------+-------------+----------------------------
depesz | depesz | SQL_ASCII | C | C |
depesz_pl | depesz | UTF8 | pl_PL.UTF-8 | pl_PL.UTF-8 |
postgres | pgdba | SQL_ASCII | C | C |
template0 | pgdba | SQL_ASCII | C | C | {=c/pgdba,pgdba=CTc/pgdba}
template1 | pgdba | SQL_ASCII | C | C | {=c/pgdba,pgdba=CTc/pgdba}
(5 rows)

One thing - you can’t change collation/ctype of existing database. Reason for this is pretty simple: indexes rely on collation (and can rely on ctype). So changing collation/ctype would require reindexation of all data. While this would be technically possible - you can do it by simply dumping database, creating new database with desired locale, and loading dump.

Of course this is still a long way to make PostgreSQL fully functional in multi-language environments, but at the very least it is a step in right direction. Awaited, and important step.

4 Responses to “Waiting for 8.4 - database-level lc_collation and lc_ctype”

  1. Steve Says:

    This is so awesome! Thanks for the continued updates on 8.4.

    You mention “this is stil la long way to make PostgreSQL fully functional in multi-language environments”. Could you probably post where it is currently lacking multi-language support or potentially cause problems?

    Thanks!

  2. depesz Says:

    @Steve:
    to make it fully functional I would need to supply colation/ctype on much smaller objects than databases: tables or columns.

    for example - consider sorting texts in many languages. of course - you can get away with it using simply utf8 everywhere, but that introduces additional costs of text conversion if your output charset is different.

  3. Thomas Says:

    great new feature! i’ve been looking forward to have dbs with different collates on the same server!

  4. Kari Says:

    Finally!! I have been waiting for this for several years. Running several parallel postgres instances is quite .. inconvenient.

Leave a Reply