Waiting for 8.4 - database-level lc_collation and lc_ctype
On 23rd of September, Heikki Linnakangas committed patch that was written by Radek Strnad (actually committed patch is a stripped-down version of original).
What it does? It adds ability to have (finally!) different collation order and character categories for different databases.
Up until now, you had to set LC_COLLATE and LC_CTYPE when doing initdb, and then it was unchangeable. Database initdb’ed as LATIN2 would not work fully correctly when dealing with UTF-8.
Now, it has been changed:
Make LC_COLLATE and LC_CTYPE database-level settings. Collation and
ctype are now more like encoding, stored in new datcollate and datctype
columns in pg_database.
This is a stripped-down version of Radek Strnad's patch, with further
changes by me.
What it gives us?
For example, let’s assume we have PostgreSQL instance initialized with “C” locale:
# \l
List of databases
Name | Owner | Encoding | Collation | Ctype | Access Privileges
-----------+--------+-----------+-----------+-------+----------------------------
depesz | depesz | SQL_ASCII | C | C |
postgres | pgdba | SQL_ASCII | C | C |
template0 | pgdba | SQL_ASCII | C | C | {=c/pgdba,pgdba=CTc/pgdba}
template1 | pgdba | SQL_ASCII | C | C | {=c/pgdba,pgdba=CTc/pgdba}
(4 rows)
Since “C” doesn’t know anything about polish national characters it will not be able to sort polish text correctly. Nor can it upper() correctly:
# set client_encoding = 'UTF-8';
SET
# select c, upper(c) from (values ('a'), ('ć'), ('e'), ('ź'), ('x'), ('ł'), ('ś'), ('w')) as x (c) order by c;
c | upper
---+-------
a | A
e | E
w | W
x | X
ć | ć
ł | ł
ś | ś
ź | ź
(8 rows)
(If you’re not familiar with polish alphabet, just trust me - it’s not ok. I’ll show how it should look in a moment.).
With Pg <= 8.3 I would have to reinitdb, and thus convert all databases to new collation. Which is a bit troublesome.
Luckily, with 8.4, I can add new database, that will use another collation:
# CREATE DATABASE depesz_pl with encoding 'utf8' collate 'pl_PL.UTF-8' ctype 'pl_PL.UTF-8' template template0;
CREATE DATABASE
The only problem is that I need to use template0. Otherwise I’ll get:
# CREATE DATABASE depesz_pl with encoding 'utf8' collate 'pl_PL.UTF-8' ctype 'pl_PL.UTF-8';
ERROR: new collation is incompatible with the collation of the template database (C)
HINT: Use the same collation as in the template database, or use template0 as template
(of course I could have created template1_pl, but then - I would have to create this template somehow
So, having this new database, let’s try to run our test query in it:
# \c depesz_pl
You are now connected to database "depesz_pl".
# show client_encoding ;
client_encoding
-----------------
UTF8
(1 row)
# select c, upper(c) from (values ('a'), ('ć'), ('e'), ('ź'), ('x'), ('ł'), ('ś'), ('w')) as x (c) order by c;
c | upper
---+-------
a | A
ć | Ć
e | E
ł | Ł
ś | Ś
w | W
x | X
ź | Ź
(8 rows)
YES! It works!
Also, \l output has been also modified to accommodate new information:
# \l
List of databases
Name | Owner | Encoding | Collation | Ctype | Access Privileges
-----------+--------+-----------+-------------+-------------+----------------------------
depesz | depesz | SQL_ASCII | C | C |
depesz_pl | depesz | UTF8 | pl_PL.UTF-8 | pl_PL.UTF-8 |
postgres | pgdba | SQL_ASCII | C | C |
template0 | pgdba | SQL_ASCII | C | C | {=c/pgdba,pgdba=CTc/pgdba}
template1 | pgdba | SQL_ASCII | C | C | {=c/pgdba,pgdba=CTc/pgdba}
(5 rows)
One thing - you can’t change collation/ctype of existing database. Reason for this is pretty simple: indexes rely on collation (and can rely on ctype). So changing collation/ctype would require reindexation of all data. While this would be technically possible - you can do it by simply dumping database, creating new database with desired locale, and loading dump.
Of course this is still a long way to make PostgreSQL fully functional in multi-language environments, but at the very least it is a step in right direction. Awaited, and important step.
September 29th, 2008 at 06:43
This is so awesome! Thanks for the continued updates on 8.4.
You mention “this is stil la long way to make PostgreSQL fully functional in multi-language environments”. Could you probably post where it is currently lacking multi-language support or potentially cause problems?
Thanks!
September 29th, 2008 at 09:55
@Steve:
to make it fully functional I would need to supply colation/ctype on much smaller objects than databases: tables or columns.
for example - consider sorting texts in many languages. of course - you can get away with it using simply utf8 everywhere, but that introduces additional costs of text conversion if your output charset is different.
September 29th, 2008 at 15:35
great new feature! i’ve been looking forward to have dbs with different collates on the same server!
September 30th, 2008 at 20:43
Finally!! I have been waiting for this for several years. Running several parallel postgres instances is quite .. inconvenient.