March 4th, 2011 by depesz | Tags: , , , , , , , , | 8 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

On 8th of February, Peter Eisentraut committed patch:

Per-column collation support
 
This adds collation support for columns and domains, a COLLATE clause
to override it per expression, and B-tree index support.
 
Peter Eisentraut
reviewed by Pavel Stehule, Itagaki Takahiro, Robert Haas, Noah Misch
 
Branch
------
master

This is really, really great.

Let's see. I have 9.1 database with American locale (en_US.UTF-8):

$ select name, setting from pg_settings where name ~ 'lc_';
name | setting
-------------+-------------
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
lc_messages | en_US.UTF-8
lc_monetary | en_US.UTF-8
lc_numeric | en_US.UTF-8
lc_time | en_US.UTF-8
(6 rows)

Now, let's assume we want some Polish words:

$ create table normal_polish ( some_text text );
CREATE TABLE
 
$ copy normal_polish from '/tmp/polish';
COPY 14
 
$ select * from normal_polish ;
some_text
-----------
ćma
łódka
śnieg
mama
ser
car
lama
żółw
zorza
tata
źdźbło
coś
alfa
szopa
(14 rows)

and some Czech:

$ create table normal_czech ( some_text text );
CREATE TABLE
 
$ copy normal_czech from '/tmp/czech';
COPY 8
 
(depesz@[local]:5910) 13:56:54 [depesz]
$ select * from normal_czech ;
some_text
-----------
1234
červený
chleba
cihla
damašek
žluťoučký
kůň
zelí
(8 rows)

The problem is that while sorted outputs can look sensible to others – they are not properly sorted in Polish and Czech languages:

$ select * from normal_czech order by some_text;
some_text
-----------
1234
červený
chleba
cihla
damašek
kůň
zelí
žluťoučký
(8 rows)
 
$ select * from normal_polish order by some_text;
some_text
-----------
alfa
car
ćma
coś
lama
łódka
mama
ser
śnieg
szopa
tata
źdźbło
żółw
zorza
(14 rows)

Up to 9.0 you couldn't really do anything about it.

But now, I can:

$ create table collated_polish (some_text text collate "pl_PL.utf8");
CREATE TABLE
 
$ copy collated_polish from '/tmp/polish';
COPY 14
 
$ select * from collated_polish order by some_text;
some_text
-----------
alfa
car
coś
ćma
lama
łódka
mama
ser
szopa
śnieg
tata
zorza
źdźbło
żółw
(14 rows)

It's interesting, because I had pl_PL locales installed when doing initdb, but I didn't have cs_CZ, so it will fail:

$ create table collated_czech (some_text text collate "cs_CZ.utf8");
ERROR: collation "cs_CZ.utf8" for current database encoding "UTF8" does not exist
LINE 1: create table collated_czech (some_text text collate "cs_CZ.u...
^

Luckily, with all locale files for cs_CZ installed, fixing it is simple:

$ CREATE COLLATION "cs_CZ.utf8" ( locale = 'cs_CZ.UTF-8' );
CREATE COLLATION
 
$ create table collated_czech (some_text text collate "cs_CZ.utf8");
CREATE TABLE
 
$ copy collated_czech from '/tmp/czech';
COPY 8
 
$ select * from collated_czech order by some_text;
ERROR: could not create locale "cs_CZ.UTF-8": Success

The last error is actually pretty bad. I mean – we all know stories about error dialogs saying “error: success", but apparently now it did hit PostgreSQL too.

The problem in here is that I added the locale to system, but didn't restart PostgreSQL. So, a quick restart later:

$ select * from collated_czech order by some_text;
some_text
-----------
cihla
červený
damašek
chleba
kůň
zelí
žluťoučký
1234
(8 rows)

And how can you check what collations there are already? Simple, just:

$ select * from pg_collation;

Of course you can have columns of different collations within single table, like this:

$ create table collated_mix (english text, polish text collate "pl_PL.utf8", czech text collate "cs_CZ.utf8", basic text collate ucs_basic);
CREATE TABLE
 
$ \d collated_mix
Table "public.collated_mix"
Column | Type | Modifiers
---------+------+--------------------
english | text |
polish | text | collate pl_PL.utf8
czech | text | collate cs_CZ.utf8
basic | text | collate ucs_basic

Now. I did put there 10000 rows, each containing the same (within a row) values. Each value is 10 character string containing only these characters:

0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!@#$%^&*()-=_+[]{},./<>?;:"|

Data looks like this:

$ select * from collated_mix limit 10;
english | polish | czech | basic
------------+------------+------------+------------
[m49rLoOdL | [m49rLoOdL | [m49rLoOdL | [m49rLoOdL
4HIQT-Kc*c | 4HIQT-Kc*c | 4HIQT-Kc*c | 4HIQT-Kc*c
)#*aD>$e^k | )#*aD>$e^k | )#*aD>$e^k | )#*aD>$e^k
/s5;@wIO/v | /s5;@wIO/v | /s5;@wIO/v | /s5;@wIO/v
j;_!NEjhqP | j;_!NEjhqP | j;_!NEjhqP | j;_!NEjhqP
.Y0z)EvlsY | .Y0z)EvlsY | .Y0z)EvlsY | .Y0z)EvlsY
4l|9i!^@;y | 4l|9i!^@;y | 4l|9i!^@;y | 4l|9i!^@;y
VgwGF.UyZ/ | VgwGF.UyZ/ | VgwGF.UyZ/ | VgwGF.UyZ/
?RR;|8But# | ?RR;|8But# | ?RR;|8But# | ?RR;|8But#
2xM1&^#kez | 2xM1&^#kez | 2xM1&^#kez | 2xM1&^#kez
(10 rows)

Sorting the rows using different columns show how's that different:

$ select * from collated_mix order by english limit 5;
english | polish | czech | basic
------------+------------+------------+------------
<01a=WS%gp | <01a=WS%gp | <01a=WS%gp | <01a=WS%gp
0:1iRvr64t | 0:1iRvr64t | 0:1iRvr64t | 0:1iRvr64t
01mM;g&BP; | 01mM;g&BP; | 01mM;g&BP; | 01mM;g&BP;
:01OJsyOf; | :01OJsyOf; | :01OJsyOf; | :01OJsyOf;
01uoah,MT7 | 01uoah,MT7 | 01uoah,MT7 | 01uoah,MT7
(5 rows)
 
$ select * from collated_mix order by polish limit 5;
english | polish | czech | basic
------------+------------+------------+------------
<01a=WS%gp | <01a=WS%gp | <01a=WS%gp | <01a=WS%gp
0:1iRvr64t | 0:1iRvr64t | 0:1iRvr64t | 0:1iRvr64t
01mM;g&BP; | 01mM;g&BP; | 01mM;g&BP; | 01mM;g&BP;
:01OJsyOf; | :01OJsyOf; | :01OJsyOf; | :01OJsyOf;
01uoah,MT7 | 01uoah,MT7 | 01uoah,MT7 | 01uoah,MT7
(5 rows)
 
$ select * from collated_mix order by czech limit 5;
english | polish | czech | basic
------------+------------+------------+------------
AAA>P>#>^_ | AAA>P>#>^_ | AAA>P>#>^_ | AAA>P>#>^_
+!AA/B/Zug | +!AA/B/Zug | +!AA/B/Zug | +!AA/B/Zug
Aa=Ca)K71b | Aa=Ca)K71b | Aa=Ca)K71b | Aa=Ca)K71b
aAFJ^1EwsY | aAFJ^1EwsY | aAFJ^1EwsY | aAFJ^1EwsY
aAi<dQAhld | aAi<dQAhld | aAi<dQAhld | aAi<dQAhld
(5 rows)
 
$ select * from collated_mix order by basic limit 5;
english | polish | czech | basic
------------+------------+------------+------------
!!1<bJ;jio | !!1<bJ;jio | !!1<bJ;jio | !!1<bJ;jio
!!>N(g"mYq | !!>N(g"mYq | !!>N(g"mYq | !!>N(g"mYq
!!N>s{*;c^ | !!N>s{*;c^ | !!N>s{*;c^ | !!N>s{*;c^
!!QuMts/1^ | !!QuMts/1^ | !!QuMts/1^ | !!QuMts/1^
!"I7<d.33g | !"I7<d.33g | !"I7<d.33g | !"I7<d.33g
(5 rows)

The interesting thing is ucs_basic collation, which works like “C" locale. That is – it doesn't care about real order of letters, just compares their numerical values. Which is usually not helpful, until you'd want to do like, on them:

$ create index i1 on collated_mix (english);
CREATE INDEX
 
$ create index i2 on collated_mix (polish);
CREATE INDEX
 
$ create index i3 on collated_mix (czech);
CREATE INDEX
 
$ create index i4 on collated_mix (basic);
CREATE INDEX

And now queries:

$ explain select * from collated_mix where english like 'abc%';
QUERY PLAN
---------------------------------------------------------------
Seq Scan on collated_mix (cost=0.00..219.00 rows=1 width=44)
Filter: (english ~~ 'abc%'::text)
(2 rows)
 
$ explain select * from collated_mix where polish like 'abc%';
QUERY PLAN
---------------------------------------------------------------
Seq Scan on collated_mix (cost=0.00..219.00 rows=1 width=44)
Filter: (polish ~~ 'abc%'::text)
(2 rows)
 
$ explain select * from collated_mix where czech like 'abc%';
QUERY PLAN
---------------------------------------------------------------
Seq Scan on collated_mix (cost=0.00..219.00 rows=1 width=44)
Filter: (czech ~~ 'abc%'::text)
(2 rows)
 
$ explain select * from collated_mix where basic like 'abc%';
QUERY PLAN
------------------------------------------------------------------------
Index Scan using i4 on collated_mix (cost=0.00..8.27 rows=1 width=44)
Index Cond: ((basic >= 'abc'::text) AND (basic < 'abd'::text))
Filter: (basic ~~ 'abc%'::text)
(3 rows)

Of course, if you know anything about PostgreSQL, you know that you could use text_pattern_ops, and varchar_pattern_ops to get like ‘…%' from indexes, but the thing is – that you couldn't do it on primary key index!

Which made it necessary to have 2 indexes.

Now, you can have table, with primary key being email in ucs_basic collation, and the like operator will work well:

$ create table users ( email text collate ucs_basic primary key );
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "users_pkey" for table "users"
CREATE TABLE
 
$ explain select * from users where email like 'dep%';
QUERY PLAN
-------------------------------------------------------------------------
Bitmap Heap Scan on users (cost=4.32..14.47 rows=7 width=32)
Filter: (email ~~ 'dep%'::text)
-> Bitmap Index Scan on users_pkey (cost=0.00..4.32 rows=7 width=0)
Index Cond: ((email >= 'dep'::text) AND (email < 'deq'::text))
(4 rows)

YEAH! Congratulations and thanks (a lot of them) go to author and reviewers.

  1. 8 comments

  2. # Lafriks
    Mar 4, 2011

    That’s just superb! Looks like this will be great release! :)

  3. Mar 5, 2011

    A big thank you to Peter and all the reviewers for the feature.

    And thank you Depesz for the ‘Waiting for’ series.

  4. Apr 24, 2011

    Am I the only person who is disappointed that “per-column” collation has been accepted rather than doing something with operator classes?

    The actual use case for arbitrary collation is being able to change the locale based on a user-specified property, like Accept-Language on HTTP. If a user comes to my website and speaks French, I need to sort the data I return to him in French collation order. If my website primarily serves users who speak English and French, it would be great to be able to have two indexes, one specified with English collation, and one specified with French collation.

    The easiest way to make that a reality, meanwhile, is to make this a property of an operator class. Much like I can use text_pattern_ops in order build indexes or do order by operations using the C locale, even when my database is set to en_US, I should be able to get text_eu_US_pattern_ops. In an /ideal/ world, using a “meta operator class”, like text_locale_ops(‘en_US’).

    Can someone please explain to me in what way is this ever a property of a column? What is the use case for having multiple columns, each with their own fixed locale?

  5. Apr 24, 2011

    @Jay:
    Reason is very simple. If it was query-based, than you couldn’t have index on the column.

    Indexes make sense only if there is well defined less-than and greater-than operators. Which – if you’d make the collation query-definable – wouldn’t be well defined, because it would be different from one query to another.

  6. # Marcin
    Jun 9, 2011

    @Depesz:
    Yes, your right. But wy not use functional indexes, like this:
    CREATE INDEX same_text_idx ON test (collation(same_text, ‘pl_PL.utf8′));
    CREATE INDEX same_text_idx ON test (collation(same_text, ‘cz_CZ.utf8′));

    SELECT * FROM test ORDER BY collation(same_text, ‘cz_CZ.utf8′);
    ?

  7. Jun 9, 2011

    @Marcin:
    Function based collation is possible in earlier Pgs too, just write smart enough collation() function.

    So it’s not that it has been chosen instead, but rather – new way has been added, which is based on normal indexes, and one-time definition of data collation (instead of redefinition on every query).

  8. # Marcin
    Jun 9, 2011

    Yes, but it’s pity that PostgreSQL don’t offer method like collation(str, locale) from my example, just “out of the box”;-)

  9. Sep 12, 2011

    @Depesz:

    The problem with defining it with the data is that it isn’t a property of the data. You now expect to find 7 identical columns in a row that all collate differently, but otherwise have the exact same contents. What you want are simply 7 different indexes, and we already have that feature using operator classes: you define indexes for the queries you want to accelerate, and you can specify the collations you want to support. You already can already use a custom operator during ORDER BY (thanks to USING), and you already can use a custom operator for comparison (thanks to, well, operators), but what you can’t do is have a “meta operator class”, something that takes an argument (the name of a collation) and returns an operator class: we don’t have a solution for the fact that there are an arbitrary number of collations. If this one feature were added, then this collation feature would already be supported, and we wouldn’t need to duplicate our data in our rows: the engine is already smart enough to choose the right index from the 7 indexes we have based on the operators being used (and if you don’t believe that, an obvious case is text_pattern_ops, which already smooths over a specific collation issue between whatever you are currently using and C collation for byte-based ordering).

Leave a comment