Waiting for 9.1 – Per-column collation support

On 8th of February, Peter Eisentraut committed patch:

Per-column collation support
 
This adds collation support for columns and domains, a COLLATE clause
to override it per expression, and B-tree index support.
 
Peter Eisentraut
reviewed by Pavel Stehule, Itagaki Takahiro, Robert Haas, Noah Misch
 
Branch
------
master

This is really, really great.

Let's see. I have 9.1 database with American locale (en_US.UTF-8):

$ SELECT name, setting FROM pg_settings WHERE name ~ 'lc_';
    name     |   setting   
-------------+-------------
 lc_collate  | en_US.UTF-8
 lc_ctype    | en_US.UTF-8
 lc_messages | en_US.UTF-8
 lc_monetary | en_US.UTF-8
 lc_numeric  | en_US.UTF-8
 lc_time     | en_US.UTF-8
(6 ROWS)

Now, let's assume we want some Polish words:

$ CREATE TABLE normal_polish ( some_text text );
CREATE TABLE
 
$ copy normal_polish FROM '/tmp/polish';
COPY 14
 
$ SELECT * FROM normal_polish ;
 some_text 
-----------
 ćma
 łódka
 śnieg
 mama
 ser
 car
 lama
 żółw
 zorza
 tata
 źdźbło
 coś
 alfa
 szopa
(14 ROWS)

and some Czech:

$ CREATE TABLE normal_czech ( some_text text );
CREATE TABLE
 
$ copy normal_czech FROM '/tmp/czech';
COPY 8
 
(depesz@[LOCAL]:5910) 13:56:54 [depesz] 
$ SELECT * FROM normal_czech ;
 some_text 
-----------
 1234
 červený
 chleba
 cihla
 damašek
 žluťoučký
 kůň
 zelí
(8 ROWS)

The problem is that while sorted outputs can look sensible to others – they are not properly sorted in Polish and Czech languages:

$ SELECT * FROM normal_czech ORDER BY some_text;
 some_text 
-----------
 1234
 červený
 chleba
 cihla
 damašek
 kůň
 zelí
 žluťoučký
(8 ROWS)
 
$ SELECT * FROM normal_polish ORDER BY some_text;
 some_text 
-----------
 alfa
 car
 ćma
 coś
 lama
 łódka
 mama
 ser
 śnieg
 szopa
 tata
 źdźbło
 żółw
 zorza
(14 ROWS)

Up to 9.0 you couldn't really do anything about it.

But now, I can:

$ CREATE TABLE collated_polish (some_text text COLLATE "pl_PL.utf8");
CREATE TABLE
 
$ copy collated_polish FROM '/tmp/polish';
COPY 14
 
$ SELECT * FROM collated_polish ORDER BY some_text;
 some_text 
-----------
 alfa
 car
 coś
 ćma
 lama
 łódka
 mama
 ser
 szopa
 śnieg
 tata
 zorza
 źdźbło
 żółw
(14 ROWS)

It's interesting, because I had pl_PL locales installed when doing initdb, but I didn't have cs_CZ, so it will fail:

$ CREATE TABLE collated_czech (some_text text COLLATE "cs_CZ.utf8");
ERROR:  collation "cs_CZ.utf8" FOR CURRENT DATABASE encoding "UTF8" does NOT exist
LINE 1: CREATE TABLE collated_czech (some_text text COLLATE "cs_CZ.u...
                                               ^

Luckily, with all locale files for cs_CZ installed, fixing it is simple:

$ CREATE COLLATION "cs_CZ.utf8" ( locale = 'cs_CZ.UTF-8' );
CREATE COLLATION
 
$ CREATE TABLE collated_czech (some_text text COLLATE "cs_CZ.utf8");
CREATE TABLE
 
$ copy collated_czech FROM '/tmp/czech';
COPY 8
 
$ SELECT * FROM collated_czech ORDER BY some_text;
ERROR:  could NOT CREATE locale "cs_CZ.UTF-8": Success

The last error is actually pretty bad. I mean – we all know stories about error dialogs saying “error: success", but apparently now it did hit PostgreSQL too.

The problem in here is that I added the locale to system, but didn't restart PostgreSQL. So, a quick restart later:

$ SELECT * FROM collated_czech ORDER BY some_text;
 some_text 
-----------
 cihla
 červený
 damašek
 chleba
 kůň
 zelí
 žluťoučký
 1234
(8 ROWS)

And how can you check what collations there are already? Simple, just:

$ SELECT * FROM pg_collation;

Of course you can have columns of different collations within single table, like this:

$ CREATE TABLE collated_mix (english text, polish text COLLATE "pl_PL.utf8", czech text COLLATE "cs_CZ.utf8", basic text COLLATE ucs_basic);
CREATE TABLE
 
$ \d collated_mix 
     TABLE "public.collated_mix"
 COLUMN  | TYPE |     Modifiers      
---------+------+--------------------
 english | text | 
 polish  | text | COLLATE pl_PL.utf8
 czech   | text | COLLATE cs_CZ.utf8
 basic   | text | COLLATE ucs_basic

Now. I did put there 10000 rows, each containing the same (within a row) values. Each value is 10 character string containing only these characters:

0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!@#$%^&*()-=_+[]{},./<>?;:"|

Data looks like this:

$ SELECT * FROM collated_mix  LIMIT 10;
  english   |   polish   |   czech    |   basic    
------------+------------+------------+------------
 [m49rLoOdL | [m49rLoOdL | [m49rLoOdL | [m49rLoOdL
 4HIQT-Kc*c | 4HIQT-Kc*c | 4HIQT-Kc*c | 4HIQT-Kc*c
 )#*aD>$e^k | )#*aD>$e^k | )#*aD>$e^k | )#*aD>$e^k
 /s5;@wIO/v | /s5;@wIO/v | /s5;@wIO/v | /s5;@wIO/v
 j;_!NEjhqP | j;_!NEjhqP | j;_!NEjhqP | j;_!NEjhqP
 .Y0z)EvlsY | .Y0z)EvlsY | .Y0z)EvlsY | .Y0z)EvlsY
 4l|9i!^@;y | 4l|9i!^@;y | 4l|9i!^@;y | 4l|9i!^@;y
 VgwGF.UyZ/ | VgwGF.UyZ/ | VgwGF.UyZ/ | VgwGF.UyZ/
 ?RR;|8But# | ?RR;|8But# | ?RR;|8But# | ?RR;|8But#
 2xM1&^#kez | 2xM1&^#kez | 2xM1&^#kez | 2xM1&^#kez
(10 ROWS)

Sorting the rows using different columns show how's that different:

$ SELECT * FROM collated_mix  ORDER BY english LIMIT 5;
  english   |   polish   |   czech    |   basic    
------------+------------+------------+------------
 <01a=WS%gp | <01a=WS%gp | <01a=WS%gp | <01a=WS%gp
 0:1iRvr64t | 0:1iRvr64t | 0:1iRvr64t | 0:1iRvr64t
 01mM;g&BP; | 01mM;g&BP; | 01mM;g&BP; | 01mM;g&BP;
 :01OJsyOf; | :01OJsyOf; | :01OJsyOf; | :01OJsyOf;
 01uoah,MT7 | 01uoah,MT7 | 01uoah,MT7 | 01uoah,MT7
(5 ROWS)
 
$ SELECT * FROM collated_mix  ORDER BY polish LIMIT 5;
  english   |   polish   |   czech    |   basic    
------------+------------+------------+------------
 <01a=WS%gp | <01a=WS%gp | <01a=WS%gp | <01a=WS%gp
 0:1iRvr64t | 0:1iRvr64t | 0:1iRvr64t | 0:1iRvr64t
 01mM;g&BP; | 01mM;g&BP; | 01mM;g&BP; | 01mM;g&BP;
 :01OJsyOf; | :01OJsyOf; | :01OJsyOf; | :01OJsyOf;
 01uoah,MT7 | 01uoah,MT7 | 01uoah,MT7 | 01uoah,MT7
(5 ROWS)
 
$ SELECT * FROM collated_mix  ORDER BY czech LIMIT 5;
  english   |   polish   |   czech    |   basic    
------------+------------+------------+------------
 AAA>P>#>^_ | AAA>P>#>^_ | AAA>P>#>^_ | AAA>P>#>^_
 +!AA/B/Zug | +!AA/B/Zug | +!AA/B/Zug | +!AA/B/Zug
 Aa=Ca)K71b | Aa=Ca)K71b | Aa=Ca)K71b | Aa=Ca)K71b
 aAFJ^1EwsY | aAFJ^1EwsY | aAFJ^1EwsY | aAFJ^1EwsY
 aAi<dQAhld | aAi<dQAhld | aAi<dQAhld | aAi<dQAhld
(5 ROWS)
 
$ SELECT * FROM collated_mix  ORDER BY basic LIMIT 5;
  english   |   polish   |   czech    |   basic    
------------+------------+------------+------------
 !!1<bJ;jio | !!1<bJ;jio | !!1<bJ;jio | !!1<bJ;jio
 !!>N(g"mYq | !!>N(g"mYq | !!>N(g"mYq | !!>N(g"mYq
 !!N>s{*;c^ | !!N>s{*;c^ | !!N>s{*;c^ | !!N>s{*;c^
 !!QuMts/1^ | !!QuMts/1^ | !!QuMts/1^ | !!QuMts/1^
 !"I7<d.33g | !"I7<d.33g | !"I7<d.33g | !"I7<d.33g
(5 ROWS)

The interesting thing is ucs_basic collation, which works like “C" locale. That is – it doesn't care about real order of letters, just compares their numerical values. Which is usually not helpful, until you'd want to do like, on them:

$ CREATE INDEX i1 ON collated_mix (english);
CREATE INDEX
 
$ CREATE INDEX i2 ON collated_mix (polish);
CREATE INDEX
 
$ CREATE INDEX i3 ON collated_mix (czech);
CREATE INDEX
 
$ CREATE INDEX i4 ON collated_mix (basic);
CREATE INDEX

And now queries:

$ EXPLAIN SELECT * FROM collated_mix  WHERE english LIKE 'abc%';
                          QUERY PLAN                           
---------------------------------------------------------------
 Seq Scan ON collated_mix  (cost=0.00..219.00 ROWS=1 width=44)
   FILTER: (english ~~ 'abc%'::text)
(2 ROWS)
 
$ EXPLAIN SELECT * FROM collated_mix  WHERE polish LIKE 'abc%';
                          QUERY PLAN                           
---------------------------------------------------------------
 Seq Scan ON collated_mix  (cost=0.00..219.00 ROWS=1 width=44)
   FILTER: (polish ~~ 'abc%'::text)
(2 ROWS)
 
$ EXPLAIN SELECT * FROM collated_mix  WHERE czech LIKE 'abc%';
                          QUERY PLAN                           
---------------------------------------------------------------
 Seq Scan ON collated_mix  (cost=0.00..219.00 ROWS=1 width=44)
   FILTER: (czech ~~ 'abc%'::text)
(2 ROWS)
 
$ EXPLAIN SELECT * FROM collated_mix  WHERE basic LIKE 'abc%';
                               QUERY PLAN                               
------------------------------------------------------------------------
 INDEX Scan USING i4 ON collated_mix  (cost=0.00..8.27 ROWS=1 width=44)
   INDEX Cond: ((basic >= 'abc'::text) AND (basic < 'abd'::text))
   FILTER: (basic ~~ 'abc%'::text)
(3 ROWS)

Of course, if you know anything about PostgreSQL, you know that you could use text_pattern_ops, and varchar_pattern_ops to get like ‘…%' from indexes, but the thing is – that you couldn't do it on primary key index!

Which made it necessary to have 2 indexes.

Now, you can have table, with primary key being email in ucs_basic collation, and the like operator will work well:

$ CREATE TABLE users ( email text COLLATE ucs_basic PRIMARY KEY );
NOTICE:  CREATE TABLE / PRIMARY KEY will CREATE implicit INDEX "users_pkey" FOR TABLE "users"
CREATE TABLE
 
$ EXPLAIN SELECT * FROM users WHERE email LIKE 'dep%';
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Bitmap Heap Scan ON users  (cost=4.32..14.47 ROWS=7 width=32)
   FILTER: (email ~~ 'dep%'::text)
   ->  Bitmap INDEX Scan ON users_pkey  (cost=0.00..4.32 ROWS=7 width=0)
         INDEX Cond: ((email >= 'dep'::text) AND (email < 'deq'::text))
(4 ROWS)

YEAH! Congratulations and thanks (a lot of them) go to author and reviewers.

8 thoughts on “Waiting for 9.1 – Per-column collation support”

  1. That’s just superb! Looks like this will be great release! 🙂

  2. A big thank you to Peter and all the reviewers for the feature.

    And thank you Depesz for the ‘Waiting for’ series.

  3. Am I the only person who is disappointed that “per-column” collation has been accepted rather than doing something with operator classes?

    The actual use case for arbitrary collation is being able to change the locale based on a user-specified property, like Accept-Language on HTTP. If a user comes to my website and speaks French, I need to sort the data I return to him in French collation order. If my website primarily serves users who speak English and French, it would be great to be able to have two indexes, one specified with English collation, and one specified with French collation.

    The easiest way to make that a reality, meanwhile, is to make this a property of an operator class. Much like I can use text_pattern_ops in order build indexes or do order by operations using the C locale, even when my database is set to en_US, I should be able to get text_eu_US_pattern_ops. In an /ideal/ world, using a “meta operator class”, like text_locale_ops(‘en_US’).

    Can someone please explain to me in what way is this ever a property of a column? What is the use case for having multiple columns, each with their own fixed locale?

  4. @Jay:
    Reason is very simple. If it was query-based, than you couldn’t have index on the column.

    Indexes make sense only if there is well defined less-than and greater-than operators. Which – if you’d make the collation query-definable – wouldn’t be well defined, because it would be different from one query to another.

  5. @Depesz:
    Yes, your right. But wy not use functional indexes, like this:
    CREATE INDEX same_text_idx ON test (collation(same_text, ‘pl_PL.utf8’));
    CREATE INDEX same_text_idx ON test (collation(same_text, ‘cz_CZ.utf8’));

    SELECT * FROM test ORDER BY collation(same_text, ‘cz_CZ.utf8’);
    ?

  6. @Marcin:
    Function based collation is possible in earlier Pgs too, just write smart enough collation() function.

    So it’s not that it has been chosen instead, but rather – new way has been added, which is based on normal indexes, and one-time definition of data collation (instead of redefinition on every query).

  7. Yes, but it’s pity that PostgreSQL don’t offer method like collation(str, locale) from my example, just “out of the box”;-)

  8. @Depesz:

    The problem with defining it with the data is that it isn’t a property of the data. You now expect to find 7 identical columns in a row that all collate differently, but otherwise have the exact same contents. What you want are simply 7 different indexes, and we already have that feature using operator classes: you define indexes for the queries you want to accelerate, and you can specify the collations you want to support. You already can already use a custom operator during ORDER BY (thanks to USING), and you already can use a custom operator for comparison (thanks to, well, operators), but what you can’t do is have a “meta operator class”, something that takes an argument (the name of a collation) and returns an operator class: we don’t have a solution for the fact that there are an arbitrary number of collations. If this one feature were added, then this collation feature would already be supported, and we wouldn’t need to duplicate our data in our rows: the engine is already smart enough to choose the right index from the 7 indexes we have based on the operators being used (and if you don’t believe that, an obvious case is text_pattern_ops, which already smooths over a specific collation issue between whatever you are currently using and C collation for byte-based ordering).

Comments are closed.