June 30th, 2011 by depesz | Tags: , , , , , , | 1 comment »
Did it help? If yes - maybe you can help me?

Every so often someone asks why sorting behaves irrational. Like here:

$ select string from test order by string;
  string
----------
 dean
 deer
 de luca
 depesz
 de vil
 dyslexia
(6 rows)

Why aren't “de luca" and “de vil" together?

The exact and quick answer is: because characters with unknown location in alphabet are ignored when sorting.

But let me show some details.

Most of you are familiar with alphabet like this:

a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z

others might know and understand alphabet like this:

a,ą,b,c,ć,d,e,ę,f,g,h,i,j,k,l,ł,m,n,ń,o,ó,p,r,s,ś,t,u,w,x,y,z,ź,ż

There are also alphabets with different characters. Like ù, æ or ë.

But that's not all. There are some languages which don't use any of these letters, they have their own. Like Russian:

а,б,в,г,д,е,ё,ж,з,и,й,к,л,м,н,о,п,р,с,т,у,ф,х,ц,ч,ш,щ,ъ,ы,ь,э,ю,я

To make it even more interesting – some languages use latin letters, but order of them in alphabet is different.

For example – in Czech language “CH" pair, is considered (as far as sorting is considered) as single character, and it sorted between H and I. Or in French, when in some cases, you order not from left to right, but from right to left.

So. To make sorting work at all we have to have tables, which say which character is before/after which. These are called “collations", and are part of system locales. PostgreSQL uses it to avoid bundling the same tables in its sources.

For example, rules for sorting for polish language look like this:

=$ perl -ne 'print if /^LC_COLLATE/../^END LC_COLLATE/' /usr/share/i18n/locales/pl_PL
LC_COLLATE
copy "iso14651_t1"
 
collating-symbol <aogonek>
collating-symbol <cacute>
collating-symbol <eogonek>
collating-symbol <lstroke>
collating-symbol <nacute>
collating-symbol <oacute>
collating-symbol <sacute>
collating-symbol <zacute>
collating-symbol <zdot>
 
reorder-after <a>
<aogonek>
 
reorder-after <c>
<cacute>
 
reorder-after <e>
<eogonek>
 
reorder-after <l>
<lstroke>
 
reorder-after <n>
<nacute>
 
reorder-after <o>
<oacute>
 
reorder-after <s>
<sacute>
 
reorder-after <z>
<zacute>
<zdot>
 
reorder-after <U00A0>
<U0020> <U0020>;IGNORE;<U0020>;<U0020>
reorder-after <U0061>
<U0105> <aogonek>;<BAS>;<MIN>;IGNORE
reorder-after <U0041>
<U0104> <aogonek>;<BAS>;<CAP>;IGNORE
 
reorder-after <U0063>
<U0107> <cacute>;<BAS>;<MIN>;IGNORE
reorder-after <U0043>
<U0106> <cacute>;<BAS>;<CAP>;IGNORE
 
reorder-after <U0065>
<U0119> <eogonek>;<BAS>;<MIN>;IGNORE
reorder-after <U0045>
<U0118> <eogonek>;<BAS>;<CAP>;IGNORE
 
reorder-after <U006C>
<U0142> <lstroke>;<BAS>;<MIN>;IGNORE
reorder-after <U004C>
<U0141> <lstroke>;<BAS>;<CAP>;IGNORE
 
reorder-after <U006E>
<U0144> <nacute>;<BAS>;<MIN>;IGNORE
reorder-after <U004E>
<U0143> <nacute>;<BAS>;<CAP>;IGNORE
 
reorder-after <U006F>
<U00F3> <oacute>;<BAS>;<MIN>;IGNORE
reorder-after <U004F>
<U00D3> <oacute>;<BAS>;<CAP>;IGNORE
 
reorder-after <U0073>
<U015B> <sacute>;<BAS>;<MIN>;IGNORE
reorder-after <U0053>
<U015A> <sacute>;<BAS>;<CAP>;IGNORE
 
reorder-after <U007A>
<U017A> <zacute>;<BAS>;<MIN>;IGNORE
<U017C> <zdot>;<BAS>;<MIN>;IGNORE
reorder-after <U005A>
<U0179> <zacute>;<BAS>;<CAP>;IGNORE
<U017B> <zdot>;<BAS>;<CAP>;IGNORE
 
reorder-end
 
END LC_COLLATE

What's most important – virtually all (at the very least: all I know) locales, don't specify order of most of the characters. Mostly because (as I understand) that (for example) space could be very well be before first letter, or after last – and it's just a matter of personal preference. While order of letters in alphabet is well established, and usually supported by local law or norms.

So, the point is – every character that has not strict position defined in collation – is ignored, because PostgreSQL, nor system, will decide for you if you want spaces before a, or after z.

Are there any ways to get sorting which takes into consideration spaces?

Sure.

If you'll use pseudo-locale “C", ordering is done using ASCII value of every character. Which ignores all special cases for languages, but uses spaces. Since I don't have (and am not really happy to setup) database with C locale, I can show it using system sort:

=$ locale | grep LC_COLLATE
LC_COLLATE="en_US.UTF-8"
 
=$ psql -qAtX -c  'select string from test' | sort
dean
deer
de luca
depesz
de vil
dyslexia

but if I'll switch to use C locale for sort:

=$ psql -qAtX -c  'select string from test' | LC_ALL=C sort
de luca
de vil
dean
deer
depesz
dyslexia

The problem that I mentioned, can be seen simply with:

=$ echo -e 'ą\nć\nę\nń\nó\nł\nś\nź\nż' | LC_ALL=pl_PL.UTF-8 sort
ą
ć
ę
ł
ń
ó
ś
ź
ż

that's correct order. And when I'll use C “locale":

=$ echo -e 'ą\nć\nę\nń\nó\nł\nś\nź\nż' | LC_ALL=C sort
ó
ą
ć
ę
ł
ń
ś
ź
ż

ó got moved to the beginning of the list.

With PostgreSQL 9.1 you will be able to use COLLATION definition per query:

$ select string from test order by string collate C;
ERROR:  collation "c" for encoding "UTF8" does not exist
LINE 1: select string from test order by string collate C;
                                                ^

This shows, that because of my database is in UTF-8, using C collation cannot work. For such cases we can use “ucs_basic" collation:

$ select string from test order by string collate ucs_basic;
  string
----------
 de luca
 de vil
 dean
 deer
 depesz
 dyslexia
(6 rows)

Unfortunately it also breaks accented letters:

$ select string from test order by string collate ucs_basic;
  string
----------
 de luca
 de vil
 dean
 deer
 depesz
 dyslexia
 ó
 ą
 ć
 ę
 ł
 ń
 ś
 ź
 ż

Which is pretty obvious, because order of ó, ą, and so on is defined in Polish language (and possibly some others, but not necessarily the same), while ucs_basic (as I understand it) orders using numerical value of unicode codepoints.

So, is it possible at all to get proper ordering, with some additional changes, that would cause spaces (or other characters) to be sorted “together"?

Well, there is solution, and quite simple (although not elegant).

We can write simple function, which will modify input string, and return another, that is sortable the way we want.

For example. Let's assume we want proper ordering, and spaces should be before letters. In my case, it means it should be “before a".

So, the easy change is to make letter “a" into “az", and space into “aa". Like this:

CREATE OR REPLACE FUNCTION order_with_spaces( TEXT ) RETURNS TEXT as $$
SELECT
    replace(
        replace( lower($1), 'a', 'az' ),
        ' ', 'aa'
    );
$$ language sql immutable;

One important thing – we need to change a to az first, and space to aa later, because if we'd do it the other way around string “b r" would get changed to “bazazr" and not “bazr".

And now we can see how it works:

$ select string, order_with_spaces( string ) from test order by order_with_spaces( string );
  string  | order_with_spaces
----------+-------------------
 ą        | ą
 ć        | ć
 de luca  | deaalucaz
 de vil   | deaavil
 dean     | deazn
 deer     | deer
 depesz   | depesz
 dyslexia | dyslexiaz
 ę        | ę
 ł        | ł
 ń        | ń
 ó        | ó
 ś        | ś
 ź        | ź
 ż        | ż
(15 rows)

As you can see both polish accents, and spaces are ordered the way we'd want.

Please note that the substitutions I used (aa and az) are safe in my locale, but you have to check if they are safe in your locale too, if you'd want to use such hack.

Using the same method we could also add to ordering dots, commas, or any other character that's important to you, but hasn't been included in collation tables.

The important factor in here is that the order_with_spaces() function is immutable, which means that you can use it as base for index. But this leads to one small issue – if you'll use it as base for index, and then you'll change the definition of the function (because, for example, you want to add special place for – character) – you will have to reindex all indexes that use the function.

  1. One comment

  2. Jul 8, 2011

    The technical details of this are not quite correct. The collation definitions on glibc do provide a sort order for all (well, maybe most) characters. The reason why spaces appear to be ignored is that collation elements are that sorted in multiple passes (usually 3 or 4), and the collation setting for space is

    IGNORE;IGNORE;IGNORE;

    so that it is ignored in the first three passed, unlike letters and other “more important” characters.

Sorry, comments for this post are disabled.