August 31st, 2007 by depesz | Tags: , , , , | 3 comments »
Did it help? If yes - maybe you can help me?

faber4 on irc asked about how to get ascii-based sorting, while his postgresql was initdb'ed with utf-8 based locale (en_US.UTF-8 to be exact).

what can we do about it?

first, let's check if my environment is ok:

# show lc_collate;
 lc_collate
-------------
 pl_PL.UTF-8
(1 row)

looks file. pl_PL is not en_US, but it's utf, so no ascii-based sorting should be done.

now, let's create some test data:

# create table test (x text);
CREATE TABLE
# copy test from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> aaa
>>  ab
>> 1ab
>> ab
>> \.

ok, so let's do the test:

# select * from test order by x;
  x
-----
 1ab
 aaa
 ab
  ab
(4 rows)

ok. it shows that we are not sorting using ascii-codes – if we did, " ab" would be first.

so, how to solve the problem?

the simplest possible approach is to convert all characters into some other in a way that their comparisons will be “like in ascii". the simple way to do it is to convert each character to it's hex-value.

it will convert " " (space) to “20", “a" to “61" and so on.

there is a small issue – i'm not sure how given locale will sort numbers vs. letters. so it would be best if i could use only numbers or only letters. this is simple as with hex-encoding we have only 16 different characters, so we can transpose then to letters without any problems.

so, let's create a ascii-recoding function:

CREATE OR REPLACE FUNCTION ascii_recode(TEXT) RETURNS TEXT AS $BODY$
my $input = shift;
$input =~ s/(.)/sprintf("%02x", ord $1)/ges;
$input =~ y/0-9a-f/a-p/;
return $input;
$BODY$ LANGUAGE plperl immutable;

now, let's test it:

# select ascii_recode('i love postgresql');
            ascii_recode
------------------------------------
 gjcagmgphggfcahagphdheghhcgfhdhbgm
(1 row)

looks good to me. how does it sort?

# select * from test order by ascii_recode(x);
  x
-----
  ab
 1ab
 aaa
 ab
(4 rows)

nice. it works.

what's important – this function is “immutable". which means we can make index on it to use it for sorting larger datasets.

of course – it is not perfect. it will have issues when dealing with multi-byte characters in recoded-strings. but – it's a trick. use with caution 🙂

  1. 3 comments

  2. # spoksss
    Jul 29, 2010

    select * from test order by x using ~<~

  3. Aug 3, 2010

    @Spoksss:
    Nice. I can’t find it anywhere in docs, though.

  4. # dedi
    Sep 17, 2011

    thx for your hint depesz. useful.

    in 8.1/8.3 postgresql version, may be possible to do this:

    select * from test order by ascii(substr(test,1,1));

    cmiiw.

Sorry, comments for this post are disabled.