postgresql tips & tricks

faber4 on irc asked about how to get ascii-based sorting, while his postgresql was initdb'ed with utf-8 based locale (en_US.UTF-8 to be exact).

what can we do about it?

first, let's check if my environment is ok:

# show lc_collate;
 lc_collate
-------------
 pl_PL.UTF-8
(1 row)

looks file. pl_PL is not en_US, but it's utf, so no ascii-based sorting should be done.

now, let's create some test data:

# CREATE TABLE test (x text);
CREATE TABLE
# copy test FROM stdin;
Enter DATA TO be copied followed BY a newline.
END WITH a backslash AND a period ON a line BY itself.
>> aaa
>>  ab
>> 1ab
>> ab
>> \.

ok, so let's do the test:

# SELECT * FROM test ORDER BY x;
  x
-----
 1ab
 aaa
 ab
  ab
(4 ROWS)

ok. it shows that we are not sorting using ascii-codes – if we did, " ab" would be first.

so, how to solve the problem?

the simplest possible approach is to convert all characters into some other in a way that their comparisons will be “like in ascii". the simple way to do it is to convert each character to it's hex-value.

it will convert " " (space) to “20", “a" to “61" and so on.

there is a small issue – i'm not sure how given locale will sort numbers vs. letters. so it would be best if i could use only numbers or only letters. this is simple as with hex-encoding we have only 16 different characters, so we can transpose then to letters without any problems.

so, let's create a ascii-recoding function:

CREATE OR REPLACE FUNCTION ascii_recode(TEXT) RETURNS TEXT AS $BODY$
my $input = shift;
$input =~ s/(.)/sprintf("%02x", ord $1)/ges;
$input =~ y/0-9a-f/a-p/;
RETURN $input;
$BODY$ LANGUAGE plperl immutable;

now, let's test it:

# SELECT ascii_recode('i love postgresql');
            ascii_recode
------------------------------------
 gjcagmgphggfcahagphdheghhcgfhdhbgm
(1 ROW)

looks good to me. how does it sort?

# SELECT * FROM test ORDER BY ascii_recode(x);
  x
-----
  ab
 1ab
 aaa
 ab
(4 ROWS)

nice. it works.

what's important – this function is “immutable". which means we can make index on it to use it for sorting larger datasets.

of course – it is not perfect. it will have issues when dealing with multi-byte characters in recoded-strings. but – it's a trick. use with caution 🙂

3 thoughts on “postgresql tips & tricks”

  1. thx for your hint depesz. useful.

    in 8.1/8.3 postgresql version, may be possible to do this:

    select * from test order by ascii(substr(test,1,1));

    cmiiw.

Comments are closed.