faber4 on irc asked about how to get ascii-based sorting, while his postgresql was initdb'ed with utf-8 based locale (en_US.UTF-8 to be exact).
what can we do about it?
first, let's check if my environment is ok:
# show lc_collate; lc_collate ------------- pl_PL.UTF-8 (1 row)
looks file. pl_PL is not en_US, but it's utf, so no ascii-based sorting should be done.
now, let's create some test data:
# create table test (x text); CREATE TABLE # copy test from stdin; Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself. >> aaa >> ab >> 1ab >> ab >> \.
ok, so let's do the test:
# select * from test order by x; x ----- 1ab aaa ab ab (4 rows)
ok. it shows that we are not sorting using ascii-codes – if we did, " ab" would be first.
so, how to solve the problem?
the simplest possible approach is to convert all characters into some other in a way that their comparisons will be “like in ascii". the simple way to do it is to convert each character to it's hex-value.
it will convert " " (space) to “20", “a" to “61" and so on.
there is a small issue – i'm not sure how given locale will sort numbers vs. letters. so it would be best if i could use only numbers or only letters. this is simple as with hex-encoding we have only 16 different characters, so we can transpose then to letters without any problems.
so, let's create a ascii-recoding function:
CREATE OR REPLACE FUNCTION ascii_recode(TEXT) RETURNS TEXT AS $BODY$ my $input = shift; $input =~ s/(.)/sprintf("%02x", ord $1)/ges; $input =~ y/0-9a-f/a-p/; return $input; $BODY$ LANGUAGE plperl immutable;
now, let's test it:
# select ascii_recode('i love postgresql'); ascii_recode ------------------------------------ gjcagmgphggfcahagphdheghhcgfhdhbgm (1 row)
looks good to me. how does it sort?
# select * from test order by ascii_recode(x); x ----- ab 1ab aaa ab (4 rows)
nice. it works.
what's important – this function is “immutable". which means we can make index on it to use it for sorting larger datasets.
of course – it is not perfect. it will have issues when dealing with multi-byte characters in recoded-strings. but – it's a trick. use with caution 🙂