Human/version sort in PostgreSQL – revisited

Couple of months ago I wrote how to do human sort in Pg by using arrays, and splitting string.

This works, but Matt mentioned in comments that it could be done with ICU collations.

So I looked into it …

First, I figured I'll write longer blogpost about collations, but to be honest – I gave up. There are many things to write about, and explaining them, with examples, would take extremely long blogpost.

Plus, it already was done by Peter Eisentrout in here, and here.

So, I figured I'll just show how to get human/version sort with these collations, and be done with it 🙂

First, let's get some sample data:

=$ CREATE TABLE invoices ( inv_no text );
=$ copy invoices (inv_no) FROM stdin;
a-2023-1
a-2023-3
a-2023-21
a-2023-50
a-2023-9
b-2022-5
b-2022-10
\.

Now, if I'd just order it using defaults, I'd get:

=$ SELECT inv_no FROM invoices ORDER BY inv_no;
  inv_no   
-----------
 a-2023-1
 a-2023-21
 a-2023-3
 a-2023-50
 a-2023-9
 b-2022-10
 b-2022-5
(7 ROWS)

Please note that a-2023-21 is before a-2023-3, a-2023-50 before a-2023-9 and b-2022-10 before b-2022-5.

Luckily this is easy fix.

We will need to create ICU collation – a way to tell Pg how to sort data.

Before we can do it, we should see if the Pg we're using does support ICU. To do so, in psql you can simply run:

=$ \dOS
                                                    List OF collations
   Schema   |          Name          | Provider |  COLLATE   |   Ctype    |    ICU Locale    | ICU Rules | Deterministic? 
------------+------------------------+----------+------------+------------+------------------+-----------+----------------
 pg_catalog | C                      | libc     | C          | C          |                  |           | yes
 pg_catalog | C.utf8                 | libc     | C.utf8     | C.utf8     |                  |           | yes
 pg_catalog | POSIX                  | libc     | POSIX      | POSIX      |                  |           | yes
 pg_catalog | af-NA-x-icu            | icu      |            |            | af-NA            |           | yes
 pg_catalog | af-ZA-x-icu            | icu      |            |            | af-ZA            |           | yes
...
 pg_catalog | zu-ZA-x-icu            | icu      |            |            | zu-ZA            |           | yes
 pg_catalog | zu-x-icu               | icu      |            |            | zu               |           | yes
(822 ROWS)

If you don't use psql, simply check:

=$ SELECT collname FROM pg_collation WHERE collprovider = 'i';
        collname        
------------------------
 unicode
 und-x-icu
 af-x-icu
...
 zu-x-icu
 zu-ZA-x-icu
(807 ROWS)

If there are rows with Provider being icu in the psql's \dOS output, or any rows from the select – you have icu support available 🙂

Now, we need to create new collation. And in here I wanted to show how easy it is, but … around two hours of experiments happened, and I couldn't have made it.

Finally Bones on Discord, pointed out that it will work if I remove typo.

So, the working code:

=$ CREATE collation human_sort (
    locale = 'und-u-kn',
    provider = icu,
    deterministic = FALSE
);
 
=$ SELECT * FROM invoices ORDER BY inv_no COLLATE human_sort;
  inv_no   
-----------
 a-2023-1
 a-2023-3
 a-2023-9
 a-2023-21
 a-2023-50
 b-2022-5
 b-2022-10
(7 ROWS)

For the curious ones, I wrote the locale as und-x-kn, and not und-u-kn.

For these even more curious, Peter wrote:

The specification for BCP 47 is RFC 5646. The -u- you see in the examples below is the extension identifier registered to Unicode.

Generally, the docs for collations are … well, they exist. That's the good thing I can say. No idea how to make them, though, as I tried to write a blogpost about various things possible with collations, and I failed.

Anyway, with this simple collation “human_sort", you can sort things like invoice numbers. Or even version numbers:

=$ SELECT *
FROM
    unnest(
        '{9.2.18,8.2.20,9.2.19,12.13,9.3.10,9.5.4,8.4.5,8.4.16,9.3.17,8.0.10,8.1,8.1.10,9.0.18,9.0.21,9.0.5,8.3.1,11.18,12.10,12.9,9.6.16}'::text[]
    ) AS v
ORDER BY v COLLATE human_sort;
   v    
--------
 8.0.10
 8.1
 8.1.10
 8.2.20
 8.3.1
 8.4.5
 8.4.16
 9.0.5
 9.0.18
 9.0.21
 9.2.18
 9.2.19
 9.3.10
 9.3.17
 9.5.4
 9.6.16
 11.18
 12.9
 12.10
 12.13
(20 ROWS)

Anyway. Collations are powerful. Even very powerful. Getting something that sorts, or compares, the way you want is definitely possible, but sometimes will take non-trivial amount of time 🙂

Also, I'd like to thank all people that helped me on IRC, Slack, and Discord when I couldn't figure out why und-x-kn-true doesn't work.