Human/version sort in PostgreSQL

Ever been in situation where you had to sort data that is partially text, and partially numerical?

Like invoice numbers: prefix-9, prefix-10, prefix-11, other-5, other-20 ? Normally you can't do order by as you will get them in wrong order: other-20 ⇒ other-5 ⇒ prefix-10 ⇒ prefix-11 ⇒ prefix-9. Can something be done with it?

As you could have guessed the answer is yes. Though it's not all that trivial.

First of all – we will need to split the data into separate parts that are textual, and separate that are numbers. For this I can use simple regexp matching:

=$ SELECT * FROM regexp_matches( 'ab-23-b-12-z', '([^0-9]*)([0-9]*)', 'g');
 regexp_matches 
----------------
 {ab-,23}
 {-b-,12}
 {-z,""}
 {"",""}
(4 ROWS)

This is pretty good first step.

Next one – to make it possible, for PostgreSQL, to sort 2, 5, 10 in this order, we need to make it understand that it is a number. regexp_matches returns array of texts.

We can't have array that contains text, and number, but we can make specialized datatype for this:

=$ CREATE TYPE sortable AS ( FIRST TEXT, SECOND INT8 );

With this in place, we can make every row returned from regexp_matches into single value using sortable type:

=$ SELECT (m[1], NULLIF(m[2], ''))::sortable
FROM regexp_matches( 'ab-23-b-12-z', '([^0-9]*)([0-9]*)', 'g') AS x (m);
   ROW    
----------
 (ab-,23)
 (-b-,12)
 (-z,)
 ("",)
(4 ROWS)

We need nullif, because, as you could have seen above, sometimes the 2nd element of array is empty string, and empty string can't be cast to integer. But if it is null – it can. So, nullif(…, ") makes it null if the input is empty string.

Sweet. Now, finally, to make it usable, we need to join it all together into array. But we should make sure that we force proper order. Thankfully, we can use ORDINALITY:"

=$ SELECT array_agg((m[1], NULLIF(m[2], ''))::sortable ORDER BY o) FROM regexp_matches( 'ab-23-b-12-z', '([^0-9]*)([0-9]*)', 'g') WITH ORDINALITY AS x (m, o);
                 array_agg                 
-------------------------------------------
 {"(ab-,23)","(-b-,12)","(-z,)","(\"\",)"}
(1 ROW)

The data in here doesn't look too interesting, and we definitely don't want to store it, so let's wrap it all in a function that we can use for order by:

=$ CREATE OR REPLACE FUNCTION human_sort( IN p_input TEXT ) RETURNS sortable[] AS $$
SELECT array_agg(
    (rm.m[1], NULLIF(rm.m[2], ''))::sortable
    ORDER BY rm.o
) FROM regexp_matches( p_input, '([^0-9]*)([0-9]*)', 'g') WITH ordinality AS rm (m, o);
$$ LANGUAGE SQL;

And let's test it:

=$ CREATE TABLE test_data (
    id INT8 generated always AS IDENTITY,
    the_thing TEXT
);
 
=$ copy test_data (the_thing) FROM stdin;
a-123
a-9
b-5
a
e
z
1
12
5
b-5-a
b-12-z
b-7-c
b-7-d-5
b-7-d-10
b-7-e
\.

and now, I can sort the data using function:

=$ SELECT * FROM test_data ORDER BY human_sort(the_thing);
 id | the_thing 
----+-----------
  7 | 1
  9 | 5
  8 | 12
  4 | a
  2 | a-9
  1 | a-123
  3 | b-5
 10 | b-5-a
 12 | b-7-c
 13 | b-7-d-5
 14 | b-7-d-10
 15 | b-7-e
 11 | b-12-z
  5 | e
  6 | z
(15 ROWS)

Of course you might want to tweak it a bit if you'd want, for example, simple numericals to be at the end, or perhaps handle fractional numerics, but the method that I described here should be clear enough to make any customizations easily.

I hope you'll find it useful 🙂


2 thoughts on “Human/version sort in PostgreSQL”

  1. @Matt:
    You’re absolutely right. The only issue is that I find their names/syntax rather opaque. Maybe I should do more research and write about it …

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.