Waiting for PostgreSQL 15 – Add assorted new regexp_xxx SQL functions.

On 3rd of August 2021, Tom Lane committed patch:

Add assorted new regexp_xxx SQL functions.
 
This patch adds new functions regexp_count(), regexp_instr(),
regexp_like(), and regexp_substr(), and extends regexp_replace()
with some new optional arguments.  All these functions follow
the definitions used in Oracle, although there are small differences
in the regexp language due to using our own regexp engine -- most
notably, that the default newline-matching behavior is different.
Similar functions appear in DB2 and elsewhere, too.  Aside from
easing portability, these functions are easier to use for certain
tasks than our existing regexp_match[es] functions.
 
Gilles Darold, heavily revised by me
 
Discussion: https://postgr.es/m/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net

I missed it. Sorry. But found it after almost 4 months, and decided to write about.

I'm huge fan of regexps. Interestingly – I'm not fan of regexps in Pg, as the regexp engine in PostgreSQL has weirdish limits that don't sit well with me, especially since there are so many other options.

Anyway, up to Pg14, we had these regexp functions:

  • regexp_match
  • regexp_matches
  • regexp_replace
  • regexp_split_to_array
  • regexp_split_to_table

Now, we got 4 more:

regexp_count( string, pattern [, start [, flags]]

Returns how many times given pattern matches in source string, optionally with index that means that it should start searching from n-th characters in a string, and with some optional flags.

For example:

=$ SELECT regexp_count('depesz DEPESZ depesz DepEsz', 'de.e');
 regexp_count 
--------------
            2
(1 ROW)
 
=$ SELECT regexp_count('depesz DEPESZ depesz DepEsz', 'de.e', 2);
 regexp_count 
--------------
            1
(1 ROW)
 
=$ SELECT regexp_count('depesz DEPESZ depesz DepEsz', 'de.e', 1, 'i');
 regexp_count 
--------------
            4
(1 ROW)

regexp_instr( string, pattern [, start [, N [, endoption [, flags [, subexpr ]]]]])

This one is even more complex, with more optional parameters.

In it's simplest form it returns position at which pattern starts in string:

=$ WITH INPUT AS (SELECT 'abcdefghijklmnopqrstuvwxyz'::text AS i),
ri AS (SELECT regexp_instr( i, '..[aeiou]' ) AS p FROM INPUT)
SELECT ri.p, substr( INPUT.i, ri.p) FROM INPUT, ri;
 p |          substr          
---+--------------------------
 3 | cdefghijklmnopqrstuvwxyz
(1 ROW)

Second argument is simply character you want to start working from. So if I'd say I want to start from 4th:

=$ WITH INPUT AS (SELECT 'abcdefghijklmnopqrstuvwxyz'::text AS i),
ri AS (SELECT regexp_instr( i, '..[aeiou]', 4 ) AS p FROM INPUT)
SELECT ri.p, substr( INPUT.i, ri.p ) FROM INPUT, ri;
 p |        substr        
---+----------------------
 7 | ghijklmnopqrstuvwxyz
(1 ROW)

Third is which match should regexp_instr consider for position calculation. By default it starts from first match, but we can easily make it look for third:

=$ WITH INPUT AS (SELECT 'abcdefghijklmnopqrstuvwxyz'::text AS i),
ri AS (SELECT regexp_instr( i, '..[aeiou]', 1, 3 ) AS p FROM INPUT)
SELECT ri.p, substr( INPUT.i, ri.p) FROM INPUT, ri;
 p  |     substr     
----+----------------
 13 | mnopqrstuvwxyz
(1 ROW)

Fourth option is endoption. It can be only 0/null/1, and if it's set to 1, regexp_instr will return position of first character after given match:

=$ WITH INPUT AS (SELECT 'abcdefghijklmnopqrstuvwxyz'::text AS i),
ri AS (SELECT regexp_instr( i, '..[aeiou]', 1, 3, 1 ) AS p FROM INPUT)
SELECT ri.p, substr( INPUT.i, ri.p) FROM INPUT, ri;
 p  |   substr    
----+-------------
 16 | pqrstuvwxyz
(1 ROW)

This leads to interesting question -what will it show for regexp that doesn't match?

=$ SELECT regexp_instr('aaa', 'b', 1, 1, 1);
 regexp_instr 
--------------
            0
(1 ROW)

and, obviously, what will happen if it matches at the end?

=$ SELECT regexp_instr('abc', 'c', 1, 1, 1);
 regexp_instr 
--------------
            4
(1 ROW)

Fifth argument is flags, which I mentioned earlier with regexp_count, but sixth is a new thing: subexpr. What does that mean?

It means that if my pattern contains elements in parens I can ask for specific set of parens:

=$ WITH INPUT AS (SELECT 'abcdefghijklmnopqrstuvwxyz'::text AS i),
ri AS (SELECT regexp_instr( i, '.(.[aeiou])(..)', 1, 1, 0, '') AS p FROM INPUT)
SELECT ri.p, substr( INPUT.i, ri.p) FROM INPUT, ri;
 p |          substr          
---+--------------------------
 3 | cdefghijklmnopqrstuvwxyz
(1 ROW)
 
=$ WITH INPUT AS (SELECT 'abcdefghijklmnopqrstuvwxyz'::text AS i),
ri AS (SELECT regexp_instr( i, '.(.[aeiou])(..)', 1, 1, 0, '', 1) AS p FROM INPUT)
SELECT ri.p, substr( INPUT.i, ri.p) FROM INPUT, ri;
 p |         substr          
---+-------------------------
 4 | defghijklmnopqrstuvwxyz
(1 ROW)
 
=$ WITH INPUT AS (SELECT 'abcdefghijklmnopqrstuvwxyz'::text AS i),
ri AS (SELECT regexp_instr( i, '.(.[aeiou])(..)', 1, 1, 0, '', 2) AS p FROM INPUT)
SELECT ri.p, substr( INPUT.i, ri.p) FROM INPUT, ri;
 p |        substr         
---+-----------------------
 6 | fghijklmnopqrstuvwxyz
(1 ROW)

Complex. And shows that we really could use named arguments for built-in functions. I'd much rather see:

SELECT regexp_instr( 'abcdefghijklmnopqrstuvwxyz', '.(.[aeiou])(..)', subexpr => 2 )

than

SELECT regexp_instr( 'abcdefghijklmnopqrstuvwxyz', '.(.[aeiou])(..)', 1, 1, 0, '', 2 )

regexp_like( string, pattern [, flags])

This time simple true/false whether given pattern matches string. With optional flags:

=$ SELECT regexp_like( 'AAA', 'a+' ), regexp_like( 'AAA', 'a+', 'i');
 regexp_like | regexp_like 
-------------+-------------
 f           | t
(1 ROW)

regexp_substr( string, pattern [, start [, N [, flags [, subexpr ]]]])

This is somewhat similar to regexp_instr, at least when it comes to options that it gets. Aside from endoption, all the other are the same, and work the same way.

The difference is that instead of returning position, it will return what was matched. Including all selectors like start position, n-th match, flags, and subexpression number:

=$ SELECT n AS nth, s AS subexpression, regexp_substr( 'abcdefghijklmnopqrstuvwxyz', '.(.[aeiou])(..)', 1, n, '', s )
FROM generate_series(1,3) n, generate_series(0, 2) s;
 nth | subexpression | regexp_substr 
-----+---------------+---------------
   1 |             0 | cdefg
   1 |             1 | de
   1 |             2 | fg
   2 |             0 | mnopq
   2 |             1 | no
   2 |             2 | pq
   3 |             0 | stuvw
   3 |             1 | tu
   3 |             2 | vw
(9 ROWS)

Additionally, to the four new functions we also got extension to:

regexp_replace(source, pattern, replacement [, start [, N ]] [, flags ])

Two new options were added – start and N, which, just like with other functions mean that matching should considering string from character number START, and work on Nth match, so I can:

=$ SELECT regexp_replace('hubert xxx lubaczewski', '(\S+)', 'depesz', 1, 2);
      regexp_replace       
---------------------------
 hubert depesz lubaczewski
(1 ROW)

Pretty handy. Thanks a lot to all involved, and sorry for lag with description.