Waiting for 9.6 – Implement lookbehind constraints in our regular-expression engine.

On 30th of October, Tom Lane committed patch:

Implement lookbehind constraints in our regular-expression engine.
A lookbehind constraint is like a lookahead constraint in that it consumes
no text; but it checks for existence (or nonexistence) of a match *ending*
at the current point in the string, rather than one *starting* at the
current point.  This is a long-requested feature since it exists in many
other regex libraries, but Henry Spencer had never got around to
implementing it in the code we use.
Just making it work is actually pretty trivial; but naive copying of the
logic for lookahead constraints leads to code that often spends O(N^2) time
to scan an N-character string, because we have to run the match engine
from string start to the current probe point each time the constraint is
checked.  In typical use-cases a lookbehind constraint will be written at
the start of the regex and hence will need to be checked at every character
--- so O(N^2) work overall.  To fix that, I introduced a third copy of the
core DFA matching loop, paralleling the existing longest() and shortest()
loops.  This version, matchuntil(), can suspend and resume matching given
a couple of pointers' worth of storage space.  So we need only run it
across the string once, stopping at each interesting probe point and then
resuming to advance to the next one.
I also put in an optimization that simplifies one-character lookahead and
lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND
constraints, which already existed in the engine.  This avoids the overhead
of the LACON machinery entirely for these rather common cases.
The net result is that lookbehind constraints run a factor of three or so
slower than Perl's for multi-character constraints, but faster than Perl's
for one-character constraints ... and they work fine for variable-length
constraints, which Perl gives up on entirely.  So that's not bad from a
competitive perspective, and there's room for further optimization if
anyone cares.  (In reality, raw scan rate across a large input string is
probably not that big a deal for Postgres usage anyway; so I'm happy if
it's linear.)

Yeah – I have to write about it, because I actually do love regexps, and I always was bitching that our (PostgreSQL) regexp engine sucks. It still does, but slightly less, so I have to write about it.

What was added – lookbehind. What it is is (somewhat) described above, so let's see how to use it. Unfortunately I don't have ready examples there lookbehind would be the only possible choice of solving the problem, but at the very least you'll see the syntax.

Basically, we got two new “operators" in regexp:

  • (?<=...) - positive lookbehind
  • (?
  • Let's assume we have following string:

    xxaa12 yyab22 zzbb33

    And we want to find a number(s) that are preceded by two letters from (‘a', ‘e', ‘i', ‘o', ‘u', ‘y') set (only “12" in our example, as it's preceeded by “aa"), and return the number together with it's prefixing string.

    This can be done, with:

    $ SELECT regexp_matches( 'xxaa12 yyab22 zzbb33', E'([^0-9[:space:]]+)(?<=[aeiouy]{2})([0-9]+)', 'g' );
    (1 ROW)

    Please note that despite having three elements in ()s, we only got two matching elements – that's because lookbehind (and lookahead) are constraints, and not groupings.

    And what if we'd want all numbers that are not immediately after two “vowels"?

    $ SELECT regexp_matches( 'xxaa12 yyab22 zzbb33', E'([^0-9[:space:]]+)(?<![aeiouy]{2})([0-9]+)', 'g' );
    (2 ROWS)

    Two matches, as expected.

    While zero-width constraints are not very popular, they do have their uses, and I, for one, am very happy that we have it now. If we only could get greedy/non-greedy parts of single RE, that would be simply amazing. And named unicode character classes…

    Anyway – I'm happy that it got committed, thanks Tom.

One thought on “Waiting for 9.6 – Implement lookbehind constraints in our regular-expression engine.”

  1. Description of second operators seems became HTML’s tag, it ate a part of the description:

    Basically, we got two new “operators” in regexp:
    (?<=…) – positive lookbehind

Comments are closed.