On 9th of April, Tom Lane committed patch:
Support indexing of regular-expression searches in contrib/pg_trgm. This works by extracting trigrams from the given regular expression, in generally the same spirit as the previously-existing support for LIKE searches, though of course the details are far more complicated. Currently, only GIN indexes are supported. We might be able to make it work with GiST indexes later. The implementation includes adding API functions to backend/regex/ to provide a view of the search NFA created from a regular expression. These functions are meant to be generic enough to be supportable in a standalone version of the regex library, should that ever happen. Alexander Korotkov, reviewed by Heikki Linnakangas and Tom Lane
One day later Tom Lane added support for the same operations using GiST indexes (original patch was working only with GIN).
Ever since 9.1 we knew it is possible to speed up substring searches. Now – the same thing can be used to speed up regexps.
How does it work?
I made a test table:
$ \d man_lines TABLE "public.man_lines" COLUMN | TYPE | Modifiers ----------+------+----------- manpage | text | man_line | text | Indexes: "trgm_idx" gin (man_line gin_trgm_ops)
It contains manpages, split by line. Looks like this:
$ SELECT * FROM man_lines ORDER BY random() LIMIT 10; manpage | man_line --------------------------------+--------------------------------------------------------------------------------- Digest::MD5.3 | OR we can USE the addfile method FOR more efficient reading OF the file: perl5160delta.1 | This policy IS described IN greater detail IN perlpolicy. perltoc.1.gz | DESCRIPTION HTML::Tree::AboutTrees.3 | grow($s, $depth + 1); Mojolicious::Guides::Routing.3 | # /bye -> {controller => 'foo', action => 'bye', mymessage => 'Bye'} foo2zjs-wrapper.1.gz | the Foomatic printer DATABASE. perltoc.1.gz | config() bc.1.gz | "current balance = "; bal apt.conf.5.gz | LES OPTIONS DE DéBOGAGE mplayer.1.gz | vhq (10 ROWS)
Index that is created on the man_line column was done using:
$ CREATE extension pg_trgm; $ CREATE INDEX trgm_idx ON man_lines USING gin (man_line gin_trgm_ops);
Now, I can use this index in case of some regexps. Not all of course.
Some examples:
$ EXPLAIN SELECT * FROM man_lines WHERE man_line ~ 'a..c'; QUERY PLAN ------------------------------------------------------------------ Seq Scan ON man_lines (cost=0.00..56251.53 ROWS=34387 width=79) FILTER: (man_line ~ 'a..c'::text) (2 ROWS)
As you can see – index was not used. Because it is not possible to generate list of all trigrams that would match such expression. On the other hand, this, very simplistic, regexp – can use index:
$ EXPLAIN SELECT * FROM man_lines WHERE man_line ~ 'abc'; QUERY PLAN -------------------------------------------------------------------------- Bitmap Heap Scan ON man_lines (cost=17.32..662.42 ROWS=170 width=79) Recheck Cond: (man_line ~ 'abc'::text) -> Bitmap INDEX Scan ON trgm_idx (cost=0.00..17.28 ROWS=170 width=0) INDEX Cond: (man_line ~ 'abc'::text) (4 ROWS)
What about something more complicated?
$ EXPLAIN analyze SELECT * FROM man_lines WHERE man_line ~ 'a{3}'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan ON man_lines (cost=17.32..662.42 ROWS=170 width=79) (actual TIME=8.003..10.376 ROWS=232 loops=1) Recheck Cond: (man_line ~ 'a{3}'::text) ROWS Removed BY INDEX Recheck: 74 -> Bitmap INDEX Scan ON trgm_idx (cost=0.00..17.28 ROWS=170 width=0) (actual TIME=7.912..7.912 ROWS=306 loops=1) INDEX Cond: (man_line ~ 'a{3}'::text) Total runtime: 10.503 ms (6 ROWS)
Nice. What about something even more complicated:
$ EXPLAIN analyze SELECT * FROM man_lines WHERE man_line ~ '(a|b)(cd|e)[fg]'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan ON man_lines (cost=281.25..28377.26 ROWS=17193 width=79) (actual TIME=16.620..60.913 ROWS=14328 loops=1) Recheck Cond: (man_line ~ '(a|b)(cd|e)[fg]'::text) ROWS Removed BY INDEX Recheck: 2640 -> Bitmap INDEX Scan ON trgm_idx (cost=0.00..276.95 ROWS=17193 width=0) (actual TIME=13.567..13.567 ROWS=16968 loops=1) INDEX Cond: (man_line ~ '(a|b)(cd|e)[fg]'::text) Total runtime: 61.497 ms (6 ROWS)
In case regexp contains other operators – it also can be indexed:
$ EXPLAIN analyze SELECT * FROM man_lines WHERE man_line ~* 'postgre.*sql'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan ON man_lines (cost=89.32..734.42 ROWS=170 width=79) (actual TIME=308.631..315.839 ROWS=1198 loops=1) Recheck Cond: (man_line ~* 'postgre.*sql'::text) ROWS Removed BY INDEX Recheck: 12 -> Bitmap INDEX Scan ON trgm_idx (cost=0.00..89.28 ROWS=170 width=0) (actual TIME=308.554..308.554 ROWS=1210 loops=1) INDEX Cond: (man_line ~* 'postgre.*sql'::text) Total runtime: 315.934 ms (6 ROWS)
All in all – it looks simply amazing. Almost magical:
$ EXPLAIN analyze SELECT * FROM man_lines WHERE man_line ~* '(?:(?:p(?:ostgres(?:ql)?|g?sql)|sql)) (?:(?:(?:mak|us)e|do|is))'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan ON man_lines (cost=285.32..930.42 ROWS=170 width=79) (actual TIME=579.765..592.435 ROWS=123 loops=1) Recheck Cond: (man_line ~* '(?:(?:p(?:ostgres(?:ql)?|g?sql)|sql)) (?:(?:(?:mak|us)e|do|is))'::text) ROWS Removed BY INDEX Recheck: 1909 -> Bitmap INDEX Scan ON trgm_idx (cost=0.00..285.28 ROWS=170 width=0) (actual TIME=578.780..578.780 ROWS=2032 loops=1) INDEX Cond: (man_line ~* '(?:(?:p(?:ostgres(?:ql)?|g?sql)|sql)) (?:(?:(?:mak|us)e|do|is))'::text) Total runtime: 592.474 ms (6 ROWS)
Just in case: above regexp matches strings which contain one of:
- postgresql
- sql
- postgres
- pgsql
- psql
then a space, and then one of:
- do
- is
- use
- make
Of course it's not even close to what TSearch2 can do, but still – it's a really cool thing. Thanks Alexander.
A little note about regexp search vs tsearch2. I would prefere to not compare them, because they are different features for different use cases. Imagine you’re searching in logs or in some scientific data like DNA/protein sequences or chemical formulas. You probably could write a special parsers and special dictionaries for tsearch2. But you might as well use regexp search without such preliminary work.
@Alexander:
Of course. The thing is – just like people use LIKE/ILIKE for full text searching, they also (though to lesser extent) use regexps.
Anyway – I really like what you did. Great stuff.
@depesz:
Just to be clear. I’m not encourage people to do full text searching with regexps: they might do it on their own risk 🙂
Thank you for appreciating my work!
Wow! So many people have been telling me that it’s impossible for any database to use an index to support Regular Expression queries. I think it’s wonderful that the opposite is true, and I’m so happy that PostgreSQL the database for this because I use PostgreSQL for so many important projects.