indexable ” field like ‘%something'”

for the long time everybody knew that you can't use index on “LIKE" operations.

then came text_pattern_ops, so we could use indexes for prefix searches:

# \d depesz_test
                         Table "public.depesz_test"
 Column |  Type   |                        Modifiers
--------+---------+----------------------------------------------------------
 id     | integer | not null default nextval('depesz_test_id_seq'::regclass)
 email  | text    | not null
Indexes:
    "depesz_test_pkey" PRIMARY KEY, btree (id)
    "x" UNIQUE, btree (email text_pattern_ops)
# EXPLAIN analyze SELECT COUNT(*) FROM depesz_test WHERE email LIKE 'dep%';
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=96.71..96.72 ROWS=1 width=0) (actual TIME=0.983..0.985 ROWS=1 loops=1)
   ->  Bitmap Heap Scan ON depesz_test  (cost=4.68..96.65 ROWS=24 width=0) (actual TIME=0.184..0.641 ROWS=155 loops=1)
         FILTER: (email ~~ 'dep%'::text)
         ->  Bitmap INDEX Scan ON x  (cost=0.00..4.67 ROWS=24 width=0) (actual TIME=0.158..0.158 ROWS=155 loops=1)
               INDEX Cond: ((email ~>=~ 'dep'::text) AND (email ~<~ 'deq'::text))
 Total runtime: 1.067 ms
(6 ROWS)

but what if i'd like to search for ‘%something'? not prefix, but suffix. in my example – what can i do to use indexes when searching for people from given domain?

sanity check – perhaps it works now :):

# EXPLAIN analyze SELECT COUNT(*) FROM depesz_test WHERE email LIKE '%pl';
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=19881.37..19881.38 ROWS=1 width=0) (actual TIME=602.059..602.061 ROWS=1 loops=1)
   ->  Seq Scan ON depesz_test  (cost=0.00..19776.34 ROWS=42011 width=0) (actual TIME=3.803..601.456 ROWS=282 loops=1)
         FILTER: (email ~~ '%pl'::text)
 Total runtime: 602.131 ms
(4 ROWS)

ok. seq scan, so at least we're sure that environment is sane 🙂

one possible approach to do it would be to add another field, and copy domain from email to have it there.

but then – some pople have emails like @poczta.onet.pl. and if i'd like to count them also when looking for onet.pl? i would be still at the same problem.

fortunatelly there is a very simple solution. let's modify the search so it will not be suffix search, but prefix search. after all – it's just another end of string.

so, instead of: field like ‘%domain' i would have to write: reversed-field like ‘niamod%';

seems simple.

how do i reverse field? in sql there is no function to do it. i could technically use plpgsql to write reversing function, but hey, we've got pl/perl 🙂

this small function does reversing:

CREATE FUNCTION text_reverse(text) RETURNS text AS $BODY$
    RETURN scalar reverse shift;
$BODY$ LANGUAGE plperl immutable;

if you dont understand it, don't worry. it works:

# SELECT text_reverse('depesz');
 text_reverse
--------------
 zseped
(1 ROW)

one important thing – please notice ‘immutable' word at the end of function
definition. this flag makes it possible to create functional index. and we like
functional indexes because with them we dont need another field in table 🙂

so, now, let's create the index:

# CREATE UNIQUE INDEX rev_x ON depesz_test ( text_reverse(email) text_pattern_ops );

index created, let's see if it will be used:

# EXPLAIN analyze SELECT COUNT(*) FROM depesz_test WHERE text_reverse(email) LIKE text_reverse('%pl');
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=8098.19..8098.20 ROWS=1 width=0) (actual TIME=9.225..9.227 ROWS=1 loops=1)
   ->  Bitmap Heap Scan ON depesz_test  (cost=146.76..8085.06 ROWS=5251 width=0) (actual TIME=0.583..8.468 ROWS=282 loops=1)
         FILTER: (text_reverse(email) ~~ 'lp%'::text)
         ->  Bitmap INDEX Scan ON rev_x  (cost=0.00..145.44 ROWS=5251 width=0) (actual TIME=0.314..0.314 ROWS=282 loops=1)
               INDEX Cond: ((text_reverse(email) ~>=~ 'lp'::text) AND (text_reverse(email) ~<~ 'lq'::text))
 Total runtime: 9.307 ms
(6 ROWS)

works nicely 🙂

instead of writing “like text_reverse(‘%pl')" i could have written “like ‘lp%'", but this extra call doesn't cost me much, and it makes it more readable.

UPDATE

just after writing this dim from #postgresql told me that there was article on this in french. sorry. didn't know it. in case you'd like to check the french article please follow this link.

dim also suggested to do a small benchmark of plpgsql and plperl versions.

so i did on my test dataset:

# SELECT COUNT(*), SUM(LENGTH(email)) FROM depesz_test;
  COUNT  |   SUM
---------+----------
 1050267 | 17935093
(1 ROW)

i used my plperl function and plpgsql function from this french article, and did test – first plperl version:

first try:

# EXPLAIN analyze SELECT *, text_reverse(email) FROM depesz_test;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Seq Scan ON depesz_test  (cost=0.00..279717.42 ROWS=1050267 width=22) (actual TIME=0.099..11762.904 ROWS=1050267 loops=1)
 Total runtime: 13899.248 ms
(2 ROWS)

second try:

# EXPLAIN analyze SELECT *, text_reverse(email) FROM depesz_test;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Seq Scan ON depesz_test  (cost=0.00..279717.42 ROWS=1050267 width=22) (actual TIME=0.103..11759.954 ROWS=1050267 loops=1)
 Total runtime: 13990.475 ms
(2 ROWS)

now plpgsql:

# EXPLAIN analyze SELECT *, reverse(email) FROM depesz_test;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Seq Scan ON depesz_test  (cost=0.00..279717.42 ROWS=1050267 width=22) (actual TIME=124.767..51294.990 ROWS=1050267 loops=1)
 Total runtime: 53241.475 ms
(2 ROWS)
# EXPLAIN analyze SELECT *, reverse(email) FROM depesz_test;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Seq Scan ON depesz_test  (cost=0.00..279717.42 ROWS=1050267 width=22) (actual TIME=0.199..54151.088 ROWS=1050267 loops=1)
 Total runtime: 56177.876 ms
(2 ROWS)

as we can see – perl is much faster when it comes to text operations. of course you will not see this difference on select time, but it will influence index creation/update times.

8 thoughts on “indexable ” field like ‘%something'””

  1. Hello,
    I’m the author of the french article. It’s really amazing to see that you had the same idea at the same time 😉
    Thanks for the comparison, it’s very interesting. I’m aware of this speed issue and I’m writing a C function for reverse, it may be worth a bench ?
    Also, did you test your reverse function while using a multi-byte encoding like UTF-8 ?

  2. @Thomas Reiss:
    actually it was utf-8. on the other hand dataset didn’t contain any multi-byte characters (it was all emails).
    i’ll redo the test with utf-8 characters, but i think i’ll wait and make the test with also your custom c function – or if you will do the test, i’ll just link to it.

  3. When I needed to match on email domains, rather than addr like ‘%@example.com’ I used this:

    split_part(addr,’@’,2)=’example.com’

    You can then create an index like this:

    create index messages by addr_domain on messages (split_part(addr,’@’,2));

    The main problem that I have had with this is that I may want to includes john@support.example.com in my “all @ example.com” query. For that, the reversing approach that you have described is better.

    Phil.

Comments are closed.