Waiting for 9.1 – Faster LIKE/ILIKE

On 1st of February, Tom Lane committed patch:

Support LIKE and ILIKE index searches via contrib/pg_trgm indexes.
 
Unlike Btree-based LIKE optimization, this works for non-left-anchored
search patterns.  The effectiveness of the search depends on how many
trigrams can be extracted from the pattern.  (The worst case, with no      
trigrams, degrades to a full-table scan, so this isn't a panacea.  But   
it can be very useful.)                                                 
 
Alexander Korotkov, reviewed by Jan Urbanski

LIKE operations are known to be slow. Very slow.

Let's check some example:

$ \d test
                           Table "public.test"
  Column   |  Type   |                     Modifiers                     
-----------+---------+---------------------------------------------------
 id        | integer | not null default nextval('test_id_seq'::regclass)
 some_text | text    | 
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)

Very simple table, with some texts inside:

$ SELECT COUNT(*), MIN(LENGTH(some_text)), MAX(LENGTH(some_text)), SUM(LENGTH(some_text)) FROM test;
  COUNT  | MIN | MAX  |    SUM     
---------+-----+------+------------
 2000000 |  26 | 1090 | 1031552557
(1 ROW)

As you can see there is 2 million rows, with some_text ranging from 26 to 1090 characters, in total 1,031,552,557 characters.

Doing LIKE search on it will be slow:

$ EXPLAIN analyze SELECT * FROM test WHERE some_text LIKE '%gec%';
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Seq Scan ON test  (cost=0.00..162909.46 ROWS=200 width=526) (actual TIME=0.196..2640.421 ROWS=7787 loops=1)
   FILTER: (some_text ~~ '%gec%'::text)
 Total runtime: 2641.341 ms
(3 ROWS)

Frequency doesn't change much – it's always ~ 2-3s.

But now, thanks to this patch, we can do something better/smarter. First we need to load pg_trgm contrib/extension.

This can be done using this syntax (which is new thing, and will be covered in one of upcoming “Waiting for" posts):

$ CREATE extension pg_trgm;
CREATE EXTENSION

Now, with pg_trgm loaded, I can make trgm index on some_text:

$ CREATE INDEX test_trgm_gin ON test USING gin (some_text gin_trgm_ops);
CREATE INDEX

And now, jaw-dropping performance:

$ EXPLAIN analyze SELECT * FROM test WHERE some_text LIKE '%gec%';
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON test  (cost=17.55..797.20 ROWS=200 width=526) (actual TIME=4.933..36.857 ROWS=7787 loops=1)
   Recheck Cond: (some_text ~~ '%gec%'::text)
   ->  Bitmap INDEX Scan ON test_trgm_gin  (cost=0.00..17.50 ROWS=200 width=0) (actual TIME=2.843..2.843 ROWS=7787 loops=1)
         INDEX Cond: (some_text ~~ '%gec%'::text)
 Total runtime: 37.362 ms
(5 ROWS)

It's sooo cool. Right? Well. Kind of.

Selecting is brilliantly fast.

But writing. Not so much.

I created the table with no indexes at all. COPYing the data to table took 56853.945 ms.

Creating btree index for primary key – 7684.400 ms.

Creating GIN index: 2932366.986 ms.

That's nearly 50 minutes!

Of course – you usually don't add 2 million rows at a time. But still – it means that every insert or update will be slower. By how much? Can't tell – you have to check your data, but just remember – this 50 minutes was the optimal situation – it was not adding to exiting index. It was just plain index creation. Changes of existing index will be more expensive.

Still – it's great thing, and the performance on selects is amazing.

What's more – ILIKE, also uses those indexes, just a bit slower:

$ EXPLAIN analyze SELECT * FROM test WHERE some_text ilike '%gec%';
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON test  (cost=17.55..797.20 ROWS=200 width=526) (actual TIME=5.205..116.651 ROWS=7787 loops=1)
   Recheck Cond: (some_text ~~* '%gec%'::text)
   ->  Bitmap INDEX Scan ON test_trgm_gin  (cost=0.00..17.50 ROWS=200 width=0) (actual TIME=3.047..3.047 ROWS=7787 loops=1)
         INDEX Cond: (some_text ~~* '%gec%'::text)
 Total runtime: 117.261 ms
(5 ROWS)

in my case there is no row count difference, as all my text are lowercase.

12 thoughts on “Waiting for 9.1 – Faster LIKE/ILIKE”

  1. No. GiST is faster to index, but much slower on search. The trade-off is well known, so I didn’t really see the point.

  2. pg_trgm GiST index behaviour strongly depends on signature size. Currently it is defined in source by SIGLENINT macro and it is 3 (that is 96 bit signature). Optimal signature size depends on average string length, language and so on. I beleive that 96 bits is not enough for most part of cases. I’ll try to propose patch to make signature an index parameter. Just now you can try to find optimal signature size by recompiling pg_trgm contrib with another SIGLENINT value.

  3. That’s really cool. How much space does the index take up on disk, compared to the size of the table?

  4. @Isaac:
    I no longer have the original table (it’s just test data, dropped as soon as I finish tests). So I had to make new one.
    Again – 2M rows, with some_text lengths from 32 to 2328. Summarized length – 2,215,615,582.

    • Table size (without indexes/toast): 2024349696 – 1931 MB
    • Table size (with toast/fsm): 2465382400 – 2351 MB
    • PKEY size (index on int4): 44949504 – 43 MB
    • Trigram GIN index size: 8477753344 – 8085 MB
  5. The result is faster when using like. But, it does not get the caseinsensitve data. I tried with POSIX, etc. But no luck.

  6. This is great way of speeding up searches.

    however, what I am getting is when my search criteria
    is short – say 3 characters (e.g gec) than it is fast

    but when I increase it to say 7 characters – findthis
    this increases the search time to almost 2 minutes.

    any idea why this would happen??

  7. Great post

    Its been a while since the last comment but anyway.

    How you suggest to go when the search is over several fields? it’s necessary to create an aditional colunm with de concatenation of those fields or there is a way to create an index suitable for this task

    Thank you in advance

  8. @SJRM:

    there is no one canned response that will be sensible in all cases. It depends on usecase. You can start by adding index on concatenated fields, and go from there.

    Also – if you have general questions (as in: not something that I should describe in blogpost, but didn’t, or there is a mistake) – you’ll be better off getting help from irc or mailing lists.

  9. DEPESZ

    Thank you for your answer, I have done your procedure over a table with 16000 records and the querys after the index creation are in average 30% slowers.

    The table has a varchar(1000) field which is the one I’m using to do the search, but the longest string inside the field is 184 characters long.
    Any guess?

  10. @SJRM:

    no idea. I can’t guess without *at least* seeing queries, \d of the tables, and explain analyze output.

    and again: please note that for support it’s better to ask on irc channel and/or mailing lists.

  11. @DEPESZ

    Sorry, my mistake.
    Noted. I will use those channels
    Thank you.

Comments are closed.