February 19th, 2011 by depesz | Tags: , , , , , , | 6 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

On 1st of February, Tom Lane committed patch:

Support LIKE and ILIKE index searches via contrib/pg_trgm indexes.
 
Unlike Btree-based LIKE optimization, this works for non-left-anchored
search patterns. The effectiveness of the search depends on how many
trigrams can be extracted from the pattern. (The worst case, with no
trigrams, degrades to a full-table scan, so this isn't a panacea. But
it can be very useful.)
 
Alexander Korotkov, reviewed by Jan Urbanski

LIKE operations are known to be slow. Very slow.

Let's check some example:

$ \d test
Table "public.test"
Column | Type | Modifiers
-----------+---------+---------------------------------------------------
id | integer | not null default nextval('test_id_seq'::regclass)
some_text | text |
Indexes:
"test_pkey" PRIMARY KEY, btree (id)

Very simple table, with some texts inside:

$ select count(*), min(length(some_text)), max(length(some_text)), sum(length(some_text)) from test;
count | min | max | sum
---------+-----+------+------------
2000000 | 26 | 1090 | 1031552557
(1 row)

As you can see there is 2 million rows, with some_text ranging from 26 to 1090 characters, in total 1,031,552,557 characters.

Doing LIKE search on it will be slow:

$ explain analyze select * from test where some_text like '%gec%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..162909.46 rows=200 width=526) (actual time=0.196..2640.421 rows=7787 loops=1)
Filter: (some_text ~~ '%gec%'::text)
Total runtime: 2641.341 ms
(3 rows)

Frequency doesn't change much – it's always ~ 2-3s.

But now, thanks to this patch, we can do something better/smarter. First we need to load pg_trgm contrib/extension.

This can be done using this syntax (which is new thing, and will be covered in one of upcoming “Waiting for" posts):

$ create extension pg_trgm;
CREATE EXTENSION

Now, with pg_trgm loaded, I can make trgm index on some_text:

$ create index test_trgm_gin on test using gin (some_text gin_trgm_ops);
CREATE INDEX

And now, jaw-dropping performance:

$ explain analyze select * from test where some_text like '%gec%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=17.55..797.20 rows=200 width=526) (actual time=4.933..36.857 rows=7787 loops=1)
Recheck Cond: (some_text ~~ '%gec%'::text)
-> Bitmap Index Scan on test_trgm_gin (cost=0.00..17.50 rows=200 width=0) (actual time=2.843..2.843 rows=7787 loops=1)
Index Cond: (some_text ~~ '%gec%'::text)
Total runtime: 37.362 ms
(5 rows)

It's sooo cool. Right? Well. Kind of.

Selecting is brilliantly fast.

But writing. Not so much.

I created the table with no indexes at all. COPYing the data to table took 56853.945 ms.

Creating btree index for primary key – 7684.400 ms.

Creating GIN index: 2932366.986 ms.

That's nearly 50 minutes!

Of course – you usually don't add 2 million rows at a time. But still – it means that every insert or update will be slower. By how much? Can't tell – you have to check your data, but just remember – this 50 minutes was the optimal situation – it was not adding to exiting index. It was just plain index creation. Changes of existing index will be more expensive.

Still – it's great thing, and the performance on selects is amazing.

What's more – ILIKE, also uses those indexes, just a bit slower:

$ explain analyze select * from test where some_text ilike '%gec%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=17.55..797.20 rows=200 width=526) (actual time=5.205..116.651 rows=7787 loops=1)
Recheck Cond: (some_text ~~* '%gec%'::text)
-> Bitmap Index Scan on test_trgm_gin (cost=0.00..17.50 rows=200 width=0) (actual time=3.047..3.047 rows=7787 loops=1)
Index Cond: (some_text ~~* '%gec%'::text)
Total runtime: 117.261 ms
(5 rows)

in my case there is no row count difference, as all my text are lowercase.

  1. 6 comments

  2. Did you try using a GIST index instead of a GIN one?

  3. Feb 22, 2011

    No. GiST is faster to index, but much slower on search. The trade-off is well known, so I didn’t really see the point.

  4. Feb 22, 2011

    pg_trgm GiST index behaviour strongly depends on signature size. Currently it is defined in source by SIGLENINT macro and it is 3 (that is 96 bit signature). Optimal signature size depends on average string length, language and so on. I beleive that 96 bits is not enough for most part of cases. I’ll try to propose patch to make signature an index parameter. Just now you can try to find optimal signature size by recompiling pg_trgm contrib with another SIGLENINT value.

  5. # Isaac
    Feb 23, 2011

    That’s really cool. How much space does the index take up on disk, compared to the size of the table?

  6. Feb 24, 2011

    @Isaac:
    I no longer have the original table (it’s just test data, dropped as soon as I finish tests). So I had to make new one.
    Again – 2M rows, with some_text lengths from 32 to 2328. Summarized length – 2,215,615,582.

    • Table size (without indexes/toast): 2024349696 – 1931 MB
    • Table size (with toast/fsm): 2465382400 – 2351 MB
    • PKEY size (index on int4): 44949504 – 43 MB
    • Trigram GIN index size: 8477753344 – 8085 MB
  7. # Radha Krishna
    Jun 28, 2013

    The result is faster when using like. But, it does not get the caseinsensitve data. I tried with POSIX, etc. But no luck.

Leave a comment