Waiting for 9.6 – Phrase full text search.

On 7th of April, Teodor Sigaev committed patch:

Phrase full text search.
 
Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery.
On-disk and binary in/out format of tsquery are backward compatible.
It has two side effect:
- change order for tsquery, so, users, who has a btree index over tsquery,
  should reindex it
- less number of parenthesis in tsquery output, and tsquery becomes more
  readable
 
Authors: Teodor Sigaev, Oleg Bartunov, Dmitry Ivanov
Reviewers: Alexander Korotkov, Artur Zakirov

At first I didn't quite understand what this is about, so figured the best way would be to test it.

Generated list of 20 unique 5-letter words:

  • blond
  • codes
  • conga
  • ducks
  • flock
  • fondu
  • fuzes
  • heros
  • honey
  • laugh
  • leapt
  • lined
  • merit
  • mixes
  • moire
  • pined
  • pulse
  • rowed
  • waxed
  • zoned

Then, generated dataset of 1000 rows, each having from 5 to 25 words from this set (duplicates were possible), using this simple ruby script:

words=%w{ blond codes conga ducks flock fondu fuzes heros honey laugh leapt lined merit mixes moire pined pulse rowed waxed zoned }
1000.times do
  (rand() * 21 + 5).to_i.times do |i|
    printf " " if i > 0
    printf "%s", words.sample
  end
  printf "\n"
end

This data was then loaded into very simple table:

$ CREATE TABLE test (id serial PRIMARY KEY, sample_text text);

Now, let's find some messages there about “waxed ducks" (don't ask):

$ SELECT * FROM test WHERE sample_text @@ to_tsquery('waxed & ducks') LIMIT 5;
 id |                                                                   sample_text                                                                   
----+-------------------------------------------------------------------------------------------------------------------------------------------------
  7 | codes merit flock mixes lined laugh zoned mixes honey leapt blond moire moire moire pulse zoned mixes ducks waxed heros pulse ducks rowed
  9 | lined pulse zoned flock codes merit flock waxed leapt laugh flock honey lined lined pulse flock leapt pined laugh ducks flock conga
 15 | merit heros merit fondu ducks conga heros laugh moire honey pulse codes waxed ducks
 16 | merit leapt codes waxed leapt honey heros fuzes ducks lined conga flock fondu laugh waxed rowed fondu merit conga heros moire mixes waxed zoned
 22 | heros rowed honey zoned waxed honey mixes ducks flock heros merit heros codes pined mixes pined heros
(5 ROWS)

Now, this worked, and returned me all rows, but I'd like to search for messages that have the phrase “waxed ducks" in them, and not things like “waxed heros ducks".

This is where this new operator comes in:

$ SELECT * FROM test WHERE sample_text @@ to_tsquery(' waxed <-> ducks ') LIMIT 5;
 id  |                                                    sample_text                                                    
-----+-------------------------------------------------------------------------------------------------------------------
  15 | merit heros merit fondu ducks conga heros laugh moire honey pulse codes waxed ducks
  45 | rowed waxed rowed waxed ducks merit moire waxed codes fuzes honey waxed conga waxed pined rowed ducks
  47 | ducks codes pulse waxed fondu codes pulse pined waxed honey laugh lined laugh waxed ducks leapt lined leapt pulse
  51 | pined waxed ducks fondu rowed rowed waxed laugh ducks waxed
 110 | codes pined heros mixes fondu pulse merit zoned waxed rowed heros zoned moire waxed ducks heros
(5 ROWS)

We can also allow some distance between “waxed" and “ducks", using the more general way:

$ SELECT * FROM test WHERE sample_text @@ to_tsquery(' waxed <2> ducks ') LIMIT 5;
 id |                                                             sample_text                                                             
----+-------------------------------------------------------------------------------------------------------------------------------------
 15 | merit heros merit fondu ducks conga heros laugh moire honey pulse codes waxed ducks
 27 | conga codes ducks conga moire pulse mixes conga leapt honey fuzes pulse waxed heros ducks fondu laugh pulse rowed waxed
 31 | laugh moire fuzes flock blond fondu flock waxed pined ducks leapt pined zoned pined blond ducks waxed pulse laugh leapt leapt rowed
 45 | rowed waxed rowed waxed ducks merit moire waxed codes fuzes honey waxed conga waxed pined rowed ducks
 47 | ducks codes pulse waxed fondu codes pulse pined waxed honey laugh lined laugh waxed ducks leapt lined leapt pulse
(5 ROWS)

In here, I wanted to find rows that have words waxed and ducks separated by at most 1 other word.

Basically – this operator allows you to do phrase searches. And to make it even easier, there is new function phraseto_tsquery, which can be used like this:

$ SELECT phraseto_tsquery('PostgreSQL allows searching for waxed ducks.');
                            phraseto_tsquery                            
------------------------------------------------------------------------
 ( ( ( 'postgresql' <-> 'allow' ) <-> 'search' ) <2> 'wax' ) <-> 'duck'
(1 ROW)

This removed stopwords (as defined by default configuration), and used the distance operator to encode the phrase. Sweet. And how about speed?

Well, I generated more rows (1 million of them), and searched for a bit more specific phrase “pulse waxed ducks".

To search I used:

  • like operator: sample_text like ‘%pulse waxed ducks%'
  • tsearch for finding rows with all words, and then like to find the phrases: sample_text @@ to_tsquery(‘pulse & waxed & ducks') and sample_text like ‘%pulse waxed ducks%'
  • tsearch for finding phrase: sample_text @@ to_tsquery(‘( pulse <-> waxed ) <-> ducks')

First, as sanity check, I verified that all of them return the same number of rows:

$ SELECT COUNT(*) FROM test WHERE sample_text LIKE '%pulse waxed ducks%';
 COUNT 
-------
  1637
(1 ROW)
 
$ SELECT COUNT(*) FROM test WHERE sample_text @@ to_tsquery('pulse & waxed & ducks') AND sample_text LIKE '%pulse waxed ducks%';
 COUNT 
-------
  1637
(1 ROW)
 
$ SELECT COUNT(*) FROM test WHERE sample_text @@ to_tsquery('( pulse <-> waxed ) <-> ducks');
 COUNT 
-------
  1637
(1 ROW)

Then, I made gin index:

$ CREATE INDEX test_idx ON test USING gin (to_tsvector('english', sample_text));

and then I ran (modified, to include to_tsvector() call) queries, 3 times, and got best time. Results were interesting, to put it lightly:

$ EXPLAIN analyze SELECT COUNT(*) FROM test WHERE to_tsvector('english', sample_text) @@ to_tsquery('pulse & waxed & ducks') AND sample_text LIKE '%pulse waxed ducks%';
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=436.77..436.78 ROWS=1 width=8) (actual TIME=139.012..139.013 ROWS=1 loops=1)
   ->  Bitmap Heap Scan ON test  (cost=432.25..436.77 ROWS=1 width=0) (actual TIME=68.765..138.907 ROWS=1637 loops=1)
         Recheck Cond: (to_tsvector('english'::regconfig, sample_text) @@ to_tsquery('pulse & waxed & ducks'::text))
         FILTER: (sample_text ~~ '%pulse waxed ducks%'::text)
         ROWS Removed BY FILTER: 155838
         Heap Blocks: exact=15483
         ->  Bitmap INDEX Scan ON test_idx  (cost=0.00..432.25 ROWS=1 width=0) (actual TIME=66.779..66.779 ROWS=157475 loops=1)
               INDEX Cond: (to_tsvector('english'::regconfig, sample_text) @@ to_tsquery('pulse & waxed & ducks'::text))
 Planning TIME: 0.114 ms
 Execution TIME: 139.101 ms
(10 ROWS)
$ EXPLAIN analyze SELECT COUNT(*) FROM test WHERE to_tsvector('english', sample_text) @@ to_tsquery('( pulse <-> waxed ) <-> ducks');
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=436.77..436.78 ROWS=1 width=8) (actual TIME=3545.417..3545.417 ROWS=1 loops=1)
   ->  Bitmap Heap Scan ON test  (cost=432.25..436.76 ROWS=1 width=0) (actual TIME=76.470..3545.217 ROWS=1637 loops=1)
         Recheck Cond: (to_tsvector('english'::regconfig, sample_text) @@ to_tsquery('( pulse <-> waxed ) <-> ducks'::text))
         ROWS Removed BY INDEX Recheck: 155838
         Heap Blocks: exact=15483
         ->  Bitmap INDEX Scan ON test_idx  (cost=0.00..432.25 ROWS=1 width=0) (actual TIME=68.282..68.282 ROWS=157475 loops=1)
               INDEX Cond: (to_tsvector('english'::regconfig, sample_text) @@ to_tsquery('( pulse <-> waxed ) <-> ducks'::text))
 Planning TIME: 0.070 ms
 Execution TIME: 3545.495 ms
(9 ROWS)

It looks that in both cases index was used to find all rows that contained “pulse", “waxed" and “ducks", but filtering via like was much faster.

Not sure if I did something wrong, or is it just because it's new, unoptimized code, but it doesn't look all that great (for now).

All in all – it's great functionality, but I think we should get some better performance out of it (or explanation what I did wrong, which definitely could have happened).

12 thoughts on “Waiting for 9.6 – Phrase full text search.”

  1. When you use functional index and pgsql does a recheck after index scan, pgsql needs to reconstruct tsvector from the scratch (with a help of to_tsquery). This is a reason why phrase query in your exampl is so slow. Add separate tsvector column to table and rerun tests – it will change.

  2. @Theodor:

    Thanks for hint. Retested. Didn’t have previous data set, but generated new one, which had 1672 rows for pulse waxed ducks.

    Added tsvector column:

    $ ALTER TABLE test ADD COLUMN sample_tsvector tsvector;
    $ UPDATE test SET sample_tsvector = to_tsvector('english', sample_text);
    $ cluster test USING test_pkey;
    $ ALTER TABLE test SET WITHOUT cluster;
    $ $ CREATE INDEX gin_idx ON test USING gin (sample_tsvector);

    (cluster-related commands to remove update-bloat).

    Then ran the explain analyze again, using:

    $ EXPLAIN analyze SELECT COUNT(*) FROM test WHERE sample_text LIKE '%pulse waxed ducks%';
    $ EXPLAIN analyze SELECT COUNT(*) FROM test WHERE sample_tsvector @@ to_tsquery('pulse & waxed & ducks') AND sample_text LIKE '%pulse waxed ducks%';
    $ EXPLAIN analyze SELECT COUNT(*) FROM test WHERE sample_tsvector @@ to_tsquery('( pulse <-> waxed ) <-> ducks');

    Times I got, were:

    • like – 249ms
    • tsearch + like – 199ms
    • phrase tsearch – 741ms

    Doesn’t looks better than with functional index, but still not so great, and phrase tsearch looks slower than seq scan with like ?

  3. Seems, the bottleneck here is a re-evaluating of to_tsquery() on recheck clause. to_tsquery(text) is marked as stable, while to_tsquery(regconfig, text) as immutable. Try this:
    explain analyze select count(*) from test where sample_tsvector @@ to_tsquery(‘english’, ‘( pulse waxed ) ducks’)

    i.e. point fts configuration in to_tsquery call.

  4. @Teodor:

    OK, Redid, with:

    • explain analyze select count(*) from test where sample_text like ‘%pulse waxed ducks%’; – 258ms
    • explain analyze select count(*) from test where sample_tsvector @@ to_tsquery(‘english’, ‘pulse & waxed & ducks’) and sample_text like ‘%pulse waxed ducks%’; – 202ms
    • explain analyze select count(*) from test where sample_tsvector @@ to_tsquery(‘english’, ‘( pulse <-> waxed ) <-> ducks’); – 197ms

    So it looks like it’s faster, but not much faster. It could be because dataset is small. Will test with larger dataset…

  5. So, to followup – made another table. with 100 million rows, each containing 10-50 5 letter words, from list of 100 such words.

    For testing I picked phrase “weird camel froze wolfs”, which matched 31 rows.

    Like approach (plain like, with no tsearch) took 334 seconds.

    Normal tsearch + like for filtering for phrase took ~ 7739ms.

    Phrase tsearch took: 7306ms.

    So, while tsearch usage is great, it doesn’t look like phrase tsearch bought us much in terms of performance.

    For sanity check, the queries were:

    • explain analyze select count(*) from test where sample_tsvector @@ to_tsquery(‘english’, ‘weird & camel & froze & wolfs’) and sample_text like ‘%weird camel froze wolfs%’;
    • explain analyze select count(*) from test where sample_tsvector @@ to_tsquery(‘english’, ‘((weird <-> camel) <-> froze) <-> wolfs’);
  6. Phrase search is about functionality, it doesn’t involved any index support. Currently, it’s looks like syntactic sugar over filtering. But, phrase search could be much faster if index supports position filtering. The new index rum, created as extension (thanks to CREATE ACCESS METHOD and GENERIC WAL features by Alexander Korotkov), will have positions and could support phrase search effectively (as well as ranked output). Hope to show it in PGCon.

  7. You need the combined features of pg_trgm & fts for phrase search. pg_trgm index gin{,gist} gin{,gist}_trgm_ops can optimize like/ilike/regex queries:

    create index gist_trgm_idx on test using gist (sample_text gist_trgm_ops);
    analyze test;

    like/ilike:
    explain analyze select count(*) from test where sample_text like ‘%pulse waxed ducks%’;

    and regex

    explain analyze select count(*) from test where sample_text ~ ‘pulse waxed ducks’;

    You also may add word boundary checks for the regex query and play with pg_trgm.word_similarity_threshold GUC. 9.6 rocks!

  8. @Yayix:

    no, you don’t.

    What you showed is available for quite some time thanks to pg_Trgm, and is completely separate from fts, and its’s phrase search addition in 9.6.

  9. If you add a trigram index does the query below improve to lower than 199ms?

    select count(*) from test where sample_text like ‘%pulse waxed ducks%’;

  10. @Philip: most likely yes. I’m not really inclined into researching it now – you can probably do it on your own if you haver 15 minutes of spare time, because this was not the point of this blog post.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.