How to index data with tsearch?

Someone asked recently on Slack, whether one should build tsvector data in a field, and index it, or make index on expression.

We talked about it for a while, and I thought I'll reformat my thoughts into blogpost to avoid retyping it next time.

For starters – I don't like having tsvector column. It doesn't really serve purpose. Sure – you can put index on it, but you can also put index on to_tsvector() expression.

It is just there, using up disk space.

But, with indexes on expression situation gets complicated.

Let's assume we have table:

CREATE TABLE posts (
    id int8 generated always AS IDENTITY PRIMARY KEY,
    title text,
    headline text,
    main_body text
);

I can make an index like this:

CREATE INDEX fts_index ON posts USING gin (
    to_tsvector( 'english'::regconfig, title || headline || main_body )
);

But it will cause false positives. For example, if title will end with use, and headline will start with fully then tsvector will contain element for word usefully, even when it wasn't really there.

Trivial to fix:

CREATE INDEX fts_index ON posts USING gin (
    to_tsvector( 'english'::regconfig, title || ' ' || headline || ' ' || main_body )
);

Done. But is it? What is headline is null? Anything concatenated with null becomes null. Of course we can use coalesce:

CREATE INDEX fts_index ON posts USING gin (
    to_tsvector( 'english'::regconfig, COALESCE(title, '') || ' ' || COALESCE(headline, '') || ' ' || COALESCE(main_body, '') )
);

and it will work. The problem, though, is that it isn't really nice because to use index, query has to repeat the exact expression.

So, every query now will have to repeat to_tsvector( ‘english'::regconfig, coalesce(title, ") || ‘ ‘ || coalesce(headline, ") || ‘ ‘ || coalesce(main_body, ") ). And just imagine the fun of changing/typing queries if one would choose to use things like search weights, and the expression becomes longer and longer.

Suddenly keeping all the logic in trigger and updating a field, and then using the field to search seems sane. But perhaps there is another way?

There is. I can make a function, let's call it fts, that will take one argument, and this argument is whole row from posts table. And based on this row it will return tsvector:

CREATE FUNCTION fts( IN p_post posts )
    RETURNS tsvector
    LANGUAGE SQL
    immutable
    AS $$
SELECT to_tsvector( 'english'::regconfig, COALESCE(p_post.title, '') || ' ' || COALESCE(p_post.headline, '') || ' ' || COALESCE(p_post.main_body, '') )
$$;

With this in place I can:

CREATE INDEX fts_index ON posts USING gin ( fts(posts) );

And it will work well:

$ \d posts
                           TABLE "public.posts"
  COLUMNTYPE  │ Collation │ NULLABLEDEFAULT            
───────────┼────────┼───────────┼──────────┼──────────────────────────────
 id        │ BIGINT │           │ NOT NULL │ generated always AS IDENTITY
 title     │ text   │           │          │ 
 headline  │ text   │           │          │ 
 main_body │ text   │           │          │ 
Indexes:
    "posts_pkey" PRIMARY KEY, btree (id)
    "fts_index" gin (fts(posts.*))
 
$ EXPLAIN SELECT * FROM posts p WHERE fts(p) @@ plainto_tsquery('depesz');
                                                                                                        QUERY PLAN                                                                                                        
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Bitmap Heap Scan ON posts p  (cost=63.00..17914.45 ROWS=5000 width=78)
   Recheck Cond: (to_tsvector('english'::regconfig, ((((COALESCE(title, ''::text) || ' '::text) || COALESCE(headline, ''::text)) || ' '::text) || COALESCE(main_body, ''::text))) @@ plainto_tsquery('depesz'::text))
   ->  Bitmap INDEX Scan ON fts_index  (cost=0.00..61.75 ROWS=5000 width=0)
         INDEX Cond: (to_tsvector('english'::regconfig, ((((COALESCE(title, ''::text) || ' '::text) || COALESCE(headline, ''::text)) || ' '::text) || COALESCE(main_body, ''::text))) @@ plainto_tsquery('depesz'::text))
(4 ROWS)

Please note scan on fts_index.

There is also one cool thing. It's rather obscure, but will help nicely in this case.

Generally – if you have function that takes single argument, and this argument is typed to a table (technically to any composite type, but it's not important for now), you can use this as virtual field in queries.

This syntax is documented but finding it is not easy. If you want to read more on it, go to: docs on function calls and read the “Note" box, and then follow link to docs on “Using Composite Types in Queries".

You can see it used in here:

$ EXPLAIN SELECT p.* FROM posts p WHERE p.fts @@ plainto_tsquery('depesz');
                                                                                                        QUERY PLAN                                                                                                        
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON posts p  (cost=63.00..17911.17 ROWS=4999 width=469)
   Recheck Cond: (to_tsvector('english'::regconfig, ((((COALESCE(title, ''::text) || ' '::text) || COALESCE(headline, ''::text)) || ' '::text) || COALESCE(main_body, ''::text))) @@ plainto_tsquery('depesz'::text))
   ->  Bitmap INDEX Scan ON fts_index  (cost=0.00..61.75 ROWS=4999 width=0)
         INDEX Cond: (to_tsvector('english'::regconfig, ((((COALESCE(title, ''::text) || ' '::text) || COALESCE(headline, ''::text)) || ' '::text) || COALESCE(main_body, ''::text))) @@ plainto_tsquery('depesz'::text))
(4 ROWS)

Please note I used syntax: p.fts @@ and not fts(p) @@.

You might want to use this feature, or just stick with normal call (fts(p)) – either way, it's way simpler to write in all queries, and allows to easily “hide" complexities of building tsvector from day-to-day queries.

Of course, if you'd choose to go that way, and, after some time, you'd decide that you want to change fts() function definition (for example, add another field, or weights) – you will need to reindex index based on this, because index will not be aware that function definition has changed.

Please note that you can have many functions with the same name, that are differentiated based on argument type(s). This means that if you'd need to add tsearch indexes on more tables, you can add more fts() function, each with different argument type, and reuse name “fts" everywhere (or however you'd call these functions).

One day we might get generated columns, based on expressions, that do not need to be stored, which will be even nicer solution to the problem, but for now – simple function and function-based index seems (to me) to be cleanest solution.

Hope it will help someone 🙂

2 thoughts on “How to index data with tsearch?”

  1. well, no, because concat_ws is “stable”, and not “immutable” – so you can’t make index on it.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.