Do you really need tsvector column?

When using tsearch one usually, often, creates a tsvector column to put data in, and then create index on it.

But, do you really need the index? I wrote once already that you don't have to, but then a person talked with me on IRC, and pointed this section of docs:

One advantage of the separate-column approach over an expression index … Another advantage is that searches will be faster, since it will not be necessary to redo the to_tsvector calls to verify index matches.

So, let's see how big of a problem it really is.

To test the case I downloaded data dump from Wikipedia, and loaded it into very simple table:

CREATE TABLE wiki_docs (
    id INT8 generated always as identity PRIMARY KEY,
    title TEXT,
    body TEXT
);

Loaded data was not parsed in any way, so the body actually contained full XML dump of each article, like this:

=$ select * from wiki_docs where id = 1;
 id |        title        |                                                                         body                                                                          
----+---------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------
  1 | AccessibleComputing |                                                                                                                                                      +
    |                     |     <title>AccessibleComputing</title>                                                                                                               +
    |                     |     <ns>0</ns>                                                                                                                                       +
    |                     |     <id>10</id>                                                                                                                                      +
    |                     |     <redirect title="Computer accessibility" />                                                                                                      +
    |                     |     <revision>                                                                                                                                       +
    |                     |       <id>1219062925</id>                                                                                                                            +
    |                     |       <parentid>1219062840</parentid>                                                                                                                +
    |                     |       <timestamp>2024-04-15T14:38:04Z</timestamp>                                                                                                    +
    |                     |       <contributor>                                                                                                                                  +
    |                     |         <username>Asparagusus</username>                                                                                                             +
    |                     |         <id>43603280</id>                                                                                                                            +
    |                     |       </contributor>                                                                                                                                 +
    |                     |       <comment>Restored revision 1002250816 by [[Special:Contributions/Elli|Elli]] ([[User talk:Elli|talk]]): Unexplained redirect breaking</comment>+
    |                     |       <origin>1219062925</origin>                                                                                                                    +
    |                     |       <model>wikitext</model>                                                                                                                        +
    |                     |       <format>text/x-wiki</format>                                                                                                                   +
    |                     |       <text bytes="111" sha1="kmysdltgexdwkv2xsml3j44jb56dxvn" xml:space="preserve">#REDIRECT [[Computer accessibility]]                             +
    |                     |                                                                                                                                                      +
    |                     | {{rcat shell|                                                                                                                                        +
    |                     | {{R from move}}                                                                                                                                      +
    |                     | {{R from CamelCase}}                                                                                                                                 +
    |                     | {{R unprintworthy}}                                                                                                                                  +
    |                     | }}</text>                                                                                                                                            +
    |                     |       <sha1>kmysdltgexdwkv2xsml3j44jb56dxvn</sha1>                                                                                                   +
    |                     |     </revision>
(1 row)

Total size of the data is:

Number of rows: 25,083,768
Size of table, including TOAST, but without index(es): 60GB
Minimal length of body: 522 bytes
Maximl length of body: 879 kB (I purposedly removed some rows from input data as these caused problems with tsearch)
Total length of all article bodies: 102 GB

With data loaded, to table, I made another table (to avoid updating existing one and causing bloat) using:

=$ create table wiki_docs_with_tsvector as
select
    id,
    title,
    body,
    setweight( to_tsvector('english', title), 'A') ||
    setweight( to_tsvector('english', body), 'B') as tsv
from
    wiki_docs;

this took “a while".

In process I got bunch of warnings regarding NOTICE: word is too long to be indexed, but since it doesn't really matter here, it can be safely ignored.

It's important to check that afterwards size of the table increased – from 60 GB to 142GB. This is because tsvector data was materialized as column in table. This is interesting as it more than doubled size of table!

Then, it's time for index. On the wiki_docs_with_tsvector index was simple:

=$ create index ts_with_tsv on wiki_docs_with_tsvector using gist (tsv);

On the table without tsvector I could have made index on the expression with setweight calls, but instead I opted to make it based on function, so it will be easier to write queries:

=$ create function get_tsvector( IN the_row wiki_docs ) returns tsvector as $$
select
    setweight( to_tsvector('english', the_row.title), 'A') ||
    setweight( to_tsvector('english', the_row.body), 'B');
$$ language sql;
CREATE FUNCTION

With this, I can simply:

=$ create index the_magic on wiki_docs using gist( get_tsvector(wiki_docs) );

Interestingly indexes were rather small – 5448 MB each.

Now, let's see how queries behave. Let's start with something that will match small number of rows, let's say 77, using: ‘algorithm & digger' tsquery:

select * from wiki_docs_with_tsvector where tsv @@ to_tsquery('english', 'algorithm & digger');

Ran each query three times, and picked fastest. First, the one with materialized tsvector, as column in table, and index on it: full explain analyze, and it's time: ~ 4.5 minutes.

Interestingly this query had to touch (read, or hit) in total 7,122,789 buffers, which is ~ 54GB of data.

Now, for the query that is using index on function:

select * from wiki_docs x where get_tsvector(x) @@ to_tsquery('english', 'algorithm & digger');

Result is … terrible. Almost an hour of runtime! I guess that it's all in the “regenerate tsvector from function for index recheck". I can try it also for query that is selecting more rows, which should make the problem even worse.

I searched, this time, for just algorithm, which returned ~ 44k rows, and index recheck removed 1.7 million rows.

With gist index on tsvector: explain, with time: ~ 11.5 minutes.

With gist index on function: explain, with time: ~ 1.5 hour.

OK. So, clealry – if you're using GiST indexes, you should get tsvector column. At the very least if you have datasizes similar to what I tested on.

On the other hand, I'd argue that 5 minutes (time to search for algorithm & digger) is too slow. So, let's see how GIN behaves.

As with GiST, let's first check with search that returns not many rows.

Explain, of a query that uses tsvector column, shows pretty respectable time of 0.984ms

How about GIN index on function? Execution time was 1.406ms, with this plan.

This is 40% worse result. But please note one very important factor. Total table size, including GIN index, of table without materialized tsvector column is 108GB. While table with tsvector materialized as column is 189GB.

This means, for me, that in case of random accesses to various searches, GIN on function will actually be faster, as Pg/OS can cache more of the index/table data using the same memory size.

Anyway, last test: what about query returning more rows.

Materialized column: explain shows pretty good time of 67.564ms.

And with index on function: explain shows, worse time: 114.025ms – almost 70% slower.

Let's sum the times, and IO numbers in single table:

Index type:	Rowcount:	tsvector as column		tsvector as function
Index type:	Rowcount:	time:	buffers touched:	time:	buffers touched:
GiST	small (77)	268,132.528 ms	7,122,789	3,462,018.503 ms	5,564,861
GiST	large (44,543)	695,815.255 ms	15,095,225	5,494,796.006 ms	11,506,126
GIN	small (77)	0.984 ms	127	1.406 ms	126
GIN	large (44,543)	67.564 ms	42,136	114.025 ms	41,476

While number of blocks touched by function-based queries isn't very different, please note that this is simply because all tables used had literally zero updates or deletes. All of them were made 100% clean, with inserts of ready data.

In real apps I would expect some updates and/or deletes, which would make the table pages less densely packed, so IO overhead of materialized column would be, in my opinion, larger.

This, plus overhead of maintenance (larger tables means longer vacuum, analyze, backup, dump, setup of replicas, and so on), makes me stand by position that I don't think it makes sense to use materialized tsvector columns. At the very least if you're using GIN indexes, and your tsearch-indexed data comes from single table.

As for – how would it benefit in your case – well, you'd have to try. With some realistic insert-to-(delete/update) ratios, some bloat, and potentially other traffic to the database.

Hope you found that interesting/helpful 🙂

=$
|

Leave a Reply