Waiting for 9.6 – Tsvector editing functions

On 11th of March, Teodor Sigaev committed patch:

Tsvector editing functions
 
Adds several tsvector editting function: convert tsvector to/from text array,
set weight for given lexemes, delete lexeme(s), unnest, filter lexemes
with given weights
 
Author: Stas Kelvich with some editorization by me
Reviewers: Tomas Vondram, Teodor Sigaev

For those that don't know tsvector is datatype used by tsearch, which in turn is PostgreSQL's full text search engine.

Basically, whenever you index data for use for tsearch, you use (sometimes explicitly, sometimes not) tsvector values.

Which looks like this:

$ SELECT to_tsvector('english', 'Mandatory arguments to long options are mandatory for short options too.');
                          to_tsvector                          
---------------------------------------------------------------
 'argument':2 'long':4 'mandatori':1,7 'option':5,10 'short':9
(1 ROW)

As you can see some words have been removed (“to", “are", and “too"), and the rest were brought back to simplest form (mandatory => mandatori).

To make future examples smaller, let's store this tsvector in test table:

$ CREATE TABLE test AS SELECT to_tsvector('english', 'Mandatory arguments to long options are mandatory for short options too.') AS t;
SELECT 1
 
$ \d test
      TABLE "public.test"
 COLUMN |   TYPE   | Modifiers 
--------+----------+-----------
 t      | tsvector | 
 
$ SELECT * FROM test;
                               t                               
---------------------------------------------------------------
 'argument':2 'long':4 'mandatori':1,7 'option':5,10 'short':9
(1 ROW)

Now, let's see the new functions.

First of them is setweight. We could have used it before, like:

$ SELECT setweight(t, 'A') FROM test;
                              setweight                               
----------------------------------------------------------------------
 'argument':2A 'long':4A 'mandatori':1A,7A 'option':5A,10A 'short':9A
(1 ROW)

Which did set weight for all words in this tsvector (well, not words, lexemes).

Now, we can do it, by word:

$ SELECT setweight(t, 'A', '{mandatory,long,short}') FROM test;
                            setweight                            
-----------------------------------------------------------------
 'argument':2 'long':4A 'mandatori':1,7 'option':5,10 'short':9A
(1 ROW)

Unfortunately, to make it work for mandatori, I'd have to provide it as it was vectorized:

$ SELECT setweight(t, 'A', '{mandatori,long,short}') FROM test;
                             setweight                             
-------------------------------------------------------------------
 'argument':2 'long':4A 'mandatori':1A,7A 'option':5,10 'short':9A
(1 ROW)

Next function is delete(), which can be used to remove lexemes from tsvector. Like:

$ SELECT DELETE(t, 'short') FROM test;
                       DELETE                        
-----------------------------------------------------
 'argument':2 'long':4 'mandatori':1,7 'option':5,10
(1 ROW)

or even multiple:

$ SELECT DELETE(t, '{long,short}'::text[]) FROM test;
                   DELETE                   
--------------------------------------------
 'argument':2 'mandatori':1,7 'option':5,10
(1 ROW)

There is also unnest function which allows you to convert tsvector to set of rows with all the information:

SELECT (unnest(t)).* FROM test;
  lexeme   | positions | weights 
-----------+-----------+---------
 argument  | {2}       | {D}
 long      | {4}       | {D}
 mandatori | {1,7}     | {D,D}
 OPTION    | {5,10}    | {D,D}
 short     | {9}       | {D}
(5 ROWS)

Then, there are two functions for converting tsvectors to arrays, and back:

$ SELECT tsvector_to_array(t) FROM test;
           tsvector_to_array            
----------------------------------------
 {argument,long,mandatori,OPTION,short}
(1 ROW)

and:

$ SELECT array_to_tsvector('{argument,long,mandatori,option,short}');
               array_to_tsvector                
------------------------------------------------
 'argument' 'long' 'mandatori' 'option' 'short'
(1 ROW)

As you can see this conversion is not perfect, as it loses data (position, and priority), but it can be useful together with delete() or setweight().

And finally, there is filter. Filter allows you to get only part of the tsvector.

For example, let's assume we have this tsvector:

$ SELECT
    setweight(
        setweight(
            setweight(t, 'A', '{long,short}'),
            'B',
            '{argument,option}'
        ),
        'C',
        '{mandatori}'
    )
FROM test;
                              setweight                               
----------------------------------------------------------------------
 'argument':2B 'long':4A 'mandatori':1C,7C 'option':5B,10B 'short':9A
(1 ROW)

Filter makes it possible to fetch just elements with specific weights, like here:

$ SELECT FILTER($$'argument':2B 'long':4A 'mandatori':1C,7C 'option':5B,10B 'short':9A$$, '{B,C}');
                     FILTER                      
-------------------------------------------------
 'argument':2B 'mandatori':1C,7C 'option':5B,10B
(1 ROW)

All in all, looks like great addition to tsearch. Thanks guys.

5 thoughts on “Waiting for 9.6 – Tsvector editing functions”

  1. Hello, thanks for feature highlight.

    > Unfortunately, to make it work for mandatori, I’d have to provide it as it was vectorized

    Or just generate array by tsvector_to_array(‘mandatori long short’::tsvector)

  2. @Stas:
    cool, thanks. Missed it, maybe because I was writing about these functions in order 🙂

Comments are closed.