Waiting for PostgreSQL 11 – Add json(b)_to_tsvector function

On 7th of April 2018, Teodor Sigaev committed patch:

Add json(b)_to_tsvector function
 
Jsonb has a complex nature so there isn't best-for-everything way to convert it
to tsvector for full text search. Current to_tsvector(json(b)) suggests to
convert only string values, but it's possible to index keys, numerics and even
booleans value. To solve that json(b)_to_tsvector has a second required
argument contained a list of desired types of json fields. Second argument is
a jsonb scalar or array right now with possibility to add new options in a
future.
 
Bump catalog version
 
Author: Dmitry Dolgov with some editorization by me
Reviewed by: Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CA+q6zcXJQbS1b4kJ_HeAOoOc=unfnOrUEL=KGgE32QKDww7d8g@mail.gmail.com

This is actually pretty cool if you are using your json(b) column as a document store with “random" column names – keys that are in some values, but not always.

So, let's look at how it would be indexed as tsvector in older pg (10.3), and the new hotness.

As a test documents I'd pick:

$ SELECT id, jsonb_pretty(payload) FROM test;
 id |                                                jsonb_pretty
----+-------------------------------------------------------------------------------------------------------------
  1 | {                                                                                                          +
    |     "glossary": {                                                                                          +
    |         "title": "example glossary",                                                                       +
    |         "GlossDiv": {                                                                                      +
    |             "title": "S",                                                                                  +
    |             "GlossList": {                                                                                 +
    |                 "GlossEntry": {                                                                            +
    |                     "ID": "SGML",                                                                          +
    |                     "Abbrev": "ISO 8879:1986",                                                             +
    |                     "SortAs": "SGML",                                                                      +
    |                     "Acronym": "SGML",                                                                     +
    |                     "GlossDef": {                                                                          +
    |                         "para": "A meta-markup language, used to create markup languages such as DocBook.",+
    |                         "GlossSeeAlso": [                                                                  +
    |                             "GML",                                                                         +
    |                             "XML"                                                                          +
    |                         ]                                                                                  +
    |                     },                                                                                     +
    |                     "GlossSee": "markup",                                                                  +
    |                     "GlossTerm": "Standard Generalized Markup Language"                                    +
    |                 }                                                                                          +
    |             }                                                                                              +
    |         }                                                                                                  +
    |     }                                                                                                      +
    | }
  2 | {                                                                                                          +
    |     "menu": {                                                                                              +
    |         "id": "file",                                                                                      +
    |         "popup": {                                                                                         +
    |             "menuitem": [                                                                                  +
    |                 {                                                                                          +
    |                     "value": "New",                                                                        +
    |                     "onclick": "CreateNewDoc()"                                                            +
    |                 },                                                                                         +
    |                 {                                                                                          +
    |                     "value": "Open",                                                                       +
    |                     "onclick": "OpenDoc()"                                                                 +
    |                 },                                                                                         +
    |                 {                                                                                          +
    |                     "value": "Close",                                                                      +
    |                     "onclick": "CloseDoc()"                                                                +
    |                 }                                                                                          +
    |             ]                                                                                              +
    |         },                                                                                                 +
    |         "value": "File"                                                                                    +
    |     }                                                                                                      +
    | }
  3 | {                                                                                                          +
    |     "x": [                                                                                                 +
    |         1,                                                                                                 +
    |         2,                                                                                                 +
    |         3                                                                                                  +
    |     ]                                                                                                      +
    | }
(3 ROWS)

On 10.3, when I tried to convert jsonb value to tsvector for full text searching, I get:

$ SELECT id, to_tsvector(json(payload)) FROM test;
 id |                                                                                                    to_tsvector                                                                                                    
----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  1 | '1986':9 '8879':8 'creat':22 'docbook':27 'exampl':1 'general':36 'glossari':2 'gml':29 'iso':7 'languag':19,24,38 'markup':18,23,33,37 'meta':17 'meta-markup':16 'sgml':5,11,13 'standard':35 'use':20 'xml':31
  2 | 'close':11 'closedoc':13 'createnewdoc':5 'file':1,15 'new':3 'open':7 'opendoc':9
  3 |
(3 ROWS)

As you can see values in 3rd row are fully missing. But now, thanks to the new functions, I can:

$ SELECT id, jsonb_to_tsvector('english', payload, '["all"]') FROM test;
 id |                                                                                                                                                                                         jsonb_to_tsvector
----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  1 | '1986':25 '8879':24 'abbrev':21 'acronym':31 'creat':46 'docbook':51 'exampl':5 'general':66 'glossari':1,6 'glossdef':35 'glossdiv':8 'glossentri':15 'glosslist':13 'glossse':59 'glossseealso':53 'glossterm':63 'gml':55 'id':17 'iso':23 'languag':43,48,68 'markup':42,47,61,67 'meta':41 'meta-markup':40 'para':37 'sgml':19,29,33 'sorta':27 'standard':65 'titl':3,10 'use':44 'xml':57
  2 | 'close':29 'closedoc':33 'createnewdoc':17 'file':5,37 'id':3 'menu':1 'menuitem':9 'new':13 'onclick':15,23,31 'open':21 'opendoc':25 'popup':7 'valu':11,19,27,35
  3 | '1':3 '2':5 '3':7 'x':1
(3 ROWS)

Instead of all, I can use any of:

  • string
  • numeric
  • boolean
  • key

And you can use multiple, like here:

$ SELECT id, jsonb_to_tsvector('english', payload, '["numeric","key"]') FROM test;
 id |                                                                                  jsonb_to_tsvector
----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  1 | 'abbrev':15 'acronym':19 'glossari':1 'glossdef':21 'glossdiv':5 'glossentri':11 'glosslist':9 'glossse':27 'glossseealso':25 'glossterm':29 'id':13 'para':23 'sorta':17 'titl':3,7
  2 | 'id':3 'menu':1 'menuitem':7 'onclick':11,15,19 'popup':5 'valu':9,13,17,21
  3 | '1':3 '2':5 '3':7 'x':1
(3 ROWS)

Pretty cool. That will definitely be useful 🙂 Thanks a lot.