Polish configuration for TSearch

Some time ago someone posted on Reddit that they have problems adding Polish configuration to their PostgreSQL.

While checking it, I found some interesting news. And of course figured out how to have Polish configuration…

First, the news: apparently in PostgreSQL 19, we will automatically get Polish dictionary, thanks to this commit from 5th of January 2026!

We can see it immediately:

=$ \dF polish
           List of text search configurations
   Schema   │  Name  │            Description
────────────┼────────┼───────────────────────────────────
 pg_catalog │ polish │ configuration for polish language
(1 row)
 
=$ SELECT * FROM ts_debug('polish', 'Niepokojące tajemnice, stopniowe odkrywanie sekretów i rozbudowane teorie przyciągnęły tłumy widzów.');
   alias   |    description    |    token     | dictionaries  | dictionary  |    lexemes
-----------+-------------------+--------------+---------------+-------------+---------------
 word      | Word, all letters | Niepokojące  | {polish_stem} | polish_stem | {niepokoj}
 blank     | Space symbols     |              | {}            |             |
 asciiword | Word, all ASCII   | tajemnice    | {polish_stem} | polish_stem | {tajemnic}
 blank     | Space symbols     | ,            | {}            |             |
 asciiword | Word, all ASCII   | stopniowe    | {polish_stem} | polish_stem | {stopniow}
 blank     | Space symbols     |              | {}            |             |
 asciiword | Word, all ASCII   | odkrywanie   | {polish_stem} | polish_stem | {odkrywan}
 blank     | Space symbols     |              | {}            |             |
 word      | Word, all letters | sekretów     | {polish_stem} | polish_stem | {sekret}
 blank     | Space symbols     |              | {}            |             |
 asciiword | Word, all ASCII   | i            | {polish_stem} | polish_stem | {i}
 blank     | Space symbols     |              | {}            |             |
 asciiword | Word, all ASCII   | rozbudowane  | {polish_stem} | polish_stem | {rozbudowan}
 blank     | Space symbols     |              | {}            |             |
 asciiword | Word, all ASCII   | teorie       | {polish_stem} | polish_stem | {teor}
 blank     | Space symbols     |              | {}            |             |
 word      | Word, all letters | przyciągnęły | {polish_stem} | polish_stem | {przyciągnęł}
 blank     | Space symbols     |              | {}            |             |
 word      | Word, all letters | tłumy        | {polish_stem} | polish_stem | {tłum}
 blank     | Space symbols     |              | {}            |             |
 word      | Word, all letters | widzów       | {polish_stem} | polish_stem | {widz}
 blank     | Space symbols     | .            | {}            |             |
(22 rows)
 
=$ SELECT * FROM to_tsvector('polish', 'Niepokojące tajemnice, stopniowe odkrywanie sekretów i rozbudowane teorie przyciągnęły tłumy widzów.');
                                                           to_tsvector
----------------------------------------------------------------------------------------------------------------------------------
 'i':6 'niepokoj':1 'odkrywan':4 'przyciągnęł':9 'rozbudowan':7 'sekret':5 'stopniow':3 'tajemnic':2 'teor':8 'tłum':10 'widz':11
(1 row)

We can see here that polish is recognized, and words are correctly stemmed (odkrywanie => odkrywan).

But the same doesn't work on currently releases PostgreSQLs, like Pg18:

=$ \dF polish
List of text search configurations
 Schema │ Name │ Description
────────┼──────┼─────────────
(0 rows)

Initially I tried to use method described in the reddit post, but it errored out on create text search dictionary because of missing files.

What one really has to do is install hunspell-pl (it is there for deb based systems, like Debian and Ubuntu).

After you will install it, in PostgreSQL tsearch data there will be interesting symlinkg:

=$ ls -l /usr/share/postgresql/18/tsearch_data | grep ^l
lrwxrwxrwx 1 root root    39 Apr 21 22:12 en_us.affix -> /var/cache/postgresql/dicts/en_us.affix
lrwxrwxrwx 1 root root    38 Apr 21 22:12 en_us.dict -> /var/cache/postgresql/dicts/en_us.dict
lrwxrwxrwx 1 root root    39 Apr 21 22:12 pl_pl.affix -> /var/cache/postgresql/dicts/pl_pl.affix
lrwxrwxrwx 1 root root    38 Apr 21 22:12 pl_pl.dict -> /var/cache/postgresql/dicts/pl_pl.dict

This tells us that we should be using pl_pl, and not polish. Unless we want to rename the files, which is also an option.

Or we can get more creative with configuration:

=$ CREATE TEXT SEARCH CONFIGURATION public.polish (COPY = pg_catalog.english);
CREATE TEXT SEARCH CONFIGURATION
 
=$ CREATE TEXT SEARCH DICTIONARY polish_hunspell (
    TEMPLATE = ispell,
    DictFile = pl_pl,
    AffFile = pl_pl
);
CREATE TEXT SEARCH DICTIONARY
 
=$ ALTER TEXT SEARCH CONFIGURATION public.polish
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
    WITH polish_hunspell, simple;
ALTER TEXT SEARCH CONFIGURATION
 
=$ ALTER TEXT SEARCH CONFIGURATION public.polish
    DROP MAPPING FOR email, url, url_path, sfloat, float;
ALTER TEXT SEARCH CONFIGURATION

It worked! So, let's test it:

=$ SELECT * FROM ts_debug('polish', 'Niepokojące tajemnice, stopniowe odkrywanie sekretów i rozbudowane teorie przyciągnęły tłumy widzów.');
   alias   |    description    |    token     |       dictionaries       |   dictionary    |         lexemes
-----------+-------------------+--------------+--------------------------+-----------------+--------------------------
 word      | Word, all letters | Niepokojące  | {polish_hunspell,simple} | polish_hunspell | {niepokojący,niepokoić}
 blank     | Space symbols     |              | {}                       |                 |
 asciiword | Word, all ASCII   | tajemnice    | {polish_hunspell,simple} | polish_hunspell | {tajemnice,tajemnica}
 blank     | Space symbols     | ,            | {}                       |                 |
 asciiword | Word, all ASCII   | stopniowe    | {polish_hunspell,simple} | polish_hunspell | {stopniowy}
 blank     | Space symbols     |              | {}                       |                 |
 asciiword | Word, all ASCII   | odkrywanie   | {polish_hunspell,simple} | polish_hunspell | {odkrywać}
 blank     | Space symbols     |              | {}                       |                 |
 word      | Word, all letters | sekretów     | {polish_hunspell,simple} | polish_hunspell | {sekrety,sekret}
 blank     | Space symbols     |              | {}                       |                 |
 asciiword | Word, all ASCII   | i            | {polish_hunspell,simple} | polish_hunspell | {i}
 blank     | Space symbols     |              | {}                       |                 |
 asciiword | Word, all ASCII   | rozbudowane  | {polish_hunspell,simple} | polish_hunspell | {rozbudowany,rozbudować}
 blank     | Space symbols     |              | {}                       |                 |
 asciiword | Word, all ASCII   | teorie       | {polish_hunspell,simple} | polish_hunspell | {teoria}
 blank     | Space symbols     |              | {}                       |                 |
 word      | Word, all letters | przyciągnęły | {polish_hunspell,simple} | polish_hunspell | {przyciągnąć}
 blank     | Space symbols     |              | {}                       |                 |
 word      | Word, all letters | tłumy        | {polish_hunspell,simple} | polish_hunspell | {tłumy,tłum}
 blank     | Space symbols     |              | {}                       |                 |
 word      | Word, all letters | widzów       | {polish_hunspell,simple} | polish_hunspell | {widz}
 blank     | Space symbols     | .            | {}                       |                 |
(22 rows)
 
=$ SELECT * FROM to_tsvector('polish', 'Niepokojące tajemnice, stopniowe odkrywanie sekretów i rozbudowane teorie przyciągnęły tłumy widzów.');
                                                                                                to_tsvector
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 'i':6 'niepokoić':1 'niepokojący':1 'odkrywać':4 'przyciągnąć':9 'rozbudowany':7 'rozbudować':7 'sekret':5 'sekrety':5 'stopniowy':3 'tajemnica':2 'tajemnice':2 'teoria':8 'tłum':10 'tłumy':10 'widz':11
(1 row)

Sweet. Now, my old blogpost, and the reddit post suggest setting also StopWords. But if I'd try:

=$ CREATE TEXT SEARCH DICTIONARY polish_hunspell (
    TEMPLATE = ispell,
    DictFile = pl_pl,
    AffFile = pl_pl,
    StopWords = pl_pl
);
psql:create.polish.2.sql:9: ERROR:  could not open stop-word file "/usr/share/postgresql/18/tsearch_data/pl_pl.stop": No such file or directory

Yeah. There is no such file. Luckily, we can simply make such file. It will be enough to just make it as empty file, but let's show how stop file works.

In this file, I wrote two lines only:

stopniowy
odkrywać

And, with recreated dictionary, with StopWords, I'm getting this output:

=$ SELECT * FROM to_tsvector('polish', 'Niepokojące tajemnice, stopniowe odkrywanie sekretów i rozbudowane teorie przyciągnęły tłumy widzów.');
                                                                                   to_tsvector
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 'i':6 'niepokoić':1 'niepokojący':1 'przyciągnąć':9 'rozbudowany':7 'rozbudować':7 'sekret':5 'sekrety':5 'tajemnica':2 'tajemnice':2 'teoria':8 'tłum':10 'tłumy':10 'widz':11
(1 row)

Sweet.

But – what can you do if you're not on Linux with debs?

It's not that complicated. Start by going to debian package page. On the right side there will be link to “libreoffice-dictionaries_VERSION.orig.tar.xz". Currently it's libreoffice-dictionaries_26.2.2.orig.tar.xz.

Download this file to some directory.

Then unpack polish files from there:

=$ tar xvf libreoffice-dictionaries_*.orig.tar.xz --wildcards 'libreoffice-*/dictionaries/pl_PL'
libreoffice-26.2.2.2/dictionaries/pl_PL/
libreoffice-26.2.2.2/dictionaries/pl_PL/th_pl_PL_v2.dat
libreoffice-26.2.2.2/dictionaries/pl_PL/plhyph.tex
libreoffice-26.2.2.2/dictionaries/pl_PL/pl_PL.dic
libreoffice-26.2.2.2/dictionaries/pl_PL/pl_PL.aff
libreoffice-26.2.2.2/dictionaries/pl_PL/hyph_pl_PL.dic
libreoffice-26.2.2.2/dictionaries/pl_PL/dictionaries.xcu
libreoffice-26.2.2.2/dictionaries/pl_PL/description.xml
libreoffice-26.2.2.2/dictionaries/pl_PL/README_pl.txt
libreoffice-26.2.2.2/dictionaries/pl_PL/README_en.txt
libreoffice-26.2.2.2/dictionaries/pl_PL/META-INF/

In here, we see that the files are in libreoffice-26.2.2.2/dictionaries/pl_PL/ directory. So, let's go there…

The problem now is that the files are in ISO-8859-2 encoding, and Pg requires UTF-8. Luckily, on any Linux you should have iconv tool, which you can use to convert the files:

=$ iconv -f iso-8859-2 -t utf-8 pl_PL.aff > pl_pl.affix
=$ iconv -f iso-8859-2 -t utf-8 pl_PL.dic > pl_pl.dict

There two files have to be copied to where tsearch expects to find data files, in my case it's /usr/share/postgresql/18/tsearch_data – as we seen earlier in error messages.

So, I need to copy these two files there, possibly using root:

=$ sudo cp -v pl_pl.affix pl_pl.dict /usr/share/postgresql/18/tsearch_data/
'pl_pl.affix' -> '/usr/share/postgresql/18/tsearch_data/pl_pl.affix'
'pl_pl.dict' -> '/usr/share/postgresql/18/tsearch_data/pl_pl.dict'

And that's all you need to have it working. Final SQL script that creates everything, and tests:

CREATE TEXT SEARCH CONFIGURATION public.polish (COPY = pg_catalog.english);
 
CREATE TEXT SEARCH DICTIONARY polish_hunspell (
    TEMPLATE = ispell,
    DictFile = pl_pl,
    AffFile = pl_pl,
    StopWords = pl_pl
);
 
ALTER TEXT SEARCH CONFIGURATION public.polish
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
    WITH polish_hunspell, simple;
 
ALTER TEXT SEARCH CONFIGURATION public.polish
    DROP MAPPING FOR email, url, url_path, sfloat, float;
 
SELECT * FROM to_tsvector('polish', 'Niepokojące tajemnice, stopniowe odkrywanie sekretów i rozbudowane teorie przyciągnęły tłumy widzów.');

Hope it will help someone 🙂

=$
|

Leave a Reply