September 15th, 2007 by depesz | Tags: , | 21 comments »
Did it help? If yes - maybe you can help me?

as most of you know postgresql can easily speedup searches using:

field like 'something%'

and (less easily):

field like '%something'

but how about:

field like '%something%'

general idea is to use some kind of full text search/indexing – tsearch, lucene, sphinx, you name it.

but sometimes you can't install fts/fti, or it doesn't really solve your problem. is there any help? let's find out.

so, first i created a test table:

create table texts (
id serial primary key,
user_text text not null
);

then, with some help from perl, i created test data:

# select min(x), max(x), avg(x), sum(x), count(*) from (select length(user_text) as x from texts) y;
min | max | avg | sum | count
-----+-----+----------------------+----------+--------
36 | 550 | 200.9747100000000000 | 20097471 | 100000
(1 row)

this texts contain from 6 to 55 words from /usr/share/dict/american-english. separated by spaces.

so, sample data looks like this:

# select * from texts order by length(user_text) asc limit 5;
id | user_text
-------+---------------------------------------
10676 | quells loch terry nickel Hancock erg
24317 | sled wag swarm breathy Myst prevents
7539 | unhurt uphill n formalize Ursa Janna
87960 | troy JFK ductile Glen frescos gavotte
4879 | mixt potters bevel Prague male siesta
(5 rows)

for simplicity sake i didn't add any other things (like numbers or punctuation characters). but the code i will present wil assume any characters can be there.

so, now, let's find out how long it really is to search this 100,000 records:

# explain analyze select * from texts where user_text ilike '%formal%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Seq Scan on texts (cost=0.00..4209.00 rows=10 width=208) (actual time=0.424..833.511 rows=431 loops=1)
Filter: (user_text ~~* '%formal%'::text)
Total runtime: 834.199 ms
(3 rows)

(next calls show times: 915.364, 739.113, 773.271, 743.347 and 776.002).

so, first let's ask a question. why do we want %something%'.

in most standard cases you dont actually need %something%. what you need is the ability to find any word starting with given string.

example? looking for “warm" you will be happy to find: warm, warmer, warmest. but “swarm" is not neccessarily what you're looking for.

so, knowing we don't need actual “substring" but just word-prefix search we can work with it.

we will need two tables. first – word dictionary, and second – list of words per text:

create table text_words (
id serial primary key,
word text not null
);
create unique index ui_text_words on text_words (word);
create index i_text_words_search on text_words (word text_pattern_ops);
create table words_in_texts (
id serial primary key,
text_id int4 not null references texts (id) on delete cascade on update cascade,
word_id int4 not null references text_words (id) on delete cascade on update cascade
);
create unique index ui_words_in_texts_witi on words_in_texts (word_id, text_id);
create index i_words_in_ti on words_in_texts (text_id);

the i_text_words_search index has text_pattern_ops because i dont use C locale. and i had to add ui_text_words because text_pattern_ops doesn't work for “=" operator (at least in 8.3 cvs head 🙂

now, i'll need something (a trigger?) that would split texts into words, and store them in dictionary, and put relations in words_in_texts. it appears that pl/perl is perfect for this job 🙂

i would like the trigger to be smart enough to split into “words" even if something else is there. numbers, zipcodes – these kind of things.

usually i would write separate triggers for insert and update, but in here i will use only one trigger for both situations, as the logic will be nearly the same. additionally – trigger “on delete" is not necce

so, our trigger looks like this:

CREATE OR REPLACE FUNCTION texts_words_iu() RETURNS TRIGGER LANGUAGE plperl AS $BODY$
spi_exec_query('DELETE FROM words_in_texts WHERE text_id = ' . $_TD->{'old'}->{'id'}) if $_TD->{'event'} eq 'UPDATE';
my $new_text = $_TD->{'new'}->{'user_text'};
return unless $new_text;
my @words = $new_text =~ m{
(
[0-9]+ (?: - [0-9]+ )+
|
-? [0-9]+ (?: \. [0-9]+ )?
|
-? \. [0-9]+
|
[a-z0-9_-]+
)
}xig;
my %uniq;
my @use_words = sort grep { !$uniq{$_}++ } map { lc } @words;
return if 0 == scalar @use_words;
$words_values = "( VALUES " . join(',', map { "('$_')" } @use_words) . " )";
my $dict_sql = "INSERT INTO text_words (word) SELECT x FROM $words_values y (x) WHERE NOT exists (SELECT * FROM text_words w1 WHERE w1.word = y.x)";
spi_exec_query( $dict_sql );
my $join_sql = "INSERT INTO words_in_texts (word_id, text_id) SELECT d.id, $_TD->{'new'}->{'id'} FROM $words_values y (x) join text_words d on y.x = d.word";
spi_exec_query( $join_sql );
return;
$BODY$;
CREATE TRIGGER texts_words_iu AFTER INSERT OR UPDATE ON texts FOR EACH ROW EXECUTE PROCEDURE texts_words_iu();

does it look scary? possibly. does it work? let's test:

# insert into texts (user_text) values ('hubert depesz lubaczewski 123. .456 789.01 -48 522-186-96-44') returning *;
id | user_text
--------+--------------------------------------------------------------
100028 | hubert depesz lubaczewski 123. .456 789.01 -48 522-186-96-44
(1 row)

# select * from text_words;
id | word
----+---------------
25 | -48
26 | .456
27 | 123
28 | 522-186-96-44
29 | 789.01
30 | depesz
31 | hubert
32 | lubaczewski
(8 rows)

# select * from words_in_texts;
id | text_id | word_id
----+---------+---------
9 | 100028 | 25
10 | 100028 | 26
11 | 100028 | 27
12 | 100028 | 28
13 | 100028 | 29
14 | 100028 | 30
15 | 100028 | 31
16 | 100028 | 32
(8 rows)

# update texts set user_text = 'some test text' where id = 100028;
UPDATE 1

# select * from text_words;
id | word
----+---------------
25 | -48
26 | .456
27 | 123
28 | 522-186-96-44
29 | 789.01
30 | depesz
31 | hubert
32 | lubaczewski
33 | some
34 | test
35 | text
(11 rows)

# select * from words_in_texts;
id | text_id | word_id
----+---------+---------
17 | 100028 | 33
18 | 100028 | 34
19 | 100028 | 35
(3 rows)

# delete from texts where id = 100028;
DELETE 1

# select * from words_in_texts;
id | text_id | word_id
----+---------+---------
(0 rows)

looks pretty cool to me. so, now let's do the same trick to all records in texts table:

update texts set user_text = user_text;

some time later.

# select relname, pg_size_pretty(pg_total_relation_size(oid)) from pg_class where relname in ('texts', 'text_words', 'words_in_texts');
relname | pg_size_pretty
----------------+----------------
text_words | 8016 kB
words_in_texts | 209 MB
texts | 25 MB
(3 rows)

ok, so our dictionary is 8 megabytes, and list of all words in texts is over 200 megabytes (about 8 times of original table).

but, is it any faster? let's find out.

# explain analyze select distinct t.* from texts t join words_in_texts wit on t.id = wit.text_id join text_words tw on wit.word_id = tw.id where tw.word like 'formal%';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=977.08..978.67 rows=213 width=207) (actual time=11.035..12.627 rows=346 loops=1)
-> Sort (cost=977.08..977.61 rows=213 width=207) (actual time=11.032..11.579 rows=347 loops=1)
Sort Key: t.id, t.user_text
Sort Method: quicksort Memory: 146kB
-> Nested Loop (cost=4.74..968.84 rows=213 width=207) (actual time=0.126..9.939 rows=347 loops=1)
-> Nested Loop (cost=4.74..906.10 rows=213 width=4) (actual time=0.097..3.839 rows=347 loops=1)
-> Index Scan using i_text_words_search on text_words tw (cost=0.00..8.28 rows=7 width=4) (actual time=0.034..0.108 rows=12 loops=1)
Index Cond: ((word ~>=~ 'formal'::text) AND (word ~<~ 'formam'::text))
Filter: (word ~~ 'formal%'::text)
-> Bitmap Heap Scan on words_in_texts wit (cost=4.74..127.86 rows=32 width=8) (actual time=0.033..0.224 rows=29 loops=12)
Recheck Cond: (wit.word_id = tw.id)
-> Bitmap Index Scan on ui_words_in_texts_witi (cost=0.00..4.73 rows=32 width=0) (actual time=0.024..0.024 rows=29 loops=12)
Index Cond: (wit.word_id = tw.id)
-> Index Scan using texts_pkey on texts t (cost=0.00..0.28 rows=1 width=207) (actual time=0.010..0.012 rows=1 loops=347)
Index Cond: (t.id = wit.text_id)
Total runtime: 13.209 ms
(16 rows)

(next calls gave times: 9.190, 9.190, 9.318, 9.454 and 9.013ms).

it's important that we get less records than with like ‘%formal%', because our new approach will not find texts with word ‘informal'.

but, it looks like a nice speedup: going down from 750ms to 9ms is nice 🙂

on the other hand – is there any chance i could get ‘informal' text as well?

actually – yes. what we need is to add word_suffixes to text_words.

let's add new table:

create table text_words_suffixes (
id serial primary key,
word_id int4 not null references texts (id) on delete cascade on update cascade,
suffix text not null
);
create unique index ui_text_words_suffixes on text_words_suffixes (word_id, suffix);
create index i_text_words_suffixes_search on text_words_suffixes (suffix text_pattern_ops);

now, another trigger, this time on text_words table:

CREATE OR REPLACE FUNCTION text_words_suffixes_iu() RETURNS TRIGGER LANGUAGE plperl AS $BODY$
my $word_id = $_TD->{'new'}->{'id'};
my $word = $_TD->{'new'}->{'word'};
spi_exec_query("DELETE FROM text_words_suffixes WHERE word_id = $word_id") if $_TD->{'event'} eq 'UPDATE';
my @suffixes = map { substr($word, $_) } (0..length($word)-1);
my $suffixes_values = "( VALUES " . join(',', map { "('$_')" } @suffixes) . " )";
my $sql = "INSERT INTO text_words_suffixes (word_id, suffix) SELECT $word_id, y.x FROM $suffixes_values y (x)";
spi_exec_query( $sql );
return;
$BODY$;
CREATE TRIGGER text_words_suffixes_iu AFTER INSERT OR UPDATE ON text_words FOR EACH ROW EXECUTE PROCEDURE text_words_suffixes_iu();

now, another “smart" update:

UPDATE text_words SET word = word;

after update (plus dump/reload to get rid of bloat):

# select relname, pg_size_pretty(pg_total_relation_size(oid)) from pg_class where relname in ('texts', 'text_words', 'words_in_texts', 'text_words_suffixes');
relname | pg_size_pretty
---------------------+----------------
text_words | 8016 kB
text_words_suffixes | 62 MB
words_in_texts | 209 MB
texts | 25 MB
(4 rows)

and how it performs?

# explain analyze select distinct t.* from texts t join words_in_texts wit on t.id = wit.text_id join text_words_suffixes tw on wit.word_id = tw.word_id where tw.suffix like 'formal%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=6034.78..6044.68 rows=1321 width=207) (actual time=18.390..20.268 rows=431 loops=1)
-> Sort (cost=6034.78..6038.08 rows=1321 width=207) (actual time=18.385..18.945 rows=433 loops=1)
Sort Key: t.id, t.user_text
Sort Method: quicksort Memory: 178kB
-> Nested Loop (cost=4.73..5966.30 rows=1321 width=207) (actual time=0.167..17.231 rows=433 loops=1)
-> Nested Loop (cost=4.73..5078.16 rows=1321 width=4) (actual time=0.124..6.363 rows=433 loops=1)
-> Index Scan using i_text_words_suffixes_search on text_words_suffixes tw (cost=0.00..8.33 rows=42 width=4) (actual time=0.058..0.259 rows=15 loops=1)
Index Cond: ((suffix ~>=~ 'formal'::text) AND (suffix ~<~ 'formam'::text))
Filter: (suffix ~~ 'formal%'::text)
-> Bitmap Heap Scan on words_in_texts wit (cost=4.73..120.32 rows=31 width=8) (actual time=0.047..0.318 rows=29 loops=15)
Recheck Cond: (wit.word_id = tw.word_id)
-> Bitmap Index Scan on ui_words_in_texts_witi (cost=0.00..4.73 rows=31 width=0) (actual time=0.030..0.030 rows=29 loops=15)
Index Cond: (wit.word_id = tw.word_id)
-> Index Scan using texts_pkey on texts t (cost=0.00..0.66 rows=1 width=207) (actual time=0.017..0.019 rows=1 loops=433)
Index Cond: (t.id = wit.text_id)
Total runtime: 20.874 ms
(16 rows)

(next calls return in 11.382, 11.616, 11.457, 14.741, 11.272 ms).

please notice that this time all 431 rows got returned!

additional feature is that we an find words that have given part:

# select tw.* from text_words tw join text_words_suffixes tws on tw.id = tws.word_id where tws.suffix like 'formal%';
id | word
-------+---------------
35922 | formal
61750 | informal
95889 | formaldehyde
72849 | formalism
66074 | formalities
38209 | formality
24214 | informality
26966 | formalization
36419 | formalize
73154 | formalized
60399 | formalizes
56199 | formalizing
78044 | informally
89264 | formally
28901 | formals
(15 rows)

(time to get this list is around 0.3ms!)

at the end of this note – basically if you need something like this – please invest time and use proper fts/fti engine (tsearch2 for example).

but if you can't – there is an alternative. and it even works 🙂

  1. 21 comments

  2. # deepcored
    Sep 15, 2007

    This is a nice approach but sounds like reinventing the wheel. Why simply not to use tsearch2 ?

  3. Sep 15, 2007

    @deepcored:
    *i* would use tsearch. but some people dont want it (for whatever reason). this is hwoto for them.
    what’s more – this approach has one feature tsearch doesn’t – prefix search.

  4. # deepcored
    Sep 16, 2007

    As for performance, I think it would be nice to have intarray together with GIN index instead of plain table. I mean the words_in_texts table.
    Or the purpose of this article was to use the simplest solution possible?

  5. Sep 16, 2007

    @deepcored:
    1. i think that “keeping it simple” is a good approach
    2. i’m not really sure if intarray would be really good – especially when searching for multiple words.

  6. # Acid
    Nov 11, 2007

    i have one problem with this post, if i try to use it with UTF8 database, with poslish signs, if perl funsction get words like “międzyżecze” it spit it into words “mi” “dzy” “ecze”, i try tu ADD to [a-z0-9_-]+, something like that [ąśćżśa-z0-9_-] but this wont work its get the database error ERROR: creation of Perl function failed: ‘require’ trapped by operation mask at line 31.,line 31 is line with perl pattern

  7. Nov 11, 2007

    @Acid:
    hmm … strange. perhaps it’s something “required” because perl sees its utf8 characters?
    i’ll write about it to pgsql-general list, we’ll see how it will go.

  8. Nov 11, 2007

    change the regexp part to:
    [a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ0-9_-]
    and remove “i” at the end of regexp. (i – the flag)

  9. # Acid
    Nov 11, 2007

    it’s “almost” working i write:

    my @words = $new_text =~ m{

    (

    [0-9]+ (?: – [0-9]+ )+

    |

    -? [0-9]+ (?: . [0-9]+ )?

    |

    -? . [0-9]+

    |

    [a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ0-9_-]+

    )

    }xg;

    end it’s working for: “ą ł” but not for “ś ż ź ę ń ó”

    eny idea?

  10. # Acid
    Nov 11, 2007

    eh OK, a create it from pgAdmin, when i create it from command line it;s work perdectly

    thank you

  11. Nov 29, 2007

    @Acid:
    temat się rozwija.
    zostął zgłoszony bug do perla:
    http://rt.perl.org/rt3//Public/Bug/Display.html?id=47576
    kolesie od pg też nie próżnują i kombinują co z tym zrobić.

  12. # Acid
    Jan 3, 2008

    heh znalazlem buga :D:D

    no ale mam kolejny problemik nie chce działać na postgresie 8.1.9 ;/ przy takim triggerze http://rafb.net/p/XSm0lu47.html dostaje błąd [message:protected] => SQLSTATE[XX000]: Internal error: 7 ERROR: error from Perl trigger function: syntax error at or near “,” at line 44.
    [string:private] =>
    [code:protected] => XX000 przy robieniu update albo insert ;/ na 8.2.x dziala bez problemu

  13. # Rauan
    Jun 6, 2008

    I solved that trouble using plperlu (untrusted)

  14. # Ferenc
    Sep 22, 2008

    It is great article, but:
    If you search for a small length of string e.g. ‘and%’ or ‘a%’, it is slower than a ‘%and%’ or ‘%a%’ query 🙁

  15. # kookimebux
    Feb 1, 2009

    Hello. And Bye. 🙂

  16. # BartekR
    Mar 6, 2010

    Już się poddaję. Testuję powyższe rozwiązanie na WinXP z postgresem 8.3 i 8.4. Próbuję wkleić dokładnie cały kod tabel i triggera i za nic nie chce działać. Dane do tabeli texts dodają się, ale nic się nie pojawia w tabelach text_words i words_in_texts. Zupełnie jakby trigger nie działał. Działam na pgAdmin 1.10. mam w bazie plperl i plperlu, na maszynie Perl ActiveState 5.8

    Na innym środowisku – linux + postgres 8.2 działa bez zarzutu. Gdzie szukać co jest nie tak? Jakaś opcja środowiska, myk w konfiguracji? Trigger na tabeli zaznaczam jako aktywny i też nie pomaga.

  17. Mar 6, 2010

    @BartekR:
    sorry, ale windows to dla mnie czarna magia. nie mam pojęcia czemu nie działa, ale jeśli trigger założyłeś, i nie zgłąsza błędów (exception) to powinno działać.

  18. # BartekR
    Mar 7, 2010

    No właśnie sam nie wiem czemu nie działa. Kombinuję już jak koń pod górę i nic.
    Błędów nie zgłasza – trigger się tworzy poprawnie, za to nic nie robi.

    Jeszcze pokombinuję, bo nagle banalny trigger w plperlu poszedł. Mam trop, że może to prefiks schematu może coś daje.

  19. # BartekR
    Mar 7, 2010

    Jeszcze pytanie Zauważyłem, że problemy się pojawiają przy polskich znakach (ale to będzie do sprawdzenia), ale dodatkowo kiedy staram się dostać do zmiennych przez $_TD. Przykładowo podpięcie takiej funkcji jako triggera działa i dodaje słowa zdefiniowane w $new_text:

    CREATE OR REPLACE FUNCTION b()
    RETURNS “trigger” AS
    $BODY$

    my $new_text = “raz dwa trzy baba jaga patrzy”;

    my @words = $new_text =~ m{
    (
    [0-9]+ (?: – [0-9]+ )+
    |
    -? [0-9]+ (?: \. [0-9]+ )?
    |
    -? \. [0-9]+
    |
    [a-z0-9_-]+
    )
    }xig;
    my %uniq;

    my @use_words = sort grep { !$uniq{$_}++ } map { lc } @words;

    $words_values = “( VALUES ” . join(‘,’, map { “(‘$_’)” } @use_words) . ” )”;

    my $dict_sql = “INSERT INTO text_words (word) SELECT x FROM $words_values y (x) WHERE NOT exists (SELECT * FROM text_words w1 WHERE w1.word = y.x)”;
    spi_exec_query( $dict_sql );

    return;
    $BODY$
    LANGUAGE ‘plperl’ VOLATILE;

    Natomiast jak dam pole, z którego te dane pochodzą, np (tabela.tresc, na polu tabela jest trigger):
    $_TD->{new}{tresc}

    to nic się nie dzieje jeśli wprowadzę te “raz dwa trzy babajaga patrzy” do pola treść.
    I znów głupi jestem

  20. # BartekR
    Mar 8, 2010

    Pomogła przesiadka na plperlu. Nie wiem czemu, ale działa dokładnie tak jak w artykule. Może pod Win jest jakieś ograniczenie? Nie wiem.

  21. Jul 28, 2010

    I really appreciate your solution to this problem and FTS on a shoe string. I learned lots. I was wondering if it’s possible to split this query

    select * from texts where user_text ilike ‘%formal%’;

    into two parts:

    select * from texts where user_text ilike ‘formal%’
    union all
    select * from texts where text_reverse(user_text) like text_reverse(‘%formal’);

    thanks,
    Brian

  22. Jul 28, 2010

    @Brian:

    no. it’s not.
    For example: let’s assume user_text is ‘informality’. it will match ‘%formal%’ but it will not match ‘formal%’ nor ‘%formal’

Leave a comment