October 17th, 2010 by depesz | Tags: , , , , | 29 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

Before I'll start let me say that I am fan of what Oleg and Teodor did – their work is great, and I do appreciate their time and ideas.

But – I simply don't like the idea of using FTS (Full Text Search) inside of database. Why? Let me show you.

One more disclaimer: I do not believe that any problems (described in this blogpost) are related to TSearch itself. I think it's more of a general fault of putting (supposedly) fast search engine, in RDBMS with all of its overhead. For example – compare this blogpost which benchmarks TSearch against full text searching in Oracle.

In the rest of article I will be showing results of some tests, so first I'll tell you what's my test environment.

  • Hardware:
  • Software:
    • OS: Linux 2.6.32 (2.6.32-23-generic)
    • Distribution: Kubuntu lucid (10.04)
    • PostgreSQL: PostgreSQL 8.4.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit; with integer datetimes; from ubuntu package version 8.4.4-0ubuntu10.04
    • Sphinx 0.9.9, compiled by myself, from source .tar.gz
  • Data: I used copy of wikipedia data from 2010-01-16, file downloaded: enwiki-20100116-pages-meta-current.xml, size: 55700771754 bytes (51.8 GB)

PostgreSQL was left with configuration from package, but I changed following settings:

  • max_connections = 5
  • shared_buffers = 768MB
  • work_mem = 512MB
  • maintenance_work_mem = 2048MB
  • effective_cache_size = 2048MB
  • fsync = off
  • synchronous_commit = off
  • wal_buffers = 16MB
  • checkpoint_completion_target = 0.9
  • checkpoint_segments = 60
  • autovacuum = off
  • autovacuum_freeze_max_age = 1000000000

plus some logging changes to log directly to files, rotated, all statements.

These settings are definitely not sensible for most of use-cases, but since there is nothing on this machine except Pg, I want it to use as much resources as possible, and to do tasks as quick as possible. Even at the cost of potential data corruption (fsync = off). After all – it's just a test!

Sphinx was configured with “./configure –without-mysql –without-pgsql“, configure output showed:

configuring Sphinx
------------------
 
checking for CFLAGS needed for pthreads... none
checking for LIBS needed for pthreads... -lpthread
checking for pthreads... found
checking whether to compile with MySQL support... no
checking whether to compile with PostgreSQL support... no
checking whether to use 64-bit document/word IDs... no
checking whether to compile with libstemmer support... no
checking for libexpat... found
checking for libiconv... found
checking for iconv() arg types... char **
checking for UnixODBC... not found
configure: WARNING: ODBC source support will NOT be available
checking for unaligned RAM access... no

As you can see, I choose to compare TSearch with Sphinx – please do not treat it as any kind of statement that sphinx is best – it's simply the engine I know.

Now. Total count of all records that I have to “play" is 19,093,861, but to make the comparison more informative, I decided to work also with smaller tables. So I loaded the data to tables named: pages_1000, pages_10000, pages_100000, pages_1000000 and pages_10000000. Each table has the same structure:

$ \d pages_all
Table "public.pages_all"
Column | Type | Modifiers
---------+--------------------------+-----------
id | integer |
title | text |
created | timestamp with time zone |
creator | text |
body | text |

As you can see – no indexes now, as it's just after data loading.

Some information about data in all those tables:

table_name rowcount title length body length
min avg max min avg max
pages_1000 1000 4 21 78 2 2805 129490
pages_10000 10000 3 25 112 1 1557 410345
pages_100000 100000 1 24 192 1 2073 966275
pages_1000000 1000000 1 22 200 1 2124 762490
pages_10000000 10000000 1 22 259 1 2680 984490

Also – physical data sizes on disk:

relname pg_relation_size pg_total_relation_size
pages_1000 560 kB 1840 kB
pages_10000 5976 kB 11 MB
pages_100000 65 MB 128 MB
pages_1000000 541 MB 1330 MB
pages_10000000 5379 MB 16 GB

Now. On all those tables, I will launch simple queries, which will select some portion of the rows based on word in body, and order by nothing, or by id, timestamp or by title.

In case of no ordering, I will get count of rows, and in case of ordering I will fetch first 50 ids – this is similar to what web systems do to page results – they need to get count, and then first page of results.

Since TSearch2 can work with 2 different sets of indexes (GiST and GIN), I'll test both.

To get sensible results, I will run each test 10 times, get rid of 4 extreme values (2 highest times, and 2 lowest times), and get average of rest of times.

Chosen words (words have been chosen to match various subsets of given table):

table name ~ 30% word ~ 20% word ~ 10% word ~ 5% word ~ 1% word ~ 0.5% word
word count % word count % word count % word count % word count % word count %
pages_1000 user 313 31.30 also 201 20.10 note 100 10.00 former 50 5.00 impact 10 1.00 trace 5 0.50
pages_10000 thank 2928 29.28 test 2012 20.12 cite 1028 10.28 want 504 5.04 enough 100 1.00 normal 50 0.50
pages_100000 welcome 29707 29.71 construct 19848 19.85 question 10212 10.21 north 5000 5.00 protect 1000 1.00 element 501 0.50
pages_1000000 class 301796 30.18 import 201569 20.16 first 100490 10.05 june 50018 5.00 meant 10003 1.00 wind 5001 0.50
pages_10000000 utc 3008374 30.08 class 2030718 20.31 ask 1007427 10.07 english 500304 5.00 upper 100539 1.01 environment 91982 0.92

above counts where checked using following type of query:

SELECT count(*)
FROM TABLE
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'WORD')

With database queries, we can use multiple different indexes – we can have GiST or GIN index.

There is a way to create index on both tsvector, and the secondary column, but in my tests, this index was used only to search, not to order. So, just to be sure, I will test (in PostgreSQL) two cases:

  • GiST index on body
  • GIN index on body

According to RhodiumToad on IRC, PostgreSQL cannot currently use GiST or GIN index for both searching and ordering, or one of them for searching and BTree for ordering.

So, to test PostgreSQL performance, I used following 2 files:

=$ cat run.test.data
pages_1000 user also note former impact trace
pages_10000 thank test cite want enough normal
pages_100000 welcome construct question north protect element
pages_1000000 class import first june meant wind
pages_10000000 utc class ask english upper environment

And the script itself is available for download.

Results of tests of TSearch (logs are also available for download).

First, index creation and sizes:

table index type size size/records time time/records
pages_1000 gist 204,800 B 204 B 0.8 s 0.8 ms
pages_1000 gin 2,867,200 B 2,867 B 1.0 s 1.0 ms
pages_10000 gist 2,105,344 B 210 B 4.8 s 0.5 ms
pages_10000 gin 12,795,904 B 1,279 B 5.5 s 0.6 ms
pages_100000 gist 27,885,568 B 278 B 72.9 s 0.7 ms
pages_100000 gin 127,565,824 B 1,275 B 82.4 s 0.8 ms
pages_1000000 gist 220,954,624 B 220 B 659.5 s 0.7 ms
pages_1000000 gin 1,057,144,832 B 1,057 B 822.9 s 0.8 ms
pages_10000000 gist 2,236,841,984 B 223 B 9,168.4 s 0.9 ms
pages_10000000 gin 10,583,187,456 B 1,058 B 88,613.0 s 8.9 ms

In case the numbers were not clear:

  • size – it's index size in bytes
  • size/records – how much space (on average) is used (in index) by single record from source table
  • time – average time to create the index
  • time/records – average time to index single row in this table

Couple of notes: index creation time is really long. It seems also that I hit some kind of threshold in the last test, as time of index creation skyrocketed (from 0.8ms per row to 8.9ms per row!)

Now. Searches. Due to the fact that we're dealing with lots of numbers, I did split it to 3 tables – first is for count(*), then for query for first 20 ids, ordered by created (timestamp), and first 20 ids, ordered by title.

count(*) times (all times in miliseconds):

table index type ~ 30% word ~ 20% word ~ 10% word ~ 5% word ~ 1% word ~ 0.5% word
pages_1000 gist 618.2 656.8 544.2 513.3 449.0 412.8
pages_1000 gin 2.7 2.5 2.6 2.5 2.4 2.3
pages_10000 gist 3,198.9 2,302.8 2,414.5 2,476.5 1,901.0 2,773.3
pages_10000 gin 7.0 6.3 5.2 4.1 3.3 2.8
pages_100000 gist 49,075.0 44,466.7 46,497.3 41,389.7 44,897.2 31,164.4
pages_100000 gin 38.2 30.0 22.0 13.3 6.2 4.9
pages_1000000 gist 389,507.0 384,140.1 426,959.2 358,398.0 279,168.7 247,768.3
pages_1000000 gin 275.6 227.0 178.1 118.1 37.6 25.2
pages_10000000 gist 6,309,379.2 5,458,583.2 5,775,298.4 4,457,977.6 4,203,556.0 4,988,674.2
pages_10000000 gin 111,110.4 87,755.1 76,945.0 68,812.2 262.6 244.5

“order by created" times (all times in miliseconds):

table index type ~ 30% word ~ 20% word ~ 10% word ~ 5% word ~ 1% word ~ 0.5% word
pages_1000 gist 612.8 649.5 538.1 509.2 442.4 408.1
pages_1000 gin 3.1 3.1 2.9 2.8 2.7 2.7
pages_10000 gist 3,163.5 2,288.3 2,395.3 2,457.8 1,885.3 2,747.1
pages_10000 gin 8.3 7.5 4.9 5.0 3.3 3.2
pages_100000 gist 48,619.5 44,112.9 46,061.2 41,082.9 44,503.1 30,890.6
pages_100000 gin 38.8 32.2 24.0 14.5 7.0 5.2
pages_1000000 gist 385,316.4 380,671.1 421,210.0 355,074.1 276,791.6 245,679.6
pages_1000000 gin 316.2 257.5 192.8 127.7 40.8 26.2
pages_10000000 gist 6,233,637.7 5,390,065.0 5,699,556.1 4,403,095.1 4,152,560.1 4,927,288.8
pages_10000000 gin 99,674.3 88,011.8 77,026.5 68,944.5 280.3 263.4

“order by title" times (all times in miliseconds):

table index type ~ 30% word ~ 20% word ~ 10% word ~ 5% word ~ 1% word ~ 0.5% word
pages_1000 gist 613.1 649.4 536.8 506.6 442.3 409.4
pages_1000 gin 3.5 3.1 3.0 2.9 2.7 2.8
pages_10000 gist 3,165.4 2,293.0 2,396.9 2,461.0 1,885.2 2,746.8
pages_10000 gin 11.1 10.4 5.2 6.2 4.0 3.2
pages_100000 gist 48,609.7 44,143.1 46,076.5 41,087.0 44,497.1 30,903.0
pages_100000 gin 46.4 37.4 26.6 16.0 7.3 5.5
pages_1000000 gist 385,426.9 380,763.5 421,239.5 355,047.7 276,817.9 245,661.1
pages_1000000 gin 367.2 295.4 214.4 137.7 43.4 26.9
pages_10000000 gist 6,238,224.6 5,392,749.8 5,700,966.5 4,404,191.9 4,153,318.1 4,926,796.8
pages_10000000 gin 99,566.1 88,179.8 77,109.1 68,867.8 302.8 286.3

Based on this, I can say that GiST indexes are really, really, much slower than GIN. But with GIN – just like with indexing – I hit some problems with 10 million rows.

Please also note how much depends on how many rows match full text search. The time differences are huge!

This wraps the numbers I got from TSearch. Now for Sphinx.

As you perhaps noticed, I compiled Sphinx without support for PostgreSQL. I did it in such a way, that I will need to write script that will get me data out of database, and convert to xml format.

To convert to XML, I wrote simple program, which also tests speed of export. Program is available for your viewing here.

table avg. export export size
pages_1000 0.183s 3.1 MB
pages_10000 1.171s 18 MB
pages_100000 12.888s 230 MB
pages_1000000 186.486s 2.3 GB
pages_10000000 2468.017s 29 GB

As we can see the sizes are pretty much scaling linearly. Times – well, with 10M we got time worse than linear, but still not bad.

Now for indexing. Full sphinx.conf is of course available for your review, but it's mostly all defaults, with exception of ram, but I assigned it the same ram as PostgreSQL had for work_mem.

Again, I indexed each data set 10 times, and got following times and sizes:

data set size size/records time time/records
pages_1000 1.7 MB 1657 B 0.203 s 0.203 ms
pages_10000 8.1 MB 840 B 1.119 s 0.112 ms
pages_100000 100 MB 1040 B 15.330 s 0.153 ms
pages_1000000 1.1 GB 1099 B 189.044 s 0.189 ms
pages_10000000 13 GB 1354 B 2580.042 s 0.258 ms

So, now we should check the indexes and run some queries on them.

To make the search as simple as possible, I will use search command line tool.

Script that I used to test it, is of course also available for download.

Results (all times in seconds, as they were taken from system time program), first count(*):

table ~ 30% word ~ 20% word ~ 10% word ~ 5% word ~ 1% word ~ 0.5% word
pages_1000 0.003 0.002 0.003 0.003 0.003 0.003
pages_10000 0.004 0.004 0.004 0.003 0.003 0.003
pages_100000 0.010 0.009 0.009 0.012 0.008 0.010
pages_1000000 0.064 0.046 0.055 0.049 0.047 0.046
pages_10000000 0.500 0.469 0.382 0.371 0.347 0.348

“order by created" times:

table ~ 30% word ~ 20% word ~ 10% word ~ 5% word ~ 1% word ~ 0.5% word
pages_1000 0.003 0.003 0.004 0.003 0.003 0.003
pages_10000 0.008 0.005 0.005 0.006 0.005 0.006
pages_100000 0.026 0.025 0.027 0.022 0.026 0.023
pages_1000000 0.110 0.110 0.109 0.108 0.108 0.109
pages_10000000 1.139 1.140 1.147 1.140 1.140 1.145

“order by title" times:

table ~ 30% word ~ 20% word ~ 10% word ~ 5% word ~ 1% word ~ 0.5% word
pages_1000 0.004 0.003 0.004 0.003 0.004 0.003
pages_10000 0.006 0.006 0.005 0.007 0.006 0.006
pages_100000 0.024 0.025 0.023 0.023 0.023 0.023
pages_1000000 0.110 0.111 0.110 0.110 0.110 0.110
pages_10000000 1.141 1.139 1.140 1.139 1.136 1.142

Results are (in my opinion) pretty clear. Let me show you comparison of one particular case, 10M index, searching with order by title, times in seconds:

index type ~ 30% word ~ 20% word ~ 10% word ~ 5% word ~ 1% word ~ 0.5% word
gist 6238.225 5392.750 5700.967 4404.192 4153.318 4926.797
gin 99.566 88.180 77.109 68.868 0.303 0.286
sphinx 1.141 1.139 1.140 1.139 1.136 1.142

While we see thet with words that return small number of records, gin index can actually be faster than sphinx, GiST is simply unusable. And as for GIN – it's indexing time is much slower.

With all this data, let me write some summary of pros and cons of both solution:

TSearch:

  • Pro:
    • you get it for free with pg
    • index is always up to date with data
    • fast results from GIN index with queries that return (without limit) small number of results
    • indexes are relatively small
  • Con:
    • Slow indexing
    • Slow results, unless your search query is (without limits) returning really small number of rows
    • Adding new search nodes requires either replicating whole database (hot standby) or just dataset, but this will force indexation on search nodes again – without ability to copy index from master

Sphinx (actually: most, if not any, search system outside of database):

  • Pro:
    • Fast indexing
    • Fast results
    • Ability to distribute index to many search nodes
    • Usually (not all, but at least some engines do support it easily) ability to partition indexes between many search nodes
  • Con:
    • indexing typically lags a bit behind database
    • another tool that needs maintenance
    • some queries are tricky (range based for example)

In my personal opinion – when you're even considering using full text search – you're (usually) worrying about performance. Given much easier scaling of search servers for Sphinx (again: or any other outside-of-database system), and the fact that virtually all shortcommings are quite easy to work around (for example range queries can be easily changed to multiple-alternatives with prefix matching), I see this as a very simple choice: use dedicated system for searches.

The only reasons I would actually consider using TSearch is when:

  • my dataset is small, so I don't really care about performance
  • I have very small number of users, and admins, so limiting number of tools becomes important
  • I absolutely, positively, need my search results to be up to date (which is virtually never the case, as in most cases when “bussiness" says “up to date" they mean “5 second old is not old". which makes it possible to work with Sphinx with ease).
  1. 29 comments

  2. # Isaac
    Oct 17, 2010

    Thanks for the in depth and detailed report! It would be interesting to compare the speed of a simple “body LIKE “%word%”” query as well, at least for the smaller row count dbs.

  3. # Jesper Krogh
    Oct 17, 2010

    Related to the “outliers” in GIN-indexing, there are a few fixes that may have interest to you not in 8.4.4:

    http://archives.postgresql.org/pgsql-hackers/2010-05/msg01743.php

    Which leads to wrong query-plans with high selectivity in the search terms.

    Combined with cost-estimating not “really implemented” in 8.4.4, which is sitting in a patch for 9.1 here:
    http://archives.postgresql.org/pgsql-hackers/2010-09/msg00400.php

    The last one fixes so PG not that eagerly switches to a Btree-scan for sorted results as earlier, which indeed is a better plan. I think that is what you see in when using large dataset, sorting and limit (as your 2 order by title/created limit 20).

    It would be nice to see that numbers with those patches. on.

    Thanks for a huge and nice wrapup of FTS in PG.

    Jesper

  4. # thomas
    Oct 17, 2010

    you missed an additional plus for in-db search: you receive a recordset back that you can use for more advanced queries, like joining to additional tables (practical usecase: fulltext search forum entries, joined against their author’s profiles, forum cateogries, excerpts etc).

    while this can also be done by external searches, getting the additional data and merging them with the fulltext search might be pretty annoying.

  5. # Oleg
    Oct 17, 2010

    Tsearch has to follow database semantics, which has very big overhead, but this is also a power of tsearch – you can do everything database provides. Sphinx and other popular search engines shine for simple search queries, when you don’t bother about consistence of your data. Once you want your search should be consistent and depends on many metadata, user roles, for example, you need internal search engine.

  6. Oct 17, 2010

    @Thomas:
    Sure. But I think that these data should be cached away anyway, so it’s not that big of a difference.

  7. Oct 17, 2010

    @Oleg:
    Hi, and thanks for comment.

    I am well aware that performance issues I see are not fault of TSearch itself, but rather are effect of constraints put in place by database environment. I hoped that it was clear from the preamble – I definitely don’t see TSearch as bad product – it’s just that it cannot be as fast as external system.

    As for running search in DB – sure. I can join tsearch queries with queries on another fields, but (at least in my experience), pg couldn’t sensibly use 2 indexes (gin/gist and some btree) for single query, so it was always using one of them, and filtered results using the other part of criteria.

  8. Oct 17, 2010

    Depesz,

    Thinks for the in-depth analysis. I still would use TSearch over sphinx for most of our use cases. I think our workloads are just different though. In our case full text search is usually a secondary search above our spatial and our complex joins. Even for e-commerce where we use no spatial, Tsearch is a secondary search. Having to merge as someone has mentioned is a BIG concern for us.

    So in short even if TSearch is slower (the speed is adequate for our needs), we prefer the simplicity of it to be able to easily integrate with our complex queries and the fact we know it will work on any OS platform we use — e.g. Windows /Linux etc without additional fuss.

    Then again if we were building a system where full text search is a primary concern and our queries are simplistic (alla a twitterific site) I think we would probably go with something like Sphinx as you suggest.

  9. # Arek
    Oct 18, 2010

    Basically I agree with Regina.

    Using Tsearch only for Full Text Search is probably not the best Idea.

    But having fts index with the rest of the data is a powerful toy.

    examples:

    1. As regina wrote: fts + spatial (find all items matching query ‘foo’ within the region ‘bar’)
    2. non fts ranking (find all items matching query ‘foo’, sorted by non-trivial expresion: fts_rank + distance_rank + featured_extra_bonus_rank )

    Sphinx can’t give you this, maybe Solr …

    Besides: I am not sure if sphinx or Solr supports ispell-based stemming.
    And this is HUGE advantage of TSearch.

  10. Oct 18, 2010

    @Arek:
    The problem I see is that even with those “great for tsearch” scenarios – if you hit “wrong” parameters – query becomes *much* slower.
    Get too many records based on region, and too many matching fts, and you’re doomed.
    This is why I showed performance for various words, with different selectivity in given table.

    Arek – as for “non fts ranking” – i’m not sure what you meant. In my example I sorted data by “non-fts-ranking” – i.e. i sorted using some other (than full text indexes) field.

    Also – some (at least solr) search systems have pretty good spatial querying.

  11. Oct 18, 2010

    Oleg & Teodor, I don’t know any Sergey here! Now I’ll read the article :)

  12. Oct 18, 2010

    So, I think you should have insisted a lot more in the pro/cons section. Tsearch is very good for selective searches and low-writes volumes, and you can’t get to use any other solution if you need consistency.

    Another thing you forgot to insist on, I think, is that e.g. the 30% matches for the 10M rows case, where you show that sphinx is so much better: it still is returning about 3M rows, right? Who wants a full text search result this big? That’s meaningless.

    All in all, I very much appreciate your article, you spent a good deal of efforts to show how the current limits of Tsearch fare compared to some non-database solution, and choosing the right solution for the right job needs those data.

    I’m just thinking the overall tone and conclusion aren’t fair. But of course we understand that you don’t like in-database full text search…

    Oh, a last note. Would you be interested in trying the same work with a pg_trgm filter, either alone or just before the tsearch one, to reduce the amount of searched data? — not sure it makes sense given the good performance of GIN for selective queries, but might help where you can’t afford its index maintenance costs and lean to GiST based approach instead.

    Regards, and keep up the good work!

  13. Oct 18, 2010

    @Dim:
    (names) oooh, I’m so sorry Teodor. Fixed, Dim – thanks a lot.

    as for 30% – remember that I returned only 50 rows.

    it’s pretty standard case for search – user can enter any criteria, and you page them by some number of rows. and user can be stupid (or purposely evil) and force some words with high number of matching rows. Besides – forget about 30% – problems start with 5%.

    As you said – I might be unfair, and I admit. I’m just seeing too many cases where people are being suggested to use tsearch as default solution for searching, where I think default should be “use dedicated search engine”, and use tsearch – only if you really know what you’re doing.

    as for pg_thrm – I’m not 100% sure what you mean by that – can you show example code/queries that illustrate this approach?

  14. # Lukasz
    Oct 18, 2010

    small correct – columns in table with xml putting results are in wrong order.

    Interesting article.
    I think that tsearch big advantage is ispell.

  15. Oct 18, 2010

    @Lukasz:
    thanks, fixed.

  16. # comboy
    Nov 6, 2010

    First of all, amazing work. Thanks. It’s really hard to find such well prepared and accurate posts.

    It’s worth noticing that numbers are still comparable when you look at the pages_1M comparison. And that’s the last case when gin index still fits in memory. When we’re talking about nice performance on lots of data, I think there’s no way that your index is not sitting in memory on the production machine. Memory is cheap nowadays, and given all pros of having all your stuff in db, I’m not convinced.

  17. Nov 6, 2010

    @Comboy:
    the biggest problem with tsearch that i see is that it is hard to scale. i.e. this part from post:

    “Adding new search nodes requires either replicating whole database (hot standby) or just dataset, but this will force indexation on search nodes again – without ability to copy index from master”

  18. Feb 18, 2011

    The newest beta (1.1.0) of Sphinx has real-time indexes, which apparently let you update them on the fly.

  19. # Nicolas Grilly
    Jun 7, 2011

    Thanks for this useful analysis! I’m wondering if all the bad performance numbers you noticed on the 10M rows dataset are not a direct consequence of MVCC concurrency. For each match in the index, PostgreSQL has to check the row visibility before returning it. This operation can be expensive when data does not fit in RAM.

    The bad performance on count(*) queries is a well known issue. It’s not specific to the FTS feature and is a direct consequence of MVCC concurrency. Other MVCC databases like for example MySQL with InnoDB “suffer” from this too. There is abundant discussions (and alternative solutions) on that topic in PostgreSQL mailing lists archive.

    Regarding the “order by created/title” queries, considering the fact that you fetch only the first 50 rows, maybe you can improve response time by just adding an index on columns “created” and “title”. We can expect the query planner to chose a plan that use that index first (because it has a greater selectivity) and check the full text GIN index only after.

    I’m really curious to know your thoughts about this. By the way, can you post the EXPLAIN associated to your queries? It can help understand the bad performance.

    Thanks again for your interesting post!

  20. Jun 7, 2011

    @Nicolas:
    As for MVCC and count(*) points – please read 2nd paragraph in the blogpost again.
    As for adding additional indexes – sure. it is possible to solve some specific cases with additional indexes, and possibly changing queries. But that’s not the point. The point is that (in my opinion) database-based full text search engines will always be slow (relatively of course).

    And as for explains – sorry – i wrote the blogpost over half a year ago, I don’t keep test datasets that long. Usually a week or two after blogpost I drop the test databases.

  21. # Nicolas Grilly
    Jun 7, 2011

    Thanks for your answer depesz.

    As for the count(*) query, there is no way to improve it in PostgreSQL. Sphinx (and other external search engines) is a definitive winner on that kind of query.

    But as for the “order by” queries, I’m not sure your benchmark is fair. Sphinx default configuration is to store attributes (“created” and “title” in your case) in a special .spa file and that file must fit into RAM in order to keep decent performance:
    http://sphinxsearch.com/docs/2.0.1/conf-docinfo.html

    It can be seen as a kind of index (I wrote “a kind” because as far as I know it’s not really an index). Sphinx probes that .spa file for each matching row, in order to select the 50 rows to fetch on the basis of timestamp or title.

    I think that a fair comparison requires to create indices on “created” and “title” columns in PostgreSQL. They are equivalent to Sphinx .spa files in terms of memory usage and they provide an efficient way to filter the matching rows similar to what Sphinx does with .spa files.

    Your point is that database-based full text search engines will always be relatively slow. But this is not what your benchark shows. PostgreSQL performance with GIN indices are very close to Sphinx (and sometimes better), except for the 10M rows dataset with queries matching 5 % rows and more. In this last case, PostgreSQL has to do 500 000 random access to the pages table, just to check the “created” or “title” column. My guess is performance decreases suddenly in this case just because the “working” dataset doesn’t fit in RAM anymore and it causes swap. Adding an index would fit into RAM and would be equivalent to what Sphinx does.

    If I’m right (and maybe I’m not), it would mean that PostgreSQL built-in full text search performance is not slow and is comparable to Sphinx performance in most cases, except for the count(*) case.

    Do you agree with this analysis or am I mistaken?

  22. Jun 7, 2011

    @Nicolas – I agree that for small datasets or very good selectivity of search phrases gin indexes are good enough. but in the apps that i’ve seen, getting more data to search in was considered a good thing, and we couldn’t trust users to use only good selectivity terms.
    so while gin would make perfect sense for someone knowing the dataset, and making smart queries, it doesn’t look good in general usage.

  23. # Nicolas Grilly
    Jun 7, 2011

    You wrote “in the apps that i’ve seen, getting more data to search in was considered a good thing, and we couldn’t trust users to use only good selectivity terms”.

    I agree with that, but I still don’t understand what makes you believe that PostgreSQL built-in FTS is inadequate in that use case. You use your benchmark to support your point. But I think that the performance drop-off you measured on the 10M rows dataset with a query mathing 5 % rows and more is typical of an application switching from memory to disk usage, just after having reached some threshold. The Sphinx benchmark doesn’t suffer from this performance drop-off because its .spa file is small enough to fit in RAM.

    This is why I’m wondering if we’re not comparing apples to oranges. I’m wondering if PostgreSQL, with adequate settings (i.e. creating indexes on columns “created” and “title”, equivalent to what Sphinx does with its .spa files), could achieve good performance even on the large dataset (10M rows) with low selectivity terms (matching 5 % or more).

    As far as I know, PostgreSQL GIN indices are just inverted indices and they are not conceptually different from inverted indices used by external search engines. So I’m not sure usage of GIN indices alone can explain the bad performance you noticed in some applications. I can think of two cases where full text search with GIN indices can show bad performance:
    - One is count(*) queries and has no solution.
    - Another is order by queries without an adequate index.

    Have you knowledge of another case where PostgreSQL FTS shows bad performance?

    I’m interested by your experience because I’m thinking of using PostgreSQL instead of Xapian to index and search a resume database :)

    Thanks!

  24. Jun 7, 2011

    @Nicolas:
    Personally, I don’t think it’s possible to get results (times) comparable to those of external search engines. You are free to do your own tests if you want, I’ll gladly read about them if they’ll prove me wrong.
    So far – we can talk, and talk, but without tests it will all be just talk. And sorry, but I’m just not willing to do the tests. I spent *a lot* of time doing tests and research to this blogpost, and I simply don’t have the time now (nor in foreseeable future) to redo it all, especially since I am *extremely* skeptical about results, because I know that even with btree indexes getting large datasets would be slow, and remember that we currently cannot use gin/gist for full text search, and btree for ordering.

  25. # Nicolas Grilly
    Jun 8, 2011

    Depez, I understand that you can’t do the tests again. Thanks for your time and advice anyway.

    Just for the record, I’ve spent some time building and testing PostgreSQL 8.4 on my machine with a dataset composed of 1 000 000 document. Document average length is 5 700 chars.

    I learned two things:

    1/ Exactly as in your benchmark, the response time is a function of query selectivity. On my machine, a query matching 5 % of documents runs in 60 ms. A query matching almost all documents runs in 430 ms. Conclusion: PostgreSQL can respond as fast as Sphinx on a good selectivity query (matching 5 % of documents), but it looks like it can be 7x slower on a low selectivity query (matching 90 % of documents).

    2/ I achieved to reproduce the performance drop-off you measured on your 10M dataset. When the number of matching rows reaches some threshold depending on your configuration, the query planner chooses a wrong plan, and decides to not use the GIN index. As a workaround I used set enable_indexscan = off; set enable_seqscan = off; to “force” the query planner to use the GIN index with a bitmap scan. It was enough to fix this issue. But it’s clearly a workaround. I think this issue is related to what Jesper wrote in a previous comment.

  26. # Misha
    Jun 9, 2011

    depesz,
    thanx agian for your work.

    in your current examples we have

    for pg: maintenance_work_mem = 2048MB

    and

    for sphinx: indexer { mem_limit = 512M }

    imho, we can significantly increase speed of sphinx index-building process (especially on large datasets) if we raise mem_limit up to 2047M (max) (respectably to pg maintenance_work_mem 2GB) and write_buffer, for example, up to 128M

    http://sphinxsearch.com/docs/manual-1.10.html
    mem_limit
    write_buffer

  27. # T.E.L
    Nov 2, 2012

    Thank you for the great article!

    We need blazing fast and not necessarily fully consistent full text search in a PostgreSQL 9.2 database which will contain somewhere between several 100 million and several billion objects — and we will need results delivered within a fraction of a second — we are currently trying tsearch, but we will definitely try an external dedicated fts engine.

  28. Nov 16, 2012

    Hello,

    As a follow-up to our one year old discussion, great news are coming from Oleg Bartunov and Alexander Korotkov:

    http://wiki.postgresql.org/images/2/25/Full-text_search_in_PostgreSQL_in_milliseconds-extended-version.pdf

    They plan to add positional information to fulltext index. This will make ranking of documents very quick, and close to what we get from a dedicated fulltext search solution.

    Their slides are very interesting.

    Cheers,

    Nicolas

  29. # boris
    Sep 3, 2013

    Do you build indexes on function to_tsvector(‘english’, body) or you have a special column with to_tsvector(‘english’, body) precalculated ?
    I ran test on 9.2 and found that indexing a column with precalculated to_tsvector gives from 10 up to 50 times performance increase.

  30. Sep 3, 2013

    @Boris:
    I don’t. Simply because, as title stated, I am not a fan of tsearch.

    The difference you’re showing is weird. Technically – index on function (i think) should be faster.

    Please provide more information (test schema, script, test script) so I can look at it more.

Leave a comment