May 17th, 2008 by depesz | Tags: , , , , , , | 1 comment »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

Today we have two interesting patches:

  • patch by Teodor Sigaev and Oleg Bartunov, and committed by Tom Lane, which adds interesting capability to GIN indexes
  • patch by Zoltan Boszormenyi, also committed by Tom, which adds “RESTART" option to ALTER SEQUENCE. With some interesting consequences

Since describing patch on GIN will take much more time and blog-post-space, I'll first go into details of SEQUENCE RESTART patch.

Basically, newly added syntax let's you do something like this:

# create sequence test;
CREATE SEQUENCE
# select nextval('test');
nextval
---------
1
(1 row)
# select count(nextval('test')) from generate_series(1,10000);
count
-------
10000
(1 row)
# select nextval('test');
nextval
---------
10002
(1 row)
# alter sequence test restart;
ALTER SEQUENCE
# select nextval('test');
nextval
---------
1
(1 row)

It doesn't seem very revolutionary – after all you could always do:

SELECT setval('test', 1);

But you have to remember that not all sequences start from 0.

What's more – having this syntax (or, to be more specific, this ability) to restart sequence, gave us something else:

# create table test (id serial primary key, something int4);
NOTICE: CREATE TABLE will create implicit sequence "test_id_seq" for serial column "test.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "test_pkey" for table "test"
CREATE TABLE
# insert into test (something) values (15) returning *;
id | something
----+-----------
1 | 15
(1 row)
INSERT 0 1
# insert into test (something) select * from generate_series(1000, 10000, 5);
INSERT 0 1801
# TRUNCATE TABLE test RESTART IDENTITY;
TRUNCATE TABLE
# insert into test (something) values (50) returning *;
id | something
----+-----------
1 | 50
(1 row)
INSERT 0 1

In case you missed that – now truncate can automatically restart sequence that is used for primary key generator. And this is what I found really cool.

Now to the second (or first, depending how you'll look at it) patch: partial-matches in GIN.

Of course the most prominent use case for GIN is TSearch2. So I'll concentrate on using partial-matches in GIN in TSearch2.

First, I have a test table:

# \d pages
Table "public.pages"
Column | Type | Modifiers
--------+----------+----------------------------------------------------
id | integer | not null default nextval('pages_id_seq'::regclass)
url | text | not null
title | text |
body | text |
ft | tsvector |
Indexes:
"pages_pkey" PRIMARY KEY, btree (id)
"pages_url_key" UNIQUE, btree (url)
"tsearch_test" gin (ft)
Triggers:
tsvectorupdate BEFORE INSERT OR UPDATE ON pages FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('ft', 'public.polish', 'url', 'title', 'body')

Which contains some (but very little) data:

# select count(*), min(length(body)), max(length(body)), sum(length(body)) from pages;
count | min | max | sum
-------+------+--------+---------
443 | 1147 | 142286 | 2935151
(1 row)

This table contains pages from Polish wikipedia, so my TSearch configuration is also based on Polish language. But it shouldn't matter in this case.

For my test I chosen 2 words: drzwi and drzewa (door and trees).

First, let's check it tsearch can tokenize them properly:

# select to_tsvector('public.polish', 'drzwi drzewa');
to_tsvector
----------------------
'drzewo':2 'drzwi':1
(1 row)

Looks fine for me. Now, let's check how fast I can search for trees with tsearch:

# explain analyze select * from pages where ft @@ to_tsquery('public.polish', 'drzewa');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Index Scan using tsearch_test on pages (cost=0.00..16.31 rows=1 width=596) (actual time=0.038..0.111 rows=17 loops=1)
Index Cond: (ft @@ '''drzewo'''::tsquery)
Total runtime: 0.193 ms
(3 rows)

Pretty fast. How about doors?

# explain analyze select * from pages where ft @@ to_tsquery('public.polish', 'drzwi');
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Scan using tsearch_test on pages (cost=0.00..16.31 rows=1 width=596) (actual time=0.043..0.046 rows=1 loops=1)
Index Cond: (ft @@ '''drzwi'''::tsquery)
Total runtime: 0.099 ms
(3 rows)

Now, let's check for pages which have any of these two words:

# explain analyze select * from pages where ft @@ to_tsquery('public.polish', 'drzwi|drzewa');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Index Scan using tsearch_test on pages (cost=0.00..16.31 rows=1 width=596) (actual time=0.057..0.142 rows=18 loops=1)
Index Cond: (ft @@ '''drzwi'' | ''drzewo'''::tsquery)
Total runtime: 0.230 ms
(3 rows)

OK. Looks fine. Times are pretty good. For comparison purposes let's check how “LIKE" will work:

# explain analyze select * from pages where (title||' '||body) like '%drzwi%';
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on pages (cost=0.00..36.75 rows=1 width=596) (actual time=24.924..73.689 rows=2 loops=1)
Filter: (((title || ' '::text) || body) ~~ '%drzwi%'::text)
Total runtime: 73.746 ms
(3 rows)
# explain analyze select * from pages where (title||' '||body) like '%drzewa%';
QUERY PLAN
---------------------------------------------------------------------------------------------------
Seq Scan on pages (cost=0.00..36.75 rows=1 width=596) (actual time=9.499..72.118 rows=8 loops=1)
Filter: (((title || ' '::text) || body) ~~ '%drzewa%'::text)
Total runtime: 72.191 ms
(3 rows)
# explain analyze select * from pages where (title||' '||body) like '%drz%';
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on pages (cost=0.00..36.75 rows=18 width=596) (actual time=0.180..63.877 rows=108 loops=1)
Filter: (((title || ' '::text) || body) ~~ '%drz%'::text)
Total runtime: 64.165 ms
(3 rows)

As you can see last test of like is not exactly comparable with ‘drzwi|drzewa', as it found also other words – even words where “drz" is only part of word, not necessarily its beginning.

This new patch, lets TSearch (and other applications) to use GIN indexes for prefix searches. TSearch syntax for this is: ‘prefix:*'. So, let's check how well (or how bad) it will work:

# explain analyze select * from pages where ft @@ to_tsquery('public.polish', 'drz:*');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Index Scan using tsearch_test on pages (cost=0.00..16.31 rows=1 width=596) (actual time=0.068..0.168 rows=23 loops=1)
Index Cond: (ft @@ '''drz'':*'::tsquery)
Total runtime: 0.262 ms
(3 rows)

Whoa. Pretty fast. And it found 5 new pages. I checked them manually, and found out, that the matched tokens were:

  • drzeć
  • drzewiecki
  • drzeżdżon
  • drzwiczki
  • drzyzga

So – it works. And to give credit to Oleg and Teodor – it works really fast!

  1. One comment

  2. # digicon
    May 23, 2008

    Sweeeeet! I have been waiting for this for some time! Thanks Oleg and Teodor!

Leave a comment