December 11th, 2010 by depesz | Tags: , , , , , , , , | 8 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

On 4th of December, Tom Lane committed really cool patch:

KNNGIST, otherwise known as order-by-operator support for GIST.

This commit represents a rather heavily editorialized version of
Teodor's builtin_knngist_itself-0.8.2 and builtin_knngist_proc-0.8.1
patches. I redid the opclass API to add a separate Distance method
instead of turning the Consistent method into an illogical mess,
fixed some bit-rot in the rbtree interfaces, and generally worked over
the code style and comments.
 
There's still no non-code documentation to speak of, but I'll work on
that separately. Some contrib-module changes are also yet to come
(right now, point <-> point is the only KNN-ified operator).
 
Teodor Sigaev and Tom Lane

The biggest problem with describing it is that is “just" adds some additional features that are not really visible from the SQL point of view, until someone will use it.

Luckily – very soon (same day) Tom Lane committed second patch, which adds KNNGIST usage to pg_trgm, so I have a way to show you how cool it is.

First, because I will be using pg_trgm in this example – you should know what it is. If you don't know – it is a way to match words using trigrams.

Simple example:

$ create table words ( word text );
CREATE TABLE
 
$ copy words from '/usr/share/dict/american-english-insane';
COPY 638645

Data look like this:

$ select * from words order by random() limit 10;
word
-----------------
heterograft's
totipalmation's
Cretaceous
Snoquamish's
inca
hydremia's
slovenlinesses
Lysite
Jud's
Aquebogue's
(10 rows)

Trigrams look like:

$ select show_trgm('depesz');
show_trgm
-------------------------------------
{" d"," de",dep,epe,esz,pes,"sz "}
(1 row)

With GiST indexes, I can find all words similar to given word:

$ CREATE INDEX trgm_idx ON words USING gist (word gist_trgm_ops);
CREATE INDEX

and now:

$ SELECT word, similarity(word, 'depesz') AS sml
FROM words
WHERE word % 'depesz'
ORDER BY sml DESC, word limit 10;
word | sml
---------+----------
depe | 0.5
depel | 0.444444
Depew | 0.444444
DePew | 0.444444
depend | 0.4
Depere | 0.4
deperm | 0.4
deepest | 0.363636
depeach | 0.363636
depends | 0.363636
(10 rows)

This is all sweet, but the problem lies in how the results are obtained:

$ explain analyze SELECT word, similarity(word, 'ing') AS sml
FROM words
WHERE word % 'ing'
ORDER BY sml DESC, word
LIMIT 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1650.25..1650.28 rows=10 width=10) (actual time=105.885..105.885 rows=10 loops=1)
-> Sort (cost=1650.25..1651.85 rows=639 width=10) (actual time=105.884..105.884 rows=10 loops=1)
Sort Key: (similarity(word, 'ing'::text)), word
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on words (cost=29.44..1636.45 rows=639 width=10) (actual time=104.076..105.664 rows=645 loops=1)
Recheck Cond: (word % 'ing'::text)
-> Bitmap Index Scan on trgm_idx (cost=0.00..29.28 rows=639 width=0) (actual time=104.037..104.037 rows=645 loops=1)
Index Cond: (word % 'ing'::text)
Total runtime: 105.922 ms
(9 rows)

Generally – PostgreSQL has to fetch all rows matching our criteria, sort it, and then get top 10.

But. With KNNGIST – PostgreSQL can get the rows already sorted in required order. Which means there is less rows to check in table data, and we can skip the sort. And it looks like this:

$ explain analyze SELECT word, similarity(word, 'ing') AS sml
FROM words
WHERE word % 'ing'
ORDER BY word <-> 'ing' LIMIT 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..37.13 rows=10 width=10) (actual time=27.577..91.367 rows=10 loops=1)
-> Index Scan using trgm_idx on words (cost=0.00..2372.46 rows=639 width=10) (actual time=27.575..91.362 rows=10 loops=1)
Index Cond: (word % 'ing'::text)
Order By: (word <-> 'ing'::text)
Total runtime: 91.406 ms
(5 rows)

Time difference is not big, but the table I am using is not really impressive – ~ 600k words only.

We can also try to use it with geometry queries:

$ create table test ( position point );
CREATE TABLE

Table created. Now let's insert some random points:

$ insert into test (position) select point( random() * 1000, random() * 1000) from generate_series(1,1000000);
INSERT 0 1000000

1 million points should be enough for my example. All of them have both X and Y in range <0, 1000). Now we just need the index:

$ create index q on test using gist ( position );
CREATE INDEX

And we can find some rows close to center of the points cloud:

$ select *, position <-> point(500,500) from test order by position <-> point(500,500) limit 10;
position | ?column?
-------------------------------------+-------------------
(499.965638387948,499.452529009432) | 0.548548271254899
(500.473062973469,500.450353138149) | 0.65315122744144
(500.277776736766,500.743471086025) | 0.793668174518778
(499.986605718732,500.844359863549) | 0.844466095200968
(500.858531333506,500.130807515234) | 0.868439207229501
(500.96702715382,499.853323679417) | 0.978087654172406
(500.975443981588,500.170825514942) | 0.990289007195055
(499.201623722911,499.368405900896) | 1.01799596553335
(498.899147845805,500.683960970491) | 1.29602394829404
(498.38217580691,499.178630765527) | 1.81438764851559
(10 rows)

And how about speed?

$ explain analyze select *, position <-> point(500,500) from test order by position <-> point(500,500) limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.77 rows=10 width=16) (actual time=0.164..0.475 rows=10 loops=1)
-> Index Scan using q on test (cost=0.00..76512.60 rows=1000000 width=16) (actual time=0.163..0.473 rows=10 loops=1)
Order By: ("position" <-> '(500,500)'::point)
Total runtime: 0.505 ms
(4 rows)

Now, that's something really cool. Great work!

  1. 8 comments

  2. Dec 11, 2010

    The patch is great!
    Try to increase SIGLENINT in trgm.h. I beleive that 3 in too small for effective search.

  3. # comboy
    Dec 11, 2010

    Whoa. Anybody seen my jaw? I think I dropped it somewhere around here.

  4. Dec 12, 2010

    Is it on purpose that the trigram query with and without the index are not on the same literal?

  5. Feb 3, 2011

    Wow, will we be able to use this to improve the performance of finding zipcodes nearby?

  6. Feb 3, 2011

    @Mark:
    yes. As long as you will store also geographical location per zip code.

  7. Feb 4, 2011

    @depesz: great! From reading closely, it sounds like we have to wait for involved operators to be “KNNifed”. In my case that’s the “@” operator for the cube type. Hopefully that’s not too hard for someone.

    Although, perhaps due to this approach some other method of calculating zipcode distances will be faster than cube search method.

  8. Apr 29, 2011

    Thanks for the interesting post highlighting this new functionality? Do you know if it is possible to use this extension for nearest neighbour searches on 3D points (in my case neurons in the brain)? As far as I am aware the point type is 2D only.

  9. Apr 29, 2011

    @Greg:
    as far as I understand, this functionality will work as long as you’ll have point3d datatype. which (afaik) you don’t have currently. I am not aware of any 3d-related datatypes and their index support.

Leave a comment