Waiting for 9.1 – KNNGIST

On 4th of December, Tom Lane committed really cool patch:

KNNGIST, otherwise known as order-by-operator support for GIST.

This commit represents a rather heavily editorialized version of
Teodor's builtin_knngist_itself-0.8.2 and builtin_knngist_proc-0.8.1
patches.  I redid the opclass API to add a separate Distance method
instead of turning the Consistent method into an illogical mess,
fixed some bit-rot in the rbtree interfaces, and generally worked over
the code style and comments.
There's still no non-code documentation to speak of, but I'll work on
that separately.  Some contrib-module changes are also yet to come
(right now, point <-> point is the only KNN-ified operator).
Teodor Sigaev and Tom Lane

The biggest problem with describing it is that is “just" adds some additional features that are not really visible from the SQL point of view, until someone will use it.

Luckily – very soon (same day) Tom Lane committed second patch, which adds KNNGIST usage to pg_trgm, so I have a way to show you how cool it is.

First, because I will be using pg_trgm in this example – you should know what it is. If you don't know – it is a way to match words using trigrams.

Simple example:

$ CREATE TABLE words ( word text );
$ copy words FROM '/usr/share/dict/american-english-insane';
COPY 638645

Data look like this:

$ SELECT * FROM words ORDER BY random() LIMIT 10;
(10 ROWS)

Trigrams look like:

$ SELECT show_trgm('depesz');
 {"  d"," de",dep,epe,esz,pes,"sz "}
(1 ROW)

With GiST indexes, I can find all words similar to given word:

$ CREATE INDEX trgm_idx ON words USING gist (word gist_trgm_ops);

and now:

$ SELECT word, similarity(word, 'depesz') AS sml
    FROM words
    WHERE word % 'depesz'
    ORDER BY sml DESC, word LIMIT 10;
  word   |   sml
 depe    |      0.5
 depel   | 0.444444
 Depew   | 0.444444
 DePew   | 0.444444
 depend  |      0.4
 Depere  |      0.4
 deperm  |      0.4
 deepest | 0.363636
 depeach | 0.363636
 depends | 0.363636
(10 ROWS)

This is all sweet, but the problem lies in how the results are obtained:

$ EXPLAIN analyze SELECT word, similarity(word, 'ing') AS sml
    FROM words
    WHERE word % 'ing'
    ORDER BY sml DESC, word
    LIMIT 10;
                                                              QUERY PLAN
 LIMIT  (cost=1650.25..1650.28 ROWS=10 width=10) (actual TIME=105.885..105.885 ROWS=10 loops=1)
   ->  Sort  (cost=1650.25..1651.85 ROWS=639 width=10) (actual TIME=105.884..105.884 ROWS=10 loops=1)
         Sort KEY: (similarity(word, 'ing'::text)), word
         Sort Method:  top-N heapsort  Memory: 25kB
         ->  Bitmap Heap Scan ON words  (cost=29.44..1636.45 ROWS=639 width=10) (actual TIME=104.076..105.664 ROWS=645 loops=1)
               Recheck Cond: (word % 'ing'::text)
               ->  Bitmap INDEX Scan ON trgm_idx  (cost=0.00..29.28 ROWS=639 width=0) (actual TIME=104.037..104.037 ROWS=645 loops=1)
                     INDEX Cond: (word % 'ing'::text)
 Total runtime: 105.922 ms
(9 ROWS)

Generally – PostgreSQL has to fetch all rows matching our criteria, sort it, and then get top 10.

But. With KNNGIST – PostgreSQL can get the rows already sorted in required order. Which means there is less rows to check in table data, and we can skip the sort. And it looks like this:

$ EXPLAIN analyze SELECT word, similarity(word, 'ing') AS sml
    FROM words
    WHERE word % 'ing'
    ORDER BY word <-> 'ing' LIMIT 10;
                                                          QUERY PLAN
 LIMIT  (cost=0.00..37.13 ROWS=10 width=10) (actual TIME=27.577..91.367 ROWS=10 loops=1)
   ->  INDEX Scan USING trgm_idx ON words  (cost=0.00..2372.46 ROWS=639 width=10) (actual TIME=27.575..91.362 ROWS=10 loops=1)
         INDEX Cond: (word % 'ing'::text)
         ORDER BY: (word <-> 'ing'::text)
 Total runtime: 91.406 ms
(5 ROWS)

Time difference is not big, but the table I am using is not really impressive – ~ 600k words only.

We can also try to use it with geometry queries:

$ CREATE TABLE test ( POSITION point );

Table created. Now let's insert some random points:

$ INSERT INTO test (POSITION) SELECT point( random() * 1000, random() * 1000) FROM generate_series(1,1000000);
INSERT 0 1000000

1 million points should be enough for my example. All of them have both X and Y in range <0, 1000). Now we just need the index:


And we can find some rows close to center of the points cloud:

$ SELECT *, POSITION <-> point(500,500) FROM test ORDER BY POSITION <-> point(500,500) LIMIT 10;
              POSITION               |     ?COLUMN?      
 (499.965638387948,499.452529009432) | 0.548548271254899
 (500.473062973469,500.450353138149) |  0.65315122744144
 (500.277776736766,500.743471086025) | 0.793668174518778
 (499.986605718732,500.844359863549) | 0.844466095200968
 (500.858531333506,500.130807515234) | 0.868439207229501
 (500.96702715382,499.853323679417)  | 0.978087654172406
 (500.975443981588,500.170825514942) | 0.990289007195055
 (499.201623722911,499.368405900896) |  1.01799596553335
 (498.899147845805,500.683960970491) |  1.29602394829404
 (498.38217580691,499.178630765527)  |  1.81438764851559
(10 ROWS)

And how about speed?

$ EXPLAIN analyze SELECT *, POSITION <-> point(500,500) FROM test ORDER BY POSITION <-> point(500,500) LIMIT 10;
                                                        QUERY PLAN                                                        
 LIMIT  (cost=0.00..0.77 ROWS=10 width=16) (actual TIME=0.164..0.475 ROWS=10 loops=1)
   ->  INDEX Scan USING q ON test  (cost=0.00..76512.60 ROWS=1000000 width=16) (actual TIME=0.163..0.473 ROWS=10 loops=1)
         ORDER BY: ("position" <-> '(500,500)'::point)
 Total runtime: 0.505 ms
(4 ROWS)

Now, that's something really cool. Great work!

8 thoughts on “Waiting for 9.1 – KNNGIST”

  1. Whoa. Anybody seen my jaw? I think I dropped it somewhere around here.

  2. Is it on purpose that the trigram query with and without the index are not on the same literal?

  3. @depesz: great! From reading closely, it sounds like we have to wait for involved operators to be “KNNifed”. In my case that’s the “@” operator for the cube type. Hopefully that’s not too hard for someone.

    Although, perhaps due to this approach some other method of calculating zipcode distances will be faster than cube search method.

  4. Thanks for the interesting post highlighting this new functionality? Do you know if it is possible to use this extension for nearest neighbour searches on 3D points (in my case neurons in the brain)? As far as I am aware the point type is 2D only.

  5. @Greg:
    as far as I understand, this functionality will work as long as you’ll have point3d datatype. which (afaik) you don’t have currently. I am not aware of any 3d-related datatypes and their index support.

Comments are closed.