Waiting for 9.6 – Cube extension kNN support

On 18th of December, Teodor Sigaev committed patch:

Introduce distance operators over cubes:
<#> taxicab distance
<->  euclidean distance
<=> chebyshev distance
 
Also add kNN support of those distances in GiST opclass.
 
Author: Stas Kelvich

I once worked for a company that was using unofficial patch to add this, so I'm very happy that we finally got kNN for cubes.

What this is?

Well, for starters – it's contrib extension that you can load to your database using simple:

$ CREATE extension cube;
CREATE EXTENSION

Within the extension there are 61 elements, but the basic is “cube" datatype.

It's basically a point, or a “cube" in n-dimensional space.

For example, you can have following cubes:

‘(12)' – single point, in one dimension
‘(12,34)' – single point in two dimensions
‘(12,34,56,78,91)' – single point in 5 dimensions
‘(12),(24)' – single dimension “cube" from 12 to 24
‘(12,34),(56,78)' – two-dimensional cube (rectangle) defined by providing opposing points
‘(1,2,3,4,5),(6,7,8,9,10)' – cube in five dimensions

There is no real limit on number of dimensions.

Now, what are the different distances defined by the new patch? Let's see. For example purposes, let's start with one-dimensional points:

$ WITH x AS ( SELECT '(1)'::cube AS point_a, '(5)'::cube AS point_b )
SELECT
    point_a <#> point_b AS taxicab,
    point_a <-> point_b AS euclidean,
    point_a <=> point_b AS chebyshev
FROM x;
 taxicab | euclidean | chebyshev 
---------+-----------+-----------
       4 |         4 |         4
(1 ROW)

Not really interesting. What about more dimensions?

$ WITH x AS ( SELECT '(0,0)'::cube AS point_a, '(16,9)'::cube AS point_b )
SELECT
    point_a <#> point_b AS taxicab,
    point_a <-> point_b AS euclidean,
    point_a <=> point_b AS chebyshev
FROM x;
 taxicab |    euclidean     | chebyshev 
---------+------------------+-----------
      25 | 18.3575597506858 |        16
(1 ROW)

This shows pretty clearly what's what. taxicab is basically sum of distances across all axes, euclidean is distance in straight line, and chebyshev is largest distance from distances on all axes.

For me, euclidean looks the most interesting. So let's stick to it.

Now, let's imagine, we have a table with 1 million 3 dimensional points, where x, y, and z are all within -5000 .. 5000 range. Or better yet, let's not image, just make it:

$ CREATE TABLE test_cubes AS
    SELECT i AS id,
        cube(array[random() * 10000 - 5000, random() * 10000 - 5000, random() * 10000 - 5000])
    FROM generate_series(1,1000000) i;
SELECT 1000000
 
$ ALTER TABLE test_cubes ADD PRIMARY KEY (id);
ALTER TABLE

some sample data:

$ SELECT * FROM test_cubes tablesample system ( 0.01 ) LIMIT 10;
   id   |                           cube                           
--------+----------------------------------------------------------
 542641 | (-4178.8482805714, 4531.55778348446, -4325.33879298717)
 542642 | (-4550.62797293067, 3511.47029548883, -220.161587931216)
 542643 | (-4966.07152279466, 1456.31569903344, -2061.16470042616)
 542644 | (-1129.34995442629, -550.684169866145, 2578.38271558285)
 542645 | (4474.94647931308, 2999.37592353672, 1136.93070597947)
 542646 | (-1516.54057670385, 1515.51792863756, -4341.86951257288)
 542647 | (795.945581048727, 2787.07716614008, 2160.27483809739)
 542648 | (1535.92750895768, -2401.27934142947, -726.663023233414)
 542649 | (763.151990249753, -995.386703871191, -2518.4234790504)
 542650 | (2935.75372546911, 1408.53805933148, -1084.11618042737)
(10 ROWS)

Finding rows that are close to some specific point (let's say – (0,0,0)) can be done by something like this:

SELECT * FROM (
    SELECT id, cube, cube <-> cube(array[0.0,0.0,0.0]) AS distance
    FROM test_cubes tc
) AS x
ORDER BY distance ASC LIMIT 10;
   id   |                           cube                            |     distance     
--------+-----------------------------------------------------------+------------------
 915947 | (15.2430403977633, -53.2670319080353, 48.1952121481299)   | 73.4336805754876
 280346 | (9.85732767730951, -80.3589634597301, -41.3774838671088)  | 90.9220880118409
 243882 | (8.24892893433571, 74.6928667649627, 67.0726876705885)    | 100.726434492086
 195082 | (-15.9435207024217, 69.5352349430323, -78.6104751750827)  | 106.155318087336
 239258 | (-61.3826420158148, -46.3476357981563, -98.9516917616129) | 125.329044468573
 661586 | (-103.007820434868, 76.9877433776855, -1.76172703504562)  | 128.611147974336
 363224 | (27.2627454251051, -116.961468011141, 53.6130089312792)   | 131.520329280688
 336093 | (68.2466197758913, -35.0401923060417, -123.154250904918)  | 145.094402730191
 168317 | (120.665831491351, 83.5728365927935, -18.6012778431177)   | 147.954957480518
  11121 | (24.2509506642818, 133.48447624594, -62.4999403953552)    | 149.373547042966
(10 ROWS)

(I know I don't need subselect, just wanted to show the distance too, not only order by it.

But it's relatively slow:

                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 LIMIT  (cost=42443.64..42443.67 ROWS=10 width=36) (actual TIME=352.614..352.617 ROWS=10 loops=1)
   ->  Sort  (cost=42443.64..44943.64 ROWS=1000000 width=36) (actual TIME=352.612..352.614 ROWS=10 loops=1)
         Sort KEY: ((tc.cube <-> '(0, 0, 0)'::cube))
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan ON test_cubes tc  (cost=0.00..20834.00 ROWS=1000000 width=36) (actual TIME=0.052..195.534 ROWS=1000000 loops=1)
 Planning TIME: 0.403 ms
 Execution TIME: 352.636 ms
(7 ROWS)

As you can see it's fetching all rows, calculates the distance, and sorts. All expensive.

But, since 9.1 we had the ability to use kNN searches, with proper indexing. And this new patch adds this capability to cubes. So how do I use it?

$ CREATE INDEX magic ON test_cubes USING gist (cube);
CREATE INDEX

And now, the same select, will have “slightly" different explain:

$ EXPLAIN analyze
SELECT * FROM (
    SELECT id, cube, cube <-> cube(array[0.0,0.0,0.0]) AS distance
    FROM test_cubes tc
) AS x
ORDER BY distance ASC LIMIT 10;
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 LIMIT  (cost=0.41..1.37 ROWS=10 width=36) (actual TIME=1.213..1.656 ROWS=10 loops=1)
   ->  INDEX Scan USING magic ON test_cubes tc  (cost=0.41..96296.41 ROWS=1000000 width=36) (actual TIME=1.210..1.649 ROWS=10 loops=1)
         ORDER BY: (cube <-> '(0, 0, 0)'::cube)
 Planning TIME: 0.497 ms
 Execution TIME: 1.808 ms
(5 ROWS)

Of course the same index would be used for taxicab and chebyshev distances.

This is great, thanks guys.