# Waiting for 9.6 – Cube extension kNN support

On 18th of December, Teodor Sigaev committed patch:

```Introduce distance operators over cubes:
<#> taxicab distance
<->  euclidean distance
<=> chebyshev distance

Also add kNN support of those distances in GiST opclass.

Author: Stas Kelvich```

I once worked for a company that was using unofficial patch to add this, so I'm very happy that we finally got kNN for cubes.

What this is?

Well, for starters – it's contrib extension that you can load to your database using simple:

```\$ create extension cube;
CREATE EXTENSION```

Within the extension there are 61 elements, but the basic is “cube" datatype.

It's basically a point, or a “cube" in n-dimensional space.

For example, you can have following cubes:

• ‘(12)' – single point, in one dimension
• ‘(12,34)' – single point in two dimensions
• ‘(12,34,56,78,91)' – single point in 5 dimensions
• ‘(12),(24)' – single dimension “cube" from 12 to 24
• ‘(12,34),(56,78)' – two-dimensional cube (rectangle) defined by providing opposing points
• ‘(1,2,3,4,5),(6,7,8,9,10)' – cube in five dimensions

There is no real limit on number of dimensions.

Now, what are the different distances defined by the new patch? Let's see. For example purposes, let's start with one-dimensional points:

```\$ with x as ( select '(1)'::cube as point_a, '(5)'::cube as point_b )
select
point_a <#> point_b as taxicab,
point_a <-> point_b as euclidean,
point_a <=> point_b as chebyshev
from x;
taxicab | euclidean | chebyshev
---------+-----------+-----------
4 |         4 |         4
(1 row)```

Not really interesting. What about more dimensions?

```\$ with x as ( select '(0,0)'::cube as point_a, '(16,9)'::cube as point_b )
select
point_a <#> point_b as taxicab,
point_a <-> point_b as euclidean,
point_a <=> point_b as chebyshev
from x;
taxicab |    euclidean     | chebyshev
---------+------------------+-----------
25 | 18.3575597506858 |        16
(1 row)```

This shows pretty clearly what's what. taxicab is basically sum of distances across all axes, euclidean is distance in straight line, and chebyshev is largest distance from distances on all axes.

For me, euclidean looks the most interesting. So let's stick to it.

Now, let's imagine, we have a table with 1 million 3 dimensional points, where x, y, and z are all within -5000 .. 5000 range. Or better yet, let's not image, just make it:

```\$ create table test_cubes as
select i as id,
cube(array[random() * 10000 - 5000, random() * 10000 - 5000, random() * 10000 - 5000])
from generate_series(1,1000000) i;
SELECT 1000000

\$ alter table test_cubes add primary key (id);
ALTER TABLE```

some sample data:

```\$ select * from test_cubes tablesample system ( 0.01 ) limit 10;
id   |                           cube
--------+----------------------------------------------------------
542641 | (-4178.8482805714, 4531.55778348446, -4325.33879298717)
542642 | (-4550.62797293067, 3511.47029548883, -220.161587931216)
542643 | (-4966.07152279466, 1456.31569903344, -2061.16470042616)
542644 | (-1129.34995442629, -550.684169866145, 2578.38271558285)
542645 | (4474.94647931308, 2999.37592353672, 1136.93070597947)
542646 | (-1516.54057670385, 1515.51792863756, -4341.86951257288)
542647 | (795.945581048727, 2787.07716614008, 2160.27483809739)
542648 | (1535.92750895768, -2401.27934142947, -726.663023233414)
542649 | (763.151990249753, -995.386703871191, -2518.4234790504)
542650 | (2935.75372546911, 1408.53805933148, -1084.11618042737)
(10 rows)```

Finding rows that are close to some specific point (let's say – (0,0,0)) can be done by something like this:

```select * from (
select id, cube, cube <-> cube(array[0.0,0.0,0.0]) as distance
from test_cubes tc
) as x
order by distance asc limit 10;
id   |                           cube                            |     distance
--------+-----------------------------------------------------------+------------------
915947 | (15.2430403977633, -53.2670319080353, 48.1952121481299)   | 73.4336805754876
280346 | (9.85732767730951, -80.3589634597301, -41.3774838671088)  | 90.9220880118409
243882 | (8.24892893433571, 74.6928667649627, 67.0726876705885)    | 100.726434492086
195082 | (-15.9435207024217, 69.5352349430323, -78.6104751750827)  | 106.155318087336
239258 | (-61.3826420158148, -46.3476357981563, -98.9516917616129) | 125.329044468573
661586 | (-103.007820434868, 76.9877433776855, -1.76172703504562)  | 128.611147974336
363224 | (27.2627454251051, -116.961468011141, 53.6130089312792)   | 131.520329280688
336093 | (68.2466197758913, -35.0401923060417, -123.154250904918)  | 145.094402730191
168317 | (120.665831491351, 83.5728365927935, -18.6012778431177)   | 147.954957480518
11121 | (24.2509506642818, 133.48447624594, -62.4999403953552)    | 149.373547042966
(10 rows)```

(I know I don't need subselect, just wanted to show the distance too, not only order by it.

But it's relatively slow:

```                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=42443.64..42443.67 rows=10 width=36) (actual time=352.614..352.617 rows=10 loops=1)
->  Sort  (cost=42443.64..44943.64 rows=1000000 width=36) (actual time=352.612..352.614 rows=10 loops=1)
Sort Key: ((tc.cube <-> '(0, 0, 0)'::cube))
Sort Method: top-N heapsort  Memory: 25kB
->  Seq Scan on test_cubes tc  (cost=0.00..20834.00 rows=1000000 width=36) (actual time=0.052..195.534 rows=1000000 loops=1)
Planning time: 0.403 ms
Execution time: 352.636 ms
(7 rows)```

As you can see it's fetching all rows, calculates the distance, and sorts. All expensive.

But, since 9.1 we had the ability to use kNN searches, with proper indexing. And this new patch adds this capability to cubes. So how do I use it?

```\$ create index magic on test_cubes using gist (cube);
CREATE INDEX```

And now, the same select, will have “slightly" different explain:

```\$ explain analyze
select * from (
select id, cube, cube <-> cube(array[0.0,0.0,0.0]) as distance
from test_cubes tc
) as x
order by distance asc limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.41..1.37 rows=10 width=36) (actual time=1.213..1.656 rows=10 loops=1)
->  Index Scan using magic on test_cubes tc  (cost=0.41..96296.41 rows=1000000 width=36) (actual time=1.210..1.649 rows=10 loops=1)
Order By: (cube <-> '(0, 0, 0)'::cube)
Planning time: 0.497 ms
Execution time: 1.808 ms
(5 rows)```

Of course the same index would be used for taxicab and chebyshev distances.

This is great, thanks guys.