Waiting for PostgreSQL 10 – Implement multivariate n-distinct coefficients

I missed it completely, but on 24th of March 2017, Alvaro Herrera committed patch:

Implement multivariate n-distinct coefficients
 
 
Add support for explicitly declared statistic objects (CREATE
STATISTICS), allowing collection of statistics on more complex
combinations that individual table columns.  Companion commands DROP
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are
added too.  All this DDL has been designed so that more statistic types
can be added later on, such as multivariate most-common-values and
multivariate histograms between columns of a single table, leaving room
for permitting columns on multiple tables, too, as well as expressions.
 
This commit only adds support for collection of n-distinct coefficient
on user-specified sets of columns in a single table.  This is useful to
estimate number of distinct groups in GROUP BY and DISTINCT clauses;
estimation errors there can cause over-allocation of memory in hashed
aggregates, for instance, so it's a worthwhile problem to solve.  A new
special pseudo-type pg_ndistinct is used.
 
(num-distinct estimation was deemed sufficiently useful by itself that
this is worthwhile even if no further statistic types are added
immediately; so much so that another version of essentially the same
functionality was submitted by Kyotaro Horiguchi:
https://postgr.es/m/.173334..horiguchi.kyotaro@lab.ntt.co.jp
though this commit does not use that code.)
 
Author: Tomas Vondra.  Some code rework by Álvaro.
 
    Ideriha Takeshi
Discussion: https://postgr.es/m/.4080608@fuzzy.cz
    https://postgr.es/m/.ixlaueanxegqd5gr@alvherre.pgsql

Afterwards, there were couple more commits related to it:

  • On 5th of April 2017, patch committed by Simon Riggs
  • On 17th of April 2017, patch committed by Alvaro Herrera
  • On 12nd of May 2017, patch committed by Alvaro Herrera

So what this is about? Phrase from first line, “Implement multivariate n-distinct coefficients", seems pretty opaque.

In reality it's rather simple. As you know – PostgreSQL chooses what to do with query (or, rather, how to get results) based on some estimates on count of rows that will match certain parts of the query.

So far PostgreSQL used only statistics on single column, and if it had to get estimate based on conditions to two columns, it involved some black magic (well, math, but it's pretty similar to magic).

You could have seen the problem with query like this:

$ CREATE TABLE test AS
SELECT i, i % 100 AS i_100, i % 500 AS i_500
FROM generate_series(1,100000) i;
SELECT 100000

data looks like:

$ SELECT * FROM test ORDER BY random() LIMIT 20;
   i   | i_100 | i_500 
-------+-------+-------
 21153 |    53 |   153
 59239 |    39 |   239
 64049 |    49 |    49
 33313 |    13 |   313
 25366 |    66 |   366
 31794 |    94 |   294
 39991 |    91 |   491
 49162 |    62 |   162
 71951 |    51 |   451
 82639 |    39 |   139
 61799 |    99 |   299
  5174 |    74 |   174
 13963 |    63 |   463
 31412 |    12 |   412
  4685 |    85 |   185
 35444 |    44 |   444
 45713 |    13 |   213
 13991 |    91 |   491
 15509 |     9 |     9
 99022 |    22 |    22
(20 ROWS)

Values seem pretty obvious. And Pg can guess estimated counts pretty well, in simple cases:

$ EXPLAIN analyze SELECT * FROM test WHERE i_100 = 34;
                                               QUERY PLAN                                                
---------------------------------------------------------------------------------------------------------
 Seq Scan ON test  (cost=0.00..1791.00 ROWS=1010 width=12) (actual TIME=0.035..23.716 ROWS=1000 loops=1)
   FILTER: (i_100 = 34)
   ROWS Removed BY FILTER: 99000
 Planning TIME: 0.422 ms
 Execution TIME: 23.879 ms
(5 ROWS)
 
$ EXPLAIN analyze SELECT * FROM test WHERE i_500 = 78;
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Seq Scan ON test  (cost=0.00..1791.00 ROWS=200 width=12) (actual TIME=0.022..11.706 ROWS=200 loops=1)
   FILTER: (i_500 = 78)
   ROWS Removed BY FILTER: 99800
 Planning TIME: 0.057 ms
 Execution TIME: 11.737 ms
(5 ROWS)

Please note that “rows=" and then the other “rows=" (in actual part of explain) are very similar.

But what will happen if I'll try to use two columns at once?

$ EXPLAIN analyze SELECT * FROM test WHERE i_100 = 91 AND i_500 = 491;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Seq Scan ON test  (cost=0.00..2041.00 ROWS=2 width=12) (actual TIME=0.172..24.154 ROWS=200 loops=1)
   FILTER: ((i_100 = 91) AND (i_500 = 491))
   ROWS Removed BY FILTER: 99800
 Planning TIME: 0.453 ms
 Execution TIME: 24.233 ms
(5 ROWS)

Count is really off it thinks there are two rows, but there are 200 of them.

It can be also thrown off by providing values that clearly can't happen together:

$ EXPLAIN analyze SELECT * FROM test WHERE i_100 = 91 AND i_500 = 17;
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Seq Scan ON test  (cost=0.00..2041.00 ROWS=2 width=12) (actual TIME=24.990..24.990 ROWS=0 loops=1)
   FILTER: ((i_100 = 91) AND (i_500 = 17))
   ROWS Removed BY FILTER: 100000
 Planning TIME: 0.458 ms
 Execution TIME: 25.045 ms
(5 ROWS)

The difference (2 vs. 0) might not seem like much, but it would be so much better if we could get it to proper estimate.

Well, now, in Pg 10, there will be a chance:

$ CREATE STATISTICS test_stats ON i_100, i_500 FROM test;
$ analyze test;

And now the “bad" queries:

$ EXPLAIN analyze SELECT * FROM test WHERE i_100 = 91 AND i_500 = 491;
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Seq Scan ON test  (cost=0.00..2041.00 ROWS=199 width=12) (actual TIME=0.141..25.531 ROWS=200 loops=1)
   FILTER: ((i_100 = 91) AND (i_500 = 491))
   ROWS Removed BY FILTER: 99800
 Planning TIME: 0.545 ms
 Execution TIME: 25.617 ms
(5 ROWS)
 
$ EXPLAIN analyze SELECT * FROM test WHERE i_100 = 91 AND i_500 = 17;
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan ON test  (cost=0.00..2041.00 ROWS=199 width=12) (actual TIME=11.803..11.803 ROWS=0 loops=1)
   FILTER: ((i_100 = 91) AND (i_500 = 17))
   ROWS Removed BY FILTER: 100000
 Planning TIME: 0.081 ms
 Execution TIME: 11.825 ms
(5 ROWS)

First query shows clearly improved estimate. Second – not so much. Why? To be honest, no idea. If someone can enlighten me, it would be really great. In any way – these are just first steps into getting proper multi-column statistics, so I'm very happy about it, and enthusiastic about future – even if some bit of the functionality now doesn't seem to work as expected (by me). It's a long road, but I'm very grateful that we started it. Thanks a lot to all involved.

One thought on “Waiting for PostgreSQL 10 – Implement multivariate n-distinct coefficients”

  1. The multivariate feature in PG 10 only records a single integer indicating the correlation of the columns — it doesn’t record the probability of specific column combinations. That features is hopefully coming in PG 11.

    Your second-to-last query hit a common combination (both ending in 1) while your last query has no matching rows (the last digit is different).

Comments are closed.