Waiting for 8.5 – hinting for number of distinct values

Robert Haas wrote, and Tom Lane committed (on 2nd of August) patch which adds interesting capability:

Log Message:
-----------
ADD ALTER TABLE ... ALTER COLUMN ... SET STATISTICS DISTINCT
 
Robert Haas

Log message is rather terse, so what does it exactly do?

To plan query for execution PostgreSQL has to know some basic statistics for table. These are updated whenever you run ANALYZE on the table, and look like this:

# CREATE TABLE test (i int4, j int4);
# INSERT INTO test (i, j) SELECT CAST(random() * 10 AS int4), CAST(random() * 1000000 AS int4) FROM generate_series(1,10000);
# SELECT a.attname, s.* FROM pg_attribute a JOIN pg_statistic s ON a.attnum = s.staattnum AND a.attrelid = s.starelid WHERE a.attrelid = 'test'::regclass;
-[ RECORD 1 ]--------------------------------------------------------------------------
attname     | i
starelid    | 17567
staattnum   | 1
stanullfrac | 0
stawidth    | 4
stadistinct | 11
stakind1    | 1
stakind2    | 2
stakind3    | 3
stakind4    | 0
staop1      | 96
staop2      | 97
staop3      | 97
staop4      | 0
stanumbers1 | {0.106333,0.104,0.100667}
stanumbers2 | [NULL]
stanumbers3 | {0.114939}
stanumbers4 | [NULL]
stavalues1  | {5,9,6}
stavalues2  | {0,1,2,3,4,7,8,10}
stavalues3  | [NULL]
stavalues4  | [NULL]
-[ RECORD 2 ]--------------------------------------------------------------------------
attname     | j
starelid    | 17567
staattnum   | 2
stanullfrac | 0
stawidth    | 4
stadistinct | -0.9944
stakind1    | 1
stakind2    | 2
stakind3    | 3
stakind4    | 0
staop1      | 96
staop2      | 97
staop3      | 97
staop4      | 0
stanumbers1 | {0.000666667,0.000666667,0.000666667}
stanumbers2 | [NULL]
stanumbers3 | {0.00116388}
stanumbers4 | [NULL]
stavalues1  | {145623,493985,667920}
stavalues2  | {64,92162,194071,294372,395093,490197,590029,689411,785237,889573,999954}
stavalues3  | [NULL]
stavalues4  | [NULL]

these values are later used when planning query (notice rows= values for HashAggregate):

# EXPLAIN SELECT DISTINCT i FROM test;
                           QUERY PLAN
----------------------------------------------------------------
 HashAggregate  (cost=170.00..170.11 ROWS=11 width=4)
   ->  Seq Scan ON test  (cost=0.00..145.00 ROWS=10000 width=4)
(2 ROWS)
 
# EXPLAIN SELECT DISTINCT j FROM test;
                           QUERY PLAN
----------------------------------------------------------------
 HashAggregate  (cost=170.00..269.44 ROWS=9944 width=4)
   ->  Seq Scan ON test  (cost=0.00..145.00 ROWS=10000 width=4)
(2 ROWS)

Now. ANALYZE is not really deterministic. It analyzes random portion of the
table, so it might get the stats wrong. What's also important – you might have
nonstandard distribution of values, and the number of distinct values in
statistics will be much different than in reality.

So, here comes “ALTER TABLE … SET STATISTICS DISTINCT". Basically, it forces number of distinc values in given column. Logic is like this:

  • if the set value is 0, then system will use value from system statistics (ANALYZE and pg_statistic table)
  • if the value is > 0 then it is assumed to be number of distinct values
  • if the value is < 0 then number of distinct values is calculated (on query planning time) using: estimated_row_count * -1 * value_set_by_user_with_alter_table

Let's see how it works.

First, this is how explain looks when I didn't force any number of distinct values:

# EXPLAIN SELECT DISTINCT i FROM test;
                           QUERY PLAN
----------------------------------------------------------------
 HashAggregate  (cost=170.00..170.11 ROWS=11 width=4)
   ->  Seq Scan ON test  (cost=0.00..145.00 ROWS=10000 width=4)
(2 ROWS)

Now, let's play with it:

# ALTER TABLE test ALTER COLUMN i SET STATISTICS DISTINCT 5;
ALTER TABLE
 
# analyze test;
ANALYZE
 
# EXPLAIN SELECT DISTINCT i FROM test;
                           QUERY PLAN
----------------------------------------------------------------
 HashAggregate  (cost=170.00..170.05 ROWS=5 width=4)
   ->  Seq Scan ON test  (cost=0.00..145.00 ROWS=10000 width=4)
(2 ROWS)

and let's test negative values:

# ALTER TABLE test ALTER COLUMN i SET STATISTICS DISTINCT -0.75;
ALTER TABLE
 
# analyze test;
ANALYZE
 
# EXPLAIN SELECT DISTINCT i FROM test;
                           QUERY PLAN
----------------------------------------------------------------
 HashAggregate  (cost=170.00..245.00 ROWS=7500 width=4)
   ->  Seq Scan ON test  (cost=0.00..145.00 ROWS=10000 width=4)
(2 ROWS)

When confronted with nonsense values, system either rejects them:

# ALTER TABLE test ALTER COLUMN i SET STATISTICS DISTINCT -2;
ERROR:  NUMBER OF DISTINCT VALUES -2 IS too low

or accepts, but adjusts to reality:

# ALTER TABLE test ALTER COLUMN i SET STATISTICS DISTINCT 1000000000000;
ALTER TABLE
 
# analyze test;
ANALYZE
 
# EXPLAIN SELECT DISTINCT i FROM test;
                           QUERY PLAN
----------------------------------------------------------------
 HashAggregate  (cost=170.00..270.00 ROWS=10000 width=4)
   ->  Seq Scan ON test  (cost=0.00..145.00 ROWS=10000 width=4)
(2 ROWS)

And of course we can go back to system-updated number:

# ALTER TABLE test ALTER COLUMN i SET STATISTICS DISTINCT 0;
ALTER TABLE
 
# analyze test;
ANALYZE
 
# EXPLAIN SELECT DISTINCT i FROM test;
                           QUERY PLAN
----------------------------------------------------------------
 HashAggregate  (cost=170.00..170.11 ROWS=11 width=4)
   ->  Seq Scan ON test  (cost=0.00..145.00 ROWS=10000 width=4)
(2 ROWS)

2 thoughts on “Waiting for 8.5 – hinting for number of distinct values”

  1. It never came official release – right?.. you can only set statistics target `with set STATISTICS`

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.