Waiting for 9.5 – TABLESAMPLE, SQL Standard and extensible

On 15th of May, Simon Riggs committed patch:

TABLESAMPLE, SQL Standard and extensible
 
Add a TABLESAMPLE clause to SELECT statements that allows
user to specify random BERNOULLI sampling or block level
SYSTEM sampling. Implementation allows for extensible
sampling functions to be written, using a standard API.
Basic version follows SQLStandard exactly. Usable
concrete use cases for the sampling API follow in later
commits.
 
Petr Jelinek
 
Reviewed by Michael Paquier and Simon Riggs

Getting random sample of the table looks potentially interesting, but how does it work?

Let's make some random table:

CREATE TABLE test (
    id serial PRIMARY KEY,
    some_timestamp timestamptz,
    some_text text
);
CREATE TABLE
INSERT INTO test (some_timestamp, some_text)
    SELECT
        now() - random() * '1 year'::INTERVAL,
        'depesz #' || i
    FROM
        generate_series(1,100000) i;
INSERT 0 100000

The table is around 6MB:

                   List OF relations
 Schema | Name | TYPE  | Owner  |  SIZE   | Description 
--------+------+-------+--------+---------+-------------
 public | test | TABLE | depesz | 5920 kB | 
(1 ROW)

Tablesample has two modes. SYSTEM and BERNOULLI.

Before we'll go any further, we will need to know how large is the table, in pages:

SELECT relpages FROM pg_class WHERE relname = 'test';
 relpages 
----------
      736
(1 ROW)

OK. So, we have 736 pages, and 100 000 rows, which means that on average, in single page we have 136 rows.

Let's say we'd like to get just 10 rows. 10 rows, out of 100000, means we want to get 0.0001 of the table, so:

EXPLAIN analyze SELECT * FROM test tablesample system ( 0.01 );
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Sample Scan (system) ON test  (cost=0.00..0.08 ROWS=8 width=44) (actual TIME=0.016..0.021 ROWS=136 loops=1)
 Planning TIME: 0.102 ms
 Execution TIME: 0.045 ms
(3 ROWS)

That's too much – 136 rows instead of 10. Why is it so?

Well, SYSTEM TABLESAMPLE method randomly picks single page, and returns all rows from this page. This means it will be fast (pick random value 1-736, load page (8kB) return all rows from it).

But it can't return less data than single page.

of course we can use then secondary randomization:

EXPLAIN analyze WITH x AS (SELECT * FROM test tablesample system ( 0.01 ))
SELECT * FROM x ORDER BY random() LIMIT 10;
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 LIMIT  (cost=0.39..0.41 ROWS=8 width=44) (actual TIME=0.088..0.090 ROWS=10 loops=1)
   CTE x
     ->  Sample Scan (system) ON test  (cost=0.00..0.08 ROWS=8 width=44) (actual TIME=0.015..0.027 ROWS=136 loops=1)
   ->  Sort  (cost=0.31..0.33 ROWS=8 width=44) (actual TIME=0.087..0.087 ROWS=10 loops=1)
         Sort KEY: (random())
         Sort Method: top-N heapsort  Memory: 25kB
         ->  CTE Scan ON x  (cost=0.00..0.19 ROWS=8 width=44) (actual TIME=0.017..0.048 ROWS=136 loops=1)
 Planning TIME: 0.136 ms
 Execution TIME: 0.115 ms
(9 ROWS)

usually using “order by random()" is slow, but in here, we're orderingonly 136 rows, so it's fast enough.

There is 2nd method – BERNOULLI – that can return smaller number of rows:

EXPLAIN analyze SELECT * FROM test tablesample bernoulli ( 0.01 );
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Sample Scan (bernoulli) ON test  (cost=0.00..736.08 ROWS=8 width=44) (actual TIME=0.465..2.742 ROWS=10 loops=1)
 Planning TIME: 0.107 ms
 Execution TIME: 2.758 ms
(3 ROWS)

Looks great – the number of rows is what I wanted (it will not always be exactly given percentage, as it's random). But notice what happens when I add more data:

INSERT INTO test (some_timestamp, some_text)
    SELECT
        now() - random() * '1 year'::INTERVAL,
        'depesz #' || i
    FROM
        generate_series(100001, 1000000) i;
INSERT 0 900000

The table is now roughly 10 times larger. Times:

EXPLAIN analyze SELECT * FROM test tablesample system ( 0.001 );
                                                  QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
 Sample Scan (system) ON test  (cost=0.00..0.10 ROWS=10 width=25) (actual TIME=0.029..0.042 ROWS=136 loops=1)
 Planning TIME: 0.125 ms
 Execution TIME: 0.061 ms
(3 ROWS)
 
 
EXPLAIN analyze SELECT * FROM test tablesample bernoulli ( 0.001 );
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Sample Scan (bernoulli) ON test  (cost=0.00..7353.10 ROWS=10 width=25) (actual TIME=2.779..27.288 ROWS=15 loops=1)
 Planning TIME: 0.112 ms
 Execution TIME: 27.312 ms
(3 ROWS)

Time for system tablesample is more or less the same. But in case of BERNOULLI – it's 10x longer. Why? Basically BERNOULLI has to seq scan whole table, and pick rows using some math to return more or less the number of rows we want.

This means that while it is more precise, and gives exactly the same chance to each row, it is slower.

In most cases, I think, that SYSTEM sampling will be best, but it has to be understood that it works on page level.

You have to remember also, that tablesample is applied before any WHERE conditions, so this query:

EXPLAIN analyze SELECT * FROM test TABLESAMPLE SYSTEM ( 1 ) WHERE id < 10;

Will usually not return any rows – it will first pick 1% of the pages, and then filter them by id < 10 - we would have to randomly pick page number 1 (and some other) for the filter to find any matching rows. All in all - I think it is useful. Thanks.