Waiting for 9.6 – Bloom index contrib module

On 1st of April, Teodor Sigaev committed patch:

Bloom index contrib module
 
Module provides new access method. It is actually a simple Bloom filter
implemented as pgsql's index. It could give some benefits on search
with large number of columns.
 
Module is a single way to test generic WAL interface committed earlier.
 
Author: Teodor Sigaev, Alexander Korotkov
Reviewers: Aleksander Alekseev, Michael Paquier, Jim Nasby

Bloom filters are well known, but so far we didn't have any kind of support for them in PostgreSQL.

Now, we do.

Before I will go into how to use it, quick recap on what these are:

  • space efficient, usually with fixed storage size
  • can report false positive (reports that key exists in set, while in reality it doesn't)
  • cannot report false negatives (key exists in dataset, but bloom filter says it doesn't)

So, let's test the bloom index/filter.

First, we'll need some sample data. Commit message mentions tables with large number of columns, so let's make one:

$ CREATE TABLE test (
    id serial PRIMARY KEY,
    i1 int4,
    i2 int4,
    i3 int4,
    i4 int4,
    i5 int4,
    i6 int4,
    i7 int4,
    i8 int4,
    i9 int4
);
CREATE TABLE

Now, let's add some random values there:

$ INSERT INTO test (i1, i2, i3, i4, i5, i6, i7, i8, i9)
    SELECT
        random() * 10000000,
        random() * 10000000,
        random() * 10000000,
        random() * 10000000,
        random() * 10000000,
        random() * 10000000,
        random() * 10000000,
        random() * 10000000,
        random() * 10000000
    FROM generate_series(1, 1000000);
INSERT 0 1000000

Data is, predictably, like these rows:

$ SELECT * FROM test LIMIT 10;
 id |   i1    |   i2    |   i3    |   i4    |   i5    |   i6    |   i7    |   i8    |   i9    
----+---------+---------+---------+---------+---------+---------+---------+---------+---------
  1 |  298795 |  957989 | 2143834 | 4828269 | 4994857 | 2230378 | 7920577 | 4899663 | 8482559
  2 | 3656066 | 2301554 | 6373858 | 3963390 | 3888016 | 6758963 | 2191576 | 6746389 | 2916640
  3 | 9928686 | 2978275 | 9865285 | 2404846 | 9855007 | 8135719 | 9910402 | 3883705 | 7401476
  4 | 8705270 |  771728 | 7236429 | 3226839 | 1070523 | 8194418 | 5370673 | 5898792 | 3189275
  5 | 7601051 | 3819369 | 8088938 | 6083610 | 7475435 |  390492 | 2457467 | 1438825 | 4278508
  6 | 9216430 | 3630402 | 1024897 | 2133070 | 3559088 | 4003172 | 1998355 | 5963934 | 3858179
  7 |  134074 | 5874336 | 7741884 | 7535550 | 4579606 | 8513612 | 4771979 | 7806445 | 9584136
  8 | 2966397 | 3177118 | 5482928 | 6155672 |  778170 | 9302297 | 4244611 | 6861779 | 6777733
  9 | 4635103 | 9319247 | 8216558 | 8913611 | 8535677 | 1846959 | 9938507 |  668747 | 5406047
 10 | 3941679 | 2667102 | 1369982 | 7799858 | 2801176 | 7244318 | 5541742 |  336726 | 1823924
(10 ROWS)

Now. Let's assume we want to check for existence of row with defined values for specific columns:

$ EXPLAIN analyze SELECT * FROM test WHERE i2 = 123121 AND i4=3123123;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Seq Scan ON test  (cost=0.00..23334.00 ROWS=1 width=40) (actual TIME=71.526..71.526 ROWS=0 loops=1)
   FILTER: ((i2 = 123121) AND (i4 = 3123123))
   ROWS Removed BY FILTER: 1000000
 Planning TIME: 0.162 ms
 Execution TIME: 71.544 ms
(5 ROWS)

There is no such row, but as you can see it took a moment to get the results – Pg had to seq-scan whole table.

I could have created btree index on (i2, i4), but it would not be used for search on (i3, i7).

I could create 9 indexes, one for each column, and then “pray" for PostgreSQL to use it, but it's not always that simple:

$ CREATE INDEX x1 ON test (i1);
CREATE INDEX
 
$ CREATE INDEX x2 ON test (i2);
CREATE INDEX
 
$ CREATE INDEX x3 ON test (i3);
CREATE INDEX
 
$ CREATE INDEX x4 ON test (i4);
CREATE INDEX
 
$ CREATE INDEX x5 ON test (i5);
CREATE INDEX
 
$ CREATE INDEX x6 ON test (i6);
CREATE INDEX
 
$ CREATE INDEX x7 ON test (i7);
CREATE INDEX
 
$ CREATE INDEX x8 ON test (i8);
CREATE INDEX
 
$ CREATE INDEX x9 ON test (i9);
CREATE INDEX

And now, we could:

$ EXPLAIN analyze SELECT * FROM test WHERE i1 = 123423 AND i8 = 2412;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 INDEX Scan USING x8 ON test  (cost=0.42..8.45 ROWS=1 width=40) (actual TIME=0.035..0.035 ROWS=0 loops=1)
   INDEX Cond: (i8 = 2412)
   FILTER: (i1 = 123423)
 Planning TIME: 0.932 ms
 Execution TIME: 0.085 ms
(5 ROWS)

While it is fast, please note that PostgreSQL used index on i8 column only, and filtered i1. Which could backfire if there were many rows with i8 == 2412.

Also, size of indexes becomes non-trivial:

$ SELECT oid::regclass, pg_relation_size( oid ) FROM pg_class WHERE relname ~ '^(test|x[1-9])$';
 oid  | pg_relation_size 
------+------------------
 test |         68272128
 x1   |         22487040
 x2   |         22487040
 x3   |         22487040
 x4   |         22487040
 x5   |         22487040
 x6   |         22487040
 x7   |         22487040
 x8   |         22487040
 x9   |         22487040
(10 ROWS)

Each of these indexes is 22 MB, and this is very small dataset.

So, let's see how bloom filter will work.

First, I need to load the extension:

$ CREATE extension bloom;
CREATE EXTENSION

Now, I can create index:

$ CREATE INDEX bloomidx ON test USING bloom (i1,i2,i3,i4,i5,i6,i7,i8,i9);
CREATE INDEX

Index size is:

$ SELECT pg_relation_size('bloomidx');
 pg_relation_size 
------------------
         16072704
(1 ROW)

To make Pg use it, I had to delete all these btree indexes on all i[0-9] columns, but then:

EXPLAIN analyze SELECT * FROM test WHERE i5 = 312312 AND i7 = 312311;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON test  (cost=17848.00..17852.02 ROWS=1 width=40) (actual TIME=10.018..10.018 ROWS=0 loops=1)
   Recheck Cond: ((i5 = 312312) AND (i7 = 312311))
   ROWS Removed BY INDEX Recheck: 1243
   Heap Blocks: exact=1162
   ->  Bitmap INDEX Scan ON bloomidx  (cost=0.00..17848.00 ROWS=1 width=0) (actual TIME=7.252..7.252 ROWS=1243 loops=1)
         INDEX Cond: ((i5 = 312312) AND (i7 = 312311))
 Planning TIME: 0.381 ms
 Execution TIME: 10.047 ms
(8 ROWS)

As you can see both columns were used, and it took ~ 10ms.

We can also see that bloom index returned 1243 false positives, which were later on ruled out by filtering on heap scan.

We could play with index parameters more. These are described in docs.

For example, we could do something like this:

$ DROP INDEX bloomidx;
DROP INDEX
 
$ CREATE INDEX bloomidx ON test USING bloom (i1,i2,i3,i4,i5,i6,i7,i8,i9) WITH ( LENGTH=20,col1=5,col2=5,col3=5,col4=5,col5=5,col6=5,col7=5,col8=5,col9=5 );
CREATE INDEX

Now, the index is:

SELECT pg_relation_size('bloomidx');
 pg_relation_size 
------------------
         46292992
(1 ROW)

46MB, but it should cause less false positives:

$ SET enable_seqscan = FALSE;
SET
 
$ EXPLAIN analyze SELECT * FROM test WHERE i5 = 312312 AND i7 = 312311;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan ON test  (cost=32604.00..32608.02 ROWS=1 width=40) (actual TIME=11.725..11.725 ROWS=0 loops=1)
   Recheck Cond: ((i5 = 312312) AND (i7 = 312311))
   ->  Bitmap INDEX Scan ON bloomidx  (cost=0.00..32604.00 ROWS=1 width=0) (actual TIME=11.722..11.722 ROWS=0 loops=1)
         INDEX Cond: ((i5 = 312312) AND (i7 = 312311))
 Planning TIME: 0.564 ms
 Execution TIME: 11.765 ms
(6 ROWS)

Yeah. No false positives. But I had to disable seq scan, because PostgreSQL assumed that seq scan will be cheaper (most likely due to size of bloom index).

Anyway – it's up for tuning, and decent understanding of what the parameters exactly mean.

All in all, even if I don't fully grok it, it's great addition to PostgreSQL, and I'd like to thank all involved 🙂