Waiting for 9.5 – BRIN: Block Range Indexes.

On 7th of November, Alvaro Herrera committed patch:

BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes.  They work by maintaining "summary" data about
block ranges.  Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not.  Normal index scans are not supported
because these indexes do not store TIDs.
 
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
 
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range.  This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results.  In this commit I only include minmax.
 
Catalog version bumped due to new builtin catalog entries.
 
There's more that could be done here, but this is a good step forwards.
 
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
 
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
 
PS:
  The research leading to these results has received funding from the
  European Union's Seventh Framework Programme (FP7/2007-2013) under
  grant agreement n° 318633.

Information in commit log is pretty simple to understand. BRIN is new index type. One that should have much lower overhead than BTree (though possibly slower).

Things that should be considered:

  • time to create index (create index on some number of rows)
  • time to update index (update some number of rows in table that is already indexed)
  • size of index
  • time to query using the index and = operator
  • time to query using the index and between operator

To test it I prepared two tables:

$ CREATE TABLE for_btree ( id int8, payload text );
CREATE TABLE
 
$ INSERT INTO for_btree (id, payload) SELECT i, repeat('depesz', 100) FROM generate_series (1, 1000000) i;
INSERT 0 1000000
 
$ CREATE TABLE for_brin ( id int8, payload text );
CREATE TABLE
 
$ INSERT INTO for_brin (id, payload) SELECT i, repeat('depesz', 100) FROM generate_series (1, 1000000) i;
INSERT 0 1000000

Both tables are ~ 650MB in size. First, I'll create index on for_btree, 3 times in a row, and use best time.

Best time for btree: 626.859 ms.

Best time for brin: 208.754 ms.

Updates – I'll simply update 30% of the rows by changing id to 1000000 + id, like:

UPDATE for_btree SET id = 1000000 + id WHERE id < 300000;

Best time for btree: 8398.461 ms.

Best time for brin: 1398.711 ms.

Size of btree index: 28MB.

Size of brin index: 64kB. (more on it later)

And now searching. I'll check for specific id (654321) and 5% rows range (100000 … 150000), with queries:

$ SELECT COUNT(*) FROM TABLE WHERE id = 654321::int8;
 
$ SELECT COUNT(*) FROM TABLE WHERE id BETWEEN 600000::int8 AND 650000::int8;

(note: I had to add ::int8, as apparently now int8 columns and simple integers don't mix well…)

Results:

  • Btree = – 0.356 ms
  • Btree between – 9.574 ms
  • Brin = – 17.386 ms
  • Brin between – 21.090 ms

So it looks that in size, creation and update time BRIN is clear winner, but searches using it are slower than with BTree.

One additional thing – BRIN is configurable. You can make it more detailed, or less. It is done by specifying pages_per_range storage parameter. The smaller the value, the larger will be the index. And possibly faster. But updates on it will take longer.

Let's see:

pages_per_range index size create time update time search (=) time search (between) time
10000 24 kB 369.234 ms 8567.034 ms 87.606 ms 154.182 ms
1000 24 kB 373.150 ms 8812.391 ms 96.339 ms 133.546 ms
100 40 kB 360.555 ms 8867.043 ms 98.941 ms 131.126 ms
10 296 kB 392.038 ms 8619.480 ms 99.834 ms 132.327 ms
1 2800 kB 629.255 ms 8600.095 ms 114.882 ms 147.765 ms

To be honest I don't quite understand why it doesn't speed up with larger indexes (i.e. searches), maybe it's due to specific distribution of data.

In any way – the index does what it was supposed to do: be small, and speed up (some) searches. Cool stuff.

13 thoughts on “Waiting for 9.5 – BRIN: Block Range Indexes.”

  1. does BRIN work well with UUID? I think someone else wrote that it’s best for ordered data. can it be used for unique constraint (in other words replace btree for a PK)?

  2. maybe an example of when the planner would choose this index over a btree index would be helpful, if it ever would

  3. Interesting patch… Netezza does this with all (well most) columns. It is fastest on the columns that are ordered or nearly ordered, when it can also stop looking at the page range indexes.For a few rows returned it will a;most always be slower than other index types (which for the most part Netezza DOESN’T have). Ideally this SHOULD apply to queries that would otherwise result in a table scan and the page ranges essentially would avoid scanning some pages. It also might apply when the trees and other index structures are simply too big or the selectivity depends on too many columns. (These are the cases implied in the description!) I suspect the cost may need jiggering (I don’t think it would be picked currently) and optimizing the way the index pages are arranged might help.

  4. @Caleb:
    UUIDs looks like work fine.

    And you can’t use it for unique indexes:

    $ CREATE UNIQUE INDEX t ON z USING brin (x);
    ERROR:  access method "brin" does NOT support UNIQUE indexes
  5. @Craig:
    I don’t think planner will choose brin – it is more expensive to use (and it didn’t choose brin, when btree was available in my tests).

    But it (brin) might be a good option for index on rerely, searched data, that we want to have some speedup, but not at a great cost when it comes to writes.

  6. @Craig, @mrusoff:

    After reading what mrusoff wrote, I retested.

    With table that contained just integer column, with unique values in range 1-1000000 (1 million), 1 million rows, and both indexed, I tested using query like:

    select * from table where id < _THRESHOLD_ Up to threshold of 558072 planner was choosing index only scan using btree index. Then, from 558073 up to 610247 is was choosing brin index scan. from 610248 to the end it chose seq scan.

  7. We have a few large btree indexes which only exist because of queries which are run just a few times a day, but without them those queries would never finish. Brin looks great for this.

    You have a small copy-paste mistake: “Size of btree index: 64kB”, should be “Size of brin index: 64kB”.

  8. Caleb, depesz: as I understand it so far, a BRIN index over a value with a random distribution (such as a UUID) would not be as useful, since in all likelihood, every/most page ranges will end up with min_value/max_value statistics that cover the majority of the value domain.

    For example unlike a timestamp, which is generally always increasing (as in the orders example), a version 4 UUID consists almost entirely of random bytes. That means there is no correlation between physical storage time (and hence page placement) and the value of the column.

    Someone please correct me if I’m wrong. This is cool tech, I’m just reading about it for the first time. 🙂

  9. @David W:

    The correlation with time is (in my opinion irrelevant). But you’re right that brin for uuids wouldn’t be very useful, unless you used something like uuid version 1.

    Generally, I would assume that brin makes sense when there are sensible ranges in sets of values that given field can have. These could be strings, integers, inets, mostly anything. But uuid is, for the sake of argument, random data, so there are no sensible range divisions.

  10. How can you get these times “create time

    update time”

    please tell me the method ! Please !

  11. Hi,

    Thank you for this post.

    In my opinion, BRIN indexes should not be compared to B*Tree indexes as they are conceived to optimize sequential scans (less pages to visit).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.