Waiting for PostgreSQL 11 – Indexes with INCLUDE columns and their support in B-tree

On 7th of April 2018, Teodor Sigaev committed patch:

Indexes with INCLUDE columns and their support in B-tree
 
This patch introduces INCLUDE clause to index definition.  This clause
specifies a list of columns which will be included as a non-key part in
the index.  The INCLUDE columns exist solely to allow more queries to
benefit from index-only scans.  Also, such columns don't need to have
appropriate operator classes.  Expressions are not supported as INCLUDE
columns since they cannot be used in index-only scans.
 
Index access methods supporting INCLUDE are indicated by amcaninclude flag
in IndexAmRoutine.  For now, only B-tree indexes support INCLUDE clause.
 
In B-tree indexes INCLUDE columns are truncated from pivot index tuples
(tuples located in non-leaf pages and high keys).  Therefore, B-tree indexes
now might have variable number of attributes.  This patch also provides
generic facility to support that: pivot tuples contain number of their
attributes in t_tid.ip_posid.  Free 13th bit of t_info is used for indicating
that.  This facility will simplify further support of index suffix truncation.
The changes of above are backward-compatible, pg_upgrade doesn't need special
handling of B-tree indexes for that.
 
Bump catalog version
 
Author: Anastasia Lubennikova with contribition by Alexander Korotkov and me
Reviewed by: Peter Geoghegan, Tomas Vondra, Antonin Houska, Jeff Janes,
             David Rowley, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/flat/.4010101@postgrespro.ru

This basically means we're getting full covering indexes.

The idea is that Index Only Scan was able to return only values that were part of index.

So, if I'd do:

=$ CREATE TABLE test (id serial PRIMARY KEY, some_rand int4, larger text);
CREATE TABLE
 
=$ INSERT INTO test (some_rand, larger) SELECT random() * 500000, substr(md5(i::text), 1, 10) FROM generate_series(1,10000000) i;
INSERT 0 10000000
 
=$ CREATE INDEX small_idx ON test (some_rand, id);
CREATE INDEX
 
=$ vacuum freeze analyze test;
VACUUM

Then, getting columns included in index uses Index Only Scan:

=$ EXPLAIN SELECT some_rand, id FROM test WHERE some_rand > 30000 ORDER BY some_rand, id LIMIT 20;
                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 LIMIT  (cost=0.43..1.00 ROWS=20 width=8)
   ->  INDEX ONLY Scan USING small_idx ON test  (cost=0.43..267048.14 ROWS=9380326 width=8)
         INDEX Cond: (some_rand > 30000)
(3 ROWS)

But if I'll add larger column to select, it can't use only index, so will switch to Index Scan:

=$ EXPLAIN SELECT some_rand, id, larger FROM test WHERE some_rand > 30000 ORDER BY some_rand, id LIMIT 20;
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 LIMIT  (cost=0.43..1.55 ROWS=20 width=19)
   ->  INDEX Scan USING small_idx ON test  (cost=0.43..521828.01 ROWS=9380326 width=19)
         INDEX Cond: (some_rand > 30000)
(3 ROWS)

Of course I could add largercolumn to index:

=$ create index large_idx on test (some_rand, id, larger);
CREATE INDEX

and now the query will use Index only scan:

=$ EXPLAIN SELECT some_rand, id, larger FROM test WHERE some_rand > 30000 ORDER BY some_rand, id LIMIT 20;
                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 LIMIT  (cost=0.56..1.31 ROWS=20 width=19)
   ->  INDEX ONLY Scan USING large_idx ON test  (cost=0.56..350178.38 ROWS=9380218 width=19)
         INDEX Cond: (some_rand > 30000)
(3 ROWS)

But it made the index larger:

=$ SELECT relname, pg_size_pretty( pg_relation_size(oid)) FROM pg_class WHERE relname IN ('small_idx','large_idx');
  relname  | pg_size_pretty 
-----------+----------------
 large_idx | 387 MB
 small_idx | 214 MB
(2 ROWS)

Now, how about we use this new idea:

=$ DROP INDEX large_idx;
DROP INDEX
 
=$ CREATE INDEX magic_idx ON test (some_rand, id) include (larger);
CREATE INDEX

And now:

=$ EXPLAIN SELECT some_rand, id, larger FROM test WHERE some_rand > 30000 ORDER BY some_rand, id LIMIT 20;
                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 LIMIT  (cost=0.43..1.18 ROWS=20 width=19)
   ->  INDEX ONLY Scan USING magic_idx ON test  (cost=0.43..349650.25 ROWS=9380218 width=19)
         INDEX Cond: (some_rand > 30000)
(3 ROWS)
 
=$ SELECT pg_size_pretty( pg_relation_size('magic_idx'::regclass));
 pg_size_pretty 
----------------
 386 MB
(1 ROW)

Index is slightly smaller, and can be used for the Index Only Scan.

So, the benefit is no in size of index (1MB is ~ 0.3% of the index size). But, it makes it possible to include data for columns that can't normally be included, because they lack appropriate access method (btree in my example).

Long story short – I, personally, don't see immediate usecase for this in the databases I work with, but I definitely welcome, and am grateful, for the new feature – as I might use it in the future. Thanks a lot 🙂

4 thoughts on “Waiting for PostgreSQL 11 – Indexes with INCLUDE columns and their support in B-tree”

  1. Hm, it looks much more interesting to combine UNIQUE/PRIMARY KEY and INCLUDE:

    create table t (id int, large text, primary key (id) include (large));

    In this case you don’t need two indexes: primary key to be sure of uniqueness and (id, large) to possible use of IndexOnlyScan.

  2. The overall index is slightly smaller, but what matters is that hot index pages contain more tuples, this reduces the height of B-tree, when index includes big attributes.

  3. Your test case ain’t right.
    Your not using this feature like why your suppose to use it. Your case can be better solved with “CLUSTER”. https://www.postgresql.org/docs/10/static/sql-cluster.html
    Then you trow somehow your index away and save 50% of the space and write TPS.(Your index and Tables have equal number of columns)

    This feature is like “grouping sets,cube” very handy for OLAP usecases. Heavy denormalized table structures with allot of Columns that are not used for each use case. With incluse you can create a subset of a big table. So abusing a index for queries that use a smaller subset of synconized table.

    Cheers

  4. Before someone asks:
    Just to add wat is the difference between:
    (1) create index large_idx on test (some_rand, id, larger);
    (2) create index large_idx on test (some_rand, id) INCLUDE larger;

    In the second the is no extra “sort” check done on “larger”. Making the maintainace done by postgres less heavy by figuring out which segment to write to.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.