Hypothetical indexes

There exists an extension to PostgreSQL, which lets you use hypothetical indexes.

What are there? That's simple – these are indexes that don't really exist. So what good are they?

Let's see.

First, I will need some test data. That will be relatively simple:

$ CREATE TABLE test_data (
    id serial PRIMARY KEY,
    some_timestamp timestamptz,
    some_integer int4,
    other_integer int4,
    another_integer int4
);
$ INSERT INTO test_data (some_timestamp, some_integer, other_integer, another_integer)
    SELECT
        now() - random() * '10 years'::INTERVAL,
        random() * 10,
        random() * 100,
        random() * 1000
    FROM
        generate_series(1,100000000);

100 million records, but the table is relatively small:

\dt+ test_data
                      List OF relations
 Schema |   Name    | TYPE  | Owner  |  SIZE   | Description 
--------+-----------+-------+--------+---------+-------------
 public | test_data | TABLE | depesz | 5746 MB | 
(1 ROW)

Now, let's assume we're considering adding index on (some_integer, other_integer), but we'd like to know if it would be used for queries that look for other_integer only.

Normally, I'd have to, on some side table/database:

$ CREATE INDEX test_index ON test_data (some_integer, other_integer);
CREATE INDEX
TIME: 243084.367 ms
 
$ EXPLAIN SELECT * FROM test_data WHERE other_integer = 50;
                             QUERY PLAN                              
---------------------------------------------------------------------
 Seq Scan ON test_data  (cost=0.00..1985295.00 ROWS=933401 width=24)
   FILTER: (other_integer = 50)
(2 ROWS)
 
TIME: 19.342 ms

The answer is “no, it will not use such index" – but the index creation was long, and pretty expensive in terms of IO and CPU.

Thanks to hypopg, I should be able to get the same answer in fraction of the time. So let's see how it works (I of course dropped this test index).

To install, I need to:

  1. sudo apt-get install pgxnclient
  2. pgxn install –unstable hypopg

First step install pgxn tool, which is then used to install actual utility.

To have it working, I have to have certain packages installed (basically pg headers), but this should be trivial to do by appropriate “apt-get install" or “yum install".

Installation took couple of seconds. Afterwards, I can:

$ CREATE extension hypopg;
CREATE EXTENSION

Which loads the code to my test database.

Now, sanity check that test_data table is without this additional index:

\d test_data
                                      TABLE "public.test_data"
     COLUMN      |           TYPE           |                       Modifiers                        
-----------------+--------------------------+--------------------------------------------------------
 id              | INTEGER                  | NOT NULL DEFAULT NEXTVAL('test_data_id_seq'::regclass)
 some_timestamp  | TIMESTAMP WITH TIME zone | 
 some_integer    | INTEGER                  | 
 other_integer   | INTEGER                  | 
 another_integer | INTEGER                  | 
Indexes:
    "test_data_pkey" PRIMARY KEY, btree (id)

OK. With this in place, I can create the hypothetical index. This is done using function call:

$ SELECT hypopg_create_index('create index on test_data (some_integer, other_integer)');
                    hypopg_create_index                    
-----------------------------------------------------------
 (16719,<16719>btree_test_data_some_integer_other_integer)
(1 ROW)

This took ~ 0.5s.

Now, I can try the explain again:

$ EXPLAIN SELECT * FROM test_data WHERE other_integer = 50;
                              QUERY PLAN                              
----------------------------------------------------------------------
 Seq Scan ON test_data  (cost=0.00..1985295.10 ROWS=1210000 width=24)
   FILTER: (other_integer = 50)
(2 ROWS)

Index is not used. But perhaps I did something wrong? let's try to make index what will definitely be used:

$ SELECT hypopg_create_index('create index on test_data (some_integer, other_integer, another_integer)');
                           hypopg_create_index                           
-------------------------------------------------------------------------
 (16721,<16721>btree_test_data_some_integer_other_integer_another_integ)
(1 ROW)
 
$ EXPLAIN SELECT * FROM test_data WHERE some_integer = 5 AND other_integer = 50 AND another_integer = 500;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 INDEX Scan USING <16721>btree_test_data_some_integer_other_integer_another_integ ON test_data  (cost=0.07..464.07 ROWS=115 width=24)
   INDEX Cond: ((some_integer = 5) AND (other_integer = 50) AND (another_integer = 500))
(2 ROWS)

Hypopg works (from what I can tell) by modifying system catalog (information about indexes), but only for current backend. So whatever you're testing in your connection – it will not influence other, “normal" connections.

Also, as soon as you close your connection – the hypothetical indexes you created are gone. Of course you can also drop them manually (by calling hypopg_drop_index() function), but I would say that generally there is no point in doing it – since it all will be cleaned as soon as you'll disconnect.

All in all, it looks great, and while in a lot of cases it's trivial to tell if the index will be used, for all the not-so-obvious cases, hypopg will be great tool.

5 thoughts on “Hypothetical indexes”

  1. Nice 🙂 How does PG react when you actually run a query that tries to use an hypothetical index ?

  2. It would be nice if it actually works when i run the query. I tried it with the both types of index and the results were not acceptable.

    hypo- index:

    explain ANALYZE VERBOSE select * from test_data where some_integer = 5 and other_integer = 50 and another_integer = 500;
    +—————————————————————————————————————————-+
    | QUERY PLAN |
    +—————————————————————————————————————————-+
    | Seq Scan on pg_temp_2.test_data (cost=0.00..2485295.00 rows=13 width=24) (actual time=710.865..10830.540 rows=91 loops=1) |
    | Output: id, some_timestamp, some_integer, other_integer, another_integer |
    | Filter: ((test_data.some_integer = 5) AND (test_data.other_integer = 50) AND (test_data.another_integer = 500)) |
    | Rows Removed by Filter: 99999909 |
    | Planning time: 0.070 ms |
    | Execution time: 10830.618 ms |
    +—————————————————————————————————————————-+

    Standard index
    EXPLAIN ANALYZE VERBOSE select * from test_data where some_integer = 5 and other_integer = 50 and another_integer = 500;
    +————————————————————————————————————————————+
    | QUERY PLAN |
    +————————————————————————————————————————————+
    | Index Scan using test_index on pg_temp_2.test_data (cost=0.57..56.86 rows=13 width=24) (actual time=1.961..8.133 rows=91 loops=1) |
    | Output: id, some_timestamp, some_integer, other_integer, another_integer |
    | Index Cond: ((test_data.some_integer = 5) AND (test_data.other_integer = 50) AND (test_data.another_integer = 500)) |
    | Planning time: 0.091 ms |
    | Execution time: 8.172 ms |
    +————————————————————————————————————————————+

  3. I’m not sure how to respond to your questions. Pg cannot run queries using hypothetical indexes because they don’t exist. So when you actually run a query, for example by using explain analyze, it’s being run as if the index didn’t exist. Because it doesn’t.

  4. Yes, the hypothetical indexes only exists when a simple EXPLAIN is called. If you use ANALYZE, the hypothetical indexes are simply skipped.
    In fact, the hypothetical are not directly stored in the catalog but in memory. A hook on explain adds them to the planner’s list of indexes when called with a simple EXPLAIN. See the hypo_get_relation_info_hook function for more details.

  5. Ah that’s neat, explain statements and normal statements get a different view of the list of indexes, so there’s no error, thanks.

Comments are closed.