Why is it hard to automatically suggest what index to create?

Every now and then someone asks me to add index suggestions to explain.depesz.com.

I always respond with polite decline. This is complicated thing to do, and I just don't have that time.

Lately I was asked, on Slack to add (to explain.depesz.com) link to pganalyze Index Advisor for Postgres.

So I checked it out. And results prompted me to write this blogpost.

Before I will go into any details, I need to make sure that you understand: I am not bashing fine people working for pgAnalyze. No. They put in the time, and got some results. But it needs to go LONG way before it will be in shape that I would consider usable.

For example. For schema I typed in this table:

CREATE TABLE z (
    id serial PRIMARY KEY,
    a int4,
    b int4,
    c timestamptz,
    d text
);

and for query I gave:

SELECT
    *
FROM
    z
WHERE
    a = 12
    AND b > 34
    AND c < '2021-10-01'
    AND d = 'done'

After pressing “ANALYZE" I was shown recommendation:

Recommendation (3.6× faster)
USING btree (b, c, a, d)

If you're new person to databases, this might look sensible. Realistically – not so much.

There is one big problem, and one that can be (potentially) easy to fix. Columns that are compared with inequality operator shouldn't be in the middle of index. So columns b and c should be either on their own (with no other columns in index), or at the end.

But that is just the tip of the iceberg of problems.

Let's think about what indexes are potentially possible:

  • 4 versions, each on single column
  • 6 versions on pairs of columns: (a,b) ; (a,c) ; (a,d) ; (d,a) ; (d,b) ; (d,c)
  • 4 versions on three columns: (a,d,b) ; (a,d,c) ; (d,a,b) ; (d,a,c)

And this is just basics. As we can also make parts of the index use “WHERE" clause, like:

CREATE INDEX whatever ON z (a, b) WHERE d = 'done'

This is yet another set of indexes.

Generally – one can't sensibly suggest index, in non-trivial cases, without knowing at the very least two things:

  • What is the selectivity of each condition. Which means: how many percent of all rows match each separate condition
  • Are certain conditions common, or not so. As in: do you always search for d = ‘done' or not?

How does that matter?

Let's consider that column a is unique. So, whatever value I will compare with, I will get at most 1 row. This means that index on (a) is optimal.

But let's consider a is not unique. Let's assume that selectivity for a = 12 is ~ 20%.

What then? Should I put it in index? Or not? What will I do, if it has 20% selectivity, but together with condition on d it will drop to 2%? Should I index it on (a,d)? (d,a)? (a) where d = ‘done'?

So, what can one do to figure out what index to make? You need to get some more data.

For example, you can start with:

$ SELECT
    COUNT(*) AS total_rows,
    COUNT(*) FILTER (WHERE a = 12) AS matching_a,
    COUNT(*) FILTER (WHERE b > 34) AS matching_b,
    COUNT(*) FILTER (WHERE c < '2021-10-01') AS matching_c,
    COUNT(*) FILTER (WHERE d = 'done') AS matching_d
FROM z;
 total_rows | matching_a | matching_b | matching_c | matching_d 
------------+------------+------------+------------+------------
     100000 |       6498 |        784 |      95152 |       4928
(1 ROW)

Based on this we see that selectivity on condition on c (this condition! Perhaps other condition on c would be better) is almost non-existent. 95% rows match the criteria.

But, it looks that condition on b is the most efficient. Let's see how it will go as index on pair:

$ SELECT
    COUNT(*) FILTER (WHERE a = 12) AS matching_a,
    COUNT(*) FILTER (WHERE d = 'done') AS matching_d,
    COUNT(*) FILTER (WHERE a = 12 AND d = 'done' ) AS matching_ad
FROM z
WHERE b > 34;
 matching_a | matching_d | matching_ad 
------------+------------+-------------
         59 |         31 |           1
(1 ROW)

Whoa. This is great. So index on (a, d, b) would find just one row.

But, if you usually search for d = ‘done' then index on (a,b) where d = ‘done'. The difference is that index with where will be simply smaller.

For example, on my test table, index on (a, d, b) is 808kB, while index on (a,b) where d = ‘done' is only 72kB. The smaller the index, the faster it is.

What are general rules when deciding on index?

  • If selectivity of condition is low (almost all rows match) – there is no point in putting the column in index
  • If you have condition that you “always" use (or, at least very often) – you can put it in WHERE part of index definition
  • Don't add too many columns to index – it will make it larger, and less efficient. If you can get 2 columns to match 1000 rows out of million, and adding 3rd will reduce the match to 900 – don't add the 3rd – it will be, most likely, better in the long term to do index scan on these two columns only, and filter the rest normally.
  • Remember that generally speaking you can have only one inequality condition in query being efficiently handled by index scan
  • Remember that any columns in index after column used for inequality comparison, will be most likely not compared against index, but rather will be filtered out

All things said – understanding it all comes with practice. And if you don't know what to do, try various indexes, run query through explain analyze and compare times.