September 9th, 2010 by depesz | Tags: , , , | 19 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

This question (and its variants) show quite often on #postgresql on IRC. People get sequential scans, and are worried that it's slow and bad.

So, I hope that this blogpost will shed some light on the subject why indexes are being chosen to be used, or not.

Before I will go deep into explain plans, queries, statistics, let's ask: what is an index? Is it magic? Is it free?

Index is data structure, created alongside of table, that is used to speed up searches or sorts.

It's not free (index has to be kept up to date, so every insert/update/delete operation gets a bit more expensive.

So, how does PostgreSQL use index, when it chooses to?

Let's assume we have some table, with column “whatever" (integer, not unique), and index on it.

If PostgreSQL will choose to use it (let's assume the query is: SELECT * FROM table WHERE whatever = 123), it will:

  • find record in index
  • based on information from index, it will find appropriate page and record in table (heap data)

It looks great, but we have to understand that “find record in index" is relatively complicated (although usually fast) operation, which can easily involve multiple random page reads.

What do I mean by random page reads? When we'll go down to disk level, we can basically have 2 different kinds of access:

  • sequential – next block of data is read just after previous, so no disk head movement is needed
  • random – next block is read from random place on disk, which required head movement, and thus is slower (think: average seek time in disk technical specification)

So, even in best case, index scan requires (from disk):

  • seek to where index is
  • fetch page from index
  • seek to where table is
  • fetch page from table

Of course disk cache and/or shared_buffers help with this, but it's still “a lot of" operations.

On the other hand – best case for sequential scan is simpler:

  • seek to where table is
  • fetch page from table

Much nicer.

This can mean that if table is smaller than 2 pages (page is 8192 bytes, at least unless you modified it), PostgreSQL shouldn't use index at all! In reality it's probably a bit more complicated, so the easiest way to tell, is simply to test it:

$ create table test (whatever integer);
CREATE TABLE
 
$ create index testi on test (whatever);
CREATE INDEX
 
$ insert into test (whatever) select i from generate_series(1,10) i;
INSERT 0 10
 
$ select pg_relation_size('test');
pg_relation_size
------------------
8192
(1 row)
 
$ analyze test;
ANALYZE
 
$ explain select * from test where whatever = 5;
QUERY PLAN
----------------------------------------------------
Seq Scan on test (cost=0.00..1.12 rows=1 width=4)
Filter: (whatever = 5)
(2 rows)

As you can see, I ran analyze manually (to avoid having to wait for pg_autovacuum. Analyze is very important, and I will show it a bit later.

As expected – for such small table – index was not used.

After some test with various row counts, I found the exact threshold for this particular Pg to switch to index scan:

$ truncate test;
TRUNCATE TABLE
 
$ insert into test (whatever) select i from generate_series(1,501) i;
INSERT 0 501
 
$ analyze test;
ANALYZE
 
$ explain select * from test where whatever = 5;
QUERY PLAN
----------------------------------------------------
Seq Scan on test (cost=0.00..8.26 rows=1 width=4)
Filter: (whatever = 5)
(2 rows)
 
$ select pg_relation_size('test');
pg_relation_size
------------------
16384
(1 row)
 
$ insert into test (whatever) values (502);
INSERT 0 1
 
$ analyze test;
ANALYZE
 
$ select pg_relation_size('test');
pg_relation_size
------------------
16384
(1 row)
 
$ explain select * from test where whatever = 5;
QUERY PLAN
------------------------------------------------------------------
Index Scan using testi on test (cost=0.00..8.27 rows=1 width=4)
Index Cond: (whatever = 5)
(2 rows)

That was the first situation where PostgreSQL can sensibly choose not to use index.

Next situation. Remember that random reads from disk are much more expensive than sequential ones (due to seeks). This means that if we are fetching more than some percent of the page – usage of the index is not sensible.

Let's test this threshold. For this I'll need much larger test table:

$ truncate test;
TRUNCATE TABLE
 
$ insert into test (whatever) select generate_series(1,10000000);
INSERT 0 10000000
 
$ analyze test
ANALYZE
 
$ select pg_relation_size('test');
pg_relation_size
------------------
321257472
(1 row)

With this table, and this simple one-liner:

perl -e 'for (1..100) { printf "%3u : ",$_; my $thr = 10000000 / 100 * $_; system(qq{psql -qAt -c "explain analyze select * from test where whatever <= $thr" | head -n 1})}'

I can test when index is being used, and when it's switching to seq scan.

And it seems that the threshold in my case was ~ 55% of the table:

$ explain analyze select * from test where whatever < 5400000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Index Scan using testi on test (cost=0.00..163836.29 rows=5424809 width=4) (actual time=0.100..9531.022 rows=5399999 loops=1)
Index Cond: (whatever < 5400000)
Total runtime: 17338.177 ms
(3 rows)
 
$ explain analyze select * from test where whatever < 5500000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..164217.00 rows=5525712 width=4) (actual time=0.056..10039.352 rows=5499999 loops=1)
Filter: (whatever < 5500000)
Total runtime: 17824.766 ms
(3 rows)

Please note how close the times are. With this in mind – let's force index scan and test 55%:

$ explain analyze select * from test where whatever < 5500000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Index Scan using testi on test (cost=0.00..166882.09 rows=5525712 width=4) (actual time=0.145..9589.029 rows=5499999 loops=1)
Index Cond: (whatever < 5500000)
Total runtime: 17553.429 ms
(3 rows)

Nice. Time for index scan is actually a bit faster, but not really much. I will tell how to fix this small “problem" (i.e. seq scan should be chosen a bit later, perhaps at 60%) later in the post.

So, right now we know that index will not be used if:

  • table is too small
  • we get too many records (as percent of the table – getting 7million rows from 10 million rows is faster with seq scan, but getting the same 7 million rows from 1 billion row table should use index).

Another situation is that not all operations are indexable. For the very simple case – like operator, with ‘%' in the beginning of pattern is not indexable (you can work around using reverse, but still – with like ‘%whatever%' – you can't use index).

So, when you're using special data types, or non-trivial operators – always check if it's at all possible to index it.

Last situation is simple to explain, but not always simple to understand.

Let's assume that you have:

select * from test where whatever + 2 < 5;

That's the same as:

select * from test where whatever < 3;

Right? No!

$ explain select * from test where whatever + 2 < 5;
QUERY PLAN
------------------------------------------------------------------------------
Seq Scan on test (cost=10000000000.00..10000189217.20 rows=3333360 width=4)
Filter: ((whatever + 2) < 5)
(2 rows)
 
$ explain select * from test where whatever < 3;
QUERY PLAN
----------------------------------------------------------------------
Index Scan using testi on test (cost=0.00..37.63 rows=1000 width=4)
Index Cond: (whatever < 3)
(2 rows)

Why is that so?

Operations are done using functions. So basically “whatever + 2″ is the same as “int4pl(whatever, 2)" – i.e. call to function int4pl, with 2 arguments.

PostgreSQL doesn't know what which function does. So, knowing that name of the function is int4pl, doesn't mean anything to it, so it can't modify the query to the later form, and it has to assume that int4pl() function can return anything. Which means – we can't use index on (column) for search for functionname(column)!

This means that if you want your index to be used – you have to keep the indexed operation “alone" on it's side of comparison operator.

While example with “+2″ might sound trivial, I have seen more than often queries like:

select * from table where timestamp_column + '5 minutes'::interval < now()

Which has exactly the same problem!

Now, while we are at functions – let's assume you want case insensitive search. This is done using:

select * from table where lower(text_field) = lower('some_value')

Which is all nice, but index on text_field cannot be used for it! Luckily there is solution: functional index:

create index some_name on table ( lower(text_field) )

As you can see this creates index not on column, but on result of function call with value from the column. This is very cool. But does it always help?

Let's assume you want to find rows from given week. And by given I mean: you give (as parameter to the query) some timestamp, and you want all records that happened in the same week as the timestamp you gave. Example query:

$ explain analyze select * from test where date_trunc( 'week', whatever ) = date_trunc( 'week', '2010-02-04 12:34:56+02'::timestamptz);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=10000000000.00..10000002193.00 rows=500 width=8) (actual time=0.727..208.189 rows=425 loops=1)
Filter: (date_trunc('week'::text, whatever) = date_trunc('week'::text, '2010-02-04 11:34:56+01'::timestamp with time zone))
Total runtime: 208.888 ms
(3 rows)

Table now looks like this:

$ \d test
Table "public.test"
Column | Type | Modifiers
----------+--------------------------+-----------
whatever | timestamp with time zone |
Indexes:
"test1" btree (whatever)
 
$ select count(*), min(whatever), max(whatever) from test;
count | min | max
--------+-------------------------------+-------------------------------
100000 | 2005-09-09 09:48:57.190155+02 | 2010-09-09 08:56:04.841355+02
(1 row)

Now let me ask you – what index to crate to be able to use it?

If you said: it's trivial: index on date_trunc( ‘week', whatever ) – well, you failed:

$ create index w on test (date_trunc('week', whatever));
ERROR: functions in index expression must be marked IMMUTABLE

What is this? Well, let's assume extreme case – can you sensibly make index on random() ? No. It changes all the time. And to be able to make index on something – we need to make sure that the value of this something doesn't change on its own. That's why you can't index on random().

But why you can't index on date_trunc( ‘week', whatever ) ? It cannot change on its own (without change in table). Right? Wrong!

$ select date_trunc( 'week', '2010-09-06 08:56:04.841355+02'::timestamptz );
date_trunc
------------------------
2010-09-06 00:00:00+02
(1 row)
 
$ set timezone = 'America/Los_Angeles';
SET
 
$ select date_trunc( 'week', '2010-09-06 08:56:04.841355+02'::timestamptz );
date_trunc
------------------------
2010-08-30 00:00:00-07
(1 row)

As you can see – value of date_trunc, with the same arguments, depend on some external data (namely: client time zone).

Some people in this case create wrapper function, which calls this function (date_trunc), but they mark the wrapper function as immutable.

It will work, but it's a really nice, and hard to debug, way to shoot yourself in the foot, in case your application will get changed later to accommodate various time zones.

What is the correct way to solve it? Unfortunately, it's a bit more convoluted, but it works:

$ create index w on test (date_trunc('week', whatever at time zone 'UTC'));
CREATE INDEX

Instead of using utc, you can use any time zone, as long as it's one, hardcoded timezone.

The problem with it is that you have to use exactly the same expression later on in queries:

$ explain analyze select * from test where date_trunc( 'week', whatever ) = date_trunc( 'week', '2010-02-04 12:34:56+02'::timestamptz);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..2193.00 rows=500 width=8) (actual time=0.727..210.051 rows=425 loops=1)
Filter: (date_trunc('week'::text, whatever) = date_trunc('week'::text, '2010-02-04 11:34:56+01'::timestamp with time zone))
Total runtime: 210.890 ms
(3 rows)
 
$ explain analyze select * from test where date_trunc( 'week', whatever at time zone 'UTC') = date_trunc( 'week', '2010-02-04 12:34:56+02'::timestamptz);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=12.14..486.23 rows=500 width=8) (actual time=1.060..3.214 rows=426 loops=1)
Recheck Cond: (date_trunc('week'::text, timezone('UTC'::text, whatever)) = date_trunc('week'::text, '2010-02-04 11:34:56+01'::timestamp with time zone))
-> Bitmap Index Scan on w (cost=0.00..12.02 rows=500 width=0) (actual time=0.915..0.915 rows=426 loops=1)
Index Cond: (date_trunc('week'::text, timezone('UTC'::text, whatever)) = date_trunc('week'::text, '2010-02-04 11:34:56+01'::timestamp with time zone))
Total runtime: 4.289 ms
(5 rows)

Luckily you can wrap it in simple SQL function:

$ create function start_of_utc_week( timestamptz ) returns timestamp as $$ select date_trunc( 'week', $1 at time zone 'UTC' ) $$ language sql immutable;
CREATE FUNCTION
 
$ create index w on test ( start_of_utc_week( whatever ));
CREATE INDEX
 
$ explain analyze select * from test where start_of_utc_week(whatever) = start_of_utc_week( '2010-02-04 12:34:56+02'::timestamptz );
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=12.14..484.98 rows=500 width=8) (actual time=0.561..2.699 rows=426 loops=1)
Recheck Cond: (date_trunc('week'::text, timezone('UTC'::text, whatever)) = '2010-02-01 00:00:00'::timestamp without time zone)
-> Bitmap Index Scan on w (cost=0.00..12.02 rows=500 width=0) (actual time=0.409..0.409 rows=426 loops=1)
Index Cond: (date_trunc('week'::text, timezone('UTC'::text, whatever)) = '2010-02-01 00:00:00'::timestamp without time zone)
Total runtime: 3.770 ms
(5 rows)

( of course I don't have to use start_of_utc_week call on the right side of comparison, but I choose to, to make it simpler ).

In the beginning of the post I wrote that analyze is important. Why is it? Let me show you simple example. I turned off autovacuum, and restarted PostgreSQL, and then ran this:

$ create table test (whatever int4);
CREATE TABLE
 
$ create index test1 on test (whatever);
CREATE INDEX
 
$ insert into test (whatever) select generate_series(1,100000);
INSERT 0 100000

With this table, and no analyze yet, I ran 3 explains:

$ explain select * from test where whatever > 1000000;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=539.92..1325.92 rows=31440 width=4)
Recheck Cond: (whatever > 1000000)
-> Bitmap Index Scan on test1 (cost=0.00..532.06 rows=31440 width=0)
Index Cond: (whatever > 1000000)
(4 rows)
 
$ explain select * from test where whatever < 0;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=539.92..1325.92 rows=31440 width=4)
Recheck Cond: (whatever < 0)
-> Bitmap Index Scan on test1 (cost=0.00..532.06 rows=31440 width=0)
Index Cond: (whatever < 0)
(4 rows)
 
$ explain select * from test where whatever < 123;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=539.92..1325.92 rows=31440 width=4)
Recheck Cond: (whatever < 123)
-> Bitmap Index Scan on test1 (cost=0.00..532.06 rows=31440 width=0)
Index Cond: (whatever < 123)
(4 rows)

Please note that rows= values are the same, and they don't make any sense!

Now, let's analyze the table, and repeat these explains:

$ analyze test;
ANALYZE
 
$ explain select * from test where whatever > 1000000;
QUERY PLAN
-------------------------------------------------------------------
Index Scan using test1 on test (cost=0.00..8.43 rows=10 width=4)
Index Cond: (whatever > 1000000)
(2 rows)
 
$ explain select * from test where whatever < 0;
QUERY PLAN
-------------------------------------------------------------------
Index Scan using test1 on test (cost=0.00..8.43 rows=10 width=4)
Index Cond: (whatever < 0)
(2 rows)
 
$ explain select * from test where whatever < 123;
QUERY PLAN
---------------------------------------------------------------------
Index Scan using test1 on test (cost=0.00..10.36 rows=120 width=4)
Index Cond: (whatever < 123)
(2 rows)

Much better. Why is it so? PostgreSQL contains some statistics about values in your table. These statistics are updated with analyze command (either run by hand, or via autovacuum). If these statistics are bad or missing you can get bad results. You saw what happens when statistics are missing. But what happens with bad stats?

$ update test set whatever = -1 * whatever;
UPDATE 100000

And now, let's test the query that previously returned ~ 120 rows, but now would return whole table (100% of rows). It should use sequential scan, but:

$ explain select * from test where whatever < 123;
QUERY PLAN
---------------------------------------------------------------------
Index Scan using test1 on test (cost=0.00..12.46 rows=239 width=4)
Index Cond: (whatever < 123)
(2 rows)

Oooops.

Of course, it's also true “the other way around". Query that would previously return 90% of rows ( whatever > 10000 ), will now not return anything, so it should use index, but:

$ explain analyze select * from test where whatever > 10000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..3281.83 rows=180104 width=4) (actual time=38.424..38.424 rows=0 loops=1)
Filter: (whatever > 10000)
Total runtime: 38.468 ms
(3 rows)

Ooops. Of course with such small table its not really big difference, but usually your tables will be larger than 100k rows with single column of int4 type :)

Also – please remember that analyze stores only some certain number of statistics, and you can still get bad estimates. If that will happen – consider changing default_statistics_target, or just alter table … alter column … set statistics to some higher value.

Last note – as I shown earlier, PostgreSQL doesn't always finds perfectly what is the best threshold to change from given plan to another. You can influence this by modifying settings like cpu_*_cost, effective_cache_size and random_page_cost. By default random_page_cost is set to 4 – which means that getting random page is 4 times as expensive as getting page sequentially. This might seem too harsh, but actually it's pretty good approximation. If you do have really cool storage (think: lots of disks in raid10, or ssd storage) you can modify random_page_cost to better suit your needs.

As for other parameters – Josh Berkus once said that we should lower all cpu_*_cost parameters by 80% due to the fact that new cpus are much faster. I did test it, and haven't seen improvements, but your mileage my vary.

Effective_cache_size – this should be simply set to real value – how much data from disk is in your operating system disk cache. For example, on Linux you can use program “free":

=$ free
total used free shared buffers cached
Mem: 3088560 745012 2343548 0 121588 373008
-/+ buffers/cache: 250416 2838144
Swap: 1952760 0 1952760

`
And in here we see that right now my laptop is using ~ 370MB for disk cache. Remember to always check the value with all necessary programs (like – httpd/monitoring or any other tools you're running on your db server) running, and after some usage of database/system. This is important, as just after reboot the value will be (usually) much lower than after a week of normal work.

Hope this post will clear some confusion, and help others with diagnosing their PostgreSQL installations.

  1. 19 comments

  2. # Thom Brown
    Sep 9, 2010

    s/disk case/disk cache/

  3. Sep 9, 2010

    Fix please code:

    SELECT * FROM TABLE WHERE timestamp_column + ’5 minutes’::interval < now()

    to

    SELECT * FROM TABLE WHERE timestamp_column + ’5 minutes’::interval < now()

  4. Sep 9, 2010

    @Thom: thanks fixed.

    @Pavel: hmm – both of these lines are the same – not sure what you want me to change.

  5. Sep 9, 2010

    – It’s not free (index has to be kept up to date, so every insert/update/delete operation gets a bit more expensive

    A long held fallacy. ONLY i/u/d which change data on indexed columns are a “bit more expensive”, not “every”. Clearly, i/d will be, it’s the u that’s always been the issue. Intelligently chosen indexes, which is to say columns which are characteristic of the row, won’t be updated in the normal course of events. For an OLTP application, this should be a true fact.

  6. Sep 9, 2010

    @Robert:

    HOT chains, which you’re referring to, of course are reality, but (at least in my practice) it doesn’t help really much.

    Simple example of situation that I have on at least couple of database – “last_modified_tsz” column, which stored when record was last modified, which gets modified on every update, and is also indexes to allow fast access to recently modified rows. Simple solution for couple of things, and HOT gets effectively disabled.

  7. Sep 9, 2010

    @Pavel: Thanks, fixed. And thanks, Theo, for pointing me to the error.

  8. # Tuxie
    Sep 9, 2010

    I wish there existed a tool I could run to “brute force” test various values in the costs parameters to find the best (or reasonably close) ones for my particular hardware and data set…

  9. # slava44
    Sep 10, 2010

    Great article. The importance of keeping statistics current is true for many other DBs. However there are few things that unique to Postgres (v8.4):
    1) in Update, even if updated column is not in index, the row changes physical location (I think this is true even with HOT updates due to MVCC) and therefore requires update to index.
    2) select count(*) from t where index_column=123 requires table lookup IN ADDITION to index. I hope very much this will be addressed in 9.1 with index-only scans, or we’ll have to spend loads of money on SSDs.. :)

  10. # Frank
    Sep 10, 2010

    @Slava44: There has to be enough space within the same page to do a HOT update, the fillfactor can create this space. When a HOT update is done, there is no need for an update of the index, the index will still point to the same page (otherwise it wouldn’t be HOT). And to get a single record out of a page, PostgreSQL has to read the entire page anyway. But that’s only 8kb (by default, largest possible size is 32kb)

  11. # Ray Reaux
    Sep 10, 2010

    Very helpful article.

    I have a question that is off topic but related. I apologize if this is a very naive question. We have just created cluster indices on database tables that hold over 50 million records. The cluster index is on a composite key (int, varchar) that is not the primary key and is not unique.

    Now after ordering into chunks based on this cluster index, how will the records that are grouped together within that chunk be ordered onto the page? Would it be ordered based on the primary key?

    I understand as records are deleted and updated, the order of the “active” tuple does not stay constant within the page. So my question is whether clustering (and vacuum full) bring back that order and resort based on the primary key.

  12. Sep 10, 2010

    @Ray:
    you have to assume that inside “chunk” order of rows is random. i.e. – they will be stored in order that is dependand on location in data files on cluster time.

    I’m not sure I understood your quetion from last sentence. Generally – normal operations will break the order enforced by cluster (i.e. clustering is not permanent!). vacuum full will not restore ordering, but “CLUSTER” will.

  13. # Ray Reaux
    Sep 10, 2010

    Thanks despesz for responding quickly.

    My understanding of what you are saying is that there is no inherent order or sort that goes on except for the cluster index. As records are inserted, it goes into the first available location in the page that is found (replacing a tombstoned tuple or fill space).

    When a table is vacuum fulled, it is collapsed (tomb stoned tuples removed) in whatever order the active tuples are located within the page. Normal database operations of insertion and updates make the order non-deterministic.

    Therefore, if I want to make the cluster resort based on the sequencer that I am using as primary key and my composite key, then I need to explicitly create a cluster index that is a composite of the three.

  14. Sep 10, 2010

    @Ray: Yes, it should be on all 3.

  15. # Vishalakshi
    Jul 25, 2012

    Hi Depesz,

    I need your help in finding how to search special characters in full text search?

    Example text: india!d win

    I’m using index for the column “name_search” in the table “name_search_table”

    name_search_table_idx” gin (to_tsvector(‘english’::regconfig, name_search))

    when i use this query
    select name_search from name_search_table where to_tsvector(‘english’,name_search) @@ to_tsquery(E’india\!d\swin’);

    it results in syntax error.

    How can i rectify this??

    Thanks in advance..

  16. Jul 25, 2012

    @Vishalakshi:
    1. as you could know – I’m not fan of tsearch – so i’m not using it ( http://www.depesz.com/2010/10/17/why-im-not-fan-of-tsearch-2/ )
    2. try increasing number of backslashes.

  17. # Lior
    Jan 2, 2013

    Great article, very helpful.
    Is there are cases in which Index is corrupted and PostgreSql will not use the index since it is corrupted?

  18. Jan 2, 2013

    @Lior:
    Index can be invalid – which is the case during “CREATE INDEX CONCURRENTLY” – and then it’s not really used.
    It can be also corrupted (for example due to storage error), but then it’s used – even if it returns bad data :(

  19. # Kire
    May 6, 2013

    Hi,
    Thank you very much for this helpful post.
    I was following all the examples but on some places your example is using index scan, while in my system it’s using Seq Scan. Like here “explain analyze select * from test where whatever < 5500000;"
    In your example the index scan is used, while in my case it's used Seq Scan. I was lowering the number until 1400000 before index scan is starting to be used,do you maybe have some thoughts why is this like that?? Thank you in advance,
    Kire

  20. May 6, 2013

    @Kire: different setup maybe. random_page_cost, cpu_*_cost.

Leave a comment