Prepared statements gotcha – select * from depesz;

Friend from my previous employer told me that plans of execution of prepared statement, and the same statement run “as it" are different.

Well, I checked and this is what I found (it's not shocking, it's actually quite obvious, but You have to think about it for a while to “get it").

Let's assume we have pretty simple table:

CREATE TABLE users (
    id        BIGSERIAL,
    username  TEXT NOT NULL,
    is_active BOOL NOT NULL DEFAULT 'false',
    PRIMARY KEY (id)
);
CREATE UNIQUE INDEX idx_u on users (username) WHERE is_active = true;

Now, this index serves 2 purposes:

username should be unique, but only for active users
we usually search only for active users, so indexing all of them doesn't make sense

Let's put some data in it:

CREATE OR REPLACE FUNCTION random_username() RETURNS TEXT as $$
return join "", map { [ "a".."z" ]->[rand() * 26] } (1..(10 + rand() * 20));
$$ language plperl;
INSERT INTO users (username, is_active)
    SELECT random_username(), random() > 0.01 from generate_series(1,500000) i;

Function random_username() generates random user name – which is string of 10 – 29 random lower case letters.

I inserted half a million of rows to my table, and this is how it looks:

select
    count(*) as all_users,
    sum(case when is_active = true then 1 else 0 end) as active_users,
    sum(case when is_active = true then 0 else 1 end) as inactive_users
from users;
 all_users | active_users | inactive_users
-----------+--------------+----------------
    500000 |       494916 |           5084
(1 row)

OK. So let's get first 50 active users, sorted by username:

# EXPLAIN ANALYZE SELECT * FROM users WHERE is_active = true ORDER BY username ASC limit 50;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3.44 rows=50 width=29) (actual time=0.081..1.011 rows=50 loops=1)
   ->  Index Scan using idx_u on users  (cost=0.00..34097.81 rows=495500 width=29) (actual time=0.076..0.821 rows=50 loops=1)
 Total runtime: 1.151 ms
(3 rows)

Pretty nice.

What happens if I do it via prepare?

# PREPARE test1 AS SELECT * FROM users WHERE is_active = true ORDER BY username ASC limit 50;
PREPARE
# EXPLAIN ANALYZE EXECUTE test1;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3.44 rows=50 width=29) (actual time=0.038..0.429 rows=50 loops=1)
   ->  Index Scan using idx_u on users  (cost=0.00..34097.81 rows=495500 width=29) (actual time=0.032..0.245 rows=50 loops=1)
 Total runtime: 0.581 ms
(3 rows)

So far so good. But what if I'll make the “true" parameter to plan?

# PREPARE test2 (bool) AS SELECT * FROM users WHERE is_active = $1 ORDER BY username ASC limit 50;
PREPARE
# EXPLAIN ANALYZE EXECUTE test2(true);
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=18179.82..18179.95 rows=50 width=29) (actual time=1576.059..1576.244 rows=50 loops=1)
   ->  Sort  (cost=18179.82..18804.82 rows=250000 width=29) (actual time=1576.054..1576.124 rows=50 loops=1)
         Sort Key: username
         Sort Method:  top-N heapsort  Memory: 20kB
         ->  Seq Scan on users  (cost=0.00..9875.00 rows=250000 width=29) (actual time=0.047..751.791 rows=494916 loops=1)
               Filter: (is_active = $1)
 Total runtime: 1576.363 ms
(7 rows)

Whoa. That's not good.

What about limit?

# PREPARE test3 (int4) AS SELECT * FROM users WHERE is_active = true ORDER BY username ASC limit $1;
PREPARE
# EXPLAIN ANALYZE EXECUTE test3(50);
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3409.78 rows=49550 width=29) (actual time=0.026..0.283 rows=50 loops=1)
   ->  Index Scan using idx_u on users  (cost=0.00..34097.81 rows=495500 width=29) (actual time=0.015..0.142 rows=50 loops=1)
 Total runtime: 0.382 ms
(3 rows)

This one is “safe".

Do You know why it happens? It's pretty simple. Prepared plan is generated without knowledge about what will be the value of parameter.

So it can't use partial index as it wouldn't work well if I ran it with “false" as argument.

Is it a bug? Not really. What can You do about it? Basically: think. Query planing is done once, and if planner doesn't have all knowledge necessary – it will generate suboptimal plans. Very suboptimal in some cases 🙂

10 thoughts on “Prepared statements gotcha”

Bartek Jablonski says:

2008-05-10 at 21:57

Going to java world – jdbc driver for postgresql sinve version 7.4 can use server side prepared statements to speed up queries. I see that things are different – it’s better to be careful with this feature.
amad3us says:

2008-05-10 at 22:26

🙂
Marcin says:

2008-05-10 at 22:35

Ciekawe. A komentuje tutaj tylko eo?
depesz says:

2008-05-11 at 13:37

hmm .. I’ve been thinking about it for better part of the night, and I think that this gotcha doesn’t apply only to partial indexes. It can apply to any other case when knowing the value in advance might provide another plan.

Limit as well – it was not shown in above example, but I think it’s entirely possible for this issue to influent limit plans as well.

Same goes for standard indexes, with non perfect value distribution.

Basically – after thinking about it, I have to admit that I wouldn’t suggest using prepared statements in most of the cases.
Szymon says:

2008-05-13 at 09:52

What’s more, all this is nicely written in the PostgreSQL documentation at 38.10.2. Plan Caching.
Vincenzo Romano says:

2008-05-13 at 10:05

Well I had the very same problem in a number of cases, mostly involving the LIKE/ILIKE operator when one of the two arguments was a parameter.
Solution?
Embed the query into a PL/PgSQL function which will create a dynamic SQL query by expanding the parametric value and then execute it.
By doing so you force the planner to do its job soon before performing the query, when there is no unknown element on it.
It’d be nice, though, to have a “lazy query planner” to void such “gotchas”.
Lukas says:

2008-05-14 at 16:11

Generally for stateless web applications I recommend not using prepared statements. If at all one should use client side emulated prepared statements. Here are my thought in more detail:
http://pooteeweet.org/blog/1083
zawadaa says:

2008-05-25 at 20:51

Isn’t that case similar?
http://archives.postgresql.org/pgsql-committers/2008-03/msg00566.php
depesz says:

2008-05-25 at 20:54

@zawadaa:
it’s similar, but not the same – i.e. the patch doesn’t change postgresql behavior in described situation.
Pingback: Kari’s World » Blog Archive » Reducing memory usage

Comments are closed.