May 10th, 2008 by depesz | Tags: , , , , | 10 comments »
Did it help? If yes - maybe you can help me?

Friend from my previous employer told me that plans of execution of prepared statement, and the same statement run “as it" are different.

Well, I checked and this is what I found (it's not shocking, it's actually quite obvious, but You have to think about it for a while to “get it").

Let's assume we have pretty simple table:

CREATE TABLE users (
id BIGSERIAL,
username TEXT NOT NULL,
is_active BOOL NOT NULL DEFAULT 'false',
PRIMARY KEY (id)
);
CREATE UNIQUE INDEX idx_u on users (username) WHERE is_active = true;

Now, this index serves 2 purposes:

  • username should be unique, but only for active users
  • we usually search only for active users, so indexing all of them doesn't make sense

Let's put some data in it:

CREATE OR REPLACE FUNCTION random_username() RETURNS TEXT as $$
return join "", map { [ "a".."z" ]->[rand() * 26] } (1..(10 + rand() * 20));
$$ language plperl;
INSERT INTO users (username, is_active)
SELECT random_username(), random() > 0.01 from generate_series(1,500000) i;

Function random_username() generates random user name – which is string of 10 – 29 random lower case letters.

I inserted half a million of rows to my table, and this is how it looks:

select
count(*) as all_users,
sum(case when is_active = true then 1 else 0 end) as active_users,
sum(case when is_active = true then 0 else 1 end) as inactive_users
from users;
all_users | active_users | inactive_users
-----------+--------------+----------------
500000 | 494916 | 5084
(1 row)

OK. So let's get first 50 active users, sorted by username:

# EXPLAIN ANALYZE SELECT * FROM users WHERE is_active = true ORDER BY username ASC limit 50;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..3.44 rows=50 width=29) (actual time=0.081..1.011 rows=50 loops=1)
-> Index Scan using idx_u on users (cost=0.00..34097.81 rows=495500 width=29) (actual time=0.076..0.821 rows=50 loops=1)
Total runtime: 1.151 ms
(3 rows)

Pretty nice.

What happens if I do it via prepare?

# PREPARE test1 AS SELECT * FROM users WHERE is_active = true ORDER BY username ASC limit 50;
PREPARE
# EXPLAIN ANALYZE EXECUTE test1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..3.44 rows=50 width=29) (actual time=0.038..0.429 rows=50 loops=1)
-> Index Scan using idx_u on users (cost=0.00..34097.81 rows=495500 width=29) (actual time=0.032..0.245 rows=50 loops=1)
Total runtime: 0.581 ms
(3 rows)

So far so good. But what if I'll make the “true" parameter to plan?

# PREPARE test2 (bool) AS SELECT * FROM users WHERE is_active = $1 ORDER BY username ASC limit 50;
PREPARE
# EXPLAIN ANALYZE EXECUTE test2(true);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (cost=18179.82..18179.95 rows=50 width=29) (actual time=1576.059..1576.244 rows=50 loops=1)
-> Sort (cost=18179.82..18804.82 rows=250000 width=29) (actual time=1576.054..1576.124 rows=50 loops=1)
Sort Key: username
Sort Method: top-N heapsort Memory: 20kB
-> Seq Scan on users (cost=0.00..9875.00 rows=250000 width=29) (actual time=0.047..751.791 rows=494916 loops=1)
Filter: (is_active = $1)
Total runtime: 1576.363 ms
(7 rows)

Whoa. That's not good.

What about limit?

# PREPARE test3 (int4) AS SELECT * FROM users WHERE is_active = true ORDER BY username ASC limit $1;
PREPARE
# EXPLAIN ANALYZE EXECUTE test3(50);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..3409.78 rows=49550 width=29) (actual time=0.026..0.283 rows=50 loops=1)
-> Index Scan using idx_u on users (cost=0.00..34097.81 rows=495500 width=29) (actual time=0.015..0.142 rows=50 loops=1)
Total runtime: 0.382 ms
(3 rows)

This one is “safe".

Do You know why it happens? It's pretty simple. Prepared plan is generated without knowledge about what will be the value of parameter.

So it can't use partial index as it wouldn't work well if I ran it with “false" as argument.

Is it a bug? Not really. What can You do about it? Basically: think. Query planing is done once, and if planner doesn't have all knowledge necessary – it will generate suboptimal plans. Very suboptimal in some cases 🙂

  1. 10 comments

  2. # Bartek Jablonski
    May 10, 2008

    Going to java world – jdbc driver for postgresql sinve version 7.4 can use server side prepared statements to speed up queries. I see that things are different – it’s better to be careful with this feature.

  3. May 10, 2008

    🙂

  4. May 10, 2008

    Ciekawe. A komentuje tutaj tylko eo?

  5. May 11, 2008

    hmm .. I’ve been thinking about it for better part of the night, and I think that this gotcha doesn’t apply only to partial indexes. It can apply to any other case when knowing the value in advance might provide another plan.

    Limit as well – it was not shown in above example, but I think it’s entirely possible for this issue to influent limit plans as well.

    Same goes for standard indexes, with non perfect value distribution.

    Basically – after thinking about it, I have to admit that I wouldn’t suggest using prepared statements in most of the cases.

  6. # Szymon
    May 13, 2008

    What’s more, all this is nicely written in the PostgreSQL documentation at 38.10.2. Plan Caching.

  7. # Vincenzo Romano
    May 13, 2008

    Well I had the very same problem in a number of cases, mostly involving the LIKE/ILIKE operator when one of the two arguments was a parameter.
    Solution?
    Embed the query into a PL/PgSQL function which will create a dynamic SQL query by expanding the parametric value and then execute it.
    By doing so you force the planner to do its job soon before performing the query, when there is no unknown element on it.
    It’d be nice, though, to have a “lazy query planner” to void such “gotchas”.

  8. May 14, 2008

    Generally for stateless web applications I recommend not using prepared statements. If at all one should use client side emulated prepared statements. Here are my thought in more detail:
    http://pooteeweet.org/blog/1083

  9. # zawadaa
    May 25, 2008

    Isn’t that case similar?
    http://archives.postgresql.org/pgsql-committers/2008-03/msg00566.php

  10. May 25, 2008

    @zawadaa:
    it’s similar, but not the same – i.e. the patch doesn’t change postgresql behavior in described situation.

  1. 1 Trackback(s)

  2. Jan 19, 2010: Kari’s World » Blog Archive » Reducing memory usage

Leave a comment