postgresql tips & tricks – select * from depesz;

mage_ from #postgresql had interesting problem today.

he has a table with 2 date fields, and he wants to have list of all years from both fields. together. as one list.

his approach:

SELECT date_part('year', date1) FROM test
UNION
SELECT date_part('year', date2) FROM test;

is hardly satisfactory – it takes too long.

any way to speed it up?

ok, i asked some questions, got some data.

here's what we're starting with:

total number of rows: 56990
date1 is in range : 1937-01-01 to 2006-08-31
date2 is in range : 1937-12-31 to 1996-12-31
data will grow in time, but there will be never rows with date prior to 1800-01-01, and max of the dates will be similar to now()
there are indexes on date1 and date2 columns (two separete, standard, btree indexes)

ok. let's make a test table:

CREATE TABLE test (id serial PRIMARY KEY, date1 DATE, date2 DATE);

now let's create some data:

INSERT INTO test (date1, date2) SELECT random() * 40000 * '1 day'::INTERVAL + '1900-01-01'::DATE, random() * 40000 * '1 day'::INTERVAL + '1900-01-01'::DATE FROM generate_series(1,60000);

and the base indexes:

# CREATE INDEX d1 ON test (date1);
# CREATE INDEX d2 ON test (date2);

ok. now we have 60000 rows, with min/max values:

# SELECT MIN(date1), MAX(date1), MIN(date2), MAX(date2) FROM test;
    MIN     |    MAX     |    MIN     |    MAX
------------+------------+------------+------------
 1900-01-02 | 2009-07-06 | 1900-01-01 | 2009-07-06
(1 ROW)

it's not perfect copy of mage's table, but it's close enough.

now. let's try his approach just to get some idea on timing on my machine:

# EXPLAIN analyze SELECT date_part('year', date1) FROM test UNION SELECT date_part('year', date2) FROM test;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 UNIQUE  (cost=15152.10..15752.10 ROWS=120000 width=4) (actual TIME=1235.563..1731.876 ROWS=110 loops=1)
   ->  Sort  (cost=15152.10..15452.10 ROWS=120000 width=4) (actual TIME=1235.556..1505.904 ROWS=120000 loops=1)
         Sort KEY: (date_part('year'::text, (public.test.date1)::TIMESTAMP WITHOUT TIME zone))
         Sort Method:  external MERGE  Disk: 2336kB
         ->  Append  (cost=0.00..3590.00 ROWS=120000 width=4) (actual TIME=0.040..760.877 ROWS=120000 loops=1)
               ->  Seq Scan ON test  (cost=0.00..1195.00 ROWS=60000 width=4) (actual TIME=0.036..171.935 ROWS=60000 loops=1)
               ->  Seq Scan ON test  (cost=0.00..1195.00 ROWS=60000 width=4) (actual TIME=0.018..150.961 ROWS=60000 loops=1)
 Total runtime: 1763.287 ms
(8 ROWS)

ok, so we have all that we need.

first – let's check if postgresql will be smart enough not to do sort of 120000 rows, but use index on fields, and do merge-sort:

# CREATE INDEX q1 ON test (date_part('year', date1));
# CREATE INDEX q2 ON test (date_part('year', date2));

and retry the query.

result – no change.

hmm .. let's try different approach.

i know the range of possible years – it is 1800 to now(). since now() can change, let's assume, that now() == 2100. just to have safe 90 years of using the query.

for this – i will use standard indexes, so the extra ones can be removed:

# DROP INDEX q1;
# DROP INDEX q2;

so, now for the query. we will use our magic-tool: generate_series:

SELECT i
FROM
    generate_series(1800, 2100) i
WHERE
    EXISTS (
        SELECT *
        FROM test t
        WHERE t.date1 BETWEEN CAST(i::text||'-01-01' AS DATE) AND CAST(i::text||'-12-31' AS DATE)
    ) OR EXISTS (
        SELECT *
        FROM test t
        WHERE t.date2 BETWEEN CAST(i::text||'-01-01' AS DATE) AND CAST(i::text||'-12-31' AS DATE)
    );

if it's not obvious, algorithm is quite simple:

for every year in (1800 .. 2100) check if there is any row that has date1 between ‘YEAR-01-01' and ‘YEAR-12-31'.
if there is such a row – return this year, and go to next year
if thers is no such a row – check again but this time using date2 field.
if there is such a row – return year

that's pretty much it.

how does it work?

explain analyze showed this:

                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series i  (cost=0.00..5437.40 rows=750 width=4) (actual time=4.384..76.243 rows=110 loops=1)
   Filter: ((subplan) OR (subplan))
   SubPlan
     ->  Index Scan using d2 on test t  (cost=0.03..806.27 rows=300 width=12) (actual time=0.008..0.008 rows=0 loops=191)
           Index Cond: ((date2 >= ((($0)::text || '-01-01'::text))::date) AND (date2 <= ((($0)::text || '-12-31'::text))::date))
     ->  Index Scan using d1 on test t  (cost=0.03..806.25 rows=300 width=12) (actual time=0.227..0.227 rows=0 loops=301)
           Index Cond: ((date1 >= ((($0)::text || '-01-01'::text))::date) AND (date1 <= ((($0)::text || '-12-31'::text))::date))
 Total runtime: 76.642 ms
(8 rows)

pretty neat 🙂

the problem with it is that it doesn't look good. (1800, 2100) ? and what if i'd want to add something from 1799? or 2200 ? i would have to modify the query.

also – in our current case – we do a lot of empty runs – as i know that there is no row from before 1900, or after 2009.

any chance to get it done right?

first – let's find the lowest year number:

# EXPLAIN analyze SELECT least(MIN(date1), MIN(date2)) FROM test;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 RESULT  (cost=0.09..0.10 ROWS=1 width=0) (actual TIME=0.115..0.117 ROWS=1 loops=1)
   InitPlan
     ->  LIMIT  (cost=0.00..0.04 ROWS=1 width=8) (actual TIME=0.054..0.056 ROWS=1 loops=1)
           ->  INDEX Scan USING d1 ON test  (cost=0.00..2616.22 ROWS=60000 width=8) (actual TIME=0.049..0.049 ROWS=1 loops=1)
                 FILTER: (date1 IS NOT NULL)
     ->  LIMIT  (cost=0.00..0.04 ROWS=1 width=8) (actual TIME=0.040..0.043 ROWS=1 loops=1)
           ->  INDEX Scan USING d2 ON test  (cost=0.00..2616.24 ROWS=60000 width=8) (actual TIME=0.035..0.035 ROWS=1 loops=1)
                 FILTER: (date2 IS NOT NULL)
 Total runtime: 0.198 ms
(9 ROWS)

this looks reasonably fast.

now, the last possible year – quite simple as well:

SELECT greatest(MAX(date1), MAX(date2)) FROM test;

now i can rewrite the generate_series query to make it:

data-change-proof
not searching in years that definitelly doesn't have any data.

EXPLAIN analyze
SELECT i
FROM
    generate_series(
        (SELECT EXTRACT('year' FROM least(MIN(date1), MIN(date2)))::int4 FROM test),
        (SELECT EXTRACT('year' FROM greatest(MAX(date1), MAX(date2)))::int4 FROM test)
    ) i
WHERE
    EXISTS (
        SELECT *
        FROM test t
        WHERE t.date1 BETWEEN CAST(i::text||'-01-01' AS DATE) AND CAST(i::text||'-12-31' AS DATE)
    ) OR EXISTS (
        SELECT *
        FROM test t
        WHERE t.date2 BETWEEN CAST(i::text||'-01-01' AS DATE) AND CAST(i::text||'-12-31' AS DATE)
    );
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 FUNCTION Scan ON generate_series i  (cost=0.21..5437.61 ROWS=750 width=4) (actual TIME=0.236..4.409 ROWS=110 loops=1)
   FILTER: ((subplan) OR (subplan))
   InitPlan
     ->  RESULT  (cost=0.09..0.10 ROWS=1 width=0) (actual TIME=0.082..0.084 ROWS=1 loops=1)
           InitPlan
             ->  LIMIT  (cost=0.00..0.04 ROWS=1 width=8) (actual TIME=0.035..0.037 ROWS=1 loops=1)
                   ->  INDEX Scan USING d1 ON test  (cost=0.00..2616.22 ROWS=60000 width=8) (actual TIME=0.031..0.031 ROWS=1 loops=1)
                         FILTER: (date1 IS NOT NULL)
             ->  LIMIT  (cost=0.00..0.04 ROWS=1 width=8) (actual TIME=0.022..0.023 ROWS=1 loops=1)
                   ->  INDEX Scan USING d2 ON test  (cost=0.00..2616.24 ROWS=60000 width=8) (actual TIME=0.018..0.018 ROWS=1 loops=1)
                         FILTER: (date2 IS NOT NULL)
     ->  RESULT  (cost=0.09..0.10 ROWS=1 width=0) (actual TIME=0.032..0.034 ROWS=1 loops=1)
           InitPlan
             ->  LIMIT  (cost=0.00..0.04 ROWS=1 width=8) (actual TIME=0.010..0.011 ROWS=1 loops=1)
                   ->  INDEX Scan Backward USING d1 ON test  (cost=0.00..2616.22 ROWS=60000 width=8) (actual TIME=0.006..0.006 ROWS=1 loops=1)
                         FILTER: (date1 IS NOT NULL)
             ->  LIMIT  (cost=0.00..0.04 ROWS=1 width=8) (actual TIME=0.009..0.011 ROWS=1 loops=1)
                   ->  INDEX Scan Backward USING d2 ON test  (cost=0.00..2616.24 ROWS=60000 width=8) (actual TIME=0.005..0.005 ROWS=1 loops=1)
                         FILTER: (date2 IS NOT NULL)
   SubPlan
     ->  INDEX Scan USING d2 ON test t  (cost=0.03..806.27 ROWS=300 width=12) (never executed)
           INDEX Cond: ((date2 >= ((($0)::text || '-01-01'::text))::DATE) AND (date2 <= ((($0)::text || '-12-31'::text))::DATE))
     ->  INDEX Scan USING d1 ON test t  (cost=0.03..806.25 ROWS=300 width=12) (actual TIME=0.030..0.030 ROWS=1 loops=110)
           INDEX Cond: ((date1 >= ((($0)::text || '-01-01'::text))::DATE) AND (date1 <= ((($0)::text || '-12-31'::text))::DATE))
 Total runtime: 4.735 ms

so i went from 1760 to 4.7. not bad.

of course the code is ugly. and much harder to understand. but hey – you can't win them all 🙂

2 thoughts on “postgresql tips & tricks”

alec says:

2007-07-13 at 08:21

No jestem pod wrażeniem ;). Indeksy częściowe są w tym wypadku kiepskie po prostu. Próbowałem z zapytaniem
SELECT DISTINCT date_part(‘year’, …
które wykorzystało indeks, ale czas wykonania skrócił się tylko o połowę w stosunku do czasu pierwotnego zapytania.
alec says:

2007-07-13 at 08:30

Oops, should be in English. I’m impressed. Expression indexes are used in SELECT DISTINCT date_part… but it’s only 50% faster that original query. So, you’re the winner 😉

Comments are closed.