July 10th, 2007 by depesz | Tags: , | 2 comments »
Did it help? If yes - maybe you can help me?

mage_ from #postgresql had interesting problem today.

he has a table with 2 date fields, and he wants to have list of all years from both fields. together. as one list.

his approach:

select date_part('year', date1) from test
union
select date_part('year', date2) from test;

is hardly satisfactory – it takes too long.

any way to speed it up?

ok, i asked some questions, got some data.

here's what we're starting with:

  1. total number of rows: 56990
  2. date1 is in range : 1937-01-01 to 2006-08-31
  3. date2 is in range : 1937-12-31 to 1996-12-31
  4. data will grow in time, but there will be never rows with date prior to 1800-01-01, and max of the dates will be similar to now()
  5. there are indexes on date1 and date2 columns (two separete, standard, btree indexes)

ok. let's make a test table:

create table test (id serial primary key, date1 date, date2 date);

now let's create some data:

insert into test (date1, date2) select random() * 40000 * '1 day'::interval + '1900-01-01'::date, random() * 40000 * '1 day'::interval + '1900-01-01'::date from generate_series(1,60000);

and the base indexes:

# create index d1 on test (date1);
# create index d2 on test (date2);

ok. now we have 60000 rows, with min/max values:

# select min(date1), max(date1), min(date2), max(date2) from test;
min | max | min | max
------------+------------+------------+------------
1900-01-02 | 2009-07-06 | 1900-01-01 | 2009-07-06
(1 row)

it's not perfect copy of mage's table, but it's close enough.

now. let's try his approach just to get some idea on timing on my machine:

# explain analyze select date_part('year', date1) from test union select date_part('year', date2) from test;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Unique (cost=15152.10..15752.10 rows=120000 width=4) (actual time=1235.563..1731.876 rows=110 loops=1)
-> Sort (cost=15152.10..15452.10 rows=120000 width=4) (actual time=1235.556..1505.904 rows=120000 loops=1)
Sort Key: (date_part('year'::text, (public.test.date1)::timestamp without time zone))
Sort Method: external merge Disk: 2336kB
-> Append (cost=0.00..3590.00 rows=120000 width=4) (actual time=0.040..760.877 rows=120000 loops=1)
-> Seq Scan on test (cost=0.00..1195.00 rows=60000 width=4) (actual time=0.036..171.935 rows=60000 loops=1)
-> Seq Scan on test (cost=0.00..1195.00 rows=60000 width=4) (actual time=0.018..150.961 rows=60000 loops=1)
Total runtime: 1763.287 ms
(8 rows)

ok, so we have all that we need.

first – let's check if postgresql will be smart enough not to do sort of 120000 rows, but use index on fields, and do merge-sort:

# create index q1 on test (date_part('year', date1));
# create index q2 on test (date_part('year', date2));

and retry the query.

result – no change.

hmm .. let's try different approach.

i know the range of possible years – it is 1800 to now(). since now() can change, let's assume, that now() == 2100. just to have safe 90 years of using the query.

for this – i will use standard indexes, so the extra ones can be removed:

# drop index q1;
# drop index q2;

so, now for the query. we will use our magic-tool: generate_series:

select i
from
generate_series(1800, 2100) i
where
exists (
select *
from test t
where t.date1 between cast(i::text||'-01-01' as date) and cast(i::text||'-12-31' as date)
) or exists (
select *
from test t
WHERE t.date2 between cast(i::text||'-01-01' as date) and cast(i::text||'-12-31' as date)
);

if it's not obvious, algorithm is quite simple:

  • for every year in (1800 .. 2100) check if there is any row that has date1 between ‘YEAR-01-01' and ‘YEAR-12-31'.
  • if there is such a row – return this year, and go to next year
  • if thers is no such a row – check again but this time using date2 field.
  • if there is such a row – return year

that's pretty much it.

how does it work?

explain analyze showed this:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series i (cost=0.00..5437.40 rows=750 width=4) (actual time=4.384..76.243 rows=110 loops=1)
Filter: ((subplan) OR (subplan))
SubPlan
-> Index Scan using d2 on test t (cost=0.03..806.27 rows=300 width=12) (actual time=0.008..0.008 rows=0 loops=191)
Index Cond: ((date2 >= ((($0)::text || '-01-01'::text))::date) AND (date2 <= ((($0)::text || '-12-31'::text))::date)) -> Index Scan using d1 on test t (cost=0.03..806.25 rows=300 width=12) (actual time=0.227..0.227 rows=0 loops=301)
Index Cond: ((date1 >= ((($0)::text || '-01-01'::text))::date) AND (date1 <= ((($0)::text || '-12-31'::text))::date)) Total runtime: 76.642 ms (8 rows)

pretty neat 🙂

the problem with it is that it doesn't look good. (1800, 2100) ? and what if i'd want to add something from 1799? or 2200 ? i would have to modify the query.

also - in our current case - we do a lot of empty runs - as i know that there is no row from before 1900, or after 2009.

any chance to get it done right?

first - let's find the lowest year number:

# explain analyze select least(min(date1), min(date2)) from test;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.09..0.10 rows=1 width=0) (actual time=0.115..0.117 rows=1 loops=1)
InitPlan
-> Limit (cost=0.00..0.04 rows=1 width=8) (actual time=0.054..0.056 rows=1 loops=1)
-> Index Scan using d1 on test (cost=0.00..2616.22 rows=60000 width=8) (actual time=0.049..0.049 rows=1 loops=1)
Filter: (date1 IS NOT NULL)
-> Limit (cost=0.00..0.04 rows=1 width=8) (actual time=0.040..0.043 rows=1 loops=1)
-> Index Scan using d2 on test (cost=0.00..2616.24 rows=60000 width=8) (actual time=0.035..0.035 rows=1 loops=1)
Filter: (date2 IS NOT NULL)
Total runtime: 0.198 ms
(9 rows)

this looks reasonably fast.

now, the last possible year - quite simple as well:

select greatest(max(date1), max(date2)) from test;

now i can rewrite the generate_series query to make it:

  1. data-change-proof
  2. not searching in years that definitelly doesn't have any data.

explain analyze
select i
from
generate_series(
(select extract('year' from least(min(date1), min(date2)))::int4 from test),
(select extract('year' from greatest(max(date1), max(date2)))::int4 from test)
) i
where
exists (
select *
from test t
where t.date1 between cast(i::text||'-01-01' as date) and cast(i::text||'-12-31' as date)
) or exists (
select *
from test t
WHERE t.date2 between cast(i::text||'-01-01' as date) and cast(i::text||'-12-31' as date)
);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series i (cost=0.21..5437.61 rows=750 width=4) (actual time=0.236..4.409 rows=110 loops=1)
Filter: ((subplan) OR (subplan))
InitPlan
-> Result (cost=0.09..0.10 rows=1 width=0) (actual time=0.082..0.084 rows=1 loops=1)
InitPlan
-> Limit (cost=0.00..0.04 rows=1 width=8) (actual time=0.035..0.037 rows=1 loops=1)
-> Index Scan using d1 on test (cost=0.00..2616.22 rows=60000 width=8) (actual time=0.031..0.031 rows=1 loops=1)
Filter: (date1 IS NOT NULL)
-> Limit (cost=0.00..0.04 rows=1 width=8) (actual time=0.022..0.023 rows=1 loops=1)
-> Index Scan using d2 on test (cost=0.00..2616.24 rows=60000 width=8) (actual time=0.018..0.018 rows=1 loops=1)
Filter: (date2 IS NOT NULL)
-> Result (cost=0.09..0.10 rows=1 width=0) (actual time=0.032..0.034 rows=1 loops=1)
InitPlan
-> Limit (cost=0.00..0.04 rows=1 width=8) (actual time=0.010..0.011 rows=1 loops=1)
-> Index Scan Backward using d1 on test (cost=0.00..2616.22 rows=60000 width=8) (actual time=0.006..0.006 rows=1 loops=1)
Filter: (date1 IS NOT NULL)
-> Limit (cost=0.00..0.04 rows=1 width=8) (actual time=0.009..0.011 rows=1 loops=1)
-> Index Scan Backward using d2 on test (cost=0.00..2616.24 rows=60000 width=8) (actual time=0.005..0.005 rows=1 loops=1)
Filter: (date2 IS NOT NULL)
SubPlan
-> Index Scan using d2 on test t (cost=0.03..806.27 rows=300 width=12) (never executed)
Index Cond: ((date2 >= ((($0)::text || '-01-01'::text))::date) AND (date2 <= ((($0)::text || '-12-31'::text))::date)) -> Index Scan using d1 on test t (cost=0.03..806.25 rows=300 width=12) (actual time=0.030..0.030 rows=1 loops=110)
Index Cond: ((date1 >= ((($0)::text || '-01-01'::text))::date) AND (date1 <= ((($0)::text || '-12-31'::text))::date)) Total runtime: 4.735 ms

so i went from 1760 to 4.7. not bad.

of course the code is ugly. and much harder to understand. but hey - you can't win them all 🙂

  1. 2 comments

  2. Jul 13, 2007

    No jestem pod wrażeniem ;). Indeksy częściowe są w tym wypadku kiepskie po prostu. Próbowałem z zapytaniem
    SELECT DISTINCT date_part(‘year’, …
    które wykorzystało indeks, ale czas wykonania skrócił się tylko o połowę w stosunku do czasu pierwotnego zapytania.

  3. Jul 13, 2007

    Oops, should be in English. I’m impressed. Expression indexes are used in SELECT DISTINCT date_part… but it’s only 50% faster that original query. So, you’re the winner 😉

Leave a comment