November 21st, 2007 by depesz | Tags: | 11 comments »
Did it help? If yes - maybe you can help me?

hannesd on irc had a problem with finding overlapping date/time ranges.

basically – in postgresql there is “overlaps" operator, but unfortunately it doesn't use indexes:

# CREATE TABLE test as
SELECT distinct least(a,b) as a, greatest(a,b) as b
FROM (
SELECT
( '1900-01-01'::date + 50000 * random() * '1 day'::INTERVAL)::date as a,
( '1900-01-01'::date + 50000 * random() * '1 day'::INTERVAL)::date as b
FROM generate_series(1,1000) i
) x
WHERE a <> b;

this query creates a 1000 row (or a bit less) table, with columns (a,b) which contain some dates.

there are some constraints on these dates, but they seem sensible:

  • a > b (which implies “a <> b")
  • pair (a, b) is unique

now, let's create indexes for this table – all possible – just in case:

CREATE INDEX i_a on test (a);
CREATE INDEX i_b on test (b);
CREATE UNIQUE INDEX i_ab on test (a,b);
CREATE UNIQUE INDEX i_ba on test (b,a);

finally, vacuum analyze, and we have a table ready for tests.

so, let's find, for every row, how many rows overlap with it:

SELECT
t1.*,
(SELECT count(*) FROM test t2 WHERE (t1.a, t1.b) overlaps (t2.a, t2.b))
FROM test t1;

explain for this:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
----
Seq Scan on test t1 (cost=0.00..28360.00 rows=1000 width=8) (actual time=3.185..2358.621 rows=1000 loops=1)
SubPlan
-> Aggregate (cost=28.34..28.34 rows=1 width=0) (actual time=2.356..2.357 rows=1 loops=1000)
-> Seq Scan on test t2 (cost=0.00..27.50 rows=333 width=0) (actual time=0.008..2.143 rows=671 loops=1000)
Filter: "overlaps"(($0)::timestamp with time zone, ($1)::timestamp with time zone, (a)::timestamp with time zone, (b)::timestamp with time zone)
Total runtime: 2359.122 ms
(6 rows)

what i don't like about it is that it does seq scan over test (t2) 1000 times! (loops = 1000).

which means, that if i will make the table longer it will take much more time. let's try it for 10000 rows:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on test t1 (cost=0.00..2783595.00 rows=10000 width=8) (actual time=32.033..225270.922 rows=10000 loops=1)
SubPlan
-> Aggregate (cost=278.33..278.35 rows=1 width=0) (actual time=22.523..22.523 rows=1 loops=10000)
-> Seq Scan on test t2 (cost=0.00..270.00 rows=3333 width=0) (actual time=0.009..20.502 rows=6653 loops=10000)
Filter: "overlaps"(($0)::timestamp with time zone, ($1)::timestamp with time zone, (a)::timestamp with time zone, (b)::timestamp with time zone)
Total runtime: 225278.174 ms
(6 rows)

not really cool.

so, we thought about a way to find the overlapping ranges in some indexable way.

after some thinking i decided to try simple operators (<, >, <=; >=).

i think (i would write “i know", but i'm not sure what about some strange edge-cases, and i'm not reallyinto it at the moment) that ‘overlaps' can be transformed to:

(a,b) overlaps (c,d) becomes (c <= a AND d > a) OR ( c >= a AND c < b)

so, let's try to rewrite the query:

SELECT
t1.*,
(SELECT count(*) FROM test t2 WHERE
(t2.a <= t1.a AND t2.b > t1.a) OR ( t2.a >= t1.a AND t2.a < t1.b)
)
FROM test t1;

first, let's try if the results are the same:

CREATE TABLE x1 as
SELECT
t1.*,
(SELECT count(*) FROM test t2 WHERE (t1.a, t1.b) overlaps (t2.a, t2.b))
FROM test t1;
CREATE TABLE x2 as
SELECT
t1.*,
(SELECT count(*) FROM test t2 WHERE
(t2.a <= t1.a AND t2.b > t1.a) OR ( t2.a >= t1.a AND t2.a < t1.b)
)
FROM test t1;

and the test:

# SELECT count(*) FROM (SELECT * FROM x1 except SELECT * FROM x2) q;
count
-------
0
(1 row)
# SELECT count(*) FROM (SELECT * FROM x2 except SELECT * FROM x1) q;
count
-------
0
(1 row)

which looks cool – results are the same.

and explain analyze?

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on test t1 (cost=0.00..1500471.60 rows=10000 width=8) (actual time=4.879..55707.457 rows=10000 loops=1)
SubPlan
-> Aggregate (cost=150.02..150.03 rows=1 width=0) (actual time=5.564..5.564 rows=1 loops=10000)
-> Bitmap Heap Scan on test t2 (cost=78.91..147.13 rows=1156 width=0) (actual time=1.212..3.461 rows=6630 loops=10000)
Recheck Cond: (((b > $0) AND (a <= $0)) OR ((a >= $0) AND (a < $1)))
-> BitmapOr (cost=78.91..78.91 rows=1161 width=0) (actual time=1.197..1.197 rows=0 loops=10000)
-> Bitmap Index Scan on i_ba (cost=0.00..73.58 rows=1111 width=0) (actual time=0.765..0.765 rows=3316 loops=10000)
Index Cond: ((b > $0) AND (a <= $0))
-> Bitmap Index Scan on i_a (cost=0.00..4.75 rows=50 width=0) (actual time=0.431..0.431 rows=3316 loops=10000)
Index Cond: ((a >= $0) AND (a < $1))
Total runtime: 55716.829 ms
(11 rows)

nicer. it still calls inner loop 10000 times, but it uses indexes.

now. this is basically done, but we can notice interesting pattern in index usage.

two indexes are being used:

  • i_ba (for where on b and a)
  • i_a (for where on a)

interesting is that both of these indexes could be simply removed, and postgresql should be able to use (for both bitmap index scans) index i_ab.

if this would be true, then our table will have only one index (i_b doesn't serve any purpose), so we would get faster writes, and less data on disc – so the query could (theoretically atleast) be a bit faster.

and how's this in reality?

DROP TABLE test;
CREATE TABLE test as
SELECT distinct least(a,b) as a, greatest(a,b) as b
FROM (
SELECT
( '1900-01-01'::date + 50000 * random() * '1 day'::INTERVAL)::date as a,
( '1900-01-01'::date + 50000 * random() * '1 day'::INTERVAL)::date as b
FROM generate_series(1,10000) i
) x
WHERE a <> b;
CREATE UNIQUE INDEX i_ab on test (a,b);
VACUUM VERBOSE ANALYZE;

and now re-run explain analyze of our query (the one without “overlaps"):

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on test t1 (cost=0.00..1500474.00 rows=10000 width=8) (actual time=9.808..51627.642 rows=10000 loops=1)
SubPlan
-> Aggregate (cost=150.02..150.03 rows=1 width=0) (actual time=5.157..5.157 rows=1 loops=10000)
-> Bitmap Heap Scan on test t2 (cost=78.91..147.13 rows=1156 width=0) (actual time=0.951..3.132 rows=6690 loops=10000)
Recheck Cond: (((a <= $0) AND (b > $0)) OR ((a >= $0) AND (a < $1)))
-> BitmapOr (cost=78.91..78.91 rows=1161 width=0) (actual time=0.938..0.938 rows=0 loops=10000)
-> Bitmap Index Scan on i_ab (cost=0.00..73.58 rows=1111 width=0) (actual time=0.516..0.516 rows=3346 loops=10000)
Index Cond: ((a <= $0) AND (b > $0))
-> Bitmap Index Scan on i_ab (cost=0.00..4.75 rows=50 width=0) (actual time=0.421..0.421 rows=3346 loops=10000)
Index Cond: ((a >= $0) AND (a < $1))
Total runtime: 51636.248 ms

so, index i_ab really got used for both scans. query is a bit faster, but i think it's more due to fluctuations in system load.

so, we were able to find all overlapping ranges in around 55 seconds. given that we were doing this for 10000 rows, i can guess that for single row, we would be able to find all overlapping ranges (from list of 10000) in about 5ms.

not bad, not bad at all.

  1. 11 comments

  2. # Ants Aasma
    Nov 21, 2007

    That or criteria simplifies as
    SELECT t1.*, (SELECT count(*) FROM test t2 WHERE t2.a t1.a) FROM test t1;
    This way there is one fewer index scans.

  3. # Ants Aasma
    Nov 21, 2007

    Stupid wordpress ate half of the criteria, meant to say
    SELECT t1.*, (SELECT count(*) FROM test t2 WHERE (t2.a < t1.b AND t2.b > t1.a)) FROM test t1;

  4. Nov 21, 2007

    @Ants Aasma:
    i think the operator is missing in the “where’ clause.

    but i dont really see what operator could you use to make this query return the same results as “overlaps”.

  5. Nov 21, 2007

    @Ants Aasma:
    ah. sometimes the easiest approach is somehow hidden 🙂

  6. # Jeff Davis
    Nov 21, 2007

    You might be interested to see the pgsql-temporal project here:

    http://www.pgfoundry.org/projects/temporal/

    It provides a time interval type (what you call a time range) and also provides GiST support functions so that it can be efficiently indexed, including the “overlaps” operator.

  7. Nov 22, 2007

    @Jeff Davis:
    thanks for link, i’ll definitely check it.

  8. Mar 29, 2008

    What was the final conclusion of the query?

  9. Mar 29, 2008

    @T1:
    looks like Ants Aasma approach from his comment is the best.

  10. Apr 16, 2008

    hi there. I’m trying something similar in php though and with dates. Your problem is your condition: you are missing the case (c,d) is inside (a,b). so the condition should be
    (c>=a AND c=a AND d<=b) that translates into: either edges of interval (c,d) is inside interval (a,b) which is bad

  11. Apr 16, 2008

    ????
    what happened to my condition?
    (c>=a && c=a && d<b)

  12. Apr 16, 2008

    what happened to my condition?

    (c>=a a.Nd c=a a.Nd d<=b)

Leave a comment