November 22nd, 2007 by depesz | Tags: | 1 comment »
Did it help? If yes - maybe you can help me?

in previous post i described how to find overlapping ranges in a way that will work with indexes.

now, i'd like to discuss something else.

how to check how many ranges are in given moment?

generally it is simple, let's create test table:

CREATE TABLE test as
SELECT least(a,b) as a, greatest(a,b) as b
FROM (
SELECT
( '1900-01-01'::date + 50000 * random() * '1 day'::INTERVAL)::date as a,
( '1900-01-01'::date + 50000 * random() * '1 day'::INTERVAL)::date as b
FROM generate_series(1,10000) i
) x
WHERE a <> b;
CREATE INDEX i_a on test (a);
CREATE INDEX i_b on test (b);
CREATE UNIQUE INDEX i_ab on test (a,b);
CREATE UNIQUE INDEX i_ba on test (b,a);

for any given day i can:

select count(*)
from test
where '2007-11-01' between a and b;

which of course uses indexes:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=181.37..181.38 rows=1 width=0) (actual time=16.394..16.397 rows=1 loops=1)
-> Bitmap Heap Scan on test (cost=69.85..172.23 rows=3655 width=0) (actual time=1.433..9.651 rows=3386 loops=1)
Recheck Cond: ('2007-11-01'::date <= b)
Filter: ('2007-11-01'::date >= a)
-> Bitmap Index Scan on i_b (cost=0.00..68.94 rows=3825 width=0) (actual time=1.384..1.384 rows=3828 loops=1)
Index Cond: ('2007-11-01'::date <= b)
Total runtime: 16.481 ms

but the question is – how do i find dates which match the most ranges?

of course i could:

select
t1.a,
( select count(*) from test t2 where t1.a between t2.a and t2.b ) as count
from
test t1
order by count desc
limit 1;

but it's slow:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1553307.10..1553307.10 rows=1 width=4) (actual time=171289.465..171289.467 rows=1 loops=1)
-> Sort (cost=1553307.10..1553332.10 rows=10000 width=4) (actual time=171289.460..171289.460 rows=1 loops=1)
Sort Key: ((subplan))
Sort Method: top-N heapsort Memory: 17kB
-> Seq Scan on test t1 (cost=0.00..1553257.10 rows=10000 width=4) (actual time=35.276..171259.246 rows=10000 loops=1)
SubPlan
-> Aggregate (cost=155.30..155.31 rows=1 width=0) (actual time=17.115..17.117 rows=1 loops=10000)
-> Bitmap Heap Scan on test t2 (cost=57.53..152.52 rows=1111 width=0) (actual time=2.331..11.220 rows=3313 loops=10000)
Recheck Cond: ($0 <= b)
Filter: ($0 >= a)
-> Bitmap Index Scan on i_b (cost=0.00..57.25 rows=3333 width=0) (actual time=2.292..2.292 rows=8313 loops=10000)
Index Cond: ($0 <= b)
Total runtime: 171289.581 ms

painful. what's more – it seems that it would scale badly.

is there any other choice?

actually – there is.

let's write a function for this task:

CREATE OR REPLACE FUNCTION get_most_busy_day(OUT start_date date, OUT end_date date, OUT ranges INT4) as $BODY$
DECLARE
current_count INT4 := 0;
current_ok bool := false;
temprec record;
BEGIN
ranges := 0;
FOR temprec IN SELECT 's' as code, a as x FROM test union all SELECT 'f' as code, b as x FROM test ORDER BY x asc, code desc loop
IF temprec.code = 'f' THEN
current_count := current_count - 1;
IF current_ok = true THEN
current_ok := false;
end_date := temprec.x;
END IF;
continue;
END IF;
current_count := current_count + 1;
IF current_count > ranges THEN
start_date := temprec.x;
ranges := current_count;
current_ok := true;
END IF;
END loop;
RETURN;
END;
$BODY$ language plpgsql;

this function returns start and end of “virtual" range of days which have the largest number of concurrent ranges in source table.

example:

# SELECT * FROM get_most_busy_day();
start_date | end_date | ranges
------------+------------+--------
1968-02-12 | 1968-02-13 | 4993
(1 row)

it means that in my data, from 1968-02-12 to 1968-02-13 there are 4993 concurrent matching ranges in test table.

and what about speed?

# explain analyze SELECT * FROM get_most_busy_day();
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Function Scan on get_most_busy_day (cost=0.00..0.26 rows=1 width=12) (actual time=270.127..270.130 rows=1 loops=1)
Total runtime: 270.175 ms

quite fast. especially when compared to pure-sql based approach 🙂

  1. 1 Trackback(s)

  2. Nov 23, 2007: Sheeri Kritzer Cabral » Blog Archive » Log Buffer #72 — a Carnival of the Vanities for DBAs - The MySQL She-BA

Leave a comment