Today some guy on IRC asked question, which I didn't fully understand, but which could (probably) be summarized: how to group data into 5 minute intervals, based on some timestamp column.
Well, it seems trivial (as long as you know how to do it), but since he clearly didn't know how to do it (or I misunderstood his problem), here's the explanation:
First, let's make some test table:
$ CREATE TABLE test AS SELECT i, now() - '6 months'::INTERVAL * random() AS ts FROM generate_series(1,100000) i;
Data looks like this:
$ SELECT * FROM test LIMIT 10; i | ts ----+------------------------------ 1 | 2010-05-11 08:58:18.94062+02 2 | 2010-05-09 00:01:20.46702+02 3 | 2010-09-29 10:51:48.29742+02 4 | 2010-08-11 20:46:35.64462+02 5 | 2010-05-07 05:39:20.96622+02 6 | 2010-05-14 08:08:43.15182+02 7 | 2010-10-15 09:50:42.86382+02 8 | 2010-05-11 22:21:28.36782+02 9 | 2010-05-05 19:56:36.95982+02 10 | 2010-08-26 01:33:35.25102+02 (10 ROWS)
some base statistics:
$ SELECT COUNT(*), MIN(ts), MAX(ts) FROM test; COUNT | MIN | MAX --------+------------------------------+------------------------------ 100000 | 2010-04-22 14:06:09.01422+02 | 2010-10-22 14:05:20.02542+02 (1 ROW)
Before I'll go to 5 minute intervals (which are somewhat tricky), let's see how to get data grouped into some simpler ranges:
by day:
$ SELECT date_trunc( 'day', ts ), COUNT(*) FROM test GROUP BY 1 ORDER BY 1 DESC LIMIT 10; date_trunc | COUNT ------------------------+------- 2010-10-22 00:00:00+02 | 301 2010-10-21 00:00:00+02 | 548 2010-10-20 00:00:00+02 | 571 2010-10-19 00:00:00+02 | 571 2010-10-18 00:00:00+02 | 543 (5 ROWS)
by hour:
$ SELECT date_trunc( 'hour', ts ), COUNT(*) FROM test GROUP BY 1 ORDER BY 1 DESC LIMIT 5; date_trunc | COUNT ------------------------+------- 2010-10-22 14:00:00+02 | 3 2010-10-22 13:00:00+02 | 27 2010-10-22 12:00:00+02 | 21 2010-10-22 11:00:00+02 | 32 2010-10-22 10:00:00+02 | 22 (5 ROWS)
So far it looks rather simple. For other date_trunc options, you can check the fine manual.
But there is no ‘5 minutes' truncation. So how can we do it?
One approach is to use epoch values, and integer mathemetics:
$ SELECT now(), 'epoch'::timestamptz + '300 seconds'::INTERVAL * (EXTRACT(epoch FROM now())::int4 / 300); now | ?COLUMN? -------------------------------+------------------------ 2010-10-22 14:15:21.397404+02 | 2010-10-22 14:15:00+02 (1 ROW)
Wait a moment …
$ SELECT now(), 'epoch'::timestamptz + '300 seconds'::INTERVAL * (EXTRACT(epoch FROM now())::int4 / 300); now | ?COLUMN? -------------------------------+------------------------ 2010-10-22 14:17:47.363632+02 | 2010-10-22 14:15:00+02 (1 ROW)
Nice. Of course using this in queries will get tedious pretty soon, but we can use functions to make it simpler:
CREATE FUNCTION ts_round( timestamptz, INT4 ) RETURNS TIMESTAMPTZ AS $$ SELECT 'epoch'::timestamptz + '1 second'::INTERVAL * ( $2 * ( EXTRACT( epoch FROM $1 )::INT4 / $2 ) ); $$ LANGUAGE SQL;
The cool thing is that when writing the function, there is not much point in hardcoding the 300 value, so instead I can make it variable. And now I can:
$ SELECT now(), ts_round( now(), 300 ); now | ts_round -------------------------------+------------------------ 2010-10-22 14:20:57.548683+02 | 2010-10-22 14:20:00+02 (1 ROW)
Or even, if I'd need to:
$ SELECT now(), ts_round( now(), 629 ); now | ts_round -------------------------------+------------------------ 2010-10-22 14:21:17.878654+02 | 2010-10-22 14:16:53+02 (1 ROW)
So, with this function at hand, I can do the grouping:
$ SELECT ts_round( ts, 300 ), COUNT(*) FROM test GROUP BY 1 ORDER BY 1 DESC LIMIT 5; ts_round | COUNT ------------------------+------- 2010-10-22 14:05:00+02 | 2 2010-10-22 14:00:00+02 | 1 2010-10-22 13:55:00+02 | 2 2010-10-22 13:50:00+02 | 1 2010-10-22 13:45:00+02 | 1 (5 ROWS)
And to verify that the values are indeed grouped correctly, I can use the trick I wrote about earlier, and do:
SELECT ts_round( ts, 300 ), COUNT(*), '- ' || array_to_string( array_agg( ts::text ), E'\n- ' ) FROM test GROUP BY 1 ORDER BY 1 DESC LIMIT 10; ts_round | COUNT | ?COLUMN? ------------------------+-------+-------------------------------- 2010-10-22 14:05:00+02 | 2 | - 2010-10-22 14:05:20.02542+02+ | | - 2010-10-22 14:05:03.43662+02 2010-10-22 14:00:00+02 | 1 | - 2010-10-22 14:01:11.71182+02 2010-10-22 13:55:00+02 | 2 | - 2010-10-22 13:57:30.87342+02+ | | - 2010-10-22 13:58:36.79662+02 2010-10-22 13:50:00+02 | 1 | - 2010-10-22 13:51:30.06702+02 2010-10-22 13:45:00+02 | 1 | - 2010-10-22 13:46:38.38062+02 2010-10-22 13:40:00+02 | 5 | - 2010-10-22 13:44:55.39182+02+ | | - 2010-10-22 13:42:52.27182+02+ | | - 2010-10-22 13:44:28.95342+02+ | | - 2010-10-22 13:41:53.43342+02+ | | - 2010-10-22 13:40:49.67022+02 2010-10-22 13:35:00+02 | 1 | - 2010-10-22 13:38:17.34702+02 2010-10-22 13:30:00+02 | 2 | - 2010-10-22 13:32:31.40142+02+ | | - 2010-10-22 13:31:16.23342+02 2010-10-22 13:25:00+02 | 4 | - 2010-10-22 13:26:56.86062+02+ | | - 2010-10-22 13:25:20.00622+02+ | | - 2010-10-22 13:25:37.89102+02+ | | - 2010-10-22 13:29:00.15342+02 2010-10-22 13:20:00+02 | 2 | - 2010-10-22 13:23:31.57422+02+ | | - 2010-10-22 13:21:27.50382+02 (10 ROWS)
Works quite nicely.
Or you could use window functions
select count(*) over(partition by ts_round(ts, 300), …
you could probably do it with bucket width.
I’d probably go for “GROUP BY DATE_TRUNC(‘hour’, ts), DATE_PART(‘minute’, ts) / 5”.
Hey depesz 🙂 . I can use ur trick. anyway here is a solution too..
select date_trunc(‘minute’,ms_click_date) – (extract(minute from ms_click_date)::integer % 5) * interval ‘1 minute’ , count(*) from user_sessions where us_click_date between ‘2010-06-15 00:00:01’ and ‘2010-06-15 02:00:01’ group by 1 order by 1;
oops corrected columns names.
select date_trunc(‘minute’,us_click_date) – (extract(minute from us_click_date)::integer % 5) * interval ‘1 minute’ , count(*) from user_sessions where us_click_date between ‘2010-06-15 00:00:01’ and ‘2010-06-15 02:00:01’ group by 1 order by 1;
This works great—thanks!