Grouping data into time ranges

Today some guy on IRC asked question, which I didn't fully understand, but which could (probably) be summarized: how to group data into 5 minute intervals, based on some timestamp column.

Well, it seems trivial (as long as you know how to do it), but since he clearly didn't know how to do it (or I misunderstood his problem), here's the explanation:

First, let's make some test table:

$ CREATE TABLE test AS
    SELECT i, now() - '6 months'::INTERVAL * random() AS ts
    FROM generate_series(1,100000) i;

Data looks like this:

$ SELECT * FROM test LIMIT 10;
 i  |              ts
----+------------------------------
  1 | 2010-05-11 08:58:18.94062+02
  2 | 2010-05-09 00:01:20.46702+02
  3 | 2010-09-29 10:51:48.29742+02
  4 | 2010-08-11 20:46:35.64462+02
  5 | 2010-05-07 05:39:20.96622+02
  6 | 2010-05-14 08:08:43.15182+02
  7 | 2010-10-15 09:50:42.86382+02
  8 | 2010-05-11 22:21:28.36782+02
  9 | 2010-05-05 19:56:36.95982+02
 10 | 2010-08-26 01:33:35.25102+02
(10 ROWS)

some base statistics:

$ SELECT COUNT(*), MIN(ts), MAX(ts) FROM test;
 COUNT  |             MIN              |             MAX
--------+------------------------------+------------------------------
 100000 | 2010-04-22 14:06:09.01422+02 | 2010-10-22 14:05:20.02542+02
(1 ROW)

Before I'll go to 5 minute intervals (which are somewhat tricky), let's see how to get data grouped into some simpler ranges:

by day:

$ SELECT date_trunc( 'day', ts ), COUNT(*)
    FROM test
    GROUP BY 1
    ORDER BY 1 DESC LIMIT 10;
       date_trunc       | COUNT
------------------------+-------
 2010-10-22 00:00:00+02 |   301
 2010-10-21 00:00:00+02 |   548
 2010-10-20 00:00:00+02 |   571
 2010-10-19 00:00:00+02 |   571
 2010-10-18 00:00:00+02 |   543
(5 ROWS)

by hour:

$ SELECT date_trunc( 'hour', ts ), COUNT(*)
    FROM test
    GROUP BY 1
    ORDER BY 1 DESC LIMIT 5;
       date_trunc       | COUNT
------------------------+-------
 2010-10-22 14:00:00+02 |     3
 2010-10-22 13:00:00+02 |    27
 2010-10-22 12:00:00+02 |    21
 2010-10-22 11:00:00+02 |    32
 2010-10-22 10:00:00+02 |    22
(5 ROWS)

So far it looks rather simple. For other date_trunc options, you can check the fine manual.

But there is no ‘5 minutes' truncation. So how can we do it?

One approach is to use epoch values, and integer mathemetics:

$ SELECT now(), 'epoch'::timestamptz + '300 seconds'::INTERVAL * (EXTRACT(epoch FROM now())::int4 / 300);
              now              |        ?COLUMN?
-------------------------------+------------------------
 2010-10-22 14:15:21.397404+02 | 2010-10-22 14:15:00+02
(1 ROW)

Wait a moment …

$ SELECT now(), 'epoch'::timestamptz + '300 seconds'::INTERVAL * (EXTRACT(epoch FROM now())::int4 / 300);
              now              |        ?COLUMN?
-------------------------------+------------------------
 2010-10-22 14:17:47.363632+02 | 2010-10-22 14:15:00+02
(1 ROW)

Nice. Of course using this in queries will get tedious pretty soon, but we can use functions to make it simpler:

CREATE FUNCTION ts_round( timestamptz, INT4 ) RETURNS TIMESTAMPTZ AS $$
    SELECT 'epoch'::timestamptz + '1 second'::INTERVAL * ( $2 * ( EXTRACT( epoch FROM $1 )::INT4 / $2 ) );
$$ LANGUAGE SQL;

The cool thing is that when writing the function, there is not much point in hardcoding the 300 value, so instead I can make it variable. And now I can:

$ SELECT now(), ts_round( now(), 300 );
              now              |        ts_round
-------------------------------+------------------------
 2010-10-22 14:20:57.548683+02 | 2010-10-22 14:20:00+02
(1 ROW)

Or even, if I'd need to:

$ SELECT now(), ts_round( now(), 629 );
              now              |        ts_round
-------------------------------+------------------------
 2010-10-22 14:21:17.878654+02 | 2010-10-22 14:16:53+02
(1 ROW)

So, with this function at hand, I can do the grouping:

$ SELECT ts_round( ts, 300 ), COUNT(*) FROM test GROUP BY 1 ORDER BY 1 DESC LIMIT 5;
        ts_round        | COUNT
------------------------+-------
 2010-10-22 14:05:00+02 |     2
 2010-10-22 14:00:00+02 |     1
 2010-10-22 13:55:00+02 |     2
 2010-10-22 13:50:00+02 |     1
 2010-10-22 13:45:00+02 |     1
(5 ROWS)

And to verify that the values are indeed grouped correctly, I can use the trick I wrote about earlier, and do:

SELECT
    ts_round( ts, 300 ),
    COUNT(*),
    '- ' || array_to_string(
        array_agg( ts::text ),
        E'\n- '
    )
FROM
    test
GROUP BY 1
ORDER BY 1 DESC LIMIT 10;
        ts_round        | COUNT |            ?COLUMN?
------------------------+-------+--------------------------------
 2010-10-22 14:05:00+02 |     2 | - 2010-10-22 14:05:20.02542+02+
                        |       | - 2010-10-22 14:05:03.43662+02
 2010-10-22 14:00:00+02 |     1 | - 2010-10-22 14:01:11.71182+02
 2010-10-22 13:55:00+02 |     2 | - 2010-10-22 13:57:30.87342+02+
                        |       | - 2010-10-22 13:58:36.79662+02
 2010-10-22 13:50:00+02 |     1 | - 2010-10-22 13:51:30.06702+02
 2010-10-22 13:45:00+02 |     1 | - 2010-10-22 13:46:38.38062+02
 2010-10-22 13:40:00+02 |     5 | - 2010-10-22 13:44:55.39182+02+
                        |       | - 2010-10-22 13:42:52.27182+02+
                        |       | - 2010-10-22 13:44:28.95342+02+
                        |       | - 2010-10-22 13:41:53.43342+02+
                        |       | - 2010-10-22 13:40:49.67022+02
 2010-10-22 13:35:00+02 |     1 | - 2010-10-22 13:38:17.34702+02
 2010-10-22 13:30:00+02 |     2 | - 2010-10-22 13:32:31.40142+02+
                        |       | - 2010-10-22 13:31:16.23342+02
 2010-10-22 13:25:00+02 |     4 | - 2010-10-22 13:26:56.86062+02+
                        |       | - 2010-10-22 13:25:20.00622+02+
                        |       | - 2010-10-22 13:25:37.89102+02+
                        |       | - 2010-10-22 13:29:00.15342+02
 2010-10-22 13:20:00+02 |     2 | - 2010-10-22 13:23:31.57422+02+
                        |       | - 2010-10-22 13:21:27.50382+02
(10 ROWS)

Works quite nicely.

6 thoughts on “Grouping data into time ranges”

  1. Or you could use window functions
    select count(*) over(partition by ts_round(ts, 300), …

  2. Hey depesz 🙂 . I can use ur trick. anyway here is a solution too..

    select date_trunc(‘minute’,ms_click_date) – (extract(minute from ms_click_date)::integer % 5) * interval ‘1 minute’ , count(*) from user_sessions where us_click_date between ‘2010-06-15 00:00:01’ and ‘2010-06-15 02:00:01’ group by 1 order by 1;

  3. oops corrected columns names.

    select date_trunc(‘minute’,us_click_date) – (extract(minute from us_click_date)::integer % 5) * interval ‘1 minute’ , count(*) from user_sessions where us_click_date between ‘2010-06-15 00:00:01’ and ‘2010-06-15 02:00:01’ group by 1 order by 1;

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.