Calculating backlog of events to handle

Yesterday on my favorite IRC channel fooqux asked interesting question. I took some more questions, and here is problem description:

We have a system which, every 5 minutes, takes a number of tasks to be done. Tasks are uniform. Within 5 minutes we can handle at most 100 tasks. Given the history of number of tasks added every 5 minutes, calculate backlog at any given moment.

Did you understand the problem? Well – I didn't. So, let's see the data, and expected output.

So. Let's see the data. I generated some random data, and here is how it looks:

# SELECT * FROM test ORDER BY stamp LIMIT 20;
        stamp        │ new_events
─────────────────────┼────────────
 2000-01-01 00:05:00 │         15
 2000-01-01 00:10:00 │         46
 2000-01-01 00:15:00 │         57
 2000-01-01 00:20:00 │         19
 2000-01-01 00:25:00 │        132
 2000-01-01 00:30:00 │         14
 2000-01-01 00:35:00 │         19
 2000-01-01 00:40:00 │        137
 2000-01-01 00:45:00 │         14
 2000-01-01 00:50:00 │        146
 2000-01-01 00:55:00 │         83
 2000-01-01 01:00:00 │         72
 2000-01-01 01:05:00 │         43
 2000-01-01 01:10:00 │         75
 2000-01-01 01:15:00 │         74
 2000-01-01 01:20:00 │         62
 2000-01-01 01:25:00 │        134
 2000-01-01 01:30:00 │         87
 2000-01-01 01:35:00 │         59
 2000-01-01 01:40:00 │        103
(20 ROWS)

Nothing really fancy here (side info: check those nice, Unicode frames in new psql 🙂 will be writing about them some time in future).

So, the idea is that every 5 minutes (every, there are no gaps) new number arrives. And this is number of items to be processed. And every 5 minutes we can process 100 tasks, which is fine when we get 15 new tasks, but when system gets 150 tasks, and it will do only 100 of them in its 5 minute period of tim – it will backlog.

Every backlog calculation has to start from some specific point in time, where we know it was 0. In here – we know that on ‘2000-01-01 00:05:00' it was 0 (because there were no events before).

Also, on every time frame we have to calculate how many tasks could be processed, and how many will have to be added or subtracted from backlog.

So, let's add first layer of calculation:

# SELECT *, new_events - 100 AS backlog_change FROM test ORDER BY stamp LIMIT 20;
        stamp        │ new_events │ backlog_change
─────────────────────┼────────────┼────────────────
 2000-01-01 00:05:00 │         15-85
 2000-01-01 00:10:00 │         46-54
 2000-01-01 00:15:00 │         57-43
 2000-01-01 00:20:00 │         19-81
 2000-01-01 00:25:00 │        13232
 2000-01-01 00:30:00 │         14-86
 2000-01-01 00:35:00 │         19-81
 2000-01-01 00:40:00 │        13737
 2000-01-01 00:45:00 │         14-86
 2000-01-01 00:50:00 │        14646
 2000-01-01 00:55:00 │         83-17
 2000-01-01 01:00:00 │         72-28
 2000-01-01 01:05:00 │         43-57
 2000-01-01 01:10:00 │         75-25
 2000-01-01 01:15:00 │         74-26
 2000-01-01 01:20:00 │         62-38
 2000-01-01 01:25:00 │        13434
 2000-01-01 01:30:00 │         87-13
 2000-01-01 01:35:00 │         59-41
 2000-01-01 01:40:00 │        1033
(20 ROWS)

Of course backlog cannot be less than zero – you can't have -1 tasks to be taken care of.

So, how can we deal with it? With help of Common Table Expressions, which were added in PostgreSQL 8.4.

Now we know what to use. But how?

Recursive CTE is not easy to comprehend at first, but with some trial and error we can grasp it.

First of all – Recursive CTE references itself. So, we need to start with something that it will be able to reference to.

First query is trivial:

# WITH backlog_info AS (
    SELECT *, new_events - 100 AS backlog
    FROM test
    WHERE stamp = '2000-01-01 00:05:00'
)
SELECT *
FROM backlog_info
ORDER BY stamp ASC
LIMIT 20;
        stamp        │ new_events │ backlog
─────────────────────┼────────────┼─────────
 2000-01-01 00:05:00 │         15-85
(1 ROW)

I hope the code is clear. Couple of important points:

  • WHERE stamp = ‘…' is to select single row that we will start from. It should be that before it backlog is 0
  • ORDER BY and LIMIT – it's obviously not necessary now, since we have only 1 row to return, but let's have it from start – in next queries we will change only the WITH() part.
  • There is obvious bug – as I wrote above backlog cannot be < 0. It's shouldn't be possible.

So, given the bug, we have to fix it. It's fortunately, relatively simple:

# WITH backlog_info AS (
    SELECT *, greatest( new_events - 100, 0 ) AS backlog
    FROM test
    WHERE stamp = '2000-01-01 00:05:00'
)
SELECT *
FROM backlog_info
ORDER BY stamp ASC
LIMIT 20;
        stamp        │ new_events │ backlog
─────────────────────┼────────────┼─────────
 2000-01-01 00:05:00 │         150
(1 ROW)

Just to remind: greatest( X, Y ) will return greater of these numbers. So if calculated backlog will be > 0 – it will be returned. If it will be < 0 – 0 will be greater, and backlog will be 0. Hope it's clear 🙂

Now. We have our first row ready. What about others?

In next rows we should reference previous one. Luckily it's very easy:

WITH RECURSIVE backlog_info AS (
    SELECT *, greatest( new_events - 100, 0 ) AS backlog
    FROM test
    WHERE stamp = '2000-01-01 00:05:00'
UNION ALL
    SELECT t.*, greatest( pt.backlog + t.new_events - 100, 0 ) AS backlog
    FROM test t, backlog_info pt
    WHERE t.stamp = pt.stamp + '5 minutes'::INTERVAL
)
SELECT *
FROM backlog_info
ORDER BY stamp ASC
LIMIT 20;
        stamp        │ new_events │ backlog
─────────────────────┼────────────┼─────────
 2000-01-01 00:05:00 │         150
 2000-01-01 00:10:00 │         460
 2000-01-01 00:15:00 │         570
 2000-01-01 00:20:00 │         190
 2000-01-01 00:25:00 │        13232
 2000-01-01 00:30:00 │         140
 2000-01-01 00:35:00 │         190
 2000-01-01 00:40:00 │        13737
 2000-01-01 00:45:00 │         140
 2000-01-01 00:50:00 │        14646
 2000-01-01 00:55:00 │         8329
 2000-01-01 01:00:00 │         721
 2000-01-01 01:05:00 │         430
 2000-01-01 01:10:00 │         750
 2000-01-01 01:15:00 │         740
 2000-01-01 01:20:00 │         620
 2000-01-01 01:25:00 │        13434
 2000-01-01 01:30:00 │         8721
 2000-01-01 01:35:00 │         590
 2000-01-01 01:40:00 │        1033
(20 ROWS)

Nice. It works. But am I happy with it? well. Not really.

There are still 2 problems.

First is more psychological: what will happen if there will be no row for given every-5-minute point? Let's see:

# DELETE FROM test WHERE stamp = '2000-01-01 00:55:00';
 
# WITH RECURSIVE backlog_info AS (
    SELECT *, greatest( new_events - 100, 0 ) AS backlog
    FROM test
    WHERE stamp = '2000-01-01 00:05:00'
UNION ALL
    SELECT t.*, greatest( pt.backlog + t.new_events - 100, 0 ) AS backlog
    FROM test t, backlog_info pt
    WHERE t.stamp = pt.stamp + '5 minutes'::INTERVAL
)
SELECT *
FROM backlog_info
ORDER BY stamp ASC
LIMIT 20;
        stamp        │ new_events │ backlog
─────────────────────┼────────────┼─────────
 2000-01-01 00:05:00 │         150
 2000-01-01 00:10:00 │         460
 2000-01-01 00:15:00 │         570
 2000-01-01 00:20:00 │         190
 2000-01-01 00:25:00 │        13232
 2000-01-01 00:30:00 │         140
 2000-01-01 00:35:00 │         190
 2000-01-01 00:40:00 │        13737
 2000-01-01 00:45:00 │         140
 2000-01-01 00:50:00 │        14646
(10 ROWS)

Quite bad, isn't it?

And 2nd. problem. Let's take a look at explain analyze output:

                                                                  QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=219.75..219.80 rows=20 width=16) (actual time=2765.042..2765.115 rows=20 loops=1)
   CTE backlog_info
     ->  Recursive Union  (cost=0.00..215.04 rows=101 width=16) (actual time=0.190..2760.145 rows=1000 loops=1)
           ->  Index Scan using some_idx on test  (cost=0.00..8.27 rows=1 width=12) (actual time=0.183..0.187 rows=1 loops=1)
                 Index Cond: (stamp = '2000-01-01 00:05:00'::timestamp without time zone)
           ->  Hash Join  (cost=0.33..20.48 rows=10 width=16) (actual time=1.367..2.754 rows=1 loops=1000)
                 Hash Cond: (t.stamp = (pt.stamp + '00:05:00'::interval))
                 ->  Seq Scan on test t  (cost=0.00..15.00 rows=1000 width=12) (actual time=0.006..1.353 rows=1000 loops=1000)
                 ->  Hash  (cost=0.20..0.20 rows=10 width=12) (actual time=0.008..0.008 rows=1 loops=1000)
                       ->  WorkTable Scan on backlog_info pt  (cost=0.00..0.20 rows=10 width=12) (actual time=0.002..0.003 rows=1 loops=1000)
   ->  Sort  (cost=4.71..4.96 rows=101 width=16) (actual time=2765.037..2765.058 rows=20 loops=1)
         Sort Key: backlog_info.stamp
         Sort Method:  top-N heapsort  Memory: 17kB
         ->  CTE Scan on backlog_info  (cost=0.00..2.02 rows=101 width=16) (actual time=0.198..2763.362 rows=1000 loops=1)
 Total runtime: 2765.290 ms
(15 rows)

Do you see the problem? No? Check loops counts on scans. This is simply wrong.

(number 1000 comes from the fact that I have 1000 rows in there).

So, what should be the solution? It's quite simple actually – don't sweat it, write a stored procedure 🙂

CREATE OR REPLACE FUNCTION
    get_events_data_with_backlog(
        IN  from_when  TIMESTAMP,
        OUT stamp      TIMESTAMP,
        OUT new_events INT4,
        OUT backlog    INT4
    )
    RETURNS SETOF RECORD
    LANGUAGE plpgsql
    AS $$
DECLARE
    temprec        RECORD;
BEGIN
    backlog := 0;
    FOR temprec IN
        SELECT *
        FROM test t
        WHERE t.stamp >= COALESCE( from_when, '-infinity' )
        ORDER BY t.stamp ASC
    LOOP
        stamp      := temprec.stamp;
        new_events := temprec.new_events;
        backlog    := greatest( backlog + ( new_events - 100 ), 0 );
        RETURN NEXT;
    END LOOP;
    RETURN;
END;
$$;

How does it work? Sanely:

# SELECT * FROM get_events_data_with_backlog(NULL) LIMIT 20;
        stamp        │ new_events │ backlog
─────────────────────┼────────────┼─────────
 2000-01-01 00:05:00 │         150
 2000-01-01 00:10:00 │         460
 2000-01-01 00:15:00 │         570
 2000-01-01 00:20:00 │         190
 2000-01-01 00:25:00 │        13232
 2000-01-01 00:30:00 │         140
 2000-01-01 00:35:00 │         190
 2000-01-01 00:40:00 │        13737
 2000-01-01 00:45:00 │         140
 2000-01-01 00:50:00 │        14646
 2000-01-01 00:55:00 │         8329
 2000-01-01 01:00:00 │         721
 2000-01-01 01:05:00 │         430
 2000-01-01 01:10:00 │         750
 2000-01-01 01:15:00 │         740
 2000-01-01 01:20:00 │         620
 2000-01-01 01:25:00 │        13434
 2000-01-01 01:30:00 │         8721
 2000-01-01 01:35:00 │         590
 2000-01-01 01:40:00 │        1033
(20 ROWS)

And what happens if I remove any row from within?

# DELETE FROM test WHERE stamp = '2000-01-01 00:55:00';
 
# SELECT * FROM get_events_data_with_backlog(NULL) LIMIT 20;
        stamp        │ new_events │ backlog
─────────────────────┼────────────┼─────────
 2000-01-01 00:05:00 │         150
 2000-01-01 00:10:00 │         460
 2000-01-01 00:15:00 │         570
 2000-01-01 00:20:00 │         190
 2000-01-01 00:25:00 │        13232
 2000-01-01 00:30:00 │         140
 2000-01-01 00:35:00 │         190
 2000-01-01 00:40:00 │        13737
 2000-01-01 00:45:00 │         140
 2000-01-01 00:50:00 │        14646
 2000-01-01 01:00:00 │         7218
 2000-01-01 01:05:00 │         430
 2000-01-01 01:10:00 │         750
 2000-01-01 01:15:00 │         740
 2000-01-01 01:20:00 │         620
 2000-01-01 01:25:00 │        13434
 2000-01-01 01:30:00 │         8721
 2000-01-01 01:35:00 │         590
 2000-01-01 01:40:00 │        1033
 2000-01-01 01:45:00 │        10710
(20 ROWS)

OK. So, there are 2 good points, and 1 bad. Good: it's faster – only one seq scan over table, and that's all. Second: it doesn't break if there is no data in middle of dataset.

Bad: Well, it returns 20 rows, but these are wrong. Backlog for ‘2000-01-01 01:00:00' should be definitely 0. System had enough time to process data from ‘2000-01-01 00:50:00' – it had 10 minutes.

So, let's fix the last bug:

CREATE OR REPLACE FUNCTION
    get_events_data_with_backlog(
        IN  from_when  TIMESTAMP,
        OUT stamp      TIMESTAMP,
        OUT new_events INT4,
        OUT backlog    INT4
    )
    RETURNS SETOF RECORD
    LANGUAGE plpgsql
    AS $$
DECLARE
    temprec          RECORD;
    previous_stamp   TIMESTAMP;
    time_passed      INTERVAL;
    intervals_passed INT4;
    could_process    INT4;
BEGIN
    backlog := 0;
    FOR temprec IN
        SELECT *
        FROM test t
        WHERE t.stamp >= COALESCE( from_when, '-infinity' )
        ORDER BY t.stamp ASC
    LOOP
        IF previous_stamp IS NULL THEN
            previous_stamp := temprec.stamp - '5 minutes'::INTERVAL;
        END IF;
        stamp            := temprec.stamp;
        new_events       := temprec.new_events;
        time_passed      := temprec.stamp - previous_stamp;
        intervals_passed := EXTRACT(epoch FROM time_passed) / 300;
        could_process    := intervals_passed * 100;
        backlog          := greatest( backlog + ( new_events - could_process), 0 );
        RETURN NEXT;
        previous_stamp := temprec.stamp;
    END LOOP;
    RETURN;
END;
$$;

( of course calculations within function could be written in much terser format, but I wanted it to be clear what it does ).

And how does it work?

# DELETE FROM test WHERE stamp = '2000-01-01 00:55:00';
 
# SELECT * FROM get_events_data_with_backlog(NULL) LIMIT 20;
        stamp        │ new_events │ backlog
─────────────────────┼────────────┼─────────
 2000-01-01 00:05:00 │         150
 2000-01-01 00:10:00 │         460
 2000-01-01 00:15:00 │         570
 2000-01-01 00:20:00 │         190
 2000-01-01 00:25:00 │        13232
 2000-01-01 00:30:00 │         140
 2000-01-01 00:35:00 │         190
 2000-01-01 00:40:00 │        13737
 2000-01-01 00:45:00 │         140
 2000-01-01 00:50:00 │        14646
 2000-01-01 01:00:00 │         720
 2000-01-01 01:05:00 │         430
 2000-01-01 01:10:00 │         750
 2000-01-01 01:15:00 │         740
 2000-01-01 01:20:00 │         620
 2000-01-01 01:25:00 │        13434
 2000-01-01 01:30:00 │         8721
 2000-01-01 01:35:00 │         590
 2000-01-01 01:40:00 │        1033
 2000-01-01 01:45:00 │        10710
(20 ROWS)

Nice. Everything calculated properly.

4 thoughts on “Calculating backlog of events to handle”

  1. It’s just too bad that those “nice Unicode frames” are not stacked together in the HTML code block.

    A line-height:0.5em; instead of the current 1em in style.css:75 would take care of that…

  2. @Steve:
    good point. Will look into it, but since layout is (as you can definitely see) my priority – it might take a while to get to it 🙁

  3. @Efgé:
    That’s what fooqux wanted, and that’s what I originally tried to write, but it seems that I can’t write it with window functions. If you can – great. Show us – I mean – it’s perfectly possible that I overlooked something.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.