Filling in the blanks

Some time ago someone on irc asked interesting question. One that I couldn't answer then (didn't have an immediate idea, and didn't have time to spend on looking into it).

Now, I have some more time, and despite the fact that the person that had this problem no longer cares about it (he found some solution himself if I recall correctly), decided to look into it.

The problem is: having simple table:

$ CREATE TABLE test (
    id INT NOT NULL PRIMARY KEY,
    sample_date DATE NOT NULL,
    VALUE INT
);

And data there, like:

$ INSERT INTO test (id, sample_date, VALUE)
    VALUES
        (1, '2015-01-01', 100),
        (2, '2015-01-03', 120),
        (3, '2015-01-06', 140),
        (4, '2015-01-15', 100);

can we somehow return data that exists in the table, but also data for missing days, somehow “interpolating" the values?

So, we'd want to get:

 sample_date | VALUE | REAL
-------------+-------+------
 2015-01-01  |   100 | t
 2015-01-02  |   110 | f
 2015-01-03  |   120 | t
 2015-01-04  |   126 | f
 2015-01-05  |   134 | f
 2015-01-06  |   140 | t
 2015-01-07  |   136 | f
 2015-01-08  |   131 | f
 2015-01-09  |   127 | f
 2015-01-10  |   122 | f
 2015-01-11  |   118 | f
 2015-01-12  |   113 | f
 2015-01-13  |   109 | f
 2015-01-14  |   104 | f
 2015-01-15  |   100 | t
(4 ROWS)

“real" column has true for values coming from test table, and false for interpolated.

Since the values are rounded to integer, final results might be slightly different, but that's the general idea.

So, how can I do it?

First, I need somehow to get all rows, even with nulls. For this I will use generate_series. But hardcoding min/max values is not really nice, so let's make it check range from the table itself:

WITH range AS (
    SELECT MIN(sample_date), MAX(sample_date) FROM test
), all_days AS (
    SELECT
        g.day::DATE AS DAY
    FROM
        range r,
        lateral generate_series( r.min, r.max, '1 day'::INTERVAL ) AS g(DAY)
)
SELECT * FROM all_days;
    DAY     
------------
 2015-01-01
 2015-01-02
 2015-01-03
 2015-01-04
 2015-01-05
 2015-01-06
 2015-01-07
 2015-01-08
 2015-01-09
 2015-01-10
 2015-01-11
 2015-01-12
 2015-01-13
 2015-01-14
 2015-01-15
(15 ROWS)

This is cool. Now, I need to join the test table, to get real data:

WITH range AS (
    SELECT MIN(sample_date), MAX(sample_date) FROM test
), all_days AS (
    SELECT
        g.day::DATE AS DAY
    FROM
        range r,
        lateral generate_series( r.min, r.max, '1 day'::INTERVAL ) AS g(DAY)
), all_with_real AS (
    SELECT
        a.day,
        t.value,
        t.value IS NOT NULL AS REAL
    FROM
        all_days a
        LEFT OUTER JOIN test t ON a.day = t.sample_date
)
SELECT * FROM all_with_real ORDER BY 1;
    DAY     | VALUE  | REAL 
------------+--------+------
 2015-01-01 |    100 | t
 2015-01-02 | [NULL] | f
 2015-01-03 |    120 | t
 2015-01-04 | [NULL] | f
 2015-01-05 | [NULL] | f
 2015-01-06 |    140 | t
 2015-01-07 | [NULL] | f
 2015-01-08 | [NULL] | f
 2015-01-09 | [NULL] | f
 2015-01-10 | [NULL] | f
 2015-01-11 | [NULL] | f
 2015-01-12 | [NULL] | f
 2015-01-13 | [NULL] | f
 2015-01-14 | [NULL] | f
 2015-01-15 |    100 | t
(15 ROWS)

Now, how about actually filing in the values. I could try with subselects:

WITH range AS (
    SELECT MIN(sample_date), MAX(sample_date) FROM test
), all_days AS (
    SELECT
        g.day::DATE AS DAY
    FROM
        range r,
        lateral generate_series( r.min, r.max, '1 day'::INTERVAL ) AS g(DAY)
), all_with_real AS (
    SELECT
        a.day,
        t.value,
        t.value IS NOT NULL AS REAL
    FROM
        all_days a
        LEFT OUTER JOIN test t ON a.day = t.sample_date
), all_with_prev_and_next AS (
    SELECT
        a.day,
        a.value,
        a.real,
        CASE
            WHEN a.real THEN NULL
            ELSE
                ( SELECT array[a.day - t.sample_date, t.value] FROM test t WHERE t.sample_date < a.day AND t.value IS NOT NULL ORDER BY t.sample_date DESC LIMIT 1 )
        END AS prev_day,
        CASE
            WHEN a.real THEN NULL
            ELSE
                ( SELECT array[t.sample_date - a.day, t.value] FROM test t WHERE t.sample_date > a.day AND t.value IS NOT NULL ORDER BY t.sample_date ASC LIMIT 1 )
        END AS next_day
    FROM
        all_with_real a
)
SELECT
    a.day,
    CASE
        WHEN a.real THEN a.value
        ELSE
            ( a.next_day[2] - a.prev_day[2] ) * a.prev_day[1] / ( a.next_day[1] + a.prev_day[1] ) + a.prev_day[2]
    END AS VALUE,
    a.real
FROM
    all_with_prev_and_next a
ORDER BY a.day;
    DAY     | VALUE | REAL 
------------+-------+------
 2015-01-01 |   100 | t
 2015-01-02 |   110 | f
 2015-01-03 |   120 | t
 2015-01-04 |   126 | f
 2015-01-05 |   133 | f
 2015-01-06 |   140 | t
 2015-01-07 |   136 | f
 2015-01-08 |   132 | f
 2015-01-09 |   127 | f
 2015-01-10 |   123 | f
 2015-01-11 |   118 | f
 2015-01-12 |   114 | f
 2015-01-13 |   109 | f
 2015-01-14 |   105 | f
 2015-01-15 |   100 | t
(15 ROWS)

But this will be relatively slow, as it has to do two scans of test table for every day we're missing data for.

Also, while I wrote it, and understand how it works, I wouldn't call it the nicest possible query.

So, is there, perhaps, any other approach we could do? Perhaps window functions could help?

Sure, they can. I found this solution. It is most likely not the best possible, but it works for me. Let's see:

WITH range AS (
    SELECT MIN(sample_date), MAX(sample_date) FROM test
), all_days AS (
    SELECT
        g.day::DATE AS DAY
    FROM
        range r,
        lateral generate_series( r.min, r.max, '1 day'::INTERVAL ) AS g(DAY)
), all_with_real AS (
    SELECT
        a.day,
        t.value,
        t.value IS NOT NULL AS REAL
    FROM
        all_days a
        LEFT OUTER JOIN test t ON a.day = t.sample_date
), all_with_previous_and_next AS (
    SELECT
        a.*,
        MAX( CASE WHEN a.value IS NULL THEN NULL ELSE array[ a.day::TEXT, a.value::TEXT ] END ) OVER ( ORDER BY a.day ) AS previous_real,
        MIN( CASE WHEN a.value IS NULL THEN NULL ELSE array[ a.day::TEXT, a.value::TEXT ] END ) OVER ( ORDER BY a.day range BETWEEN CURRENT ROW AND unbounded following ) AS next_real
    FROM
        all_with_real a
)
SELECT
    a.day,
    COALESCE(
        a.value,
        a.previous_real[2]::INT4 + ( a.next_real[2]::INT4 - a.previous_real[2]::INT4 ) * ( a.day - a.previous_real[1]::DATE ) / ( a.next_real[1]::DATE - a.previous_real[1]::DATE )
    ) AS VALUE,
    a.real
FROM
    all_with_previous_and_next a
ORDER BY
    a.day;

Result is as expected, but how it works?

Let's see what each of the CTEs returns:

range:

    MIN     |    MAX     
------------+------------
 2015-01-01 | 2015-01-15
(1 ROW)

all_days:

SELECT * FROM all_days;
    DAY     
------------
 2015-01-01
 2015-01-02
 2015-01-03
 2015-01-04
 2015-01-05
 2015-01-06
 2015-01-07
 2015-01-08
 2015-01-09
 2015-01-10
 2015-01-11
 2015-01-12
 2015-01-13
 2015-01-14
 2015-01-15
(15 ROWS)

all_with_real:

    DAY     | VALUE  | REAL 
------------+--------+------
 2015-01-01 |    100 | t
 2015-01-02 | [NULL] | f
 2015-01-03 |    120 | t
 2015-01-04 | [NULL] | f
 2015-01-05 | [NULL] | f
 2015-01-06 |    140 | t
 2015-01-07 | [NULL] | f
 2015-01-08 | [NULL] | f
 2015-01-09 | [NULL] | f
 2015-01-10 | [NULL] | f
 2015-01-11 | [NULL] | f
 2015-01-12 | [NULL] | f
 2015-01-13 | [NULL] | f
 2015-01-14 | [NULL] | f
 2015-01-15 |    100 | t
(15 ROWS)

all_with_previous_and_next

    DAY     | VALUE  | REAL |  previous_real   |    next_real
------------+--------+------+------------------+------------------
 2015-01-01 |    100 | t    | {2015-01-01,100} | {2015-01-01,100}
 2015-01-02 | [NULL] | f    | {2015-01-01,100} | {2015-01-03,120}
 2015-01-03 |    120 | t    | {2015-01-03,120} | {2015-01-03,120}
 2015-01-04 | [NULL] | f    | {2015-01-03,120} | {2015-01-06,140}
 2015-01-05 | [NULL] | f    | {2015-01-03,120} | {2015-01-06,140}
 2015-01-06 |    140 | t    | {2015-01-06,140} | {2015-01-06,140}
 2015-01-07 | [NULL] | f    | {2015-01-06,140} | {2015-01-15,100}
 2015-01-08 | [NULL] | f    | {2015-01-06,140} | {2015-01-15,100}
 2015-01-09 | [NULL] | f    | {2015-01-06,140} | {2015-01-15,100}
 2015-01-10 | [NULL] | f    | {2015-01-06,140} | {2015-01-15,100}
 2015-01-11 | [NULL] | f    | {2015-01-06,140} | {2015-01-15,100}
 2015-01-12 | [NULL] | f    | {2015-01-06,140} | {2015-01-15,100}
 2015-01-13 | [NULL] | f    | {2015-01-06,140} | {2015-01-15,100}
 2015-01-14 | [NULL] | f    | {2015-01-06,140} | {2015-01-15,100}
 2015-01-15 |    100 | t    | {2015-01-15,100} | {2015-01-15,100}
(15 ROWS)

Of course this is the most magical. The previous_real and next_real, as expected, show previous date/value, and next date/value.

This is done using one nasty trick. Normally I could use:

MAX( CASE WHEN a.value IS NULL THEN NULL ELSE a.day END ) OVER ( ORDER BY a.day )

To get previous day (and similar query with different frame to get next real day).

But – this doesn't give me value. I can't use “min/max" on the values, as the values can decrease in time.

I didn't want to do subselects to get the values for previous/next day, so instead I decided to abuse arrays. I cast both needed values to text, and merge them into array, but putting date as first column.

PostgreSQL date format YYYY-MM-DD has the great benefit of being sortable. So assuming the dates are unique, value will not influence ordering.

This is still not really nice – the dirty trick with arrays of text – it just doesn't look ok.

Can it be done in any other way?

Sure. First of all, maybe someone else will find a better way, but I figured that I can simply add custom aggregate (don't worry, this is not scary at all):

CREATE FUNCTION last_agg ( anyelement, anyelement ) RETURNS anyelement AS $$
    SELECT COALESCE( $2, $1 );
$$ LANGUAGE SQL;
 
CREATE aggregate last_agg ( anyelement ) (
    sfunc = last_agg,
    stype = anyelement
);

That's all. What it does is that it returns last not-null value from a group. This doesn't really make sense in case of normal grouping, as “last" is not really well defined there, but since PostgreSQL can have aggregates with “order by" clause, suddenly last makes a lot of sense.

For example, given these values, in this order:

  • 10
  • NULL
  • 20
  • NULL
  • 15
  • NULL

last_agg will return value 15.

Having this, we can change our monstrous query into something much nicer:

WITH range AS (
    SELECT MIN(sample_date), MAX(sample_date) FROM test
), all_days AS (
    SELECT
        g.day::DATE AS DAY
    FROM
        range r,
        lateral generate_series( r.min, r.max, '1 day'::INTERVAL ) AS g(DAY)
), all_with_real AS (
    SELECT
        a.day,
        t.value,
        t.value IS NOT NULL AS REAL
    FROM
        all_days a
        LEFT OUTER JOIN test t ON a.day = t.sample_date
), all_with_previous_and_next AS (
    SELECT
        a.*,
        last_agg( CASE WHEN a.value IS NULL THEN NULL ELSE a.day END ) OVER ( ORDER BY a.day ) AS previous_day,
        last_agg( a.value ) OVER ( ORDER BY a.day ) AS previous_value,
        last_agg( CASE WHEN a.value IS NULL THEN NULL ELSE a.day END ) OVER ( ORDER BY a.day DESC ) AS next_day,
        last_agg( a.value ) OVER ( ORDER BY a.day DESC ) AS next_value
    FROM
        all_with_real a
)
SELECT
    a.day,
    COALESCE(
        a.value,
        a.previous_value + ( a.day - a.previous_day ) * ( a.next_value - a.previous_value ) / ( a.next_day - a.previous_day )
    ) AS VALUE,
    a.real
FROM
    all_with_previous_and_next a
ORDER BY
    a.day;

all_with_previous_and_next contains now different values, and they are even properly typed:

    DAY     | VALUE  | REAL | previous_day | previous_value |  next_day  | next_value 
------------+--------+------+--------------+----------------+------------+------------
 2015-01-01 |    100 | t    | 2015-01-01   |            100 | 2015-01-01 |        100
 2015-01-02 | [NULL] | f    | 2015-01-01   |            100 | 2015-01-03 |        120
 2015-01-03 |    120 | t    | 2015-01-03   |            120 | 2015-01-03 |        120
 2015-01-04 | [NULL] | f    | 2015-01-03   |            120 | 2015-01-06 |        140
 2015-01-05 | [NULL] | f    | 2015-01-03   |            120 | 2015-01-06 |        140
 2015-01-06 |    140 | t    | 2015-01-06   |            140 | 2015-01-06 |        140
 2015-01-07 | [NULL] | f    | 2015-01-06   |            140 | 2015-01-15 |        100
 2015-01-08 | [NULL] | f    | 2015-01-06   |            140 | 2015-01-15 |        100
 2015-01-09 | [NULL] | f    | 2015-01-06   |            140 | 2015-01-15 |        100
 2015-01-10 | [NULL] | f    | 2015-01-06   |            140 | 2015-01-15 |        100
 2015-01-11 | [NULL] | f    | 2015-01-06   |            140 | 2015-01-15 |        100
 2015-01-12 | [NULL] | f    | 2015-01-06   |            140 | 2015-01-15 |        100
 2015-01-13 | [NULL] | f    | 2015-01-06   |            140 | 2015-01-15 |        100
 2015-01-14 | [NULL] | f    | 2015-01-06   |            140 | 2015-01-15 |        100
 2015-01-15 |    100 | t    | 2015-01-15   |            100 | 2015-01-15 |        100
(15 ROWS)

All this, and we still have just one scan of the underlying table. Not bad. Anyone has better idea?

13 thoughts on “Filling in the blanks”

  1. You don’t need the subqueries. Just use lead on the real data before joining it with all_days:

    WITH range AS (
        SELECT MIN(sample_date), MAX(sample_date) FROM test),
    all_days AS (
        SELECT g.day::DATE AS DAY
        FROM range r, lateral generate_series( r.min, r.max, '1 day'::INTERVAL ) AS g(DAY)),
    test_extended AS (
      SELECT t.sample_date DATE,
        lead(t.sample_date, 1, t.sample_date+1) OVER (ORDER BY t.sample_date) nextdate,
        t.value,
        lead(t.value) OVER (ORDER BY t.sample_date) nextvalue
        FROM test t),
    all_with_real AS (
        SELECT a.day, t.date, t.nextdate, t.value, (a.day=t.date) AS REAL, t.nextvalue
        FROM all_days a
        LEFT OUTER JOIN test_extended t ON a.day>=t.date AND a.day<t.nextdate
    )
    SELECT a.day,
      COALESCE(a.value + (a.nextvalue - a.value) * (a.day - a.date) / (a.nextdate - a.date), a.value) AS VALUE,
      a.real
    FROM all_with_real a
    ORDER BY a.day;
  2. @Anders:

    Nice. I knew that there has to be a simpler approach. Just didn’t get to it myself :/

  3. Argh, apparently I didn’t either: I didn’t like the range in the join (it looked slow), but I guess you can just produce the range in pieces:

    WITH test_extended AS (
      SELECT t.sample_date DATE,
        lead(t.sample_date, 1, t.sample_date+1) OVER (ORDER BY t.sample_date) nextdate,
        t.value,
        lead(t.value) OVER (ORDER BY t.sample_date) nextvalue
        FROM test t),
    all_with_real AS (
        SELECT g.day::DATE AS DAY, t.date, t.nextdate, t.value, (g.day::DATE=t.date) AS REAL, t.nextvalue
        FROM test_extended t, lateral generate_series(t.date, t.nextdate-1, '1 day'::INTERVAL) AS g(DAY))
    SELECT a.day,
      COALESCE(a.value + (a.nextvalue-a.value) * (a.day-a.date) / (a.nextdate-a.date), a.value) AS VALUE,
      a.real
    FROM all_with_real a
    ORDER BY a.day;
  4. FWIW, if you used temp views it would make the examples easier to follow. IE, replace all the WITH range, … all_days … with

    CREATE TEMP VIEW all_days AS
    with range as (
    select min(sample_date), max(sample_date) from test
    ) SELECT …

    This is especially handy as you build up examples; if each example is a temp view then you can just keep adding them to your JOIN clause.

  5. @Jim:
    True, will check how it goes next time.

    The point in the queries is to shown that it can be done with a query, without any special objects. I’m sure you know that view/cte are interchangeable, but I’m not so sure if this is common knowledge.

  6. Nice food for thought. Here is my attempt:

    with recursive
    bounds as (select min(sample_date),max(sample_date) from test),
    day(current) as (select bounds.min from bounds union all select (current + interval ‘1 day’)::date from day,bounds where current < bounds.max),
    dates as (select current, max(test.sample_date) filter (where test.sample_date = current) as hi_date from day, test group by current),
    samples as (select current,lo_date,hi_date,min(value) filter (where test.sample_date = lo_date) as lo_val,min(value) filter (where test.sample_date = hi_date) as hi_val from dates, test group by 1,2,3)
    select *, case when hi_date=lo_date then lo_val else lo_val + (current-lo_date) * (hi_val-lo_val) / (hi_date-lo_date) end as intra from samples order by 1;

  7. Scratch my previous post.

    with recursive
    bounds as (select min(sample_date),max(sample_date) from test),
    day(current) as (select bounds.min from bounds union all select (current + interval ‘1 day’)::date from day,bounds where current < bounds.max),
    dates as (select current, max(test.sample_date) filter (where test.sample_date = current) as hi_date from day, test group by current),
    samples as (select current,lo_date,hi_date,min(value) filter (where test.sample_date = lo_date) as lo_val,min(value) filter (where test.sample_date = hi_date) as hi_val from dates, test group by 1,2,3)
    select *, case when hi_date=lo_date then lo_val else lo_val + (current-lo_date) * (hi_val-lo_val) / (hi_date-lo_date) end as intra from samples order by 1;

  8. Last attempt to copy-paste, what is wrong with this thing 🙂

    WITH recursive
    bounds AS (SELECT MIN(sample_date),MAX(sample_date) FROM test),
    DAY(CURRENT) AS (SELECT bounds.min FROM bounds UNION ALL SELECT (CURRENT + INTERVAL '1 day')::DATE FROM DAY,bounds WHERE CURRENT < bounds.max),
    dates AS (SELECT CURRENT, MAX(test.sample_date) FILTER (WHERE test.sample_date = CURRENT) AS hi_date FROM DAY, test GROUP BY CURRENT),
    samples AS (SELECT CURRENT,lo_date,hi_date,MIN(VALUE) FILTER (WHERE test.sample_date = lo_date) AS lo_val,MIN(VALUE) FILTER (WHERE test.sample_date = hi_date) AS hi_val FROM dates, test GROUP BY 1,2,3)
    SELECT *, CASE WHEN hi_date=lo_date THEN lo_val ELSE lo_val + (current-lo_date) * (hi_val-lo_val) / (hi_date-lo_date) END AS intra FROM samples ORDER BY 1;
  9. doesn’t work for me – doesn’t even run – complains about lo_date column not existing.

  10. I don’t know, the lo_date is declared in the the “dates” CTE, which is mungled each time i posted it. Simmilar to hi_date, only reversed.

    dates as (
    select current,
    max(test.sample_date) filter (where test.sample_date = current) as hi_date
    from day, test group by current),

  11. Final version, working:

    WITH recursive
     
    bounds AS (SELECT MIN(sample_date),MAX(sample_date) FROM test),
     
    DAY(CURRENT) AS (SELECT bounds.min FROM bounds UNION ALL SELECT (CURRENT +
    INTERVAL '1 day')::DATE FROM DAY,bounds WHERE CURRENT < bounds.max),
     
    dates AS (SELECT CURRENT, MAX(test.sample_date) FILTER (WHERE
    test.sample_date <= CURRENT) AS lo_date, MIN(test.sample_date) FILTER (WHERE
    test.sample_date >= CURRENT) AS hi_date FROM DAY, test GROUP BY CURRENT),
     
    samples AS (SELECT CURRENT,lo_date,hi_date,MIN(VALUE) FILTER (WHERE
    test.sample_date = lo_date) AS lo_val,MIN(VALUE) FILTER (WHERE
    test.sample_date = hi_date) AS hi_val FROM dates, test GROUP BY 1,2,3)
     
    SELECT *, CASE WHEN hi_date=lo_date THEN lo_val ELSE lo_val +
    (current-lo_date) * (hi_val-lo_val) / (hi_date-lo_date) END AS intra FROM
    samples ORDER BY 1;
  12. Just put it on some paste site. Or use html pre tags in your comment, that should be allowed by wordpress, and it will not change anything inside.

  13. @Pg23193:
    Looks like it’s working, but your solution requires 3 scans overt test table, while mine need 2, and Andres – just one.

Comments are closed.