rownum anyone? cumulative sum in one query?

one of the nice (for some things) features of oracle is rownum() function.

i assume you know what it does – if you don't – i think a quick google will show it.

today i will show how to make rownum in postgresql. if you're not interested in rownum itself – please continue reading anyway, as some functions shown here have other uses as well 🙂

ok, to add rownum, one usually is told to create temp sequence, and add “nextval()" to your query. this is how it works:

first, our test dataset:

# \d test
     TABLE "public.test"
  COLUMN   | TYPE | Modifiers
-----------+------+-----------
 some_word | text | NOT NULL
Indexes:
    "test_pkey" PRIMARY KEY, btree (some_word)
# SELECT * FROM test;
 some_word
-----------
 a
 AND
 at
 been
 feast
 great
 have
 languages
 OF
 scraps
 stolen
 the
 they
(13 ROWS)

so, let's add temp sequence and test the approach:

# CREATE temp SEQUENCE q;
CREATE SEQUENCE
# SELECT NEXTVAL('q'), some_word FROM test;
 NEXTVAL | some_word
---------+-----------
       1 | a
       2 | AND
       3 | at
       4 | been
       5 | feast
       6 | great
       7 | have
       8 | languages
       9 | OF
      10 | scraps
      11 | stolen
      12 | the
      13 | they
(13 ROWS)

nice. it works great. does it? let's try again, but with some non-obvious order by:

# SELECT NEXTVAL('q'), some_word FROM test ORDER BY LENGTH(some_word) DESC;
 NEXTVAL | some_word
---------+-----------
      21 | languages
      23 | scraps
      24 | stolen
      19 | great
      18 | feast
      26 | they
      17 | been
      20 | have
      15 | AND
      25 | the
      22 | OF
      16 | at
      14 | a
(13 ROWS)

oops. now we have a problem. or actually – 2 problems:

  • sequence has to be reset after each statement! otherwise it will continue the count. which is not really useful
  • order of the nextval is somehow random

order can be easily fixed:

# SELECT NEXTVAL('q'), * FROM ( SELECT some_word FROM test ORDER BY LENGTH(some_word) DESC ) x;
 NEXTVAL | some_word
---------+-----------
      27 | languages
      28 | scraps
      29 | stolen
      30 | great
      31 | feast
      32 | they
      33 | been
      34 | have
      35 | AND
      36 | the
      37 | OF
      38 | at
      39 | a
(13 ROWS)

but resetting counter after each usage seems to be … not nice.

any better options?

basically we could write some plperl function that would show us the counter and store current_value in %_SHARED, but it would also not help us with the issue of resetting the counter.

the best option would be if we could somehow get “trigger before select". which we can't.

but, there is an option to “know" when we start new statement – we should just check transaction (and statement) id numbers.

so, let's do it.

<24 hours later>

hmm, there is nothing that really works like statement id!.

but, as tom lane suggested i can use statement timestamp. it is perfectly fine for me for 2 reasons:

  1. i will use it “per-backend", so concurrent queries on another backends do not worry me
  2. it is pretty precise – up to 0.000001 second. so – as long as you're not running 1 milion of selects per second – you should be fine 🙂

so, there goes some .c code:

#include "postgres.h"
#include "access/xact.h"
#include "fmgr.h"
#include "utils/timestamp.h"
PG_MODULE_MAGIC;
Datum get_statement_timestamp(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(get_statement_timestamp);
Datum
get_statement_timestamp(PG_FUNCTION_ARGS)
{
    PG_RETURN_TIMESTAMPTZ( GetCurrentStatementStartTimestamp() );
}

i compiled it to .so, and saved as /home/pgdba/statementid.so.

then i can create sql-accessible function for it:

CREATE FUNCTION get_statement_timestamp()
RETURNS timestamptz
AS '/home/pgdba/statementid.so', 'get_statement_timestamp'
LANGUAGE C RETURNS NULL ON NULL INPUT VOLATILE;

now, that i have a way to tell whether previous call was for different statement i can finally write my rownum() code:

CREATE OR REPLACE FUNCTION rownum()
    RETURNS INT4
    LANGUAGE plpgsql
    AS $BODY$
DECLARE
    current_id TEXT;
    current_rownum INT4;
    settings_id TEXT;
BEGIN
    current_id := get_statement_timestamp()::TEXT;
    settings_id := current_setting('depesz.rownum_name');
    IF settings_id IS DISTINCT FROM current_id THEN
        PERFORM set_config('depesz.rownum_name', current_id::TEXT, FALSE);
        current_rownum := 0;
    ELSE
        current_rownum := current_setting('depesz.rownum_count')::INT4;
    END IF;
    current_rownum := current_rownum + 1;
    PERFORM set_config('depesz.rownum_count', current_rownum::TEXT, FALSE);
    RETURN current_rownum;
END;
    $BODY$;

as you can see it's pretty simple. one important thing is that i use custom variable class – you have to define it in your postgresql.conf. as for how – check my previous posting.

so, let's check how it works:

# SELECT rownum(), * FROM (SELECT some_word FROM test ORDER BY LENGTH(some_word) DESC ) x;
 rownum | some_word
--------+-----------
      1 | languages
      2 | scraps
      3 | stolen
      4 | great
      5 | feast
      6 | they
      7 | been
      8 | have
      9 | AND
     10 | the
     11 | OF
     12 | at
     13 | a
(13 ROWS)

second call within the same session? works nicely as well (starts numbering from 1, not from 14).

hmmm .. but what if i'd like to have multiple rownums per statement? for example – i would like the rownum to restart for each length of some_word?

let's do something like this:

CREATE OR REPLACE FUNCTION rownum()
    RETURNS INT4
    LANGUAGE plpgsql
    AS $BODY$
DECLARE
BEGIN
    RETURN rownum('');
END;
    $BODY$;

this function works instead of previous rownum(), but it calls rownum(TEXT). which looks like:

CREATE OR REPLACE FUNCTION rownum(in_code TEXT)
    RETURNS INT4
    LANGUAGE plpgsql
    AS $BODY$
DECLARE
    current_id TEXT;
    current_rownum INT4;
    settings_id TEXT;
BEGIN
    current_id := get_statement_timestamp()::TEXT || in_code;
    settings_id := current_setting('depesz.rownum_name');
    IF settings_id IS DISTINCT FROM current_id THEN
        PERFORM set_config('depesz.rownum_name', current_id::TEXT, FALSE);
        current_rownum := 0;
    ELSE
        current_rownum := current_setting('depesz.rownum_count')::INT4;
    END IF;
    current_rownum := current_rownum + 1;
    PERFORM set_config('depesz.rownum_count', current_rownum::TEXT, FALSE);
    RETURN current_rownum;
END;
    $BODY$;

so, what good it is for?

first, let's check standard call:

# SELECT rownum(), * FROM (SELECT some_word FROM test ORDER BY LENGTH(some_word) DESC ) x;
 rownum | some_word
--------+-----------
      1 | languages
      2 | scraps
      3 | stolen
      4 | great
      5 | feast
      6 | they
      7 | been
      8 | have
      9 | AND
     10 | the
     11 | OF
     12 | at
     13 | a
(13 ROWS)

ok, so it works. and now, let's change invocation, by explicitly calling rownum(TEXT):

# SELECT rownum(LENGTH(x.some_word)::text), * FROM (SELECT some_word FROM test ORDER BY LENGTH(some_word) DESC ) x;
 rownum | some_word
--------+-----------
      1 | languages
      1 | scraps
      2 | stolen
      1 | great
      2 | feast
      1 | they
      2 | been
      3 | have
      1 | AND
      2 | the
      1 | OF
      2 | at
      1 | a
(13 ROWS)

now rownum restarts whenever i change rownum(TEXT) argument – length of some_word.

as i mentioned in the beginning – this trick has more uses.

i once wrote an article about cumulative sum. then i wrote another one with jan wiecks method.

all methods described there required some extra queries to make the output (preparation or sum clearer). but now, since we know how to differentiate function calls between statements, we can do it better.

let's use again the table that was described there:

# SELECT * FROM test;
 id | user_id | VALUE
----+---------+-------
  1 |    2222 |   120
  2 |    2222 |  1566
  3 |    1111 |    10
  4 |    2222 |   123
  5 |    3333 |  1567
  6 |    2222 |  1234
  7 |    1111 |     4
  8 |    1111 |     1
  9 |    1111 |     3
 10 |    1111 |     5
 11 |    1111 |     2
 12 |    3333 |  1588
(12 ROWS)

now, let's write cumulative_sum function using statement_timestamp:

CREATE OR REPLACE FUNCTION cumulative_sum(in_code TEXT, VALUE INT4)
    RETURNS INT4
    LANGUAGE plpgsql
    AS $BODY$
DECLARE
    current_id TEXT;
    current_sum INT4;
    settings_id TEXT;
BEGIN
    current_id := get_statement_timestamp()::TEXT || in_code;
    settings_id := current_setting('depesz.cumulative_sum_name');
    IF settings_id IS DISTINCT FROM current_id THEN
        PERFORM set_config('depesz.cumulative_sum_name', current_id::TEXT, FALSE);
        current_sum := 0;
    ELSE
        current_sum := current_setting('depesz.cumulative_sum_count')::INT4;
    END IF;
    current_sum := current_sum + COALESCE(VALUE, 0);
    PERFORM set_config('depesz.cumulative_sum_count', current_sum::TEXT, FALSE);
    RETURN current_sum;
END;
    $BODY$;

and let's test if it works:

# SELECT x.*, cumulative_sum(user_id::text, VALUE) FROM (SELECT * FROM test ORDER BY user_id ) x;
 id | user_id | VALUE | cumulative_sum
----+---------+-------+----------------
  7 |    1111 |     4 |              4
  8 |    1111 |     1 |              5
  9 |    1111 |     3 |              8
 10 |    1111 |     5 |             13
 11 |    1111 |     2 |             15
  3 |    1111 |    10 |             25
  1 |    2222 |   120 |            120
  2 |    2222 |  1566 |           1686
  4 |    2222 |   123 |           1809
  6 |    2222 |  1234 |           3043
  5 |    3333 |  1567 |           1567
 12 |    3333 |  1588 |           3155
(12 ROWS)

great 🙂 i think that there are literally hundreds of other uses, but finding and describing them all would be too much 🙂 do some experiments yourself. have fun.

21 thoughts on “rownum anyone? cumulative sum in one query?”

  1. It looks like the beginning of
    the implementation of OLAP extensions such as CUBE/ROLLUP
    for GROUP BY?

    That would be great:-)

  2. @Fabien Coelho:
    i’m not familia with olap. can you direct me to wherever i can read about it? let’s start with the extension that you find the simplest.

  3. These are extensions to keep intermediate sums when grouping stuff. It is not what you are doing here, but some issues are similar as one may(?) have to keep additionnal counters alive to store partial results. See for instance:

    http://www.research.ibm.com/journal/sj/414/colossi.html

    Or Oracle doc:

    http://download.oracle.com/docs/cd/B12037_01/server.101/b10736/aggreg.htm

    I do not think that it is simple to implement in Pg, as it may have to impact the actual aggregation implementation and possibly the query plan.

  4. Nice read.

    Having worked with Oracle’s rownum I have to say I hate it. The fact that the rownum is effectively random unless you use a subquery is just plain painful. You have to use it if you want to do the Oracle equivelant of the LIMIT/OFFSET query of PostgreSQL.

    There have been a few posts recently that have reminded me of the work I did using arrays to do related things (in my case I wanted a moving set of the last X rows for stock marketet moving averages and the like)

    There is a lot that can be done just using pgPL and SQL functions – though of course C is going to be fast enough.

    My articles on moving data sets using arrays is here:
    http://frakkle.com/archives/archive_2006-m04.php

  5. Very cool ! Satisfied by results and in postgres 8.2.* I don’t need C part as it has statement_timestamp() by default !

  6. I´ve failed to compile in 8.1.4… 🙁
    There is no GetCurrentStatementStartTimestamp()

  7. How could I do a select on the result?

    What I’m trying to do is to create a report of registrations for a particular month.
    It shows the number of registrations per day and the cumulated registrations up to that day (since ever).

    So I use your function to compute the running total, so I get

    Date ———— Registrations ——– Cumulated
    2008-04-28 —– 5 ——————– 5
    2008-04-29 —– 1 ——————– 6
    2008-04-30 —– 2 ——————– 8
    2008-05-01 —– 6 ——————– 14
    2008-05-02 —– 3 ——————– 17

    And then I want to select on the values from May, so I put a select around (wrapping) my previous query and want to get

    Date ———— Registrations ——– Cumulated
    2008-05-01 —– 6 ——————– 14
    2008-05-02 —– 3 ——————– 17

    Instead I get
    Date ———— Registrations ——– Cumulated
    2008-05-01 —– 6 ——————– 6
    2008-05-02 —– 3 ——————– 9

    Do you see any way of solving this?

    Thanks,
    Peter

  8. hi depesz,

    i had a similar problem and read your both articles, this one and the one you posted some time before. in your first article about cumulative sums you used
    a perl function, but with the problem of no auto reset the counter for the next query. here you provided another way to get cumulative sum using get_statement_timestamp. hm, the plperl-function is 2 times faster then this written in plpgsql, so i want to use the faster one ;-).
    i have found no way in plperl to access statement_timestamp() so i used the following workaround giving the current statements timestamp as param:

    select x.*, cumulative_sum(user_id::text, value, statement_timestamp()) from (select * from test order by user_id) x;

    CREATE OR REPLACE FUNCTION cumulative_sum(in_code text, in_value bigint, in_ts timestamp with time zone)
    RETURNS bigint AS
    $BODY$
    my ($code, $value, $ts) = @_;

    if ($code ne $_SHARED{‘cumulative_sum_code’} or $ts ne $_SHARED{‘ts’}) {
    $_SHARED{‘cumulative_sum_value’} = 0;
    $_SHARED{‘cumulative_sum_code’} = $code;
    $_SHARED{‘ts’} = $ts;
    }

    $_SHARED{‘cumulative_sum_value’} += $value;

    return $_SHARED{‘cumulative_sum_value’};
    $BODY$
    LANGUAGE ‘plperl’ VOLATILE
    COST 100;

    thank you for sharing your expiriences! i want to give something back 🙂

    jan

  9. Hi Depesz,

    My apologies for not getting back on the previous question.

    I was doing something like

    select *
    from
    (select mydate, cumulative_sum(‘x’, registrations) from mytable) subQ
    where mydate > now() – interval ’10 days’

    and the optimizer kept optimizing it to

    select mydate, cumulative_sum(‘x’, registrations) from mytable where mydate > now() – interval ’10 days’

    which ofcourse gives different values for the cumulative sum.
    I had posted the same on the postgres list and almost immediately got the response that I could tell the optimizer not to touch the inner query by putting an ‘offset 0’ at the end of the subquery.

    Now I recently upgraded my DB to 8.3 (mainly to be able to use FTS) and just found out the cumulative_sum() function doesn’t work any more because of the changed way current_setting() behaves (as you’ve debated already on one of the lists).

    A fix for this is simple and easy.
    Add the following line at the end of postgresql.conf (after custom_class_variables).
    depesz.cumulative_sum_name = ‘initial_value’

    That takes care of it.

    Regards,
    Peter

  10. I added custom_variable_classes = ‘depesz’ and restarted the DB,

    but when I run I get:

    ERROR: unrecognized configuration parameter “depesz.cumulative_sum_name”

    CONTEXT: PL/pgSQL function “cumulative_sum” line 7 at assignment

    ********** Error **********

    ERROR: unrecognized configuration parameter “depesz.cumulative_sum_name”

    SQL state: 42704

    Context: PL/pgSQL function “cumulative_sum” line 7 at assignment

  11. Hi,

    Can we use this as a transaction id?

    SELECT l.virtualxid
    FROM pg_lock_status() l(locktype text, database oid, relation oid, page integer, tuple smallint, virtualxid text, transactionid xid, classid oid, objid oid, objsubid smallint, virtualtransaction text, pid integer, mode text, granted boolean);

    May be it is not actual because of 8.4 has window-functions

  12. @AVN:
    Not sure about virtualxid, but I guess it would be slow – locks checking is not really super fast.

    And from 8.4 on we have window functions, as you said, so there is no point in making your own hacks.

  13. Unfortunately, 8.3 -> 8.4 migration in some cases is not a simple process, for example because of complex third-party patches for PostreSQL sources.

Comments are closed.