How to group messages into chats? – select * from depesz;

My jabber server had the feature, that it logs all messages that got sent through it.

This is pretty cool, and useful. And now, i got asked to use it to create list of conversations.

What exactly is this? Whenever I send (or receive) something there is record in database with information about which local user, communication type (send/recv), correspondent, when it happened, and what is the body of message.

And based on this, we want to list messages into chats. How?

Table looks like this:

$ \d message_log
               Table "public.message_log"
    Column     |            Type             | Modifiers
---------------+-----------------------------+-----------
 local_user    | text                        |
 correspondent | character varying(250)      |
 msg_type      | character varying(250)      |
 logged_when   | timestamp without time zone |
 msg_body      | text                        |

And data in this, like this:

$ SELECT md5(local_user), md5(correspondent), msg_type, logged_when, md5(msg_body) FROM message_log LIMIT 3;
               md5                |               md5                | msg_type |        logged_when         |               md5
----------------------------------+----------------------------------+----------+----------------------------+----------------------------------
 d5ecebb85919589a7cbf7f90eb34eacf | 09cc8925ddb803a1eb756001ad984c1c | sent     | 2007-07-23 23:23:38.802531 | 875d829b9d90d52dc62ea9e8aa7e0afc
 d5ecebb85919589a7cbf7f90eb34eacf | 09cc8925ddb803a1eb756001ad984c1c | recv     | 2007-07-23 23:23:45.314665 | d7567c72c61e30334500b029ec7e2033
 d5ecebb85919589a7cbf7f90eb34eacf | d59345ff3c632a9ae921c7e3aab661fe | sent     | 2007-07-23 23:43:00.96291  | eb43de8a086c848c0907bcff462154e6
(3 ROWS)

(sorry, can't disclose sensitive information).

Now, there is a question: what is conversation? Of course, we don't want to make language analysis, and checking the subject of messages. Instead let's try simpler approach:

Disclaimer: Idea on how to make it is not my own: it's based on RhodiumToad blogpost.

if 2 messages are less than x minutes apart – they are both part of the same conversation

Of course only messages between the same 2 parties.

So, assuming we'd have messages logged at:

2007-07-23 23:23:38
2007-07-23 23:24:01
2007-07-24 05:01:20
2007-07-24 08:24:59
2007-07-24 08:29:40
2007-07-24 08:33:15

and we assume that conversation threshold is 5 minutes, we have in here 3 conversations:

from 2007-07-23 23:23:38 to 2007-07-23 23:24:01 (2 messages)
on 2007-07-24 05:01:20, single message
from 2007-07-24 08:24:59 to 2007-07-24 08:33:15 (3 messages)

Seems easy enough. But how to write a query to do it? Luckily, with PostgreSQL 8.4, we got window functions, which will be *very* useful in here.

First let's make smaller table, with test values from above:

CREATE TABLE test ( ts TIMESTAMP);
INSERT INTO test (ts) VALUES
    ('2007-07-23 23:23:38'),
    ('2007-07-23 23:24:01'),
    ('2007-07-24 05:01:20'),
    ('2007-07-24 08:24:59'),
    ('2007-07-24 08:29:40'),
    ('2007-07-24 08:33:15');

First part of the query, is a way to mark which messages are within our threshold from previous message, we can do it as:

WITH setup AS (
    SELECT '5 minutes'::INTERVAL AS threshold
)
SELECT
    t.ts,
    CASE
        WHEN t.ts - lag(t.ts) OVER ( ORDER BY t.ts ) <=  s.threshold THEN NULL
        ELSE t.ts
    END
FROM
    test t,
    setup s
ORDER BY t.ts;

Results are, as expected:

         ts          |         ts          
---------------------+---------------------
 2007-07-23 23:23:38 | 2007-07-23 23:23:38
 2007-07-23 23:24:01 | [null]
 2007-07-24 05:01:20 | 2007-07-24 05:01:20
 2007-07-24 08:24:59 | 2007-07-24 08:24:59
 2007-07-24 08:29:40 | [null]
 2007-07-24 08:33:15 | [null]
(6 rows)

Name of the 2nd column is bad, but it's not relevant.

In case you don't understand the case expression I used:

when t.ts – lag(t.ts) over ( order by t.ts ) <= s.threshold then null

Does:

From the value of t.ts (current ts in given row), subtracts result of “lag(t.ts) over ( order by t.ts )" – which is value of ts column in previous row, previous – as defined by “order by ts".

Technically, it means it shows time difference between current, and previous row:

$ SELECT ts, ts - lag(ts) OVER ( ORDER BY ts ) FROM test ORDER BY ts;
         ts          | ?COLUMN? 
---------------------+----------
 2007-07-23 23:23:38 | [NULL]
 2007-07-23 23:24:01 | 00:00:23
 2007-07-24 05:01:20 | 05:37:19
 2007-07-24 08:24:59 | 03:23:39
 2007-07-24 08:29:40 | 00:04:41
 2007-07-24 08:33:15 | 00:03:35
(6 ROWS)

Then, we compare the value with setup threshold (5 minutes), and if it's less than threshold – I just return null – it's part of the previous conversation. If it's larger than 5 minutes – it's clearly new chat, and I need to mark it somehow – I chose to use ts as the mark.

With this resultset, I can now fill-in the nulls, to values that will let me group the messages into chats:

WITH setup AS (
    SELECT '5 minutes'::INTERVAL AS threshold
),
starts_marked AS (
    SELECT
        t.ts,
        CASE
            WHEN t.ts - lag(t.ts) OVER ( ORDER BY t.ts ) <=  s.threshold THEN NULL
            ELSE t.ts
        END AS chat_start
    FROM
        test t,
        setup s
)
SELECT
    ts,
    MAX(chat_start) OVER (ORDER BY ts)
FROM
    starts_marked
ORDER BY ts;

Results now:

         ts          |         max         
---------------------+---------------------
 2007-07-23 23:23:38 | 2007-07-23 23:23:38
 2007-07-23 23:24:01 | 2007-07-23 23:23:38
 2007-07-24 05:01:20 | 2007-07-24 05:01:20
 2007-07-24 08:24:59 | 2007-07-24 08:24:59
 2007-07-24 08:29:40 | 2007-07-24 08:24:59
 2007-07-24 08:33:15 | 2007-07-24 08:24:59
(6 rows)

Again, name of the column is bad, but it's irrelevant.

As you can see, now all rows that belong to the same conversation have the same value in max column, which make it possible to:

WITH setup AS (
    SELECT '5 minutes'::INTERVAL AS threshold
),
starts_marked AS (
    SELECT
        t.ts,
        CASE
            WHEN t.ts - lag(t.ts) OVER ( ORDER BY t.ts ) <=  s.threshold THEN NULL
            ELSE t.ts
        END AS chat_start
    FROM
        test t,
        setup s
),
grouped AS (
    SELECT
        ts,
        MAX(chat_start) OVER (ORDER BY ts) AS group_start
    FROM
        starts_marked
)
SELECT
    group_start,
    MAX(ts) AS group_end,
    COUNT(*) AS messages,
    MAX(ts) - group_start AS duration
FROM
    grouped
GROUP BY
    group_start
ORDER BY group_start;

Results:

     group_start     |      group_end      | messages | duration 
---------------------+---------------------+----------+----------
 2007-07-23 23:23:38 | 2007-07-23 23:24:01 |        2 | 00:00:23
 2007-07-24 05:01:20 | 2007-07-24 05:01:20 |        1 | 00:00:00
 2007-07-24 08:24:59 | 2007-07-24 08:33:15 |        3 | 00:08:16
(3 rows)

Which is pretty cool. But how to change it to work with multiple correspondents? That's actually pretty trivial:

WITH setup AS (
    SELECT '5 minutes'::INTERVAL AS threshold
),
starts_marked AS (
    SELECT
        local_user,
        correspondent,
        t.logged_when,
        CASE
            WHEN t.logged_when - lag(t.logged_when) OVER ( partition BY local_user, correspondent ORDER BY t.logged_when ) <=  s.threshold THEN NULL
            ELSE t.logged_when
        END AS chat_start
    FROM
        message_log t,
        setup s
),
grouped AS (
    SELECT
        local_user,
        correspondent,
        logged_when,
        MAX(chat_start) OVER (partition BY local_user, correspondent ORDER BY logged_when) AS group_start
    FROM
        starts_marked
)
SELECT
    md5(local_user),
    md5(correspondent),
    group_start,
    MAX(logged_when) AS group_end,
    COUNT(*) AS messages,
    MAX(logged_when) - group_start AS duration
FROM
    grouped
GROUP BY
    md5(local_user),
    md5(correspondent),
    group_start
ORDER BY
    md5(local_user),
    md5(correspondent),
    group_start;

Results look pretty cool:

               md5                |               md5                |        group_start         |         group_end          | messages |    duration     
----------------------------------+----------------------------------+----------------------------+----------------------------+----------+-----------------
 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-03 21:17:29.281889 | 2010-09-03 21:25:59.092128 |       32 | 00:08:29.810239
 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-04 05:21:00.081023 | 2010-09-04 05:21:00.081023 |        1 | 00:00:00
 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-04 16:31:50.569392 | 2010-09-04 16:41:42.637592 |       58 | 00:09:52.0682
 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-04 21:25:56.50246  | 2010-09-04 21:25:56.50246  |        1 | 00:00:00
 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-05 00:09:59.265376 | 2010-09-05 00:09:59.265376 |        1 | 00:00:00
 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-05 08:41:11.776131 | 2010-09-05 08:41:11.776131 |        1 | 00:00:00
 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-10 19:49:30.163026 | 2010-09-10 19:49:45.565422 |        3 | 00:00:15.402396
 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-10 20:54:53.288434 | 2010-09-10 20:55:27.811443 |        6 | 00:00:34.523009
 20a80d47259e2f59315e6e46850e2cff | 09351174dfb598850656184da17446a9 | 2010-09-02 18:25:51.179718 | 2010-09-02 18:25:51.179718 |        1 | 00:00:00
 20a80d47259e2f59315e6e46850e2cff | 09351174dfb598850656184da17446a9 | 2010-09-02 18:34:35.861093 | 2010-09-02 18:36:56.641371 |        6 | 00:02:20.780278
...

Now, I could toss in some rounding (for example: conversations are rounded up to nearest 5 or 10 or 15 minutes), grouping (all conversation with given person during a month), and I'll have nice system for automatically generating invoices for consulting-over-jabber services, but how to do it – I'll leave out as an exercise for readers 🙂

3 thoughts on “How to group messages into chats?”

gray_hemp says:

2010-09-12 at 21:43

@depesz, what about performance – using indexes, amount of data dependences, etc.
depesz says:

2010-09-12 at 21:47

Well, it is all mostly rescanning of some recordset (i.e. list of messages to be grouped into chats). So, as long as your dataset is not too big – it will be fast. Generally, we are speaking about: 1. seq scan (or index scan) of the source table, 3 materializations of changes to the recordset. with appropriate work_mem it should work really nice.
digicon says:

2011-07-13 at 20:15

@depesz, thanks for the extremely useful example! You made my day.

Comments are closed.