Who logged to system from multiple countries in 2 hours?

Yesterday someone posted a set of queries for interviews, all centered on answering business-like questions from database.

Today this post is hidden behind some “subscribe to read more" thing, so I will not even link it, but one question there caught my eye.

Since I can't copy paste the text, I'll try to write what I remember:

Given table sessions, with columns: user_id, login_time, and country_id, list all cases where single account logged to the system from more than one country within 2 hour time frame.

The idea behind is that it would be a tool to find hacked account, based on idea that you generally can't change country within 2 hours. Which is somewhat true.

Solution in the blogpost suggested joining sessions table with itself, using some inequality condition. I think we can do better…

First, of course, let's make a table, and put there some data. Since I don't care about size of data, for now, let's make it small, with just couple of well defined cases:

=$ create table sessions (
    user_id int8,
    login_time timestamptz,
    country_id int8
);
CREATE TABLE

In real life there would be some indexes, foreign keys, primary key, but for this example, I just don't care enough.

Sample data. I will need at least three cases:

  1. user that never changed country, it will be user_id == 1
  2. user that changed country, but it was more than 2 hours apart, it will be user_id == 2
  3. user that changed country, and did it within 2 hours. Let's make it user_id == 3

So, insert data will be:

=$ insert into sessions (user_id, login_time, country_id) values
    (1, '2025-05-01 12:34:56', 100),
    (1, '2025-05-01 13:34:56', 100),
    (1, '2025-05-01 19:34:56', 100),
    (2, '2025-05-01 12:34:56', 100),
    (2, '2025-05-01 19:34:56', 200),
    (3, '2025-05-01 12:34:56', 100),
    (3, '2025-05-01 13:34:56', 200);
INSERT 0 7

Now, basic approach with join would be something like:

=$ select
    s_start.user_id,
    s_start.login_time,
    array_agg(distinct s_end.country_id) as countries
from
    sessions as s_start
    join sessions s_end on
        s_start.user_id = s_end.user_id and
        s_end.login_time between s_start.login_time and s_start.login_time + '2 hours'::interval
group by
    s_start.user_id, s_start.login_time
having
    count(distinct s_end.country_id) > 1;
 user_id |       login_time       | countries
---------+------------------------+-----------
       3 | 2025-05-01 12:34:56+02 | {100,200}
(1 row)

Simple-ish. But can it be done with just one scan of sessions?

I think we can. Using window functions, and specifically non-standard “frame" definitions…

Specifically, in the window definition we can use something like this: order by login_time range between current row and ‘2 hours'::interval following. This makes a frame that takes into account all rows that have login_time that is between this of current row, and 2 hours in the future. We can see it working here:

=$ SELECT
    s.*,
    array_agg(login_time) over (partition by user_id order by login_time range between current row and '2 hours'::interval following)
FROM
    sessions s;
 user_id |       login_time       | country_id |                      array_agg
---------+------------------------+------------+-----------------------------------------------------
       1 | 2025-05-01 12:34:56+02 |        100 | {"2025-05-01 12:34:56+02","2025-05-01 13:34:56+02"}
       1 | 2025-05-01 13:34:56+02 |        100 | {"2025-05-01 13:34:56+02"}
       1 | 2025-05-01 19:34:56+02 |        100 | {"2025-05-01 19:34:56+02"}
       2 | 2025-05-01 12:34:56+02 |        100 | {"2025-05-01 12:34:56+02"}
       2 | 2025-05-01 19:34:56+02 |        200 | {"2025-05-01 19:34:56+02"}
       3 | 2025-05-01 12:34:56+02 |        100 | {"2025-05-01 12:34:56+02","2025-05-01 13:34:56+02"}
       3 | 2025-05-01 13:34:56+02 |        200 | {"2025-05-01 13:34:56+02"}
(7 rows)

The partition by user_id part makes it so that for each row we only take into account other rows with the same user_id.

Anyway, with this in place, we can change the query to actually show the country ids, and filter them out so that it will be only shown when there are multiple countries:

=$ SELECT
    s.user_id,
    s.login_time,
    array_agg(distinct s.country_id) over (partition by user_id order by login_time range between current row and '2 hours'::interval following) as countries
FROM
    sessions s;
ERROR:  DISTINCT is not implemented for window functions
LINE 4:     array_agg(distinct country_id) over (partition by user_i...
            ^

That throws a tiny wrench into my plans. But OK, let's make it work with array_agg, and no distinct, and we can filter them out later:

=$ SELECT
    s.user_id,
    s.login_time,
    array_agg(s.country_id) over (partition by user_id order by login_time range between current row and '2 hours'::interval following) as countries
FROM
    sessions s;
 user_id |       login_time       | countries
---------+------------------------+-----------
       1 | 2025-05-01 12:34:56+02 | {100,100}
       1 | 2025-05-01 13:34:56+02 | {100}
       1 | 2025-05-01 19:34:56+02 | {100}
       2 | 2025-05-01 12:34:56+02 | {100}
       2 | 2025-05-01 19:34:56+02 | {200}
       3 | 2025-05-01 12:34:56+02 | {100,200}
       3 | 2025-05-01 13:34:56+02 | {200}
(7 rows)

That worked as expected, so let's get rid of duplicate countries. Relatively simple way, at least for me, is to write custom aggregate. I know that this sounds scary, but it's just a matter of writing very simple function, and one create aggregate statement:

=$ create function array_agg_unique( INOUT p_state int8[], IN p_newval int8 ) returns int8[] as $$
    select case
        when p_newval = any(p_state) then p_state
        else p_state || p_newval
    end;
$$ language sql;
CREATE FUNCTION
 
=$ create aggregate array_agg_unique( int8 ) (
    sfunc = array_agg_unique,
    stype = int8[],
    initcond = '{}'
);
CREATE AGGREGATE

That's all. So how does the new unique aggregate work:

=$ SELECT
    s.user_id,
    s.login_time,
    array_agg_unique(s.country_id) over (partition by user_id order by login_time range between current row and '2 hours'::interval following) as countries
FROM
    sessions s;
 user_id |       login_time       | countries 
---------+------------------------+-----------
       1 | 2025-05-01 12:34:56+02 | {100}
       1 | 2025-05-01 13:34:56+02 | {100}
       1 | 2025-05-01 19:34:56+02 | {100}
       2 | 2025-05-01 12:34:56+02 | {100}
       2 | 2025-05-01 19:34:56+02 | {200}
       3 | 2025-05-01 12:34:56+02 | {100,200}
       3 | 2025-05-01 13:34:56+02 | {200}
(7 rows)

And now the filtering. The simplest, for me, is to just wrap it in CTE:

WITH base as (
    SELECT
        s.user_id,
        s.login_time,
        array_agg_unique(s.country_id) over (partition by user_id order by login_time range between current row and '2 hours'::interval following) as countries
    FROM
        sessions s
)
select *
from base
where array_upper(countries, 1) > 1;
 user_id |       login_time       | countries 
---------+------------------------+-----------
       3 | 2025-05-01 12:34:56+02 | {100,200}
(1 row)

Now, the question is: is it really faster than the join approach?

Let's look first at simple explain analyze of both queries. First the join approach:

  1. =$ explain analyze
  2. select
  3.     s_start.user_id,
  4.     s_start.login_time,
  5.     array_agg(distinct s_end.country_id) as countries
  6. from
  7.     sessions as s_start
  8.     join sessions s_end on
  9.         s_start.user_id = s_end.user_id and
  10.         s_end.login_time between s_start.login_time and s_start.login_time + '2 hours'::interval
  11. group by
  12.     s_start.user_id, s_start.login_time
  13. having
  14.     count(distinct s_end.country_id) > 1;
  15.                                                                  QUERY PLAN                                                                  
  16. ---------------------------------------------------------------------------------------------------------------------------------------------
  17.  GroupAggregate  (cost=194.32..209.42 rows=183 width=48) (actual time=0.147..0.163 rows=1.00 loops=1)
  18.    Group Key: s_start.user_id, s_start.login_time
  19.    Filter: (count(DISTINCT s_end.country_id) > 1)
  20.    Rows Removed by Filter: 6
  21.    Buffers: shared hit=8
  22.    ->  Sort  (cost=194.32..195.69 rows=549 width=24) (actual time=0.120..0.139 rows=9.00 loops=1)
  23.          Sort Key: s_start.user_id, s_start.login_time, s_end.country_id
  24.          Sort Method: quicksort  Memory: 25kB
  25.          Buffers: shared hit=8
  26.          ->  Hash Join  (cost=45.33..169.34 rows=549 width=24) (actual time=0.039..0.074 rows=9.00 loops=1)
  27.                Hash Cond: (s_start.user_id = s_end.user_id)
  28.                Join Filter: ((s_end.login_time >= s_start.login_time) AND (s_end.login_time <= (s_start.login_time + '02:00:00'::interval)))
  29.                Rows Removed by Join Filter: 8
  30.                Buffers: shared hit=2
  31.                ->  Seq Scan on sessions s_start  (cost=0.00..25.70 rows=1570 width=16) (actual time=0.004..0.013 rows=7.00 loops=1)
  32.                      Buffers: shared hit=1
  33.                ->  Hash  (cost=25.70..25.70 rows=1570 width=24) (actual time=0.026..0.030 rows=7.00 loops=1)
  34.                      Buckets: 2048  Batches: 1  Memory Usage: 17kB
  35.                      Buffers: shared hit=1
  36.                      ->  Seq Scan on sessions s_end  (cost=0.00..25.70 rows=1570 width=24) (actual time=0.004..0.013 rows=7.00 loops=1)
  37.                            Buffers: shared hit=1
  38.  Planning:
  39.    Buffers: shared hit=156
  40.  Planning Time: 0.435 ms
  41.  Execution Time: 0.243 ms
  42. (25 rows)

OK. So it took two separate Seq Scans of sessions (in lines 31 and 36). I guess if there was more data, one of them would be changed into set of Index Scans. We'll test it in a moment.

On the other hand my window+frame+aggregate approach generates this explain analyze:

  1. =$ explain analyze
  2. WITH base as (
  3.     SELECT
  4.         s.user_id,
  5.         s.login_time,
  6.         array_agg_unique(s.country_id) over (partition by user_id order by login_time range between current row and '2 hours'::interval following) as countries
  7.     FROM
  8.         sessions s
  9. )
  10. select *
  11. from base
  12. where array_upper(countries, 1) > 1;
  13.                                                             QUERY PLAN                                                             
  14. -----------------------------------------------------------------------------------------------------------------------------------
  15.  Subquery Scan on base  (cost=109.08..163.99 rows=523 width=48) (actual time=0.279..0.296 rows=1.00 loops=1)
  16.    Filter: (array_upper(base.countries, 1) > 1)
  17.    Rows Removed by Filter: 6
  18.    Buffers: shared hit=50
  19.    ->  WindowAgg  (cost=109.08..140.44 rows=1570 width=48) (actual time=0.221..0.283 rows=7.00 loops=1)
  20.          Window: w1 AS (PARTITION BY s.user_id ORDER BY s.login_time RANGE BETWEEN CURRENT ROW AND '02:00:00'::interval FOLLOWING)
  21.          Storage: Memory  Maximum Storage: 17kB
  22.          Buffers: shared hit=50
  23.          ->  Sort  (cost=109.04..112.96 rows=1570 width=24) (actual time=0.044..0.056 rows=7.00 loops=1)
  24.                Sort Key: s.user_id, s.login_time
  25.                Sort Method: quicksort  Memory: 25kB
  26.                Buffers: shared hit=4
  27.                ->  Seq Scan on sessions s  (cost=0.00..25.70 rows=1570 width=24) (actual time=0.006..0.015 rows=7.00 loops=1)
  28.                      Buffers: shared hit=1
  29.  Planning:
  30.    Buffers: shared hit=63 dirtied=1
  31.  Planning Time: 0.279 ms
  32.  Execution Time: 0.385 ms
  33. (18 rows)

Single Seq Scan, and some processing.

Great. So now, let's put there some more data. Let's say 100,000 rows with some random users, times, and countries:

=$ insert into sessions (user_id, login_time, country_id)
select
    floor( 1 + random() * 50 ),
    now() - '3 months'::interval * random(),
    floor( 1 + random() * random() * random() * random() * random() * 5 )
from generate_series(1,100000);
INSERT 0 100000

The multi-random expression for country simply made country_id much less equally distributed:

=$ select country_id, count(*) from sessions group by 1 order by 1;
 country_id | count 
------------+-------
          1 | 97623
          2 |  2119
          3 |   235
          4 |    23
        100 |     5
        200 |     2
(6 rows)

To make sure that we have it all nicely indexed, I'll add index on (user_id, login_time):

=$ create index the_index on sessions (user_id, login_time);
CREATE INDEX

Sweet. And now, the moment of truth: I'll run both queries (the one using join, and the one using window) 3 times each, pick fastest from each type, and we'll see how it behaves:

Fastest JOIN run:

                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=17.24..2143404.69 rows=33298 width=48) (actual time=154.160..3097.721 rows=6316.00 loops=1)
   Group Key: s_start.user_id, s_start.login_time
   Filter: (count(DISTINCT s_end.country_id) > 1)
   Rows Removed by Filter: 93690
   Buffers: shared hit=585882
   ->  Incremental Sort  (cost=17.24..1919272.77 rows=22263351 width=24) (actual time=151.612..2689.938 rows=285069.00 loops=1)
         Sort Key: s_start.user_id, s_start.login_time, s_end.country_id
         Presorted Key: s_start.user_id, s_start.login_time
         Full-sort Groups: 8578  Sort Method: quicksort  Average Memory: 26kB  Peak Memory: 26kB
         Buffers: shared hit=585882
         ->  Nested Loop  (cost=0.84..770705.97 rows=22263351 width=24) (actual time=151.285..1929.886 rows=285069.00 loops=1)
               Buffers: shared hit=585882
               ->  Index Only Scan using the_index on sessions s_start  (cost=0.42..3052.52 rows=100007 width=16) (actual time=0.011..130.035 rows=100007.00 loops=1)
                     Heap Fetches: 0
                     Index Searches: 1
                     Buffers: shared hit=387
               ->  Index Scan using the_index on sessions s_end  (cost=0.42..5.46 rows=222 width=24) (actual time=0.003..0.007 rows=2.85 loops=100007)
                     Index Cond: ((user_id = s_start.user_id) AND (login_time >= s_start.login_time) AND (login_time <= (s_start.login_time + '02:00:00'::interval)))
                     Index Searches: 100007
                     Buffers: shared hit=585495
 Planning Time: 0.145 ms
 JIT:
   Functions: 13
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.490 ms (Deform 0.123 ms), Inlining 8.884 ms, Optimization 123.118 ms, Emission 19.230 ms, Total 151.722 ms
 Execution Time: 3106.287 ms
(26 rows)

and fastest query with custom aggregate and window-based approach:

                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on base  (cost=49.51..8850.37 rows=33336 width=48) (actual time=0.890..957.652 rows=6316.00 loops=1)
   Filter: (array_upper(base.countries, 1) > 1)
   Rows Removed by Filter: 93691
   Buffers: shared hit=100241
   ->  WindowAgg  (cost=49.51..7350.26 rows=100007 width=48) (actual time=0.034..816.409 rows=100007.00 loops=1)
         Window: w1 AS (PARTITION BY s.user_id ORDER BY s.login_time RANGE BETWEEN CURRENT ROW AND '02:00:00'::interval FOLLOWING)
         Storage: Memory  Maximum Storage: 17kB
         Buffers: shared hit=100241
         ->  Index Scan using the_index on sessions s  (cost=0.42..5600.14 rows=100007 width=24) (actual time=0.011..169.658 rows=100007.00 loops=1)
               Index Searches: 1
               Buffers: shared hit=100241
 Planning Time: 0.068 ms
 Execution Time: 965.772 ms
(13 rows)

Sweet. So it seems to be ~ 3 times faster.

Hope it will help someone on some kind of job interview. Or at least to show that there are many way to work to get answer to a problem…

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.