Getting list of most common domains

Today, on #postgresql on IRC, guy (can't contact him now to get his permission to name him), said:

I have a table called problematic_hostnames. It contains a list of banned hostnames in column “hostname" (varchar). I would like to display the top 10 troll ISPs based on this. Does PG have a way of spotting a “pattern"? Some ISPs are example.net while others are foo.bar.example.net, so you can't just regexp the last X.Y (since that would cause “.co.uk" to be one of the top troll ISPs).

Well, the problem description is rather problematic, and would require some kind of AI to know which domains to show, and which not, but I decided it's cool task to work on.

So, let's say we have this table:

CREATE temp TABLE test ( domain_name TEXT );

(pretty simple, isn't it?)

Now, let's put some data in it. To make it more entertaining I generated list of all host names that show in “referrer" field in requests to my website (got from apache access.log). It looks like this:

# SELECT COUNT(*) FROM test;
 COUNT
-------
  8097
(1 ROW)
 
# SELECT * FROM test ORDER BY random() LIMIT 10;
        domain_name
----------------------------
 www.depesz.com
 www.depesz.com
 images.google.pl
 auta.cc
 www.kwick.de
 canalla-metal.blogspot.com
 www.depesz.com
 canalla-metal.blogspot.com
 canalla-metal.blogspot.com
 www.depesz.com
(10 ROWS)

Well, www.depesz.com is pretty common 🙂

Now. To get list of domains (like .com, depesz.com, blogspot.com) I will need something that will generate list of domains based on host name. Something like this function:

CREATE OR REPLACE FUNCTION generate_domain_list(in_domain TEXT) RETURNS setof TEXT AS $$
DECLARE
    elements   text[] := regexp_split_to_array(in_domain, E'\\.');
    min_i      int4   := array_lower( elements, 1 );
    max_i      int4   := array_upper( elements, 1 );
    temp_array text[];
BEGIN
    FOR i IN min_i .. max_i LOOP
        temp_array := elements[ i : max_i ];
        RETURN NEXT array_to_string(temp_array, '.');
    END loop;
    RETURN;
END;
$$ LANGUAGE plpgsql;

How does it work? Let's check:

# SELECT * FROM generate_domain_list('a.b.c.d.e');
 generate_domain_list
----------------------
 a.b.c.d.e
 b.c.d.e
 c.d.e
 d.e
 e
(5 ROWS)

Great.

Unfortunately PostgreSQL < 8.4 has one drawback regarding pl/pgsql functions – they can't be used in “SELECT" part of the query.

So, I will need to add a wrapper:

CREATE OR REPLACE FUNCTION sql_generate_domain_list(TEXT) RETURNS setof TEXT AS $$
    SELECT * FROM generate_domain_list($1);
$$ LANGUAGE SQL;

And check how does it work:

# SELECT sql_generate_domain_list('a.b.c.d.e');
 sql_generate_domain_list
--------------------------
 a.b.c.d.e
 b.c.d.e
 c.d.e
 d.e
 e
(5 ROWS)

Nice. Now, that we have this function we can try to get more detailed information from our test table:

# SELECT sql_generate_domain_list(domain_name) FROM test LIMIT 10;
 sql_generate_domain_list
--------------------------
 images.google.com
 google.com
 com
 www.depesz.com
 depesz.com
 com
 www.depesz.com
 depesz.com
 com
 www.depesz.com
(10 ROWS)

Great. And now, let's make nice summary:

# SELECT x AS domain_name, COUNT(*) FROM (SELECT sql_generate_domain_list(domain_name) AS x FROM test) AS q GROUP BY x ORDER BY COUNT DESC LIMIT 10;
        domain_name         | COUNT
----------------------------+-------
 com                        |  6864
 www.depesz.com             |  5215
 depesz.com                 |  5215
 blogspot.com               |  1139
 canalla-metal.blogspot.com |  1072
 pl                         |   687
 google.pl                  |   517
 images.google.pl           |   392
 cc                         |   173
 auta.cc                    |   172
(10 ROWS)

Looks promising, but I think there is no point in putting “com", “pl", “cc" or anything like this on the list. Basically – top level domains shouldn't be listed.

Of course I could add nice “WHERE" to my report query, but instead let's modify a bit original function:

CREATE OR REPLACE FUNCTION generate_domain_list(in_domain TEXT) RETURNS setof TEXT AS $$
DECLARE
    elements   text[] := regexp_split_to_array(in_domain, E'\\.');
    min_i      int4   := array_lower( elements, 1 );
    max_i      int4   := array_upper( elements, 1 );
    temp_array text[];
BEGIN
    FOR i IN min_i .. max_i - 1 LOOP
        temp_array := elements[ i : max_i ];
        RETURN NEXT array_to_string(temp_array, '.');
    END loop;
    RETURN;
END;
$$ LANGUAGE plpgsql;

(The only change is " – 1″ in the “FOR … LOOP" line.

Result:

# SELECT * FROM generate_domain_list('a.b.c.d.e');
 generate_domain_list
----------------------
 a.b.c.d.e
 b.c.d.e
 c.d.e
 d.e
(4 ROWS)

The top-level domain (e) has been skipped. and how do the report look like now?

# SELECT x AS domain_name, COUNT(*) FROM (SELECT sql_generate_domain_list(domain_name) AS x FROM test) AS q GROUP BY x ORDER BY COUNT DESC LIMIT 10;
        domain_name         | COUNT
----------------------------+-------
 www.depesz.com             |  5215
 depesz.com                 |  5215
 blogspot.com               |  1139
 canalla-metal.blogspot.com |  1072
 google.pl                  |   517
 images.google.pl           |   392
 auta.cc                    |   172
 google.com                 |   157
 myspace.com                |   147
 www.google.pl              |   124
(10 ROWS)

Nice. But – should we really put “www.google.pl" and “google.pl" as separate positions? I think not. Let's assume that leading “www." should be stripped. So, another modification of the function:

CREATE OR REPLACE FUNCTION generate_domain_list(in_domain TEXT) RETURNS setof TEXT AS $$
DECLARE
    use_domain TEXT   := regexp_replace(LOWER(in_domain), E'^www\\.', '');
    elements   text[] := regexp_split_to_array(use_domain, E'\\.');
    min_i      int4   := array_lower( elements, 1 );
    max_i      int4   := array_upper( elements, 1 );
    temp_array text[];
BEGIN
    FOR i IN min_i .. max_i - 1 LOOP
        temp_array := elements[ i : max_i ];
        RETURN NEXT array_to_string(temp_array, '.');
    END loop;
    RETURN;
END;
$$ LANGUAGE plpgsql;

Changes now: newline “use_domain TEXT …", and elements constructor has been changes to use use_domain instead of in_domain.

Result?

For non “www." prefixed domain:

# SELECT * FROM generate_domain_list('a.b.c.d.e');
 generate_domain_list
----------------------
 a.b.c.d.e
 b.c.d.e
 c.d.e
 d.e
(4 ROWS)

But when the domain contains “www." prefix:

# SELECT * FROM generate_domain_list('www.google.pl');
 generate_domain_list
----------------------
 google.pl
(1 ROW)

Also – please note that I lowercased domain name.

So, let's regenerate report one more time:

# SELECT x AS domain_name, COUNT(*) FROM (SELECT sql_generate_domain_list(domain_name) AS x FROM test) AS q GROUP BY x ORDER BY COUNT DESC LIMIT 10;
        domain_name         | COUNT
----------------------------+-------
 depesz.com                 |  5215
 blogspot.com               |  1139
 canalla-metal.blogspot.com |  1072
 google.pl                  |   517
 images.google.pl           |   392
 auta.cc                    |   172
 google.com                 |   157
 myspace.com                |   147
 profile.myspace.com        |   119
 images.google.com          |    95
(10 ROWS)

Looks ok. Of course – it will still count “co.uk" as domain (which is not what the guy from irc wanted), but it can be easily fixed – either by another rule in function, or by simple ‘WHERE' in report.

4 thoughts on “Getting list of most common domains”

  1. Only SQL solution

    postgres=# create or replace function domain_list(varchar[]) returns setof varchar as $$select array_to_string($1,’.’) union all select domain_list($1[2:array_upper($1,1)]) where array_upper($1,1) >=1 $$ language sql;
    CREATE FUNCTION
    postgres=# select domain_list(array[‘a’,’b’,’c’,’d’]);
    domain_list
    ————-
    a.b.c.d
    b.c.d
    c.d
    d

    (5 rows)

    postgres=# create or replace function domain_list(varchar) returns setof varchar as $$ select domain_list(string_to_array($1, ‘.’))$$ language sql immutable strict;
    CREATE FUNCTION
    postgres=# select domain_list(‘a.b.c.d’);
    domain_list
    ————-
    a.b.c.d
    b.c.d
    c.d
    d

  2. I liked your explanation. I might be checking your blog more often.

  3. apparently this particular entry got really popular among comment spammers. because of this, i have to close the possibility to comment this post. if you think your comment should be allowed – please mail me at depesz @ depesz.com

Comments are closed.