June 23rd, 2009 by depesz | Tags: , , , , | No comments »
Did it help? If yes - maybe you can help me?

Let's say you imported some data, but it contains duplicates. You will have to handle them in some way, but to make sensible choice on how to handle it, you need more information.

So, let's start. We have table:

# \d users
Table "public.users"
Column | Type | Modifiers
------------+--------------------------+----------------------------------------------------
id | integer | not null default nextval('users_id_seq'::regclass)
username | text |
registered | timestamp with time zone |
Indexes:
"users_pkey" PRIMARY KEY, btree (id)

And this table contains:

# select count(*), count(distinct username) from users;
count | count
-------+-------
2000 | 1990
(1 row)

2000 rows, but only 1990 distinct usernames – i.e. 10 usernames are duplicated.

Of course finding them is simple:

# select username, count(*) from users group by username having count(*) > 1;
username | count
----------+-------
GTe | 2
Bpu | 2
wg2 | 2
ab2 | 2
GZ7 | 2
BWR | 2
baO | 2
NR0 | 2
8Bj | 2
Yht | 2
(10 rows)

But it doesn't really tell us anything interesting. We don't know these users ids, nor their registration date.

Can it be added in some way?

Let's start with the simplest approach:

# SELECT
u.username,
count(*),
ARRAY(SELECT q.id FROM users q WHERE q.username = u.username),
ARRAY(SELECT q.registered FROM users q WHERE q.username = u.username)
FROM
users u
GROUP BY
u.username
HAVING count(*) > 1
;
username | count | ?column? | ?column?
----------+-------+---------------+-----------------------------------------------------
GTe | 2 | {35249,36823} | {"2009-10-31 16:20:55+01","2009-09-10 15:44:15+02"}
Bpu | 2 | {35115,36797} | {"2010-02-25 05:23:29+01","2009-09-27 22:49:23+02"}
wg2 | 2 | {35595,36528} | {"2009-11-24 13:39:29+01","2009-12-20 16:16:08+01"}
ab2 | 2 | {35750,36252} | {"2009-10-18 20:36:29+02","2009-07-30 00:12:54+02"}
GZ7 | 2 | {35051,35737} | {"2009-10-01 02:09:32+02","2009-09-19 19:54:37+02"}
BWR | 2 | {35266,35703} | {"2009-10-06 15:54:51+02","2009-06-29 17:24:28+02"}
baO | 2 | {35487,36647} | {"2010-03-12 20:33:17+01","2010-04-15 14:23:06+02"}
NR0 | 2 | {36556,36782} | {"2009-08-13 14:12:38+02","2010-02-21 12:33:46+01"}
8Bj | 2 | {36280,36763} | {"2010-01-22 08:18:58+01","2009-06-28 14:10:50+02"}
Yht | 2 | {35222,36421} | {"2009-12-26 18:55:07+01","2010-01-12 04:27:03+01"}
(10 rows)

OK. It's better, but still not really what we need – we don't have any way to tell which date belongs to which user.

So, let's retry with this:

# SELECT
u.username,
count(*),
ARRAY(SELECT q.id::text || ' : ' || q.registered::TEXT FROM users q WHERE q.username = u.username)
FROM
users u
GROUP BY
u.username
HAVING count(*) > 1
;
username | count | ?column?
----------+-------+---------------------------------------------------------------------
GTe | 2 | {"35249 : 2009-10-31 16:20:55+01","36823 : 2009-09-10 15:44:15+02"}
Bpu | 2 | {"35115 : 2010-02-25 05:23:29+01","36797 : 2009-09-27 22:49:23+02"}
wg2 | 2 | {"35595 : 2009-11-24 13:39:29+01","36528 : 2009-12-20 16:16:08+01"}
ab2 | 2 | {"35750 : 2009-10-18 20:36:29+02","36252 : 2009-07-30 00:12:54+02"}
GZ7 | 2 | {"35051 : 2009-10-01 02:09:32+02","35737 : 2009-09-19 19:54:37+02"}
BWR | 2 | {"35266 : 2009-10-06 15:54:51+02","35703 : 2009-06-29 17:24:28+02"}
baO | 2 | {"35487 : 2010-03-12 20:33:17+01","36647 : 2010-04-15 14:23:06+02"}
NR0 | 2 | {"36556 : 2009-08-13 14:12:38+02","36782 : 2010-02-21 12:33:46+01"}
8Bj | 2 | {"36280 : 2010-01-22 08:18:58+01","36763 : 2009-06-28 14:10:50+02"}
Yht | 2 | {"35222 : 2009-12-26 18:55:07+01","36421 : 2010-01-12 04:27:03+01"}
(10 rows)

Well, it's nicer, but the ARRAY special characters, like { or " make it hard to read.

So, let's use array_to_string. Using it to separate the values by ‘, ‘ wouldn't help much, but we can do better:

SELECT
u.username,
count(*),
array_to_string(
ARRAY(SELECT q.id::text || ' : ' || q.registered::TEXT FROM users q WHERE q.username = u.username),
E'\n'
)
FROM
users u
GROUP BY
u.username
HAVING count(*) > 1
;
username | count | array_to_string
----------+-------+--------------------------------
GTe | 2 | 35249 : 2009-10-31 16:20:55+01
: 36823 : 2009-09-10 15:44:15+02
Bpu | 2 | 35115 : 2010-02-25 05:23:29+01
: 36797 : 2009-09-27 22:49:23+02
wg2 | 2 | 35595 : 2009-11-24 13:39:29+01
: 36528 : 2009-12-20 16:16:08+01
ab2 | 2 | 35750 : 2009-10-18 20:36:29+02
: 36252 : 2009-07-30 00:12:54+02
GZ7 | 2 | 35051 : 2009-10-01 02:09:32+02
: 35737 : 2009-09-19 19:54:37+02
BWR | 2 | 35266 : 2009-10-06 15:54:51+02
: 35703 : 2009-06-29 17:24:28+02
baO | 2 | 35487 : 2010-03-12 20:33:17+01
: 36647 : 2010-04-15 14:23:06+02
NR0 | 2 | 36556 : 2009-08-13 14:12:38+02
: 36782 : 2010-02-21 12:33:46+01
8Bj | 2 | 36280 : 2010-01-22 08:18:58+01
: 36763 : 2009-06-28 14:10:50+02
Yht | 2 | 35222 : 2009-12-26 18:55:07+01
: 36421 : 2010-01-12 04:27:03+01
(10 rows)

It is still 10 rows, but the 3rd column now contains literal line breaks.

For extra points we can prefix each line with “- " to make it obvious that it is list, and add leading empty line, to separate the entries from each other:

SELECT
u.username,
count(*),
'- ' || array_to_string(
ARRAY(SELECT q.id::text || ' : ' || q.registered::TEXT FROM users q WHERE q.username = u.username),
E'\n- '
) || E'\n' as details
FROM
users u
GROUP BY
u.username
HAVING count(*) > 1
;
# \i z.sql
username | count | details
----------+-------+----------------------------------
GTe | 2 | - 35249 : 2007-10-31 16:20:55+01
: - 36823 : 2007-09-10 15:44:15+02
:
Bpu | 2 | - 35115 : 2008-02-25 05:23:29+01
: - 36797 : 2007-09-27 22:49:23+02
:
wg2 | 2 | - 35595 : 2007-11-24 13:39:29+01
: - 36528 : 2007-12-20 16:16:08+01
:
ab2 | 2 | - 35750 : 2007-10-18 20:36:29+02
: - 36252 : 2007-07-30 00:12:54+02
:
GZ7 | 2 | - 35051 : 2007-10-01 02:09:32+02
: - 35737 : 2007-09-19 19:54:37+02
:
BWR | 2 | - 35266 : 2007-10-06 15:54:51+02
: - 35703 : 2007-06-29 17:24:28+02
:
baO | 2 | - 35487 : 2008-03-12 20:33:17+01
: - 36647 : 2008-04-15 14:23:06+02
:
NR0 | 2 | - 36556 : 2007-08-13 14:12:38+02
: - 36782 : 2008-02-21 12:33:46+01
:
8Bj | 2 | - 36280 : 2008-01-22 08:18:58+01
: - 36763 : 2007-06-28 14:10:50+02
:
Yht | 2 | - 35222 : 2007-12-26 18:55:07+01
: - 36421 : 2008-01-12 04:27:03+01
:
(10 rows)

Above technique can benefit you in writing real application code, but it's main strength is that it can turn table, that is understandable only by dba, into fully management readable report.

Leave a comment