how to remove duplicated rows?

this questions pops every once in a while on irc.

some guy has a table, and it contains duplicated rows.

basically there are 2 possible scenarios:

  • all columns are duplicated
  • only some columns are duplicated

so, how to remove duplicates from such tables?

let's check what to do when all columns are duplicated, like in here:

# SELECT * FROM test;
 some_text | some_int
-----------+----------
 a         |        1
 a         |        1
 b         |        1
 b         |        2
 b         |        2
(5 ROWS)

the simplest approach to fix the table is to:

# CREATE temp TABLE temp_test AS SELECT DISTINCT * FROM test;
TRUNCATE test;
INSERT INTO test SELECT * FROM temp_test;
DROP TABLE temp_test;

after this – contents of the table looks like this:

# SELECT * FROM test;
 some_text | some_int
-----------+----------
 a         |        1
 b         |        1
 b         |        2
(3 ROWS)

but what if you don't want to copy values between tables?

it's also quite simple – we will simply add new column, enter some unique data there, and then use it to differentiate the rows:

# CREATE temp SEQUENCE temp_seq;
CREATE SEQUENCE
# ALTER TABLE test ADD COLUMN unique_id int4;
ALTER TABLE
# UPDATE test SET unique_id = NEXTVAL('temp_seq');
UPDATE 5
# DELETE FROM test AS t1 WHERE EXISTS (SELECT * FROM test t2 WHERE (t2.some_text, t2.some_int) = (t1.some_text, t1.some_int) AND t2.unique_id > t1.unique_id);
DELETE 2
# ALTER TABLE test DROP COLUMN unique_id;
ALTER TABLE

after this it is good to do vacuum of the table.

also – if you dont have index on (some_text, some_int) – it will be good to add such a index to speed up the removal.

and what in case not all of the columns are duplicated? like in this case:

# SELECT * FROM test;
 id | some_text | some_int
----+-----------+----------
  2 | a         |        1
  3 | a         |        1
  4 | b         |        1
  5 | b         |        2
  6 | b         |        2
(5 ROWS)

it's simple – use the trick shown above, but this time you don't have to add new column – you already have it.

interesting thing is when you consider that only some_text should be unique. and if many records have the same some_text – we should keep the one with lowest some_int. and if more then one have the same some_int – keep the newest (with highest id).

how to do it?

simple:

# DELETE FROM test AS t1 WHERE EXISTS (SELECT * FROM test t2 WHERE t2.some_text = t1.some_text AND (t2.some_int < t1.some_int OR (t2.some_int = t1.some_int AND t2.id > t1.id)));
DELETE 3
# SELECT * FROM test;
 id | some_text | some_int
----+-----------+----------
  3 | a         |        1
  4 | b         |        1
(2 ROWS)

7 thoughts on “how to remove duplicated rows?”

  1. Instead of using a Sequence, is it possible to use the CTID to distinguish between two row?

  2. @Richard Broersma Jr.:
    it is of course possible (like Pavel showed), but i’m not really enthusiastic about using system-data in standard (ekhem) sql queries.
    the idea and usage is great. i just have “moral” issues with it 🙂

  3. @depesc, ofcourse .. on bigger tables can be slow, corelated subqueries are efective in 10% very specific cases and it’s better doesn’t use it

Comments are closed.