TSQL Find Duplicate Rows and remove them.

Matt_Hirst · 25 Sep 2018 at 15:39

I have a requirement at work to remove row duplicates from a table, but leaving the first occurrence and also the non duplicates in place.

I initially thought select distinct and that gives me a unique set of data, but then I'd need to extract these and back populate the table ~26,000,000 records.

What other options are there?

Matt

snips86x · 25 Sep 2018 at 15:40

Do you have any rows which are the "same" but have different identifiers? Not the UID but another cell which has something else in it.

tom_e · 25 Sep 2018 at 15:57

Join the table to itself group the rows then do a count, where count is greater than one take those rows and use them to select from the original table doing a row number function ordered by whatever cell tells you which occurrence was first and deleting where row number greater than two.

Just a quick stream of thought but it should point you in the right direction.

snips86x · 25 Sep 2018 at 17:17

Thats where I was going with it, had to do this for a couple of clients.

john_s · 25 Sep 2018 at 20:19

If you have different IDs, it’s easier, but if not, have a look at ROW_NUMBER() / PARTITION BY / OVER.

AHarvey · 26 Sep 2018 at 06:39

Do a SELECT DISTINCT into a support/temp table WHERE COUNT > 1

Then have another script that loops through the data a few thousand rows at a time and deletes them from the first table if the data is in the support/temp table, and marking it as deleted in the temp table. Looping will help if there is any replication issues.

We do it all the time when we decommission clients.

Matt_Hirst · 26 Sep 2018 at 12:11

Its a load of address data, there are numerous null fields on certain rows creating duplicate data.

Thanks for the help will look at the above suggestions to see where it leads.

Matt

kkelly · 27 Sep 2018 at 19:42

Although oracle, look at the first statement in above

http://www.dba-oracle.com/t_delete_duplicate_table_rows.htm

You should be able to just use strighst SQL, basically what tom e suggested

FredFlint · 29 Sep 2018 at 18:06

Could create a hash for the row minus the id column and compare that:https://docs.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-2017

OspreyO · 1 Nov 2018 at 01:18

Something must have a timestamp, UID or similar. Get the first or last and delete the rest. Or something like that. If they are genuinely duplicates then there's something wrong with the database design.

I wouldn't be relying on DISTINCT if you are modifying/deleting data. You need to be more precise and confident than that.

https://weblogs.sqlteam.com/markc/archive/2008/11/11/60752.aspx

Also I'd be more interested in finding out the cause of the duplicates and stopping it.
If deleting duplicates is something you do a lot. Then something is wrong. Fix it.

Dj_Jestar · 5 Nov 2018 at 15:17

Code:

GROUP BY (the columns) HAVING COUNT(*) > 1

h4rm0ny · 11 Nov 2018 at 09:28

The traditional approach is to join the table to itself on the fields you consider duplicates and distinguish them by the unique ID field. Which is why it's always good to have a unique identifier in any table. But failing that you can add a unique index to a table with the same structure and INSERT...IGNORE into it so you only get one copy and then swap the tables. That presumes you can have some downtime to do it.

Goksly · 11 Nov 2018 at 20:49

john_s said:
If you have different IDs, it’s easier, but if not, have a look at ROW_NUMBER() / PARTITION BY / OVER.

This. Anything greater than one = dust.