TSQL Find Duplicate Rows and remove them.

Associate
Joined
27 Jan 2005
Posts
1,347
Location
S. Yorks
I have a requirement at work to remove row duplicates from a table, but leaving the first occurrence and also the non duplicates in place.

I initially thought select distinct and that gives me a unique set of data, but then I'd need to extract these and back populate the table ~26,000,000 records.

What other options are there?

Matt
 
Join the table to itself group the rows then do a count, where count is greater than one take those rows and use them to select from the original table doing a row number function ordered by whatever cell tells you which occurrence was first and deleting where row number greater than two.

Just a quick stream of thought but it should point you in the right direction.
 
Do a SELECT DISTINCT into a support/temp table WHERE COUNT > 1

Then have another script that loops through the data a few thousand rows at a time and deletes them from the first table if the data is in the support/temp table, and marking it as deleted in the temp table. Looping will help if there is any replication issues.

We do it all the time when we decommission clients.
 
Its a load of address data, there are numerous null fields on certain rows creating duplicate data.

Thanks for the help will look at the above suggestions to see where it leads.

Matt
 
Something must have a timestamp, UID or similar. Get the first or last and delete the rest. Or something like that. If they are genuinely duplicates then there's something wrong with the database design.

I wouldn't be relying on DISTINCT if you are modifying/deleting data. You need to be more precise and confident than that.

https://weblogs.sqlteam.com/markc/archive/2008/11/11/60752.aspx

Also I'd be more interested in finding out the cause of the duplicates and stopping it.
If deleting duplicates is something you do a lot. Then something is wrong. Fix it.
 
The traditional approach is to join the table to itself on the fields you consider duplicates and distinguish them by the unique ID field. Which is why it's always good to have a unique identifier in any table. But failing that you can add a unique index to a table with the same structure and INSERT...IGNORE into it so you only get one copy and then swap the tables. That presumes you can have some downtime to do it.
 
Back
Top Bottom