1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

TSQL Find Duplicate Rows and remove them.

Discussion in 'HTML, Graphics & Programming' started by Matt_Hirst, Sep 25, 2018.

  1. Matt_Hirst

    Wise Guy

    Joined: Jan 27, 2005

    Posts: 1,083

    Location: S. Yorks

    I have a requirement at work to remove row duplicates from a table, but leaving the first occurrence and also the non duplicates in place.

    I initially thought select distinct and that gives me a unique set of data, but then I'd need to extract these and back populate the table ~26,000,000 records.

    What other options are there?

    Matt
     
  2. snips86x

    Capodecina

    Joined: Nov 14, 2012

    Posts: 15,447

    Location: Wiltshire

    Do you have any rows which are the "same" but have different identifiers? Not the UID but another cell which has something else in it.
     
    Last edited: Sep 25, 2018
  3. tom_e

    Man of Honour

    Joined: Dec 26, 2003

    Posts: 25,821

    Location: West mids

    Join the table to itself group the rows then do a count, where count is greater than one take those rows and use them to select from the original table doing a row number function ordered by whatever cell tells you which occurrence was first and deleting where row number greater than two.

    Just a quick stream of thought but it should point you in the right direction.
     
  4. snips86x

    Capodecina

    Joined: Nov 14, 2012

    Posts: 15,447

    Location: Wiltshire

    Thats where I was going with it, had to do this for a couple of clients.
     
  5. john_s

    Wise Guy

    Joined: Jun 25, 2009

    Posts: 1,172

    Location: Guernsey

    If you have different IDs, it’s easier, but if not, have a look at ROW_NUMBER() / PARTITION BY / OVER.
     
  6. AHarvey

    Sgarrista

    Joined: Mar 6, 2008

    Posts: 9,248

    Location: Stoke area

    Do a SELECT DISTINCT into a support/temp table WHERE COUNT > 1

    Then have another script that loops through the data a few thousand rows at a time and deletes them from the first table if the data is in the support/temp table, and marking it as deleted in the temp table. Looping will help if there is any replication issues.

    We do it all the time when we decommission clients.
     
  7. Matt_Hirst

    Wise Guy

    Joined: Jan 27, 2005

    Posts: 1,083

    Location: S. Yorks

    Its a load of address data, there are numerous null fields on certain rows creating duplicate data.

    Thanks for the help will look at the above suggestions to see where it leads.

    Matt
     
  8. kkelly

    Mobster

    Joined: Oct 8, 2003

    Posts: 2,601

    Location: Glasgow

  9. FredFlint

    Wise Guy

    Joined: Feb 1, 2006

    Posts: 1,912

  10. OspreyO

    Mobster

    Joined: Dec 12, 2006

    Posts: 3,058

    Something must have a timestamp, UID or similar. Get the first or last and delete the rest. Or something like that. If they are genuinely duplicates then there's something wrong with the database design.

    I wouldn't be relying on DISTINCT if you are modifying/deleting data. You need to be more precise and confident than that.

    https://weblogs.sqlteam.com/markc/archive/2008/11/11/60752.aspx

    Also I'd be more interested in finding out the cause of the duplicates and stopping it.
    If deleting duplicates is something you do a lot. Then something is wrong. Fix it.
     
  11. Dj_Jestar

    Caporegime

    Joined: Oct 18, 2002

    Posts: 28,322

    Location: Back in East London

    Code:
    GROUP BY (the columns) HAVING COUNT(*) > 1
     
  12. h4rm0ny

    Mobster

    Joined: Jun 25, 2011

    Posts: 4,999

    Location: Yorkshire and proud of it!

    The traditional approach is to join the table to itself on the fields you consider duplicates and distinguish them by the unique ID field. Which is why it's always good to have a unique identifier in any table. But failing that you can add a unique index to a table with the same structure and INSERT...IGNORE into it so you only get one copy and then swap the tables. That presumes you can have some downtime to do it.
     
  13. Goksly

    Capodecina

    Joined: Mar 5, 2003

    Posts: 10,506

    Location: Nottingham

    This. Anything greater than one = dust.