1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

TSQL Find Duplicate Rows and remove them.

Discussion in 'HTML, Graphics & Programming' started by Matt_Hirst, Sep 25, 2018.

  1. Matt_Hirst

    Wise Guy

    Joined: Jan 27, 2005

    Posts: 1,083

    Location: S. Yorks

    I have a requirement at work to remove row duplicates from a table, but leaving the first occurrence and also the non duplicates in place.

    I initially thought select distinct and that gives me a unique set of data, but then I'd need to extract these and back populate the table ~26,000,000 records.

    What other options are there?

    Matt
     
  2. snips86x

    Capodecina

    Joined: Nov 14, 2012

    Posts: 15,268

    Location: Wiltshire

    Do you have any rows which are the "same" but have different identifiers? Not the UID but another cell which has something else in it.
     
    Last edited: Sep 25, 2018
  3. tom_e

    Man of Honour

    Joined: Dec 26, 2003

    Posts: 25,491

    Location: West mids

    Join the table to itself group the rows then do a count, where count is greater than one take those rows and use them to select from the original table doing a row number function ordered by whatever cell tells you which occurrence was first and deleting where row number greater than two.

    Just a quick stream of thought but it should point you in the right direction.
     
  4. snips86x

    Capodecina

    Joined: Nov 14, 2012

    Posts: 15,268

    Location: Wiltshire

    Thats where I was going with it, had to do this for a couple of clients.
     
  5. john_s

    Wise Guy

    Joined: Jun 25, 2009

    Posts: 1,172

    Location: Guernsey

    If you have different IDs, it’s easier, but if not, have a look at ROW_NUMBER() / PARTITION BY / OVER.
     
  6. AHarvey

    Sgarrista

    Joined: Mar 6, 2008

    Posts: 8,929

    Location: Stoke area

    Do a SELECT DISTINCT into a support/temp table WHERE COUNT > 1

    Then have another script that loops through the data a few thousand rows at a time and deletes them from the first table if the data is in the support/temp table, and marking it as deleted in the temp table. Looping will help if there is any replication issues.

    We do it all the time when we decommission clients.
     
  7. Matt_Hirst

    Wise Guy

    Joined: Jan 27, 2005

    Posts: 1,083

    Location: S. Yorks

    Its a load of address data, there are numerous null fields on certain rows creating duplicate data.

    Thanks for the help will look at the above suggestions to see where it leads.

    Matt
     
  8. kkelly

    Mobster

    Joined: Oct 8, 2003

    Posts: 2,575

    Location: Glasgow

  9. FredFlint

    Wise Guy

    Joined: Feb 1, 2006

    Posts: 1,845