1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Occasional file corruption - how to trace source?

Discussion in 'General Hardware' started by sampo, 20 Oct 2017.

  1. sampo

    Gangster

    Joined: 15 Oct 2006

    Posts: 268

    Hi folks,

    Not sure if this is the right forum for this; mods please move if necessary.

    I transfer up to 1TB of data between a server and workstation each day. The files are large (~100MB) ascii grids (simulation result files). Approximately 0.01-0.1% of these files seems to become corrupted at random; typically a few unexpected characters will appear somewhere in a grid and cause our processing scripts to fail.

    A few notes:
    - errors only appear once files have transferred from server to workstation (i.e. we have not detected any errors or corruption on server side)
    - if I zip a good archive on the workstation and then unzip it, sometimes a file in the unzipped variant will corrupt

    The second point makes me think the fault lies not with the network but with the workstation (i.e. when the files are being written to the workstation drive). However, the workstation passes memtest and there are no SMART errors showing for the workstation drives.

    Any suggestions as to how I might trace the source of error. Pretty certain it has to be hardware related...

    Thanks
     
  2. Armageus

    Don

    Joined: 19 May 2012

    Posts: 13,020

    Location: Spalding, Lincolnshire

    Has the workstation got ECC Memory (and/or is it enabled)? The low error rates sound somewhat similar to the suggested error rates of Non-ECC Ram.

    It it was a disk issue, you would likely be getting corruption on a more regular basis. If it was a network issue, the Zipped versions would likely be damaged and not extract in the first place.
     
  3. sampo

    Gangster

    Joined: 15 Oct 2006

    Posts: 268

    Workstation is non-ecc; we are considering switching to ecc in future

    What is odd is that this is a relatively recent issue; the workstation has been happy for a year with the same work package but the erros have only just started creeping in.

    I've just checked and the disk (a 6TB Seagate Enterprise job) does have a few reallocated sectors (raw value of 40), not enough to trigger SMART but perhaps the drive is starting to wobble?
     
  4. Armageus

    Don

    Joined: 19 May 2012

    Posts: 13,020

    Location: Spalding, Lincolnshire

    Only way to rule it out is to try another drive.

    (Also curve ball suggestion, but it's not an early 6TB or an archive drive is it - the sort that use Shingled recording - that kind of Data turnaround would likely cause issues on one of those)

    I don't know what sort of data you are using or how critical it is, but dealing with 1TB a day, ECC would be top of my list.
     
  5. Quartz

    Capodecina

    Joined: 1 Apr 2014

    Posts: 13,982

    Location: Aberdeen

    This indicates that the problem is on the workstation. ECC RAM is a good start.

    I would replace the HDD as well, just to be sure, as the error you are seeing seems to be in the ballpark for a normal Uncorrectable Bit Error Rate on consumer drives.

    You should check out Backblaze's blog and see if your drive is mentioned there.
     
  6. sampo

    Gangster

    Joined: 15 Oct 2006

    Posts: 268

    Drive is a 6TB Seagate Enterprise Nearline (ST6000NM0024). I will swap it out and try that. ECC memory will mean a new workstation so this will happen when we next procure.

    Thanks folks
     
  7. BongoHunter

    Wise Guy

    Joined: 14 Apr 2014

    Posts: 2,382

    Location: West London

    That kind of drive is not really suited to your use case, would definitely start there.
    I doubt it's memory related, there would be other noticeable impacts if that was the case - any environmental changes in your office lately? Big new equipment etc?
     
  8. Armageus

    Don

    Joined: 19 May 2012

    Posts: 13,020

    Location: Spalding, Lincolnshire

    On what grounds?

    No there wouldn't - whilst non ecc is fine for normal daily use, when you are manipulating TBs of data, the likelihood is that you may run into single bit memory errors, which ultimately can result in data corruption.
     
  9. BongoHunter

    Wise Guy

    Joined: 14 Apr 2014

    Posts: 2,382

    Location: West London

    I don't think workstations are the intended market - these drives are meant to sit somewhere between archive work and frequent access / disk intensive tasks, and all the performance and reliability data will have been produced for 24x7 enterprise usage.

    I cant imagine the use case the OP has described is what they expected when they sat down to design it, so a drive not hitting it's MTBF figure in this situation would not surprise me. Memory errors at that frequency would surprise me though, especially considering the OPs problem has only surfaced recently.

    It would definitely be worth considering ECC memory, but much cheaper to check the disk...

    Edit: Ha! I'm from Spalding too! Small world
     
  10. Quartz

    Capodecina

    Joined: 1 Apr 2014

    Posts: 13,982

    Location: Aberdeen

    The UBER for this drive is 1 in 10^15 or 1 bit in 125 TB. So that should be 2-3 times per year in your use case.

    Regardless of the cause of the problem, I think the OP should develop an automated means for checking for errors if not already done.
     
  11. JasonM

    Mobster

    Joined: 19 Jun 2009

    Posts: 3,258

    I would load it back from the workstation drive (without copying). If the issue is still present then it rules out anything external to the workstation computer.

    If it loads ok from the workstation computer, then it's either in network, or on the external computer used to load the files.

    If after the above it's looking like a data issue external to the workstation, then you could try compressing the file with WinRar but checking the data-recovery option. If the data-recovery option is able to correct the issue then it's likely network or data write issue.