Occasional file corruption - how to trace source?

Associate
Joined
15 Oct 2006
Posts
268
Hi folks,

Not sure if this is the right forum for this; mods please move if necessary.

I transfer up to 1TB of data between a server and workstation each day. The files are large (~100MB) ascii grids (simulation result files). Approximately 0.01-0.1% of these files seems to become corrupted at random; typically a few unexpected characters will appear somewhere in a grid and cause our processing scripts to fail.

A few notes:
- errors only appear once files have transferred from server to workstation (i.e. we have not detected any errors or corruption on server side)
- if I zip a good archive on the workstation and then unzip it, sometimes a file in the unzipped variant will corrupt

The second point makes me think the fault lies not with the network but with the workstation (i.e. when the files are being written to the workstation drive). However, the workstation passes memtest and there are no SMART errors showing for the workstation drives.

Any suggestions as to how I might trace the source of error. Pretty certain it has to be hardware related...

Thanks
 
Don
Joined
19 May 2012
Posts
17,148
Location
Spalding, Lincolnshire
Has the workstation got ECC Memory (and/or is it enabled)? The low error rates sound somewhat similar to the suggested error rates of Non-ECC Ram.

It it was a disk issue, you would likely be getting corruption on a more regular basis. If it was a network issue, the Zipped versions would likely be damaged and not extract in the first place.
 
Associate
OP
Joined
15 Oct 2006
Posts
268
Workstation is non-ecc; we are considering switching to ecc in future

What is odd is that this is a relatively recent issue; the workstation has been happy for a year with the same work package but the erros have only just started creeping in.

I've just checked and the disk (a 6TB Seagate Enterprise job) does have a few reallocated sectors (raw value of 40), not enough to trigger SMART but perhaps the drive is starting to wobble?
 
Don
Joined
19 May 2012
Posts
17,148
Location
Spalding, Lincolnshire
I've just checked and the disk (a 6TB Seagate Enterprise job) does have a few reallocated sectors (raw value of 40), not enough to trigger SMART but perhaps the drive is starting to wobble?

Only way to rule it out is to try another drive.

(Also curve ball suggestion, but it's not an early 6TB or an archive drive is it - the sort that use Shingled recording - that kind of Data turnaround would likely cause issues on one of those)

I don't know what sort of data you are using or how critical it is, but dealing with 1TB a day, ECC would be top of my list.
 
Soldato
Joined
1 Apr 2014
Posts
18,610
Location
Aberdeen
- if I zip a good archive on the workstation and then unzip it, sometimes a file in the unzipped variant will corrupt

This indicates that the problem is on the workstation. ECC RAM is a good start.

It it was a disk issue, you would likely be getting corruption on a more regular basis.

I would replace the HDD as well, just to be sure, as the error you are seeing seems to be in the ballpark for a normal Uncorrectable Bit Error Rate on consumer drives.

You should check out Backblaze's blog and see if your drive is mentioned there.
 
Associate
OP
Joined
15 Oct 2006
Posts
268
Drive is a 6TB Seagate Enterprise Nearline (ST6000NM0024). I will swap it out and try that. ECC memory will mean a new workstation so this will happen when we next procure.

Thanks folks
 
Soldato
Joined
14 Apr 2014
Posts
2,586
Location
East Sussex
That kind of drive is not really suited to your use case, would definitely start there.
I doubt it's memory related, there would be other noticeable impacts if that was the case - any environmental changes in your office lately? Big new equipment etc?
 
Don
Joined
19 May 2012
Posts
17,148
Location
Spalding, Lincolnshire
That kind of drive is not really suited to your use case, would definitely start there.

On what grounds?

I doubt it's memory related, there would be other noticeable impacts if that was the case

No there wouldn't - whilst non ecc is fine for normal daily use, when you are manipulating TBs of data, the likelihood is that you may run into single bit memory errors, which ultimately can result in data corruption.
 
Soldato
Joined
14 Apr 2014
Posts
2,586
Location
East Sussex
On what grounds?
I don't think workstations are the intended market - these drives are meant to sit somewhere between archive work and frequent access / disk intensive tasks, and all the performance and reliability data will have been produced for 24x7 enterprise usage.

I cant imagine the use case the OP has described is what they expected when they sat down to design it, so a drive not hitting it's MTBF figure in this situation would not surprise me. Memory errors at that frequency would surprise me though, especially considering the OPs problem has only surfaced recently.

It would definitely be worth considering ECC memory, but much cheaper to check the disk...

Edit: Ha! I'm from Spalding too! Small world
 
Soldato
Joined
1 Apr 2014
Posts
18,610
Location
Aberdeen
Drive is a 6TB Seagate Enterprise Nearline (ST6000NM0024).

The UBER for this drive is 1 in 10^15 or 1 bit in 125 TB. So that should be 2-3 times per year in your use case.

Regardless of the cause of the problem, I think the OP should develop an automated means for checking for errors if not already done.
 
Soldato
Joined
19 Jun 2009
Posts
3,869
The second point makes me think the fault lies not with the network but with the workstation (i.e. when the files are being written to the workstation drive). However, the workstation passes memtest and there are no SMART errors showing for the workstation drives.

I would load it back from the workstation drive (without copying). If the issue is still present then it rules out anything external to the workstation computer.

If it loads ok from the workstation computer, then it's either in network, or on the external computer used to load the files.

If after the above it's looking like a data issue external to the workstation, then you could try compressing the file with WinRar but checking the data-recovery option. If the data-recovery option is able to correct the issue then it's likely network or data write issue.
 
Back
Top Bottom