Near-failing drive in Softraid

Ronald · 10 May 2015 at 12:09

I would like to hear opinions on the situation I have:

I have been running a set of 3x 2TB Western digital greens (WD20EARS) in a HP microserver N36L for over 3 years without any issues so far. They are configured in Linux Softraid 5 and perform very well.
Earlier this week smartd emailed some notifications about one of the drives apparently starting to fail.

Smart details in spoiler:

Code:

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.19.0-16-generic] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, [url]www.smartmontools.org[/url]

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA8058940
LU WWN Device Id: 5 0014ee 25b4cca04
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sun May 10 11:14:44 2015 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 246) Self-test routine in progress...
                                        60% of test remaining.
Total time to complete Offline 
data collection:                (37980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 366) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   173   167   021    Pre-fail  Always       -       6308
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       81
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   061   061   000    Old_age   Always       -       28719
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       80
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       55
193 Load_Cycle_Count        0x0032   106   106   000    Old_age   Always       -       284822
194 Temperature_Celsius     0x0022   109   104   000    Old_age   Always       -       41
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   001   001   000    Old_age   Always       -       65535
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       12
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   194   191   000    Old_age   Offline      -       1843

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%     28699         270537619

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I have marked the drive failed and re-added it to the softraid to trigger a rebuild/full drive rewrite which completed succesfully, but it doesn't make the drive look any healthier - Still masses of pending sector relocations, so it may just have run out of spare space already?

I'll have to replace at least the failing drive, but I see two options:

Grab https://www.overclockers.co.uk/showproduct.php?prodid=HD-398-WD which is specified to have exactly the same sector count as the failing drive, so should replace the failing drive fine from the softraid driver point of view fine.
Or upgrade to https://www.overclockers.co.uk/showproduct.php?prodid=BU-007-WD replacing all 3 drives and gain 2 TB effective space.
Backblaze statistics show drives start increasing failure rates after 3 years, so further failures are becoming more likely.

KIA · 10 May 2015 at 13:53

65535 is the max number of log entries, ~~so you should replace the drive as soon as possible~~.

Load_Cycle_Count is very high. Have a read of this.

What does WD's tool make of the drive?

EDIT: You should zero the drive before you do anything else. Review the SMART data after you've done this.

Ronald · 10 May 2015 at 16:44

KIA said:
65535 is the max number of log entries, ~~so you should replace the drive as soon as possible~~.

Load_Cycle_Count is very high. Have a read of this.

Thanks for that

Good thing the OS runs off a seperate drive, it doesn't quite have the 2 million cycles ;-)

The timer is now disabled for all 3 drives.

What does WD's tool make of the drive?

EDIT: You should zero the drive before you do anything else. Review the SMART data after you've done this.

I'll have to take the drive out as the microserver runs Linux. ~~so will look at this later~~ Full test and zero will take another 9 or so hours, so will post results tomorrow.

Before I took the drive out I ran another long self test, and it passed...

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     28723         -
# 2  Extended offline    Completed: read failure       10%     28699         270537619
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1

Ronald · 10 May 2015 at 23:29

Zeroing the drive took 6 hours :eek:

, an hour and a half more than rebuilding the Raid5 array...

WD's tool has been showing PASS under SMART status all the time. I'll run Extended test overnight and report in the morning.

In the meanwhile... regardless of the outcome... I'm not sure I trust the drive.
So back to one of the questions in the opening post: Would you replace 1 or as a set of 3?

Ronald · 11 May 2015 at 10:14

So WD diagnostics tool give it its blessing, pass throughout.

Code:

Test Option: QUICK TEST 
Model Number: WDC WD20EARS-00MVWB0 
Unit Serial Number: WD-WCAZA8058940 
Firmware Number: 51.0AB51 
Capacity: 2000.40 GB 
SMART Status: PASS 
Test Result: PASS 
Test Time: 17:04:55, May 10, 2015 

Test Option: WRITE ZEROS 
Model Number: WDC WD20EARS-00MVWB0 
Unit Serial Number: WD-WCAZA8058940 
Firmware Number: 51.0AB51 
Capacity: 2000.40 GB 
SMART Status: PASS 
Test Result: COMPLETE 
Test Time: 23:17:58, May 10, 2015 


Test Option: EXTENDED TEST 
Model Number: WDC WD20EARS-00MVWB0 
Unit Serial Number: WD-WCAZA8058940 
Firmware Number: 51.0AB51 
Capacity: 2000.40 GB 
SMART Status: PASS 
Test Result: PASS 
Test Time: 10:11:57, May 11, 2015

KIA · 11 May 2015 at 10:45

I'd replace all three with WD Reds.

What do these look like now?

Reallocated_Sector_Ct
Current_Pending_Sector

Ronald · 11 May 2015 at 11:32

I put it back in the microserver and rebuilding the array (An about to fail drive can't be worse than no drive after all)

So back to smartmontools output, which actually shows slightly more detail that WD diag tool... No changes from the doubtful state:

Code:

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.19.0-16-generic] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, [url]www.smartmontools.org[/url]

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA8058940
LU WWN Device Id: 5 0014ee 25b4cca04
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon May 11 11:02:00 2015 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (37980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 366) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   179   167   021    Pre-fail  Always       -       6050
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       86
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   061   061   000    Old_age   Always       -       28738
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       85
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       56
193 Load_Cycle_Count        0x0032   106   106   000    Old_age   Always       -       284834
194 Temperature_Celsius     0x0022   113   104   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   001   001   000    Old_age   Always       -       65535
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       12
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   191   000    Old_age   Offline      -       60

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     28725         -
# 2  Extended offline    Completed without error       00%     28723         -
# 3  Extended offline    Completed: read failure       10%     28699         270537619
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 2

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.