RAID 5 help

Associate
Joined
18 Jan 2010
Posts
1,138
Location
London
Hi Guys, bit of a weird one.

One of our clients has a HP server running Win 2k3 server, they turned if off by the button and then complained about it not running properly. We had a look and under event viewer the were hundreds of "ntfs" errors.

My manager was looking over it and spoke to HP and they confirmed 2 disks of the 3 in RAID 5 had met there read/write failure limit.

They said we will need delete the RAID volume replace the drives, recreate the RAID5 volume and restore the system from a backup, we cant just replace the drives as it will carry corrupted data.

This was then passed over to me and after doing a lot of googling it looks like there was a hotfix that resolved this "ntfs" error and was a problem with Windows, i thought what the hell, so i installed it and set the system to restart over night (all done remotely) and also run a chkdsk.

Came back the next day logged onto their system and there hasn't been one error since Tuesday morning before i rebooted it and the server is running fine.

Do you think i still need to do the full backup and restore as the server is running fine? Or can I just replace the drives.

Any help greatly appreciated
 
a raid 5 should not give R/W errors (since the disk should get marked as failed and fall out the raid) they probably trashed a few files by buttoning it... do a tape backup, do a file copy backup kick off a chkdsk.

if will likely delete some rubbish and fix iteself...

then replace the suspect disks in the usual way....
 
a raid 5 should not give R/W errors (since the disk should get marked as failed and fall out the raid) they probably trashed a few files by buttoning it... do a tape backup, do a file copy backup kick off a chkdsk.

if will likely delete some rubbish and fix iteself...

then replace the suspect disks in the usual way....

Thanks for the reply mate.

So it will be safe to just replace the two drives, letting one rebuild, then pulling the other one out and putting the new one in to rebuild?
 
If it is an HP server with an HP Smart Array controller, just run the ACU (Array Controller Utility) and it will give you all the hardware details with any faults, if any. Examine each of the disks to see if they are reporting any errors. There is no point in replacing disks if they don't need to be replaced, you are just risking the array going down while it is rebuilding (if another disk coincidentally happens to fail during the rebuild).
 
If it is an HP server with an HP Smart Array controller, just run the ACU (Array Controller Utility) and it will give you all the hardware details with any faults, if any. Examine each of the disks to see if they are reporting any errors. There is no point in replacing disks if they don't need to be replaced, you are just risking the array going down while it is rebuilding (if another disk coincidentally happens to fail during the rebuild).

The disks are not reporting errors on there, the controller is saying it needs a critical update though, which i knew anyway.

So why is the HP Insight Diagnostics saying there is a read/write threshold failure?? But not the ACU?

Do you think this is nothing to worry about then??
 
The disks are not reporting errors on there, the controller is saying it needs a critical update though, which i knew anyway.

So why is the HP Insight Diagnostics saying there is a read/write threshold failure?? But not the ACU?

Do you think this is nothing to worry about then??
Would you mind posting the exact error? Insight Diagnostics is just collecting information from the Smart Array and the disks, so the ACU should be showing it too. I think my approach would be to flash the Smart Array and the disks (they nearly always require firmware updates also), and then re-run Insight Diagnostics and check everything again. If you are still being told there are problems with the disks, then most definitely get them swapped. Download the latest SPP ISO (it's from Feb 2013), boot from it, and it will put all the latest firmware on the server (assuming you can arrange an outage).
 
Thanks for the help so far mate.

The error is - Error: 640006: The Read and/or Write HARD error rate is above threshold

The controller is an E200
Server is a Proliant DL180 G5

That's on disk 2 and 3. I will be going to their office tomorrow morning to update the firmware on the Controller and I was going to swap out one of the drives, leave it rebuilding and go back Monday morning and replace the other drive and help them with any other issues they have.
 
Mate, to me this looks like a firmware problem:

Error messages ( critical to update )
SLOT 0 Smart Array E200i Controller HP Support Document ID c01318999: Upgrade controller firmware to prevent incorrect Bus Fault counts.
SLOT 0 Smart Array E200i Controller HP Support Document ID c01382041: Upgrade controller firmware to prevent false Predictive Failures.
SLOT 0 Smart Array E200i Controller HP Support Document ID c01587778: Upgrade controller firmware to prevent data write errors.
SLOT 0 Smart Array E200i Controller HP Support Document ID c01725956: Upgrade controller firmware to prevent Windows Blue Screen and STOP message.

http://h30499.www3.hp.com/t5/ProLia...e-error-640006-on-ML350G5/m-p/5598997#M127988
 
I think you may be right, thanks for all your help really appreciate it.

I did see those notifications but i just brushed passed them as updating the firmware was one of the things on my list.

Will update the firmware tomorrow and check again like you said. Is the e200 firmware definitely on the SPP Disk?
 
Back
Top Bottom