RAID 5 help

Associate
Joined
18 Jan 2010
Posts
1,138
Location
London
Hi Guys, bit of a weird one.

One of our clients has a HP server running Win 2k3 server, they turned if off by the button and then complained about it not running properly. We had a look and under event viewer the were hundreds of "ntfs" errors.

My manager was looking over it and spoke to HP and they confirmed 2 disks of the 3 in RAID 5 had met there read/write failure limit.

They said we will need delete the RAID volume replace the drives, recreate the RAID5 volume and restore the system from a backup, we cant just replace the drives as it will carry corrupted data.

This was then passed over to me and after doing a lot of googling it looks like there was a hotfix that resolved this "ntfs" error and was a problem with Windows, i thought what the hell, so i installed it and set the system to restart over night (all done remotely) and also run a chkdsk.

Came back the next day logged onto their system and there hasn't been one error since Tuesday morning before i rebooted it and the server is running fine.

Do you think i still need to do the full backup and restore as the server is running fine? Or can I just replace the drives.

Any help greatly appreciated
 
a raid 5 should not give R/W errors (since the disk should get marked as failed and fall out the raid) they probably trashed a few files by buttoning it... do a tape backup, do a file copy backup kick off a chkdsk.

if will likely delete some rubbish and fix iteself...

then replace the suspect disks in the usual way....

Thanks for the reply mate.

So it will be safe to just replace the two drives, letting one rebuild, then pulling the other one out and putting the new one in to rebuild?
 
If it is an HP server with an HP Smart Array controller, just run the ACU (Array Controller Utility) and it will give you all the hardware details with any faults, if any. Examine each of the disks to see if they are reporting any errors. There is no point in replacing disks if they don't need to be replaced, you are just risking the array going down while it is rebuilding (if another disk coincidentally happens to fail during the rebuild).

The disks are not reporting errors on there, the controller is saying it needs a critical update though, which i knew anyway.

So why is the HP Insight Diagnostics saying there is a read/write threshold failure?? But not the ACU?

Do you think this is nothing to worry about then??
 
Thanks for the help so far mate.

The error is - Error: 640006: The Read and/or Write HARD error rate is above threshold

The controller is an E200
Server is a Proliant DL180 G5

That's on disk 2 and 3. I will be going to their office tomorrow morning to update the firmware on the Controller and I was going to swap out one of the drives, leave it rebuilding and go back Monday morning and replace the other drive and help them with any other issues they have.
 
I think you may be right, thanks for all your help really appreciate it.

I did see those notifications but i just brushed passed them as updating the firmware was one of the things on my list.

Will update the firmware tomorrow and check again like you said. Is the e200 firmware definitely on the SPP Disk?
 
Back
Top Bottom