Major Problem

MuttsNutts · 2 Aug 2011 at 14:09

What I'm having to deal with yesterday & today :eek:

HP MSA 2312fc with a 12 bay Expantion shelf 24 disks in total:
Running 3 vDisks
vdisk01 is 11 disks RAID 5
vdisk02 is 5 disks RAID 5
vdisk03 is 5 disks RAID 5
3 Hot Spare's Disks

This MSA Host's all my VMWare Servers 23 of them tobe exact
I have 3 Physical EXS Servers & one physical Backup Server.

The whole company is down all 23 VmWare servers down, Why? you ask
All this from a power cut, UPS's kicked in and shut the lot down as it should do, upon the power coming back on UPS's & I powered it all on.
Then....

MSA has put vdisk 1 & 2 as offline in a degraded state & vdisk 3 is in a Critical state.
All ESX Servers are fine & so is the Backup Server.
I have managed to copy all the Virtual Server image files off the vdisks 1 & 2 to 3 USB HD's (Took what 10 hours or so) but the vdisk 3 is inaccessable This is 4 remaining servers that I cannot loose the current backup for this is over 6 weeks old due to backup failures and incomplete backups the servers in question are the company's Accounts for itself & it's customers including payroll so a big one to loose or revert back 6 weeks, another is a server that hosts a custom database package that has lets just say about 12 years of data on it the other two are minimal as I could rebuild them no problem minor loss as one's a Proxy Server & the other is a Web Server that is currently backed up to a 3rd party web provider.

Been on the dong & bone to HP on & off for the past two days, tried a few things & diags from HP say there is a communication issue between one of the controllers & the mid plane in the MSA not the Expansion shelf.
They say the array's will be intact (they think as it's unlikly that more than 4 or 5 drives can blow in one go) ???? I'm Unsure as none of the Hot Spare's kicked in at all one of the major issues is as we can only gather logs from one of the controllers this reports as ok just degraded array's I could bring these online and boot most of the servers..... but for how long is unsure and if another drive goes then it's GAME OVER for them arrays.

Now waiting for a HP Engineer & parts they say they are bringing a new Chassis which includes the Mid Plane board...... I asked what if its the controller? They said unlikely?????

It's painful just sitting here waiting for this part to tip up when i have the MD's, Cheif Exec & every member of staff i see all asking me how long.......?
Im also thinking HP swap this part and it don't work?

I Have locked myself in the server room LOL

STRESSED IS NOT THE WORD.... WOOOSARRR..... WOOOSARRRR

JUST THOUGHT I'D LET ALL THIS OUT, FEEL A BIT BETTER NOW.

Uhtred · 2 Aug 2011 at 14:21

6 weeks without a backup??

Wishing you luck mate.

edscdk · 2 Aug 2011 at 14:35

SAN whats the worst that can happen..... oh yer... it broke...

single point of failure = fail... our san has barely survived a power down, 5 failed drives a controller card and something else last time!!

still its not your problem if the hardware support (hp) are crap... all you can do it manage the situation and keep everyone updated... make sure you tell them that you suspect its a part HP are not going to bring for their own reasons, then if you does not fix the issue you can say told them so...

I always wonder if this sort of kit needs powercycling every 6 months so when bits break hopefully its not enought to trash the lot (my thinking being that drives and other devices can be working but WILL fail for what ever reason after a power cycle..)

FirebarUK · 2 Aug 2011 at 14:35

Look on the bright side, at least you could get your bosses to justify spending some money on redundancy

edscdk · 2 Aug 2011 at 14:41

MuttsNutts said:
I'm Unsure as none of the Hot Spare's kicked in at all one of the major issues is as we can only gather logs from one of the controllers this reports as ok just degraded array's I could bring these online and boot most of the servers
.

as happened to us all the drive probably failed at once after a power cycle... i htink the drives get into a state where they will not survive a power cycle but as long as they are kept running are ok - we always get something fail when we power cycle our data center

Yamahahahahaha · 2 Aug 2011 at 15:01

MuttsNutts said:
Been on the dong & bone

Oo-err.

Wait for HP to get there and replace what they feel is appropriate - it gives you a get-out for if it's not completely successful.
It's never nice having drives fail, let alone losing data.

You need to look at getting your backup solution sorted. 6 weeks is unacceptable.

MuttsNutts · 2 Aug 2011 at 15:04

HP just rang have said 4pm now instead of 3pm..... another hour to kill :-(

Just had a butty from local supermarket deliverd by one of the girls..... awww bless thanks I thought until she asked me why she can't I get her emails? & how long will it be.
OMG..... Shhhhh have told the powers that be that HP and bringing X part but I think it's this, that & the other I said at least it's progress and with this new part we should be able to gather logs from the other controller now and pin point failure.......

PiKe · 2 Aug 2011 at 15:07

Why should stuff break when it's power cycled? That makes no sense.

MuttsNutts · 2 Aug 2011 at 15:11

Yamahahahahaha said:
Oo-err.

Wait for HP to get there and replace what they feel is appropriate - it gives you a get-out for if it's not completely successful.
It's never nice having drives fail, let alone losing data.

You need to look at getting your backup solution sorted. 6 weeks is unacceptable.

Already got my own get out policy for backups not working, 2 months ago I installed a trial of Veeam backup as Backup Exec 12.5 was proving not to be as good or stable, Veeam ran great and worked every time I put the offer to the powers that be about buying veeam, they came back to me as it's too expensive and lets just stick to what we have for now...... Sod's Law 6 weeks later power cut & MSA dies.........

Dare I say I told you so, as with the Veeam backups i could have had most servers back online & running within about 3 - 4 hours once the MSA was fixed of course, now it's a drag and drop import of the virutal server images one by one will take an age from the USB drives there on

edscdk · 2 Aug 2011 at 15:15

PiKe said:
Why should stuff break when it's power cycled? That makes no sense.

motors / bearings / boards cooling down and contracting, possibly power spike on next power cycle, and everyhting expanding as it heats up again.... im sure there are more reasons .

Penguin · 2 Aug 2011 at 15:19

Ouch, good luck man.

We had a Dell SAN go pop out in Dallas when they were having power cuts at the start of the year & took ages for them to fix it. (though it was mainly down to 6 hour time difference, it still kinda working & us wanting to copy all the VM images off it).

They ended up replacing the back plane, a controller card & a partridge in a pear tree. Ended up going through a number of support levels with them all saying "I've not seen this before" In the end they had to wipe the RAID Card configs which meant losing all the data (luckily we had backups & could also copy the VM images off in scheduled down time). Once they did that it started behaving itself again.

Hopefully they can get it sorted & it's not as messy as that. You really should get the backup situation sorted though dude, 6 weeks is an epic amount of time between backups.

martynbez · 3 Aug 2011 at 11:27

How you getting on with this?

MuttsNutts · 4 Aug 2011 at 17:20

martynbez said:
How you getting on with this?

I managed to get all Virtual servers apart from 2 back online fully yesterday.
the bad news which I thought would have been ok I pulled servers from a corrupt bad array to a good one and like i said all but 2 booted fine,

Bad news is one of the failed servers was te MAIN DC aswell as the FILE server so all users data....

Im currently rebuilding and going to have to restore the AD etc from the backups that are 6 weeks old then chop and change what ever data i can handle the other failed server aint that important anyhow, could be redone in a few hours which will do that once I got the major one fixed.

Still on with it and got some help too now as not sleeped much since monday

chrislip · 4 Aug 2011 at 22:45

Is all your file data on a VMFS, or an RDM? It may be that you can get the data back using some sort of tool. We had a file server with an RDM that 'lost' everything when our SAN was upgraded but all teh data was still on there. Can't remember what we used to get it bcak but it was a *nix based tool which was free..... If that's the situation you are in let me know and I'll do some digging to find what we used.

blueacid · 4 Aug 2011 at 22:59

chrislip said:
Is all your file data on a VMFS, or an RDM? It may be that you can get the data back using some sort of tool. We had a file server with an RDM that 'lost' everything when our SAN was upgraded but all teh data was still on there. Can't remember what we used to get it bcak but it was a *nix based tool which was free..... If that's the situation you are in let me know and I'll do some digging to find what we used.

Could it have been testdisk ?

Skidilliplop · 10 Aug 2011 at 12:19

FirebarUK said:
Look on the bright side, at least you could get your bosses to justify spending some money on redundancy

.... Doubt it, they'll struggle to stay trading if they lose 6weeks of accounting and client data.

Oh well. What did we learn?...

edscdk · 10 Aug 2011 at 12:24

Skidilliplop said:
.... Doubt it, they'll struggle to stay trading if they lose 6weeks of accounting and client data.

Oh well. What did we learn?...

a) single points of failure are not good
b) 6 weekly backup is bad

a + b = DISASTER!

larger the company the easier it is to make it redundant,

edscdk · 10 Aug 2011 at 12:33

you really have to allow for a san failure,

but that gets expensive

smaller setups I'm sure would be better off with dual machine clusters, so if you need 10 real servers have 12 running as 6 clusters each with their own seperate (so 6 sets of disk enclosures).

Of course you could see it that you need 10 but in a disaster you can run on 8 so not bother with the extra 2.

so a single server failure is no issue, or a disk enclosure failure is also no issue (assuming you had a backup which op did not, but at least he would be up and running much faster, as the old server backups could be brought up on the working hardware)

im sure the extra servers / enclosures would (i am guessing) cost about the same as an expensive san in the first place

not the first time a SAN has caused a major disaster

Skidilliplop · 10 Aug 2011 at 12:33

edscdk said:
a) single points of failure are not good
b) 6 weekly backup is bad

a + b = DISASTER!

larger the company the easier it is to make it redundant,

True to an extent, but no matter how small you are, you can always afford to invest more in DR, because if not you don't stuff like this can happen and cost you 10x more in losses.

I hear a lot about fancy backup solutions not being "cost effective". Cost effective becomes a moot point when 90% of SMEs that lose data in a critical failure cease trading within 2 years.
Data is priceless, protect it at all costs