Power failure, monitoring and graceful shutdown - VMware

Soldato
Joined
18 May 2010
Posts
22,371
Location
London
The company I work for at the moment are at the beginning stages of getting them selves in order.

This was one of the reasons why they brought me on board as they didn't have a dedicated Mon-Fri Sys Admin.

We had a major powercut last week. They don't have any Nagios alerts setup for the internal VMware equipment, despite me suggesting it to them.

The UPS they have is only 30mins of battery life. I'm not sure why the senior engineers didn't decide to do a graceful shutdown of the system but they just left the system to go on to UPS and then die.

When we came in in the morning we found that all the Vmware equipment and the SAN where down. It took us about 2 hours to get everything back up again and start up all the servers.

So I have two questions:

1. Is it possible to set up alerting when the systems go down? My manager said how would we do this if the systems lost power...

2. Graceful shutdown. Is it possible to get the VMware software to do a graceful shutdown of the servers in the event of a power failure?
 
Soldato
Joined
26 Sep 2007
Posts
4,137
Location
Newcastle
As above, APC UPS' with an NMC and the VMWare appliance. Then configure everything with your desired parameters and it'll gracefully shut down the VMs and then the hosts. Not sure about the SAN though!
 
Caporegime
Joined
18 Oct 2002
Posts
26,080
I wouldn't worry too much about actually shutting the SAN down. They have battery or flash-backed cache and once the VMs running off them have shut down then there's no open files to worry about corrupting.
 
Associate
Joined
27 Nov 2002
Posts
827
Location
Desborough,Kettering
1. Is it possible to set up alerting when the systems go down? My manager said how would we do this if the systems lost power...

How about a standalone server with a monitoring app on, redundant network links to the infrastructure.
Monitor multiple devices with ping and SNMP get to confirm devices up and responding.
Have a USB 3G dongle and set to SMS/email/telephone an alarm if it loses access to multiple ESX hosts? Something like https://serverscheck.com/monitoring-software/ (I havent spent time looking, but assume you can find a UK functional equivalent)
Obviously it will need its own dedicated UPS to give it longer life than the 30 minutes of your core systems, although I imagine you could use a small UPS and a lightweight pc with little power consumption to keep cost down.

All this assumes you have a single site to work from. If you have multiple office sites you could monitor from a satellite office over a site to site VPN. Although you may get false positives if you lose the VPN, it will at least give you a heads up of a major issue if you lose connectivity.

You could offload the worry/cost out of hours by engaging an IT managed service provider to remotely monitor and call you out in the event of connectivity loss to the entire environment, by performing alternative connectivity checks before calling you out of hours it should reduce false positives.

I do work for for an IT managed service provider, but am happy to ignore that angle if you are looking for a DIY solution.
 
Back
Top Bottom