Disaster Recovery and vmware

Betsy's post in summary is excellent and I can't see anything wrong with it from my perspective. He possible has over shot the price on a few items but not by much. He has also rounded up a little in those cases I would say helps you out more!

Don't forget that whilst HA is excellent and really does help you must not disregard the need for some sort of DR strategy. To be honest, you could implement the HA first with expense and implement DR at some point in the very near future.

Hell, even if you used some software to grab full copies of your production VM's and you copy them to tape or portable drive for off site that would cover you as a basic principle in the short term. This wouldn't allow you to actively maintain business continuity but if you have other data backups in place then a "Mickey Mouse" DR strategy would be implemented of sorts....enough til you find your feet maybe.
 
Last edited:
HA for free? Well i could write about this for hours on end.

Again, I don't know your situation well enough and the following is just a hypothetical construct from me based on what you have written.

Best way to tackle this hearts and minds problem, is to list the business requirements and map them to the additional technical cost of delivering them.

How to do this efficiently and simply for HA server requirements?

What you may need to do is to list each application and get the head of the company to assign them a priority order and a recovery time objective. Don't let the application owners do this, as you will end up with a list of applications of equal priority which need to be recovered in under an hour ;)

You then need to perform a current state assessment, to confirm how long it would currently take to recover each of them individually and match against the target from the CEO/MD.

To work this (recovery time) out for each application you need to construct the "recovery chain", in terms of hours.

e.g If you lose a server with 100Gb of Data on it, you will have to wait a minimum of 4 hours for the server hardware to arrive (if you are platinum support), you will then spend up to 6-8 Hours building the OS and recovering the data, depending on the speed of your disk drives and tape library. This means the recovery chain is 1 Business day as although, it will only take you half a day in actual time, if this fault occurs at 7am, you will not be operational again until 7pm thus, missing a full business day.

The potential financial cost of missing a business day with a single application can be £0-£100m in my experience, although I am sure there are others which can lose more. You will need to place a £ sign against yours.

There are also other costs to factor which may not be directly financial, such as reputation and operational. i.e if your website goes down and you provide services via the web, customers will lose faith in your abilities to deliver if they see you are down (reputation). If your job is air-traffic control or another critical function, you may not be able to operate efficiently or without high risk of error while the computer system is down (operational). These can also be financial cost inter-related and doing a full assessment can be complex and expensive so avoid going too deep and just identify the big ones.

Neither of these non-financial costs are generally acceptable to heads of business functions, you will need to test their resolve in terms of how many occurrences they are willing to accept per year which then drives the cost of your technical solution.

Multiple server loss events would constitute a disaster recovery scenario, which is different to your high-availability requirements although, they are related.

The difference between your current and target state can then be rolled into the business case. If there is no gap, there is no business case for the more expensive toys as you do not need Networked Storage or the more expensive VMWare Licenses.

There is a sting in the tail here, which you need to maintain perspective on. In smaller shops such as yours, the facilities which sit underneath your kit i.e the datacentre facility, power, cooling, network will dictate the level of availability you will be able to provide with your Virtualisation solution. VMWare will only protect you against a limited set of faults.

If you lose your power, the server room chillers give way or your network paths are not fully diverse and go belly up, you will lose your entire cluster.

You can spend millions trying to protect yourselves against all the different types of faults, just make sure you do not over-sell the server protection side as a panacea for all eventualities :P

Placing your entire environment into three racks, inside a nice Tier 4 rented datacentre space would be a logical next step however, the network pipes required to plumb you into said space remain prohibitive for most small businesses.

Technical note:

If you plump for vSphere Advanced Edition, it allows you to use 12 cores per socket which I believe is the current limit for any commercially available AMD/Intel chip. Quad-Core's will be fine, if you are buying new, I would consider the 6 core or 8 core models.

Regarding cores per license, I think this is only relevant if you go for the Standard License and are currently either an AMD shop or decide to go for the R810 with the top-end Intel chips as stated.

This (cores per license) is probably a short term consideration, as it is common knowledge that VMWare are considering switching to a Per-VM licensing model for the next major release and have already done so with some of their new products i.e CapacityIQ.

My cost's above include 2x Intel 8C/24Mb chips @ 1.83GHZ, this allows for all 4 sockets to be populated later in the R810 and allows scaling to 512GB RAM per box and 32 Physical Cores (64 Logical) thus, negating the need to buy more servers and junking the old ones. Business owners hate these types of gotchas.

I would steer clear of the AMD boxes as in this specific area of IT, the Intel chips are generally over twice the speed with half the cores. AMD rep's bleet on about memory bandwidth advantages, in my experience this is rarely tangible in the real-world, for the majority of workloads.
 
Last edited:
As above - a good read. I echo the steering clear of AMD for a server environment.

I still need to work on our VM environments DR - we're still doing tape backups which causes the recovery time to be longer than I'd like, we also have HA turned off as we're only running two ESXi nodes... I have some work to do on that front. Early stages tho - ultimately we'll be mirroring our essential servers (File, email, DNS etc) between two sites and then HA will be enabled.

From a server management perspective it's fantastic, it makes our lives so much easier when there is a centralised management point.

Our current setup is 2 DL380 G6s with Xeon E5540 Processors and around 60GB of memory in each. They're currently running around 25 servers and they barely break a sweat. We have two SAN nodes with around 7TB of storage but that is getting toward the high end of utilisation however the real beauty of the VM/SAN environment is that you have zero downtime when expanding and upgrading. Just plug it in and off you go.

EXSi wise we have it installed on USB drives mounted on the internal headers and the backup is as easy as copying the install to an indential drive. Failure and we either swap the server and just stick the drive back in or whip the top off the machine and put the backed up USB drive in - servers back... so easy.
 
Following a bit on from Besty's words regarding the licensing.

If you can afford it I would go for the Enterprise Plus licence the main reason would be the distributed vSwitch.

This will allow you to create one switch that all your hosts can use, so you do not have to configure a switch on each host, something that we find invaluable, while we only have 3 hosts at the moment, it is one less thing we have to think about when adding in a new host.

Kimbie
 
Back
Top Bottom