Any Enterprise level VMWare people here?

Si.

Si.

Soldato
Joined
22 Oct 2002
Posts
2,675
Location
Melbourne, Aus
Just want to discuss a few Enterprise DR thoughts.

I'm planning to configure a 2nd data Centre and want to run it Active/Active with the primary. The business want as little as possible downtime in the event of a DR scenario. I already have offsite backups and replication, but the recovery time for that is way too long.

My thoughts are to run a stretched cluster over 2 locations, with SAN Replication to keep VMs almost real-time. Should DC1 fail then I should be able to bring VMs up in DC2 almost instantly.

I know stretched clusters are not usually recommended for long distances, but we have good networking in place so I can hopefully mitigate that.. the 2 DCs will be split between countries, so the distances are large.

Any thoughts, anyone running a similar configuration and can give some experienced knowledge on this?

Thanks.
 
Last edited:
I've operated a number of stretched clusters between multiple DCs, but always in a metro setup where <1ms latency is involved. Previously it wasn't supported to have a stretched cluster with >10ms due to issues with VMotion, but this is now irrelevant with improvements in vSphere 7. With what you are proposing (SAN sync replication) you would ideally want an orchestration piece to handle the spinning up of resources should a failure happen in the primary location, something like VMWare SRM, or look at software like Zerto/Veeam to do the replication and failover part. Keeping in mind for near seamless failover you would also need to look at how the networking portion would work for those VMs, a lot of stretched clusters have layer 2 stretched networking or using SDN of some sort.
 
I'm in the early stages, but would be ideally looking at Layer 2 between the DCs. Latency I don't know yet, but we will have dual 10gb links between them so I'm hoping it will be acceptable. If latency was a problem when I would need to re-think everything.

SAN will be an active cluster using iSCSI and uniform connections between SAN and hosts so I'm hoping VMWare HA would be able to just VMotion/Spin up VMs if a failure happened. So no need for SRM.

We don't have a huge number of VMs. last count it was around 250 and approx 75TB of data in use.
 
Last edited:
Pure Active cluster? Theres a lot of devil in the detail there, active cluster, sync and async .. Only situation where you could natively rely on VMWare HA to recover VMs is if you have ActiveCluster/PeerPersistance/MetroNode
 
Correct, Pure Active cluster.

Veeam is something I will look into, I've used it in the past but not in this type of scenario.

The thing I need to maintain is the RTO, this needs to be as low as possible.
 
So the main thing you need to understand is the RTT latency involved between the two DCs, then you have a number of options to provide a near seamless failover ... latency permitted ActiveCluster is great ... I've been using this without any issues for the past 4 years.

If you don't have a suitable latnecy then the next best option to keep with native SAN replication would be async with SRM.. IMO of course ..
 
Cool, good to hear someone with a positive experience of Active Cluster. We have 2 Purestorage SANs at the moment, this would be a 3rd, but the current ones don't replicate.
 
I've built loads of active active clusters based on Dell / VMware.

The easiest by a million miles is a vSAN stretched cluster - ridiculously simple. This is pretty much my go to solution these days... bang in some Dell vSAN Ready Nodes if you want it cheap, or VxRail if you want a turnkey solution. After that I've used Dell EMC's VPLEX to stretch traditional SANs (XtremIO, Unity, VNX, PowerStore etc), which works exceptionally well, but is expensive.

With whatever solution you opt for, distance is not the factor you think it is. There's solutions stretched over 100 miles or more. The bigger factor is latency. Any solution you do build has a minimum of 10GB L2 links between your DCs and you'll be golden. Don't forget you'll need a third site to deploy a witness node to with any good active active solution.

Don't forget other parts of the solution though. It's all good having an active active vsphere environment if other parts of the infrastructure aren't also resilient. For example, firewalls - go with a decent solution like a pair of Cisco Firepowers in a HA pair, load-balancers - a HA pair of F5's with the DNS (GTM) module, or NSX AVI (active / active LBer).

Active/Active makes so much sense; reduces, in most cases entirely, the need for a DR site with kit that costs an arm and a leg and isn't used for it's entire life span.
 
Last edited:
If you're already invested in Pure, the best solution is to stick with it.

As I recall you need an 11ms RTT between sites for ActiveCluster. If you don't have this, the next best thing is ActiveDR.

If adding another Pure array is out of budget, then look at Zerto.
 
We don't do ActiveCluster as we prefer to keep sites 'separate' - but we have around 30 Pure Arrays which replicate and we have little to no issues with them. Everything from X90s to C60s. As Hellsmk2 says Stretched vSAN is another option, but you're realistically looking at sub-5ms latency there, so if you're going across separate countries this is probably a no go. Zerto is good if you want to go towards that replication methodology, although it's been bought out by HPE who will no doubt ruin it like they have everything else they purchased.

For example, firewalls - go with a decent solution like a pair of Cisco Firepowers in a HA pair

Do you work for Cisco? That's the first time that phrase has ever been written outside of Cisco/Cisco VARs, FirePowers are never, ever a decent solution. Palo/Fortigates are laughing their way to huge profits on the back of Firepower being so bad that people just want to run ASA code to this day.
 
Will add another things to consider - if you run vSAN stretched cluster on 2 sites and then witness on third, remeber to put very good UPS in at least two locations...
Plenty of CPU/RAM/storage overhead to be able to run entire company off of only one site in case you need to do maintenance on VMWare hosts/hardware in other datacenter.
VCSA in v7 have native ability to run vSAN stretched cluster from only one location without witness, but 6.7 does not.

Edit: with VMWare 7 witness does not really needs to be on separate site, can be in separate location on one of datacentre sites but I would still think separate and dedicated hardware is a must for it..
 
Stretched clusters aren't "Disaster Recovery"; its Disaster / Downtime Avoidance at best, largely providing cost effectiveness. Whatever you do end up with you need to consider the fault domains and how easily you can recover in the event of a disaster. You need to fully understand you RTO / RPO and routinely test you recovery plans, Site Recovery Manager used to be rather good.
 
ecksmen - disaster avoidance is very true. if properly planned for RAM/CPU capacity and generator power supply for at least one location, you can avoid loss of data - for last half a year I was multiple times in position where only good UPS's saved us from shutting down entire company. Not IT fault, we were on receiving end of problems elsewhere, but nevertheless..
 
Do you work for Cisco? That's the first time that phrase has ever been written outside of Cisco/Cisco VARs, FirePowers are never, ever a decent solution. Palo/Fortigates are laughing their way to huge profits on the back of Firepower being so bad that people just want to run ASA code to this day.

Firepower are the ones with the IPS stuff aren’t they from when they bought Sourcefire?
 
I work for VMware. vSAN stretched cluster is super easy provided you can satisfy the networking requirements, but as has been said it is not a DR solution.
 
They technically do IPS/IDS etc yeah, but they're pretty much universally reviled too because they're bad.

I just remember from my IPS selling days at a competitor to Sourcefire having demos that showed how easily they could be fooled/defeated ;)
 
Correct, Pure Active cluster.

Veeam is something I will look into, I've used it in the past but not in this type of scenario.

The thing I need to maintain is the RTO, this needs to be as low as possible.
Have your business actually defined an RTO or just said as fast as possible? If DR is your real driving factor in this expensive project then who ever is in charge of BC should actually have this defined and agreed with the business based on proper impact analysis etc. Then you need a risk analysis, what are the disasters you are actually trying to protect yourself from? An active/active cluster is potentially ok if you are concerned about a power outage at a single site (although to deliver like for like performance you land up with significant redundant kit at both locations) but it won’t help you in a ransomware attack as you likely replicated the issue real-time between the two DC’s. Lots to think about before designing and implementing something expensive that might not help!
 
Back
Top Bottom