Any Enterprise level VMWare people here?

Si. · 19 Nov 2021 at 11:52

Just want to discuss a few Enterprise DR thoughts.

I'm planning to configure a 2nd data Centre and want to run it Active/Active with the primary. The business want as little as possible downtime in the event of a DR scenario. I already have offsite backups and replication, but the recovery time for that is way too long.

My thoughts are to run a stretched cluster over 2 locations, with SAN Replication to keep VMs almost real-time. Should DC1 fail then I should be able to bring VMs up in DC2 almost instantly.

I know stretched clusters are not usually recommended for long distances, but we have good networking in place so I can hopefully mitigate that.. the 2 DCs will be split between countries, so the distances are large.

Any thoughts, anyone running a similar configuration and can give some experienced knowledge on this?

Thanks.

AlexD · 19 Nov 2021 at 12:59

I've operated a number of stretched clusters between multiple DCs, but always in a metro setup where <1ms latency is involved. Previously it wasn't supported to have a stretched cluster with >10ms due to issues with VMotion, but this is now irrelevant with improvements in vSphere 7. With what you are proposing (SAN sync replication) you would ideally want an orchestration piece to handle the spinning up of resources should a failure happen in the primary location, something like VMWare SRM, or look at software like Zerto/Veeam to do the replication and failover part. Keeping in mind for near seamless failover you would also need to look at how the networking portion would work for those VMs, a lot of stretched clusters have layer 2 stretched networking or using SDN of some sort.

Si. · 19 Nov 2021 at 13:07

I'm in the early stages, but would be ideally looking at Layer 2 between the DCs. Latency I don't know yet, but we will have dual 10gb links between them so I'm hoping it will be acceptable. If latency was a problem when I would need to re-think everything.

SAN will be an active cluster using iSCSI and uniform connections between SAN and hosts so I'm hoping VMWare HA would be able to just VMotion/Spin up VMs if a failure happened. So no need for SRM.

We don't have a huge number of VMs. last count it was around 250 and approx 75TB of data in use.

AlexD · 19 Nov 2021 at 13:19

Pure Active cluster? Theres a lot of devil in the detail there, active cluster, sync and async .. Only situation where you could natively rely on VMWare HA to recover VMs is if you have ActiveCluster/PeerPersistance/MetroNode

Si. · 19 Nov 2021 at 13:20

Correct, Pure Active cluster.

Veeam is something I will look into, I've used it in the past but not in this type of scenario.

The thing I need to maintain is the RTO, this needs to be as low as possible.

AlexD · 19 Nov 2021 at 13:26

So the main thing you need to understand is the RTT latency involved between the two DCs, then you have a number of options to provide a near seamless failover ... latency permitted ActiveCluster is great ... I've been using this without any issues for the past 4 years.

If you don't have a suitable latnecy then the next best option to keep with native SAN replication would be async with SRM.. IMO of course ..

Si. · 19 Nov 2021 at 13:29

Cool, good to hear someone with a positive experience of Active Cluster. We have 2 Purestorage SANs at the moment, this would be a 3rd, but the current ones don't replicate.

Hellsmk2 · 19 Nov 2021 at 19:20

I've built loads of active active clusters based on Dell / VMware.

The easiest by a million miles is a vSAN stretched cluster - ridiculously simple. This is pretty much my go to solution these days... bang in some Dell vSAN Ready Nodes if you want it cheap, or VxRail if you want a turnkey solution. After that I've used Dell EMC's VPLEX to stretch traditional SANs (XtremIO, Unity, VNX, PowerStore etc), which works exceptionally well, but is expensive.

With whatever solution you opt for, distance is not the factor you think it is. There's solutions stretched over 100 miles or more. The bigger factor is latency. Any solution you do build has a minimum of 10GB L2 links between your DCs and you'll be golden. Don't forget you'll need a third site to deploy a witness node to with any good active active solution.

Don't forget other parts of the solution though. It's all good having an active active vsphere environment if other parts of the infrastructure aren't also resilient. For example, firewalls - go with a decent solution like a pair of Cisco Firepowers in a HA pair, load-balancers - a HA pair of F5's with the DNS (GTM) module, or NSX AVI (active / active LBer).

Active/Active makes so much sense; reduces, in most cases entirely, the need for a DR site with kit that costs an arm and a leg and isn't used for it's entire life span.

khrall · 19 Nov 2021 at 21:07

If you're already invested in Pure, the best solution is to stick with it.

As I recall you need an 11ms RTT between sites for ActiveCluster. If you don't have this, the next best thing is ActiveDR.

If adding another Pure array is out of budget, then look at Zerto.

Throrik · 21 Nov 2021 at 23:45

We don't do ActiveCluster as we prefer to keep sites 'separate' - but we have around 30 Pure Arrays which replicate and we have little to no issues with them. Everything from X90s to C60s. As Hellsmk2 says Stretched vSAN is another option, but you're realistically looking at sub-5ms latency there, so if you're going across separate countries this is probably a no go. Zerto is good if you want to go towards that replication methodology, although it's been bought out by HPE who will no doubt ruin it like they have everything else they purchased.

Hellsmk2 said:
For example, firewalls - go with a decent solution like a pair of Cisco Firepowers in a HA pair

Do you work for Cisco? That's the first time that phrase has ever been written outside of Cisco/Cisco VARs, FirePowers are never, ever a decent solution. Palo/Fortigates are laughing their way to huge profits on the back of Firepower being so bad that people just want to run ASA code to this day.

Robert T. · 24 Nov 2021 at 18:26

Will add another things to consider - if you run vSAN stretched cluster on 2 sites and then witness on third, remeber to put very good UPS in at least two locations...
Plenty of CPU/RAM/storage overhead to be able to run entire company off of only one site in case you need to do maintenance on VMWare hosts/hardware in other datacenter.
VCSA in v7 have native ability to run vSAN stretched cluster from only one location without witness, but 6.7 does not.

Edit: with VMWare 7 witness does not really needs to be on separate site, can be in separate location on one of datacentre sites but I would still think separate and dedicated hardware is a must for it..

ecksmen · 3 Dec 2021 at 13:47

Stretched clusters aren't "Disaster Recovery"; its Disaster / Downtime Avoidance at best, largely providing cost effectiveness. Whatever you do end up with you need to consider the fault domains and how easily you can recover in the event of a disaster. You need to fully understand you RTO / RPO and routinely test you recovery plans, Site Recovery Manager used to be rather good.

Robert T. · 3 Dec 2021 at 19:11

ecksmen - disaster avoidance is very true. if properly planned for RAM/CPU capacity and generator power supply for at least one location, you can avoid loss of data - for last half a year I was multiple times in position where only good UPS's saved us from shutting down entire company. Not IT fault, we were on receiving end of problems elsewhere, but nevertheless..

Ev0 · 3 Dec 2021 at 19:52

Throrik said:
Do you work for Cisco? That's the first time that phrase has ever been written outside of Cisco/Cisco VARs, FirePowers are never, ever a decent solution. Palo/Fortigates are laughing their way to huge profits on the back of Firepower being so bad that people just want to run ASA code to this day.

Firepower are the ones with the IPS stuff aren’t they from when they bought Sourcefire?

Throrik · 4 Dec 2021 at 01:35

Ev0 said:
Firepower are the ones with the IPS stuff aren’t they from when they bought Sourcefire?

They technically do IPS/IDS etc yeah, but they're pretty much universally reviled too because they're bad.

ChrisD. · 6 Dec 2021 at 21:56

I work for VMware. vSAN stretched cluster is super easy provided you can satisfy the networking requirements, but as has been said it is not a DR solution.

Ev0 · 6 Dec 2021 at 22:32

Throrik said:
They technically do IPS/IDS etc yeah, but they're pretty much universally reviled too because they're bad.

I just remember from my IPS selling days at a competitor to Sourcefire having demos that showed how easily they could be fooled/defeated

a1ex2001 · 30 Dec 2021 at 09:31

Si. said:
Correct, Pure Active cluster.

Veeam is something I will look into, I've used it in the past but not in this type of scenario.

The thing I need to maintain is the RTO, this needs to be as low as possible.

Have your business actually defined an RTO or just said as fast as possible? If DR is your real driving factor in this expensive project then who ever is in charge of BC should actually have this defined and agreed with the business based on proper impact analysis etc. Then you need a risk analysis, what are the disasters you are actually trying to protect yourself from? An active/active cluster is potentially ok if you are concerned about a power outage at a single site (although to deliver like for like performance you land up with significant redundant kit at both locations) but it won’t help you in a ransomware attack as you likely replicated the issue real-time between the two DC’s. Lots to think about before designing and implementing something expensive that might not help!