Server 2016 Stretch Cluster with HV Shared Storage (FSW Replica Question)

glenimp617 · 1 Nov 2016 at 10:56

Hi all,

I'm building a new 2016 RDS platform across two sites and wish to use both user profile disks and user environment virtualisation. The latter also being used for desktop clients. There will be a dedicated platform for this and I would like it to be highly available even in the event of a site outage.

This question is around disaster mitigation on the user profile disks platform, not RDS in general.

My design plan is as follows

Site A (Hyper V Cluster 1)
================

Server1 - Server 2016 DataCenter
Server2 - Server 2016 DataCenter
ServerFSW - Server 2016 Standard (HV Replica enabled between both HV clusters)

SAN 15k - 100gb LUN - Logs
SAN 7k - 500gb LUN - Data

Site B (Hyper V Cluster 2)
================

Server3 - Server 2016 DataCenter
Server4 - Server 2016 DataCenter

SAN 15k - 100gb LUN - Logs
SAN 7k - 500gb LUN - Data

The five servers are all virtual running on HyperV 2012 R2. Server1 and 2 will both share the storage on the SANs in SiteA. Servers3 and 4 will both share the storage on the SANs in SiteB.

I will create a stretched cluster across Servers 1, 2, 3 and 4 with Server5 as the file share witness.

Server 5 will be enabled for HyperV replication between both HV clusters.

My question is

Does having the witness server enabled for HV replication pose any risk? There maybe the possibility of moving this server into Azure, but not yet. This is the only design I can come up with which makes the platform highly available between sites and giving users access to their profile disks when roaming.

ecksmen · 1 Nov 2016 at 18:45

I'm not sure why you would be using hyper-v replication, over the storage space replication.

https://technet.microsoft.com/en-us...-replica/server-to-server-storage-replication

glenimp617 · 1 Nov 2016 at 18:47

to ensure the file share witness server migrates across in the event of a site outage

ecksmen · 1 Nov 2016 at 18:54

If the fileshare witness is offline in that scenario, then how would you have sufficient votes to bring it online?

glenimp617 · 1 Nov 2016 at 18:59

ecksmen said:
If the fileshare witness is offline in that scenario, then how would you have sufficient votes to bring it online?

because hyperv will automatically detect site a is down so will bring the witness server up at site b. There may be a small delay, so we may have to manually bring the cluster back up which isn't a problem. There will be enough for the votes though which is the important thing.

ecksmen · 1 Nov 2016 at 19:43

It's been a while, and maybe I'm missing the point but if the cluster cannot see the other nodes, nor the witness then surely it would remain offline until manual intervention?

If the interuption isn't catastrophic datacentre failure but comminucations failure, when comminucations are re-established then you'll now have duplicate machines online surely and the whole point of node / disk / witness configurations is to avoid split brain situations.

Ploppy2k3 · 1 Nov 2016 at 20:21

im reading this as if site A goes down the FSW will come up at site B. Not sure if its best practice though.

glenimp617 · 1 Nov 2016 at 20:49

ecksmen said:
It's been a while, and maybe I'm missing the point but if the cluster cannot see the other nodes, nor the witness then surely it would remain offline until manual intervention?

The witness share server will always be accessible as it's the only server enabled for hv replication.

ecksmen said:
If the interuption isn't catastrophic datacentre failure but comminucations failure, when comminucations are re-established then you'll now have duplicate machines online surely and the whole point of node / disk / witness configurations is to avoid split brain situations.

There won't be any duplicate machines. Server 1 and 2 always remain in site a, Server 3 and 4 always remain in site b. The only server which can be up or down at either site is the witness share server. When comms is back, hyper-v can move the witness server back to site a or leave it in site b. HV replicas can't be online in both hv clusters at the same time.

glenimp617 · 1 Nov 2016 at 20:49

Ploppy2k3 said:
im reading this as if site A goes down the FSW will come up at site B. Not sure if its best practice though.

Yes that's it.

I can't see there would be any issues, but just thought I would ask the question. You never know

If there are any issues I'll just move it into azure, which will become the holy grail of the "third site"

Microsoft do say best practice for a two site stretched cluster with shared storage replication is to put the witness server in site 3......or azure. I'm sure my method would work though.

ecksmen · 2 Nov 2016 at 15:16

So the point I am making, is perhaps a little better explained here -

https://blogs.msdn.microsoft.com/clustering/2014/11/13/introducing-cloud-witness/

In this example configuration, there are 2 nodes in 2 datacenters (referred as Sites). Note, it is possible for cluster to span more than 2 datacenters as well as each datacenter can have many more than 2 nodes. A typical cluster quorum configuration in this setup (automatic failover SLA) would give each node a vote. And then we need one extra vote of the quorum witness to allow cluster to keep running even if either one of the datacenter experiences power outage. Math is simple: There are 5 total votes, and you need 3 votes for the cluster to keep running.
In case of power outage in one datacenter, to give equal opportunity for the cluster in other datacenter to keep running, it is recommended to host the quorum witness in a location other than the two datacenters. This typically means requiring a 3rd separate datacenter (site) to host File Server that is backing the File Share which is used as the quorum witness (File Share Witness).
We received feedback from our customers, that most don’t have a 3rd separate datacenter that will host File Server backing the File Share Witness. This means customers host the File Server in one of the two datacenters, by extension making that datacenter the primary datacenter. In a scenario where there is power outage in the primary datacenter, the cluster would go down as the other datacenter would only have 2 votes which is below the quorum majority of 3 votes. For the customers that have 3rd separate datacenter to host the File Server, it is an overhead to maintain the highly available File Server backing the File Share Witness. Hosting VMs in public cloud that have the File Server for File Share Witness running in Guest OS is a significant overhead in terms of both setup & maintenance.

glenimp617 · 2 Nov 2016 at 16:38

^ That's my plan :cool:

The only slight change is that since we have no site 3 (I am toying with Azure btw) the witness server will seamlessly move between sites dependant on any outages that may have occurred.

Since the witness server is enabled for hv replica, it exists at both sites but only active in one site at a time. If one of the sites goes down, it can be set to come up automatically which gives the cluster quorum. However, I think I will just do it manually as it only takes 2 clicks on the mouse. It just means the cluster may be down for a few mins. To be honest, it can be down for a couple of hours as it's only hosting profile disks. If we have a comms issue, or a site issue the last of my worries is profile disks.

ecksmen · 2 Nov 2016 at 16:46

glen8 said:
^ That's my plan

The only slight change is that since we have no site 3 (I am toying with Azure btw) the witness server will seamlessly move between sites dependant on any outages that may have occurred.

But you WONT have three votes, the required amount.

TrafficMaster · 2 Nov 2016 at 17:52

Witness in Azure would be my first choice over anything tbh!

glenimp617 · 2 Nov 2016 at 18:56

ecksmen said:
But you WONT have three votes, the required amount.

I don't understand why

2 clustered servers and witness server equals three

No matter which site goes offline the vote will always equal three

I guess the only way the vote would be two is if hv failed to bring the witness server online at the other site. It's never failed us yet though

glenimp617 · 5 Nov 2016 at 14:14

ok so it took me a couple of days mainly due to the amount of servers and luns I had to build. Created the stretched cluster across the four nodes and created a fifth file server hosting the witness share. This fifth server was then replicated using hv replica to the second hv cluster at our remote site. Tested the failover of that server and all was good.

Brought the stretched cluster online which worked ok, I could see and access the share I was hosting ready for profile disks.

Simulated a planned site outage. The fifth server came up at the second site as expected but the cluster remained offline. However, when I manually started it again it worked fine and I could access the hosted share without issue.

The only thing I had to do was create two dns entries for the witness server. One being the IP address used at site a, and the second ip address it uses at site b.

So I was very happy my method seemed to work.

THEN.... it suddenly dawned on me.

Why the hell don't I just create one file server for profile disks and enable that for hv replication! I think I have over engineered this solution somewhat lol It will mean its offline for the time it takes to boot up, but that's fine.

I guess I 'could' build two file servers (one at site a, the other at site b) hosting their own profile disks and create two rds collections dedicated to each site. Something I shall mull over this weekend. It will certainly help keep the users at each site using their local resources a bit better.