Server 2016 Stretch Cluster with HV Shared Storage (FSW Replica Question)

glenimp617 · 1 Nov 2016 at 10:56

Hi all,

I'm building a new 2016 RDS platform across two sites and wish to use both user profile disks and user environment virtualisation. The latter also being used for desktop clients. There will be a dedicated platform for this and I would like it to be highly available even in the event of a site outage.

This question is around disaster mitigation on the user profile disks platform, not RDS in general.

My design plan is as follows

Site A (Hyper V Cluster 1)
================

Server1 - Server 2016 DataCenter
Server2 - Server 2016 DataCenter
ServerFSW - Server 2016 Standard (HV Replica enabled between both HV clusters)

SAN 15k - 100gb LUN - Logs
SAN 7k - 500gb LUN - Data

Site B (Hyper V Cluster 2)
================

Server3 - Server 2016 DataCenter
Server4 - Server 2016 DataCenter

SAN 15k - 100gb LUN - Logs
SAN 7k - 500gb LUN - Data

The five servers are all virtual running on HyperV 2012 R2. Server1 and 2 will both share the storage on the SANs in SiteA. Servers3 and 4 will both share the storage on the SANs in SiteB.

I will create a stretched cluster across Servers 1, 2, 3 and 4 with Server5 as the file share witness.

Server 5 will be enabled for HyperV replication between both HV clusters.

My question is

Does having the witness server enabled for HV replication pose any risk? There maybe the possibility of moving this server into Azure, but not yet. This is the only design I can come up with which makes the platform highly available between sites and giving users access to their profile disks when roaming.

glenimp617 · 1 Nov 2016 at 18:47

to ensure the file share witness server migrates across in the event of a site outage

glenimp617 · 1 Nov 2016 at 18:59

ecksmen said:
If the fileshare witness is offline in that scenario, then how would you have sufficient votes to bring it online?

because hyperv will automatically detect site a is down so will bring the witness server up at site b. There may be a small delay, so we may have to manually bring the cluster back up which isn't a problem. There will be enough for the votes though which is the important thing.

glenimp617 · 1 Nov 2016 at 20:49

ecksmen said:
It's been a while, and maybe I'm missing the point but if the cluster cannot see the other nodes, nor the witness then surely it would remain offline until manual intervention?

The witness share server will always be accessible as it's the only server enabled for hv replication.

ecksmen said:
If the interuption isn't catastrophic datacentre failure but comminucations failure, when comminucations are re-established then you'll now have duplicate machines online surely and the whole point of node / disk / witness configurations is to avoid split brain situations.

There won't be any duplicate machines. Server 1 and 2 always remain in site a, Server 3 and 4 always remain in site b. The only server which can be up or down at either site is the witness share server. When comms is back, hyper-v can move the witness server back to site a or leave it in site b. HV replicas can't be online in both hv clusters at the same time.

glenimp617 · 1 Nov 2016 at 20:49

Ploppy2k3 said:
im reading this as if site A goes down the FSW will come up at site B. Not sure if its best practice though.

Yes that's it.

I can't see there would be any issues, but just thought I would ask the question. You never know

If there are any issues I'll just move it into azure, which will become the holy grail of the "third site"

Microsoft do say best practice for a two site stretched cluster with shared storage replication is to put the witness server in site 3......or azure. I'm sure my method would work though.

glenimp617 · 2 Nov 2016 at 16:38

^ That's my plan :cool:

The only slight change is that since we have no site 3 (I am toying with Azure btw) the witness server will seamlessly move between sites dependant on any outages that may have occurred.

Since the witness server is enabled for hv replica, it exists at both sites but only active in one site at a time. If one of the sites goes down, it can be set to come up automatically which gives the cluster quorum. However, I think I will just do it manually as it only takes 2 clicks on the mouse. It just means the cluster may be down for a few mins. To be honest, it can be down for a couple of hours as it's only hosting profile disks. If we have a comms issue, or a site issue the last of my worries is profile disks.

glenimp617 · 2 Nov 2016 at 18:56

ecksmen said:
But you WONT have three votes, the required amount.

I don't understand why

2 clustered servers and witness server equals three

No matter which site goes offline the vote will always equal three

I guess the only way the vote would be two is if hv failed to bring the witness server online at the other site. It's never failed us yet though

glenimp617 · 5 Nov 2016 at 14:14

ok so it took me a couple of days mainly due to the amount of servers and luns I had to build. Created the stretched cluster across the four nodes and created a fifth file server hosting the witness share. This fifth server was then replicated using hv replica to the second hv cluster at our remote site. Tested the failover of that server and all was good.

Brought the stretched cluster online which worked ok, I could see and access the share I was hosting ready for profile disks.

Simulated a planned site outage. The fifth server came up at the second site as expected but the cluster remained offline. However, when I manually started it again it worked fine and I could access the hosted share without issue.

The only thing I had to do was create two dns entries for the witness server. One being the IP address used at site a, and the second ip address it uses at site b.

So I was very happy my method seemed to work.

THEN.... it suddenly dawned on me.

Why the hell don't I just create one file server for profile disks and enable that for hv replication! I think I have over engineered this solution somewhat lol It will mean its offline for the time it takes to boot up, but that's fine.

I guess I 'could' build two file servers (one at site a, the other at site b) hosting their own profile disks and create two rds collections dedicated to each site. Something I shall mull over this weekend. It will certainly help keep the users at each site using their local resources a bit better.