Windows DHCP Redundancy

iaind · 9 Mar 2010 at 09:23

Hi All,

We're running Windows (soon to be 2008) AD/DNS/DHCP servers, but the DHCP side of things is proving to be a weak link.

I've read conflicting opinions of how best to implement redundant DHCP servers, but nothing definitive...so what's best practice?

We've got 2 scopes, although only one needs to be redundant - the other is for phones on a different VLAN and an IP helper is used, so it would be overly complicated to make this work, and not really necessary as phones never get rebooted normally. The scope is 172.16.1.1 - 172.16.1.149 with 1.1 and 1.100 to 1.120 excluded from distribution (no idea why).

Should I just create a scope on another server of 1.150-1.254? Or should I set the scope on both servers to 1.1 - 1.254 and set opposing exclusions for half the subnet?

n3vrmind · 9 Mar 2010 at 10:03

opposing exclusions works perfectly and is imo is more straight forward to implement than say clustering, so long as you have a decent enough lease like 7 days it will usually cover you for most eventualities

so far as backing up goes, i know its not too much trouble to restore a system state but i still have a preference for just taking a dump of the dhcp database with netsh, since its plain text and it takes minutes to install dhcp on another machine, give it the IP address and import

ethos · 9 Mar 2010 at 11:01

I have some notes on DHCP redundancy (not written by myself):

It will be tempting to build each DHCP server with half the scope but resist this. DHCP works by accepting whatever server responds first to the server location broadcast. If each has half of the DHCP range and it tries to renew with the wrong server, the workstation will get a NAK message. Unfortunately, the workstation may not always try again when it gets the NAK and will simply drop off the network. I like to call this NAK poisoning (though I am sure this is probably not my term)

To get around this, have both servers host the whole range and have reciprocal exclusions so if the renewal is out of range, the server will offer a substitute address rather than a NAK.

Example on a 192.168.1.0/24 network
Server 1: 192.168.1.0/24 range excludes 192.168.1.129-254
Server 2. 192.168.1.0/24 range excludes 192.168.1.1-128

Note that this is 50/50... Note that normal distribution is 80/20 here the 80 can handle the entire required number of hosts. Just remember if you're going to split the servers across a routed boundary to include an IP Helper-Address (on Cisco)command to allow foreign solicitations. The local one will almost always respond first unless it is down.

bigredshark · 9 Mar 2010 at 13:31

My favorite method would be clustering, my second a VM with a copy that can be started on a different host (automatically or manually).

I'm not much a fan of the methods discussed as it's a bit of a bodge really and it gets worse when machines require reservations too...

iaind · 9 Mar 2010 at 13:48

bigredshark said:
My favorite method would be clustering, my second a VM with a copy that can be started on a different host (automatically or manually).

I'm not much a fan of the methods discussed as it's a bit of a bodge really and it gets worse when machines require reservations too...

Its currently a VM with HA enabled, but it's not the host that's failing. Last night it ran out of nonpaged memory, think due to our AV software. The machine was responding to pings and was technically up, just useless!

iaind · 9 Mar 2010 at 13:50

While I'm here, the other weak link is with our Wyse thin clients - they download a config file by FTP from that Dc, so when it fails, they cant get a config.

I was going to set up an FTP server on a second host and use NLB to ensure one of them is available - can I create an NLB cluster with one 2008 machine and one 2003 machine?

The intention is to move the 2003 DC to 2008, and ultimately replicate the FTProot with DFS-R, but one step at a time!

n3vrmind · 9 Mar 2010 at 14:36

bigredshark said:
My favorite method would be clustering, my second a VM with a copy that can be started on a different host (automatically or manually).

I'm not much a fan of the methods discussed as it's a bit of a bodge really and it gets worse when machines require reservations too...

clustering - why? shared storage requirement etc, its unnecessary for the requirement
second copy on a VM - if the VM is a copy of the first machine its machine AD account would become outdated, if its a different machine with the same scopes and off, you would need manual intervention to bring it online and authorise the scopes.

theres absolutely no reason you cant have two up, both authorised dishing out the same scopes, if you have opposing exclusions they work well together. I have this setup running on a large client with litterally hundreds of scopes. Its seemless you dont have a situation where a dhcp server stopped overnight and client pcs arent able to get an IP the following morning.

paradigm · 9 Mar 2010 at 15:56

We just have a 2nd server configured with DHCP, but disabled.

Using a script that copies the configuration daily using netsh, and on the event of the DHCP server failing, the script calls config data and imports it to the backup server, then the service started. All automagically.

This way we maintain all the reservations, scope options, etc etc.

bigredshark · 9 Mar 2010 at 16:25

n3vrmind said:
clustering - why? shared storage requirement etc, its unnecessary for the requirement
second copy on a VM - if the VM is a copy of the first machine its machine AD account would become outdated, if its a different machine with the same scopes and off, you would need manual intervention to bring it online and authorise the scopes.

theres absolutely no reason you cant have two up, both authorised dishing out the same scopes, if you have opposing exclusions they work well together. I have this setup running on a large client with litterally hundreds of scopes. Its seemless you dont have a situation where a dhcp server stopped overnight and client pcs arent able to get an IP the following morning.

Why clustering - because the question was the best method for DHCP resiliency and that method is pretty flawless, problem free and manufacturer recommended. No it's not cheap, it can't be setup by an idiot and it does require shared storage but if you're actually considering DHCP resiliency at all and don't have most of the requirements met already for the likes of resilient exchange then I'd question your priority list.

And yes, it is completely possible to do it with opposing scopes but it's ugly, a fudged way round doing it properly and generally bad design. There is a long list of things in this and other areas which are completely possible but also completely the wrong way to go about things...

edscdk · 11 Mar 2010 at 15:35

surely clustering is going to make it more complicated and likley to go wrong, why not jsut stick half the scope on one server and the other half on another.. (I think the MS way was 40/60 or 30/70 I never understood why, I jsut use 50/50)

Sp00n · 11 Mar 2010 at 15:45

I have half of our scope on one and half on the other, both servers have the same reservations.

It's a pain that they can't automatically replicate to each other so I have to manually update it but it doesnt' change that much so I'm not that fussed.

Little_Crow · 11 Mar 2010 at 16:59

We use this, works great and saves an awful lot of hassle, and it's free.

It's extremely straightforward, all the scripts to backup/transfer your live DHCP DB are included, and it will e-mail you if the backup has had to kick in, and when the Live has come back.

bigredshark · 11 Mar 2010 at 17:11

edscdk said:
surely clustering is going to make it more complicated and likley to go wrong, why not jsut stick half the scope on one server and the other half on another.. (I think the MS way was 40/60 or 30/70 I never understood why, I jsut use 50/50)

Depends how well you set it up, correctly configured it pretty much never breaks. It's about the simplest cluster role going...

Yeah it's more complex, simpler is not having DHCP resiliency which is an option for most people who don't fancy clustering, you can always enable it on your router/firewall in an emergency. Anywhere you need that always available service clustering is the answer.

iaind · 11 Mar 2010 at 20:13

I went with splitting the scope in the end.

At the end of the day, we're a relatively small business (100 users) with a specific problem to solve and this was the obvious solution. I know it's not technically the best, but it was a quick and effective solution to the problem, which is what my manager was jumping on me for - after the second time Sophos killed the DC on a tuesday morning, he wanted a solution fast. Obviously I'm talking to sophos about it too, but regardless of the cause, that DC was an obvious weak point.

I daren't mention which firewall I ended up buying in case BRS shouts at me

n3vrmind · 11 Mar 2010 at 20:31

I presume sophos is configured to not scan your NTDS databases?

iaind · 11 Mar 2010 at 20:37

It is indeed, exceptions set up as per MS guidelines.

For the past 2 weeks, at about 5:30 on a tuesday morning, it has run out of nonpaged memory - I suspect it's something to do with the update process

matthab · 14 Mar 2010 at 04:56

The way ive alwasy done it is to have a second server either using the 80/20 as described above or if server 1 is you main DHCP server have server 2 set with the same scope but dont activate the scope or have the dhcp service active till required.