Silly little connectivity issue for a client - am I missing something dumb?

crinkleshoes · 2 Sep 2018 at 10:42

Hoping I can pick some of the knowledgeable brains on here... just trying to get to the bottom of an issue for one of my clients.

Thanks in advance for the read, I'm going to try and keep it as concise as possible but need to get a little background in.

I am quite confident my assessment of the fault is correct, but they simply won't accept it's the case. I am still pushing my assessment as the correct answer to them, but with their stream of rejections for my claimed root cause, it has got me wondering if I have missed something basic or complex that I should be investigating further.

It's a single site for a relatively well known hotel brand in London.

My company look after their guest internet connectivity and WiFi networks (it should just be guest WiFi, but as usual, other networks are pushed through the same APs).

The load on their (now previous) router kept spiking to levels where it would cause some issues and need a reboot too regularly. Investigation went on and it was deemed that the network had grown beyond original scope and with no HA or load balancing in place (their choice, not ours), when that router was not performing adequately, it caused significant slow downs and connectivity issues on their guest WiFi... it took a while, but finally got them to agree to a router upgrade with a secondary redundant unit as well.

The issue is wired, not wireless & the networks we do not control are not served by our router in any way - just pass straight through to the AP controller.

Our router only serves internet connectivity to their guest WiFi and a couple of other small networks they've added in since the inception of our service such as room control systems etc. All on unique VLANs with no interference or inter-VLAN communication.

As you'd expect, they have a main admin network for their main office PCs, IPTV, phones... etc, for which they have a couple of 1st line engineers on site with local management and 2nd/3rd line is handled by the brand's central IT team.

We connect through their switching infrastructure & our router is configured as a client (but no routing function, dhcp, dns etc) for each of their VLANs - in order to aid with any ad-hoc support that they request so that we can simply connect remotely.

The installation of the new guest routers went ahead on Wednesday, we completed tests, they completed tests and I stayed onsite for a good while to ensure there was nothing unusual popping up after the change... all appeared to have completed successfully with no issues.

I head back to the hotel where I was staying for the night and just as I'm about to arrive, I get a phone call from one of their on-site guys stating that they are having connectivity issues on their admin network and their PMS, Exchange and Internet connectivity is not functioning correctly and it's a sporadic issue poping up on various PCs through the building, but not all at once.

I ensure them that our controlled networks are isolated from everything else on their switching, so it's unlikely to be related to our work... this is about 3 hours after the swap completed and tests came back positive, showing no signs of any issues on their admin network - which was one of their main checks before I left.

I run through some initial fault finding over the phone, just some simple connectivity tests.

On a PC with the noted issue, the following tests were performed:

1. Ping to 8.8.8.8 - success
2. Ping to exchange server IP - success
3. Ping to Opera server IP - success
4. Ping to domain controller - success
5. Ping to both DHCP issued DNS servers - success
6. General web browsing - fail
7. Use of outlook or opera - fail (both configured using host name, not IP & no entry in hosts file)
8. Ping to Exchange server host name - fail (IP not resolved)
9. Ping to Opera server host name - fail (IP not resolved)
10. Change PC primary DNS server to google public DNS & test web browsing - success

During these tests, pings to the dns, exchange and opera servers remained consistently strong. Proving stable connectivity. Routes were also confirmed not to pass through our routers.

To me, that immediately screams... DNS ISSUE! Nothing else much even comes to mind, other than the off weird chance that there is some port forwarding for the DNS port 53 that's mis-directing the return of the queries... but this seems unlikely & even if it was there, it would be on their infrastructure, not ours.

Their DNS servers are centralised as part of their global SDWAN.

It's unlikely to be a routing issue, as we would not see a stable ping to all of those IPs... it was the correct host responding to the pings.

I then spend the next 30 minutes explaining to the 1st line guy & his manager, what the function of a DNS server is (these guys are supposed to offer network support). I push them to interact with their central IT team & also offer to explain to them and/or show them how to setup some workarounds, like adding the DNS role to their local domain controller (big site with central control, change management would be a lengthy process)... ok, so a temporary workaround would be simply to edit the hosts files on the PCs... a pain, but should resolve the issue. One side note, a little odd for me to see a Windows domain with a DC onsite that's not a local DNS server, but I'm sure they have their reasons.

This issue comes and goes across their network for the next couple of hours, with various phone calls and I keep pushing them to talk to their central IT guys. This doesn't appear to happen and shortly after the issue appears to subside... great... quiet evening.

Next morning rolls around and I get another phone call... the issue has returned... same tests performed, same results as above. Push them to talk to their IT team.

This time they do, the IT team say there are no known issues with the DNS servers, no reports from other hotels & they've run checks and see normal function.

So client look to push back to me, but at this moment the issue has subsided again and all has returned to normal.

Most of the day passes with no issue, then mid-late afternoon, it re-occurs & they're back on the phone denying any DNS issue & asking me to continue investigation. I agree and can find no faults or links to our equipment. The issue then subsides again before I can even re-do the ping tests after checking logs and configuration.

Friday, Saturday and today... similar obsure occurences that seem to be regularly falling within the 6am-3pm UK time window... then outside these hours, no occurence, or at least not noticed and reported.

The way it's happening is a little weird... but an item of note is that, due to the brand and the distribution of the highest volume of their properties, that 6am-3pm time window happens to fall around the time I would expect to see the DNS servers under their highest load. Then things appear to settle in the quieter hours.

Given the behaviour and tests performed... I am stumped to any idea other than a DNS issue.

Any ideas of other places to look that I'm not thinking of?

Thanks again for the lengthy read & any tips greatly appreciated.

crinkleshoes · 2 Sep 2018 at 11:45

Got connected by teamviewer today finally so could run the tests myself.

The secondary DNS server their DHCP server is pushing out doesn't ping! They told me it did... grrr.

I've checked & there's no route for the secondary DNS server out their SDWAN... it gets to the gateway and stops.

Doesn't go anywhere near our kit.

Happy days... that's a reasonable explanation for the intermittent DNS issues rolling round the PCs seemingly at random.

Caged · 2 Sep 2018 at 11:50

So as I understand it, your router has an interface on multiple networks, but isn't set as the gateway for any of the corporate stuff, it's just to allow you to have remote access if required?

Are you bridging multicast traffic at any point, giving clients an invalid IPv6 configuration that they would prefer over a working IPv4?

crinkleshoes · 2 Sep 2018 at 12:04

Thanks for the response, but no.

Just explored a bit more with client... looks like someone updated their DHCP scope to issue the wrong secondary DNS IP.

The primary is the one with highest load and I was right... peak times are 6am-3pm... so PCs went to query secondary & couldn't get a response because it was the wrong IP / no route for it in their system.

Zefan · 2 Sep 2018 at 12:19

Any chance of being able to get network traffic of the the machine during the failures? From the way you've described their team I assume they don't have proper network monitoring, if you can get wireshark on one of the machines you could stop the guesswork and see exactly what's going on when the failures occur.

:edit: wrote this post a lot earlier but didn't post until now. I can see you've pinpointed it now but maybe an idea for the future.

crinkleshoes · 2 Sep 2018 at 12:34

Yeah, thanks for the reply.

I've been running fault finding over the phone with them since Wednesday afternoon and it took til today to get me on a teamviewer session.

I told them it was a DNS issue on Wednesday afternoon after 10 minutes on the phone!

Ahhh well, at least I warned them verbally and in writing that if this was found to be their issue and unrelated to my work on Wednesday... they might receive a bill in the post... my Sunday OOH rates aren't cheap either!

crinkleshoes · 2 Sep 2018 at 14:46

Done, scope updated...