Strange Network issue with one of our servers.

Associate
Joined
2 Feb 2009
Posts
842
for those TLDR; I think there is some strange backbone routing issue that started on 19th October and I have no idea who to report it to.


We are having a weird network related issue with one of our servers that's hosted in a Manchester data centre.

I will explain a little about the background of how we use and connect to that server to show that the issue is a recent one.

On that server is a web service connected to a database. we have 50 of our clients calling in every 30 seconds, and each time one calls, the web service records some information. This has been the case for the last few years without any major issues.

However, since around 10.40am on 19th October 2023 about 40% of them are getting intermittent connection errors. The other 60% are working as they have been for years.

No recent changes have been made to client or webservice back-end code or database.

Below is from two clients error stats broken down by day.

Client 1
Client15014
16/10/2023​
0​
Client15014
17/10/2023​
3​
Client15014
18/10/2023​
0​
Client15014
19/10/2023​
0​
Client15014
20/10/2023​
1​
Client15014
21/10/2023​
0​
Client15014
22/10/2023​
0​
Client15014
23/10/2023​
0​

Client 2
Client3553
16/10/2023​
0​
Client3553
17/10/2023​
3​
Client3553
18/10/2023​
0​
Client3553
19/10/2023​
127​
Client3553
20/10/2023​
152​
Client3553
21/10/2023​
238​
Client3553
22/10/2023​
125​
Client3553
23/10/2023​
114​

Around 40% of the 50 clients dialing into that server are very simliar to the client 2 stats above, with a sharp increase in the number of errors.

The errors the clients are reporting are very generic,

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond ourIPAddress:443

The underlying connection was closed: An unexpected error occurred on a receive.

"Unable to connect to the remote server"

My assumptions; (and yes I understand assumptions can be dangerous!)

Given that it’s only 40% of our clients that are getting this issue, (and the same clients), we’re assuming that it’s not an issue our end. If it was our end we would expect to see the errors spread across all clients.

Given that even the 40% of the clients that are erroring sometimes have periods where the connection seems stable for short periods, we’re also assuming it isn’t client firewalls or some other blocking their end. If it was firewall or blocking then it would either connect or not? (yes we've had lots of times in the past where we've lost contact due to IT people changing firewall rules!)

Anyone any ideas? or any backbone network engineers on this forum?? anyone know who to even send this kind of stuff to? (we've already emailed out own hosting company, but they're being typically, we can't see any issues from here!)

Thanks.
R.
 
What are the clients, are they Windows devices or similar?

See if a client that's working fine and one that's having issues can do the following:

- Ping the server.
- Traceroute to the server.
- Telnet to port 443 on the server.

Do you know what internet connectivity is being used by the clients? I've seen a few times where all clients from a certain ISP are getting dropped by an access list on an intermediary router, usually at the handoff into the provider that's hosting the server.
 
Last edited:
Thanks for the reply, clients are all windows OS machines, various levels, from Windows Server 2008 R2 Standard to Microsoft Windows Server 2022 Datacenter.

I can also run the same software as our clients, and i'm seeing the same intermittent errors, however one of my colleagues also runs it.. and they are getting the working perfect end of the stick.

I will do the ping, tracert and telnet and get my colleague to do the same, then post the results here.

Unfortunately we have no idea what internet connectivity is being used .. unless we asked all the various IT departments of the verious clients we have. (and IT departments usually don't want to be botherd by software vendors). Although if we thought that would help us solve the issue, then we could do that.

I will post back shortly with the ping etc..

R.
 
Right from my machine, which is displaying the intermittent issues,

PING - i let it run for a good 5 mins.. and the below is fairly representative
Code:
Reply from 89.238.xxx.xxx: bytes=32 time=20ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=20ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=20ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=21ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=20ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=20ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=21ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=21ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=20ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=20ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=23ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=23ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=23ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=23ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=23ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=20ms TTL=120
Reply from 89.238.xxx.xxx: bytes=32 time=23ms TTL=120

Tracert - I did 12 of them.. and they were all pretty simliar.
Code:
Tracing route to ourserver.com [89.238.xxx.xxx]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  192.168.1.2
  2    18 ms    25 ms    21 ms  84.18.229.130
  3     *        *        *     Request timed out.
  4     7 ms     7 ms     6 ms  91.193.8.162
  5     *        *        *     Request timed out.
  6     9 ms     7 ms     7 ms  ae1.3101.ear1.London1.level3.net [4.69.141.62]
  7     8 ms     7 ms     7 ms  FranceTelecom-level3-100G.London1.Level3.net [4.68.73.154]
  8     *        *        *     Request timed out.
  9    14 ms    14 ms    14 ms  ethernet-29-1.pni1.ams2.nl.m247.ro [185.206.226.42]
 10     *        *       16 ms  be-4-2992.bb1n.ams2.nl.m247.ro [37.120.220.104]
 11    14 ms     *        *     hundredgige0-0-3-0.bb1n.lon2.uk.m247.ro [146.70.1.229]
 12    15 ms    14 ms    14 ms  hundredgige0-0-2-0.bb1n.lon1.uk.m247.ro [146.70.1.124]
 13    17 ms    14 ms    14 ms  te-7-1.bb1.lon1.uk.m247.ro [193.27.15.157]
 14    21 ms    21 ms    21 ms  te-0-0-0-20.bb1n.lon1.uk.m247.ro [193.27.15.238]
 15    21 ms     *       21 ms  hundredgige0-0-2-3.bb1n.lon2.uk.m247.ro [146.70.1.125]
 16    23 ms    21 ms    21 ms  be-101.bb1n.man4.uk.m247.ro [212.103.51.182]
 17    21 ms    20 ms    20 ms  vlan3101.core-dc1-agg1.man4.uk.m247.ro [37.120.220.3]
 18    26 ms    21 ms    20 ms  te-1-49.xs5a.man4.uk.m247.ro [77.243.185.87]
 19    20 ms    20 ms    20 ms  ourserver.com [89.238.xxx.xxx]

I can telnet straight into port 443.. tried 5 times and got in straight away each time..

I will post my colleagues results shortly..

R.
 
Right my colleagues results..

Ping -
Code:
Reply from |89.238.xxx.xxx: bytes=32 time=12ms TTL=118
Reply from |89.238.xxx.xxx: bytes=32 time=12ms TTL=118
Reply from |89.238.xxx.xxx: bytes=32 time=12ms TTL=118
Reply from |89.238.xxx.xxx: bytes=32 time=12ms TTL=118
Reply from |89.238.xxx.xxx: bytes=32 time=12ms TTL=118
Reply from |89.238.xxx.xxx: bytes=32 time=12ms TTL=118
Reply from |89.238.xxx.xxx: bytes=32 time=12ms TTL=118
Reply from |89.238.xxx.xxx: bytes=32 time=12ms TTL=118
Reply from |89.238.xxx.xxx: bytes=32 time=12ms TTL=118
Reply from |89.238.xxx.xxx: bytes=32 time=12ms TTL=118
Reply from |89.238.xxx.xxx: bytes=32 time=12ms TTL=118

Tracert -
Code:
Tracing route to ourserver.com [89.238.xxx.xxx]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  DSL-AC68U-F370 [192.168.62.1]
  2     5 ms     5 ms     5 ms  195.166.130.255
  3     6 ms     6 ms     6 ms  84.93.253.123
  4     6 ms     6 ms     6 ms  core1-BE1.southbank.ukcore.bt.net [195.99.125.130]
  5     6 ms     6 ms     6 ms  peer7-et-3-1-2.telehouse.ukcore.bt.net [109.159.252.230]
  6     6 ms     6 ms     6 ms  109.159.253.91
  7     *        *        *     Request timed out.
  8    12 ms    12 ms    12 ms  be-101.bb1n.man4.uk.m247.ro [212.103.51.182]
  9     *        *        *     Request timed out.
 10    13 ms    12 ms    12 ms  te-1-49.xs5a.man4.uk.m247.ro [77.243.185.87]
 11    12 ms    12 ms    12 ms  ourserver.com [89.238.xxx.xxx]

He was able to telnet into 443 as well.

r.
 
Was it working, or sodding about at the time you did those tests? It'd be good to see results from when it's working and when it isn't.

The connection issues are very intermittent, and last seconds only.. i.e. the software on my machine is constantly sending messages, (it's quite chatty), and only a few of the connections will fail. So there isn't really a time when it con't connect, and a time when it will if you see what I mean..

Thanks again for any and all help.

R.
 
Right ping has been running for almost 3 hours solid, and not one single failure. every tracert I've done in that time (about 15).. is pretty much the same as the earlier one I posted..

In that time i've had quite a number of errors from our software, and also quite a few periods where it has worked perfectly for a while.

So i'm no further forward really.. :(

We still have client systems that are intermittently erroring.. just like my own machine is.. and other client systems calling the exact same web services that have not had a single error since 19th.

I'm starting to think this is all a bit above my pay grade. :)
 
M247 are a garbage-tier provider. Your traces to Manchester are going via Amsterdam.

They also host tons of VPN services (https://m247.com/eu/services/host/vpn-servers/) so their IP ranges are treated as dirt.

If the only thing hitting your server is HTTPS then do A/B testing with it behind a Cloudflare Tunnel to avoid the bin fire that is M247's network and I bet the issues go away.
 
Last edited:
M247 are a garbage-tier provider. Your traces to Manchester are going via Amsterdam.

They also host tons of VPN services (https://m247.com/eu/services/host/vpn-servers/) so their IP ranges are treated as dirt.

If the only thing hitting your server is HTTPS then do A/B testing with it behind a Cloudflare Tunnel to avoid the bin fire that is M247's network and I bet the issues go away.

Thanks for the reply,

No arguments from me.. but their service has been fine for years.. with our services working without issues for a long time.

We are considering moving our servers to a new host, but in the mean time.. it would be good to get to the bottom of why 60% of our clients are still working perfectly yet the other 40% are intermittently erroring.

Will investigate the cloudflare tunnel, however it feels like getting around an unknown issue rather then finding and sorting it.

R.
 
Given that it’s only 40% of our clients that are getting this issue, (and the same clients), we’re assuming that it’s not an issue our end. If it was our end we would expect to see the errors spread across all clients.

Given that even the 40% of the clients that are erroring sometimes have periods where the connection seems stable for short periods
without reading deep into this and probably not applicable to you

If there was a load balancing server on the backend in front of two workers... That routes traffic according to lets say client ip, so for some clients it always goes to worker1 and for others to worker2(broken)
Hits a failed worker2 for a while, then temporarily black lists it and all traffic goes to worker1 and works.

long shot
 
without reading deep into this and probably not applicable to you

If there was a load balancing server on the backend in front of two workers... That routes traffic according to lets say client ip, so for some clients it always goes to worker1 and for others to worker2(broken)
Hits a failed worker2 for a while, then temporarily black lists it and all traffic goes to worker1 and works.

long shot

Yeah unfortunately not applicable in this instance, all the clients are being serviced by the same web service and the same back end server.
 
We are considering moving our servers to a new host, but in the mean time.. it would be good to get to the bottom of why 60% of our clients are still working perfectly yet the other 40% are intermittently erroring.

Will investigate the cloudflare tunnel, however it feels like getting around an unknown issue rather then finding and sorting it.

Some of the connections will not be routing via Amsterdam and doing funny things in the M247 network as a result, your colleagues Plusnet result doesn't go via Amsterdam, you on Zzoomm do. The Cloudflare tunnel is a workaround, but if M247 can't/won't fix the routing then all you can do is move provider.
 
Some of the connections will not be routing via Amsterdam and doing funny things in the M247 network as a result, your colleagues Plusnet result doesn't go via Amsterdam, you on Zzoomm do. The Cloudflare tunnel is a workaround, but if M247 can't/won't fix the routing then all you can do is move provider.

Thanks at least that gives me something to throw at M247..
 
You're right in a way that Cloudflare Tunnel is a workaround (though the security aspect of it shouldn't be overlooked) but as a troubleshooting method if you can take a customer with 200 errors one day down to zero when connecting through the tunnel, and then back to 200 when connecting to the server directly then that's a pretty good indication that it's at least not your problem.
 
That is very true.. as I said I will investigate a cloudflare tunnel, however not having used one, or knowing anything about them, then it may take some understanding before we can get to that point.. and possibly even code changes?!? .. let the investigations begin..

Thanks again!!
R.
 
Back
Top Bottom