Network gurus - Network Troubleshooting?

Soldato
Joined
19 Jul 2004
Posts
4,087
Location
Shoreham by Sea
Have two sites linked by a BT LES10 line with Gigabit managed switches at either end.

Users in the smaller office access a terminal server in the main office but regularly complain of performance issues. Initially we did the usual troubleshooting involving pings etc and found a problem with one of the switches. The switch was replaced and things seemed better but users still complain of performance problems.

Short of doing more pinging, what's the best way to test the link between the two sites to highlight any problems or to prove that it's fine and that the problem lies with the server or software? Issues aren't very regular and usually seem fine when we look so I'm not sure packet captures would be very useful? Any other tools/methods?
 
quick easy and free way to do this is to a packet blast from one end of the link to the other, you can get a small app called 'iperf' or a slightly more advanced version called 'jperf' with a gui,

Install this on laptops connected to both ends, you should be able to send and receive a few hundred meg at least.
 
Are the switches decent, what is their queueing and buffering like? If they're netgears doing cut through or with next to no buffers than performance is going to suck. I assume you're just extending the layer2 network over the link (which isn't a great approach but should just about work).

Basically, there's loads of things here. If you're spewing traffic into those gig switches at gig speed and just expecting them to sort out sending it through the LES10 then, well, you need to rethink things.
 
Well I'm not 100% sure whether the link speeds are auto or not but I get the feeling theyre manually configured (I'll check)

The link isn't strictly a LES10 since it is considerably faster since it has different fibre modules at each end now!

The switches are half decent I believe although they're not really that new and were configured by someone who no longer works here. Theres an FSM7326P at one end and a GSM 7312 at the other.

The switches are configured with a few VLANs to help with routing so that we can have the fibre link as a priority but if the link goes down a new route over a VPN is then used.

I guess there's lots of room for error in this configuration so really I'm looking for tests which might highlight any potential problems. The connection always seems fine when we try to look at it but users still seem to complain about speed/reliability issues.
 
The connection always seems fine when we try to look at it but users still seem to complain about speed/reliability issues.

Try and observe your users reproducing the issues, It's quite possible that the problem actually lies in a particular software package / overloaded server and the network itself is fine.
 
The link isn't strictly a LES10 since it is considerably faster since it has different fibre modules at each end now!

LES is admined remotely by BT and goes direct into a LES chassis. Unless you’ve increased bandwidth via BT you’re not going to witness a upgrade by swapping SFP's.
 
As has already been mentioned get a laptop directly connected at each end and run jperf, which is the same as iperf (infact uses iperf) just with a Jave gui to make it a bit prettier :)

Setup some monitoring software to monitor the switch ports to make sure they are not being maxed out at all, especially the connections to the LES10.

How many users in the smaller office and does all of their traffic traverse the 10Mb? Or is there local internet break out for them? If all the traffic traverses the 10Mb then it wont take much to max it out during the day, and making any TS sessions seem sluggish.

Mike
 
One of our clients runs 7312's at either end of a LES100 (iirc?), performance is fine. I'll check tomorrow if they use theirs with SFPs or presented from the NTEs via copper ethernet.

One thought might be mismatched routes, as you've mentioned other network paths / VPNs involved.
 
LES is admined remotely by BT and goes direct into a LES chassis. Unless you’ve increased bandwidth via BT you’re not going to witness a upgrade by swapping SFP's.

Well the speeds were tested before and after changing the modules at either end and there was definitely an increase in speed. The modules are now gigabit but the speed wasn't quite up to that speed.

I guess its a possibility that the issues are a result of using modules that are faster than the LES10 was supposed to run at but there were issues before we changed the modules anyway.

Traffic through the LES10 during the day should be limited to RDP traffic only so I don't think that 5-10 users of RDP traffic would fill a LES10 even if it was running at 10Mbit would it? Would be nice to have some monitoring software which checks all this.

As for monitoring the ports on the switches, what software is good for that? Is there a decent free tool for this?
 
Try and observe your users reproducing the issues, It's quite possible that the problem actually lies in a particular software package / overloaded server and the network itself is fine.

This is actually more difficult than it sounds since we aren't generally on site unless theres problems. Whenever people mention theres a problem it's normally cleared by the time we look :/ It doesn't help that they tend to wait ages before reporting it too.
 
Well I got jperf on there and did some random tests which look fine to me! But what do I know eh? :P Heres the outputs


Certainly doesn't seem very slow for a LES10 and the UDP traffic seems pretty reliable till you ramp the speed up but thats about all I can glean from this? Any other tests I should be doing?

Apologies for the wall of text :P

------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 0.01 MByte (default)
------------------------------------------------------------
[408] local 10.1.1.4 port 5001 connected with 10.2.1.3 port 54478
[ ID] Interval Transfer Bandwidth
[408] 0.0- 1.0 sec 32.3 MBytes 271 Mbits/sec
[408] 1.0- 2.0 sec 29.3 MBytes 246 Mbits/sec
[408] 2.0- 3.0 sec 30.2 MBytes 254 Mbits/sec
[408] 3.0- 4.0 sec 33.1 MBytes 278 Mbits/sec
[408] 4.0- 5.0 sec 31.6 MBytes 265 Mbits/sec
[408] 5.0- 6.0 sec 25.6 MBytes 215 Mbits/sec
[408] 6.0- 7.0 sec 30.4 MBytes 255 Mbits/sec
[408] 7.0- 8.0 sec 31.1 MBytes 261 Mbits/sec
[408] 8.0- 9.0 sec 30.8 MBytes 259 Mbits/sec
[408] 9.0-10.0 sec 32.1 MBytes 269 Mbits/sec
[408] 0.0-10.0 sec 307 MBytes 257 Mbits/sec
Done.

bin/iperf.exe -s -u -P 0 -i 1 -p 5001 -f m
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size: 0.01 MByte (default)
------------------------------------------------------------
[200] local 10.1.1.4 port 5001 connected with 10.2.1.3 port 53903
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[200] 0.0- 1.0 sec 0.12 MBytes 1.00 Mbits/sec 0.465 ms 1095779661/ 85 (1.3e+009%)
[200] 1.0- 2.0 sec 0.12 MBytes 1.00 Mbits/sec 0.423 ms 0/ 85 (0%)
[200] 2.0- 3.0 sec 0.12 MBytes 0.99 Mbits/sec 0.487 ms 0/ 84 (0%)
[200] 3.0- 4.0 sec 0.12 MBytes 1.01 Mbits/sec 0.432 ms 0/ 86 (0%)
[200] 4.0- 5.0 sec 0.12 MBytes 1.00 Mbits/sec 0.428 ms 0/ 85 (0%)
[200] 5.0- 6.0 sec 0.12 MBytes 1.00 Mbits/sec 0.429 ms 0/ 85 (0%)
[200] 6.0- 7.0 sec 0.12 MBytes 1.00 Mbits/sec 0.442 ms 0/ 85 (0%)
[200] 7.0- 8.0 sec 0.12 MBytes 1.00 Mbits/sec 0.430 ms 0/ 85 (0%)
[200] 8.0- 9.0 sec 0.12 MBytes 1.00 Mbits/sec 0.411 ms 0/ 85 (0%)
[200] 9.0-10.0 sec 0.12 MBytes 1.00 Mbits/sec 0.440 ms 0/ 85 (0%)
[200] 0.0-10.0 sec 1.19 MBytes 1.00 Mbits/sec 0.422 ms 0/ 852 (0%)
[200] local 10.1.1.4 port 5001 connected with 10.2.1.3 port 54801
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[200] 0.0- 1.0 sec 0.60 MBytes 5.00 Mbits/sec 0.203 ms 0/ 425 (0%)
[200] 1.0- 2.0 sec 0.60 MBytes 5.00 Mbits/sec 0.168 ms 0/ 425 (0%)
[200] 2.0- 3.0 sec 0.60 MBytes 5.00 Mbits/sec 0.161 ms 0/ 425 (0%)
[200] 3.0- 4.0 sec 0.60 MBytes 5.01 Mbits/sec 0.121 ms 0/ 426 (0%)
[200] 4.0- 5.0 sec 0.60 MBytes 5.00 Mbits/sec 0.107 ms 0/ 425 (0%)
[200] 5.0- 6.0 sec 0.60 MBytes 5.00 Mbits/sec 0.120 ms 0/ 425 (0%)
[200] 6.0- 7.0 sec 0.60 MBytes 5.00 Mbits/sec 0.102 ms 0/ 425 (0%)
[200] 7.0- 8.0 sec 0.60 MBytes 5.00 Mbits/sec 0.098 ms 0/ 425 (0%)
[200] 8.0- 9.0 sec 0.60 MBytes 5.01 Mbits/sec 0.215 ms 0/ 426 (0%)
[200] 9.0-10.0 sec 0.60 MBytes 5.00 Mbits/sec 0.100 ms 0/ 425 (0%)
[200] 0.0-10.0 sec 5.96 MBytes 5.00 Mbits/sec 0.159 ms 0/ 4253 (0%)
[200] local 10.1.1.4 port 5001 connected with 10.2.1.3 port 58120
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[200] 0.0- 1.0 sec 2.34 MBytes 19.6 Mbits/sec 0.031 ms 0/ 1669 (0%)
[200] 1.0- 2.0 sec 2.34 MBytes 19.6 Mbits/sec 0.037 ms 1/ 1670 (0.06%)
[200] 2.0- 3.0 sec 1.97 MBytes 16.5 Mbits/sec 0.133 ms 0/ 1407 (0%)
[200] 3.0- 4.0 sec 2.36 MBytes 19.8 Mbits/sec 0.020 ms 0/ 1685 (0%)
[200] 4.0- 5.0 sec 2.36 MBytes 19.8 Mbits/sec 0.026 ms 0/ 1682 (0%)
[200] 5.0- 6.0 sec 2.32 MBytes 19.4 Mbits/sec 0.085 ms 1/ 1653 (0.06%)
[200] 6.0- 7.0 sec 2.36 MBytes 19.8 Mbits/sec 0.025 ms 0/ 1684 (0%)
[200] 7.0- 8.0 sec 2.36 MBytes 19.8 Mbits/sec 0.099 ms 0/ 1684 (0%)
[200] 8.0- 9.0 sec 2.33 MBytes 19.6 Mbits/sec 0.227 ms 0/ 1663 (0%)
[200] 0.0-10.0 sec 23.1 MBytes 19.4 Mbits/sec 0.137 ms 2/16478 (0.012%)
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[200] 0.0- 1.0 sec 24.2 MBytes 203 Mbits/sec 0.102 ms 190/17440 (1.1%)
[200] 1.0- 2.0 sec 24.9 MBytes 209 Mbits/sec 0.125 ms 282/18013 (1.6%)
[200] 2.0- 3.0 sec 25.2 MBytes 211 Mbits/sec 0.127 ms 482/18446 (2.6%)
[200] 3.0- 4.0 sec 24.5 MBytes 205 Mbits/sec 0.116 ms 287/17732 (1.6%)
[200] 4.0- 5.0 sec 25.5 MBytes 214 Mbits/sec 0.116 ms 265/18488 (1.4%)
[200] 5.0- 6.0 sec 25.5 MBytes 214 Mbits/sec 0.126 ms 257/18465 (1.4%)
[200] 6.0- 7.0 sec 20.6 MBytes 173 Mbits/sec 0.129 ms 168/14875 (1.1%)
[200] 7.0- 8.0 sec 25.4 MBytes 213 Mbits/sec 0.122 ms 345/18487 (1.9%)
[200] 8.0- 9.0 sec 25.5 MBytes 214 Mbits/sec 0.127 ms 215/18419 (1.2%)
[200] 9.0-10.0 sec 24.4 MBytes 205 Mbits/sec 0.097 ms 119/17552 (0.68%)
[200] 0.0-10.0 sec 246 MBytes 206 Mbits/sec 0.153 ms 2610/177918 (1.5%)





bin/iperf.exe -s -P 0 -i 1 -p 5001 -f m
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 0.01 MByte (default)
------------------------------------------------------------
[408] local 10.1.1.4 port 5001 connected with 10.2.1.3 port 54640
[428] local 10.1.1.4 port 5001 connected with 10.2.1.3 port 54641
[ ID] Interval Transfer Bandwidth
[428] 0.0- 1.0 sec 28.5 MBytes 239 Mbits/sec
[408] 0.0- 1.0 sec 28.6 MBytes 240 Mbits/sec
[SUM] 0.0- 1.0 sec 57.1 MBytes 479 Mbits/sec
[408] 1.0- 2.0 sec 28.0 MBytes 235 Mbits/sec
[428] 1.0- 2.0 sec 27.8 MBytes 233 Mbits/sec
[SUM] 1.0- 2.0 sec 55.8 MBytes 468 Mbits/sec
[408] 2.0- 3.0 sec 26.6 MBytes 223 Mbits/sec
[428] 2.0- 3.0 sec 26.2 MBytes 220 Mbits/sec
[SUM] 2.0- 3.0 sec 52.8 MBytes 443 Mbits/sec
[428] 3.0- 4.0 sec 27.2 MBytes 228 Mbits/sec
[408] 3.0- 4.0 sec 27.8 MBytes 233 Mbits/sec
[SUM] 3.0- 4.0 sec 55.0 MBytes 461 Mbits/sec
[408] 4.0- 5.0 sec 28.9 MBytes 242 Mbits/sec
[428] 4.0- 5.0 sec 28.5 MBytes 239 Mbits/sec
[SUM] 4.0- 5.0 sec 57.4 MBytes 481 Mbits/sec
[408] 5.0- 6.0 sec 25.5 MBytes 214 Mbits/sec
[428] 5.0- 6.0 sec 25.5 MBytes 214 Mbits/sec
[SUM] 5.0- 6.0 sec 51.0 MBytes 428 Mbits/sec
[408] 6.0- 7.0 sec 28.6 MBytes 240 Mbits/sec
[428] 6.0- 7.0 sec 28.2 MBytes 237 Mbits/sec
[ ID] Interval Transfer Bandwidth
[SUM] 6.0- 7.0 sec 56.8 MBytes 477 Mbits/sec
[428] 7.0- 8.0 sec 24.1 MBytes 202 Mbits/sec
[408] 7.0- 8.0 sec 24.4 MBytes 205 Mbits/sec
[SUM] 7.0- 8.0 sec 48.5 MBytes 406 Mbits/sec
[428] 8.0- 9.0 sec 26.4 MBytes 222 Mbits/sec
[408] 8.0- 9.0 sec 26.6 MBytes 223 Mbits/sec
[SUM] 8.0- 9.0 sec 53.0 MBytes 445 Mbits/sec
[428] 0.0-10.0 sec 274 MBytes 230 Mbits/sec
[408] 0.0-10.0 sec 276 MBytes 232 Mbits/sec
[SUM] 0.0-10.0 sec 550 MBytes 462 Mbits/sec
 
Last edited:
If you're running at those speeds you've replaced the BT NTE and all bets are off, there's nothing I can recommend to troubleshoot as the infrastructure between the devices is unknown, the quality of the optics is unknown and you're running it out of spec, unsupported in way I could never advise a business to do. Good luck.

I'd also add, no matter how fast it is, it's not going to be line speed for the PCs and that means something needs to manage the bandwidth, whether that's queueing on the devices or TCP windowing on the app side depends on your implementation...
 
Last edited:
If you're running at those speeds you've replaced the BT NTE and all bets are off, there's nothing I can recommend to troubleshoot as the infrastructure between the devices is unknown, the quality of the optics is unknown and you're running it out of spec, unsupported in way I could never advise a business to do. Good luck.

I'd also add, no matter how fast it is, it's not going to be line speed for the PCs and that means something needs to manage the bandwidth, whether that's queueing on the devices or TCP windowing on the app side depends on your implementation...

To be honest the connection should never be pushed that hard during work hours anyway. The most load it should be getting is around 20 users using remote desktop.

Do these Netgear switches have much in the way of output stats like Netflow or something? All I can find is total packets sent and crap like that :/ doesn't really tell me if a particular client is causing issues.
 
If you're running at those speeds you've replaced the BT NTE and all bets are off, there's nothing I can recommend to troubleshoot as the infrastructure between the devices is unknown, the quality of the optics is unknown and you're running it out of spec, unsupported in way I could never advise a business to do. Good luck.

I'd also add, no matter how fast it is, it's not going to be line speed for the PCs and that means something needs to manage the bandwidth, whether that's queueing on the devices or TCP windowing on the app side depends on your implementation...

I assume that he's broken a contract as well... which could potentially a bigger issue?
 
You could run up PRTG (or similar) and graph the SNMP stats on the individual ports.

As above though - whoever "upgraded" the line has made the issue worse, as you have no baseline to go back to BT with.
 
You could run up PRTG (or similar) and graph the SNMP stats on the individual ports.

Presumably SNMP has to be enabled before we can do this? I've only got experience in using SNMP to monitor bandwidth on some of the firewall/routers we have and these required a little configuration.

Does the SNMP data output affect the performance of the switches at all?
 
I'm confused about the 1M/5M/20M test... have you forced it to that speed somehow?

If you're natively getting 200 odd tho, this is probably going to be some spaz app/DNS type issue and nothing to do with the 'network'.
Consider watching exactly what they're doing, as annoying and time consuming as that sounds :p
 
I'm confused about the 1M/5M/20M test... have you forced it to that speed somehow?

If you're natively getting 200 odd tho, this is probably going to be some spaz app/DNS type issue and nothing to do with the 'network'.
Consider watching exactly what they're doing, as annoying and time consuming as that sounds :p

Yeah Jperf had settings for different speeds with the UDP tests so I tried a few different ones.
 
I assume that he's broken a contract as well... which could potentially a bigger issue?
^^ Much bigger.

If that IS a LES10 (which the way you described it I'm still not convinced it is) and BT find out you've bypassed their NTEs, there will probably be financial recompense and probably also a termination of service.
To say you're in very dubious territory is an understatement. For the cost of upgrading your CDR from 10-100mbit it's not worth risking it.

As for performance issues, you will always get unusual things happen if your WAN links sit on the same broadcast domain as your LAN because the broadcast/Multicast traffic from both sites will be traversing it completely unchecked.
Active monitoring won't work as production data flowing over the link will affect your monitoring results. Also without any QoS on it your testing could actually have detrimental effects on the production traffic.
The best way to monitor that would be passively using SFLOW/SNMP monitoring on the switches or port mirroring the LES facing port to a monitoring workstation.


P.S two sites linked at layer 2 - I hope you have STP running on those switches.
 
Back
Top Bottom