Just thought I'd share this little nugget with you guys. Every now and again, a DBA will complain about the performance of their databases no matter what. You could have infinite bandwidth into a SAN made entirely of RAM with the cluster running on the most powerful hardware conceivable and still they'd find a way to whinge (instead of fixing their queries!)
In this instance though, there was something to it:
In the event log there were lots of Error 833s - I/O operations taking longer than 15s
You can see there the counters I started off looking at. Some counters are absolutely reliable, some are kinda reliable and others should never be used. These ones are the former - they are very reliable indicators of what is going on when you interpret them correctly.
This server is used to keep track of multiple-terabyte clustered build operations. There are a lot of moving parts, but essentially it is extremely busy in bursts but not very busy most of the rest of the time. This makes troubleshooting a little bit trickier but not impossible as there is a trickle of load most of the time, you just have to watch and wait. Eventually (after about 20 minutes) you can see what happened - the disk queue length hit 1, all I/O activity stopped and when it came back, the average sec/transfer had spiked massively.
At the same time, I had the very basic NetApp performance stats open:
You can see that the system isn't particularly busy but at the same time as the blip in the above image, the per-protocol latency graph for FC/FCoE just stops reporting a value. When it comes back, it is a bit higher than normal. Curious!
Time to go deeper... SSHing into the controller, elevating up to diag and running a statit showed nothing untoward on the disks. But lo, what is this?
]: FCP target 2b reported a port configuration change. FIP CVL received - the virtual FC link was disconnected.
Seems likely to cause an issue...
Off to the Nexus to confirm:
%PORT-5-IF_TRUNK_DOWN: %$VSAN xxx%$ Interface vfcxx, vsan xxx is down (waiting for flogi)
The interface is flapping very, very briefly (down for less than one second in every 10 minutes or so) but obviously that's causing the FCoE fabric to converge/reconverge and that takes time.
The first time I have seen an in-service passive TwinAx cable failure. Not the first place I thought to check of when a DBA reports a performance issue with a database!
If nothing else, this goes to show that in modern Enterprise IT, you need to have at least a rudimentary handle on all of the infrastructure that knits together to make up your environment. I don't think it is good enough any more to just be a server guy or a storage guy or a network guy if you want to really excel in any of those areas - especially as things become more and more converged.
In this instance though, there was something to it:
In the event log there were lots of Error 833s - I/O operations taking longer than 15s

You can see there the counters I started off looking at. Some counters are absolutely reliable, some are kinda reliable and others should never be used. These ones are the former - they are very reliable indicators of what is going on when you interpret them correctly.
This server is used to keep track of multiple-terabyte clustered build operations. There are a lot of moving parts, but essentially it is extremely busy in bursts but not very busy most of the rest of the time. This makes troubleshooting a little bit trickier but not impossible as there is a trickle of load most of the time, you just have to watch and wait. Eventually (after about 20 minutes) you can see what happened - the disk queue length hit 1, all I/O activity stopped and when it came back, the average sec/transfer had spiked massively.
At the same time, I had the very basic NetApp performance stats open:

You can see that the system isn't particularly busy but at the same time as the blip in the above image, the per-protocol latency graph for FC/FCoE just stops reporting a value. When it comes back, it is a bit higher than normal. Curious!
Time to go deeper... SSHing into the controller, elevating up to diag and running a statit showed nothing untoward on the disks. But lo, what is this?
]: FCP target 2b reported a port configuration change. FIP CVL received - the virtual FC link was disconnected.
Seems likely to cause an issue...
Off to the Nexus to confirm:
%PORT-5-IF_TRUNK_DOWN: %$VSAN xxx%$ Interface vfcxx, vsan xxx is down (waiting for flogi)
The interface is flapping very, very briefly (down for less than one second in every 10 minutes or so) but obviously that's causing the FCoE fabric to converge/reconverge and that takes time.
The first time I have seen an in-service passive TwinAx cable failure. Not the first place I thought to check of when a DBA reports a performance issue with a database!
If nothing else, this goes to show that in modern Enterprise IT, you need to have at least a rudimentary handle on all of the infrastructure that knits together to make up your environment. I don't think it is good enough any more to just be a server guy or a storage guy or a network guy if you want to really excel in any of those areas - especially as things become more and more converged.