DBAs aren't always lying... (troubleshooting fun)

DRZ · 6 Sep 2013 at 13:47

Just thought I'd share this little nugget with you guys. Every now and again, a DBA will complain about the performance of their databases no matter what. You could have infinite bandwidth into a SAN made entirely of RAM with the cluster running on the most powerful hardware conceivable and still they'd find a way to whinge (instead of fixing their queries!)

In this instance though, there was something to it:

In the event log there were lots of Error 833s - I/O operations taking longer than 15s

You can see there the counters I started off looking at. Some counters are absolutely reliable, some are kinda reliable and others should never be used. These ones are the former - they are very reliable indicators of what is going on when you interpret them correctly.

This server is used to keep track of multiple-terabyte clustered build operations. There are a lot of moving parts, but essentially it is extremely busy in bursts but not very busy most of the rest of the time. This makes troubleshooting a little bit trickier but not impossible as there is a trickle of load most of the time, you just have to watch and wait. Eventually (after about 20 minutes) you can see what happened - the disk queue length hit 1, all I/O activity stopped and when it came back, the average sec/transfer had spiked massively.

At the same time, I had the very basic NetApp performance stats open:

You can see that the system isn't particularly busy but at the same time as the blip in the above image, the per-protocol latency graph for FC/FCoE just stops reporting a value. When it comes back, it is a bit higher than normal. Curious!

Time to go deeper... SSHing into the controller, elevating up to diag and running a statit showed nothing untoward on the disks. But lo, what is this?

]: FCP target 2b reported a port configuration change. FIP CVL received - the virtual FC link was disconnected.

Seems likely to cause an issue...

Off to the Nexus to confirm:

%PORT-5-IF_TRUNK_DOWN: %$VSAN xxx%$ Interface vfcxx, vsan xxx is down (waiting for flogi)

The interface is flapping very, very briefly (down for less than one second in every 10 minutes or so) but obviously that's causing the FCoE fabric to converge/reconverge and that takes time.

The first time I have seen an in-service passive TwinAx cable failure. Not the first place I thought to check of when a DBA reports a performance issue with a database!

If nothing else, this goes to show that in modern Enterprise IT, you need to have at least a rudimentary handle on all of the infrastructure that knits together to make up your environment. I don't think it is good enough any more to just be a server guy or a storage guy or a network guy if you want to really excel in any of those areas - especially as things become more and more converged.

DRZ · 8 Sep 2013 at 16:19

ecksmen said:
All I see here is that clearly you do not have sufficient monitoring in place, otherwise you'd have alerting configured that would have already flagged up a bad port on a switch. Doesn't appear "enterprise" but then that phrase is thrown around far too easily.

Define enterprise? We're quite small, only 10,000ish users across 60 countries with an annual turnover of billions. Maybe that doesn't qualify by your standards but it does by mine

I've never yet worked in an absolutely perfect environment! Monitoring is far from perfect no matter what you implement - we do monitor switchports (of course) but very short failures are more difficult to catch unless you're monitoring in a certain way. You've obviously got tiers of importance as well, some things are monitored much more closely than others. Monitoring isn't yet in my remit, its getting better all the time though.

As always on OcUK, you get someone wading in with a point that wasn't the point of the thread at all

DRZ · 8 Sep 2013 at 19:50

I've worked in bigger places, that's for sure

DRZ · 8 Sep 2013 at 22:13

I think that entirely depends on a few factors and it influences the way you think about it. I've worked with more than 10,000 users in a city size area with more than 100 sites - that's a very different proposition to a global organisation with pockets of users ranging from less than 5 to more than 2,000. The latter is a different challenge to the former (and I personally prefer it because it means I get to jet all over the world!

)

This also depends on how you classify users. I just count users as internal resources that I need to keep fed and watered to perform their job. That is the side of what we do that I tend to (as a part of our global team, of course). What I'm not directly responsible for is the millions upon millions of people that use our services every single day. It would be unrealistic to suggest that they made up our "user count" but they do matter!