DBAs aren't always lying... (troubleshooting fun)

DRZ · 6 Sep 2013 at 13:47

Just thought I'd share this little nugget with you guys. Every now and again, a DBA will complain about the performance of their databases no matter what. You could have infinite bandwidth into a SAN made entirely of RAM with the cluster running on the most powerful hardware conceivable and still they'd find a way to whinge (instead of fixing their queries!)

In this instance though, there was something to it:

In the event log there were lots of Error 833s - I/O operations taking longer than 15s

You can see there the counters I started off looking at. Some counters are absolutely reliable, some are kinda reliable and others should never be used. These ones are the former - they are very reliable indicators of what is going on when you interpret them correctly.

This server is used to keep track of multiple-terabyte clustered build operations. There are a lot of moving parts, but essentially it is extremely busy in bursts but not very busy most of the rest of the time. This makes troubleshooting a little bit trickier but not impossible as there is a trickle of load most of the time, you just have to watch and wait. Eventually (after about 20 minutes) you can see what happened - the disk queue length hit 1, all I/O activity stopped and when it came back, the average sec/transfer had spiked massively.

At the same time, I had the very basic NetApp performance stats open:

You can see that the system isn't particularly busy but at the same time as the blip in the above image, the per-protocol latency graph for FC/FCoE just stops reporting a value. When it comes back, it is a bit higher than normal. Curious!

Time to go deeper... SSHing into the controller, elevating up to diag and running a statit showed nothing untoward on the disks. But lo, what is this?

]: FCP target 2b reported a port configuration change. FIP CVL received - the virtual FC link was disconnected.

Seems likely to cause an issue...

Off to the Nexus to confirm:

%PORT-5-IF_TRUNK_DOWN: %$VSAN xxx%$ Interface vfcxx, vsan xxx is down (waiting for flogi)

The interface is flapping very, very briefly (down for less than one second in every 10 minutes or so) but obviously that's causing the FCoE fabric to converge/reconverge and that takes time.

The first time I have seen an in-service passive TwinAx cable failure. Not the first place I thought to check of when a DBA reports a performance issue with a database!

If nothing else, this goes to show that in modern Enterprise IT, you need to have at least a rudimentary handle on all of the infrastructure that knits together to make up your environment. I don't think it is good enough any more to just be a server guy or a storage guy or a network guy if you want to really excel in any of those areas - especially as things become more and more converged.

anything I don't mind · 6 Sep 2013 at 14:04

Wow a cable failure. Wouldn't think of that myself.

jake108 · 6 Sep 2013 at 14:58

Wouldn't have thought of it.

It's the old "but have you checked" that comes with experience. I try my best to optimise performance as a system admin guy but you can't be an expert in every field. Sometimes these crop up and you have to do investigative work to the Columbo level to get to the bottom of it. Good work though

Uhtred · 6 Sep 2013 at 19:26

I happen to work with a DBA who gets off on shaving off seconds from his queries, he's never once tried to use dodgy hardware. Our developers though, they are a nightmare for that

covenantuk · 7 Sep 2013 at 08:21

DRZ said:
If nothing else, this goes to show that in modern Enterprise IT, you need to have at least a rudimentary handle on all of the infrastructure that knits together to make up your environment. I don't think it is good enough any more to just be a server guy or a storage guy or a network guy if you want to really excel in any of those areas - especially as things become more and more converged.

I've been trying to convince our Head of IT on this for a long time.

The way things are going (especially with desktop and application virtualisation) it's not even feasible to split up desktop and server support. Both need a working knowledge of the other in order to troubleshoot problems. I'm seeing it in job applications too - adverts for server engineers now have a huge list of desirable skills ranging from desktop to networks

ecksmen · 8 Sep 2013 at 15:04

All I see here is that clearly you do not have sufficient monitoring in place, otherwise you'd have alerting configured that would have already flagged up a bad port on a switch. Doesn't appear "enterprise" but then that phrase is thrown around far too easily.

DRZ · 8 Sep 2013 at 16:19

ecksmen said:
All I see here is that clearly you do not have sufficient monitoring in place, otherwise you'd have alerting configured that would have already flagged up a bad port on a switch. Doesn't appear "enterprise" but then that phrase is thrown around far too easily.

Define enterprise? We're quite small, only 10,000ish users across 60 countries with an annual turnover of billions. Maybe that doesn't qualify by your standards but it does by mine

I've never yet worked in an absolutely perfect environment! Monitoring is far from perfect no matter what you implement - we do monitor switchports (of course) but very short failures are more difficult to catch unless you're monitoring in a certain way. You've obviously got tiers of importance as well, some things are monitored much more closely than others. Monitoring isn't yet in my remit, its getting better all the time though.

As always on OcUK, you get someone wading in with a point that wasn't the point of the thread at all

teaboy5 · 8 Sep 2013 at 19:38

DRZ said:
Define enterprise? We're quite small, only 10,000ish users across 60 countries with an annual turnover of billions. Maybe that doesn't qualify by your standards but it does by mine

I've never yet worked in an absolutely perfect environment! Monitoring is far from perfect no matter what you implement - we do monitor switchports (of course) but very short failures are more difficult to catch unless you're monitoring in a certain way. You've obviously got tiers of importance as well, some things are monitored much more closely than others. Monitoring isn't yet in my remit, its getting better all the time though.

As always on OcUK, you get someone wading in with a point that wasn't the point of the thread at all

You class 10k users as small ? I can't tell if you're joking or not.

DRZ · 8 Sep 2013 at 19:50

I've worked in bigger places, that's for sure

jake108 · 8 Sep 2013 at 19:55

teaboy5 said:
You class 10k users as small ? I can't tell if you're joking or not.

Yeah I'ld consider 10k of users pretty small. It's when you are into the 100,000+ does it get a bit bigger.

Considering what you're doing it sounds like fine diagnostics / analysis to me. :confused:

Worst case scenarios, DRA, "What if" scenarios are usually converged for network, server, client and sometimes power too by best practice at our place. Sometimes convergence can help isolate performance issues, capacity problems and help improve overall performance of the network.

Proactive response also helps but sometimes it's difficult to find the time if you're constantly fighting fires or faced with an argument between client and server side teams. Dare I say the name Dell here have been the direct cause of some issues similar to the OPs with their "specialist" range of contracted cowboy engineers. They are meant to be "enterprise" but it always boils down to the engineer you get and if they have REAL hands on experience with dealing with "enterprise" or even business customers at all. Good communication between teams before convergence will ALWAYS trump everything else IMHO.

teaboy5 · 8 Sep 2013 at 20:01

jake108 said:
Yeah I'ld consider 10k of users pretty small. It's when you are into the 100,000+ does it get a bit bigger.

Considering what you're doing it sounds like fine diagnostics / analysis to me. Worst case scenarios, DRA, "What if" scenarios are usually converged for network, server, client and sometimes power too by best practice at our place. Sometimes convergence can help isolate performance issues, capacity problems and help improve overall performance of the network.

Proactive response also helps but sometimes it's difficult to find the time if you're constantly fighting fires or faced with an argument between client and server side teams. Dare I say the name Dell here have been the direct cause of some issues similar to the OPs with their "specialist" range of contracted cowboy engineers. They are meant to be "enterprise" but it always boils down to the engineer you get and if they have REAL hands on experience with dealing with "enterprise" or even business customers at all. Good communication between teams before convergence will ALWAYS trump everything else IMHO.

10k users is by no means small!

DRZ · 8 Sep 2013 at 22:13

I think that entirely depends on a few factors and it influences the way you think about it. I've worked with more than 10,000 users in a city size area with more than 100 sites - that's a very different proposition to a global organisation with pockets of users ranging from less than 5 to more than 2,000. The latter is a different challenge to the former (and I personally prefer it because it means I get to jet all over the world!

)

This also depends on how you classify users. I just count users as internal resources that I need to keep fed and watered to perform their job. That is the side of what we do that I tend to (as a part of our global team, of course). What I'm not directly responsible for is the millions upon millions of people that use our services every single day. It would be unrealistic to suggest that they made up our "user count" but they do matter!

Yamahahahahaha · 8 Sep 2013 at 23:40

Small is under 100 users. Small - Medium can vary from 100+ to 250+ users. Enterprise is 1000+ users.

We haven't got anything that means looking after more than a few thousand yet. The contracts we have had come up in the five figures of users tend to be uneconomical because they want to cut lots of unusual corners that only a few vendors in the market can actually fill. Good luck to them if they think they can get away with it, but for my perspective it's more hassle to take them on than not. Plus you end up doing more and more to please these clients who can sink the company by not paying up on time, it's not worth the risk for the relatively small upside.

The other problem with these kind of contracts is that if one of the huge competitors (Microsoft Azure etc.) come up with an exact fit for what they need, it will be cheaper than what we can offer and they will be out the door as soon as the contract is over - again not worth the risk or smallish margins. Our margins come from value-add work, which is our bread and butter, not tier 1 and 2 tech support for end users.