Network slows down then regains speed

Destination · 23 Jan 2010 at 12:05

At work we have one fileserver which connects around 15 workstations.
Its main function is to run a server versions of a booking/notes/xrays system.
We have noticed especially more so recently, that the system grinds to virtual standstill several times per day before resuming normal function. It is most fustrating, and I am unable to figure out what leads to the slowdowns.

The server itself is a HP tower, which is populated by a single processor, the drives are in 2 separate banks of raid, one handles all the main data, records, bookings papernotes equivalents, and the other bank is dedicated to digital images. Each is backed up separately, each night, and this process completely without problem and well beofre anyone arrivesback the next day.

What are the likely bottlenecks in such a system, and how am i likely to relief them? There is a hardware company involved, they are muppets. There is also software support who have investigated and seem to think it isn't anything to do with them. When the system is grinding to a halt, i have view the server in task manager and nothing appears to be stealing major CPU cycles.

rz30 · 23 Jan 2010 at 12:33

I would also look to HW basics first, such as faulty ram, PSU voltages and cpu/HD/memory temps.

Then capacities, such as lack of space on HDs/tape drives/page file, enough memory.

Then event viewer, logs etc would be a good start.

PistolPete · 23 Jan 2010 at 12:58

I'd do it the other way around, I'd be looking at HDD space, memory usage etc first as it's more likely to be something like that rather than a physical hardware problem.

iaind · 23 Jan 2010 at 13:17

Yup, hardware faults like dodgy memory are more likely to cause crashes etc than slowness.

Rather than just looking at task manager, get some proper performance logs going, showing CPU, memory, paging, IO, network etc etc - if its slowing down it's running out of some resource or another

Destination · 23 Jan 2010 at 17:29

What software should I use to gauge and record the performance?
I won't be at the Server much, more likely at a workstation, and viewing the server before work, at lunchtime and after work.

Roland0 · 23 Jan 2010 at 18:22

Can you monitor the network at all?

Destination · 24 Jan 2010 at 10:31

Noob question, how would I go about that? What can i use to record the data?

PieMaster · 24 Jan 2010 at 11:58

It sounds like you're using Windows, so just setup Performance Monitor and leave it running for a couple of days. Ask some of the users to note down times when they experience slowdowns.

Some good counters to start with would be CPU, available memory, disk queue and TCP retransmissions.

mikeh501 · 24 Jan 2010 at 13:56

start>run>perfmon

Destination · 24 Jan 2010 at 19:19

thx guys, i'll run it for a couple of days and see

atomiser · 25 Jan 2010 at 12:35

another good place to look: anti-virus software!

s0ck · 25 Jan 2010 at 17:26

atomiser said:
another good place to look: anti-virus software!

Definitely! Especially if the clients have a daily scan scheduled which stupidly includes network drives by default. Ahem *cough* CA

Destination · 26 Jan 2010 at 22:35

PieMaster said:
It sounds like you're using Windows, so just setup Performance Monitor and leave it running for a couple of days. Ask some of the users to note down times when they experience slowdowns.

Some good counters to start with would be CPU, available memory, disk queue and TCP retransmissions.

When the system slows down the DISK QUEUE figures go off the chart.
CPU is up a bit but not loads, and RAM seems fine at this point also.
How would I investigate what is triggering the DISK QUEUE hammering?

mikeh501 · 27 Jan 2010 at 09:09

Hikari Kisugi said:
When the system slows down the DISK QUEUE figures go off the chart.
CPU is up a bit but not loads, and RAM seems fine at this point also.
How would I investigate what is triggering the DISK QUEUE hammering?

Sounds like you've found the symptons then. High Disk Queue length numbers are because the operating system is waiting on the disk subsystem to complete IO operations (reads/writes) and the then having to queue them because it can't complete them quickly enough, which is then pausing threads on the server, which is in turn servicing clients. This doesnt necessarily mean theres anything wrong, it could just be the load being asked of the server.

Some things to check....

* You say you have 2 RAID arrays, can you check that both RAID arrays are not generating errors due to a disk has died. This would make it very slow, but you'd likely see it more than a couple of times per day as stated.
* Check the disk queue length against both RAID arrays and see which one is bottlenecking your performanc, its unlikely to be both.
* What are disk queue length numbers your seeing? (need to know how many disk to make sense of this number)
* Once you've determined this, then the easy answer is to upgrade that array (more disks, not necessarily more capacity & faster disks), but its worth determining root cause by seeing what data is held on the disk, when is it accessed, and by how many people. Whats the pattern?
* Can you let us know how many disks of what type/speed/capacity and what server please.

Destination · 27 Jan 2010 at 19:10

Neither array is generating errors from what i can see.

I believe the smaller of the arrays which also happens to be the primary OS one seems to be where the queues are occurring but I am unsure how to investigate exactly what processes are causing the queue, how would i go about determining this?

The Scale is marked as 100.000 and it is going full on the bar, there are 2 drives, a basic mirror with 2*160Gb hard drives, of which there are about 30Gb free space.
The second RAID has 2*750Gb discs and severla hundred GBs free.

The server is a HP Proliant ML350 with a Xeon [email protected] and 3GB RAM, ATI ES1000.
networking provided by a HP ML373 multifunction gigabit server adapter going to router/switch/thingy
RAIDs are
LSI adapter Ultra 320 SCSI 2000(W/1020/1030)(StorPort) and
Smart Array E200I controller

Hope this proves useful information.

mikeh501 · 27 Jan 2010 at 19:56

OK so it sounds like your issues are coming from the OS RAID set. To find out whats causing it then your going to have to do a little investigative work. Theres a number of ways of doing this, but i'd probably try the following:

* Add the I/O columns to the task manager and see if that tell you anything
* Try FileMon or ProcessMon from sysinternals to see what processes are using what files.
* Check the memory paging counters for activity. If the server is low on memory it will be paging to disk. Possible cause.
* If all that gets you nowhere then maybe try PAL from microsoft which can analyse your logs and indicate issues.

Sometimes this type of thing is like following your gut and digging through some probably reasons, discounting them before striking gold

s0ck · 28 Jan 2010 at 00:16

Hikari Kisugi said:
The Scale is marked as 100.000

That's a generic scale and doesn't apply to the disk queue length. Look at the 'last' 'average' 'minimum' and 'maximum' at the bottom of the graph.

You haven't actually said what OS this is either. SBS by any chance...? A default install will almost certainly be paging with 3GB RAM.

Destination · 28 Jan 2010 at 21:01

Aye OS is 2003 SBS afaik.

I need time to actually look at the server stuff, unfortunately my job gives me virtually zero free time during the 9-1 and 2-5 periods and I am not going to be remunerated for anything I find.
I will try to determine teh cause as it would make my working time easier if i discover the issue.

s0ck · 29 Jan 2010 at 07:16

You might want to look at limiting the amount of memory the SBS Monitoring service uses for its SQL instance. The command is osql something or other. A quick google0r should reveal (I usually cap to 100MB, it could be using anything up to 1GB). If you've got Premium then do the same for the ISA instance which, again, will be having a field day with your RAM

On lesser boxes, I've just completely turned off SBS Monitoring in the past and it makes a marked improvement in terms of performance.