Computer crash when processing large amounts of data

KrazeyKami · 31 Jan 2011 at 12:58

Hi all,

For a short introduction of my tech level; I’ve been a system manager for the past 8 years, and been playing around with computer for nearly 20 years.
Until recently, I’ve never faced a problem with my own computer that I couldn’t solve, or figure out what was giving issues, until now… And that’s why I need your help, and a fresh look on things.

First, my setup:

Motherboard: ASUS P6T SE
Motherboard BIOS: v.0808
CPU: Intel i920 @ 2,6 GHz (Stock)
CPU Heatsink: ProlimaTech MegaHalems + 2 Cooler Master120mm fans
CPU Idle Temp: 30-35 degrees per core
CPU Load Temp: 45-50 degrees per core (Prime95)
Memory Part Number: 6 GB OCZ Gold PC3-8500U + 6 GB OCZ Gold PC3-10700U (Both OCZ3G1333LV6GK @ 1066 MHz, Stock)
Memory Voltage: 1.65
Video Card(s): Asus nVidia GTX295
Sound Card: Sound Blaster Fatal1ty X-Fi
PSU Model Number: Cooler Master 1000W Real PowerPro
Hard Drive(s): Intel X-25 M SSD 80G (OS/Boot), 4x 2TB HD204UI Samsung Spinpoint F4 (all on ICH10 / Sata2)
Optical Drive(s): GGW-H20L BlueRay RW
Other Cooling: Cooler Master Stacker 831 case with 6x Cooler Master 120mm fans
Operating System: Windows 7 Professional x64

This machine is running 24x7, rock solid. I never had any issues with performance or stability.

Now, for the problem:

Until a few weeks ago, I didn’t have the 4x 2TB drives in it.

I bought the drives, made a RAID5 array, using the ICH10R, and that’s when the problems began:
When copying (or downloading) large amounts of data to the array, thus creating a high I/O on any of the 2TB drives, my computer suddenly reboots. No blue screens (although that option is checked ON), and the only remarks in Event Viewer are: System suddenly rebooted unexpectedly, Event ID 6008, and possible cause: Power failure, Event ID 41).
No error codes, no nothing.

I decided to break up the array, and see what happens when I copy 1.5 TB of data from 1 drive to the next; The computer reboots again. The drives are now single SATA2 drives, with a 2 TB partition. After 10-80 minutes, the system reboots without notice or error.

So, I started to systematically remove drives, and test with the other drives.
Regardless of what drive is the source, or destination (I tried all combo’s, and all directions), the system reboots when a high amount of data is generated. Not just with copying the existing data, but also when downloading (and at the same time repairing files (.PAR), and extracting).

I tried the following:

Check for overheating: All values are well below 40 degrees C; (also checked drives);
Even put an active cooler on my southbridge (The ICH10);
Memory checks. Ran 3 different programs to check / test my memory, ran overnight for hours and hours, multiple passes, 0 errors.
Remove all other hardware, except for the absolute minimal necessary;
Swap / replace powercords, SATA cables, even rotate drive position on the SATA connectors;
Reset BIOS settings;
Reinstall Windows 7 (delete the entire 80GB partition on the SSD and reinstall, no other tweaks, but right after install, start the copy transactions) to rule out the possibility of faulty software and / or drivers;
Reformat / create the drives / partitions: Tried both MBR and GPT partition; Different block sizes;
Turn off Write Back Cache, to even further rule out a problem with my RAM;
Calculate the PSU needs; I tried multiple programs, even a paid one, and counted manually: Granted, on a full Direct3D load (games on high etc), my GFX card needs around 450 watts. This makes a grant total of 950 Watts. However, the problems occur while idle in Windows, so the consumption for my GFX is max. 100 Watts, making a total of (roughly) 600 Watts, well within the limits of my 1000W PSU;
CPU check / Prime95; runs for days, stable, without a single error;

The facts:

the problem only (and only) occurs when copying / downloading a large amount of data;
The system runs flawlessly under high load (playing games, watching movies, running programs etc);
I never had this problem before, but then again, I never had the space to start downloading 250 GB of data, or copying 1.5 TB ( I didn’t even have 1,5 TB ^^) data to other drives.

I am able to reproduce a “fast” reboot error:
I created 3 separate batch files, which basicly tell Robocopy to copy data from:

Drive D: to Drive E:
Drive D: to Drive F:
Drive D: to Drive G:

When I run these scripts separate it runs for a while, but also reboots / crashes after an hour or so.
When I start these scripts all at once, it reboots within 5 minutes.
Remember, that I already cloned the drives, so I could rotate the source drive, and systematically removed / switched a destination drive, thus trying all different combo’s (and to check whether one of the drives might be faulty).
Further this also makes me doubt if it’s the shear size of data that causes the problem, cause within 5 minutes, not even 100 MB is being addressed, and still, it reboots.
This is making me think, that the PSU might be the problem. As soon as these drives are actively called upon, i can imagine a sudden increase in the 12V+ rail, can cause to overload my 12V rail... altho my PSU has 6x 12V rails, i'm not to convinced this might work as well as people say...there are many discussions on the web about the use of 6x 12V rails. Could it be, that my 12V rail is maxed? (considering it's giving power to: The Mobo, The Cpu (4/6 pins, can't remember), The GFX card (both 6 and 8 pins), 5 drives (SSD + 4x 2 TB), an Optical BD-RW, and ofcourse the onboard devices (Soundcard) and USB devices (headset, webcam).

I am all out of ideas. If there is something that I haven’t checked / tried, please tell me. I think I wrote down everything I tried thus far; maybe I missed something, but I’ve been testing and trying for 3 weeks now.

For now my conclusion / suspects are:

The motherboard. Either a chip in the ICH10 was fried, or the SMBus got a dent;
The motherboard (or ICH10 / SMBus) is just not capable of processing such large amounts of data.
My PSU. Mainly, the 12V+ rail. It could be (maybe), that my 12V rail is max.loaded, and when kicking in the extra drive operations, it fluctuates, and tilts it a bit above it maximum, thus crashing my computer. Looking at the symptoms (sudden reboots without any errors) it might be a more plausible cause then all the other things I tried. And yet, if you look at my hardware setup, I cannot imagine I reached the 12V max.
However, I will be testing this week, by taking another PSU on a second desktop, and connect my hard drives to that power supply. Or maybe even, take an el cheepo GFX card and remove my GTX295, and see if the computer stays stable during copying…

I’ll post the results shortly. In the mean time, if any1 has seen this problem before, or has other ideas / solutions to try, please let me know here and I’ll try them.

P.S.:
According to the PSU calculator, this is the recommended Watts / Amperage for my setup with a full 100% load; Below that is a table of power my PSU can handle:

Afaik, i can add the 12V rails together, so my PSU can handle max. 128 A on the 12V rail?
This is my PSU: http://www.coolermaster.com/product.php?product_id=2519

Does this mean my PSU should have more then enough power?
I'm still gonna try with a different PSU or GFX card to be sure, but according to the above i think all should be covered... Any thoughts?

Many thanks in advance for thinking with me.

Kind regards,
Kami.

JonJ678 · 31 Jan 2011 at 14:10

My 12gb/x58 system does something very similar when the qpi voltage is too low. Your op suggests the system is at stock volts, stock voltage will be fine for 6gb of ram but may not be for 12gb.

My system also passed intel burn test / memtest quite happily, but fell over if I moved 8gb+ of data into ram. Thoroughly irritating fault to track down. More qpi resolved it.

Your psu is probably fine, though six 12V rails is not what you want to see written on it really. Trying a less power hungry gfx card could help eliminate this. Alternatively if the psu is struggling, running intel burn test and furmark at the same time should cause it to switch off.

I feel bad for writing such a concise reply when you've put considerable effort into your OP, but it's hard to see the description as anything other than the same instability I spent months hunting down.

matthab · 31 Jan 2011 at 14:14

Your trying to use a ICH10R chip for RAID5?

Try using a PCI RAID card onboard raid chips cant cut the mustard and do exactly what your describing.

KrazeyKami · 31 Jan 2011 at 14:51

@JonJ678;
Thank you for your reply. It is most certainly helpful, since i haven't considered the QPI Voltage yet. It would explain why the problem occurs when moving large chunks of data.
I will give it a shot tonight and remove 3 dimms, just to make sure and see what happens. I will look into adding QPI; this sounds a bit to me like Voltage OCing, something i'd rather not (altho i managed to OC the system to a stable 3,6 GHz, and 1339 mHz RAM). I will post the results asap.

@matthab;
Thank you for your reply. However, i'm no longer using RAID5, since i figured the ICH10R wasn't cutting it and was the culprit. But, running them as single SATA drives still gives the same problems. I did manage to make a stable build btw with RAID5 on my setup (with some nice speed results). If you're interested, read this post:

http://forums.overclockers.co.uk/showthread.php?t=18224121

Kind regards,
Kami.

KrazeyKami · 1 Feb 2011 at 08:02

JonJ678, you are a hero!

I have 2 different sets of RAM:
6 GB OCZ Gold PC3-8500U + 6 GB OCZ Gold PC3-10700U (Both OCZ3G1333LV6GK)

The first 6 i bought at a certain date, and later on, i ordered the other 6. They differ from each other, but OCZ claims this is exactly the same memory, altho newly revised. They upped the SPD for the later sets of 6GB:

http://www.ocztechnologyforum.com/f...shows-up-as-OCZ3G1333LV2G-per-module-Confused...

Anyway... i removed a set of 6GB RAM, and presto.... rock solid again!! Started copying 1,5 TB from 1 drive towards 2 other drivers, AND in the mean time started my downloads, including repairing + extracting the files... it ran all night, no problems. Usually this was enough to have it crash within 5 minutes.

Now, i wonder. Tonight, i'll try the other 6 GB and see wether it says stable.
And then ofc i need to have both the sets run together.

Do you think, that setting in the BIOS settings manually (non-OC) would make a difference?

The memory sets seem to run on different SPD's:

http://www.yelong.nl/images/memory2.jpg

This are 2 dimms; the above is the 'older', the one below is the 'newer'.

Look at the difference in timings, MHz, etc.
Yet OCZ claims this memory works perfectly together.

I placed this question on OCZ's forum as well, but maybe you can give your opinion on this. What would be the best way to make these dimms run stable on the same speeds?

p.s.,
In my older post on the OCZ forum, they told me to set the timings to 9-9-9-20. I did, and everything ran (more or less) stable @ 1339 MHz.

But i can't help but get the feeling that these modules will never run in sync, because of the big differences between timings and MHz.

Any thoughts on this?

Kind regards,
Kami.

KrazeyKami · 2 Feb 2011 at 08:05

Just an update on the issue:

I've tried the 6GB sets seperate from each other. Seems with 6GB, i can stress my PC all i want (high IO, Downloads, etc) and it runs stable without crashing.
So, this rules out the possibility of faulty RAM.

It could very well be, that (one of) the other 3 RAM Sockets is broken / damaged. I'm just not sure if i can start filling the A2, B2, C3 slots. I read that (and afaik it always was like this) you should start with A1, B1, C1 in case of triple memory. Anyway, lets assume the slots are fine. I'm tending to test this tonight tho.

I made a post on the OCZ forum a few days ago, no reply yet: http://www.ocztechnologyforum.com/f...3-10700U-(Both-OCZ3G1333LV6GK)-on-ASUS-P6T-SE

I tested the RAM sets seperately and checked the settings / voltages etc with CPU-Z. This is the result:

3x 2GB PC3-8500F:

3x 2GB PC3-10700F:

I notice that, indeed, the voltage is Auto set to 1.5 volts.
Furthermore, i notice that they run on different timings; 7-7-7 vs 8-8-8.

If i look in the Timings Table, i see that, when comparing e.g. the 8-8-8 settings, they both have different Frequency speeds, different tRAS, different tRC and different tRFC's.

I'm not sure if this would be in indicator to the problem of combining these DIMMS;
Does this mean they'd run on different out-of-sync settings, or are they forced to the same settings? Also, on the sticker of my memory, it says 9-9-9@1,65v.

I hope OCZ will respond soon to these questions. Basicly i'd like to know what the right settings should be to enter manually in the BIOS, and if this won't give any problems with the 2 sets combined.

Oh, i also got a call from an Asus support agent last night; He confirmed that OCZ memory is known to get the wrong / too low Voltages from the Auto Settings. He recommends to manually set the Voltage to 1.65v. Still, i'd like for OCZ to confirm this and also give me the rest of the settings.

I'll keep you all posted, and if some1 has some facts on this, i'm all ears / eyes!

Kind regards,
Kami.

p.s.:
Just for the record: seems this problem overall has nothing to do with my HD's, Silent Data Corruption, ICH10(R) or any RAID5 constructions. It was just where i started my search to ID the problem.

JonJ678 · 3 Feb 2011 at 01:26

One thing there is easily answered, when using both sets at once you'll want to run it all at the slower 8-8-8 timings. All six sticks will run at exactly the same settings, which should be the slowest of the two sets in each case. It is unlikely that it'll need 1.65V, but I do sympathise with Asus for giving an easy answer. The ram wont care if it's run at 1.65V, but your processor *might do*. There's a rule of thumb which says to keep ram voltage and qpi voltage within 0.45V of each other, so if you put the ram up to 1.65V you should make sure the QPI is on at least 1.2V.

On the old 775 platform, increasing northbridge voltage to regain stability with four sticks of ram was standard practice. I do wonder what people using intel / other non-oc motherboards do. Increasing the qpi voltage slightly is very unlikely to do any damage, and if it brings stability with 12gb of ram then it's worthwhile.

Keep QPI below 1.35V and you'll be within the bounds of the most timid overclocker. I'd go so far as to say that if it fails at 1.3V, it would have failed at stock anyway.

Otherwise it's possible you'll get away with running the ram at a slower frequency, which wont make a measurable difference to performance anyway. I'm using 1600mhz ram at 1200mhz because it helped with stability using six sticks. I do feel foolish for paying a premium for faster ram only to run it slowly however.

KrazeyKami · 3 Feb 2011 at 13:42

Hey Jon,

Thanks for your reply. I had a response from OCZ with the recommended settings:

http://www.ocztechnologyforum.com/f...3-10700U-(Both-OCZ3G1333LV6GK)-on-ASUS-P6T-SE

I tried everything, but nothing worked. Interesting note is, that the Auto settings actually work, as long as the same 'type' of memory is used.
In fact, setting things manually and altering them a bit, still remains stable, no matter what set of memory is used.

I'm on another track of testing, cause i suspect that the memory is of a different type. OCZ claims this is exactly the same stuff, but i disagree.

Tonight i'll test what will happen when i fill each DIMM_Channel with the same type of memory: A1 + A2 = 10700, B1 + B2 = 8500. Then i'll fill C1 with 10700 and if everything is still running smooth and stable, i'll fill C2 with 8500. If the system crashes again at that point, i think it's proven that the memory is of different types / speeds and when mixed in a channel gives an error.

Because OCZ and my local reseller all claim its the same type (altho a newer revision according to OCZ, and a completely different type of memory according to CPU-Z, SiSoft Sandra and Everest), i hope / think it's fair they let me trade in the 8500 memory for 3 DIMMS of 10700 memory. It's the same in their book anyway.

I'll post here when i have some more info.

KrazeyKami · 3 Feb 2011 at 22:05

Well, I inserted the memory as stated before.

My computer was running rock solid with 10700 in A1,A2, and 8500 in B1,B2. CPU-Z confirmed the memory was running @ Dual Channel, and it ran on the lowest speed (533 MHz).

After that, i added 10700 to C1... CPU-Z confirmed the memory was running @ Triple Channel. Guess what happened? After starting the data copy, it crashed, almost instant. After that, i removed the DIMM from C1, and added it to C2. It crashed. I removed the DIMM from C2, added the 8500 DIMM to C1, it crashed. I removed it and placed it in C2, it crashed.

After confirming it crashes in Triple Channel *Mixed*, i went on ahead and tested it on Dual Channel *Mixed*. The rig remained stable.
I even took it a step further, and inserted the memory in A1,A2 and C1,C2, just to rule out that the DIMM slots aren't defect. The rig remained as solid as ever.

So all in all, what does this mean?
This proves that for some reason, the memory won't run stable when mixed, and @ Triple Channel. The DIMM slots are not the culprit, nor is the Triple Channel mode by itself. It's only crashing when mixing the 2 types of memory @ Triple Channel. @ Dual Channel, the 2 types are running solid.

It freaks me out. Why would this memory run;
- stable when not mixed @ Double Channel;
- stable when not mixed @ Triple Channel;
- stable when mixed @ Dual Channel;
-Unstable when mixed @ Triple Channel?

Afaiac, this proves that the memory is not stable when mixed in Triple Channel.

OR: The rig just cant handle more then 4 DIMMS @ the standard voltages. Now, i tried to up voltages etc. as eleborately described in my previous posts, so don't think that is the problem.

We will know for sure once i have my 6 DIMMS of 10700 memory.

KrazeyKami · 5 Feb 2011 at 10:10

Hi everyone,

Yesterday i got a call from my reseller, telling me he had the new memory DIMMS. I took em home, plugged em in and confirmed it was PC3-10700 memory. I returned the PC-3 8500 memory to the reseller. I now have 6 DIMMS of PC3-10700 memory in my computer.

The system is running rock solid. I copied many TB's, while downloading, playing games, doing VMWare, after that i ran Memtest and Prime95 overnight, and all is stable.

Now, to me this proves that the problem was the PC3-8500 memory being mixed with the PC3-10700 memory. Ofcourse OCZ still claims that this memory is the same, but the fact is that it, the memory showing up as PC3-8500, is NOT COMPATIBLE with the memory showing up as PC3-10700. All these tests and real-life findings prove it.

So, no manual settings, no voltage tweaks, no faulty memory (as the PC3-8500 was tested for hours, 0 errors, AND ran fine on its own).

In the end the problem was exactly like i suspected:
It are 2 different types of memory, running on different SPD's and NOT being able to run stable together, while performing high loads on the RAM.

Hopefully this will help other people who are having trouble with 2 types of OCZ memory that 'should be the same'.

This thread can now be closed, as the problem is resolved by replacing the 'older revision of the same type of memory'.

Thank you Jon for pointing me out to check the memory.
Even tho i did the memtests and they all showed up clean, i'd never had come up with the idea of removing a set, if you hadn't told me about your experiences.

We'll meet again at the next upgrade / failure of my rig.

Kind regards,
Kami.

-=Vect0r=- · 6 Feb 2011 at 10:16

Glad you got it sorted. Interessant om te lezen en goed om een Hollander hier te zien!

KrazeyKami · 6 Feb 2011 at 15:07

-=Vect0r=- said:
Glad you got it sorted. Interessant om te lezen en goed om een Hollander hier te zien!

Heheh thx

And i guess a picture says more then a 1000 words, ey?