WHEA Uncorrectable Error on GPU load (Win 10)

Associate
Joined
27 Apr 2021
Posts
8
Location
Manchester
Hi all,

I hope it's the right place to write it, but I'm currently desperate for help and would really appreciate it.

When my GPU is on load (games or Furmark tests) my computer crashes with WHEA error and no dump file is created. Just jumping forward I have to note, that this is happening on Windows 10 only and not on Windows 7. I can run Furmark tests on Win7 for hours or play the whole day - there are no crashes and everything is working really well. (I have quite recently purchased Win10 and installed it as my second OS, so I can switch between both.)

PC specs:
  • Motherboard: GA-Z170X-Gaming 3 (rev 1.0)
  • CPU: i7-6700k @ stock clock
  • GPU: Zotac GTX-1080 Amp Extreme
  • RAM: 32GB Kingston @2400MHz (XMP is off)
  • Storage: Samsung and Sabrent M.2
  • PSU: Corsair RM850
  • PCIe: ASUS PCE-AC88 WiFi card
  • PCIe: StartTech PEX1394A2V FireWire card

In addition, computer has PCI-E Wi-Fi card and PCI-E FireWire card. I’m using Focusrite Saffire Pro-40 as my sound device.

Computer has the latest BIOS and driver updates.

What has been done:
  • BIOS reset,
  • GPU test with MODS (MATS tests)
  • Windows Memory Diagnostics Tool test
  • Intel CPU diagnostic test
  • CPU burnout test with Prime95
  • Completely removed Windows10 (disk format) and reinstalled in offline mode, then installed Nvidia drivers - same problem.
  • Tried multiple (10+) versions of Nvidia drivers (I was removing them with DDU and installing in offline mode)
  • Added custom fan profile to spin fans on max when on load (temp went very low ~55C)

None of the tests result in errors and clean Windows install does not help.

At some point I’ve switched CPU Vcore Landline Calibration to Standard (from auto) and it seemed that the issue were fixed (I could run Furmark tests and play without having issues). Then yesterday I was playing Borderlands 2 and my game crashed on this error again. After restarting my PC I could play for 5-7 minutes and then game was crashing over and over. After 5 tries I have booted to Windows7 and could play for over 2 hours without the issues. Later on I’ve run Furmark test on Windows 10 and it was resulting in WHEA.

Another thing is - there are no dumps created (full or mini).

Right now I’ve started Windows Memory Diagnostics on Extended mode and will see if it can spot anything.

I have a bit of concern of my PSU as 12V drops to 11.3V (measured with sensor reports) when running Furmark (other voltages also drops slightly), but then why it’s stable on Windows 7 then? I will measure the voltage with multimeter once my memory test will complete, but I’m quite convinced that it will show the same.

UPDATE: measured with multimeter - there is no voltage drops on 3.3V, 5V, 12V ATX, 12V GPU or 12V CPU lines.

I would appreciate any help in diagnostics of this issue as I am out of ideas. Thanks in advance.
 
Last edited:
Was the Windows 10 that you installed the latest up to date version?

Did you install motherboard chipset drivers and any other drivers you need etc?

I've tried different Windows 10 versions. Currently it's the newest with all the updates. I also tried older versions without updating - same thing.
All chipset drivers have been installed. I've installed all the drivers from the Gigabyte. Also, tried to let Windows to download them, but in both cases I get the BSOD (during test or game).

Voltage measurement thtough sensors isnt as reliable as a multimeter but what i find strange is its rock solid on windows 7. That seems software related unless windows is adjusting votages diffrent.

What windows power plan are you using and Have you tried launching the game in windows 7 compatability mode.

Sensors on Windows 7 are showing the same voltage drops when running Furmark. The only difference - it does not BSOD. I'm on `Full Performance` plan.

Multimeter rating to follow, but I am currently waiting for that Memory test as it runs for 5 hours already (stuck on classic 21% which is the longest). If it will not move till this evening, I will stop it.
 
Coming with the extra information.

I've run the Windows Memory Diagnostic test on Extended mode (took me days) and there is no error.

Also, I've measured the voltage with multimeter and there is nearly no voltage drops on PSU pins (tested 3.3V, 5V, 12V - on board, 12V on GPU and CPU). Voltage drops from 12.1V to 12.0V which is nothing. Weirdly, but computer sensors think that there is a voltage drop (could it be a motherboard issue ? ).

Have to mention, that I also have a WiFi and FW card on PCI-E. I took FW card away for sake of test and left WiFi only (model: Asus PCE-AC88).

No idea where the problem lies. Might be something incompatible on Windows 10. but what?

Any help highly appreciated.
 
Possibly the CPU.


As I've mentioned, I've done a CPU burnout test with Prime95. Dead or damaged CPU would fail on this point. Also, Windows 7 has no issues with this.


I'm currently suspecting the WiFi card and Saffire FireWire card.

I've disconnected both and my PC seems to be stable. Hopefully the problem will not reappear. But this has happened before - I thought that I've fixed the issue, but then it appeared again.


Also, sensors reading voltage drops (when there is no drop) seems weird to me. So many things to suspect )
 
The PC looks like it's from 2015/16, that accurate?

After putting in a new GPU to replace my old 1070 SC and running 3DMark tests, I had my first WHEA uncorrectable error crash when running the CPU physics test in Time Spy. As I'm OCing (4.9 GHz on a i9-9900K with -200 Mhz AVX offset), I suspected it was power draw related; I upped the voltage by 0.005V and so far it's not crashed again.

The RM850 is a fine PSU, I have a 2020 RM750x, if it's 2015/2016 era and has been running since 2016 I would be starting to consider a replacement, even if it's same wattage.

The question is whether the PSU and motherboard are capable during spikes of current draw, and whether the motherboard voltage regulation is still able to ride out momentary spikes or handle Vdroop. How is your Load Line Calibration and Vcore in the BIOS, still on defaults?

The first two sections of this video are worth a watch: https://www.youtube.com/watch?v=NMIh8dTdJwI (and https://www.youtube.com/watch?v=xkeR1Z62wi8)

Motherboard reporting voltage drops is not unexpected. How extreme are the sags, are you using HWiNFO64 or Open Hardware Monitor or similar to log these and watch them or probing with a meter? NB boards' own drivers typically misreport voltages slightly. My Asus Maximus Hero's Nuvoton chipset Vdrop and Vdroop numbers are always a bit off.


Of course all this may be irrelevant and you're just really unfortunate with PCI-e or changes to Windows power management. Perhaps one of the cards is more sensitive to voltage changes than it used to be or Windows 10 is doing some power state management. How is Link State Power Management set in Power options? Did you have the other cards on x1 and x4 slots, not sharing a x16 with your GPU?

I'd also query the Focusrite - the Saffire FireWire range is EOL and Focusrite don't recommend upgrading beyond Windows 10 1809, which suggests known incompatibilities. Windows 10 20H2 has well-documented serious issues with 1394.


Thanks for reply.

I don't know where exactly the problem lies, but I've done 3 things in the same time: disconnected Focusrite and WiFi adapter, reassembled my GPU and changed thermal pads and paste and connected M-Audio FW 410 as audio card. Since then I have not seen WHEA errors. I think this in fact is Focusrite that makes it crash.

My CPU was slightly overclocked before that (was set @4.4GHz without changing voltage), but now it's not for a year already. Load Line Calibration is set to normal mode at the moment. Weird moment is, that I've changed it from default to normal and WHEA disappeared for a while and then returned after 2 weeks. Changing Load Line Calibration to High did not help that time.

In regards to voltage drops, I would not blame my PSU anymore (I did at the very beginning and wanted to change it) as I've measured voltage drops with multimeter while switching load on and off - it's super stable on all pins. It's just the system reports that there are drops. Also, it does not seem that crashes occurred during spikes. Usually I needed to run the Furmark for a minute and then it would crash. To measure voltage I've used OHM and Aida mainly.

PCIe x16 were not shared with anything and worked on x16 speed.

I guess I just need to connect my Focusrite at some point and run a test to see if it will crash my PC (for the sake of science).


But now a different problem appeared - when my computer is on heavy load (gaming), sometimes it dies with and screen freezes. I have linked this with overheat. I wish I could find out what exactly overheats, but I have no idea how. Temperature looks very nice everywhere from both sensor reports and IR thermometer. CPU on full load barely falls over 50C. GPU is also chill - around 60C and hotspot is 71C. Opening the lid helps it to not crash, but with the closed lid I usually can play for around 2 hours. I have just received to Noctua A12x25 and will install on front (right now I have ULN fans which perform badly).

Is there a way to detect what exactly overheats in my PC?

p.s.: I guess I will get some of the programs listed and test my system for a while.
p.p.s.: thanks for Youtube links, I will check them out.
 
This is a long shot but can you check that the cables are fully plugged in at the PSU? I'm wondering if a power cable may be heating and breaking contact. The cables all have latches on the PSU and they should all be engaged.

Something else to try - and again another long shot - is moving the cables on the PSU just in case one of the sockets is iffy.

Cables are fully plugged, I have insured that. Also, I've moved all the cables in different PSU sockets.

Tricky that. I have higher temps on my machine due to positioning and bad airflow with the side closed. I tend to leave it open if I'm doing things which really thrash the CPU and GPU.
You can run GPU-Z and log its sensors to a text file, likewise you can run HWiNFO64 and log to a file. Then, use GenericLogViewer (https://www.hwinfo.com/forum/threads/logviewer-for-hwinfo-is-available.802/) to analyse data points after a crash.

How recently have you checked thermal paste applications on the CPU and ensured airflow is unobstructed? If you're running an AIO water cooler, made sure there's no airlocks inside it? No dust inside the heatsinks of main components?

I will try to log it with GPU-Z. I was monitoring my temps in different software and temperatures seem to never exceed 55C on any spot (except of GPU). Thermal paste and pads were changed on both GPU and CPU a month ago as an attempt to solve the problem. It lowered temps down and decreased temperature difference between GPU temp and GPU hot spot. Must note that during gaming the max CPU temp reaches 45C.

It is odd, but PC is only crashing when fully closed. It never crashed with the opened side.

I have recently installed 2 new Noctua fans for intake and reconfigured all my fans in PC to run according to temperatures (previously they were running nearly always on max speed). Also, I've slowed down GPU fans, so the GPU temp goes up to 70C. Then I closed the lid and played twice x2 hours and it was stable. Yesterday I've got a dead freeze after 40 minutes of gameplay again and opened the side.

Screen freezes can be disc related. I wouldn't have thought the whea error was, but might be worth running sfc in the command prompt.
I will do a scan just in case but I doubt the problem is in SSDs. Disk temperature is quite low all the time and their resource is huge. Also, when PC side is open it never crashes.


At this point I think there might be some bad controller on motherboard that overheats and produce errors. This is really upsetting though.
 
Is W10 on the M.2 drive under the GPU?

Heat from the GPU may be to much for the drive when the case panel is on.
I have two M.2 drives. The one that has the W10 (Sabrent) is not under the GPU and is relatively cold. The second drive (Samsung 950 Pro) with W7 is under the GPU and it heats up to 50-55C during the load but it's not used.


I have been experimenting a bit and have some more weird information. I have closed the side wall, reconfigured all the fans for the low (quiet, but to keep temperatures all right) speed and at the same time I have down-clocked GPU for 99MHz. Also, I wanted to confirm the theory, that WHEA error was actually coming from my Focusrite FW card, so I plugged the card in and set it as default. Then I've run Furmark for 2 hours and my PC did not show any signs of crashing. And FPS did not drop a single time during the torture test.

The current setup is running for 3 days now (and I play every day at least 2h) and it's stable: no WHEA and no freezes.

Taking in the information from above I may conclude that it's not an overheat that freezes my system, but too high frequency that might spike up automatically (this is built in ****** Zotac boost chip). And WHEA might be related with ASUS WiFi card which is defected. Have to add, that WiFi card was disconnecting on Win10 sometimes and I needed to reboot my PC to fix it.

Of course this conclusion may be proven wrong when I get a new freeze or WHEA, but strangely, the system works perfectly at the moment.
 
@~cw yes, it's a nightmare.

I think I have found the cause of the problem and can reproduce it now. In fact, it is a WHEA Uncorrectable Error, and not just freeze.

As I have mentioned, I have 2 SSD drives: Sabrent and Samsung (both on M.2 slots). So, when GPU is on 100% load (ex: Furmark test, heavy gaming), then when you copy files from other drive (Samsung) to OS drive (Sabrent), then data is copied into memory quick and then Sabrent is 100% busy and can not store the data for 5-7 seconds and then WHEA appears. This also explains why my WHEA error never could be logged - cause OS drive did not respond. Also explains, why I never had this problem in my Windows 7 (it does not detect Sabrent drive for some reason). Since OS drive temperature is actually low (it's located behind the GPU), I am betting it's a PCI bus issue where the drive is connected.

I will be swapping drives and will be doing another set of tests to confirm that.
 
Back
Top Bottom