Ryzen 5800x WHEA errors

Jamin280672 · 1 Dec 2021 at 07:13

As the title suggests, I'm getting WHEA errors left right and centre on my Ryzen 5800x when running OCCT stress test, I've set everything to stock settings and still getting them, the only thing that is running overclocked is my ram which I have dialed into xmp settings but dialed in manually with a bit of extra voltage, it's giving me an ID on the error of 0 which I relate to core 0, would this be right ? I've removed PBO, curve optimizer, the lot and still getting them, any ideas anyone ?

Running Windows 11
Ryzen 5800x
32gb (4x8gb crucial ballistic 3600mhz cl16 1.4v)
Gigabyte x570S master
4 X nvme drives
Asrock RX6800 phantom gaming X GPU

Joxeon · 1 Dec 2021 at 07:18

RMA the CPU direct to AMD.

scubes · 1 Dec 2021 at 09:30

I had these whea errors on my 6700k ocuk told mee to up the cpu voltage a wee bit .

Jamin280672 · 1 Dec 2021 at 09:42

scubes said:
I had these whea errors on my 6700k ocuk told mee to up the cpu voltage a wee bit .

It's on auto, because when it boosts it pulls upto 1.5v, so it has all the voltage it wants.

jigger · 1 Dec 2021 at 11:22

It might be worth checking you don’t have two different memory types.

subroutine · 1 Dec 2021 at 15:11

These are the settings my system required to fix WHEA errors on core 0 -

RAM 1.38v
SOC 1.124v
VDDG IOD 1075
VDDG CCD 950

Anything over 950 for CCD caused more problems than it solved.

pc-guy · 1 Dec 2021 at 15:24

Have you cleared CMOS and rebooted with default settings? Without ram OC? And still gives you WHEA?

If that is the case then something is up either with CPU or RAM or both.

if not then something is in the OC. If you want to test ram best load up memtest86 in dos mode and run it with XMP and everything in auto

Donnie Fisher · 1 Dec 2021 at 16:11

I fitted a 5800X last night and I'm getting the WHEA errors all over the place. Tried various things so far, but games seems to be the main thing which does it.

Unsure at the moment if its my cooler though ... the existing wraith prism is hard hitting off 90°C in the likes of cinebench, ( but around 70°C in OCCT )

I dont see how its ram, because its been the ram ( TridentZ 3000 ) installed for a 2400G and a 3600 before it, and its been rock solid through both of them

RavenXXX2 · 1 Dec 2021 at 16:25

WD nvme drives can cause whea errors.

Donnie Fisher · 1 Dec 2021 at 16:40

maybe so, but when you've had a system running for a fair while with no issues, then plop in a new CPU why would it be the nvme?

Jamin280672 · 1 Dec 2021 at 16:46

RavenXXX2 said:
WD nvme drives can cause whea errors.

3 X Samsung 980 pros and 1 X crucial nvme.

subroutine said:
These are the settings my system required to fix WHEA errors on core 0 -

RAM 1.38v
SOC 1.124v
VDDG IOD 1075
VDDG CCD 950

Anything over 950 for CCD caused more problems than it solved.

Thanks, I'll give this a try when I get home, seems to be a common problem on the 5800x, I've been reading through a load of threads today, apparently they fixed it with agesa updates and then it came back, you can't back flash agesa unless your board has some sort of bios flashback, thankfully my X579S master does, so I may try some other bioses out.

Donnie Fisher · 1 Dec 2021 at 16:55

Out of curiosity, what cooler are you running with your 5800x ?

Jamin280672 · 1 Dec 2021 at 17:16

Donnie Fisher said:
Out of curiosity, what cooler are you running with your 5800x ?

I'm custom water cooling.

I didn't even know I was getting these errors, I'm not a big gamer, I just have a tinker with call of duty, battlefield and the new Microsoft flight sim every now and then when I'm bored, I was passing all benchmarks I threw at it like IBT, cinebench (all of them both multi and single core), memtest and all sorts, I decided to give it a run with OCCT last night and that's when I first discovered the WHEA errors, as OCCT basically shows them right to you without opening event viewer.

I was using the gigabyte active OC turner in the bios where I was using PBO for under 45amps and a manual overclock of 4.7ghz when the CPU requested more than 45amps. I immediately removed everything to do with overclocking the CPU and disabled the active OC turner, disabled curve optimizer, even backed off on the fabric and ram speed and still getting the WHEA logger errors.

Donnie Fisher · 1 Dec 2021 at 17:46

I’ll try running Occt for a while then and see if crashes and gives me an error. That being said my whea errors BSOD and restart the machine so i may not be able to see the error details.

Jamin280672 · 1 Dec 2021 at 18:23

Donnie Fisher said:
I’ll try running Occt for a while then and see if crashes and gives me an error. That being said my whea errors BSOD and restart the machine so i may not be able to see the error details.

Yes do, it didn't take long at all, I just used the standard test with medium data set, only took 2 mins of testing before the WHEA errors started to appear by the tonnes.

Joxeon · 1 Dec 2021 at 18:30

Check the event viewer and see which error it is.

Event 18 means CPU is borked, RMA will most likely be needed.

Event 19 means it's fclk related, lower fclk or play with voltages.

Jamin280672 · 1 Dec 2021 at 18:39

Joxeon said:
Check the event viewer and see which error it is.

Event 18 means CPU is borked, RMA will most likely be needed.

Event 19 means it's fclk related, lower fclk or play with voltages.

Thanks, I'll double check when I'm home, I don't finish till 9 tonight, I'll post some Picts.

Donnie Fisher · 1 Dec 2021 at 19:10

Event 18 is what I've been able to see in the Event Viewer....

Code:

Log Name:      System
Source:        Microsoft-Windows-WHEA-Logger
Date:          01/12/2021 18:43:35
Event ID:      18
Task Category: None
Level:         Error
Keywords:    
User:          LOCAL SERVICE
Computer:      DESKTOP-O7F9G01
Description:
A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Bus/Interconnect Error
Processor APIC ID: 0

Oddly though, whilst I'm typing this, OCCT is running, CPU @ 90°C steady, and its not crashed (yet). The last crash which the above log is from was from messing about with Davinci Resolve. If I did a 4k render using only the CPU, it completed fine - albeit at 90°C, but it completed. If I did the exact same render, but used the nvidia GPU encoder, it crashed within 30s of starting. The CPU was at 60°C and no where near as loaded.
That ties in with the other general crashes I'm getting such as in games like Apex which would utilise the GPU as well.

Sigh. I'll be disappointed if its needing returned. Anything more I can do to confirm its the CPU?

I've tried variations of stock bios settings, DOCP ram applied, -0.1V offset, eco mode, PBO disabled. All seem to end up crashing.

Jamin280672 · 2 Dec 2021 at 05:42

Thanks for your help guys, looks like ive cracked it, I basically got home last night, reset the whole bios and then went through each setting 1 at a time with a quick test inbetween, looks like either the ram or the fabric was the problem, so far ive got PBO on with curve optimizer of -20 across all cores, I need to set this to per core and find out which ones are better than others, this is with +125mhz boost, so its boosting to 4975mhz and holds there for quite a while

Ive enabled XMP and set the fabric to 1800mhz with ram at 3600mhz but some really rubbish timings, need to sort this out yet, but CPU first hey, ram voltage is at 1.38v and the FCLK voltage is at 1.15v.

So far so good, im getting some pretty good results in the quick bench on CPU-z, right now as I type this OCCT is running and has been for 30mins without any errors, the CPU is bouncing around between 4.8 and 4.7ghz, a bit more testing to do and then i'll start working on the ram, Ive had this kit of ram running happy all day long at 3800mhz with 16-19-16-38-1T-GDM enabled, 1.45v.

Donnie Fisher · 2 Dec 2021 at 07:36

Sounds promising. I spent more time with mine last night as well. I’m beginning to lean towards it being a power delivery quirk when the processor is down stepping. Apparently the 5000 series can momentarily draw a high spike of current when they step down which can trip the vrm protection, leading to a hardware error being flagged. Older boards like mine were not aware the spike, so more likely to trip.

The solution possibly being to one of the increase the Pbo limits so that it doesn’t trip. I’m still messing about to find a happy medium.

this might explain why I see the crashes at lighter loads and not when running stress tests.