Dreaded Kernel-Power 41

Swi1ch · 20 Dec 2022 at 22:10

Specs:

5900x
MSI x570 Tomahawk
32GB TG8P Edition
Nvidia 3080TI
Corsair RMx1000

Issue:

Initially freezing completely at random (no response at all to any input), while machine was idle. Had to kill power and reboot. Happened maybe 4 times over a week.

Since then randomly cutting to black and restarting, once or twice every couple days. So far only when idle or light usage (Played plenty of BF2042 without issue). Started after I had it unplugged for a week as I was away.

Only thing in the event log is the Kernel-Power 41. Machine is just over a year old and have had no issues at all in that time. Tried the usual gumpf - setting power plan, making sure the drives don't sleep ever etc to no effect. Windows and graphics drivers are up to date. BIOS was up to date although if there's been a new one in the last year I've not updated to it - kinda scared to too with power issues.

I've got a replacement PSU on the way out just in case, but before I break the seal on that and my ability to return it, is there anything else I could try to try and diagnose? I don't even know where to start - literally only thing I can say is the PSU looks a touch dusty and might need a clean.

Swi1ch · 20 Dec 2022 at 22:34

mickyflinn said:
I would test the memory with memtest just in case.

Will do, it just now completely froze up with no warning as I was sitting here watching TV. The machine was still actually on and the fans running etc, just completely unresponsive to input. Would seem to suggest it's not (purely) a PSU issue.

Swi1ch · 20 Dec 2022 at 22:39

Quartz said:
The latest Windows Update is known to call BSODs.

https://www.theregister.com/2022/12/20/microsoft_windows_10_crash/

Oh man I guess this is a good thing - that was the day I got back from my trip and updated everything so hopefully that's all it is. If so you're a lifesaver pal.

At least I hope this is right? Looks like that is causing systems to boot directly into BSOD, and the workarounds are just to get a machine to boot rather than stay on.

Reckon it's worth me just formatting and reinstalling windows?

Swi1ch · 23 Dec 2022 at 13:11

Quartz said:
No. The fix is given in the article: boot to a command prompt and copy over one file.

Unfortunately this hasn't worked. I tried it exactly as given in the article/microsoft page at first, then a tweaked version as there appears to be a duplicated slash. Reading more into it I'm not sure this is related as everyone else suffering from the BSODs caused by the update are getting them on boot up, hence the need to use the recovery console.

I've also set windows to not automatically restart on crash to try and force a BSOD instead of a reboot, but it just resets.

Swi1ch · 23 Dec 2022 at 14:16

DoneADougalOnSofa said:
My 5800X did similar when I got it. Try LLC (load-line calibration) in the UEFI settings. Mode 1 sorted the problem straight out for me. (On both CPU and SOC settings).
BSODs at idle often mean the voltage drops a little too low- I think it's particular to individual chips.

May also be worth setting DRAM voltage at 1.37v or so rather than the 1.35v XMP/DOCP default.

This machine has run for a year with no issues so seems a bit odd that I would be having a voltage issue now? If a reformat doesn't help i'll give it a go.

I've still not seen an actual BSOD - either the machine hard-resets, or freezes up completely.

Swi1ch · 24 Dec 2022 at 02:09

I'm suddenly getting something in the event log:

WHEA-Logger
A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 3

Swi1ch · 25 Dec 2022 at 00:44

DoneADougalOnSofa said:
My 5800X did similar when I got it. Try LLC (load-line calibration) in the UEFI settings. Mode 1 sorted the problem straight out for me. (On both CPU and SOC settings).
BSODs at idle often mean the voltage drops a little too low- I think it's particular to individual chips.

May also be worth setting DRAM voltage at 1.37v or so rather than the 1.35v XMP/DOCP default.

Had a look into this as I'm still having issues.

My board apparently doesn't have an option for LLC.

My DRAM voltage is already at 1.37 - I never changed this myself.

Swi1ch · 25 Dec 2022 at 22:19

Dg834man said:
It will have a better look, you tried running with xmp disabled, test both ram sticks individually, whats your soc voltage? the processor still under warranty? test the basics if you cant find anything pester someone for a replacement.

~~If I run with XMP disabled then my RAM gets shunted down to 2400mhz - I can test this if it matters? Both this machine and the previous one (no common components) ran like dirt without XMP enabled.~~
Same exact issues with XMP disabled.

I've not tested each stick individually but I ran memtest for 8 passes/12 hours with no issues - obviously not conclusive but should probably point in the direction of something else.

~~I cannot find my SOC voltage - google suggests that on AMD boards this could be called 'NB Voltage' but I still cannot find any reference in my BIOS.~~
SOC Voltage hovers around 1.1v.

Who and what am I chasing for a replacement? I think RAM and PSU are probably OK so I'm leaning motherboard or CPU? Having finally got something in my event viewer with WHEA errors seems to suggest that the CPU might be at fault?

Edit. Did some googling on the WHEA error and apparently updating BIOS is a thing that should happen - I've flashed to the latest BIOS and no change.

Swi1ch · 26 Dec 2022 at 19:42

Major774 said:
Probably PSU…

I thought I could get away with running an undervolted 5800X and an undervolted 3080 FE with a Seasonic 650W SFX PSU. But I would get random hard freezes, sudden restarts, etc.

A Silverstone 1000W SFX PSU solved my problems.

I'm currently running a RM1000x, so I guess that could be on it's way out.

That said I've started having BSODs - volmgr.sys failures - which I don't think is PSU related.

I posted my minidump on the microsoft forum and someone went through it and said that the volume manager driver was throwing issues and they've had me run the system file checker. I was on a fresh format so I didn't think it would do anything, yet the scan did find errors and apparently fixed them, so I'll see how that goes.

Swi1ch · 26 Dec 2022 at 19:52

labbby said:
Same mobo and cpu here.. had this a while back and was driving me insane m.. tried all the hardware testing that I could, turns out a fresh windows installed sorted my issue.

Completely wiped and formatted with no improvement.

I'm currently trying Windows 11 to rule out the recent Windows 10 update issues.

Edit. Windows 11 no luck either

Swi1ch · 27 Dec 2022 at 16:41

malachi said:
I would strip everything to the bare minimum, fresh install of Windows. Update the bios, restore motherboard defaults after bios update and see from there.

Still no joy, I would look at the motherboard.

Done all this already (Apart from physically stripping the machine)

Event viewer is showing WHEA errors which seems to indicate the motherboard.

Had someone on the microsoft forums check the WHEA dump and they reckon the CPU or MB is on it's way too.

Next step is to actually tear the PC apart, reseat everything, reapply thermal paste and go from there.

Edit. Took PC apart and it was pretty much immaculate. Reseated everything and no improvement.

Have a replacement MB on the way to test.

Swi1ch · 1 Jan 2023 at 00:48

Well I swapped out the motherboard and still having the same issues.

I guess I swap the PSU next.

Swi1ch · 2 Jan 2023 at 00:16

DoneADougalOnSofa said:
In 'Advanced mode' in the BIOS (on the Tomahawk board), click OC tab, then scroll down to DigitALL Power, LLC settings are in there.
Try Mode 1 for both CPU and SoC. I know it sounds daft, but maybe a Windows update has altered power draw or something, not beyond the realms...

Either way, hope you sort it soon

Thanks, I've found this and I'm testing now.

Edit. Unfortunately, no change.

sreeve1993 said:
This is a good suggestion.

Whilst OP is in the BIOS, they can temporarily turn off Turbo Boost for their CPU and only use the base clock as a next step of your step doesn't work.

A friend of mine had this similar issue and I found that his CPU used this feature, turned this off and it was resolved.

Is this specifically called Turbo Boost? I've got a big 'Game Boost' button in the BIOS that has always been off, and digging through the menus I've found a 'Precision Boost'.

Swi1ch · 4 Jan 2023 at 19:01

Well, new PSU and still the same issue. No idea what to try now.

Swi1ch · 4 Jan 2023 at 20:26

ncncore said:
Have you tried running single sticks of ram to rule that out? Ram tested fine in memtest when I had the same too.

It was a while back in DDR3 days but same issue.

Just tested now - exact same issue regardless of which RAM stick is installed.

I have a spare GPU in this machine I can test to rule that out.

I have 2 x m.2 drives in the problem machine that I can probably test one at a time to rule those out.

I might have a spare CPU cooler to test although I've been staring at HWMonitor when the machine crashes and the temps are fine.

Only thing I can't reasonably test now is the CPU.

Ship of Theseus incoming.

Swi1ch · 6 Jan 2023 at 19:51

So I have now tested both m.2 drives independently, and also tested a separate GPU. No improvement.

New motherboard, new PSU, GPU is fine, drives are fine, tested both RAM sticks.

So either both of my RAM sticks are having the same issue, or this is a CPU issue - I think?

Swi1ch · 22 Jan 2023 at 13:15

An update for anyone who finds this thread in the future - sent my CPU back to AMD who apparently agreed it was faulty and have sent me a replacement.