Computer just shuts down unexpectedly during gaming

Gorbash12346 · 20 Jul 2024 at 12:46

Hi,
Bit of an odd one, after running pretty much flawlessly I've encountered an issue in the last two days with my main computer where it will shut down unexpectedly during gaming after a few seconds of stuttering. No blue screen no blank screen just straight switch off.

Spec is
I9 13900K w/ Corsair H170I elite
Asus Z790 extreme
2x 16GB Corsair Corsair Vengeance Black 32GB 7200MHz DDR5 (No XMP as was originally listed as a supported /approved set but was pulled from the listings after release :mad:

)
RTX 4090 FE
Corsair HX1200
2TB Firecuda 530 nvme
Windows 10 64 bit pro (was 11 but had stability issues so went back to 10 for now.)

I've removed the header in case the switch was faulty, no change,
ran HW monitor to keep an eye on temps and voltages nothing too excessive during stress testing
re-applied thermal paste (Kryonaut) old application looked good. No change
updated to the BIOS that Asus just released on the 12th to address a micro code issue and ran into the news about 13th and 14th gen failing. I really hope this isn't it. Now running in Intel performance spec
Also updated every driver on it including the intel management engine etc.
actually seems to be getting more frequent.
plenty of errors in event viewer but not seeing anything critical?
Everything else on the computer runs exceptionally cool as it's inside a Corsair obsidian 900D w/ all the fans

I'm not sure what to do given I get no error log or blue screen to work from.

Any suggestions/ help would be very much appreciated!

Gorbash12346 · 20 Jul 2024 at 15:51

Tetras said:
The most obvious reason for a PC to just shut down out of the blue under load with no error messages is the PSU, but admittedly it is hard to trust these CPUs right now.

You haven't moved the PC lately and might have unsettled the power cables?

How is your 4090 connected to the power supply?

The majority of event viewer errors are meaningless, but WHEA errors would be more interesting.

What stability issues did you have with Windows 11?

Lot's of WHEA errors

There's not even a second between them at points.

A corrected hardware error has occurred.

Component: PCI Express Root Port
Error Source: Advanced Error Reporting (PCI Express)

Primary Bus

evice:Function: 0x0:0x1C:0x4
Secondary Bus

evice:Function: 0x0:0x0:0x0
Primary Device Name

CI\VEN_8086&DEV_7A3C&SUBSYS_88821043&REV_11
Secondary Device Name:

All I get on the critical error list is an unexpected shutdown.

This one is repeated most of the way through.

the 4090 was bought at launch in May 2023 iirc and the full 13900k rig was put together from new in December 2022 power supply is a little older

Power to the 3090 is using all individual PCIE cables 4 into the 12 pin FE adaptor (No daisy chained connectors). cable is as straight as can be and fully seated (lots of room in a 900D)

It just crashed while typing this not even gaming. :rolleyes:

I was having memory stability issues that seemed to be significantly worse under windows 11 than in 10. though this may have been more attributed to the dodgy BIOS issues the extreme seemed to suffer on release and is probably why it was discontinued so quickly. But at the time it seemed to make a difference. I haven't re-tried it since to be honest.

Yes after seeing the reports coming out I was dreading it being the same issue.

Gorbash12346 · 20 Jul 2024 at 16:03

Tetras said:
Are you using a riser?

You could try setting it to PCI-E gen 3.0 in the BIOS.

A few of these PCIE errors are not a problem (especially if they're when the PC boots), but if they're very frequent then that's more alarming.

Presumably your memory is actually running at 4800 or 5200, if XMP is disabled?

No riser. Just straight into the board. and supported as well.

Memory is 4800 CAS 40 at the moment.

Would that not strangle the performance of my 4090 going to 3.0?

Gorbash12346 · 20 Jul 2024 at 16:10

Tetras said:
No, you only lose a few %, but regardless, we're just trying to rule out the Intel issues because I'm afraid to say CPU connected devices producing errors is part of the symptoms.

NVIDIA GeForce RTX 4090 PCI-Express Scaling

The new NVIDIA GeForce RTX 4090 is a graphics card powerhouse, but what happens when you run it on a PCI-Express 4.0 x8 bus? In our mini-review we've also tested various PCI-Express 3.0, 2.0 and 1.1 configs to get a feel for how FPS scales with bandwidth.

www.techpowerup.com

Yeah I'm getting pretty worried. I'll give it a go. I've just checked the power connector on the 4090 in case the old melting power pins was in progress (I've been trying my best not to disturb them since it was fitted in case I provoked the problem) but it's fully intact, no sign of any issue there at all.

Gorbash12346 · 20 Jul 2024 at 16:25

Tetras said:
Glad to hear that, are you using a support bracket to prop up the 4090?

Because of the bottom of the case being so far away it's held up with a Fine piece of black nylon attached to the overhead AIO. Not taking any chances. :cry:

Ok going to give it a try with a few loops on 3dmark. I suspect Intels new microcode is going to have it running a lot slower. (and I thought I had left the performance losses of hardware vulnerabilities behind

)

Gorbash12346 · 20 Jul 2024 at 17:20

Some unusual behaviour so far as the cores boosting to higher clock speeds appear to be cores 4 and 5 hitting 5.8ghz with the rest capping at 5.5ghz and e-cores at 4.3 as usual. Temperatures are significantly down since the bios update at 70 max. previous was about 82-85

Gorbash12346 · 20 Jul 2024 at 18:41

So after 20 loops of 3dmark steel nomad stress test and a good chunk of time on prime95 it's not crashed since changing to pci-e gen 3 and no more WHEA errors though some other one I've not seen before.

Unable to open the job object \BaseNamedObjects\WmiProviderSubSystemHostJob for query access. The calling process may not have permission to open this job. The first four bytes (DWORD) of the Data section contains the status code.
Metadata staging failed, result=0x80070490 for container

Gorbash12346 · 20 Jul 2024 at 19:25

Tetras said:
It obviously isn't ideal to lose any performance, but it is not a big deal for gaming, even with the slowest profile Intel offer.

The top-end performance in benchmarks or long-run workloads can be impacted a lot more because they're more likely to exceed the power limits, or use the max single-core boost.

I think the frustration for me is mainly that I chose Intel over AMD at the time off the back of the increased memory bandwidth with 7200 initially being touted as a perfectly stable speed during the pre-launch review cycle and now I'm stuck at 4800 rolling back to PCIE gen 3 and running the biggest AIO I could get my hands on and it's potentially still **** it's pants. :mad:

Gorbash12346 · 21 Jul 2024 at 14:42

WHEA errors are back again. same PCI express root port. no crashes as yet but some intermittant jittering in game that ties in with the times on event viewer.

Gorbash12346 · 23 Jul 2024 at 16:09

Tetras said:
You could try laying the PC flat, for if the problem is the seating with the PCI-E slot and GPU sag, though I think the FE 3090 has a vapor chamber and those may not be designed to operate in a different orientation.

Did you set the graphics PCI-E gen only, or the M.2 PCI-E gen too?

Tried it outside the case on the motherboard box but no change. i knocked both back to 3.0 which seemed to work for a while but it came back. No actual crashes though just the WHEA errors. I've swapped out my 4090 FE for my old 3090 FE in to see if it made any difference and so far it hasn't reported any more errors other than the metadata stuff Though this happened yesterday as well and then came back so I'm not convinced at the moment. Temperatures and voltages are still within spec though the 3090 memory runs a good bit hotter so I've got it's fans on full bore to keep it cool. (should probably replace the thermal pads on it.)

No sign of any issues on the 4090 as far as sag it's still straight along the edge with no signs of distortion or damage to the pins etc.

Gorbash12346 · 23 Jul 2024 at 16:13

Major774 said:
Crashed whilst typing and not gaming… CPU?

I would just down clock it and see if it’s stable.

yep. just web browser open and hwmonitor logging in the background and because it doesn't get time to do a dump file I've got nothing concrete to work with.

I'd clipped it's max boost back to 5.5 already. No change. though I've not had an actual crash for a few days the WHEA errors were persisting.
Today I've had it back on regular performance setting while running the 3090 to try and provoke the issue again but no dice

Gorbash12346 · 26 Jul 2024 at 12:25

I've tried a good few things now temperature is ok but even with the new microcode adjusted BIOS and in intels performance setting w/ max boost clocks of 5.3 are being reported as hitting 1.717vid on a stress test. Surely thats not ok? in the BIOS vcore is set at 1.350V

Gorbash12346 · 2 Sep 2024 at 14:07

Ok so retailer has my processor for RMA for the last month. they couldn't fault it but are being told to use the latest microcode updates which seem to be masking the issue. I also sent them my motherboard as i did not accept that there was no issue and some definitely faulty ramm sticks. all testing ok with newest bios and microcode fixes.

Any idea what to do? suggestions? They've had it for more than a month and I don't know what to do as there doesn't seem to be a definitive way to test the processors to replicate the faulty. afaik all they are doing is running aida64 stress/ burn tests
Any tests I can ask them to try to replicate the fault? It seems like the microcode updates have effectively applied a bandaid to my processor but I've no confidence in it at all.

Intel did get back to me after it had been sent to the retailer and offered to warranty the processor so I guess thats my next port of call.

Any help would be greatly appreciated!

Gorbash12346 · 2 Sep 2024 at 14:36

Tetras said:
For intermittent/hard to diagnose issues, I would always send the CPU back to the manufacturer where possible, because their testing is much more extensive and I'm not sure they even possess the capability to return the same CPU to you once the RMA has been accepted and the return is being processed.

Thanks Tetras I think I'll do that then. given VID was reporting 1.717 - 1.719V at idle I really don't fancy keeping hold of it.

Not sure what to make of the ramm being tested without fault. memtest was pretty definitive when testing those sticks though all the errors were on even numbered cores.