Edit: Solved - please don't waste your time reading this.
Specs as per sig, but for those on mobile:
AMD Threadripper 3960X at stock
Asus RoG Strix TRX40 E-Gaming
32GB (4x 8GB) 8Pack Edition RAM 3600c16 ('stock' @ DOCP 1.35v)
Sapphire Pulse Vega 56
Seasonic Prime Ultra Platinum 1000W PSU
Samsung Evo Plus 1TB NVMe
Since I built this machine I've had crashes under heavy load on Linux (mostly when running FoldingAtHome on GPU and BOINC on CPU at the same time). It ran OK under Windows so I assumed it was just a product of the hacky way I'd been running OpenCL on Linux to leverage the Vega card for FaH.
I separately noticed that if I enabled secure memory (core isolation in Windows) or TSME in the BIOIS (to encrypt RAM) the machine would hard crash regularly, even in BIOS. I figured it was a bug in the BIOS and just disabled it, albeit begrudgingly as I spent thousands on this rig and it runs fine on my outgoing 8700K machine.
This week, Asus released a BIOS update with 1.0.0.4 AGESA for the CPU. I updated to it and my machine is turning off every few minutes, just a hard power off to a black screen, with a red light on the mobo and the OLED says 'Check CPU'. Uh oh. I reset the BIOS to stock just in case, but the crashes still happen. I managed to boot to WIndows, and RealBench (running stress test with up to 32GB RAM) crashes after a second or two saying 'instability detected'. Everything's at stock!
If I run RB with anything less than my actual RAM (16GB, 8GB, 4GB) it runs all day long with no issues. As soon as I try to use 32GB it crashes out with 'instability detected'. I'm about to load a memtest live USB but am I right in suspecting it's probably the RAM? I know the mobo OLED said check CPU but a quick search suggests that's a very generic error and often means RAM or PSU.
With the machine being practically unbootable with TSME enabled or core memory isolation enabled, this is making me suspect the RAM above anything else. Does this sound reasonable?
If memtest passes I'll try the usual - two sticks instead of four, switch slots on the mobo, reseat everything etc. For now I'm just looking for preliminary thoughts. TIA.
Edit: I can't go back to the old BIOS because Asus kindly removed all previous versions from their site. I did keep a backup of the old BIOS file, but the new BIOS simply says 'Not a recognised BIOS file' when I try to use it. FFS Asus!
Edit 2: I booted a Linux live USB just fine (Artix), and passed dozens of runs of Prime95 (blend, small and smallest FFTs). It also passed a stress test of the RAM directly. I booted into Windows and RealBench crashed with instability detected (7z). I rebooted into BIOS, changed DOCP to Default, which set RAM to 2133MHz at 1.35v and fabric speed to 1200MHz. Booted into Windows and RB still crashes after a few seconds with instability detected (7z). It's gotta be the RAM, right?!
Specs as per sig, but for those on mobile:
AMD Threadripper 3960X at stock
Asus RoG Strix TRX40 E-Gaming
32GB (4x 8GB) 8Pack Edition RAM 3600c16 ('stock' @ DOCP 1.35v)
Sapphire Pulse Vega 56
Seasonic Prime Ultra Platinum 1000W PSU
Samsung Evo Plus 1TB NVMe
Since I built this machine I've had crashes under heavy load on Linux (mostly when running FoldingAtHome on GPU and BOINC on CPU at the same time). It ran OK under Windows so I assumed it was just a product of the hacky way I'd been running OpenCL on Linux to leverage the Vega card for FaH.
I separately noticed that if I enabled secure memory (core isolation in Windows) or TSME in the BIOIS (to encrypt RAM) the machine would hard crash regularly, even in BIOS. I figured it was a bug in the BIOS and just disabled it, albeit begrudgingly as I spent thousands on this rig and it runs fine on my outgoing 8700K machine.
This week, Asus released a BIOS update with 1.0.0.4 AGESA for the CPU. I updated to it and my machine is turning off every few minutes, just a hard power off to a black screen, with a red light on the mobo and the OLED says 'Check CPU'. Uh oh. I reset the BIOS to stock just in case, but the crashes still happen. I managed to boot to WIndows, and RealBench (running stress test with up to 32GB RAM) crashes after a second or two saying 'instability detected'. Everything's at stock!
If I run RB with anything less than my actual RAM (16GB, 8GB, 4GB) it runs all day long with no issues. As soon as I try to use 32GB it crashes out with 'instability detected'. I'm about to load a memtest live USB but am I right in suspecting it's probably the RAM? I know the mobo OLED said check CPU but a quick search suggests that's a very generic error and often means RAM or PSU.
With the machine being practically unbootable with TSME enabled or core memory isolation enabled, this is making me suspect the RAM above anything else. Does this sound reasonable?
If memtest passes I'll try the usual - two sticks instead of four, switch slots on the mobo, reseat everything etc. For now I'm just looking for preliminary thoughts. TIA.
Edit: I can't go back to the old BIOS because Asus kindly removed all previous versions from their site. I did keep a backup of the old BIOS file, but the new BIOS simply says 'Not a recognised BIOS file' when I try to use it. FFS Asus!
Edit 2: I booted a Linux live USB just fine (Artix), and passed dozens of runs of Prime95 (blend, small and smallest FFTs). It also passed a stress test of the RAM directly. I booted into Windows and RealBench crashed with instability detected (7z). I rebooted into BIOS, changed DOCP to Default, which set RAM to 2133MHz at 1.35v and fabric speed to 1200MHz. Booted into Windows and RB still crashes after a few seconds with instability detected (7z). It's gotta be the RAM, right?!
Last edited: