Hmm so as a bit of an update for anyone who comes across this.. the crashing came back
but I think I might have a solution
.
I tried flashing to the latest beta MSI bios. I had been getting errors in the event log about "Display driver nvlddmkm stopped responding and has successfully recovered.". When I Googled for a changelog on the BIOS the first thing that came up was:
I have on an MSI B450-A Pro Max with an R5 3600 and 2070 Super. It completely solves the nvlddmkm driver crashes when cards switch between idle/active state (PCI-E lane speed changes).
It also completely solves the blackscreen crashes on a 5700XT in the very same situation.
https://www.reddit.com/r/Amd/comments/g750y5/updated_amd_agesa_comboam4pi_1005/
I do think with my old PSU it was crashing the entire system when this was happening. The new PSU seemed to be able to cope with this crash better, the screen would flicker, but I could carry on.
However, this still didn't solve my crashing whilst gaming issue. So I thought I'd try reinstalling Windows. That didn't fix it. I tried running FurMark for 20min, that didn't crash or have any artefacts that I could see, which I found odd given I thought the GPU was bad. I tried running both the Windows memory test and memtestx86+, neither reported any errors. I tried running OCCT's stress test - that triggered a crash after a few minutes. I tried running the test again, it triggered another crash. Hmm, interesting!
When OCCT was causing crashes I was noticing in my eventlog the following. I'd never noticed this before when crashing during gaming etc.
A fatal hardware error has occurred.
Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Bus/Interconnect Error
Processor APIC ID: 0
The details view of this entry contains further information.
After a bit of Googling some people suggested it may be memory that causes these issues. I remembered a friend was having issues with his motherboard crashing when using the XMP memory profile, but he has an old gen Ryzen, and slower memory, and I'd read before I even bought this set-up these had been fixed in Ryzen 5 etc.
But I thought I'd give it a go anyway so I disabled XMP falling back to the default 2666MHz at 1.2v rather than 3600MHz at 1.35v. I ran OCCT for 25min or so and it seemed seemed to remain stable. I played some games for a few hours with friends last night, and it didn't crash. I have read some people were able to get back to near XMP speeds by manually setting the frequencies/timings etc but I haven't tried that yet.
So maybe it was memory all along?