BSOD - MEMORY_MANAGEMENT

Associate
Joined
19 Jan 2023
Posts
7
Location
-
Hi all..

Back in November 2020, I bought an 8Pack Elite Asus XII Hero mobo bundle and went with Corsair Vengance RAM (RGB Pro 128GB (4x32GB) DDR4 PC4-29200C18 3600MHz). Just before xmas, I started getting BSODs. This has escalated to at least daily. I ran MS Memory Diagnostics after the last BSOD stop code was MEMORY_MANAGEMENT. This has reported that there is hardware errors. This didn't give me a lot of detail:

The Windows Memory Diagnostic tested the computer's memory and detected hardware errors. To identify and repair these problems, contact the computer manufacturer

XML:
- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
  <Provider Name="Microsoft-Windows-MemoryDiagnostics-Results" Guid="{5f92bc59-248f-4111-86a9-e393e12c6139}" />
  <EventID>1202</EventID>
  <Version>0</Version>
  <Level>2</Level>
  <Task>0</Task>
  <Opcode>0</Opcode>
  <Keywords>0x8000000000000000</Keywords>
  <TimeCreated SystemTime="2023-01-18T20:57:20.2215140Z" />
  <EventRecordID>5085</EventRecordID>
  <Correlation />
  <Execution ProcessID="5468" ThreadID="4092" />
  <Channel>System</Channel>
  <Computer>concept</Computer>
  <Security UserID="S-1-5-18" />
  </System>
- <UserData>
- <Results xmlns="http://manifests.microsoft.com/win/2005/08/windows/Reliability/Postboot/Events">
  <CompletionType>Fail</CompletionType>
  </Results>
  </UserData>
  </Event>

I'm a software engineer, hardware isn't my thing at all and very much at a loss here as to what I need to do/consider replacing, etc.

I did remove all of the RAM modules and replace them one by one and re-run the diagnostics and every time, it threw the same hardware error.

Any information/advice/suggestions on this is massively welcomed.


Cheers..

Ian
 
Soldato
Joined
1 Feb 2006
Posts
3,397
Running 128GB at 3600 will always be difficult, dropping the frequency to 3200 might be a good idea.
Create a memtest86+ boot usb and run it with 2 sticks, if you get errors remove and try each, repeat with other 2. Its unlikely that all 4 are bad but not impossible. you could also drop the frequency to 3200 and try again. You should also run sfc /scannow to test for data errors as bad RAM will corrupt the OS drive.

 
Last edited:
Man of Honour
Joined
22 Jun 2006
Posts
11,675
I did remove all of the RAM modules and replace them one by one and re-run the diagnostics and every time, it threw the same hardware error.

Just to clarify, you're saying that every stick tested individually still produced memory errors? :o

Back in November 2020, I bought an 8Pack Elite Asus XII Hero mobo bundle and went with Corsair Vengance RAM (RGB Pro 128GB (4x32GB) DDR4 PC4-29200C18 3600MHz). Just before xmas, I started getting BSODs.

Did anything at all change in your PC prior to Christmas (even if it seems unrelated)?

What DRAM voltage was your memory configured to run at and what frequency and DRAM voltage is it running at now?
 
Associate
OP
Joined
19 Jan 2023
Posts
7
Location
-
Running 128GB at 3600 will always be difficult, dropping the frequency to 3200 might be a good idea.
Create a memtest86+ boot usb and run it with 2 sticks, if you get errors remove and try each, repeat with other 2. Its unlikely that all 4 are bad but not impossible. you could also drop the frequency to 3200 and try again. You should also run sfc /scannow to test for data errors as bad RAM will corrupt the OS drive.


Thanks @FredFlint. I created a MemTest86 USB and ran the first test overnight last night. Reported ~1300 errors (I have a log and HTML report). Just replaced the RAM sticks with the other 2 and only 4 mins in, it's reporting 2201 errors!

The BSOD and underlying issues have already taken out my OS twice (most data is stored on one of the other drives installed, and have backups, so not a massive issue per-se.

I'm not sure how to change the frequency of the RAM tbh.. it's one reason I bought an OC'd bundle so I didn't have to think about/work out how to do this stuff and could just construct the box with simple components (that much hardware stuff I can handle :) ).

Just to clarify, you're saying that every stick tested individually still produced memory errors? :o



Did anything at all change in your PC prior to Christmas (even if it seems unrelated)?

What DRAM voltage was your memory configured to run at and what frequency and DRAM voltage is it running at now?

@Tetras yup, running the built-in Windows 10 Mem diagnostics test, it blew up on every individual stick.

Only thing that changed was that I installed my TrackIR 5.. but even then, it had been running fine for a few weeks before this became an issue, so in all honesty, there's nothing I could think of that is close enough to the time that would have caused this? No hardware (peripheral) changes have been made for probably 8 or so months either.

I don't know what the DRAM voltage is/was. I'm assuming this is part of the stats I can see in the ASUS BIOS screen (or is it the 1.350V in the screenshot below)? If not, and it's in teyh BIOS, I can have a look at the info there once this MemTest86 scan finishes (and since typing this post, at 13 mins in now, there's 8917 errors).

This is the HTML report from the first 2 sticks.. but tbh, other than the "fail", it means very little to me.
sticks_1-2_report.png



Second set of sticks failed dramatically and MemTest86 stopped due to too many errors (it reported 10001 by the time I'd actually posted the rest of the reply above). Here's the 2nd report:

sticks_3-4_report.png



Really appreciate your replies both of you and assistance in this... after being a software/devops engineer for many many years... this stuff makes me realise just how much I don't know :D


Cheers..

Ian
 
Last edited:
Man of Honour
Joined
22 Jun 2006
Posts
11,675
@Tetras yup, running the built-in Windows 10 Mem diagnostics test, it blew up on every individual stick.

Only thing that changed was that I installed my TrackIR 5.. but even then, it had been running fine for a few weeks before this became an issue, so in all honesty, there's nothing I could think of that is close enough to the time that would have caused this? No hardware (peripheral) changes have been made for probably 8 or so months either.

I don't know what the DRAM voltage is/was. I'm assuming this is part of the stats I can see in the ASUS BIOS screen (or is it the 1.350V in the screenshot below)? If not, and it's in teyh BIOS, I can have a look at the info there once this MemTest86 scan finishes (and since typing this post, at 13 mins in now, there's 8917 errors).

This is the HTML report from the first 2 sticks.. but tbh, other than the "fail", it means very little to me.

Second set of sticks failed dramatically and MemTest86 stopped due to too many errors (it reported 10001 by the time I'd actually posted the rest of the reply above). Here's the 2nd report:

Really appreciate your replies both of you and assistance in this... after being a software/devops engineer for many many years... this stuff makes me realise just how much I don't know :D

Cheers..

Ian

Usually, that amount of memory errors is considered a catastrophic failure and the memory would be sent for replacement, with zero effort to investigate any further. But, for every stick in a set to be so evidently faulty? As FredFlint said, that's very unusual, though not impossible. Typically, if there's e.g. a problem with the memory config or overheating, you'll get some errors, but not thousands.

Since you're using such a large amount of memory, I'd want to check: what speed are they actually running at (CPU-Z can tell you that) and what is the DRAM voltage. 128GB is quite challenging for most systems and sometimes it needs to have the frequency lowered and to have more voltage, but it doesn't explain why a system that was stable for so long has had 4 sticks fail simultaneously, unless perhaps the BIOS was reset (or partially reset) at some point and you didn't notice.

Unfortunately your testing is only 100% meaningful if you know what speed it is running at and what the DRAM voltage is, so you need to establish this. I think the DRAM voltage in the images above is only reading from the memory profiles, not from the BIOS (so it doesn't tell us what it is using). If you've been running the sticks at XMP, I'd try again at stock (probably 2666), but if you don't have an 8Pack profile make sure you write everything down, because otherwise you'll lose his settings and your system might not be fully stable without them.
 
Associate
OP
Joined
19 Jan 2023
Posts
7
Location
-
Thanks again @Tetras. Despite my lack of knowledge in the area, I also had the same feeling about the failure rate and that surely, they can't all be bad! Initially, I removed all the sticks and re-ordered them in the mobo, primarily to re-seat them and to see if in a different order, would they continue to fail (just FWIW, they were bought at the same time as the rest of the components for this build and as a pack of 4/128GB rather than built up over time) and then went through and just installed them 1 by 1 and ran the MS diagnostics which threw up errors every time.

The main 2 games that I play that warm everything up are Assetto Corsa Competizione, and sometimes DCS World (trying to learn that one!). The other few games are much lighter/less demanding. I'm not a huge gamer tbh and bought/built this primarily as a "future-proofed" development box and the large amount of RAM due to running multiple Docker containers. Either way, AFAIK, things don't get too hot in there.. unless I'm playing ACC for a couple of hours, the box is almost silent. It sits under the desk in a big open space (not against a wall for example) so should have plenty of room to breathe. Just for completeness, the rest of the components are:

  • Lian-Li O11 XL case
  • Geforce RTX 2080ti
  • Kolink Continuum 1200W
  • EK-360 Water pump/rad AIO
  • Samsung 970 EVO plus 2TB
  • Samsung 860 Pro 4TB
  • Samsung 860 Pro 2TB
  • 6x Noctua fans (2 on the floor, 3 on the side next to mobo and 1 on the rear)

Regarding BIOS.. unless that was reset by something I have no idea about, then I'd be puzzled as to how as I don't access it and the box is running 24/7.. reboot for Windows updates mostly.

I'll install CPU-Z and start the box up again to have a look at the BIOS stats this evening.


Cheers..

Ian
 
Last edited:
Man of Honour
Joined
22 Jun 2006
Posts
11,675
The main 2 games that I play that warm everything up are Assetto Corsa Competizione, and sometimes DCS World (trying to learn that one!). The other few games are much lighter/less demanding. I'm not a huge gamer tbh and bought/built this primarily as a "future-proofed" development box and the large amount of RAM due to running multiple Docker containers.

Just thought I should mention this, since you may have some important work stuff on this system. If your memory really is in the bad way it appears to be, your data will be prone to corruption, so I'd minimise how much I use it for the time being (since you could end up with it not booting at all).
 
Soldato
Joined
15 May 2012
Posts
5,812
Location
Louth, lincs
With that amount of ram I'd imagine SA or system agent voltage was quite high, not sure on your system for the correct terminology, do you know said voltage or was it left at auto?? IMC degradation could be a possibility
 
Associate
Joined
31 Oct 2010
Posts
301
I'd be suspecting the memory controller.
It's very unlikely that all the sticks have failed at the same time with similar errors.
 
Last edited:
Soldato
Joined
30 Dec 2021
Posts
3,476
Location
Yorkshire
Yes but also your running a lot of ram. The controller my just not be up to the job.

When it comes to ram the general rule is the more you have to slower it runs. I can’t see you ever getting it to run at 3600mhz
 
Associate
OP
Joined
19 Jan 2023
Posts
7
Location
-
Hi all..
I forgot to update this thread!
I replaced all of the sticks with new ones and this has fixed the issue (with no further changes). However, keen to not go through that issue again in a couple of years time. It sounds from some of the comments here that essentially I was "lucky" that it was so stable for so long?

I'm now contemplating (and currently testing) just running 64GB in here (can put the other 32GB sticks in my lads boxes) and leaving it at 3600. I'm also assuming, with my limited hardware knowledge that 64GB @ 3600 would be (far more) beneficial in general than 128GB at say, 3000 or 3200?

Appreciate all of the comments here. This is all a learning curve for me and more than happy to be educated



Cheers..

Ian
 
Man of Honour
Joined
16 Mar 2005
Posts
8,060
Location
Clevedon , Bristol
I doubt you were ever fully stable tbh. 128gb @3600 is quite a test for the mem controller.

What might of happened is lots of small errors overtime resulting in a stage where Windows says ' Thats enough ' and starts throwing more serious errors at you - Like the BSOD

You gradually corrupt the OS with small errors.

I'd be very surprised it you last long with the new set of ram at the same settings.

I'd be tempted to stick with 2x32Gb and reinstall/repair Windows.
 
Man of Honour
Joined
22 Jun 2006
Posts
11,675
I'm also assuming, with my limited hardware knowledge that 64GB @ 3600 would be (far more) beneficial in general than 128GB at say, 3000 or 3200?

The key point is as Armageus said, do you actually need it for anything?

That said, outside of benchmarks you're unlikely to notice the difference between 3200 and 3600.

If you're primarily using your PC for work, then I'd be inclined to just use the stock frequency and timings for stability reasons (if you have a 10th gen i7 or i9 then that's 2933, 11th gen is 3200).
 
Associate
OP
Joined
19 Jan 2023
Posts
7
Location
-
Depends if you actually need 128GB for anything? (e.g. if you use it for VM's, Databases, or big image manipulation etc)

Faster Memory is only faster, until you run out of it :)

Good point! :D Biggest DBs I have here to work with are 4.5GB (2x running side-by-side). I don't do any 3D rendering any more, and my photoshop stuff is mostly game liveries, nothing crazy.


I doubt you were ever fully stable tbh. 128gb @3600 is quite a test for the mem controller.

What might of happened is lots of small errors overtime resulting in a stage where Windows says ' Thats enough ' and starts throwing more serious errors at you - Like the BSOD

You gradually corrupt the OS with small errors.

I'd be very surprised it you last long with the new set of ram at the same settings.

I'd be tempted to stick with 2x32Gb and reinstall/repair Windows.

This is a good point. I always left this box on 24/7.. pretty much the only times it ever got rebooted was for windows updates. The "trigger" seemed to be shutting it down completely and firing it back up the next day, which I'm assuming kind of potentially makes sense.

It's been fine for the past 6 months, but again, the box has been running 24/7, but as 3 pretty decently spec'd desktops running in a 12' log cabin contribute a fair amount of heat even when not under load (though the kids have a habit of leaving games running overnight too!), I'm going to be shutting this box down at night and don't fancy the corruption again.

So far, 64GB seems to be handling everything just fine.. so looks like the kids will get a 16->32GB upgrade shortly :)


Cheers..

Ian
 
Back
Top Bottom