Diagnosing system crashes with ChatGPT...

Soldato
Joined
29 Aug 2006
Posts
4,352
Location
In a world of my own - but it's ok, they know me.
It's been a while since I had the wherewithal to be able to build my own systems so nowadays I just choose a base system from OCUK, select the upgrade components I want (case, GPU, etc) and get them to build it. My last purchase was about ten months ago and it was a Gaming Claymore - i9-14900kf with an ASUS 5080. It's been a fun machine but I've had a few stability issues which I honestly put down to upgrading to an i9, default microcode settings in the BIOS, needing time to 'bed in' and needing some tuning to get cooling right etc.

I would get a black screen of death with various modules being blames - nvidia drivers more than anything and we all know how buggy they can be.

However a week ago I was getting regular crashes especially "paged fault in non paged area" in ntoskrnl and eventually I was stuck in a boot time recovery loop and couldn't recover the system. I used ChatGPT to help my try to figure out the problem and we went down several rabbit holes but it was super detailed in trying to fix the problem.

I used this prompt: "act as a windows developer and systems architect with decades of experience and a specialty in diagnosing system and kernel crashes. I will upload a windows dump file and I want you to determine the most likely cause of the system crash." and uploaded a minidump for analysis. It analysed for over 7 minutes - installing various python modules including volatility and came back with a detailed analysis which was pretty impressive to my eyes and is quoted below. If you are interested in reading the whole saga - which went on for several days - you can read it here: https://chatgpt.com/share/699d8d4e-f39c-8003-b2ee-645be132704d

One of the things I love about ChatGPT in these scenarios is you can upload photographs to it and it will use them in it's diagnosis too, which is very cool. I used it a few months ago to diagnose my broken central heating and it found the exact cause which I was able to call out an engineer to fix and saved myself labour costs on diagnosis.

I wonder how many of you will actually read through it and guess what the likely underlying issue is before reaching the end! although to be fair, I've yet to implement the recommended fix which will have to wait until payday... :)


I pulled the bugcheck and key header fields directly out of your dump.


What the dump​


  • Dump type/signature: PAGE / DU64 (64-bit kernel crash dump)
  • OS build: 26100 (Windows 11 family build line)
  • Machine: 0x8664 (x64)
  • CPU count: 32
  • Bugcheck: 0x000001AA – EXCEPTION_ON_INVALID_STACK

Bugcheck parameters (from the dump header):


  • P1 (current stack pointer): 0xffffd2067f4336ea
  • P2 (stack limit type): 0x2 → KeExpandKernelStackAndCallout(Ex) stack
  • P3 / P4: pointers to the exception records involved in dispatch/unwind (0xffffd2067f432cb0, 0xffffd2067f4334a8)

Meaning of this bugcheck (practically)​


EXCEPTION_ON_INVALID_STACK (0x1AA) means Windows tried to dispatch/unwind an exception and discovered the kernel stack pointer was outside a legal kernel stack range — i.e., the stack was corrupted or execution was happening on a stack that isn’t valid for the current context.


Because P2 = 0x2, the kernel believes it should have been on an expanded kernel stack (the one used when Windows temporarily expands a thread’s kernel stack to run a deep callout). That narrows the failure mode: corruption during/around a callout executed under KeExpandKernelStackAndCallout(Ex), or returning with a busted stack pointer.


Most likely cause (ranked)​


1)​


This is the classic cause of 0x1AA: stack overwrite (local buffer overrun, incorrect structure size, bad function pointer/callback calling convention, etc.) during exception handling/unwind.


Drivers that most often do this in the real world:


  • Filter drivers (AV/EDR, DLP, encryption, backup)
  • Network filter / virtual switch / VPN / packet capture
  • File system minifilters
  • Occasionally GPU kernel components when they corrupt memory earlier and you only “see” it later during unwind.

From raw string extraction, your dump contains references to various .sys files including dam.sys, l1vhlwf.sys, kmpdc.sys, plus many Microsoft inbox drivers. (Important: this doesn’t prove they were on-stack at crash time, but it’s a useful “what’s present on the box” hint.)


2)​


If RAM is flipping bits or you have marginal timings/voltages, the stack pointer or saved frame state can be corrupted and only detected later during exception dispatch. This is the runner-up cause for 0x1AA.


3)​


Less common, but P2=2 sometimes shows up when a driver is doing something unusual with callouts, APC/DPC context, or stack switching (or a Windows bug around certain builds). Still, statistically: third-party driver first.


What I would do next to​


A. Enable Driver Verifier​


Target non-Microsoft drivers first (especially AV/EDR/VPN/virtualization/network filters).


Recommended Verifier settings to catch stack corruption quickly:


  • Special Pool
  • Pool Tracking
  • Force IRQL checking
  • I/O Verification
  • Deadlock detection
  • DDI compliance checking (if available)
  • Security checks
  • Misc checks

Avoid enabling every option on day one (DMA checking etc.) unless you’re prepared for boot loops.


B. If this machine runs anything like Hyper-V / VBS / WSL2 / virtual switch / VPN​


Temporarily:


  • Remove/disable the VPN/packet capture/NPCAP/WFP filter stack
  • Update NIC drivers + chipset
  • If safe, test with Hyper-V / VBS toggled off to isolate hypervisor/network-virtualization interactions

(Reason: 0x1AA + callout stack contexts are very often “filter driver did something naughty”.)


C. Rule out hardware instability​


  • Disable XMP/EXPO and any CPU undervolt/overclock
  • Run a proper memory test pass (not just the quick Windows one)

If you want the​


I can usually identify it by walking:


  • the exception records at P3/P4,
  • the context record (RIP),
  • and the stack backtrace,
    then matching addresses to loaded modules.

However, in this environment I can extract the bugcheck and headers reliably, but I can’t fully symbol-walk your Windows 11 build from this dump without the right kernel symbols and a debugger-style unwind.


If you can also upload MEMORY.DMP (full kernel dump) or a paired minidump from the same crash, I can usually fingerprint the culprit driver much more concretely (often down to driver.sys + offset).


Bottom line​


Most likely root cause: kernel stack corruption caused by a buggy third-party kernel driver during a callout that required kernel stack expansion (P2=2).
Second most likely: memory instability corrupting stack state.


If you tell me what security/VPN/virtualization tooling is installed on that host (Defender-only vs. CrowdStrike/SentinelOne/etc., any VPN, any packet capture tools, Hyper-V/WSL2), I’ll narrow the suspect set to a very short list and give you the exact “rip it out / update this driver” order.
 
Last edited:
My last purchase was about ten months ago and it was a Gaming Claymore - i9-14900kf with an ASUS 5080. It's been a fun machine but I've had a few stability issues which I honestly put down to upgrading to an i9, default microcode settings in the BIOS, needing time to 'bed in' and needing some tuning to get cooling right etc.
Can you elaborate on this please?

By upgrading to an i9, do you mean that you upgraded the CPU, or that it came with this CPU?

By default microcode settings, are you referring to the mitigations posted in recent Intel BIOS?

You're checking some of the boxes here (symptoms of degraded/degrading CPU):

Which is a little concerning.
 
Can you elaborate on this please?

By upgrading to an i9, do you mean that you upgraded the CPU, or that it came with this CPU?

By default microcode settings, are you referring to the mitigations posted in recent Intel BIOS?

You're checking some of the boxes here (symptoms of degraded/degrading CPU):

Which is a little concerning.

It came with the i9 - I've never had one before, always had i7s so it was an upgrade to me.
 
Why do you need settings and cooling to settle in? Pcs generally don't work like that,?

If its that unstable reset all overlooking?
 
Yeah i dont understand
The need time to bed in thing either

Tuning your cooling fair enough

Page fault in non paged area
First question would be is it overclocked?
If so load defaults

also That's a common fault if cpu is unstable and that cpu
From memory is one of the ones
That's on the list for possible degradation

Think intel had a check list
Of things to do to see if it fixes it or if it
Still shows instability and needs RMA
 
CPU is stock and ASUS motherboard settings have been set to Intel defaults. ChatGPT suggested that the Aerocool Integrator PSU was insufficient for the job of running the CPU and 5080 and I should look at a better PSU. Do you think this is wrong?
 
CPU is stock and ASUS motherboard settings have been set to Intel defaults. ChatGPT suggested that the Aerocool Integrator PSU was insufficient for the job of running the CPU and 5080 and I should look at a better PSU. Do you think this is wrong?
I think you should ask OCUK as they built it and its still inside the 3 year limited warranty. You change the PSU and that's the warranty on ALL of it gone.
 
Last edited:
Depends what’s triggering the fault. If the crashes mostly occur during times where the CPU and GPU are under load, then possibly. I’ve no experience with those PSUs, but the issue doesn’t point unstable power. How old is the system?
 
Why do you need settings and cooling to settle in? Pcs generally don't work like that,?

If its that unstable reset all overlooking?

I don’t know why the AI pointed this out in this case, but it’s probably some reference to heat cycling which is a main cause of certain kinds of failures. Bedding in time is a factor. I don’t think this is the cause here though.
 
I think you should ask OCUK as they built it and its still inside the 3 year limited warranty. You change the PSU and that's the warranty on ALL of it gone.

This. If only Chat GPT could package everything up and carry the system to post office…
 
Depends what’s triggering the fault. If the crashes mostly occur during times where the CPU and GPU are under load, then possibly. I’ve no experience with those PSUs, but the issue doesn’t point unstable power. How old is the system?

Ten months old at the moment. It's crashed at odd times before I did the windows rebuild - under load, exiting games, when in sleep mode over night. After the rebuild it's more stable but still the occasional crash. I've run the Intel Diagnostic Tool and got a pass, ram memtest86 overnight from the bios and got a pass. Just running Cinebench at the moment and will likely go through a bunch of stress tests over the next few days to see what happens. I've set the BIOS settings to all the recommended Intel settings but hold out a worry of the chip being degraded and getting worse over time.
 
This. If only Chat GPT could package everything up and carry the system to post office…

I don't have the original packaging anymore - had a garage clear out last year. If I have to I will drive the system to OCUK myself and hand it over to them.
 
Ten months old at the moment. It's crashed at odd times before I did the windows rebuild - under load, exiting games, when in sleep mode over night. After the rebuild it's more stable but still the occasional crash. I've run the Intel Diagnostic Tool and got a pass, ram memtest86 overnight from the bios and got a pass. Just running Cinebench at the moment and will likely go through a bunch of stress tests over the next few days to see what happens. I've set the BIOS settings to all the recommended Intel settings but hold out a worry of the chip being degraded and getting worse over time.

So it shipped with the i9 as is from OcUK ten months ago?
 
Last edited:
Perhaps the bed in time
Refers to whatever thermal paste
Is on there
Some do improve after been used
For a while

Might have missed it
Did you check you're on the latest bios?
 
I think it’s pretty likely the issue is the i9 is dying. Random bouts of throwing up errors is typically one of the first symptoms. Possibly ask Chat GPT to behave as an Intel engineer.
 
Perhaps the bed in time
Refers to whatever thermal paste
Is on there
Some do improve after been used
For a while

Might have missed it
Did you check you're on the latest bios?

That is part of the bedding in process, but it’s more to do with heat cycles of the die, motherboard and traces etc as everything expands and contracts and gases off.
 
Back
Top Bottom