Diagnosing system crashes with ChatGPT...

Soldato
Joined
29 Aug 2006
Posts
4,323
Location
In a world of my own - but it's ok, they know me.
It's been a while since I had the wherewithal to be able to build my own systems so nowadays I just choose a base system from OCUK, select the upgrade components I want (case, GPU, etc) and get them to build it. My last purchase was about ten months ago and it was a Gaming Claymore - i9-14900kf with an ASUS 5080. It's been a fun machine but I've had a few stability issues which I honestly put down to upgrading to an i9, default microcode settings in the BIOS, needing time to 'bed in' and needing some tuning to get cooling right etc.

I would get a black screen of death with various modules being blames - nvidia drivers more than anything and we all know how buggy they can be.

However a week ago I was getting regular crashes especially "paged fault in non paged area" in ntoskrnl and eventually I was stuck in a boot time recovery loop and couldn't recover the system. I used ChatGPT to help my try to figure out the problem and we went down several rabbit holes but it was super detailed in trying to fix the problem.

I used this prompt: "act as a windows developer and systems architect with decades of experience and a specialty in diagnosing system and kernel crashes. I will upload a windows dump file and I want you to determine the most likely cause of the system crash." and uploaded a minidump for analysis. It analysed for over 7 minutes - installing various python modules including volatility and came back with a detailed analysis which was pretty impressive to my eyes and is quoted below. If you are interested in reading the whole saga - which went on for several days - you can read it here: https://chatgpt.com/share/699d8d4e-f39c-8003-b2ee-645be132704d

One of the things I love about ChatGPT in these scenarios is you can upload photographs to it and it will use them in it's diagnosis too, which is very cool. I used it a few months ago to diagnose my broken central heating and it found the exact cause which I was able to call out an engineer to fix and saved myself labour costs on diagnosis.

I wonder how many of you will actually read through it and guess what the likely underlying issue is before reaching the end! although to be fair, I've yet to implement the recommended fix which will have to wait until payday... :)


I pulled the bugcheck and key header fields directly out of your dump.


What the dump​


  • Dump type/signature: PAGE / DU64 (64-bit kernel crash dump)
  • OS build: 26100 (Windows 11 family build line)
  • Machine: 0x8664 (x64)
  • CPU count: 32
  • Bugcheck: 0x000001AA – EXCEPTION_ON_INVALID_STACK

Bugcheck parameters (from the dump header):


  • P1 (current stack pointer): 0xffffd2067f4336ea
  • P2 (stack limit type): 0x2 → KeExpandKernelStackAndCallout(Ex) stack
  • P3 / P4: pointers to the exception records involved in dispatch/unwind (0xffffd2067f432cb0, 0xffffd2067f4334a8)

Meaning of this bugcheck (practically)​


EXCEPTION_ON_INVALID_STACK (0x1AA) means Windows tried to dispatch/unwind an exception and discovered the kernel stack pointer was outside a legal kernel stack range — i.e., the stack was corrupted or execution was happening on a stack that isn’t valid for the current context.


Because P2 = 0x2, the kernel believes it should have been on an expanded kernel stack (the one used when Windows temporarily expands a thread’s kernel stack to run a deep callout). That narrows the failure mode: corruption during/around a callout executed under KeExpandKernelStackAndCallout(Ex), or returning with a busted stack pointer.


Most likely cause (ranked)​


1)​


This is the classic cause of 0x1AA: stack overwrite (local buffer overrun, incorrect structure size, bad function pointer/callback calling convention, etc.) during exception handling/unwind.


Drivers that most often do this in the real world:


  • Filter drivers (AV/EDR, DLP, encryption, backup)
  • Network filter / virtual switch / VPN / packet capture
  • File system minifilters
  • Occasionally GPU kernel components when they corrupt memory earlier and you only “see” it later during unwind.

From raw string extraction, your dump contains references to various .sys files including dam.sys, l1vhlwf.sys, kmpdc.sys, plus many Microsoft inbox drivers. (Important: this doesn’t prove they were on-stack at crash time, but it’s a useful “what’s present on the box” hint.)


2)​


If RAM is flipping bits or you have marginal timings/voltages, the stack pointer or saved frame state can be corrupted and only detected later during exception dispatch. This is the runner-up cause for 0x1AA.


3)​


Less common, but P2=2 sometimes shows up when a driver is doing something unusual with callouts, APC/DPC context, or stack switching (or a Windows bug around certain builds). Still, statistically: third-party driver first.


What I would do next to​


A. Enable Driver Verifier​


Target non-Microsoft drivers first (especially AV/EDR/VPN/virtualization/network filters).


Recommended Verifier settings to catch stack corruption quickly:


  • Special Pool
  • Pool Tracking
  • Force IRQL checking
  • I/O Verification
  • Deadlock detection
  • DDI compliance checking (if available)
  • Security checks
  • Misc checks

Avoid enabling every option on day one (DMA checking etc.) unless you’re prepared for boot loops.


B. If this machine runs anything like Hyper-V / VBS / WSL2 / virtual switch / VPN​


Temporarily:


  • Remove/disable the VPN/packet capture/NPCAP/WFP filter stack
  • Update NIC drivers + chipset
  • If safe, test with Hyper-V / VBS toggled off to isolate hypervisor/network-virtualization interactions

(Reason: 0x1AA + callout stack contexts are very often “filter driver did something naughty”.)


C. Rule out hardware instability​


  • Disable XMP/EXPO and any CPU undervolt/overclock
  • Run a proper memory test pass (not just the quick Windows one)

If you want the​


I can usually identify it by walking:


  • the exception records at P3/P4,
  • the context record (RIP),
  • and the stack backtrace,
    then matching addresses to loaded modules.

However, in this environment I can extract the bugcheck and headers reliably, but I can’t fully symbol-walk your Windows 11 build from this dump without the right kernel symbols and a debugger-style unwind.


If you can also upload MEMORY.DMP (full kernel dump) or a paired minidump from the same crash, I can usually fingerprint the culprit driver much more concretely (often down to driver.sys + offset).


Bottom line​


Most likely root cause: kernel stack corruption caused by a buggy third-party kernel driver during a callout that required kernel stack expansion (P2=2).
Second most likely: memory instability corrupting stack state.


If you tell me what security/VPN/virtualization tooling is installed on that host (Defender-only vs. CrowdStrike/SentinelOne/etc., any VPN, any packet capture tools, Hyper-V/WSL2), I’ll narrow the suspect set to a very short list and give you the exact “rip it out / update this driver” order.
 
Last edited:
C#:
try
{
    // Do stuff
}
catch (exception ex)
{
    var fix = OpenAI.chat($"Fix for:", ex.Message);
    console.log(fix);
}
 
Back
Top Bottom