While this originally looked like something with one simple cause, given the difficulty in replicating that, it might be more complex.
Many years ago, I did some work that looked at accident investigations. Some of these are really simple. "The pilot felt suicidal and flew into a hill." But those are the exception. A lot require different factors to all come together in the wrong way. A good example of this was the wreck of the MSC Napoli. You might remember this - big container ship that got in trouble in the Channel in early 2007 and ended up beached in Lyme Bay. Made the news because people descended on the beach and looted everything from nappies to motorbikes. The accident investigation found that the ship's hull tore itself down the middle because:
- It had been in an collision a few years earlier and, while it had been repaired, the centre of the hull wasn't quite as strong as it had been.
- It was in a storm whose heavy seas were making waves hit the ship at precisely the wrong frequency.
- The loading of containers on the deck had put too much weight at each end of the ship and not enough in the middle, creating torsion in the middle of the hull as the waves hit the ship..
- The automatic regulator that reduces power to the propeller shafts when they're partially out of the water to reduce vibration failed, so a crew member was put on duty regulating it manually.
- Said crewmember didn't speak the same language as the people working in that section of the ship and couldn't ask anybody to take over when his shift ended.
If just one of those things hadn't been true, the Napoli would have reached her destination and nobody would have realised how close we came to a shipwreck. But all of them were true, so the shipwreck happened.
And with the 4090s, we've got a load of factors in play, including:
- The make and model of the card.
- Whether the card is installed vertically or horizontally.
- The specific adapter shipped with it - 3x 8-pin vs 4x 8-pin and 150W vs 300W cables.
- Whether the cable is bent within 3cm.
- Whether the adapter has been plugged in perfectly.
If you sent a bunch of container ships into the Channel to see whether they broke up this would be a) a really stupid idea and b) a really bad way to diagnose what had gone wrong as you may never replicate the specific circumstances. I suspect only Nvidia themselves have enough of a sample size to work out what combination of factors is making things get burny.