Two different arguments these. I'm talking about boiling figures down to an average. Take the car analogy. You have ten cars and you want to work out the average top speed of all ten. But five of them have a speed limiter engaged. You cannot then say, 'The average speed of these ten cars is 130mph' or whatever it comes to. There is junk data contributing to that number. I'm not arguing that they shouldn't be testing these titles and showing the 4090 hits their CPU or engine limits (and it is engine limits in some cases, which is even worse). I'm arguing that the number output as an average is junk.
Whether it hits CPU/engine limits or GPU limits is missing the point. The question that (most) consumers are interested in is "how will putting this GPU in my computer affect the games I play", that's only reasonably reflected by testing those games in actual computer setups and whether or not the 4090 is capable of more in a still theoretical future system with faster CPU/memory/etc is not relevant to the question. In fact, being limited by the best CPUs available is a property of the card relevant to consumers and excluding it would produce less relevant results than including it.
And I'm telling you that a methodology that presents an average with hidden confounds would be laughed out of the room in my workplace. It's junk. Worthless. The methodology has to exclude it from said average and it doesn't.
I don't agree that "behaviour in real world testing" is a hidden confounding factor. Rather it is what is being measured. Any average hides detail, and that detail may be very important to the audience or it may not, hiding detail does not make using an average wrong. Now, if any particular source has failed to discuss the probable reasons for variations in performance and why CPU limitation may explain some of the results, then I'd agree they're not well informing their audience.