You can do that & it would be wrong. If 99/100 all run built-in benchmark and you compare to the 1/100 which runs a custom scene, the outlier isn't wrong because 99/100 all get the same result. Aggregating data without discernment is not going to get you closer to the truth, quite the opposite.What you can do is take an average from cross multiple reviews and compare that.
To actually get closer to the truth requires active thinking & scrutiny on a per-review basis. There's no shortcut.