I haven't actually done any analysis into this question, so I can't say for sure, but it's important to know that there's a lot more different between the GTX 480 and GTX 460 than just 480 vs 336 shaders.
The internal organization of the GTX 480 (GF100) is actually as 15 cores, each containing 2 16-wide computational units ("CUDA cores" are a marketing name - they are not real cores, as they don't compute independently). The GTX 460 (GF104) has 7 cores, each with 3 16-wide units (48 CUDA cores). GF104 also has twice the number of special-function units (to evaluate exponentials, trigonometric functions, etc.). Perhaps the most relevant change, though, is that GF104 is able to use more of its execution hardware at the same time (e.g., both the shaders and special function units), because it is
superscalar - it can run more than one instruction at the same time from each thread (
http://www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2).
My suspicion (and again, I haven't done the testing to test these guesses) is that there are a few things behind GTX 460 being more efficient than expected:
- Fewer SMs may make it easier to load-balance the GPU
- More special function hardware may increase throughput in a few parts of the code
- Superscalar execution provides an efficiency boost in per-shader terms.