Between 775 and 1366 Intel changed the cache layout between cores. On 775, each core has its own cache and there's a collective pool. If it doesn't find the result in its own cache, or the pool, it then has to query the cache of the other cores and finally run the calculation if it doesn't find the result. On 1366, each core has some cache dedicated to it, and the data in this cache is mirrored to the pool. It leads to less cache available overall, but if the core doesn't find data in its own cache or the pool, it knows it isn't stored anywhere on the chip.
The latter layout is more efficient, as you don't have cores repeatedly trying to access each others cache. This, possibly combined with triple channel ram, accounts for much of the performance per clock improvement of the new intel generation.
The point I'm trying to make is that your question doesn't make much sense. Is a core the processing area with cache attached? Or without it attached? And what do you make of any shared memory? At best all you can do is take two chips with the same number of cores and run tests against each other, which does lead to Q6600 < Q9650 < i7 920 without HT, all tested at the same clock speed.
I think 45nm transistors switch quicker than 65nm, and likewise 32nm quicker than 45nm, so shrinking the process size will also account for some of the improvement.
I've no idea what AMD is doing though, over to BigWayne for that.