While the basic shaders on Fermi share very similiar basic arithmetic functionality to the G80 cores everything else is done quite differently the geometry engine is completely different, threading, scheduling, cache, etc. is all vastly improved over previous designs the TMUs are very different too. Granted in terms of efficency for rendering it doesn't work out much more efficent than a scaled up GT200 but architecture wise it only bares passing similiarity.
If you compare the R600 against cayman however theres a vast amount of similiarity, same thread dispatcher, similiar setup for handling geometry the only difference is the design is now somewhat more modular and some things have been moved around for better efficency i.e. z-stencil cache.
EDIT: As I said before and people ignored/laugh at me... the exercise with the 6 series has been about increasing the efficency of the design and making DX11, etc. features more "integrated" so they work better with the whole rendering pipeline rather than a redesign.
There is smeg all in R600 that bears a similarity to Cayman, top to bottom its different, schedualer is VASTLY different, AMD showed/said this, due to simplified shaders(one type vs 2) that makes things massively easier and saves a LOT of core logic, thats from AMD's mouths(and something I suggested weeks before we saw those slides). The shaders are a 4 way identical shaders, not a 5 way 4 simpler shaders and one uber complex one.
The 2900xt had a 512bit external memory bus, and a 1024bit internal ringbus, 512bit memory bus, down to 256bit, works with different kinds of memory and is vastly more efficient(while Nvidia's seems to struggle badly with gddr5, some might suggest they haven't come close to updating it to match latest memory speeds).
THe internal memory bus, well, that was a HUGE die space expense, huge, absolutely insane(though should have been a lower percentage on a 65nm die with more shaders). Thats gone, thats one of the biggest changes in architectures from either side in, well, a long time, its a hugely immense change.
The front end has changed dramatically from the 5870 to the 6870, let alone 2900xt to 6970.
Its fairly simple, Nvidia rather operates a 1:1 design, equate it to people, AMD has 100 people but 100 doors for them to pass through, its efficient and very simple, this hasn't changed since 8800. It doesn't really matter what type of people go through those doors, theres 100 available and they can each do the same thing.
AMD has 400 people, but 40 doors, and it has to work VERY hard to get them through as quickly as possible, and different people have to go through different doors. The general route input to output through the Nvidia core is essentially easy because of those, schedualing is pretty easy, everythings predictable and simple, the only cost for an efficient(in terms of code and getting full power) architecture, is size.
WIth every generation I'd be expecting pretty fundamental and large changes right through the AMD design, because when trying to get 400 people through 40 doors, theres a bunch of work to be done to improve every generation, for Nvidia, you can't easily improve on, one door for everyone philosphy, it pretty much solves itself.
Yes, shaders change, functions they can perform change, thats not "really" architecture. AMD's shaders, what they can do, functions they can perform change.
As for the functions and wasted die space, the only thing "wasted" on the R600 die, was the tesselator, almost every other feature of dx10 as it was, was efficiency, wasting die space on efficiency improvements only fails when someone rips out the software to use the efficiency improvements.
Overusing features in a new dx11 benchmark doesn't in any way equate to waste on a die 3 years ago, or is it 4 now, suggesting so is just ridiculous to be perfectly honest.
By your own account and description of DX11 features that would have been a waste several years ago, equates to saying that right now, because 3dmark over uses several features, that the 480/580/6970 having tesselation is also a waste, because its simply not fast enough to overuse them in a completely ridiculous was......... no, that also makes entirely no logical sense.
Hardware support comes before software implementation, comes before further hardware support and continual increase in software implementation of a feature, it HAS to start somewhere, if it doesn't, no one would ever use it.
It will still be YEARS, maybe 3-4 years before full scene tesselation as in Uniengine+ levels are everywhere in a game, across almost everything is completely standard, does that mean tesselation in the 480gtx is a complete waste, well no, it won them uniengine.....
As it gets used more, hardware for tesselation will improve, efficiency will improve, knowledge of coding for it more efficiently will improve and hardware work arounds to reduce overheads and make it more usable with lower performance hardware around it WILL happen, it only happens quicker the earlier its introduced.
EDIT:- other than the tesselator, can you name another feature of the original dx10 spec, that was in the R600, that was wasted and would only have hurt performance if it was used? Then just for the heck of it, any idea how much die space it "wasted". Also for the record, tesselation is actually an efficiency improving device, being used to offset an increase in quality. IE tesselate one fantastically detailed image, and producing the same image without tesselation, not a flat image without tesselation but the same quality same detail image and tesselation is massively faster.
DX 1 through "till MS die" is 99% about making it easier and faster to implement new things people come up with, its rarely about doing something you could never ever do before and therefore kills performance.