Chill out - I wasn't insulting your beloved AMD, I was just explaining why increasing the number of transistors neccesarily leads to a larger die size
A more modular design will always come with a related cost in transistors, since an increased amount of control logic is required to connect the various modular units efficiently (I'll explain this in more depth shortly). The reason for taking a more modular approach is, as always, to improve scalability (e.g. to allow you to increase the number of SPs while maintaining as close to a linear performance increase as you can). This idea of increased modularity improving scalability is a general data-flow efficiency concept that reaches far outside the design of semiconductors.
---------
Anyway, getting back to why a more modular design costs extra transistors: Cypress and Barts both have two "RPEs" (render and processing engines), which are a modular block of SIMDs, TMUs, local cache and data-share logic (see pic below). Each of these has its own dispatch processor and modular cache. To link the two RPEs, a "global data share" is required, which like anything else on the die, costs transistors. My understanding is that Cayman has
three RPEs (up from two on Barts and Cypress). To link the three blocks it is neccesary to
at least increase the amount of "global data share" logic by a factor of two (one data share to link block1 to block2, and one to link block2 to block3). So, you're seeing a 50% increase in the number of RPEs for a 100% increase in the number of transistors that connect them.
Secondly: Consider that, in comparison to Cypress, each SPU now only has 4 SPs instead of 5. Each SPU has associated with it some control logic to link it to the rest of the SIMD core. So, in Cypress you have five SPs for every chunk of control logic, whereas in Cayman you have four. So, for a fixed total number of SPs (say 1600) you have more pieces of local (intra-SIMD) control logic in the Cayman design than in Cypress (400 instead of 320).
[Barts core]
I'm not going to argue rumour-by-rumour, they change every day. As I understand it, Cayman has (natively) 1680 SPs, 96TMUs, and 48ROPs, arranged into three RPEs (with 560SP/32TMU/16ROP in each RPE - just like Barts). Perhaps if the rumours of yield issues are correct then we could see some of the SIMD cores disabled to account for manufacturing errors (which could account for the 1536 number), but deactivating these clusters would not reduce the overall die size.
I said that there is a strict limit on the number of TRANSISTORS you can fit in a given area on a given process, not "shaders". But there is no dramatic change in the architecture from Cypress to Caymann, and so no reason that transistor density will have improved dramatically either. It's also very reasonable to assume that the transistor density will continue to be better than Nvidia's GTX580 (since their architecture has not changed dramatically either).
Your opinion of "what should be easy a year later" is utterly irrelevant. These considerations should be made based on the semiconductor physics, and the logic of GPU design, not on your personal perception of how GPUs have improved historically.
The only problem with this post is, none of its relevant, and thats a real shame, because it was really quite long.
You're ignoring so many key factors as to make everything you said irrelevant.
lets take your analogy for example, 100 soliders, 10 groups of 10, on commanding officer, thats great.
But in this situation we have 1600 soldiers, and not 1600 equal soldiers, but two VERY different types of soldiers in each group. Which would mean, lets say each group is 4 plain old riflemen and one such fantastically well equiped elite force unit who carrys a mortar, a RPG, a gun, some explosives, etc, etc, you need one guy to control the 4 basic guys and tell them what to do, and you need another guy whose able to properly direct the far more advanced unit how to do things.
Now, instead of just increasing the amount of groups by dividing 1600 by 4 instead of 5, you've also taken out the requirement for the ultra complex unit, and the complexity of telling the different guy how to perform. Unfrtunately what you're supposing is 1600/5 clusters of shaders + the core logic to control them would use less transistors than 1600/4 clusters of shaders + the core logic to control them, if that core logic was the same and used the same amount of transistors, sure. Unfortunately thats the part you got wrong, the core logic won't be the same, much of the reason to move to 4 identical shaders rather than 4 + 1 VERY different shader is the simplication at EVERY stage of controlling those shaders. The schedualler, the dispatcher, everything can be made more streamlined with one type of shader to control rather than two completely different shaders. Which means a 4way shader + all its core logic probably uses the same or even less transistors than a 5way shader and more complex core logic at every stage of the pipeline, inside and outside the RPE.
Each 4 way shader is smaller than a 5 way shader, and the core logic to control and balance the workflow, and schedual the work is FAR more simplified as you're no longer waiting on the much more complex shader to do something far more slowly some of the time.
So you've increased the number of clusters, but reduced the complexity. You said an increasingly modular design WOULD use more transistors, I said it didn't have to, not it couldn't, but its incorrect to say it HAS to, which is infact completely incorrect.
AS fot the 1680 shader rumour, there is no such rumour, 1536 is the ONLY rumour around, well that and 1920 shaders.
You simply posted that it was using more transistors, and would be much bigger, etc, etc when infact theres no rumours to back that up at all, unless you count Fud.
Increasing transistor would almost always mean a larger die size on the same process, again, you didn't say this, you said it did have more transistors, thats not a fact. Again what if they've reduced the die size, the 6870 in several situations outperforms a 2.15billion transistor 336mm2 core, with its 1.7billion transistor 255mm2 core. Which by the way, shows a very slight increase in transistor density in doing so.
Whose to say they won't make a 2.15billion transistor core at 330mm2, thats 35% faster, or a 2billion transistor core thats 28% faster at 300mm2, no one basically.
You stated rumours no ones heard, as fact, and other things that are often true, but aren't without question fact.
AS for RPE's, Cayman being 3 "RPE" is incredibly unlikely, likewise, you have entirely nothing to suggest that a global data share, needs doubling with one more RPE, likewise the RPE global data share is in NO WAY the only thing connecting the RPE's to the rest of the core, there are MANY more connections besides the global data share so doubling the global data share would in no way increase the amount of core logic connecting the RPE's by 100%, these are all things that aren't factual, some are guesses, some are possibilities and some are flat out wrong.
Cayman isn't a big architectural change, again, utter rubbish, a new front end, a new type of shader........ those are the two BIGGEST things in the entire architecture that will be completely different, Cayman is set to be all but a completely different architecture.
You also talked about transistor density like I said the word shaders, anywhere, when I talked about transistor density, I didn't, and by the fact that the 6870 shows a higher transistor density, and the fact that AMD have a significant lead on Nvidia on the area, the simple fact is you stated it in a matter of fact way that basically transistor density was a constant that was unavoidable, its not, hence me stating that, why you brought shaders up I don't know.
While I agree with the fact that there will likely be more core logic in a more modular design, that doesn't mean a more modular, but simplified design will use more transistors, and again fundamentally thats what you claimed. More modular, or the same design with more shaders = more transistors, sure, thats not what you said.
Personally it should be bigger, but not "that" much bigger at all, and transistor count my bet would be around the 2.6billion mark, with a marginally increased transistor density but not much.
But please read what I said, and what you initially said, you made several claims about "dramatic increase in rops", and various other things, and used that as a reason to claim several other things. We have NO CLUE how many rops it has, we have no clue if the rops will be in the same place, the same size, or if they haven't doubled the performance of each rop and got the same amount. 3RPE's makes zero sense, theres a reason GF100 didn't have 15 clusters in the design but 16, and why almost every GPu design I've seen trys to remain symmetrical, a 3RPE design would be, completely and utterly impracticle. Its possible, theres no reason it couldn't work, but for not least just pure timing control cores are generally design symmetrically to keep everything equidistance from each other, otherwise you have one RPE on the opposite side of the core to the ROPS and, etc, etc. The smallest Fermi chip is a 96 shader cluster design TWO of their "rpe's" despite the desparate need for a smaller core, because 48shaders just doesn't work, they now have a 48 shader 420GT but its a 96 shader part with one cluster disabled and you can be certain they don't want to disable 40% of a core to sell a part in a price bracket, its economically worthless.
Remember the reason for the thread, Fud claiming yields are in the tank because, losely implied from having read their other BS articles, that its a HUGE core, which it simply won't be.
As for what my opinion on what should be easy a year later, again you're talking out of your behind. Nvidia found a 512sp core impossible to make a year ago, and now, they managed to make one, with okay yields(no idea what they are, but its moved from non releaseable, to releaseable) a year later on the same process thats improved over time.
Every single process, ever, in the history of the universe, has had higher yields towards the end of its life than the beginning, and almost every company whose ever had chips built has had no problems making a bigger core a year later than they managed fine a year earlier. Its not my wish or opinion, its solid fact based on TSMC's results, and Intels, and AMD's/GloFo's over the past decade.