Chill out - I wasn't insulting your beloved AMD, I was just explaining why increasing the number of transistors neccesarily leads to a larger die size
drunkenmaster said:
its got a more modular design and that costs transistors, says who, you?
A more modular design will always come with a related cost in transistors, since an increased amount of control logic is required to connect the various modular units efficiently (I'll explain this in more depth shortly). The reason for taking a more modular approach is, as always, to improve scalability (e.g. to allow you to increase the number of SPs while maintaining as close to a linear performance increase as you can). This idea of increased modularity improving scalability is a general data-flow efficiency concept that reaches far outside the design of semiconductors.
-- aside--
To give a nice simple example of data flow and modularity in action, consider a military heirachy: If you have only a small "army" of 100 fighting men, you can divide your troops into squads of 10 men, and attach a single officer to each group, who answers to a single overall commander. However, if you have an army of ten thousand and you try the same approach, you have 100 squads of 100 men, and a commander who must control 100 officers. By doing this you introduce a massive inefficiency (one commander cannot control 100 men as efficiently as he can control ten), and the whole army grinds to a halt. To get around this you introduce additional levels of heirachy; you maintain your squad-size of ten, with a single officer, and have each of the 1000 officers report to one of 100 "captains", who then report to one of ten "colonels", who then report to a single general. Now, you have maintained the efficiency in that each commander has only to pass orders down through ten men, but you have introduced an overhead in terms of extra men (captains and colonels) who do not add directly to your fighting force. Extending the analogy to GPU design, the SPs, ROPs and TMUs are the 'fighting men', while the cache, interconnects and other control logic are the officers of various types. When you increase the size of your army (total processing capacity of the GPU), you need more officers (control logic).
Anyway, this is a general concept of data-flow that is repeated all over the place... Those of you with experience using object-oriented programming languages will be familiar with this concept in a different way: The increased modularity of OO languages (in comparison to sequential languages) generally comes with a slight overhead in terms of runtime efficiency, but allows much larger programs to be written without the code becoming intractable (as it quickly does with sequential languages).
---------
Anyway, getting back to why a more modular design costs extra transistors: Cypress and Barts both have two "RPEs" (render and processing engines), which are a modular block of SIMDs, TMUs, local cache and data-share logic (see pic below). Each of these has its own dispatch processor and modular cache. To link the two RPEs, a "global data share" is required, which like anything else on the die, costs transistors. My understanding is that Cayman has
three RPEs (up from two on Barts and Cypress). To link the three blocks it is neccesary to
at least increase the amount of "global data share" logic by a factor of two (one data share to link block1 to block2, and one to link block2 to block3). So, you're seeing a 50% increase in the number of RPEs for a 100% increase in the number of transistors that connect them.
Secondly: Consider that, in comparison to Cypress, each SPU now only has 4 SPs instead of 5. Each SPU has associated with it some control logic to link it to the rest of the SIMD core. So, in Cypress you have five SPs for every chunk of control logic, whereas in Cayman you have four. So, for a fixed total number of SPs (say 1600) you have more pieces of local (intra-SIMD) control logic in the Cayman design than in Cypress (400 instead of 320).
[Barts core]
Its got more transistors and is bigger than the 5870, because you say so, its got more shaders, again according to you, the biggest running rumour right now is
it has 1536 shaders, does that somehow count as more than 1600 in your world?
I'm not going to argue rumour-by-rumour, they change every day. As I understand it, Cayman has (natively) 1680 SPs, 96TMUs, and 48ROPs, arranged into three RPEs (with 560SP/32TMU/16ROP in each RPE - just like Barts). Perhaps if the rumours of yield issues are correct then we could see some of the SIMD cores disabled to account for manufacturing errors (which could account for the 1536 number), but deactivating these clusters would not reduce the overall die size.
You can only fit a given number of shaders into an area on a process, wow, care to explain how AMD already on the 5870 had more than 10% more transistors in a die size over 10% smaller than Nvidia?
I said that there is a strict limit on the number of TRANSISTORS you can fit in a given area on a given process, not "shaders". This limit is determined largely by the physical size of the transistors, but also by the spacing that is required between them to stop bleed-through of current. The "shader-powerhouse" design AMD implements allows them to pack the transistors slightly closer together (on average) than the "heavy encapsulation" approach that Nvidia takes. But there is no dramatic change in the architecture from Cypress to Caymann, and so no reason that transistor density will have improved dramatically either. It's also very reasonable to assume that the transistor density will continue to be better than Nvidia's GTX580 (since their architecture has not changed dramatically either).
Either way, the simple fact is that at 530mm2 Nvidia eventually managed with no care for the process, to get parts out with 480shaders in high enough quantities for people to buy them when required. I think personally the 530mm2 is what I'd call "massive" for the 40nm process, if ANY cores can be made at 530mm2, it should be
almost easy a year later to make something 450mm2, and every single rumour suggests a core way under 400mm2, thats by no means "massive".
Your opinion of "what should be easy a year later" is utterly irrelevant. These considerations should be made based on the semiconductor physics, and the logic of GPU design, not on your personal perception of how GPUs have improved historically.
Far longer post than I intended - nevermind