Analysis of Fermi

Duff-Man · 7 Jan 2010 at 20:23

There’s been a lot of discussion on Fermi here recently, and rightly so since it’s a very interesting bit of technology

We don’t know for sure what the price or the performance will be like, but I think there are a few interesting things we can conclude from the stuff we’ve seen so far.

Anyway, here are my views on Fermi and other related GPU stuff, so have at it and feel free to tear them up

I know it's long, but consider this a replacement for 20 or so comments I never bothered to write in various threads

BTW in case anyone cares about this stuff, I have a 4870x2 right now but I've owned plenty of cards from both manufacturers in the past.

1 - Scalability:

A lot has been said about whether the Fermi design will be able to scale to produce mid-range and low-end GPUs. It’s true that architectural scaling might be more difficult than with GT200, or r800, but there’s nothing to suggest it’s going to be a huge problem. If you look at the schematic diagram for Fermi, it’s still highly modular. You mostly have a design with 32 largely self-contained blocks (each with 16 ‘cores’), and a fair amount of L2 cache spanning the whole lot. There’s nothing to suggest that nvidia can’t produce a design with 8 or 16 blocks instead (i.e. 128 or 256 ‘cores’ total), and a suitably reduced amount of cache. We will have to wait and see, but I certainly wouldn’t write Fermi off based on this.

2 - Impact of manufacturing failures:

One of the biggest problems with the kind of architecture that nvidia has designed is that it becomes more sensitive to faults in manufacturing. This is largely because a fair amount of physical die-space is given over to cache, and other bits and pieces that span across multiple processing blocks (or ‘streaming multiprocessors’ as nvidia calls them).

Consider a chip with, say, 10% die space given over to global structures, and 90% contained in the individual processing blocks. If a failure occurs in the one of the processing blocks, it can usually just be disabled, and the chip can be sold at a lower grade (GTX260 or whatever). If a failure occurs in one of the globally reaching structures (like the L2 cache), then the entire chip will be almost certainly be unusable.

If we instead use a different design with double the proportion of die area given over to global structures (i.e a 20% / 80% split) then, assuming a single fault occurs randomly, we double our chances of getting a dead chip from 10% up to 20%. If multiple faults occur then the odds become even more grim as the probabilities compound. Add to this that the GF100 chip is physically a lot bigger than the r800 (so we expect more faults per chip on average), and Fermi looks particularly vulnerable to the quality of the process at TSMC. Since TSMC is still reporting some issues with its 40nm process, this could turn out to be Fermi’s Achilles heel, at least in the short term.

3 - GPGPU programming:

Nvidia are clearly pushing the GPGPU side of things hard with Fermi. The biggest markets for this are scientific and financial modelling, and their various sub-groups. These are potentially really huge markets, and the calculations they are interested in are generally highly suitable for acceleration with GPUs (since they can largely be broken into small and independent processes). The problem is that writing code to work efficiently with a GPU is a different animal even to writing for a CPU cluster. While certain codes can be easily adapted to work with GPUs, most of the time you will need to write a large part of the software from the ground up with stream processing in mind - at least if you want to use the GPU efficiently.

Re-writing software to take advantage of GPUs is time consuming, and expertise with GPU programming is still quite rare, so there hasn’t really been the kick needed to really start the conversion en masse. There is a lot of interest, but very few people have actually made the switch so far (I work in scientific modelling by the way...). Fermi could well be the kick that gets things rolling, since the ability to execute native C++ code will allow easier conversion of existing algorithms, and should require a little less specialist knowledge in order to get started. On top of that, the good double precision performance and error checking on the memory fill in essentials that were missing from the previous generation.

I guess that nvidia are gambling on explosive growth in this sector over the next 2 or 3 years, and from my experience I’m pretty sure it’s going to happen. But still, it’s going to be a long time before the market for GPGPU applications matches that of gaming (way outside the lifetime of Fermi), so nvidia will need to stay competitive in the graphics market if they want to survive. I also wonder whether we will start to see two separate GPU designs in the next ‘real’ generation of nvidia cards...

4 - Efficiency:

In principle, Fermi is designed to handle a wide variety of data types more steadily and efficiently, whereas the r800 design is focused on raw number crunching (r800 has nearly double the floating point power of Fermi). Obviously the reason for this as a design choice is the GPGPU market, but I’m hoping that it could also lead to more consistent framerates in games. I guess this isn’t something we will know until release, but I can hope

Of course, another aspect of efficiency is the efficient use of transistors. I think it’s fair to assume that in terms of “performance per square mm of die space” r800 is going to be hands down the winner (as r700 was against GT200), so from that point of view it’s fair to say that ATIs is currently more efficient in terms of architecture. But, it’s still worth considering that a rigid design like the r7/800 is not going to achieve the scalability of a more dynamic and adaptable design like Fermi, as computational demand increases. So, when the next round of GPU designs comes around, ATI will be faced with an increasingly large bottleneck in efficiently feeding data to their increasing number of stream processors, whereas nvidia will have a new and flexible architecture to build upon.

5 - Physics:

One area where a Fermi-type architecture will really shine is in hardware physics. And no, I’m not talking about Nvidia’s highly pimped Physx (which I wish would die a quick and quiet death... closed standards won’t get us anywhere). The type of computations which occur in collision physics are generally more complex than those involved in shading pixels etc. More importantly, they require much greater connectivity between data – that is; the result of one part of the calculation may rely heavily on another part. This requires more sophisticated communication between threads if the calculation is to be done efficiently. You can see this in action by comparing the performance of current high-end GPUs against a CPU doing hardware physics as compared to rendering. A good current generation GPU may be 5 times faster or so than a quad-core CPU at hardware physics, but at more “simple” stuff like rendering it’s going to be well over 100 times faster.

Anyway, I guess my point is that the advanced scheduling and shared cache in Fermi should be ideal for performing hardware physics calculations. So, if developers actually start supporting a good open source physics API (like the one based on OpenCL that AMD was touting a few months back), and putting effort into designing physics features, then we could be in for some fun.

Overall:

I think that Fermi is a bold move by nvidia to establish themselves as the market leader in a new and expanding field, while maintaining their presence in a much bigger market. I don’t think there’s any doubt that Fermi will become the de-facto choice for scientific stream computing and other GPGPU applications, but whether it can compete with ATI in the graphics market will depend on a few things: The performance of the thing, its ability to scale down to the mid- and low-end market, and perhaps most importantly, the yields at TSMC. I’m quite confident it will perform well – probably better than most people are expecting, but as for the other two it’s going to be hard going for nvidia. I can only see them losing further market share to ATI this year and the next, but Fermi could still turn out to be a good investment in the long term.

Thanks for reading

Rroff · 7 Jan 2010 at 20:28

RE point 1 I think the question people have been asking is more along the lines of can it be scaled downwards cost effectively - rather than does the design allow for scaling downwards.

George110 · 7 Jan 2010 at 20:36

Thanks.

Duff-Man · 7 Jan 2010 at 20:37

Rroff said:
RE point 1 I think the question people have been asking is more along the lines of can it be scaled downwards cost effectively - rather than does the design allow for scaling downwards.

I guess we won't know for sure until we start seeing GTX340s and 320s, but I guess my point was that most of the extra "advanced logic" that Fermi is lumbered with is contained within the individual processing blocks (the SMs). So, by halving the number of SMs and halving the amount of cache you cut out half the fat

Of course we don't know how easy it will be to design a chip like that from the full GF100, but in principle it should scale well to smaller die sizes.

Sumanji · 7 Jan 2010 at 20:37

Sumanji, reporting for anti-troll and/or anti-fanboy duty as requested!!

(Nice analysis btw, thanks!)

AgedIdiot · 7 Jan 2010 at 20:46

I think that Fermi is a bold move by nvidia ............ but Fermi could still turn out to be a good investment in the long term.

Interesting read - Thanks (got to say have got sick of reading the multitude of fanboy replies to any threads which refer to either nvidia or ati gpu's ....hopefully this one can remain thoughtful.)

The technical side is , I am afraid a little ( no , a lot

) beyond me but everything I've read about Fermi leads me to the view that this may be one of those products that either sends the company making it to a massive new high or might push it into the arms of another company (intel?).

I have no idea which way it might go and to be honest do not like the look of either possible result

I do not want to see EITHER Amd or Nvidia stop being major players in the gpu design/production market.

bells · 7 Jan 2010 at 20:51

Interesting stuff

d_brennen · 7 Jan 2010 at 21:00

It's not real til we can hold one (minus wood screws). Performance numbers have been vague to say the least. Even estimations. nVidia usually don't miss a chance to talk their new kit up.

Duff-Man · 7 Jan 2010 at 21:07

AgedIdiot said:
everything I've read about Fermi leads me to the view that this may be one of those products that either sends the company making it to a massive new high or might push it into the arms of another company

I agree

Nvidia is tied to this kind of sophisticated general purpose architecture now, way beyond the life of Fermi I suspect. If it can't compete with simpler designs in gaming, and if the GPGPU market doesn't grow as quickly as nvidia hopes, they could really start to struggle. Of course, if it performs well in both areas then nvidia could really start to dominate in terms of cash for R+D and marketing (which is a pretty scary thought considering some of their recent tactics!).

sturmtruppe · 7 Jan 2010 at 21:21

They have no choice but to release lower end gf100 products. Yes, the g92 renaming went on 4 or 5 times, but architecturally you cant have a mid range gt200 without it being the g92.

Seems they can't do that this time, unless they want a product line with dx10 parts - which wont sell, dx11 is going to breath a little life back into pc gaming and as such hardware sales will rise, once we have an established lineup of games, so selling dx10 parts just won't make sense.

dx11, windows 7 and a free GFW is a large step up for the pc, hopefully by march we will have a good dx11 lineup. whereas vista, dx10 and a paid for subscription to GFW - pretty much killed it off.

Duff-Man · 7 Jan 2010 at 21:38

sturmtruppe said:
...Seems they can't do that this time

I don't see why not... Read point 1

Dauthi · 7 Jan 2010 at 21:56

Duff-Man said:
I don't see why not... Read point 1

Problem is the low end cards could be going out against ATI next refresh of low end cards if fermi gets moved back any longer. Using prior generations as a time line for when they get released. Nvidia's lowend 200 series cards have only recently been released and ATI's low end directx 11 parts are due to release soon for example.

All i can say is thank god for competition stopping people getting complacent. Hopefully TSMC will now get its act together. Not letting Nvidia wholly off the hook for making it a behomoth as ATI have had to adapt around TSMC after running aground a few years back.

Nice read thou.

mk17 · 7 Jan 2010 at 23:26

Regarding point 3: I agree with Duff-Man, I currently spend a lot of time on CUDA, researching genetic algorithms and cellular automata on GPU. The speed-up experienced in the median is circa 40x-60x. CUDA 3.0 looks like a serious step forward even from this.

Given the continuing (albeit rather slow) increase in core-count, and even slower software adoption (see .net 4 and parallel libraries) it's not too surprising that coding for these things remains in the specialist arena.

Goodness knows that taking full advantage of the massively parallel execution model is no mean-feat, but I've heard the same argument from vb devs used against c++ :-)

Typically, scientific systems spend more time in execution than in development, and the advantages promised are very enticing. Sadly, with Intel deferring Larrabee until who-knows-when if ever, and Fermi looking dodgy with this first release I feel like one of those supercomputer users who are told that if the procurement cycle breaches 12 months they should wait 6 months and do it then in 3!

I'll most likely take fermi for CUDA 3.0, but probably a low-end card for prototyping purposes (just like NV!) and wait until the refresh for the real hardware. But then I'm not much of a gamer. I find myself wondering if NV haven't sold out their core market in gaming a little, for a place at the table in an immature async one.

This seems to be the best of times and worst of times for GPGPU applications.

I'll probably keep an eye on ATI's efforts in the meantime, but they've been awefully slow and very poorly adopted so far. Then again, 6 months is an ice-age in this market.

Sorry for waffling on, but, as a parting question: would anyone who buys gfx for gaming only seriously look outside ATI in the near future?

Andric · 7 Jan 2010 at 23:39

mk17 said:
Regarding point 3: I agree with Duff-Man, I currently spend a lot of time on CUDA, researching genetic algorithms and cellular automata on GPU. The speed-up experienced in the median is circa 40x-60x. CUDA 3.0 looks like a serious step forward even from this.

Given the continuing (albeit rather slow) increase in core-count, and even slower software adoption (see .net 4 and parallel libraries) it's not too surprising that coding for these things remains in the specialist arena.

Goodness knows that taking full advantage of the massively parallel execution model is no mean-feat, but I've heard the same argument from vb devs used against c++

Typically, scientific systems spend more time in execution than in development, and the advantages promised are very enticing. Sadly, with Intel deferring Larrabee until who-knows-when if ever, and Fermi looking dodgy with this first release I feel like one of those supercomputer users who are told that if the procurement cycle breaches 12 months they should wait 6 months and do it then in 3!

I'll most likely take fermi for CUDA 3.0, but probably a low-end card for prototyping purposes (just like NV!) and wait until the refresh for the real hardware. But then I'm not much of a gamer. I find myself wondering if NV haven't sold out their core market in gaming a little, for a place at the table in an immature async one.

This seems to be the best of times and worst of times for GPGPU applications.

I'll probably keep an eye on ATI's efforts in the meantime, but they've been awefully slow and very poorly adopted so far. Then again, 6 months is an ice-age in this market.

Sorry for waffling on, but, as a parting question: would anyone who buys gfx for gaming only seriously look outside ATI in the near future?

interesting points indeed - and a good point re: c#/.net 4 in particular with its thrust towards parallel programign models and tools. Will be interesting to see if there is a tie in to DirectCompute to the .net world to facilitate that, along with the extensions to excel for hpc compute (yes, major financials really do need this - sad but true fact)

drunkenmaster · 8 Jan 2010 at 01:51

Honestly, I disagree with almost every point you make to be honest.

Fermi is in no way a bad design in and of itself in a perfect world where manufacturing isn't a problem, when it is, and it always is, just very bad. Firstly the L2 cache is big for a GPU, but relatively small and when you factor in that GPU's have vastly more transistors than CPU's already, a minor cache to an already larger core is actually a fairly minor increase in global functioning units. Likewise as its almost 100% worthless to gaming, if it was a huge issue, that could have been disabled to get better yields for gaming only parts, not ideal, not bad as probably less than 1% of the home market would want it.

But we're hearing about shader clusters not working, not the cache which is fairly simple really and probably not at all the manufacturing issue. Then you take into account that unlike system memory which you're obviously looking at decent latency but huge bandwidth, cache makes a lot of difference, l1/l2/l3 cache speeds all pawn bandwidth to the memory.

However the use of L2 won't be nearly as effective in all situations on a GPU with what 10times the bandwidth of a CPU, in predictable calculations the latency will be less of an issue and the huge bandwidth could largely negate the L2, other situations, we'll see, but gaming simply won't use it.

Likewise, I really don't see Fermi as a push for GPGPU, its GPGPU features are logical upgrades, almost every single one of which was added to the 5XXX series, newer IEEE, ability to run different threads on different clusters, more cache(just not wasted L2 cache in terms of die space), etc, etc. Its 95% a GT200b, with a few additions.

Infact its clusters look to have less core logic per shader than a GT200b(twice the shaders in each cluster) so could even lose efficiency there.

As for manufacturing, as Rroff said, its nothing to do with design, its all about cost. Mid and low end are run on smaller margins and large sales, buta 40nm wafer costs the same no matter what you make on it, high yields are essential to mid and low end, and their yields suck because of a design that really isn't good for the 40nm process. As per usual I'll point out TSMC made a sucky sucky process, but Nvidia ignored the fact 98% of the planet knew they'd make a sucky process.

In reality Nvidia long term competitiveness hinges on changing their architecture entirely, towards a smaller more paralel lower clocked core, they really can't avoid that at all.

As for two designs, one for gpgpu and one for gaming, they can't afford it, right now GPGPU is less than 1% of their turnover, its nothing essentially, its miniscule, a incredibly growth in the sector still wouldn't allow them to fund a second design purely on GPGPU sales, the market is not there yet and we're years away, which at that stage, 28/22/and, i forget what the next process is, they will simply not be able to make a Fermi style core.

AMD's small core, high efficiency, lower clocked cores look far more likely to maintain a fairly similar architecture throughout those processes, people looking for long term stability in their GPGPU software design shouldn't be looking at Fermi, if it gets another generation on that design I'd be surprised.

The other issue is despite worse yields for those mid/low end parts, they two will be 50%+ bigger than their AMD equivilents, their own 40nm parts can't compete with their OWN 55nm parts on price/performance, let alone AMD's parts which are better on price, size and yields.

Nvidia for a couple years now simply hasn't bothered releasing a current gen mid/low end, but filled that with last gen's, which they've discontinued. We look set this year to have High end Fermi and two gen old midend parts, with last gen low end parts.

AMD also have the ability to run more types of data, they have a wider range of shaders, and actually its the GPGPU market that "should" have a easier time customising their software to use the full power of the 5+1(or is it 4+1?) shader setup AMD has. Gaming seems to be struggling to keep all the shaders filled up, when it can, Nvidia's architecture can't compete, the problem is for AMD its an utter pain to code to get every last ounce of performance out of it.

I've almost no doubt Nvidia will have to go a similar route architecturally to reduce die size, so both companies will likely be on level pegging by the time games can use open standard physics incredibly easily and before GPGPU becomes more than 1% of their market. UNfortunately again by that time AMD will have essentially, their own guys producing their own stuff, in better fabs, while Nvidia likely will make them at TSMC(who will improve) or possibly even Samsung, as yet I can't see them moving to Global, and it might be in Global's interest to turn them down, but its a slim possibility.

TO sum up, I think Fermi is anything but a bold move. They knew of TSMC's problems years ago when 2900XT happened, (almost the same as Fermi, except AMD managed the feat of moving it UP a process and getting it out, which really is stupidly impressive), the bold move, 2-3 years ago, would have been a radical redesign to a small efficient manufacturable core. This design is a super pumped 65nm design, its anything but bold, its clinging onto 3 years ago and a different manufacturing time with desperation, its running blindly into problems everyone could forsee due to lack of courage to try something new.

Bold, not even close. IF Nvidia do finally take a bold step and radically change architecture for the next gen, I'll give them some credit for holding their hands up and saying, TSMC suck balls, we gotta change

I'll be fair, it took the 2900XT disaster for AMD(well ATi) to stick their hands up in the air and go..... nope, not gonna happen, time for a change. Remember the 2900xt, actually not 100% sure, was bigger than the 8800, it was a monolithic core, it was a true beast, its almost exactly what Fermi was in terms of manufacturing made almost impossible on a smaller process. It did have some very smart things, and it had some bold additions like the memory ring bus. But they saw the issue, realise TSMC would always suck and change their entire gameplan, we got a redesign incredibly quickly, and since then the architecture has not altered much at all, its still the same fundamental design and idea's as the 5XXX series. But because they saw the problems coming, designed in mind for it, they rode the 40nm problems with relative ease. TSMC will happen again, production issues aren't set to get that much easier, the past 15 years have been pretty easy, every drop now is going to be hugely harder. Nvidia have just had their 2900XT< which has hurt them all over the place, market share, cost, reputation, sales, future sales of an overly hot part, etc, etc. If they can't follow the very obvious path AMD took after the 2900XT, they deserve to fail, but I really don't think they will.

Fermi will probably(bar a refresh) be their last huge core design.

I can see minorly different GPGPU designs in the future, but it would require a easier design to start off with and isn't actually very likely. A Gaming version of Fermi would ideal cut out the L2, but its slap bang in the middle of the core, thats not a minor redesign, thats likely a full on timings and lining up nightmare, a fairly massive design change. If the L2 was on the outter edge it would be FAR easier to simply lop it off, but would also have worse access to the shaders.

Now if they found a way to make each side of the L2 its own core, with some kind of incredibly high bandwidth link(think QPI/HT) and be able to produce a dual core or a dual core with L2 bolted in, maybe, but again thats a massive design difference and a lot of headaches. Its just cheaper, easier and faster to have a single design in these situations.

CPU guys that take 3-4 years for an architecture, a year to tape out and sell the thing in various forms for 2-3 years is one thing. A GPU guy with a refresh every 6 months and a new core every 12-18 months, short tape out and much shorter design time, its not worth the hassel at this stage of GPU evolution.

EDIT:- I forgot to mention, the original GT200 was said to be a little lower clocked than they hoped, the GT200b was a much smaller increase than they wanted, late, it was several months late due to processing issues dropping to just 55nm. Revisions of GT200b to 40nm simply didn't work, they couldn't get anything above a half size core out the door, and it took them, what, around a year and had several cores canceled. Fermi isn't the first design on this basic architecture idea to have problems, it had problems at 65, 55, and now 40nm. I decided that is a bold design move, your previous cores have issues, everyone expects it can't make something double the size well at 40nm, the world knows TSMC sucks and yet Nvidia boldly went ahead with the stupidest idea they've had to date, it definately was Bold, I was wrong about that

Seriously though, if 65/55 were entirely perfect, with zero delays thats one thing, but TSMC have a horrible track record over maybe the past 4 years, I truly believe Intel's biggest problem with Larabee is that its trying to do less cores than Fermi even faster, and finding its unmanufacturable at competitive speeds, but its their first attempt.

Lightnix · 8 Jan 2010 at 02:17

Just a few points here:

I'll be fair, it took the 2900XT disaster for AMD(well ATi) to stick their hands up in the air and go..... nope, not gonna happen, time for a change. Remember the 2900xt, actually not 100% sure, was bigger than the 8800, it was a monolithic core, it was a true beast, its almost exactly what Fermi was in terms of manufacturing made almost impossible on a smaller process.

G80 was a bigger core, but actually used slightly fewer transistors - this works because R600 was on a slightly smaller fabrication node (90nm vs. 80nm).

5+1(or is it 4+1?) shader setup AMD has.

4+1. But that number has nothing to do with the data types it can process, nothing at all. In fact it becomes far less efficient with smaller and scalar data types on the Radeon architecture because (gross oversimplification warning!) it's designed with vector data types in mind. That doesn't happen on scalar architectures, every calculation is just mapped to a single 'core', but obviously there's a transistor penalty for that.

A GPU guy with a refresh every 6 months and a new core every 12-18 months, short tape out and much shorter design time, its not worth the hassel at this stage of GPU evolution.

The GPU guys work smarter in my opinion. They're always working on multiple designs simultaneously. ATi started work on RV770 in 2005, and some people called that a minor architectural update (and in many senses, it kind of was) for crying out loud! The CPU guys just plunge loads of money into research of technology they could possibly incorporate (terascale from Intel strikes me as a prime example) and then add it in when the time is right.

Sumanji · 8 Jan 2010 at 02:31

Jesus Christ!

Drunkenmaster, go find a dictionary and look up the word:

CONCISE

And then attempt to put it into practice

Mojo · 8 Jan 2010 at 02:45

Overall, fermi is a bold move by Nvidia, but it will not succeed unless they manage to improve the yields from Tsmc. The yields from ATI are currently bad enough, and given that Nvidia's are still at early stages, TSMC will charge through the nose just to get a working chip, the stingy ********s. Looks like an ATI price drop is a long way off

Competitor rules

Analysis of Fermi

More options

Duff-Man

Duff-Man

Rroff

Rroff

George110

George110

Duff-Man

Duff-Man

Sumanji

Sumanji

AgedIdiot

AgedIdiot

bells

bells

d_brennen

d_brennen

Duff-Man

Duff-Man

sturmtruppe

sturmtruppe

Duff-Man

Duff-Man

Dauthi

Dauthi

mk17

mk17

Andric

Andric

drunkenmaster

drunkenmaster

Lightnix

Lightnix

Sumanji

Sumanji

Mojo

Mojo