• Competitor rules

    Please remember that any mention of competitors, hinting at competitors or offering to provide details of competitors will result in an account suspension. The full rules can be found under the 'Terms and Rules' link in the bottom right corner of your screen. Just don't mention competitors in any way, shape or form and you'll be OK.

AMD VEGA confirmed for 2017 H1

Status
Not open for further replies.
Infinity fabric is a really fast interconnect. Fast enough for CPUs to synchronize their L1 caches with. Think about this for a second: when a CPU issues a cache fence instruction (store/load fence) it ends up exchanging messages with other CPU cores. We are talking about the fastest thing after direct register-file access.

Now, there are many ways you could use it in GPUs. One way would be to use it instead of PCI and rely on CrossFire software or mGPU/DX12, but that is really not new: you will still have the current issues of SLI/CrossFire (maybe less stuttering) or you will have for devs to write specifically for it.

The second way would be to create a single GPU as an MCM (multi chip module) same as Ryzen/Naples. Again here there are many ways to go about it:

1) You can create a single "control" module with a shared memory controller, geometry engine, hbcc, hardware scheduler, etc. Then you create multiple modules with the shaders. You use infinity fabric to connect them because it is that damn fast. This way you have a single GPU so software is not affected at all, but it is dirt-cheap to make because you produce small shader modules (e.g. 512 shaders in each) and can build cards with many of them. a 4096 shader card would have 8 x 512-shader-modules. A 2048 card would have 4 such modules, etc. The cost per shader module is dirt-cheap because they are small so yields should be astronomical. So you have a single shader-module design and maybe 2 control-module designs (high end / low end because ROPs, TMUs etc are not one-size-fits-all) and you mix and match.

2) You create a multi-chip card where the "control module" is smart enough to synchronize HBCC caches with other "control modules". I believe this is the approach they are taking because that's what the HBCC does: it's similar to Ryzen's memory controller and infinity fabric takes care of consistency and moving data around as each chip works on it. This is closer to having 2 "traditional GPUs" on the same MCM which act as one (share the same memory address space and let HBCC manage data movement).

3) You do both.

If they can get this to work (and I don't see why, they managed to get it working with Ryzen) then Navi will be dirt-cheap to make even in 6144 shader configuration.



Yes, as Roff alluded to above it will require significant architectural changes before infinity Fabric could really net us some true gains. I'm also not sure if IF is mature enough to really see this happen yet.
http://www.tomshardware.com/reviews/amd-ryzen-5-1600x-cpu-review,5014-2.html

A lot of the issues reviewers have found in various mtuli-threaded tests with Ryzen come about form the higher latency and variability in inter-core communication that Intel CPUs just don't have. For a graphics works load that effectively has thousands of threads could be quite a problem.
 
Here is a video of cycles render engine, rendering a scene using 2 Nvidia GPUs (of different make). https://youtu.be/dreR2z8Kgyk?t=6m34s
The cards don't need to be in SLI to function they just need to have drivers installed.

The scene is copied on to both GPUs and rendered out. From memory, cycles works by firing a "photon" and bouncing it around the scene till it hits the camera. That means that parts of the image that haven't been rendered out can affect the reflection in a scene that is currently being rendered. I must admit i'm not entirely sure how it all works from a coding perspective however it does show that it is possible for 2 cards to render out different parts of a scene simultaneously.

The biggest issue seems to be distributing the workload and syncing up memory. HBCC could potentially take care of the memory issue. As for distributing workload, either AdoredTV or NerdTechGasm, mentioned that with the NCU coming in vega should have an improved loadbalancing system which was one of the things which bottlenecked the fury cards.


That is a very different rendering technology so is not really applicable.
 
That is a very different rendering technology so is not really applicable.
I am aware that it is a ray tracer, while most games work on rasterization (i think?) But considering that raytracers work by bouncing light around a scene I can't help but feel that raytracing relies more on data and information in other parts of the screen compared to rasterization, so getting it to work across multiple GPUs would be more challenging.
 
Yes, as Roff alluded to above it will require significant architectural changes before infinity Fabric could really net us some true gains. I'm also not sure if IF is mature enough to really see this happen yet.
http://www.tomshardware.com/reviews/amd-ryzen-5-1600x-cpu-review,5014-2.html

A lot of the issues reviewers have found in various mtuli-threaded tests with Ryzen come about form the higher latency and variability in inter-core communication that Intel CPUs just don't have. For a graphics works load that effectively has thousands of threads could be quite a problem.

Those issues are really exaggerated. Ryzen is very competitive and anyone disputing that is just closing their eyes and refusing to see the truth. I think PC perspective showed that the latency was 120ns instead of 40ns? That's nothing!

To give you some idea: 60FPS means 16 milliseconds per frame. That's 16000000 nanoseconds. For threads doing message passing it does increase the latency significantly but for the coarse-grained nature of rendering frames it's negligible.

The real thing to note here is that AMD have already done the hardest part: the HBCC. Having a single virtual "address space" instead of actual memory makes everything else possible.

Also of note is that I believe that I recall Raja said that Vega can already renders directly out to the L2 cache from its ROPs. I do believe Raja mentioned that before (maybe not in this last presentation, I think it was the near the Ryzen launch)? So they main 2 paths for memory I/O are already flowing through the HBCC.

Of course this whole thing can't be done in 1 go. They're taking gradual decisive steps towards their goal.

The question (as has been the case with AMD for the last few years) is whether they will execute/deliver.
 
Also of note is that I believe that I recall Raja said that Vega can already renders directly out to the L2 cache from its ROPs. I do believe Raja mentioned that before (maybe not in this last presentation, I think it was the near the Ryzen launch)? So they main 2 paths for memory I/O are already flowing through the HBCC.

Found it: http://wccftech.com/amd-vega-architecture-detailed-variable-width-simds-confirmed/

Sorry for wccftech link but it was the first google result. I quote:

Finally, in Vega the render back-ends also known as Render Output Units or ROPs for short are now clients of the L2 cache rather than the memory pool. This implementation is especially advantageous in boosting performance of games that use deferred shading.

They mention deferred shading, but I will add to that the fact that: this makes it easy for an HBCC to "reassemble" a single image from a frame that was distributed across several "control modules" as I called them above.
 
To give you some idea: 60FPS means 16 milliseconds per frame. That's 16000000 nanoseconds. For threads doing message passing it does increase the latency significantly but for the coarse-grained nature of rendering frames it's negligible.

Various queue and cache latency, etc. all adds up though - you quickly eat that budget - also a game ticking over at 60fps is probably spending half of that budget doing CPU stuff, etc. before rendering tasks. I think you are massively underestimating the implications of latency and bandwidth in respect to a GPU versus CPU.
 
Various queue and cache latency, etc. all adds up though - you quickly eat that budget - also a game ticking over at 60fps is probably spending half of that budget doing CPU stuff, etc. before rendering tasks. I think you are massively underestimating the implications of latency and bandwidth in respect to a GPU versus CPU.

Well, I never said it was easy. I just said they *claim* they've done one of the hardest parts (we've yet to see Vega in action). Buy if the HBCC works, then they've really taken a leap towards Navi.

Now as far as the CPU is concerned: it has to do that stuff anyway. Nothing changes there. We're talking about presenting the whole thing as a single GPU to the outside world, so all that CPU stuff will not change at all.

I mean the problem is really simple: how do you get something inside the GPU to talk to 4 small "workers", ask them to each do part of a job, then collect that and dish it out as a unified output? Well, you need:

- a super-fast bus with both super-small latency and super-high throughput: check (Infinity Fabric)
- something to take care of data movement and coherence: check (HBCC)
- something to glue everything together: check (Interposer)
- something to distribute/reassemble: todo (Navi) EDIT: and note that AMD is no slouch in creating sophisticated hardware schedulers
 
You also have to consider that there are millions of polygons in a standard scene, 30 million in Star Citizen, thousands of sharers across thousands of cores rendering millions of polygons doing billions of calculations. you quickly find that latencies to cache etc really can become enormous. An instruction cache miss on a PCU of the order of 10ns can have big impacts on performance.
 
Now as far as the CPU is concerned: it has to do that stuff anyway. Nothing changes there. We're talking about presenting the whole thing as a single GPU to the outside world, so all that CPU stuff will not change at all.

I was talking in terms of your 16000000ns per frame thing - at 60fps you are probably spending around 10ms in game logic, etc. code and the GPU only doing much for around 4ms - so way less time than you were budgeting for.
 
Well, I never said it was easy. I just said they *claim* they've done one of the hardest parts (we've yet to see Vega in action). Buy if the HBCC works, then they've really taken a leap towards Navi.

Now as far as the CPU is concerned: it has to do that stuff anyway. Nothing changes there. We're talking about presenting the whole thing as a single GPU to the outside world, so all that CPU stuff will not change at all.

I mean the problem is really simple: how do you get something inside the GPU to talk to 4 small "workers", ask them to each do part of a job, then collect that and dish it out as a unified output? Well, you need:

- a super-fast bus with both super-small latency and super-high throughput: check (Infinity Fabric)
- something to take care of data movement and coherence: check (HBCC)
- something to glue everything together: check (Interposer)
- something to distribute/reassemble: todo (Navi) EDIT: and note that AMD is no slouch in creating sophisticated hardware schedulers

I think they have already been working on the hardware scheduler in vega. Below.

Edit: Found the video. Watch till 16:25
https://youtu.be/m5EFbIhslKU?t=13m33s

Look at the AMD slide for improved load balancing. I believe that each row after the intelligent workgroup distributor block refers to geometry process engine. If you keep watching he says that GCN can support a maximum of 4 these.

so a few thinks to consider why haven't they shown only 4 rows. It would fit in the slide. A few theories.

1. Vega has more than 4 shader engines and AMD didn't want to reveal this.
2. The new IWD can scale to as many geometry engines as present
3. AMD just thought it looked better and it means nothing
 
Those issues are really exaggerated. Ryzen is very competitive and anyone disputing that is just closing their eyes and refusing to see the truth. I think PC perspective showed that the latency was 120ns instead of 40ns? That's nothing!

To give you some idea: 60FPS means 16 milliseconds per frame. That's 16000000 nanoseconds. For threads doing message passing it does increase the latency significantly but for the coarse-grained nature of rendering frames it's negligible.

The real thing to note here is that AMD have already done the hardest part: the HBCC. Having a single virtual "address space" instead of actual memory makes everything else possible.

Also of note is that I believe that I recall Raja said that Vega can already renders directly out to the L2 cache from its ROPs. I do believe Raja mentioned that before (maybe not in this last presentation, I think it was the near the Ryzen launch)? So they main 2 paths for memory I/O are already flowing through the HBCC.

Of course this whole thing can't be done in 1 go. They're taking gradual decisive steps towards their goal.

The question (as has been the case with AMD for the last few years) is whether they will execute/deliver.

In their "Best CPU" Toms Hardware recommended Intel CPU's over Ryzen CPU's despite their own reviews showing Ryzen 5 vs i5 was faster and less expensive even when you applied their very narrow 'these games only' not in what they said, but in their actual results.

Within a hours of them posting this utterly bamboozled members both old and new made 4 pages of posts asking Toms Hardware basically "WTF?" they spent a while putting out fires and standing by their words until someone actually pointed out in depth that their own numbers in their own reviews they based these "Best CPU" conclusions on actually proved the Ryzen 5 chips where actually the "best CPU's", right from that moment on they just clammed up.

In fact other reviews mocked them for the crap they stood by in that idiotic "Best CPU's" junk

Don't trust a thing Toms Hardware say or even do, they are fakes, shills. ;)

You, i have always thought, in this, you will see who among reviewers are the shills and who aint when competitors do become real competitors and need to invent or find very interesting ways to be used to continue to push one brand over another.

In that i never though Toms Hardware would be one of them, but then again some of the ones i thought would be have turned out not to be Intel's shills.
 
Last edited:
From my own perspective, I want AMD to do well and I will still buy NVidia for as long as I keep my PG348Q screen. I don't see anything wrong at all with people wanting to buy 'only' NVidia or AMD, as lots of gamers/enthusiasts have either G-Sync or Freesync screens, so of course they want to buy their next GPU at a decent price. You were rather condescending with your 2nd response and massively hyperbolic but if AMD have a good enough product, it will lure people in for purchases, which NVidia will not like, so NVidia cut prices to try and tempt potential lost customers back, then AMD lower prices to tempt them back etc. You see how that works? And that is the way the market works.

So yer, I hope AMD have a great product so I can buy NVidia cheaper ;)
You are one of the sort that will never really use AMD regardless of how good they are though. You've made many comments over the years that show this.

There's nothing wrong with buying nVidia because they only offer a product at a performance point that you want. But that isn't really what it's about for you.

I think people supporting nVidia and their Titans though is ridiculous. Every consumer who buys a Titan on release is complicit in nVidia's price creep, so you really have no rights to complain about their prices.

Their Ti cards are still stupidly priced, it it's at least on the upper end of "in the real world"

Your bias has been so bad at times that you'd look to blame AMD for anything you could find using faulty logic. For example;

"if AMD can do anything to help fix a problem, then it definitely means that they're the cause of it."

Which is something you literally said in the past. This was when Intel X79 boards were having PCIE 3.0 problems that caused conflict with AMD cards trying to run at PCIE 3.0 speeds.

Granted, this is in the past and you seem to have matured away from that behaviour somewhat. But it's a trademark mentality for die hard nVidia fans, and that's the point that's being made.
 
I think some of the current performance degradations are more to do with cache misses between threads. Rather than inter-ccX latency. Ryzen has a very different cache hierarchy compared to current intel quad-octa core parts, with the intel parts, the L3 cache is part of the ring bus and is shared amongst all cores. With Ryzen, each CCX has its own L3 cache.

It is more than likely a case of interdependent threads ending up on different CCX degrading performance, as well as threads hopping CCX. I think someone showed an increase in performance when disabling an entire CCX on an octacore part to make a quad instead of having a quad across two CCX. Also thread's core hopping can cause performance degradation if the thread jumps CCX as well as cores, meaning its dependant data has been left behind in the CCX it was previously in.


This is where Devs coding direct support for Ryzen into their software becomes key, since they need to adapt their doftware for Ryzens cache hierarchy and quirks. Considering we have only had intel style setups for so long.
 
Seriously considering upgrading to a 1080Ti, but this close to Computex I'm waiting to see what gets announced for Vega. I really hope AMD knock it out of the park with this one, time will tell :)

If not, I'll just have to settle for the 1070 equivalent Vega to hold me over since I don't feel the 1080ti should cost what it does. If AMD can even manage to do that card well... if not I'll sell my soul and just get a 1080ti too.
 
Status
Not open for further replies.
Back
Top Bottom