DirectX® 12 for Enthusiasts: Explicit Multiadapter

shankly1985 · 12 Aug 2015 at 21:32

Excellent read here: some things new
It may surprise you to learn…

DirectX® 12 is the very first version of the DirectX® API that has specific features, techniques tools to support multi-GPU (mGPU) gaming. If you are indeed surprised, follow us as we take a trip through the complicated world of mGPU in PC gaming and how DirectX® 12 turns some classic wisdom on its head.

MULTI-GPU TODAY
Modern multi-GPU gaming has been possible since DirectX® 9, and has certainly grown in popularity during the long-lived DirectX® 11 era. Even so, many PC games hit the market with no specific support for multi-GPU systems. These games might exhibit no performance benefits from extra GPUs or, perhaps, even lower performance. Oh no!

Our AMD Gaming Evolved program helps solve for these cases by partnering with major developers to add mGPU support to games and engines—with resounding success! For other applications not participating in the AMD Gaming Evolved program, AMD has talented software engineers that can still add AMD CrossFire™ support through a driver update.1

All of this flows from the fact that DirectX® 11 doesn’t explicitly support multiple GPUs. Certainly the API does not prevent multi-GPU configurations, but it contains few tools or features to enable it with gusto. As a result, most games have used a classic “workaround” known as Alternate Frame Rendering (AFR).

HOW AFR WORKS
Graphics cards essentially operate with a series of buffers, where the results of rendering work are contained until called upon for display on-screen. With AFR mGPU, each graphics card buffers completed frames into a queue, and the GPUs take turns placing an image on screen.

AFR is hugely popular for the framerate gains it provides, as more frames can be made available every second if new ones are always being readied up behind the one being seen by a user.

But AFR is not without its costs, as all this buffering of frames into long queues can increase the time between mouse movement and that movement being reflected on screen. Most gamers call this “mouse lag.”

Secondly, DirectX® 11 AFR works best on multiple GPUs of approximately the same performance. DirectX® 11 frequently cannot provide tangible performance benefits on “asymmetric configurations”, or multi-GPU pairings where one GPU is much more powerful than the other. The slower device just can’t complete its frames in time to provide meaningful performance uplifts for a user.

Thirdly, the modest GPU multi-threading in DirectX® 11 makes it difficult to fully utilize multiple GPUs, as it’s tough to break up big graphics jobs into smaller pieces.

INTRODUCING EXPLICIT MULTI-ADAPTER
DirectX® 12 addresses these challenges by incorporating multi-GPU support directly into the DirectX® specification for the first time with a feature called “explicit multi-adapter.” Explicit multi-adapter empowers game developers with precise control over the workloads of their engine, and direct control over the resources offered by each GPU in a system. How can that be used in games? Let’s take a look at a few of the options.

SPLIT-FRAME RENDERING
New DirectX® 12 multi-GPU rendering modes like “split-frame rendering” (SFR) can break each frame of a game into multiple smaller tiles, and assign one tile to each GPU in the system. These tiles are rendered in parallel by the GPUs and combined into a completed scene for the user. Parallel use of GPUs reduces render latency to improve FPS and VR responsiveness.

Some have described SFR as “two GPUs behaving like one much more powerful GPU.” That’s pretty exciting!

Trivia: The benefits of SFR have already been explored and documented with AMD’s Mantle in Firaxis Games’ Sid Meier’s Civilization®: Beyond Earth™.

ASYMMETRIC MULTI-GPU
DirectX® 12 offers native support for asymmetric multi-GPU, which we touched on in the “how AFR works” section. One example: a PC with an AMD APU and a high-performance discrete AMD Radeon™ GPU. This is not dissimilar from AMD Radeon™ Dual Graphics technology, but on an even more versatile scale!2

With asymmetric rendering in DirectX® 12, an engine can assign appropriately-sized workloads to each GPU in a system. Whereas an APU’s graphics chip might be idle in a DirectX® 11 game after the addition of a discrete GPU, that graphics silicon can now be used as a 3D co-processor responsible for smaller rendering tasks like physics or lighting. The larger GPU can handle the heavy lifting tasks like 3D geometry, and the entire scene can be composited for the user at higher overall performance.

4+4=8?
In the world of DirectX® 9 and 11, gamers are accustomed to a dual-GPU system only offering one GPU’s worth of RAM. This, too, is a drawback of AFR, which requires that each GPU contain an identical copy of a game’s data set to ensure synchronization and prevent scene corruption.

But DirectX® 12 once again turns conventional wisdom on its head. It’s not an absolute requirement that AFR be used, therefore it’s not a requirement that each GPU maintain an identical copy of a game’s data. This opens the door to larger game workloads and data sets that are divisible across GPUs, allowing for multiple GPUs to combine their memory into a single larger pool. This could certainly improve the texture fidelity of future games!

WRAP-UP
A little realism is important, and it’s worth pointing out that developers must choose to adopt these features for their next-generation PC games. Not every feature will be used simultaneously, or immediately in the lifetime of DirectX® 12. Certainly DirectX® 11 still has a long life ahead of it with developers that don’t need or want the supreme control of 12.

Even with these things in mind, I’m excited about the future of PC gaming because developers already have expressed interest in explicit multi-adapter’s benefits—that’s why the feature made it into the API! So with time, demand from gamers, and a little help from AMD, we can make high-end PC gaming more powerful and versatile than ever before.

And that, my friends, is worth celebrating!

Robert Hallock is the Head of Global Technical Marketing at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.
https://community.amd.com/community...r?hootPostID=93cda05897016216c639de09b83b3798

shankly1985 · 12 Aug 2015 at 21:35

Am very excited about having multi GPUs perform different tasks.. The possibility here are something worth shouting about.

Rroff · 12 Aug 2015 at 21:39

SFR is nothing new and not something that DX12 will revolutionise outside of very specific scenarios (none the least lucid is a good example of this).

The useful stuff is the fact you can much easier abstract and composite different elements of a game in a useful way i.e. as an example completely farm UI rendering to a separate GPU to the one doing the scene rendering - often games have to reduce the update rate of the UI to keep performance up for a variety of reasons (none the least you can tie the CPU up due to things like drawcall limitations).

shankly1985 · 12 Aug 2015 at 21:43

Rroff said:
SFR is nothing new and not something that DX12 will revolutionise outside of very specific scenarios (none the least lucid is a good example of this).

The useful stuff is the fact you can much easier abstract and composite different elements of a game in a useful way i.e. as an example completely farm UI rendering to a separate GPU to the one doing the scene rendering - often games have to reduce the update rate of the UI to keep performance up for a variety of reasons.

Agreed..
The fact you could force other GPU to do other tasks could very well force some breathtaking visuals.

Kaapstad · 12 Aug 2015 at 21:52

If game devs have to spend time and money enabling things they are not going to bother.

It was the game devs that killed Mantle not AMD.

D.P. · 12 Aug 2015 at 22:00

Welcome to the 1990s.

There are probably some scenarios where tasks separation could make sense. Lets say your have some kind of RTS and you have split screen mode showing 2 different battles or the same battle form 2 different views (top down and on the ground), then each separate view could be rendered on a different GPU.

But for the most part, it wont make much different. Split frame rendering was tried and failed. The smaller you make the tiles the more overhead there is and the larger the tiles the greater the variance between GPU loads (a GPU might get lots of tiles of quickly rendered surfaces with minimal fragment shading). With AFR the frame to frame difference is minimal so each frame has the same work load

Gregster · 12 Aug 2015 at 22:01

Kaapstad said:
If game devs have to spend time and money enabling things they are not going to bother.

It was the game devs that killed Mantle not AMD.

I don't agree, AMD and Nvidia need to babysit Devs or at the very least ply them with funds to get the PC gamers what their machines are capable of. Take GameWorks and how many titles it is in and you can see that Nvidia have taken the bull by the horns and helped Devs to get it implemented into games. I can see DX12 and both vendors doing well for their customers if they supply plenty of support to the game Devs.

D.P. · 12 Aug 2015 at 22:02

Kaapstad said:
If game devs have to spend time and money enabling things they are not going to bother.

It was the game devs that killed Mantle not AMD.

Some triple A games might come with better support but as yu say, most game developers are under far too much constraints to bother with a tony market percentage.

Mauller · 12 Aug 2015 at 22:04

Once methods and code start circulating the types of graphics rendering modes for multi adapter will increase.

We are only at the start of the entire Low abstraction multi gpu cycle. The devs need to learn how to do this all.

Although i am still waiting on a good SFR SuperTiling implementation. Unfortunately i discovered that the SFR in CiV BE is only split screen. So is less efficient than SuperTiling and does not scale as well.

shankly1985 · 12 Aug 2015 at 22:05

D.P. said:
Welcome to the 1990s.

There are probably some scenarios where tasks separation could make sense. Lets say your have some kind of RTS and you have split screen mode showing 2 different battles or the same battle form 2 different views (top down and on the ground), then each separate view could be rendered on a different GPU.

But for the most part, it wont make much different. Split frame rendering was tried and failed. The smaller you make the tiles the more overhead there is and the larger the tiles the greater the variance between GPU loads (a GPU might get lots of tiles of quickly rendered surfaces with minimal fragment shading). With AFR the frame to frame difference is minimal so each frame has the same work load

If you think in that link SFR is the thing that gets me excited for the future of pc gaming you be very wrong. Read again.

D.P. · 12 Aug 2015 at 22:13

shankly1985 said:
If you think in that link SFR is the thing that gets me excited for the future of pc gaming you be very wrong. Read again.

Well shared resources are an absolute no go anyway, PCI-E is far too slow, so no 2 x 4GB FuryX's are still going to have a copy of each resource on both GPUs.

Mauller · 12 Aug 2015 at 22:16

D.P. said:
Welcome to the 1990s.

There are probably some scenarios where tasks separation could make sense. Lets say your have some kind of RTS and you have split screen mode showing 2 different battles or the same battle form 2 different views (top down and on the ground), then each separate view could be rendered on a different GPU.

But for the most part, it wont make much different. Split frame rendering was tried and failed. The smaller you make the tiles the more overhead there is and the larger the tiles the greater the variance between GPU loads (a GPU might get lots of tiles of quickly rendered surfaces with minimal fragment shading). With AFR the frame to frame difference is minimal so each frame has the same work load

So you are going to base older SFR SuperTiling that was hacked on top of Abstracted directx, against Explicit gpu SuperTiling?

They can more easily manage synchronisation between gpu's with DX12 than they could with 9 - 11. It will not be as large an issue this time around and should see far better performance.

And also the smaller you make the tiles the better the scaling. you are only rendering certain tiles in a grid on one gpu and the other then pasting it together. And you need small tiles if you want decent scaling beyond 2 cards.

Edit: Smaller tiles reduces the variance between both gpu's workloads

Kaapstad · 12 Aug 2015 at 22:16

D.P. said:
Well shared resources are an absolute no go anyway, PCI-E is far too slow, so no 2 x 4GB FuryX's are still going to have a copy of each resource on both GPUs.

I think it was a big mistake by AMD going forward to drop the crossfire bridges. If cards are to do some of the things described above we really need to see the return of bridges with much higher data throughput.

D.P. · 12 Aug 2015 at 22:25

Kaapstad said:
I think it was a big mistake by AMD going forward to drop the crossfire bridges. If cards are to do some of the things described above we really need to see the return of bridges with much higher data throughput.

Something like NVLink

Mauller · 12 Aug 2015 at 22:29

D.P. said:
Something like NVLink

You forgot About AMD's Coherent Data Fabric

http://fudzilla.com/news/processors/38381-amd-s-new-interconnect-tech-is-coherent-fabric

D.P. · 12 Aug 2015 at 22:32

Mauller said:
So you are going to base older SFR SuperTiling that was hacked on top of Abstracted directx, against Explicit gpu SuperTiling?

They can more easily manage synchronisation between gpu's with DX12 than they could with 9 - 11. It will not be as large an issue this time around and should see far better performance.

And also the smaller you make the tiles the better the scaling. you are only rendering certain tiles in a grid on one gpu and the other then pasting it together. And you need small tiles if you want decent scaling beyond 2 cards.

Edit: Smaller tiles reduces the variance between both gpu's workloads

Physics doesn't change, the same issues that plagued SFR in the past are still here today, that is just the nature of these things.

The smaller you make the tiles the worse the scaling is because the greater the overlap, e.g. there is a small triangle sent from the geometry shader, if the tiles are larger then there is a greater chance that the projected triangle can be discarded, conversely with smaller tiles there s a greater chances that both GPUs will have to process the triangle. The problem is also that with massively parallel fragment shaders they are much more efficient when there is less switching of the shader code. Given 2 surfaces with 2 different fragment programs it is much more efficient if each GPU could render each surface separately using a single frag program, if both GPUs render both surfaces due to smaller tiles then the shader engine has to optimize 2 different program executions.

The high variances is exactly why AFR just works so much better.

shankly1985 · 12 Aug 2015 at 22:34

D.P. said:
Well shared resources are an absolute no go anyway, PCI-E is far too slow, so no 2 x 4GB FuryX's are still going to have a copy of each resource on both GPUs.

But why would Microsoft go through the effort of adding asymmetric multi-GPU into the DirectX 12 API if how you put it something that can't be done?

Has Robert Said
Even with these things in mind, I’m excited about the future of PC gaming because developers already have expressed interest in explicit multi-adapter’s benefits—that’s why the feature made it into the API! So with time, demand from gamers, and a little help from AMD, we can make high-end PC gaming more powerful and versatile than ever before.

D.P. · 12 Aug 2015 at 22:35

Mauller said:
You forgot About AMD's Coherent Data Fabric

http://fudzilla.com/news/processors/38381-amd-s-new-interconnect-tech-is-coherent-fabric

First time I have heard of it.

But year, until GPUs have a higher speed interlink then shared memory is a dream.

An alternative maybe possible with HBM/HMC where both PUS are stacked on the same interposer and everything is interlinked but iimagien the complexity is huge

Mauller · 12 Aug 2015 at 22:40

D.P. said:
Physics doesn't change, the same issues that plagued SFR in the past are still here today, that is just the nature of these things.

The smaller you make the tiles the worse the scaling is because the greater the overlap, e.g. there is a small triangle sent from the geometry shader, if the tiles are larger then there is a greater chance that the projected triangle can be discarded, conversely with smaller tiles there s a greater chances that both GPUs will have to process the triangle. The problem is also that with massively parallel fragment shaders they are much more efficient when there is less switching of the shader code. Given 2 surfaces with 2 different fragment programs it is much more efficient if each GPU could render each surface separately using a single frag program, if both GPUs render both surfaces due to smaller tiles then the shader engine has to optimize 2 different program executions.

The high variances is exactly why AFR just works so much better.

No, The problems with synchronisation were due to the amount of Abstraction in the api. But when it worked right it was brilliant and far better than AFR. The main reason they transitioned to AFR becasue as you say it is easier to manage. But also because of the artificial FPS boost you get from the amount of buffers you have to make and re displayed frames. So it bloats performance figures making the cards look better when the latency is pants in reality.

And when i mean variance in workload with smaller tiles i mean the complexity of the scene is more evenly distributed with smaller tiles. so the overall performance is higher as the workload becomes more balanced.

But with split screen and smaller tiles, if you have a horizontal split then you end up with issues of simple scene complexity at the top of the scene and high complexity at the bottom. so overall performance and scaling is bad.

Yeas overdraw will degrade performance but not by as much as you think in the end. Of course there will be a point where making the tiles too small will degrade performance, but a 8x8 grid will have better scaling than a 4x4 grid when it comes to super tiling due to the more ven distribution of the workload.

Mauller · 12 Aug 2015 at 22:44

D.P. said:
First time I have heard of it.

But year, until GPUs have a higher speed interlink then shared memory is a dream.

An alternative maybe possible with HBM/HMC where both PUS are stacked on the same interposer and everything is interlinked but iimagien the complexity is huge

Amd are doing this. they are releasing an exascale APU for supercomputers next year based on Zen, Greenland, HBM and their Coherent Data Fabric.

http://fudzilla.com/news/processors/38402-amd-s-coherent-data-fabric-enables-100-gb-s

and

http://techreport.com/news/28742/amd-exascale-heterogenous-processor-is-the-server-apu
http://hexus.net/tech/news/cpu/85184-amd-exascale-heterogenous-processor-sports-32-zen-cores/

It is an awesome chip and where most APU's will go. Although i think the GPU part of the APU will only use the HBM. would have been good if the HBM was either shared between CPU and GPU or it was split half each.

I wouldnt mind an apu system with 16B of HBM. would be tiny overall.

The difference with AMD's 'Coherent data fabric' from what i read is that it will entirely replace PCI-E when compatible components are in a system. So if you have a zen cpu with a greenland gpu, it will use CDF instead of PCI to communicate. etc. otherwise the CPU will fall back to PCI if a component does not support CDF

Be interesting to see what Intel and nvidia do. if they adopt it or if intel end up adopting Nvlink. or all three come to some in the middle compromise that is not PCI-E