A different take on SLI / MGPU: pipelining?

Quartz · 4 Jan 2020 at 17:01

Have AMD or Nvidia tried pipelining as a MGPU solution? That is, having different cards handle different parts of the render process in sequence.

For instance, if you have three cards, the first card could handle the basic t&l, then the second card could handle Hairworks, and the third could handle ray tracing, which would then be output to the monitor. This would negate the microstutter so prevalent in MGPU setups.

I remember that you used to be able to hand off PhysX to a separate card but that wasn't in a pipeline but call and return.

bru · 4 Jan 2020 at 22:25

Interesting concept.

SkeeterUK · 4 Jan 2020 at 22:54

You mean like what was tried then failed for some reason and no one liked it in benchmarks, Lucid Virtu™ Universal® MVP GPU virtualization?

james.miller · 4 Jan 2020 at 23:03

I think tile based rendering is the closest we've seen. I'm guessing the reason why we have seen other kinds of workload balancing (outside of sli/xfire) is probably mostly down to the lack of support from APIs like DirectX. Technically I guess there no reason why the hardware couldn't do it if the card to card bandwidth and latency was sufficient. Somebody like Rroff would know more I think.

Grim5 · 4 Jan 2020 at 23:39

Sounds like chiplet architecture.

you don't need multiple cards just multiple chiplets on the card with each chiplet handling a different workloads

Maldoror · 4 Jan 2020 at 23:58

As I understand it, this won't work because of latencies - for the same reason that it's not possible to have an 'RTX add-in board' to handle ray-tracing (discussed a lot at the time of the RTX launch).

Even just the latency created by the position of the tensor cores in the die structures creates issues when using a tech like DLSS at high fps.

It's different with PhysX because you are offloading a workload that follows a different timeline than the creation of the frame, and you're moving data that needs to do several other things anyway rather than build the frame of the given microsecond.

Rroff · 5 Jan 2020 at 00:02

As above this hasn't really been possible, other than stuff you can entirely offload like physics, due to problems with bandwidth and latency, etc. both the latency of data in transit and if you need to wait on tasks to complete, etc. before parts are in a state to communicate the needed data.

With recent advanced in substrate technology and semi conductor nodes getting so small some form of this is likely the future of GPUs down the line for multi-package implementations - especially if they can create blocks that can be repurposed on the fly for different types of tasks depending on what the load is.

SkeeterUK · 5 Jan 2020 at 00:48

what about infinty fabric like those on amd cpus, could that be used to reduced latency etc somehow in the future or hbm memory 3d tech but for tying the gpu to something that does those things?

Rroff · 5 Jan 2020 at 01:00

SkeeterUK said:
what about infinty fabric like those on amd cpus, could that be used to reduced latency etc somehow in the future or hbm memory 3d tech but for tying the gpu to something that does those things?

Advances in substrate technology is making interconnects kind of like IF feasible with the demands of latency, etc. but it needs to be paired with at a minimum chips built on 7nm plus (EUV) to be able to fit everything as needed and more realistically 5nm. I suspect but might be wrong that there will be the situation where IF like interconnects catch up with the requirements inside current chips but then current chips will move on though - but possibly as smaller and smaller semi-conductor nodes become prohibitively harder to produce and prohibitively costly it might provide an alternative path for making GPUs.

(This still won't enable a chiplet like system to just work though - still needs a massive overhaul in terms of GPU architecture to move past the problems with SLI/CF).

SkeeterUK · 5 Jan 2020 at 01:46

So another 5 years maybe 10 year wait.

bru · 5 Jan 2020 at 12:47

@Rroff is quite right that with the current way that GPU'S work this idea wouldn't. But we know that NVIDIA is working on an MCM architecture for Hopper or at least that is what the latest rumours suggest.
They have an awful lot of clever people so they might be doing a complete rework of the way things are done.

Ampere might be good, but hopper has the potential to be completely ground breaking. Kinda like Ryzen has been for the CPU sid of things.

bemaniac · 5 Jan 2020 at 14:53

Doesn't RTX work like this though? Like basically 2 cards in 1 with separate tasks.

champion1642 · 5 Jan 2020 at 17:26

I think the best way this could be achieved is with a chiplet design (like Ryzen CPU's) with a way faster infinity fabric.

Maldoror · 5 Jan 2020 at 20:50

Chiplets/MCM/etc is very different to actually having different discrete cards handle different elements of the rendering pipeline, though. What the OP describes is splitting the workload not in terms of finished frames or parts of frames (e.g. Alternate Frame Rendering, Tile Rendering, etc.) but rather different tasks assigned to different discrete units that are plugged into your motherboard and talk over bus/NVlink-type arrangements. This will be very difficult to achieve - even the distances (and resistances) involved within different parts of the same die can affect performance or introduce overhead, let alone having to move data between discrete cards while the frame is being rendered.

Quartz · 5 Jan 2020 at 21:26

Maldoror said:
This will be very difficult to achieve

Enlighten me.

D.P. · 6 Jan 2020 at 09:47

Maldoror said:
Chiplets/MCM/etc is very different to actually having different discrete cards handle different elements of the rendering pipeline, though. What the OP describes is splitting the workload not in terms of finished frames or parts of frames (e.g. Alternate Frame Rendering, Tile Rendering, etc.) but rather different tasks assigned to different discrete units that are plugged into your motherboard and talk over bus/NVlink-type arrangements. This will be very difficult to achieve - even the distances (and resistances) involved within different parts of the same die can affect performance or introduce overhead, let alone having to move data between discrete cards while the frame is being rendered.

Over any kind of bus link it would be impossible but a functional split of a GPU into separate chips on a single substrate is the future. Any kind fof splti frame or alternate frame rendering just doesn't work with modern rendering pipelines.

The basic problem is current rendering algorithms have high temporal-spatio dependencies. So data for one part of a frame depends on data that would be calculated in a different part of potentially an earlier frame. Interestingly, this doesn't exist with ray-tracing, so as RTX becomes standard for all lighting then a MCM design becomes more feasible.

Maldoror · 6 Jan 2020 at 16:06

Quartz said:
Enlighten me.

The bus link between cards is too slow for that. This isn't just a bandwidth issue that'll be solved by PCIE 4.0, 5.0 or even 20.0 (or any version of NVlink connector). Even if the bandwidth was a hundred times greater, there would still be a problem. Put it this way - imagine how close the tensor cores are in the die to the shader cores used for rasterization. Even that can cause a problem in latency when doing DLSS at high framerates. Now think about how far apart two cores are on different cards, aside from all the architecture problems.

What we'll likely see instead is different islands on a fast substrate handling pipeline tasks, alongside more fixed function cores (like the RT initiative). This will all be on a single "card" that you buy, however. I think it's very unlikely you'll be able to ever increase your performance by buying a 'second card to handle RTX' and buying more cards to each handle a cog in the wheel, so to speak. Hairworks of course is different, since that's a physics API, follows a different timeline in the game logic and can be offloaded. That's not building the frame, that's just saying where objects will be, much as AI paths would.

Quartz · 6 Jan 2020 at 16:17

Maldoror said:
The bus link between cards is too slow for that.

Is it? The throughput of NVLink is 100 GB / sec and it's bidirectional, so you can simultaneously have 100 GB/s from card A to cvard B and 100 GB /sec from B to C. If you assume that all the cards have all the textures in VRAM, how much data actually needs to be transferred from card to card?

Panos · 6 Jan 2020 at 16:22

Quartz said:
Have AMD or Nvidia tried pipelining as a MGPU solution? That is, having different cards handle different parts of the render process in sequence.

For instance, if you have three cards, the first card could handle the basic t&l, then the second card could handle Hairworks, and the third could handle ray tracing, which would then be output to the monitor. This would negate the microstutter so prevalent in MGPU setups.

I remember that you used to be able to hand off PhysX to a separate card but that wasn't in a pipeline but call and return.

2021 and beyond GPUs going to be of MCM design, so not needed.

Rroff · 6 Jan 2020 at 16:22

Quartz said:
Is it? The throughput of NVLink is 100 GB / sec and it's bidirectional, so you can simultaneously have 100 GB/s from card A to cvard B and 100 GB /sec from B to C. If you assume that all the cards have all the textures in VRAM, how much data actually needs to be transferred from card to card?

Isn't just about bandwidth - often for extreme speeds you have to queue up operations and despatch a lot at once which adds prohibitive software latency if you have small serially dependant operations that need the results of the one before to be able to start, etc. never mind physical link latency. Textures and shaders are often the base for building the materials used so you'd have to copy (mirror) any modifications including any frame dependant changes, etc. as well and so on and on :s

Some form of pipe-lining will be utilised for future GPU architectures though - you can't just slap GPU cores together like they can do with CPU cores to produce results significantly better than current CF/SLI implementations - the physical link being shorter and more direct might help but it can't overcome the bigger issues - hence why various attempts at things like sideport access have been abandoned.