• Competitor rules

    Please remember that any mention of competitors, hinting at competitors or offering to provide details of competitors will result in an account suspension. The full rules can be found under the 'Terms and Rules' link in the bottom right corner of your screen. Just don't mention competitors in any way, shape or form and you'll be OK.

A different take on SLI / MGPU: pipelining?

Soldato
Joined
1 Apr 2014
Posts
18,605
Location
Aberdeen
Have AMD or Nvidia tried pipelining as a MGPU solution? That is, having different cards handle different parts of the render process in sequence.

For instance, if you have three cards, the first card could handle the basic t&l, then the second card could handle Hairworks, and the third could handle ray tracing, which would then be output to the monitor. This would negate the microstutter so prevalent in MGPU setups.

I remember that you used to be able to hand off PhysX to a separate card but that wasn't in a pipeline but call and return.
 
Soldato
Joined
17 Aug 2003
Posts
20,158
Location
Woburn Sand Dunes
I think tile based rendering is the closest we've seen. I'm guessing the reason why we have seen other kinds of workload balancing (outside of sli/xfire) is probably mostly down to the lack of support from APIs like DirectX. Technically I guess there no reason why the hardware couldn't do it if the card to card bandwidth and latency was sufficient. Somebody like Rroff would know more I think.
 
Soldato
Joined
6 Feb 2019
Posts
17,547
Sounds like chiplet architecture.

you don't need multiple cards just multiple chiplets on the card with each chiplet handling a different workloads
 
Associate
Joined
9 Apr 2017
Posts
188
Location
Eve Online
As I understand it, this won't work because of latencies - for the same reason that it's not possible to have an 'RTX add-in board' to handle ray-tracing (discussed a lot at the time of the RTX launch).

Even just the latency created by the position of the tensor cores in the die structures creates issues when using a tech like DLSS at high fps.

It's different with PhysX because you are offloading a workload that follows a different timeline than the creation of the frame, and you're moving data that needs to do several other things anyway rather than build the frame of the given microsecond.
 
Man of Honour
Joined
13 Oct 2006
Posts
91,002
As above this hasn't really been possible, other than stuff you can entirely offload like physics, due to problems with bandwidth and latency, etc. both the latency of data in transit and if you need to wait on tasks to complete, etc. before parts are in a state to communicate the needed data.

With recent advanced in substrate technology and semi conductor nodes getting so small some form of this is likely the future of GPUs down the line for multi-package implementations - especially if they can create blocks that can be repurposed on the fly for different types of tasks depending on what the load is.
 
Soldato
Joined
24 Oct 2005
Posts
16,279
Location
North East
what about infinty fabric like those on amd cpus, could that be used to reduced latency etc somehow in the future or hbm memory 3d tech but for tying the gpu to something that does those things?
 
Man of Honour
Joined
13 Oct 2006
Posts
91,002
what about infinty fabric like those on amd cpus, could that be used to reduced latency etc somehow in the future or hbm memory 3d tech but for tying the gpu to something that does those things?

Advances in substrate technology is making interconnects kind of like IF feasible with the demands of latency, etc. but it needs to be paired with at a minimum chips built on 7nm plus (EUV) to be able to fit everything as needed and more realistically 5nm. I suspect but might be wrong that there will be the situation where IF like interconnects catch up with the requirements inside current chips but then current chips will move on though - but possibly as smaller and smaller semi-conductor nodes become prohibitively harder to produce and prohibitively costly it might provide an alternative path for making GPUs.

(This still won't enable a chiplet like system to just work though - still needs a massive overhaul in terms of GPU architecture to move past the problems with SLI/CF).
 

bru

bru

Soldato
Joined
21 Oct 2002
Posts
7,360
Location
kent
@Rroff is quite right that with the current way that GPU'S work this idea wouldn't. But we know that NVIDIA is working on an MCM architecture for Hopper or at least that is what the latest rumours suggest.
They have an awful lot of clever people so they might be doing a complete rework of the way things are done.

Ampere might be good, but hopper has the potential to be completely ground breaking. Kinda like Ryzen has been for the CPU sid of things.
 
Associate
Joined
9 Apr 2017
Posts
188
Location
Eve Online
Chiplets/MCM/etc is very different to actually having different discrete cards handle different elements of the rendering pipeline, though. What the OP describes is splitting the workload not in terms of finished frames or parts of frames (e.g. Alternate Frame Rendering, Tile Rendering, etc.) but rather different tasks assigned to different discrete units that are plugged into your motherboard and talk over bus/NVlink-type arrangements. This will be very difficult to achieve - even the distances (and resistances) involved within different parts of the same die can affect performance or introduce overhead, let alone having to move data between discrete cards while the frame is being rendered.
 
Caporegime
Joined
18 Oct 2002
Posts
32,617
Chiplets/MCM/etc is very different to actually having different discrete cards handle different elements of the rendering pipeline, though. What the OP describes is splitting the workload not in terms of finished frames or parts of frames (e.g. Alternate Frame Rendering, Tile Rendering, etc.) but rather different tasks assigned to different discrete units that are plugged into your motherboard and talk over bus/NVlink-type arrangements. This will be very difficult to achieve - even the distances (and resistances) involved within different parts of the same die can affect performance or introduce overhead, let alone having to move data between discrete cards while the frame is being rendered.



Over any kind of bus link it would be impossible but a functional split of a GPU into separate chips on a single substrate is the future. Any kind fof splti frame or alternate frame rendering just doesn't work with modern rendering pipelines.

The basic problem is current rendering algorithms have high temporal-spatio dependencies. So data for one part of a frame depends on data that would be calculated in a different part of potentially an earlier frame. Interestingly, this doesn't exist with ray-tracing, so as RTX becomes standard for all lighting then a MCM design becomes more feasible.
 
Associate
Joined
9 Apr 2017
Posts
188
Location
Eve Online
Enlighten me.

The bus link between cards is too slow for that. This isn't just a bandwidth issue that'll be solved by PCIE 4.0, 5.0 or even 20.0 (or any version of NVlink connector). Even if the bandwidth was a hundred times greater, there would still be a problem. Put it this way - imagine how close the tensor cores are in the die to the shader cores used for rasterization. Even that can cause a problem in latency when doing DLSS at high framerates. Now think about how far apart two cores are on different cards, aside from all the architecture problems.

What we'll likely see instead is different islands on a fast substrate handling pipeline tasks, alongside more fixed function cores (like the RT initiative). This will all be on a single "card" that you buy, however. I think it's very unlikely you'll be able to ever increase your performance by buying a 'second card to handle RTX' and buying more cards to each handle a cog in the wheel, so to speak. Hairworks of course is different, since that's a physics API, follows a different timeline in the game logic and can be offloaded. That's not building the frame, that's just saying where objects will be, much as AI paths would.
 
Soldato
OP
Joined
1 Apr 2014
Posts
18,605
Location
Aberdeen
The bus link between cards is too slow for that.

Is it? The throughput of NVLink is 100 GB / sec and it's bidirectional, so you can simultaneously have 100 GB/s from card A to cvard B and 100 GB /sec from B to C. If you assume that all the cards have all the textures in VRAM, how much data actually needs to be transferred from card to card?
 
Soldato
Joined
22 Nov 2009
Posts
13,252
Location
Under the hot sun.
Have AMD or Nvidia tried pipelining as a MGPU solution? That is, having different cards handle different parts of the render process in sequence.

For instance, if you have three cards, the first card could handle the basic t&l, then the second card could handle Hairworks, and the third could handle ray tracing, which would then be output to the monitor. This would negate the microstutter so prevalent in MGPU setups.

I remember that you used to be able to hand off PhysX to a separate card but that wasn't in a pipeline but call and return.

2021 and beyond GPUs going to be of MCM design, so not needed.
 
Man of Honour
Joined
13 Oct 2006
Posts
91,002
Is it? The throughput of NVLink is 100 GB / sec and it's bidirectional, so you can simultaneously have 100 GB/s from card A to cvard B and 100 GB /sec from B to C. If you assume that all the cards have all the textures in VRAM, how much data actually needs to be transferred from card to card?

Isn't just about bandwidth - often for extreme speeds you have to queue up operations and despatch a lot at once which adds prohibitive software latency if you have small serially dependant operations that need the results of the one before to be able to start, etc. never mind physical link latency. Textures and shaders are often the base for building the materials used so you'd have to copy (mirror) any modifications including any frame dependant changes, etc. as well and so on and on :s

Some form of pipe-lining will be utilised for future GPU architectures though - you can't just slap GPU cores together like they can do with CPU cores to produce results significantly better than current CF/SLI implementations - the physical link being shorter and more direct might help but it can't overcome the bigger issues - hence why various attempts at things like sideport access have been abandoned.
 
Back
Top Bottom