1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

A different take on SLI / MGPU: pipelining?

Discussion in 'Graphics Cards' started by Quartz, Jan 4, 2020.

  1. Quartz

    Sgarrista

    Joined: Apr 1, 2014

    Posts: 9,780

    Location: Aberdeen

    Have AMD or Nvidia tried pipelining as a MGPU solution? That is, having different cards handle different parts of the render process in sequence.

    For instance, if you have three cards, the first card could handle the basic t&l, then the second card could handle Hairworks, and the third could handle ray tracing, which would then be output to the monitor. This would negate the microstutter so prevalent in MGPU setups.

    I remember that you used to be able to hand off PhysX to a separate card but that wasn't in a pipeline but call and return.
     
  2. bru

    Soldato

    Joined: Oct 21, 2002

    Posts: 7,192

    Location: kent

    Interesting concept.
     
  3. SkeeterUK

    Capodecina

    Joined: Oct 24, 2005

    Posts: 14,446

    Location: North East

    You mean like what was tried then failed for some reason and no one liked it in benchmarks, Lucid Virtu™ Universal® MVP GPU virtualization?
     
  4. james.miller

    Capodecina

    Joined: Aug 17, 2003

    Posts: 17,955

    Location: Woburn Sand Dunes

    I think tile based rendering is the closest we've seen. I'm guessing the reason why we have seen other kinds of workload balancing (outside of sli/xfire) is probably mostly down to the lack of support from APIs like DirectX. Technically I guess there no reason why the hardware couldn't do it if the card to card bandwidth and latency was sufficient. Somebody like Rroff would know more I think.
     
  5. Grim5

    Wise Guy

    Joined: Feb 6, 2019

    Posts: 2,281

    Sounds like chiplet architecture.

    you don't need multiple cards just multiple chiplets on the card with each chiplet handling a different workloads
     
  6. Maldoror

    Gangster

    Joined: Apr 9, 2017

    Posts: 115

    Location: Eve Online

    As I understand it, this won't work because of latencies - for the same reason that it's not possible to have an 'RTX add-in board' to handle ray-tracing (discussed a lot at the time of the RTX launch).

    Even just the latency created by the position of the tensor cores in the die structures creates issues when using a tech like DLSS at high fps.

    It's different with PhysX because you are offloading a workload that follows a different timeline than the creation of the frame, and you're moving data that needs to do several other things anyway rather than build the frame of the given microsecond.
     
  7. Rroff

    Man of Honour

    Joined: Oct 13, 2006

    Posts: 65,673

    As above this hasn't really been possible, other than stuff you can entirely offload like physics, due to problems with bandwidth and latency, etc. both the latency of data in transit and if you need to wait on tasks to complete, etc. before parts are in a state to communicate the needed data.

    With recent advanced in substrate technology and semi conductor nodes getting so small some form of this is likely the future of GPUs down the line for multi-package implementations - especially if they can create blocks that can be repurposed on the fly for different types of tasks depending on what the load is.
     
  8. SkeeterUK

    Capodecina

    Joined: Oct 24, 2005

    Posts: 14,446

    Location: North East

    what about infinty fabric like those on amd cpus, could that be used to reduced latency etc somehow in the future or hbm memory 3d tech but for tying the gpu to something that does those things?
     
  9. Rroff

    Man of Honour

    Joined: Oct 13, 2006

    Posts: 65,673

    Advances in substrate technology is making interconnects kind of like IF feasible with the demands of latency, etc. but it needs to be paired with at a minimum chips built on 7nm plus (EUV) to be able to fit everything as needed and more realistically 5nm. I suspect but might be wrong that there will be the situation where IF like interconnects catch up with the requirements inside current chips but then current chips will move on though - but possibly as smaller and smaller semi-conductor nodes become prohibitively harder to produce and prohibitively costly it might provide an alternative path for making GPUs.

    (This still won't enable a chiplet like system to just work though - still needs a massive overhaul in terms of GPU architecture to move past the problems with SLI/CF).
     
  10. SkeeterUK

    Capodecina

    Joined: Oct 24, 2005

    Posts: 14,446

    Location: North East

    So another 5 years maybe 10 year wait.
     
  11. bru

    Soldato

    Joined: Oct 21, 2002

    Posts: 7,192

    Location: kent

    @Rroff is quite right that with the current way that GPU'S work this idea wouldn't. But we know that NVIDIA is working on an MCM architecture for Hopper or at least that is what the latest rumours suggest.
    They have an awful lot of clever people so they might be doing a complete rework of the way things are done.

    Ampere might be good, but hopper has the potential to be completely ground breaking. Kinda like Ryzen has been for the CPU sid of things.
     
  12. bemaniac

    Mobster

    Joined: Jul 30, 2006

    Posts: 2,858

    Doesn't RTX work like this though? Like basically 2 cards in 1 with separate tasks.
     
  13. champion1642

    Gangster

    Joined: Aug 30, 2016

    Posts: 106

    I think the best way this could be achieved is with a chiplet design (like Ryzen CPU's) with a way faster infinity fabric.
     
  14. Maldoror

    Gangster

    Joined: Apr 9, 2017

    Posts: 115

    Location: Eve Online

    Chiplets/MCM/etc is very different to actually having different discrete cards handle different elements of the rendering pipeline, though. What the OP describes is splitting the workload not in terms of finished frames or parts of frames (e.g. Alternate Frame Rendering, Tile Rendering, etc.) but rather different tasks assigned to different discrete units that are plugged into your motherboard and talk over bus/NVlink-type arrangements. This will be very difficult to achieve - even the distances (and resistances) involved within different parts of the same die can affect performance or introduce overhead, let alone having to move data between discrete cards while the frame is being rendered.
     
  15. Quartz

    Sgarrista

    Joined: Apr 1, 2014

    Posts: 9,780

    Location: Aberdeen

    Enlighten me.
     
  16. D.P.

    Caporegime

    Joined: Oct 18, 2002

    Posts: 30,425



    Over any kind of bus link it would be impossible but a functional split of a GPU into separate chips on a single substrate is the future. Any kind fof splti frame or alternate frame rendering just doesn't work with modern rendering pipelines.

    The basic problem is current rendering algorithms have high temporal-spatio dependencies. So data for one part of a frame depends on data that would be calculated in a different part of potentially an earlier frame. Interestingly, this doesn't exist with ray-tracing, so as RTX becomes standard for all lighting then a MCM design becomes more feasible.
     
  17. Maldoror

    Gangster

    Joined: Apr 9, 2017

    Posts: 115

    Location: Eve Online

    The bus link between cards is too slow for that. This isn't just a bandwidth issue that'll be solved by PCIE 4.0, 5.0 or even 20.0 (or any version of NVlink connector). Even if the bandwidth was a hundred times greater, there would still be a problem. Put it this way - imagine how close the tensor cores are in the die to the shader cores used for rasterization. Even that can cause a problem in latency when doing DLSS at high framerates. Now think about how far apart two cores are on different cards, aside from all the architecture problems.

    What we'll likely see instead is different islands on a fast substrate handling pipeline tasks, alongside more fixed function cores (like the RT initiative). This will all be on a single "card" that you buy, however. I think it's very unlikely you'll be able to ever increase your performance by buying a 'second card to handle RTX' and buying more cards to each handle a cog in the wheel, so to speak. Hairworks of course is different, since that's a physics API, follows a different timeline in the game logic and can be offloaded. That's not building the frame, that's just saying where objects will be, much as AI paths would.
     
  18. Quartz

    Sgarrista

    Joined: Apr 1, 2014

    Posts: 9,780

    Location: Aberdeen

    Is it? The throughput of NVLink is 100 GB / sec and it's bidirectional, so you can simultaneously have 100 GB/s from card A to cvard B and 100 GB /sec from B to C. If you assume that all the cards have all the textures in VRAM, how much data actually needs to be transferred from card to card?
     
  19. Panos

    Capodecina

    Joined: Nov 22, 2009

    Posts: 12,035

    Location: Under the hot sun.

    2021 and beyond GPUs going to be of MCM design, so not needed.
     
  20. Rroff

    Man of Honour

    Joined: Oct 13, 2006

    Posts: 65,673

    Isn't just about bandwidth - often for extreme speeds you have to queue up operations and despatch a lot at once which adds prohibitive software latency if you have small serially dependant operations that need the results of the one before to be able to start, etc. never mind physical link latency. Textures and shaders are often the base for building the materials used so you'd have to copy (mirror) any modifications including any frame dependant changes, etc. as well and so on and on :s

    Some form of pipe-lining will be utilised for future GPU architectures though - you can't just slap GPU cores together like they can do with CPU cores to produce results significantly better than current CF/SLI implementations - the physical link being shorter and more direct might help but it can't overcome the bigger issues - hence why various attempts at things like sideport access have been abandoned.