The One Vision That Intel, AMD And Nvidia Are All Chasing – Why Heterogeneous Computing Is The Futur

Stanners · 13 Oct 2015 at 15:22

humbug said:
Good Grief..

?

drunkenmaster · 13 Oct 2015 at 15:45

He's saying good grief because Layte gave a silly response which highlighted what I said.

from the link

NVLink: High-Speed GPU Interconnect

HSA is NOT a high speed interconnect, that is ALL NVLink is. It's an alternative to pci-e, nothing more nothing less. The 'unified memory' has countless times been shown to be Nvidia being very dodgy. I precisely described it, providing slightly easier to code for situation where the difference in cpu and gpu memory is virtualised as a single pool of memory for the programmer to access. The actual copies to and from the cpu and gpu section of memory still occurs, this is where lets say 70% of the latency issues with on die gpu acceleration, the rest is in the gpu not being able to take the lead but running how and what the cpu tells it to. It is in no way by any hardware definition unified memory, where all devices can access any point of the memory which is all used as a single pool.

Nor does it address any cross platform compatibility. With HSA you can stick a few ARM cores with ANY other HSA IP and have them be able to communicate together. NVlink will only work on IBM CPUs, it's not an industry standard, it won't work with anything else.

HSA has no relation to, no resemblance to and nothing in common with NVLink at all. The problems addressed by HSA are NOT addressed by NVLink.

NVLink is like getting PCI-E 5.0, it allows a greater bandwidth to and from the cpu and gpu, and from one gpu to another. It does literally not one thing HSA is designed to do.

He's saying "Nvidia don't need HSA, they have Nvlink".... when they aren't remotely similar technologies. You may as well say Nvidia don't need x86-64 support, they have physx, or AMD don't need Physx, they have freesync. They are completely different technologies aimed at 100% different problems. NVlink is 100% about increasing bandwidth to the gpu(which in turn reduces latency somewhat, different level of latency entirely), HSA is about treating the GPU equally with a CPU on a system level enabling them to run significantly faster without getting in each others way. Nothing about HSA is designed to tackle bandwidth, mostly but not entirely HSA is about making on die(with drastically lower latency already) access to a GPU give such meaningful performance increases along with a simple way to utilise it that the entire industry will change how it writes programs. Nvlink is about getting A to B faster.

layte · 13 Oct 2015 at 15:57

Yea, I'm not going to play this pointless game with you. The article and related sidebar links literally counter the rubbish you are trying to paint as fact. Just because you dress your rants up in as many words as possible does not make them any less factually wrong.

Well, go on then one last little treat. NVlink removes the vast majority of latency as the CPU and GPU can directly access memory located in each others pools without having to copy it out. We don't know anything about cache coherency as of yet though. But you knew that already of course, as surely you wouldn't have gone off on such a rant without knowing the basics...

drunkenmaster · 13 Oct 2015 at 15:58

layte said:
Yea, I'm not going to play this pointless game with you. The article and related sidebar links literally counter the rubbish you are trying to paint as fact. Just because you dress your rants up in as many words as possible does not make them any less factually wrong.

That articles 100% agrees with me, you just don't understand it.

Are you somehow misunderstanding where in that article NVLink is precisely described as a high speed interlink... and that HSA isn't a interlink at all. Seriously, what about that can't you get?

The article headline itself is how NVLInk enables faster multi GPU computing. Again, HSA isn't about multi gpu computing. Notice it doesn't say faster gpu compute but faster MULTi GPU computing. The one thing holding back multi gpu computing is bandwidth, this is the one thing Nvlink is designed to address. These are entirely different technologies addressing entirely different problems.

layte · 13 Oct 2015 at 16:06

Dude, seriously, read the whole article, not just the bits you want to.

that means GPU kernels will be able to access data in host system memory at the same bandwidth the CPU has to that memory—much faster than PCIe. Host and device portions of applications will be able to share data much more efficiently and cooperatively operate on shared data structure, and supporting larger problem sizes will be easier than ever.

I'd also read up on CUDA6 Unified Memory and how it will work. http://devblogs.nvidia.com/parallelforall/nvlink-pascal-stacked-memory-feeding-appetite-big-data/ http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

Kaapstad · 13 Oct 2015 at 16:06

drunkenmaster said:
That articles 100% agrees with me, you just don't understand it.

Epic !!!!

layte · 13 Oct 2015 at 16:11

Kaapstad said:
Epic !!!!

Unfortunately for DM he read a bit, got all foamy at the mouth as it was NV related and didn't read the whole thing.

drunkenmaster · 13 Oct 2015 at 16:12

layte said:
Well, go on then one last little treat. NVlink removes the vast majority of latency as the CPU and GPU can directly access memory located in each others pools without having to copy it out. We don't know anything about cache coherency as of yet though. But you knew that already of course, as surely you wouldn't have gone off on such a rant without knowing the basics...

In Nvidia's world it does, in everyone else's reality it doesn't.

Unified memory as decided by the entire computing industry is ONE pool of memory accessible by all devices. Nvidia's concept of unified memory is two pools accessible by EITHER device. CPU accessing the GPU memory is not a local memory call and absolutely has a latency hit. GPU accessing CPU memory is not a local memory call and absolutely has a latency hit.

A ARM or AMD chip with HSA has one pool of memory which both the cpu and GPU access with little to no latency.

CPU accessing it's own memory will be many times faster than accessing the GPU memory and same the other way around.

Every single thing in the article is about multi gpu, nothing else. Notice there is not one benchmark on that page showing how NVLink improves the compute performance on one gpu, not one benchmark, with an article with MULTIGPU in the title?

Nvidia has NO mechanism for industry standard unified memory, there is no way for a gpu and CPU to use the same physical memory locally and act with the latency of on die communication, none. PCI-E or Nvlink, neither work on the same scale as on die communication latency, nor power, nor usefulness.

Also it's not out, the ONLY thing Nvidia offer for realistic use is the Unified virtual addresses, which for a couple of years they flat out called unified memory as well.

http://www.nvidia.co.uk/object/nvidia-nvlink-technology-mar25-2014-uk.html

Nvidia's launch... what advantages it offers over, what is that, oh right. Nvidia in their own launch press release for Nvlink compare it to PCI-E.

It is an interconnect, HSA is NOT an interconnect, comparing them is absurd.

layte · 13 Oct 2015 at 16:14

drunkenmaster said:
In Nvidia's world it does, in everyone else's reality it doesn't.

Here we get to the crux of the matter. Because this is Nvidia we are talking about, DM dismisses everything and anything that doesn't fit his agenda out of hand and then goes on to post big long spiels of text in the hope people will be confused enough by the end to just agree.

Sometimes I wonder why I bother.

Kaapstad · 13 Oct 2015 at 16:25

Would it not be more logical for AMD Intel and NVidia to design entirely new processors that are good at running both CPU and GPU type tasks which can be run together like GPUs can in parallel.

This would give the user the choice to add as many processors as is needed like you can with MGPUs to get the job done.

Sadly this would mean the end of Windows, too bad Microsoft.

Having said that I don't know much about the subject so I may have just written a load of rubbish.

drunkenmaster · 13 Oct 2015 at 16:27

layte said:
Here we get to the crux of the matter. Because this is Nvidia we are talking about, DM dismisses everything and anything that doesn't fit his agenda out of hand and then goes on to post big long spiels of text in the hope people will be confused enough by the end to just agree.

Sometimes I wonder why I bother.

Not the crux of the matter, I should clarify in Nvidia's world it isn't unified memory either, they call a half attempt at it unified memory when none of the rest of the industry would call it unified memory. By your ridiculous logic if AMD called freesync a CPU, you couldn't call AMD on using a ridiculous term for it because to do so would mean having an agenda? That would be much like when Nvidia told everyone they made the first GPU and held patents on a bunch of stuff they invented meaning no one could challenge them... oh wait, they claimed a bunch of stuff and then they lost a huge lawsuit because they were talking out of their behind. A company claiming something doesn't make it so.

Here is more proof

http://devblogs.nvidia.com/parallelforall/combine-openacc-unified-memory-productivity-performance/

If you read that correctly you will see two of Nvidia's own benchmarks, showing a 4% or so gain from 'unified' memory in one case and around a 8-9% loss of performance in the other case which by the way the post describes as more realistic.

In both cases the ONLY advantage attributed was easier coding, the performance improvement in the first he even says "I did not lose performance" as opposed to "wow, this performance is increased".

Here are the downsides of their 'unified memory' listed

There are some limitations of Unified Memory on current-generation GPU architectures:

Only heap allocations can be used with Unified Memory, no stack memory, static or global variables.
The total heap size for Unified Memory is limited to the GPU memory capacity.
When Unified Memory is accessed on the CPU, the runtime migrates all the touched pages back to the GPU before a kernel launch, whether or not the kernel uses them.
Concurrent CPU and GPU accesses to Unified Memory regions are not supported and result in segmentation faults.

If you understood a word of what you were talking about you would realise their terming this technology unified memory does not make it so. It has no particular performance upsides, it's primary focus is to give the coder a single virtual memory space to work from. You say it has no copies, read it, it copies the data from one memory set to another, you'd be hard pressed to do that without copies. It isn't across all memory, if you have 64GB of system memory and 8GB of GPU memory you get 8GB of 'unified memory'... you ever see that in a real unified memory system, nope.

All these limitations are because it's not freaking unified memory. It's allocating part of the normal memory to a program which the cpu and gpu then access, but not concurrently as highlighted in the downsides, another performance disadvantage.

This is NOT unified memory, this is Nvidia calling their software version of it the incorrect name to appear to offer the same features.

It still has no bearing on HSA, does not address the same problems, is not an industry wide standard open to anyone, will never see the light of day in a desktop/gaming/console/mobile system. Is intended to improve multi-gpu, addresses none of the problems HSA does, offers none of the performance benefits HSA does. They are two entirely unrelated technologies.

Again, Nvidia themselves call it a high speed interconnect and compare it to PCI-E 3 and nothing else. Explain how HSA is an interconnect or stop trolling.

Pottsey · 13 Oct 2015 at 16:28

layte said:
“The only people making any real money in mobile are Apple, who sell cheap ARM hardware in shiny boxes. Qualcomm are pulling in most of their money from modems and licensing. Intel alone are making more than AMD and the commodity ARM players combined. “

That’s incorrect in both that only Apple are making money and that Apple are selling cheap ARM hardware. Apple are using in house designed CPU’s and a custom GPU’s from Imagination.

3 of the biggest CPU makers are backing the HSA foundation along with the 3 biggest GPU players by unit shipment along with the smaller players like AMD.

layte said:
”Dude, seriously, read the whole article, not just the bits you want to.”

Drunenmaster is correct nothing NVidia are doing is comparable. NVlink if anything is the opposite of the goal the HSA foundation has. NVlink is the very type of thing the HSA wants to move away from. NV links clearly doesn’t solve 99% of the problems that HSA is trying to solve.

As DM pointed out NVLink is just a high speed interlink how is that comparable to HSA or the goals of HSA?

layte · 13 Oct 2015 at 16:30

Kaapstad said:
Would it not be more logical for AMD Intel and NVidia to design entirely new processors that are good at running both CPU and GPU type tasks which can be run together like GPUs can in parallel.

This would give the user the choice to add as many processors as is needed like you can with MGPUs to get the job done.

Sadly this would mean the end of Windows, too bad Microsoft.

Having said that I don't know much about the subject so I may have just written a load of rubbish.

The problem is we already have issues with general purpose software making use of multiple cores as it is, trying to spread them over devices with potentially hundreds of cores is going to be even more complex.

Pottsey · 13 Oct 2015 at 16:31

layte said:
The problem is we already have issues with general purpose software making use of multiple cores as it is, trying to spread them over devices with potentially hundreds of cores is going to be even more complex.

HSA solves that which is why many of the big players are all backing it. I suggest you re read the articles as you seem to have misunderstood what HSA is and you seem to have misunderstood drunkenmaster as what he said is correct.

layte · 13 Oct 2015 at 16:40

drunkenmaster said:
http://devblogs.nvidia.com/parallelforall/combine-openacc-unified-memory-productivity-performance/

That's because those tests were performed on current hardware which is not designed around such concepts. The article I linked to earlier talks about how Pascal is designed around the concept. PCIe-3 for example would certainly explain the issues as the latency and low bandwidth would bring down performance.

I'm not saying NVlink is HSA, I'm saying it is an enabler for similar concepts. Which it is, which NV have talked about in their whitepapers. Just because you dont like it, does not make it any less valid.

thesmokingman · 13 Oct 2015 at 16:41

Kaapstad said:
Would it not be more logical for AMD Intel and NVidia to design entirely new processors that are good at running both CPU and GPU type tasks which can be run together like GPUs can in parallel.

This would give the user the choice to add as many processors as is needed like you can with MGPUs to get the job done.

Sadly this would mean the end of Windows, too bad Microsoft.

Having said that I don't know much about the subject so I may have just written a load of rubbish.

Nvidia is in a pickle. They don't have a x86 license. And the reason for that too may lay at their feet but that's what prevents them from using any of this unified memory and heterogeneous computing. That they have no cpu to tie it into. A very long gaze into the crystal ball could paint them in a situation when they are the lone player w/o a cpu thus they cannot produce a x86 combo product. That's why they have been trying to get into the ARMs race w/o much success unfortunately. They even went so far as to slap a tegra on a gpu so they could call it unified memory.

layte · 13 Oct 2015 at 16:42

Pottsey said:
HSA solves that which is why many of the big players are all backing it. I suggest you re read the articles as you seem to have misunderstood what HSA is and you seem to have misunderstood drunkenmaster as what he said is correct.

Unfortunately it doesn't magically solve the problem of how do you spread your code over many execution units. Code still has to be written to take advantage of 'moar corez'.

drunkenmaster · 13 Oct 2015 at 16:53

layte said:
Unfortunately it doesn't magically solve the problem of how do you spread your code over many execution units. Code still has to be written to take advantage of 'moar corez'.

Again you show a complete lack of understanding of the problem. Because this is precisely what HSA is doing, oh no, wrong again.

ACtually read up on something before talking nonsense.

First of all Nvlink doesn't enable similar concepts, it's 'more bandwidthz' and not much else. Second you directly implied that Nvidia doesn't need HSA because it has NVlink, you were absolutely comparing them as you can't make that statement without comparing them. Third, HSAs entire premise is creating a industry standard for which then the software industry can code once for everything.

This is why Java, many compilers, lots of tools have been created to make integration of easy gpu accelerated code into ALL software dramatically easier than it is. This is one of the base reasons behind getting everyone to support one way of doing things yet you don't even know that.

When every company did gpu acceleration differently every single program out there would have to write individual and complex code for every single platform they made programs for. The fundamental reason behind HSA is to get to a point where software guys code once for all platforms and it will work on all platforms.

HSA and Nvlink aren't even close to comparable, they don't enable the same concepts, NVlink is a basic interconnect, HSA is a dramatically larger scale industry standard that enables things NVlink nor Nvidia's attempt at 'unified memory' even attempt to enable.

What part of NVLink will enable an ARM core, an AMD memory controller, AMD GPU, PowerVR video acceleration and several discrete Samsung IP blocks to be put together on the same chip and actually function... nothing, correct.

Orangey · 13 Oct 2015 at 17:01

Give it a rest layte, you waded in without knowing whereof you speak, just bow out gracefully.

layte · 13 Oct 2015 at 17:13

drunkenmaster said:
Again you show a complete lack of understanding of the problem. Because this is precisely what HSA is doing, oh no, wrong again.

No, you still have to sufficiently 'thread' your code to take advantage of any increase in execution units.

The main premise of HSA is to allow CPU and GPU to work as one, this is exactly what NVlink (and related technologies) enables, and what Nvidia are talking about in the white papers you have mentioned. The latest CUDA is already bringing UM to the table, which will take advantage of the hardware when it is available. Funny that you talk about Java a lot: http://devblogs.nvidia.com/parallel...e-performance-java-power-systems-nvidia-gpus/ IBM are a pretty big beast when it comes to Enterprise Java.

You say HSA is a standard. But it's a standard for markets where the current incumbents have no interest in taking any part of. Intel have no interest in plugging or letting anyone plug into their archetecture and we know NV and IBM are doing their own thing.

HSA will likely gain some traction in the mobile ARM space, but with only AMD backing it on the desktop/laptop/server (ok the ARM guys have some hardware here nobody is currently interested in) space it's not going anywhere quickly there.

All this because I mentioned that IBM and NV were off cooking something up. It could have been left at that, but noooo, the evil green monster has to be slayed and here we are many posts later, and I for one am tired of it all and sometimes wish I never even bothered. Why cant we have nice things, why cant points be raised without some going crazy? Nobody wanted to discuss my points, only beat them down because it didn't fit the agenda.

Orangey said:
Give it a rest layte, you waded in without knowing whereof you speak, just bow out gracefully.

The irony here is outstanding.