The One Vision That Intel, AMD And Nvidia Are All Chasing – Why Heterogeneous Computing Is The Futur

Wrinkly · 12 Oct 2015 at 09:48

From A Beautifully Simple Concept To A An Industry Wide Vision : Heterogeneous Computing

All of Intel’s past actions and future roadmaps are strong indicators that this is the future that they envision. A future where CPUs and GPUs work seamlessly together to address the new challenges of CPU performance scaling and to address problems that CPUs and GPUs simply cannot solve separately.
AMD made it very clear from 2006 that their goal was to build the ultimate heterogeneous processor.
The company was able to more quickly adapt its vision to practice with the Heterogeneous System Architecture Foundation and mold it into an industry wide strategy that it hopes will bare fruit. The HSA foundation was brought into existence as a collective industry effort to chase the untapped potential of heterogeneous designs and begin a new era of computing where performance would scale again at the rate of golden age Silicon Valley.

AMD Forms The HSA Foundation

HSA stands for Heterogeneous System Architecture. To know what this foundation is all about, we need to take a few steps back. AMD’s goal to build the ultimate heterogeneous processor meant that they had their work cut out for them. The company’s vision for the next era of computing has a far-reaching effect on the entire industry, which made an industry-wide collaboration crucial to the success of any effort to bring this vision to reality. Luckily for AMD, many companies shared its aspirations. Industry giants such as Samsung, Qualcomm, ARM, Imagination Technologies, Mediatek and Texas Instruments joined AMD in its efforts and the HSA foundation was born.

So what is HSA & how does it solve the problem?

HSA is a relatively old concept based on a simple idea. The idea is to run the code on the ideal processor that would be the fastest and most efficient in executing it. Serial code with a lot of branches and conditionals would then be well suited to run on the CPU because that’s the fastest and most efficient pcoessor for this type of code. On the other hand, code that is fairly short, less conditional and massively parallel, such as the code used in graphics to calculate what color each pixel on the screen should be, would be well suited for a graphics processor.

GPUs differ from traditional CPUs in several key characteristics. CPUs generally have a lot more decode and branch prediction resources because they tend to deal with more complex branchy code. GPUs on the other hand are designed with heavy emphasis on execution resources. Because GPUs deal with code that relatively is less complex and data that’s massively more parallel. Which in turn means that the weight would fall on the execution engines rather than the front end of the processor having to deal with the complexity of serial code.

A great, yet simple, example of a heterogeneous system would be a gaming computer. The graphics processor does all the graphics’ heavy lifting and the CPU deals with the API communication, audio processing, artificial intelligence and gameplay physics such as bullet trajectory, hit boxes, etc. Now think of HSA as a significantly more sophisticated and versatile system although based on the very same concept. Instead of the GPU and CPU working on two completely different tasks, such as graphics and AI, the processors can now work on and share the exact same task, such as physics. However, each processor takes care of a different stage of the task. The stages that would be completed faster on the CPU are done by the CPU and the stages which are more appropriate for the GPU are handled by the GPU.

Luckily, this concept works exceptionally well because the majority of software out there has a healthy mix of serial and parallel workloads, making the heterogeneous processor the ideal candidate for a lot of software.

An example of such a task is a Suffix Array. Suffix Arrays are used in a variety of workloads, such as full text index search, lossless data compression and Bio-informatics.

...

What does all of this mean?

Once you examine all the major players carefully, a crystal-clear image of the entire industry moving towards heterogeneous computing appears. AMD, Nvidia and Intel are all addressing the same challenges. As is usual with such cases, AMD chose to go for the open standard industry-wide route, where the entire industry (or as much of it as possible) collaborates to achieve a common goal. Nvidia chose to go for the proprietary route, while Intel took a more awkward position in the middle. They’re making sure their hardware is going to be up to snuff, but are leaving a lot of the industry-wide software challenges for the industry to deal with rather than address them directly like the HSA foundation is doing. Of course, there are exceptions to this, but they remain very specific and quite limited in scope.

...

Article http://wccftech.com/intel-amd-nvidia-future-industry-hsa/

I still find myself year after year wishing that Nvidia would learn to play well with others. If it had then maybe we would have been enjoying an optimised and enhanced PhysX integrated with past DirectX versions and more excitingly HSA extensions built into DirectX12. Instead I feel it will take many years before we get the performance benefits of heterogeneous computing in our day to day software.

What is your opinion?

Bazzards · 12 Oct 2015 at 09:57

I like the idea, however look how long it is taking for multicore processing to be adopted fully. I can see this technology being prevalent in 2050.

Freddie1980 · 12 Oct 2015 at 10:41

That’s a clear enough summary but it doesn’t add anything new. Most of the big the big apps such as Adobe which is probably the biggest and most widely used software suite in the world are already GPU accelerated either via CUDA or OpenGL. Ultimately that is what HSA is, it’s enabling GPU’s to be used to accelerate applications and processes that were being done on the CPU.

mmj_uk · 12 Oct 2015 at 11:30

It's all well and good supporting open efforts but when NVidia by themselves are implementing CUDA in a wider range of software it just goes to show how inefficient open collaborations can be.

You criticised PhysX but NVidia have once again done a much better job of implementing it in games than AMD and co. with their 'open' alternative, how come a comparatively small company like NVidia are always able to do so much more work promoting new technologies than AMD and their open source alliances combined? it's not enough to just give lip service to open collaborations but then drag your feet about promoting them (unless all you care about is public image).

drunkenmaster · 12 Oct 2015 at 12:13

Freddie1980 said:
That’s a clear enough summary but it doesn’t add anything new. Most of the big the big apps such as Adobe which is probably the biggest and most widely used software suite in the world are already GPU accelerated either via CUDA or OpenGL. Ultimately that is what HSA is, it’s enabling GPU’s to be used to accelerate applications and processes that were being done on the CPU.

It's not, at all. HSA is not about a gpu acceleration directly. It's about making a gpu work with a cpu with few to no penalties. Currently what might save you 2ms in computation time on the gpu can add 4ms of latency to copy the memory to the gpu section of memory, process the data, then copy it back to the cpu. HSA is about removing the latency, providing memory usable by both the cpu and gpu at the same time.

There is a reason why there are HSA benchmarks where the same computation done on cpu as using normal gpu compute may show a 5% performance gain but using HSA may show a 300% performance gain.

Early APUs were sensible and the right step but lacked both a software and hardware ecosystem to make full use of the gpu. While the gpu was unquestionably faster in many cases the latency and overhead of accessing it basically reduced any performance gain in most cases. Quicksync works because you're offloading a long task to the gpu which means huge work for a small initial overhead, most work for compute to be useful will mean constantly changing between cpu/gpu and using them concurrently. This means huge amounts of small accesses each with the overhead, that is where memory copying and other latency penalties stack up.

Enabling compute in any application has been doable since the start of computing, making it easy to access other devices, with no overhead and easy for coders to make use of is the step you need before it becomes widely used. The ability to offload tasks has always been there, it's just incredibly difficult to do on an individual basis, HSA and support for it in commonly used languages changes that entirely while removing the overhead.

This is the big deal with HSA, Samsung, Apple, AMD, Mediatek, Qualcomm and almost any ARM device will be 80-90% of all CPUs/GPUs/SOCs/APUs shipping with HSA support.

When you have thousands of software companies who want to use compute but have to program it with complex code on dozens of platforms individually you have huge cost and little benefit because of the overheads making only a few stand out cases faster. With massive industry wide support you get what is already happening, Java support, other language/API support. Meaning easier to code and one code for all, along with the removed overhead of the hardware side of the architecture and you've moved from difficult to write per platform coding with smaller performance benefit to easy to write code once for all platforms with a significantly increased performance benefit.

drunkenmaster · 12 Oct 2015 at 12:19

mmj_uk said:
It's all well and good supporting open efforts but when NVidia by themselves are implementing CUDA in a wider range of software it just goes to show how inefficient open collaborations can be.

You criticised PhysX but NVidia have once again done a much better job of implementing it in games than AMD and co. with their 'open' alternative, how come a comparatively small company like NVidia are always able to do so much more work promoting new technologies than AMD and their open source alliances combined? it's not enough to just give lip service to open collaborations but then drag your feet about promoting them (unless all you care about is public image).

More complete rubbish, you mean the same way Freesync has less panels out than g-sync because g-sync was first? Oh right, the industry standard took longer but already has sigificantly more support, more panels coming, no cost penalty and is better for everyone?

G-sync will die, freesync, or adaptive sync will be used by everyone. Speed to come out doesn't matter, what it does for the industry, the users and how it's adopted matters. open standards 99% of the time take a lot longer because the industry needs to support it, but once it does becomes used by everyone far FAR more widely than any closed standard. This has been proven time and time and time and time again.

Physx hardware acceleration support after what a decade of being out, something Nvidia didn't create, is woeful. Physx/gameworks games are often stuttering messes and one of the few companies consistently implementing it, Ubisoft, are now more hated than EA due to the lack of quality of their games.... yeah, wooo, physx, closed standards, really pushing the industry forward. :rolleyes:

Pottsey · 12 Oct 2015 at 16:19

mmj_uk said:
It's all well and good supporting open efforts but when NVidia by themselves are implementing CUDA in a wider range of software it just goes to show how inefficient open collaborations can be.

You criticised PhysX but NVidia have once again done a much better job of implementing it in games than AMD and co. with their 'open' alternative, how come a comparatively small company like NVidia are always able to do so much more work promoting new technologies than AMD and their open source alliances combined? it's not enough to just give lip service to open collaborations but then drag your feet about promoting them (unless all you care about is public image).

I am not followed what you mean. CUDA has little backing working on a tiny part of the world graphics market. Heterogeneous compute is being backed by all the big players making up 90%+ of the world market. About the only people not backing it are NVidia and it will go ahead without them. Heterogeneous computing is massive and far bigger then CUDA.

Heterogeneous is more then just the desktop market its a solution from the smallest to biggest chip. The goal is a high-level programming language that works across all major CPU’s, GPU’s, DSP’s. This might be oversimplified but you write a program and it will run no matter which hardware you use. Not only run but the program will choose the best hardware to run on. For example parallel processing will be done via the GPU. What this also means is you can swap the CPU from Intel, MIPS, AMD to ARM or others and it doesn't matter. Same for GPU you can swap and the high-level programming language still runs.

D.P. · 12 Oct 2015 at 16:37

mmj_uk said:
It's all well and good supporting open efforts but when NVidia by themselves are implementing CUDA in a wider range of software it just goes to show how inefficient open collaborations can be.

You criticised PhysX but NVidia have once again done a much better job of implementing it in games than AMD and co. with their 'open' alternative, how come a comparatively small company like NVidia are always able to do so much more work promoting new technologies than AMD and their open source alliances combined? it's not enough to just give lip service to open collaborations but then drag your feet about promoting them (unless all you care about is public image).

The open source alternatives AMD promotes aren't actually AMD inventions, despite what AMD PR will have you believe. Bullet Physics was developed by a guy who worked at Sony, and a stint at AMD and is now at Google.
OpenCL was developed by Apple and handed over to the Khronos group, which is heavily supported by Nvidia.

Orangey · 12 Oct 2015 at 16:42

If you look at VISC, they can somehow extract a great deal of context from instructions and bundle them off to virtual cores using resources from any number of real ones, for very little performance hit. AMD are one of the major investors, so we can assume they want priority and licensing considerations for something else they're working on... now what could that be?

humbug · 12 Oct 2015 at 17:38

prav · 13 Oct 2015 at 04:20

humbug said:

Hey, that looks really impressive stuff! Is support for this HSA method built into the programs listed on that chart or has it been implemented just to show off the potential gains? Finding it difficult to find solid details of when we can expect to actually see it implemented in software, everything I find when I google it is talking about the theoretical implications and showing off limited demos, but that might just be my fault for not knowing precisely what to look for.

Scougar · 13 Oct 2015 at 12:19

HSA is an exciting tech that isn't really anywhere other than a few very discrete applications and benchmarks as far as I am aware. It is a real shame because it has fantastic potential.

humbug · 13 Oct 2015 at 13:01

prav said:
Hey, that looks really impressive stuff! Is support for this HSA method built into the programs listed on that chart or has it been implemented just to show off the potential gains? Finding it difficult to find solid details of when we can expect to actually see it implemented in software, everything I find when I google it is talking about the theoretical implications and showing off limited demos, but that might just be my fault for not knowing precisely what to look for.

Just Libra office and some Coral applications, as far as i know, there might be some Adobe stuff.

The problem is for this to become viable all hardware vendors need to agree on implementation and then work together.

On the one side you have the HSA Founders Mainly AMD, ARM and Qualcomm all agreed on an Open Source strategy.

Then you have Nvidia who think they have seen another way to lock you into their hardware and with that are going the propitiatory rout.
Then there's Intel who its unclear how they want to approach this, they are sort of sitting on the fence thinking "do we go with the HSA foundation are go it alone?"

What we could end up with is 3 different approaches to the same thing with one certainly needing propitiatory software and specific hardware, namely one vendors discrete GPU's given they don't make APU's.

Some people may not like me saying this but so far only one has bothered to to get the ball rolling and make compatible hardware for it.
The problem is that one is a minor player, its not enough to give that ball any momentum.

layte · 13 Oct 2015 at 14:17

This is going to end up with AMD and the ARM lot hawking their low power/performance wares gaining little interest beyond smartphone and tablet toys, the money and heavy apps are not in this space. Intel will be doing their own thing because Intel. Nvidia & IBM are cooking something up with NVlink and their tie up, how much this trickles down is anybodies guess.

humbug · 13 Oct 2015 at 14:33

layte said:
This is going to end up with AMD and the ARM lot hawking their low power/performance wares gaining little interest beyond smartphone and tablet toys, the money and heavy apps are not in this space. Intel will be doing their own thing because Intel. Nvidia & IBM are cooking something up with NVlink and their tie up, how much this trickles down is anybodies guess.

The money certainly isn't in Desktops

HSA is good for High density cluster servers.

drunkenmaster · 13 Oct 2015 at 14:37

Nothing Nvidia are doing is remotely comparable, NVlink is just that, a link and it's based on PCI-E, nothing more or less. IBM already have a pci-e based link that was a bit faster than the normal pci-e link standard, this is the next stage. It doesn't combine gpu and cpu on a single die, doesn't offer the same low latency low overhead changes and doesn't offer unified memory(just memory that doesn't need to be managed separately, taking gpu/cpu memory addresses and virtualising them as one, code doesn't need to take it into account but the hardware still has all the overheads).

AMD and ARM, and Apple, Qualcomm and everyone else are involved and all higher performance leads to lower power. Look at the HSA benchmarks above what can be achieved in multiple application areas provides higher performance than alternate gpgpu setups. There are undeniable overheads to current gpgpu that are impossible to remove without huge architectural changes, these are those changes.

For a while low power focused on say running at lower voltage and lower clock speeds , taking longer to complete actions but doing them at lower average power but higher total power.

Chips work in such a way that if you can ring them off and turn them off you can stop almost all leakage losses, while turned on leakage losses are a problem. Everyone, literally everyone in the industry now uses the power saving model of hurry up and shut down.

You can run something slowly at 3W taking 5 seconds before turning off to idle again using 15W total the alternative is using 10W but only 1 second before turning off to idle again for 10W total. Everyone in the industry has moved to the later. Increases in performance = power reduction and power reduction = increases in performance. One is the other now, there is no difference in design goals between the highest and lowest end chips, higher power is the goal in every segment and HSA is all about higher performance.

As for saying the money isn't in mobile.... lol.

layte · 13 Oct 2015 at 14:45

humbug said:
The money certainly isn't in Desktops

HSA is good for High density cluster servers.

No, it's in servers, workstations and HPC. Somewhere ARM and AMD are insignificant players in right now, hence this attempt to try and make a "standard" where the current incumbents have no interest in playing.

layte · 13 Oct 2015 at 14:49

drunkenmaster said:
As for saying the money isn't in mobile.... lol.

The only people making any real money in mobile are Apple, who sell cheap ARM hardware in shiny boxes. Qualcomm are pulling in most of their money from modems and licensing. Intel alone are making more than AMD and the commodity ARM players combined.

layte · 13 Oct 2015 at 15:08

drunkenmaster said:
Nothing Nvidia are doing is remotely comparable, NVlink is just that, a link and it's based on PCI-E, nothing more or less. IBM already have a pci-e based link that was a bit faster than the normal pci-e link standard, this is the next stage. It doesn't combine gpu and cpu on a single die, doesn't offer the same low latency low overhead changes and doesn't offer unified memory(just memory that doesn't need to be managed separately, taking gpu/cpu memory addresses and virtualising them as one, code doesn't need to take it into account but the hardware still has all the overheads).

Sigh... One of these days you will educate yourself before going off on one of your rants. Here, I've done the hard bit for you. http://devblogs.nvidia.com/parallelforall/how-nvlink-will-enable-faster-easier-multi-gpu-computing/

I'll quote you the pertinant bit so you dont get confused;

Unified Memory and NVLink represent a powerful combination for CUDA® programmers. Unified Memory provides you with a single pointer to data and automatic migration of that data between the CPU and GPU. With 80 GB/s or higher bandwidth on machines with NVLink-connected CPUs and GPUs, that means GPU kernels will be able to access data in host system memory at the same bandwidth the CPU has to that memory—much faster than PCIe. Host and device portions of applications will be able to share data much more efficiently and cooperatively operate on shared data structure, and supporting larger problem sizes will be easier than ever.

In all current HSA supporting hardware, the CPU component has much lower than 80GB/s access to data stored in shared memory, the only real question is how much the latency is reduced in NVlink and other successors to current PCIe interconnects. I'd expect this is something that is being actively worked on.

humbug · 13 Oct 2015 at 15:20

Good Grief..