Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
by Anand Lal Shimpi on May 14, 2012 3:46 PM EST- Posted in
- CPUs
- AMD
- Ask the Experts
- GPUs
AMD’s Manju Hegde is one of the rare folks I get to interact with who has an extensive background working at both AMD and NVIDIA. He was one of the co-founders and CEO of Ageia, a company that originally tried to bring higher quality physics simulation to desktop PCs in the mid-2000s. In 2008, NVIDIA acquired Ageia and Manju went along, becoming NVIDIA’s VP of CUDA Technical Marketing. The CUDA fit was a natural one for Manju as he spent the previous three years working on non-graphics workloads for highly parallel processors. Two years later, Manju made his way to AMD to continue his vision for heterogeneous compute work on GPUs. His current role is as the Corporate VP of Heterogeneous Applications and Developer Solutions at AMD.
Given what we know about the new AMD and its goal of building a Heterogeneous Systems Architecture (HSA), Manju’s position is quite important. For those of you who don’t remember back to AMD’s 2012 Financial Analyst Day, the formalized AMD strategy is to exploit its GPU advantages on the APU front in as many markets as possible. AMD has a significant GPU performance advantage compared to Intel, but in order to capitalize on that it needs developer support for heterogeneous compute. A major struggle everyone in the GPGPU space faced was enabling applications that took advantage of the incredible horsepower these processors offered. With AMD’s strategy closely married to doing more (but not all, hence the heterogeneous prefix) compute on the GPU, it needs to succeed where others have failed.
The hardware strategy is clear: don’t just build discrete CPUs and GPUs, but instead transition to APUs. This is nothing new as both AMD and Intel were headed in this direction for years. Where AMD sets itself apart is that it is will to dedicate more transistors to the GPU than Intel. The CPU and GPU are treated almost as equal class citizens on AMD APUs, at least when it comes to die area.
The software strategy is what AMD is working on now. AMD’s Fusion12 Developer Summit (AFDS), in its second year, is where developers can go to learn more about AMD’s heterogeneous compute platform and strategy. Why would a developer attend? AMD argues that the speedups offered by heterogeneous compute can be substantial enough that they could enable new features, usage models or experiences that wouldn’t otherwise be possible. In other words, taking advantage of heterogeneous compute can enable differentiation for a developer.
That brings us to today. In advance of this year’s AFDS, Manju has agreed to directly answer your questions about heterogeneous compute, where the industry is headed and anything else AMD will be covering at AFDS. Manju has a BS in Electrical Engineering (IIT, Bombay) and a PhD in Computer Information and Control Engineering (UMich, Ann Arbor) so make the questions as tough as you can. He'll be answering them on May 21st so keep the submissions coming.
101 Comments
View All Comments
BenchPress - Wednesday, May 16, 2012 - link
How is Amdahl's Law an argument in favor of heterogeneous computing? It tells us that when you scale up the parallel processing, you get diminishing returns. So there's also a need to focus on sequential processing speed.GPUs are incredibly slow at processing a single thread. They just achieve high throughput by using many threads. That's not a good thing in light of Amdahl's Law. Even more so since the explicit parallelism is finite. And so it's really no coincidence that the pipeline length of GPUs has been shortening ever since they became programmable, to improve the latencies and lower the number of threads.
Please observe that this means the GPU is slowly evolving in the direction of a CPU architecture, where instruction latencies are very low, and caches, prefetching and out-of-order execution ensure that every thread advances as fast as possible so you only need a few and don't suffer from limited parallelism.
This convergence hasn't been slowing down. So it's obvious that in the future we'll end up with devices which combine the advantages of the CPU and GPU into one. With AVX2 that future isn't very far away any more.
TeXWiller - Friday, May 18, 2012 - link
Say you want to have 1024 threads at your disposal and have limited chip resources. Integrating 256 or more SB level cores on a single chip while using low power is quite difficult.Instead if you have 16 high performance threads of computing with 4 to 8 very wide cores with all the vector goodness that fits in the power budgets combined with 246 Larrabee style of narrow cores with very good energy efficiency, you can have the cake and eat it as well, so to speak.
Heterogeneous computing is all about using parallel cores for parallel problems and powerful sequential cores for sequential problems. Scaling simply stops otherwise. The concept of heterogeneous computing does not imply anything about the actual ISAs used, instead it does imply a model of computation. This seems to be the true issue in this discussion.
BenchPress - Friday, May 18, 2012 - link
You're offering a solution in search of a problem. Nobody "wants" 1024 threads. In an ideal world we'd have one thread per process.I'm afraid though you're confused about the difference between a CPU thread and a GPU thread. You have to carefully distinguish between threads, warps, wavefronts, strands, fibers, bundles, tiles, grids, etc. For instance by using Intel's terminology, a quad-core Haswell CPU will have no problem running 8 threads, 64 fibers and 2048 strands. In fact you can freely choose the number of fibers per thread and the number of strands per fiber. The optimal amount depends on the kernel's register count and available cache space. But a higher strand count definitely doesn't automatically equal higher performance.
Likewise, a high number of cores is never the right answer. You have to balance core count, SIMD width, and issue width. And when GPU manufacturers give you a 'compute core' count, they multiply all these numbers. Using this logic, mainstream Haswell will have 64 compute cores (at three times the clock frequency of most GPUs).
And this is why CPUs are actually much closer in compute density compared to GPUs, than the marketing terminology might have you conclude.
TeXWiller - Saturday, May 19, 2012 - link
I was talking only about hardware parallelism and the problems of scaling software once the parallelism available on single a chip exceeds certain limit. I wasn't talking anything about programming frameworks or zones of locality.You don't sound like you would believe Intel's MIC will ever be a useful solution for HPC. That would be ironic considering your other posts. The ideal world where we have single thread per process we also have that 10GHz Pentium 4, right?
BenchPress - Sunday, May 20, 2012 - link
Intel's MIC is aimed at supercomputers, which is very different from the consumer market. The problem size an run times are both many orders of magnitude larger. So they require a very different architecture. The MIC will do fine in the HPC market, but it's not an architecture suited for consumer software.Consumer software is a complex mix of ILP, DLP and TLP, and can switch between them on a microsecond scale or less. So the hardware has to cater for each type of parallelism. GPUs have a strong focus on DLP and can do a bit of TLP. CPUs can deal with ILP and TLP and next year they'll add DLP to the list using AVX2.
Beyond that the core count will go up again, and we'll have TSX to efficiently synchronize between them.
iwod - Tuesday, May 15, 2012 - link
I think BenchPress actually got it right. Unless someone could explain to me otherwise.Hardware without Software means nothing. And just by looking at GPU compute adoption you can tell after all these years Nvidia trying to push for it, it is only going to take off in HPC space where software re engineering cost outweigh traditional method.
Someday, sometimes GPGPU will eventually be so powerful that we simply can not ignore it. But as x86 continues to evolve with AVX adding even more FP power, taking advantage of this continual improvement with x86 is going to be much much easier for developers.
Ofcoz, that is assuming Intel will be pushing AVX forward for this. Otherwise there simply isn't enough incentive to rewrite apps in OpenCL.
I have a feeling that even Apple has sort of abandon ( or slow down ) with OpenCL.
GullLars - Tuesday, May 15, 2012 - link
I think he is partially correct. Having a wide vector unit within the out-of-order domain and connected to the flexible CPU caches will allow acceleration of sequential code for instructions and methods within a program. However making use of explicit parallelism for larger heavy tasks which are massively or embarrisingly parallell will allow for a more efficient speedup.Hetrogenous computing also allows easier addition of modular special purpose accelerators which may be powergated. Intel's quick-sync, and AES-NI are examples of the performance and power-efficiency of such accelerators.
As geometries become smaller, transistors become cheaper and power-density increases. Dedicating die area to accelerators that may be powergated, and having a large GPGPU area which can run at a lower lock and voltage and/or be clock/power gated in sections will help manage thermal limitations. Thermals will likely become a problem of geometry (active transistor density and distribution), not just total chip powerconsumption.
Denithor - Monday, May 14, 2012 - link
Guess that's the big question - what other regions of software are going to benefit substantially from being able to use GPU acceleration?I was asking like a week ago in the forums if anyone thought we'd see physics for games show up being run on the iGPU (either on Intel or AMD processors). Especially in cases where a user already has a powerful discrete GPU, is there any advantage to buying a CPU with an on-die GPU as well or are those going to be just extra baggage for most power users?
Denithor - Monday, May 14, 2012 - link
And one other question, these days the drive in CPU is to lower power/heat generation. Compared to discrete GPU where it's nice to use less power but that's not really as much of a driver as increased performance.I imagine that ntegrated GPU gains a serious advantage from sharing a cache at some level with the CPU - making workflow much more efficient.
However, for these integrated GPUs to seriously challenge discrete cards, I think they are going to have to push the power consumption up significantly. Currently we don't mind using 70-100+W for the CPU and another 150-300W for the discrete GPU. Are there any plans to release a combined CPU+iGPU that will use like 200W or more? Or is the iGPU going to continue to be held back to a minimal performance level by power concerns?
Matias - Monday, May 14, 2012 - link
When will we see GPU usage in parallel tasks such as file compression/decompression (RAR, mp3, flac), database management (SQL), even booting Windows etc? Sorry if my questions are too simplistic or ignorant.