AMD's Graphics Core Next Preview: AMD's New GPU, Architected For Compute
by Ryan Smith on December 21, 2011 9:38 PM ESTPrelude: The History of VLIW & Graphics
Before we get into the nuts & bolts of Graphics Core Next, perhaps it’s best to start at the bottom, and then work our way up.
The fundamental unit of AMD’s previous designs has been the Streaming Processor, previously known as the SPU. In every modern AMD design other than Cayman (6900), this is a Very Long Instruction Word 5 (VLIW5) design; Cayman reduced this to VLIW4. As implied by the architectural name, each SP would in turn have 5 or 4 fundamental math units – what AMD now calls Radeon cores – which executed the individual instructions in parallel over as many clocks as necessary. Radeon cores were coupled with registers, a branch unit, and a special function (transcendental) unit as necessary to complete the SP.
VLIW designs are designed to excel at executing many operations from the same task in parallel by breaking it up into smaller groupings called wavefronts. In AMD’s case a wavefront is a group of 64 pixels/values and the list of instructions to be executed against them. Ideally, in a wavefront a group of 4 or 5 instructions will come down the pipe and be completely non-interdependent, allowing every Radeon core to be fed. When dependent instructions would come down however, fewer instructions could be scheduled at once, and in the worst case only a single instruction could be scheduled. VLIW designs will never achieve perfect efficiency in this regard, but the farther real world utilization is from ideal efficiency, the weaker the benefits of VLIW.
The use of VLIW can be traced back to the first AMD DX9 GPU, R300 (Radeon 9700 series). If you recall our Cayman launch article, we mentioned that AMD initially used a VLIW design in those early parts because it allowed them to process a 4 component dot product (e.g. w, x, y, z) and a scalar component (e.g. lighting) at the same time, which was by far the most common graphics operation. Even when moving to unified shaders in DX10 with R600 (Radeon HD 2900), AMD still kept the VLIW5 design because the gaming market was still DX9 and using those kinds of operations. But as new games and GPGPU programs have come out efficiency has dropped over time, and based on AMD’s own internal research at the time of the Cayman launch the average shader program was utilizing only 3.4 out of 5 Radeon cores. Shrinking from VLIW5 to VLIW4 fights this some, but utilization will always be a concern.
Finally, it’s worth noting what’s in charge of doing all of the scheduling. In the CPU world we throw things at the CPU and let it schedule actions as necessary – it can even go out-of-order (OoO) within a thread if it will be worth it. With VLIW, scheduling is the domain of the compiler. The compiler gets the advantage of knowing about the full program ahead of time and can intelligently schedule some things well in advance, but at the same time it’s blind to other conditions where the outcome is unknown until the program is run and data is provided. Because of this the schedule is said to be static – it’s set at the time of compilation and cannot be changed in-flight.
So why in an article about AMD Graphics Core Next are we going over the quick history of AMD’s previous designs? Without understanding the previous designs, we can’t understand what is new about what AMD is doing, or more importantly why they’re doing it.
83 Comments
View All Comments
EJ257 - Saturday, June 18, 2011 - link
I can't believe it's been 6 years since the X360 and PS3 release. It seems like this latest generation of consoles stuck around a lot longer than previous versions did. Any speculations on what kind of hardware MS and Sony will throw into the next gen?DanNeely - Sunday, June 19, 2011 - link
They have. The big console makers, at the gave devs requests, were trying to make the current generation last a decade to allow more time to recover the work expended figuring out how to best program them. The motion capture cameras were supposed to be the thing that kept the platforms from getting too stale. I suspect however, that by planning to launch its new console early Nintendo may have blown those plans out of the water.jabber - Sunday, June 19, 2011 - link
I'm pretty sure the hardware specs for both the next Xbox and Playstation have been set in stone already.I'm still betting on a 2013 release too.
So right now GPU wise I reckon we're looking at GPUs currently sitting in the $100 range for both boxes. By 2013, the cost of these chips (suitably modified) will be down to $15 -$10 a box.
I wouldnt have thought anything higher than a 5770 or 450 would be suitable/required.
Targon - Monday, June 20, 2011 - link
It all depends on what you expect. Things feel a bit stagnant on the PC game front because consoles are not evolving, and too many companies want almost exactly the same experience on the PC version as what you have on the console.Stargrazer - Saturday, June 18, 2011 - link
Something doesn't feel right here. In itself, SIMD is about *Data* Level Parallelism, not Thread Level Parallelism. Sure, you could use SIMD units as part of some larger scheme that exploits TLP, but that's not what *SIMD* is about.
Loki726 - Saturday, June 18, 2011 - link
If you use a strict definition of a SIMD programming model, then yes, you are probably right: SIMD is a single sequence of operations executed over multiple data elements.However, over time SIMD has been used to refer to both the aforementioned programming model and the hardware used to implement it. The hardware typically consists of a single control unit that broadcasts instructions to multiple functional units. When people say "a SIMD", they typically mean that hardware implementation rather than the computing model.
If that wasn't confusing enough, in the 1980s GPUs started using that SIMD hardware to execute multiple threads as long as the threads were all executing the same instruction at the same time.
So the statement about using "a SIMD" to exploit TLP is accurate, if you take "a SIMD" to mean a processor pipeline with a single control unit that broadcasts to multiple functional units, and have some scheme for scheduling threads onto functional units.
RedemptionAD - Saturday, June 18, 2011 - link
It seems like a good thing potentially. I hope that their good intentions are followed with good execution, at least better than Fermi.Targon - Sunday, June 19, 2011 - link
It should be interesting going forward. Now that AMD is finally into the 32nm process node, standalone GPUs also stand to gain quite a bit. As long as graphics don't become an afterthought to GPGPU, AMD should be in good shape. Radeon 7970(if that is the next generation GPU) may really be a game changer.Navier - Saturday, June 18, 2011 - link
Will the GCN architecture be able to be virtualized? Can a VMWare/XEN/KVM/HyperV hypervisor create vGPUs accessible by VMs in much the same way as vCPUs are today? With GPUs being integrated within the CPU package it would be a waste of resources if it could not be virtualized.This will become a critical feature for enterprise computing beyond HPC applications. One example would be gaming in a cloud computing environment, where a company provides a service that runs a game on their compute and graphics hardware for a game and streams the output to your mobile device for you to enjoy.
hechacker1 - Saturday, June 18, 2011 - link
Yeah I'm also curious about this. Perhaps with the IOMMU and other CPU like features that the GPU now has, it would be much easier to timeshare the GPU.