NVIDIA Reveals Next-Gen Turing GPU Architecture: NVIDIA Doubles-Down on Ray Tracing, GDDR6, & More
by Ryan Smith on August 13, 2018 8:20 PM EST- Posted in
- GPUs
- Raytrace
- CUDA
- Quadro
- NVIDIA
- Restructuring
- GDDR6
- Tensor
- Quadro RTX
- Turing
Moments ago at NVIDIA’s SIGGRAPH 2018 keynote presentation, company CEO Jensen Huang formally unveiled the company’s much awaited (and much rumored) Turing GPU architecture. The next generation of NVIDIA’s GPU designs, Turing will be incorporating a number of new features and is rolling out this year. While the focus of today’s announcements is on the professional visualization (ProViz) side of matters, we expect to see this used in other upcoming NVIDIA products as well. And by the same token, today’s reveal should not be considered an exhaustive listing of all of Turing’s features.
Hybrid Rendering & Neural Networking: RT & Tensor Cores
So what does Turing bring to the table? The marquee feature, at least for NVIDIA’s ProViz crowd, is on hybrid rendering, which combines ray tracing with traditional rasterization to exploit the strengths of both technologies. This announcement is essentially a continuation of NVIDIA’s RTX announcement from earlier this year, so if you thought that announcement was a little sparse, well then here is the rest of the story.
The big change here is that NVIDIA is going to be including even more ray tracing hardware with Turing in order to offer faster and more efficient hardware ray tracing acceleration. New to the Turing architecture is what NVIDIA is calling an RT core, the underpinnings of which we aren’t fully informed on at this time, but serve as dedicated ray tracing processors. These processor blocks accelerate both ray-triangle intersection checks and bounding volume hierarchy (BVH) manipulation, the latter being a very popular data structure for storing objects for ray tracing.
NVIDIA is stating that the fastest Turing parts can cast 10 Billion (Giga) rays per second, which compared to the unaccelerated Pascal is a 25x improvement in ray tracing performance.
The Turing architecture also carries over the tensor cores from Volta, and indeed these have even been enhanced over Volta. The tensor cores are an important aspect of multiple NVIDIA initiatives. Along with speeding up ray tracing itself, NVIDIA’s other tool in their bag of tricks is to reduce the amount of rays required in a scene by using AI denoising to clean up an image, which is something the tensor cores excel at. Of course that’s not the only feature tensor cores are for – NVIDIA’s entire AI/neural networking empire is all but built on them – so while not a primary focus for the SIGGRAPH crowd, this also confirms that NVIDIA’s most powerful neural networking hardware will be coming to a wider range of GPUs.
New to Turing is support for a wider range of precisions, and as such the potential for significant speedups in workloads that don't require high precisions. On top of Volta's FP16 precision mode, Turing's tensor cores also support INT8 and even INT4 precisions. These are 2x and 4x faster than FP16 respectively, and while NVIDIA's presentation doesn't dive too deep here, I would imagine they're doing something similar to the data packing they use for low-precision operations on the CUDA cores. And without going too deep ourselves here, while reducing the precision of a neural network has diminishing returns – by INT4 we're down to a total of just 16(!) values – there are certain models that really can get away with this very low level of precision. And as a result the lower precision modes, while not always useful, will undoubtedly make some users quite happy at the throughput, especially in inferencing tasks.
Getting back to hybrid rendering in general though, it’s interesting that despite these individual speed-ups, NVIDIA’s overall performance promises aren’t quite as extreme. All told, the company is promising a 6x performance boost versus Pascal, and this doesn’t specify against which parts. Time will tell if even this is a realistic assessment, as even with the RT cores, ray tracing in general is still quite the resource hog.
Meanwhile, to better take advantage of the tensor cores outside of ray tracing and specialty deep learning software, NVIDIA will be rolling out a SDK, NVIDIA NGX, to integrate neural networking into image processing. Details here are sparse, but NVIDIA is envisioning using neural networking and the tensor cores for additional image and video processing, including methods like the upcoming Deep Learning Anti-Aliasing (DLAA).
Turing SM: Dedicated INT Cores, Unified Cache, Variable Rate Shading
Alongside the dedicated RT and tensor cores, the Turing architecture Streaming Multiprocessor (SM) itself is also learning some new tricks. In particular here, it’s inheriting one of Volta’s more novel changes, which saw the Integer cores separated out into their own blocks, as opposed to being a facet of the Floating Point CUDA cores. The advantage here – at least as much as we saw in Volta – is that it speeds up address generation and Fused Multiply Add (FMA) performance, though as with a lot of aspects of Turing, there’s likely more to it (and what it can be used for) than we’re seeing today.
Speaking of ALUs, one thing I'm still waiting to get confirmation on for Turing – but is something it almost certainly supports – is support for faster low precision operations (e.g. fast FP16). In Volta this is manifested as FP16 operations at 2x the FP32 rate and INT8 operations at 4x the INT32 rate, and I expect much the same here. Especially as the tensor cores already support this concept, so it would be highly unusual not to bring it to the CUDA cores as well.
Fast FP16, rapid packed math, and other means of packing together multiple smaller operations into a single larger operation, are all a key component of improving GPU performance at a time when Moore’s Law is slowing down. By only using data types as large (precise) as necessary, it’s possible to pack them together to get more work done in the same period of time. This in turn is particularly important to neural networking inference but is also increasingly important in game development, as not all shader programs require FP32 precision, and cutting down on precision can improve performance and cut down on precious memory bandwidth & register file usage.
The Turing SM also includes what NVIDIA is calling a “unified cache architecture.” As I’m still awaiting official SM diagrams from NVIDIA, it’s not clear if this is the same kind of unification we saw with Volta – where the L1 cache was merged with shared memory – or if NVIDIA has gone one step further. At any rate NVIDIA is saying that it offers twice the bandwidth of the “previous generation” which is unclear if NVIDIA means Pascal or Volta (with the latter being more likely).
Finally, also tucked away in the Turing press release is the mention of support for variable rate shading. This is a relatively young and upcoming graphics rendering technique that there's limited information about (especially as to how exactly NVIDIA is implementing it). But at a very high level it sounds like the next generation of NVIDIA's multi-res shading technology, which allows developers to render different areas of a screen at various effective resolutions, in order to concentrate quality (and rendering time) in to the areas where it's the most beneficial.
Feeding the Beast: GDDR6 Support
As the memory used by GPUs is developed by outside companies, there are no big secrets here. The JEDEC and its big 3 members Samsung, SK Hynix, and Micron, have all been developing GDDR6 memory as the successor to both GDDR5 and GDDR5X, and NVIDIA ha confirmed that Turing will support it. Depending on the manufacturer, first-generation GDDR6 is generally promoted as offering up to 16Gbps per pin of memory bandwidth, which is 2x that of NVIDIA’s late-generation GDDR5 cards, and 40% faster than NVIDIA’s most recent GDDR5X cards.
GPU Memory Math: GDDR6 vs. HBM2 vs. GDDR5X | ||||||||
NVIDIA Quadro RTX 8000 (GDDR6) |
NVIDIA Quadro RTX 5000 (GDDR6) |
NVIDIA Titan V (HBM2) |
NVIDIA Titan Xp |
NVIDIA GeForce GTX 1080 Ti | NVIDIA GeForce GTX 1080 | |||
Total Capacity | 24 GB | 16 GB | 12 GB | 12 GB | 11 GB | 8 GB | ||
B/W Per Pin | 14 Gb/s | 1.7 Gb/s | 11.4 Gbps | 11 Gbps | ||||
Chip capacity | 2 GB (16 Gb) | 4 GB (32 Gb) | 1 GB (8 Gb) | |||||
No. Chips/KGSDs | 12 | 8 | 3 | 12 | 11 | 8 | ||
B/W Per Chip/Stack | 56 GB/s | 217.6 GB/s | 45.6 GB/s | 44 GB/s | ||||
Bus Width | 384-bit | 256-bit | 3092-bit | 384-bit | 352-bit | 256-bit | ||
Total B/W | 672 GB/s | 448GB/s | 652.8 GB/s | 547.7 GB/s | 484 GB/s | 352 GB/s | ||
DRAM Voltage | 1.35 V | 1.2 V (?) | 1.35 V |
Relative to GDDR5X, GDDR6 is not quite as big of a step up as some past memory generations, as many of GDDR6’s innovations were already baked into GDDR5X. None the less, alongside HBM2 for very high end use cases, it is expected to become the backbone memory of the GPU industry. The principle changes here include lower operating voltages (1.35v), and internally the memory is now divided into two memory channels per chip. For a standard 32-bit wide chip then, this means a pair of 16-bit memory channels, for a total of 16 such channels on a 256-bit card. While this in turn means there is a very large number of channels, GPUs are also well-positioned to take advantage of it since they are massively parallel devices to begin with.
NVIDIA for their part has confirmed that the first Turing Quadro cards will run their GDDR6 at 14Gbps, which happens to be the fastest speed grade offered by all of the Big 3 members. That said, NVIDIA has also confirmed that they will be using Samsung’s memory here, specifically for their cutting-edge 16Gb capacity modules. This is important, as it means for a typical 256-bit GPU, NVIDIA could outfit the card with the standard 8 modules and get 16GB of total capacity, or even 32GB if they use clamshell mode.
Odds & Ends: NVLink, VirtualLink, & 8K HEVC
Rounding out the Turing package thus far, NVIDIA has also briefly confirmed some of the external I/O features that the architecture will support. NVLink support will be present on at least some Turing products, and NVIDIA is tapping it for all three of their new Quadro cards. In the case of all of these products, NVIDIA is offering two-way GPU configurations. I am assuming based on this that we are looking at two NVLinks per board – similar to the Quadro GV100 – however I’m waiting on confirmation of that given NVIDIA’s 100GB/sec transfer number.
One thing I’d like to note here before any of our more gaming-focused audience reads too much into this is that the presence of NVLink in Turing hardware doesn’t mean it’ll be used in consumer parts. Today’s event is all about ProViz, and it would be an entirely NVIDIA thing to do to limit that feature to Quadro and Tesla only. So we’ll see what happens once NVIDIA announces their obligatory consumer cards.
USB Type-C Alternate Modes | ||||||
VirtualLink | DisplayPort (4 Lanes) |
DisplayPort (2 Lanes) |
Base USB-C | |||
Video Bandwidth (Raw) | 32.4Gbps | 32.4Gbps | 16.2Gbps | N/A | ||
USB 3.x Data Bandwidth | 10Gbps | N/A | 10Gbps | 10Gbps + 10Gbps | ||
High Speed Lane Pairs | 6 | 4 | ||||
Max Power | Mandatory: 15W Optional: 27W |
Optional: Up To 100W |
Meanwhile gamers and ProViz users alike have something to look forward to for VR, with the addition of VirtualLink support. The USB Type-C alternate mode was announced last month, and supports 15W+ of power, 10Gbps of USB 3.1 Gen 2 data, and 4 lanes of DisplayPort HBR3 video all over a single cable. In other words, it’s a DisplayPort 1.4 connection with extra data and power that is intended to allow a video card to directly drive a VR headset. The standard is backed by NVIDIA, AMD, Oculus, Valve, and Microsoft, so Turing products will be the first of what we expect will ultimately be a number of products supporting the standard.
Finally, while NVIDIA only briefly touched upon the subject, we do know that their video encoder block, NVENC, has been updated for Turing. The latest iteration of NVENC specifically adds support for 8K HEVC encoding. Meanwhile NVIDIA has also been able to further tune the quality of their encoder, allowing them to achieve similar quality as before with a 25% lower video bitrate.
Performance Numbers
Along with the hardware specifications announced thus far, NVIDIA has also thrown out a handful of performance numbers for Turing hardware. It should be noted that there is a lot more we don’t know here than we do. However at a high level, these appear to be based around a mostly or completely enabled high-end Turing SKU featuring 4608 CUDA cores and 576 tensor cores. Clockspeeds are not disclosed, however as these numbers are profiled against Quadro hardware, we’re likely looking at lower clockspeeds than what we’ll see in any consumer hardware.
NVIDIA Quadro Specification Comparison | ||||||
RTX 8000 | GV100 | P6000 | M6000 | |||
CUDA Cores | 4608 | 5120 | 3840 | 3072 | ||
Tensor Cores | 576 | 640 | N/A | N/A | ||
ROPs | 96? | 128 | 96 | 96 | ||
Boost Clock | ~1730MHz? | ~1450MHz | ~1560MHz | ~1140MHz | ||
Memory Clock | 14Gbps GDDR6 | 1.7Gbps HBM2 | 9Gbps GDDR5X | 6.6Gbps GDDR5 | ||
Memory Bus Width | 384-bit | 4096-bit | 384-bit | 384-bit | ||
VRAM | 48GB | 32GB | 24GB | 24GB | ||
ECC | ? | Full | Partial | Partial | ||
Half Precision | 32 TFLOPs? | 29.6 TFLOPs? | N/A | N/A | ||
Single Precision | 16 TFLOPs | 14.8 TFLOPs | 12 TFLOPs | 7 TFLOPs | ||
Double Precision | ? | 7.4 TFLOPs | 0.38 TFLOPs | 0.22 TFLOPs | ||
Tensor Performance | 500 TOPs (INT4) |
119 TFLOPs (FP16) |
N/A | N/A | ||
TDP | ? | 250W | 250W | 250W | ||
GPU | Unnamed Turing | GV100 | GP102 | GM200 | ||
Die Size | 754mm2 | 815mm2 | 471mm2 | 601mm2 | ||
Transistor Count | 18.6B | 21.1B | 11.8B | 8B | ||
Architecture | Turing | Volta | Pascal | Maxwell 2 | ||
Manufacturing Process | TSMC 12nm FFN? | TSMC 12nm FFN | TSMC 16nm | TSMC 28nm | ||
Launch Date | Q4 2018 | March 2018 | October 2016 | March 2016 |
Along with the aforementioned 10GigaRays/sec number for the RT cores, for the tensor cores NVIDIA is touting 500 trillion tensor operations per second (500T TOPs). For reference, NVIDIA frequently quotes the GV100 GPU as maxing out at 120T TOPs, however these are not the same. Specifically, while the GV100 was quoted with FP16 operations, Turing is being quoted with extremely low precision INT4, which just so happens to be one-quarter the size of FP16, and thus 4x the throughput. If we do normalize for precision, then Turing's tensor cores don't appear to have better throughput per core, but rather they offer more precision options than Volta did. At any rate, with 576 tensor cores in this chip is quite close to the 640 offered by GV100, but at the end of the day is still a lower amount.
As for the CUDA cores, NVIDIA is saying that the Turing GPU can offer 16 TFLOPS of performance. This is slightly ahead of the 15 TFLOPS single precision performance of the Tesla V100, or even a bit farther ahead of the 13.8 TFLOPS of the Titan V. Or if you’re looking for a more consumer-focused reference, it’s about 32% more than the Titan Xp. Some quick paper napkin math with these figures would put the GPU clockspeed at around 1730MHz, assuming there have been no other changes at the SM level which would throw off the traditional ALU throughput formulas.
Meanwhile NVIDIA has stated that the Quadro cards will come with GDDR6 running at 14Gbps. And with the top two Quadro SKUs offering 48GB and 24GB of GDDR6 respectively, we are almost looking at a 384-bit memory bus for this Turing GPU. By the numbers, this adds up to 672 GB/sec of memory bandwidth for both of the top two Quadro cards.
Otherwise with the architectural shift, it’s difficult to make too many useful performance comparisons, especially against Pascal. From what we’ve seen with Volta, NVIDIA’s overall efficiency has gone up, especially in well-crafted compute workloads. So the roughly 33% improvement in on-paper compute throughput versus the Quadro P6000 may very well be underestimating things. As for consumer product speculation, I’ll hold off on that entirely.
I'll also quickly touch upon the die size of this new GPU. At 754mm2 it's not just large, but it's huge. Compared to other GPUs it's second in size only to NVIDIA's GV100, which for now at least remains the NVIDIA flagship in certain respects. And with 18.6 billion transistors, it's easy to see why the resulting chip needed to be so big. Pardon the pun, but it's clear that NVIDIA has some very big plans for this GPU, and enough so that they can justify having two GPUs in their product stack that are so enormously large.
NVIDIA for their part hasn't stated the specific model number of this GPU – if it's the traditional 102-class GPU, or even the 100-class GPU. The sheer size of the GPU has me leaning towards the latter, but the matter is somewhat arbitrary at this point since the only thing even remotely comparable is GV100. Either way, it does have me wondering whether we're going to see this GPU filter down to consumer products in some form; it's so large that NVIDIA may intend to keep it for their more profitable Quadro and Tesla GPUs. In which case this specific GPU won't be what eventually hits consumer cards.
Coming in Q4 2018, If Not Sooner
Wrapping things up, alongside the Turing architecture announcement, NVIDIA has announced that the first 3 Quadro cards based on Turing GPUs – the Quadro RTX 8000, RTX 6000, and RTX 5000 – will be shipping in Q4 of this year. As the very nature of this reveal is somewhat inverted – normally NVIDIA announces consumer parts first – I wouldn’t necessarily apply that same timeline to consumer cards, which don’t have as stringent the validation requirements. Still, this means we’ll see Turing hardware in Q4 of this year, if not sooner. Interested Quadro buyers will want to start saving their pennies now: a top-tier Quadro RTX 8000 will set you back a cool $10,000.
Finally, as for NVIDIA’s Tesla customers, the Turing launch leaves Volta in a state of flux. NVIDIA has not told us whether Turing will eventually expand into the high-end Tesla space – replacing the GV100 – or if their one-off Volta part will remain the master of its domain for a generation. However as the other Tesla cards have been Pascal-powered thus far, they are very clear candidates for a Turing treatment in 2019.
83 Comments
View All Comments
Alistair - Tuesday, August 14, 2018 - link
You didn't understand my comment. I meant if nVidia has a choice of including Tensor and RT cores or not, adding them won't improve performance with current titles so it's a lot of extra money without payoff. I was hoping to see Quadro cards without RT capabilities announced. Memory and standard CUDA cores only.We can expect the Geforce cards to be almost identical but 4-6 times cheaper, just like with P5000 and P6000 vs GTX 1080 and Titan XP.
Dr. Swag - Tuesday, August 14, 2018 - link
You said "Downside is so much money and die area spent without increasing performance in existing titles and engines. I'm not sure I want to pay double without seeing double performance in my existing titles, but today is marking the beginning of the transition to ray-tracing it seems."This implies to me that you're seeing these as gaming cards. Especially around parts like "I'm not sure I want to pay double without seeing double performance in my existing titles."
The RT stuff is meant for workstations. That's the whole point. If you want stuff without it... Then you've got yourself a geforce card, lol. Or maybe even the regular geforce cards will have RT tracing hardware on it, though if they do I'd expect for there to be less of it.
Alistair - Tuesday, August 14, 2018 - link
The Quadro cards are usually exactly like the gaming cards. If you don't have reading comprehension, can't have a conversation...Trackster11230 - Tuesday, August 14, 2018 - link
That's historically true, but maybe we'll see a shift here?Either way, I agree. That's money the company spent on areas that aren't gaming, but the GPU market as a whole is growing in several directions so I can understand their choices.They have to appeal to as many clients as possible, which means any one market segment likely won't get full resource utilization.
Dr. Swag - Thursday, August 16, 2018 - link
Yes but you are talking about these gpus being multiple times more expensive than current geforce gpus for much less than 2x gain in performance in games. Yes, that's true, but geforce cards will be released at similar price points to current gpus, making the "too expensive to justify the extra gaming perfomance point" invalid. These prices are for quadros, not geforce cards.Yojimbo - Monday, August 13, 2018 - link
They are getting less transistors per unit area than for Volta, so I imagine it's on the 12 FFN process. The density increase of 7 nm should dominate any architectural changes that would reduce transistor density.I'm surprised they are coming out with a generation of 12 FFN GPUs now when it seems 7 nm should be ready by mid 2019.
Judging by Volta, Turing probably gets its performance advantage over Pascal mostly by being more energy efficient. So a larger die size (and/or a higher clock) is necessary to obtain a performance gain.
Interestingly, trying to estimate by the number of CUDA cores (assuming the top end Quadro part isn't using a significantly cut down chip), it seems like there are FP64 units on this GPU, unless the RT cores and other architectural improvements over Volta are taking up a whole lot of transistors.
Sales to hyperscalers should take precedence over sales for proviz, so the fact that these Quadros will be available soon seems to suggest that a V100 replacement isn't forthcoming based on this GPU, unless NVIDIA has been selling such a card to hyperscalers in stealth mode. Of course, backing this up is that fact that this is a GDDR6 part with lower memory bandwidth than the GV100 and the fact that other than the greater precision options in the tensor cores, this GPU doesn't offer much of an advantage over the V100 for data center workloads. Well, it's interesting then that it SEEMS to have double precision. Maybe it doesn't and those RT cores really do take up a large number of transistors.
DanNeely - Monday, August 13, 2018 - link
You've answered your own question. Launching on 12nm now gets them a new product now, a year after their previous refresh, and with major feature enhancements on the pro-side of the house an entire year sooner than if they waited for 7 nm. If the rumor mill is to be trusted, they did delay this generation a few months to empty out their retail channel of last gen parts after the crypto bubble popped; but going an entire year without a product refresh when they've got one would be insane.Yojimbo - Tuesday, August 14, 2018 - link
DanNeely,I believe it costs a lot to prepare a design for high volume manufacturing. They aren't sitting on such a thing if they never make the preparations to begin with. Right now they have little competition, and they won't have competition until 2019. Even though Pascal has been out over two years, gamer market share uptake of Pascal GPUs has been low due to cryptocurrency. Those are reasons why NVIDIA could get away with not releasing a new generation until 2019.
In 2019, however, NVIDIA will have competition and it will be on 7 nm. It's not cheap to do a die shrink (or to come out with a new generation) so if 12 FFN Turing is only at the top of the market for a year that will eat into their margins. Those are reasons why NVIDIA would want to wait until 2019 to release their new generation on 7 nm.
Possibly NVIDIA believe that people won't be willing to purchase 2+ year old GPUs even though those GPUs are significantly faster than what the people currently have and there isn't anything better available than those 2+ yo GPUs on the market. Another possibility is that NVIDIA want to get the RTX hardware into the market as soon as possible in order to push the technology forward for future margins, giving up margins in the near term. The greater the percentage of RTX-capable cards in the market and the sooner they are released the more and sooner developers will design and build games using these technologies, and so the greater the demand for GPUs with these technologies will be in the future.
abufrejoval - Wednesday, August 15, 2018 - link
I think Nvidia is more concerned about keeping concept leadership for GPUs.Essentially GPU is somewhere between commodity (like RAM) and proprietary (CPU) and AMD is just too uncomfortably close and Intel soon threatening to do something similar.
So Nvidia cannot wait to have the next process size ready. They need to push a product to establish their "brand mixed bag of functionalities hitherto known as premium GPU", or risk loosing that to someone else. AMD beefed up the compute part of their design because they understood the war will no longer be fought in games alone--perhaps even too much to cut it in games--even if the miners appreciated it for a while.
But that also highlights the risk: If you get the mixture wrong, if some specialty market like VR doesn't develop the way you anticipated it, you have a dud or lower margin product.
So they are seeding ray-tracing and putting stakes into the ground to claim the future GPU field. And you could argue that cheaper production of almost realistic animations could be as big or bigger than blockchain mining.
I am convinced they are thinking several generations ahead and it's becoming ever more challenging to find white spots on the map of compute that will last long enough for ROI.
Yojimbo - Wednesday, August 15, 2018 - link
NVIDIA's gross margins are better than Intel's now. There's nothing commodity-like about GPUs. Maybe what you are referring to is that Intel and AMD control the x86 instruction set whereas from a gaming point of view GPUs operate through APIs. But still NVIDIA achieve great margins on their going products because of superior research and development, which is an antithetical situation to a commodity market.As far as finding "white spots" in the market overall, there is a lot more room for innovation in GPUs than in CPUs. That is why CPUs have had very little increase in performance or abilities over the past 10 years while GPU performance and abilities have been bounding upwards. That trend will continue, there are plenty of legs left in GPU architectural innovation.