As someone who analyzes GPUs for a living, one of the more vexing things in my life has been NVIDIA’s Maxwell architecture. The company’s 28nm refresh offered a huge performance-per-watt increase for only a modest die size increase, essentially allowing NVIDIA to offer a full generation’s performance improvement without a corresponding manufacturing improvement. We’ve had architectural updates on the same node before, but never anything quite like Maxwell.

The vexing aspect to me has been that while NVIDIA shared some details about how they improved Maxwell’s efficiency over Kepler, they have never disclosed all of the major improvements under the hood. We know, for example, that Maxwell implemented a significantly altered SM structure that was easier to reach peak utilization on, and thanks to its partitioning wasted much less power on interconnects. We also know that NVIDIA significantly increased the L2 cache size and did a number of low-level (transistor level) optimizations to the design. But NVIDIA has also held back information – the technical advantages that are their secret sauce – so I’ve never had a complete picture of how Maxwell compares to Kepler.

For a while now, a number of people have suspected that one of the ingredients of that secret sauce was that NVIDIA had applied some mobile power efficiency technologies to Maxwell. It was, after all, their original mobile-first GPU architecture, and now we have some data to back that up. Friend of AnandTech and all around tech guru David Kanter of Real World Tech has gone digging through Maxwell/Pascal, and in an article & video published this morning, he outlines how he has uncovered very convincing evidence that NVIDIA implemented a tile based rendering system with Maxwell.

In short, by playing around with some DirectX code specifically designed to look at triangle rasterization, he has come up with some solid evidence that NVIDIA’s handling of tringles has significantly changed since Kepler, and that their current method of triangle handling is consistent with a tile based renderer.

NVIDIA Maxwell Architecture Rasterization Tiling Pattern (Image Courtesy: Real World Tech)

Tile based rendering is something we’ve seen for some time in the mobile space, with both Imagination PowerVR and ARM Mali implementing it. The significance of tiling is that by splitting a scene up into tiles, tiles can be rasterized piece by piece by the GPU almost entirely on die, as opposed to the more memory (and power) intensive process of rasterizing the entire frame at once via immediate mode rendering. The trade-off with tiling, and why it’s a bit surprising to see it here, is that the PC legacy is immediate mode rendering, and this is still how most applications expect PC GPUs to work. So to implement tile based rasterization on Maxwell means that NVIDIA has found a practical means to overcome the drawbacks of the method and the potential compatibility issues.

In any case, Real Word Tech’s article goes into greater detail about what’s going on, so I won’t spoil it further. But with this information in hand, we now have a more complete picture of how Maxwell (and Pascal) work, and consequently how NVIDIA was able to improve over Kepler by so much. Finally, at this point in time Real World Tech believes that NVIDIA is the only PC GPU manufacturer to use tile based rasterization, which also helps to explain some of NVIDIA’s current advantages over Intel’s and AMD’s GPU architectures, and gives us an idea of what we may see them do in the future.

Source: Real World Tech

Comments Locked


View All Comments

  • J0hnnyBGood - Monday, August 1, 2016 - link

    Given that hardware development takes a long time, the recently reduced R&D budget will bite them in 2 to 3 years.
  • filenotfound - Monday, August 1, 2016 - link

    Nvidia had relatif weak performance in DX12 asynchronous compute. Especially in maxwell.
    Is this direct impact of choosing "tile base rasterization"?
    Or not at all?
  • Yojimbo - Monday, August 1, 2016 - link

    My uneducated guess is that is a shader scheduling issue, not a rasterization issue. NVIDIA used an inflexible scheduling method for mixed (graphics/compute) workloads in Maxwell. Pascal uses a better method that allows for dynamic balancing. The reason Polaris gets a larger speed boost than Pascal with asynchronous compute enabled compared with async disabled is probably because there's more 'air' in AMD's pipelines and so more resources available for asynchronous compute to take advantage of. In other words AMD's architecture is utilized less efficiently to begin with so more efficiency gain is available to be realized through asynchronous compute.
  • Scali - Monday, August 1, 2016 - link

    "NVIDIA used an inflexible scheduling method for mixed (graphics/compute) workloads in Maxwell."

    That is correct. When you run graphics and compute together, Maxwell splits up into two 'partitions', allocating some SMs to each partition.
    As long as you balance your workload so that both the graphics and compute work complete at around the same time, this can work nicely.
    However, since it cannot repartition 'on the fly', some SM will sit idle once their work is done, until the other SMs have completed as well, and the GPU can schedule new work/repartition the SMs.

    So in theory, you can get gains from async compute on Maxwell. In practice it's even more difficult to tune for performance than GCN and Pascal already are.
  • killeak - Monday, August 1, 2016 - link

    The truth is that ASync compute allows to use unused compute units, and GCN has a lot more ALU power than nVidia (Maxwell), but normal usage is way lower, that's why a Fury X (8.60 TFLOPS) competes with the 980ti (5.63 TFLOPS) and the 290x (also 5.63 TFLOPS) vs the 980 (4.61 TFLOPS).

    So, even if Maxwell allowed to execute mixed compute and graphics wavefronts like GCN does, the amount of usued ALU power is less.

    The fact that nVidia use 32 threads wavefronts vs 64 in GCN, is in part a reason why nVidia get it's units more busy.

    Rasterization has nothing to do with ASync compute.
  • J0hnnyBGood - Monday, August 1, 2016 - link

    As I understand it the ROPS are fixed function while compute runs on the shaders so it shouldn't be related.
  • Scali - Monday, August 1, 2016 - link

    Intel has historically used a form of tile-based rendering as well, or 'Zone rendering', as they call it:
    Not sure if they still do, but I don't see why not. iGPUs are generally more bandwidth-restricted than dGPUs, so it makes even more sense there.
  • Yojimbo - Monday, August 1, 2016 - link

    Yeah I wish that video compared more architectures than just Maxwell and TerraScale 2.
  • Yojimbo - Monday, August 1, 2016 - link

    Or rather, Maxwell, Pascal, and TerraScale 2.
  • J0hnnyBGood - Monday, August 1, 2016 - link

    Back when that pdf was up to date, Intel licensed PowerVR and didn't make there own GPUs.

Log in

Don't have an account? Sign up now