The Bulldozer Aftermath: Delving Even Deeperby Johan De Gelas on May 30, 2012 1:15 AM EST
The Front End: Branch Prediction
Bulldozer's branch prediction units have been described in many articles. Most insiders agree that Bulldozer's decoupled branch predictor is a step forward from the K10's multi-level predictor. A better predictor might reduce the branch misprediction rate from 5 to 4%, but that is not the end of the story. Here's a quick rundown of the branch prediction capabilities of various CPU architectures.
|Architecture||Branch Misprediction Penalty|
|AMD K10 (Barcelona, Magny-Cours)||12 cycles|
|AMD Bulldozer||20 cycles|
|Pentium 4 (NetBurst)||20 cycles|
|Core 2 (Conroe, Penryn)||15 cycles|
|Sandy Bridge||14-17 cycles|
The numbers above show the minimum branch misprediction penalty, and the fact is that the Bulldozer architecture has a branch misprediction penalty that is 66% higher than the previous generation. That means that the branch prediction of Bulldozer must correctly predict 40% of the pesky branches that were mispredicted by the K10 to compensate (at the same clock). Unfortunately, that kind of massive branch prediction improvement is almost impossible to achieve.
Quite a few people have commented that Bulldozer is AMD's version of Intel's Pentium 4: it has a long pipeline, with high branch misprediction penalties, and it's built for high clock speeds that it cannot achieve. The table above seems to reinforce that impression, but the resemblance between Bulldozer and NetBurst is very superficial.
The minimum branch prediction penalty of the Bulldozer chip is indeed in the same range as Pentium 4. However, the maximum penalty could be a horrifying 100 cycles or more on the P4, while it's a lot lower on Bulldozer. In most common scenarios, the Bulldozer's branch misprediction penalty will be below 30 cycles.
Secondly, the Pentium 4's pipeline was 28 ("Willamette") to 39 ("Prescott") cycles. Bulldozer's pipeline is deep, but it's not that deep. The exact number is not known, but it's in the lower twenties. Really, Bulldozer's pipeline length is not that much higher than Intel's Nehalem or Sandy Bridge architectures (around 16 to 19 stages). The big difference is that the introduction of the µop cache (about 6KB) in Sandy Bridge can reduce the typical branch misprediction to 14 cycles. Only when the instruction is not found in the µop cache and must be fetched from the L1 data cache will the branch misprediction penalty increase to about 17 cycles. So on average, even if the efficiency of Bulldozer's and Sandy Bridge's branch predictors is more or less the same, Sandy Bridge will suffer a lot less from mispredictions.
The Front End: Shared Decoders
Quite a few reviewers, including our own Anand, have pointed out that two integer cores in Bulldozer share four decoders, while two integer cores in the older “K10” architecture each get three decoders. Two K10 cores thus have six decoders, while two Bulldozer cores only have four. Considering that the complexity of the x86 ISA leads to power hungry decoders, reducing the power by roughly 1/3 (e.g. four decoders instead of six for dual-core) with a small single-threaded performance hit is a good trade off if you want to fit 16 of these integer cores in a power envelope of 115W. Instead of 48 decoders, Bulldozer tries to get by with just 32.
The single-threaded performance disadvantage of sharing four decoders between two integer cores could have been lessened somewhat by x86 fusion (test + jump and CMP + jump; Intel calls this macro-op fusion) in the pre-decoding stages. Intel first introduced this with their “Core” architecture back in 2006. If you are confused by macro-ops and micro-ops fusion, take a look here.
However AMD decided to introduce this kind of fusion in Bulldozer later in the decoding pipeline than Intel, where x86 branch fusion is already present in the predecoding phases. The result is that the decoding bandwidth of all Intel CPUs since Nehalem has been up to five (!) x86-64 instructions, while x86 branch fusion does not increase the maximum decode rate of a Bulldozer module.
This is no trifle, as on average this kind of x86 fusion can happen once every ten x86 instructions. So why did AMD let this chance to improve the effective decoding rate pass even if that meant creating a bottleneck in some applications? The most likely reason is that doing this prior to decoding increases the complexity of the chip, and thus the power consumption. Even if AMD's version of x86 branch fusion does not increase the decoding bandwidth, it still offers advantages:
- Increased dispatch bandwidth
- Reduced scheduler queue occupancy
- Faster branch misprediction recovery
The first two increase performance without any extra (or very minimal) power consumption, the last one increases performance and reduces power consumption. AMD preferred to get more cores in the same power envelop over higher decode bandwidth and thus single-threaded performance.
Mark of Hardware.fr compared the performance of a four module CPU with only one core per module enabled with the standard configuration (two integer cores per module). Lightly threaded games were 3-5% faster, which is the first indication that the front end might be something of a bottleneck for some high IPC workloads, but not a big one.
The Memory Subsystem
One of the most important features of Intel's Core architecture was its speculative out-of-order memory pipeline. It gave the Core architecture a massive improvement in many integer benchmarks over the K8, which had a strictly in order pipeline. Barcelona improved this a bit by bringing the K10 to the level of the much older PIII architecture: out of order, but not speculative. Bulldozer now finally has memory disambiguation, a feature which Intel introduced in 2006 in their Core architecture, but there's more to the story.
Bulldozer can have up to 33% more memory instructions in flight, and each module (two integer stores) can do four load/stores per cycle. It's clear that AMD’s engineers have invested heavily in Memory Level Parallelism (MLP). Considering that MLP is often the most important bottleneck in server workloads, this is yet another sign that Bulldozer is targeted at the server world. In this particular area of its architecture, Bulldozer can even beat the Westmere Intel CPUs: two threads on top of the current Intel architecture have only 2 load/stores available and have to share the L1 data cache bandwidth.
The memory controller is improved too, as you can see in the stream benchmark results below. For details about our Stream binary, check here.
Bandwidth is 25% higher than Barcelona, while the clock speed of the RAM modules has only increased by 20%. Clearly, the Orochi die of the Opteron 6200 has a better memory controller than the Opteron 6100. The memory controller and load/store units are among the strongest parts of the Bulldozer architecture.
Post Your CommentPlease log in or sign up to comment.
View All Comments
thunderising - Wednesday, May 30, 2012 - linkGlad AMD has "Greater Performance" planned sometime in the future. Wow!
themossie - Wednesday, May 30, 2012 - link"There are also other factors at play, though, as it's already known that StarCraft II doesn't use more than two cores; theinstead, it's likely the..."
(feel free to remove comment after fixing this)
Nightraptor - Wednesday, May 30, 2012 - linkI am wondering if it would be possible to compare the processor performance of the Trinity A10 with a underclocked FX-4100 set to the same frequencies (I don't know if it is possible to disable the L3 cache on the FX-4100). This might give us a rough idea of how much the improvements of Piledriver have bought us. Just doing rough math in my head it would seem that they have to be pretty significant given how a FX-4100 compared to the Phenom II X4's (it lost alot, if not most of the time t of the time). The new A10 Trinity's on the other hand seem to win most of the time compared to the old architecture. Given that the A10 is a Piledriver based FX-4XXX series equivalent minus the L3 cache it would seem that Piledriver brought very significant enhancements. Either that or the Phenom II era processors responded much more poorly to the lack of L3 than Piledriver does.
coder543 - Wednesday, May 30, 2012 - linkI was hoping they would be doing the same thing, even though it would be challenging to draw real information out of comparing a desktop processor and a mobile processor.
SleepyFE - Wednesday, May 30, 2012 - link+1
kyuu - Wednesday, May 30, 2012 - link+2
If you can figure out some way to do a comparison and analysis of Piledriver's performance vs. Bulldozer, I think a great many of us are interested to see that. From benchmarks, it seems like Piledriver improved a great deal over Bulldozer, but it's difficult to tell without being able to compare two similar processors.
Aone - Wednesday, May 30, 2012 - linkYou can compare A10-4600M or A8-4500M versus mobile Llano or Phenom or even Turion to see tweaked BD is nothing of spectacular. For instance, in most cases A8-4500M (2.3GHz base) loses to Llano A8-3500M (1.5GHz base).
Nightraptor - Wednesday, May 30, 2012 - linkWhere are you getting the benchmarks from that a Trinity loses to Llano. In almost all the benchmarks I have been able to find (with the exception of a few) it seems that Trinity beats Llano, hence the original post. If the Piledriver enhancements were very minor I would've expected Trinity (a hacked quad core) to lose to Llano most of the time (a true quad core). This didn't appear to happen - at least not in the anandtech review.
Aone - Wednesday, May 30, 2012 - link"http://www.notebookcheck.net/AMD-A-Series-A8-4500M...
Look at "Show comparison chart". Great info!
Nightraptor - Thursday, May 31, 2012 - linkI'm not a big fan of the reliability of that website - They tend to be pretty scant on the test circumstances and configurations. Furthermore I'm curious where they are getting the informaiton for the A8-4500 as to the best of my knowledge the only Trinity in the wild at the moment is the A10 which AMD sent out in a custom made review laptop. All they list is a "K75D Sample".