Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000by Dr. Ian Cutress on December 3, 2020 10:00 AM EST
- Posted in
- Zen 3
- Ryzen 5000
- Ryzen 9 5950X
One of the stories around AMD’s initial generations of Zen processors was the effect of Simultaneous Multi-Threading (SMT) on performance. By running with this mode enabled, as is default in most situations, users saw significant performance rises in situations that could take advantage. The reasons for this performance increase rely on two competing factors: first, why is the core designed to be so underutilized by one thread, or second, the construction of an efficient SMT strategy in order to increase performance. In this review, we take a look at AMD’s latest Zen 3 architecture to observe the benefits of SMT.
What is Simultaneous Multi-Threading (SMT)?
We often consider each CPU core as being able to process one stream of serial instructions for whatever program is being run. Simultaneous Multi-Threading, or SMT, enables a processor to run two concurrent streams of instructions on the same processor core, sharing resources and optimizing potential downtime on one set of instructions by having a secondary set to come in and take advantage of the underutilization. Two of the limiting factors in most computing models are either compute or memory latency, and SMT is designed to interleave sets of instructions to optimize compute throughput while hiding memory latency.
An old slide from Intel, which has its own marketing term for SMT: Hyper-Threading
When SMT is enabled, depending on the processor, it will allow two, four, or eight threads to run on that core (we have seen some esoteric compute-in-memory solutions with 24 threads per core). Instructions from any thread are rearranged to be processed in the same cycle and keep utilization of the core resources high. Because multiple threads are used, this is known as extracting thread-level parallelism (TLP) from a workload, whereas a single thread with instructions that can run concurrently is instruction-level parallelism (ILP).
Is SMT A Good Thing?
It depends on who you ask.
SMT2 (two threads per core) involves creating core structures sufficient to hold and manage two instruction streams, as well as managing how those core structures share resources. For example, if one particular buffer in your core design is meant to handle up to 64 instructions in a queue, if the average is lower than that (such as 40), then the buffer is underutilized, and an SMT design will enable the buffer is fed on average to the top. That buffer might be increased to 96 instructions in the design to account for this, ensuring that if both instruction streams are running at an ‘average’, then both will have sufficient headroom. This means two threads worth of use, for only 1.5 times the buffer size. If all else works out, then it is double the performance for less than double the core design in design area. But in ST mode, where most of that 96-wide buffer is less than 40% filled, because the whole buffer has to be powered on all the time, it might be wasting power.
But, if a core design benefits from SMT, then perhaps the core hasn’t been designed optimally for a single thread of performance in the first place. If enabling SMT gives a user exact double performance and perfect scaling across the board, as if there were two cores, then perhaps there is a direct issue with how the core is designed, from execution units to buffers to cache hierarchy. It has been known for users to complain that they only get a 5-10% gain in performance with SMT enabled, stating it doesn't work properly - this could just be because the core is designed better for ST. Similarly, stating that a +70% performance gain means that SMT is working well could be more of a signal to an unbalanced core design that wastes power.
This is the dichotomy of Simultaneous Multi-Threading. If it works well, then a user gets extra performance. But if it works too well, perhaps this is indicative of a core not suited to a particular workload. The answer to the question ‘Is SMT a good thing?’ is more complicated than it appears at first glance.
We can split up the systems that use SMT:
- High-performance x86 from Intel
- High-performance x86 from AMD
- High-performance POWER/z from IBM
- Some High-Performance Arm-based designs
- High-Performance Compute-In-Memory Designs
- High-Performance AI Hardware
Comparing to those that do not:
- High-efficiency x86 from Intel
- All smartphone-class Arm processors
- Successful High-Performance Arm-based designs
- Highly focused HPC workloads on x86 with compute bottlenecks
(Note that Intel calls its SMT implementation ‘HyperThreading’, which is a marketing term specifically for Intel).
At this point, we've only been discussing SMT where we have two threads per core, known as SMT2. Some of the more esoteric hardware designs go beyond two threads-per-core based SMT, and use up to eight. You will see this stylized in documentation as SMT8, compared to SMT2 or SMT4. This is how IBM approaches some of its designs. Some compute-in-memory applications go as far as SMT24!!
There is a clear trend between SMT-enabled systems and no-SMT systems, and that seems to be the marker of high-performance. The one exception to that is the recent Apple M1 processor and the Firestorm cores.
It should be noted that for systems that do support SMT, it can be disabled to force it down to one thread per core, to run in SMT1 mode. This has a few major benefits:
It enables each thread to have access to a full core worth of resources. In some workload situations, having two threads on the same core will mean sharing of resources, and cause additional unintended latency, which may be important for latency critical workloads where deterministic (the same) performance is required. It also reduces the number of threads competing for L3 capacity, should that be a limiting factor. Also should any software be required to probe every other workflow for data, for a 16-core processor like the 5950X that means only reaching out to 15 other threads rather than 31 other threads, reducing potential crosstalk limited by core-to-core connectivity.
The other aspect is power. With a single thread on a core and no other thread to jump in if resources are underutilized, when there is a delay caused by pulling something from main memory, then the power of the core would be lower, providing budget for other cores to ramp up in frequency. This is a bit of a double-edged sword if the core is still at a high voltage while waiting for data in an SMT disabled mode. SMT in this way can help improve performance per Watt, assuming that enabling SMT doesn’t cause competition for resources and arguably longer stalls waiting for data.
Mission critical enterprise workloads that require deterministic performance, and some HPC codes that require large amounts of memory per thread often disable SMT on their deployed systems. Consumer workloads are often not as critical (at least in terms of scale and $$$), and so the topic isn’t often covered in detail.
Most modern processors, when in SMT-enabled mode, if they are running a single instruction stream, will operate as if in SMT-off mode and have full access to resources. Some software takes advantage of this, spawning only one thread for each physical core on the system. Because core structures can be dynamically partitioned (adjusts resources for each thread while threads are in progress) or statically shared (adjusts before a workload starts), situations where the two threads on a core are creating their own bottleneck would benefit having only a single thread per core active. Knowing how a workload uses a core can help when designing software designed to make use of multiple cores.
Here is an example of a Zen3 core, showing all the structures. One of the progress points with every new generation of hardware is to reduce the number of statically allocated structures within a core, as dynamic structures often give the best flexibility and peak performance. In the case of Zen3, only three structures are still statically partitioned: the store queue, the retire queue, and the micro-op queue. This is the same as Zen2.
SMT on AMD Zen3 and Ryzen 5000
So much like AMD’s previous Zen-based processors, the Ryzen 5000 series that uses Zen3 cores also have an SMT2 design. By default this is enabled in every consumer BIOS, however users can choose to disable it through the firmware options.
For this article, we have run our AMD Ryzen 5950X processor, a 16-core high-performance Zen3 processor, in both SMT Off and SMT On modes through our test suite and through some industry standard benchmarks. The goals of these tests are to ascertain the answers to the following questions:
- Is there a single-thread benefit to disabling SMT?
- How much performance increase does enabling SMT provide?
- Is there a change in performance per watt in enabling SMT?
- Does having SMT enabled result in a higher workload latency?*
*more important for enterprise/database/AI workloads
The best argument for enabling SMT would be a No-Lots-Yes-No result. Conversely the best argument against SMT would be a Yes-None-No-Yes. But because the core structures were built with having SMT enabled in mind, the answers are rarely that clear.
For our test suite, due to obtaining new 32 GB DDR4-3200 memory modules for Ryzen testing, we re-ran our standard test suite on the Ryzen 9 5950X with SMT On and SMT Off. As per our usual testing methodology, we test memory at official rated JEDEC specifications for each processor at hand.
|AMD AM4||Ryzen 9 5950X||MSI X570
|GPU||Sapphire RX 460 2GB (CPU Tests)
NVIDIA RTX 2080 Ti
|PSU||OCZ 1250W Gold|
|SSD||Crucial MX500 2TB|
|OS||Windows 10 x64 1909
Spectre and Meltdown Patched
|VRM Supplimented with Silversone SST-FHP141-VF 173 CFM fans|
Also many thanks to the companies that have donated hardware for our test systems, including the following:
|Hardware Providers for CPU and Motherboard Reviews|
RX 460 Nitro
RTX 2080 Ti
|Crucial SSDs||Corsair PSUs|
Post Your CommentPlease log in or sign up to comment.
View All Comments
MrSpadge - Thursday, December 3, 2020 - link> We’ve known for many years that having two threads per core is not the same as having two cores
True, and I still read this as an argument against SMT in forums. IMO it should be pointed out clearly that the cost of implementing either also differs drastically: +100% core size for another core and ~5% for SMT.
WaltC - Thursday, December 3, 2020 - linkIntel began its HT journey in order to pull more efficiency from each core--basically, as performance was being left on the table. Interestingly enough, after Athlon and A64, AMD roundly criticized Intel because the SMT thread was not done by a "real core"...and then proceeded to drop cores with two integer units--which AMD then labeled as "cores"...;) Intel's HT approach proved superior, obviously. IIRC. It's been awhile so the memories are vague...;) The only problem with this article is that it tries to make calls about SMT hardware design without really looking hard at the software, and the case for SMT is a case for SMT software. Games will not use more than 4-8 threads simultaneously so of course there is little difference between SMT on and off when running most games on a 5950. You would likely see near the same results on a 5600 in terms of gaming. SMT on or off when running these games leaves most of the CPU's resources untouched. Programs designed and written to utilize a lot of threads, however, show a robust, healthy scaling with SMT on versus no SMT. So--without a doubt--SMT CPU design is superior to no SMT from the standpoint of the hardware's performance. The outlier is the software--not the hardware. And of course the hardware should never, ever be judged strictly by the software one arbitrarily decides to run on it. We learn a lot more about the limits of the software tested here than we learn about SMT--which is a solid performance design in CPU hardware.
WarlockOfOz - Friday, December 4, 2020 - linkVery valid point about how games won't see a difference between 16 and 32 threads when they only use 6. Do you know if this type of analysis has been done at the lower end of the market?
WaltC - Friday, December 4, 2020 - linkIt's been common knowledge established a few years ago when AMD started pushing 8 core (and greater) CPUs that games don't require that many cores and that 6 cores is optimal for gaming right now. And if you do more than game, occasionally, and need more than 6 threads then SMT is there for you. As the new consoles are 8-core CPU designs, over time the number of cores required for optimal game performance will increase.
Flying Aardvark - Friday, December 4, 2020 - linkConsoles are 8-core now, with 2 reserved for the OS. Count on 6-cores being optimal for gaming for quite some time.
Kangal - Friday, December 4, 2020 - linkThoset Jaguar cores was more like a 4c/8t processor to be fair. And they weren't that much better than Intel's Atom cores, a far cry from Intel's Core-i SkyLake architecture. And current gen consoles were very light on the OS, so maybe using 1-full core (or 2-threads-shared) leaving only 3-cores for games, but much better than the 2-core optimised games from the PS3/360 era.
The new gen consoles will be somewhat similar, using only 1-full core (2-threads) reserved for the OS. But this time we have an architecture that's on-par with Intel's Core-i SkyLake, with a modern full 8-core processor (SMT/HT optional). This time leaving a healthy 7-cores that's dedicated to games. Optimisations should come sooner than later, and we'll see the effects on PC ports by 2022. So we should see a widening gap between 4vs6-core, and to a lesser extent 6vs8-core in the future. I wouldn't future-proof my rig by going for a 5700x instead of a 5600x, I would do that for the next round (ie 2022 Zen4).
AntonErtl - Sunday, December 6, 2020 - linkThe 8 Jaguar cores are in no way like 4c/8t CPUs; if you use only half of them, you get half the performance (unless your application is memory/L2-bandwidth-limited). Their predecessor Bobcat is about twice as fast as an Bonnell core (Atom proper), and a little slower than Silvermont (the core that replaced Bonnell), about half as fast as Goldmont+ (all at the clock rates at which they were available in fanless mini-ITX boards), one third as fast as a 3.5GHz Excavator core, and one sixth as fast as a 4.2GHz Skylake.
Oxford Guy - Sunday, December 6, 2020 - linkWorse IPC than Bulldozer as far as I know. Certainly worse than Piledriver.
Really sad. The "consoles" should have used something better than Jaguar. It's bad enough that the "consoles" are a parasitic drain on PC gaming in the first place. It's worse when they not only drain life with their superfluous walled gardens but also by foisting such a low-grade CPU onto the art.
Kangal - Thursday, December 24, 2020 - linkThe Jaguar cores share alot of DNA with Bulldozer, but they aren't the same. It's like Intel's Atom chips compared to Intel Core-i chips. With that said, 2015 Puma+ was a slight improvement over 2013 Jaguar, which was a modest improvement over the initial 2011 Bobcat lineup. All this started in 2006 with AMD choosing to evolve their earlier Phenom2 cores which are derivatives of the AMD Athlon-64.
So just by their history, we can see they're inline with Intel's Atom architecture evolution, and basically a direct competitor. Where Intel had slightly less performance, but had much lower power-draw... making them the obvious winner. Leaving AMD to fill in the budget segments of the market.
As for the core arrangement, they don't have full proper cores as people expect them. Like the Bulldozer architecture, each core had to share resources like the decoder and floating-point unit. So in many instances, one core would have to wait for the other core. This boosts multithreaded performance with simple calculations in orderly patterns. However, with more complex calculations and erratic/dynamic patterns (ie Regular PC use), it causes a hit to the single-thread performance and notable hiccups. So my statement was true. This is more like a 4c/8t chipset, and it is less like a Core-i and much more like an Atom. But don't take my word for it, take Dr Ian
Cutress. He said the same thing during the deep dive into the Jaguar microarchitecture, and recently in the Chuwi Aerobox (Xbox One S) article.
Now, there have been huge benefits to the Gaming PC industry, and game ports, due to the PS4/XB1. The first being the x86-64bit direct compatibility. Second was the cross-compatability thanks to Vulkan and DirectX (moreso with PS4 Pro and XB1X). The third being that it forced game developers to innovate their game engines, so that they're less narrow and more multi-threaded. With PS5/XseX we now see a second huge push with this philosophy, and the improvements of fast single-thread performance and fast-flash storage access. So I think while we have legitimate reasons to groan about the architecture (especially in the PS4) upon release, we do have to recognize the conveniences that they also brought (especially in the XB1X). This is just to show that my stance wasn't about console bashing.
at_clucks - Monday, December 7, 2020 - link@Kangal, Jaguar APUs in consoles are definitely not "like a 4c/8t processor" because they don't use CMT. They are full 8 cores. Their IPC may be comparable with some newer Atoms although it's hard to benchmark how the later "Evolved Jaguar" cores in the mid generation console refresh compares against the regular Jaguar or Atom.