Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000by Dr. Ian Cutress on December 3, 2020 10:00 AM EST
- Posted in
- Zen 3
- Ryzen 5000
- Ryzen 9 5950X
For simplicity, we are listing the percentage performance differentials in all of our CPU testing – the number shown is the % performance of having SMT2 enabled compared to having the setting disabled. Our benchmark suite consists of over 120 tests, full details of which can be found in our #CPUOverload article.
Here are the single threaded results.
|Single Threaded Tests
AMD Ryzen 9 5950X
Interestingly enough our single threaded performance was within a single percentage point across the stack (SPEC being +1.2%). Given that ST mode should arguably give more resources to each thread for consistency, the fact that we see no difference means that AMD’s implementation of giving a single thread access to all the resources even in SMT mode is quite good.
The multithreaded tests are a bit more diverse:
AMD Ryzen 9 5950X
|3D Particle Movement||100%||165.7%|
|3DPM with AVX2||100%||177.5%|
|HandBrake 4K HEVC||100%||107.9%|
Here we have a number of different factors affecting the results.
Starting with the two tests that scored statistically worse with SMT2 enabled: yCruncher and AIBench. Both tests are memory-bound and compute-bound in parts, where the memory bandwidth per thread can become a limiting factor in overall run-time. yCruncher is arguably a math synthetic benchmark, and AIBench is still early-beta AI workloads for Windows, so quite far away from real world use cases.
Most of the rest of the benchmarks are between a +5% to +35% gain, which includes a number of our rendering tests, molecular dynamics, video encoding, compression, and cryptography. This is where we can see both threads on each core interleaving inside the buffers and execution units, which is the goal of an SMT design. There are still some bottlenecks in the system affecting both threads getting absolute full access, which could be buffer size, retire rate, op-queue limitations, memory limitations, etc – each benchmark is likely different.
The two outliers are 3DPM/3DPMavx, and Corona. These three are 45%+, with 3DPM going 66%+. Both of these tests are very light on the cache and memory requirements, and use the increased Zen3 execution port distribution to good use. These benchmarks are compute heavy as well, so splitting some of that memory access and compute in the core helps SMT2 designs mix those operations to a greater effect. The fact that 3DPM in AVX2 mode gets a higher benefit might be down to coalescing operations for an AVX2 load/store implementation – there is less waiting to pull data from the caches, and less contention, which adds to some extra performance.
In an ideal world, both threads on a core will have full access to all resources, and not block each other. However, that just means that the second thread looks like it has its own core completely. The reverse SMT method, of using one global core and splitting it into virtual cores with no contention, is known as VISC, and the company behind that was purchased by Intel a few years ago, but nothing has come of it yet. For now, we have SMT, and by design it will accelerate some key workloads when enabled.
In our CPU results, the single threaded benchmarks showed no uplift with SMT enabled/disabled in our real-world or synthetic workloads. This means that even in SMT enabled mode, if one thread is running, it gets everything the core has on offer.
For multi-threaded tests, there is clearly a spectrum of workloads that benefit from SMT.
Those that don’t are either hyper-optimized on a one-thread-per-core basis, or memory latency sensitive.
Most real-world workloads see a small uplift, an average of 22%. Rendering and ray tracing can vary depending on the engine, and how much bandwidth/cache/core resources each thread requires, potentially moving the execution bottleneck somewhere else in the chain. For execution limited tests that don’t probe memory or the cache at all, which to be honest are most likely to be hyper-optimized compute workloads, scored up to +77% in our testing.
Post Your CommentPlease log in or sign up to comment.
View All Comments
quadibloc - Monday, December 14, 2020 - linkThe SPARC chips used SMT a lot, even going beyond 2-way, so I'm surprised they weren't mentioned as examples.
mode_13h - Sunday, June 6, 2021 - link> When SMT is enabled, depending on the processor, it will allow two, four,
> or eight threads to run on that core
Intel's HD graphics GPUs win the oddball award for supporting 7 threads per EU, at least up through Gen 11, I think.
IIRC, AMD supports 12 threads per CU, on GCN. I don't happen to know how many "warps" Nvidia simultaneously executes per SM, in any of their generations.
mode_13h - Sunday, June 6, 2021 - linkThanks for looking at this, although I was disappointed in the testing methodology. You should be separately measuring how the benchmarks respond to simply having more threads, without introducing the additional variable of SMT on/off. One way to do this would be to disable half of the cores (often an option you see in BIOS) and disable SMT. Then separately re-test with SMT on, and then with SMT off but all cores on. This way, we could compare SMT on/off with the same number of threads. Ideally, you'd also do this on a single-die/single-CCX CPU, to ensure no asymmetry in which cores were disabled.
Even better would be it disable any turbo, so we could just see the pipeline behavior. Although, controlling for more variables poses a tradeoff between shedding more insight into the ALU behavior and making the test less relevant to real-world usage.
The reason to separate to hold the number of threads constant is that software performance doesn't scale linearly with the number of threads. Due to load-balancing issues or communication overhead (e.g. lock contention), performance of properly-designed software always scales sub-linearly with the number of threads. So, by keeping the number of threads constant, you'd eliminate that variable.
Of course, in real-world usage, users would be deciding between the two options you tested (SMT on/off; always using all cores). So, that was most relevant to the decision they face. It's just that you're limited in your insights into the results, if you don't separately analyze the thread-scaling of the benchmarks.
mode_13h - Sunday, June 6, 2021 - linkOops, I also intended to mention OS scheduling overhead as another source of overhead, when running more threads. We tend not to think of the additional work that more threads creates for the OS, but each thread the kernel has to manage and schedule has a nonzero cost.
mode_13h - Sunday, June 6, 2021 - linkAs for the article portion, I also thought too little consideration was given towards the relative amounts of ILP in different code. Something like zip file compressor should have relatively little ILP, since each symbol in the output tends to have a variable length in the input, meaning decoding of the next symbol can't really start until the current one is mostly done. Text parsing and software compilation also tend to fall in this category.
So, I was disappointed not to see some specific cases of low-ILP (but high-TLP) highlighted, such as software compilation benchmarks. This is also a very relevant use case for many of us. I spend hours per week compiling software, yet I don't play video games or do 3D photo reconstruction.
mode_13h - Sunday, June 6, 2021 - linkA final suggestion for any further articles on the subject: rather than speculate about why certain benchmarks are greatly helped or hurt by SMT, use tools that can tell you!! To this end, Intel has long provided VTune and AMD has a tool called μProf.