Huge Memory Bandwidth, but not for every Block

One highly intriguing aspect of the M1 Max, maybe less so for the M1 Pro, is the massive memory bandwidth that is available for the SoC.

Apple was keen to market their 400GB/s figure during the launch, but this number is so wild and out there that there’s just a lot of questions left open as to how the chip is able to take advantage of this kind of bandwidth, so it’s one of the first things to investigate.

Starting off with our memory latency tests, the new M1 Max changes system memory behaviour quite significantly compared to what we’ve seen on the M1. On the core and L2 side of things, there haven’t been any changes and we consequently don’t see much alterations in terms of the results – it’s still a 3.2GHz peak core with 128KB of L1D at 3 cycles load-load latencies, and a 12MB L2 cache.

Where things are quite different is when we enter the system cache, instead of 8MB, on the M1 Max it’s now 48MB large, and also a lot more noticeable in the latency graph. While being much larger, it’s also evidently slower than the M1 SLC – the exact figures here depend on access pattern, but even the linear chain access shows that data has to travel a longer distance than the M1 and corresponding A-chips.

DRAM latency, even though on paper is faster for the M1 Max in terms of frequency on bandwidth, goes up this generation. At a 128MB comparable test depth, the new chip is roughly 15ns slower. The larger SLCs, more complex chip fabric, as well as possible worse timings on the part of the new LPDDR5 memory all could add to the regression we’re seeing here. In practical terms, because the SLC is so much bigger this generation, workloads latencies should still be lower for the M1 Max due to the higher cache hit rates, so performance shouldn’t regress.

A lot of people in the HPC audience were extremely intrigued to see a chip with such massive bandwidth – not because they care about GPU or other offload engines of the SoC, but because the possibility of the CPUs being able to have access to such immense bandwidth, something that otherwise is only possible to achieve on larger server-class CPUs that cost a multitude of what the new MacBook Pros are sold at. It was also one of the first things I tested out – to see exactly just how much bandwidth the CPU cores have access to.

Unfortunately, the news here isn’t the best case-scenario that we hoped for, as the M1 Max isn’t able to fully saturate the SoC bandwidth from just the CPU side;

From a single core perspective, meaning from a single software thread, things are quite impressive for the chip, as it’s able to stress the memory fabric to up to 102GB/s. This is extremely impressive and outperforms any other design in the industry by multiple factors, we had already noted that the M1 chip was able to fully saturate its memory bandwidth with a single core and that the bottleneck had been on the DRAM itself. On the M1 Max, it seems that we’re hitting the limit of what a core can do – or more precisely, a limit to what the CPU cluster can do.

The little hump between 12MB and 64MB should be the SLC of 48MB in size, the reduction in BW at the 12MB figure signals that the core is somehow limited in bandwidth when evicting cache lines back to the upper memory system. Our test here consists of reading, modifying, and writing back cache lines, with a 1:1 R/W ratio.

Going from 1 core/threads to 2, what the system is actually doing is spreading the workload across the two performance clusters of the SoC, so both threads are on their own cluster and have full access to the 12MB of L2. The “hump” after 12MB reduces in size, ending earlier now at +24MB, which makes sense as the 48MB SLC is now shared amongst two cores. Bandwidth here increases to 186GB/s.

Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s and this appears to be the limit on the SoC fabric that the CPUs are able to achieve, as adding additional cores and threads beyond this point does not increase the bandwidth to DRAM at all. It’s only when the E-cores, which are in their own cluster, are added in, when the bandwidth is able to jump up again, to a maximum of 243GB/s.

While 243GB/s is massive, and overshadows any other design in the industry, it’s still quite far from the 409GB/s the chip is capable of. More importantly for the M1 Max, it’s only slightly higher than the 204GB/s limit of the M1 Pro, so from a CPU-only workload perspective, it doesn’t appear to make sense to get the Max if one is focused just on CPU bandwidth.

That begs the question, why does the M1 Max have such massive bandwidth? The GPU naturally comes to mind, however in my testing, I’ve had extreme trouble to find workloads that would stress the GPU sufficiently to take advantage of the available bandwidth. Granted, this is also an issue of lacking workloads, but for actual 3D rendering and benchmarks, I haven’t seen the GPU use more than 90GB/s (measured via system performance counters). While I’m sure there’s some productivity workload out there where the GPU is able to stretch its legs, we haven’t been able to identify them yet.

That leaves everything else which is on the SoC, media engine, NPU, and just workloads that would simply stress all parts of the chip at the same time. The new media engine on the M1 Pro and Max are now able to decode and encode ProRes RAW formats, the above clip is a 5K 12bit sample with a bitrate of 1.59Gbps, and the M1 Max is not only able to play it back in real-time, it’s able to do it at multiple times the speed, with seamless immediate seeking. Doing the same thing on my 5900X machine results in single-digit frames. The SoC DRAM bandwidth while seeking around was at around 40-50GB/s – I imagine that workloads that stress CPU, GPU, media engines all at the same time would be able to take advantage of the full system memory bandwidth, and allow the M1 Max to stretch its legs and differentiate itself more from the M1 Pro and other systems.

M1 Pro & M1 Max: Performance Laptop Chips Power Behaviour: No Real TDP, but Wide Range
POST A COMMENT

492 Comments

View All Comments

  • michael2k - Thursday, October 28, 2021 - link

    Power consumption scales linearly with clock speed.

    Clock speed, however, is constrained by voltage. That said, we already know that the M1M itself has a 3.2GHz clock while the GPU is only running at 1.296GHz. It is unknown if there is any reason other than power for the GPU to run so slowly. If they could double the GPU clock (and therefore double it's performance) without increasing it's voltage, it would only draw about 112W. If they let it run at 3.2GHz it would draw 138W.

    Paired with the CPU drawing 40W the M1M would still be several times under the Mac Pro's current 902W. So that leaves open the possibility of a multiple chip solution (4 M1P still only draws 712W if the GPU is clocked to 3.2GHz) as well as clocking up slightly to 3.5GHz, assuming no need to increase voltage. Bumping up to 3.5GHz would still only consume 778W while giving us almost 11x the GPU power of the current M1P, which would be 11x the performance of the 3080 found in the GE76 Raider

    Also, you bring up AMD/Intel/NVIDIA at 5nm, without also considering that when Apple stops locking up 5nm it's because they will be at 4nm and 3nm.
    Reply
  • uningenieromas - Thursday, October 28, 2021 - link

    You would think that if Apple's silicon engineers are so freakin' good, they could basically work wherever they want...and, yep, they chose Apple. There might be a reason for that? Reply
  • varase - Wednesday, November 3, 2021 - link

    We're glad you shared your religious epiphany with the rest of us 😳. Reply
  • Romulo Pulcinelli Benedetti - Sunday, May 22, 2022 - link

    Sure, Intel and AMD would take all the hard work to advance humanity toward Apple level chips if Apple was not there, believe in this... Reply
  • Alej - Tuesday, October 26, 2021 - link

    The native ARM Mac scarcity I don’t fully get, a lot of games get ported to the switch which is already ARM. And if they are using Vulkan as the graphics API then there’s already MoltenVK to translate it to Metal, which even if not perfect and won’t use the 100% of available tricks and optimizations, it would run well enough. Reply
  • Wrs - Tuesday, October 26, 2021 - link

    @Alej It's a numbers and IDE game. 90 million Switches sold, all purely for gaming, supported by a company that exclusively does games. 20 million Macs sold yearly, most not for gaming in the least, built by a company not focused on gaming for that platform. iPhones are partially used for gaming, however, and sell many times the volume of the Switch, so as expected there's a strong gaming ecosystem. Reply
  • Kangal - Friday, October 29, 2021 - link

    Apple is happy where they are.
    However, if Apple were a little faster/wiser, they would've made the switch from Intel Macs to M1 Macs back in 2018 using the TSMC 7nm node, their Tempest/Vortex CPUs and their A12-GPU. They wouldn't be too far removed from the performance of the M1, M1P, M1X if scaled similarly.

    And even more interesting, what if Apple released a great Home Console?
    Something that is more compact than the Xbox Series S, yet more powerful than the Xbox Series X. That would leave both Microsoft and Sony scrambling. They could've designed a very ergonomic controller with much less latency, and they could've enticed all these AAA-developers to their platform (Metal v2 / Swift v4). It would be gaming-centric, with out-of-box support for iOS games/apps, and even a limited-time support (Rosetta v2) for legacy OS X Applications. They wouldn't be able to subsidies the pricing like Sony, but could basically front the costs from their own pocket to bring it to a palatable RRP. After 2 years, then they would be able to turn a profit from its hardware sales and software sales.

    I'm sure they could have been a hit. And it would then pivot to make MacBook Pro's more friendly for media consumption, and developer-supported. Strengthening their entire ecosystem, and leveraging their unique position in software and hardware to remain competitive.
    Reply
  • kwohlt - Tuesday, October 26, 2021 - link

    I think it is just you. Imagine a hypothetical ultra thin, fanless laptop that offered 20 hours of battery under load and could play games at desktop 3080 levels...Would you wish this laptop was louder, hotter, and had worse battery?

    No of course not. Consuming less power and generating less heat, while offering similar or better performance has always been the goal of computing. It's this trend that allows us to daily carry computing power that was once the size of a refrigerator in our pockets and on our wrists.
    Reply
  • Wrs - Wednesday, October 27, 2021 - link

    No, but I might wish it could scale upward to a desktop/console for way more performance than a 3080. :) That would also be an indictment of how poorly the 3080 is designed or fabricated, or how old it is.

    Now, if in the future silicon gets usurped by a technology that does not scale up in power density, then I could be forced to say yes.
    Reply
  • turbine101 - Monday, October 25, 2021 - link

    Why would developers waste there time on a device which will have barely any sales?

    The M1 Mac Max costs $6knzd. That's just crazy, even the most devout Apple enthusiasts cannot justify this. And Mac is far less usable than IOS.
    Reply

Log in

Don't have an account? Sign up now