A new update to the Intel document for software developers indicates that the company will begin to introduce various AVX-512 instruction set extensions to its consumer CPUs soon. This will start from the codenamed Cannon Lake (CNL) and Ice Lake (ICL) processors, made using 10 nm process technologies. The new extensions will enable future chips to improve performance in certain applications. One of the main questions on AVX-512 is which consumer programs will actually support the AVX-512 when these CNL and ICL processors hit the market. In addition to the AVX-512, the upcoming processors will introduce a host of other new non-AVX-512 instructions.

AVX-512 Coming to Consumer CPUs

According to the Intel Architecture Instruction Set Extensions and Future Features Programming Reference document, Intel’s Cannon Lake CPUs will support AVX512F, AVX512CD, AVX512DQ, AVX512BW, and AVX512VL. This will bring the feature set of these CPUs to the current level of the Skylake-SP based processors. In addition, the Cannon Lake microarchitecture will support the AVX512_IFMA and AVX512_VBMI commands, but at this point, it is unclear whether the support will be limited to servers, or will also be featured in the consumer processors (the latter scenario is likely based on the document wording, but remains unclear).

Intel originally promised to release Cannon Lake processors in 2016 – 2017 timeframe, but delayed introduction of its 10 nm process technology to 2018, thus postponing the CPU launch as well. Initially it was expected that the Cannon Lake CPUs would generally resemble the Kaby Lake and Coffee Lake chips with some refinements, but the addition of the AVX-512 support means a rather tangible architecture improvement. For AVX-512, large the chunks of data require massive memory bandwidth, which the Skylake-SP cores get due to large caches and more memory controllers. Keeping in mind memory bandwidth and power consumption factors, the AVX-512 might not be supported by all Cannon Lake client CPUs, but only by those aimed at higher-performance machines (i.e., no AVX-512 for ULP mobile parts as well as entry-level desktop SKUs, but this is a speculation at this point). Meanwhile, a good news is that by the time AVX-512-supporting Cannon Lake processors arrive, programs for client PCs that take advantage of the latest extensions will likely be available.

The evolution of the AVX-512 on general-purpose CPUs is not going to stop. Intel’s Ice Lake processors will support AVX512_VPOPCNTDQ (which will also be supported by the Xeon Phi ‘Knights Mill’) commands as well as AVX512_VNNI, AVX512_VBMI2, AVX512+VPCLMULQDQ and AVX512_BITALG instructions. The ICL chips will also feature AVX-512 versions of known AES and GFNI algorithms for encryption and error corrections — AVX512+VAES and AVX512+GFNI.

Meanwhile, the Knights Mill will exclusively support AVX512_4FMAPS and AVX512_4VNNI (at least for a while, because an Intel filing with the Linux kernel states that the upcoming Xeon Phi and Xeon CPUs will support both commands, but descriptions of Linux patches are not always accurate, plus, plans tend to change).

AVX-512 Support Propogation by Various Intel CPUs
  Xeon, Core X General Xeon Phi  
Skylake-SP AVX512BW
Knights Landing
Cannon Lake AVX512VBMI
Knights Mill
Ice Lake AVX512_VNNI
Source: Intel Architecture Instruction Set Extensions and Future Features Programming Reference (pages 12 and 13)

As it turns out from Intel’s document, the Cannon Lake and Ice Lake processors will have an up-to-date AVX-512 support. It is unknown whether the CNL and the ICL cores will be used inside the future server processors (remember that Intel has server-specific 'Cascade Lake' product incoming), but if this is the case, then it looks like Intel’s cores for server and client computers will have the same feature-set going forward, at least when it comes to the AVX-512 support.

Adding the AVX-512 to consumer processors looks like an important development even though the instruction set was primarily designed to process large amounts of data common for servers and, to a degree, workstations (such as encoding, rendering, cryptography, deep learning, etc.). Apparently, Intel believes that 512-bit INT/FP calculations will be important for mainstream PCs as well. A big question is how exactly Intel plans to implement the AVX-512 in various Cannon Lake and Ice Lake processors going forward. Keep in mind that Intel’s six and eight-core Skylake-X CPUs officially support one fused FMA for AVX-512-F, but the chips with 10+ cores officially support dual 512-bit AVX-512-F ports and can offer up to two times higher performance. So in that respect, there is potential for further differentiation between products.

In the meantime, Intel’s Cannon Lake and Ice Lake CPUs will have a number of other new instructions for various matters and they are certainly worth looking at.

New Instructions to Improve Security, Performance of Upcoming CPUs

In a bid to speed up certain cryptography algorithms, Cannon Lake will feature the SHA-NI instruction set that is already supported by the Goldmont cores. SHA-NI is of a similar base to AES-NI, that was added several generations prior. Based on Intel’s publications, SHA-NI can speed up SHA1, SHA256 and SHA224 algorithms. In addition, the new CPUs will also support the UMIP security mechanism that prevents the execution of certain instructions in if their privilege level is insufficient for that, preventing certain apps from accessing the OS settings.

The Ice Lake chips will bring support for Fast Short REP MOV instruction that will enable fast moves of large amounts of data from one location to another, which will benefit optimized memory-intensive applications. Keep in mind that we are moving towards persistent memory for a number of server applications and therefore large amounts of data located in DRAM and/or NVDIMMs will be more common in the future.

Another interesting feature supported by the Ice Lake consumer processors is CLWB (Cache Line Write Back) command for NVMe programming. The feature is already supported by the Skylake-SP cores and is required to better handle SSDs connected to the processor, but will come into consumer products with Ice Lake. CLWB flushes the write caches, but does not invalidate the data, making it available if it is needed after the line is flushed, thus improving performance in certain situations. Given the Purley/Skylake-SP context, CLWB is something required for upcoming NVDIMMs (based on 3D XPoint), but it is not completely clear how Intel expects to use it in case of consumer platforms (they make sense for certain workstation applications and for that reason CLWB is supported by SKL-SP). In any case, the addition of CLWB will add some speed in certain cases when very fast SSDs are used and cache miss is an issue.

There are other features coming in the Goldmont Plus (the heart of upcoming Gemini Lake SoCs) and Ice Lake processors, namely PTWRITE and RDPID, which seem to be aimed mostly at software developers and which purpose may not benefit end users right away.

Instruction Set Extensions of Cannon Lake, Ice Lake and Goldmont+ CPUs
  Instruction Purpose Description
Cannon Lake SHA-NI Security Cryptography acceleration.

User-Mode Instruction Prevention
Security Prevents execution of certain instructions if the Current Privilege Level (CPL) is greater than 0. If these instructions were executed while in CPL > 0, user space applications could have access to system-wide settings such as the global and local descriptor tables, the task register and the interrupt descriptor table.
Ice Lake CLWB

Cache Line
Write Back
Performance Writes back modified data of a cache line similar to CLFLUSHOPT, but avoids invalidating the line from the cache (and instead transitions the line to non-modified state). CLWB attempts to minimize the compulsory cache miss if the same data is accessed temporally after the line is flushed if the same data is accessed temporally after the line is flushed.
Fast Short REP MOV Performance Enables fast moves of data from one location to another.

Read Processor ID
General Quickly reads processor ID to discover its feature set and apply optimizations/use specific code path if possible.
Goldmont Plus PTWRITE

Write Data to a Processor Trace Packet
Debugging Unclear.
UMIP Security See above
RDPID General See above
Source: Intel Architecture Instruction Set Extensions and Future Features Programming Reference (pages 12 and 13)

Some History

Intel and AMD have been adding various instruction set extensions to the x86 architecture since the mid-1990s. Throughout the recent 20 years, both companies have brought in hundreds of new instructions designed to improve performance in various applications by SIMD instructions and feeding CPU cores large amounts of data at once or by using special-purpose hardware. Intel’s latest mainstream extensions are called the AVX/AVX2 and their main purposes were increasing the width of the register file (both SIMD and integer) to 256 bits and the introduction of commands like the FMA3 (that serves the same purpose — does relatively complex computations in one instruction). To perform 256-bit AVX2 operations, CPUs have to lower their frequency to maintain stability, as cores tend to draw a lot of power under such workloads, but even at lower clock rates AVX/AVX2 make a lot of sense and increase overall throughput.

The next step in the evolution of the instruction set extensions that Intel made was the AVX-512. With AVX-512 the company decided to introduce different sets of instructions for different applications and implemented them in different products. Some of the AVX-512 extensions are aimed primarily at enterprise workloads, whereas the others are needed for supercomputers or high performance compute. Implementing all of them in in all products hardly makes a lot of sense for Intel and its customers, so the latest Skylake-SP Xeons (and the high-end desktop processors) support one set of AVX-512 commands and the Xeon Phis support another one. In the meantime, contemporary mainstream consumer CPUs do not support AVX-512 at all.  One of the reasons for this is because the physical implementation significantly increases die size (by up to 15% in case of the Skylake core). Other factors such as the cost associated with a die increase, and partly because client applications today cannot take advantage of such instructions, are also in the mix. In the future, this is going to change as Intel plans to enable support of certain AVX-512 variations in its future Cannon Lake and Ice Lake processors for mainstream consumers.

Wrapping Up

The addition of the AVX-512 to the future consumer CPUs is a good news for those who use such processors for things like video encoding, rendering or other applications that are common for workstations. Meanwhile, with the Ice Lake consumer chips, Intel is adding a deep learning-specific (AVX512_VNNI) 512-bit instructions as well as the NV-DIMM-oriented features such as CLWB, although immediate advantages for this market segment are unclear. Intel is opening this information up to allow developers to prepare for these processors and develop software in advance. In any case, all new features are always welcome by many because at some point they start to bring certain advantages.

Related Reading

Source: Intel (via WikiChip Twitter).

Comments Locked


View All Comments

  • andrebrait - Thursday, October 19, 2017 - link

    It's interesting to notice AMD has supported the SHA-NI instructions on their Zen architecture for a while now.
  • lefty2 - Thursday, October 19, 2017 - link

    Also, Zen does not lower it's frequency to perform 256-bit AVX2 operations. That is specific to Intel CPUs. It seems Zen has more efficient power management of it's AVX units
  • smilingcrow - Thursday, October 19, 2017 - link

    Doesn't Zen have noticeably lower clock speeds to start with?
    The best thing is to look at actual benchmarks as clock speed alone is only part of the picture.

    I did read that people over-clocking Coffee Lake i7 k series to ~5GHz were using an AVX offset of 3 so 4.7GHz for AVX loads sounds pretty good.
  • Elstar - Thursday, October 19, 2017 - link

    Right, clock speed isn't everything. Let's not forget about the Pentium 4 fiasco where they prioritized high frequencies over real world efficiency/performance.

    I think the AVX offset is often misunderstood. I think the way to read it is that Intel feels that non-AVX code is still common to be worth optimizing for, hence the slight "turbo boost" for running non-AVX code.
  • MrSpadge - Thursday, October 19, 2017 - link

    No, it's a penatly for AVX2/512 code, not a boost for other workloads (which are of course still very important). The higher throughput in AVX2/512 is the reason for the higher power draw - there's no free lunch here. And Intel decided that full clocks & voltages would put undue stress on the chips and power delivery, hence the clock speed penatly for AVX2/512
  • Elstar - Thursday, October 19, 2017 - link

    You're missing my point. Processor designers can either limit the max frequency to the slowest/hottest part of the CPU design, or they can implement dynamic frequency logic to allow different instruction mixes to run at different frequencies. Whether one views a given dynamic frequency mode as a "penalty" or a "reward" is a matter of perspective (and a human value judgement too). The fact that Intel refers to the AVX frequency is a "penalty" says more about internal Intel politics than it does technical design.
  • alistair.brogan - Thursday, October 19, 2017 - link

    You can make two versions of the same Intel CPU, and if you add AVX you need an offset as high as 400 mhz to run at the same temperature and voltage, so it is called a penalty.

    Point is you need to compare last years with this years. You are overthinking a little bit ;)
  • HStewart - Thursday, October 19, 2017 - link

    A similar thing could say about Core count - it not actually the number the cores that make the difference but how the cores performed.
  • smilingcrow - Thursday, October 19, 2017 - link

    It's called an AVX offset not a non-AVX boost but ultimately that's just semantics really.
    But when you consider that everything bar AVX runs at the maximum speed it makes sense to call it an AVX offfset as that is clear and simple to grasp.
  • pm9819 - Thursday, October 19, 2017 - link

    While the Pentium 4 was a beast to keep cool and insanely power inefficient it was also the CPU when it came to video editing and transcoding. And it also ran laps around AMD's cpu's in anything floating point intensive. I can't think of any "real world" task it was deficient in.

Log in

Don't have an account? Sign up now