The Xeon E5-2600: Dual Sandy Bridge for Servers
by Johan De Gelas on March 6, 2012 9:27 AM EST- Posted in
- IT Computing
- Virtualization
- Xeon
- Opteron
- Cloud Computing
Intel's Sandy Bridge architecture was introduced to desktop users more than a year ago. Server parts however have been much slower to arrive, as it has taken Intel that long to transpose this new engine into a Xeon processor. Although the core architecture is the same, the system architecture is significantly different from the LGA-1155 CPUs, making this CPU quite a challenge, even for Intel. Completing their work late last year, Intel first introduced the resulting design as the six-core high-end Sandy Bridge-E desktop CPU, and since then have been preparing SNB-E for use in Xeon processors. This has taken a few more months but Xeon users' waits are at an end at last, as today Intel is launching their first SNB-E based Xeons .
Compared to its predecessor, the Xeon X5600, the Xeon E5-2600 offers a number of improvements:
A completely improved core, as described here in Anand's article. For example, the µop cache lowers the pressure on the decoding stages and lowers power consumption, killing two birds with one stone. Other core improvements include an improved branch prediction unit and a more efficient Out-of-Order backend with larger buffers.
A vastly improved Turbo 2.0. The CPU can briefly go beyond the TDP limits, and when returning to the TDP limit, the CPU can sustain higher "steady-state" clockspeed. According to Intel, enabling turbo allows the Xeon E5 to perform 14% better in the SAP S&D 2 tier test. This compares well with the Turbo inside the Xeon 5600 which could only boost performance by 4% in the SAP benchmark.
Support for AVX Instructions combined with doubling the load bandwidth should allow the Xeon to double the peak floating point performance compared to the Xeon "Westmere" 5600.
A bi-directional 32 byte ring interconnect that connects the 8 cores, the L3-cache, the QPI agent and the integrated memory controller. The ring replaces the individual wires from each core to the L3-cache. One of the advantages is that the wiring to the L3-cache can be simplified and it is easier to make the bandwidth scale with the number of cores. The disadvantage is that the latency is variable: it depends on how many hops a certain piece of data inside the L3-cache must cross before ends up at the right core.
A faster QPI: revision 1.1, which delivers up to 8 GT/s instead of 6.4 GT/s (Westmere).
Lower latency to PCI-e devices. Intel integrated a PCIe 3.0 I/O subsystem inside the die which sits on the same bi-directional 32 bit ring as the cores. PCIe 3.0 runs at 8 GT/s (PCIe 2.0: 5 GT/s), but the encoding has less overhead. As a result, PCIe 3.0 can deliver up to 1 GB full duplex per second per lane, which is twice as much as PCIe 2.0.
Removing the I/O lowered PCIe latency by 25% on average according to Intel. If you only access the local memory, Intel measured 32% lower read latency.
The access latency to PCIe I/O devices is not only significantly lower, but Intel's Data Direct I/O Technology allows the PCIe NICs to read and write directly to the L3-cache instead of to the main memory. In extremely bandwidth constrained situations (using 4 infiniband controllers or similar), this lowers power consumption and reduces latency by another 18%, which is a boon to HPC users with 10G Ethernet or Infiniband NICs.
The new Xeon also supports faster DDR-3 1600, up to 2 DIMMs per channel can run at 1600 MHz.
Last but certainly not least: 2 additional cores and up to 66% more L3 cache (20 MB instead of 12 MB). Even with 8 cores and a PCIe agent (40 lanes), the Xeon E5 still runs at 2.2 GHz within a 95W TDP power envelope. Pretty impressive when compared with both the Opteron 6200 and Xeon 5600.
81 Comments
View All Comments
alpha754293 - Tuesday, March 6, 2012 - link
Thanks for running those.Are those results with HTT or without?
If you can write a little more about the run settings that you used (with/without HTT, number of processes), that would be great.
Very interesting results thought.
It would have been interesting to see what the power consumption and total energy consumption numbers would be for these runs (to see if having the faster processor would really be that beneficial).
Thanks!
alpha754293 - Tuesday, March 6, 2012 - link
I should work with you more to get you running some Fluent benchmarks as well.But, yes, HPC simulations DO take a VERY long time. And we beat the crap out of our systems on a regular basis.
jhh - Tuesday, March 6, 2012 - link
This is the most interesting part to me, as someone interested in high network I/O. With the packets going directly into cache, as long as they get processed before they get pushed out by subsequent packets, the packet processing code doesn't have to stall waiting for the packet to be pulled from RAM into cache. Potentially, the packet never needs to be written to RAM at all, avoiding using that memory capacity. In the other direction, web servers and the like can produce their output without ever putting the results into RAM.meloz - Tuesday, March 6, 2012 - link
I wonder if this Data Direct I/O Technology has any relevance to audio engineering? I know that latency is a big deal for those guys. In past I have read some discussion on latency at gearslutz, but the exact science is beyond me.Perhaps future versions of protools and other professional DAWs will make use of Data Direct I/O Technology.
Samus - Tuesday, March 6, 2012 - link
wow. 20MB of on-die cache. thats ridiculous.PwnBroker2 - Tuesday, March 6, 2012 - link
dont know about the others but not ATT. still using AMD even on the new workstation upgrades but then again IBM does our IT support, so who knows for the future.the new xeon's processors are beasts anyways, just wondering what the server price point will be.
tipoo - Tuesday, March 6, 2012 - link
"AMD's engineers probably the dumbest engineers in the world because any data in AMD processor is not processed but only transferred to the chipset."...What?
tipoo - Tuesday, March 6, 2012 - link
Think you've repeated that enough for one article?tipoo - Wednesday, March 7, 2012 - link
Like the Ivy bridge comments, just for future readers note that this was a reply to a deleted troll and no longer applies.IntelUser2000 - Tuesday, March 6, 2012 - link
Johan, you got the percentage numbers for LS-Dyna wrong.You said for the first one: the Xeon E5-2660 offers 20% better performance, the 2690 is 31% faster. It is interesting to note that LS-Dyna does not scale well with clockspeed: the 32% higher clockspeed of the Xeon E5-2690 results in only a 14% speed increase.
E5-2690 vs Opteron 6276: +46%(621/426)
E5-2660 vs Opteron 6276: +26%(621/492)
E5-2690 vs E5-2660: +15%(492/426)
In the conclusion you said the E5 2660 is "56% faster than X5650, 21% faster than 6276, and 6C is 8% faster than 6276"
Actually...
LS Dyna Neon-
E5-2660 vs X5650: +77%(872/492)
E5-2660 vs 6276: +26%(621/492)
E5-2660 6C vs 6276: +9%(621/570)
LS Dyna TVC-
E5-2660 vs X5650: +78%(10833/6072)
E5-2660 vs 6276: +35%(8181/6072)
E5-2660 6C vs 6276: +13%(8181/7228)
It's funny how you got the % numbers for your conclusions. It's merely the ratio of lower number vs higher number multiplied by 100.