Positioning x86 Petascale Performance with MIC Architecture
Intel’s reputation is now on the line to demonstrate that Many Integrated Core architecture can compete against GPUs 

Die shot of Intel MIC Architecture co-processor codenamed “Auburn Isle” — heart of “Knights Ferry” SDP card
Figure 1: Die shot of Intel MIC Architecture co-processor codenamed “Auburn Isle” — heart of “Knights Ferry” SDP card.
The Texas Advanced Computing Center (TACC) announcement of their forthcoming 10 petaflop/s Stampede supercomputer utilizing the Intel Many Integrated Core (MIC) architecture demonstrates a substantial commitment by Intel to design hardware to accelerate highly-parallel workloads. Expected to be operational at the beginning of 2013, Stampede will contain many thousands of the MIC-based “Knights Corner” co-processors. The TACC announcement also indicated that a planned upgrade of the Intel hardware will increase performance by 50 percent to 15 petaflops. The Stampede supercomputer will contain 272 terabytes (272,000 gigabytes) of total memory and 14 petabytes (14 million gigabytes) of disk storage. With the TACC announcement, Intel’s reputation is now on the line to demonstrate that the MIC architecture can compete against GPUs and other computer architectures in the “leadership class” HPC market space.

Recently, I had the pleasure and good fortune to be briefed by Intel on the MIC architecture at one of their Hillsboro, OR, campuses. While my meeting was limited to a certain extent, as I did not wish to receive any confidential information, Intel still provided me with a wealth of information about MIC. Following is a short summary.

The MIC architecture is intended to leverage Intel’s deep understanding of x86 processor design and remarkable 22 nanometer manufacturing capability to capture some of the massively-parallel market space that is now “interesting.” Unlike GPUs, Intel stressed that MIC has been designed to support existing programming models. All that is required is that the software be written to scale to large numbers of processing cores on a symmetric multiprocessing (SMP) computer. Thus OpenMP, TBB (Intel’s Threading Building Blocks), pthreads, MPI and even OpenCL applications are purported to run efficiently.

The MIC hardware has been specifically designed to make porting MPI codes easy and to support thread-level parallelism (TLP) using the x86 instruction set. The idea behind TLP is to give each processing core a sufficiently large number of actively running threads, so that at least one thread can be ready to run at any time because all its dependencies have been satisfied (e.g. all the data it needs has been fetched from memory and there are sufficient internal processing elements available to perform the work). A programmer who uses TLP is essentially placing a bet that at least one of their application threads per core will always be ready to run. The more threads the programmer uses, the better the odds are that high application performance will be achieved. Hardware support for TLP requires fast thread switching inside the core and the ability to keep multiple memory transactions “in-flight” across the cores to hide the latency of fetching data from external sources. Limitations in hardware and systems resources provide an upper bound on the number of threads that can be active at any time.

Intel MIC Architecture-based Software Development Platform card called “Knights Ferry”
Figure 2: Intel MIC Architecture-based Software Development Platform card called “Knights Ferry”
The current generation of MIC hardware, Knights Corner, uses Pentium-based processing cores with HyperThreading technology and a vector unit containing 16 SIMD elements. The chip is manufactured according to a state-of-the-art 22 nm manufacturing process. Intel mentioned they leverage the in-order execution of the Pentium processor to quickly identify threads that are ready to run to support TLP. Programmers have the ability to exploit the multiple instruction multiple data (MIMD) capability of each core to keep four threads active along with the SIMD vector unit to best address application needs. Fifty or more of these modified Pentium cores are connected via a bi-directional ring inside a single MIC chip.

Understanding balance ratios and how they characterize workloads is required to understand the MIC architecture, the current ambiguity in core count, and the tweaking of other design parameters by the Intel designers. (Balance ratios are discussed in greater depth in my Scientific Computing article, “HPC Balance and Common Sense.”1) When targeting the balance ratios of those applications that will give each generation of MIC hardware the greatest penetration into the HPC market space, the Intel designers have the ability to dial-in the characteristics of each hardware component inside the chip including: the number of processor cores (and threads per core), memory I/O units (and cache size per unit), plus any fixed logic or special function units. At the moment, the current Knights Corner hardware characteristics appear to be “under evaluation,” which explains why we have seen the evolution from the 32-core Knights Ferry design. It also accounts for some of the ambiguity surrounding the current 50+ core Knights Corner chip.

The bi-directional ring interconnect within each chip lies at the heart of the performance and flexibility of the MIC design, as it dictates both the bandwidth into the processing cores and the latency of accessing the caches on the memory I/O units. An integrated L1 and shared L2 cache among groups of processor cores help to reduce dependency on the ring interconnect.
The bandwidth to memory external to the MIC chip is controlled by the number of memory I/O units, which also act as multiple shared L3 caches for the internal MIC processors. Thus, the
Intel designers can adjust the number of cores and memory I/O units to achieve the best flops per byte per second ratio between the MIC chip and external memory bandwidth for target workloads. Memory bandwidth per flop is a key processor metric. Size, power, cost and other considerations limit this crucial metric that controls the performance of so many real-world applications.

With Knights Corner, Intel is focusing on problems that exhibit a high data locality and data reuse. In this case, expect larger internal caches and lower external memory bandwidth. However, larger caches imply that sparse matrix and graph-based problems might show good performance. Much will depend on the performance characteristics of the external memory subsystems. The 272 terabytes of physical memory in the TACC Stampede supercomputer indicates that MIC is being targeted to run big memory production jobs. Let’s keep our fingers crossed that the memory subsystem supports efficient irregular memory accesses and large page sizes!

It is not clear what workloads Intel is using to dial-in the Knights Corner and future designs. The TOP500 HPL Linpack benchmark suite is a fairly safe bet. Philip Collella’s seven dwarves and the 14 dwarves proposed by David Patterson for problems outside of HPC are obvious choices for target kernel workloads. My sense is that hardware random number generation is a candidate fixed function unit that can be used to accelerate Monte Carlo methods. Intel announced this capability will be in the Ivy Bridge processors,2 so I expect MIC will get this capability (although probably not in Knights Corner). I am not certain if there will be acceleration of transcendental functions.

Regardless of programming model (OpenMP, TBB, pthreads, MPI, OpenCL, etcetera), achieving high performance on Knights Corner hardware will likely require the use of at least four threads per core and/or extensive use of the vector unit. My conversation at Intel took an interesting turn when discussing TLP and the number of active threads per core. If Intel decides that four threads per core are sufficient to achieve high TLP efficiency, then a single MIC chip will support more than 200 concurrent threads of execution.5 If eight become necessary, then expect more than 400 concurrent threads per MIC chip containing 50+ cores. This indicates that Intel is still evaluating the best number of threads per core if not for the current Knights Corner release, then for future generations of chips. To be safe, forward-thinking programmers should assume that the number of active threads per core will increase in later generations.

Having used Knights Ferry hardware, it is clear that synchronization upon a single memory location can dramatically slow program execution, because it will cause the cores to serialize when accessing one cache or memory location. Intel claims that the ring has sufficient capacity so that threads not participating in this synchronization operation will not be substantially affected by the traffic generated by this worst-case scenario. While this is good news, I expect that lock-free and wait-free data structures will be very popular on MIC! I also expect that reference counted C++ smart-pointers will need to be refactored to run efficiently on the MIC architecture.

Existing programming models assume cache coherency across the processing cores. Kudos to Intel for preserving this programming characteristic in a 50+ core chip, as maintaining cache coherency across a large number of processor cores is a known scaling issue in the SMP model. Still, programmers should check to see if there are performance implications of “too much data locality” that might cause serialization bottlenecks or generate high traffic across the ring interconnect and degrade performance. All in all, it is clear that much will rely on the programmer’s ability to avoid performance pitfalls, which emphasizes the importance of any profiling tools that will be provided with MIC.

The current Knights Corner boards are PCIe-based and, thus, resemble a GPU in form factor. While Intel is not releasing any information about the power efficiency of the Knights Corner chips, Colfax did release information about an SGEMM benchmark running on an older 32-core Knights Ferry board that utilized eight cards in a workstation and achieved 7.4 TF/s.3 From this benchmark we can infer:
1. that a newer Knights Corner board should deliver multiple teraflop/s of single-precision performance per PCIe card and
2. that a single board will consume less than 300 watts per card to conform to the PCIe specification.

Intel announced teraflop/s double-precision Knights Corner performance at Supercomputing 2011. During my Hillsboro meeting, Intel did mention that there is lots of extra space in the vector unit encodings for the future. I take this to mean that significant improvements in double-precision performance are to be expected. This thought is reinforced by the TACC announcement about the already planned upgrade for the Stampede supercomputer that will provide a 50 percent increase in performance to 15 petaflops.

So, is MIC a co-processor or multi-core processor architecture? The answer is that MIC is currently packaged as a co-processor running across a PCIe bus. This packaging easily can be changed. As Intel told me, an application just needs to be able to scale to a large number of cores running in an SMP environment to run on MIC. Does this mean that Linux can run on MIC? Yes, but I believe MIC represents an opportunity to add lock-free algorithms to stretch the scalability of the Linux kernel massively parallel SMP hardware.

MIC reuses much of the design effort that was put into Larrabee. For this reason, the paper, “Larrabee: A Many-Core x86 Architecture for Visual Computing,”4 provides a good starting point for more detailed information about Knights Corner. A picture and additional information can be found at “Intel Knights Corner: More than 50 cores at 22nm Tri-Gate.”5

Happy “high core count” computing!

2. Intel random number acceleration:,2817,2391367,00.asp#fbid=f5C3BufZOat 
3. “Intel readying MIC x64 coprocessor for 2012”: 
4. Intel Larrabee paper: 
5. More info: 

Rob Farber is an independent HPC expert to startups and fortune 100 companies, as well as government and academic organizations. He may be reached at