The GPU Performance Revolution
Intel’s entry into the massively parallel chip market adds fuel to survival-of-the-fittest product evolution 

Five years ago, NVIDIA disrupted “business as usual” in the high performance computing industry with the release of CUDA in February 2007. The September 2011 announcement by the Texas Advanced Computer Center (TACC) of a MIC-based (Many Integrated Core) Stampede supercomputer shows that Intel has decided to compete against graphics processing units (GPU) and other computer architectures in the “leadership class” HPC market space with the Knights Corner (KNC) many-core processor chip. While similarly packaged as a PCIe device, the MIC architecture differs significantly from current GPU architectures.

Table 1: GPGPU and MIC architectural approaches to massive parallelism


Intel MIC

Degree of Parallelism

Fermi supports 512 concurrent SIMT threads of execution. Kepler will triple this number to 1,536 threads.

Knights Corner expected to support between 200 and 256 concurrent threads.

Achieving High Performance

A per-SM hardware scheduler keeps multiple computational units busy by identifying and dispatching any ready-to-run SIMD instructions.

A compiler or programmer utilizes special streaming SIMD extensions (SSE)-like instructions to keep each per-core
vector unit busy.

Achieving Power Efficiency

The per-SM SIMD execution model requires less supporting logic, leading to high-power efficiency and floating-point performance. Expect a 3x increase in Kepler double-precision efficiency.

Leverages the simplicity of the original Pentium design and the floating-point capability of a 512-bit vector unit along with the power savings resulting from a 22 nm process manufacturing process.

Data-parallel acceleration

Data-parallel operations are spread across the SMs of one or more GPU devices.

Data-parallel operations accelerated by the per-core vector units and are spread across the cores of one or more devices.

Task-parallel acceleration

Concurrent kernel execution allows multiple kernels to run on one or more SM.

Concurrent threads can run multiple tasks on the device.

MPI acceleration

MPI jobs are accelerated by using one or more GPUs per MPI process and optimized data transfer capabilities like GPUdirect.

MPI jobs are accelerated by using one or more MIC devices per MPI process, or one MPI process per MIC core.

The budding benchmark battles between proponents of AMD, NVIDIA and Intel hardware will certainly exploit architectural differences to highlight the performance benefits of one design over another. The forthcoming public, and likely very vocal, performance debates will spur many-core and memory system innovations that will revolutionize the performance of future generations of products. Vive la Revolution!

The fact that Intel now has made a substantial commitment to teraflops-capable, massively-parallel hardware devices comes as no surprise. With nearly 1.3 billion CUDA-enabled GPUs sold to date, NVIDIA has highlighted the high-stakes attractiveness of these devices in the consumer and HPC markets.

Many in the computer industry, myself included, have observed that CPU and GPU technologies are following convergent evolutionary paths. As I note in my Scientific Computing article, “HPC’s Future,” the failure of Dennard’s scaling laws forced chip manufacturers to switch to parallelism to increase processor performance. Due to power and heat issues, many-core processors have become a necessity, as it is no longer possible to significantly increase the performance of a single processing core.

While similarly packaged as a PCIe device, Intel has taken a different architectural approach that reflects more traditional multiple instruction multiple data (MIMD) and vector design. Conversely, GPUs are based on hardware acceleration based on a single instruction multiple data (SIMD) streaming multiprocessor (SM) computational unit, where efficient GPU task-based parallelism is supported via the hardware acceleration of a single instruction multiple thread (SIMT) model that can load-balance multiple tasks across the GPU SM.
Table 1, from my article “Convergent Evolution or Head-on Collision: Parallel Programming Approaches for MIC and GPU,”1 provides a more detailed summary of differences.

The Code Conundrum
This new era of multi- and many-core computing has been disruptive to the software industry, as it requires that existing applications be redesigned to exploit parallelism (rather than clock speed) to achieve high application performance on this new parallel hardware. During this transition time to massively parallel programming, the owners of legacy code bases are faced with some difficult choices, because there are no generic “recompile and run” solutions. As I noted in my Scientific Computing article, “Redefining What is Possible”:
“Legacy applications and research efforts that do not invest in multi-threaded software will not benefit from modern multi-core processors, because single-threaded and poorly scaling software will not be able to utilize extra processor cores. As a result, computational performance will plateau at or near current levels, placing the projects that depend on these legacy applications at risk of both stagnation and loss of competitiveness.”

While teraflop/sec performance is compelling, there is no guarantee that MIC or GPUs will deliver high performance (or even a performance benefit) for any given application. This uncertainty, coupled with the risk and costs of a porting effort, has kept many customers with legacy codes from investing in this new technology.

Chip manufacturers, and the industry as a whole, have invested heavily in several programming models to make porting efforts as fast and risk free as possible. Not surprisingly, well-established programming models have attracted much attention:
• message passing interface (MPI)
• directive-based programming, like OpenMP and OpenACC
• common libraries providing fast Fourier transform (FFT) and basic linear algebra subprograms (BLAS) functionality
• language platforms based on a strong-scaling execution model (CUDA and OpenCL)

The current packaging of GPU and MIC massively-parallel chips as external PCIe devices complicates each of these programming models. For example, the overhead incurred by host/device data transfers breaks an assumption made by the symmetric multiprocessing (SMP) execution model that any thread can access any data in a shared memory system without paying a significant performance penalty. Efforts like OpenACC (and potentially OpenMP 4.0) are attracting attention, because they provide a standard method to specify data locality. Standardization is moving quickly to prevent a “tower of babel” proliferation of incompatible pragma specifications.

Language platforms based on a strong-scaling execution model, such as CUDA and OpenCL, will likely perform well on both MIC and GPU architectures, because they provide linear scaling according to number of processing elements and provide the best reduction in parallel code runtime. Scaling behavior of the computational kernels should not be an issue unless global atomic operations are utilized.

In all programming approaches, high performance can be achieved when the compute-intensive portions of the application conform to the following three rules of high-performance co-processer programming. If not, expect floating-point performance to be either PCIe or device-memory limited.
1. Transfer the data across the PCIe bus onto the device and keep it there.
2. Give the device enough work to do.
3. Focus on data reuse within the co-processor(s) to avoid memory bandwidth bottlenecks.

From an architectural point of view, MIC has the ability to be used as a 50- to 60-core standalone Linux computer connected to the host system by the PCIe bus. This is a very interesting capability that effectively satisfies my first rule of high-performance co-processor programming by using the co-processor as a separate computer. As discussed in my article “Convergent Evolution or Head-on Collision: Parallel Programming Approaches for MIC and GPU,” running as a standalone device might be limited by the processing time of sequential sections of code (due to slow Pentium performance relative to a modern high-clock rate processing core), on-board memory capacity limitations, and the ability of the application to use the wide vector unit. Please consult this article for a more information.

With prototypes demonstrating a sustained terabit (trillion bits of data) of I/O capability along with a large data capacity, small physical footprint, and a claimed 70 percent power savings over current technology, hybrid memory cubes hold hope for large memory co-processors in the future that can match the floating-point capability of our current processors. This is clearly a technology to watch, as it holds the potential to revolutionize many aspects of massively-parallel computing including the impact of co-processors by reducing or eliminating both PCIe and bandwidth limitations.

Intel’s entry into the massively parallel chip market will add fuel to the evolutionary growth of both GPUs and their own products. The fact that MIC takes a completely different architectural approach is an added bonus that will certainly affect both GPU and MIC design for future generations of processors. A Darwinian “survival-of-the-fittest” revolution in product evolution will stimulate the design and management teams at NVIDIA, Intel and AMD as they strive to expand their individual (and indirectly collective) market share of these powerful devices and obviate bottlenecks, such as the bandwidth limitations of the PCIe bus. The desirable characteristics of new products, such as hybrid memory cubes, also hold much potential to revolutionize future designs.

From a performance point-of-view, the Knights Corner chip looks to be competitive with GPUs as a teraflops-capable co-processor. Pricing information is not available for NVIDIA Kepler or Intel KNC products, so it is not possible at this time to make a price vs. performance comparison. The TACC announcement shows that Intel is definitely looking at high performance computing. Meanwhile, NVIDIA has established a strong market presence and massive base of CUDA developers with products starting around the $150 to $180 price range and extending to HPC products priced in the thousands of dollars.

As developers, scientists, and consumers: the future is going to be both fun and exciting!

1. Doctor Dobb’s Journal:

Rob Farber is an independent HPC expert to startups and fortune 100 companies, as well as government and academic organizations. He may be reached at

BLAS Basic Linear Algebra Subprograms ? FFT Fast Fourier Transform ? GPGPU General-purpose Computing on Graphics Processing Units ? GPU Graphics Processing Unit ? KNC Knights Corner ? MIC Many Integrated Core ? MIMD Multiple Instruction Multiple Data ? MPI Message Passing Interface ? SIMD Single Instruction, Multiple Data ? SIMT Single Instruction Multiple Thread ? SM Streaming Multiprocessor ? SMP Symmetric Multiprocessing  ? SSE Streaming SIMD Extensions ? TACC Texas Advanced Computer Center