Rob FarberIntegrated ARM/GPU technology and standalone Intel Xeon Phi devices will accelerate the transition away from the PCIe bus

Recent announcements by Intel and NVIDIA indicate that massively parallel computing with GPUs and Intel Xeon Phi will no longer require passing data via the PCIe bus. The bad news is that these standalone devices are still in the design phase and are not yet available for purchase. Instead of residing on the PCIe bus as a second-class system component like a disk or network controller, the new Knights Landing processor announced by Intel at ISC’13 will be able to run as a standalone processor just like a Sandy Bridge or any other multi-core CPU. Meanwhile, NVIDIA’s release of native ARM compilation in CUDA 5.5 provides a necessary next step toward Project Denver, which is NVIDIAs integration of a 64-bit ARM processor and a GPU. This combination, termed a CP-GP (or ceepee-geepee) in the media, can leverage the energy savings and performance of both architectures.

Of course, the NVIDIA strategy also opens the door to the GPU acceleration of mobile phone and other devices in the ARM dominated low-power, consumer and real-time markets. In the near 12- to 24-month timeframe, customers should start seeing big-memory standalone systems based on Intel and NVIDIA technology that only require power and a network connection. The need for a separate x86 computer to host one or more GPU or Intel Xeon Phi coprocessors will no longer be a requirement.

The introduction of standalone GPU and Intel Xeon Phi devices will affect the design decisions made when planning the next generation of leadership class supercomputers, enterprise data center procurements, and teraflop/s workstations. It also will affect the software view in programming these devices, because the performance limitations of the PCIe bus and the need to work with multiple memory spaces will no longer be compulsory.

Knights Landing
The Intel Knights Landing marketing team has been quick to jump on the revolutionary aspects of the Knights Landing standalone capability by highlighting in the ISC’13 announcement that this processor family is no longer “bound” by offloading bottlenecks. Instead, Intel marketing is building the expectation that Knights Landing will be used in standalone big-memory workstations and servers or, if desired, in devices that fit the existing PCIe Intel Xeon Phi coprocessor form factor. It is reasonably safe to assume that the 14nm process manufacturing process will give the Knights Landing processor family excellent power efficiency, just like its Knights Corner predecessor. It is also safe to assume that Knights Landing processors also will provide more cores than the current Knights Corner chips.

Equally important in the Intel announcement is the Knights Landing “integrated on-package memory.” Intel is not disclosing much, but it seems clear that Knights Landing will have a hierarchical memory subsystem consisting of an internal top-level memory subsystem that will deliver very high performance followed by the second memory tier comprised of the much larger and slower memory system on the motherboard.
Unlike memory manufactured directly on the chip, on-package memory can be built out of separate chips. (These memory chips could even be stacked, but that is probably not going to happen in Knights Landing.) Unlike memory on the processor chip, utilizing separate memory chips in the processor package provides capacity and performance without consuming space on the processor itself, nor does it affect chip yields. Instead, the capacity of the on-package memory will be limited by the cost, power and heat envelope of the package itself. In a sense, the complete Knights Landing processor package can be viewed simplistically as a miniaturized Intel Xeon Phi coprocessor that communicates via memory rather than the PCIe bus. While no mention of on-package memory capacity has been made by Intel, it is likely that the Knights Landing internal memory capacity will be measured in gigabytes rather than megabytes.

Project Denver
Meanwhile, NVIDIA continues an aggressive bottom-up/top-down strategy in their march toward Project Denver by supporting native ARM compilation in the recent CUDA 5.5 software release, with a corresponding announcement at ISC’13. Traditionally, ARM processors have not been floating-point powerhouses, but the market for ARM is huge. Currently, the number of ARM chips shipped per year greatly exceeds the numbers of x86 processors sold by a wide margin approaching 10x. It is expected that this number will increase as enterprise customers start to adopt ARM servers in their data centers.

As Sumit Gupta, General Manager of the Tesla Accelerated Computing business unit at NVIDIA points out, “We think GPU accelerators are going to be, in effect, the floating-point units for ARM processors.”
In short, just as removing the dependence on the PCIe bus makes GPUs and Intel Xeon Phi coprocessors first-class citizens inside the computer, so does the CUDA 5.5 development environment have the ability to make ARM a first-class citizen in the HPC and enterprise computing worlds. A happy CUDA ARM marriage also will bring GPU computing to the ARM-dominated mass mobile computing and remote device markets with GPU accelerated apps written in CUDA being downloaded by future billions of mobile phone users. This is indeed a bold strategy.

PCIe technology

PCIe-based accelerators do provide substantial benefits both in terms of programming and performance.

As the graphic shows, my freely available numerical optimization teaching code recently achieved an average sustained 2.2 PF/s (petaflop/s or 1,000 trillion floating-point operations per second) using 3,000 Intel Xeon Phi equipped computational nodes. The average sustained performance includes all communications overhead and performance variations that result when using such a large number of nodes. In a July 2013 presentation at TACC, I showed students in one slide that they can transition from a “Hello World” program to exascale computation for a wide class of numerical optimization problems. Three slides later, they also understand how to use the example code to teach their computers how to read aloud. This presentation is freely available for download on the Texas Advanced Computing Center Web site while the example code is discussed and provided for download in my Intel Xeon Phi tutorial on the Dr. Dobbs Web site. This same code also runs on GPUs with high performance. GPU-oriented readers can view a similar “Hello World to Exascale” set of slides incorporating big data in my two NVIDIA GTC (GPU Technology Conference) talks. In the near future, this teaching code will run on the 18,688 GPU Titan supercomputer.

One benefit of PCIe-based technology is that multiple devices can be plugged into the same computer to create workstations and computational nodes with outstanding performance. Even with the overhead of the PCIe bus and offload programming, the numerical optimization and machine learning teaching code mentioned in the previous paragraph can deliver 700 TF/s to 900 TF/s average sustained performance per device. Thus, four GPUs or Intel Xeon Phi devices can be plugged in to create a multi-teraflop-per-second workstation.

Similarly, the aggregate memory performance of multiple co-processors can be a compelling advantage for a broad class of problems in data mining and analytics. Graph-based algorithms are an important representative of this class of low-flop but memory intensive computational problems. Instead of floating-point operations per second, performance on graph problems is evaluated in terms of edges per second, because an edge represents the basic unit of a relationship between entities. Edges can represent relationships between concepts in a semantic graph, pathways in a biochemical graph, or the flow of money within an economy. The boon and bane of big data is the sheer amount of information (e.g. numbers of edges) that can be used to create large graphs inside the computer.

By exploiting the aggregate memory bandwidth of multiple devices in a workstation, a data scientist can reap tremendous performance advantages without the latency and complexity of a Hadoop or other big data analytic framework. For example, a workstation containing four GPUs or Intel Xeon Phi coprocessors can provide well over a half a terabyte per second of aggregate memory bandwidth, or roughly 10x the capability of the workstation memory system by itself.

Graph-based computation
Graph problems are notorious for being memory latency limited rather than memory bandwidth limited. However, all that latency hiding technology that helps to sustain high flop rates in the massively parallel environment inside both GPUs and Intel Xeon Phi devices also helps to hide memory latency for graph-based computations. As a result, these devices are surprisingly competent graph analytic engines, especially when multiple devices are used inside a workstation.

The DARPA XDATA program has been investigating GPU and Intel coprocessor technology for information mining purposes. The freely available mpgraph SourceForge project, administered by SYSTAP LLC as part of their participation in the XDATA program, has several commonly used graph algorithms implemented via a GAS (Gather Apply Scatter) framework. The GAS framework is a powerful graph processing abstraction popularized by GraphLab and Google’s Pregal system. The beauty of the GAS framework is that the computation is expressed concisely, in a massively vertex parallel fashion that is portable and amenable to optimization by the underlying GAS engine.

Those who download mpgraph can compile the Thrust-based source code GAS engine to run on GPUs, Intel Xeon Phi coprocessors, plus multi-core processors using OpenMP and Intel Thread Building Blocks. Currently implemented algorithms include Breadth-First Search (BFS), Page Rank, Single Source Shortest Paths (SSSP), connected components, and both exact and approximate betweenness centrality. The approximate betweenness centrality algorithm is based on the work by David Bader and Kamesh Madduri.

The DARPA XDATA program is also about using comparison to grow and select best of breed solutions. As a result, highly optimized “to the metal” GPU implementations of several graph algorithms also have been developed as part of the XDATA program. Several of these bare metal graph implementations, created by Yangzihao Wang at the University of California at Davis, also are freely available and can be downloaded from the XDATA repository, which will become open in the near future.

Interested readers can perform their own best of breed evaluation that can compare GPU and Intel Xeon Phi coprocessor performance based on the SYSTAP and “to the metal” codes against the optimized and freely available parallel GraphLab and GraphChi packages for multicore processors. Of special interest is the GraphChi project, an offshoot of GraphLab, which allows very large graphs to be efficiently processed out-of-core on a single laptop or workstation. GraphChi is another freely downloadable project on Google Code.

The future of massively parallel computing clearly resides in the first class integration of parallel processors with the memory and IO subsystems. The dependence on the PCIe bus is a vestigial artifact that has lingered due to the tight financial and market interrelationships between the high-end gaming community and the current generation of computational GPGPU devices. The transition to integrated ARM/GPU technology (e.g. Project Denver) and standalone Intel Xeon Phi devices will accelerate the transition away from the PCIe bus. While useful, the loss of the PCIe bus dependence will be mourned by few.

Rob Farber is an independent HPC expert to startups and Fortune 100 companies, as well as government and academic organizations. He may be reached at