Raising the Speed Limit
New GPU-to-GPU communications model increases cluster efficiency

New GPU-to-GPU communications model increases cluster efficiency
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, has made graphics accelerators a compelling platform for computationally demanding tasks in a wide variety of application domains. Due to the great computational power of the graphics processing unit (GPU), general-purpose computation on graphics processing units (GPGPU) has proven valuable in various areas of science and technology. The modern GPU is a highly data-parallel processor, optimized to provide very high floating point arithmetic throughput for problems suitable to solve with a single-program multiple-data model.

On a GPU, this model works by launching thousands of threads running the same program working on different data. The ability of the GPU to rapidly switch between threads, in combination with the high number of threads, ensures the hardware is busy at all times. This ability effectively hides memory latency and, in combination with the several layers of very high bandwidth memory available in modern GPUs, also improves GPU performance. This advantage enables HPC systems to achieve the needed performance capabilities mandated by the ever-increasing simulation complexities.

GPU-based clusters are being used to perform compute-intensive tasks, like finite element computations, gas dispersion simulations, heat shimmering simulations, accurate nuclear explosion simulations and Monte-Carlo simulations. Two of the world’s petascale systems, according to the June 2010 release of the world’s TOP500 supercomputers, are using GPUs in order to achieve the desired performance while reducing the total system power consumption. Since the GPUs provide the highest core count and floating point operations capability, a high-speed network is required to connect between GPU-central processing unit (CPU) platforms or servers. In many cases, InfiniBand has been chosen as the high-speed networking of choice for such systems.

Non-efficient GPU-InfiniBand data transfer mechanism
Figure 1: Non-efficient GPU-InfiniBand data transfer mechanism
By providing low-latency, high-bandwidth and extremely low CPU overhead, InfiniBand has become the most deployed high-speed interconnect for high-performance computing, replacing proprietary or low-performance solutions. The InfiniBand Architecture (IBA) is an industry-standard fabric designed to provide high bandwidth, low-latency computing, scalability for 10 thousand nodes and multiple CPU/GPU cores per server platform, and efficient utilization of compute processing resources. Mellanox ConnectX-2 InfiniBand adapters and IS5000 switches provide up to 40Gb/s of bandwidth between servers and up to 120Gb/s between switches. This high-performance bandwidth is matched with ultra-low application latency of 1?sec, and switch latencies under 100ns that enable efficient scale-out of compute systems.

GPU-GPU communication model
While GPUs have been shown to provide worthwhile performance acceleration, yielding benefits to price/performance and power/performance, several areas of GPU-based clusters could be improved in order to provide higher performance and efficiency. One issue with deploying clusters consisting of multi-GPU nodes involves the interaction between the GPU and the high speed InfiniBand network, in particular to the way GPUs are using the network in order to transfer data between them.

Efficient GPU InfiniBand data transfer mechanism (GPUDirect)
Figure 2: Efficient GPU InfiniBand data transfer mechanism (GPUDirect)
Prior to the development of GPUDirect technology, a performance issue existed with user-mode DMA mechanisms used by GPU devices and the InfiniBand remote direct memory access (RDMA) technology. The issue involved the lack of a software/hardware mechanism of “pinning” pages of virtual memory to physical pages that can be shared by both the GPU devices and the networking devices. In general, GPUs use pinned memory in the host memory to increase DMA performance by eliminating the need for intermediate buffers, or to pin and unpin regions of memory on-the-fly. The use of pinned memory buffers can allow a well-written code to achieve zero-copy message passing semantics via RDMA.

The lack of a mechanism for managing memory pinning among user-mode accelerator and InfiniBand message passing libraries creates performance issues due to the need of having a third device, the host CPU, be responsible for moving the data between the different GPU and InfiniBand pinned memory regions. The issue is depicted in Figure 1, which illustrates that data transfer between remote GPUs requires three steps:
1. The GPU writes data to a host pinned memory, marked as system memory 1.
2. The host CPU copies the data from system memory 1 to system memory 2.
3. The InfiniBand device reads data from its pinned memory (system memory 2) and sends it to the InfiniBand pinned memory on the remote node.

Step 2 not only requires the host CPU involvement which, in turn, reduces CPU efficiency, introduces CPU overhead and CPU noise (CPU interrupt), but it also increases the latency for the GPU data communications. Such overhead can be counted for 30 percent of the communication time, which can dramatically reduce high-performance, latency sensitive application performance.

Cellulose benchmark results with ECC enabled
Figure 3: Cellulose benchmark results with ECC enabled
GPUDirect communications model
The ultimate communication mechanism between GPUs and InfiniBand devices would involve the development of a mechanism for performing DMA and RDMA operations directly between GPUs and would bypass the host entirely. Such an interface could conceivably allow RDMAs from one GPU device directly to another GPU on a remote host. An intermediate solution can use the host memory for the data transactions, but  requires elimination of the host CPU’s involvement by having the acceleration devices and the InfiniBand adapters share the same pinned memory as shown in Figure 2.

The new hardware/software mechanism called GPUDirect eliminates the need for the CPU to be involved in the data movement. Essentially, it enables higher GPU-based cluster efficiency, and also paves the way for the creation of “floating point services.” GPUDirect is based on a new interface between the GPU and the InfiniBand device that enables both devices to share pinned memory buffers, and for the GPU to notify the network device to stop using the pinned memory so it can be destroyed. This new communication interface allows the GPU to maintain control of the user-space pinned memory, and eliminates the issues of data reliability. 

Performance evaluation
In order to evaluate the performance advantage of GPUDirect, we decided to use Amber, a molecular dynamics software package, one of the most widely used programs for bimolecular studies with an extensive user base. Amber was developed in an active collaboration between David Case at Rutgers University, Tom Cheatham at the University of Utah, Tom Darden at the National Institute of Environmental Health Sciences (now at OpenEye), Ken Merz at Florida, Carlos Simmerling at SUNY-Stony Brook, Ray Luo at UC Irvine, and Junmei Wang at Encysive Pharmaceuticals. Amber was originally developed under the leadership of Peter Kollman.

One of the new features of Amber 11 is the ability to use NVIDIA GPUs to accelerate both explicit solvent PME and implicit solvent GB simulations. Therefore, we selected Amber as one of the first applications to be tested with the new GPUDirect technology.

The test environment is part of the HPC Advisory Council computing center, which included eight compute nodes, connected via Mellanox ConnectX-2 adapters and switches. Each node includes one NVIDIA Fermi c2050 GPU. The Amber performance results with and without GPUDirect are presented in Figure 3 for the Cellulose (408,609 atoms) benchmark.

HPC demonstrates ever-increasing demands for more computing power. We see these demands on the TOP500 supercomputers list and from the worldwide petascale programs. GPU-based compute clusters are becoming the most cost-effective way to provide the next level of compute resources. For example, building a petascale proprietary system using Cray requires 20,000 nodes, while achieving the same level of performance using InfiniBand and GPU-based clusters requires only 5,000 nodes. The advantages of the second option are clear — space, management and affordability.

As GPU-based computing becomes popular, there is a need to create direct communications between GPUs using the fastest available interconnect solutions, such as InfiniBand, and to increase the applications modifications for utilizing GPU and parallel GPU computations more effectively.  In this article, we have reviewed the latest GPUDirect technology and demonstrated up to 33 percent performance improvement (with translates to the capability to run 33 percent more jobs per day) on only eight nodes, each with a single NVIDIA Fermi GPU and Mellanox ConnectX-2 InfiniBand adapter. 

Gilad Shainer is Chairman of the HPC Advisory Council. Ali Ayoub is lead developer at Mellanox Technologies. Pak Lui is HPC Advisory Council Cluster Center Manager, and Tong Liu is Director of the HPC Advisory Council China Center of Excellence. They may be reached at