Raising the Speed Limit
Raising the Speed Limit
New GPU-to-GPU communications model increases cluster efficiency
On a GPU, this model works by launching thousands of threads running the same program working on different data. The ability of the GPU to rapidly switch between threads, in combination with the high number of threads, ensures the hardware is busy at all times. This ability effectively hides memory latency and, in combination with the several layers of very high bandwidth memory available in modern GPUs, also improves GPU performance. This advantage enables HPC systems to achieve the needed performance capabilities mandated by the ever-increasing simulation complexities.
GPU-based clusters are being used to perform compute-intensive tasks, like finite element computations, gas dispersion simulations, heat shimmering simulations, accurate nuclear explosion simulations and Monte-Carlo simulations. Two of the world’s petascale systems, according to the June 2010 release of the world’s TOP500 supercomputers, are using GPUs in order to achieve the desired performance while reducing the total system power consumption. Since the GPUs provide the highest core count and floating point operations capability, a high-speed network is required to connect between GPU-central processing unit (CPU) platforms or servers. In many cases, InfiniBand has been chosen as the high-speed networking of choice for such systems.
|Figure 1: Non-efficient GPU-InfiniBand data transfer mechanism|
GPU-GPU communication model
While GPUs have been shown to provide worthwhile performance acceleration, yielding benefits to price/performance and power/performance, several areas of GPU-based clusters could be improved in order to provide higher performance and efficiency. One issue with deploying clusters consisting of multi-GPU nodes involves the interaction between the GPU and the high speed InfiniBand network, in particular to the way GPUs are using the network in order to transfer data between them.
|Figure 2: Efficient GPU InfiniBand data transfer mechanism (GPUDirect)|
The lack of a mechanism for managing memory pinning among user-mode accelerator and InfiniBand message passing libraries creates performance issues due to the need of having a third device, the host CPU, be responsible for moving the data between the different GPU and InfiniBand pinned memory regions. The issue is depicted in Figure 1, which illustrates that data transfer between remote GPUs requires three steps:
1. The GPU writes data to a host pinned memory, marked as system memory 1.
2. The host CPU copies the data from system memory 1 to system memory 2.
3. The InfiniBand device reads data from its pinned memory (system memory 2) and sends it to the InfiniBand pinned memory on the remote node.
Step 2 not only requires the host CPU involvement which, in turn, reduces CPU efficiency, introduces CPU overhead and CPU noise (CPU interrupt), but it also increases the latency for the GPU data communications. Such overhead can be counted for 30 percent of the communication time, which can dramatically reduce high-performance, latency sensitive application performance.
|Figure 3: Cellulose benchmark results with ECC enabled|
The ultimate communication mechanism between GPUs and InfiniBand devices would involve the development of a mechanism for performing DMA and RDMA operations directly between GPUs and would bypass the host entirely. Such an interface could conceivably allow RDMAs from one GPU device directly to another GPU on a remote host. An intermediate solution can use the host memory for the data transactions, but requires elimination of the host CPU’s involvement by having the acceleration devices and the InfiniBand adapters share the same pinned memory as shown in Figure 2.
The new hardware/software mechanism called GPUDirect eliminates the need for the CPU to be involved in the data movement. Essentially, it enables higher GPU-based cluster efficiency, and also paves the way for the creation of “floating point services.” GPUDirect is based on a new interface between the GPU and the InfiniBand device that enables both devices to share pinned memory buffers, and for the GPU to notify the network device to stop using the pinned memory so it can be destroyed. This new communication interface allows the GPU to maintain control of the user-space pinned memory, and eliminates the issues of data reliability.
In order to evaluate the performance advantage of GPUDirect, we decided to use Amber, a molecular dynamics software package, one of the most widely used programs for bimolecular studies with an extensive user base. Amber was developed in an active collaboration between David Case at Rutgers University, Tom Cheatham at the University of Utah, Tom Darden at the National Institute of Environmental Health Sciences (now at OpenEye), Ken Merz at Florida, Carlos Simmerling at SUNY-Stony Brook, Ray Luo at UC Irvine, and Junmei Wang at Encysive Pharmaceuticals. Amber was originally developed under the leadership of Peter Kollman.
One of the new features of Amber 11 is the ability to use NVIDIA GPUs to accelerate both explicit solvent PME and implicit solvent GB simulations. Therefore, we selected Amber as one of the first applications to be tested with the new GPUDirect technology.
The test environment is part of the HPC Advisory Council computing center, which included eight compute nodes, connected via Mellanox ConnectX-2 adapters and switches. Each node includes one NVIDIA Fermi c2050 GPU. The Amber performance results with and without GPUDirect are presented in Figure 3 for the Cellulose (408,609 atoms) benchmark.
HPC demonstrates ever-increasing demands for more computing power. We see these demands on the TOP500 supercomputers list and from the worldwide petascale programs. GPU-based compute clusters are becoming the most cost-effective way to provide the next level of compute resources. For example, building a petascale proprietary system using Cray requires 20,000 nodes, while achieving the same level of performance using InfiniBand and GPU-based clusters requires only 5,000 nodes. The advantages of the second option are clear — space, management and affordability.
As GPU-based computing becomes popular, there is a need to create direct communications between GPUs using the fastest available interconnect solutions, such as InfiniBand, and to increase the applications modifications for utilizing GPU and parallel GPU computations more effectively. In this article, we have reviewed the latest GPUDirect technology and demonstrated up to 33 percent performance improvement (with translates to the capability to run 33 percent more jobs per day) on only eight nodes, each with a single NVIDIA Fermi GPU and Mellanox ConnectX-2 InfiniBand adapter.
Gilad Shainer is Chairman of the HPC Advisory Council. Ali Ayoub is lead developer at Mellanox Technologies. Pak Lui is HPC Advisory Council Cluster Center Manager, and Tong Liu is Director of the HPC Advisory Council China Center of Excellence. They may be reached at editor@ScientificComputing.com.