Scientific Computing

Articles

Advertisement

Maximizing MultiGPU Machines
Fri, 11/04/2011 - 10:06am
Rob Farber

Maximizing MultiGPU Machines
Multiple GPU and hybrid CPU+GPU performance is heavily dependent upon vendor implementation of the PCIe bus

Maximizing MultiGPU Machines
GPU technology provides orders of magnitude speedups with a single GPU over a conventional processor. Plugging two or four GPUs into a workstation or computational node can double or quadruple the performance of computational applications and games. Even more performance can be achieved by utilizing the multicore capability of the host processor in concert with the GPUs in a system.

While attractive, multiple GPU and hybrid CPU+GPU performance is heavily dependent upon the vendor implementation of the PCIe bus on the computer motherboard. Beware vendor shortcuts on the PCIe bus! With the right PCIe chipset, multiGPU applications can increase performance according to the number of GPUs in the system. Use a system with the wrong chipset, and that multiGPU investment will not deliver. Why waste money? Make certain the PCIe chipset can deliver all the performance of your GPU investment!

NVIDIA recently released CUDA 4.0 containing a number of features that simplify the use of multiple GPUs within a workstation or computational node. The ICHEC (Irish Center for High-End Computing) phiGEMM library makes use of the CUDA 4.0 features to provide a matrix multiply that concurrently utilizes multiple GPUs and the host processor. The phiGEMM performance gains are impressive, with single GPU + CPU performance equaling that of the Linpack HPL matrix multiply used to evaluate the TOP 500 supercomputers in the world. When running a single matrix multiply across four GPUs, plus the host processor, phiGEMM can deliver over a teraflop (1071 Gflops) of double-precision matrix multiply performance on matrices that are larger than the memory of any single GPU!

Maximizing figure 1
Figure 1: phiGEMM performance results
Figure 1 shows performance for a 25000 x 25000 matrix, where the 12-cores of the 2.93 GHz dual Intel X5670 processors deliver a constant 130 Gflops. Without the performance boost provided by the Intel cores, the four NVIDIA GPUs deliver 942 Gflops of 64-bit floating-point performance, or a 3.4-times increased quad-GPU performance over a single GPU. The  phiGEMM library is  available for free download.1

The impact of a poorly performing PCIe bus chipset can be significant. For example, this same phiGEMM matrix multiply will run 17 percent slower on a shared PCIe bus, since it simply takes longer to transfer the data. In other words, each GPU gets half of the data bandwidth, because the bus is shared. Matrix multiply is a good example for demonstrating the performance of multiple GPUs, as the runtime becomes limited by floating-point performance rather than data transfers as the matrix size increases.

Many vendors advertise that their systems support multiple high-speed x16 PCIe slots, which is true subject to some performance assumptions. To save money, some vendors use PCIe chipsets that deliver full performance only when one device is active. Figure 2, taken from my book, CUDA Application Design and Development,2 shows that some PCIe implementations treat one of the GPUs as a second-class citizen. As can be seen in the graphical output from the NVIDIA Visual Profiler, the pink and red regions denoting data transfers take significantly longer on the second device noted as Device_1:Context_1.

The output of the NVIDIA Visual Profiler shows that running 3-D fast fourier transform (FFT) across multiple GPUs connected via a poorly performing PCIe bus can result in a nearly 60 percent decrease in performance. The reason for such a dramatic performance impact is that the FFT performs fewer computations per datum transferred than a matrix multiply. Fewer computations means the application is limited more by the speed of the data transfers than by computational throughput.

A poorly performing PCIe bus will have an even greater impact on the runtime of these types of applications. A 60 percent performance decrease is significant, as it can eliminate the performance benefit of multiple GPUs. The Thrust C++ data parallel API became standard in the CUDA 4.0 release.3 Through the use of generic programming and functors (C++ objects that can be called like a function), C++ applications can be written that operate in a high-performance massively-parallel fashion on CUDA vectors and arrays.  Thrust makes CUDA programming easy, since anyone who knows C++ already knows how to write programs for GPUs.

Maximizing figure 2
Figure 2: Output from the NVIDIA Visual Profile
What is interesting from a multiGPU and combined CPU + GPU programming perspective is that Thrust can generate code that will run on both the host multi-core processor and the GPUs! Simply supply the two specifiers “__device__ __host__”. This tells the compiler to generate code for the functor that will run on both the host and GPU devices. I use this Thrust capability to write applications that make use of all the available computing resources within a workstation.

For my applications, I prefer to use host-based functors and OpenMP (Open MultiProcessing) directives to explicitly specify the parallelism on the host processor.  OpenMP is an API supported by most compilers that can be used to create parallel applications for multi-core processors. Be aware that Thrust-based applications also can be transparently compiled so they run entirely on the host multi-core processor. No code changes are required! Instead, a special variable “THRUST_DEVICE_BACKEND” is appropriately defined to specify an OpenMP backend. Thrust is written in such a way that the compiler can then generate code that runs in parallel on the host multi-core processor without requiring a GPU. More information is available on the Thrust Web site.3

Those programmers who use message passing interface (MPI) can use this Thrust-based capability to create distributed applications that run a separate MPI process on each GPU on a system, as well as the host processor. Be aware that distributed applications also can make heavy use of the PCIe bus, which can expose shortcomings in the vendor PCIe implementation just like multiGPU applications discussed previously.

All this information points to a simple, common-sense solution: purchase vendor hardware that does not compromise on the PCIe bus. Instead, look for hardware that will fully support high-performance PCIe data transfers to multiple GPUs. Even if your applications do not utilize multiple GPUs at the moment, don’t limit your application possibilities. Adding additional GPUs to increase performance by two to four times is a viable path forward.

For those running computational clusters, a poorly performing PCIe bus also can negatively impact performance. Benchmarks using the phiGEMM library, my FFT example, or your own multiple GPU examples should immediately highlight any problems. Do not rely on the results that measure PCIe bandwidth to a single device. Only benchmarks that concurrently utilize multiple devices will expose limitations in the PCIe implementation.

Happy multiGPU computing!

References
1. phiGEMM: http://qe-forge.org/projects/phigemm
2. CUDA Application Design and Development is available for preorder from www.amazon.com and other book sellers.
3. Thrust: http://code.google.com/p/thrust

Rob Farber is a visiting HPC expert at Irish Center for High-End Computing (ICHEC), supported by Science Foundation Ireland. He may be reached at editor@ScientificComputing.com.

Advertisement

Share this Story

X
You may login with either your assigned username or your e-mail address.
The password field is case sensitive.
Loading