Parallel Coding with GPUs: Should You Wait?
A look at real GPU options and some issues that still need to be addressed

Parallel Coding with GPUs: Should You Wait?
The Japanese Tsubame 2.0 system was awarded the title of “Greenest Production Supercomputer in the World,” since it is the only machine to feature in the top five of both the Green 500 and the TOP500 lists. This illustrates how power-efficient GPU based computing can be.
While the cost of high performance computing (HPC) has been reducing steadily over recent years, it may still put some people off. The advent of general purpose graphical processing units (GPGPU) has both accelerated the cost reduction and improved the energy efficiency of some HPC installations. The experiences of those who have written GPGPU algorithms, and who know HPC systems (some with GPGPU capability), may help you decide whether this technology is right for you.

Most hardware vendors now have added GPGPU capabilities to their standard HPC hardware product range. Announcements by many different vendors and integrators make it clear that GPGPU computing is no longer confined to research groups but is now considered to be much more mainstream.

This article examines the background to GPU usage, some of the options that currently exist and some issues that still need to be addressed. It then offers an opinion as to whether now is the right time to start working with GPU technology for the current generation of hardware.

The Numerical Algorithms Group’s development teams have been working on numerical software products for parallel systems, including a regularly updated multicore library,1 for over 15 years. In addition, NAG has recently been working with Intel in the evaluation of the suitability of Intel Many Integrated Core (Intel MIC) architecture for advanced numerical algorithms and has been developing numerical routines for GPGPU architectures upon which this article will focus.

GPGPU use in production supercomputers 
Experiments with the use of GPGPUs for numerical and scientific computing are ongoing at prestigious research institutions around the globe. High profile examples are Oak Ridge National Laboratory in the U.S., the Swiss National Supercomputing Centre, the Chinese Academy of Sciences, Australia’s Commonwealth Scientific and Industrial Research Organisation and the UK National Supercomputing Service (HECToR), all of which already have added GPGPU hardware to their existing traditional HPC systems. Indeed, in the 37th edition of the TOP500 list of the world’s fastest supercomputers,2 three of the top five systems (two Chinese systems at No. 2 and No. 4 and the Japanese Tsubame 2.0 system at No. 5) are all using GPUs to accelerate computation. In total, 19 systems on the list now utilize GPU technology. The same Japanese Tsubame 2.0 system is also No. 3 on the June 2011 Green 500 supercomputer3 list. It was awarded the title of “Greenest Production Supercomputer in the World,” since it is the only machine to feature in the top five of both the Green 500 and the TOP500 lists; this illustrates how power-efficient GPU based computing can be.

High-End Computing Terascale Resource (HECToR) is the fastest supercomputer in the UK and is a facility that NAG knows very well. It is a resource funded by the UK Research Councils and is available for use by academia in the UK and Europe. NAG4 provides HECToR users with support for their computational science and engineering applications.

The main HECToR system is complemented by a GPGPU test facility. This currently consists of two nodes containing four NVIDIA Fermi C2050 GPUs, with a third containing a further two additional C2050 GPUs and two higher memory C2070 GPUs. Each of these three nodes also contains a quad core Intel CPU plus 32GB host RAM and will comprise the production side of the test-bed system. A further node also has been included that hosts a single NVIDIA Fermi C2050 GPU and a single AMD Firestream 9270 GPU. This node also contains a quad core Intel CPU plus 16GB host RAM. The whole system is housed in a cabinet infrastructure with an AMD Magny-Cours (12-core) head node, 6 TB of disk space and shared 10 Gigabit Ethernet switch.

At present, the GPU system is primarily aimed at development and testing. As is common with many prototype systems, users typically want to experiment with the new hardware to gauge the possible benefits for their applications. The key task is to identify those parts of their code which will most benefit from GPU acceleration, and then to see whether the implementation (or possibly even the design) of the algorithm/application can be altered to use the GPU hardware efficiently. Many users need some help with this process and, so, reach out to experienced groups, both internal and external, for guidance. There is a lively debate among the HPC community about the benefits of GPGPU and how to realize these benefits (when they exist).

CPUs and GPUs: converging architectures?
From the point of view of designing algorithms, modern CPUs and GPUs are quite similar. Both have several independent processing units (called “cores” on CPUs) and both have single instruction, multiple data (SIMD) units attached to each core (the MMX, SSE, SSE2, SSE3, SSE4 and AVX units on CPUs). The most obvious difference is in the numbers: GPUs currently have more processing units and longer SIMD units than CPUs, run at a lower clock speeds and, until recently, have had no caches. However, CPU vendors are rapidly increasing the core count (12+ core CPUs are readily available) and the length of the SIMD units (4 floats with SSE to 32 floats with AVX). Intel also has an 80 core research processor that one day may be used for HPC. On the GPU side, NVIDIA has recently added level 1 and level 2 caches to their products, greatly easing the implementation of many algorithms. Projecting these trends forward, the technologies don’t appear too different.

AMD’s Fusion architecture combines a GPU and a multicore CPU on the same chip, delivering an interesting hybrid of traditional multicore processing with a mid-range GPU coprocessor. NVIDIA’s recently announced ARM strategy seems geared toward delivering a powerful GPU paired with a mid-range CPU. Intel MIC is designed as a powerful coprocessor to sit alongside a traditional CPU. It seems as if the future might contain heterogeneous computing platforms, where a serial processor is tightly coupled with a coprocessor aimed at parallel SIMD floating point calculations.

A common theme across all architectures is that parallelism, and the effective use of SIMD units, will be key in getting good performance in the future. Ignoring or misusing the AVX units on a CPU could lead to a 32x drop in floating point performance in the worst case. Breaking the “single instruction, multiple data” constraint by having one half of the SIMD unit execute one code path and the other half another code path (called “branching” or “warp divergence” in NVIDIA CUDA) also will degrade performance. Often, this is done unknowingly: for example, software implementations of special functions (e.g. erfc) often use two or three different approximations, depending on the value at which the special function is to be evaluated. If this function is evaluated on 32 random numbers in an AVX unit, the evaluation may have to be done two or three times, and parts of the output discarded, in order to obtain the final answer. The details of this will depend heavily on the hardware implementation (e.g. NVIDIA’s hardware handles the re-evaluation and masking for you). However, algorithmically, the key point is that SIMD units will become increasingly important and algorithms will have be designed around them.

Implications for software development: lessons from GPGPU
With this context in mind, a technical team from NAG has been successfully developing numerical routines to run on NVIDIA GPUs. The routines were written in CUDA. The objectives were to explore the algorithmic and performance implications of GPGPU programming, as well as to evaluate the software tools and the development process. Some of their experiences may be useful to you. The team highlights several key points:

? Not all algorithms can be rewritten to run faster on parallel architectures. However, many can. Often, there is more parallelism in an algorithm than might initially be thought. The first step in performance coding is to profile the CPU code and identify “hot spots” — areas which take up most of the runtime. These might not be what you expect: for example, it may be that an external or legacy system, disk IO, or network traffic is slowing you down. (For the latter, note there is a growing literature on communication avoiding algorithms.) If the hot spots are compute-intensive sections, see to what extent they can be parallelized. This will inevitably require an in-depth analysis of the sections of code. You may wish to re-examine the application on a higher level, sketching out how parallel flows could be created or how data could be reused. Sometimes, it is more expensive to move data than to re-compute it.

? In the past, developers have relied on compilers to vectorize code automatically (i.e. use the SIMD units). Compilers differ quite a bit in how well they identify vectorizable sections of code and in how well they use the SIMD units. As SIMD units get longer, algorithms will have to be (re)designed around them, and programmers will have to make sure that compilers do, in fact, use the SIMD units. Sometimes, it may still make sense to use these units even though the code “branches” or creates “warp divergence.” This will depend to an extent on how efficiently the hardware handles such “branches.” It seems inevitable that, for the foreseeable future, we will have to keep a close eye on what the compilers actually do to our code. CUDA is quite useful in that it forces you to design your algorithm for the lowest common denominator: a single element in a SIMD unit. The algorithm is then built up to multiple (cooperating) SIMD units and, finally, to multiple (independent) blocks of SIMD units. This is a useful, fairly general model of parallel programming, and might even map well onto other architectures. Indeed, PGI has been working on a CUDA compiler that targets conventional x86 processors, and it will be interesting to see how well this combination performs on multicore CPUs.

? Beware of “performance comparisons.” There is much material, both scientific and commercial, which claims massive speedups from GPGPU, sometimes as much as several hundred times. Often, this compares a (rather powerful) GPU against a single (unoptimized) thread on a (rather old) CPU. There is a real question as to how one matches a CPU to a GPU. One approach is on price, but it is seldom that one finds a researcher with a system containing an equally priced CPU and GPU. Frequently, a new GPU is added to an existing, and somewhat old, system and the results are compared. Sometimes the massive speedups are warranted, but the challenge is to discern these cases from results which are skewed due to a hardware mismatch. Of course, the real point is that parallelizing code can lead to huge gains on new hardware, and that GPUs are fairly cheap (a new CPU may require a whole new system). In practice, some “expectation management” may have to be performed, especially when dealing with non-technical external departments which have influence over the project. Some researchers claim that, as a rule of thumb, a 10x speedup when comparing a new GPU (in double precision) against equivalent multithreaded code on a new CPU is not unrealistic for many applications.

? Precision affects runtimes on GPUs. Often, single and double precision calculations appear to take more or less the same time on a CPU. This is especially the case when calling some math libraries, where many single precision routines are implemented by simply calling the double precision counterparts with the single precision value. On a GPU, halving the precision can sometimes more than double the performance, since less data is moved around the device. It is worth knowing which parts of your application have to be done in double precision, and which parts can be done in single precision.

? Numerical accuracy is sometimes difficult to ascertain. Floating point addition is not associative, which means that performing operations in a different order can lead to slightly different numerical results. Moreover, different implementations of special functions (e.g. sine or cosine) will have slightly different output. In almost all cases, running an application on both a CPU and a GPU will yield slightly different answers. Sometimes, this can create difficulties, for example, if the code is tested by a validation department. Often, it is assumed that the serial, CPU implementation is correct without any analysis of how the accumulated round-off errors, truncation errors or errors in special function implementations propagate through the code. It is assumed that the number produced by the CPU is correct without appreciating that this number itself could lie anywhere in a (small) interval, depending on how the algorithm is implemented and on how the compiler translates the code into assembly language. One should be careful not to be misled by spurious accuracy.

? The size of GPU memory and the effects of caching make GPUs more programmer-friendly. NVIDIA C2070 cards now have 6 GB of memory and all the “Fermi” series cards have L1 and L2 caches. The extra memory and the caching make programming the device much simpler than the previous-generation GPUs. Often, the GPU code looks quite similar to the equivalent multithreaded code on the CPU. The PCIe bus is something that has to be handled carefully, though. Moving data through the PCIe bus from the host to the GPU is expensive. Fortunately, this often can be overlapped with computation both on the GPU and on the CPU so that the transfer is essentially free, but this comes at the cost of added code complexity.

? NAG has developed most of their GPU routines in CUDA, but also has experience in code development using OpenCL. While the development tools and environments are now quite good, they are not perfect. This is perhaps more the case for OpenCL than for CUDA: these days, CUDA has performance analyzers, pretty stable and effective debuggers, and significant support for the C++ language. Expect to spend some time learning about the new programming environment and execution/memory models, etcetera, which often only can be done practically. But, with tools such as Parallel Nsight, a development environment for NVIDIA’s GPUs integrated into Microsoft Visual Studio, the development effort is focused much more on designing the parallel algorithm, rather than on implementing it.

? Plan for ongoing change. In the coming months and years, it is inevitable that there will be plenty of new releases of toolkits and environments. Your build and release processes should allow for this and, from time to time, designs might have to be updated (or at least revisited) as new software and hardware features become available. This is not really different from the process that needs to be in place when maintaining compatibility with any software development environment. However, because the tools are quite new and are developing rapidly, the changes will occur at least once a year. For example the CUDA release history has been

? 2 releases in 2007
? 1 release in 2008
? 3 releases in 2009
? 3 releases in 2010
? 2 releases (to June) in 2011

Case study
The team at NAG has specifically been working with CUDA and, despite a large amount of combined experience in parallel numerical code, the initial work with CUDA 2.1 and 2.2 was demanding. This was primarily because the tools were relatively primitive, and debugging was a tedious affair. With the advent of Parallel Nsight and CUDA support for debugging, the development process now more closely resembles that of traditional parallel CPU code. However, writing parallel code, whether for a CPU or a GPU, is demanding.

A specific example is illustrated by work that has been undertaken to efficiently implement the Mersenne Twister MT19937 pseudorandom number generator to run on an NVIDIA GPU. The MT19937 generator is well-established and is widely used in Monte Carlo simulations. Since this technique lends itself to GPGPU, the question was how to parallelize this generator efficiently. This is a well-known, and difficult, computer science problem: the core issue is to find an efficient skip ahead. The massive parallelism of the GPU made this fundamental numerical problem more acute: “quick fixes” or workarounds that can be used for CPU-based code are simply impractical. For example, in the literature, some authors tend to ignore the cost of computing skip ahead polynomials and instead focus on applying them efficiently. These polynomials are costly to compute, and the authors argue that a set of them should be calculated in advance and stored. At runtime, a particular polynomial is selected and applied to a given seed. This is equivalent to selecting an “independent” generator from a fixed set of predetermined “independent” generators. However, the massive parallelism of a GPU makes this approach untenable, especially if the routines are to have the same characteristics (e.g. output) as a serial routine. (More details are given in the paper: “Parallelisation Techniques for Random Number Generators.”5)

For this particular case, finding a good solution to the problem of parallelizing MT19937 proved to be much more time consuming and problematic for an experienced developer than was initially hoped, and some issues still remain unresolved. For example, it is still very costly to compute skip ahead polynomials, and there is a lack of literature on good parameters for creating independent streams and substreams for MT19937, something which is crucial for many highly parallel applications of the generator. Therefore, the principle is, when working on complex problems for GPUs, and many-core CPUs in general, allow for extra delays beyond the semi-predictable delays you might normally expect with other software coding projects where there is more built up experience to guide you.

Should you wait?
The considered answer is that there is no need to wait — do start to experiment with GPUs now. Indeed, seeing as CPU and GPU technology appears to be converging, the sooner practitioners start parallelizing their code (in the fine-grained parallelism the new hardware will require), the better. It is important to start analyzing code now to identify hotspots and to plan how these hotspots could be implemented on many-core, long-SIMD length machines.

The tools are effective, and the hardware is available at competitive prices and may already be incorporated in your HPC service. The task of writing fine-grained, parallel code for GPUs and many-core architectures is not simple by any means. But by starting on the learning curve now, you will not only reap the benefits of speedups today, you also will ensure you are ready to exploit the power of the next generation of parallel and massively parallel computer hardware. 

Jacques du Toit is a technical consultant at The Numerical Algorithms Group. He may be reached at

1. NAG Library for SMP and Multicore:
2. Top 500: 
3. Grenn 500: 
4. NAG HPC skill sets: 
5. Gems: