Redefining What is Possible
Perfect storm of opportunities delivers fresh approaches

General purpose graphics processor unit (GPGPU) technology has arrived during a perfect storm of opportunities
General purpose graphics processor unit (GPGPU) technology has arrived during a perfect storm of opportunities. Multi-threaded software is now a necessity as x86 and other conventional processor designs have been forced to adopt a multi-core approach. From dual core cell phones to IBM Power 7 systems that will support well over a million concurrent threads of execution, parallelism is now the path to performance.

Legacy applications and research efforts that do not invest in multi-threaded software will not benefit from modern multi-core processors, because single-threaded and poorly scaling software will not be able to utilize extra processor cores. As a result, computational performance will plateau at or near current levels, placing the projects that depend on these legacy applications at risk of both stagnation and loss of competitiveness.

Graphic processors have matured into general purpose computational devices at exactly the right time to be considered in this industry-wide retooling to utilize multi-threaded parallelism. To put this in very concrete terms, any teenager (or research effort) from Beijing, China, to New Delhi, India, can purchase a teraflop-capable graphics processor and start developing and testing massively parallel applications. Table 1 shows two inexpensive teraflop-capable offerings from AMD and NVIDIA that are available now for purchase.

These devices represent a peak floating-point capability that was beyond anything available for the most advanced high performance computing (HPC) users until Sandia National Laboratory performed a trillion floating-point operations per second in December 1996 on the ASCI Red supercomputer. I wonder how many of those proposals for leading-edge research using a teraflop supercomputer can be performed today by students anywhere in the world using a few GPGPUs in a workstation with a fast RAID disk subsystem and a decent amount of host system memory.

two inexpensive teraflop-capable offerings from AMD and NVIDIA that are available now for purchase
From personal experience, current GPGPU flop rates meet or exceed the computational capability to which I had access as a scientist in the theoretical division at Los Alamos National Laboratory in the late 1990s. In addition, the machines I used were shared with other users, while current GPGPUs are inexpensive enough to be dedicated for use by a single individual. Installing four high-end GPUs in a workstation can create a machine with a peak flop rate comparable to the large MPP2 supercomputer that Pacific Northwest National Laboratory (PNNL) made available to users just a few years ago.

Competition is fierce in both commercial and academic circles, which is why commodity supercomputing in the hands of the masses is going to have a huge impact on both commercial products and scientific research. Plus, GPGPU technology has made the competition global and accessible to almost anyone who wishes to compete, as opposed to the past where competition was restricted to a relatively small community of scientists.

While programming GPGPUs to achieve high-performance can be a challenge, and there are internal resource limitations that must be overcome, the wide variety of applications that already exhibit very high-performance — as shown on the NVIDIA Web site and in the scientific literature — clearly demonstrates that people are willing to put in the time and effort needed to make this technology work. Those who ignore the potential of multi-threading and GPGPU devices place themselves in a position of competitive disadvantage.

Fastest CUDA Applications
Figure 1: Fastest CUDA Applications
Two dominant programming languages exist for creating GPGPU applications at this time. The first is NVIDIA’s Compute Unified Device Architecture (CUDA) and the second is a relative newcomer, OpenCL. Both of these programming environments provide minor extensions to the C or C++ languages that enable programming with massive numbers of threads and the ability to utilize various GPU memory spaces.

CUDA is a quickly maturing software development capability provided free-of-charge by NVIDIA to develop applications for NVIDIA GPGPUs. It is currently the most mature and widely used GPGPU development platform. OpenCL is a newer standards-based cross-vendor and cross-platform software development platform. In other words, massively parallel OpenCL applications are not limited to a single vendor and have the ability to run unchanged on GPUs, conventional processors, field programmable gate array (FPGA) and even on parallel hardware that has not yet been invented.

Data-parallel extensions are a hot area of development. These are extensions that provide a very simple, yet powerful, API to utilize multi-core hardware and can greatly ease the burden of programming these devices. A very usable one for CUDA-based projects is the Thrust project hosted on Google code.

The rate of development and demand for these massively-threaded software development environments is extraordinary. For example, the first CUDA software development kit (SDK) was released in February 2007. Today, CUDA is part of the curriculum at nearly 300 educational institutions around the world, such as Harvard, Oxford, The Indian Institute of Technologies, National Taiwan University and the Chinese Academy of Sciences.

Application speed tells the story, as shown by the plot in Figure 1, of the fastest CUDA applications as reported on the NVIDIA Web site on September 8, 2010. Results are displayed in orders of magnitude performance increase over conventional hardware for a variety of applications.

The game-changing nature of GPU technology can be appreciated through the observation that, for the last three years, nascent developers around the world have been writing CUDA and OpenCL programs, unaware that there is anything unusual in creating programs that utilize many thousands of threads that concurrently utilize hundreds of processing elements. NVIDIA states that more that 100 million CUDA-enabled GPUs have already been sold, which gives a sense of the potential size of this developer population. 

My expectation is that there will be extraordinary progress in the near future as this large base of free-thinking technically-literate massively-parallel developers grows and begins to leverage its capabilities. The personal computer (PC) put relatively powerful virtual memory systems into the hands of the masses. Access to one of these machines allowed an unknown student in Finland to write a piece of software which he posted on the Internet just to see if he understood operating design. Without question, Linus Torvalds and the Linux operating system have had a tremendous impact on our technological world. Commodity supercomputing in the hands of the global public can similarly provide other bright individuals with the computational platform they can use to change our world.

Regardless of your interest in GPGPU technology, massive threading is changing all aspects of high performance computing (HPC) because the amount of parallelism available has increased so dramatically. Instead of looking for applications to scale well on a few tens to even thousands of processors, HPC scientists must now think in terms of millions of threads of concurrent operation. At the very highest end of HPC, for example, the soon-to-be-completed National Center for Supercomputing Applications (NCSA) Blue Waters project will provide a peak performance of 10 petaflops (10 quadrillion calculations every second) using over 300,000 IBM Power 7 processing cores that can simultaneously run 1.3 million concurrent threads of operation, where each thread has the ability to access any byte within a petabyte (1 quadrillion bytes) of globally accessible physical memory.

At this point, most or all HPC and cluster vendors offer hybrid products that incorporate GPGPUs. A large number of supercomputer centers such as Tokyo Tech, Oak Ridge National Laboratory (ORNL), National Energy Research Scientific Computing Center (NERSC) and PNNL have experimental CPU/GPU clusters to evaluate this hybrid technology. China already has built the Nebulae supercomputer, located in Shenzhen, China, containing 64,960 Tesla GPGPUs that recently demonstrated 1.271 PFlop/s when running the Linpack benchmark. Nebulae is now ranked number two on the TOP500 list. It also has a peak theoretical capability of 3 PFlop/s — the highest ever on the TOP500 list. The seventh-ranked system in the TOP500 is another Chinese system, Tianhe-1, which utilizes both AMD and NVIDIA GPUs. Such large systems demonstrate that it is possible to exploit the excellent energy efficiency of GPGPU technology (watt/flop) to overcome the power and cooling challenges inherent in building extremely large petascale and beyond supercomputers.

From teraflop GPGPU supercomputers in the hands of the masses to petascale hardware, this perfect storm of opportunities is going to bring fresh approaches and a host of new minds to the halls of academia, product development and HPC research. I expect many “dusty software decks” maintained over decades and at great expense will (and must) become no more than validation suites for the evaluation of newer software models and algorithms. The payoffs can be tremendous, as better, more computationally expensive approximations and analytic approaches become possible, as multi-threaded and GPGPU technology blow the dust away and clear the air for many exciting advances, as individuals from around the world redefine what is possible for high performance computing.

1. 20-part CUDA Series:
2. New Tutorial OpenCL Series:

Rob Farber is a senior research scientist at Pacific Northwest National Laboratory. He may be reached at