Back to the Future

Thu, 07/17/2008 - 9:33am
Rob Farber
Back to the Future

The return of massively parallel systems

In the 1980s, the scientific community faced the challenge of programming the massively parallel Thinking Machines supercomputers with 65,536 processors. Modern systems, such as the 62,976-core Ranger supercomputer at Texas Advanced Computing Center, the 212,992-core Blue Gene at Lawrence Livermore National Laboratory (last year’s TOP500 leader), and even the commodity NIVIDA graphics processors present a similar challenge: how to map scientific computations efficiently onto large numbers of processing cores.

Massively parallel systems dramatically highlight the computation versus communications trade-offs that must be made in designing high-performance parallel software. The old joke that, "a supercomputer is an expensive device that transforms a compute-bound problem into an I/O limited problem" is applicable. Modernizing this joke only requires removing the word "expensive" because, nowadays, low-cost commodity devices have enough processing cores to transform many scientific workloads into I/O limited problems.

Mapping problems from many-core processors to massively parallel supercomputers involves simultaneously addressing two issues:
• fitting the problem efficiently onto each individual core (or computational node)
• communicating data amongst all the cores in a manner that permits each to run at peak efficiency.

My February 2007 column, "HPC Balance and Common Sense" discussed how the software must fit within the balance ratios of the available hardware in order to run efficiently. Unfortunately, there is no certainty that any given computational problem will fit onto your specific computer hardware. Current dual and quad-core processors are especially susceptible to performance problems, as they have a high ratio of floating point capability to memory bandwidth (e.g. they can deliver many more floating point operations per second compared to the ability of all the cores to fetch bytes of memory per second from memory). Efficient software for these processors must make extensive reuse of local data in-cache, or else the processing cores will starve for data and poor performance will result. Similarly, bandwidth and latency also make it difficult to map problems across multiple computational units — be they graphics processors within a workstation or nodes in a massively parallel supercomputer. In general, it is difficult to maintain that balancing act between computation and communications and not turn your compute problem into an I/O limited problem or vice versa.

Because the computer industry has switched to delivering multiple core processors rather than increasing clock speed (see my previous column), there is great interest in defining uniform programming models for these systems. In effect, we have traveled back to the future, history begins to repeat itself, and work that was done on the early generation massively parallel computers is now new again.

As you are aware from reading earlier installments of this column, NVIDIA is delivering Compute Unified Device Architecture (CUDA) with their graphics processors. It is fairly straight-forward to map single instruction multiple data (SIMD) algorithms onto CUDA devices. More complex single program multiple data (SPMD) also can work efficiently on these devices. (For a better understanding of the taxonomy of computer architectures, I recommend looking at the Flynn’s taxonomy entry listed under Related Resources). NVIDIA claims that the GPU enabled C-compiler will do the heavy lifting required to turn C code into GPU instructions.

Intel plans to compete directly with NVIDIA with their Larrabee product. Larrabee will deliver more than a teraflop of performance based on their successful    architecture (IA) cores. In a possible play on NVIDIA’s words, Intel’s senior vice president, Pat Gelsinger, was quoted by TGDAILY as saying, "The industry believes that CUDA is heavy lifting" so Larrabee will use common libraries and run under the same operating system as IA processors. The challenge remains to teach developers how to use all those computational cores.

The wisdom of hindsight has demonstrated to me that some models of computation continue to map efficiently onto ever newer generations of machines. For example, my work in the 1980s on mapping machine learning and optimization problems to massively parallel SIMD computers is directly applicable to the current hardware and platforms. The lesson is that simpler is better for longevity. Minor tweaks have allowed these early software mappings to achieve linear scaling (with the number of processing cores) and excellent performance from the NVIDIA GPUs as well as other supercomputers, such as Ranger. Needless to say, I’m happy and look forward to working with newer products and the next generation supercomputers.

Looking forward, I will be interested to see how Intel and other vendors incorporate time-tested software models including message passing interface (MPI) and pthreads. Newer programming frameworks, such as MapReduce (see Hadoop — the programming model Google uses to sort 20 Petabytes/day — for a downloadable version) and other useful programming patterns motivated by Google’s "Cloud Computing" will certainly have an effect. The computer industry is in a state of flux, and I will not be surprised if new programming models for many-core and massively parallel applications appear.

Happy "scalable" computing!

Rob Farber is a senior research scientist in the Molecular Science Computing Facility at the William R. Wiley Environmental Molecular Sciences Laboratory, a Department of Energy national scientific user facility located at Pacific Northwest National Laboratory in Richland, WA. He may be reached at

Related Resources

CUDA Compute Unified Device Architecture | IA Intel Architecture | MPI Message Passing Interface | SIMD Single Instruction Multiple Data | SPMD Single Program Multiple Data

Share this Story

You may login with either your assigned username or your e-mail address.
The password field is case sensitive.