Advertisement
Articles
Advertisement

Estimating Current and Future Potentials

Fri, 01/07/2011 - 9:58am
Gerhard Wellein

Estimating Current and Future Potentials
In less than a decade, GPGPUs may provide a pathway to exascale computing
 

GPGPUs may provide a pathway to exascale computing
Since the glory days of the vector supercomputers, change has been the only constant in the architecture of leading-edge computing machinery. Over the past 25 years, the HPC community has adopted several technological breakthroughs driven by economics of scale in completely different IT markets. Prominent examples are the “killer micros” (in the beginning, RISC based server processors;1 nowadays, desktop-like x86 processors), the IBM BlueGene series leveraging embedded systems technology, or the use of standard components for interconnects (e.g. GigE) or storage subsystems (e.g. cost-effective SATA drives). The strong interest in the use of general purpose graphics processing units (GPGPUs) in scientific and technical computing can be considered as the latest development along this line.

Raw performance numbers of GPGPUs, both in terms of floating point capabilities and bandwidth to device memory, have reached extremely appealing levels in recent years at a rather low price tag. Moreover, the introduction of programming interfaces appropriate for scientific and technical computing such as CUDA and OpenCL have initiated extensive evaluations of GPGPUs in many application areas. However, the strapline “supercomputer performance at your desktop” is not the only trend fuelling the GPGPUs activities. GPGPUs or related architectures also are considered as one of the few pathways to exascale computing within less than a decade.

So, it is not surprising that people are very enthusiastic about this technology and conduct extensive work on programming and optimizing kernels and applications for GPGPUs. Hardly any conference related to numerical simulation lacks talks about GPGPUs, and many authors are tempted to show ever-increasing speed-ups for their codes on GPUs — in particular, compared to traditional CPUs. In doing so, basic technological limitations of GPGPUs are all too often ignored, and only those parts of the full application that can “show off” compared to CPUs are considered. While double precision performance and the lack of memory resilience are addressed by NVIDIA’s latest GPGPU generation (“Fermi”), the typical use of GPGPUs as accelerators challenges a more fundamental and inveterate enemy: Amdahl’s law dictates that a very large fraction of any application must be accelerated to really profit from the blessings of massive on-chip parallelism; making half of the code infinitely fast only gives us a speedup of two!

To estimate the current and future potentials of GPGPUs for real world code, all these pros and cons have to be evaluated carefully and in an unbiased way. As the premier HPC conference, SC 2010 features a session on GPGPU performance with selected high-quality papers to address these issues in different application areas of broad interest: hierarchical N-body simulation in cosmology, biomolecular MD simulation, and a bioinformatics application for genome alignment. The application codes used are open source projects,2,3,4 and are widely accepted in the respective communities. The talks will report on experience from developing, implementing and running full applications on a single or multiple GPGPUs. They all focus on NVIDIA GPGPUs using NVIDIA’s CUDA programming interface — a typical combination for most research papers presented these days for GPGPUs in scientific and technical computing. We expect the session to stimulate a fruitful discussion for the GPGPU and CPU programming experts, and to provide general insights for scientists who are mainly interested in productivity and usability of full application codes on GPGPU clusters.

References
1. Brooks, E.: “The attack of the killer micros.” Teraflop Computing Panel, Supercomputing ’89 Conference, Reno, NV.
2. ChaNGa: www-hpcc.astro.washington.edu/tools/changa.html
3. LAMMPS: lammps.sandia.gov
4. MUMmer: mummer.sourceforge.net

Gerhard Wellein is High Performance Computing group leader at Erlangen Regional Computing Center, Professor at the Department for Computer Science at University of Erlangen, and the SC10 Conference GPGPU Performance Session Chair.


Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations
Biomolecular simulations have traditionally benefited from increases in the processor clock speed and coarse-grain inter-node parallelism on large-scale clusters. With stagnating clock frequencies, the evolutionary path for performance of microprocessors is maintained by virtue of core multiplication. Graphical processing units (GPUs) offer revolutionary performance potential at the cost of increased programming complexity. Furthermore, it has been extremely challenging to effectively utilize heterogeneous resources (host processor and GPU cores) for scientific simulations, as underlying systems, programming models and tools are continually evolving.

In this paper, we present a parametric study demonstrating approaches to exploit resources of heterogeneous systems to reduce time-to-solution of a production-level application for biological simulations. By overlapping and pipelining computation and communication, we observe up to 10-fold application acceleration in multi-core and multi-GPU environments illustrating significant performance improvements over code acceleration approaches, where the host-to-accelerator ratio is static, and is constrained by a given algorithmic implementation.

Chair:
Gerhard Wellein, Erlangen Regional Computing Center

Authors: 
Scott Hampton, Oak Ridge National Laboratory
Sadaf Alam, Swiss National Supercomputing Centre
Paul Crozier, Sandia National Laboratories
Pratul Agarwal, Oak Ridge National Laboratory


Scaling Hierarchical N-Body Simulations on GPU Clusters
This paper focuses on the use of GPGPU-based clusters for hierarchical N-body simulations. Whereas the behavior of these hierarchical methods has been studied in the past on CPU-based architectures, we investigate key performance issues in the context of clusters of GPUs. These include kernel organization and efficiency, the balance between tree traversal and force computation work, grain size selection through the tuning of offloaded work request sizes, and the reduction of sequential bottlenecks. The effect of various application parameters is modeled and experiments are carried out to quantify gains in performance.

Our studies are carried out in the context of a production-quality parallel cosmological simulator called ChaNGa. We highlight the re-engineering of the application to make it more suitable for GPU-based environments. Finally, we present scaling performance results from experiments on the National Center for Supercomputing Applications (NCSA) Lincoln GPU cluster.

Chair:
Gerhard Wellein, Erlangen Regional Computing Center

Authors:
Pritish Jetley,
University of Illinois at
Urbana-Champaign
Lukasz Wesolowski,
University of Illinois at
Urbana-Champaign
Filippo Gioachin,
University of Illinois at
Urbana-Champaign
Laxmikant V. Kale,
University of Illinois at
Urbana-Champaign
Thomas R. Quinn,
University of Washington

Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
GPUs offer drastically different performance characteristics compared to traditional multicore architectures. To explore the tradeoffs exposed by this difference, we refactor MUMmer, a widely-used, highly engineered bioinformatics application which has both CPU- and GPU-based implementations. We synthesize our experience as three high-level guidelines to design efficient GPU-based applications.

First, minimizing the communication overheads is as important as optimizing the computation. Second, trading-off higher computational complexity for a more compact in-memory representation is a valuable technique to increase overall performance (by enabling higher parallelism levels and reducing transfer overheads). Finally, ensuring that the chosen solution entails low pre- and post-processing overheads is essential to maximize the overall performance gains. Based on these insights, MUMmerGPU++, our GPU-based design of the MUMmer sequence alignment tool, achieves, on realistic workloads, up to 4x speedup compared to a previous, highly optimized GPU port.

Chair:
Gerhard Wellein, Erlangen Regional Computing Center

Authors:
Abdullah Gharaibeh, University of British Columbia
Matei Ripeanu, University of British Columbia

Advertisement

Share this Story

X
You may login with either your assigned username or your e-mail address.
The password field is case sensitive.
Loading