Seeking Wisdom in the Clouds
Cloud computing can offer a convenient, “faster, cheaper, greener” option than hardware ownership
“Faster, better, cheaper” and “green computing” are common themes in general and high-performance computing nowadays. Newer multithreaded and general-purpose computing on graphics processing units (GPGPU) architectures have attracted significant attention as they transition our computational world into higher-performance, less-expensive and more power-efficient hardware. Cloud computing is an alternative application model that allows users to run small to very, very large distributed applications leveraging the idle computational resources at a number of institutions and data centers.
Significant cloud resources are available, because many commercial data centers need to keep substantial numbers of machines powered-on and available so they can instantly be tasked to handle sudden peak loads — say, if a large number of people respond to a promotion on a Web site or suddenly decide to perform search queries. Instead of wasting cycles, the data center can dedicate resources to cloud computing until a higher-priority customer needs the extra resources.
Many highly visible names, such as Google, Amazon and Microsoft, already have leveraged their data centers to reap huge benefits from internal cloud computing efforts. The industry is now recognizing that it can generate revenue by making this excess hardware capability available to the public at heavily discounted rates. Solutions with funny names like Hadoop, the Elastic Compute Cloud (Amazon EC2) and Dryad are now making cloud computing accessible to all.
From a convenience point of view, using shared cloud resources could not be easier. Much of it is accessed via a Web interface. As a by-product of competition, infrastructure providers need to upgrade to the newest technology — and perhaps make test beds available to try interesting new technologies. Users, in turn, benefit as they can migrate to the latest technology without any capital investment. In particular, direct hardware maintenance and support costs disappear. Cloud proponents also can argue effectively that many of the indirect costs associated with hardware ownership no longer apply. For these and many other reasons, cloud computing offers a convenient “faster, cheaper, greener” option than hardware ownership.
The payback can be tremendous. As a teaching tool, cloud computing gives students access to computational resources well beyond those a normal educational institution could provide. Computational literacy is a core skill for today’s graduates that will form the basis for business and national competitiveness in the future.
Highlighting the ability to train users, the mathematics of evolution program at the University of Buffalo provides high school students and entry-level freshmen the opportunity to perform real science using cloud computing. This is but one of many programs at a number of institutions that are currently exploring how to teach students to think and utilize cloud-scale computational and data analysis applications. By performing real research, the hope is that the students will become motivated enough to pursue careers in science and technology.
Training students to write programs for the cloud also teaches them how to create parallel distributed applications, plus it gives them hands-on experience with overcoming scaling and performance bottlenecks. In many ways, this is a win-win for both students and cloud infrastructure providers, as the students learn to work with technologies that will be competitive in tomorrow’s job market, plus the cloud infrastructure provider gets the chance to create a trained base of developers to build applications with their framework.
The cloud computing opportunity has not been lost on the large software houses. Microsoft opened up a free community technology preview of its Dryad effort, a programming model for writing parallel and distributed programs that can scale from a small cluster to a large datacenter. This infrastructure can be viewed as belonging to the same genre as the Hadoop MapReduce framework and Google’s BigTable to build “big data” analysis applications for the cloud.
From a computational point of view, Dryad is very interesting because it structures all the computational tasks for an application inside a directed graph. This data structure gives the underlying software stack the ability to reorder the computation for greatest efficiency, schedule jobs and handle other complexity. Said another way, Dryad-based applications can best utilize the current state of the cloud to transparently and robustly complete a user’s application. At the moment, Microsoft is focusing Dryad on data mining types of applications, but the graph-based implementation hints at future large-scale HPC applications as well for the Dryad runtime.
Microsoft is not unique in creating a unified runtime for a complicated distributed environment. Right now, the design of scalable and flexible run times is an area of active research for both cloud computing and hybrid environments. The use of directed graphs also is not unique to Dryad, as is illustrated by the Directed Acyclic Graph Unified Environment (DAGuE) project at the University of Tennessee’s Innovative Computing Laboratory (ICL). However, HPC efforts like DAGuE are focused on hybrid CPU/GPU environments, as opposed to Dryad’s focus on the cloud environment. The StarPU software by INRIA (the Institut National De Recherche en Informatiqueet en Automatique) is another freely available runtime that is utilized by popular HPC projects, such as the Matrix Algebra on GPU and Multicore Architectures (MAGMA) project from the ICL team. MAGMA promises some wonderful speedup for matrix algebra using hybrid CPU and GPU architectures.
The current split in HPC computation between the cloud and hybrid environments reflects differences between HPC and cloud application needs. In particular, the bandwidth and latency of the network connecting the computational nodes of a supercomputer play a critical role in the computational efficiency of an HPC application. Supercomputer customers spend large amounts of money on the machine interconnect, because many important HPC problem domains tend to be dominated by network bandwidth and latency limitations. Commercial applications, on the other hand, tend to be more latency tolerant, which means the infrastructure providers can economize on the datacenter network.
The use of virtual machines in cloud computing also creates a performance challenge. The paper by Fabrizio Petrini and co-authors, “The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q” highlights the importance of eliminating latency in systems running HPC applications. As this paper showed, operating system jitter caused by system daemons infrequently running for a few microseconds caused a 2X performance decrease in application performance.
While 2X does not seem like much, especially when one considers the cost and other benefits of cloud computing, the extensive use of virtual machines and shared cloud computing environments just serves to compound the latency problem. Basically, the entire virtual machine can be preempted for some other task. For many HPC calculations, the slowest machine(s) limit the computational rate regardless the cause of the delay (e.g network limitations or other applications slowing the processor, memory or disk).
David Brown at Pacific Northwest National Laboratory (PNNL) demonstrated the limitation of the popular Amazon EC2 cloud computing network through a comparison with the PNNL Chinook Supercomputer. The graph in Figure 1 illustrates the performance on the Ohio State University multiple bandwidth test. This test, “evaluates the aggregate uni-directional bandwidth and message rate between multiple pairs of processes. Each of the sending processes sends a fixed number of messages (the window size) back-to-back to the paired receiving process before waiting for a reply from the receiver. This process is repeated for several iterations. The objective of this benchmark is to determine the achieved bandwidth and message rate from one node to another node with a configurable number of processes running on each node.” Note the nearly two order of magnitude difference in performance between the EC2 cloud and the Chinook supercomputer.
That said, the benchmark competition has just begun, as illustrated by the cycle computing blog, “A Couple More Nails in the Coffin of the Private Compute Cluster,” which cites fast benchmark results on the Amazon EC2 GPU cloud. As discussed in my Scientific Computing article, “HPC Balance and Common Sense,” it is important to look at the balance ratios for your workload to decide if a target hardware platform — or cloud computing environment — can support your computational needs. Clouds are here to stay, as they offer faster, cheaper, greener computational platforms for many applications. It is well worth downloading and trying Dryad, Hadoop and many of those other funny sounding cloud applications.
1. The Dryad Project: http://research.microsoft.com/en-us/ projects/Dryad/
2. The Innovative Computing Laboratory: http://icl.cs.utk.edu/
3. Fabrizio Petrini, Darren Kerbyson and Scott Pakin. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q. In IEEE/ACM SC2003, Phoenix, AZ, November 2003. http://hpc.pnl.gov/people/fabrizio/papers/sc03_noise.pdf
4. Amazon EC2: http://www.datapipe.com/solutions-cloud-computing.htm
5. HPC Balance and Common Sense: http://www.scientificcomputing.com/hpc-balance-and-common-sense.aspx
Rob Farber is a senior research scientist at Pacific Northwest National Laboratory. He may be reached at editor@ScientificComputing.com