Energy-efficient HPC is Heating Up
A new approach is needed to reach a set of common, useful metrics
Although there is a lot of discussion on how to best measure the energy efficiency of supercomputing centers, gauging interest in the topic is much more straightforward, judging by attendance at a Birds-of-a-Feather session at the recent 2010 International Supercomputing Conference in Hamburg, Germany. When we held our first BoF on energy-efficient HPC at ISC’08, only about 25 people turned out. Expecting a similar turnout for the “Setting Trends for Energy-Efficient Supercomputing” BoF this year, we were literally overwhelmed by a standing-room-only crowd estimated at 125 or more. (I co-organized the session with Horst Simon, Natalie Bates, Tahir Cader, Wu-chun Feng and Erich Strohmaier.)
This show of increased awareness indicates that power issues are becoming more urgent and will only become worse unless something dramatic changes. What we learned in the BoF discussion is that there are lots of workshops on this topic, and that people are sharing information through working groups in many countries.
The session brought together a number of key players in this field to seek common ground on how to accurately and completely measure energy use by HPC systems, including representatives of DOE’s Lawrence Berkeley National Laboratory, theTOP500 List, the Green500, the Energy Efficient HPC Working Group and The Green Grid. Our goal was to improve the cooperation between industry and academic groups to develop better methods for measuring success when it comes to improving the efficiency of HPC systems. To put it bluntly, the current methods are not achieving that result.
For example, many vendors have adopted peak flop/s per watt as a measure of energy efficiency success for the purpose of marketing. Although factual, these numbers don’t tell the whole story, since they provide no insight into the effectiveness of the systems for real applications. Measuring energy efficiency of individual machine components does not tell the entire story either — the approach must go even further.
We need metrics that provide a better reflection of real-world use of these machines to run codes and applications. A metric that measures energy used by the machines to do useful work can help guide system architects through difficult design trade-offs and enable users to make well-informed decisions for system purchases.
The efficiency at which a system integrates with the overall facility is also very important and can’t be overlooked. Even if you have an accurate and acceptable metric for each piece of equipment installed in a data center, total energy efficiency can still fail if it is inefficient to cool the facility or to deliver adequate power to the center. Hiding or ignoring this energy burden might make a facility appear more energy-efficient, but won’t be borne out by the bottom line. We have to look at the end-to-end burden and not shift the energy costs from the systems to the building infrastructure.
Academic institutions, laboratories and other HPC end-users have a much more detailed understanding of their HPC application performance and requirements than their industry counterparts. On the other hand, industry has a long history of developing rigorous methods for measuring various performance aspects of their systems. If we combine this expertise in metrics with user experience, we can more effectively work together to come up with rigorous and common standards for measuring total energy efficiency. However, the HPC community has to be able to agree on these metrics, otherwise we are measuring apples and comparing them to orange metrics.
The BoF session enabled the universities, DOE laboratories and industry to lay out their respective roles in creating this new approach to energy efficiency assessment for HPC computing systems. We believe this is how research institutions can work with the HPC industry to develop systems that are truly more efficient. As these systems are installed and tested, we have a responsibility to accurately measure the total energy cost, pointing out what areas are inefficient — and why. Based on our experience, many vendors don’t have a clear understanding of the end-user workload of the systems they provide.
The BoF at ISC’10 was a good step in this direction, but there is obviously much to be done. Our next steps include organizing additional workshops with the aim of developing a proposed measurement of workload effectiveness. This proposed standard would then be circulated in the committee for feedback and further refined. The immediate challenge is to capitalize on this heightened awareness and translate it into the momentum needed to reach a set of common, useful metrics.
John Shalf leads the Advanced Technologies Group at DOE’s National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory. He may be reached at editor@ScientificComputing.com.