Natalie Bates chairs the Energy Efficient High Performance Computing Working Group (EE HPC WG). Natalie Bates chairs the Energy Efficient High Performance Computing Working Group (EE HPC WG). The purpose of the EE HPC WG is to drive implementation of energy conservation measures and energy efficient design in HPC. At ISC’14, Bates will chair the session titled Breaking Paradigms to Meet the Power Challenges, which will feature two presentations: the first describing why BMW moved their HPC center to Iceland and the second discussing the energy savings that can be achieved with SoC technology that integrates heterogeneous cores, NICs, stacked memory, and on-chip networks. We recently caught up with her to discuss some of today’s key power efficiency challenges.

What do you think are the biggest impediments in creating more energy-efficient HPC?

Bates: The biggest impediments in creating energy-efficient HPC are the scaling limitations of silicon-based technologies that are used for fabricating the components of HPC systems. Perhaps, one day, an unconventional method of computing will open new versions of Moore and Dennard’s Scaling Laws. Quantum computing, optical computing, biomolecular computing are just some of the many alternative methods of computing that could be much more energy-efficient than today’s silicon-based computing. Unfortunately, all of these methods of computing are in early stages of research and, although they are important to pursue, none of them can be counted on to help with creating more energy-efficient HPC for the near future.

A more immediate opportunity is what Barroso and Hozle from Google call “energy proportional computing;” that is, using resources in a balanced way so that all the energy consumed is optimally doing useful work. The corollary of this is that, when resources are not doing useful work, their energy consumption is nil or minimal. This is a general statement and can be applied to many aspects of computing systems. A ubiquitous implementation of this is sleep or power down modes on electronic devices. Another more technically savvy, but commonly understood implementation of this is the use of the dynamic voltage and frequency scaling feature of compute components. One simple and relevant way of describing this opportunity is to think about reducing idle power, which has been decreasing as a percent of peak power, but is still quite high.

Although there have been major improvements in the energy consumed by the infrastructure, as evidenced by power usage effectiveness (PUE) measures on the order of 1.2 or less for many of the major supercomputing centers, there remains a strong regional affiliation that can be an impediment to more energy-efficient HPC, as well as to green-house gas emissions. This raises the question of site location. Can we leverage the climate advantages as well as renewable electricity sources of locations like Iceland for siting our systems?

Creating more energy-efficient HPC is a continuous improvement process that requires the right tools for measuring, taking action, checking the results and iterating in a virtuous cycle. This is true for the infrastructure as well as all levels of the system; from components through applications. We haven't had to think about energy efficiency, nor have we had the tools to measure it. Once the right tools are in place, we can start wrapping our heads around what are the contributing factors to better efficiency, and from that we can start influencing hardware designs.

There is also a lack of standardization and metrics for energy-efficiency. Depending on the target group (e.g., application developers, system integrators, HPC data center administrators), the expectations and goals differ — sometimes these goals and optimizations conflict with each other. Should we care about FLOPS per Watt, or cost per Watt, or Energy-Delay products, or exceeding a power budget, or science accomplished per watt or utilizing allocated power well?

As you can see, there are many impediments and opportunities to creating more energy-efficient HPC. We have been making steady progress on improving the energy efficiency of both infrastructure and systems, but we have also been steadily growing the power and energy requirements for HPC.

Is there still a place for performance at any cost, or will the entire HPC industry be forced to reorient its priorities around efficiency concerns?

Bates: I am not sure there is a potential for a "global" performance at any cost. There are pockets of applications or situations where performance at any cost do and will most likely continue to exist. Some pockets would be applications or situations with a real-time aspect or a time-critical element. That being said, even these application spaces are looking to optimize the amount of work returned for the energy being consumed. Sloppy power/energy consumption will, hopefully, not have a major place going forward in the computing environment.

What types of approaches are being studied in the Energy Efficient HPC Working Group that you chair?

Bates: The purpose of the EE HPC WG is to drive energy efficiency measures and design in HPC. It is a forum for sharing of information (e.g., best practices and peer-to-peer exchange) as well as collective action (guidelines, recommendations, collaborations). There are over 450 members from 20 different countries. It is an open membership with participants from government agencies (50%), vendors (30%) and academe (20%). Most of the active participants are from the large supercomputing centers of the United States Department of Energy Laboratories and other large supercomputing centers in Europe. This collective voice provides a strong influence to encourage system integrators, standards bodies and other organizations to actively participate in the drive for energy efficiency measures and design.

Some of the most significant results to date for the EE HPC WG include:

  • Development of guidelines for liquid cooling inlet temperatures, which have been adopted and published by ASHRAE. These guidelines encourage more efficient cooling technologies.
  • Development of an improved power measurement methodology for use while running benchmarks and workloads. This was done in collaboration with the Green500, Top500 and the Green Grid and has been adopted by the Green500 as a supplement to their run-rules.
  • Development of an improved data center infrastructure energy efficiency metric TUE (total usage effectiveness) – Gauss Best Paper Award at ISC13. TUE is an improvement over the widely adopted power usage effectiveness (PUE) metric.
  • Development of procurement considerations for energy efficient HPC with a specific focus on measurement capabilities and requirements.

Other teams are in a more formative stage of activity. One team has developed a commissioning guideline for the building infrastructure that supports liquid-cooled HPC systems. This team is currently working with ASHRAE to adopt and promote these guidelines as part of their informative publications.

Another team is investigating the opportunities for large supercomputing sites to develop closer relationships with their electricity service providers. These relationships, similar to other commercial and industrial partnerships, are driven by mutual interest to reduce energy costs and improve electrical grid reliability.

How will energy efficiency be measured? Are specific goals being proposed

Bates: Ultimately in HPC, performance is measured by the amount of meaningful work accomplished per dollar spent. Useful efficiency gains will be expressed in terms of being able to tackle larger problems for the same amount of money, or reducing the price of solving fixed-size problems. 

We need a better set of metrics. A single metric is not sufficient — it needs to be a customizable set of metrics based on what your computing center is trying to accomplish. 

More than a single metric is also required for comparing the energy efficiency of system architectures. High Performance Linpack (HPL) is effective for stressing the compute sub-system, but does not provide an effective stress for other subsystems, such as memory, interconnect and storage. There is an opportunity for developing and/or promoting other community-wide benchmarking efforts to supplement the extensive knowledge and information we have with HPL, both in understanding performance and energy efficiency.

For the data center, PUE has been successful in improving energy efficiency, but it is not perfect. One challenge is that PUE does not account for the power distribution and cooling losses inside computer systems, which is particularly problematic for high performance computing. Another challenge is that PUE is NOT intended to be used to compare data centers. The EE HPC WG has developed two metrics: ITUE (IT-power usage effectiveness), similar to PUE but “inside” the system and TUE (total-power usage effectiveness), which combine for a total efficiency picture.

I mentioned earlier that PUE does not capture the broader environmental considerations of the HPC data center. There are other metrics developed by the Green Grid and the EE HPC WG that are measures of carbon and water usage effectiveness as well as energy re-use effectiveness. These metrics are not widely deployed, but could help to provide a broader focus on impact to the environment of HPC data centers.

I think that HPC provides a lot of social utility. There is better science and engineering that result in a net-savings of energy and carbon. Consider the carbon footprint and the amount of energy used by the automotive industry for crash testing. They go through the entire process of building test vehicles and then crash them to see how the vehicle and passengers would fare in a collision. Now, imaging the amount of energy used and carbon footprint created by that modeling process. The energy used and the carbon footprint of modeling vehicle collisions is far less than building, crashing and disposing of vehicles. We have been talking about metrics like PUE and CUE. The Coefficient of Performance of Carbon is another metric we may want to develop, It is defined such that when it’s greater than zero that’s good, and there is no upper limit. I assert that COPcarbon for HPC is much greater than zero.

How will this new focus change the way HPC hardware and software components are designed?

Bates: HPC is enough of a niche that we don't have much influence over component design. Fortunately, saving energy is driving consumer electronics as well, and we can benefit from the hardware changes that are taking place there. For example, vendors will begin to create heterogeneous SoC’s that borrow technologies and techniques from the cost- and power-sensitive design requirements found in the embedded world. Current multi-core chips have networks that are relatively simple and allow for all-to-all communication on chip. However, as core counts continue to increase, we will see more complex network-on-chip architectures begin to emerge. These on-chip networks will connect many components in a heterogeneous system that includes multiple core types, integrated NICs and stacked memory.

Changes to software design are going to be profound, but that's still very much an active research area, and the result won't be filtering into real-world best practices for a few years.

Digging a little deeper, how will processors, systems, application software and data centers need to change to deliver more energy efficient outcomes?

Bates: I think the key is to involve hardware vendors, application developers/scientists and system administrators in influencing the next-generation of energy-efficient HPC. The process should be similar to the existing co-design process, and all areas including computer architecture, system software, application software, programming models, and toolsets should evolve to accomplish this goal. The goal will be to keep the resources utilized, and be able to dynamically reconfigure based on goals that target power/energy and actual work accomplished. 

In closing, I’d like to recognize a few other members of the EE HPC WG who were particularly engaged in helping me to answer these questions. They are: Stephen Poole, Barry Rountree, Tapasya Patki, Sridutt Bhalachandra and Thomas Ilsche.