Rob Farber is an independent HPC expert to startups and Fortune 100 companies, as well as government and academic organizations. Optimization for high-performance and energy efficiency is a necessary next step after verifying that an application works correctly. In the HPC world, profiling means collecting data from hundreds to potentially many thousands of compute nodes over the length of a run. In other words, profiling is a big-data task, but one where the rewards can be significant — including potentially saving megawatts of power on a leadership class system and/or reducing the time to solution so more scientists can utilize these precious resources.

The choice of profiling tool is important, as many HPC systems utilize GPU or Intel Xeon Phi coprocessors to accelerate massively-parallel floating-point intensive sections of the code. While these devices have proven successful in providing high FLOPs/watt massive-parallelism, they also complicate profiling significantly by introducing one or more devices, each containing a separate memory space and requiring data movements across a PCIe bus. However, massively parallel accelerators are here to stay, which means that modern profiling tools need to support GPUs and/or Intel Xeon Phi.

James Reinders and Jim Jeffers noted in their book High Performance Parallelism Pearls that, by mid-2013, Intel Xeon Phi coprocessors “exceeded the combined FLOPs contributed by all the graphics processing units (GPUs) installed as floating-point accelerators in the TOP 500 list.” The only devices that contributed more FLOPs to the TOP500 list were Intel processors.

One of the motivators for this move to GPUs and Intel Xeon Phi coprocessors is the large FLOPs/watt ratios such devices can deliver. From a financial perspective, the Green 500 recognizes that the operating costs of today’s petascale systems are on par with the acquisition costs of the actual supercomputer hardware itself. As the industry moves to exascale computing and beyond, the focus on energy efficiency (and hence financial savings) will become even more pronounced. A joke in the HPC world is that the power companies should donate the hardware because they will recoup the cost in electricity charges.

The paper, “Energy Evaluation for Applications with Different Thread Affinities on the Intel Xeon Phi,” by Lawson et. al. measured energy consumption as a function of thread affinity and number of threads on an Intel Xeon Phi. They showed that “varying thread affinity may improve both performance and energy, which is the most apparent under the compact affinity tests when the number of threads is larger than three per core. The energy savings reached as high as 48 percent for the CG NAS benchmark.”

To put this in perspective, the 33 PF/s Chinese Tianhe-2 supercomputer contains 48,000 Intel Xeon Phi coprocessors. This system has a peak energy consumption of 24 megawatts (million watts). For applications that run across all 48k coprocessors, the Lawson paper indicates that literally megawatts of power can be saved by through energy and performance profiling.

Tianhe-2 is not alone, as the Trinity supercomputer procurement by the NNSA will provide another 30 PF/s Intel Xeon Phi supercomputer for use by the United States — although this system will be powered by the stand-alone KNL-based Intel Xeon Phi devices. Similarly, the NERSC Cori procurement will provide another KNL-powered leadership-class Intel Xeon Phi-powered supercomputer for scientific research. Careful power performance profiling across these machines, for example, by simply investigating the appropriate thread affinity, can save significant power at each of these organizations.

GPU-related procurements such as Summit, which will be housed at Oak Ridge National Laboratory, and Sierra, to be installed at Lawrence Livermore, are very large NVIDIA GPU powered systems.

Allinea, the UK-based developer of the petascale and potentially exascale-capable MAP profiler, will be utilized by both Los Alamos National Laboratory and Sandia National Laboratory for profiling codes on the Trinity supercomputer. This profiler also supports profiling GPU-based systems as well.

For example, David Lecomber of Allinea relates a success story when moving the HemeLB code to the new Archer supercomputer in the UK. Basically, the application just did not perform as expected, but like most MPI codes — the programmers had no concrete data to go on except their intuition. Was this a poor partitioning, a poor network performance? 

A 25 percent performance increase was achieved after the culprit was identified through profiling. Basically a section of code that performed parallel I/O was preventing efficient CPU utilization. After reworking that section of code, the application started delivering the expected performance.

A key feature of the Allinea MAP profiler is that it uses adaptive sampling to avoid the big-data complications when profiling tens to hundreds of thousands of processes. In other words, the profiler’s sampling frequency adapts (and that some early samples are discarded) so that the number of samples taken is roughly constant over the length of the run. Allinea uses this method to prevent over-sampling and to ensure that the final output shows the consistent sampling — meaning the first N seconds of a long run would have same number of samples as the last N seconds. All the per-process data is then merged up a scalable tree to produce compact profile files rather than gigabytes (or terabytes) of profile data.

In contrast, Evan Felix and the Environmental Molecular Sciences Laboratory (EMSL) IT staff collect performance counters for every application that run on their large supercomputers with the freely available NWPerf software that can be downloaded from github. The NWPerf data collection is designed to have a minimal impact on application performance yet provide a historical record that is invaluable for software vendors, scientists and evaluating balance metrics for future procurements.

The performance behavior of any run or sets of runs can be accessed via a Web interface. While the collected information is large, it can still be visualized effectively by a “waterfall plot” where data from each node is displayed in a line that is part of a three dimensional plot displaying information from all the nodes utilized by the application.

Figure 1 shows the power utilization of a number of Intel Xeon Phi coprocessors when running NWChem on a PNNL cluster.

Figure 1: An NWChem run. Courtesy EMSL, a DOE Office of Science User Facility sponsored by the Office of Biological and Environmental Research and located at Pacific Northwest National Laboratory

The lower right waterfall plot in Figure 2 shows the relative constant power draw during a linpack run.

Figure 2: Profile of a linpack run. Courtesy EMSL, a DOE Office of Science User Facility sponsored by the Office of Biological and Environmental Research and located at Pacific Northwest National LaboratoryHaving an historical record of application performance measurement has proven invaluable when tracking down performance regressions in user codes. For example, users have approached the PNNL support staff complaining that an application appears to be running slower than usual. Is this a hardware problem or a performance regression? In one instance, a compiler update included a performance bug where the parallel loop generation exceeded the boundaries of an allocated region of memory. The application still computed the correct result, but the performance slowed down due to floating-point traps that happened when reading undefined floating-point values from outside of allocated memory. Thus, the historical record proved invaluable in identifying that the problem occurred after the compiler update, and provided the data that helped identify the problem so it could be communicated to the vendor, and which the vendor could use to fix the problem to restore high application performance.

This same historical profile record has proven invaluable for new machine procurements. As discussed in my 2007 Scientific Computing article “HPC Balance and Common Sense,” the performance of a new machine can be extrapolated using balance ratios and a knowledge of workload requirements from an existing machine. For example, if a frequently used application is memory bound, which would be represented with a low FLOPs/Memory Bandwidth ratio, then a new machine must have a similar or higher FLOPS/Memory Bandwidth, else performance will likely suffer because the floating-point units will be starved for data. Similarly, the balance ratios of other top applications can be determined simply by calculating the cumulative time each application ran on a system along with a statistical description of the balance metrics (e.g. mean, standard deviation, harmonic mean, etcetera).

This article has barely scratched the potential use of profiling information, and the wealth of debugging and profiling tools that are available to the end user. Regardless of the tool utilized, it is important to regularly profile production codes to ensure that silent performance regressions don’t enter the tool chain. The end result will be a happier user base, make precious HPC computing resources available to more users, and potentially save both energy and energy costs.

Rob Farber is an independent HPC expert to startups and Fortune 100 companies, as well as government and academic organizations. He may be reached at