Power Optimization in HPC, Enterprise and Mobile Computing
Amazing juxtaposition of interests spurs marvelous increase in power efficiency, performance
|Figure 1: The Portland Group
demonstrated a small four-node ARM cluster at Supercomputing 2012.
The following facts demonstrate the sheer scale of mobile and data center usage that affected power consumption in 2011:
• Worldwide, data centers were estimated to consume 1.3 percent of the world’s electrical consumption (source: Time Tech).
• 90 percent of online users use search engines — that means 1.7 billion people (source: Mckinzey).
• There are over 1.2 billion active mobile subscriptions (source: pingdom.com).
• Over 1.6 trillion searches a year are currently conducted globally (source: Mckinzey).
• One Google search is equal to turning on a 60W light bulb for 17 seconds (source: Time Tech).
Clearly, enterprise transactions are dominating both data center design and power consumption. For example, only 12.5 million watts of Google’s 260-million-watt yearly energy bill is consumed by search. In contrast, the recently upgraded Oakridge National Laboratory (ORNL) Titan supercomputer consumes nine million watts of power. However, Google’s power usage is distributed around the world, while the Titan supercomputer packs all that power consumption into 200 racks that comprise the single machine.
Looking to the future, exascale supercomputers will perform 1,000 quadrillion (or 1 quintillion) computations per second. Even though Titan is ranked third on the Green500 list (the TOP500 supercomputer list reordered by energy efficiency), turning Titan into an exascale supercomputer means it would consume 50-times more power, or 511 megawatts. It is just not feasible from a technical or financial standpoint to have each rack consume 2.5 megawatts, or the rough equivalent of 250,000 one hundred watt bulbs, which explains why HPC computer scientists are intently focused on power consumption.
BIG AND SMALL NUMBERS
The tale of the big numbers tells the same story in the enterprise, as reducing power consumption has clear financial benefits. Reducing Google’s 260 megawatt-per-year power bill will clearly benefit the bottom line. A 10 percent power reduction would nearly pay to run three Titan supercomputers. Now, multiply that interest by key players such as Amazon, Microsoft Bing, Facebook and others.
The tale of the small numbers tells the story in the mobile market, where a 10 percent reduction in power consumption translates to smart phones that can run 10 percent longer. It is certain that all of the current 1.7 billion active cell phone users will experience a dead battery. It only takes one dead battery to create an educated consumer who will look for a longer-lasting product the next time they purchase a cell phone.
This convergence of interests has stimulated investment and a rapid evolution in computing hardware.
At the moment, ARM-based processors are the CPUs for the mobile market, while x86 hardware dominates servers in the enterprise. If we can believe the claims and benchmark results on the Internet, this stratification of processor by market segment is set to change in the next two years. Further, we can expect orders of magnitude increases in power efficiency.
In the x86-dominated enterprise space, an ARM-based Apache Web server startup notes it can service 5,500 requests per second while consuming 5 watts of power. On that same benchmark, an Intel Xeon E3-1240 server serviced 6,950 requests and consumed 102 watts of power. A 20x reduction in power consumption, while preserving 80 percent of the performance, can be a significant win for many in the enterprise computing space.
Innovators should note that the commodity nature of ARM devices makes it easy to build systems and computational clusters that consume — by current standards — extraordinarily small amounts of power. To demonstrate their software, The Portland Group built a small four-node ARM cluster that they showcased at Supercomputing 2012. As can be seen in Figure 1, when running (not idling) this cluster consumed 8.5 watts of power. The watt meter reports the total power consumption of cluster including switch.
To compete in the ARM-based mobile market, Intel labs demonstrated at the IDF conference last year a 10 milliwatt (thousandth of a watt) processor running Linux powered by a solar cell the size of a postage stamp illuminated by incandescent lighting. Intel claims this power consumption when the processor is awake and running is far lower than the idle suspend state of many mobile processors. The stated goal for this Intel research project is to achieve a 300-fold improvement in energy efficiency over the next 10 years. Industry bloggers expect release of the Medfield SoC (System on a Chip) to be the first definitive move by Intel into the mobile market.
In the HPC space, NVIDIA was busy at Supercomputing 2012 with the consumer launch of the K20 Kepler GPUs. ORNL reached the top spot on the TOP500 list by installing 18,688 of the Tesla K20X GPU accelerators. The flagship K20X GPU claims 3.95 teraflops single-precision and 1.31 teraflops double-precision peak floating point performance. The K20 accelerator delivers a strong performance of 3.52 teraflops of single-precision and 1.17 teraflops of double-precision at peak. The K20-based GPUs deliver this performance in a 225 watt PCIe card with memory and support hardware, which is why Titan is also ranked third on the Green500 list of supercomputers.
In the HPC space, the TACC (Texas Advanced Computing Center)’s 10 petaflop Stampede Supercomputer is set to go online early in 2013. This will be the first production release of the new Intel Xeon Phi coprocessor. As an unknown, these new Intel devices will be closely watched by the HPC community both from a power and performance perspective. However, the No.1 spot in the Green500 list is held by a small Intel Xeon Phi compute cluster.
What is unusual about the Intel Xeon Phi product line is that they were designed to be coprocessors that run Linux. From a performance point of view, the Intel Xeon Phi devices can provide a teraflop/s of vector floating-point performance by:
1. using pragmas to augment existing codes so they offload work from the host processor to the Intel Xeon Phi coprocessors(s)
2. recompiling the source code to run directly on the coprocessor as a separate native SMP (Symmetric Multi-Processor) many-core Linux computer
3. accessing the coprocessor as an accelerator through optimized libraries, such as the Intel Math Kernel Library (MKL)
4. using each coprocessor as an MPI node or, alternatively, as a device containing a cluster of MPI nodes
Unlike GPUs that only act as accelerators, Intel Xeon Phi coprocessors have the ability to be used as support processors merely by compiling existing applications to run natively on the PCIe device. While Intel Xeon Phi coprocessors probably will not be a performance star for non-vector applications, they still can be used to speed applications with 61-core parallelism and high-memory bandwidth.
ARM based supercomputers also are being constructed. The Barcelona Supercomputer Center currently is building a supercomputer around an ARM-based Tegra 3 tablet/cellphone chip. They claim their prototype boards are able to deliver 5 Gflop/s per watt. To put this in perspective, a 9 megawatt Titan supercomputer could potentially deliver 45 petaflop/s of performance using the Barcelona Supercomputer ARM prototype.
Both Intel Xeon Phi coprocessors and ARM supercomputers need to address the concern that there is a big difference between prototype results and production performance. It will be interesting to see how the TACC Stampede supercomputer, publicly rated at 10 petaflop/s, compares from a power and achieved-in-production floating-point efficiency with Titan. It is certain that the forthcoming accelerator-versus-coprocessor debates will be both vocal and visible. Looking beyond that debate, it will be interesting to see what the achieved floating-point efficiency will be on the Barcelona ARM-based supercomputer. With a strong petascale potential, ARM can be an HPC contender. The future for consumers of floating-point performance is going to be very, very interesting!
Nothing is free. Total system power consumption is a function of both the process and supporting hardware, such as memory. As memory capacity increases, so increases the power consumption of the memory subsystem. The designers of the PCIe devices, such as GPUs and the Intel Xeon Phi coprocessors are acutely aware of the memory issue as are users limited by the memory capacity (6 to 8 GB maximum) of these devices.
I have noted in my past Scientific Computing articles that Hybrid Memory Cube (HMC) technology has a potential to disrupt the computer industry. The Hybrid Memory Cube Consortium consists of some of the industry’s heaviest hitters: Samsung, Microsoft, IBM, ARM, HP, Altera, Xilinx, Open-Silicon and SK Hynix. HMC devices are multi-chip modules (MCM) specifically designed to meet the needs of CPUs and GPUs in a way that is attractive to computer manufacturers. These devices offer extraordinary bandwidth per module (168 GB/s or 1,344 Gb/s), consume 70 percent less energy and require 90 percent less space than conventional DDR3 memory. It is easy to see how six banks of HMC memory can accelerate massively parallel GPU and Intel Xeon Phi devices that currently provide around 200 GB/s of bandwidth.
At their Supercomputing 2012 booth, Micron said they are sampling product now and are quoting production delivery in Q2 2013 for high-speed routers and real-time devices. The nice thing about HMC technology is that the vertical stacking and IO interface of this memory architecture can work for other memory technologies, such as NAND flash and newer memory types that are in development, such as magneto-resistive random-access memory (MRAM), and phase change memory (PCM).
Two MIT professors, Joel Dawson and David Perreault, claim to have solved a major problem in mobile computing with a new power amplifier that utilizes just half the power of current devices. Most people notice their smartphone gets warm whenever they stream videos or play games. The claim is that the cause of this heat is not the processor, but rather the power amplifiers in the cell phone that waste as much as 65 percent of their energy. The new technology, dubbed asymmetric multilevel outphasing, basically selects the best possible voltage to use when the mobile device is communicating in order to minimize power consumption.
Another mobile technology startup has devised tools that enable the manufacture of OLED displays that use a fraction of the energy of current displays yet produce displays that are brighter, sharper, viewable from multiple angles, compact, thin and flexible, and reduce glare so you don’t have to cup your hand over the screen to read it. Plus, these tools also reduce the cost of manufacture. While this all sounds too good to be true, apparently the major display manufacturers are already integrating this new manufacturing toolset.
Optimizing algorithms for energy consumption rather than performance is an area of active research, especially for mobile computing. Simply put, moving data around costs a lot more energy than processing it. For example, Chris Shore at ARM notes, “[i]f the cost of using an instruction is 1, then the cost of a tightly coupled memory (TCM) access is roughly 1/25, the cost of a cache access around 1/6. The cost of an external RAM access is seven times the cost of an instruction execution.” In other words, reuse data in register or on-chip as much as possible to achieve both high-performance and high-energy efficiency.
An amazing juxtaposition of interests by consumers and the dominant technology companies in mobile, enterprise, and HPC has spurred a marvelous increase in the power efficiency and performance of our computer hardware over the past few years. Expect even greater progress in the next few years, as there is simply too great an economic and technology need for even more efficient computational devices. While our progress has been rapid, it is clear that our latest, best and most advanced power-efficient technology pales in comparison to the efficiency of even the simplest animal brains.
Rob Farber is an independent HPC expert to startups and fortune 100 companies, as well as government and academic organizations. He may be reached at editor@ScientificComputing.com.