Erik DeBenedictis is on the staff at Sandia National Labs and participates in the IEEE Rebooting Computing initiative and International Technology Roadmap for Semiconductors. While the Department of Energy’s ambitious Exascale Computing Initiative (ECI) is to build an exaflop-scale supercomputer sometime in the 202X timeframe, the planning horizon extends further. The ECI program is about continuing performance scaling in the absence of technology scaling of the underlying substrate. This makes it appropriate to consider post exaflops scenarios, developing a de facto roadmap for long-lifespan issues such as software and the career paths of staff.

The key issue is a shift in the technology path for semiconductors (see Rebooting Supercomputing). While the semiconductor industry is poised to implement more transistors per chip and per dollar, the power efficiency of the transistor has not improved at historical rates. Left unchecked, this trend will cause successive generations of chips to consume more energy. The cost of purchasing a computer today is already close to the cost of the energy consumed over its lifetime. If the current trend is allowed to continue, supercomputing will come to be dominated by energy costs.

Let’s consider three scenarios for the evolution of supercomputers in the range of 1-50 exaflops.

A. Devices and scaling: Efforts continue to extend Moore's Law in the most seamless way. This scenario assumes continuation of the original Moore’s Law, which predicted a rising number of devices per chip at constant power per unit area (or chip, since chip area has remained fairly constant) and rising speed. The design scaling rule, called Dennard scaling, relies on the operating voltages for silicon circuits to decrease at a rate commensurate with the shrinking of transistors in order to keep power in bounds (power consumption being proportional to the square of the operating voltage). Unfortunately, voltage scaling has reached its limit because the voltage is lower than the threshold required to turn a transistor completely on or completely off (the sub-threshold limit). The current class of transistor (called the MOSFET) cannot sustain lower voltages without leaking some current, which in turn causes power to rise again.

This leads to the situation in diagram A, where two generations of scaling offers the possibility of a fourfold increase in system density and, hence, performance. Since power cannot be reduced as much as desired, perhaps only a doubling in density or commensurate lowering of performance is possible within the power envelope. In lieu of having half the chip unoccupied, it tends to be filled with memory. The memory is useful, but not as useful as two more nodes.

However, there is a worldwide search for a new, lower-voltage transistor (whose placeholder name is the “millivolt switch”) that will allow continued reduction in energy per operation. Commercialization of a millivolt switch will offer a largely transparent upgrade to users, as it will support the same architectures and, hence, boost the performance of existing software with minimal rewrite.

The millivolt switch is considered inevitable due to its technical feasibility and the overwhelming market forces that demand it, but there is no current laboratory demonstration of an obvious successor to current silicon transistor technologies and no timescale for its discovery and commercialization. The millivolt switch can increase energy efficiency by only 10´-100´ before the systems will encounter a reliability limit, but it is a reasonable step.

This scenario seems to be the most desirable technically, yet supercomputer users, and the electronics industry as a whole, have very little control over the timeframe. Millivolt switches must be perfected based on currently unknown principles of device physics and the unpredictable ingenuity of researchers — leading to commercialization of a new technology with at least 10 years lead time — or less if competitive pressures keep it secret.

B. 3D: Greater use of 3-D integration as illustrated in diagram B offers tremendous benefit, yet poses technology development challenges and will alter architecture enough to necessitate some reprogramming. Flash memory chips with 32 devices in the third dimension are in production — with the manufacturers boasting that the next generation will be 50+ layers. On the surface, one might think Moore’s Law could be extended into the third dimension using this approach, except that computers cannot be built entirely of memory and the obvious solution of manufacturing computer logic in the third dimension has an insurmountable heat removal problem. Basically, heat can be produced throughout a 3-D solid, but can only be dissipated along a 2-D surface or face.

This scenario seems to be leading to a tighter integration of processors and memory, both architecturally different 3-D integrated processor-in-memory (PIM) and 2.5D/silicon-interposer integrated processor-near-memory (PNM) systems, and conventional CPUs attached to memory by much shorter interconnect wires. The changes are having a dramatic impact on energy efficiency and performance of memory systems, since a larger fraction of system energy and latency is in the memory hierarchy.

While the more conventional CPU and attached memory approach may allow all existing programs to run without change and still give the correct answer, shifting the sizes and access latencies in the memory hierarchy is very likely to require retuning of code for performance. The PIM/PNM architecture, however, may well require substantial revisions of software technologies to fully exploit this new-shaped architecture. There is a hope that standard libraries and special APIs could be written for the new architectural components to reduce the amount of recoding while obtaining most of the benefit for the broadest array of users, but this is itself a challenge that would require significant research and development investment while still requiring some code changes.

This scenario is of intermediate desirability. Industry is very likely to develop 3-D technology for storage and mobile devices quite independently of supercomputers given the nonvolatile storage technologies are far less energy intensive than logic. However, industry will spend R&D funds on developing the technology and is likely to demand premium prices for new product.

C. Architectural specialization. Progress can continue through architectural changes alone as illustrated in diagram C, yet this could be problematic for software and algorithms. Traditional microprocessor architectures use only a few percent of their energy budget for the actual arithmetic, with the rest attributed to memory access and the interpretation of the processor’s instruction set. Just as a graphics processing unit (GPU) achieves high energy efficiency by using constrained data paths and applying one instruction stream to many ALUs in parallel, other GPU-like specialized devices could be made for other purposes. For example, a vendor might develop chip layouts for the six functions illustrated in diagram C. A chip could be created with, say, four of the six in an approach referred to as “dark silicon” or “dim silicon.” Already, large fractions of the area of systems-on-chip used for cell phones include dozens of such specialized accelerators, but how this could be applied to more general purpose computing is not yet mainstream (nor is it well understood). Using some sort of power management, perhaps one of the four would run at full power at a time (as illustrated by the rotator switch). This would allow high performance due to the one enabled function having both a customized layout and a lot of power available. The limitation of this approach is also shown in diagram C: The system is specialized to the functions provided; the user has no recourse if a needed, specialized function is not manufactured into the system from the beginning. Applying dark silicon/dim-silicon techniques to general-purpose computing workloads remains a research topic.

This scenario of architectural specialization is more under control of the supercomputer industry than any other, as it is likely that computers of this type can be created from commodity processors, general purpose graphic processing units (GPGPUs), and software tools, with supercomputer-specific technology appearing at the circuit board or module level. However, for this to be fully realized, we need an economic model that would enable incorporation of targeted specialization into HPC chip designs, such as the one proposed by SOC for HPC []. Even so, this scenario is very likely to present the programmer with idiosyncratic architectures that require extensive experience by the programmer, and may lead to code that is not easily repurposed to other applications.

None of these scenarios is universally best, as illustrated in diagram D. Each trades off technology risk for extra software effort during use.

This topic will be discussed further at the IEEE Rebooting Computing’s Birds-of-a-Feather session at SC15. “Supercomputing Beyond Moore’s Law,” will take place on Wednesday, November 18, 2015, from 1:30 p.m. to 3:00 p.m. and will include introductory talks on the three scenarios above, followed by working groups that will hopefully persist beyond the actual session. Each working group is expected to create a plan to further develop one of the scenarios.

For more information please visit the IEEE Rebooting Computing Web site.

Erik DeBenedictis is on the staff at Sandia National Labs and participates in the IEEE Rebooting Computing initiative and International Technology Roadmap for Semiconductors. Erik's first connection with Scientific Computing was building a computer now called the Cosmic Cube as a graduate student at Caltech in the early 1980s. This research computer became the model for almost all supercomputers today. Erik works on various aspects of advanced computing, notably a "Beyond Moore" project at Sandia working in conjunction with the IEEE Rebooting Computing initiative and International Technology Roadmap for Semiconductors.