Jack Dongarra has been involved since the origin and formation of the TOP500 list in 1993, which used his LINPACK Benchmark as the common application for evaluating the performance of supercomputers. Through the consistent use of the LINPACK benchmark, the TOP500 list provides a standardized measure of supercomputers over the past 20 years. Dongarra holds appointments at the University of Tennessee, Oak Ridge National Laboratory and the University of Manchester. He specializes in numerical algorithms in linear algebra, parallel computing, use of advanced-computer architectures, programming methodology, and tools for parallel computers.
In addition to LINPACK and the TOP500, Dongarra has contributed to the design and implementation of the following open source software packages and systems: EISPACK, the BLAS, LAPACK, ScaLAPACK, Netlib, PVM, MPI, NetSolve, ATLAS and PAPI. He has published approximately 200 articles, papers, reports and technical memoranda, and he is coauthor of several books. He was awarded the IEEE Sidney Fernbach Award in 2004 for his contributions in the application of high performance computers using innovative approaches; in 2008 he was the recipient of the first IEEE Medal of Excellence in Scalable Computing; in 2010 he was the first recipient of the SIAM Special Interest Group on Supercomputing's award for Career Achievement; and in 2011 he was the recipient of the IEEE IPDPS 2011 Charles Babbage Award. He is a Fellow of the AAAS, ACM, IEEE and SIAM and a member of the National Academy of Engineering.
Dongarra received a Bachelor of Science in Mathematics from Chicago State University in 1972 and a Master of Science in Computer Science from the Illinois Institute of Technology in 1973. He received his Ph.D. in Applied Mathematics from the University of New Mexico in 1980. He worked at the Argonne National Laboratory until 1989, becoming a senior scientist. He is the director of the Innovative Computing Laboratory at the University of Tennessee.
Dongarra is a popular speaker at the International Supercomputing Conference (ISC), held each year in Germany. In recognition of his significant contributions to the conference over the years, Dongarra has been named an ISC Fellow. As a lead-in to ISC’13 to be held June 16 to 20 in Leipzig, the ISC’13 communications team posed a few questions to Dongarra on his role in the TOP500 List and the current state of HPC.
Q1: Jack, you are well-known in the scientific computing community for your work at Argonne National Lab and your Innovative Computing Laboratory at the University of Tennessee, but your name seems to most often pop up in connection with the TOP500 list of the world’s top supercomputers. What’s your role and how did you get involved in the project?
The story has its origins in the LINPACK Software collection in the late 70s. LINPACK is a software package for solving systems of linear equations and, along with that software package we produced, a users’ guide. So, we had a set of software and accompanying document, which described how users could use the package effectively. It described the calling sequences, it described various examples and how a user could, in fact, use this software to solve these kinds of problems. That was a document that was produced and published by SIAM. The first edition came out in 1979. [LINPACK User's Guide, J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart, SIAM Publications, Philadelphia, 1979].
In the appendix to that book, I collected some timing information about various machines. The idea was to give users a handle on how much time it would take to solve their problem if they used our software. In the appendix was the table that listed the solution time for solving a matrix of order 100 using the LINPACK software, and that problem size was chosen because I was able to accommodate that matrix on all the machines on which we were testing, from a Cray-1 down to a DEC PDP something or other. I was able to use that same matrix size, and we received the execution time and translated that into an execution rate. Based on that, there was a little table published which said, “on this machine it took this long, this rate of execution, for this problem” and maybe 10 or 15 machines were listed there. So, that’s the origin of something called the LINPACK Benchmark. I’ve maintained and added to the list of computer performance compared on this benchmark since 1979. Over the years, additional performance data was added, more as a hobby than anything else.
From 1986 through 1992, Hans Meuer presented the Mannheim supercomputer statistics at the opening session of the Supercomputer Seminars at Mannheim University. In 2001, these Seminars were renamed the ‘International Supercomputing Conference – ISC’. There was an increasing interest in these statistics from year to year. In 1992, Hans released the last Mannheim statistics, with 530 installed supercomputers worldwide. The statistics simply counted the supercomputers at that time. In 1993 Hans approached me and suggested we develop a list of supercomputers that was ranked according to the LINPACK Benchmark and that was the origin of the TOP500 list.
Q2: The TOP500 rankings are based on the systems’ performance running the LINPACK benchmarking application you developed. How did LINPACK come to be and did you ever think it would take on this level of importance?
LINPACK is a benchmark that people often cite because there’s such a historical database of information there, because it’s fairly easy to run, it’s fairly easy to understand, and it captures in some sense the best and worst of programming.
The LINPACK Benchmark is a very floating point-intensive test. That is, it performs O(n3) floating point operations and moves O(n2) data. I think it’s important to understand the era when then this benchmark was started. This was a time when floating point operations were expensive compared to other operations and data movement. So, perhaps it was then more reflective of the efficiency of scientific computations in general. Today, times have changed, as well as applications that use supercomputers.
LINPACK is not particularly well done for modern architectures, so it doesn’t have a very good access pattern in terms of referencing data in memory and, as a result, the performance is really quite inferior to what we can do today. That’s one reason why we built other packages subsequent to LINPACK, and many applications are written the same way as LINPACK. So some people feel that LINPACK captures the essence of their application, not necessarily what it’s doing but the performance level that it’s achieving; so that’s another reason why it is perhaps something that resonates with people when they look at performance.
But there is one thing to point out. If your computer doesn’t do well on the LINPACK Benchmark, you will probably be disappointed with the performance of your application on the computer.
Q3: You also developed the HPCS HPCchallenge Benchmarks for DARPA. Can you “compare and contrast” that work with LINPACK?
The HPC Challenge suite of benchmarks examines the performance of HPC architectures using kernels with memory access patterns more challenging than those of the High Performance LINPACK (HPL) benchmark used in the TOP500 list. The HPC Challenge suite is designed to augment the TOP500 list, provide benchmarks that bound the performance of many real applications as a function of memory access characteristics, e.g., spatial and temporal locality, and provide a framework for including additional benchmarks. The HPC Challenge benchmarks are scalable with the size of data sets being a function of the largest HPL matrix for a system. The HPC Challenge benchmark suite has been released by the DARPA HPCS program to help define the performance boundaries of future Petascale computing systems.
The collection of tests includes tests on a single processor (local) and tests over the complete system (global). In particular, to characterize the architecture of the system, we consider three testing scenarios:
- Local – only a single processor is performing computations.
- Embarrassingly Parallel – each processor in the entire system is performing computations, but they do not communicate with each other explicitly.
- Global – all processors in the system are performing computations and they explicitly communicate with each other.
The HPC Challenge benchmark consists at this time of seven performance tests: HPL, STREAM, RandomAccess, PTRANS, FFT(implemented using FFTE), DGEMM and Latency/Bandwidth. HPL is the LINPACK TPP (toward peak performance) benchmark. The test stresses the floating point performance of a system. STREAM is a benchmark that measures sustainable memory bandwidth (in GB/s), RandomAccess measures the rate of random updates of memory. PTRANS measures the rate of transfer for large arrays of data from multiprocessor’s memory. Latency/Bandwidth measures (as the name suggests) latency and bandwidth of communication patterns of increasing complexity between as many nodes as is time-wise feasible. HPCC attempts to span high and low spatial and temporal locality space.
Q4: It seems like every couple of years, a new “something500” list appears, based on the premise that a better way of measuring system performance is needed. They seem to want to build on the reputation of the TOP500, given their names and timing. Do you see them as competing with or complementing the TOP500 list?
No, I don’t see this as competing, they augment the TOP500. I only hope they will be around long enough to have a historic base of data for comparison.
Q5: All things considered, if you were going to be buying a large scale HPC system, what would you use for benchmarking?
The best benchmark is a set of software that will be used on the future computer. That is, benchmark the applications you intend to run on the system.
Q6: You’re not one to shy away from voicing your opinions. What’s your assessment of the current state of the road to exascale? What do you see as the two or three biggest obstacles to getting there?
As a number of recent studies make clear, technology trends over the next decade — broadly speaking, increases of 1000X in capability over today’s most massive computing systems, in multiple dimensions, as well as increases of similar scale in data volumes — will force a disruptive change in the form, function and interoperability of future software infrastructure components (I’ll call this the X-stack) and the system architectures incorporating them. The momentous nature of these changes can be illustrated for several critical system-level parameters:
- Concurrency – Moore’s Law scaling in the number of transistors is expected to continue through the end of the next decade, at which point the minimal VLSI geometries will be as small as five nanometers. Unfortunately, the end of Dennard scaling means that clock rates are no longer keeping pace, and may in fact be reduced in the next few years to reduce power consumption. As a result, the exascale systems on which the X-stack will run will likely be composed of hundreds of millions of arithmetic logic units (ALUs). Assuming there are multiple threads per ALU to cover main-memory and networking latencies, applications may contain 10 billion threads.
- Reliability – System architecture will be complicated by the increasingly probabilistic nature of transistor behavior due to reduced operating voltages, gate oxides, and channel widths/lengths resulting in very small noise margins. Given that state-of-the-art chips contain billions of transistors and the multiplicative nature of reliability laws, building resilient computing systems out of such unreliable components will become an increasing challenge. This cannot be cost-effectively addressed with pairing or TMR; rather, it must be addressed by X-stack software and perhaps even scientific applications.
- Power consumption – Twenty years ago, HPC systems consumed less than a megawatt. The Earth Simulator was the first such system to exceed 10 MW. Exascale systems could consume over 100 MW, and few of today’s computing centers have either adequate infrastructure to deliver such power or the budgets to pay for it. The HPC community may find itself measuring results in terms of power consumed, rather than operations performed. The X-stack and the applications it hosts must be conscious of this situation and act to minimize it.
Similarly dramatic examples could be produced for other key variables, such as storage capacity, efficiency and programmability.
Q7: Finally, can you take a few minutes to talk about your “real” job. What are the main projects at your Innovative Computing Laboratory? Any exciting new projects on the horizon?
The Innovative Computing Laboratory (ICL) aspires to be a world leader in enabling technologies and software for scientific computing. Our vision is to provide high performance tools to tackle science’s most challenging problems and to play a major role in the development of standards for scientific computing in general.
The major ICL projects are a reminder that we are building on more than two decades of remarkable creativity and dedication of the people who have come to ICL and the University of Tennessee. The names of the projects that they have helped lead — PVM, MPI, LAPACK, ScaLAPACK, BLAS, ATLAS, Netlib, TOP500, PAPI, NetSolve, Open-MPI, FT-MPI, the HPC Challenge and the LINPACK Benchmark — are familiar to HPC users around the world. And the software that they have produced now provides critical research infrastructure for hundreds of thousands, if not millions users. But if you look at the probable path to exascale computing, and survey the set of revolutionary problems it presents (e.g., the need to exploit billion way parallelism, order of magnitude increases in the number of execution faults, unprecedented constraints on energy use, and so on), it becomes clear that the most challenging years for ICL research may still lie in front of us.
I am fortunate that today’s ICL brings together one of the most talented and most experienced teams we have ever had. In the areas of numerical libraries, distributed computing, and performance monitoring and benchmarking, they have created and are leading ground breaking projects — PLASMA, MAGMA, PaRSEC, PAPI-V and OpenMPI, to name a few — that are directly targeting some of the central problems in HPC today.