Visualization of dark matter in the Dark Sky Simulation, the first trillion-particle simulation to be made publicly available. It spans a region nearly 40 billion light-years across, and was produced using 80 million CPU hours on the Oak Ridge National Laboratory Titan supercomputer. Courtesy of the Dark Sky Simulation TeamThe field of cosmology was rocked in the late 1990s, when astronomers made the startling discovery that something was causing the universe to expand faster and faster over time. This phenomenon was dubbed “dark energy,” and the effort to understand the nature of dark energy and how it has impacted our universe has defined the field of cosmology ever since.

The experiment that revealed the existence of dark energy came from measurements of the properties of a few tens of exploding stars in distant galaxies — but today cosmologists work with measurements of millions of galaxies, and HPC is playing an increasingly important part in their work. We use galaxies to trace the structure of matter in the universe — both regular matter (like stars, gas and dust, which interact with light) and dark matter (which does not interact with light). The competing forces of gravitational attraction pulling matter together, and the expansion of the universe due to dark energy, have been imprinted on the structure of matter ever since the Big Bang.

Cosmology is the study of the universe as a whole — what it’s made of, how it started and how it’s evolved over its 13.7 billion year lifespan. The problem with cosmology as an experimental science is that there’s not a lot we can do to experiment on the universe — instead, we must use computer simulations to try out our theories, and to see what the universe would look like under different theoretical models. If a particular simulated universe containing a specific type of dark energy matches the one we observe around us, then we know we have a plausible theoretical model for dark energy.

Simulating a universe is not a trivial matter, and some of the biggest tests of modern HPC systems come from cosmological simulations. These simulations need to be both large-scale (to sample a significant volume of a universe and so ensure we have a useful sample size) and high-resolution (to be able to track the matter that forms individual galaxies). The compute resources to calculate the gravitational forces on everything in a simulated universe, from everything else in that universe, are enormous — a typical cosmological simulation today requires tens of millions of CPU hours and thousands of compute cores — something that would take today’s best desktop computer over 2000 years to run. And that’s before we add the dimension of time.

The speed of light in a vacuum is well-known to be a constant, which means that light from distant galaxies has taken millions, or even billions, of years to reach us. This allows us to get a glimpse of what the universe looked like back when those galaxies emitted their light — we can effectively look back in time. This also means that cosmological simulations need to record what’s going on inside the simulation at regular time steps so we can trace what the distant galaxies would look like to us today. We can’t just set a simulation running and see what comes out at the very end! We, therefore, have a problem that requires not just lots of compute time to simulate the forces acting on everything in a universe, but also plenty of IO bandwidth and disk capacity to write out snapshots of the simulation at regular intervals. This can add up to petabytes of data. By comparison, a petabyte is about the same amount of data as is contained in the DNA of three times the entire population of the USA. Cosmological simulations are therefore not just an HPC problem, but also a big data problem. Modern supercomputers are already being pushed to their maximum capacity to produce simulations and analyze the outputs. [for example] But to be able to understand how these simulations relate to the universe we actually live in, we need to compare the outputs of those simulations to real data collected by the telescopes that take images of the night sky.

Analyzing the data collected by telescope surveys of the sky is in itself a big data/HPC problem. Current surveys (such as the Dark Energy Survey) collect measurements of tens of millions of galaxies and, in the next 10 years, astronomers will collect data on tens of billions of galaxies. The data storage and analysis needs here are complex — astronomical data is noisy and can take many forms. We may have repeated observations of the same galaxy, taken on different nights under different observing conditions and, therefore, of different quality. We may have additional observations from other telescopes with entirely different noise characteristics that we wish to combine. In addition, many astronomical objects change over time (for example, supernovae can explode over the space of a few weeks or months, or the active black holes at the center of galaxies can flare up...) so the galaxy you observe from one week to the next can be fundamentally different. Putting all this together is a highly complex analysis problem that require sophisticated algorithms to make sense of this noisy hash of data, which require HPC resources [here’s an example].

We can’t actually simulate the exact universe we live in — we’ll never match the location of all the galaxies we see through our telescopes — so we must construct statistics that describe the interesting properties of the universe that we want to measure. For example, we can construct statistics that describe the spatial distribution of galaxies in both our simulations and our data. Measuring how close galaxies are to each other, and how that has changed over time, can tell us which of the competing forces of gravity and dark energy has the upper hand at any particular time. A comparison of those statistics can tell us how accurate our simulations are — and a comparison of a combination of statistics can give real insight. Inevitably, the calculation of these statistics over billions of galaxies also require significant computing resources, and HPC techniques are being applied to enormous simulation and datasets in order to allow us to make these comparison quickly.

Ultimately, ever-more-detailed simulations and ever-larger astronomical surveys are aiding cosmologists to understand the nature of our universe. But the lessons learned from the discovery of dark energy remain with us — the most transformative things happen when we *don’t* see what we expect. Possibly the most exciting prospect from the coming era of data/simulations is the possibility that we will observe something that doesn’t fit with any of our simulated universes — and then the fun really begins.

Deborah Bard, who earned her Ph.D. in particle physics from the University of Edinburgh, is a Big Data Architect at the Department of Energy's National Energy Research Scientific Computing Center (NERSC). She worked as a project scientist on the Large Synoptic Survey Telescope at the SLAC National Accelerator Center and developed and taught a course at Stanford University on "Discovering the Cosmos."