Advertisement

Figure 1: A surfactant-laden water droplet begins to spread across a polymeric surface. The surfactant, called a trisiloxane (black and red atoms) sits at the surface of the water droplet (blue atoms), and enables faster spreading of the droplet on the surface (gray atoms) by lowering the interfacial energy of the water. Courtesy of R. E. Isele-Holder and A. E. IsmailToday’s sixth installment continues the series covering how scientists are updating popular molecular dynamics, quantum chemistry and quantum materials code to take advantage of hardware advances, such as the forthcoming Intel Xeon Phi processors.

Both HPC hardware and software are being modernized to aid in simulations used in research. Yet, what researchers can do with current software and hardware and what they would like to do is still quite far apart.

LAMMPS, an acronym for Large-scale Atomic/Molecular Massively Parallel Simulator, is one of the most widely used software packages for molecular dynamics and is used by researchers throughout the world. Developed at Sandia National Laboratories, LAMMPS contains potentials for solid-state materials (metals, semiconductors) and soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used as a parallel particle simulator at the atomic, meso or continuum scale. LAMMPS is designed to run on single processors or in parallel using message-passing techniques and a spatial decomposition of the simulation domain.

Introducing Intel Parallel Computing Centers RWTH Aachen and CAS-CINC

The development process of modernizing LAMMPS has already started. The Intel Parallel Computing Center (Intel PCC) at RWTH Aachen University (located in Aachen, North Rhine-Westphalia, Germany) and the Computer Network Information Center (CNIC) of the Chinese Academy of Sciences (CAS-CNIC) are leaders in optimizing HPC software to take advantage of Intel Xeon processors and Intel Xeon Phi coprocessors. These centers are optimizing LAMMPS’ code to run on current coprocessors and the next generation of processors to meet scientific computing demands.

Water droplet spreading across a polymeric surface

The kind of research simulations that can be performed using LAMMPS on HPC systems is illustrated in this example of how wetting occurs when a liquid droplet spreads across a liquid or solid surface. The dynamics of wetting are crucial for a wide variety of industrial applications, including the spreading of paints, adhesives, inks and pesticides. Of particular interest is the spreading of droplets containing surfactants, which are short polymeric chains that, when added to water, enable faster spreading by lowering water's interfacial energy, making it easier for droplets to expand and spread across a surface. However, even after decades of intensive study, the relationship between the chemical composition of surfactants and their ability to enable the spreading of water droplets remains unclear.

One reason for the difficulty is that the presence of explicit interfaces — in this case, the surface of the droplet as well as the surface onto which it falls — makes the handling of long-ranged electrostatic and dispersion interactions between the droplet and the surface tricky to handle. Most previous simulation efforts have struggled to properly account for the interactions and, thus, have been unable to accurately capture the spreading dynamics, including the significant changes induced by different chemical structures.

According to Professor Ahmed E. Ismail, a former junior professor at RWTH Aachen University and now an assistant professor of chemical engineering at West Virginia University, “Through development and optimization of one of the long-range interaction kernels in LAMMPS, called the particle-particle particle-mesh (PPPM) method, we are able to compute both accurately and efficiently the physics of the wetting process, and have been able to show that different chemical structures do lead to vastly different spreading dynamics. In particular, we can show, in agreement with experiments, that one well-known class of surfactants, called trisiloxanes, exhibit much faster spreading dynamics, because they are able to ‘roll’ across a surface instead of having to ‘push’ their way across the surface, as other surfactant-laden droplets must. As we further develop and optimize these long-range interaction kernels, we will be able to extend this work to study even larger and more complex fluid-solid interfaces, including cell membranes and self-assembled monolayers.”

Modifications RWTH Aachen is making to LAMMPS

According to Paolo Bientinesi, Professor for Algorithm-Oriented Code Generation for High-Performance Architectures in the Computer Science department at RWTH Aachen University, “LAMMPS is actively used on just about every single supercomputer in the world, to perform materials science and computational chemistry simulations. With our optimizations, we target some of LAMMPS’ most used solvers; any meaningful speedup will result in a massive reduction of compute cycles, which in turns means energy savings.

“We are exploiting the built-in vectorization capabilities of the Intel Xeon Phi coprocessor. The most common techniques we use to improve the effectiveness of Intel’s vector units are loop fusion and fission, packing to the vector width, and explicit alignment. In terms of parallelism, we add a shared memory layer beneath MPI's domain decomposition to reduce communication,” says Bientinesi.

Hardware used by RWTH Aachen

RWTH Aachen mainly uses the hardware shown in the chart below in their optimization work. “By adopting modern Intel hardware, individual nodes grow in computational power so that it becomes possible to reduce the number of nodes participating in a simulation, thus lowering the total volume of inter-node communication,” states Rodrigo Canales, a scientific programmer working with Professors Ismail and Bientinesi at RWTH Aachen.

The RWTH Aachen team typically performs work on following Intel HPC systems:

CPUs

RAM

2x Intel Xeon Processor E5-2450 (formerly known as Sandy Bridge)

2x Intel Xeon Phi coprocessor

48 GB 
2x 8 GB

2x Intel Xeon Processor E5-2680 v3 (formerly known as Haswell)
1x Intel Xeon Phi coprocessor 5110P

64 GB 
8 GB

RWTH Aachen team optimizes Tersoff and Buckingham potentials

The RWTH Aachen team focuses on optimizing pair potentials and many-body potentials (Buckingham, Tersoff and AIREBO) in the LAMMPS package so they can run efficiently on Intel architectures. AIREBO and Tersoff potentials are encountered in simulations of carbon nanotubes, graphene and other hydrocarbons. “Vectorization of the Tersoff potential on the Intel Xeon Phi coprocessor is especially challenging, because the calculation consists of short loops, too short to effectively fill the coprocessor's long vector unit. Specifically, the calculation is performed by a three-fold nested loop. The outer loop iterates over all atoms in the simulation (tens of thousands of atoms, or more), while the two other loops iterate over the atom's neighbors. Typically, since the Tersoff potential describes covalent bonding behavior, each atom only has three or four neighbors. By contrast, the Intel Xeon Phi coprocessor's vector unit performs calculations on eight or 16 elements at once,” states Markus H¨ohnerbach, a doctoral candidate at RWTH Aachen.

The RWTH Aachen team did the following to limit — and possibly avoid — the costly under-utilization of the available hardware. First, they changed how neighbor lists are stored and rearranged the data used in the calculations to improve vectorization. This results in the loops being fused with the outer one (which iterates over all the atoms involved in the simulation), and vectorized explicitly.

“While on an element-by-element basis this approach is more expensive, it also leads to the full exploitation of the vector units and, ,therefore to considerable speedups for the full calculation.” — Paolo Bientinesi, RWTH Aachen

Other methods used to optimize the Tersoff potential included adding Intel Xeon Phi coprocessor offload support, vectorization through intrinsics for the full range of available Intel hardware and explicit alignment to improve the effectiveness on vector units.

Optimizing Buckingham potentials

To add offloading support for the Buckingham potential, the team modified the initialization and compute routines. Canales states, “After preprocessing the simulation parameters — which determine the interaction of each atom pair — we copy them to the Intel Xeon Phi coprocessor and, to avoid costly data transfer, make sure they are available for the entire simulation.

“The compute routine, which is the core of our implementation, was also reorganized. At the beginning of the simulation, the atom positions are copied to the coprocessor; the computation is performed there, and at the end of each timestep the calculated forces for each atom, and the total energy, are retrieved.”

Vectorization was achieved for both the Intel Xeon Processor and the Xeon Phi coprocessor using single instruction, multiple data (SIMD) pragmas. To support vectorization efficiency, the team packed and aligned the simulation parameters and the position and force data from the particles. For the Intel Xeon Phi, the RWTH team also implemented vectorization through intrinsics.

Figure 2: Buckingham potential speedup on Xeon Phi coprocessor 5110p. Courtesy of Rodrigo Canales, RWTH AachenResults of RWTH Aachen optimization tests

Benchmarking tests compared the processing speed of RWTH Aachen’s optimized code versus the best that could be achieved in LAMMPS before the optimizations. The Buckingham test included three types of Buckingham potentials running in both single and double precision with the speed calculated for one thread. For the Tersoff potential benchmarks, various computing systems were used, including Processor (host)-Intel Xeon processor E5-2650 and Intel Xeon Phi coprocessor 5110P. Figures 2 and 3 show the speedup of the RWTH Aachen optimized Buckingham and Tersoff code versus the LAMMPS USER-OMP package.

Figure 3: Portable speed-ups on Tersoff potential (single threaded, native) on various Intel processors and coprocessors. Courtesy of Markus H¨ohnerbach, RWTH AachenModifications CAS-CNIC is making to LAMMPS Code

CAS-CNIC focuses on developing efficient algorithms and codes targeting Intel Xeon Processors and Intel Xeon Phi coprocessors as part of their LAMMPS code optimization work. Their main areas of research focus on two mainstream mesoscopic simulation techniques — the Phase Field method and the Dissipative Particle Dynamics (DPD). Their team has devised fast and stable compact exponential time difference (cETD) multistep methods for solving the Phase field simulations. Their time stepping methods are explicit in nature and, thus, free from the need to solve linear and nonlinear systems.

Figure 4: Pipeline scheme for tensor dot product. Courtesy of CAS-CINCCAS-CNIC phase field simulation methods

In their Phase field simulations, CAS-CNIC organizes 3-D phase variables as 2-D arrays, divides the matrices into pieces and uses both the CPU and Intel Many Integrated Core Architecture (Intel MIC) to compute the matrix multiplication. They design special mapping schemes for the tensor transpose operation and adopt pipeline techniques to hide the data moving caused by the offloading and transpose operations, as shown in Figure 4. The process of solving Cahn-Hilliard equations and simulations of phase separation dynamics achieves 1,300 GFLOPS in double precision on a computing node with two CPUs and MICs which is 52 percent of peak performance.

Figure 5: Prototype program simulates DPD with MIC native mode under various optimization approaches. Courtesy of CAS-CINCCAS-CNIC dissipative particle dynamics (DPD) simulation methods

In their Dissipative Particle Dynamic (DPD) work, CAS-CNIC uses a binary (two atom types) DPD system where atoms were N=32000. Two software packages developed for this work included their own CAS_CNIC prototype program and LAMMPS with an enhanced DPD package. The prototype program was tested in MIC native mode and LAMMPS on MIC offload mode. There was a 6.8 increase when running the prototype program. Speedups within LAMMPS are expected to be 1.8x as compared to baseline 2 socket Ivy Bridge performance as shown in Figure 5.

Wide-ranging benefits of CAS-CNIC LAMMPS code optimization

According to Dr. Zhong Jin, Professor at CAS-CINC, “We developed an efficient DPD simulation code based on Intel MIC Architecture. This implementation is designed and optimized according to the nature of DPD simulation technique and will also fully take advantage of the computational power of MICs. This MIC-based implementation will provide a speedup of two compared to that based on a single CPU.”

The team has also worked on optimization of additional algorithms for their Intel PCC research, including a cETD library for Intel Xeon processors and Intel Xeon Phi coprocessors. It is expected to reach 30 percent of peak performance in production-level simulations. Besides the phase field models, the cETD methods are applicable to a wide variety of partial differential equations, such as time-dependent advection-diffusion and Navier-Stokes equations for fluids dynamics, the Ginzburg-Landau equations for modeling superconductivity, and the Schrödinger equations for quantum mechanics.

Common software tools used in code modernization

Both the CAS-CNIC and RWTH Aachen teams use a wide variety of HPC software in their work to optimize LAMMPS code. In their work, they use Intel compilers, profiling tools and runtime libraries, such as Intel MPI and OpenMP; both of these are industry standards for parallel programming, MPI at the distributed-memory level, and OpenMP at the shared-memory level. To support their vectorization work, RWTH Aachen also uses OpenMP's new SIMD construct.

According to H¨ohnerbach, “Our team used the Intel Vector Advisor tool to help exploit the vector processing units on the Intel Xeon Phi coprocessor. It can analyze loops and give feedback about the achieved vectorization quality and possible issues causing the problem. It can also be used to assess the progress of the optimization effort by estimating the point where there is no room for improvement.”

How HPC will aid molecular dynamics research in the future

The software optimizations for LAMMPS performed at the CAS-CNIC and RWTH Aachen IPCCs will be available to all users as part of the USER-INTEL optimized package supplied with LAMMPS. The package is maintained by W. Michael Brown at Intel and the LAMMPS developers at Sandia National Laboratories. It includes optimizations for Intel Xeon and Xeon Phi processors that support a wide range of simulation models. The highly-optimized Tersoff and Buckingham simulation models have already been made available with the package, and scientists can exploit significant improvements in molecular dynamics with these models today. LAMMPS can be downloaded at: http://lammps.sandia.gov

“Among the biggest challenges that we face in computational chemistry and materials modeling is that scientists' desire to solve bigger and more complex problems will always outstrip advances in computational power. And we're still many orders of magnitude away from having the resources to study many important biological, chemical and physical phenomena. Supercomputing is already making progress in these efforts, but it's unlikely we'll see a fully hardware-based solution to these issues in the next few decades. So, we'll need to figure out not only how to make our computational resources more efficient and more powerful, but also rethink our approaches to both the software that analyzes these problems, as well as the models that underlie them,” states Ismail.

To meet the needs of making HPC systems and software more efficient and powerful, the RWTH Aachen team will continue to optimize LAMMPS’ code. Future work will include optimizing the PPPM dispersion solver, adding support for the next-generation Intel Xeon Phi processor, and implementing other vectorization techniques, such as array notation and OpenMP 4.1 pragmas.

The CAS-CNIC team is also looking to the future to optimize LAMMPS and other codes to utilize modern hardware, as well as identifying software and other methods that can be used in the future as we move toward exascale computing. According to Jin, “Our focus is to modernize the software codes and turn them into a package that can be incorporated into the LAMMPS code so that everyone can use our modifications. We are currently writing codes incorporating the methods and calculations used for DPD code. Our work is also focusing on doing code modernization to improve vectorization to support offload to coprocessors and to support fast math calculations.”
Other articles in this series covering the modernization of popular chemistry codes include:

References

Publications & Presentations

Linda Barney is the founder and owner of Barney and Associates, a technical/marketing writing, training and web design firm in Beaverton, OR.

 

R&D 100 AWARD ENTRIES NOW OPEN: Establish your company as a technology leader! For more than 50 years, the R&D 100 Awards have showcased new products of technological significance. You can join this exclusive community! Learn more.

Advertisement
Advertisement