Until now supercomputing has required costly proprietary systems to run intensive workloads and store them efficiently. That’s changing with advances in silicon, networking and software - enabling amazing performance, greater flexibility, and cost savings.

SciNet, Canada’s largest academic supercomputer center, is based in Toronto and serves thousands of researchers in biomedical, aerospace, climate sciences, and more. Its large-scale modeling, simulation, analysis and visualization applications involve complex sets of processes that sometimes run for weeks, and where interruptions could destroy the results of the entire compute job. To avoid interruption most supercomputers enable fast checkpointing, so that computing jobs can be easily restarted. But when a system is checkpointing, it's not computing - and at scale, as individual jobs become larger, checkpointing may take too long to complete, making the calculation difficult, or even impossible to carry out.

So SciNet implemented a burst buffer – a fast intermediate layer between the non-persistent memory of the compute nodes and the storage. Instead of making the entire file system meet a very large write performance requirement, a portion of the file system (or in some cases, a separate file system) is configured to take a burst of write I/O at a very high rate. When flash storage is used as the burst buffer pool it has the added advantage of facilitating a faster restart (when needed) as checkpoint restarts often impose a very large random read load on the underlying storage.

Burst buffer approaches fall into two general categories:

  1. A fast centralized file system, up to the speeds the deployment can afford.  This is costly, uses a lot of power, and the controllers can become a bottleneck.
  2. Burst to a temporary location, perhaps local, and then move the individual host bursts to a central location later.  While more economical, because the checkpoint remains on the compute host's local media, it is subject to failure with the host itself - making the checkpoint potentially useless.

Shared NVMe, For Greater Performance

Enter NVMe (Non-Volatile Memory express), a standard that accelerates the transfer of data between computing resources and solid-state drives (SSDs), also known as Flash storage. By being placed on the PCIe bus, NVMe enables access to next-gen storage approaches more suitable for massive scalability.  The magnitude of NVMe’s improvement is astounding. It enables an over 1000x speedup of processing over NAND technology, allows the processing of over 65,000 queues of jobs instead of only one, and needs only a single message for 4KB transfers as opposed to two, in addition to dramatically better power efficiency.

Because NVMe is costly, system architects usually allocate it to specific applications or teams whose work demands its low latency and intensive performance.   What happens is that organization end up with “storage silos,” with unused NVMe Flash capacity.  A recent advance now allows shared NVMe across a network yet with the same performance as if resources were local.  SciNet’s team felt this approach would mitigate the costly nature of NVMe flash at supercomputing-scale, and open the door to other scalability and flexibility advantages.   

Three New Architectural Paradigms

In late 2017 SciNet created a peta-scale storage system that leverages the full performance of NVMe SSDs at scale, over the network. It deployed the shared NVMe via Excelero’s NVMesh®server SAN to create a unified, distributed pool of NVMe flash storage comprised of 80 NVMe devices in just 10 NSD protocol-supporting servers. The approach enables three advantages:

1). Elastic NVMe. NVMesh and its patented Remote Direct Drive Access (RDDA) technology allows customers to local NVMe drives may be used by remote compute nodes without consuming local CPU. As a result, the drives on every compute node are pooled for use by the cluster. In the most simplistic form, half of each drive can be used as a local burst buffer while the other half is reserved for the redundant copy of a peer. Thus, when a node fails, its scratch is preserved and accessible by an alternate node - any node on the fabric. The system uses Mellanox’s InfiniBand as its high speed, low latency interconnect. This provided approximately 148 GB/s of write burst (device limited) and 230 GB/s of read throughput (network limited) – in addition to well over 20M random 4K iOPS.

2). Standard hardware. Checkpoints are typically saved in a shared, parallel file system; SciNet chose the IBM Spectrum Scale General Parallel File System (GPFS) as its de-facto standard filesystem protocol, which is deployed by Lenovo using their DSS-G appliances. Integration of the Excelero burst buffer with SciNet’s parallel file system was straightforward, and the system enables SciNet to scale both capacity and performance linearly as its research load grows.

3). Support for software-defined storage, which abstracts underlying hardware in a single scalable pool of storage. SDS has seen a good uptake for object storage and increasing use in the more traditional block storage used for high performance computing (HPC).  Being software-defined allows the NVMesh shared NVMe implementation to support SciNet’s Spectrum Scale file system without custom coding.

“For SciNet, using shared NVMe presented an extremely cost-effective method of achieving unheard-of burst buffer bandwidth,” said Daniel Gruner, Ph.D., chief technical officer, SciNet High Performance Computing Consortium. “By adding commodity flash drives and NVMesh software to compute nodes, and to a low-latency network fabric that was already provided for the supercomputer itself, the system provides redundancy without impacting target CPUs. This enables standard servers to go beyond their usual role in acting as block targets – the servers now can also act as file servers.”

Using pooled NVMe, SciNet gained important storage functionality with the highest performance available in the industry at a significantly reduced price – while assuring vital scientific research can progress swiftly.

Author’s Bio:  Tom Leyden is vice president of corporate marketing at Excelero, provider of a server SAN leveraging elastic NVMe at local performance.