The great strength of scalable systems is their ability to make the aggregate bandwidth available so parallel applications can achieve very high performance
The adage that a supercomputer is a complicated device that turns a compute bound problem into an IO bound problem is becoming ever more apparent in the age of big data. The trick to avoid the painful truth in this adage is to ensure that the application workload is dominated by streaming IO operations. The reason is that parallel file systems such as Lustre and GPFS are designed so they can scale to deliver the aggregate bandwidth of tens, hundreds and even many thousands of storage devices to match streaming bandwidth to increasingly demanding application workloads. With streaming IO and a parallel file system, the gating factor becomes how much an organization wishes to spend on storage-related hardware rather than the limitations of the storage technology. Meanwhile, the revolution in solid-state disk (SSD) pricing is being exploited to help parallel file systems deliver significantly higher performance on latency-limited big data workflows, plus greater bandwidth with fewer devices for streaming workflows. The need is real as innovations in computer architecture such as GPUs, coprocessors and Hybrid Memory Cubes redefine what is meant by big data and what can be done with it — along with the ability of ever smaller organizations to compete in the big data arena.
The Oak Ridge National Laboratory (ORNL) Titan supercomputer defines the current state-of-the-art in HPC storage. Built to run the scalable, high-reliability Lustre file system software, the Titan storage subsystem contains 20,000 disks, 40 petabytes (40,000 trillion bytes) of storage, and has the ability to deliver sustained performance approaching or exceeding a terabyte/s (TB/s) of storage bandwidth. By virtue of its GNU Public License (GPL) the Lustre file system used for Titan also runs on six of the 10 largest supercomputers in the TOP500 list, plus countless clusters around the world. Other distributed file systems include the Global Parallel File System (GPFS), PanFS, Ceph, glusterfs and several others.
The great strength of these scalable distributed file systems is their ability to make the aggregate bandwidth of a large number of storage devices available so parallel applications can achieve a very high streaming performance. This works very well in the HPC world, because most supercomputer applications generally read a large amount of sequential data when the job starts on the supercomputer, after which the application periodically writes a sequential data file to storage as a checkpoint for later analysis and potentially to use when restarting the job.
The common denominator in these workflows is that the file system and storage subsystem see mainly sequential (e.g. streaming) read and write operations from the client applications. Sequential access means that each hard disk drive in the storage array had the opportunity to deliver peak bandwidth. Additional benefits also occur because sequential I/O means that the buffers on individual servers will be very effective. Users on HPC systems are encouraged to use the file system tools to stripe individual large files across many storage devices to ensure maximum throughput. The file system also places a bet by distributing small individual files uniformly across the storage devices. The performance payoff occurs when applications that use many small files end up having their I/O operations uniformly distributed across the storage array, thus maximizing streaming bandwidth. Succinctly: tuning a distributed file system for streaming I/O means those expensive leadership class supercomputers can maximize their time running calculations while minimizing the time they spend performing I/O. The same design benefits translate beautifully to enterprise and smaller computational clusters as well.
The sheer number of CPU cores on the ORNL Titan supercomputer, 209,008 in all, highlights a particularly challenging problem in distributed file system design: specifically what happens when a large number of threads need to perform a file system operation at the same time.
The Lustre file system utilizes a single metadata server to remove as much latency as possible when processing both average and peak load file system requests. Other file systems such as GPFS distribute the file system metadata operations across many, or even all the nodes, so that large numbers of file operations will not bottleneck on a single server. Both approaches have benefits and drawbacks.
The advent of affordable, large SSD devices means that general file system metadata operations, which tend to be heavily seek dominated, are no longer limited by the storage media. While this sounds great, in practice, the Virtual File System layer (VFS) in the operating system or Installable File System layer (IFS) for Microsoft users tends to be CPU limited when performing the many string comparisons required for each filename lookup. Metadata servers can be built utilizing high-core-count processors, which is useful but does not really fix the scalability issue. File systems that distribute the metadata server across nodes fix the CPU bottleneck, but tend to be network latency-dominated due to the large number of messages that must be sent between servers during metadata operations. The point is that there is still work to be done in creating high-availability, scalable file system metadata servers.
Another tangible benefit of SSD devices is that all file systems (be they local or distributed) are now much, much better at supporting databases and other applications that exhibit non-streaming I/O patterns. Some of these new SSD devices can perform millions of random IO operations per second (IOPS). In contrast, current high-end disk drives can only perform a few hundred IOPS. The dramatic increase in device capability means that the full gamut of computational devices from top-tier HPC file systems to commodity laptops are now much more capable database engines. The challenge for distributed file systems (and users who wish to run databases on these file systems) is that the lock manager that preserves the integrity of each file’s contents and metadata must not get in the way. Because of the distributed nature of the storage units, the lock manager tends to be network latency-limited. Again, there is still work to be done in this area.
High availability is a must for any big-capacity file system. For example, it is just not possible to shut down the 40 petabytes of storage for the Titan supercomputer. The standing joke in file system design is that storage is persistent, which means that the error you experience today may have actually been caused by a problem that happened a year ago. Readers who have large terabyte disks connected to their personal computers know that it can take minutes to hours to scan and correct file system errors on just a few terabytes of storage. Unfortunately, checking petabytes of storage can take days to months. Such extended down times are just not possible with critical HPC and enterprise resources.
Instead, distributed file systems like Lustre include high-availability features that allow the file system integrity to be checked and corrected while the file system is in operation. Further, individual components in the distributed file system can fail or be purposefully shut down for maintenance or hardware/software upgrades without affecting the integrity of the file system as a whole. By design, the only impact will be a performance slowdown as active jobs wait for a backup server to assume the responsibilities of the failed or powered-down component. At the device level, Redundant Arrays of Independent Devices (RAID) protections and Logical Device Management (LVM) ensure error-free data availability.
In the enterprise and small business environment, open-source file systems, such as Lustre, can be downloaded just like any other open-source package. Many of these file systems have convenient interfaces for most operating systems. Lustre, for example, offers a native Linux client. The Common Internet File System (CIFS) gateways allow systems running other operating systems to utilize Lustre as well. Of course, many vendors also provide turn-key Lustre storage packages, as well as validated and warrantied components and upgrades.
With history as a guide, it is likely that the TB/s leadership class Titan storage subsystem of today will almost certainly become the expected level of enterprise file system performance in a few years. In particular, Hybrid Memory Cubes hold the promise of commodity laptops that contain a terabyte of battery-powered memory and workstations that contain hundreds of terabytes of tremendously fast TB/s memory. Titan has shown that it is possible with today’s technology to provide a TB/s of streaming bandwidth, which will certainly be required to support such huge memory devices. The challenge lies in dropping the costs of a TB/s of storage performance to a more affordable level. Even more so, much exciting work still needs to be done to exploit the capabilities of new technology such as SSD, GPUs and content addressable memory arrays to accelerate random-access dominated workloads, such as file system metadata servers and databases.
Rob Farber is an independent HPC expert to startups and Fortune 100 companies, as well as government and academic organizations. He may be reached at editor@ScientificComputing.com.