Scalable Storage Solutions for Applied Big Data
The challenge lies in fitting all the pieces together so they work reliably, at the scale required, while maximizing the potential for future expansion
Considered in isolation, big data is nothing more than job security for tech vendors and system managers. Only through application can the value of big data be realized. For example, scraping the Internet for Web sites will clearly generate a big data set. In isolation, this information does nothing more than consume storage media and system administrator time. Google demonstrated that spectacular value can be realized by adding the ability to perform real-time distributed keyword searches on such big data sets. Similarly, many organizations are capitalizing on the actionable intelligence that can be found by solving big data “needle in the haystack” problems.
While application clearly defines value, from a computer design perspective, application depends entirely on the performance and scalability of the storage solution. Sans a performant storage solution, the entire value chain of people, big data, algorithms, parallel computation, expensive networks and high-performance computers breaks from data starvation and/or storage bottlenecks. Meanwhile, changes in the economics of computer storage, such as solid state disk storage dropping below the dollar per gigabyte barrier, plus innovations in computer memory, such as Hybrid Memory Cubes, are redefining the requirements for current and future high-performance storage solutions.
The key to big data lies in making it comprehensible to people, which is why even small organizations are jumping into the business of big data. Modern laptops easily can store and manipulate graphs with a billion nodes. Graphs are a key data structure for social media analysis, as they can represent the interactions between people, organizations and generic entities. The interactions can be attending the same events, going to the same Web sites, liking the same products, sending posts to each other, or linking to friends and colleagues. The challenge is that people don’t understand graphs with a billion nodes, graphs containing a thousand nodes, or even relatively small graphs containing a few tens of nodes. Those who create the commonly accepted metrics and data visualization tools stand an excellent chance of being richly rewarded at work and in the marketplace.
Enterprise systems managers know, and laptop users quickly discover, that the perfectly natural act of sharing data between users adds a tremendous amount of complexity to the storage infrastructure. Users tend to react badly when a weakness anywhere in the storage technology chain causes data loss, an inability to access the storage system, or even the perception of slowness. It does not matter what part of the data access path is the problem, be it the network, server(s) or physical disks. For example, an eight-year-old child really does not care that the WiFi network is too slow to let them watch cartoons without annoying freezes, just as the 40-year-old senior research scientist really only cares about getting the results of a storage intensive analysis back quickly. The pain point quickly transitions from personally annoying to very expensive when $30M supercomputers and/or expensive technical staff cannot work effectively due to an inadequate storage system.
Scalable storage is the only practical solution, because additional capability can be added as weaknesses are identified or develop as users become more data-intensive. Happily, scalable storage solutions are a well-studied problem where the market offers a plethora of storage products ranging from those that can support the needs of entertainment-oriented home users to the extreme trillion byte per second (TB/s) storage subsystem built for the Oakridge National Laboratory Titan supercomputer. From kilobits/s to terabytes/s, the storage wheel has been invented. The challenge lies in fitting all the pieces together so they work reliably, at the scale required, while maximizing the potential for future expansion.
From the perspective of an end user, the file-system software is the most visible component, as it provides the key interface that allows them to actually use data. Most home users and small organizations rely on Common Internet File System (CIFS), also referred to as SAMBA by UNIX-oriented users, which is natively supported by Microsoft, Linux and MacOS operating systems, plus app store products also are offered for Android and iOS.
While CIFS is common, it is an older standard that has some inherent limitations that bound performance, reliability and scalability. For this reason, newer highly scalable, fault-tolerant HPC file-systems have been developed. Currently, GPFS and Lustre are the two most commonly used HPC file systems, but a number of other entries such as Ceph and GlusterFS also are available.
The core idea behind all these file-system designs is to aggregate the performance of a number of storage devices and/or servers to deliver very high data transfer rates to many clients simultaneously. Assuming the data is uniformly distributed so each device can deliver maximum bandwidth, it is easy to see that scaling the storage system to 10 devices will provide 10x the bandwidth of a single device; utilizing 100 devices will deliver 100x the bandwidth, and so on.
Random access, or latency bound, data access patterns present a particular challenge for scalable file-systems. It is true that the number of random accesses (expressed in terms of IOPS or the number of I/O operations per second) will scale according to the number of devices — assuming a uniform distribution of accesses across all the devices. Unfortunately, commonly used random-access applications, such as relational databases, tend to have hotspots where small regions of storage are accessed and updated frequently. These hotspots break scalability and cause performance to be limited to the capabilities of just a few devices. Many common file operations (e.g. open, create, list, delete) require that the file-system modify or synchronize an internal database that can force all the clients to wait while a few storage blocks are modified. This issue can be seen on even a commodity 24-core workstation. For example, it is quite reasonable to have a single thread or application running on each core attempt to open a file in a directory. As a result, the storage system will “see” 24 concurrent file opens. The problem only becomes worse in a cluster or cloud environment. In the extreme, the Titan supercomputer can potentially issue 299,008 simultaneous file-system metadata operations.
A partial fix is to utilize solid-state storage for either the file-system meta-data server or within a tiered storage architecture. SSD devices are able to support orders of magnitude more IOPS than hard disks, making them a wonderful new addition for scalable storage. The challenge is that fast storage means that the file-system becomes limited by the ability of the CPU to perform the string comparisons required to look up a filename. While a high-end 24-core processors chipset can help, it is not a cure-all. Some scalable file-systems like GPFS distribute the metadata operations across the nodes, which can help avoid the CPU string comparison limitation in the general case, but it does introduce limitations due to a heavy reliance on the network interconnect (in particular network latency). In short, the perfect solution has yet to be found for many database and file-system metadata operations.
Looking to the future, new memory architectures, such as Hybrid Memory Cubes, hold the promise of commodity laptops that contain a terabyte of battery powered memory and workstations that contain hundreds of terabytes of tremendously fast, TB/s memory. Of course, the demands on the storage systems will increase as applications expand as users change from working with tens of gigabytes of data to datasets that fit in tens of terabytes of RAM. In all likelihood, current HPC file systems will be able to meet the streaming data bandwidth requirements of these future systems. Still, there is plenty of room for innovation to find the next generation file-system solutions for insanely huge or fast (by current standards) database and metadata servers. Perhaps the next killer hardware platform will be a GPU or content addressable array processor that will accelerate fast metadata name lookup and service latency limited database requests.
Rob Farber is an independent HPC expert to startups and Fortune 100 companies, as well as government and academic organizations. He may be reached at editor@ScientificComputing.com.