Articles
Confronting the Data Tsunami
A comprehensive look at the latest data management techniques
|
The book, Scientific Data Management Challenges, Technology, and Deployment, edited by Arie Shoshani and Doron Rotem, describes cutting-edge technologies and solutions for managing and analyzing the vast amounts of data generated and utilized by NERSC, Oakridge National Laboratory (ORNL), the Large Hadron Collider (LHC) and many other leading-edge scientific institutions. Each chapter contains insights and experience gleaned by experts and luminaries in storage who are confronting and managing the data tsunami that has now inundated the leading-edge scientific and supercomputing centers around the world. Individuals in a variety of scientific and commercial areas who are struggling to manage large amounts of data should find this book both educational and useful.
This is not a textbook, but rather a compendium of interrelated chapters written by a collection of individuals who teach, pioneer and generally are responsible for “[making] storage work” for their respective institutions and sponsors, such as the Office of Science. Each chapter is structured such that it can be standalone in the sense that it focuses on a topic, contains an introduction, brings up the issues, talks about solutions, and provides at least one example if not more of practical applications. However, the chapters also refer to each other so readers can flip to more detailed discussion of key topics elsewhere in the book. Extensive references are provided to guide the reader to external resources as well.
The book begins with a three-part discussion of storage technology, parallel access and dynamic storage management. Included is information about the newest solid-state storage technologies that can potentially revolutionize storage by eliminating seek-latencies and parallel file-systems. These technologies run on workstations, yet can act as a common file-system for national centers with many large supercomputers. Data integrity, device failure, failure statistics and other topics also are covered.
The focus then changes to the efficient movement of data and management of storage spaces. This includes an exploration of emerging database systems for scientific data. There is information on transparent data movement that frees users from having to understand the details of the underlying storage mechanisms.
How to best organize data for analysis purposes and effectively conduct searches over large datasets is the topic of the section on specialized retrieval techniques and database systems. Scientific data is considered as being distinct from commercial data, because it contains hundreds of attributes per record and billions of records. In some instances, data can be inserted into a database extraordinarily quickly by instruments such as the LHC. Some of the contents are
• the main differences between commercial DBMS and scientific data
• a taxonomy of external index methods
• retrieval methods, such as smart iterators and vertical databases that minimize the amount of data that must be locked in high-update environments.
• Additionally, a very new concept called array structures is briefly introduced for extremely large databases.
This book provides a comprehensive understanding of the latest techniques for managing data during scientific exploration processes, from data generation to data analysis. Enhanced by numerous detailed color images, it includes real-world examples of applications drawn from biology, ecology, geology, climatology and more.
Building on the previous sections, scientific process management is reviewed, including metadata and provenance, how to successfully automate multistep scientific process workflows and automatically collecting metadata and lineage information.
Rob Farber is a senior PNNL research scientist at Pacific Northwest National Laboratory. He may be reached at editor@ScientificComputing.com.




