Clusters, Parallel Computing and Workflow Management in Bioinformatics

Bioinformatics has become a principal driver of the migration from traditional multiprocessing servers to clusters of commodity-priced machines
Jeff Augen

Several years ago, dramatic advances in information technology and computer sciences made possible the launch of in silico biology. As the field matured, researchers became proficient at defining biological problems using mathematical constructs and building the computer infrastructure to solve those problems. Over time, it became clear that most biological problems lend themselves to solution in a clustered environment after division into a large number of small pieces. This mathematical property of most biological problems has now become a principal driver of one of the most important trends in the information technology industry — the migration from large multiprocessing computers to clusters of commodity-priced machines. The migration is driving the development of a variety of tools for application integration and resource management that are further solidifying the role of clusters as the dominant force in high performance technical computing.

Biotechnology differs from other technical disciplines because its computational component, bioinformatics, is built on a new class of computationally intense problems and associated algorithms that are still evolving. Furthermore, the emergence of new sub-disciplines within biotechnology such as systems modeling, high throughput sequencing, and mRNA profile analysis are likely to drive even more demand for unique and powerful IT platforms. Finally, unlike more mature industries, the biological world is experiencing explosive growth with regard to both the amount and type of available data.

The designers of bioinformatic algorithms have been quick to take advantage of the atomic nature of many biological problems by building parallel infrastructure — most often Linux clusters composed of commodity-priced machines. These clusters have now become a dominant force in bioinformatics replacing large symmetric multiprocessing (SMP) systems whenever practical. Despite its recent emergence, bioinformatics has become a principal driver of one of the most important trends in information technology — the migration from traditional multiprocessing servers to clusters of commodity-priced machines.

Another property of most bioinformatic problems is that their solution requires a large number of discreet steps. Different algorithms, linked by their inputs and outputs, form the basis of the steps. Moreover, the steps are connected by a series of logical constructs — loops, conditionals, and branches. Taken together, the algorithms, connections, and flow logic make up a complete workflow that describes the underlying problem. Once properly described, individual components of the workflow can often be executed in parallel with varying degrees of granularity.

Coordination of tasks in a workflow-based clustered environment has traditionally involved the development of simple scripts that distribute computing jobs across a cluster and manage the output. This approach has evolved and, today, vendors are beginning to offer tools that integrate and manage application workloads across a cluster. Three major tasks must be addressed if a cluster is to be used as a "virtual supercomputer": resource management, data management, and application management. These tasks differ substantially in a number of ways. For example, resource management is a universal problem, cutting across all computing disciplines and evolving rapidly as computer, communications, and storage technologies change. In contrast, both data management and application management are far more specialized, and the best solutions are likely to depend significantly on details specific to particular types of applications or vertical market segments. As a result, it seems unlikely that any single software system will emerge to address all three tasks at once.

Resource management tools essentially address the question: "Who may do what; and when and where may they do it?" These tools are designed to address job management, user authentication, and the allocation of facilities such as computers, storage and network capacity. The most common type of resource management tool is a batch queuing system. Frequently-used batch queuing systems include open-source versions of the Portable Batch System (PBS) and Sun Grid Engine, as well as commercial versions of those two systems and Platform's LSF. A batch queuing system oversees execution of computing jobs that are submitted to it by users. However, such systems have drawbacks related to the fact that they are static and cannot dynamically manage and distribute workloads within a cluster. Complex jobs and computing environments are not static - their needs and capabilities change frequently. However, without information about the internal activities of the jobs, it is impossible for batch queuing systems to overcome the constraints of static resource allocation and respond effectively to dynamic changes. More sophisticated solutions are evolving which are capable of monitoring the performance of specific tasks in a cluster and making real time changes to the way tasks are distributed.

Data management tools address the accessibility and delivery of data, either from files or database systems. In a cluster (or grid) setting, two different types of data management tools are widely used:
•Database systems that both store large quantities of data and deliver it on demand in response to specific queries, and
•Data access systems that provide virtual integrated interfaces facilitating integrated access to and delivery of data that may be stored on a number of disjoint and distributed file and/or database systems.

In both cases, data management tools are responsible for user authentication for data access, reliable data delivery, data security and encryption, caching strategies to reduce access or delivery time and conserve network communication capacity, and other related issues.

Finally, most high performance technical computing solutions involve many applications that must be integrated into a complete solution. Because resource and data management tools do not address application integration, new tools are evolving to address these needs by allowing users to integrate applications into a single managed workflow whose individual tasks are dynamically distributed across a cluster. The intelligent combination of resource, data, and application management tools allows a typical cluster built of commodity priced hardware to become a replacement for a traditional supercomputer.

Jeff Augen is president and CEO of TurboWorx. He may be contacted at