The sequencing machines that run today produce data several orders of magnitude faster than the machines used in the Human Genome Project. We at the Wellcome Trust Sanger Institute currently produce more sequences in one hour than we did in our first 10 years of operation. A great deal of computational resource is then needed to process that data. For instance, a single cancer genome sample produces DNA sequence data that requires up to 7,000 CPU hours for analysis, and we’re doing tens of thousands of these at once. The sheer scale is enormous, and the computational effort required is huge.
We are always under pressure to keep pace with rapid data growth and around-the-clock access demands. With unpredictable data growth, it became difficult to scale storage sufficiently without overburdening our existing network infrastructure or encroaching beyond power and space constraints in the data center. As a result, we found ourselves facing a classic big data problem that was further exacerbated whenever new advances in sequencing technology produced more sequencing data more quickly and more cheaply than ever before. On one upgrade, we went — almost overnight — from 100 sequencers to the equivalent of 700 machines.
It is crucial for us to have scalable, reliable, high-performance storage that serves multiple purposes — ranging from sophisticated storage that supports a Lustre file system for complex computational analysis to bulk storage underlying the Integrated Rule-Oriented Data Management System (iRODS) for managing large data collections.
ENTRIES OPEN: Establish your company as a technology leader. For 50 years, the R&D 100 Awards, widely recognized as the “Oscars of Invention,” have showcased products of technological significance. Learn more.
Therefore, we began exploring technologies that could play a significant role in improving our immediate and future architecture. We needed solutions that would give us a much better way to provide storage to our expanding user community with good access controls through iRODS. After all, if you need 10,000 cores to perform an extra layer of analysis in an hour, you have to scale both storage and compute to get answers quickly. You need a solution that can address everything from very small to extremely large data sets.
We ended up deploying DDN SFA high-performance storage engine and EXAScaler Lustre file system appliance as part of a 22 petabyte genomic storage environment — using DDN’s solutions primarily for high performance, parallel storage for feeding our 17,000 core HPC clusters. This deployment has enabled unprecedented levels of throughput and scalability to support tens of thousands of genome sequences and subsequent analysis of that data.
Since installing our initial SFA storage platform, we have kept pace with ever-increasing computational and analytical demands — taking advantage of DDN’s ongoing performance increases to achieve speeds of up to 20 GBps, which enables meeting the needs of the most demanding workloads. To accommodate demands for increased bandwidth, we also upgraded our 10GbE network to 40GbE, and future plans include scaling our current DDN storage to support expanded network capacity. We are also exploring the DDN WOS distributed object storage platform as an option for increased collaboration and data sharing as part of a private cloud.
We are pushing our machines to the limit. To deliver our world-leading research, we need world-leading IT to go with it, and DDN helped fulfill that need. We’re making great progress in taming big data growth while scaling our environment more cost effectively. For more on what we do and our experience with DDN, I encourage you to check out this short video or watch this talk.
Tim Cutts is Head of Scientific Computing at Wellcome Trust Sanger Institute. He may be reached at editor@ScientificComputing.com.