With traffic in both directions blurring the boundaries, where do you draw the line between these two data-intensive domains?
In late 2011, I wrote in Scientific Computing about the increasing collisions between commercial big data and data-intensive HPC. In the short time since then, the dimensions of these encounters have become clearer. This might qualify as the most serious rapprochement since business computing first branched off from scientific-technical computing in the early 1960s.
Traffic in both directions is blurring the boundaries between commercial computing and data-intensive HPC. Companies that focused only on commercial computing, such as LexisNexis, Oracle, SAP, SAS and EMC, have been forming high performance analytics teams and moving into HPC. They are doing this to meet the escalating high-end analytics needs of existing customers and to capture new HPC customers. In the other direction, HPC stalwarts, such as Cray and SGI, are taking aim at both HPC and commercial markets by offering systems with graphing capabilities that clusters can’t match. The major hardware OEMs (HP, IBM, Dell, et al.) and their storage and networking counterparts have prior, but separate, experience in both markets and are learning how to fuse their big data marketing and sales efforts.
There is still some definitional confusion. The commercial sector typically says that big data must exhibit three Vs: volume (lots of data), velocity (time criticality), and variety (multiple types of data). Often, a fourth V is added: value (the data must be worth something to someone). And not to forget: commercial Big Data is also a new trend, aka the “next big thing.”
We can all think of HPC applications that nicely fit the definition of the four Vs. Weather forecasting ensemble models, fluid-structure interactions and bioinformatics are a few examples that come to mind. But these are not new, and there’s the rub. It’s sometimes difficult for commercial folks, caught up in the passion of the next big thing, to acknowledge that the HPC community has been dealing with definition-compliant big data jobs for decades, on back to the birth of scientific-technical computing in the post-WWII era.
That is why, where big data is concerned, IDC uses a broad, descriptive rather than a narrow, prescriptive definition. We want to be in concert with the worldwide HPC community, which generally sees big data as including both established and newer data-intensive computing methods, such as large-scale graph analytics, semantic technologies and knowledge discovery algorithms.
Defining HPC big data comprehensively is also important, because users will increasingly seek to maximize insights and innovation by applying both established and newer methods to the same scientific or industrial problem, often using the same HPC system.
A good example is climate research, one of the most data-intensive of all HPC domains. As the report from the first Climate Knowledge Discovery Workshop (April 2011,
Hamburg, Germany) notes: “Current approaches to data volumes are primarily focused on traditional methods, best-suited for large-scale phenomena and coarse-resolution data sets. The data volumes from climate modeling will increase dramatically due to both increasing resolution and number of processes described. What is needed is a suite of new techniques [i.e., knowledge discovery algorithms] interpreting and linking phenomena on and between different time- and length scales, as well as realms and processes. Such tools could provide unique insights into challenging features of the Earth system, including extreme events, nonlinear dynamics and chaotic regimes.”
Climate research’s vision of extracting new insights by applying existing and new algorithm types to their problems — for the most part separately, not as a mashup — points the way toward what some other domains also inevitably will do. IDC has long defined HPC as requiring use of modeling and simulation. Now, analytics problems based on complex algorithms are gaining ground in the market. And, some of the most challenging commercial analytics problems are already starting to use HPC resources. Consider these examples:
- eBay is using high-performance clusters and Hadoop-based scenarios for fraud detection in the PayPal system.
- A probability-of-loan-default calculation that used to take as long as 20 hours now takes under one minute using commercial software on an HPC cluster. Recalculating an entire risk portfolio used to take up to 20 hours and now takes as little as 15 minutes.
- Other commercial applications that now run on HPC systems include retention campaigns, regression analysis, and coupon redemption rate jobs.
The commercial healthcare field looks especially promising as a market for HPC big data resources. This, too, is not exactly new. In the 1990s, all of the following things happened:
- A women’s clinic in Germany began routinely using HPC to predict which expectant mothers would require surgery for Caesarian births, with the goal of avoiding traditional, riskier last-minute decisions during childbirth.
- A Washington, DC, hospital routinely employed HPC to “read” digital mammograms with better-than-human accuracy to spot early signs of breast cancer (microcalcifications).
- Hospitals in Europe and the U.S. began using HPC in surgical training, especially to convey the “feel” of various procedures as experienced by veteran surgeons (haptics).
Recent health care examples will up the ante by exploiting some of today’s most powerful systems:
- The Mayo Clinic will use an HPC system as a real-time, interactive expert system for outcomes-based patient diagnostics and treatment planning. Outcomes-based guidance will aggregate findings from 10 million archived patient records.
- DOE’s Joint Genome Institute has asked NERSC for access to its HPC infrastructure and expertise. Among other things, DOE wants to know how to turn cellulose into ethanol, and this is a genomics problem. In essence, it involves mimicking what happens in a termite’s stomach.
With commercial and HPC big data converging in many areas, where do you draw the line between the two data-intensive domains? The answer, in IDC’s view, is that a problem moves into the HPC realm when it requires HPC resources, especially software needing to run on HPC hardware to meet performance goals. It seems apparent that, over time, a growing number of commercial problems will scale to this level. This will cause more commercial vendors to expand into the HPC market, and more HPC vendors to expand in the opposite direction.
Steve Conway is IDC Research VP, HPC. He may be reached at editor@ScientificComputing.com.