Developing a Technology Roadmap for Data-intensive Computing

Wed, 08/08/2012 - 12:00pm

Developing a Technology Roadmap for Data-intensive Computing
The role of group psychology in the transition to massively parallel computing 

NVIDIA Kepler GK110 Block Diagram
Figure 1: NVIDIA Kepler GK110 Block Diagram
Technology and computational evangelists quickly learn that human psychology is a key component of any project roadmap. The truth in the quip that it took one social genius plus 500,000 scientists and engineers to put a man on the moon can be appreciated when one observes the social issues that surface in technical discussions within even small groups of people. As the name implies, data-intensive computing requires large amounts of data.

My article “Big Money for Big Data” in the June 2012 issue of HPC Source notes that massive parallelism is the only path forward for organizations wishing to cope with the ever-increasing size of big data sets. Just as technology is changing the meaning of “big” data, so is it increasing what is meant by “massive” parallelism. At the moment, GPU programmers work with tens of thousands of threads, while multi-core programmers utilize tens of threads. This technical dichotomy has created a tension between the CTO (chief technical officer) who is responsible for defining the data processing goals to keep an organization competitive over time, and those within the organization vested with the responsibility to evaluate, integrate and manage the technology used to reach those goals within a production environment.

Both parties are making a rational risk/reward calculation, but when and how should the performance gains attributed to massively parallel technology be included in the organization roadmap? While technology ultimately dictates production capability, the choice of the technology roadmap to meet the performance goals for the organization is a very human process. People differ in their risk affinity and perception of a reward. So, they naturally make different judgments. Uncertainty increases the importance of testing, but opinion tends to guide the decision-making process about what to test. Unfortunately, personal biases are a known danger that can inadvertently affect the test process to support a pre-conceived notion.

AMD 7970 GPU block diagram
Figure 2: AMD 7970 GPU block diagram
While it seems counterintuitive, the success and rapid evolution of massively parallel hardware and development platforms like CUDA and OpenCL are, in a very real sense, a detriment, because people have become desensitized by repeated success stories and repeatedly having to ask the question “are the benefits of this technology now worth the risk and effort required to start using it?” Hype, the continued state of flux, lack of certainty, and need to frequently reevaluate the risks and benefits have taken a toll and induced a form of fatigue.

Regardless, organizations must continue to play this guessing game, because a failure to adopt this new technology early, yet at the right time, risks placing the organization at competitive disadvantage. Planning for incremental change is a challenge, but a mad rush to match the success of another organization is much worse. In addition, some organizations might still be recovering from the stress of a recent transition to threaded programming to exploit the performance potential of multicore processors.

The prospect of incorporating massively parallel co-processors into production workflows means the whole threading issue needs to be revisited, plus it introduces the corresponding challenges of integrating new hardware into a production environment while making new, rather pervasive changes to the software development process.
At this juncture, GPU success stories might be viewed by decision makers as interesting, but lacking merit in assessing the benefits and costs of a transition to new technology. Progressive organizations might have tried some early experiments with GPU computing and decided to not pursue the technology at that time, which can make conversations about the new capabilities of these devices difficult. Still, GPU success stories continue to highlight what is possible, plus Intel’s introduction of the MIC (Many Integrated Core) architecture further validates the market draw of these massively parallel devices. While validation and competition in the market is good, differences between MIC and GPU architectures further complicate the decision-making process — especially as people need to be educated so they understand the capabilities and drawbacks of both architectures.

Over the next three to five years, hybrid CPU/co-processor-based systems seem to be the platform of choice for many organizations, because they provide the greatest amount of performance, parallelism and flexibility within a desired cost and power envelope. Basically, these systems appear to be a safe bet to get the technology in house and start the process of capitalizing on whatever capability it can provide without incurring too much expense. Again, people’s perceptions matter: just as pulling on one thread begins the process of unraveling a sweater, so does the introduction of tens of thousands of threads per process hold the potential to gum up the software development and production data processing capabilities of an organization.

Intel Sandy Bridge Core I7 3960X
Figure 3: Intel Sandy Bridge Core I7 3960X block diagram
Proponents of massive parallelism can expect push-back from multiple sources — sometimes from unexpected individuals — as they work to introduce hybrid computing and new threading models, wean programmers from cache coherent SMP programming, and ultimately “rock the boat” at some very fundamental levels throughout an organization. Again, education is an essential part of the process.

Key to the move to massively parallel programming (again think tens of thousands of threads per application), is the scalability of the threading model. The motivation for using such large numbers of threads is based on technical reasons including:

  • thread level parallelism (TLP), as opposed to traditional vector approaches
  • limiting communications between threads for scalability (e.g. utilizing a strong scaling execution model like CUDA or OpenCL)
  • tying data to computation so it can be parceled out to many processing elements
  • SIMD execution and others

Visually, one can get a sense of the motivation for scaling to tens of thousands of threads by looking at GPU parallelism shown in the block diagrams of a 2,880 concurrent thread NVIDIA GK110 Kepler GPU chip and a 1,280 thread AMD Tahiti 7970 GPU chip (Figure 2). For comparison, the block diagram of an 8-core (16-thread) Intel Sandy Bridge Core I7 3960X also is provided in Figure 3. Both CPU and GPU devices require that the programmer utilize sufficient numbers of threads to keep the hardware processing elements busy.

Future devices will certainly offer even more parallelism as GPU chip designers leverage the latest fabrication techniques and architectural enhancements to maximize the number and efficiency of the on-chip processing elements. CPU chip designers are also adding cores, but just not as quickly. Cache coherency, non-SIMD architecture, and other characteristics present a challenge to the scalable replication of processor cores by CPU designers. In either case, the software programming model must scale to use any additional hardware parallelism, both now and in the future, or performance will be lost.

Various financial and personal factors enter into the choice of software development platform used to create and manage large numbers of threads. Common sense is one of the most important guides (along with expert advice) to find the best solution. In particular:

  • Avoid vendor lock-in, because the industry is in a state of flux.
  • Minimize risk by looking at software development platforms that are known to work for many people on real projects using massively parallel hardware.
  • Make certain the software platform allows hybrid CPU plus co-processor applications so various configurations of CPU, GPU, and/or MIC hardware can be utilized as the hardware evolves over time.

Education is critical, otherwise people can become overwhelmed by a host of ideas that are important to efficient co-processor and hybrid application development. For example, SIMD execution, TLP, the ability to manage multiple asynchronous execution queues, asynchronous data movement, and transparent compilation for multiple device types are all important characteristics of the software development platform.

When possible, select a software platform that preserves the ability to migrate to an alternate development platform in the future. While prudent, skeptics might interpret discussion of potential migration pathways away from a proposed software roadmap as a sign that the roadmap and/or technology is not fully cooked.

People skills are important. Technically, discussions on a migration option can be couched in terms of Amdahl’s law: keep the sequential sections of code in a common language such as C or C++ and express the parallel sections so they can be migrated should it prove necessary to do so in the future. For example, CUDA and OpenCL are close enough that semi-automatic translators already exist — although many are only in research form. Similarly, C++ data parallel extensions, such as Thrust, are generic enough to provide some degree of comfort that future versions will be implemented for other platforms like OpenCL. Many organizations also are working on data parallel software extensions similar to Thrust.

At this moment, technical innovation continues to rapidly evolve massively parallel devices into ever-more-capable computational tools. The introduction of MIC-based co-processors by Intel will accelerate the evolutionary process by adding competition. While both GPU and MIC devices support multiple programming models, the current PCIe-based packaging imposes memory capacity, data locality and bus bandwidth limitations. The extra bandwidth of the latest PCIe version 3 bus can help, but data transport overhead remains an issue.

Profiling data movement within the existing organization workflows is an essential part of understanding where co-processors can fit. It also may uncover inefficiencies due to excess copying and highlight portions of the workflow that can achieve greater parallelism on the current hardware platform. With a bit of luck, even the process of evaluating existing workflows to see where to incorporate massively parallel co-processors can be beneficial.

For additional high-level technical discussion of programming models and massively parallel devices, such as the NVIDIA Kepler and Intel MIC products, please look to my Scientific Computing article, “The GPU Performance Revolution” and my “Intel’s 50+ core MIC architecture: HPC on a Card or Massive Co-Processor?” on the Doctor Dobb’s Journal Web site.

Rob Farber is an independent HPC expert to startups and Fortune 100 companies, as well as government and academic organizations. He may be reached at


Share this Story

You may login with either your assigned username or your e-mail address.
The password field is case sensitive.