Intel's New Omni Scale Fabric Platform: Taking on I/O Interconnect Challenges for Extreme-Scale HPC
Even as CPU power and memory bandwidth march forward, a major bottleneck hampering overall supercomputing performance has presented a significant challenge over the past decade: I/O interconnectivity.
The vision behind Intel’s new Omni Scale Fabric is to deliver a platform for the next generation of HPC systems — one that overcomes the growing cost, latency and power consumption problems posed by current fabric technology.
Omni Scale Fabric is an end-to-end solution consisting of adaptors, edge switches, director switches and fabric management and development tools. In addition, it will be integrated into the next generation of Intel Xeon Phi processors and future generations of Intel’s Xeon microprocessors.
“Raw compute power without the capability to connect and combine that power won’t support the industry’s drive for ExaFLOP computing,” said Barry Davis, General Manager of Intel’s High Performance Fabric Organization in the company’s Technical Computing Group. “The CPUs we’re bringing to market are scaling well. Memory bandwidth that those CPUs utilize is scaling well. What’s not scaling well is the I/O interconnect — the I/O fabric. It’s not moving at the same pace. That’s what Omni Scale Fabric is designed to address.”
Omni Scale Fabric is the result of several years of Intel development work by engineering teams assembled when Intel acquired Qlogic’s InfiniBand assets, the ASIC internet design team from Cray, and other groups at Intel.
Its roots are derived from a combination of the True Scale Fabric, Intel’s current interconnect technology and high performance interconnect intellectual property acquired from Cray. Built on top of InfiniBand, True Scale is a fabric designed specifically for HPC. Omni Scale Fabric is software compatible with True Scale, which means existing HPC applications, and the middleware built on top of True Scale, will run on it.
The Intel Omni Scale Fabric is designed to address problems of cost, density and reliability with large-scale HPC clusters. As clusters become more complex, costs go up, and with current-generation technology the interconnect fabric can comprise up to 30 percent of total cluster costs. Omni Scale Fabric addresses this through a combination of host CPU integration, a clear understanding of what’s important to an HPC fabric, and integration of high density Intel Silicon Photonics in director-class switches.
Density is another major challenge. The integration that comes from the Omni Scale Fabric helps address density by removing the need for PCI Express-based adaptors in the server node, delivering the fabric interface directly from the Intel CPU. This is a level of integration the technical computing industry has been clamoring for from Intel for many years.
Large, complex clusters also pose major reliability challenges. Omni Scale Fabric addresses this with a set of unique capabilities in the switching fabric specifically designed to scale clustered solutions up to tens of thousands of nodes.
Finally, there is power consumption, a critical environmental and cost problem. The most widely used I/O interface, connecting from a CPU to a fabric controller is PCI Express, which simply connects point A to point B and increases latency, cost and power consumption. Intel’s Omni Scale Fabric provides a tightly integrated point-to-point connection that eliminates middle steps in communications between the PCIe bus and CPUs, cutting latency and power use. This will have major implications as extreme-scale clusters emerge during the rest of the decade.
While Intel has designed Omni Scale Fabric to serve as a solution for the future, it carries forward solutions of the past by being compatible with today’s True Scale Fabric. This means existing investments in applications and middleware built on top of True Scale today will run on Omni Scale. This is an important consideration for many organizations.
“Omni Scale is going to drive down cluster cost,” Davis said, “and we're going to drive down power use. We're going to drive down the latency. We're going to improve the density. All these factors are major improvements and will have far-reaching impact on technical computing. This is a game-changing capability for the HPC community.”
Doug Black is a communications professional who has been involved in the HPC industry for close to 20 years. He may be reached at editor@ScientificComputing.com.