Maximizing Fabric Efficiency in HPC Clusters
Three techniques can enable higher levels of performance

High-performance computing users invest in performance-oriented interconnect fabrics such as InfiniBand to provide adequate system balance to match the computational capabilities of modern multi-core processors and GPUs. However, as HPC system sizes continue to grow with at least hundreds of nodes, thousands of cores and numerous simultaneous jobs, users are increasingly sensitive to interconnect fabric efficiency as a major contributor to performance. After making a significant investment in high-performance interconnect fabrics to achieve adequate system balance, HPC system managers must now look at how efficiently the interconnect fabric is actually enabling communications.

In this article, I will examine the issue of fabric efficiency in HPC clusters and explain how three techniques — dispersive routing, adaptive routing and quality of service (QoS) – enable higher levels of efficiency.

Why fabric efficiency is crucial in HPC environments
As clusters increase in size, the amount and diversity of communications traffic in the fabric increases:
1. Within a single application, the stress of All2All and collective communications increases dramatically with the number of nodes and communicating processes.
2. Application messaging patterns may become more diverse as a result of improved highly parallel algorithms, new applications, introduction of GPU technology and exploitation of PGAS languages.
3. Fabrics are becoming multi-use facilities. HPC centers that once ran one job at a time now achieve high levels of system utilization by running multiple jobs. With growing user populations, job schedulers for such multi-user systems will be challenged to provide consistently ideal rank placement in the fabric. In addition, communications traffic from disparate workloads competes for fabric resources or requires conflicting optimizations.

While the statically routed InfiniBand standard supports a rich set of topologies, the dynamic nature of the considerations described above demand a more dynamic view of fabric behavior. Even fabrics configured as “fat-trees” with full bisectional bandwidth can exhibit congestion, reducing overall fabric efficiency to about 60 percent, depending on message patterns.

Maximizing fabric efficiency means minimizing congestion that can slow communications. For example, a messaging pattern in which several nodes are talking to the same switch port can cause congestion that saps performance when other applications are trying to use the same communications resource. Alternatively, a large file server job that consumes fabric bandwidth can interfere with a compute-intensive application that needs to optimize latency through the fabric.

So, increasingly, being able to get the most efficient use of the fabric is an important issue as messaging patterns become more complex through the use of multi-core processors, and as the number of nodes keeps growing. Fortunately, there are techniques that can be used to reduce congestion, prioritize traffic, and optimize the message flow through the fabric.

Key tools for enhancing fabric efficiency
An advantage of InfiniBand fabrics is the ability to provide many potential routing paths between any two nodes. However, conventional InfiniBand usage only exploits a single path between nodes. Since congestion is the major reason for reduced fabric efficiency, QLogic has chosen to focus on congestion reduction capabilities that exploit these additional routing paths. Other techniques, such as injection rate control, have the potential to reduce congestion from spreading at the cost of underutilizing the fabric bandwidth, thereby not significantly increasing overall fabric efficiency. It can be used to prevent end-point congestion from spreading, rather than increasing overall fabric efficiency.

Fabric efficiency enhancement techniques that exploit additional routing paths include:
• Dispersive routing — Avoiding the formation of contention by load balancing traffic across multiple paths.
• Adaptive routing — Monitoring the fabric for early signs of congestion and, where present, selecting new routing paths via less congested portions of the fabric.
• Quality of service (QoS) — Multi-use fabrics can have workloads whose traffic patterns necessarily collide in the fabric but have very different QoS requirements. For example, a short low-latency message from a computational workload may contend with a large data transfer when servicing a file system request.

Each of these advanced fabric capabilities — dispersive routing, adaptive routing, and QoS ? are described in the sections that follow.

Dispersive routing load-balances message traffic
An advantage of InfiniBand fabrics is the ability to provide diverse routing paths. Dispersive routing exploits this capability by distributing MPI message traffic over multiple routes in order to load-balance traffic automatically, rather than have a pair of nodes communicate only via a single routing path. This technique minimizes the likelihood that congestion will occur and has been shown to be highly effective for oblivious message patterns, raising the fabric efficiency from 60 percent to 85 percent. This technique is applicable to the wide range of MPI communications libraries that support the open-source PSM interface.

Adaptive routing alleviates congestion
While dispersive routing helps minimize the potential for congestion depending on messaging patterns, congestion can still occur. Adaptive routing can handle this by continually monitoring the fabric for the formation of congestion, and moving network traffic between end nodes from over-utilized routing paths to under-utilized routing paths. It leverages intelligence in the switches themselves in order to monitor the performance of the available routing paths, and automatically shifts traffic to less congested routes. InfiniBand ordering is maintained such that all InfiniBand protocols operate transparently. This capability requires fabric management software that supports the adaptive routing function.

Adaptive routing benefits both single-use and multi-use fabrics. Long-running applications will often have various phases, each exhibiting a different messaging pattern. Adaptive routing detects congestion formation in the fabric, and transparently shifts routes to under-utilized paths. Multi-use environments add the additional complexity that jobs are continually entering and leaving portions of the fabric. Adaptive routing again detects any congestion formation as the fabric is disturbed, and transparently shifts routes away from over-utilized paths.

QoS control prioritizes traffic
The HPC interconnect fabric can suffer significant losses in performance as multiple applications contend for resources. For example, if a message-intensive application is requiring ultra-low latency, adding a bandwidth-intensive application, such as a large file system transfer, can increase the overall latency by an order of magnitude.

QoS allows provision of a priority and bandwidth allocation for each traffic class. In this way, a message-intensive application can retain its ultra-low latency requirement, while a storage application’s traffic can retain its required fabric bandwidth — all with minimal conflict. The result is that, rather than increasing latency by an order of magnitude, adding the storage application only modestly impacts the message-passing application, while still providing the storage application with essentially full bandwidth.

In addition, an open-source PSM communications library can utilize QoS to minimize application latency. Specifically, MPI control messages and short messages can be assigned a higher priority than long MPI messages. This has the benefit of reducing latency for communications as related to progress, reliable communications and latency sensitive messaging.

Whose responsibility is fabric efficiency?
As a practical matter, it is important that fabric management software provide these advanced routing and QoS capabilities transparently to all MPI communication libraries. With this capability, it is not necessary to alter any of the applications themselves, nor to be restricted to using only a specific MPI communications library. QLogic has chosen to enable these fabric efficiency capabilities within the open source PSM user mode communications library that supports all widely used open source and commercial MPI communications libraries. In this way, no application changes are required, and the user can choose any widely used MPI communications library. Fabric management responsibility remains in the hands of the system administrator who has an overall view of user commitments, system resources and current workload.

The larger the fabric and the more diverse the workload, the more important it is to maximize efficiency. Taken separately, dispersive routing, adaptive routing and QoS can each improve fabric efficiency by reducing congestion and prioritizing traffic. Used together, they offer a powerful set of capabilities for improving overall efficiency. QLogic tests have shown that these tools can increase overall efficiency from 60 to 85 percent — a 41 percent improvement in efficiency. These techniques have been developed for the world’s largest compute clusters, but today’s advanced InfiniBand fabric software now makes them available for clusters of any size.

Lloyd Dickman is CTO for InfiniBand Products in QLogic’s Network Solutions Group. He may be contacted at