Figure 1: The experimental version will merge into the mainstream compiler and become available for all users sometime after the 4.9 GNU compiler release around 2015.Basic tools are now available to test efficacy — it’s worth taking a look

Encryption and nuclear weapons are two easily recognized examples where a combinatorial explosion (e.g. a situation where the number of possible combinations grows very rapidly as the number of options increases) is a sought after characteristic. In the software development world, combinatorial explosions are bad. In particular, it is far too easy to become lost in the minutia of writing code that can run efficiently on NVIDIA GPUs, AMD GPUs, x86, ARM and Intel Xeon Phi while also addressing the numerous compiler and user interface vagaries to compile and create user interfaces for Linux, OS X, Windows, iOS and Android devices.

The ideal is to have a single source tree of portable code that can compile and run efficiently on devices from mobile to HPC and, thus, eliminate the redundant costs of creating cross-platform, hardware-agnostic libraries and applications. Pragma-based programming standards like OpenACC (and potentially OpenMP 4.0) move us closer to the single-source tree ideal by providing a mechanism to annotate code to run efficiently on serial and parallel, as well as shared and discrete memory hardware devices ranging from the NVIDIA K1 for mobile products to the TACC Stampede, ORNL Titan and Tehane-1 leadership class supercomputers.

Free GNU-based OpenACC compilers are under development, which means the global developer community will soon be able to test the reality of a single, portable “from sequential to massively parallel” source base. Other freely available software packages, such as the Qt user interface generator, mean that the global developer community also has the means to write both command-line and GUI (graphical user interface) applications for Android, iOS, Microsoft and X-Windows devices. Thus, the creation of single-source code libraries and applications for mobile, PC, enterprise and HPC users might finally become a demonstrable reality in the near future.

The key contribution that OpenACC provides for single-source from mobile to HPC software projects is the ability to use pragmas to define parallel regions of code that can run efficiently in parallel on both multi-core shared-memory computers and massively parallel accelerators with discrete memory. These architectures span the gamut of currently available hardware devices used in everything from mobile phones to the latest generation leadership-class supercomputers. Compatibility with the OpenACC standard means that the compiler understands a set of pragmas, or language constructs, that provides hints about how a particular section of code should be parallelized. The advantage of OpenACC over OpenCL, which was the previous de facto standard for portable parallelism, is the support for many programming languages. (OpenCL is based on the C programming language.) Currently C, C++ and Fortran programs can be annotated with OpenACC pragmas, but there is no reason why other languages such as Java can not also benefit from pragma-based annotation.

OpenACC leverages the success of the already existing OpenMP pragma-based programming standard that has successfully been used to write parallel programs for multi-core processors. Unlike OpenMP which assumes a shared memory space where any processor can directly access any memory location, the newer OpenACC standard has been designed to efficiently support devices containing separate, discrete memory like GPUs and Intel Xeon Phi. From a hardware point of view, this change is essential, because shared memory is no longer a sustainable memory model for massively parallel devices. Unfortunately, maintaining cache coherency across hundreds or thousands of processing cores just does not scale, as the messaging overhead becomes too burdensome. Further, designing with discrete memory systems has the advantage that the memory bandwidth does not have to be shared except during data transfer operations.

The OpenMP standards committee has recognized the need to support discrete memory space devices in the forthcoming OpenMP 4.0 standard. The concerns about OpenMP 4.0 are twofold:
1. Will the standard be general enough so that all devices, including GPUs, will be able to run with high-performance, and
2. Given that no OpenMP 4.0 compilers yet exist, how well does the standard work in practice?

Time will tell, and it is expected that OpenMP 4.0 and OpenACC 2.0 will eventually merge to become a single unified standard. Happily, early adopters will not be penalized as pragmas — by design since they merely annotate the source code — for different standards can be mixed without side effects. Thus OpenMP, OpenACC and OpenMP 4.0 pragmas can all coexist within the same source code without conflict.

Free compiler availability is essential to adoption by the worldwide software community. Kudos to the GNU compiler developers who have created and made freely available the most commonly utilized compilers in the world!

According to Nathan Sidwell, Director of Sourcery Services at Mentor Embedded, their experimental version of a GNU OpenACC enabled compiler is expected be available in Q4 of 2014. The experimental version will merge into the mainstream compiler, and become available for all users, sometime after the 4.9 GNU compiler release around 2015, as shown in Figure 1. The Mentor Embedded effort is implementing a true compiler that generates PTX, rather than acting as a translator to OpenCL. Per a market requirements assessment, the experimental version of the compiler will only run on x86 machines. There is no technical reason why an ARM version of the OpenACC compiler cannot be implemented, so sign up if you are interested in creating an ARM implementation.

Nathan Sidwell also notes that their team is working closely with the OpenMP GNU team to, “make the underlying implementation as OpenACC 2.0 and OpenMP 4.0 agnostic as possible.” This will eliminate much redundant work and help create a stable implementation for both standards as quickly as possible.

I was also happy to hear that their OpenACC effort is looking to add intelligence so the compiler can flag any confusion between pointers to objects that reside in different memory spaces. It is expected that C++ object compatibility will be based on classes that where is_trivially_copyable is true, meaning that device-compatible C++ objects will reside in contiguous regions of memory and adhere to certain other criteria, such as no virtual bases.

Functioning OpenACC portability has been confirmed by several vendors, including PGI and CAPS, through tradeshow demonstrations that compile a single source code to run on AMD GPUs, NVIDIA GPUs, x86 multicore, ARM multicore and Intel Xeon Phi coprocessors. Expect multi-platform commercial compilers to become available in the 2014 to 2015 timeframe.
Software efforts that utilize OpenACC can capitalize on the power of both current and future generations of teraflop-per-second massively parallel devices. For example, NVIDIA claims their new K1 processor for the mobile and tablet market can deliver 360 gigaflop per second (billion floating-point operations per second) while consuming one watt of power.

Amdahl’s law tells us that, at the ideal limit, the time to solution will ultimately be dictated by the runtime of the serial sections of the code. In practice, those codes that achieve this ideal speedup will be rare. Experience has shown that speedups of 2x to 10x are frequently possible, which means that applications can become more interactive. Mobile applications in particular will benefit from these speedups. A much smaller number of applications can achieve speedups between 10x and 100x, which is disruptive, as computations that were previously out of reach become possible. For example, a two order of magnitude speedup means that a run that would have previously taken a year to complete can finish in just a few days on the newer device(s). Some HPC applications are very scalable, which means they can achieve speedups beyond 100x by simultaneously using a large number of devices. I use my scalable machine learning and numerical optimization framework as a teaching tool to demonstrate how to achieve both high performance exceeding a teraflop per second performance on a single device, and over 13 petaflop per second when using 16,384 devices on the Oak Ridge National Laboratory leadership-class Titan supercomputer. In a nutshell, many scientific and software developers will find that OpenACC is the only tool they need to create massively parallel command-line driven applications and libraries for other developers.

A GUI is used by many applications, especially those in the mobile market, to provide a more intuitive, interactive way for people to work with the computer. Writing GUIs is an art form unto itself, and a beautiful interface can make the experience of interacting with an application both efficient and pleasant. Vendors have realized the consumer appeal of a consistent and intuitive user interface, which is why the look-and-feel of the user interface has become a market discriminator for mobile devices and for-profit application software. Thus, market lock-in and platform-specific user expectations further complicate cross-platform GUI development far beyond the technical issues.

There are many efforts to provide cross-platform GUI development tools, all of which provide varying degrees of success, and none of which will satisfy all users. That said, projects like the well-established Qt GUI framework allow the creation of user interfaces for all desktop operating systems, many real-time systems, and the popular mobile operating systems.
The idea behind Qt and the other GUI creation tools is that the developer need only create the user interface once using a set of controls and widgets (a small application with limited functionality). The developer can then generate a user interface for a destination device without having to deal with the combinatorial explosion of options that must be addressed for the different devices, input capabilities, screen resolutions, GPU capabilities, operating systems, runtime, and numerous other factors that make GUI design a thankless task. In GUI design, the devil really is in the details, which is why it is well worth the time taken to seek out a good portable user interface design tool. Mobile vendors in particular are in a highly competitive evolutionary race to capture and keep customers with their user interface, so look for a GUI design tool that is actively developed to preserve performance and a platform’s current look and feel. Meanwhile, HPC and enterprise users are unforgiving of user interfaces that cannot manage large amounts of data.

With the development of OpenACC 2.0 (and potentially OpenMP 4.0), application developers now have a basic tool needed to write portable parallel code for devices ranging from mobile processors, to PCs and enterprise workstations, and even to the world’s largest leadership class supercomputers. Pragma-based programming has the advantage of letting programmers write software in the language of their choice, such as C, C++ and Fortran, to create libraries and applications that can run on a myriad of current devices, and even as-yet-unanticipated future devices. The parallel portability argument is compelling. The forthcoming release of OpenACC in the ubiquitous GNU compilers means that the global software development community will be able to test and verify how well this solution works in practice.

The risk of trying pragma-based parallel programming is low given that pragmas from competing standards (like OpenMP, OpenACC 2.0 and OpenMP 4.0) can coexist within the same code without side effects. This makes it easy to add those pragmas needed to parallelize a computationally expensive section of code and evaluate the runtime of the parallel region. If the speedup is compelling, then more human resources can be allocated to partitioning the computational problem to minimize data movement and exploit both massively parallel and multicore hardware platforms.

Looking ahead, it is possible to wrap an application in a GUI to create a single-source tree commercial, gaming or research application that can run “from mobile to HPC.” While there is not, and probably never will be a “one size fits all” solution for portable software development spanning all aspects of these markets, at least the basic tools are now available to test the efficacy of single-source tree applications that can be built for all these platforms. For this reason, it is worth taking a look at OpenACC 2.0 (and OpenMP 4.0 when it arrives) plus GUI generators like Qt.

Rob Farber is an independent HPC expert to startups and Fortune 100 companies, as well as government and academic organizations. He may be reached at