Primitive Restart Makes GPGPU Tech Sparkle
Exploiting the full computational power of the GPGPU to render high-performance, high-quality graphics

GPGPU technology is dramatically changing what is possible for data visualization, as well as computation. The orders-of-magnitude increased application performance reported in the recent literature succinctly conveys the computational power of GPGPU devices. With such exciting floating-point performance, it is easy to forget that GPGPU technology is an outstanding visualization technology as well.

OpenGL is the most common graphical programming application programming interface (API) in high performance computing (HPC). It is standards-based, cross-language, and cross-platform so it can be used to create applications that can render 2-D and 3-D images on most visualization hardware. GPGPUs are no exception.

Primitive restart is a new feature added in the OpenGL 3.1 specification. It can greatly accelerate GPGPU applications and create better-looking images, because it allows primitive OpenGL rendering commands to be mixed with data on the device. When primitive restart is used in programs written in OpenCL or the CUDA architecture by NVIDIA (which includes C/C++ and OpenCL), performance limiting data transfers across the PCIe bus can be avoided. Instead, the data can be generated on the GPGPU and then rendered without requiring any significant amounts of data to be transferred between the host and GPGPU.

Figure 1: Using primitive restart to draw two variable length lines
Succinctly, primitive restart allows the user to define a numeric value to represent a token that tells the OpenGL state machine to restart an OpenGL rendering instruction to begin with the next data item. In this way, one OpenGL command can be used to draw multiple lines using variable number of points as illustrated in Figure 1, where the number 1000 is used as the primitive restart token.

Fig 2
Figure 2: Two triangle strips
More complicated OpenGL commands also can be used to render surfaces with triangle strips and triangle fans. Examples show rendering of surfaces, including artificial 3-D terrain, at a hundred or more frames per second faster than older “optimized” OpenGL API calls like MultiDraw. The newest GPGPUs, such as the NVIDIA Fermi architecture, can compute complex 3-D surfaces like the image in Figure 4 and can render them six times faster than previous-generation NVIDIA 10-series GPGPUs (over 3,000 frames per second). This implies the new GPGPUs hold marvelous potential to render large scientific data sets in real-time.

There are three general rules to achieving performance on GPGPU hardware for both computation and visualization:

  • Get (and keep) the data on the GPGPU to eliminate the PCIe memory bandwidth bottleneck.
  • Maximize the amount of work performed per call to the GPU to eliminate the latency incurred when passing even short commands and small amounts of data to the GPU over the PCIe bus.
  • Exploit internal resources on the GPU (such as registers, shared memory, etcetera) to bypass internal memory bottlenecks and maximize performance.

Figure 3: Data rendered as four triangle fans (the center marked with a filled circle)
The numbers tell the story, as a modern GPGPU can access global memory, the slowest memory on the GPGPU, at roughly 150 to 200 billion bytes per second (GiB). In contrast, the latest and fastest PCIe bus (x16, V2.0) can transfer data at best at 8 GiB, or roughly 20 to 25 times slower than the GPGPU memory.

Rendering performance can be optimized with primitive restart by arranging the data to achieve the highest reuse of the cache in the texture units. In this way, GPGPU global memory can be avoided to further increase rendering speed. Also, higher-quality images can be created by alternating the direction of tessellation, as noted in the primitive restart specification and illustrated in Figures 2 and 3.

Fig 4
Figure 4: Example surface
GPGPU computation combined with OpenGL primitive restart makes visualizations of big data interactive, as even big data can be fluidly rendered. Surprisingly, higher performance and higher-quality images also can be produced. The reason is that the full computational power of the GPGPU can be exploited to generate the data which allows PCIe bottlenecks and latencies to be avoided so high-performance high-quality graphics can be rendered — even when the images require irregular meshes and/or computationally expensive data generation.

1. OpenGL Primitive restart documentation 
2. Doctor Dobb’s Journal 
2. NVIDIA Fermi Whitepaper 

Rob Farber is a senior research scientist at Pacific Northwest National Laboratory. He may be reached at