Guide into OpenMP: Easy multithreading programming for C++

By Joel Yliluoma, September 2007; last update in June 2016 for OpenMP 4.5

Abstract

This document attempts to give a quick introduction to OpenMP (as of version 4.5), a simple C/C++/Fortran compiler extension that allows to add parallelism into existing source code without significantly having to rewrite it.

In this document, we concentrate on the C++ language in particular, and use GCC to compile the examples.

Table of contents [expand all] [collapse all]

Preface: Importance of multithreading

As CPU speeds no longer improve as significantly as they did before, multicore systems are becoming more popular.

To harness that power, it is becoming important for programmers to be knowledgeable in parallel programming — making a program execute multiple things simultaneously.

This document attempts to give a quick introduction to OpenMP, a simple C/C++/Fortran compiler extension that allows to add parallelism into existing source code without significantly having to entirely rewrite it.

Support in different compilers

GCC (GNU Compiler Collection) supports OpenMP 4.5 since version 6.1, OpenMP 4.0 since version 4.9, OpenMP 3.1 since version 4.7, OpenMP 3.0 since version 4.4, and OpenMP 2.5 since version 4.2. Add the commandline option -fopenmp to enable it. OpenMP offloading is supported for Intel MIC targets only (Intel Xeon Phi KNL + emulation) since version 5.1, and to NVidia (NVPTX) targets since version 7 or so.
Clang++ supports OpenMP 4.5 since version 3.9 (without offloading), OpenMP 4.0 since version 3.8 (for some parts), and OpenMP 3.1 since version 3.7. Add the commandline option -fopenmp to enable it.
Solaris Studio supports OpenMP 4.0 since version 12.4, and OpenMP 3.1 since version 12.3. Add the commandline option -xopenmp to enable it.
Intel C Compiler (icc) supports Openmp 4.5 since version 17.0, OpenMP 4.0 since version 15.0, OpenMP 3.1 since version 12.1, OpenMP 3.0 since version 11.0, and OpenMP 2.5 since version 10.1. Add the commandline option -openmp to enable it. Add the -openmp-stubs option instead to enable the library without actual parallel execution.
Microsoft Visual C++ (cl) supports OpenMP 2.0 since version 2005. Add the commandline option /openmp to enable it.

Note: If your GCC complains that "-fopenmp" is valid for D but not for C++ when you try to use it, or does not recognize the option at all, your GCC version is too old. If your linker complains about missing GOMP functions, you forgot to specify "-fopenmp" in the linking.

More information: http://openmp.org/wp/openmp-compilers/

Introduction to OpenMP in C++

OpenMP consists of a set of compiler #pragmas that control how the program works. The pragmas are designed so that even if the compiler does not support them, the program will still yield correct behavior, but without any parallelism.

Here are two simple example programs demonstrating OpenMP.

You can compile them like this:

  g++ tmp.cpp -fopenmp

Example: Initializing a table in parallel (multiple threads)

This code divides the table initialization into multiple threads, which are run simultaneously. Each thread initializes a portion of the table.

  #include <cmath>
  int main()
  {
    const int size = 256;
    double sinTable[size];
    
    #pragma omp parallel for
    for(int n=0; n<size; ++n)
      sinTable[n] = std::sin(2 * M_PI * n / size);
  
    // the table is now initialized
  }

Example: Initializing a table in parallel (single thread, SIMD)

This version requires compiler support for at least OpenMP 4.0, and the use of a parallel floating point library such as AMD ACML or Intel SVML (which can be used in GCC with e.g. ‑mveclibabi=svml).

  #include <cmath>
  int main()
  {
    const int size = 256;
    double sinTable[size];
    
    #pragma omp simd
    for(int n=0; n<size; ++n)
      sinTable[n] = std::sin(2 * M_PI * n / size);
  
    // the table is now initialized
  }

Example: Initializing a table in parallel (multiple threads on another device)

OpenMP 4.0 added support for offloading code to different devices, such as a GPU. Therefore there can be three layers of parallelism in a single program: Single thread processing multiple data; multiple threads running simultaneously; and multiple devices running same program simultaneously.

  #include <cmath>
  int main()
  {
    const int size = 256;
    double sinTable[size];
    
    #pragma omp target teams distribute parallel for map(from:sinTable[0:256])
    for(int n=0; n<size; ++n)
      sinTable[n] = std::sin(2 * M_PI * n / size);

    // the table is now initialized
  }

Example: Calculating the Mandelbrot fractal in parallel (host computer)

This program calculates the classic Mandelbrot fractal at a low resolution and renders it with ASCII characters, calculating multiple pixels in parallel.

 #include <complex>
 #include <cstdio>
 
 typedef std::complex<double> complex;
 
 int MandelbrotCalculate(complex c, int maxiter)
 {
     // iterates z = z + c until |z| >= 2 or maxiter is reached,
     // returns the number of iterations.
     complex z = c;
     int n=0;
     for(; n<maxiter; ++n)
     {
         if( std::abs(z) >= 2.0) break;
         z = z*z + c;
     }
     return n;
 }
 int main()
 {
     const int width = 78, height = 44, num_pixels = width*height;
     
     const complex center(-.7, 0), span(2.7, -(4/3.0)*2.7*height/width);
     const complex begin = center-span/2.0;//, end = center+span/2.0;
     const int maxiter = 100000;
   
   #pragma omp parallel for ordered schedule(dynamic)
     for(int pix=0; pix<num_pixels; ++pix)
     {
         const int x = pix%width, y = pix/width;
         
         complex c = begin + complex(x * span.real() / (width +1.0),
                                     y * span.imag() / (height+1.0));
         
         int n = MandelbrotCalculate(c, maxiter);
         if(n ==