Sep 10, 2019

One day, while browsing R's source code …

Why is this surprising?

  • Isn't R supposed to be single-threaded?
  • If the R Core team uses OpenMP, then OpenMP is supported by all platforms that are supported by R (the multicore functions of the parallel package on the other hand only run on Unix systems)
  • As someone who has never looked into OpenMP: doesn't this look super easy?!

What is OpenMP?

  • "A specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs."
  • Developed by OpenMP Architecture Review Board
  • First release for Fortran in 1997, for C/C++ in 1998
  • https://www.openmp.org/

Example: Summing up a binary matrix of doubles

Tricky to benchmark

Approach:

  1. Generate random matrix
  2. Load matrix into memory (via mmap)
  3. Compute sum of matrix
  4. Print sum

All benchmarks were performed using gcc 9.1.0 on an ancient 4-core Intel Core i5-3427U Processor and on an SSD.

Generate random matrix

Load matrix into memory

 // Map matrix
    struct file_mapping matrix;
    if (map_file(path, &matrix) == -1) {
        fprintf(stderr, "Could not map file\n");
        exit(EXIT_FAILURE);
    }
 // Compute sum of entire array
    [...]
 // Unmap file
    if (unmap_file(&matrix) == -1) {
        fprintf(stderr, "Could not unmap file\n");
        exit(EXIT_FAILURE);
    }

Code: file_mapping.h

Sequential

double sum = 0;
for (ptrdiff_t i = 0; i < matrix.length; i++) {
    sum += matrix.data[i];
}

Code: sum_sequential.c

Benchmark (with -O3):

$ hyperfine './sum_sequential ~/matrix_250000_1000.bin'
Benchmark #1: ./sum_sequential ~/matrix_250000_1000.bin
  Time (mean ± σ):     447.3 ms ±   1.2 ms    [User: 337.6 ms, System: 107.9 ms]
  Range (min … max):   446.3 ms … 449.4 ms    10 runs

omp parallel for

double sum = 0;
#pragma omp parallel for reduction(+:sum) schedule(static)
for (ptrdiff_t i = 0; i < matrix.length; i++) {
    sum += matrix.data[i];
}

Code: sum_omp_parallel_for.c

Benchmark (with -O3):

$ hyperfine './sum_omp_parallel_for ~/matrix_250000_1000.bin'
Benchmark #1: ./sum_omp_parallel_for ~/matrix_250000_1000.bin
  Time (mean ± σ):     196.2 ms ±   1.5 ms    [User: 480.9 ms, System: 177.5 ms]
  Range (min … max):   193.8 ms … 199.8 ms    15 runs

omp simd

double sum = 0;
#pragma omp simd reduction(+:sum)
for (ptrdiff_t i = 0; i < matrix.length; i++) {
    sum += matrix.data[i];
}

Code: sum_omp_simd.c

Benchmark (with -O3):

$ hyperfine './sum_omp_simd ~/matrix_250000_1000.bin'
Benchmark #1: ./sum_omp_simd ~/matrix_250000_1000.bin
  Time (mean ± σ):     310.3 ms ±   1.2 ms    [User: 197.8 ms, System: 111.0 ms]
  Range (min … max):   308.6 ms … 312.1 ms    10 runs

omp parallel for simd

double sum = 0;
#pragma omp parallel for simd reduction(+:sum)
for (ptrdiff_t i = 0; i < matrix.length; i++) {
    sum += matrix.data[i];
}

Code: sum_omp_parallel_for_simd.c

Benchmark (with -O3):

$ hyperfine './sum_omp_parallel_for_simd ~/matrix_250000_1000.bin'
Benchmark #1: ./sum_omp_parallel_for_simd ~/matrix_250000_1000.bin
  Time (mean ± σ):     183.9 ms ±   0.9 ms    [User: 438.6 ms, System: 172.4 ms]
  Range (min … max):   182.3 ms … 186.0 ms    16 runs

Using POSIX threads (not using SIMD)

Just for fun.

Code: sum_pthread_collapsed_master.c

Benchmark (with -O3):

$ hyperfine './sum_pthread_collapsed_master ~/matrix_250000_1000.bin'
Benchmark #1: ./sum_pthread_collapsed_master ~/matrix_250000_1000.bin
  Time (mean ± σ):     194.3 ms ±   1.6 ms    [User: 475.3 ms, System: 163.4 ms]
  Range (min … max):   192.6 ms … 197.7 ms    15 runs

Summary

Task Time (ms)
sequential 447.3
omp parallel for 196.2
omp simd 310.3
omp parallel for simd 183.9
pthreads 194.3

How to use OpenMP in R?

Thanks