OpenMP

Sep 10, 2019

One day, while browsing R's source code …

do_colsum function of src/main/array.c (it's been there since 2010):

Why is this surprising?

Isn't R supposed to be single-threaded?
If the R Core team uses OpenMP, then OpenMP is supported by all platforms that are supported by R (the multicore functions of the parallel package on the other hand only run on Unix systems)
As someone who has never looked into OpenMP: doesn't this look super easy?!

What is OpenMP?

"A specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs."
Developed by OpenMP Architecture Review Board
First release for Fortran in 1997, for C/C++ in 1998
https://www.openmp.org/

Example: Summing up a binary matrix of doubles

Tricky to benchmark

Approach:

Generate random matrix
Load matrix into memory (via mmap)
Compute sum of matrix
Print sum

All benchmarks were performed using gcc 9.1.0 on an ancient 4-core Intel Core i5-3427U Processor and on an SSD.

Generate random matrix

./generate_random_matrix 250000 1000 ~/matrix.bin

Code: generate_random_matrix.c

Load matrix into memory

 // Map matrix
    struct file_mapping matrix;
    if (map_file(path, &matrix) == -1) {
        fprintf(stderr, "Could not map file\n");
        exit(EXIT_FAILURE);
    }
 // Compute sum of entire array
    [...]
 // Unmap file
    if (unmap_file(&matrix) == -1) {
        fprintf(stderr, "Could not unmap file\n");
        exit(EXIT_FAILURE);
    }

Code: file_mapping.h

Sequential

double sum = 0;
for (ptrdiff_t i = 0; i < matrix.length; i++) {
    sum += matrix.data[i];
}

Code: sum_sequential.c

Benchmark (with -O3):

$ hyperfine './sum_sequential ~/matrix_250000_1000.bin'
Benchmark #1: ./sum_sequential ~/matrix_250000_1000.bin
  Time (mean ± σ):     447.3 ms ±   1.2 ms    [User: 337.6 ms, System: 107.9 ms]
  Range (min … max):   446.3 ms … 449.4 ms    10 runs

omp parallel for

double sum = 0;
#pragma omp parallel for reduction(+:sum) schedule(static)
for (ptrdiff_t i = 0; i < matrix.length; i++) {
    sum += matrix.data[i];
}

Code: sum_omp_parallel_for.c

Benchmark (with -O3):

$ hyperfine './sum_omp_parallel_for ~/matrix_250000_1000.bin'
Benchmark #1: ./sum_omp_parallel_for ~/matrix_250000_1000.bin
  Time (mean ± σ):     196.2 ms ±   1.5 ms    [User: 480.9 ms, System: 177.5 ms]
  Range (min … max):   193.8 ms … 199.8 ms    15 runs

omp simd

double sum = 0;
#pragma omp simd reduction(+:sum)
for (ptrdiff_t i = 0; i < matrix.length; i++) {
    sum += matrix.data[i];
}

Code: sum_omp_simd.c

Benchmark (with -O3):

$ hyperfine './sum_omp_simd ~/matrix_250000_1000.bin'
Benchmark #1: ./sum_omp_simd ~/matrix_250000_1000.bin
  Time (mean ± σ):     310.3 ms ±   1.2 ms    [User: 197.8 ms, System: 111.0 ms]
  Range (min … max):   308.6 ms … 312.1 ms    10 runs

omp parallel for simd

double sum = 0;
#pragma omp parallel for simd reduction(+:sum)
for (ptrdiff_t i = 0; i < matrix.length; i++) {
    sum += matrix.data[i];
}

Code: sum_omp_parallel_for_simd.c

Benchmark (with -O3):

$ hyperfine './sum_omp_parallel_for_simd ~/matrix_250000_1000.bin'
Benchmark #1: ./sum_omp_parallel_for_simd ~/matrix_250000_1000.bin
  Time (mean ± σ):     183.9 ms ±   0.9 ms    [User: 438.6 ms, System: 172.4 ms]
  Range (min … max):   182.3 ms … 186.0 ms    16 runs

Using POSIX threads (not using SIMD)

Just for fun.

Code: sum_pthread_collapsed_master.c

Benchmark (with -O3):

$ hyperfine './sum_pthread_collapsed_master ~/matrix_250000_1000.bin'
Benchmark #1: ./sum_pthread_collapsed_master ~/matrix_250000_1000.bin
  Time (mean ± σ):     194.3 ms ±   1.6 ms    [User: 475.3 ms, System: 163.4 ms]
  Range (min … max):   192.6 ms … 197.7 ms    15 runs

Summary

Task	Time (ms)
sequential	447.3
omp parallel for	196.2
omp simd	310.3
omp parallel for simd	183.9
pthreads	194.3

How to use OpenMP in R?

PKG_CFLAGS=-fopenmp R CMD SHLIB code.c
https://cran.r-project.org/doc/manuals/R-exts.html#OpenMP-support

One day, while browsing R's source code …

Why is this surprising?

What is OpenMP?

Example: Summing up a binary matrix of doubles

Tricky to benchmark

Generate random matrix

Load matrix into memory

Sequential

omp parallel for

omp simd

omp parallel for simd

Using POSIX threads (not using SIMD)

Summary

How to use OpenMP in R?

Thanks