GPGPU / CUDA

Apr 9, 2021

GPGPU (General-Purpose Computing on GPUs)

What is a GPU?
- Graphical Processing Unit
- 3D computer graphics
  - Mostly computations involve vector and matrix operations
- Scientists find cumbersome ways to exploit this (OpenGL, DirectX)
  - CUDA (2006), OpenCL (2009), OpenACC (2012), …

CUDA

“CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.”

https://developer.nvidia.com/cuda-zone

Why look into this?

The HPCC currently features 562 GPUs
- 39x2 NVIDIA Tesla K20 on intel14-k20 cluster
- 48x8 NVIDIA Tesla K80 on intel16-k80 cluster
- 8x8 NVIDIA Tesla V100 on intel18-v100 cluster
- 9x4 NVIDIA Tesla V100S on amd20-v100 cluster

NVIDIA Tesla V100

80 Streaming Multiprocessors (SMs)
64 FP32 Cores / SM = 5120 FP32 Cores / GPU
64 INT32 Cores / SM = 5120 INT32 Cores / GPU
32 FP64 Cores / SM = 2560 FP64 Cores / GPU
32 GB of dedicated memory

(S/D)AXPY

Simple example to explore CUDA programming model and R integration:

Level 1 BLAS routine
\(\boldsymbol{y} \leftarrow a * \boldsymbol{x} + \boldsymbol{y}\)
Single (S) or Double (D) precision

Modifications for R:

Use of DAXPY instead of SAXPY because R does not support single-precision
Return a copy of y because R does not typically modify objects directly

Simple CPU implementation:

void daxpy(int n, double a, double *x, double *y) {
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

Simple CPU implementation:

void daxpy(int n, double a, double *x, double *y) {
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

Simple CUDA implementation (device code, kernel):

__global__ void cuda_daxpy_kernel(int n, double a, double *d_x, double *d_y) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < n) {
        d_y[i] = a * d_x[i] + d_y[i];
    }
}

1. The device code (kernel):

__global__ void cuda_daxpy_kernel(int n, double a, double *d_x, double *d_y) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < n) {
        d_y[i] = a * d_x[i] + d_y[i];
    }
}

The for loop is gone: kernel is run n times, each in a separate thread
Threads are grouped into blocks
gridDim and blockDim come from somewhere else (“magic”)
The thread index can exceed n
The function is marked as __global__ and has a return type of void
d_ prefix means “data on device”

2. The host code (error checking omitted):

void cuda_daxpy(int n, double a, double *h_x, double *h_y, double *h_out) {

    // Allocate device memory
    double *d_x, *d_y;
    cudaMalloc(&d_x, n * sizeof (double));
    cudaMalloc(&d_y, n * sizeof (double));

    // Copy data from host to device
    cudaMemcpy(d_x, h_x, n * sizeof (double), cudaMemcpyHostToDevice);
    cudaMemcpy(d_y, h_y, n * sizeof (double), cudaMemcpyHostToDevice);

    // Run device function (kernel)
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + (threadsPerBlock - 1)) / threadsPerBlock;
    cuda_daxpy_kernel<<<blocksPerGrid, threadsPerBlock>>>(n, a, d_x, d_y);

    // Copy data from device to host
    cudaMemcpy(h_out, d_y, n * sizeof (double), cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_x);
    cudaFree(d_y);

}

CUDA Runtime API
- cudaMalloc to allocate memory on device
- cudaMemcpy to copy from host to device or device to host
- cudaFree to free up allocated memory on device
<<<blocksPerGrid, threadsPerBlock>>> is special syntax for launching a kernel (i.e., a special compiler is needed)
h_ prefix means “data on host”

3. The R entry point:

#include <Rinternals.h>

extern "C" SEXP C_daxpy(SEXP a, SEXP x, SEXP y) {

    // Allocate memory for result
    R_xlen_t n = Rf_xlength(y);
    SEXP out = PROTECT(Rf_allocVector(REALSXP, n));

    // Call host function code
    cuda_daxpy(n, Rf_asReal(a), REAL(x), REAL(y), REAL(out));

    // Return result
    UNPROTECT(1); // out

    return out;

}

4. Using CUDA on the HPCC:

Be either on a GPU development node (dev-intel14-k20, dev-intel16-k80, dev-amd20-v100) or request a GPU using the --gres=gpu:[type:]1 option (type is optional and can be k20, k80, or v100)
CUDA module is already loaded: nvcc, nsys, nsight-sys, nvidia-smi, …

5. Compile code:

nvcc -shared -Xcompiler -fPIC
     -I/mnt/research/quantgen/tools/software/R/4.0.3/lib64/R/include
     -arch=compute_35
     -o daxpy.so daxpy.cu

6. Run code:

dyn.load("daxpy.so")
a <- 2.0
x <- rnorm(1e8, 10)
y <- rnorm(1e8)
res <- .Call("C_daxpy", a, x, y)

7. Profile code:

nsys profile --stats=true Rscript daxpy.R
# nsight-sys

Beyond DAXPY — GWAS

Same implementation as BGData rayOLS
Simulated height
Chromosome 22 of UK Biobank (409,629 x 9,927)

Time in milliseconds

   expr        min         lq       mean     median         uq       max neval
    CPU 39926.4360 40038.1847 40174.8242 40133.1558 40249.4981 40663.422    25
    GPU   334.4782   343.9245   538.7745   345.6474   346.3567  5188.819    25

Challenges:

Main memory of GPU limited
- Chunking may be required, limiting parallel throughput
BED files require large data transfers
- Filtering on host is expensive

Conclusions

Programming model is limited but especially suited to embarrassingly parallel problems
- map (1:1), scatter (1:N), gather (N:1), reduce (all-to-one), scan (all-to-all), …
Difficulties with R
- No access to R functions in device code (kernel)
- CUDA is not aware of R’s missing values
Difficult to tune and debug, but great profiling support
CUDA only works on NVIDIA GPUs

Resources

Intro to Parallel Computing / CUDA: https://www.youtube.com/playlist?list=PLGvfHSgImk4aweyWlhBXNF6XISY3um82_
CUDA C++ Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Code for DAXPY and cuGWAS is available at /mnt/research/quantgen/projects/gruenebe/CUDA

GPGPU (General-Purpose Computing on GPUs)

CUDA

Why look into this?

NVIDIA Tesla V100

(S/D)AXPY

Simple example to explore CUDA programming model and R integration:

Modifications for R:

Simple CPU implementation:

Simple CPU implementation:

Simple CUDA implementation (device code, kernel):

1. The device code (kernel):

2. The host code (error checking omitted):

3. The R entry point:

4. Using CUDA on the HPCC:

5. Compile code:

6. Run code:

7. Profile code:

Beyond DAXPY — GWAS

Challenges:

Conclusions

Resources

Thanks