Apr 9, 2021

GPGPU (General-Purpose Computing on GPUs)

  • What is a GPU?
    • Graphical Processing Unit
    • 3D computer graphics
      • Mostly computations involve vector and matrix operations
    • Scientists find cumbersome ways to exploit this (OpenGL, DirectX)
      • CUDA (2006), OpenCL (2009), OpenACC (2012), …

CUDA

“CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.”

https://developer.nvidia.com/cuda-zone

Why look into this?

  • The HPCC currently features 562 GPUs
    • 39x2 NVIDIA Tesla K20 on intel14-k20 cluster
    • 48x8 NVIDIA Tesla K80 on intel16-k80 cluster
    • 8x8 NVIDIA Tesla V100 on intel18-v100 cluster
    • 9x4 NVIDIA Tesla V100S on amd20-v100 cluster

NVIDIA Tesla V100

  • 80 Streaming Multiprocessors (SMs)
  • 64 FP32 Cores / SM = 5120 FP32 Cores / GPU
  • 64 INT32 Cores / SM = 5120 INT32 Cores / GPU
  • 32 FP64 Cores / SM = 2560 FP64 Cores / GPU
  • 32 GB of dedicated memory

(S/D)AXPY

Simple example to explore CUDA programming model and R integration:

  • Level 1 BLAS routine
  • \(\boldsymbol{y} \leftarrow a * \boldsymbol{x} + \boldsymbol{y}\)
  • Single (S) or Double (D) precision

Modifications for R:

  • Use of DAXPY instead of SAXPY because R does not support single-precision
  • Return a copy of y because R does not typically modify objects directly

Simple CPU implementation:

void daxpy(int n, double a, double *x, double *y) {
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

Simple CPU implementation:

void daxpy(int n, double a, double *x, double *y) {
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

Simple CUDA implementation (device code, kernel):

__global__ void cuda_daxpy_kernel(int n, double a, double *d_x, double *d_y) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < n) {
        d_y[i] = a * d_x[i] + d_y[i];
    }
}

1. The device code (kernel):

__global__ void cuda_daxpy_kernel(int n, double a, double *d_x, double *d_y) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < n) {
        d_y[i] = a * d_x[i] + d_y[i];
    }
}
  • The for loop is gone: kernel is run n times, each in a separate thread
  • Threads are grouped into blocks
  • gridDim and blockDim come from somewhere else (“magic”)
  • The thread index can exceed n
  • The function is marked as __global__ and has a return type of void
  • d_ prefix means “data on device”

2. The host code (error checking omitted):

void cuda_daxpy(int n, double a, double *h_x, double *h_y, double *h_out) {

    // Allocate device memory
    double *d_x, *d_y;
    cudaMalloc(&d_x, n * sizeof (double));
    cudaMalloc(&d_y, n * sizeof (double));

    // Copy data from host to device
    cudaMemcpy(d_x, h_x, n * sizeof (double), cudaMemcpyHostToDevice);
    cudaMemcpy(d_y, h_y, n * sizeof (double), cudaMemcpyHostToDevice);

    // Run device function (kernel)
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + (threadsPerBlock - 1)) / threadsPerBlock;
    cuda_daxpy_kernel<<<blocksPerGrid, threadsPerBlock>>>(n, a, d_x, d_y);

    // Copy data from device to host
    cudaMemcpy(h_out, d_y, n * sizeof (double), cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_x);
    cudaFree(d_y);

}

  • CUDA Runtime API
    • cudaMalloc to allocate memory on device
    • cudaMemcpy to copy from host to device or device to host
    • cudaFree to free up allocated memory on device
  • <<<blocksPerGrid, threadsPerBlock>>> is special syntax for launching a kernel (i.e., a special compiler is needed)
  • h_ prefix means “data on host”

3. The R entry point:

#include <Rinternals.h>

extern "C" SEXP C_daxpy(SEXP a, SEXP x, SEXP y) {

    // Allocate memory for result
    R_xlen_t n = Rf_xlength(y);
    SEXP out = PROTECT(Rf_allocVector(REALSXP, n));

    // Call host function code
    cuda_daxpy(n, Rf_asReal(a), REAL(x), REAL(y), REAL(out));

    // Return result
    UNPROTECT(1); // out

    return out;

}

4. Using CUDA on the HPCC:

  • Be either on a GPU development node (dev-intel14-k20, dev-intel16-k80, dev-amd20-v100) or request a GPU using the --gres=gpu:[type:]1 option (type is optional and can be k20, k80, or v100)
  • CUDA module is already loaded: nvcc, nsys, nsight-sys, nvidia-smi, …

5. Compile code:

nvcc -shared -Xcompiler -fPIC
     -I/mnt/research/quantgen/tools/software/R/4.0.3/lib64/R/include
     -arch=compute_35
     -o daxpy.so daxpy.cu

6. Run code:

dyn.load("daxpy.so")
a <- 2.0
x <- rnorm(1e8, 10)
y <- rnorm(1e8)
res <- .Call("C_daxpy", a, x, y)

7. Profile code:

nsys profile --stats=true Rscript daxpy.R
# nsight-sys

Beyond DAXPY — GWAS

  • Same implementation as BGData rayOLS

  • Simulated height

  • Chromosome 22 of UK Biobank (409,629 x 9,927)

  • Time in milliseconds

       expr        min         lq       mean     median         uq       max neval
        CPU 39926.4360 40038.1847 40174.8242 40133.1558 40249.4981 40663.422    25
        GPU   334.4782   343.9245   538.7745   345.6474   346.3567  5188.819    25

Challenges:

  • Main memory of GPU limited
    • Chunking may be required, limiting parallel throughput
  • BED files require large data transfers
    • Filtering on host is expensive

Conclusions

  • Programming model is limited but especially suited to embarrassingly parallel problems
    • map (1:1), scatter (1:N), gather (N:1), reduce (all-to-one), scan (all-to-all), …
  • Difficulties with R
    • No access to R functions in device code (kernel)
    • CUDA is not aware of R’s missing values
  • Difficult to tune and debug, but great profiling support
  • CUDA only works on NVIDIA GPUs

Resources

Thanks