Sep 10, 2019

Problem

Centering and scaling in R is slow and expensive.

Let's see some numbers on an intel18 HPCC compute node …

Centering

Generate a dummy matrix

n <- 100000
p <- 500
rand <- function(n, p) {
    matrix(data = rnorm(n * p, mean = 100, sd = 10), nrow = n, ncol = p)
}
X <- rand(n, p)

Generate random matrix of doubles with 400 MB in size.

One thousand and one ways to center in R

  • c_loop: for (j in 1:ncol(X)) X[, j] <- X[, j] - mean[X[, j] (in-place)
  • c_loop_precomputed: means <- colMeans(X); for (j in 1:ncol(X)) X[, j] <- X[, j] - means[j] (in-place)
  • c_apply: apply(X, 2, function(x) x - mean(x))
  • c_scale: scale(X, center = TRUE, scale = FALSE)
  • c_sweep: sweep(X, 2, colMeans(X), check.margin = FALSE)
  • c_linalg: X - rep(1, nrow(X)) %*% t(colMeans(X)) (Credits)

Benchmarks

library(microbenchmark)
microbenchmark(c_loop(X), c_loop_precomputed(X), c_apply(X), c_scale(X),
               c_sweep(X), c_linalg(X), times = 50)
## Unit: milliseconds
##                   expr       min        lq      mean    median       uq
##              c_loop(X) 1140.6247 1290.3500 1445.2759 1463.5481 1530.536
##  c_loop_precomputed(X)  683.2851  819.4876  931.0042  941.6843 1055.396
##             c_apply(X) 1339.6141 1533.3265 1691.6666 1750.0358 1844.104
##             c_scale(X)  923.4389  997.6793 1076.2210 1046.8453 1132.087
##             c_sweep(X)  902.1200  989.5979 1029.7515 1023.5344 1067.992
##            c_linalg(X)  170.2866  208.6570  228.7473  239.1603  243.018
##        max neval   cld
##  1952.3326    50    d 
##  1178.7784    50  b   
##  1979.6656    50     e
##  1543.0841    50   c  
##  1155.7243    50   c  
##   330.2129    50 a

One thousand and one ways to center in C

Benchmarks

library(microbenchmark)
microbenchmark(c_inplace(X), c_inplace_omp(X), c_blas(X), c_duplicate(X),
               c_duplicate_omp(X), c_alloc(X), c_alloc_omp(X), times = 50,
               setup = { X <- rand(n, p) })
## Unit: milliseconds
##                expr       min        lq      mean    median        uq
##        c_inplace(X)  69.63981  71.81472  73.56984  73.55544  75.51548
##    c_inplace_omp(X)  24.11459  25.12135  26.84304  26.33484  27.39767
##           c_blas(X)  74.56461  76.04124  77.68445  77.09719  77.72556
##      c_duplicate(X) 212.31280 218.43572 229.70650 222.36975 227.23836
##  c_duplicate_omp(X) 168.33961 173.98532 190.30355 177.84117 181.91254
##          c_alloc(X) 153.28024 156.95446 165.50865 160.21125 162.12550
##      c_alloc_omp(X)  31.38045  34.29273  49.25495  39.01961  46.85285
##        max neval    cld
##   78.05493    50   c   
##   34.44762    50 a     
##  109.36529    50   c   
##  372.31437    50      f
##  318.71433    50     e 
##  316.30628    50    d  
##  188.75457    50  b

This is an ideal …

Certain tradeoffs need to be made:

  • safety
  • integer vectors
  • missing values
  • BLAS

Tradeoffs: Safety

In-place modification of the data has potentially unexpected side effects.

Tradeoffs: Integer vectors

Integer values need to be coerced to doubles to compute the centered value, i.e., we cannot modify the data in-place.

We need to support integer vectors (for example the BEDMatrix package returns integers).

Tradeoffs: Missing values

  • Integers
    • Coded as \(-2^{31}\)
    • Test with x == NA_INTEGER
  • Doubles
    • Coded as NaN with a significand value of 1954 (the year Ross Ihaka was born)
    • Test with ISNA(x)

Tests are expensive but inevitable for integers.

Tradeoffs: BLAS

  • BLAS performance depends on implementation
  • BLAS may not be safe inside threaded code: "BLAS (and LAPACK) routines may be used inside threaded code. The reference implementations are thread-safe but external ones may not be (even single-threaded ones): this can lead to hard-to-track-down incorrect results or segfaults." R Installation and Administration
  • BLAS routines do not accept long vector inputs
  • BLAS does not support integer arithmetic

Tradeoffs: BLAS

We could use BLAS under the following conditions:

  • We only use it for small vectors (less than \(2^{31}\) elements)
  • We either only use it for double inputs, or coerce integer inputs to double inputs beforehand

Further considerations

Should we also

  • accept precomputed centers?
  • impute missing values?

Scaling

Computing the variance

Just a few notes because the operations are almost the same

  • Don't use scale(X, center = FALSE, scale = TRUE): "If ‘scale’ is ‘TRUE’ then scaling is done by dividing the (centered) columns of ‘x’ by their standard deviations if ‘center’ is ‘TRUE’, and the root mean square otherwise."
  • Unfortunately, there is no equivalent of colMeans for computing the standard deviation of each column. The performance of apply(X, 2, sd) is bad. Some faster options are proposed here.

Both centering and scaling

Code and benchmark

library(microbenchmark)
microbenchmark(cs(X), scale(X, center = TRUE, scale = TRUE), times = 50)
## Unit: milliseconds
##                                   expr        min        lq       mean
##                                  cs(X)   31.47225   38.5872   85.03674
##  scale(X, center = TRUE, scale = TRUE) 4068.91624 4245.6604 4425.22855
##     median        uq      max neval cld
##    53.1925  109.4171  434.806    50  a 
##  4385.5309 4559.6897 4972.973    50   b

Thanks