Centering and scaling in R is slow and expensive.
Let's see some numbers on an intel18 HPCC compute node …
Sep 10, 2019
Centering and scaling in R is slow and expensive.
Let's see some numbers on an intel18 HPCC compute node …
n <- 100000
p <- 500
rand <- function(n, p) {
matrix(data = rnorm(n * p, mean = 100, sd = 10), nrow = n, ncol = p)
}
X <- rand(n, p)
Generate random matrix of doubles with 400 MB in size.
c_loop: for (j in 1:ncol(X)) X[, j] <- X[, j] - mean[X[, j] (in-place)c_loop_precomputed: means <- colMeans(X); for (j in 1:ncol(X)) X[, j] <- X[, j] - means[j] (in-place)c_apply: apply(X, 2, function(x) x - mean(x))c_scale: scale(X, center = TRUE, scale = FALSE)c_sweep: sweep(X, 2, colMeans(X), check.margin = FALSE)c_linalg: X - rep(1, nrow(X)) %*% t(colMeans(X)) (Credits)library(microbenchmark)
microbenchmark(c_loop(X), c_loop_precomputed(X), c_apply(X), c_scale(X),
c_sweep(X), c_linalg(X), times = 50)
## Unit: milliseconds ## expr min lq mean median uq ## c_loop(X) 1140.6247 1290.3500 1445.2759 1463.5481 1530.536 ## c_loop_precomputed(X) 683.2851 819.4876 931.0042 941.6843 1055.396 ## c_apply(X) 1339.6141 1533.3265 1691.6666 1750.0358 1844.104 ## c_scale(X) 923.4389 997.6793 1076.2210 1046.8453 1132.087 ## c_sweep(X) 902.1200 989.5979 1029.7515 1023.5344 1067.992 ## c_linalg(X) 170.2866 208.6570 228.7473 239.1603 243.018 ## max neval cld ## 1952.3326 50 d ## 1178.7784 50 b ## 1979.6656 50 e ## 1543.0841 50 c ## 1155.7243 50 c ## 330.2129 50 a
Criteria: fast, memory-efficient, can handle large matrices, portable, somewhat readable, no surprising results?
c_inplace: center_inplace.cc_inplace_omp: center_inplace_omp.cc_blas.c: center_blas.cc_duplicate: center_duplicate.cc_duplicate_omp: center_duplicate_omp.cc_alloc: center_alloc.cc_alloc_omp: center_alloc_omp.clibrary(microbenchmark)
microbenchmark(c_inplace(X), c_inplace_omp(X), c_blas(X), c_duplicate(X),
c_duplicate_omp(X), c_alloc(X), c_alloc_omp(X), times = 50,
setup = { X <- rand(n, p) })
## Unit: milliseconds ## expr min lq mean median uq ## c_inplace(X) 69.63981 71.81472 73.56984 73.55544 75.51548 ## c_inplace_omp(X) 24.11459 25.12135 26.84304 26.33484 27.39767 ## c_blas(X) 74.56461 76.04124 77.68445 77.09719 77.72556 ## c_duplicate(X) 212.31280 218.43572 229.70650 222.36975 227.23836 ## c_duplicate_omp(X) 168.33961 173.98532 190.30355 177.84117 181.91254 ## c_alloc(X) 153.28024 156.95446 165.50865 160.21125 162.12550 ## c_alloc_omp(X) 31.38045 34.29273 49.25495 39.01961 46.85285 ## max neval cld ## 78.05493 50 c ## 34.44762 50 a ## 109.36529 50 c ## 372.31437 50 f ## 318.71433 50 e ## 316.30628 50 d ## 188.75457 50 b
Certain tradeoffs need to be made:
In-place modification of the data has potentially unexpected side effects.
Integer values need to be coerced to doubles to compute the centered value, i.e., we cannot modify the data in-place.
We need to support integer vectors (for example the BEDMatrix package returns integers).
x == NA_INTEGERNaN with a significand value of 1954 (the year Ross Ihaka was born)ISNA(x)Tests are expensive but inevitable for integers.
We could use BLAS under the following conditions:
Should we also
Solutions to compute variance:
scale(X, center = FALSE, scale = TRUE): "If ‘scale’ is ‘TRUE’ then scaling is done by dividing the (centered) columns of ‘x’ by their standard deviations if ‘center’ is ‘TRUE’, and the root mean square otherwise."colMeans for computing the standard deviation of each column. The performance of apply(X, 2, sd) is bad. Some faster options are proposed here.library(microbenchmark) microbenchmark(cs(X), scale(X, center = TRUE, scale = TRUE), times = 50)
## Unit: milliseconds ## expr min lq mean ## cs(X) 31.47225 38.5872 85.03674 ## scale(X, center = TRUE, scale = TRUE) 4068.91624 4245.6604 4425.22855 ## median uq max neval cld ## 53.1925 109.4171 434.806 50 a ## 4385.5309 4559.6897 4972.973 50 b