Eliminating observed values with zero variance in R – Make money from data

I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.

I use the KDD Cup 2009 data sets as my reference for this experiment. (You will need to register to download the data.) It is a realistic example of the type of customer data that I usually work with. It has 50,000 observations of 15,000 variables. To load it into R you’ll need a reasonably beefy machine. My workstation has 16GB of memory; if yours have less then use a sample of the data.

We load the data into R and propose a few ways in which we may identify the columns we need:

Code

#!/usr/bin/Rscript
## zero-var.R - find the fastest way of eliminating observations with zero variance
## © 2010 Allan Engelhardt, https://www.cybaea.net

## Read the data file.
## We have already converted it to R format and saved it, so we can do
# train <- readRDS(here::here("../train.rds"))
## instead of something like
# train <- data.table::fread(file = here::here("../orange_large_train.data"))
train <- 
  vroom::vroom(
    file = here::here("../orange_large_train.data"),
    ## Column types from the documentation
    col_types = paste0(strrep("d", 14740), strrep("c", 260))
  )
saveRDS(train, here::here("../train.rds"), compress = FALSE)

## Some suggestions for zero variance functions:
zv.1 <- function(x) {
    ## The literal approach
    y <- var(x, na.rm = TRUE)
    return(is.na(y) || y == 0)
}
zv.2 <- function(x) {
    ## As before, but avoiding direct comparison with zero
    y <- var(x, na.rm = TRUE)
    return(is.na(y) || y < .Machine$double.eps ^ 0.5)
}
zv.3 <- function(x) {
    ## Maybe it is faster to check for equality than to compute?
    y <- x[!is.na(x)]
    return(all(y == y[1]))
}
zv.4 <- function(x) {
    ## Taking out the special case may speed things up?
    ## (At least for this data set where this case is common.)
    z <- is.na(x)
    if ( all(z) ) return(TRUE);
    y <- x[!z]
    return(all(y == y[1]))
}
## Edited to add variant from {caret} / {tidymodels}
zv.5 <- function(x) {
    x <- x[!is.na(x)]
    length(unique(x)) < 2
}

Now we just have to load the very useful {rbenchmark} package and let the machine figure it out:

Code

library("rbenchmark")

cat("Running benchmarks:\n")
b <- benchmark(
  zv1 = {
    sapply(train, zv.1)
  },
  zv2 = {
    sapply(train, zv.2)
  },
  zv3 = {
    sapply(train, zv.3)
  },
  zv4 = {
    sapply(train, zv.4)
  },
  zv5 = {
    sapply(train, zv.5)
  },
  replications = 5,
  columns = c("test", "elapsed", "relative", "sys.self"),
  order = "elapsed"
)
print(b)

The answer (on my machine) is that it is faster to calculate than to check for equality:

2010 results

Running benchmarks:
  test elapsed relative sys.self
1  zv1  78.619 1.000000    6.395
2  zv2  79.276 1.008357    6.586
3  zv3 113.024 1.437617    1.735
4  zv4 118.579 1.508274    1.716

2023 results with base R

Running benchmarks:
  test elapsed relative sys.self
2  zv2  107.14    1.000     4.22
1  zv1  123.26    1.150     7.84
3  zv3  301.05    2.810    12.55
4  zv4  308.28    2.877    11.29
5  zv5  402.13    3.753    16.30

2023 results with {vroom}

Running benchmarks:
  test elapsed relative sys.self
2  zv2   83.36    1.000     1.31
1  zv1   85.09    1.021     3.09
3  zv3  358.60    4.302    26.25
4  zv4  495.50    5.944    22.61
5  zv5  698.60    8.381    24.83

The two functions based on the core variance function are easily the fastest (despite having to do arithmetic) while taking out the special case in the equality functions is a Bad Idea.

Can you think of an even faster way to do it?