Eliminating observed values with zero variance in R

R
Published

8 March 2010

I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.

I use the KDD Cup 2009 data sets as my reference for this experiment. (You will need to register to download the data.) It is a realistic example of the type of customer data that I usually work with. It has 50,000 observations of 15,000 variables. To load it into R you’ll need a reasonably beefy machine. My workstation has 16GB of memory; if yours have less then use a sample of the data.

We load the data into R and propose a few ways in which we may identify the columns we need:

Code
``````#!/usr/bin/Rscript
## zero-var.R - find the fastest way of eliminating observations with zero variance
## © 2010 Allan Engelhardt, https://www.cybaea.net

## We have already converted it to R format and saved it, so we can do
# train <- data.table::fread(file = here::here("../orange_large_train.data"))
train <-
vroom::vroom(
file = here::here("../orange_large_train.data"),
## Column types from the documentation
col_types = paste0(strrep("d", 14740), strrep("c", 260))
)
saveRDS(train, here::here("../train.rds"), compress = FALSE)

## Some suggestions for zero variance functions:
zv.1 <- function(x) {
## The literal approach
y <- var(x, na.rm = TRUE)
return(is.na(y) || y == 0)
}
zv.2 <- function(x) {
## As before, but avoiding direct comparison with zero
y <- var(x, na.rm = TRUE)
return(is.na(y) || y < .Machine\$double.eps ^ 0.5)
}
zv.3 <- function(x) {
## Maybe it is faster to check for equality than to compute?
y <- x[!is.na(x)]
return(all(y == y[1]))
}
zv.4 <- function(x) {
## Taking out the special case may speed things up?
## (At least for this data set where this case is common.)
z <- is.na(x)
if ( all(z) ) return(TRUE);
y <- x[!z]
return(all(y == y[1]))
}
## Edited to add variant from {caret} / {tidymodels}
zv.5 <- function(x) {
x <- x[!is.na(x)]
length(unique(x)) < 2
}``````

Now we just have to load the very useful {rbenchmark} package and let the machine figure it out:

Code
``````library("rbenchmark")

cat("Running benchmarks:\n")
b <- benchmark(
zv1 = {
sapply(train, zv.1)
},
zv2 = {
sapply(train, zv.2)
},
zv3 = {
sapply(train, zv.3)
},
zv4 = {
sapply(train, zv.4)
},
zv5 = {
sapply(train, zv.5)
},
replications = 5,
columns = c("test", "elapsed", "relative", "sys.self"),
order = "elapsed"
)
print(b)``````

The answer (on my machine) is that it is faster to calculate than to check for equality:

``````Running benchmarks:
test elapsed relative sys.self
1  zv1  78.619 1.000000    6.395
2  zv2  79.276 1.008357    6.586
3  zv3 113.024 1.437617    1.735
4  zv4 118.579 1.508274    1.716``````
2023 results with base R
``````Running benchmarks:
test elapsed relative sys.self
2  zv2  107.14    1.000     4.22
1  zv1  123.26    1.150     7.84
3  zv3  301.05    2.810    12.55
4  zv4  308.28    2.877    11.29
5  zv5  402.13    3.753    16.30``````
2023 results with {vroom}
``````Running benchmarks:
test elapsed relative sys.self
2  zv2   83.36    1.000     1.31
1  zv1   85.09    1.021     3.09
3  zv3  358.60    4.302    26.25
4  zv4  495.50    5.944    22.61
5  zv5  698.60    8.381    24.83``````

The two functions based on the core variance function are easily the fastest (despite having to do arithmetic) while taking out the special case in the equality functions is a Bad Idea.

Can you think of an even faster way to do it?