R: Eliminating observed values with zero variance

On , wrote in CYBAEA Data:

I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.

I use the KDD Cup 2009 data sets as my reference for this experiment. (You will need to register to download the data.) It is a realistic example of the type of customer data that I usually work with. It has 50,000 observations of 15,000 variables. To load it into R you'll need a reasonably beefy machine. My workstation has 16GB of memory; if yours have less then use a sample of the data.

We load the data into R and propose a few ways in which we may identify the columns we need:

```#!/usr/bin/Rscript
## zero-var.R - find the fastest way of eliminating observations with zero variance
## © 2010 Allan Engelhardt, http://www.cybaea.net

## Read the data file.
## We have already converted it to R format and saved it, so we can do
## instead of something like
# train <- read.delim(file="../orange_large_train.data.bz2")

## Some suggestions for zero variance functions:
zv.1 <- function(x) {
## The literal approach
y <- var(x, na.rm = TRUE)
return(is.na(y) || y == 0)
}
zv.2 <- function(x) {
## As before, but avoiding direct comparison with zero
y <- var(x, na.rm = TRUE)
return(is.na(y) || y < .Machine\$double.eps ^ 0.5)
}
zv.3 <- function(x) {
## Maybe it is faster to check for equality than to compute?
y <- x[!is.na(x)]
return(all(y == y[1]))
}
zv.4 <- function(x) {
## Taking out the special case may speed things up?
## (At least for this data set where this case is common.)
z <- is.na(x)
if ( all(z) ) return(TRUE);
y <- x[!z]
return(all(y == y[1]))
}
```

Now we just have to load the very useful rbenchmark package and let the machine figure it out:

```library("rbenchmark")

cat("Running benchmarks:\n")
benchmark(
zv1 = { sapply(train, zv.1) },
zv2 = { sapply(train, zv.2) },
zv3 = { sapply(train, zv.3) },
zv4 = { sapply(train, zv.4) },
replications = 5,
columns = c("test", "elapsed", "relative", "sys.self"),
order = "elapsed"
)
```

The answer (on my machine) is that it is faster to calculate than to check for equality:

```Running benchmarks:
test elapsed relative sys.self
1  zv1  78.619 1.000000    6.395
2  zv2  79.276 1.008357    6.586
3  zv3 113.024 1.437617    1.735
4  zv4 118.579 1.508274    1.716
```

The two functions based on the core variance function are easily the fastest (despite having to do arithmetic) while taking out the special case in the equality functions is a Bad Idea.

Can you think of an even faster way to do it?