I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.

I use the KDD Cup 2009 data sets as my reference for this experiment. (You will need to register to download the data.) It is a realistic example of the type of customer data that I usually work with. It has 50,000 observations of 15,000 variables. To load it into R you’ll need a reasonably beefy machine. My workstation has 16GB of memory; if yours have less then use a sample of the data.

We load the data into R and propose a few ways in which we may identify the columns we need:

Now we just have to load the very useful rbenchmark package and let the machine figure it out:

The answer (on my machine) is that it is faster to calculate than to check for equality:

```
Running benchmarks:
test elapsed relative sys.self
1 zv1 78.619 1.000000 6.395
2 zv2 79.276 1.008357 6.586
3 zv3 113.024 1.437617 1.735
4 zv4 118.579 1.508274 1.716
```

The two functions based on the core variance function are easily the fastest (despite having to do arithmetic) while taking out the special case in the equality functions is a Bad Idea.

Can you think of an even faster way to do it?