I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.
I use the KDD Cup 2009 data sets as my reference for this experiment. (You will need to register to download the data.) It is a realistic example of the type of customer data that I usually work with. It has 50,000 observations of 15,000 variables. To load it into R you’ll need a reasonably beefy machine. My workstation has 16GB of memory; if yours have less then use a sample of the data.
We load the data into R and propose a few ways in which we may identify the columns we need:
The two functions based on the core variance function are easily the fastest (despite having to do arithmetic) while taking out the special case in the equality functions is a Bad Idea.