R used by KDD 2009 cup winner of slow challenge – Make money from data

The results from the KDD Cup 2009 challenge (which we wrote about before) are in, and the winner of the slow challenge used the R statistical computing and analysis platform for their winning submission.

The write up from Hugh Miller and team at the University of Melbourne includes these points:

Decision tree, stub, or Random Forest as base classifiers with Logistic loss or cross-entropy loss function
Models fit in an hour or so
Used the R statistical package
Most of models run on Windows laptop with Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive.

Impressive hardware selection! Well done R. Weka was another popular tool among the top entrants. Key for all of them were clever data preparation and variable substitution. The fast track winners from IBM document this in some detail:

We normalized the numerical variables by range, keeping the sparsity. For the categorical variables, we coded them using at most 11 binary columns for each variable. For each categorical variable, we generated a binary feature for each of the ten most common values, encoding whether the instance had this value or not. The eleventh column encoded whether the instance had a value that was not among the top ten most common values. We removed constant attributes, as well as duplicate attributes.

We replaced the missing values by mean for numerical attributes, and coded them as a separate value for discrete attributes. We also added a separate column for each numeric attribute with missing values, indicating wether the value was missing or not. We also tried another approach for imputing missing values based on KNN.

On the large data set we discretized the 100 numerical variables that had the highest mutual information with the target into 10 bins, and added them as extra features.

We tried PCA on the large data set, but it did not seem to help.

Because we noticed that some of the most predictive attributes were not linearly correlated with the targets, we build shallow decision trees (2-4 levels deep) using single numerical attributes and used their predictions as extra features. We also build shallow decision trees using two features at a time and used their prediction as an extra feature in the hope of capturing some non-additive interactions among features.