How to win the KDD Cup Challenge with R and gbm

Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about recently) kindly provides more information about how to win this public challenge using the R statistical computing and analysis platform on a laptop (!).

As a reminder of what we wrote before, the challenge provided two anonymized data set each of 50,000 mobile teleco customers and each entry having 15,000 variables. The task was to find the best churn, up-, and cross-sell models.

Hugh summarizes his team’s approach:

Feature selection was an important first step [CYBAEA: we mentioned before that this is key for all successful data mining projects]. We looked at how effective each individual variable was as a predictor, which also allowed us to reading parts of the data only, as the whole dataset didn’t fit in memory [CYBAEA: our emphasis]. The assessment here was homebrew, making a simple predictor on half the data and measuring performance (by the AUC measure) on the other half:

  • For categorical variables we just took the average number of 1's in the response for each category and used this as a predictor
  • For continuous variables we split the variable up into "bins", as you would a histogram, and again took the average number of 1's in the response for each bin as the predictor.

From this we came up with a set of about 200 variables for each model, which we continued to tinker with. The main model was a gradient boosted machine which used the "gbm" package in R. This basically fits a series of small decision trees, up-weighting the observations that are predicted poorly at each iteration. We used Bernoulli loss and also up-weighted the "1" response class. A fair amount of time was spent optimising the number of trees, how big they should be etc, but a fit of 5,000 trees only took a bit over an hour to fit. The package itself is quite powerful as it gives some useful diagnostics such as relative variable importance, allowing us to exclude some and include others.

We used trees to avoid doing much data cleaning – they automatically allow for extreme results, non-linearity, missing values and handle both categorical and continuous variables. The main adjustment we had to make was to aggregate the smaller categories in the categorical variables, as they tended to distort the fits.

They did this on standard Windows laptops (Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive) against a competition that had more computing clusters available than Imelda Marcos had shoes. It is not what you’ve got, it’s how you use it ☺.

Congratulations to Hugh and his team!