KDD Cup 2009 – Make money from data

The results from the KDD Cup 2009 are both interesting and fundamentally not interesting. For this public data mining challenge Orange, the mobile telecommunications company, provided anonymous data sets on mobile customers: 50,000 records each of training and testing data with 15,000 variables. (The data set are still available for download and there are also smaller data sets with only 230 variables.) The competition was to provide the best models for churn, cross-sell (“appetency”), and up-sell.

The problem with the competition is that we do not know what the data means: the variables are simply named Var1, Var2, …, Var15000. This means that this is purely a statistical exercise and no understanding of the business problem is required or helpful. Which is really disappointing and made the challenge much (much) less interesting for me.

What is kind-of-interesting about the results is that you can still score some 75% of the churners (where the score is the area under the specificity-sensitivity curve). That is a little higher than I expected from purely statistical methods (my guess would have been around ⅔).

The message to mobile operators is that if you do not know (about) 75% of your churners in advance, you are doing something very, very wrong. You are not even getting the statistics right. And if you know your business you can relatively easily get the score up in the 80-90% range: we have done that without getting very sophisticated in the analysis. Predicting who will churn in mobile telecommunications is not hard.

If you do not know (about) 75% of your churners in advance, you are doing something very, very wrong

However, the number isn’t really interesting. Let’s just assume the sensitivity is 75% so you know that proportion of churners in advance. Or even 80% or 90%. So what? Prediction is not the goal in business, action is. First I need to know who to retain. If the 25% (20% or 10%) I do not predict correctly are all the profitable customers who are churning, then I am no better off than without the model.

Prediction is not the goal in business, action is.

And second I need to know how to retain them. Profitably. That means understanding why they are churning, what offers would potentially retain them, and how they will behave in the next contract period so I can see if it would be profitable for me to extend the retention offers.

That is hard. The problem is not predicting who will churn. And not even to predict which profitable customers will churn (the first objection above), though that is somewhat harder. Retention is not difficult. Profitable retention is. Predicting which offers will retain the customer and how they will behave given that offer is hard. And that is the (only) interesting business problem in retention.