On 2009-05-12 10:02:00, Allan Engelhardt wrote in CYBAEA Journal:
The results from the KDD Cup 2009 are both interesting and fundamentally not interesting. For this public data mining challenge Orange, the mobile telecommunications company, provided anonymous data sets on mobile customers: 50,000 records each of training and testing data with 15,000 variables. (The data set are still available for download and there are also smaller data sets with only 230 variables.) The competition was to provide the best models for churn, cross-sell (“appetency”), and up-sell.
The problem with the competition is that we do not know what the data means: the variables are simply named Var1, Var2, ..., Var15000. This means that this is purely a statistical exercise and no understanding of the business problem is required or helpful. Which is really disappointing and made the challenge much (much) less interesting for me.
What is kind-of-interesting about the results is that you can still score some 75% of the churners (where the score is the area under the specificity-sensitivity curve). That is a little higher than I expected from purely statistical methods (my guess would have been around ⅔).
The message to mobile operators is that if you do not know (about) 75% of your churners in advance, you are doing something very wrong. You are not even getting the statistics right. And if you know your business you can relatively easily get the score up in the 80-90% range: we have done that without getting very sophisticated in the analysis. Predicting who will churn in mobile telecommunications is not hard.
However, the number isn’t really interesting. Let’s just assume the sensitivity is 75% so you know that proportion of churners in advance. Or even 80% or 90%. So what? Prediction is not the goal in business, action is. First I need to know who to retain. If the 25% (20% or 10%) I do not predict correctly are all the profitable customers who are churning, then I am no better off than without the model.
And second I need to know how to retain them. Profitably. That means understanding why they are churning, what offers would potentially retain them, and how they will behave in the next contract period so I can see if it would be profitable for me to extend the retention offers.
That is hard. The problem is not predicting who will churn. And not even to predict which profitable customers will churn (the first objection above), though that is somewhat harder. Retention is not difficult. Profitable retention is. Predicting which offers will retain the customer and how they will behave given that offer is hard. And that is the (only) interesting business problem in retention.
On 2010-07-13 07:47:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics.
The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type="l") does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary.
We also get a nice opportunity to use the under-appreciated read.fwf function.
Read more (~535 words).
On 2010-06-22 11:45:00, Allan Engelhardt wrote in CYBAEA Journal:
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is scary.
We now re-do the analysis four years later and, just because we can, we are using the leading companies of the London stock exchange instead of the largest American companies.
The results still hold. We called it the 3/2 rule: treble the number of workers and you halve their individual productivity. Large companies with ten times the number of employees are ¼ as productive as their smaller competitors.
Employee productivity is a big issue. If all the FTSE-100 companies achieved their average profits per employee, then the index would generate almost £1 trn of additional net profits for the economy.
Read more (~245 words).
On 2010-06-22 11:20:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary.
We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.
Read more (~763 words, 5 comments).
On 2010-06-17 09:05:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages.
Read more (~300 words, 2 comments).
On 2010-06-15 10:21:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection.
But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate.
The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time.
Read more (~934 words, 1 comments).
Join the discussion
There are no comments yet. Be the first to comment.