The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform.
Do you have accurate and timely analysis of the quality of the customers you are acquiring? Most companies carefully track the quantity of new customers by the hour, day, or certainly the week, but it is still less common to track the quality of the inflow as it happens. It is interesting to know that we have acquired, say, 1000 new customers today, but so very much more informative to know that this inflow will bring in £22,000 of revenues over the next year at 35% margin. Break it down by channel and product to see who is performing and who is not, and I as a marketing manager get really excited: I have the tools to do my job!
We argued in our article on commercial churn modelling that you want to predict not only the probability of a customer leaving you but even more importantly what you can do about it. We want to predict why the customer is churning or, more precisely, his likelihood to stay (given that he was likely to leave) after we extend an offer or perform an action from a list of activities for churn management, as well as his profitability after the save.
Churn modelling is easy; commercial churn modelling is hard. Let us compare the two to explain what we mean by the latter.
Why do we do analytics?
You will come to know the truth, and the truth will set you free, said the teacher, and while he wasn’t talking about commercial data mining we think he could have been.
Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us.
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification.
Revolutions Analytics recently announced their “big data” solution for R. This is great news and a lovely piece of work by the team at Revolutions.