save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data.
Because it is Friday and because we collect quotes, here is one on statistics being the best and worst of disciplines. Which one of the two views are closest to your opinion?
For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation. But I don’t like it and let’s have a look at one reason why. In order not to always pick on Microsoft, we use another application, but you get the same results with Excel.
The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform.
Do you have accurate and timely analysis of the quality of the customers you are acquiring? Most companies carefully track the quantity of new customers by the hour, day, or certainly the week, but it is still less common to track the quality of the inflow as it happens. It is interesting to know that we have acquired, say, 1000 new customers today, but so very much more informative to know that this inflow will bring in £22,000 of revenues over the next year at 35% margin. Break it down by channel and product to see who is performing and who is not, and I as a marketing manager get really excited: I have the tools to do my job!
We argued in our article on commercial churn modelling that you want to predict not only the probability of a customer leaving you but even more importantly what you can do about it. We want to predict why the customer is churning or, more precisely, his likelihood to stay (given that he was likely to leave) after we extend an offer or perform an action from a list of activities for churn management, as well as his profitability after the save.
Churn modelling is easy; commercial churn modelling is hard. Let us compare the two to explain what we mean by the latter.
Why do we do analytics?
You will come to know the truth, and the truth will set you free, said the teacher, and while he wasn’t talking about commercial data mining we think he could have been.
Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us.
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.