Why?

3 January 2011

Why do we do analytics? You will come to know the truth, and the truth will set you free, said the teacher, and while he wasn’t talking about commercial data mining we think he could have been.

Read more (~370 words)

Benchmarking feature selection with Boruta and caret

25 November 2010

Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us.

Read more (~1290 words)

Feature selection: Using the caret package

16 November 2010

Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.

Read more (~990 words)

Feature selection: All-relevant selection with the Boruta package

15 November 2010

Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification.

Read more (~1210 words)

Big data for R

5 August 2010

Revolutions Analytics recently announced their “big data” solution for R. This is great news and a lovely piece of work by the team at Revolutions.

Read more (~760 words)

Area Plots with Intensity Coloring

13 July 2010

I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics.

Read more (~420 words)

Employee productivity revisited

22 June 2010

We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is scary.

Read more (~300 words)

Employee productivity as function of number of workers revisited

22 June 2010

We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary.

Read more (~670 words)

Comparing standard R with Revoutions for performance

17 June 2010

Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages.

Read more (~270 words)

Faster R through better BLAS

15 June 2010

Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection.

Read more (~750 words)