CYBAEA Data and Analysis

By Allan Engelhardt
[CYBAEA Data and Analysis]

Read the CYBAEA Data and Analysis blog for in-depth coverage of selected topics in data analysis, data mining, statistics, causal inference, and related topics.

This is the blog for practising data analysts and theoretical statisticians. The business conclusions of any analysis would normally be discussed in the CYBAEA Journal while this blog may contain the details of the analysis.

Subscribe to CYBAEA Data and Analysis

R versus SAS/SPSS in corporations

2011-10-28 11:10:00 Allan Engelhardt wrote:

A recent question on one of the LinkedIn groups about the advantages of using R over commercial tools like SAS or IBM SPSS Modeller drew lots of comments for R. We like R a lot and we use it extensively, but I also wanted to balance the discussion. R is great, but looking at commercial organizations near the end of 2011 it is not necessarily the right choice to make.

Friday quote: what is the question to which this number is the answer?

2011-08-26 09:05:00 Allan Engelhardt wrote:

John Kay muses on interpreting statistical data:

Always ask of such data “what is the question to which this number is the answer?”. “Earnings before interest, tax, depreciation and amortisation on a like-for-like basis before allowance for exceptional restructuring costs” is the answer to the question “what is the highest profit number we can present without attracting flat disbelief?”.

A warning on the R save format

2011-08-23 07:20:00 Allan Engelhardt wrote:

The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data.

I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?

Friday quote: the handmaiden and the whore

2011-08-19 12:04:00 Allan Engelhardt wrote:

Because it is Friday and because we collect quotes:

If mathematics is the handmaiden of science, statistics is the whore: all that scientists are looking for is a quick fix without the encumbrance of a meaningful relationship. Statisticians are second-class mathematicians, third-rate scientists and fourth-rate thinkers. They are the hyenas, jackals and vultures of the scientific ecology: picking over the bones and carcasses of the game that the big cats, the biologists, the physicists and the chemists, have brought down.

Spreadsheet errors

2011-04-20 11:19:00 Allan Engelhardt wrote:
[Click for article]

For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation. But I don’t like it and let’s have a look at one reason why. In order not to always pick on Microsoft, we use another application, but you get the same results with Excel.

Getting started with the Heritage Health Price competition

2011-04-08 08:39:00 Allan Engelhardt wrote:

The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform.

Benchmarking feature selection with Boruta and caret

2010-11-25 13:43:00 Allan Engelhardt wrote:
[Performance of Boruta feature selection]

Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us.

Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches.

Neither approach is suitable out of the box for the sizes of data sets that we normally work with.

Feature selection: Using the caret package

2010-11-16 19:35:00 Allan Engelhardt wrote:

Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.

Feature selection: All-relevant selection with the Boruta package

2010-11-15 10:04:00 Allan Engelhardt wrote:

Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification.

In this article we take a first look at the problem of all-relevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed for the R statistical computing and analysis platform.

Big data for R

2010-08-05 08:22:00 Allan Engelhardt wrote:

Revolutions Analytics recently announced their "big data" solution for R. This is great news and a lovely piece of work by the team at Revolutions.

However, if you want to replicate their analysis in standard R, then you can absolutely do so and we show you how.

Area Plots with Intensity Coloring

2010-07-13 07:47:00 Allan Engelhardt wrote:

I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics.

The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type="l") does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary.

We also get a nice opportunity to use the under-appreciated read.fwf function.

Employee productivity as function of number of workers revisited

2010-06-22 11:20:00 Allan Engelhardt wrote:

[Results of analysis shown in graph]
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary.

We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.

Comparing standard R with Revoutions for performance

2010-06-17 09:05:00 Allan Engelhardt wrote:

Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages.

Faster R through better BLAS

2010-06-15 10:21:00 Allan Engelhardt wrote:

Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection.

But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate.

The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time.

R: Eliminating observed values with zero variance

2010-03-08 14:46:00 Allan Engelhardt wrote:

I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.

Beautiful Data

2009-07-27 19:38:00 Allan Engelhardt wrote:

O'Reilly's recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data, is also available as a PDF download.

Massively parallel database for analytics

2009-07-22 13:37:00 Allan Engelhardt wrote:

This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end. But much more than a theoretical discussion, they have built a solution which they call HadoopDB. It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source. Alternative, column-based, backends to PostgreSQL are being implemented now. Read: Announcing release of HadoopDB.

The Knapsack Problem

2009-07-10 20:30:00 Allan Engelhardt wrote:

David posts a question about how to solve this knapsack problem using the R statistical computing and analysis platform. My reply in the comments seems to have disappeared for a while so here is my proposed solution:

OECD Statistics

2009-07-02 20:33:00 Allan Engelhardt wrote:

I am a sucker for good quality data. I wrote about data.gov, the US Government data site before, and now I find OECD Statistics which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)

R tips: Determine if function is called from specific package

2009-06-16 10:27:00 Allan Engelhardt wrote:

I like the "multicore" library for a particular task. I can easily write a combination of if(require("multicore",...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.

So, I needed a function to determine if my function was called from any function in the "multicore" library. Here it is.

More posts from CYBAEA Data and Analysis: 0-20 | >Next 20