Blog posts from CYBAEA

OECD Statistics

2 July 2009

I am a sucker for good quality data. I wrote about, the US Government data site before, and now I find OECD Statistics which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)

Read more (~50 words)

R tips: Determine if function is called from specific package

16 June 2009

I like the “multicore” library for a particular task. I can easily write a combination of if(require("multicore",...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.

Read more (~170 words)

R tips: Installing Rmpi on Fedora Linux

12 June 2009

Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform. Since it is unusually painful to get working, I might as well copy the instructions here.

Read more (~630 words)

Data Mashups in R from O’Reilly

9 June 2009

O’Reilly has published Data Mashups in R as a $4.99 PDF download in their Short Cut series. In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one below. This is all done with the R statistical computing and analysis platform.

Read more (~110 words)

How to win the KDD Cup Challenge with R and gbm

1 June 2009

Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about recently) kindly provides more information about how to win this public challenge using the R statistical computing and analysis platform on a laptop (!).

Read more (~450 words)

R used by KDD 2009 cup winner of slow challenge

31 May 2009

The results from the KDD Cup 2009 challenge (which we wrote about before) are in, and the winner of the slow challenge used the R statistical computing and analysis platform for their winning submission.

Read more (~380 words)

R tips: Use read.table instead of strsplit to split a text column into multiple columns

29 May 2009

Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. He wanted to sort by this column and I proposed a solution involving strsplit. But Peter Dalgaard comes up with a much nicer method using read.table on a textConnection object:

Read more (~100 words)

Do social networks influence purchases?

22 May 2009

Havard Business School has an interesting study titled Do Friends Influence Purchases in a Social Network?. I would like to get my hands on the raw data (which is from the Korean social site Cyworld), but the outline conclusions seems plausible:

Read more (~170 words)

22 May 2009

I am always on the lookout for useful data sources for training in statistics, so I am excited that has opened for business. The purpose of is to increase public access to high value, machine readable datasets generated by the US Government.

Read more (~90 words)

KDD Cup 2009

12 May 2009

The results from the KDD Cup 2009 are both interesting and fundamentally not interesting. For this public data mining challenge Orange, the mobile telecommunications company, provided anonymous data sets on mobile customers: 50,000 records each of training and testing data with 15,000 variables. (The data set are still available for download and there are also smaller data sets with only 230 variables.) The competition was to provide the best models for churn, cross-sell (“appetency”), and up-sell.

Read more (~440 words)