I like the “multicore” library for a particular task. I can easily write a combination of
if(require("multicore",...)) that means that my function will automatically use the parallel
mclapply() instead of
lapply() where it is available. Which is grand 99% of the time, except when my function is called from
mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.
Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform. Since it is unusually painful to get working, I might as well copy the instructions here.
O’Reilly has published Data Mashups in R as a $4.99 PDF download in their Short Cut series. In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one below. This is all done with the R statistical computing and analysis platform.
Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 184.108.40.206). He wanted to sort by this column and I proposed a solution involving
strsplit. But Peter Dalgaard comes up with a much nicer method using
read.table on a
Havard Business School has an interesting study titled Do Friends Influence Purchases in a Social Network?. I would like to get my hands on the raw data (which is from the Korean social site Cyworld), but the outline conclusions seems plausible:
I am always on the lookout for useful data sources for training in statistics, so I am excited that data.gov has opened for business. The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government.
The results from the KDD Cup 2009 are both interesting and fundamentally not interesting. For this public data mining challenge Orange, the mobile telecommunications company, provided anonymous data sets on mobile customers: 50,000 records each of training and testing data with 15,000 variables. (The data set are still available for download and there are also smaller data sets with only 230 variables.) The competition was to provide the best models for churn, cross-sell (“appetency”), and up-sell.