Revolutions Analytics recently announced their “big data” solution for R. This is great news and a lovely piece of work by the team at Revolutions.
However, if you want to replicate their analysis in standard R, then you can absolutely do so, and we show you how.
Data preparation
First you need to prepare the rather large data set that they use in the Revolutions white paper. The preparation script shown below does two passes over all the files which is not needed: changing it to a single pass is left as an exercise for the reader…. Note that the following script will take a while to run and will need some 30-odd gig of free disk space (another exercise: get rid of the airlines.csv file), but once it is done the analysis is fast.
## bigScale.R - Replicate the analysis from http://bit.ly/aTFXeN with normal R## http://info.revolutionanalytics.com/bigdata.html## See big.R for the preprocessing of the data## Load required librarieslibrary("biglm")library("bigmemory")library("biganalytics")library("bigtabulate")## Use parallel processing if available## (Multicore is for "anything-but-Windows" platforms)if(require("multicore")){library("doMC")registerDoMC()}else{warning("Consider registering a multi-core 'foreach' processor.")}day.names<-c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday", "Sunday")## Attach to the datadescriptor.file<-"airlines.des"data<-attach.big.matrix(dget(descriptor.file))## Replicate Table 5 in the Revolutions document:## Table 5t.5<-bigtabulate(data, ccols ="DayOfWeek", summary.cols ="ArrDelay", summary.na.rm =TRUE)## Pretty-fy the outoutstat.names<-dimnames(t.5.2$summary[[1]])[2][[1]]t.5.p<-cbind(matrix(unlist(t.5$summary), byrow =TRUE, nrow =length(t.5$summary), ncol =length(stat.names), dimnames =list(day.names, stat.names)), ValidObs =t.5$table)print(t.5.p)# min max mean sd NAs ValidObs# Monday -1410 1879 6.669515 30.17812 385262 18136111# Tuesday -1426 2137 5.960421 29.06076 417965 18061938# Wednesday -1405 2598 7.091502 30.37856 405286 18103222# Thursday -1395 2453 8.945047 32.30101 400077 18083800# Friday -1437 1808 9.606953 33.07271 384009 18091338# Saturday -1280 1942 4.187419 28.29972 298328 15915382# Sunday -1295 2461 6.525040 31.11353 296602 17143178## Figure 1plot(t.5.p[, "mean"], type ="l", ylab="Average arrival delay")
Just like the Revolutions paper. You can now use biglm.big.matrix and bigglm.big.matrix for basic regression and there are also k-means clustering and other functions.
2023 update: use {speedglm}
Last year we migrated a UK insurer away from {RevoScaleR}. This is the product referenced here. Revolutions got acquired by Microsoft who eventually abandoned the whole project.
We moved them to {speedglm} for their main models. Works well and was relatively painless. (It should be: {speedglm} is just another implementation of the same algorithm.)
So if it is still around 2023 when you read this and your are stuck with the abandoned Microsoft version of R and {RevoScaleR} then feel free to ping me so we can get you sorted.
I must admit here that I do not understand the Revolutions regression example, so I have not attempted to replicate it here. It seems kind of sad if they change the syntax to be incompatible with standard R formulas, which is what appears to be happening.