2010-08-05 08:22:00 Allan Engelhardt wrote in CYBAEA Data and Analysis:
Revolutions Analytics recently announced their "big data" solution for R. This is great news and a lovely piece of work by the team at Revolutions.
However, if you want to replicate their analysis in standard R, then you can absolutely do so and we show you how.
First you need to prepare the rather large data set that they use in the Revolutions white paper. The preparation script shown below does two passes over alal the files which is not needed: changing it to a single pass is left as an exercise for the reader.... Note that the following script will take a while to run and will need some 30-odd gig of free disk space (another exercise: get rid of the airlines.csv file), but once it is done the analysis is fast.
#!/usr/bin/Rscript ## big.R - Preprocess the airline data ## Copyright © 2010 Allan Engelhardt (http://www.cybaea.net/) ## Install the packages we will use install.packages("bigmemory", dependencies = c("Depends", "Suggests", "Enhances")) ## Data sets are downloaded from the Data Expo '09 web site at ## http://stat-computing.org/dataexpo/2009/the-data.html for (year in 1987:2008) { file.name <- paste(year, "csv.bz2", sep = ".") if ( !file.exists(file.name) ) { url.text <- paste("http://stat-computing.org/dataexpo/2009/", year, ".csv.bz2", sep = "") cat("Downloading missing data file ", file.name, "\n", sep = "") download.file(url.text, file.name) } } ## Read sample file to get column names and types d <- read.csv("2008.csv.bz2") integer.columns <- sapply(d, is.integer) factor.columns <- sapply(d, is.factor) factor.levels <- lapply(d[, factor.columns], levels) n.rows <- 0L ## Process each file determining the factor levels ## TODO: Combine with next loop for (year in 1987:2008) { file.name <- paste(year, "csv.bz2", sep = ".") cat("Processing ", file.name, "\n", sep = "") d <- read.csv(file.name) n.rows <- n.rows + NROWS(d) new.levels <- lapply(d[, factor.columns], levels) for ( i in seq(1, length(factor.levels)) ) { factor.levels[[i]] <- c(factor.levels[[i]], new.levels[[i]]) } rm(d) } save(integer.columns, factor.columns, factor.levels, file = "factors.RData") ## Now convert all factors to integers so we can create a bigmatrix of the data col.classes <- rep("integer", length(integer.columns)) col.classes[factor.columns] <- "character" cols <- which(factor.columns) first <- TRUE csv.file <- "airlines.csv" # Write combined integer-only data to this file csv.con <- file(csv.file, open = "w") for (year in 1987:2008) { file.name <- paste(year, "csv.bz2", sep = ".") cat("Processing ", file.name, "\n", sep = "") d <- read.csv(file.name, colClasses = col.classes) ## Convert the strings to integers for ( i in seq(1, length(factor.levels)) ) { col <- cols[i] d[, col] <- match(d[, col], factor.levels[[i]]) } write.table(d, file = csv.con, sep = ",", row.names = FALSE, col.names = first) first <- FALSE } close(csv.con) ## Now convert to a big.matrix library("bigmemory") backing.file <- "airlines.bin" descriptor.file <- "airlines.des" data <- read.big.matrix(csv.file, header = TRUE, type = "integer", backingfile = backing.file, descriptorfile = descriptor.file, extraCols = c("age"))
All done now. Sample analysis:
## bigScale.R - Replicate the analysis from http://bit.ly/aTFXeN with normal R ## http://info.revolutionanalytics.com/bigdata.html ## See big.R for the preprocessing of the data ## Load required libraries library("biglm") library("bigmemory") library("biganalytics") library("bigtabulate") ## Use parallel processing if available ## (Multicore is for "anything-but-Windows" platforms) if ( require("multicore") ) { library("doMC") registerDoMC() } else { warning("Consider registering a multi-core 'foreach' processor.") } day.names <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday") ## Attach to the data descriptor.file <- "airlines.des" data <- attach.big.matrix(dget(descriptor.file)) ## Replicate Table 5 in the Revolutions document: ## Table 5 t.5 <- bigtabulate(data, ccols = "DayOfWeek", summary.cols = "ArrDelay", summary.na.rm = TRUE) ## Pretty-fy the outout stat.names <- dimnames(t.5.2$summary[[1]])[2][[1]] t.5.p <- cbind(matrix(unlist(t.5$summary), byrow = TRUE, nrow = length(t.5$summary), ncol = length(stat.names), dimnames = list(day.names, stat.names)), ValidObs = t.5$table) print(t.5.p) # min max mean sd NAs ValidObs # Monday -1410 1879 6.669515 30.17812 385262 18136111 # Tuesday -1426 2137 5.960421 29.06076 417965 18061938 # Wednesday -1405 2598 7.091502 30.37856 405286 18103222 # Thursday -1395 2453 8.945047 32.30101 400077 18083800 # Friday -1437 1808 9.606953 33.07271 384009 18091338 # Saturday -1280 1942 4.187419 28.29972 298328 15915382 # Sunday -1295 2461 6.525040 31.11353 296602 17143178 ## Figure 1 plot(t.5.p[, "mean"], type = "l", ylab="Average arrival delay")
Just like the Revolutions paper. You can now use biglm.big.matrix and bigglm.big.matrix for basic regression and there are also k-means clustering and other functions.
I must admit here that I do not understand the Revolutions regression example, so I have not attempted to replicate it here. It seems kind of sad if they change the syntax to be incompatible with standard R formulas, which is what appears to be happening.
Credit to Michael Kane and Jay Emerson of Yale who showed much of this in their poster The Airline Data Set... What's the big deal?.
Subscribe to CYBAEA Data and Analysis
Jump to comments.
Employee productivity as function of number of workers revisited
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.
Getting started with the Heritage Health Price competition
The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform .
Benchmarking feature selection with Boruta and caret
Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, …
A warning on the R save format
The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a d…
Join the discussion
error
I get an error:
> for (year in 1987:2008) {
+ file.name <- paste(year, "csv.bz2", sep = ".")
+ if ( !file.exists(file.name) ) {
+ url.text <- paste("http://stat-computing.org/dataexpo/2009/",
+ year, ".csv.bz2", sep = "")
+ cat("Downloading missing data file ", file.name, "\n", sep = "")
+ download.file(url.text)
+ }
+ }
Downloading missing data file 1987.csv.bz2
Error in .Internal(download(url, destfile, quiet, mode, cacheOK)) :
'destfile' is missing
>
Code updated
@jrara: My apologies! The code has now been updated to
download.file(url.text, file.name)
It should work now.
big data: performance
Great! Thanks a lot for this post!
I don't have the disk space to replicate it, so can you tell how long does it take to run and on what machine?
Timings
@Michael - The relevant timings are:
> system.time( data <- attach.big.matrix(dget(descriptor.file)) )
user system elapsed
0.004 0.001 0.023
> system.time( t.5 <- bigtabulate(data, ccols = "DayOfWeek", summary.cols = "ArrDelay", summary.na.rm = TRUE) )
user system elapsed
14.366 1.730 16.119
The rest is standard R on small data sets.
System is a quad-core Intel Xeon E5420 @ 2.50GHz and a relatively slow eSATA hard disk.
Hope this is useful.
Allan
re: timings
Allan, thanks. I was curious roughly how fast this analysis runs. And it's pretty fast I have to say... :)
Thanks again for the code, and for the work on 'big*' packages too.
script in Windows
I've tried to run the script in R using Windows. When I try to get the column labels from the 2008.csv.bz2 file it crashes out. So in fact I need a bigmemory pointer to just one file. Have you tried this in the Windows environment? Any remedies for Windows limitations?
@larry
Try using the nrows= parameter to read.csv, e.g.
d <- read.csv("2008.csv.bz2", nrows = 1e5)
This may get you over any memory or performance issues at this stage in the processing. Adjust as needed for your computer.