2011-08-23 07:20:00 Allan Engelhardt wrote in CYBAEA Data and Analysis:
The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data.
save()I am using Windows on my travel laptop and Linux on my workstation. To speed things up on the latter and make use of my many (well, four) cores, I use the ‘multicore’ package, which I do not have available on the Windows machine.
To illustrate the problem with the save file format, I created a file on the Linux machine simply as:
library("multicore")
a <- list(data = 1:10, fun = mclapply)
save(a, file = "a.RData")
What could be simpler? The mclapply is a function from the ‘multicore’ package but it clearly has no impact on the stored data. (We will show a more realistic example below – work with me here.)
But try to open the save file on a machine without the package installed, like my Windows laptop, and you get:
Error in loadNamespace(name) : there is no package called 'multicore'
There is no way of getting to your precious data without installing the missing package.
If the package has been withdrawn or is no longer available then your data is basically lost.
Some suggestions from the helpful people on R-help:
./src/main/saveload.R and serialize.R to extract only the parts you need. “This is probably not worth the effort.”R CMD INSTALL --fake should be sufficient to let you load the data. Also suggests that the proposal above would be very hard indeed.That is three good answers from three of the heavy-weights in the R community. Thank you all!
Martin’s comment is worth expanding. We can change the above example to:
library("multicore")
computeFunction <- function(...) {
if (require(multicore)) mclapply(...)
else lapply(...)
}
a <- list(data = 1:10, fun = computeFunction)
save(a, file = "a.RData")
Now everything works fine! No data is horribly lost: the file loads fine on the ‘multicore’-less machine.
And for the more realistic example, I had been using caret::rfe as Martin knew in the example he provided:
library("caret")
data(BloodBrain)
x <- scale(bbbDescr[,-nearZeroVar(bbbDescr)])
x <- x[, -findCorrelation(cor(x), .8)]
x <- as.data.frame(x)
set.seed(1)
lmProfile <- rfe(x, logBBB,
sizes = c(2:25, 30, 35, 40, 45, 50, 55, 60, 65),
rfeControl = rfeControl(functions = lmFuncs,
number = 5,
computeFunction=mclapply))
save(lmProfile, file = "lmProfile.RData")
Slightly less obvious that there is a reference to the external namespace in this code, but easy enough to see if you know what to look for.
For old files I will use the R CMD INSTALL --fake suggestion, but for new data I am going with the last approach and using a computeFunction like this:
### MCCompute: A computeFunction for caret::rfeControl and caret::trainControl
### that does not leave a reference to the multicore package in the save file
MCCompute <- function(X, FUN, ...) {
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
if (require("multicore")) mclapply(X, FUN, ...)
else lapply(X, FUN, ...)
}
I know that Max Kuhn is rewriting the caret package which should make this a moot point in the near future for that specific case. But the indirection approach is generally useful and will also be relevant in other situations.
My recommendations:
save() function which is really for objects (data and code). Suitable formats include CSV and variants, HDF5, and CDF, as well as others.MCCompute function shown.What is your approach? Suggestions in the comments below, please.
Subscribe to CYBAEA Data and Analysis
Jump to comments.
R tips: Determine if function is called from specific package
I like the multicore library for a particular task. I can easily write a combination of if(require(multicore,...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result. So, I needed a function to determine if my function was called from any function in the multicore library. Here it is.
R: Eliminating observed values with zero variance
I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.
R tips: Keep your packages up-to-date
In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.
Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show yo…
R tips: Eliminating the “save workspace image” prompt on exit
When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off.
Employee productivity as function of number of workers revisited
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis …
Benchmarking feature selection with Boruta and caret
Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, …
Join the discussion
This is why I use a Mac
Not R but data related… I got sick of losing my data so may times on WIndows so now I use a Mac and TimeMachine feature to get my older files. I've never lost an R or any other file or version of a file since then.
It's good to have multiple layers of file protection.
agreed, with a little caveat
Agreed that data should usually be saved in a standard, text-based data format like CSV (or a relational database).
Note that can can use save in a slightly less risky way by setting ascii = TRUE. (In that case it works much like dput.)
The main use of the save/load functions are that they work much quicker than read.csv. So for big datasets that you want to read repeatedly, it can be useful to store them in R's binary format *as well as* in a plain text format.
Save Edits
I save all my editor scripts and my data in separate files. That way, even if I have to re-install R (new version or whatever), everything can be re-run starting from scratch.
new parallel package in R > 2.14.0
R (>2.14.0) has a built-in package 'parallel' to replace both 'multicore' and 'snow' package. It runs in all OS.