A warning on the R save format

On , wrote in CYBAEA Data:

The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data.

How to lose your data with save()

I am using Windows on my travel laptop and Linux on my workstation. To speed things up on the latter and make use of my many (well, four) cores, I use the ‘multicore’ package, which I do not have available on the Windows machine.

To illustrate the problem with the save file format, I created a file on the Linux machine simply as:

a <- list(data = 1:10, fun = mclapply)
save(a, file = "a.RData")

What could be simpler? The mclapply is a function from the ‘multicore’ package but it clearly has no impact on the stored data. (We will show a more realistic example below ­– work with me here.)

But try to open the save file on a machine without the package installed, like my Windows laptop, and you get:

Error in loadNamespace(name) : there is no package called 'multicore'

There is no way of getting to your precious data without installing the missing package.

If the package has been withdrawn or is no longer available then your data is basically lost.

What can you do?

Some suggestions from the helpful people on R-help:

  1. (Uwe Ligges): You could try to rewrite ./src/main/saveload.R and serialize.R to extract only the parts you need. “This is probably not worth the effort.”
  2. (Prof. Brian Ripley): You could try installing the missing package; R CMD INSTALL --fake should be sufficient to let you load the data. Also suggests that the proposal above would be very hard indeed.
  3. (Martin Morgan): Don't store package functions with your code.

That is three good answers from three of the heavy-weights in the R community. Thank you all!

Martin’s comment is worth expanding. We can change the above example to:

computeFunction <- function(...) {
    if (require(multicore)) mclapply(...)
    else lapply(...) 
a <- list(data = 1:10, fun = computeFunction)
save(a, file = "a.RData")

Now everything works fine! No data is horribly lost: the file loads fine on the ‘multicore’-less machine.

And for the more realistic example, I had been using caret::rfe as Martin knew in the example he provided:


x <- scale(bbbDescr[,-nearZeroVar(bbbDescr)])
x <- x[, -findCorrelation(cor(x), .8)]
x <- as.data.frame(x)

lmProfile <- rfe(x, logBBB,
                 sizes = c(2:25, 30, 35, 40, 45, 50, 55, 60, 65),
                 rfeControl = rfeControl(functions = lmFuncs,
                   number = 5,
save(lmProfile, file = "lmProfile.RData")

Slightly less obvious that there is a reference to the external namespace in this code, but easy enough to see if you know what to look for.

For old files I will use the R CMD INSTALL --fake suggestion, but for new data I am going with the last approach and using a computeFunction like this:

### MCCompute: A computeFunction for caret::rfeControl and caret::trainControl 
### that does not leave a reference to the multicore package in the save file
MCCompute <- function(X, FUN, ...) {
    FUN <- match.fun(FUN)
    if (!is.vector(X) || is.object(X)) 
        X <- as.list(X)
    if (require("multicore")) mclapply(X, FUN, ...)
    else lapply(X, FUN, ...)

I know that Max Kuhn is rewriting the caret package which should make this a moot point in the near future for that specific case. But the indirection approach is generally useful and will also be relevant in other situations.


My recommendations:

  1. Save data in a data format, not using the save() function which is really for objects (data and code). Suitable formats include CSV and variants, HDF5, and CDF, as well as others.
  2. Avoid references to packages in your objects by using the one level indirection trick exemplified by the MCCompute function shown.

What is your approach? Suggestions in the comments below, please.

Last modified .