Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.

The caret package provides a very flexible framework for the analysis as we shall see, but first we set up the artificial test data set as in the previous article.

## Feature-bc.R - Compare Boruta and caret feature selection ## Copyright © 2010 Allan Engelhardt (http://www.cybaea.net/) run.name <- "feature-bc" library("caret") ## Load early to get the warnings out of the way: library("randomForest") library("ipred") library("gbm") set.seed(1) ## Set up artificial test data for our analysis n.var <- 20 n.obs <- 200 x <- data.frame(V = matrix(rnorm(n.var*n.obs), n.obs, n.var)) n.dep <- floor(n.var/5) cat( "Number of dependent variables is", n.dep, "\n") m <- diag(n.dep:1) ## These are our four test targets y.1 <- factor( ifelse( x$V.1 >= 0, 'A', 'B' ) ) y.2 <- ifelse( rowSums(as.matrix(x[, 1:n.dep]) %*% m) >= 0, "A", "B" ) y.2 <- factor(y.2) y.3 <- factor(rowSums(x[, 1:n.dep] >= 0)) y.4 <- factor(rowSums(x[, 1:n.dep] >= 0) %% 2)

The flexibility of the caret package is to a large extent implemented by using control objects. Here we specify to use the `randomForest`

classification algorithm (which is also what Boruta uses) and if the multicore package is available then we use that for extra perfomance (you can also use MPI etc – see the documentation):

control <- rfeControl(functions = rfFuncs, method = "boot", verbose = FALSE, returnResamp = "final", number = 50) if ( require("multicore", quietly = TRUE, warn.conflicts = FALSE) ) { control$workers <- multicore:::detectCores() control$computeFunction <- mclapply control$computeArgs <- list(mc.preschedule = FALSE, mc.set.seed = FALSE) }

We will consider from one to six features (using the `sizes`

variable) and then we simply let it lose:

sizes <- 1:6 ## Use randomForest for prediction profile.1 <- rfe(x, y.1, sizes = sizes, rfeControl = control) cat( "rf : Profile 1 predictors:", predictors(profile.1), fill = TRUE ) profile.2 <- rfe(x, y.2, sizes = sizes, rfeControl = control) cat( "rf : Profile 2 predictors:", predictors(profile.2), fill = TRUE ) profile.3 <- rfe(x, y.3, sizes = sizes, rfeControl = control) cat( "rf : Profile 3 predictors:", predictors(profile.3), fill = TRUE ) profile.4 <- rfe(x, y.4, sizes = sizes, rfeControl = control) cat( "rf : Profile 4 predictors:", predictors(profile.4), fill = TRUE )

The results are:

rf : Profile 1 predictors: V.1 V.16 V.6 rf : Profile 2 predictors: V.1 V.2 rf : Profile 3 predictors: V.4 V.1 V.2 rf : Profile 4 predictors: V.10 V.11 V.7

If you recall the feature selection with Boruta article, then the results there were

- Profile 1:
`V.1, (V.16, V.17)`

- Profile 2:
`V.1, V.2, V,3, (V.8, V.9, V.4)`

- Profile 3:
`V.1, V.4, V.3, V.2, (V.7, V.6)`

- Profile 4:
`V.10, (V.11, V.13)`

To show the flexibility of caret, we can run the analysis with another of the built-in classifiers:

## Use ipred::ipredbag for prediction control$functions <- treebagFuncs profile.1 <- rfe(x, y.1, sizes = sizes, rfeControl = control) cat( "treebag: Profile 1 predictors:", predictors(profile.1), fill = TRUE ) profile.2 <- rfe(x, y.2, sizes = sizes, rfeControl = control) cat( "treebag: Profile 2 predictors:", predictors(profile.2), fill = TRUE ) profile.3 <- rfe(x, y.3, sizes = sizes, rfeControl = control) cat( "treebag: Profile 3 predictors:", predictors(profile.3), fill = TRUE ) profile.4 <- rfe(x, y.4, sizes = sizes, rfeControl = control) cat( "treebag: Profile 4 predictors:", predictors(profile.4), fill = TRUE )

This gives:

treebag: Profile 1 predictors: V.1 V.16 treebag: Profile 2 predictors: V.2 V.1 treebag: Profile 3 predictors: V.1 V.3 V.2 treebag: Profile 4 predictors: V.10 V.11 V.1 V.7 V.13

And of course, if you have your own favourite model class that is not already implemented, then you can easily do that yourself. We like `gbm`

from the package of the same name, which is kind of silly to use here because it provides variable importance automatically as part of the fitting process, but may still be useful. It needs numeric predictors so we do:

## Use gbm for prediction y.1 <- as.numeric(y.1)-1 y.2 <- as.numeric(y.2)-1 y.3 <- as.numeric(y.3)-1 y.4 <- as.numeric(y.4)-1 gbmFuncs <- treebagFuncs gbmFuncs$fit <- function (x, y, first, last, ...) { library("gbm") n.levels <- length(unique(y)) if ( n.levels == 2 ) { distribution = "bernoulli" } else { distribution = "gaussian" } gbm.fit(x, y, distribution = distribution, ...) } gbmFuncs$pred <- function (object, x) { n.trees <- suppressWarnings(gbm.perf(object, plot.it = FALSE, method = "OOB")) if ( n.trees <= 0 ) n.trees <- object$n.trees predict(object, x, n.trees = n.trees, type = "link") } control$functions <- gbmFuncs n.trees <- 1e2 # Default value for gbm is 100 profile.1 <- rfe(x, y.1, sizes = sizes, rfeControl = control, verbose = FALSE, n.trees = n.trees) cat( "gbm : Profile 1 predictors:", predictors(profile.1), fill = TRUE ) profile.2 <- rfe(x, y.2, sizes = sizes, rfeControl = control, verbose = FALSE, n.trees = n.trees) cat( "gbm : Profile 2 predictors:", predictors(profile.2), fill = TRUE ) profile.3 <- rfe(x, y.3, sizes = sizes, rfeControl = control, verbose = FALSE, n.trees = n.trees) cat( "gbm : Profile 3 predictors:", predictors(profile.3), fill = TRUE ) profile.4 <- rfe(x, y.4, sizes = sizes, rfeControl = control, verbose = FALSE, n.trees = n.trees) cat( "gbm : Profile 4 predictors:", predictors(profile.4), fill = TRUE )

And we get the results below:

gbm : Profile 1 predictors: V.1 V.10 V.11 V.12 V.13 gbm : Profile 2 predictors: V.1 V.2 gbm : Profile 3 predictors: V.4 V.1 V.2 V.3 V.7 gbm : Profile 4 predictors: V.11 V.10 V.1 V.6 V.7 V.18

It is all good and very flexible, for sure, but I can’t really say it is better than the Boruta approach for these simple examples.