The US$ 3 million Heritage Health Price [link to archive.org] competition is on, so we take a look at how to get started using the R statistical computing and analysis platform.
We do not have the full set of data yet, so this is a simple warm-up session to predict the days in hospital in year 2 based on the year 1 data.
Prerequisites
Obviously you need to have R installed, and you should also have signed up for the competition (be sure to read the terms carefully) and downloaded and extracted the release 1 data file.
Data preparation
Let’s load the data into R and do some basic housekeeping:
Code #!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/
##############################
#### DATA PREPARATION
##++++
## Members
members <- read.csv ( file = "HHP_release1/Members_Y1.csv" ,
colClasses = rep ( "factor" , 3 ) ,
comment.char = "" )
##----
##++++
## Claims
claims.Y1 <- read.csv ( file = "HHP_release1/Claims_Y1.csv" ,
colClasses = c (
rep ( "factor" , 7 ) ,
"integer" , # paydelay
"character" , # LengthOfStay
"character" , # dsfs
"factor" , # PrimaryConditionGroup
"character" # CharlsonIndex
) ,
comment.char = "" )
## Utility function
make.numeric <- function ( cv , FUN = mean ) {
### make a character vector numeric by splitting on '-'
sapply ( strsplit ( gsub ( "[^[:digit:]]+" ,
" " ,
cv ,
perl = TRUE ) ,
" " ,
fixed = TRUE ) ,
function ( x ) FUN ( as.numeric ( x ) ) )
}
## Length of stay as days
{
z <- make.numeric ( claims.Y1 $ LengthOfStay )
z.week <- grepl ( "week" , claims.Y1 $ LengthOfStay , fixed = TRUE )
z [ z.week ] <- z [ z.week ] * 7 # Weeks are 7 days
z [ is.nan ( z ) ] <- 0
claims.Y1 $ LengthOfStay.days <- z
}
los.levels <- c ( "" , "1 day" , sprintf ( "%d days" , 2 : 6 ) ,
"1- 2 weeks" , "2- 4 weeks" , "4- 8 weeks" , "8-12 weeks" ,
"12-26 weeks" , "26+ weeks" )
stopifnot ( all ( claims.Y1 $ LengthOfStay %in% los.levels ) )
claims.Y1 $ LengthOfStay <- factor ( claims.Y1 $ LengthOfStay ,
levels = los.levels ,
labels = c ( "0 days" , los.levels [ - 1 ] ) ,
ordered = TRUE )
## Months since first claim
claims.Y1 $ dsfs.months <- make.numeric ( claims.Y1 $ dsfs )
## dsfs is an ordered factor and gives the ordering of the claims
dsfs.levels <- c ( "0- 1 month" , sprintf ( "%d-%2d months" , 1 : 11 , 2 : 12 ) )
claims.Y1 $ dsfs <- factor ( claims.Y1 $ dsfs , levels = dsfs.levels , ordered = TRUE )
## Index as numeric
claims.Y1 $ CharlsonIndex.numeric <- make.numeric ( claims.Y1 $ CharlsonIndex )
claims.Y1 $ CharlsonIndex <- factor ( claims.Y1 $ CharlsonIndex , ordered = TRUE )
##----
##++++
## Days in hospital
dih.Y2 <- read.csv ( file = "HHP_release1/DayInHospital_Y2.csv" ,
colClasses = c ( "factor" , "integer" ) ,
comment.char = "" )
names ( dih.Y2 ) [ 1 ] <- "MemberID" # Fix broken file
##----
save ( members , claims.Y1 , dih.Y2 ,
file = "HHPR1.RData" )
##############################
Scoring
We will need a function to score our predictions p against the actual values a. The formula is on the evaluation page and we implement it as:
Code #!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/
##############################
#### FUNCTION TO CALCULATE SCORE
HPPScore <- function ( p , a ) {
### Scorng function after
### http://www.heritagehealthprize.com/c/hhp/Details/Evaluation
### Base 10 log from http://www.heritagehealthprize.com/forums/default.aspx?g=posts&m=2226#post2226
sqrt ( mean ( ( log ( 1 + p , 10 ) - log ( 1 + a , 10 ) ) ^ 2 ) )
}
##############################
The simplest benchmarks
The simplest models don’t really model at all: they just use the average and are simple benchmarks.
Code #!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/
y <- dih.Y2 $ DaysInHospital_Y2 # Actual
p <- rep ( mean ( y ) , NROW ( dih.Y2 ) )
cat ( sprintf ( "Score using mean : %8.6f\n" , HPPScore ( p , y ) ) )
# Score using mean : 0.278725
p <- rep ( median ( y ) , NROW ( dih.Y2 ) )
cat ( sprintf ( "Score using median: %8.6f\n" , HPPScore ( p , y ) ) )
# Score using median: 0.267969
Simple single-variable linear models
OK, a model that doesn’t use past data isn’t much of a model, so let’s improve on that:
Code #!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/
library ( "reshape2" )
vars <- dcast ( claims.Y1 , MemberID ~ . , sum , value_var = "LengthOfStay.days" )
names ( vars ) [ 2 ] <- "LengthOfStay"
data <- merge ( vars , dih.Y2 )
model <- lm ( DaysInHospital_Y2 ~ LengthOfStay , data = data )
p <- predict ( model )
cat ( sprintf ( "Score using lm(LengthOfStay): %8.6f\n" , HPPScore ( p , y ) ) )
# Score using lm(LengthOfStay): 0.279062
model <- glm ( DaysInHospital_Y2 ~ LengthOfStay ,
family = quasipoisson ( ) ,
data = data )
p <- predict ( model , type= "response" )
cat ( sprintf ( "Score using glm(LengthOfStay): %8.6f\n" , HPPScore ( p , y ) ) )
# Score using glm(LengthOfStay): 0.278914
Let the competition begin.