Getting started with the Heritage Health Price competition

The US$ 3 million Heritage Health Price [link to archive.org] competition is on, so we take a look at how to get started using the R statistical computing and analysis platform.

We do not have the full set of data yet, so this is a simple warm-up session to predict the days in hospital in year 2 based on the year 1 data.

Prerequisites

Obviously you need to have R installed, and you should also have signed up for the competition (be sure to read the terms carefully) and downloaded and extracted the release 1 data file.

Data preparation

Let’s load the data into R and do some basic housekeeping:

Code

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/

##############################
#### DATA PREPARATION

##++++
## Members
members   <- read.csv(file = "HHP_release1/Members_Y1.csv",
                      colClasses = rep("factor", 3),
                      comment.char = "")
##----
##++++
## Claims
claims.Y1 <- read.csv(file = "HHP_release1/Claims_Y1.csv",
                      colClasses = c(
                          rep("factor", 7),
                          "integer",    # paydelay
                          "character",  # LengthOfStay
                          "character",  # dsfs
                          "factor",     # PrimaryConditionGroup
                          "character"   # CharlsonIndex
                          ),
                      comment.char = "")
## Utility function
make.numeric <- function(cv, FUN = mean) {
### make a character vector numeric by splitting on '-'
    sapply(strsplit(gsub("[^[:digit:]]+",
                         " ",
                         cv,
                         perl = TRUE),
                    " ",
                    fixed = TRUE),
           function(x) FUN(as.numeric(x)))
}
## Length of stay as days
{
    z <- make.numeric(claims.Y1$LengthOfStay)
    z.week <- grepl("week", claims.Y1$LengthOfStay, fixed = TRUE)
    z[z.week] <- z[z.week] * 7          # Weeks are 7 days
    z[is.nan(z)] <- 0
    claims.Y1$LengthOfStay.days <- z
}
los.levels <- c("", "1 day", sprintf("%d days", 2:6),
                "1- 2 weeks", "2- 4 weeks", "4- 8 weeks", "8-12 weeks",
                "12-26 weeks", "26+ weeks")
stopifnot(all(claims.Y1$LengthOfStay %in% los.levels))
claims.Y1$LengthOfStay <- factor(claims.Y1$LengthOfStay,
                                 levels = los.levels,
                                 labels = c("0 days", los.levels[-1]),
                                 ordered = TRUE)
## Months since first claim
claims.Y1$dsfs.months <- make.numeric(claims.Y1$dsfs)
## dsfs is an ordered factor and gives the ordering of the claims
dsfs.levels <- c("0- 1 month", sprintf("%d-%2d months", 1:11, 2:12))
claims.Y1$dsfs <- factor(claims.Y1$dsfs, levels = dsfs.levels, ordered = TRUE)
## Index as numeric
claims.Y1$CharlsonIndex.numeric <- make.numeric(claims.Y1$CharlsonIndex)
claims.Y1$CharlsonIndex <- factor(claims.Y1$CharlsonIndex, ordered = TRUE)
##----
##++++
## Days in hospital
dih.Y2    <- read.csv(file = "HHP_release1/DayInHospital_Y2.csv",
                      colClasses = c("factor", "integer"),
                      comment.char = "")
names(dih.Y2)[1] <- "MemberID"          # Fix broken file
##----
save(members, claims.Y1, dih.Y2,
     file = "HHPR1.RData")
##############################

Scoring

We will need a function to score our predictions p against the actual values a. The formula is on the evaluation page and we implement it as:

Code

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/

##############################
#### FUNCTION TO CALCULATE SCORE
HPPScore <- function(p, a) {
### Scorng function after
### http://www.heritagehealthprize.com/c/hhp/Details/Evaluation
### Base 10 log from http://www.heritagehealthprize.com/forums/default.aspx?g=posts&m=2226#post2226
    sqrt(mean((log(1 + p, 10) - log(1 + a, 10))^2))
}
##############################

The simplest benchmarks

The simplest models don’t really model at all: they just use the average and are simple benchmarks.

Code

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/

y <- dih.Y2$DaysInHospital_Y2           # Actual
p <- rep(mean(y), NROW(dih.Y2))
cat(sprintf("Score using mean  : %8.6f\n", HPPScore(p, y)))
# Score using mean  : 0.278725

p <- rep(median(y), NROW(dih.Y2))
cat(sprintf("Score using median: %8.6f\n", HPPScore(p, y)))
# Score using median: 0.267969

Simple single-variable linear models

OK, a model that doesn’t use past data isn’t much of a model, so let’s improve on that:

Code

#!/usr/bin/Rscript
## example001.R - simple benchmarks for the HHP
## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/
library("reshape2")

vars <- dcast(claims.Y1, MemberID ~ ., sum, value_var = "LengthOfStay.days")
names(vars)[2] <- "LengthOfStay"
data <- merge(vars, dih.Y2)

model <- lm(DaysInHospital_Y2 ~ LengthOfStay, data = data)
p <- predict(model)
cat(sprintf("Score using lm(LengthOfStay): %8.6f\n", HPPScore(p, y)))
# Score using lm(LengthOfStay): 0.279062

model <- glm(DaysInHospital_Y2 ~ LengthOfStay,
             family = quasipoisson(),
             data = data)
p <- predict(model, type="response")
cat(sprintf("Score using glm(LengthOfStay): %8.6f\n", HPPScore(p, y)))
# Score using glm(LengthOfStay): 0.278914

Let the competition begin.