The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform.

We do not have the full set of data yet, so this is a simple warm-up session to predict the days in hospital in year 2 based on the year 1 data.

## Prerequisites

Obviously you need to have R installed, and you should also have signed up for the competition (be sure to read the terms carefully) and downloaded and extracted the release 1 data file.

## Data preparation

Let’s load the data into R and do some basic housekeeping:

#!/usr/bin/Rscript ## example001.R - simple benchmarks for the HHP ## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/ ############################## #### DATA PREPARATION ##++++ ## Members members <- read.csv(file = "HHP_release1/Members_Y1.csv", colClasses = rep("factor", 3), comment.char = "") ##---- ##++++ ## Claims claims.Y1 <- read.csv(file = "HHP_release1/Claims_Y1.csv", colClasses = c( rep("factor", 7), "integer", # paydelay "character", # LengthOfStay "character", # dsfs "factor", # PrimaryConditionGroup "character" # CharlsonIndex ), comment.char = "") ## Utility function make.numeric <- function (cv, FUN = mean) { ### make a character vector numeric by splitting on '-' sapply(strsplit(gsub("[^[:digit:]]+", " ", cv, perl = TRUE), " ", fixed = TRUE), function (x) FUN(as.numeric(x))) } ## Length of stay as days { z <- make.numeric(claims.Y1$LengthOfStay) z.week <- grepl("week", claims.Y1$LengthOfStay, fixed = TRUE) z[z.week] <- z[z.week] * 7 # Weeks are 7 days z[is.nan(z)] <- 0 claims.Y1$LengthOfStay.days <- z } los.levels <- c("", "1 day", sprintf("%d days", 2:6), "1- 2 weeks", "2- 4 weeks", "4- 8 weeks", "8-12 weeks", "12-26 weeks", "26+ weeks") stopifnot(all(claims.Y1$LengthOfStay %in% los.levels)) claims.Y1$LengthOfStay <- factor(claims.Y1$LengthOfStay, levels = los.levels, labels = c("0 days", los.levels[-1]), ordered = TRUE) ## Months since first claim claims.Y1$dsfs.months <- make.numeric(claims.Y1$dsfs) ## dsfs is an ordered factor and gives the ordering of the claims dsfs.levels <- c("0- 1 month", sprintf("%d-%2d months", 1:11, 2:12)) claims.Y1$dsfs <- factor(claims.Y1$dsfs, levels = dsfs.levels, ordered = TRUE) ## Index as numeric claims.Y1$CharlsonIndex.numeric <- make.numeric(claims.Y1$CharlsonIndex) claims.Y1$CharlsonIndex <- factor(claims.Y1$CharlsonIndex, ordered = TRUE) ##---- ##++++ ## Days in hospital dih.Y2 <- read.csv(file = "HHP_release1/DayInHospital_Y2.csv", colClasses = c("factor", "integer"), comment.char = "") names(dih.Y2)[1] <- "MemberID" # Fix broken file ##---- save(members, claims.Y1, dih.Y2, file = "HHPR1.RData") ##############################

## Scoring

We will need a function to score our predictions `p`

against the actual values `a`

. The formula is on the evaluation page and we implement it as:

#!/usr/bin/Rscript ## example001.R - simple benchmarks for the HHP ## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/ ############################## #### FUNCTION TO CALCULATE SCORE HPPScore <- function (p, a) { ### Scorng function after ### http://www.heritagehealthprize.com/c/hhp/Details/Evaluation ### Base 10 log from http://www.heritagehealthprize.com/forums/default.aspx?g=posts&m=2226#post2226 sqrt(mean((log(1+p, 10) - log(1+a, 10))^2)) } ##############################

## The simplest benchmarks

The simplest models don’t really model at all: they just use the average and are simple benchmarks.

#!/usr/bin/Rscript ## example001.R - simple benchmarks for the HHP ## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/ y <- dih.Y2$DaysInHospital_Y2 # Actual p <- rep(mean(y), NROW(dih.Y2)) cat(sprintf("Score using mean : %8.6f\n", HPPScore(p, y))) # Score using mean : 0.278725 p <- rep(median(y), NROW(dih.Y2)) cat(sprintf("Score using median: %8.6f\n", HPPScore(p, y))) # Score using median: 0.267969

## Simple single-variable linear models

OK, a model that doesn’t use past data isn’t much of a model, so let’s improve on that:

#!/usr/bin/Rscript ## example001.R - simple benchmarks for the HHP ## Copyright © 2011 CYBAEA Limited - http://www.cybaea.net/ library("reshape2") vars <- dcast(claims.Y1, MemberID ~ ., sum, value_var = "LengthOfStay.days") names(vars)[2] <- "LengthOfStay" data <- merge(vars, dih.Y2) model <- lm(DaysInHospital_Y2 ~ LengthOfStay, data = data) p <- predict(model) cat(sprintf("Score using lm(LengthOfStay): %8.6f\n", HPPScore(p, y))) # Score using lm(LengthOfStay): 0.279062 model <- glm(DaysInHospital_Y2 ~ LengthOfStay, family = quasipoisson(), data = data) p <- predict(model, type="response") cat(sprintf("Score using glm(LengthOfStay): %8.6f\n", HPPScore(p, y))) # Score using glm(LengthOfStay): 0.278914

Let the competition begin.