Net Promoter Score confidence intervals 2: maths and stats

We looked at a rule of thumb for confidence intervals and what that means for a business manager in our previous post. Now we do the maths, stats, and R code for the practitioner.

The Net Promoter Score (NPS) is a metric for measuring the loyalty and engagement of a firm’s customers that is extremely popular in across many industries. Originally introduced by Fred Reichheld in a 2003 article in the Harvard Business Review (link); the current reference is the book The Ultimate Question 2.0 by Reichheld and Markey (Amazon paid link UK | US).

The score is calculated by asking the firm’s customers a single question, How likely is it that you would recommend our company/product/service to a friend or colleague?, on a scale from 0 (very unlikely) to 10 (very likely). Those customers who respond 9 or 10 are considered Promoters, those who score 7 or 8 are Passives, and the rest scoring 0-6 are Detractors. The Net Promoter Score (NPS) is then simply the percentage Promoters minus the percentage Detractors, which gives a number between -100 and +100.

Our concern here is estimating the measurement error of the score given the responses to the survey.

Maths and Stats: Closed form for the variance

Let’s start by deriving the closed form for the variance of the Net Promoter Score. Skip this one if you only want the code.

Let $ n $ be the number of respondents to our survey and let $ p $ and $ d $ be the proportion of respondents who are Promoters and Detractors, respectively. Then

\[ NPS = 100 \times ( p - d ) \]

and by the usual formula for variance we have

\[ \Var(NPS) = 100^2 \times ( \Var(p) + \Var(d) - 2 \times \Cov(p, d) ) \]

where $ \Cov $ denotes the covariance.

We can find the variance and covariance from the simple sample and the Categorical Distribution so $ \Var(p) = p \times (1-p) / n $ and $ \Cov(p, d) = {-p} d / n $ where we assume the total population $ N $ of possible respondents is much greater than $ n $. We then get

\[ \begin{align} \Var(NPS) & = 100^2 \times \frac{p(1-p) + d(1-d) + 2pd}{n} && \text{or,}\\\\\\\\ \sd(NPS) & = 100 \times \sqrt{\frac{p(1-p) + d(1-d) + 2pd}{n}} \end{align} \]

where $ \sd $ denotes the standard deviation.

For the special case $ p = d = ⅓ $ (which is one case of a zero NPS) we considered in the previous post and using the normal approximation for the 95% confidence interval $ \CI \approx \pm 1.96\sd $, we get for the full width of the interval

\[ \begin{align} \CI & \approx \frac{320}{\sqrt{n}} && \text{so,} \\\\\\\\ n & \approx \left( 320 / \CI \right)^2 \end{align} \]

For a full width confidence interval of 10 or 1, you need 1024 or 102,400 responses, respectively, which gives us the rule of thumb we mentioned in the previous post:

For the very impatient, a good rule of thumb is that with 1000 responses you (only) know your Net Promoter Score to within 10 points (±5) and it takes fully 100,000 responses to know the score to one point (±0.5).

Practical implementation using R

Knowing the formula for the variance is all well and good but we are data science people so let’s work with the data to create the confidence intervals from the data in a way that works even when the score is near 100 and which extends to other values you may want to derive from the data. In the code examples below we use the R software environment for statistical computing and graphics but it should be easy to translate to other languages.

To create our confidence intervals we use resampling techniques, specifically the bootstrap method as implemented in the boot library. The code below implements a simple function for bootstrapping the errors.

library("boot")
## Function for bootstrapping the sample error estimates
NPS.boot <- function(x, R = 1e4, conf = 0.95, type = "basic") {
  NPS <- function(x) 100*(sum(x >= 9) - sum(x <= 6))/length(x)
  b   <- boot(x, function(data, i) NPS(data[i]), R = R)
  ci  <- boot.ci(b, conf = conf, type = type)
  res <- as.list(c(NPS(x), length(x), ci[[type]][1, c(1, 4:5)]))
  names(res) <- c("NPS", "n", "conf", "Low", "High")
  return(res)
}

In the code, x is a vector of responses to the key NPS question (Likelihood to recommend), R is the number of resamples you want to do, and conf is the confidence interval. We take the confidence interval from the observed quantiles of the distribution as indicated by type = "basic"; see the boot.ci help for other options but this choice is our recommendation.

Then we can use the function simply as NPS.boot(nps$Likelihood.to.Recommend) if nps is our data of survey responses and Likelihood.to.Recommend is the integer 0-10 value from the response to the NPS question. Or, more likely, we can use it to generate confidence intervals for a range of periods, eg months, quarters, or years, as outlined below.

## Note that we use the data.table syntax below.
library("data.table")
## Assume nps holds your survey data, one row per response where
## - Likelihood.to.Recommend is the 0-10 score which we assume is not NA here
## - quarter is the time period over which you want to aggregate, here quarterly
nps[, NPS.boot(Likelihood.to.Recommend), by = quarter]

To try out the code above, we can generate some data

## Note that we use the data.table syntax and its IDate class below.
library("data.table")
set.seed(20160115)

n <- 12e3  # Generate this many responses
p <- c(rep(1,7), 3, 4, 4, 3) # Response probabilities; true NPS = 0

nps <-
  data.table(Likelihood.to.Recommend =
               sample(0L:10L, n, replace = TRUE, prob = p),
             Response.Date =
               sample(seq(as.IDate("2013-01-01"),
                          as.IDate("2015-12-31"),
                          by = "day"),
                      n, replace = TRUE))

# Add column `quarter`
nps[, quarter := round(Response.Date, digits = "quarters")]
setkey(nps, quarter)

qs <- nps[, NPS.boot(Likelihood.to.Recommend), by = quarter]
print(qs)

       quarter        NPS    n conf         Low      High
 1: 2013-01-01  0.3211991  934 0.95 -4.81798715  5.674518
 2: 2013-04-01 -0.3036437  988 0.95 -5.36437247  4.655870
 3: 2013-07-01  2.1632252 1017 0.95 -2.85152409  7.079646
 4: 2013-10-01 -0.4935834 1013 0.95 -5.42941757  4.343534
 5: 2014-01-01 -2.9927761  969 0.95 -8.25593395  2.270382
 6: 2014-04-01  0.3992016 1002 0.95 -4.79041916  5.489022
 7: 2014-07-01  1.7329256  981 0.95 -3.46585117  6.931702
 8: 2014-10-01  5.1536174 1009 0.95  0.09910803 10.208127
 9: 2015-01-01 -2.6679842 1012 0.95 -7.80632411  2.371542
10: 2015-04-01 -3.4644195 1068 0.95 -8.33333333  1.404494
11: 2015-07-01  0.1021450  979 0.95 -5.20939734  5.209397
12: 2015-10-01  1.0700389 1028 0.95 -3.98832685  6.031128

Note that, even with 1000 responses per quarter, one of the 95% confidence intervals (number 8) exclude the true value of NPS=0; with 12 periods that is quite reasonable (probability 46%). Note also that we have an average of 1000 responses per quarter and the full width of our confidence intervals are indeed around 10, as per our rule of thumb.

For a simple plot of the data we could use

library("ggplot2")

qs.plot <-
  ggplot(qs, aes(x = quarter, y = NPS, ymin = Low, ymax = High)) +
  scale_x_date() +
  geom_hline(aes(yintercept = 0), linetype = "dashed", colour = "red") +
  geom_point() + geom_errorbar(width = 30) +
  ggtitle(expression(atop("Quarterly NPS",
                          atop(italic("with 95% confidence intervals"), ""))))

plot(qs.plot)

Simulated quarterly NPS scores with bootstrapped 95% confidence intervals; red line indicates true population NPS of 0.

Note that there is no true trend in this data. If you think you see a pattern where the first two quarters of the year are low and the second two are high then you are just imagining it. If you think your score fell in Q1 2015 but is now on the mends since Q3, you are just imagining it.

The approach extends trivially to other metrics you want to calculate from the survey data. One of our automotive clients tracked the proportion of ‘negative’ responses 0-4 to the Likelihood to recommend question; negative in the sense that the customer had selected below the middle (5) on the 0-10 scale.

We can do that easily

library("boot")
library("data.table")
library("ggplot2")
## Function for bootstrapping the sample error estimates
negative.boot <- function(x, R = 1e4, conf = 0.95, type = "basic") {
  neg <- function(x) 100*sum(x <= 4)/length(x)
  b   <- boot(x, function(data, i) neg(data[i]), R = R)
  ci  <- boot.ci(b, conf = conf, type = type)
  res <- as.list(c(neg(x), length(x), ci[[type]][1, c(1, 4:5)]))
  names(res) <- c("Negative", "n", "conf", "Low", "High")
  return(res)
}

set.seed(1)   # Reproducible bootstrap values

nqs <- nps[, negative.boot(Likelihood.to.Recommend), by = quarter]

nqs.plot <-
  ggplot(nqs, aes(x = quarter, y = Negative, ymin = Low, ymax = High)) +
  scale_x_date() +
  geom_hline(aes(yintercept = 500/21), linetype = "dashed", colour = "red") +
  geom_point() +
  geom_errorbar(width = 30) +
  ggtitle(expression(atop("Proportion of Negative responses",
                       atop(italic("with 95% confidence intervals"), ""))))

plot(nqs.plot)

This gives us:

Simulated quarterly ‘negative’ (0-4) scores with bootstrapped 95% confidence intervals; red line indicates true population value of 500/21.

If you have, say, a threshold of 25% for this metric and you only use the mean from the data, you would have had 3 false positives in this three months period. Or if your threshold was 22.5% you would have had 1 false negative.

Conclusion

I hope to see many more plots of NPS with error bars. As we mentioned previously, the score is not really the core of the system; closing the loop with the customers, understanding the root causes, and taking appropriate action is the core. Nevertheless, it is easy for an organization to become somewhat fixated on the score which leads to demands to ‘explain’ every change in the measured value.

Hopefully this analysis will allow the practitioner to have an informed conversation about the magnitude of the change and whether it may just be down to sampling. But note that we do not advocate ignoring the ‘why’ question even when sample sizes are small and the change is not statistically significant: it is extremely valuable to be collecting hypotheses for what may affect the score and try to test these. That too is part of closing the loop.

Feel free to contact us with any questions or comments.