2010-06-22 11:20:00 Allan Engelhardt wrote in CYBAEA Data and Analysis:
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary.
Let’s try the FTSE-100 index of leading UK companies to see if they are significantly different from the S&P 500 leading American companies that we analyzed four years ago.
We will of course use the R statistical computing and analysis platform for our analysis, and once again we are grateful to Yahoo Finance for providing the data.
The analysis script is available as ftse100.R and is really simple:
## ftse100.R - Display employee productivity for FTSE-100 consitituents
## Copyright © 2010 Allan Engelhardt <http://www.cybaea.net/>
## All Rights Reserved.
## Get the index constituents.
ftse.100 <- read.csv(file = "http://uk.old.finance.yahoo.com/d/quotes.csv?s=@%5EFTSE&f=s&e=.csv", header = FALSE)
names(ftse.100) <- c("symbol")
data <- data.frame(symbol=NULL, employees=NULL, profit=NULL, sector=NULL)
## For each stock symbol, get employees, profit, and sector
for (symbol in ftse.100$symbol) {
profile.url <- paste("http://uk.finance.yahoo.com/q/pr?s=", symbol, sep="")
con <- url(profile.url, open = "r")
text <- readChar(con, 2^24) # enough bytes
close(con)
x <- sub('.*Number of employees:</td><td.*?>[[:space:]]*([[:digit:],]+).*', "\\1", text, ignore.case = TRUE)
x <- gsub(',', '', x)
empl <- tryCatch(as.integer(x), warning = function(x) NA)
x <- sub('.*Net Profit.*?</td><td.*?>[[:space:]]*([+-]?[[:digit:],]+).*', '\\1', text)
x <- gsub(',', '', x)
profit <- tryCatch(as.integer(x)*1e6, warning = function(x) NA)
sector <- sub('.*Sector:</td><td.*?>(.*?)</td>.*', '\\1', text)
if (any(c(empl, profit) <= 0, is.na(c(empl, profit)))) {
cat("Error parsing symbol", symbol, "see", profile.url, "\n")
} else {
data <- rbind(data, data.frame(symbol=symbol, employees=empl, profit=profit, sector=sector))
}
Sys.sleep(1)
}
## Save the data so we don't have to hit Yahoo all the time.
save(data, file = "data.RData")
## Save plot to file:
#png(filename="ftse100.png", width=800, height=800, pointsize=14, bg="white", res=100)
opar <- par(cex.sub = sqrt(sqrt(2)), font.sub = 3, font.lab = 2)
## x and y coordinates of plot and plot limits
x <- with(data, employees)
y <- with(data, profit/employees)
xlim <- c(10^floor(log10(min(x))), 10^ceiling(log10(max(x))))
ylim <- c(10^floor(log10(min(y))), 10^ceiling(log10(max(y))))
## Set up to display different color and symbols
plot_col <- 1
plot_pch <- 1
markers <- 21:25
pchs <- rep(markers, ceiling(length(levels(data$sector))/length(markers)))
palette(rainbow(length(levels(data$sector)), start=3/6, end=6/6))
# Make empty plot:
plot.new()
plot(profit/employees ~ employees, data = data[FALSE, ],
type = "p", pch = pchs[plot_pch], col = plot_col,
log="xy", xaxp = c(xlim, 1), yaxp = c(ylim, 1), xlim = xlim, ylim = ylim,
main = "Profit per employee (FTSE 100)", xlab = "Employees", ylab = "Profit per employees (GBP)")
## Plot each sector
for (sector in levels(data$sector)) {
plot.xy(xy.coords(with(data[data$sector == sector,], employees),
with(data[data$sector == sector,], profit/employees),
log = "xy", xlab = "", ylab = ""),
type = "p", pch = pchs[plot_pch], col = plot_col, bg = plot_col)
plot_pch <- plot_pch + 1
plot_col <- plot_col + 1
}
legend(x = "bottomleft", legend = levels(data$sector), title = "Industry Sectors",
col = palette(), pt.bg = palette(), pch = pchs, cex = 2/3, pt.cex = 1, ncol = 2)
## Fit a linear model to the log-log data:
m <- lm(log10(y) ~ log10(x))
xl <- c(xlim[1]*5, xlim[2]/5)
yl <- 10^predict(m, data.frame(x = xl))
lines(xl, yl, col = "darkred", lty = "dashed", lwd = 2)
t <- sprintf("Power = %0.3g", m$coefficients[2])
text(xl[2], yl[2], t, adj = c(0.25, -1.5), col = "darkred", font = 2)
## All done.
par(opar)
dev.off()
Leave it to run and this is what you get:
The power law still broadly holds. In a large company, the productivity of the individual employee is only ¼ of the productivity in a company with one-tenth of the number of workers.
The analysis for the FTSE All-Share index is easy (ftse-all.R) and gives a slope of -0.7605541 for the 301 companies with the required information, which is much worse. More convincingly, fitting the companies with more than 1,000 employees (to avoid some bias of smaller companies needing to have large profits per employee in order to be big enough to afford a stock market listing) gives a slope of -0.2838.
Subscribe to CYBAEA Data and Analysis
Jump to comments.
Area Plots with Intensity Coloring
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function.
Benchmarking feature selection with Boruta and caret
Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with.
Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how.
R code for Chapter 2 of Non-Life Insurance Pricing with GLM
We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article.
R: Eliminating observed values with zero variance
I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.
Join the discussion
statistically significant effects of sector?
I see you plot with different symbols per sector. Could you run a model with fixed effects by sector to see if you get statistically/substantially different slopes or intercepts by sector? That is, does (say) the petrochemical sector have a reduced drop-off, while the IT sector has a sharper decrease in productivity?
Effects per sector
@Harlan:
Thank you for your comment. I like the suggestion, but there are so few samples in many of the sectors that I doubt you will get a statistical result. But intuitively we would expect an effect: the Investment Companies sector has different productivity from Transports (median 10,000,000 versus 2,599 in the All-Share data set).
Do
sort( with(data, tapply(profit/employees, sector, median)) )
to see the productive industries.
Explanation and mirages
Big companies are usually old -> profitability has decreased and the companies are waiting bankruptcy.
Another thing is that because of subcontractors the employee and accounting information is partly unreliable/irrelevant. Subcontractors are deliberately used to cook the books and to hide what really happens in the business.
Selection bias
If productivity and number of employees were independent, the largest companies would be those at the extremes of productivity and employee numbers. The FTSE sample will only include a company with few employees if they are extremely productive but will include companies thatare not very productive if they have large numbers of employees.
It's obviously more complicated than that but I would guess that the relationship would be weaker in the FTSE 250.
Re: Selection bias
@Bunbury:
Thank you for your comment. You are of course completely right about the selection bias and that is a very good point indeed.
Which makes it even more surprising that the relationship is stronger when a broader index is used. We found -0.60 for the FTSE-100, but -0.68 for the S&P 500 index and -0.76 for the FTSE All-Share.
Again, our analysis is by necessity limited to companies where we have the profit and employee information (and where they are greater than zero) which introduces another selection bias. But it is not immediately obvious to me that this bias should make the effect stronger??
Putnam Model
Somebody pointed us to the Putnam Model [1] which appears to predict a slope of -1/3. This is very close to our findings of -0.28 even though that model was only for software projects.
Does anyone have more experience with the Putnam Model and can explain the B factor?
[1] https://secure.wikimedia.org/wikipedia/en/wiki/Putnam_model
Don't find it too surprising ...
Basic micro-economics says that [with perfect knowledge] the profit-maximizing point is where marginal cost for that last unit/sale/employee = the marginal revenue of that last unit/sale/employee, and the marginal profit = 0.
From that viewpoint, each incremental employee adds a smaller and smaller addition to overall profits. AND dilutes the average profitability/productivity of all the employees hired up to that point.
Obviously the real world is messier.
But still, given any number x of employees, a company will go for all the lowest hanging fruit it can identify in its business environment. The next hired employee after that, and the next after that, can't go for the lowest-hanging fruit, those are already gone. They have to go for fruit that are less ripe, or more work to get, etc.
And of course, your data is consistent with that theory.
(I'm going to glibly bypass discussing economies of scale by asserting that, within industries, everyone faces cost functions from economies of scale that are more or less equivalent.)