List of posts in Data

Analytics for Marketing online training 25 - 28 September 2012

I am excited to be giving the Analytics for Marketing online training course on 25-28 September 2012. Sign up before 25 August 2012 for the early bird discount. Our friends at Revolution Analytics who will provide the infrastructure to host the event.

Update: For clarification, this is an online, instructor led training course. We are using the Cisco WebEx Training Center to provide the training room. This allows us to keep the interactivity of classroom training without everybody having to physically travel. There is a limit on the number of participants so book early to ensure your seat (and for the early bird discount).

R code for Chapter 2 of Non-Life Insurance Pricing with GLM

We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US).

At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article.

R code for Chapter 1 of Non-Life Insurance Pricing with GLM

Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R, the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code.

doSMP pulled

They have finally pulled that buggy unreliable piece of code that was doSMP from the CRAN mirrors while (I hear) Revolutions are re-writing it. To use all your cores for analysis on the Windows platform, you can try doSNOW instead; my code is something like the fragment below. Neither option is as attractive as doMC on anything-but-Windows platforms, but sometimes you have to work with legacy systems.

R versus SAS/SPSS in corporations

A recent question on one of the LinkedIn groups about the advantages of using R over commercial tools like SAS or IBM SPSS Modeller drew lots of comments for R. We like R a lot and we use it extensively, but I also wanted to balance the discussion. R is great, but looking at commercial organizations near the end of 2011 it is not necessarily the right choice to make.

Friday quote: what is the question to which this number is the answer?

John Kay muses on interpreting statistical data:

Always ask of such data “what is the question to which this number is the answer?”. “Earnings before interest, tax, depreciation and amortisation on a like-for-like basis before allowance for exceptional restructuring costs” is the answer to the question “what is the highest profit number we can present without attracting flat disbelief?”.

A warning on the R save format

The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data.

I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?

Friday quote: the handmaiden and the whore

Because it is Friday and because we collect quotes:

If mathematics is the handmaiden of science, statistics is the whore: all that scientists are looking for is a quick fix without the encumbrance of a meaningful relationship. Statisticians are second-class mathematicians, third-rate scientists and fourth-rate thinkers. They are the hyenas, jackals and vultures of the scientific ecology: picking over the bones and carcasses of the game that the big cats, the biologists, the physicists and the chemists, have brought down.

Spreadsheet errors

For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation. But I don’t like it and let’s have a look at one reason why. In order not to always pick on Microsoft, we use another application, but you get the same results with Excel.