On 2009-05-06 15:33:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
We are interested in Social Network Analysis using the statistical analysis and computing platform R. The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.
In our previous post on SNA we gave up on using the statnet package because it was not able to handle our data volumes. In this entry we have better success with the igraph package.
The task we are considering is still how to load the network data into the R package’s internal representation. We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries. In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:
src, dest, start, duration,type,...
+447000000005,+447000000006,1238510028, 52,call,...
+447000000006,+447000000009,1238510627, 154,call,...
+447000000009,+447000000007,1238511103, 48,call,...
+447000000006,+447000000005,1238511145, 49,call,...
+447000000006,+447000000005,1238511678, 12,call,...
+447000000001,+447000000006,1238511735, 147,call,...
+447000000007,+447000000009,1238511806, 26,call,...
+447000000000,+447000000008,1238511825, 19,call,...
+447000000009,+447000000008,1238511900, 28,call,...
...
Here a record indicates that the customer identified as src called (type=call) the customer dest at the given time start and the call lasted duration seconds. In general, there will be (many) more attributes describing the transaction which are represented by the .... In a Financial Services example, the records may be money transfers between accounts.
igraph packageWe are able to load the previous test data with 51 million records easily:
> library("igraph")
> m <- matrix(scan(bzfile("cdr.51M.csv.bz2", open="r"),
+ what=integer(0), skip=1, sep=','),
+ ncol=4, byrow=TRUE)
Read 205266564 items
> ### Vertices are numbered from zero in the igraph library
> m[,1] <- m[,1]-1; m[,2] <- m[,2]-1
> g <- graph.edgelist(m[,c(2,1)])
> E(g)$start <- as.POSIXct(m[,3], origin="1970-01-01", tz="UTC")
> E(g)$duration <- m[,4]
> ns <- neighborhood.size(g, 1)
Time to up the ante! We have a file with simulated call data records containing over 700 million entries where we suspect the algorithm used is under-estimating nodes with small connections. Let’s check on the first ½ billion records (which seems to more-or-less fit in our available memory on this workstation):
> library("igraph")
### Note that R can only handle 2^31-1 elements in a vector (on any
### platform, including 64-bit), so we need to read this file as a
### list.
> s <- scan("cdr.1e6x1e1.csv", what=rep(list(integer(0)),4), skip=1, sep=',', multi.line=FALSE)
Read 700466826 records
> m <- as.vector(rbind(s[[2]], s[[1]]))
> print(length(m))
[1] 1400933652
> length(m) <- 1e9
> g <- graph(m, directed=TRUE)
> ns <- neighborhood.size(g, 1)
> summary(ns)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 35.00 40.00 42.92 47.00 101.00
> hist(ns, xlab="Neighborhood size", main="Distribution of neighborhood size",
sub="From cdr.1e6x1e1.1e9")
As we suspected, the Monte Carlo algorithm does not provide enough customers with low calling circle sizes. Fortunately it is very easy to add these separately: the hard part is modelling the larger calling circles. A mix of these two algorithms provide a reasonably good fit to actual customer behaviour. (The cut-off at 100 is a parameter to our Monte Carlo simulation program which indeed was 100 for this run.)
However, it is not all perfect. When we attempt to add the edge parameters in the obvious way it fails:
> length(s[[3]]) <- 0.5e9 > length(s[[4]]) <- 0.5e9 > E(g)$start <- s[[3]] Error: cannot allocate vector of size 3.7 Gb Execution halted > E(g)$duration <- s[[4]]
So we are just at the limit. Probably 100 million records is OK in this environment. But the core igraph library is accessible from C so better performance can probably be achieved this way and certainly pointers are 8 byte structures on this machine so we should not have the silly limits that R imposes on us.
On 2010-07-13 07:47:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics.
The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type="l") does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary.
We also get a nice opportunity to use the under-appreciated read.fwf function.
Read more (~535 words).
On 2010-06-22 11:45:00, Allan Engelhardt wrote in CYBAEA Journal:
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is scary.
We now re-do the analysis four years later and, just because we can, we are using the leading companies of the London stock exchange instead of the largest American companies.
The results still hold. We called it the 3/2 rule: treble the number of workers and you halve their individual productivity. Large companies with ten times the number of employees are ¼ as productive as their smaller competitors.
Employee productivity is a big issue. If all the FTSE-100 companies achieved their average profits per employee, then the index would generate almost £1 trn of additional net profits for the economy.
Read more (~245 words).
On 2010-06-22 11:20:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary.
We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.
Read more (~763 words, 5 comments).
On 2010-06-17 09:05:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages.
Read more (~300 words, 2 comments).
On 2010-06-15 10:21:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection.
But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate.
The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time.
Read more (~934 words, 1 comments).
Join the discussion
"spinners"
hi, i just stumbled across this.
I also use the SNA to identify "spinners" (never heard them called that before :).
It actualy works quite well and you can get surprisingly accurate results by matching simple immediate neighbours (circle of friends). As long as you ensure muiltiple matches and similar usage profile, you can be almost certain of persistent tenure (customer matching). The really hard part is keeping the compuations effecient. Trying to match each 'spinners' friends against the 15 or so million possibilities (in Australia) isn't pretty, especially on a corporate data warehouse... My SQL queries aren't pretty either :)
Anyhow, its definately worthwhile and a great customer insight.
Cheers
Tim
This trick or Rmysql
Great post and stuff
I am also dealing with huge databases in R.
I wonder how efficient is this compared to putting the data into a database and making the queries in R through Rmysql. Have you ever tried that?
Best
Re: Great stuff
> Very interesting reading Allan.
Thank you.
> Was the 51M real CDR data?
No, simulated. I want to understand the SNA libraries and that is much easier when I know the underlying model.
> Also what is the purpose of the analysis? Churn prediction?
No, churn prediction is easy (as we recently wrote about). I wanted:
1. To understand “spinners”: prepaid customers who chuck their SIM card regularly for a new one with a better promotion. Extremely common behaviour in price-sensitive markets like Romania.
2. To model the customer value better: if you leave, your friends may leave as well (to stay on-net) or at least go Multi-SIM.
> If you have not already seen it you may be interested in the SNA mobile churn prediction work Tim Manns is doing over at http://timmanns.blogspot.com/
Yes, I like Tim’s blog.
> Regarding the limits you've faced, is this 64-bit R and you're still having these limits?
Yes, this is all 64-bit R on 64-bit computers (x86_64 architecture). You still have the internal 2^31-1 limits within R: all 64 bit buys you is the ability to map more memory (and so have more <= 2^31-1 vectors around).
Great stuff
Very interesting reading Allan. Was the 51M real CDR data? Also what is the purpose of the analysis? Churn prediction?
If you have not already seen it you may be interested in the SNA mobile churn prediction work Tim Manns is doing over at http://timmanns.blogspot.com/
Regarding the limits you've faced, is this 64-bit R and you're still having these limits?