2009-05-06 15:33:00 Allan Engelhardt wrote in CYBAEA Data and Analysis:
We are interested in Social Network Analysis using the statistical analysis and computing platform R. The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.
In our previous post on SNA we gave up on using the statnet package because it was not able to handle our data volumes. In this entry we have better success with the igraph package.
The task we are considering is still how to load the network data into the R package’s internal representation. We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries. In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:
src, dest, start, duration,type,...
+447000000005,+447000000006,1238510028, 52,call,...
+447000000006,+447000000009,1238510627, 154,call,...
+447000000009,+447000000007,1238511103, 48,call,...
+447000000006,+447000000005,1238511145, 49,call,...
+447000000006,+447000000005,1238511678, 12,call,...
+447000000001,+447000000006,1238511735, 147,call,...
+447000000007,+447000000009,1238511806, 26,call,...
+447000000000,+447000000008,1238511825, 19,call,...
+447000000009,+447000000008,1238511900, 28,call,...
...
Here a record indicates that the customer identified as src called (type=call) the customer dest at the given time start and the call lasted duration seconds. In general, there will be (many) more attributes describing the transaction which are represented by the .... In a Financial Services example, the records may be money transfers between accounts.
igraph packageWe are able to load the previous test data with 51 million records easily:
> library("igraph")
> m <- matrix(scan(bzfile("cdr.51M.csv.bz2", open="r"),
+ what=integer(0), skip=1, sep=','),
+ ncol=4, byrow=TRUE)
Read 205266564 items
> ### Vertices are numbered from zero in the igraph library
> m[,1] <- m[,1]-1; m[,2] <- m[,2]-1
> g <- graph.edgelist(m[,c(2,1)])
> E(g)$start <- as.POSIXct(m[,3], origin="1970-01-01", tz="UTC")
> E(g)$duration <- m[,4]
> ns <- neighborhood.size(g, 1)
Time to up the ante! We have a file with simulated call data records containing over 700 million entries where we suspect the algorithm used is under-estimating nodes with small connections. Let’s check on the first ½ billion records (which seems to more-or-less fit in our available memory on this workstation):
> library("igraph")
### Note that R can only handle 2^31-1 elements in a vector (on any
### platform, including 64-bit), so we need to read this file as a
### list.
> s <- scan("cdr.1e6x1e1.csv", what=rep(list(integer(0)),4), skip=1, sep=',', multi.line=FALSE)
Read 700466826 records
> m <- as.vector(rbind(s[[2]], s[[1]]))
> print(length(m))
[1] 1400933652
> length(m) <- 1e9
> g <- graph(m, directed=TRUE)
> ns <- neighborhood.size(g, 1)
> summary(ns)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 35.00 40.00 42.92 47.00 101.00
> hist(ns, xlab="Neighborhood size", main="Distribution of neighborhood size",
sub="From cdr.1e6x1e1.1e9")
As we suspected, the Monte Carlo algorithm does not provide enough customers with low calling circle sizes. Fortunately it is very easy to add these separately: the hard part is modelling the larger calling circles. A mix of these two algorithms provide a reasonably good fit to actual customer behaviour. (The cut-off at 100 is a parameter to our Monte Carlo simulation program which indeed was 100 for this run.)
However, it is not all perfect. When we attempt to add the edge parameters in the obvious way it fails:
> length(s[[3]]) <- 0.5e9 > length(s[[4]]) <- 0.5e9 > E(g)$start <- s[[3]] Error: cannot allocate vector of size 3.7 Gb Execution halted > E(g)$duration <- s[[4]]
So we are just at the limit. Probably 100 million records is OK in this environment. But the core igraph library is accessible from C so better performance can probably be achieved this way and certainly pointers are 8 byte structures on this machine so we should not have the silly limits that R imposes on us.
Subscribe to CYBAEA Data and Analysis
Jump to comments.
SNA with R: Loading your network data
We are interested in Social Network Analysis using the statistical analysis and computing platform R . As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work. We use here the statnet group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages. The first task which we consider in this post is to load our data into a network object, which is how all the statnet packages represent a network. Typically for R, the documentation is voluminous but not always as helpful as one could want.
4 easy steps to make Mason utf-8 Unicode clean with Apache, mod_perl, and DBI
This is a note for people who are using the Mason system for high-performance, dynamic web site authoring with Apache , mod_perl , and a relational database like PostgreSQL accessed through DBI, and who want to be utf-8 Unicode clean in all their data. You want to be able to write accented letters in any language in your web pages. You want your users to be able to enter any characters in web forms, and you want that data to get in and out of your relational database and still display correctly and be handled correctly by perl. That is, unfortunately, not how it works out of the box, at least not on Red Hat Enterprise Linux 5 or on Fedora 10. This article shows how we made it work right.
Employee productivity as function of number of workers revisited
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis …
Area Plots with Intensity Coloring
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …
Join the discussion
Great stuff
Very interesting reading Allan. Was the 51M real CDR data? Also what is the purpose of the analysis? Churn prediction?
If you have not already seen it you may be interested in the SNA mobile churn prediction work Tim Manns is doing over at http://timmanns.blogspot.com/
Regarding the limits you've faced, is this 64-bit R and you're still having these limits?
Re: Great stuff
> Very interesting reading Allan.
Thank you.
> Was the 51M real CDR data?
No, simulated. I want to understand the SNA libraries and that is much easier when I know the underlying model.
> Also what is the purpose of the analysis? Churn prediction?
No, churn prediction is easy (as we recently wrote about). I wanted:
1. To understand “spinners”: prepaid customers who chuck their SIM card regularly for a new one with a better promotion. Extremely common behaviour in price-sensitive markets like Romania.
2. To model the customer value better: if you leave, your friends may leave as well (to stay on-net) or at least go Multi-SIM.
> If you have not already seen it you may be interested in the SNA mobile churn prediction work Tim Manns is doing over at http://timmanns.blogspot.com/
Yes, I like Tim’s blog.
> Regarding the limits you've faced, is this 64-bit R and you're still having these limits?
Yes, this is all 64-bit R on 64-bit computers (x86_64 architecture). You still have the internal 2^31-1 limits within R: all 64 bit buys you is the ability to map more memory (and so have more <= 2^31-1 vectors around).
This trick or Rmysql
Great post and stuff
I am also dealing with huge databases in R.
I wonder how efficient is this compared to putting the data into a database and making the queries in R through Rmysql. Have you ever tried that?
Best
"spinners"
hi, i just stumbled across this.
I also use the SNA to identify "spinners" (never heard them called that before :).
It actualy works quite well and you can get surprisingly accurate results by matching simple immediate neighbours (circle of friends). As long as you ensure muiltiple matches and similar usage profile, you can be almost certain of persistent tenure (customer matching). The really hard part is keeping the compuations effecient. Trying to match each 'spinners' friends against the 15 or so million possibilities (in Australia) isn't pretty, especially on a corporate data warehouse... My SQL queries aren't pretty either :)
Anyhow, its definately worthwhile and a great customer insight.
Cheers
Tim