SNA with R: Loading large networks using the igraph library

On 2009-05-06 15:33:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:

We are interested in Social Network Analysis using the statistical analysis and computing platform R. The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.

In our previous post on SNA we gave up on using the statnet package because it was not able to handle our data volumes. In this entry we have better success with the igraph package.

The task we are considering is still how to load the network data into the R package’s internal representation. We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries. In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:

          src,         dest,     start,  duration,type,...
+447000000005,+447000000006,1238510028,        52,call,...
+447000000006,+447000000009,1238510627,       154,call,...
+447000000009,+447000000007,1238511103,        48,call,...
+447000000006,+447000000005,1238511145,        49,call,...
+447000000006,+447000000005,1238511678,        12,call,...
+447000000001,+447000000006,1238511735,       147,call,...
+447000000007,+447000000009,1238511806,        26,call,...
+447000000000,+447000000008,1238511825,        19,call,...
+447000000009,+447000000008,1238511900,        28,call,...
...

Here a record indicates that the customer identified as src called (type=call) the customer dest at the given time start and the call lasted duration seconds. In general, there will be (many) more attributes describing the transaction which are represented by the .... In a Financial Services example, the records may be money transfers between accounts.

Loading the data in the igraph package

We are able to load the previous test data with 51 million records easily:

> library("igraph")
> m <- matrix(scan(bzfile("cdr.51M.csv.bz2", open="r"), 
+                  what=integer(0), skip=1, sep=','), 
+             ncol=4, byrow=TRUE)
Read 205266564 items
> ### Vertices are numbered from zero in the igraph library
> m[,1] <- m[,1]-1; m[,2] <- m[,2]-1
> g <- graph.edgelist(m[,c(2,1)])
> E(g)$start    <- as.POSIXct(m[,3], origin="1970-01-01", tz="UTC")
> E(g)$duration <- m[,4]
> ns <- neighborhood.size(g, 1)

Time to up the ante! We have a file with simulated call data records containing over 700 million entries where we suspect the algorithm used is under-estimating nodes with small connections. Let’s check on the first ½ billion records (which seems to more-or-less fit in our available memory on this workstation):

> library("igraph")
### Note that R can only handle 2^31-1 elements in a vector (on any
### platform, including 64-bit), so we need to read this file as a
### list.
> s <- scan("cdr.1e6x1e1.csv", what=rep(list(integer(0)),4), skip=1, sep=',', multi.line=FALSE)
Read 700466826 records
> m <- as.vector(rbind(s[[2]], s[[1]]))
> print(length(m))
[1] 1400933652
> length(m) <- 1e9
> g <- graph(m, directed=TRUE)
> ns <- neighborhood.size(g, 1)
> summary(ns)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   35.00   40.00   42.92   47.00  101.00 
> hist(ns, xlab="Neighborhood size", main="Distribution of neighborhood size", 
       sub="From cdr.1e6x1e1.1e9")
[Distribution of neighborhood size plot]

As we suspected, the Monte Carlo algorithm does not provide enough customers with low calling circle sizes. Fortunately it is very easy to add these separately: the hard part is modelling the larger calling circles. A mix of these two algorithms provide a reasonably good fit to actual customer behaviour. (The cut-off at 100 is a parameter to our Monte Carlo simulation program which indeed was 100 for this run.)

Problems

However, it is not all perfect. When we attempt to add the edge parameters in the obvious way it fails:

> length(s[[3]]) <- 0.5e9
> length(s[[4]]) <- 0.5e9
> E(g)$start     <- s[[3]]
Error: cannot allocate vector of size 3.7 Gb
Execution halted
> E(g)$duration  <- s[[4]]

So we are just at the limit. Probably 100 million records is OK in this environment. But the core igraph library is accessible from C so better performance can probably be achieved this way and certainly pointers are 8 byte structures on this machine so we should not have the silly limits that R imposes on us.

Subscribe to CYBAEA Data and Analysis

Join the discussion

Do you agree or disagree? Have a question of want to make a point? Join the discussion:

"spinners"

On 2010-01-28 11:25:00, TimManns said:

hi, i just stumbled across this.
I also use the SNA to identify "spinners" (never heard them called that before :).

It actualy works quite well and you can get surprisingly accurate results by matching simple immediate neighbours (circle of friends). As long as you ensure muiltiple matches and similar usage profile, you can be almost certain of persistent tenure (customer matching). The really hard part is keeping the compuations effecient. Trying to match each 'spinners' friends against the 15 or so million possibilities (in Australia) isn't pretty, especially on a corporate data warehouse... My SQL queries aren't pretty either :)

Anyhow, its definately worthwhile and a great customer insight.

Cheers

Tim

This trick or Rmysql

On 2009-06-10 07:03:00, Esteban Moro said:

Great post and stuff
I am also dealing with huge databases in R.
I wonder how efficient is this compared to putting the data into a database and making the queries in R through Rmysql. Have you ever tried that?
Best

Re: Great stuff

On 2009-05-14 10:16:00, Allan Engelhardt said:

> Very interesting reading Allan.
Thank you.

> Was the 51M real CDR data?
No, simulated. I want to understand the SNA libraries and that is much easier when I know the underlying model.

> Also what is the purpose of the analysis? Churn prediction?
No, churn prediction is easy (as we recently wrote about). I wanted:
1. To understand “spinners”: prepaid customers who chuck their SIM card regularly for a new one with a better promotion. Extremely common behaviour in price-sensitive markets like Romania.
2. To model the customer value better: if you leave, your friends may leave as well (to stay on-net) or at least go Multi-SIM.

> If you have not already seen it you may be interested in the SNA mobile churn prediction work Tim Manns is doing over at http://timmanns.blogspot.com/
Yes, I like Tim’s blog.

> Regarding the limits you've faced, is this 64-bit R and you're still having these limits?
Yes, this is all 64-bit R on 64-bit computers (x86_64 architecture). You still have the internal 2^31-1 limits within R: all 64 bit buys you is the ability to map more memory (and so have more <= 2^31-1 vectors around).

Great stuff

On 2009-05-13 11:54:00, Shane Butler said:

Very interesting reading Allan. Was the 51M real CDR data? Also what is the purpose of the analysis? Churn prediction?

If you have not already seen it you may be interested in the SNA mobile churn prediction work Tim Manns is doing over at http://timmanns.blogspot.com/

Regarding the limits you've faced, is this 64-bit R and you're still having these limits?

Navigation

This site is standards compliant: