SNA with R: Loading large networks using the igraph library

On , wrote in CYBAEA Data:

We are interested in Social Network Analysis using the statistical analysis and computing platform R. The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.

In our previous post on SNA we gave up on using the statnet package because it was not able to handle our data volumes. In this entry we have better success with the igraph package.

The task we are considering is still how to load the network data into the R package’s internal representation. We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries. In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:

          src,         dest,     start,  duration,type,...
+447000000005,+447000000006,1238510028,        52,call,...
+447000000006,+447000000009,1238510627,       154,call,...
+447000000009,+447000000007,1238511103,        48,call,...
+447000000006,+447000000005,1238511145,        49,call,...
+447000000006,+447000000005,1238511678,        12,call,...
+447000000001,+447000000006,1238511735,       147,call,...
+447000000007,+447000000009,1238511806,        26,call,...
+447000000000,+447000000008,1238511825,        19,call,...
+447000000009,+447000000008,1238511900,        28,call,...
...

Here a record indicates that the customer identified as src called (type=call) the customer dest at the given time start and the call lasted duration seconds. In general, there will be (many) more attributes describing the transaction which are represented by the .... In a Financial Services example, the records may be money transfers between accounts.

Loading the data in the igraph package

We are able to load the previous test data with 51 million records easily:

> library("igraph")
> m <- matrix(scan(bzfile("cdr.51M.csv.bz2", open="r"), 
+                  what=integer(0), skip=1, sep=','), 
+             ncol=4, byrow=TRUE)
Read 205266564 items
> ### Vertices are numbered from zero in the igraph library
> m[,1] <- m[,1]-1; m[,2] <- m[,2]-1
> g <- graph.edgelist(m[,c(2,1)])
> E(g)$start    <- as.POSIXct(m[,3], origin="1970-01-01", tz="UTC")
> E(g)$duration <- m[,4]
> ns <- neighborhood.size(g, 1)

Time to up the ante! We have a file with simulated call data records containing over 700 million entries where we suspect the algorithm used is under-estimating nodes with small connections. Let’s check on the first ½ billion records (which seems to more-or-less fit in our available memory on this workstation):

> library("igraph")
### Note that R can only handle 2^31-1 elements in a vector (on any
### platform, including 64-bit), so we need to read this file as a
### list.
> s <- scan("cdr.1e6x1e1.csv", what=rep(list(integer(0)),4), skip=1, sep=',', multi.line=FALSE)
Read 700466826 records
> m <- as.vector(rbind(s[[2]], s[[1]]))
> print(length(m))
[1] 1400933652
> length(m) <- 1e9
> g <- graph(m, directed=TRUE)
> ns <- neighborhood.size(g, 1)
> summary(ns)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   35.00   40.00   42.92   47.00  101.00 
> hist(ns, xlab="Neighborhood size", main="Distribution of neighborhood size", 
       sub="From cdr.1e6x1e1.1e9")

As we suspected, the Monte Carlo algorithm does not provide enough customers with low calling circle sizes. Fortunately it is very easy to add these separately: the hard part is modelling the larger calling circles. A mix of these two algorithms provide a reasonably good fit to actual customer behaviour. (The cut-off at 100 is a parameter to our Monte Carlo simulation program which indeed was 100 for this run.)

Problems

However, it is not all perfect. When we attempt to add the edge parameters in the obvious way it fails:

> length(s[[3]]) <- 0.5e9
> length(s[[4]]) <- 0.5e9
> E(g)$start     <- s[[3]]
Error: cannot allocate vector of size 3.7 Gb
Execution halted
> E(g)$duration  <- s[[4]]

So we are just at the limit. Probably 100 million records is OK in this environment. But the core igraph library is accessible from C so better performance can probably be achieved this way and certainly pointers are 8 byte structures on this machine so we should not have the silly limits that R imposes on us.

Last modified .