We are interested in Social Network Analysis using the statistical analysis and computing platform R. The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.
In our previous post on SNA we gave up on using the {statnet} package because it was not able to handle our data volumes. In this entry we have better success with the {igraph} package.
The task we are considering is still how to load the network data into the R package’s internal representation. We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries. In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:
src, dest, start, duration,type,...
+447000000005,+447000000006,1238510028, 52,call,...
+447000000006,+447000000009,1238510627, 154,call,...
+447000000009,+447000000007,1238511103, 48,call,...
+447000000006,+447000000005,1238511145, 49,call,...
+447000000006,+447000000005,1238511678, 12,call,...
+447000000001,+447000000006,1238511735, 147,call,...
+447000000007,+447000000009,1238511806, 26,call,...
+447000000000,+447000000008,1238511825, 19,call,...
+447000000009,+447000000008,1238511900, 28,call,...
...
Here a record indicates that the customer identified as src
called (type=call
) the customer dest
at the given time start
and the call lasted duration
seconds. In general, there will be (many) more attributes describing the transaction which are represented by the …. In a Financial Services example, the records may be money transfers between accounts.
Loading the data in the igraph package
We are able to load the previous test data with 51 million records easily:
> library("igraph")
> m <- matrix(scan(bzfile("cdr.51M.csv.bz2", open="r"),
+ what=integer(0), skip=1, sep=','),
+ ncol=4, byrow=TRUE)
205266564 items
Read > ### Vertices are numbered from zero in the igraph library
> m[,1] <- m[,1]-1; m[,2] <- m[,2]-1
> g <- graph.edgelist(m[,c(2,1)])
> E(g)$start <- as.POSIXct(m[,3], origin="1970-01-01", tz="UTC")
> E(g)$duration <- m[,4]
> ns <- neighborhood.size(g, 1)
Time to up the ante! We have a file with simulated call data records containing over 700 million entries where we suspect the algorithm used is under-estimating nodes with small connections. Let’s check on the first ½ billion records (which seems to more-or-less fit in our available memory on this workstation):
Note that the comment below that R can only handle \(2^{31} - 1\) elements in a vector is very outdated.
> library("igraph")
### Note that R can only handle 2^31-1 elements in a vector (on any
### platform, including 64-bit), so we need to read this file as a
### list.
> s <- scan("cdr.1e6x1e1.csv", what=rep(list(integer(0)),4), skip=1, sep=',', multi.line=FALSE)
700466826 records
Read > m <- as.vector(rbind(s[[2]], s[[1]]))
> print(length(m))
1] 1400933652
[> length(m) <- 1e9
> g <- graph(m, directed=TRUE)
> ns <- neighborhood.size(g, 1)
> summary(ns)
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 35.00 40.00 42.92 47.00 101.00
> hist(ns, xlab="Neighborhood size", main="Distribution of neighborhood size",
sub="From cdr.1e6x1e1.1e9")
As we suspected, the Monte Carlo algorithm does not provide enough customers with low calling circle sizes. Fortunately it is very easy to add these separately: the hard part is modelling the larger calling circles. A mix of these two algorithms provide a reasonably good fit to actual customer behaviour. (The cut-off at 100 is a parameter to our Monte Carlo simulation program which indeed was 100 for this run.)
Problems
However, it is not all perfect. When we attempt to add the edge parameters in the obvious way it fails:
> length(s[[3]]) <- 0.5e9
> length(s[[4]]) <- 0.5e9
> E(g)$start <- s[[3]]
: cannot allocate vector of size 3.7 Gb
Error
Execution halted> E(g)$duration <- s[[4]]
So we are just at the limit. Probably 100 million records is OK in this environment. But the core igraph library is accessible from C so better performance can probably be achieved this way and certainly pointers are 8 byte structures on this machine so we should not have the silly limits that R imposes on us.