SNA with R: Loading your network data

On 2009-04-01 16:08:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:

We are interested in Social Network Analysis using the statistical analysis and computing platform R. As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work. We use here the statnet group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages.

The first task which we consider in this post is to load our data into a network object, which is how all the statnet packages represent a network. Typically for R, the documentation is voluminous but not always as helpful as one could want.

We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries. In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:

          src,         dest,     start,  duration,type,...
+447000000005,+447000000006,1238510028,        52,call,...
+447000000006,+447000000009,1238510627,       154,call,...
+447000000009,+447000000007,1238511103,        48,call,...
+447000000006,+447000000005,1238511145,        49,call,...
+447000000006,+447000000005,1238511678,        12,call,...
+447000000001,+447000000006,1238511735,       147,call,...
+447000000007,+447000000009,1238511806,        26,call,...
+447000000000,+447000000008,1238511825,        19,call,...
+447000000009,+447000000008,1238511900,        28,call,...
...

Here a record indicates that the customer identified as src called (type=call) the customer dest at the given time start and the call lasted duration seconds. In general, there will be (many) more attributes describing the transaction which are represented by the .... In a Financial Services example, the records may be money transfers between accounts.

Implementation in the network class

In the naive implementation of this data as a network, we would have the sources and destinations (broadly speaking: people) as vertices and the calls as edges. That broadly seems to make sense: people are connected by the calls they make, and that is the social relationship we wish to model.

In the terminology of the network class, that means that our network will be directed (calls and money transfers have a direction from one person to another) and will need to allow multiple edges between the same endpoints (because any one person can, and indeed usually will, make several calls to the same other person).

We could consider dropping the multiple attribute of the network and instead represent the fact that A has called B with a single edge and perhaps have the number of calls and their total duration as an edge attribute. We will investigate this another time, but it is surely a less faithful representation of the data that we have (and we would need to drop information like the time of call).

Mapping customer identifiers to network vertex numbers

One thing they seem to forget to tell you in the documentation is that when you import your data your vertex identifiers (which in our case is customer or account numbers) must be changed to number the vertices and that this numbering must be sequential and start from 1. Being used to an environment where the vertex identifiers are arbitrary (and arrays usually start from 0), this one had me puzzled for a while. The error message that tells you your vertex numbering is not what the package expected is spectacularly unhelpful:

> n <- network(m, matrix.type="edgelist", directed=TRUE, multiple=TRUE)
Error in add.edges(g, as.list(x[, 1]), as.list(x[, 2]), edge.check = edge.check) : 
  (edge check) Illegal vertex reference in addEdges_R.  Exiting.

For the discussion that follows, we will assume that you have changed your identifies externally to R.

Loading the data

The good news is that our data is essentially in a format that the network package calls edge list and which it can import directly.

I say “essentially” because for some strange reason the package expects the destination to come before the source which seems ass-backwards to me. But assume we have our data in a file cdr.csv like this (we only have calls here):

       src,      dest,     start,  duration
         5,         6,1238510028,        52
         6,         9,1238510627,       154
         9,         7,1238511103,        48
         6,         5,1238511145,        49
...

Then we can load the data into R easily:

> library("network")
> m <- matrix(scan(file="cdr.csv", what=integer(0), skip=1, sep=','), ncol=4, byrow=TRUE)
Read 1896 items
> # Swap columns for ass-backward network package
> m[,c(1,2)] <- m[,c(2,1)]

> # Create network
> net <- network(m, matrix.type="edgelist", directed=TRUE, multiple=TRUE)

> summary(net)
Network attributes:
 vertices = 10
 directed = TRUE
 hyper = FALSE
 loops = FALSE
 multiple = TRUE
 bipartite = FALSE
 total edges = 474 
   missing edges = 0 
   non-missing edges = 474 
 density = 5.266667 

Vertex attributes:
 vertex.names:
   character valued attribute
   10 valid vertex names

No edge attributes

Network adjacency matrix:
Error in as.matrix.network.adjacency(x = x, attrname = attrname, ...) : 
  Multigraphs not currently supported in as.matrix.network.adjacency.  Exiting.
In addition: Warning message:
In network.density(x) :
  Network is multiplex - no general way to define density.  Returning value for a non-multiplex network (hope that's what you wanted).

OK, that's a lot of warnings, but it basically worked. We have figured out how to load our network data into the network package in R.

Performance

We can’t do an exhaustive performance review now, but let us at least make sure we can load medium-sized networks. We change our CDR simulator to emit the desitnation before the source just like network likes it and let it run.

The first file has 2,645,288 (simulated) CDR lines from 100k customers and it loads OK on our small development workstation even with the default settings:

> library("network")
> n <- network(matrix(scan(file="cdr.1e5x1e0.csv", 
                           what=integer(0), skip=1, sep=','), 
                      ncol=4, byrow=TRUE), 
               matrix.type="edgelist", directed=TRUE, multiple=TRUE)
Read 10581152 items
> proc.time()
   user  system elapsed 
138.304   1.597 140.878 
> save(n, file="n.RData", ascii=FALSE, compress=FALSE)

The size of the saved network object is 373MB (only 27MB compressed).

We can potentially save some time and memory by not explicitly not performing the edge check (again: the documentation frustrates us and is silent on what the defaults are for the network call we used above) so we try this for our next file with 51,316,641 lines of CDR data (again for 100k customers) which also saves us some column swapping:

> library("network")
> m <- matrix(scan(file="cdr.51M.csv", 
                   what=integer(0), skip=1, sep=','),
              ncol=4, byrow=TRUE)
Read 205266564 items
> num_vert <- max(m[,1], m[,2])
> num_vert
[1] 100000
> n <- network.initialize(n=num_vert, directed=TRUE, multiple=TRUE)
> add.edges(n, tail=m[,2], head=m[,1], edge.check=FALSE)
> proc.time()
(several hours: I’ll let you know when it is done)
> rm(m)
> save(n, file="n.RData", ascii=FALSE, compress=TRUE)

Our attempted optimization did not seem to matter and this network is too big for the machine and the network package. Building the network was painful as I was working on the workstation at the same time. The machine has 16GB installed RAM, but it was clearly running out and swapping extensively.

51 million might be a reasonable size data set for some Financial Services applications but it is clearly a trivial number of records for Telecommunications. I’ll need to do some more digging around.

Does anybody have any SNA benchmarks? I like the KXEN implementation for its simplicity and speed so I might get a copy and try it out. Any R performance experts who could make suggestions in the comments? How big are your networks?

Subscribe to CYBAEA Data and Analysis

Join the discussion

Do you agree or disagree? Have a question of want to make a point? Join the discussion:

Eigenvector centrality in 10min for 400K nodes

On 2009-07-24 22:00:00, Nick Lim said:

Richard,

Saw your note about trying to run a 40K node system.

We have successfully run certain SNA metrics on large networks fairly quickly. Here is a recent real life example - data comes from a social network hosted by a fortune 2000 company, 400K nodes, 1M+ edges. Specifically, our client was trying to calculate eigenvector centrality, which you might know as a recursive iterative metric. On my dell Lattitude D810 laptop that has 2GB of RAM, our software calculated the eigenvector centrality metric in about 5 minutes, it took about 5 min to load the data and about 6 min to complete the calculation and output the scores to a CSV file.

It'd be great for us to get together to discuss what you are trying to do with the 40K node network. Perhaps we can help each other out. We also have an eval that can be shipped out fairly quickly. Drop me a line.

Am trying hard not to sound like a vendor that I am, so please forgive me for that....

Nick
nick@sonamine.com

shorter, easier code

On 2009-06-21 19:49:00, David Chen said:

I struggled with this for a bit. Here's what I came up with:

#read in the table
myData <- read.table("data.txt")
#instantiate the network object
net <- network(myData, matrix.typ="edgelist")
#set the edgelist attributes
set.edge.attribute(net, "o", myData[[3]])
#run a mds
final <- cmdscale(sedist(as.sociomatrix(net, "o"), method="euclidean"))
#plot the mds
plot(final)
#add labels from the original data file
text(final[,1],final[,2], row.names(final), cex=0.6, pos=4, col="red")

This works for the following text file

a 2 0.3
2 3 0.25
3 4 0.1
4 5 0.1
5 6 0.1
6 2 0.75
a 3 0.82
2 4 0.86
3 5 0.7
4 6 0.86
5 2 0.1
6 3 0.12
a 4 0.13
2 5 0.02
3 6 0.34
4 2 0.88
5 3 0.76
6 4 0.98
a 5 0.64
2 6 0.71
3 2 0.21
4 3 0.12
5 4 0.33
6 5 0.32
a 6 0.1
2 a 0.1
3 a 0.85
4 a 0.19
5 a 0.81
6 a 0.21

avoiding errors, warnings

On 2009-04-27 13:34:00, David Chen said:

There's probably someone out there who has a better solution, but here's how I avoided the above errors:
# read in the myData table
myData <- matrix(read.table("data.txt"))
# write the edgelist
elist <- cbind(myData[[1]],myData[[2]])
# create the network
net <- network(elist, matrix.type="edgelist")
# set the edgelist attribute for 'val'
net %e% "val" <- as.list(myData[[3]])
# display the adjacency matrix with values
as.sociomatrix(o, "val")

Contents of data.txt:
1 3 0.1
2 4 0.6
3 1 0.2
4 2 0.5
2 1 0.4
1 2 1.0

A very helpful guide is posted at http://www.jstatsoft.org/v24/i02/

RE: SNA in R

On 2009-04-23 20:47:00, Richard said:

Thanks for the feedback. I found an upper limit of 5000 nodes before memory limitations were being reached in trying to compute SNA metrics. After a bit of digging around, I read about the igraph library. I have successfully processed data and produced SNA metrics for networks with 100,000 nodes and 500,000 transactions between them. Of course, some of the metrics take time to be processed, but that is understandable.

Re: SNA in R

On 2009-04-23 08:58:00, Allan Engelhardt said:

Richard: Functions like sna::degree do not work for graphs with multiple edges, but you are absolutely right that the function uses insane amounts of memory as anybody can verify by doing

> library("network"); library("sna")
> m <- cbind(seq(1,40e3,by=1), seq(40e3,1,by=-1))
> n <- network(m, matrix.type="edgelist", directed=TRUE, multiple=FALSE)
> # degree(n)

Where we have commented out the call to degree because it trashes our swap file.

In summary: do not use the network and sna libraries naively for normal networks: the libraries seems to be for networks identified by people with clipboards only (up to, say, around the Dunbar number of 150 nodes).

SNA in R

On 2009-04-23 07:36:00, Richard said:

Thanks for your web page about SNA in R. Really helped me out. Just one thing I'd like to know... you talk about 100k customers. Well I have an edgelist of only 40k customers. However, when try to use an 'sna' library function such as 'degree' on the adjacency matrix (created with the network command), it throws an error because it has not got enough memory to create a vector that is of length (40kx40k). Did you do any similar analysis with your 100k customers and what did you experience?

The statnet documentation is very good by R standards

On 2009-04-01 19:58:00, Allan Engelhardt said:

I should make clear that the statnet documentation is very (very) good by the standards of R packages and I am not picking on the authors in particular. It is just that that is not a very high bar. And a different update: I killed the last (51M records) network build after a little over 16 hours when it still hadn’t finished. Note that the matrix of course loads fine – it is the actual creation of the network object that fails to complete.

Navigation

This site is standards compliant: