We are interested in Social Network Analysis using the statistical analysis and computing platform R. As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work. We use here the statnet group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages.
The first task which we consider in this post is to load our data into a network object, which is how all the statnet packages represent a network. Typically for R, the documentation is voluminous but not always as helpful as one could want.
We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries. In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:
src, dest, start, duration,type,... +447000000005,+447000000006,1238510028, 52,call,... +447000000006,+447000000009,1238510627, 154,call,... +447000000009,+447000000007,1238511103, 48,call,... +447000000006,+447000000005,1238511145, 49,call,... +447000000006,+447000000005,1238511678, 12,call,... +447000000001,+447000000006,1238511735, 147,call,... +447000000007,+447000000009,1238511806, 26,call,... +447000000000,+447000000008,1238511825, 19,call,... +447000000009,+447000000008,1238511900, 28,call,... ...
Here a record indicates that the customer identified as
src called (
type=call) the customer
dest at the given time
start and the call lasted
duration seconds. In general, there will be (many) more attributes describing the transaction which are represented by the …. In a Financial Services example, the records may be money transfers between accounts.
Implementation in the network class
In the naive implementation of this data as a network, we would have the sources and destinations (broadly speaking: people) as vertices and the calls as edges. That broadly seems to make sense: people are connected by the calls they make, and that is the social relationship we wish to model.
In the terminology of the
network class, that means that our network will be directed (calls and money transfers have a direction from one person to another) and will need to allow multiple edges between the same endpoints (because any one person can, and indeed usually will, make several calls to the same other person).
We could consider dropping the multiple attribute of the network and instead represent the fact that A has called B with a single edge and perhaps have the number of calls and their total duration as an edge attribute. We will investigate this another time, but it is surely a less faithful representation of the data that we have (and we would need to drop information like the time of call).
Mapping customer identifiers to network vertex numbers
One thing they seem to forget to tell you in the documentation is that when you import your data your vertex identifiers (which in our case is customer or account numbers) must be changed to number the vertices and that this numbering must be sequential and start from 1. Being used to an environment where the vertex identifiers are arbitrary (and arrays usually start from 0), this one had me puzzled for a while. The error message that tells you your vertex numbering is not what the package expected is spectacularly unhelpful:
For the discussion that follows, we will assume that you have changed your identifies externally to R.
Loading the data
The good news is that our data is essentially in a format that the
network package calls edge list and which it can import directly.
I say “essentially” because for some strange reason the package expects the destination to come before the source which seems ass-backwards to me. But assume we have our data in a file cdr.csv like this (we only have calls here):
src, dest, start, duration 5, 6, 1238510028, 52 6, 9, 1238510627, 154 9, 7, 1238511103, 48 6, 5, 1238511145, 49 ...
Then we can load the data into R easily:
OK, that’s a lot of warnings, but it basically worked. We have figured out how to load our network data into the network package in R.
We can’t do an exhaustive performance review now, but let us at least make sure we can load medium-sized networks. We change our CDR simulator to emit the desitnation before the source just like network likes it and let it run.
The first file has 2,645,288 (simulated) CDR lines from 100k customers and it loads OK on our small development workstation even with the default settings:
The size of the saved network object is 373MB (only 27MB compressed).
We can potentially save some time and memory by not explicitly not performing the edge check (again: the documentation frustrates us and is silent on what the defaults are for the network call we used above) so we try this for our next file with 51,316,641 lines of CDR data (again for 100k customers) which also saves us some column swapping:
Our attempted optimization did not seem to matter and this network is too big for the machine and the network package. Building the network was painful as I was working on the workstation at the same time. The machine has 16GB installed RAM, but it was clearly running out and swapping extensively.
51 million might be a reasonable size data set for some Financial Services applications but it is clearly a trivial number of records for Telecommunications. I’ll need to do some more digging around.
Does anybody have any SNA benchmarks? I like the KXEN implementation for its simplicity and speed so I might get a copy and try it out. Any R performance experts who could make suggestions in the comments? How big are your networks?