Read the CYBAEA Data and Analysis blog for in-depth coverage of selected topics in data analysis, data mining, statistics, causal inference, and related topics.
This is the blog for practising data analysts and theoretical statisticians. The business conclusions of any analysis would normally be discussed in the CYBAEA Journal while this blog may contain the details of the analysis.
Subscribe to CYBAEA Data and Analysis
I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.
Read more (~501 words).
O'Reilly's recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data, is also available as a PDF download.
Read more (~66 words).
This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end. But much more than a theoretical discussion, they have built a solution which they call HadoopDB. It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source. Alternative, column-based, backends to PostgreSQL are being implemented now. Read: Announcing release of HadoopDB.
Read more (~83 words).
David posts a question about how to solve this knapsack problem using the R statistical computing and analysis platform. My reply in the comments seems to have disappeared for a while so here is my proposed solution:
Read more (~165 words).
I am a sucker for good quality data. I wrote about data.gov, the US Government data site before, and now I find OECD Statistics which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)
Read more (~53 words).
I like the "multicore" library for a particular task. I can easily write a combination of if(require("multicore",...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.
So, I needed a function to determine if my function was called from any function in the "multicore" library. Here it is.
Read more (~190 words).
Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform. Since it is unusually painful to get working, I might as well copy the instructions here.
Read more (~414 words, 3 comments).
O’Reilly has published Data Mashups in R as a $4.99 PDF download in their Short Cut series. In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one here. This is all done with the R statistical computing and analysis platform.
Read more (~108 words).
Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about recently) kindly provides more information about how to win this public challenge using the R statistical computing and analysis platform on a laptop (!).
Read more (~456 words).
The results from the KDD Cup 2009 challenge (which we wrote about before) are in, and the winner of the slow challenge used the R statistical computing and analysis platform for their winning submission.
Read more (~388 words).
Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 1.10.100.200). He wanted to sort by this column and I proposed a solution involving strsplit. But Peter Dalgaard comes up with a much nicer method using read.table on a textConnection object:
Read more (~157 words).
I am always on the lookout for useful data sources for training in statistics, so I am excited that Data.gov has opened for business. The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government.
Read more (~93 words).
We are interested in Social Network Analysis using the statistical analysis and computing platform R. The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.
In our previous post on SNA we gave up on using the statnet package because it was not able to handle our data volumes. In this entry we have better success with the igraph package.
Read more (~736 words, 4 comments).
We are interested in Social Network Analysis using the statistical analysis and computing platform R. As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work. We use here the statnet group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages.
The first task which we consider in this post is to load our data into a network object, which is how all the statnet packages represent a network. Typically for R, the documentation is voluminous but not always as helpful as one could want.
Read more (~1292 words, 7 comments).
Swapping two columns in a matrix is really easy: m[ , c(1,2)] <- m[ , c(2,1)].
Read more (~84 words).
When using R, the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off.
Read more (~221 words, 3 comments).
In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.
Read more (~571 words).
More posts from CYBAEA Data and Analysis: 0-20 |
On 2010-03-08 14:46:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.
Read more (~501 words).
On 2009-08-17 09:18:00, Allan Engelhardt wrote in CYBAEA Journal:
We knew the potential existed already, of course. Mobile devices in the USA generates some 600 billion transactions per day, each tagged with the location and time. Jeff Jonas: Every call, text message, email and data transfer handled by your mobile device creates a transaction with your space-time coordinate[...].
The mobile operators have this data, of course. We all know this (especially here where we have been using some of it for social network analysis). No real surprises here, except perhaps in the volumes.
But did you know that the operators are sharing your data? What is new, at least to me, is that this data is being provided to third parties that are leveraging specially designed analytics to make sense of our space-time-travel data.
Read more (~449 words, 1 comments).
On 2009-07-27 19:38:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
O'Reilly's recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data, is also available as a PDF download.
Read more (~66 words).
On 2009-07-22 13:37:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end. But much more than a theoretical discussion, they have built a solution which they call HadoopDB. It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source. Alternative, column-based, backends to PostgreSQL are being implemented now. Read: Announcing release of HadoopDB.
Read more (~83 words).
On 2009-07-22 06:59:00, Allan Engelhardt wrote in CYBAEA Journal:
The nice people at Velocity has released The B2B Content Marketing Workbook. It is behind a registration wall which means we wouldn’t normally recommend it but you can just type junk in the fields if you are not comfortable with giving your personal details to a marketing agency. (Think about it....) If you are relatively new in the B2B world, say having joined a professional services or consulting organization, you may find this one useful.
Read more (~263 words).