CYBAEA Data and Analysis

By Allan Engelhardt
[CYBAEA Data and Analysis]

Read the CYBAEA Data and Analysis blog for in-depth coverage of selected topics in data analysis, data mining, statistics, causal inference, and related topics.

This is the blog for practising data analysts and theoretical statisticians. The business conclusions of any analysis would normally be discussed in the CYBAEA Journal while this blog may contain the details of the analysis.

Subscribe to CYBAEA Data and Analysis

R: Eliminating observed values with zero variance

On 2010-03-08 14:46:00, Allan Engelhardt wrote:

I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.

Beautiful Data

On 2009-07-27 19:38:00, Allan Engelhardt wrote:

O'Reilly's recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data, is also available as a PDF download.

Massively parallel database for analytics

On 2009-07-22 13:37:00, Allan Engelhardt wrote:

This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end. But much more than a theoretical discussion, they have built a solution which they call HadoopDB. It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source. Alternative, column-based, backends to PostgreSQL are being implemented now. Read: Announcing release of HadoopDB.

The Knapsack Problem

On 2009-07-10 20:30:00, Allan Engelhardt wrote:

David posts a question about how to solve this knapsack problem using the R statistical computing and analysis platform. My reply in the comments seems to have disappeared for a while so here is my proposed solution:

OECD Statistics

On 2009-07-02 20:33:00, Allan Engelhardt wrote:

I am a sucker for good quality data. I wrote about data.gov, the US Government data site before, and now I find OECD Statistics which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)

R tips: Determine if function is called from specific package

On 2009-06-16 10:27:00, Allan Engelhardt wrote:

I like the "multicore" library for a particular task. I can easily write a combination of if(require("multicore",...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.

So, I needed a function to determine if my function was called from any function in the "multicore" library. Here it is.

R tips: Installing Rmpi on Fedora Linux

On 2009-06-12 10:23:00, Allan Engelhardt wrote:

Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform. Since it is unusually painful to get working, I might as well copy the instructions here.

Data Mashups in R from O'Reilly

On 2009-06-09 11:23:00, Allan Engelhardt wrote:
[Philadelphia County July 2009 Foreclosure Heat Map]

O’Reilly has published Data Mashups in R as a $4.99 PDF download in their Short Cut series. In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one here. This is all done with the R statistical computing and analysis platform.

How to win the KDD Cup Challenge with R and gbm

On 2009-06-01 07:07:00, Allan Engelhardt wrote:

Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about recently) kindly provides more information about how to win this public challenge using the R statistical computing and analysis platform on a laptop (!).

R used by KDD 2009 cup winner of slow challenge

On 2009-05-31 13:17:00, Allan Engelhardt wrote:

The results from the KDD Cup 2009 challenge (which we wrote about before) are in, and the winner of the slow challenge used the R statistical computing and analysis platform for their winning submission.

R tips: Use read.table instead of strsplit to split a text column into multiple columns

On 2009-05-29 10:53:00, Allan Engelhardt wrote:

Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 1.10.100.200). He wanted to sort by this column and I proposed a solution involving strsplit. But Peter Dalgaard comes up with a much nicer method using read.table on a textConnection object:

Data.gov

On 2009-05-22 02:23:00, Allan Engelhardt wrote:

I am always on the lookout for useful data sources for training in statistics, so I am excited that Data.gov has opened for business. The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government.

SNA with R: Loading large networks using the igraph library

On 2009-05-06 15:33:00, Allan Engelhardt wrote:

We are interested in Social Network Analysis using the statistical analysis and computing platform R. The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.

In our previous post on SNA we gave up on using the statnet package because it was not able to handle our data volumes. In this entry we have better success with the igraph package.

SNA with R: Loading your network data

On 2009-04-01 16:08:00, Allan Engelhardt wrote:

We are interested in Social Network Analysis using the statistical analysis and computing platform R. As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work. We use here the statnet group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages.

The first task which we consider in this post is to load our data into a network object, which is how all the statnet packages represent a network. Typically for R, the documentation is voluminous but not always as helpful as one could want.

R tips: Swapping columns in a matrix

On 2009-03-31 15:59:00, Allan Engelhardt wrote:

Swapping two columns in a matrix is really easy: m[ , c(1,2)] <- m[ , c(2,1)].

R tips: Eliminating the “save workspace image” prompt on exit

On 2009-03-26 08:14:00, Allan Engelhardt wrote:

When using R, the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off.

R tips: Keep your packages up-to-date

On 2009-03-25 20:59:00, Allan Engelhardt wrote:

In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.

More posts from CYBAEA Data and Analysis: 0-20 |

Navigation

This site is standards compliant: