On 2009-05-31 13:17:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
The results from the KDD Cup 2009 challenge (which we wrote about before) are in, and the winner of the slow challenge used the R statistical computing and analysis platform for their winning submission.
The write up (username/password may be required) from Hugh Miller and team at the University of Melbourne includes these points:
Impressive hardware selection! Well done R. Weka was another popular tool among the top entrants. Key for all of them were clever data preparation and variable substitution. The fast track winners from IBM document this in some detail:
We normalized the numerical variables by range, keeping the sparsity. For the categorical variables, we coded them using at most 11 binary columns for each variable. For each categorical variable, we generated a binary feature for each of the ten most common values, encoding whether the instance had this value or not. The eleventh column encoded whether the instance had a value that was not among the top ten most common values. We removed constant attributes, as well as duplicate attributes.
We replaced the missing values by mean for numerical attributes, and coded them as a separate value for discrete attributes. We also added a separate column for each numeric attribute with missing values, indicating wether the value was missing or not. We also tried another approach for imputing missing values based on KNN.
On the large data set we discretized the 100 numerical variables that had the highest mutual information with the target into 10 bins, and added them as extra features.
We tried PCA on the large data set, but it did not seem to help.
Because we noticed that some of the most predictive attributes were not linearly correlated with the targets, we build shallow decision trees (2-4 levels deep) using single numerical attributes and used their predictions as extra features. We also build shallow decision trees using two features at a time and used their prediction as an extra feature in the hope of capturing some non-additive interactions among features.
On 2010-03-08 14:46:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.
Read more (~501 words).
On 2009-08-17 09:18:00, Allan Engelhardt wrote in CYBAEA Journal:
We knew the potential existed already, of course. Mobile devices in the USA generates some 600 billion transactions per day, each tagged with the location and time. Jeff Jonas: Every call, text message, email and data transfer handled by your mobile device creates a transaction with your space-time coordinate[...].
The mobile operators have this data, of course. We all know this (especially here where we have been using some of it for social network analysis). No real surprises here, except perhaps in the volumes.
But did you know that the operators are sharing your data? What is new, at least to me, is that this data is being provided to third parties that are leveraging specially designed analytics to make sense of our space-time-travel data.
Read more (~449 words, 1 comments).
On 2009-07-27 19:38:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
O'Reilly's recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data, is also available as a PDF download.
Read more (~66 words).
On 2009-07-22 13:37:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end. But much more than a theoretical discussion, they have built a solution which they call HadoopDB. It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source. Alternative, column-based, backends to PostgreSQL are being implemented now. Read: Announcing release of HadoopDB.
Read more (~83 words).
On 2009-07-22 06:59:00, Allan Engelhardt wrote in CYBAEA Journal:
The nice people at Velocity has released The B2B Content Marketing Workbook. It is behind a registration wall which means we wouldn’t normally recommend it but you can just type junk in the fields if you are not comfortable with giving your personal details to a marketing agency. (Think about it....) If you are relatively new in the B2B world, say having joined a professional services or consulting organization, you may find this one useful.
Read more (~263 words).
Join the discussion
There are no comments yet. Be the first to comment.