Blog posts from CYBAEA

Feature selection: All-relevant selection with the Boruta package

15 November 2010

Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification.

Read more (~1210 words)

Big data for R

5 August 2010

Revolutions Analytics recently announced their “big data” solution for R. This is great news and a lovely piece of work by the team at Revolutions.

Read more (~760 words)

Area Plots with Intensity Coloring

13 July 2010

I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics.

Read more (~420 words)

Employee productivity revisited

22 June 2010

We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is scary.

Read more (~300 words)

Employee productivity as function of number of workers revisited

22 June 2010

We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary.

Read more (~670 words)

Comparing standard R with Revoutions for performance

17 June 2010

Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages.

Read more (~270 words)

Faster R through better BLAS

15 June 2010

Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection.

Read more (~750 words)

Eliminating observed values with zero variance in R

8 March 2010

I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.

Read more (~470 words)

Your mobile phone knows everything about you … and it is telling

17 August 2009

We knew the potential existed already, of course. Mobile devices in the USA generates some 600 billion transactions per day, each tagged with the location and time. Jeff Jonas says, Every call, text message, email and data transfer handled by your mobile device creates a transaction with your space-time coordinate[…]. Got a Blackberry? Every few minutes, it sends a heartbeat, creating a transaction whether you are using the phone or not. That is some 7 million transactions per second, on average.

Read more (~440 words)

Beautiful Data

27 July 2009

O’Reilly’s recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data, is also available as a PDF download.

Read more (~60 words)