Large data and R

Unlearn your small data mindset

“Unlearn your small data mindset and learn how to manipulate and analyse large and very large data sets using the R language and environment.”

The R language and environment for statistical computing and graphics is the de facto standard for advanced data mining across many fields of research and many industries. The environment provides an integrated suite of software facilities for data manipulation, calculation and graphical display.

Most books and materials about R focus on small data sets such as the iris data. This has led some to the misconception that R is unsuited for the manipulation and analysis of large data sets. Nothing could be further from the truth.

Working with large data sets in R requires a different approach from that which is appropriate for small sets. This course will help you unlearn your small data mindset and teach you how to manipulate and analyse large and very large data sets using R.

This course in intended for people who are already familiar with R and comfortable using the environment, but who are finding it difficult to manipulate and analyse ever-growing data sets. At the end of this course you will be comfortable with large (tens of gigabytes) and very big (terabytes and beyond) data sets. While mainly a technical course you will at the end have a practical understanding of when you need to use one approach or the other to deliver results for your business, and you will have developed a set of practical strategies for big data analysis that you can use immediately.

Course contents

Part 1: Large Data on your computer

First we look at handling large data sets on a single computer. (By large data we mean tens of gigabytes.) This is where you unlearn your memory inefficient habits to produce lean, mean, and efficient analysis, and where you learn about file-backed storage for when the data is very large.

Topics covered include:

  1. Efficient data structures: goodbye data.frame, hello data.table, sqldf, and friends.
  2. Working with databases: DBI, RSQLite, RPostgreSQL, and packages built on top of these.
  3. Analysing large data sets: biglm, speedglm, ff, biganalytics, and biglars.
  4. Introduction to compiled code: Rcpp and inline.

After completing this part you will have developed an approach for manipulating and analysing large data sets that you can apply to your work.

Part 2: Big Data in the cloud

For truly Big Data problems a single computer is not enough: you need to split the workload across many computers using cloud computing or in-house clusters. This part of the course teaches you how. When you walk out you will be comfortable analysing truly massive data sets.

Topics covered include:

  1. Introduction to cloud computing: AWS, EC2, S3, EBS, AE MapReduce, and others.
  2. Overview of the Hadoop ecosystem: Hadoop, HDFS, MapReduce, Cassandra, HBase, Hive, Mahout, and others. What are they, what are they used for, and how does it all fit together.
  3. R and Hadoop: hive, HadoopStreaming, RHIPE, and others.
  4. Round-up: when you really need big data and when you can get the business results from merely large data. Suggested best practice big data analysis strategy.

After completing this part you will be comfortable manipulating and analysing truly big data sets (terabytes and beyond).

Course format

Either of the two parts may be covered separately or the whole contents in a single course. Each part is usually one week training with in-class exercises and home work, though it can of course be customised.

We have successfully given the course in both physical and online (virtual) classrooms.

About this course

This course is provided by CYBAEA who are experts in the analysis of commercial data to gain knowledge and genuine understanding of your markets and customers.

As the “more than analytics company” we understand how to exploit those insights and execute on that knowledge within the business to deliver profits and sustainable advantage.

The teacher is usually Allan Engelhardt who is an expert with over30 years experience in big data analysis and more than 19 years of experience with R.

Dates and more information

Contact us to register your interest and receive more information on this course, and we will let you know the next time we run a public class.