The central limit theorem states that as samples of larger size are collected from a population, the distribution of sample means approaches a normal distribution with the same mean as the population. No matter the distribution of the population (uniform, binomial, etc), the sampling distribution of the mean will approximate a normal distribution and its mean is the same as the population mean. By default R uses POSIX extended regular By expressions. You can switch to PCRE regular expressions using PERL = TRUEfor base or by wrapping patterns with perlfor stringr. All functions can be used with literal searches switches using fixed = TRUE for base or by wrapping patterns with fixed for stringr. Mark van der Loo A systematic approach to data cleaning with R. The statistical value chain From raw to technically correct data From technically correct to.
Resources and materials
- R for Reproducible Scientific Analysis
- 9:00am - 12:00pm, Thursday, November 10, 2016
Data - We will use a few csvs:
- Gapminder, you can get that by running this code in R:
- A zip file of several csv’s:
Unzip those into your ‘data’ folder inside your workshop project folder.
- @u2ng - Reid Otsuji
- @jt14den - Tim Dennis
Graduate students and researchers
This workshop is geared to researchers wanting to use R for basic data manipulation and analysis. It will introduce participants to the basics of data tidying and manipulation (notably using tidyr & dplyr), how to set up data processing pipelines, and briefly cover working with a database. We will be using a genomics dataset for this course. This workshop is designed for novices, but we would like you to have some experience with R or have attended the Intro to R course on 11/8.
By Anna Kayfitz, CEO of StrategicDB Corp
As millions or billions of data elements come into your business each day, it is almost inevitable that some of it will lack the necessary quality to create efficient business models. Ensuring that your data is clean should always be the first and arguably most important part of a Data Science workflow as without it, you will have difficulty seeing what is important and potentially make the wrong decisions due to duplicates, anomalies or missing information.
One of the most common and powerful data programming tools is R, an open source language and environment for statistical computing and graphics. R provides uses with all the tools needed to create data science projects but with anything, it is only as good as the data that feeds into it. With that, there are a number of libraries within the R environment that help with data cleaning and manipulation before the start of any project.
Exploring the data
Most of the tools for exploring a set of data that you’ve imported already exist within the R platform.
This handy command simply gives an overview of all your data attributes, displaying the min, max, median, mean and category splits for each. It is a good method for quickly spotting any potential data anomalies.
Following on from this, you can use a histogram to better understand the distribution of your data. This will visualise show any outliers within the dataset or any numeric column that you are particularly looking to observe.
The plyr package
You will need to install the plyr package to create your Histogram, using the standard R functionality for installing libraries
R Data Frame Cheat Sheet
This will create a visualisation of your data to spot any anomalies quickly for. A boxplot visualisation uses the same package but splits into quartiles for outlier detection. Both of these combined will quickly tell you if you need to limit the dataset or only use certain segments of it within any algorithms or statistical modelling.
R has a number of pre-built methods for correcting data errors such as converting values as you might do in Excel or SQL with simple logic e.g. as.charater() converts the column to a character string.
However, if you want to start correcting the errors that you saw in your histogram or boxplot, there are additional packages that have the capability of doing just that.
The stringr package
There are a few different ways in which stringr can help cleanse your data including trimming white spaces and replacing certain unnecessary words. These are quite standard bits of code structured as str_trim(YOUR_DATA_FIELD) which simply removes the white space.
However, what about removing the anomalies that our histogram told us we had? It would need a bit more complexity than this but as a basic example, we can tell R to replace all the outliers in our field with the median value of that field. This will move everything in together and take away anomaly bias.
R Data Cleaning Cheat Sheet Sample
It is very simple in R to check for incomplete data and perform and action with that field. For example, this function will eliminate missing vales completely from your chosen data column.
There are similar options to replace blank values with 0’s or N/A depending on the field type and improve the consistency of the dataset.
The tidyr package
The tidyr package is designed to tidy your data. It works by identifying the variables in your dataset and using the tools provided to move them into columns with three main functions or gather (), separate () and spread().
The gather() function takes multiple columns and gathers them into key value pairs. A an example, say you have exam score data like.
|Name||Exam A||Exam B|
The gather functions work by transforming that into usable columns like this.
R Data Wrangling Cheat Sheet
Now we are truly able to analyse the exam scores. The separate and spread functions do similar things which you can explore once you have the package but ultimately theyalig your data as needed.
Here are a few other packages of note that may be useful for data cleansing in R
- The purr package
The purr package is designed for data wrangling. It is quite similar to the plyr package, albeit older and some users simply find it easier to use and more standardised in its functionality.
- The sqldf package
A lot of R users are more comfortable coding in SQL language rather than R. This function allows you to write SQL code within R studio to select your data elements
- The janitor package
This package is able to find duplicates by multiple columns and make friendly columns with ease from your dataframe. It even has a get_dupes() function for finding duplicate values amongst multiple rows of data. If you are looking to dedupe your data in a more advanced manner, for example, finding different combinations or using fuzzy logic, you may want to look into a deduping tool instead.
- The splitstackshape package
This is an older package that can work with comma separated values in a dataframe column. Useful for survey or text analysis preparation.
R has a huge number of packages and this article only really touches the surface of what it can do. With new libraries popping up all the time it is important to do your research and get the right ones for you before starting any new project.
Bio: Anna Kayfitz is a CEO of StrategicDB Corp, a data cleansing and analytics company. She holds an MBA from a Schulich School of Business, and has spent over 10 years working in data analytics and marketing roles prior to founding StrategicDB.