A collection of readings on data wrangling. A collection of readings on data wrangling. Data Wrangling; Welcome. 10.1 Suffixes; 10.2 Examples. This cheat sheet is inspired by the data wrangling cheat sheets from RStudio and pandas. Examples are based on the Kaggle Titanic data set. Created by Tom Kwong, November 2020. V0.22 rev1 Page 2 / 2 gdf = groupby(df,:pclass) gdf = groupby(df, :pclass,:sex) Group data frame by one or more columns.
- Data Wrangling In R Cheat Sheet Example
- Tidyverse Cheat Sheet Pdf
- R Cheat Sheet Pdf
- R Code Cheat Sheet
- Data Wrangling In R Cheat Sheet Download
- Data Wrangling In R Cheat Sheet
- Data Wrangling In R Cheat Sheet Free
Approximate time: 75 minutes
The Tidyverse suite of integrated packages are designed to work together to make common data science operations more user friendly. The packages have functions for data wrangling, tidying, reading/writing, parsing, and visualizing, among others. There is a freely available book, R for Data Science, with detailed descriptions and practical examples of the tools available and how they work together. We will explore the basic syntax for working with these packages, as well as, specific functions for data wrangling with the ‘dplyr’ package, data tidying with the ‘tidyr’ package, and data visualization with the ‘ggplot2’ package.
All of these packages use the same style of code, which is snake_case
formatting for all function names and arguments. The tidy style guide is available for perusal.
Adding files to your working directory
We have three files that we need to bring in for this lesson:
- A normalized counts file (gene expression counts normalized for library size)
- A metadata file corresponding to the samples in our normalized counts dataset
- The differential expression results output from our DE analysis using DESeq2
Download the files to the data
folder by right-clicking the links below:
- Normalized counts file: right-click here
- Differential expression results: right-click here
Choose to Save Link As
or Download Linked File As
and navigate to your Visualizations-in-R/data
folder. You should now see the files appear in the data
folder in the RStudio file directory.
Reading in the data files
Let’s read in all of the files we have downloaded:
Tidyverse basics
As it is difficult to change how fundamental base R structures/functions work, the Tidyverse suite of packages create and use data structures, functions and operators to make working with data more intuitive. The two most basic changes are in the use of pipes and tibbles.
Pipes
Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing.
To make R code more human readable, the Tidyverse tools use the pipe, %>%
, which was acquired from the ‘magrittr’ package and comes installed automatically with Tidyverse. The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.
NOTE: Shortcut to write the pipe is shift + command + M
An example of using the pipe to run multiple commands:
The pipe represents a much easier way of writing and deciphering R code, and we will be taking advantage of it for all future activities.
Exercises
Extract the
replicate
column from themetadata
data frame (use the$
notation) and save the values to a vector namedrep_number
.Use the pipe (
%>%
) to perform two steps in a single line:- Turn
rep_number
into a factor. - Use the
head()
function to return the first six values of therep_number
factor.
- Turn
Tibbles
A core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard data.frame, with some internal improvements to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have column names that are not normally allowed, such as numbers/symbols.
Important: tidyverse is very opininated about row names. These packages insist that all column data (e.g. data.frame
) be treated equally, and that special designation of a column as rownames
should be deprecated. Tibble provides simple utility functions to handle rownames: rownames_to_column()
and column_to_rownames()
. More help for dealing with row names in tibbles can be found:
Tibbles can be created directly using the tibble()
function or data frames can be converted into tibbles using as_tibble(name_of_df)
.
NOTE: The function as_tibble()
will ignore row names, so if a column representing the row names is needed, then the function rownames_to_column(name_of_df)
should be run prior to turning the data.frame into a tibble. Also, as_tibble()
will not coerce character vectors to factors by default.
Exercises
Create a tibble called
df_tibble
using thetibble()
function to combine the vectorsspecies
andglengths
.Change the
metadata
data frame to a tibble calledmeta_tibble
. Use therownames_to_column()
function to preserve the rownames combined with using%>%
and theas_tibble()
function.
Differences between tibbles and data.frames
The main differences between tibbles and data.frames relate to printing and subsetting.
Printing
A nice feature of a tibble is that when printing a variable to screen, it will show only the first 10 rows and the columns that fit to the screen by default. This is nice since you don’t have to specify head to take a quick look at your dataset. If it is desirable to view more of the dataset, the print()
function can change the number of rows or columns displayed.
Subsetting
When subsetting base R data.frames the default behavior is to simplify the output to the simplest data structure. Therefore, if subsetting a single column from a data.frame, R will output a vector (unless drop=FALSE
is specified). In contrast, subsetting a single column of a tibble will by default return another tibble, not a vector.
Due to this behavior, some older functions do not work with tibbles, so if you need to convert a tibble to a data.frame, the function as.data.frame(name_of_tibble)
will easily convert it.
Also note that if you use piping to subset a data frame, then the notation is slightly different, requiring a placeholder .
prior to the [ ]
or $
.
Tidyverse tools
While all of the tools in the Tidyverse suite are deserving of being explored in more depth, we are going to investigate only the tools we will be using most for data wrangling and tidying.
Dplyr
The most useful tool in the tidyverse is dplyr. It’s a swiss-army knife for data wrangling. dplyr has many handy functions that we recommend incorporating into your analysis:
select()
extracts columns and returns a tibble.arrange()
changes the ordering of the rows.filter()
picks cases based on their values.mutate()
adds new variables that are functions of existing variables.rename()
easily changes the name of a column(s)summarise()
reduces multiple values down to a single summary.pull()
extracts a single column as a vector._join()
group of functions that merge two data frames together, includes (inner_join()
,left_join()
,right_join()
, andfull_join()
).
Note:dplyr underwent a massive revision this year, switching versions from 0.5 to 0.7. If you consult other dplyr tutorials online, note that many materials developed prior to 2017 are no longer correct. In particular, this applies to writing functions with dplyr (see Notes section below).
select()
To extract columns from a tibble we can use the select()
.
Conversely, you can remove columns you don’t want with negative selection.
arrange()
Note that the rows are sorted by the gene symbol. Let’s fix that and sort them by adjusted P value instead with arrange()
.
filter()
Let’s keep only genes that are expressed (baseMean
above 0) with an adjusted P value below 0.01. You can perform multiple filter()
operations together in a single command.
Data Wrangling In R Cheat Sheet Example
mutate()
mutate()
enables you to create a new column from an existing column. Let’s generate log10 calculations of our baseMeans for each gene.
rename()
You can quickly rename an existing column with rename()
. The syntax is new_name
= old_name
.
summarise()
You can perform column summarization operations with summarise()
.
Advanced:summarise()
is particularly powerful in combination with the group_by()
function, which allows you to group related rows together.
Tidyverse Cheat Sheet Pdf
Note: summarize()
also works if you prefer to use American English. This applies across the board to any tidy functions, including in ggplot2 (e.g. color
in place of colour
).
pull()
In the recent dplyr 0.7 update, pull()
was added as a quick way to access column data as a vector. This is very handy in chain operations with the pipe operator.
_join()
Dplyr has a powerful group of join operations, which join together a pair of data frames based on a variable or set of variables present in both data frames that uniquely identify all observations. These variables are called keys.
inner_join
: Only the rows with keys present in both datasets will be joined together.left_join
: Keeps all the rows from the first dataset, regardless of whether in second dataset, and joins the rows of the second that have keys in the first.right_join
: Keeps all the rows from the second dataset, regardless of whether in first dataset, and joins the rows of the first that have keys in the second.full_join
: Keeps all rows in both datasets. Rows without matching keys will have NA values for those variables from the other dataset.
R Cheat Sheet Pdf
To practice with the join functions, we can use a couple of built-in R datasets.
Tidyr
The purpose of Tidyr is to have well-organized or tidy data, which Tidyverse defines as having:
- Each variable in a column
- Each observation in a row
- Each value as a cell
There are two main functions in Tidyr, gather()
and spread()
. These functions allow for conversion between long data format and wide data format. The downstream use of the data will determine which format is required.
gather()
The gather()
function changes a wide data format into a long data format. This function is particularly helpful when using ‘ggplot2’ to get all of the values to plot into a single column.
To use this function, you need to give the columns in the data frame you would like to gather together as a single column. Then, provide a name to give the column where all of the column names will be present using the key
argument, and the name to give the column where all of the values will be present using the value
argument.
spread()
The spread()
function is the reverse of the gather()
function. The categories of the key
column will become separate columns, and the values in the value
column split across the associated key
columns.
Programming notes
Underneath the hood, tidyverse packages build upon the base R language using rlang, which is a complete rework of how functions handle variable names and evaluate arguments. This is achieved through the tidyeval
framework, which interprates command operations using tidy evaluation
. This is outside of the scope of the course, but explained in detail in the Programming with dplyr vignette, in case you’d like to understand how these new tools behave differently from base R.
Data wrangling is the transformation of raw data into a format that is easier to use. But what exactly does it involve? In this post, we find out.
Manipulation is at the core of data analytics. We don’t mean the sneaky kind, of course, but the data kind! Scraping data from the web, carrying out statistical analyses, creating dashboards and visualizations—all these tasks involve manipulating data in one way or another. But before we can do any of these things, we need to ensure that our data are in a format we can use. This is where the most important form of data manipulation comes in: data wrangling.
In this post, we explore data wrangling in detail. When you’ve finished reading, you’ll be able to answer:
First up…
1. What is data wrangling and why is it important?
Data wrangling is a term often used to describe the early stages of the data analytics process. It involves transforming and mapping data from one format into another. The aim is to make data more accessible for things like business analytics or machine learning. The data wrangling process can involve a variety of tasks. These include things like data collection, exploratory analysis, data cleansing, creating data structures, and storage.
Data wrangling is time-consuming. In fact, it can take up to about 80% of a data analyst’s time. This is partly because the process is fluid, i.e. there aren’t always clear steps to follow from start to finish. However, it’s also because the process is iterative and the activities involved are labor-intensive. What you need to do depends on things like the source (or sources) of the data, their quality, your organization’s data architecture, and what you intend to do with the data once you’ve finished wrangling it.
Why is data wrangling important?
Insights gained during the data wrangling process can be invaluable. They will likely affect the future course of a project. Skipping or rushing this step will result in poor data models that impact an organization’s decision-making and reputation. So, if you ever hear someone suggesting that data wrangling isn’t that important, you have our express permission to tell them otherwise!
Unfortunately, because data wrangling is sometimes poorly understood, its significance can be overlooked. High-level decision-makers who prefer quick results may be surprised by how long it takes to get data into a usable format. Unlike the results of data analysis (which often provide flashy and exciting insights), there’s little to show for your efforts during the data wrangling phase. And as businesses face budget and time pressures, this makes a data wrangler’s job all the more difficult. The job involves careful management of expectations, as well as technical know-how.
2. Data wrangling vs. data cleaning: what is the difference?
Some people use the terms ‘data wrangling’ and ‘data cleaning interchangeably. This is because they’re both tools for converting data into a more useful format. It’s also because they share some common attributes. But there are some important differences between them:
- Data wrangling refers to the process of collecting raw data, cleaning it, mapping it, and storing it in a useful format. To confuse matters (and because data wrangling is not always well understood) the term is often used to describe each of these steps individually, as well as in combination.
- Data cleaning,meanwhile, is a single aspect of the data wrangling process. A complex process in itself, data cleaning involves sanitizing a data set by removing unwanted observations, outliers, fixing structural errors and typos, standardizing units of measure, validating, and so on. Data cleaning tends to follow more precise steps than data wrangling…albeit, not always in a very precise order! You can learn more about the data cleaning process in this post.
The distinction between data wrangling and data cleaning is not always clear-cut. However, you can generally think of data wrangling as an umbrella task. Data cleaning falls under this umbrella, alongside a range of other activities. These can involve planning which data you want to collect, scraping those data, carrying out exploratory analysis, cleansing and mapping the data, creating data structures, and storing the data for future use.
3. What is the data wrangling process?
The exact tasks required in data wrangling depend on what transformations you need to carry out to get a dataset into better shape. For instance, if your source data is already in a database, this will remove many of the structural tasks. But if it’s unstructured data (which is much more common) then you’ll have more to do.
The following steps are often applied during data wrangling. But the process is an iterative one. Some of the steps may not be necessary, others may need repeating, and they will rarely occur in the same order. But you still need to know what they all are!
Extracting the data
Not everybody considers data extraction part of the data wrangling process. But in our opinion, it’s a vital aspect of it. You can’t transform data without first collecting it. This stage requires planning. You’ll need to decide which data you need and where to collect them from. You’ll then pull the data in a raw format from its source. This could be a website, a third-party repository, or some other location. If it’s raw, unstructured data, roll your sleeves up, because there’s work to do! You can learn how to scrape data from the web in this post.
Carrying out exploratory data analysis (EDA)
EDA involves determining a dataset’s structure and summarizing its main features. Whether you do this immediately, or wait until later in the process, depends on the state of the dataset and how much work it requires. Ultimately, EDA means familiarizing yourself with the data so you know how to proceed. You can learn more about exploratory data analysis in this post.
Structuring the data
Freshly collected data are usually in an unstructured format. This means they lack an existing model and are completely disorganized. Unstructured data are often text-heavy but may contain things like ID codes, dates, numbers, and so on. To structure your dataset, you’ll usually need to parse it. In this context, parsing means extracting relevant information. For instance, you might parse HTML code scraped from a website, pulling out what you need and discarding the rest. The result might be a more user-friendly spreadsheet containing the useful data with columns, headings, classes, and so on.
Cleaning the data
Once your dataset has some structure, you can start applying algorithms to tidy it up. You can automate a range of algorithmic tasks using tools like Python and R. They can be used to identify outliers, delete duplicate values, standardize systems of measurement, and so on. You can learn about the data cleaning process in detail in this post.
Enriching the data
Once your dataset is in good shape, you’ll need to check if it’s ready to meet your requirements. At this stage, you may want to enrich it. Data enrichment involves combining your dataset with data from other sources. This might include internal systems or third-party providers. Your goal could be to accumulate a greater number of data points (to improve the accuracy of an analysis). Or it could simply be to fill in gaps…Say, by combining two databases of customer info where one contains telephone numbers, and the other doesn’t.
Validating the data
Validating your data means checking it for consistency, quality, and accuracy. We can do this using pre-programmed scripts that check the data’s attributes against defined rules. This is also a good example of an overlap between data wrangling and data cleaning—validation is key to both. Because you’ll likely find errors, you may need to repeat this step several times.
Publishing the data
R Code Cheat Sheet
Last but not least, it’s time to publish your data. This means making the data accessible by depositing them into a new database or architecture. End-users might include data analysts, engineers, or data scientists. They may use the data to create business reports and other insights. Or they might further process it to build more complex data structures, e.g. data warehouses. After this stage, the possibilities are endless!
4. What tools do data wranglers use?
Data wranglers use many of the same tools applied in data cleaning. These include programming languages like Python and R, software like MS Excel, and open-source data analytics platforms likeKNIME. Programming languages can be difficult to master but they are a vital skill for any data analyst. However, Python is not that difficult to learn and it allows you to write scripts for very specific tasks. We share some tips for learning Python in this post.
There are also visual data wrangling tools out there. The general aim of these is to make data wrangling easier for non-programmers and to speed up the process for experienced ones. Tools like Trifacta and OpenRefine can help you transform data into clean, well-structured formats.
A word of caution, though. While visual tools are more intuitive, they are sometimes less flexible. Because their functionality is more generic, so they don’t always work as well on complex datasets. As a rule, the larger and more unstructured a dataset, the less effective these tools will be. Beginners should aim to combine programming expertise (scripting) with proprietary tools (for high-level wrangling).
Final thoughts
Data Wrangling In R Cheat Sheet Download
Data wrangling is vital to the early stages of the data analytics process. Before carrying out a detailed analysis, your data needs to be in a usable format. And that’s where data wrangling comes in. In this post, we’ve learned that:
Data Wrangling In R Cheat Sheet
- Data wrangling involves transforming and mapping data from a raw form into a more useful, structured format.
- Data wrangling can be used to prepare data for everything from business analytics to ingestion by machine learning algorithms.
- The terms ‘data wrangling’ and ‘data cleaning’ are often used interchangeably—but the latter is a subset of the former.
- While the data wrangling process is loosely defined, it involves tasks like data extraction, exploratory analyses, building data structures, cleaning, enriching, and validating; and storing data in a usable format.
- Data wranglers use a combination of visual tools like OpenRefine, Trifacta or KNIME, and programming tools like Python, R, and MS Excel.
Data Wrangling In R Cheat Sheet Free
The best way to learn about data wrangling is to dive in and have a go. For a hands-on introduction to some of these techniques, why not try out our free, five-day data analytics short course? To learn more about data analytics, check out the following: