Notice how each {dplyr} function takes a data frame as input and returns a data frame as output. This makes the functions easy to use in a step-by-step fashion. For example, you could:
Filter babynames to just boys born in 2017
Select the name and n columns from the result
Arrange those columns so that the most popular names appear near the top.
boys_2017 <-filter(babynames, year ==2017, sex =="M")boys_2017 <-select(boys_2017, name, n)boys_2017 <-arrange(boys_2017, desc(n))boys_2017
# A tibble: 14,160 × 2
name n
<chr> <int>
1 Liam 18728
2 Noah 18326
3 William 14904
4 James 14232
5 Logan 13974
6 Benjamin 13733
7 Mason 13502
8 Elijah 13268
9 Oliver 13141
10 Jacob 13106
# ℹ 14,150 more rows
Redundancy
The result shows us the most popular boys names from 2017, which is the most recent year in the data set. But take a look at the code. Do you notice how we re-create boys_2017 at each step so we will have something to pass to the next step? This is an inefficient way to write R code.
You could avoid creating boys_2017 by nesting your functions inside of each other, but this creates code that is hard to read:
arrange(select(filter(babynames, year ==2017, sex =="M"), name, n), desc(n))
There is a third way to write sequences of functions: the pipe operator, |>.
Explanation: Pipes
The pipe operator |> performs an extremely simple task: it passes the result on its left into the first argument of the function on its right. Or put another way, x |> f(y) is the same as f(x, y). This piece of code punctuation makes it easy to write and read series of functions that are applied in a step by step way. For example, we can use the pipe to rewrite our code above:
babynames |>filter(year ==2017, sex =="M") |>select(name, n) |>arrange(desc(n))
# A tibble: 14,160 × 2
name n
<chr> <int>
1 Liam 18728
2 Noah 18326
3 William 14904
4 James 14232
5 Logan 13974
6 Benjamin 13733
7 Mason 13502
8 Elijah 13268
9 Oliver 13141
10 Jacob 13106
# ℹ 14,150 more rows
As you read the code, pronounce |> as “and then”. You’ll notice that {dplyr} makes it easy to read pipes. Each function name is a verb, so our code resembles the statement, “Take babynames, and then filter it by name and sex, and then select the name and n columns, and then arrange the results by descending values of n.”
{dplyr} also makes it easy to write pipes. Each {dplyr} function returns a data frame that can be piped into another {dplyr} function, which will accept the data frame as its first argument. In fact, {dplyr} functions are written with pipes in mind: each function does one simple task. {dplyr} expects you to use pipes to combine these simple tasks to produce sophisticated results.
Exercise: Pipes
I’ll use pipes for the remainder of the tutorial, and I will expect you to as well. Let’s practice a little by writing a new pipe in the chunk below. The pipe should:
Filter babynames to just the girls that were born in 2017
Select the name and n columns
Arrange the results so that the most popular names are near the top.
Try to write your pipe without copying and pasting the code from above.
babynames |>filter(year ==2017, sex =="F") |>select(name, n) |>arrange(desc(n))
Revisiting the motivating example
You’ve now mastered a set of skills that will let you easily plot the popularity of your name over time. In the code chunk below, use a combination of {dplyr} and {ggplot2} functions with |> to:
Trim babynames to just the rows that contain your name and your sex
Trim the result to just the columns that will appear in your graph (not strictly necessary, but useful practice)
Plot the results as a line graph with year on the x axis and prop on the y axis
Note that the first argument of ggplot() takes a data frame, which means you can add ggplot() directly to the end of a pipe. However, you will need to switch from |> to + to finish adding layers to your plot.
babynames |>filter(name =="Andrew", sex =="M") |>select(year, prop) |>ggplot() +geom_line(aes(x = year, y = prop)) +labs(title ="Popularity of the name Andrew")
Recap
Together, select(), filter(), and arrange() let you quickly find information displayed within your data.
The next tutorial will show you how to derive information that is implied by your data, but not displayed within your data set.
In that tutorial, you will continue to use the |> operator, which is an essential part of programming with the dplyr library.
Pipes help make R expressive, like a spoken language. Spoken languages consist of simple words that you combine into sentences to create sophisticated thoughts.
In the tidyverse, functions are like words: each does one simple task well. You can combine these tasks into pipes with |> to perform complex, customized procedures.