Pipes

Steps

Notice how each {dplyr} function takes a data frame as input and returns a data frame as output. This makes the functions easy to use in a step-by-step fashion. For example, you could:

  1. Filter babynames to just boys born in 2017
  2. Select the name and n columns from the result
  3. Arrange those columns so that the most popular names appear near the top.
boys_2017 <- filter(babynames, year == 2017, sex == "M")
boys_2017 <- select(boys_2017, name, n)
boys_2017 <- arrange(boys_2017, desc(n))
boys_2017
# A tibble: 14,160 × 2
   name         n
   <chr>    <int>
 1 Liam     18728
 2 Noah     18326
 3 William  14904
 4 James    14232
 5 Logan    13974
 6 Benjamin 13733
 7 Mason    13502
 8 Elijah   13268
 9 Oliver   13141
10 Jacob    13106
# ℹ 14,150 more rows

Redundancy

The result shows us the most popular boys names from 2017, which is the most recent year in the data set. But take a look at the code. Do you notice how we re-create boys_2017 at each step so we will have something to pass to the next step? This is an inefficient way to write R code.

You could avoid creating boys_2017 by nesting your functions inside of each other, but this creates code that is hard to read:

arrange(select(filter(babynames, year == 2017, sex == "M"), name, n), desc(n))

There is a third way to write sequences of functions: the pipe operator, |>.

Explanation: Pipes

The pipe operator |> performs an extremely simple task: it passes the result on its left into the first argument of the function on its right. Or put another way, x |> f(y) is the same as f(x, y). This piece of code punctuation makes it easy to write and read series of functions that are applied in a step by step way. For example, we can use the pipe to rewrite our code above:

babynames |> 
  filter(year == 2017, sex == "M") |> 
  select(name, n) |> 
  arrange(desc(n))
# A tibble: 14,160 × 2
   name         n
   <chr>    <int>
 1 Liam     18728
 2 Noah     18326
 3 William  14904
 4 James    14232
 5 Logan    13974
 6 Benjamin 13733
 7 Mason    13502
 8 Elijah   13268
 9 Oliver   13141
10 Jacob    13106
# ℹ 14,150 more rows

As you read the code, pronounce |> as “and then”. You’ll notice that {dplyr} makes it easy to read pipes. Each function name is a verb, so our code resembles the statement, “Take babynames, and then filter it by name and sex, and then select the name and n columns, and then arrange the results by descending values of n.”

{dplyr} also makes it easy to write pipes. Each {dplyr} function returns a data frame that can be piped into another {dplyr} function, which will accept the data frame as its first argument. In fact, {dplyr} functions are written with pipes in mind: each function does one simple task. {dplyr} expects you to use pipes to combine these simple tasks to produce sophisticated results.

Exercise: Pipes

I’ll use pipes for the remainder of the tutorial, and I will expect you to as well. Let’s practice a little by writing a new pipe in the chunk below. The pipe should:

  1. Filter babynames to just the girls that were born in 2017
  2. Select the name and n columns
  3. Arrange the results so that the most popular names are near the top.

Try to write your pipe without copying and pasting the code from above.

babynames |> 
  filter(year == 2017, sex == "F") |> 
  select(name, n) |> 
  arrange(desc(n))

Revisiting the motivating example

You’ve now mastered a set of skills that will let you easily plot the popularity of your name over time. In the code chunk below, use a combination of {dplyr} and {ggplot2} functions with |> to:

  1. Trim babynames to just the rows that contain your name and your sex
  2. Trim the result to just the columns that will appear in your graph (not strictly necessary, but useful practice)
  3. Plot the results as a line graph with year on the x axis and prop on the y axis

Note that the first argument of ggplot() takes a data frame, which means you can add ggplot() directly to the end of a pipe. However, you will need to switch from |> to + to finish adding layers to your plot.

babynames |> 
  filter(name == "Andrew", sex == "M") |> 
  select(year, prop) |> 
  ggplot() +
    geom_line(aes(x = year, y = prop)) +
    labs(title = "Popularity of the name Andrew")

Recap

Together, select(), filter(), and arrange() let you quickly find information displayed within your data.

The next tutorial will show you how to derive information that is implied by your data, but not displayed within your data set.

In that tutorial, you will continue to use the |> operator, which is an essential part of programming with the dplyr library.

Pipes help make R expressive, like a spoken language. Spoken languages consist of simple words that you combine into sentences to create sophisticated thoughts.

In the tidyverse, functions are like words: each does one simple task well. You can combine these tasks into pipes with |> to perform complex, customized procedures.

Next Topic