Motivation
What are the most popular names of all time?
Let’s use babynames
to answer a different question: what are the most popular names of all time?
This question seems simple enough, but to answer it we need to be more precise: how do you define “the most popular” names? Try to think of several definitions and then click Continue. After the Continue button, I will suggest two definitions of my own.
Two definitions of popular
I suggest that we focus on two definitions of popular, one that uses sums and one that uses ranks:
- Sums - A name is popular if the total number of children that have the name is large when you sum across years.
- Ranks - A name is popular if it consistently ranks among the top names from year to year.
This raises a question:
Deriving information
Every data frame that you meet implies more information than it displays. For example, babynames
does not display the total number of children who had your name, but babynames
certainly implies what that number is. To discover the number, you only need to do a calculation:
|>
babynames filter(name == "Andrew", sex == "M") |>
summarize(total = sum(n))
# A tibble: 1 × 1
total
<int>
1 1283910
Useful functions
{dplyr} provides three functions that can help you reveal the information implied by your data:
summarize()
group_by()
mutate()
Like select()
, filter()
and arrange()
, these functions all take a data frame as their first argument and return a new data frame as their output, which makes them easy to use in pipes.
Let’s master each function and use them to analyze popularity as we go.