Counts

geom_count()

Boxplots provide an efficient way to explore the interaction of a continuous variable and a categorical variable. But what if you have two categorical variables?

You can see how observations are distributed across two categorical variables with geom_count(). geom_count() draws a point at each combination of values from the two variables. The size of the point is mapped to the number of observations with this combination of values. Rare combinations will have small points, frequent combinations will have large points.

Exercise 8: Count plots

Use geom_count() to plot the interaction of the cut and clarity variables in the diamonds data set.

ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = clarity))

count()

You can use the count() function in the {dplyr} package to compute the count values displayed by geom_count(). To use count(), pass it a data frame and then the names of zero or more variables in the data frame. count() will return a new table that lists how many observations occur with each possible combination of the listed variables.

So for example, the code below returns the counts that you visualized in Exercise 8.

diamonds |> 
  count(cut, clarity)
# A tibble: 40 × 3
   cut   clarity     n
   <ord> <ord>   <int>
 1 Fair  I1        210
 2 Fair  SI2       466
 3 Fair  SI1       408
 4 Fair  VS2       261
 5 Fair  VS1       170
 6 Fair  VVS2       69
 7 Fair  VVS1       17
 8 Fair  IF          9
 9 Good  I1         96
10 Good  SI2      1081
# ℹ 30 more rows

Heat maps

Heat maps provide a second way to visualize the relationship between two categorical variables. They work like count plots, but use a fill color instead of a point size, to display the number of observations in each combination.

How to make a heat map

{ggplot2} does not provide a geom function for heat maps, but you can construct a heat map by plotting the results of count() with geom_tile().

To do this, set the x and y aesthetics of geom_tile() to the variables that you pass to count(). Then map the fill aesthetic to the n variable computed by count(). The plot below displays the same counts as the plot in Exercise 8.

diamonds |> 
  count(cut, clarity) |> 
  ggplot() +
    geom_tile(mapping = aes(x = cut, y = clarity, fill = n))

Exercise 9: Make a heat map

Practice the method above by re-creating the heat map below.

diamonds |> 
  count(color, cut) |> 
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))

Good job!

Recap

Boxplots, dotplots and violin plots provide an easy way to look for relationships between a continuous variable and a categorical variable. Violin plots convey a lot of information quickly, but boxplots have a head start in popularity—they were easy to use when statisticians had to draw graphs by hand.

In any of these graphs, look for distributions, ranges, medians, skewness or anything else that catches your eye to change in an unusual way from distribution to distribution. Often, you can make patterns even more revealing with the fct_reorder() function from the {forcats} package (we’ll wait to learn about {forcats} until after you study factors).

Count plots and heat maps help you see how observations are distributed across the interactions of two categorical variables.

Next Topic