R
- Practical 3In this practical you will learn about exploratory data analysis
(EDA). EDA involves inspecting, transforming and visualising data with
the goal of finding interesting features to substantiate one or more
research questions. This practical will draw on skills you learned in
week 1 and week 2, while introducing you to some new material like
visualisation with ggplot()
.
You will need the following packages for this tutorial.
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(kableExtra)
library(weathermetrics)
You will work with the mpg
dataset from the
ggplot2
package. mpg
contains a subset of the
fuel economy data that EPA makes available here. The data frame has 234 rows
and 11 variables:
You can also type ?mpg
into the console to learn
more.
Let’s get an overview of the data.
head(mpg)
In previous weeks we learned that str()
can be used to
get an overview of the data and the structure of a dataframe. You can
also use glimpse()
from the tidyverse
which
returns similar results in a more readable format.
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Before you begin exploring your data, you need to check that the class or mode of your variables is as expected. Remember from previous practicals, you have character/factor and numerical/integer types.
# All variables that are listed as 'character' are categorical and should be listed as 'factor'.
# Here we should change all of the character variables to factor variables with levels. For now, we leave the levels as they are.
# There are a couple of methods to do this, but the quickest way is using `mutate_if()` which performs a conditional operation. Base R is also fine.
mpg <- mpg %>%
mutate_if(is.character, as.factor)
# If we inspect the data gain using glimpse(), we now see that the character variables are converted into factors
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <fct> audi, audi, audi, audi, audi, audi, audi, audi, audi, aud…
## $ model <fct> a4, a4, a4, a4, a4, a4, a4, a4 quattro, a4 quattro, a4 qu…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <fct> auto(l5), manual(m5), manual(m6), auto(av), auto(l5), man…
## $ drv <fct> f, f, f, f, f, f, f, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, r, …
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <fct> p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, r, …
## $ class <fct> compact, compact, compact, compact, compact, compact, com…
Exploratory Data Analysis is a process whereby you examine and transform your data to extract interesting insights and understand what relationships (if any!) exist among the variables. In this way, EDA is closely related to principles of data cleaning because tidy data is easier to explore. Initial steps of EDA involve examining distributions, handling missings and outliers, removing redundant data and coding categorical variables properly.
EDA is very flexible and iterative process and gives the researcher a lot of freedom, but it is typically split into graphical and non-graphical analysis. Further within this framework we can also start to think about what kind of variation occurs within variables and what kind of covariation occurs between my variables.
Variation describes how the values of a variable change each time you measure it. A continuous variable that is measured on one occasion will have different values to that same variable measured on another occasion. This variation between measurement occasions are what we call “error”. Categorical variables also vary across measurement occasions or across individuals. Visualising variation is a very intuitive way to understand trends or patterns within a variable.
Covariation describes how the values of at least two variables vary together or are related. Again, visualising covariation is also the most intuitive way to understand patterns between variables.
If you are interested in a particular aspect of a variable you can summarise this information in a table. We have covered how to create nicely formatted tables in the previous practical, so the example below is just to refresh your memory.
Below we summarise hwy
by vehicle class
, by
asking for commonly used summary statistics.
mpg %>%
group_by(class) %>% # by class
summarise(min = min(hwy), # minimum of hwy
mean = mean(hwy), # mean
sd = sd(hwy), # standard deviation
max = max(hwy), # maximum
n = n()) %>% # nr of observations
kbl() %>%
kable_styling(latex_options = c("striped", "hover"), full_width = F)
class | min | mean | sd | max | n |
---|---|---|---|---|---|
2seater | 23 | 24.80000 | 1.303840 | 26 | 5 |
compact | 23 | 28.29787 | 3.781620 | 44 | 47 |
midsize | 23 | 27.29268 | 2.135930 | 32 | 41 |
minivan | 17 | 22.36364 | 2.062655 | 24 | 11 |
pickup | 12 | 16.87879 | 2.274280 | 22 | 33 |
subcompact | 20 | 28.14286 | 5.375012 | 44 | 35 |
suv | 12 | 18.12903 | 2.977973 | 27 | 62 |
3. Create a summary of engine displacement by year, including the minimum, maximum, median, and inter quantile range.
mpg %>%
group_by(year) %>% # by year
summarise(min = min(displ), # minimum of engine displacement
Q1 = quantile(displ, 0.25), # first quartile
median = median(displ), # median
Q3 = quantile(displ, 0.75), # third quartile
max = max(displ), # maximum
IQR = Q3 - Q1) %>% # inter quartile range
kbl() %>%
kable_styling(latex_options = c("striped", "hover"), full_width = F)
year | min | Q1 | median | Q3 | max | IQR |
---|---|---|---|---|---|---|
1999 | 1.6 | 2.2 | 3.0 | 4.0 | 6.5 | 1.8 |
2008 | 1.8 | 2.5 | 3.6 | 4.7 | 7.0 | 2.2 |
Plots can be made using base R functions like plot()
,
hist()
, or barplot()
. Here are some examples
of how this works on the mpg
data.
hist(mpg$displ)
barplot(table(mpg$cyl))
plot(x = mpg$displ, y = mpg$hwy,
xlab = "Highway mpg",
ylab = "Engine displacement (L)")
These plots are quick to produce and are useful for an initial
understanding of the data, but the syntax used to create them is
specific to the type of plot. In contrast, ggplot
has a
streamlined approach to plotting involving largely the same steps
regardless of plot type. Specifically, in ggplot
we build
up visualisations layer by layer using the +
operator
(which is similar to the %>%
operator we are already
familiar with).
The main, important, layers in ggplot
are:
ggplot()
aes()
geom()
facet()
What does this layering system look like? In Chapter 3: Data Visualisation of Hadley Wickham’s R4DS book, there is a general ggplot template to give you an idea of how plots are layered.
```{r}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>)) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION> +
<THEME>
```
Below is a real example of how this looks. Remember that you do not
have to / need to use all of the ggplot
options (and there
are many many more than presented here). However, this gives you an idea
of how the layers of these plots are built up.
ggplot(iris) + # Data
geom_point(mapping = aes(x = Sepal.Length, # Variable on the x-axis
y = Sepal.Width, # Variable on the y-axis
colour = Species)) + # Legend
labs(x = "Sepal Length",
y = "Sepal Width",
title = "Relationship between Sepal Length and Width by Species") +
coord_cartesian() + # Default standard for mapping x and y
facet_wrap(~Species) + # Splits plot by the species variable
theme_bw() # Sets the background theme
You can read about different ggplot2
options here. The reference
guides by RPubs, STHDA
and Data
Novia will be useful once you grasp the basics and want to make more
complicated visualisations.
Bar charts are often used to visualise categorical
variables with a finite set of values. Below is an example how to
visuailse the distribution of drv
- the type of drive train
in the mpg
dataset.
geom_bar
transforms each value of drv
into
a count to be plotted on the y-axis. The x-axis presents the category
associated with each count.
ggplot(mpg) +
geom_bar(aes(x = drv))
mpg
.ggplot(mpg) + # list the dataset
geom_bar(aes(x = class)) # geom_bar for a barplot, and the 'class' variable on the x-axis.
# geom_bar adds counts on the y-axis by default
ggplot
themes, and apply one
to the bar chart you just created.ggplot(mpg) +
geom_bar(aes(x = class)) +
theme_bw() # Here we add the theme
# There are different `ggplot` themes and the one you choose is largely down to personal preference, but you also want one that enhances your data visualisation. Here I choose `+ theme_bw()` but other examples include `+ theme_minimal()` and `+ theme_light()`.
Histograms are often used to visualise the
distribution of a continuous variable. Below is an example of how to
visualise the distribution of cty
- the number of city
miles per gallon in the mpg
dataset.
geom_hist
transforms the x-axis into equally spaced
“bins” and the height of each bar on the y-axis is the number of
observations falling in each bin.
ggplot(mpg) +
geom_histogram(aes(x = cty),
binwidth = 3)
You can change the value of binwidth
to adjust the width
of the intervals. Different binwidths can reveal different patterns in
the data.
binwidth
to 1
? Does the distribution
change?ggplot(mpg) +
geom_histogram(aes(x = cty),
binwidth = 1)
# The general distribution does not drastically change, but it is now much easier to spot outliers. Specifically, there are two extreme values to the right of the distribution where the city miles per gallon exceed 30 in a small number of cases.
Density plots are an alternative to histograms for
continuous data. Below is an example of the density of cty
from the mpg
data. The argument
fill = "darkseagreen"
just adds colour. You can also
specify colour = "<COLOUR NAME>"
to change the
density line.
geom_density
transforms the data into smoothed kernel
density estimates (basically a smooth histogram). It is useful for
continuous data that come from an underlying smooth distribution.
ggplot(mpg, aes(x = cty)) +
geom_density(fill = "darkseagreen")
You can also include raw data in a histogram in the form of “rug” marks.
+ geom_rug(size = 1, colour = "darkorange")
.ggplot(mpg, aes(x = cty)) +
geom_density(fill = "darkseagreen") +
geom_rug(size = 1,
colour = "darkorange")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Scatterplots are often used to explore how two
continuous variables covary together. Below is an example of how to
visualise covariation between displ
(engine displacement)
and hwy
(highway miles per gallon) in the mpg
data. If there is a clear pattern in how the points fall on the x- and
y-axes then we can say these variables covary.
geom_point
plots values of displ
on the
y-axis and values of hwy
on the x-axis.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy))
Some of these values might be clustered together based on another characteristic. You can explore this by adding colour to the aesthetic mappings, for example.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy,
colour = class))
colour
aesthetic. Add a theme and change the titles
of the x- and y-axes.ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy,
colour = trans)) +
labs(x = "Engine Displacement (L)",
y = "Highway miles per gallon",) +
theme_bw()
Different aesthetic mappings can be combined in one plot. For
example, you may want to add a line of best fit to a scatterplot plot,
which you can do using geom_smooth()
.
ggplot(mpg, aes(x = displ,
y = hwy)) +
geom_point() +
geom_smooth() +
labs(x = "Engine Displacement (L)",
y = "Highway miles per gallon",) +
theme_bw()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# `geom_smooth()` is used to add a line of best fit. The default method is "loess" with the formula x ~ y, but you can change this to any other method. For example, you can choose `method = "lm"` for a linear best fit line.
You can fit more than one line in the same plot. This is often useful in multilevel modelling, where different hierarchies (e.g., groups, locations, individuals) have different slopes.
10. Recreate the previous plot, but adding
colour = class
to the aesthetic mappings. Use
method = "lm"
and set se = FALSE
in
geom_smooth()
. What difference does this make?
ggplot(mpg, aes(x = displ,
y = hwy,
colour = class)) +
geom_point() +
geom_smooth(method = lm,
se = FALSE) +
labs(x = "Engine Displacement (L)",
y = "Highway miles per gallon",) +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
# Adding `colour = class` to the aesthetic mappings causes `geom_smooth()` to fit a line for each class of vehicle. Using `method = "lm"` with `se = FALSE` fits a linear line without the standard error. Dropping the standard error makes the plot more readable, since they are otherwise overlapping.
Boxplots are often used to explore the distribution
of a continuous variable broken down by a categorical variable. Below is
an example of hwy
broken down by drv
.
geom_boxplot
transforms the data to show summary
statistics that are represented by the different visual features of the
boxplot.
ggplot(mpg) +
geom_boxplot(aes(x = drv,
y = hwy))
These features of the boxplot are:
# Lower horizontal line -> 25th percentile
# Thicker horizontal middle line -> median
# Upper horizontal line -> 75th percentile
# Vertical whiskers -> IQR
# 1.5 Points beyond the whiskers -> OUTLIERS
+ coord_flip()
to the plot above. What does
this do?ggplot(mpg) +
geom_boxplot(aes(x = drv, y = hwy)) +
coord_flip()
# coord_flip()` swaps the x- and y-axes.
# There are many outliers (points beyond the whiskers) for drive train type "f", which represents a front-wheel drive. There are no outliers beyond the IQR for the other types of drive train, 4-wheel drive and rear-wheel drive.
The variable trans
contains 10 different transmissions
types. These 10 categories can be assumed under 2 broader categories:
manual and auto.
mutate()
and fct_collapse()
to
collapse the 10 categories of trans
to just these 2
categories.mpg <- mpg %>%
mutate(trans = fct_collapse(trans,
"auto" = c("auto(av)",
"auto(l3)",
"auto(l4)",
"auto(l5)",
"auto(l6)",
"auto(s4)",
"auto(s5)",
"auto(s6)"),
"manual" = c("manual(m5)",
"manual(m6)")))
cty
on the y-axis and drv
on the x-axis, this
time adding the argument colour = trans
to
aes()
. What has changed?ggplot(mpg) +
geom_boxplot(aes(x = drv,
y = cty,
colour = trans))
# Adding the argument `colour = trans` to the aesthetic mappings further splits the drive train into auto and manual.
# Note that you can also perform the above operations in one piped step.
mpg %>%
mutate(trans = fct_collapse(trans,
"auto" = c("auto(av)",
"auto(l3)",
"auto(l4)",
"auto(l5)",
"auto(l6)",
"auto(s4)",
"auto(s5)",
"auto(s6)"),
"manual" = c("manual(m5)",
"manual(m6)"))) %>%
ggplot() +
geom_boxplot(aes(x = drv,
y = cty,
colour = trans))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `trans = fct_collapse(...)`.
## Caused by warning:
## ! Unknown levels in `f`: auto(av), auto(l3), auto(l4), auto(l5), auto(l6), auto(s4), auto(s5), auto(s6), manual(m5), manual(m6)
Facets are a way to split your plot into many subplots according to some categorical variable.
To do this, you use the command facet_wrap()
to facet by
a single variable. The first argument to facet_wrap()
is a
formula initiated by ~
and a variable name. Below is an
example of how this can work on the mpg
data.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy)) +
facet_wrap(~ class, nrow = 2)
It is possible to facet on a combination of two variables using
facet_grid()
. Like before, the first argument is a formula,
but containing two variable names separated by ~
. Below is
an example of how this can work.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy)) +
facet_grid(drv ~ cyl)
# There are no rear-wheel drives that have 4 cyclinders, so there is nothing to plot here.
displ
and
hwy
, faceting by the of manufacturer
name.ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy)) +
facet_wrap(~manufacturer, nrow = 3)
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy)) +
labs(x = "Engine displacement (litres)",
y = "Highway miles per gallon") +
facet_wrap(~manufacturer, nrow = 3)
If you have multiple plots that you want to arrange on the same page,
you can use ggarrange()
from the ggpubr
package.
The first step is creating some plots. For this exercise we will only
use the variables hwy
, cty
, and
class
.
A. A bar plot showing counts of class
, assigning
class
to the fill
aesthetic.
B. A box plot of class
by hwy
per
gallon, assigning class
to the colour
aesthetic.
C. A jittered scatterplot of hwy
and
cty
miles per gallon, assigning class
to the
colour
aesthetic.
Remember to save each plot as an object to be called upon later.
# plot A
p1 <- ggplot(data = mpg, aes(class, fill = class)) +
geom_bar() +
labs(x = "Vehicle class",
y = "Count") +
coord_flip() +
theme_bw()
# plot B
p2 <- ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy, colour = class)) +
labs(x = "Vehicle class",
y = "Highway mpg") +
coord_flip() +
theme_bw()
# plot C
p3 <- ggplot(data = mpg) +
geom_jitter(mapping = aes(x = cty, y = hwy, colour = class)) +
labs(x = "Vehicle class",
y = "Highway mpg") +
theme_bw()
# I create three plots names "p1", "p2", and "p3", using previous examples to guide me if necessary. For p1 and p2 I add `coord_flip()` to make the categories easier to read.
D. Use ggarrange()
to arrange the three plots in
one space. ggarrange()
takes the plots as arguments as well
as ncol
or nrow
to customise the
arrangement
ggarrange(p1, p2, p3,
ncol = 2,
nrow = 2)
# The only necessary arguments are the plots, but specifying `ncol` and `nrow` can improve the appearance. the repetition of the legend is usually something we want to avoid.
E. Repeat the code from the previous question, adding
common.legend = TRUE
and legend = "bottom"
to
ggarrange()
.
ggarrange(p1, p2, p3,
nrow = 2,
ncol = 2,
vjust = 0.5,
common.legend = TRUE,
legend = "bottom")
# Setting `common.legend = TRUE` gets rid of the repetitive legend and slightly improves the look. In this example I move the legend to the bottom position, but "top", "left", and "right" are other options.
End of practical