R
- Practical 3In this practical you will learn about exploratory data analysis
(EDA). EDA involves inspecting, transforming and visualising data with
the goal of finding interesting features to substantiate one or more
research questions. This practical will draw on skills you learned in
week 1 and week 2, while introducing you to some new material like
visualisation with ggplot()
.
You will need the following packages for this tutorial.
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(kableExtra)
library(weathermetrics)
You will work with the mpg
dataset from the
ggplot2
package. mpg
contains a subset of the
fuel economy data that EPA makes available here. The data frame has 234 rows
and 11 variables:
You can also type ?mpg
into the console to learn
more.
Let’s get an overview of the data.
head(mpg)
In previous weeks we learned that str()
can be used to
get an overview of the data and the structure of a dataframe. You can
also use glimpse()
from the tidyverse
which
returns similar results in a more readable format.
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Before you begin exploring your data, you need to check that the class or mode of your variables is as expected. Remember from previous practicals, you have character/factor and numerical/integer types.
Are there any variables that do not have the class you expect it to?
Make any changes you feel are necessary.
Exploratory Data Analysis is a process whereby you examine and transform your data to extract interesting insights and understand what relationships (if any!) exist among the variables. In this way, EDA is closely related to principles of data cleaning because tidy data is easier to explore. Initial steps of EDA involve examining distributions, handling missings and outliers, removing redundant data and coding categorical variables properly.
EDA is very flexible and iterative process and gives the researcher a lot of freedom, but it is typically split into graphical and non-graphical analysis. Further within this framework we can also start to think about what kind of variation occurs within variables and what kind of covariation occurs between my variables.
Variation describes how the values of a variable change each time you measure it. A continuous variable that is measured on one occasion will have different values to that same variable measured on another occasion. This variation between measurement occasions are what we call “error”. Categorical variables also vary across measurement occasions or across individuals. Visualising variation is a very intuitive way to understand trends or patterns within a variable.
Covariation describes how the values of at least two variables vary together or are related. Again, visualising covariation is also the most intuitive way to understand patterns between variables.
If you are interested in a particular aspect of a variable you can summarise this information in a table. We have covered how to create nicely formatted tables in the previous practical, so the example below is just to refresh your memory.
Below we summarise hwy
by vehicle class
, by
asking for commonly used summary statistics.
mpg %>%
group_by(class) %>% # by class
summarise(min = min(hwy), # minimum of hwy
mean = mean(hwy), # mean
sd = sd(hwy), # standard deviation
max = max(hwy), # maximum
n = n()) %>% # nr of observations
kbl() %>%
kable_styling(latex_options = c("striped", "hover"), full_width = F)
class | min | mean | sd | max | n |
---|---|---|---|---|---|
2seater | 23 | 24.80000 | 1.303840 | 26 | 5 |
compact | 23 | 28.29787 | 3.781620 | 44 | 47 |
midsize | 23 | 27.29268 | 2.135930 | 32 | 41 |
minivan | 17 | 22.36364 | 2.062655 | 24 | 11 |
pickup | 12 | 16.87879 | 2.274280 | 22 | 33 |
subcompact | 20 | 28.14286 | 5.375012 | 44 | 35 |
suv | 12 | 18.12903 | 2.977973 | 27 | 62 |
3. Create a summary of engine displacement by year, including the minimum, maximum, median, and inter quantile range.
Plots can be made using base R functions like plot()
,
hist()
, or barplot()
. Here are some examples
of how this works on the mpg
data.
hist(mpg$displ)
barplot(table(mpg$cyl))
plot(x = mpg$displ, y = mpg$hwy,
xlab = "Highway mpg",
ylab = "Engine displacement (L)")
These plots are quick to produce and are useful for an initial
understanding of the data, but the syntax used to create them is
specific to the type of plot. In contrast, ggplot
has a
streamlined approach to plotting involving largely the same steps
regardless of plot type. Specifically, in ggplot
we build
up visualisations layer by layer using the +
operator
(which is similar to the %>%
operator we are already
familiar with).
The main, important, layers in ggplot
are:
ggplot()
aes()
geom()
facet()
What does this layering system look like? In Chapter 3: Data Visualisation of Hadley Wickham’s R4DS book, there is a general ggplot template to give you an idea of how plots are layered.
```{r}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>)) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION> +
<THEME>
```
Below is a real example of how this looks. Remember that you do not
have to / need to use all of the ggplot
options (and there
are many many more than presented here). However, this gives you an idea
of how the layers of these plots are built up.
ggplot(iris) + # Data
geom_point(mapping = aes(x = Sepal.Length, # Variable on the x-axis
y = Sepal.Width, # Variable on the y-axis
colour = Species)) + # Legend
labs(x = "Sepal Length",
y = "Sepal Width",
title = "Relationship between Sepal Length and Width by Species") +
coord_cartesian() + # Default standard for mapping x and y
facet_wrap(~Species) + # Splits plot by the species variable
theme_bw() # Sets the background theme
You can read about different ggplot2
options here. The reference
guides by RPubs, STHDA
and Data
Novia will be useful once you grasp the basics and want to make more
complicated visualisations.
Bar charts are often used to visualise categorical
variables with a finite set of values. Below is an example how to
visuailse the distribution of drv
- the type of drive train
in the mpg
dataset.
geom_bar
transforms each value of drv
into
a count to be plotted on the y-axis. The x-axis presents the category
associated with each count.
ggplot(mpg) +
geom_bar(aes(x = drv))
Create a bar chart that shows a count for the different
vehicle classes in mpg
.
Look up different ggplot
themes, and apply
one to the bar chart you just created.
Histograms are often used to visualise the
distribution of a continuous variable. Below is an example of how to
visualise the distribution of cty
- the number of city
miles per gallon in the mpg
dataset.
geom_hist
transforms the x-axis into equally spaced
“bins” and the height of each bar on the y-axis is the number of
observations falling in each bin.
ggplot(mpg) +
geom_histogram(aes(x = cty),
binwidth = 3)
You can change the value of binwidth
to adjust the width
of the intervals. Different binwidths can reveal different patterns in
the data.
binwidth
to 1
? Does the distribution
change?Density plots are an alternative to histograms for
continuous data. Below is an example of the density of cty
from the mpg
data. The argument
fill = "darkseagreen"
just adds colour. You can also
specify colour = "<COLOUR NAME>"
to change the
density line.
geom_density
transforms the data into smoothed kernel
density estimates (basically a smooth histogram). It is useful for
continuous data that come from an underlying smooth distribution.
ggplot(mpg, aes(x = cty)) +
geom_density(fill = "darkseagreen")
You can also include raw data in a histogram in the form of “rug” marks.
+ geom_rug(size = 1, colour = "darkorange")
.Scatterplots are often used to explore how two
continuous variables covary together. Below is an example of how to
visualise covariation between displ
(engine displacement)
and hwy
(highway miles per gallon) in the mpg
data. If there is a clear pattern in how the points fall on the x- and
y-axes then we can say these variables covary.
geom_point
plots values of displ
on the
y-axis and values of hwy
on the x-axis.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy))
Some of these values might be clustered together based on another characteristic. You can explore this by adding colour to the aesthetic mappings, for example.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy,
colour = class))
colour
aesthetic. Add a theme and change the titles
of the x- and y-axes.Different aesthetic mappings can be combined in one plot. For
example, you may want to add a line of best fit to a scatterplot plot,
which you can do using geom_smooth()
.
You can fit more than one line in the same plot. This is often useful in multilevel modelling, where different hierarchies (e.g., groups, locations, individuals) have different slopes.
10. Recreate the previous plot, but adding
colour = class
to the aesthetic mappings. Use
method = "lm"
and set se = FALSE
in
geom_smooth()
. What difference does this make?
Boxplots are often used to explore the distribution
of a continuous variable broken down by a categorical variable. Below is
an example of hwy
broken down by drv
.
geom_boxplot
transforms the data to show summary
statistics that are represented by the different visual features of the
boxplot.
ggplot(mpg) +
geom_boxplot(aes(x = drv,
y = hwy))
These features of the boxplot are:
What do each of these features tell us?
Add + coord_flip()
to the plot above. What
does this do?
What can you conclude about the highway miles per gallon of each type of drive train from the boxplot above?
The variable trans
contains 10 different transmissions
types. These 10 categories can be assumed under 2 broader categories:
manual and auto.
Use mutate()
and fct_collapse()
to collapse the 10 categories of trans
to just these 2
categories.
Use the previous example to create a boxplot mapping
cty
on the y-axis and drv
on the x-axis, this
time adding the argument colour = trans
to
aes()
. What has changed?
Facets are a way to split your plot into many subplots according to some categorical variable.
To do this, you use the command facet_wrap()
to facet by
a single variable. The first argument to facet_wrap()
is a
formula initiated by ~
and a variable name. Below is an
example of how this can work on the mpg
data.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy)) +
facet_wrap(~ class, nrow = 2)
It is possible to facet on a combination of two variables using
facet_grid()
. Like before, the first argument is a formula,
but containing two variable names separated by ~
. Below is
an example of how this can work.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy)) +
facet_grid(drv ~ cyl)
Notice that there are empty cells in the bottom left facets above. What do you think this means?
Create a scatter plot of displ
and
hwy
, faceting by the of manufacturer
name.
Change the names on the axes to be more informative.
If you have multiple plots that you want to arrange on the same page,
you can use ggarrange()
from the ggpubr
package.
The first step is creating some plots. For this exercise we will only
use the variables hwy
, cty
, and
class
.
A. A bar plot showing counts of class
, assigning
class
to the fill
aesthetic.
B. A box plot of class
by hwy
per
gallon, assigning class
to the colour
aesthetic.
C. A jittered scatterplot of hwy
and
cty
miles per gallon, assigning class
to the
colour
aesthetic.
Remember to save each plot as an object to be called upon later.
D. Use ggarrange()
to arrange the three plots in
one space. ggarrange()
takes the plots as arguments as well
as ncol
or nrow
to customise the
arrangement
E. Repeat the code from the previous question, adding
common.legend = TRUE
and legend = "bottom"
to
ggarrange()
.
End of practical