R
- Practical 2Welcome to the second practical of Fundamental Techniques in Data Science with R! The aim of this practical is to learn a bit more about different data types and objects in R, how to transform data and create new variables, and how to work with pipes.
Start by creating a new R Project and open a new R Markdown file within it. If you can’t remember how to do this, you can find more instructions in the preparation practical.
You should by now have tidyverse
installed, which we
need for this practical. Within the tidyverse
we will use
the dplyr
and magrittr
packages which have
useful functions for data manipulation and working with factors. We will
also use kableExtra
to create nicely formatted tables.
library(tidyverse)
library(kableExtra)
We are going to use the General Social Survey, which is a
long-running US survey conducted by
the NORC at the University of Chicago. The survey monitors changes in
social characteristics and attitudes. The survey is quite large, so we
can access a smaller version in the forcats
package. It has
the following variables:
If you want more information on the survey you can type
?gss_cat
into the console. Let’s load the data.
gss_cat <- forcats::gss_cat
We can take a look at the data using head()
, which shows
us the first 6 rows of the data frame.
head(gss_cat)
The str()
function tells us what the class of each
variable is. You will notice that most variables in the gss_cat data are
factors with different levels. In this tutorial we will work mainly with
these variables to learn techniques for working with factors.
str(gss_cat)
## tibble [21,483 × 9] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:21483] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ marital: Factor w/ 6 levels "No answer","Never married",..: 2 4 5 2 4 6 2 4 6 6 ...
## $ age : int [1:21483] 26 48 67 39 25 25 36 44 44 47 ...
## $ race : Factor w/ 4 levels "Other","Black",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ rincome: Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ...
## $ partyid: Factor w/ 10 levels "No answer","Don't know",..: 6 5 7 6 9 10 5 8 9 4 ...
## $ relig : Factor w/ 16 levels "No answer","Don't know",..: 15 15 15 6 12 15 5 15 15 15 ...
## $ denom : Factor w/ 30 levels "No answer","Don't know",..: 25 23 3 30 30 25 30 15 4 25 ...
## $ tvhours: int [1:21483] 12 NA 2 4 1 NA 3 NA 0 3 ...
Tibbles are modern re-workings of a data.frame
. Tibbles
keep the most important features of a traditional
data.frame
while introducing small tweaks to improve
functionality. For instance, tibbles never change variable names or
types, and won’t create rownames. Another useful feature of tibbles is
that they warn you if do something they don’t like, such as use a
variable that does not exist.
In addition, there are two main differences when using tibbles
vs. data.frame
:
.
placeholder (e.g., df %>% .x
instead of df$x
or df[["x"]])
)Because tibble is a part of the core tidyverse
any
function also connected to it will produce tibbles. However, sometimes
you need to coerce R data frames to tibbles using `as_tibble()’, like
so:
as_tibble(gss_cat)
Pipes are a useful tool that make it easier to express a sequence of
multiple operations in R. Pipes %>%
come from the
magrittr
package, so they are automatically loaded when you
use the tidyverse
. Pipes make code more intuitive and
easier to understand - which is good for practicing open science!
Let’s compare code written with - and without - pipes.
Below you see how the %>%
is used to pass information
from one line to the next. It’s easy to follow along and see exactly
what is being done to the data, line by line.
gss_cat %>%
filter(relig == "Protestant") %>%
group_by(year, relig) %>%
summarize(tvhours = mean(tvhours, na.rm = TRUE))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
In contrast, the following code performs the same operations but without piping. To understand what is happening you need to read the code from the inside out - which is much more difficult.
# Using base R
summarize(group_by(filter(gss_cat,
relig == "Protestant"),
year,
relig),
tvhours = mean(tvhours,
na.rm = TRUE)
)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
Pipes don’t (automatically) assign a new object as a result of the
operations you make. You need to specify <-
at the
beginning if you wish to save the results. There is a special variation
of the pipe that allows assignments, %<>%
but it is
less obvious than <-
.
Pipes are not appropriate for every situation. You should consider not using pipes when:
Note: Cmd + Shift + M
(Mac) and
Ctrl + Shift + M
is a useful shorthand for the
%>%
.
Let’s take another look at the gss_data. Below is an overview of the data. There are several character variables will inconsistent and somewhat messy categories.
gss_cat[1:10,] %>%
kable() %>%
kable_styling(latex_options = "striped")
year | marital | age | race | rincome | partyid | relig | denom | tvhours |
---|---|---|---|---|---|---|---|---|
2000 | Never married | 26 | White | $8000 to 9999 | Ind,near rep | Protestant | Southern baptist | 12 |
2000 | Divorced | 48 | White | $8000 to 9999 | Not str republican | Protestant | Baptist-dk which | NA |
2000 | Widowed | 67 | White | Not applicable | Independent | Protestant | No denomination | 2 |
2000 | Never married | 39 | White | Not applicable | Ind,near rep | Orthodox-christian | Not applicable | 4 |
2000 | Divorced | 25 | White | Not applicable | Not str democrat | None | Not applicable | 1 |
2000 | Married | 25 | White | $20000 - 24999 | Strong democrat | Protestant | Southern baptist | NA |
2000 | Never married | 36 | White | $25000 or more | Not str republican | Christian | Not applicable | 3 |
2000 | Divorced | 44 | White | $7000 to 7999 | Ind,near dem | Protestant | Lutheran-mo synod | NA |
2000 | Married | 44 | White | $25000 or more | Not str democrat | Protestant | Other | 0 |
2000 | Married | 47 | White | $25000 or more | Strong republican | Protestant | Southern baptist | 3 |
You will also notice that there are some missing values in the data. In R, missing values can be presented in different ways.
You can easily search an entire data for missing values using
anyNA()
. If missing values are present this will return
“TRUE”, else “FALSE” if there are no missing values.
anyNA(gss_cat)
## [1] TRUE
So we know there is at least one missing value in gss_cat, but we
don’t know where. You can find the position of missing values in the
data using is.na()
. This returns a “TRUE” or “FALSE”
response rowwise and columwise.
is.na()
can you tell which variable has
only standard missing values (NAs)?is.na(gss_cat)
# Only `tvhours` has NA values, indiated by "TRUE" in the output.
Both anyNA()
and is.na()
are generic
methods of detecting standard missing values, and the output is
not very informative. Non-standard missing values are more difficult to
find because R doesn’t know that they are missing. Depending on your
research question, you may want to convert non-standard missing data
responses to NA for smoother data manipulation. You may also
decided that these types of responses are informative and that you want
to keep them as they are.
Let’s take a look at the non-standard missing responses in
rincome from the gss_cat
data.
A very basic search can be done using count()
on the
column of interest.
Below we see responses that can be considered as missing values, such as “No answer”, “Don’t know”, “Refused”, and “Not applicable”. Some of these responses have the same meaning, like “No answer” and “Refused”, whereas others might have additional meaning, like “Not applicable”. Unemployed people might respond with the latter, whereas people who wish to keep their income private refuse to answer or say “Don’t know”. as a researcher, it is up to you how you handle these responses.
gss_cat %>%
count(rincome)
Hint: You may use count() like in the example.
gss_cat %>%
count(marital)
gss_cat %>%
count(partyid)
gss_cat %>%
count(relig)
gss_cat %>%
count(denom)
# There are numerous ways to quantify missing values in R, but sticking to the very simple method using `count()` we can inspect each factor variable in turn. Almost all of the factor variables have some responses that could be considered to be missing data.
# There are more advanced methods to find missing values that involve less code and a bit less manual work. We won't go into this in detail today.
In this section we will practice data transformation using functions
from the dplyr
package in the core tidyverse. You should be
familiar with some of these functions already, but today we will go a
little further.
Remember to use the %>%
operator!
The filter()
function allows you to subset observations
based on their values. The first argument is the name of the data frame.
The second and subsequent arguments are the expressions that filter the
data frame.
R provides the following options for filtering: >
,
>=
, <
, <=
,
!=
(not equal), and ==
(equal). You can also
combine these with the following logical operators: &
meaning “and”, |
meaning “or”, and !
meaning
“not”.
filter()
function to display only
married people in the gss_cat data set.gss_cat %>%
filter(marital == "Married")
# Since we only want to find married people we will use the equal operator, ==, and encase the observations we want in quotes.*
filter()
function to display divorced
AND widowed people in the gss_cat data set.gss_cat %>%
filter(marital == "Divorced" | marital == "Widowed")
# In this case we need to combine logical operators. The or operator is appropriate here since we are looking for two kinds of matches in the same variable (& would not work since people cannot be both divorced and widowed). We combine this with the equal operator.
arrange()
function to reorder the
information in the data frame by the number of tv hours.gss_cat %>%
arrange(tvhours)
# Arrange only needs one argument, the variable you wish to reorder. Arrange orders the rows of a gss_cat by the tvhours column. The default here is to arrange in ascending order.
desc()
to re-order a column in descending order. Try doing
this.gss_cat %>%
arrange(desc(tvhours))
# When combining these functions you need to wrap arrange around the operation you want to do.
Hint: You need to combine filter and arrange using the %>%
gss_cat %>%
filter(marital == "Married") %>%
arrange(tvhours)
# Using the pipe we can perform multiple operations at once without needing to save each interim step as an object.
arrange()
and
count()
to find what the most common religion
is?gss_cat %>%
count(relig) %>%
arrange(desc(n))
# In this case, we are not passing a variable to arrange but a logical, n. This tells R to count all the categories in relig and subsequently order these categories in descending order.*
Hint: select()
, group_by()
, and
summarize()
are useful functions for this
gss_cat %>%
select(relig, tvhours) %>% # also works without using select
group_by(relig) %>%
summarise(tvhours = mean(tvhours, na.rm = TRUE))
# Combining several operations can seem complex, but once you understand what is happening you can apply to most other cases. In this example we tell R to take the `gss_data`, select only `relig` and `tvhours`, group the different categories of `relig` and perform a summarising function on these groups in respect to tvhours. Inside the `summarise()` function we tell it to take an average and to remove missing values.*
Anything written within single ''
or double
""
quotes in R is treated as a string. R internally stores
any string within double quotes, even if you created it with single
quotes. Strings usually contain unstructured or semi-structured data and
we can use regular expressions (regexps) to describe patterns within
strings. The stringr
package is used for string
manipulation and is part of the core tidyverse. There are some rules
around string creation:
The following examples are valid ways to create strings:
string1 <- "This is a string in double quotes"
string2 <- 'This is a string in single quotes'
string3 <- 'If I want to include a "double quote" inside a string, I use single quotes'
string4 <- "If i want to include a 'single quote' inside a string, I use double quotes"
Multiple strings can be stored in a character vector, which you can
create with c()
.
c("one", "two", "three")
## [1] "one" "two" "three"
You can combine strings using str_c()
, which can take
additional arguments like sep
to dictate how the strings
are separated.
str_c("x", "y", "z")
str_c("x", "y", "z", sep = ", ")
# Using the `sep` argument separates each letter in the string by a comma, but other character separators are possible (including blank space).
The opposite operation can also be performed with
str_collapse()
.
str_c(c("x", "y", "z"), collapse = "")
## [1] "xyz"
You can extract parts of a string using str_sub()
which
takes the string as an argument, in addition to start
and
end
arguments.
For example, the code below extracts the 1st and 3rd character.
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
## [1] "App" "Ban" "Pea"
Regular expressions (regexps) are a language to to describe string
patterns. You can use str_view()
and
str_view_all()
which take a character vector and a regular
expression to match patterns in a string.
Consider a very simple case of pattern matching. Using
str_view()
we get one pattern match -
bananas.
x <- c("apple", "banana", "pear")
str_view(x, "an")
## [2] │ b<an><an>a
The next example uses .
to match anything either side of
a specified character. Using the string vector created previously, we
get two matches - banana and pear.
str_view(x, ".a.")
## [2] │ <ban>ana
## [3] │ p<ear>
Regexps match any part of a string, but you can provide an anchor so that it matches to the start or end of the string by providing:
^
to match the start of the string$
to match the end of the stringThe code below finds one match where a string starts with a - apple.
x <- c("apple", "banana", "pear")
str_view(x, "^a")
## [1] │ <a>pple
The code below finds one match where a string ends in a - banana.
str_view(x, "a$")
## [2] │ banan<a>
You can further combine ^
and $
to match a
complete string. For example, the code below only finds one exact match
- apple.
x <- c("apple pie", "apple", "apple cake")
str_view(x, "^apple$")
## [2] │ <apple>
This is only the beginning of what you can do with the
stringr
package. You can read the Strings chapter in the
R4DS book if you are keen to learn more.
Factors are used to to store categorical data with levels. Levels are fixed and known sets of values for that variable. Factors can store both strings and integers and are useful when you want to display character vectors in a non-alphabetical format.
Some base R functions convert characters to factors automatically,
meaning factors can pop up unexpectedly. However, this isn’t a problem
in the tidyverse. The forcats
package lets us work with
factors and is part of the core tidyverse.
Storing categorical character information in a vector can lead to problems, such as typos and data ordered in a non-meaningful way. For instance, we create a character vector below with a typo in “Jam”. Notice that sorting this vector does not provide a meaningful order.
x <- c("Dec", "Apr", "Jam", "Mar")
sort(x)
## [1] "Apr" "Dec" "Jam" "Mar"
Factors can fix help us to stop these problems. To create a factor,
start by providing a list of valid levels. For example, below we create
a vector month_levels
containing abbreviated months of the
year.
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
Next you can create the factor and provide the levels to
factor()
. Any values not in the set will be converted to
NA. For instance, “Jam” does not appear in y
because there is no matching level. This helps us to spot our mistake
and trace it back to where we created x
,
y <- factor(x, levels = month_levels)
print(y)
## [1] Dec Apr <NA> Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
In addition, we can now sort y
in a meaningful way
according to the structure and order we created in
month_levels
.
sort(y)
## [1] Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
In the next set of exercises we are going to learn how to present
data in nicely formatted tables in R markdown. We will continue to work
with gss_cat
.
gss_sample <- gss_cat[sample(nrow(gss_cat), 10),] # Takes a random sample of 10 observations from the data
The kable()
function is part of the knitr
package and generates nicely formatted tables from matrices or data
frames in R Markdown. In this exercise we will use the built-in mtcars
data set to learn about kable()
.
kable
packageHint: kable()
is a simple function that can work
with just one argument - the data frame.
kable(gss_sample) # A very simple table
year | marital | age | race | rincome | partyid | relig | denom | tvhours |
---|---|---|---|---|---|---|---|---|
2002 | Married | 51 | Other | Don’t know | Not str democrat | Protestant | Other | NA |
2008 | Never married | 19 | White | Not applicable | Not str republican | None | Not applicable | NA |
2008 | Never married | 21 | White | Not applicable | Not str republican | Protestant | Other baptists | NA |
2010 | Married | 60 | White | Refused | Not str republican | Catholic | Not applicable | 1 |
2006 | Never married | 40 | White | $25000 or more | Strong republican | Protestant | No denomination | NA |
2014 | Divorced | 47 | White | Not applicable | Independent | Protestant | Baptist-dk which | 12 |
2008 | Separated | 68 | Black | Not applicable | Strong democrat | Protestant | Baptist-dk which | 3 |
2006 | Widowed | 71 | White | Not applicable | Strong democrat | Protestant | United methodist | NA |
2010 | Separated | 78 | White | Refused | Not str republican | Protestant | Southern baptist | 4 |
2002 | Married | 54 | White | $5000 to 5999 | Ind,near rep | None | Not applicable | NA |
# Notice that the output from `kable()` is very difficult to read. This is because the`kable()` function offers very limited formatting options and is not optimised for html. Within `kable()` you can align data/text by running `kable(mtcars, align = "l")`, for instance. For more advanced formatting we need another package. More on that soon.
select()
alongside
kable()
to display only the variables “year”, “age”, “race”
and “rincome”.Hint: kable()
can also be combined with the
dplyr
functions we used before.There are two ways to do
this: You can wrap the kable()
function around
select()
; you can also pipe the results from
select()
to kable()
select(gss_sample, year, age, race, rincome) %>%
kable() # Method 2
year | age | race | rincome |
---|---|---|---|
2002 | 51 | Other | Don’t know |
2008 | 19 | White | Not applicable |
2008 | 21 | White | Not applicable |
2010 | 60 | White | Refused |
2006 | 40 | White | $25000 or more |
2014 | 47 | White | Not applicable |
2008 | 68 | Black | Not applicable |
2006 | 71 | White | Not applicable |
2010 | 78 | White | Refused |
2002 | 54 | White | $5000 to 5999 |
# Again, the output is very difficult to read. Let's learn how to improve this in the next exercise.
The kableExtra
package extends the basic functionality
of kable()
. A nice thing about kableExtra
is
that its features work well with HTML and PDF outputs. You can install
this package from CRAN as usual.
You can use the pipe operator, %>%
to push
kable()
output to kableExtra
styling options.
For example, you can create a striped table using the code below.
kable(gss_sample) %>% kable_styling(latex_options = "striped")
year | marital | age | race | rincome | partyid | relig | denom | tvhours |
---|---|---|---|---|---|---|---|---|
2002 | Married | 51 | Other | Don’t know | Not str democrat | Protestant | Other | NA |
2008 | Never married | 19 | White | Not applicable | Not str republican | None | Not applicable | NA |
2008 | Never married | 21 | White | Not applicable | Not str republican | Protestant | Other baptists | NA |
2010 | Married | 60 | White | Refused | Not str republican | Catholic | Not applicable | 1 |
2006 | Never married | 40 | White | $25000 or more | Strong republican | Protestant | No denomination | NA |
2014 | Divorced | 47 | White | Not applicable | Independent | Protestant | Baptist-dk which | 12 |
2008 | Separated | 68 | Black | Not applicable | Strong democrat | Protestant | Baptist-dk which | 3 |
2006 | Widowed | 71 | White | Not applicable | Strong democrat | Protestant | United methodist | NA |
2010 | Separated | 78 | White | Refused | Not str republican | Protestant | Southern baptist | 4 |
2002 | Married | 54 | White | $5000 to 5999 | Ind,near rep | None | Not applicable | NA |
kable(gss_sample) %>%
kable_styling(latex_options = "striped",
font_size = 8) # Change the font size here
year | marital | age | race | rincome | partyid | relig | denom | tvhours |
---|---|---|---|---|---|---|---|---|
2002 | Married | 51 | Other | Don’t know | Not str democrat | Protestant | Other | NA |
2008 | Never married | 19 | White | Not applicable | Not str republican | None | Not applicable | NA |
2008 | Never married | 21 | White | Not applicable | Not str republican | Protestant | Other baptists | NA |
2010 | Married | 60 | White | Refused | Not str republican | Catholic | Not applicable | 1 |
2006 | Never married | 40 | White | $25000 or more | Strong republican | Protestant | No denomination | NA |
2014 | Divorced | 47 | White | Not applicable | Independent | Protestant | Baptist-dk which | 12 |
2008 | Separated | 68 | Black | Not applicable | Strong democrat | Protestant | Baptist-dk which | 3 |
2006 | Widowed | 71 | White | Not applicable | Strong democrat | Protestant | United methodist | NA |
2010 | Separated | 78 | White | Refused | Not str republican | Protestant | Southern baptist | 4 |
2002 | Married | 54 | White | $5000 to 5999 | Ind,near rep | None | Not applicable | NA |
The kableExtra
package also comes with different
themes:__
kable_paper
kable_classic
kable_classic_2
kable_minimal
kable_material
kable_material_dark
kable(gss_sample) %>%
kable_classic %>% # Add your theme here
kable_styling(latex_options = "striped",
font_size =12)
year | marital | age | race | rincome | partyid | relig | denom | tvhours |
---|---|---|---|---|---|---|---|---|
2002 | Married | 51 | Other | Don’t know | Not str democrat | Protestant | Other | NA |
2008 | Never married | 19 | White | Not applicable | Not str republican | None | Not applicable | NA |
2008 | Never married | 21 | White | Not applicable | Not str republican | Protestant | Other baptists | NA |
2010 | Married | 60 | White | Refused | Not str republican | Catholic | Not applicable | 1 |
2006 | Never married | 40 | White | $25000 or more | Strong republican | Protestant | No denomination | NA |
2014 | Divorced | 47 | White | Not applicable | Independent | Protestant | Baptist-dk which | 12 |
2008 | Separated | 68 | Black | Not applicable | Strong democrat | Protestant | Baptist-dk which | 3 |
2006 | Widowed | 71 | White | Not applicable | Strong democrat | Protestant | United methodist | NA |
2010 | Separated | 78 | White | Refused | Not str republican | Protestant | Southern baptist | 4 |
2002 | Married | 54 | White | $5000 to 5999 | Ind,near rep | None | Not applicable | NA |
You can also combine kableExtra()
with other
dplyr
functions like fct_recode
and
summarise()
. The example below uses
fct_recode()
to make the partyid variables
easier to follow, then groups this information to pass to
kbl()
and some styling options.
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican" = "Strong republican",
"Republican" = "Not str republican",
"Democrat" = "Strong democrat",
"Democrat" = "Not str democrat",
"Independent" = "Ind,near rep",
"Independent" = "Ind,near dem"
)) %>%
group_by(partyid) %>%
summarise(n=n()) %>%
kbl() %>%
kable_paper(bootstrap_options = "striped", full_width = F)
partyid | n |
---|---|
No answer | 154 |
Don’t know | 1 |
Other party | 393 |
Republican | 5346 |
Independent | 8409 |
Democrat | 7180 |
There are many other styling options included in
kableExtra
. I recommend you try them out yourself.
Other packages for creating nice tables in R Markdown exist too, such
as xtable
, stargazer
, and
pander
.