R
- Practical 1Welcome to this first practical of Fundamental Techniques of Data
Science in R
! Each week, the practical of that week covers
(part of) the topics discussed in the lectures and the reading
materials. The practicals prepare the students for the topics of the
graded assignments of the workgroups.
For each practical you are supposed to hand in an R Markdown file and the corresponding html file that shows that you followed all steps of the practical. These files won’t be graded, but students are expected to make and hand in all practicals as part of meeting the course requirements. The deadline for each practical is the start of the Monday afternoon lecture of the next week. Students are allowed to not submit 1 practical. However, if a student fails to submit the practical more than once, this student loses their right to take a retake exam. Instructions on submitting practicals can be found on the course website under ‘R practical’.
The practicals are discussed in the Tuesday afternoon lecture. Here, there is room to ask questions or help with the practicals. The answers of each practical can be found online on the course website. Note that it is strongly recommended to first try the practical yourself before checking the answers. However, you can check the answers any time, as long as you hand in your practical in time.
In this practical, we will make use of the following packages
library(dplyr)
library(readr)
library(knitr)
library(kableExtra)
#Note that `dplyr` and `readr` are included in the `tidyverse` distribution. If you have `tidyverse` installed you can simply load `library(tidyverse)`. You can also load packages within `tidyverse` independently (as above), which is a bit quicker.
The following exercises are some basic (mathematical) operations to
illustrate what you could code with R. Please run all exercises up to 11
in a single code chunk. Comment on half of the exercises in the code
chunk (use a #
after the code or on a different line).
These comments can be helpful for others (or a later version of
yourself) to understand what the code is supposed to do.
Create an object a with value 1
Verify that 1 is stored in a
Square a
Assign a + a to the object b, and check if b is equal to a + a.
Square b
Multiply the answer of question 6 by a over b
Assign the result to c
Take the square root (use sqrt()
) of c to
the power b
Multiply the answer of question 9 by a over (b to the power 6)
Round the answer from the previous question to 3 decimal
places (use round()
, and use ?round()
to find
out more about how to use this code).
Now you know how to use R
as a calculator and R markdown
for typesetting, we can move on to some more advanced operations.
A function in R is a piece of code that contains a set of statements
organized to perform a specific task. For example, a function could be
used to calculate the mean of a some data, or to make a barplot of some
other data. In R functions in code can be recognized by the parentheses
after a word (e.g. mean()
is a function).
Functions consist of:
For example, the function mean()
takes a vector of
numbers as input, then has the actions of summing the
numbers and dividing them over the number of elements, and finishes with
returning the obtained number.
The code below illustrates what a function would be constructed.
# Example function
# Function name and the input arguments
function_name <- function(argument_1, argument_2, argument_3){
# actions of the function
x <- (argument_1 + argument_2) / argument_3
# returning output
return(x)
}
R has lots of built-in functions or functions in packages, such as
seq()
, mean()
, min()
,
max()
, and sum()
. If you want more information
about a built-in function, you can always run the code
?function_name()
to retrieve documentation on the function.
Functions can also be coded/created by a programmer themselves. This can
for example be useful if some longer code needs to be repeated multiple
times.
Create a sequence of numbers from 12 to 24, by using the
function seq()
.
Sum the numbers from 28 to 63 by using the
sum()
-function.
Find the mean of the numbers from 25 to 82.
There are several of ways to read data into R. One option is to use
the readr
package that comes with the
tidyverse
distribution. The function
read_csv()
reads comma delimited files.
Download the file “flightdata.csv” from the course page and store it
in your project folder. This file contains a sample from the “flights”
dataset from the nycflights13
package. This contains
airline data for all flight departing from NYC in 2013. Note that you
have to assign the desired data to an object when reading the data into
R.
flightdata.csv
file into R with the
readr
package using the code belowflight_data <- read_csv("flightdata.csv") # Imports the data
flight_data # View the data
To get other types of data into R the tidyverse packages listed below are recommended.
haven
reads SPSS, Stata, and SAS filesreadxl
reads excel files (.xls and .xlsx)There are different functions to summarise data, but the base R
function summary()
works well too.
summary()
function to the
dataSometimes we need to add new columns that are functions of existing
columns, and mutate()
does this.
speed = distance / air_time * 60
.
Store the adjusted flight_data
dataset under a new name
flight_data2
.You might get a data set with more variables than you need. In this
case, it is useful to narrow it down to just the variables you will be
working with. select()
can be used for this.
select()
function from flight_data2 and store it under
flight_data3.Sometimes when coding, you want to repeat the same code for multiple times for different pieces of data, or you want to repeat the same action multiple times. For example, you could have a situation where you would draw 10 random numbers and want to calculate the mean, and then want to repeat this same action multiple times.
When repeating something multiple times (also called having
iterations of something), you could use loops in R. Loops are
pieces of code that are repeated a set number of times. In this
practical, we discuss the for
loop.
for
-loopsfor
-loops repeat the given loop for the number of
elements in a provided sequence or vector. The following code shows how
we loop over the numbers of 1 to 10. Running this code would provide the
third power for each of the numbers from 1 to 10.
# Defining the loop
for(i in seq(1, 10)){
# action you want to repeat, in this case each number to the power 3.
print(i^3)
}
Note that for
-loops always have the form described
below. When using a for
-loop, pay attention to the
parentheses and brackets.
for(<NAME_FOR_ELEMENT_IN_LOOP> in <SEQUENCE_OR_VECTOR>){
<WANTED ACTIONS>
}
for
-loop that iterates over the
numbers 1 to 12 and for each number takes the third power and divides
that number by 13. Then print the output for each number.As an alternative to loops, apply
statements can be used
to apply the same function to a list or vector of elements. For example,
you can compute the mean of every column in your data set by using an
apply statement.
The apply statements are several similar statements that are useful
in different situations. Some examples are apply()
,
sapply()
, lapply()
, and mapply()
.
To learn the exact differences between the statements, please read the
function-documentation (e.g. ?apply()
).
The standard apply()
function has the input
arguments
Below, an example is provided were we want to calculate the mean for each column of a data matrix of 9 by 9 cells.
# Create a 9 by 9 cell matrix with numbers 1 to 81
data <- matrix(1:81, nrow = 9, ncol = 9)
# apply with the input the data, margin and function
apply(X = data, MARGIN = 2, FUN = mean)
var()
) of each row of an 8 by 8 matrix with numbers 1 to
64.After making changes to data frames, or after creating output in the form of a data frame, you might want to save your data in a new file. For example, after pre-processing your data for the analysis, you might want to save a pre-processed version in addition to the version with raw data.
The write_csv()
function saves a data frame as a csv
file. The write_csv()
function has two main arguments: the
data frame to save, and the path to where you want the file to be
located. Other options can also be specified, for example how to write
missing values. Type ?write_csv
to learn more about this
function.
flight_sample3
to a file using
the write_csv()
function.write_csv(flight_data3, "flight_data3.csv")
# As we work in an Rproject, the .csv file is automatically stored within the project folder
This concludes the practical for this week. Don’t forget to hand in your work on this practical! Find instructions for this on the course website!