In nine weeks, you will learn the basics of data handling with R and details about regression techniques in the context of statistical inference. We will also cover the connection between these concepts and research philosophy. During every lecture, we will cover a different theoretical topic. In addition to the lectures, there will also be a weekly computer lab exercise that connects the statistical theory to practice. You will also attend weekly workgroup meetings wherein you will work on solving motivating, real-world case studies.
The final grade is computed as follows
Grade Component | Weight |
---|---|
Group assignment 1: Linear regression | 25% |
Group assignment 2: Logistic Regression | 25% |
Written Exam | 50% |
In addition to the grade components listed above, you will also do
R
exercises for the first 7 weeks of the course. These
exercises will develop the skills needed to successfully complete the
assignments.
To pass the course:
During this course, you will attend 8 workgroup sessions and hand in 7 practical assignments. We expect to you attend at least 7 out of 8 workgroup sessions, and hand in at least 6 out of 7 practical assignments (before the deadline). If you do not meet these requirements, you lose the right to resit the exam.
We will use two open-source books in this course:
There is no need to purchase these books. The freely available online versions are sufficient. The relevant chapters will be linked in this dashboard where the reading is assigned. We will also use several external webpages and web apps. These resources will also be linked in this dashboard.
Week # | Topic | R Exercise |
Workgroup | Reading |
---|---|---|---|---|
1 | The basics of R |
How to work with R via scripts, projects, and markdown;
How to import external data into R ; How to write your own
functions; How to iterate repetative tasks |
Form groups; Search for a dataset for the two group assignments; Formulate research questions | R4DS: Chapter 11, Chapter 27, Chapter 19, and Chapter 21 |
2 | Programmatic data manipulation 1 | Data types and objects in R ; Data transformation;
Working with pipes |
Perform data transformations on your found dataset | R4DS: Chapter 5, Chapter 10, Chapter 14 (only 14.1 and 14.2), Chapter 15, Chapter 18, and Chapter 20 |
3 | Programmatic data manipulation 2 | Data visualization; Data inspection; Data cleaning | Continue with data inspection and cleaning | R4DS: Chapter 3 and Chapter 7; ASWR: Chapter 4 |
4 | Multiple linear regression | Estimating linear models in R using the
lm() function; Model fit and model comparison; Categorical
predictors; Moderation |
Find a best fitting model; Test your hypotheses | ASWR: Chapter 7 (only 7.1–7.4), Chapter 9 (only 9.1–9.4), Chapter 11 (only 11.1–11.3), and Chapter 16 (only 16.1—not 16.1.4—and 16.2) |
5 | Model assumptions and diagnostics | Assumptions of the linear model; Leverage, outliers, and influential cases | Check assumptions of your model and inspect for unusual observations; Make adjustments if necessary; Draw conclusions; Submit Assignment 1 | ASWR: Chapter 13 |
6 | Generalized linear model and logistic regression | Estimating generalized linear models using the glm()
function in R ; Definition, estimation, and interpretation
of logistic regression models |
Perform data inspection and cleaning for the second assignment; Formulate hypothesis; Find a best fitting model and test your hypotheses | ASWR: Chapter 17 (only 17.1–17.3); This webpage |
7 | Logistic regression assumptions and classification | Logistic regression assumptions; Classification; Confusion matrix | Check the assumptions of your model and make adjustments if necessary; Make classifications | ASWR: Chapter 17 (only 17.4); This webpage |
8 | Summary, catch-up, and questions | - | Interpret your final model as well as the confusion matrix; Draw conclusions; Submit Assignment 2 | - |
Regression techniques are widely used to quantify the relationship
between two or more variables. In data science, linear and logistic
regression are common and powerful techniques for evaluating such
relations. These techniques are only useful, however, once you
understand when and how to apply them. In this course, students will
learn how to apply linear and logistic regression with the
R
statistical software package.
This course will introduce students to the principles of analytical
data science, linear and logistic regression, and the basics of
statistical learning. Students will develop fundamental R
programming skills and will gain experience with tidyverse: visualize
data with ggplot2 and performing basic data wrangling with dplyr. This
course helps prepare students for an entry-level research career
(e.g. junior researcher or research assistant) or further education in
research (e.g., a [research] Master program or a PhD).
At the end of this course, students are able to:
R
statistical software platform to perform
basic statistical programming, data manipulation, data visualization,
and basic data wrangling. \(\\[6pt]\)R
statistical software platform to perform,
interpret, and evaluate linear and logistic regression analyses on
real-world data. \(\\[6pt]\)R
output and use the results to answer
research questions. \(\\[6pt]\)R
Markdown to document the results of a statistical
analysis.In this course, skills and knowledge are evaluated with two types of assignment.
In eight weeks, you will learn the basics of data handling and statistical programming with R and details about regression techniques in the context of statistical inference, prediction, and classification. Each week will comprise three class activities:
During this course, you will attend 8 workgroup sessions and hand in 7 practical assignments. We expect you to attend at least 7 out of 8 workgroup sessions, and hand in at least 6 out of 7 practical assignments (before the deadline). If you do not meet these requirements, you lose the right to resit the exam.
Type of assignment: Group (4 students)
Grading: 25% of your final grade
Deadline: Monday December 16, 17:00
What to submit: A ZIP archive containing the complete R project (dataset, RMD, HTML)
Where to submit: This Surfdrive folder
Description: For this assignment, you perform and report a multiple linear regression analysis in an R markdown document. The assignment will be graded on the following five dimensions:
Type of assignment: Group (4 students)
Grading: 25% of your final grade
Deadline: Thursday January 16, 17:00
What to submit: A ZIP archive containing the complete R project (dataset, RMD, HTML)
Where to submit:: This Surfdrive folder
Description: For this assignment, you perform and report a multiple logistic regression analysis in an R markdown document. The assignment will be graded on the following five dimensions:
This semester, you will participate in the Fundamental
Techniques in Data Science with R
course at
Utrecht University. In this course, you will use both R
and
RStudio
. The steps below will guide you through installing
both R
and RStudio
. Please do so before the
first meeting.
Bring a laptop computer to the course, and make sure that you have full write access and administrator rights on the machine. We will explore programming and compiling in this course, so you will need full access to your machine. Some corporate laptops come with limited access for their users, I therefore advise you to bring a personal laptop to the workgroup meetings.
R
You can obtain a copy of R
here. We won’t use R
directly in the course. Rather, we’ll call R
through
RStudio
. Therefore, you also need to install
RStudio
.
RStudio
DesktopRStudio
is an Integrated Development Environment (IDE)
for R
. You can download RStudio
as stand-alone
software here. The
free and open-source RStudio Desktop
version is
sufficient.
Open RStudio
, and copy-paste the following lines of code
into the console window to execute them.
install.packages(c("ggplot2",
"tidyverse",
"magrittr",
"micemd",
"jomo",
"pan",
"lme4",
"knitr",
"rmarkdown",
"plotly",
"ggplot2",
"devtools",
"class",
"car",
"MASS",
"ISLR",
"mice"),
dependencies = TRUE)
If you are not sure where to paste the code, use the following figure to identify the console:
When you are asked the following:
Type Yes
in the console, and press the “Enter/Return”
key (or click the corresponding button if the question presents as a
dialog box).
If the suggested steps fail, or you have insufficient rights on your machine, you can use the following web-based solutions.
Open a free account on posit.cloud.
RStudio
environment
there. \(\\[12pt]\)Use Utrecht University’s MyWorkPlace.
R
and RStudio
there. When you start a new MyWorkPlace session, you may need to
(re)install packages.Naturally, you will need internet access to use these services.
R
To familiarize yourself with basic R
usage, complete the
following exercise before
the first lecture. This exercise will get you started with
R
and RStudio
. You can always also have a look
at the posit
website for more detailed tutorials.
Suggested reading:
We expect you to be familiar with some basic statistical concepts such as:
To refresh your memory, you can have a look at the material below. Note that these topics are background knowledge for this course. The course material builds on this knowledge.
From ASWR: Chapter 5 (only 5.1 and 5.2) and Chapter 7 (only 7.1 and 7.2).
Furthermore, you may benefit from exploring the following shiny apps:
This week, we’ll cover some fundamentals of R
programming.
R
via scripts and R MarkdownR
R
functionsR
scriptsYou can find the lecture slides here.
These readings are exam materials.
Well organized, tidy code is easier to debug, maintain, and understand. The tidyverse style guide provides a useful set of formatting conventions that you can apply to your own code.
In this week’s workgroup meeting, we discuss the assignments and expectations for your work in this course. You will form groups and decide which research questions you will answer for Assignments 1 and 2. You will search for a dataset to use in the two group assignments and thinking of possible research questions.
You can find the slides for this workgroup meeting here.
Deadline:
Email the following information to your workgroup instructor before the end of the workgroup meeting.
R
PracticalNOTE: Please read the Preparation page before starting with these practical exercises.
your_name_1.Rmd
and
your_name_1.html
, respectively. Where
your_name
is your full name in lower snake case,
and the 1
indicates Practical 1.Answers
You can find suggested answers to the practical below. We provide these answer files for two reasons:
Even though you have the solutions available, we strongly encourage you to seriously attempt answering each question in the exercises before checking the solutions.
This week, we start looking more closely at programmatic data
manipulation in R
.
R
objects and data typesYou can find the lecture slides here.
These readings are exam materials.
In this week’s meeting, you refine your research questions and perform any necessary manipulations to the variables in your dataset.
You can find the slides for this workgroup meeting here.
R
practicalThis week’s R
practical is about R
objects
and data types, performing basic data manipulations, and working with
pipes.
your_name_2.Rmd
and
your_name_2.html
, respectively. Where
your_name
is your full name in lower snake case,
and the 2
indicates Practical 2.Answers
You can find suggested answers to the practical exercises here:
This week, we continue with our discussion of programmatic data processing.
ggplot2
You can find the lecture slides below.
Note: In the live lecture, we covered the first 15 minutes of the following video.
These readings are exam materials.
Below, you can find links to some useful resources.
In today’s workgroup, you will continue with inspecting and cleaning your group’s chosen dataset.
You can find the slides for this workgroup meeting here.
R
practicalThis week’s R
practical will cover data visualization,
inspection, and cleaning as well as writing functions in
R
.
your_name_3.Rmd
and
your_name_3.html
, respectively. Where
your_name
is your full name in lower snake case,
and the 3
indicates Practical 3.Answers
You can find suggested answers to the practical exercises here:
This week, we move on from R
programming and begin our
discussion of linear modeling.
R
using the
lm()
functionYou can find the lecture slides here
These readings are exam materials.
In this week’s workgroup, you will continue to work with your group on Assignment 1. In particular, you will build your multiple linear regression model. You start will build up your final model in steps:
Finally, you will interpret the results of your optimal model.
You can find the slides for this workgroup meeting here.
R
practicalThis week’s R
practical is about linear regression and
model comparison. The practical also includes more practice with data
visualization.
your_name_4.Rmd
and
your_name_4.html
, respectively. Where
your_name
is your full name in lower snake case,
and the 4
indicates Practical 4.You can find an additional R code demonstration script here.
Answers
You can find suggested answers to the practical here:
This week, we wrap up our discussion of the linear model by considering how we can check if our model results are trustworthy.
You can find the lecture slides here.
Below, you can find links to some useful resources.
In this week’s workgroup, you will:
You can find the slides for this workgroup meeting here.
You must submit Assignment 1 by Monday December 16, 17:00.
R
practicalThis week’s R
practical guides you through the various
assumptions of the linear model as well as checks for outliers and
influential cases.
your_name_5.Rmd
and
your_name_5.html
, respectively. Where
your_name
is your full name in lower snake case,
and the 5
indicates Practical 5.Answers
You can find suggested answers to the practical here:
This week’s lecture covers the generalized linear model and the basics of logistic regression.
You can find the lecture slides here
These readings are exam materials.
In this week’s workgroup, you will start working on Assignment 2.
You can find the slides for this workgroup meeting here.
R
practicalThis week’s R
practical guides you through the basics of
logistic regression analyses.
your_name_6.Rmd
and
your_name_6.html
, respectively. Where
your_name
is your full name in lower snake case,
and the 6
indicates Practical 6.Answers
You can find suggested answers to the practical here:
This week, will cover the assumptions of logistic regression and evaluating classification performance via confusion matrices.
You can find the lecture slides here
These readings are exam materials.
In this week’s workgroup, you will continue to work on Assignment 2. You will check the assumptions of your model and make adjustments, if necessary. Also, you will interpret the confusion matrix of your model.
You can find the slides for this workgroup meeting here.
R
practicalThis week’s R
practical guides you through the process
of checking the assumptions of the logistic regression model and
evaluating classification performance.
your_name_7.Rmd
and
your_name_7.html
, respectively. Where
your_name
is your full name in lower snake case,
and the 7
indicates Practical 7.Answers
You can find suggested answers to the practical here:
In this week’s lecture, we will wrap up the course, and I’ll give an overview of the main points we’ve covered.
You can find the lecture slides here.
In this week’s workgroup, you will finalize your second group project. First, you will confirm that you have performed all steps as discussed last week. If you have time left, you can can fine-tune the interpretation of your results and polish the figures and tables in your markdown document.
You must submit Assignment 2 by Thursday January 16, 17:00.
R
practicalThere is no R practical this week. Use your time to finish the second assignment and prepare for the exam.
You can find the practice exam here
Anything mentioned in the lectures may appear on the exam.
Anything covered in the required readings may appear on the exam.
Week # | Topic | Reading |
---|---|---|
1 | The basics of R |
R4DS: Chapter 11, Chapter 27, Chapter 19, and Chapter 21 |
2 | Programmatic data manipulation 1 | R4DS: Chapter 5, Chapter 10, Chapter 14 (only 14.1 and 14.2), Chapter 15, Chapter 18, and Chapter 20 |
3 | Programmatic data manipulation 2 | R4DS: Chapter 3 and Chapter 7; ASWR: Chapter 4 |
4 | Multiple linear regression | ASWR: Chapter 7 (only 7.1–7.4), Chapter 9 (only 9.1–9.4), Chapter 11 (only 11.1–11.3), and Chapter 16 (only 16.1—not 16.1.4—and 16.2) |
5 | Model assumptions and diagnostics | ASWR: Chapter 13 |
6 | Generalized linear model and logistic regression | ASWR: Chapter 17 (only 17.1–17.3); This webpage |
7 | Logistic regression assumptions and classification | ASWR: Chapter 17 (only 17.4); This webpage |
This is not a math class; we are not trying to test your ability to do calculations or manipulate equations. That being said, a certain degree of mathematical literacy is crucial to statistics and data science, so you will have to do some simple calculations on the exam. For example, you should be comfortable with the following.
Of course, you should also be able to do basic arithmetic operations that are too trivial to detail here (e.g., calculating the difference between the \(R^2\) statistics from two models that you are trying to compare).
Note: Although all examples above are shown in terms of simple linear regression models, you should also be able to do these calculations/interpretations using multiple linear regression models and models that include dummy codes and interactions.
If any of the course materials confuse you, feel free to ask about it during the final lecture meeting (even if your concerns relate to content from earlier weeks).
We well devote part (most?) of the final lecture to a dedicated Q&A session