R
Practical 6We will use the following packages in this practical:
library(dplyr)
library(magrittr)
library(ggplot2)
library(foreign)
library(kableExtra)
library(janitor)
library(readr)
In this practical, you will perform regression analyses using
glm()
and inspect variables by plotting these variables,
using ggplot()
.
Logistic regression is a supervised learning algorithm that classifies data into categories, by predicting the probability that an observation falls into a particular category based on its features. In this tutorial we will consider binary classification, where we determine which of two categories a data point belongs to.
The logistic function can be described as:
\[ P(y = 1|X) = sigmoid(z) = \frac{1}{1+e^{-z}} \] where
\[ z = \hat{\beta_0} + \hat{\beta_1}x_1 + \hat{\beta_2}x_2 + ... + \hat{\beta_k}x_k \] \(z\) is like the linear predictor in linear regression, but it is transformed by the sigmoid function so that results can be interpreted as probabilities (between 0 and 1). The probability is compared to a threshold to determine what class \(y\) belongs to based on \(X\). You can choose what this threshold is and it is context dependent. For example, if you are predicting the chances of recovery in a clinical trial you might set a very high threshold of 0.90. A common threshold for low-stakes research is 0.50.
The glm()
function is used to specify several different
models, among which the logistic regression model. The logistic
regression model can be specified by setting the family
argument to “binomial”. You can save a model in an object and
request summary statistics with the summary()
command.
For logistic regression, it important to know and check what category
the predicted probabilities refer to, so you can interpret the model and
it’s coefficients correctly. If your outcome variable is coded as a
factor, the glm()
function predicts the 2nd category, which
is by default the alphabetical latter one. For example, if the
categories are coded as 0 and 1, the probability of belonging to “1” is
predicted by the model.
When a model is stored in an object you can ask for the coefficients (model$coeffients), the predicted probabilities of belonging to the ‘higher’ category category (model$fitted.values), and the aic (model$aic). To investigate all additional model information that is stored in the object, check out the list of the model by selecting it in the environment-list.
Before we get started with logistic modelling it helps to understand how odds, log-odds, and probability are related. Essentially, they are all just different expressions of the same thing and converting between them involve simple formulas.
Coefficients calculated using the glm()
function returns
log-odds by default. Most of us find it difficult to think in terms of
log-odds, so instead we convert them to odds (or odds-ratios) using the
exp()
function. If we want to go from odds to log-odds, we
just take the logarithm using log()
.
An odds-ratio is the probability of success and is defined as \(Odds = \frac{P}{1-P}\), where \(P\) is the probability of an event happening and \(1-P\) is the probability that it does not happen. For example, if we have an 80% chance of a sunny day, then we have a 20% chance of a rainy day. The odds would then equal \(\frac{.80}{.20} = 4\), meaning the odds of a sunny day are 4 to 1. Let’s consider this further with an example.
The code below creates a data frame called data
with a
column called conc
showing the number of trials wherein
different concentrations of the peptide-C protein inhibited the flow of
current across a membrane. The yes
column contains counts
of trials where this occured.
data <- data.frame(conc = c(0.1, 0.5, 1, 10, 20, 30, 50, 70, 80, 100, 150),
no = c(7, 1, 10, 9, 2, 9, 13, 1, 1, 4, 3),
yes = c(0, 0, 3, 4, 0, 6, 7, 0, 0, 1 ,7)
)
data
no
and yes
trials for each
row)yes
vs no
)Inspect the new columns. Do you notice anything unusual?
Add a new column to your dataset containing the corrected odds.
You can compute the value of this column using the following formulation of the log-odds:
\[ log(odds) = log(\frac{yes + 0.5} {no + 0.5}) \]
prop
is the outcomeconc
is the only predictorprop
)Interpret the slope estimate.
You will work with the titanic
data set which you can
find in the surfdrive folder, containing information on the fate of
passengers on the infamous voyage.
Survived
: this is the outcome variable that you are
trying to predict, with 1 meaning a passenger survived and 0 meaning
they did notPclass
: this is the ticket class the passenger was
travelling on, with 1, 2, and 3 representing 1st, 2nd and 3rd class
respectivelyAge
: this is the age of the passenger in yearsSex
: this is the sex of the passenger, either male or
femaleRead in the data from the “titanic.csv” file, selecting
only the variables Survived
, Pclass
,
Sex
and Age
. If necessary, correct the class
of the variables.
What relationships do you expect to find between the predictor variables and the outcome?
Investigate how many passengers survived in each class.
You can do this visually by creating a bar plot, or by using the
table()
function. Search ??table
for more
information.
Similarly, investigate the relationship between survival and sex by creating a bar plot and a table.
Investigate the relationship between age and survival by creating a histogram of the age of survivors versus non-survivors.
Specify a logistic regression model where “Survived” is the outcome and “Sex” is the only predictor.
What does the intercept mean? What are the odds and what are the log-odds of survival for males?
Specify a logistic regression model where “Survived” is the outcome and “Pclass” is the only predictor.
Which category is the reference group? What are their odds of survival?
What are the chances of survival for 2nd and 3rd class passengers?
Save this model as you will come back to it later.
What does the intercept mean when there is a continuous predictor?
How are the odds and log-odds interpreted for a continuous predictor?
Survived
is
the outcome and Pclass
plus an interaction between
Sex
and Age
as the predictor.Save this model as we will return to it later.
Model selection is an important step and there are several metrics
for assessing model fit to help us select the best performing model. We
will use deviance and information criterion to compare the fit of two
models you saved before: fit1
and fit2
.
Deviance is measure of the goodness-of-fit in a GLM where lower deviance indicates a better fitting model. R reports two types of deviance:
We can use the anova()
function to perform an analysis
of deviance that compares the difference in deviances between competing
models.
anova() and
test = “Chisq”`.AIC is the Akaike’s Information Criterion, a method for assessing model quality through comparison of related models. AIC is based on the deviance but introduces a penalty for more complex models. The number itself is not meaninful, and it is only useful when comparing models against one another. Like deviance, the model with the lowest AIC is best.
AIC()
function to get the AIC value for
model 1 and model 2.BIC is the Bayesian Information Criterion and is very similar to AIC, but penalises a complex model more than the AIC would. Complex models will have a larger score indicating worse fit. One difference to the AIC is that the probability of selecting the correct model with the BIC increases as the sample size of the training set increases.
Use the BIC()
function to get the BIC value
for model 1 and model 2.
Which model should we proceed with?
Often with logistic regression we are interested in how well our
model can predict the outcome. The predict()
function
allows us to do this, taking the model and some data as its main
arguments. Additionally, you can specify whether you want
predict()
to give you predictions as logit or
probabilities.
Proceed using the model you selected in the previous question.
predict()
function to generate
predicted probabilities for the multivariate logistic model.
predict()
takes the following arguments:object
, i.e. the logistic modelnewdata
, i.e. a data set where we want to
predict the outcome (we will use titanic
)type
, i.e. can be "logit"
for
log-odds or "response"
for probabilities (we will use
type = "response"
)se.fit
, i.e. set to TRUE
to
estimate the standard error of the probabilitiesRemember to save the output to an object.
Add the predicted probabilities and standard errors to the data set.
Calculate the confidence intervals for the predicted probabilities and add them to the data.
You are now at the end of this week’s practical. Next week, we will discuss model assumptions, model prediction, and the confusion matrix.