Introduction

We will use the following packages in this practical:

library(dplyr)
library(magrittr)
library(ggplot2)
library(foreign)
library(kableExtra)
library(janitor)
library(readr)

In this practical, you will perform regression analyses using glm() and inspect variables by plotting these variables, using ggplot().

Logistic regression

Logistic regression is a supervised learning algorithm that classifies data into categories, by predicting the probability that an observation falls into a particular category based on its features. In this tutorial we will consider binary classification, where we determine which of two categories a data point belongs to.

The logistic function can be described as:

\[ P(y = 1|X) = sigmoid(z) = \frac{1}{1+e^{-z}} \] where

\[ z = \hat{\beta_0} + \hat{\beta_1}x_1 + \hat{\beta_2}x_2 + ... + \hat{\beta_k}x_k \] $z$ is like the linear predictor in linear regression, but it is transformed by the sigmoid function so that results can be interpreted as probabilities (between 0 and 1). The probability is compared to a threshold to determine what class $y$ belongs to based on $X$. You can choose what this threshold is and it is context dependent. For example, if you are predicting the chances of recovery in a clinical trial you might set a very high threshold of 0.90. A common threshold for low-stakes research is 0.50.

The glm() function is used to specify several different models, among which the logistic regression model. The logistic regression model can be specified by setting the family argument to “binomial”. You can save a model in an object and request summary statistics with the summary() command.

For logistic regression, it important to know and check what category the predicted probabilities refer to, so you can interpret the model and it’s coefficients correctly. If your outcome variable is coded as a factor, the glm() function predicts the 2nd category, which is by default the alphabetical latter one. For example, if the categories are coded as 0 and 1, the probability of belonging to “1” is predicted by the model.

When a model is stored in an object you can ask for the coefficients (model$coeffients), the predicted probabilities of belonging to the ‘higher’ category category (model$fitted.values), and the aic (model$aic). To investigate all additional model information that is stored in the object, check out the list of the model by selecting it in the environment-list.

Working with odds and log-odds

Before we get started with logistic modelling it helps to understand how odds, log-odds, and probability are related. Essentially, they are all just different expressions of the same thing and converting between them involve simple formulas.

Coefficients calculated using the glm() function returns log-odds by default. Most of us find it difficult to think in terms of log-odds, so instead we convert them to odds (or odds-ratios) using the exp() function. If we want to go from odds to log-odds, we just take the logarithm using log().

An odds-ratio is the probability of success and is defined as $Odds = \frac{P}{1-P}$, where $P$ is the probability of an event happening and $1-P$ is the probability that it does not happen. For example, if we have an 80% chance of a sunny day, then we have a 20% chance of a rainy day. The odds would then equal $\frac{.80}{.20} = 4$, meaning the odds of a sunny day are 4 to 1. Let’s consider this further with an example.

The code below creates a data frame called data with a column called conc showing the number of trials wherein different concentrations of the peptide-C protein inhibited the flow of current across a membrane. The yes column contains counts of trials where this occured.

data <- data.frame(conc = c(0.1, 0.5, 1, 10, 20, 30, 50, 70, 80, 100, 150),
                   no = c(7, 1, 10, 9, 2, 9, 13, 1, 1, 4, 3),
                   yes = c(0, 0, 3, 4, 0, 6, 7, 0, 0, 1 ,7)
                   ) 
data

Add the following variables to the dataset:

the total number of trials for each observation (i.e., the sum of the no and yes trials for each row)
the proportion of yes trials in each row (i.e. yes divided by the total)
the log-odds of inhibition for each row (i.e. the log-odds of yes vs no)

Inspect the new columns. Do you notice anything unusual?
Add a new column to your dataset containing the corrected odds.

You can compute the value of this column using the following formulation of the log-odds:

\[ log(odds) = log(\frac{yes + 0.5} {no + 0.5}) \]

Fit a logistic regression model where:

prop is the outcome
conc is the only predictor
the number of total trials per row are used as weights (we need this because a different number of trials can go into defining each observation of prop)

Interpret the slope estimate.

Titanic data

You will work with the titanic data set which you can find in the surfdrive folder, containing information on the fate of passengers on the infamous voyage.

Survived: this is the outcome variable that you are trying to predict, with 1 meaning a passenger survived and 0 meaning they did not
Pclass: this is the ticket class the passenger was travelling on, with 1, 2, and 3 representing 1st, 2nd and 3rd class respectively
Age: this is the age of the passenger in years
Sex: this is the sex of the passenger, either male or female

Read in the data from the “titanic.csv” file, selecting only the variables Survived, Pclass, Sex and Age. If necessary, correct the class of the variables.
What relationships do you expect to find between the predictor variables and the outcome?
Investigate how many passengers survived in each class. You can do this visually by creating a bar plot, or by using the table() function. Search ??table for more information.
Similarly, investigate the relationship between survival and sex by creating a bar plot and a table.
Investigate the relationship between age and survival by creating a histogram of the age of survivors versus non-survivors.

No predictors

Specify a logistic regression model where “Survived” is the outcome and there are no predictors.

Binary predictor

Specify a logistic regression model where “Survived” is the outcome and “Sex” is the only predictor.
What does the intercept mean? What are the odds and what are the log-odds of survival for males?

Categorical predictor (more than 2 categories)

Specify a logistic regression model where “Survived” is the outcome and “Pclass” is the only predictor.
Which category is the reference group? What are their odds of survival?
What are the chances of survival for 2nd and 3rd class passengers?

Continuous predictor

Specify a logistic regression model where “Survived” is the outcome and “Age” is the only predictor.

Save this model as you will come back to it later.

What does the intercept mean when there is a continuous predictor?
How are the odds and log-odds interpreted for a continuous predictor?

Multinomial model with an interaction term

Specify a logistic regression model Survived is the outcome and Pclass plus an interaction between Sex and Age as the predictor.

Save this model as we will return to it later.

How is the significant interaction term interpreted in this model?

Model fit

Model selection is an important step and there are several metrics for assessing model fit to help us select the best performing model. We will use deviance and information criterion to compare the fit of two models you saved before: fit1 and fit2.

Deviance

Deviance is measure of the goodness-of-fit in a GLM where lower deviance indicates a better fitting model. R reports two types of deviance:

null deviance: how well the outcome is predicted by the intercept-only model
residual deviance: how well the outcome is predicted by the model with the predictors added

Get the model summaries and indicate what the null and residual deviance are.

We can use the anova() function to perform an analysis of deviance that compares the difference in deviances between competing models.

Compare the fit of model 1 with the fit of model 2 using anova() andtest = “Chisq”`.

Information criteria

AIC is the Akaike’s Information Criterion, a method for assessing model quality through comparison of related models. AIC is based on the deviance but introduces a penalty for more complex models. The number itself is not meaninful, and it is only useful when comparing models against one another. Like deviance, the model with the lowest AIC is best.

Use the AIC() function to get the AIC value for model 1 and model 2.

BIC is the Bayesian Information Criterion and is very similar to AIC, but penalises a complex model more than the AIC would. Complex models will have a larger score indicating worse fit. One difference to the AIC is that the probability of selecting the correct model with the BIC increases as the sample size of the training set increases.

Use the BIC() function to get the BIC value for model 1 and model 2.
Which model should we proceed with?

Predicted probabilites

Often with logistic regression we are interested in how well our model can predict the outcome. The predict() function allows us to do this, taking the model and some data as its main arguments. Additionally, you can specify whether you want predict() to give you predictions as logit or probabilities.

Proceed using the model you selected in the previous question.

Use the predict() function to generate predicted probabilities for the multivariate logistic model. predict() takes the following arguments:

object, i.e. the logistic model
newdata, i.e. a data set where we want to predict the outcome (we will use titanic)
type, i.e. can be "logit" for log-odds or "response" for probabilities (we will use type = "response")
se.fit, i.e. set to TRUE to estimate the standard error of the probabilities

Remember to save the output to an object.

Add the predicted probabilities and standard errors to the data set.
Calculate the confidence intervals for the predicted probabilities and add them to the data.

End of practical

You are now at the end of this week’s practical. Next week, we will discuss model assumptions, model prediction, and the confusion matrix.

Fundamental Techniques in Data Science with R Practical 6