R
Practical 7We will use the following packages in this practical:
dplyr
for manipulationmagrittr
for pipingreadr
for reading dataggplot
for plottingkableExtra
for tableslibrary(pROC)
, library(regclass)
, and
library(caret)
for model diagnosticslibrary(dplyr)
library(magrittr)
library(ggplot2)
library(kableExtra)
library(readr)
library(pROC)
library(regclass)
library(caret)
In this practical, you will perform logistic regression analyses
again using glm()
and discuss model assumptions and
diagnostics titanic
data set.
1. Read in the data from the “titanic.csv” file, which we also used for the previous practical.
fit1
and fit2
:Survived ~ Pclass
Survived ~ Age + Pclass*Sex
The first outcome in a logistic regression is that the outcome should be binary and therefore follow a binomial distribution. This is easy to check: you just need to be sure that the outcome can only take one of two responses. You can plot the responses of the outcome variable to visually check this if you want. In our case, the possible outcomes are:
Survived
using ggplot()
.If you are using logistic regression to make predictions/classifications then the accuracy will be affected by imbalance in the outcome classes. Notice that in the plot you just made there are more people who did not survive than who did survive. A possible consequence is reduced accuracy in classification of survivors.
A certain amount of imbalance is expected and can be handled well by the model in most cases. The effects of this imbalance is context-dependent. Some solutions to serious class imbalance are down-sampling or weighting the outcomes to balance the importance placed on the outcomes by the model.
Sample size in logistic regression is a complex issue, but some suggest that it is ideal to have 10 cases per candidate predictor in your model. The minimum number of cases to include is \(N = \frac{10*k} {p}\), where \(k\) is the number of predictors and \(p\) is the smallest proportion of negative or positive cases in the population.
fit1
.You learned about this assumption in the linear regression practicals, but to remind you:
Influential values are extreme individual data points that can affect the fit1 of the logistic regression model. They can be visualised using Cook’s distance and the Residuals vs Leverage plot.
plot()
function to visualise the
outliers and influential points of fit2
.Hint: you need to specify the correct plot with the
which
argument. Check the lecture slides or search
??plot
if you are unsure.
Lastly, it is important to note that the assumptions of a linear regression do not all map to logistic regression. In logistic regression, we do not need:
However, deviance residuals are useful for determining if the individual points are not fit1 well by the model.
Hint: you can use some of the code from the lecture for the next few questions.
Use the resid()
function to get the deviance
residuals for fit2
.
Compute the predicted logit values for the model.
Plot the deviance residuals.
Pearson residuals can also be useful in logistic regression. They
measure deviations between the observed and fit1ted values. Pearson
residuals are easier to plot than deviance residuals as the
plot()
function can be used.
In last week’s practical, you learned how to use the
predict()
function to calculate predicted probabilites
using the models. This week we will create predicted probabilities for
the final two models from last week compare the results by using the
confusion matrix.
Use the predict()
function to get model
predicted probabilities for fit1
and
fit2
.
Create model predicted classifications for survival, for
fit1
and fit2
.
You can read about the confusion matrix on this Wikipedia page. This section tells you how to get some useful metrics from the confusion matrix to evaluate model performance.
Create two confusion matrices (one each for each model)
using the classifications from the previous question. You can use the
table()
function, providing the modeled outcome as the
true
parameter and the classifications as the
pred
parameter.
Based on the confusion matrices, which model do you think makes better predictions?
Calculate the accuracy, sensitivity, and specificity, false positive rate, positive and negative predictive values from the confusion matrix of the model that makes the best predictions.
Explain what the difference metrics mean in substantive terms?
What does it mean for a model to have such low specificity, but high sensitivity?
The confusionMatrix()
function from the
caret
package can do a lot of this for us. The function
takes three arguments:
data
- a vector of predicted classes (in factor
form)reference
- a vector of true classes (in factor
from)positive
- a character string indicating the ‘positive’
outcome. If not specified, the confusion matrix assumes that the first
specified category is the positive outcome.You can type ??confusionMatrix
into the console to learn
more.