R
Practical 6We will use the following packages in this practical:
library(dplyr)
library(magrittr)
library(ggplot2)
library(foreign)
library(kableExtra)
library(janitor)
library(readr)
In this practical, you will perform regression analyses using
glm()
and inspect variables by plotting these variables,
using ggplot()
.
Logistic regression is a supervised learning algorithm that classifies data into categories, by predicting the probability that an observation falls into a particular category based on its features. In this tutorial we will consider binary classification, where we determine which of two categories a data point belongs to.
The logistic function can be described as:
\[ P(y = 1|X) = sigmoid(z) = \frac{1}{1+e^{-z}} \] where
\[ z = \hat{\beta_0} + \hat{\beta_1}x_1 + \hat{\beta_2}x_2 + ... + \hat{\beta_k}x_k \] \(z\) is like the linear predictor in linear regression, but it is transformed by the sigmoid function so that results can be interpreted as probabilities (between 0 and 1). The probability is compared to a threshold to determine what class \(y\) belongs to based on \(X\). You can choose what this threshold is and it is context dependent. For example, if you are predicting the chances of recovery in a clinical trial you might set a very high threshold of 0.90. A common threshold for low-stakes research is 0.50.
The glm()
function is used to specify several different
models, among which the logistic regression model. The logistic
regression model can be specified by setting the family
argument to “binomial”. You can save a model in an object and
request summary statistics with the summary()
command.
For logistic regression, it important to know and check what category
the predicted probabilities refer to, so you can interpret the model and
it’s coefficients correctly. If your outcome variable is coded as a
factor, the glm()
function predicts the 2nd category, which
is by default the alphabetical latter one. For example, if the
categories are coded as 0 and 1, the probability of belonging to “1” is
predicted by the model.
When a model is stored in an object you can ask for the coefficients (model$coeffients), the predicted probabilities of belonging to the ‘higher’ category category (model$fitted.values), and the aic (model$aic). To investigate all additional model information that is stored in the object, check out the list of the model by selecting it in the environment-list.
Before we get started with logistic modelling it helps to understand how odds, log-odds, and probability are related. Essentially, they are all just different expressions of the same thing and converting between them involve simple formulas.
Coefficients calculated using the glm()
function returns
log-odds by default. Most of us find it difficult to think in terms of
log-odds, so instead we convert them to odds (or odds-ratios) using the
exp()
function. If we want to go from odds to log-odds, we
just take the logarithm using log()
.
An odds-ratio is the probability of success and is defined as \(Odds = \frac{P}{1-P}\), where \(P\) is the probability of an event happening and \(1-P\) is the probability that it does not happen. For example, if we have an 80% chance of a sunny day, then we have a 20% chance of a rainy day. The odds would then equal \(\frac{.80}{.20} = 4\), meaning the odds of a sunny day are 4 to 1. Let’s consider this further with an example.
The code below creates a data frame called data
with a
column called conc
showing the number of trials wherein
different concentrations of the peptide-C protein inhibited the flow of
current across a membrane. The yes
column contains counts
of trials where this occured.
data <- data.frame(conc = c(0.1, 0.5, 1, 10, 20, 30, 50, 70, 80, 100, 150),
no = c(7, 1, 10, 9, 2, 9, 13, 1, 1, 4, 3),
yes = c(0, 0, 3, 4, 0, 6, 7, 0, 0, 1 ,7)
)
data
no
and yes
trials for each
row)yes
vs no
)data <-
data %>%
mutate(total = no + yes,
prop = yes / total,
logit = qlogis(prop)
)
# The `qlogis()` function is equivalent to the log-odds (i.e, logit) function.
head(data)
#There are many zero proportions which produce logit values of infinity. We can work around this issue by adding a constant (usually 0.5) to all cells before calculating the log-odds. We add the same value to the numerator and denominator of our odds formula, so we don't change the relative interpretations of the odds. We could also add a 1 to each cell. This option is conceptually interesting because the log of 1 equals 0. It's almost like we're adding zero to the odds and still correcting the issue.
You can compute the value of this column using the following formulation of the log-odds:
\[ log(odds) = log(\frac{yes + 0.5} {no + 0.5}) \]
robustLogit <- function(x, y) log((x + 0.5) / (y + 0.5))
data <- data %>%
mutate(logit2 = robustLogit(yes, no))
data
prop
is the outcomeconc
is the only predictorprop
)Interpret the slope estimate.
summary(glm(prop ~ conc,
family = binomial,
weights = total,
data = data))
##
## Call:
## glm(formula = prop ~ conc, family = binomial, data = data, weights = total)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.32701 0.33837 -3.922 8.79e-05 ***
## conc 0.01215 0.00496 2.450 0.0143 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16.683 on 10 degrees of freedom
## Residual deviance: 10.389 on 9 degrees of freedom
## AIC: 30.988
##
## Number of Fisher Scoring iterations: 4
# A unit increase in conc increases the log-odds of inhibition by 0.0121 units, and this increase is statistically significant.
# If we exponentiate the slope estimate, we can get an interpretation in odds units, but the effect becomes multiplicative instead of additive. So for every unit increase in conc, the odds of inhibition are 1.01215 times higher. Note then that odds above 1 indicate inhibition is x-times higher, while odds below 1 indicate inhibition is x-times less.
You will work with the titanic
data set which you can
find in the surfdrive folder, containing information on the fate of
passengers on the infamous voyage.
Survived
: this is the outcome variable that you are
trying to predict, with 1 meaning a passenger survived and 0 meaning
they did notPclass
: this is the ticket class the passenger was
travelling on, with 1, 2, and 3 representing 1st, 2nd and 3rd class
respectivelyAge
: this is the age of the passenger in yearsSex
: this is the sex of the passenger, either male or
femaleSurvived
, Pclass
,
Sex
and Age
. If necessary, correct the class
of the variables.titanic <- read_csv("titanic.csv") %>%
mutate(Survived = as.factor(Survived),
Sex = as.factor(Sex),
Pclass = as.factor(Pclass))
## Rows: 891 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (3): Survived, Pclass, Age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# We could say that:
# class is related to the outcome as passengers travelling on a higher class ticket have a higher probability of survival
# sex is related to the outcome as women have a higher probability of survival
# age is related to the outcome as younger passengers have a higher probability of survival
table()
function. Search ??table
for more
information.titanic %>%
ggplot(aes(Pclass, fill = Survived)) +
geom_bar(position = "dodge") +
labs(x = "Passenger Class",
y = "Count") +
theme_bw()
# The bar plot clearly shows that people in lower class were less likely to survive.
# We can also use the `prop.table()` function to investigate this. The argument `margin = 1` turns the counts to marginal proportions.
titanic %$%
table(Pclass, Survived) %>%
prop.table(margin = 1) %>%
round(2)
## Survived
## Pclass 0 1
## 1 0.37 0.63
## 2 0.53 0.47
## 3 0.76 0.24
titanic %$%
table(Sex, Survived) %>%
prop.table(margin = 1) %>%
round(2)
## Survived
## Sex 0 1
## female 0.26 0.74
## male 0.81 0.19
# The table shows the proportion of males and females that survived versus those who did not survive. Females are much more likely to have survived than males.
titanic %>%
ggplot(aes(Sex, fill = Survived)) +
geom_bar(position = "dodge") +
labs(x = "Sex",
y = "Count") +
theme_bw()
titanic %>%
ggplot(aes(Age, fill = Survived)) +
geom_histogram(colour = "white") +
labs(x = "Age",
y = "Count") +
facet_wrap(~Survived) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# The distribution of age is different for survivors and non-survivors. Younger passengers have higher chances of survival compared to older passengers.
glm(Survived ~ 1,
family = binomial,
data = titanic) %>%
summary()
##
## Call:
## glm(formula = Survived ~ 1, family = binomial, data = titanic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.47329 0.06889 -6.87 6.4e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 1186.7 on 890 degrees of freedom
## AIC: 1188.7
##
## Number of Fisher Scoring iterations: 4
# A logistic regression without any predictors is simply modelling the log-odds of survival for the entire population (the intercept, beta0).
# The log-odds are -0.473, and the odds are $exp(-0.473) = 0.623$.
# We can also get the odds from a frequency table: the probability of survival is $342/549 = 0.623$. The log-odds equals exp(beta0) = -0.473.
titanic %>%
count(Survived) %>%
mutate(prop = prop.table(n)) %>%
kbl(digits = 2) %>%
kable_paper(bootstrap_options = "striped", full_width = FALSE)
Survived | n | prop |
---|---|---|
0 | 549 | 0.62 |
1 | 342 | 0.38 |
glm(Survived ~ Sex,
family = binomial,
data = titanic) %>%
summary()
##
## Call:
## glm(formula = Survived ~ Sex, family = binomial, data = titanic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.0566 0.1290 8.191 2.58e-16 ***
## Sexmale -2.5137 0.1672 -15.036 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 917.8 on 889 degrees of freedom
## AIC: 921.8
##
## Number of Fisher Scoring iterations: 4
# In the model with one dichotomous predictor we are modelling logit(p) = beta0 + beta1*male.
# The intercept is the log-odds of survival for women (1.0566), since the reference group is female.
# The log-odds of survival for men is -2.5137 lower than for women. The odds of survival for men is 0.081, or 92% lower than females.
glm(Survived ~ Pclass,
family = binomial,
data = titanic) %>%
summary()
##
## Call:
## glm(formula = Survived ~ Pclass, family = binomial, data = titanic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5306 0.1409 3.766 0.000166 ***
## Pclass2 -0.6394 0.2041 -3.133 0.001731 **
## Pclass3 -1.6704 0.1759 -9.496 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 1083.1 on 888 degrees of freedom
## AIC: 1089.1
##
## Number of Fisher Scoring iterations: 4
# The reference group are 1st class passengers, represented by the intercept.
# The log-odds of survival for 1st class passengers is 0.5306.
# The odds are 1.70, meaning 1st class passengers are 70% more likely to survive.
# For 2nd class passengers, the log-odds of survival is -0.6394.
# The odds are 0.527, meaning 2nd class passengers are 47% less likely to survive than 1st class passengers.
# For 3rd class passengers, the log-odds of survival is -1.646.
# The odds are 0.188, meaning 3nd class passengers are 81% less likely to survive than 1st class passengers.
Save this model as you will come back to it later.
fit1 <- glm(Survived ~ Age,
family = binomial,
data = titanic)
summary(fit1)
##
## Call:
## glm(formula = Survived ~ Age, family = binomial, data = titanic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.14327 0.17209 -0.832 0.4051
## Age -0.01120 0.00539 -2.077 0.0378 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 1182.3 on 889 degrees of freedom
## AIC: 1186.3
##
## Number of Fisher Scoring iterations: 4
# In the case of a continuous predictor there is no real reference group. Instead, the intercept is the log-odds of survival when age = 0. In this model, the log-odds of survival for passengers of age 0 is -0.143, corresponding with the odds of survival at 0.867 (= exp(log odds)).
# For continuous predictors, the log-odds either increase or decrease with every unit increase in the continuous predictor. So, in our model:
# For every increase in age of one year, the log-odds of survival decrease by -0.011, meaning that as age increases the chances of survival decrease.
# For every increase in age of one year, the odds of survival are 0.99 (= exp(-0.0112)) times the odds of those with one age unit less, or -1.09%.
Survived
is
the outcome and Pclass
plus an interaction between
Sex
and Age
as the predictor.Save this model as we will return to it later.
fit2 <- glm(Survived ~ Pclass + Sex*Age, family = binomial, data = titanic)
summary(fit2)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex * Age, family = binomial,
## data = titanic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.834344 0.414982 6.830 8.49e-12 ***
## Pclass2 -1.264399 0.273220 -4.628 3.70e-06 ***
## Pclass3 -2.412614 0.250004 -9.650 < 2e-16 ***
## Sexmale -1.262875 0.433364 -2.914 0.003567 **
## Age -0.004202 0.011426 -0.368 0.713083
## Sexmale:Age -0.048460 0.014576 -3.325 0.000885 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 793.82 on 885 degrees of freedom
## AIC: 805.82
##
## Number of Fisher Scoring iterations: 5
# The interaction between age and sex is significant, suggesting the slopes for age on survival are different for males and females.
Model selection is an important step and there are several metrics
for assessing model fit to help us select the best performing model. We
will use deviance and information criterion to compare the fit of two
models you saved before: fit1
and fit2
.
Deviance is measure of the goodness-of-fit in a GLM where lower deviance indicates a better fitting model. R reports two types of deviance:
# You can use the `summary()` command to get the deviance statistics for each model. The null and residual deviance are below the model coefficients.
summary(fit1)
##
## Call:
## glm(formula = Survived ~ Age, family = binomial, data = titanic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.14327 0.17209 -0.832 0.4051
## Age -0.01120 0.00539 -2.077 0.0378 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 1182.3 on 889 degrees of freedom
## AIC: 1186.3
##
## Number of Fisher Scoring iterations: 4
summary(fit2)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex * Age, family = binomial,
## data = titanic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.834344 0.414982 6.830 8.49e-12 ***
## Pclass2 -1.264399 0.273220 -4.628 3.70e-06 ***
## Pclass3 -2.412614 0.250004 -9.650 < 2e-16 ***
## Sexmale -1.262875 0.433364 -2.914 0.003567 **
## Age -0.004202 0.011426 -0.368 0.713083
## Sexmale:Age -0.048460 0.014576 -3.325 0.000885 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 793.82 on 885 degrees of freedom
## AIC: 805.82
##
## Number of Fisher Scoring iterations: 5
# For model 1, the null deviance is 1186.7 and the residual deviance is 1182.3 For model 2, the null deviance is 1186.66 and the residual deviance is 793.82
We can use the anova()
function to perform an analysis
of deviance that compares the difference in deviances between competing
models.
anova() and
test = “Chisq”`.anova(fit1, fit2, test = "Chisq")
# The analysis of deviance indicates that there is a reduction in residual deviance of 388.46 that is statistically significant. Model 2 is a better model. For a binomial model, the statistical test should be the chi-square difference test.
AIC is the Akaike’s Information Criterion, a method for assessing model quality through comparison of related models. AIC is based on the deviance but introduces a penalty for more complex models. The number itself is not meaninful, and it is only useful when comparing models against one another. Like deviance, the model with the lowest AIC is best.
AIC()
function to get the AIC value for
model 1 and model 2.AIC(fit1, fit2)
# The AIC for model 2 is lower than the AIC for model 1, indicating that model 2 has a better fit
BIC is the Bayesian Information Criterion and is very similar to AIC, but penalises a complex model more than the AIC would. Complex models will have a larger score indicating worse fit. One difference to the AIC is that the probability of selecting the correct model with the BIC increases as the sample size of the training set increases.
BIC()
function to get the BIC value for
model 1 and model 2.BIC(fit1, fit2)
# The BIC for model 2 is lower than the BIC for model 1, indicating that model 2 has a better fit
# Model 2, as it has lower residual deviance, AIC and BIC.
Often with logistic regression we are interested in how well our
model can predict the outcome. The predict()
function
allows us to do this, taking the model and some data as its main
arguments. Additionally, you can specify whether you want
predict()
to give you predictions as logit or
probabilities.
Proceed using the model you selected in the previous question.
predict()
function to generate
predicted probabilities for the multivariate logistic model.
predict()
takes the following arguments:object
, i.e. the logistic modelnewdata
, i.e. a data set where we want to
predict the outcome (we will use titanic
)type
, i.e. can be "logit"
for
log-odds or "response"
for probabilities (we will use
type = "response"
)se.fit
, i.e. set to TRUE
to
estimate the standard error of the probabilitiesRemember to save the output to an object.
preds <- data.frame(predict(fit2, type = "response", se.fit = TRUE))
# The `type="response"` option tells R to output probabilities of the form P(Y = 1|X), as opposed to other information such as the logit.
titanic$probs <- preds$fit
titanic$se <- preds$se.fit
# You can access the predicted probabilities using `$fit` and the standard errors as `$se.fit`.
titanic %<>%
mutate(ci_Lower = probs - 1.96 * se,
ci_Upper = probs + 1.96 * se)
# Now we have all the information to visualize our predicted probabilities using `ggplot()`:
# Note: `geom_ribbon()` allows us to display an interval for the `y` values using `ymin` and `ymax`.*
titanic %>%
ggplot(aes(x = Age, y = probs)) +
geom_ribbon(aes(ymin = ci_Lower, ymax = ci_Upper, fill = Pclass), alpha = 0.2) +
geom_line(aes(color = Pclass), lwd = 1) +
ylab("Probability of Survival") +
theme_bw() +
facet_wrap(vars(Sex))
You are now at the end of this week’s practical. Next week, we will discuss model assumptions, model prediction, and the confusion matrix.