R
- Practical 4We will use the following packages in this practical:
library(dplyr)
library(magrittr)
library(ggplot2)
library(gridExtra)
In this practical, you will learn how to perform regression analyses
in R
using lm()
and inspect variables by
plotting these variables, using ggplot()
, repeating some
topics of the last 3 weeks.
In the this practical, we will use the built-in data set
iris
. This data set contains the measurement of different
iris species (flowers), you can find more information here.
A good way of eyeballing on a relation between two continuous variables is by creating a scatterplot.
ggplot
scatter plot
(geom_points
)A loess curve can be added to the plot to get a general idea of the
relation between the two variables. You can add a loess curve to a
ggplot with stat_smooth(method = "loess")
.
To get a clearer idea of the general trend in the data (or of the
relation), a regression line can be added to the plot. A regression line
can be added in the same way as a loess curve, the method argument in
the function needs to be altered to lm
to do so.
With the lm()
function, you can specify a linear
regression model. You can save a model in an object and request summary
statistics with the summary()
command. The model is always
specified with the code outcome_variable ~ predictor
.
When a model is stored in an object, you can ask for the coefficients
with coefficients()
. The next code block shows how you
would specify a model where petal width is predicted by sepal width, and
how summary statistics for this model would look like
# Specify model: outcome = petal width, predictor = sepal width
iris_model1 <- lm(Petal.Width ~ Sepal.Width,
data = iris)
summary(iris_model1)
##
## Call:
## lm(formula = Petal.Width ~ Sepal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.38424 -0.60889 -0.03208 0.52691 1.64812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1569 0.4131 7.642 2.47e-12 ***
## Sepal.Width -0.6403 0.1338 -4.786 4.07e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7117 on 148 degrees of freedom
## Multiple R-squared: 0.134, Adjusted R-squared: 0.1282
## F-statistic: 22.91 on 1 and 148 DF, p-value: 4.073e-06
The summary of the model provides:
Individual elements can be extracted by calling specific model
elements (e.g. iris_model1$coefficients
).
Specify a regression model where Sepal length is predicted by Petal width. Store this model as `model1. Supply summary statistics for this model.
Based on the summary of the model, give a substantive interpretation of the regression coefficient.
Relate the summary statistics and coefficients to the plots you made in questions 2 - 4.
You can add additional predictors to a model. This can improve the
fit and the predictions. When multiple predictors are used in a
regression model, it’s called a Multiple linear regression. You specify
this model as
outcome_variable ~ predictor_1 + predictor_2 + ... + predictor_n
.
model1
and store this under the name
model2
, and supply summary statistics. Again, give a
substantive interpretation of the coefficients and the
model.Up to here, we only included continuous predictors in our models. We will now include a categorical predictor in the model as well.
When a categorical predictor is added, this predictor is split in
several contrasts (or dummies), where each group is compared to a
reference group. In our example Iris data, the variable ‘Species’ is a
categorical variable that indicate the species of flower. This variable
can be added as example for a categorical predictor. Contrasts, and thus
the dummy coding, can be inspected through contrasts()
.
model2
, store it under the name model3
and
interpret the categorical coefficients of this new model.Now you have created multiple models, you can compare how well these models function (compare the model fit). There are multiple ways of testing the model fit and to compare models, as explained in the lecture and the reading material. In this practical, we use the following:
AIC()
on the model object)BIC()
on the model object)MSE()
of the MLmetrics
package,
or calculate by transforming the model$residuals
)anova()
to compare 2 models)When fitting a regression line, the predicted values have some error in comparison to the observed values. The sum of the squared values of these errors is the sum of squares. A regression analysis finds the line such that the lowest sum of squares possible is obtained.
The image below shows how the predicted (on the blue regression line) and observed values (black dots) differ and how the predicted values have some error (red vertical lines).
When having multiple predictors, it becomes harder or impossible to make such a plot as above (you need a plot with more dimensions). You can, however, still plot the observed values against the predicted values and infer the error terms from there.
Create a dataset of predicted values for model 1 by
taking the outcome variable Sepal.Length
and the
fitted.values
from the model.
Create an observed vs. predicted plot for model 1 (the red vertial lines are no must).
Create a dataset of predicted values and create a plot for model 2.
Compare the two plots and discuss the fit of the models
based on what you see in the plots. You can combine them in one figure
using the grid.arrange()
function.
A regression model can be used to predict values for new cases that were not used to built the model. The regression equation always consists of coefficients (\(\beta\)s) and observed variables (\(X\)):
\[\hat{y} = \beta_0 + \beta_1 * X_{a}* + \beta_2 * X_b + \ldots + \beta_n * X_n\]
All terms can be made specific for the regression equation of the created model. For example, if we have a model where ‘happiness’ is predicted by age and income (scored from 1-50), the equation could look like:
\[\hat{y}_{happiness} = \beta_{intercept} + \beta_{age} * X_{age} + \beta_{income} * X_{income}\]
Then, we could impute the coefficients obtained through the model. Given \(\beta_{intercept} = 10.2\), \(\beta_{age} = 0.7\), and \(\beta_{income} = 1.3\), the equation would become:
\[\hat{y}_{happiness} = 10.2 + 0.7 * X_{age} + 1.3 * X_{income}\]
If we now want to predict the happiness score for someone of age 28 and with an income score of 35, the prediction would become:
\[\hat{y}_{happiness} = 10.2 + 0.7 * 28 + 1.3 * 35 = 75.3\]
Adding a categorical predictor to the regression equation gives the number of contrasts as coefficient terms added. The previous regression equation for predicting happiness could be adjusted by adding ‘living density’ as a categorical predictor with levels ‘big city’, ‘smaller city’, ‘rural’, where ‘big city’ would be the reference category. The equation could then be:
\[\hat{y}_{happiness} = 10.2 + 0.7 * X_{age} + 1.3 * X_{income} + 8.4 * X_{smaller city} + 17.9 * X_{rural}\]
When predicting a score for an equation with a categorical predictor, you just assign a 1 to the category that the observation belongs to, and 0s for all other categories.
In regression equations with an interaction, an extra coefficient is added to the equation. For example, the happiness equation with age and income as predictors could have an added interaction term. The equation could then look like:
\[\hat{y}_{happiness} = 10.2 + 0.7 * X_{age} + 1.3 * X_{income} + 0.01 * X_{age} * X_{income}\]
For a person of age 36 and income score 30, the predicted score would be:
\[\hat{y}_{happiness} = 10.2 + 0.7 * 36 + 1.3 * 30 + 0.01 * 36 * 30 = 85.2\]