In these lab exercises, you are going to start working with multiple imputation (MI). Specifically, you will do some simple MI-based analyses using he mice package.
Load the mice package.
library(mice)
The mice package contains several datasets. Once the package is loaded, you can access these datasets in your workspace. In this lab, we will work with the nhanes dataset (Schafer, 1997, Table 6.14).
The nhanes dataset is a small data set with non-monotone missing values. These data comprise 25 observations of four variables: age group, body mass index, hypertension, and cholesterol.
Unless otherwise specified, all subsequent questions refer to the nhanes dataset.
Use the mice()
function to impute the missing data using stochastic
regression imputation.
Set the random number seed by specifying a value for the seed
argument.
imp1 <- mice(nhanes, method = "norm.nob", m = 1, seed = 235711)
##
## iter imp variable
## 1 1 bmi hyp chl
## 2 1 bmi hyp chl
## 3 1 bmi hyp chl
## 4 1 bmi hyp chl
## 5 1 bmi hyp chl
This code imputes the missing values in the dataset with a single (because
m = 1
) stochastic regression imputation. This approach does not account for
variability in the regression weights, so it is not ‘proper’ in the sense of
Rubin (1987). The variability of the imputed data will be
underestimated and, consequently, standard errors for statistics estimated from
these imputed data will be too small.
Although the above is always true, for data with few missing values, the additional complexity of MI is not really necessary. The degree of standard error deflation produced by stochastic regression imputation is a function of the missing data rate increases. So, stochastic regression imputation can still be a good (and simple) solution when the missing data rate is low (say, 5% or less missing on any given variable).
Use the mice::complete() function to create a completed dataset.
(comp1 <- complete(imp1))
## age bmi hyp chl
## 1 1 33.14591 1.2730967 223.8749
## 2 2 22.70000 1.0000000 187.0000
## 3 1 33.18101 1.0000000 187.0000
## 4 3 20.67198 1.5426007 185.5465
## 5 1 20.40000 1.0000000 113.0000
## 6 3 22.27522 1.8590519 184.0000
## 7 1 22.50000 1.0000000 118.0000
## 8 1 30.10000 1.0000000 187.0000
## 9 2 22.00000 1.0000000 238.0000
## 10 2 23.90271 1.1955721 150.0048
## 11 1 30.23901 0.9862682 177.2913
## 12 2 24.64479 1.1240773 227.3365
## 13 3 21.70000 1.0000000 206.0000
## 14 2 28.70000 2.0000000 204.0000
## 15 1 29.60000 1.0000000 186.0343
## 16 1 32.82535 1.0013922 264.1071
## 17 3 27.20000 2.0000000 284.0000
## 18 2 26.30000 2.0000000 199.0000
## 19 1 35.30000 1.0000000 218.0000
## 20 3 25.50000 2.0000000 257.3953
## 21 1 30.74885 1.7896261 185.0158
## 22 1 33.20000 1.0000000 229.0000
## 23 1 27.50000 1.0000000 131.0000
## 24 3 24.90000 1.0000000 246.1195
## 25 2 27.40000 1.0000000 186.0000
The complete()
function replaces the missing values with the imputations
generated by the mice()
run. Now, we have a completed dataset to analyze.
Use the imputed data to regress bmi
onto age
.
fit1 <- lm(bmi ~ age, data = comp1)
summary(fit1)
##
## Call:
## lm(formula = bmi ~ age, data = comp1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1483 -2.0483 0.5517 2.4856 5.7517
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.8152 1.7343 18.921 1.61e-15 ***
## age -3.2669 0.8944 -3.653 0.00133 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.64 on 23 degrees of freedom
## Multiple R-squared: 0.3671, Adjusted R-squared: 0.3396
## F-statistic: 13.34 on 1 and 23 DF, p-value: 0.001327
Check the documentation for the mice()
function.
?mice
Use the mice()
function to multiply impute the nhanes data using all
default options.
imp <- mice(nhanes)
##
## iter imp variable
## 1 1 bmi hyp chl
## 1 2 bmi hyp chl
## 1 3 bmi hyp chl
## 1 4 bmi hyp chl
## 1 5 bmi hyp chl
## 2 1 bmi hyp chl
## 2 2 bmi hyp chl
## 2 3 bmi hyp chl
## 2 4 bmi hyp chl
## 2 5 bmi hyp chl
## 3 1 bmi hyp chl
## 3 2 bmi hyp chl
## 3 3 bmi hyp chl
## 3 4 bmi hyp chl
## 3 5 bmi hyp chl
## 4 1 bmi hyp chl
## 4 2 bmi hyp chl
## 4 3 bmi hyp chl
## 4 4 bmi hyp chl
## 4 5 bmi hyp chl
## 5 1 bmi hyp chl
## 5 2 bmi hyp chl
## 5 3 bmi hyp chl
## 5 4 bmi hyp chl
## 5 5 bmi hyp chl
As you can see from the printed output, the algorithm ran for 5 iterations and
produced 5 imputations for each missing datum. These numbers result from the
default values for the maxit
and m
arguments, respectively.
The object returned by mice()
has the class mids (multiply imputed
data set). This object encapsulates all the relevant information mice()
produced when imputing the nhanes dataset (e.g., the original data, the
imputed values, the number of missing values, number of iterations, etc.).
Use the ls()
function to check the contents of the mids object you created
in Question 2.2.
What do you think is stored in the following slots?
data
imp
method
m
ls(imp)
## [1] "blocks" "blots" "call" "chainMean"
## [5] "chainVar" "data" "date" "formulas"
## [9] "ignore" "imp" "iteration" "lastSeedValue"
## [13] "loggedEvents" "m" "method" "nmis"
## [17] "post" "predictorMatrix" "seed" "version"
## [21] "visitSequence" "where"
The original, incomplete, data are stored in the data
slot.
imp$data
## age bmi hyp chl
## 1 1 NA NA NA
## 2 2 22.7 1 187
## 3 1 NA 1 187
## 4 3 NA NA NA
## 5 1 20.4 1 113
## 6 3 NA NA 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 NA NA NA
## 11 1 NA NA NA
## 12 2 NA NA NA
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 NA
## 16 1 NA NA NA
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 NA
## 21 1 NA NA NA
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 NA
## 25 2 27.4 1 186
A list containing the imputed values is stored in the imp
slot.
imp$imp
## $age
## [1] 1 2 3 4 5
## <0 rows> (or 0-length row.names)
##
## $bmi
## 1 2 3 4 5
## 1 22.7 35.3 30.1 29.6 33.2
## 3 33.2 27.2 26.3 30.1 27.2
## 4 22.7 24.9 27.4 22.5 22.5
## 6 25.5 25.5 27.4 20.4 27.4
## 10 22.0 30.1 22.7 27.5 21.7
## 11 35.3 27.5 28.7 22.0 27.5
## 12 22.0 27.5 20.4 22.7 22.5
## 16 29.6 35.3 26.3 22.0 25.5
## 21 29.6 30.1 27.5 28.7 29.6
##
## $hyp
## 1 2 3 4 5
## 1 1 1 1 1 1
## 4 2 2 1 2 1
## 6 2 2 2 1 2
## 10 1 1 1 2 2
## 11 1 1 1 1 1
## 12 1 1 1 1 1
## 16 1 1 1 1 1
## 21 1 1 1 1 1
##
## $chl
## 1 2 3 4 5
## 1 238 218 204 131 218
## 4 186 186 218 206 186
## 10 199 284 187 186 118
## 11 187 118 187 113 131
## 12 187 206 238 238 199
## 15 131 187 187 238 204
## 16 187 186 238 131 187
## 20 206 284 218 204 206
## 21 187 238 238 238 204
## 24 218 186 206 206 186
A vector defining the elementary imputation methods is stored in the method
slot.
imp$method
## age bmi hyp chl
## "" "pmm" "pmm" "pmm"
The number of imputations is stored in the m
slot.
imp$m
## [1] 5
Before proceeding any further, we need to check convergence. If the imputation model did not converge, or if the imputed values are not reasonable, there is no reason to proceed to the analysis phase.
Use the plot.mids()
function to create trace plots of the imputed data.
plot(imp)
These plots show the means and standard deviations of the imputed values for each variable across iterations and imputations. Each line represents an different imputed dataset, and the x-axis indexes iteration.
If the imputation model has converged, we should see the lines mixing randomly with no trends in the latter parts of these trace plots. Although it’s difficult to judge with so few imputations, it looks like there may be some trending in these plots. We should probably consider increasing the number of iterations. We’ll come back to that idea below.
Use the mice::densityplot()
function to check the distributions of the
imputed values.
densityplot(imp)
In these plots, the blue curve represents the density estimated from the observed data, and the red curves represent the densities estimated from the imputed values. We want to see the red and blue curves behaving similarly (though not identically).
At various places above, you should have also seen that mice()
created 5 sets
of imputations (which is the default). If we want more imputations (which we
frequently will), we can set an appropriate value for the m
argument in the
mice()
function.
Redo the imputation from Question 2.2.
This time, change the following settings.
imp2 <- mice(nhanes, m = 20, seed = 235711, print = FALSE)
The print = FALSE
options simply suppresses printed iteration information.
Check the convergence of the imputation model from Question 3.1 as you did before.
plot(imp2)
densityplot(imp2)
The trace plots suggest convergence of the imputation models, and the density plots suggest plausible imputed values. We’re good to go.
If we want to create a set of stand-alone multiply imputed datasets, we use the
mice::complete()
function to do so.
When completing the data with stochastic regression imputation, we only had one
sensible way to return the result because we only created a single set of
imputations. MI, on the other hand, creates m different sets of imputations,
so we have several ways to format the completed datasets. We can select between
these options via the action
argument in the mice::complete()
function.
If we specify a single number for the action
argument, the function will
return a single completed datasets corresponding to the requested imputation
number (e.g., the third imputation in the following code).
imp <- mice(nhanes, m = 3, print = FALSE)
complete(imp, 3)
## age bmi hyp chl
## 1 1 27.2 2 199
## 2 2 22.7 1 187
## 3 1 26.3 1 187
## 4 3 27.5 2 206
## 5 1 20.4 1 113
## 6 3 24.9 2 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 22.7 2 186
## 11 1 22.0 1 187
## 12 2 35.3 1 206
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 131
## 16 1 35.3 1 199
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 186
## 21 1 26.3 1 187
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 184
## 25 2 27.4 1 186
We can also request all m imputed datasets be collapsed into a single block of data with three different structures.
action = long
row binds the imputed datasets one atop the other.action = broad
column binds the imputed datasets side-by-side.action = repeated
does the same as broad
but reorders the variables so the
m replicates of each variable form a contiguous block of data.complete(imp, "long")
## age bmi hyp chl .imp .id
## 1 1 35.3 2 206 1 1
## 2 2 22.7 1 187 1 2
## 3 1 27.2 1 187 1 3
## 4 3 25.5 1 204 1 4
## 5 1 20.4 1 113 1 5
## 6 3 21.7 2 184 1 6
## 7 1 22.5 1 118 1 7
## 8 1 30.1 1 187 1 8
## 9 2 22.0 1 238 1 9
## 10 2 22.7 1 187 1 10
## 11 1 22.5 1 187 1 11
## 12 2 22.7 1 187 1 12
## 13 3 21.7 1 206 1 13
## 14 2 28.7 2 204 1 14
## 15 1 29.6 1 199 1 15
## 16 1 22.0 1 131 1 16
## 17 3 27.2 2 284 1 17
## 18 2 26.3 2 199 1 18
## 19 1 35.3 1 218 1 19
## 20 3 25.5 2 206 1 20
## 21 1 22.0 1 187 1 21
## 22 1 33.2 1 229 1 22
## 23 1 27.5 1 131 1 23
## 24 3 24.9 1 184 1 24
## 25 2 27.4 1 186 1 25
## 26 1 27.4 1 131 2 1
## 27 2 22.7 1 187 2 2
## 28 1 30.1 1 187 2 3
## 29 3 22.5 1 206 2 4
## 30 1 20.4 1 113 2 5
## 31 3 21.7 1 184 2 6
## 32 1 22.5 1 118 2 7
## 33 1 30.1 1 187 2 8
## 34 2 22.0 1 238 2 9
## 35 2 30.1 1 184 2 10
## 36 1 35.3 1 199 2 11
## 37 2 22.7 1 131 2 12
## 38 3 21.7 1 206 2 13
## 39 2 28.7 2 204 2 14
## 40 1 29.6 1 187 2 15
## 41 1 25.5 1 238 2 16
## 42 3 27.2 2 284 2 17
## 43 2 26.3 2 199 2 18
## 44 1 35.3 1 218 2 19
## 45 3 25.5 2 218 2 20
## 46 1 33.2 1 199 2 21
## 47 1 33.2 1 229 2 22
## 48 1 27.5 1 131 2 23
## 49 3 24.9 1 204 2 24
## 50 2 27.4 1 186 2 25
## 51 1 27.2 2 199 3 1
## 52 2 22.7 1 187 3 2
## 53 1 26.3 1 187 3 3
## 54 3 27.5 2 206 3 4
## 55 1 20.4 1 113 3 5
## 56 3 24.9 2 184 3 6
## 57 1 22.5 1 118 3 7
## 58 1 30.1 1 187 3 8
## 59 2 22.0 1 238 3 9
## 60 2 22.7 2 186 3 10
## 61 1 22.0 1 187 3 11
## 62 2 35.3 1 206 3 12
## 63 3 21.7 1 206 3 13
## 64 2 28.7 2 204 3 14
## 65 1 29.6 1 131 3 15
## 66 1 35.3 1 199 3 16
## 67 3 27.2 2 284 3 17
## 68 2 26.3 2 199 3 18
## 69 1 35.3 1 218 3 19
## 70 3 25.5 2 186 3 20
## 71 1 26.3 1 187 3 21
## 72 1 33.2 1 229 3 22
## 73 1 27.5 1 131 3 23
## 74 3 24.9 1 184 3 24
## 75 2 27.4 1 186 3 25
complete(imp, "broad")
## age.1 bmi.1 hyp.1 chl.1 age.2 bmi.2 hyp.2 chl.2 age.3 bmi.3 hyp.3 chl.3
## 1 1 35.3 2 206 1 27.4 1 131 1 27.2 2 199
## 2 2 22.7 1 187 2 22.7 1 187 2 22.7 1 187
## 3 1 27.2 1 187 1 30.1 1 187 1 26.3 1 187
## 4 3 25.5 1 204 3 22.5 1 206 3 27.5 2 206
## 5 1 20.4 1 113 1 20.4 1 113 1 20.4 1 113
## 6 3 21.7 2 184 3 21.7 1 184 3 24.9 2 184
## 7 1 22.5 1 118 1 22.5 1 118 1 22.5 1 118
## 8 1 30.1 1 187 1 30.1 1 187 1 30.1 1 187
## 9 2 22.0 1 238 2 22.0 1 238 2 22.0 1 238
## 10 2 22.7 1 187 2 30.1 1 184 2 22.7 2 186
## 11 1 22.5 1 187 1 35.3 1 199 1 22.0 1 187
## 12 2 22.7 1 187 2 22.7 1 131 2 35.3 1 206
## 13 3 21.7 1 206 3 21.7 1 206 3 21.7 1 206
## 14 2 28.7 2 204 2 28.7 2 204 2 28.7 2 204
## 15 1 29.6 1 199 1 29.6 1 187 1 29.6 1 131
## 16 1 22.0 1 131 1 25.5 1 238 1 35.3 1 199
## 17 3 27.2 2 284 3 27.2 2 284 3 27.2 2 284
## 18 2 26.3 2 199 2 26.3 2 199 2 26.3 2 199
## 19 1 35.3 1 218 1 35.3 1 218 1 35.3 1 218
## 20 3 25.5 2 206 3 25.5 2 218 3 25.5 2 186
## 21 1 22.0 1 187 1 33.2 1 199 1 26.3 1 187
## 22 1 33.2 1 229 1 33.2 1 229 1 33.2 1 229
## 23 1 27.5 1 131 1 27.5 1 131 1 27.5 1 131
## 24 3 24.9 1 184 3 24.9 1 204 3 24.9 1 184
## 25 2 27.4 1 186 2 27.4 1 186 2 27.4 1 186
complete(imp, "repeated")
## age.1 age.2 age.3 bmi.1 bmi.2 bmi.3 hyp.1 hyp.2 hyp.3 chl.1 chl.2 chl.3
## 1 1 1 1 35.3 27.4 27.2 2 1 2 206 131 199
## 2 2 2 2 22.7 22.7 22.7 1 1 1 187 187 187
## 3 1 1 1 27.2 30.1 26.3 1 1 1 187 187 187
## 4 3 3 3 25.5 22.5 27.5 1 1 2 204 206 206
## 5 1 1 1 20.4 20.4 20.4 1 1 1 113 113 113
## 6 3 3 3 21.7 21.7 24.9 2 1 2 184 184 184
## 7 1 1 1 22.5 22.5 22.5 1 1 1 118 118 118
## 8 1 1 1 30.1 30.1 30.1 1 1 1 187 187 187
## 9 2 2 2 22.0 22.0 22.0 1 1 1 238 238 238
## 10 2 2 2 22.7 30.1 22.7 1 1 2 187 184 186
## 11 1 1 1 22.5 35.3 22.0 1 1 1 187 199 187
## 12 2 2 2 22.7 22.7 35.3 1 1 1 187 131 206
## 13 3 3 3 21.7 21.7 21.7 1 1 1 206 206 206
## 14 2 2 2 28.7 28.7 28.7 2 2 2 204 204 204
## 15 1 1 1 29.6 29.6 29.6 1 1 1 199 187 131
## 16 1 1 1 22.0 25.5 35.3 1 1 1 131 238 199
## 17 3 3 3 27.2 27.2 27.2 2 2 2 284 284 284
## 18 2 2 2 26.3 26.3 26.3 2 2 2 199 199 199
## 19 1 1 1 35.3 35.3 35.3 1 1 1 218 218 218
## 20 3 3 3 25.5 25.5 25.5 2 2 2 206 218 186
## 21 1 1 1 22.0 33.2 26.3 1 1 1 187 199 187
## 22 1 1 1 33.2 33.2 33.2 1 1 1 229 229 229
## 23 1 1 1 27.5 27.5 27.5 1 1 1 131 131 131
## 24 3 3 3 24.9 24.9 24.9 1 1 1 184 204 184
## 25 2 2 2 27.4 27.4 27.4 1 1 1 186 186 186
The list format produced by action = all
is, by far, the most useful. This
option returns the imputed datasets in an m-element list where each element
contains one completed dataset. Since R prefers to work with lists, you will
almost certainly want to use the action = all
format when completing multiply
imputed datasets for analysis in R.
complete(imp, "all")
## $`1`
## age bmi hyp chl
## 1 1 35.3 2 206
## 2 2 22.7 1 187
## 3 1 27.2 1 187
## 4 3 25.5 1 204
## 5 1 20.4 1 113
## 6 3 21.7 2 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 22.7 1 187
## 11 1 22.5 1 187
## 12 2 22.7 1 187
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 199
## 16 1 22.0 1 131
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 206
## 21 1 22.0 1 187
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 184
## 25 2 27.4 1 186
##
## $`2`
## age bmi hyp chl
## 1 1 27.4 1 131
## 2 2 22.7 1 187
## 3 1 30.1 1 187
## 4 3 22.5 1 206
## 5 1 20.4 1 113
## 6 3 21.7 1 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 30.1 1 184
## 11 1 35.3 1 199
## 12 2 22.7 1 131
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 187
## 16 1 25.5 1 238
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 218
## 21 1 33.2 1 199
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 204
## 25 2 27.4 1 186
##
## $`3`
## age bmi hyp chl
## 1 1 27.2 2 199
## 2 2 22.7 1 187
## 3 1 26.3 1 187
## 4 3 27.5 2 206
## 5 1 20.4 1 113
## 6 3 24.9 2 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 22.7 2 186
## 11 1 22.0 1 187
## 12 2 35.3 1 206
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 131
## 16 1 35.3 1 199
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 186
## 21 1 26.3 1 187
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 184
## 25 2 27.4 1 186
##
## attr(,"class")
## [1] "mild" "list"
Use the mice::complete()
function to generate a list of multiply imputed
datasets from the imputations you created in Question 3.1.
Save the result as miList
.
miList <- complete(imp2, "all")
Use the mids object you created in Question 3.1 to answer the questions in this section.
After generating the multiply imputed datasets as we did above, the next step in an MI-based analysis is the so-called analysis phase. In the analysis phase, we fit a substantively interesting analysis model to each of the imputed datasets.
When working with mice, the simplest way to implement the analysis phase is
via the with.mids()
function. For example, the following code regresses bmi
onto chl
using the 3 imputed datasets created above.
fit <- with(imp, lm(bmi ~ chl))
fit
## call :
## with.mids(data = imp, expr = lm(bmi ~ chl))
##
## call1 :
## mice(data = nhanes, m = 3, printFlag = FALSE)
##
## nmis :
## [1] 0 9 8 10
##
## analyses :
## [[1]]
##
## Call:
## lm(formula = bmi ~ chl)
##
## Coefficients:
## (Intercept) chl
## 17.26467 0.04527
##
##
## [[2]]
##
## Call:
## lm(formula = bmi ~ chl)
##
## Coefficients:
## (Intercept) chl
## 21.44927 0.02879
##
##
## [[3]]
##
## Call:
## lm(formula = bmi ~ chl)
##
## Coefficients:
## (Intercept) chl
## 20.43439 0.03409
As you can see, the fit
object simply contains 3 fitted lm
objects. Each
of these objects represents a regression model fit to one of the imputed datasets
in the imp
object.
The fit
object returned by with.mids()
has the class mira (multiply
imputed repeated analyses).
class(fit)
## [1] "mira"
ls(fit)
## [1] "analyses" "call" "call1" "nmis"
The actual fitted models live in the analyses
slot of the mira
object. Each
of these fitted models will behave according to their individual types (e.g., a
fitted lm
object inside a mira
object will behave as any other lm
object
would).
library(dplyr)
fit$analyses[[1]] %>% summary()
##
## Call:
## lm(formula = bmi ~ chl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0387 -3.0300 -0.9996 2.2004 8.7099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.26467 4.30626 4.009 0.00055 ***
## chl 0.04527 0.02227 2.033 0.05374 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.044 on 23 degrees of freedom
## Multiple R-squared: 0.1523, Adjusted R-squared: 0.1155
## F-statistic: 4.134 on 1 and 23 DF, p-value: 0.05374
fit$analyses[[2]] %>% coef()
## (Intercept) chl
## 21.44927435 0.02878946
fit$analyses[[3]] %>% resid()
## 1 2 3 4 5 6
## -0.01769018 -4.10864703 -0.50864703 0.04370132 -3.88621430 -1.80638624
## 7 8 9 10 11 12
## -1.95664894 3.29135297 -6.54708040 -4.07456010 -4.80864703 7.84370132
## 13 14 15 16 17 18
## -5.75629868 1.31187518 4.70022098 8.08230982 -2.91507913 -0.91769018
## 19 20 21 22 23 24
## 7.43465818 -1.27456010 -0.50864703 4.95970196 2.60022098 -1.80638624
## 25
## 0.62543990
Use with.mids()
to regress bmi
onto age
and hyp
.
fit2 <- with(imp2, lm(bmi ~ age + hyp))
The final step in an MI-based analysis is the pooling phase wherein we
aggregate the multiple model fits from the analysis phase. If we’ve implemented
the analysis phase via the with.mids()
function, then we can pool using the
mice::pool()
function. For example, the following code will pool the regression
models estimated above.
poolFit <- pool(fit)
summary(poolFit)
## term estimate std.error statistic df p.value
## 1 (Intercept) 19.71611122 5.04500901 3.908043 10.64744 0.002596098
## 2 chl 0.03604852 0.02456045 1.467747 14.68944 0.163257414
The poolFit
object has class mipo
(multiply imputed pooled object).
The mipo
object contains the pooled regression coefficients and standard
errors (pooled with the Rubin, 1987, rules), as well as some other
useful statistics (e.g., the fraction of missing information, fmi
).
ls(poolFit)
## [1] "call" "glanced" "m" "pooled"
ls(poolFit$pooled)
## [1] "b" "df" "dfcom" "estimate" "fmi" "lambda"
## [7] "m" "riv" "t" "term" "ubar"
poolFit$pooled$fmi
## [1] 0.3595712 0.2518258
Use the mice::pool()
function to pool the estimates you generated in
Question 5.1.
What inferences can you draw from the MI estimates?
pool2 <- pool(fit2)
summary(pool2)
## term estimate std.error statistic df p.value
## 1 (Intercept) 28.106490 2.866908 9.803765 13.27152 1.896402e-07
## 2 age -2.812772 1.214656 -2.315695 13.61421 3.672980e-02
## 3 hyp 2.843548 2.233504 1.273133 15.56480 2.216580e-01
At the \(\alpha = 0.05\) significance level, neither age
nor hyp
significantly
predict bmi
.
The workflow demonstrated above applies to many different types of analysis. As
long as the estimation function provides coef()
and vcov()
methods, this
workflow should succeed. In Lab 2c, we will explore alternative workflows that
you can use in situations where the mice()
\(\rightarrow\) with.mids()
\(\rightarrow\) pool()
process doesn’t work.
By default, mice()
will impute each incomplete variable using some flavor of
univariate supervised model wherein all other variables on the dataset act as
predictors. These individual, univariate models are called the elementary
imputation models (EIMs), and we are completely free to adjust their
specification. One such adjustment that we often want to make is restricting
which variables a given EIM uses as predictors.
In mice()
we adjust the right hand side (RHS) of the EIMs via the
predictorMatrix
argument.
The predictor matrix is stored in the predictorMatrix
slot of the mids
object. Let’s look at a default predictor matrix.
imp$predictorMatrix
## age bmi hyp chl
## age 0 1 1 1
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
Each row in the predictor matrix is a pattern vector that defines the predictors
used in the EIM for each variable in the dataset. Variables that are not imputed
still get a row in the predictor matrix, their rows are just ignored by mice()
.
Consequently, the predictor matrix is square. A value of 1 in a matrix entry
indicates that the column variable is used as a predictor in the EIM for the row
variable. For example, the 1 at position [2, 1] indicates that variable age
was used in the imputation model for bmi
. Note that the diagonal is zero
because a variable is cannot impute its own missing data.
We are not forced to use this version of the predictor matrix; mice
gives us
complete control over the predictor matrix. So, we can choose our own predictors
for each EIM in our particular problem. This flexibility can be very useful, for
example, when you have many variables, or when you have clear ideas or prior
knowledge about relations in the data.
We can use the mice::make.predictorMatrix()
function to initialize a default
predictor matrix, without doing any imputation. We can then modify this matrix
to fit our needs.
(pred <- make.predictorMatrix(nhanes))
## age bmi hyp chl
## age 0 1 1 1
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
The object pred
now contains the default predictor matrix for the nhanes
data. Altering the predictor matrix and using the updated version for imputation
is very simple. For example, the following code removes the variable hyp
from
the set of predictors in each EIM.
pred[ , "hyp"] <- 0
pred
## age bmi hyp chl
## age 0 1 0 1
## bmi 1 0 0 1
## hyp 1 1 0 1
## chl 1 1 0 0
We use the modified predictor matrix by supplying it to the predictorMatrix
argument in mice()
.
imp <- mice(nhanes, predictorMatrix = pred, print = FALSE, seed = 235711)
Create a custom predictor matrix for the nhanes data.
Specify the EIMs as follows:
pred1 <- make.predictorMatrix(nhanes)
pred1["age", ] <- 0
pred1["bmi", "age"] <- 0
pred1["hyp", "bmi"] <- 0
pred1["chl", c("age", "hyp")] <- 0
pred1
## age bmi hyp chl
## age 0 0 0 0
## bmi 0 0 1 1
## hyp 1 0 0 1
## chl 0 1 0 0
The mice::quickpred()
function provides a simple way to construct predictor
matrices via various selection rules. The following code will select a variable
as a predictor when it correlates at least \(\rho = 0.30\) with the incomplete
variable.
(pred <- quickpred(nhanes, mincor = 0.3))
## age bmi hyp chl
## age 0 0 0 0
## bmi 1 0 0 1
## hyp 1 0 0 1
## chl 1 1 1 0
Check the documentation for quickpred()
.
Take note of the different selection criteria you can apply.
?quickpred
Use quickpred()
to create a predictor matrix for the nhanes data.
This predictor matrix should satisfy the following conditions.
age
as a predictor in every EIM.Which variables are selected as predictors for each EIM?
(pred <- quickpred(nhanes, include = "age", mincor = 0.45))
## age bmi hyp chl
## age 0 0 0 0
## bmi 1 0 0 0
## hyp 1 0 0 0
## chl 1 0 0 0
Age is the only predictor in each of the imputation models.
Use the predictor matrix you created in Question 6.3 to impute the nhanes data.
Use default settings except for the following options.
norm
.imp <- mice(nhanes,
m = 8,
method = "norm",
predictorMatrix = pred,
seed = 235711,
print = FALSE)
Create trace plots and density plots for the mids object you created in Question 6.4.
plot(imp)
densityplot(imp)
For each column in the dataset, we must specify an imputation method. The
method
slot in a mids object shows which methods were applied to each column.
The following code shows the default methods that would be applied if we impute
the nhanes data.
make.method(nhanes)
## age bmi hyp chl
## "" "pmm" "pmm" "pmm"
The age
variable is completely observed, so it is not imputed. The the empty
string, ""
, tells mice()
not to impute the corresponding variable. The other
variables will all be imputed via pmm
(predictive mean matching), which is
the default method for numeric data.
In reality, these data are actually a mix of numeric and categorical data. The nhanes2 dataset contains appropriately typed categorical variables.
summary(nhanes2)
## age bmi hyp chl
## 20-39:12 Min. :20.40 no :13 Min. :113.0
## 40-59: 7 1st Qu.:22.65 yes : 4 1st Qu.:185.0
## 60-99: 6 Median :26.75 NA's: 8 Median :187.0
## Mean :26.56 Mean :191.4
## 3rd Qu.:28.93 3rd Qu.:212.0
## Max. :35.30 Max. :284.0
## NA's :9 NA's :10
str(nhanes2)
## 'data.frame': 25 obs. of 4 variables:
## $ age: Factor w/ 3 levels "20-39","40-59",..: 1 2 1 3 1 3 1 1 2 2 ...
## $ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
## $ hyp: Factor w/ 2 levels "no","yes": NA 1 1 NA 1 NA 1 1 1 NA ...
## $ chl: num NA 187 187 NA 113 184 118 187 238 NA ...
In nhanes2, the age
variable is a factor comprising three categories, and
hyp
is a binary factor. If we run mice()
without specifying a method
argument, the algorithm will attempt to match the imputation method for each
column to the column’s type.
Let’s see what happens when we initialize a default method vector for the nhanes2 dataset.
(meth <- make.method(nhanes2))
## age bmi hyp chl
## "" "pmm" "logreg" "pmm"
The imputation method for hyp
is now set to logreg
, which imputes by
logistic regression (the default for binary factors). The mice()
algorithm
has recognized the type of the new hyp
variable and updated the imputation
method appropriately.
We can get an up-to-date overview of the available methods via the methods()
function.
methods(mice)
## [1] mice.impute.2l.bin mice.impute.2l.lmer
## [3] mice.impute.2l.norm mice.impute.2l.pan
## [5] mice.impute.2lonly.mean mice.impute.2lonly.norm
## [7] mice.impute.2lonly.pmm mice.impute.cart
## [9] mice.impute.jomoImpute mice.impute.lasso.logreg
## [11] mice.impute.lasso.norm mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm mice.impute.lda
## [15] mice.impute.logreg mice.impute.logreg.boot
## [17] mice.impute.mean mice.impute.midastouch
## [19] mice.impute.mnar.logreg mice.impute.mnar.norm
## [21] mice.impute.mpmm mice.impute.norm
## [23] mice.impute.norm.boot mice.impute.norm.nob
## [25] mice.impute.norm.predict mice.impute.panImpute
## [27] mice.impute.passive mice.impute.pmm
## [29] mice.impute.polr mice.impute.polyreg
## [31] mice.impute.quadratic mice.impute.rf
## [33] mice.impute.ri mice.impute.sample
## [35] mice.mids mice.theme
## see '?methods' for accessing help and source code
More detail on the methods is available in the Details section of the
documentation for mice()
, and we can see detailed documentation for each
method through the help page for the appropriate mice.impute.METHOD()
function.
?mice
?mice.impute.lasso.select.logreg
In Question 6.4, you changed the imputation method for all variables by
setting a single value for the method
argument. We can also change the methods
on a variable-by-variable basis. The following code will change the imputation
method for bmi
to Bayesian linear regression while leaving the remaining
methods unaffected.
(meth["bmi"] <- "norm")
## [1] "norm"
To use this custom method vector, we simply supply meth
to the method
argument when we call mice()
.
imp <- mice(nhanes2, method = meth, print = FALSE)
Create a custom method vector for the nhanes2 data.
Define the following methods.
age
: Nonebmi
: Classification and regression treeshyp
: Logistic regressionchl
: Predictive mean matching(meth <- c(age = "", bmi = "cart", hyp = "logreg", chl = "pmm"))
## age bmi hyp chl
## "" "cart" "logreg" "pmm"
## OR
meth <- make.method(nhanes2)
meth["bmi"] <- "cart"
meth
## age bmi hyp chl
## "" "cart" "logreg" "pmm"
Note that both of the above approaches produce the same result. The first approach is more direct and probably preferable for smaller datasets. The second approach can be more efficient when working with many variables, especially when you only need to use a few non-default methods.
Impute the missing values in the nhanes2 dataset.
Set up the imputation run as follows.
imp1 <- mice(nhanes2,
m = 15,
method = meth,
predictorMatrix = pred1,
seed = 235711,
print = FALSE)
Create trace plots and density plots for the mids object you created in Question 7.2.
plot(imp1)
densityplot(imp1)
By default, mice()
will use five iterations. Sometimes, these five will be
enough, but often we will want to run the algorithm longer. We can set the
number of iterations via the maxit
argument.
The following code will run the default imputation model for ten iterations.
imp <- mice(nhanes2, maxit = 10)
##
## iter imp variable
## 1 1 bmi hyp chl
## 1 2 bmi hyp chl
## 1 3 bmi hyp chl
## 1 4 bmi hyp chl
## 1 5 bmi hyp chl
## 2 1 bmi hyp chl
## 2 2 bmi hyp chl
## 2 3 bmi hyp chl
## 2 4 bmi hyp chl
## 2 5 bmi hyp chl
## 3 1 bmi hyp chl
## 3 2 bmi hyp chl
## 3 3 bmi hyp chl
## 3 4 bmi hyp chl
## 3 5 bmi hyp chl
## 4 1 bmi hyp chl
## 4 2 bmi hyp chl
## 4 3 bmi hyp chl
## 4 4 bmi hyp chl
## 4 5 bmi hyp chl
## 5 1 bmi hyp chl
## 5 2 bmi hyp chl
## 5 3 bmi hyp chl
## 5 4 bmi hyp chl
## 5 5 bmi hyp chl
## 6 1 bmi hyp chl
## 6 2 bmi hyp chl
## 6 3 bmi hyp chl
## 6 4 bmi hyp chl
## 6 5 bmi hyp chl
## 7 1 bmi hyp chl
## 7 2 bmi hyp chl
## 7 3 bmi hyp chl
## 7 4 bmi hyp chl
## 7 5 bmi hyp chl
## 8 1 bmi hyp chl
## 8 2 bmi hyp chl
## 8 3 bmi hyp chl
## 8 4 bmi hyp chl
## 8 5 bmi hyp chl
## 9 1 bmi hyp chl
## 9 2 bmi hyp chl
## 9 3 bmi hyp chl
## 9 4 bmi hyp chl
## 9 5 bmi hyp chl
## 10 1 bmi hyp chl
## 10 2 bmi hyp chl
## 10 3 bmi hyp chl
## 10 4 bmi hyp chl
## 10 5 bmi hyp chl
plot(imp)
Notice that now the x-axis in the trace plots extends to 10 to reflect the higher number of iterations.
One common reason to increase the number of iterations is non-convergence. We’ll
sometimes observe trends when we inspect the trace plots of the imputed items.
These trends tell us that the algorithm has not yet reached an equilibrium point.
In such situations, the first thing to try (so long as you don’t notice any
obvious errors in your imputation model specification) is increasing the number
of iterations. While we could accomplish this goal simply by rerunning mice()
with a higher value for maxit
, doing so would be very wasteful. We would have
to recreate all of the existing iterations before generating any new iterations.
Thankfully, we don’t need to be so inefficient.
We can easily extend the chains in any mids object with the mice.mids()
function. This function takes as mids object as its primary argument and picks
up the sampling where mice()
left off (using exactly the same imputation model
specification). The following code will extend the chains in our most recent
mids object to 25 total iterations by adding on another 15 iterations.
imp <- mice.mids(imp, maxit = 15, print = FALSE)
plot(imp)
Now, imp
contains 25 iterations, which we can see by the range of the x-axis
in the trace plots.
Increase the number of iterations for the imputation you ran in Question 7.2 to 25.
imp1 <- mice.mids(imp1, maxit = 10, print = FALSE)
Recall that even if the imputation model converges (e.g., as demonstrated by trace plots), the imputed values still might not be reasonable estimates of the missing data. If the data are MCAR, then the imputations should have the same distribution as the observed data. In general, though, the distributions of the observed and imputed data may differ when the data are MAR. However, we don’t want to see very large discrepancies.
Strip plots are another excellent visualization tool with which we can assess
the comparability of the imputed and observed data. The mice::stripplot()
function will generate strip plots from a mids object.
The following code will create a strip plot of the observed and imputed chl
values.
stripplot(imp, chl)
The observed data are plotted in blue, and the imputed data are plotted in red.
Since chl
was imputed with PMM (which draws imputations from the observed data),
the imputed values have the same gaps as the observed data, and the imputed
values are always within the range of the observed data. This figure indicates
that the observed and imputed values of chl
follow similar distributions. So,
we should conclude that these imputations are plausible.
We can also call stripplot()
without specifying any variables. Doing so will
plot all of the imputed variables in an array.
stripplot(imp)
Create trace plots, density plots, and strip plots for the mids object you created in Question 8.1.
Based on these visualizations, would you say the imputation model has converged to a valid solution?
plot(imp1)
densityplot(imp1)
stripplot(imp1)
Yes, these results look pretty good. The trace plots show the chains mixing well without any trends, and the density and strip plots show no extreme or implausible values.
Rerun the imputation from Question 8.1.
Keep all the same settings, but use Bayesian linear regression to impute bmi
and chl
.
meth <- imp1$method
meth[c("bmi", "chl")] <- "norm"
meth
## age bmi hyp chl
## "" "norm" "logreg" "norm"
imp2 <- mice(nhanes2,
m = 15,
method = meth,
predictorMatrix = pred1,
maxit = 25,
seed = 235711,
print = FALSE)
Create trace plots, density plots, and strip plots for the mids object you created in Question 9.2.
Based on the visualizations, are the imputations from Question 8.1 or Question 9.2 more reasonable? Why?
plot(imp2)
densityplot(imp2)
stripplot(imp2)
Both sets of imputations seem acceptable and reasonable, but each set also has
its own potential issues. The donor-based imputations of bmi
and chl
from
Question 8.1 show a tendency to cluster around the center of the
distribution (hence the “sharp” spikes in some of the imputed densities). The
normal-theory imputations from Question 9.2, on the other hand, contain
some very high and very low values for both bmi
and chl
.
The choice between these two results probably comes down to which aspects of the distribution are more important for your application. If maintaining the same range as the observed data is important, then the PMM- and CART-based imputations are probably better. If a more accurate representation of variability is important (and the out-of-bound values don’t matter), then the normal-theory imputations may be preferable.
Use the multiply imputed data you created in Question 8.1 to estimate the following regression model.
\(Y_{bmi} = \beta_0 + \beta_1 X_{age} = \beta_2 X_{chl} + \varepsilon\)
fit <- with(imp1, lm(bmi ~ age + chl))
Pool the estimates from the model you estimated in Question 9.4.
Is the effect of chl
on bmi
significant after controlling for age
?
est <- pool(fit)
summary(est)
## term estimate std.error statistic df p.value
## 1 (Intercept) 20.41801164 4.21217336 4.847382 13.76014 0.0002714008
## 2 age40-59 -2.42177927 2.20075229 -1.100432 13.42765 0.2904806485
## 3 age60-99 -3.23774834 2.46989035 -1.310887 12.05394 0.2143209558
## 4 chl 0.03711272 0.02384541 1.556388 11.61866 0.1464214159
No, cholesterol level does not significantly predict BMI after controlling for age (\(\beta = 0.04\), \(t[11.62] = 1.56\), \(p = 0.146\)).
Rerun the regression analysis from above using the imputed data from Question 9.2.
Do the results differ between the two versions of the analysis? Why or why not?
est2 <- with(imp2, lm(bmi ~ age + chl)) %>% pool()
summary(est2)
## term estimate std.error statistic df p.value
## 1 (Intercept) 18.76625504 4.6061039 4.074214 9.172722 0.00267377
## 2 age40-59 -3.63438601 2.2470109 -1.617431 11.528841 0.13279824
## 3 age60-99 -4.10858709 2.5738980 -1.596251 9.889733 0.14185676
## 4 chl 0.05183625 0.0232639 2.228184 10.123331 0.04968446
The estimates differ slightly due to the different imputation methods for bmi
and chl
. Also, these differences were just enough to tip the effect of chl
over into statistical significance at the \(\alpha = 0.05\) level.
End of Lab 2a