In these lab exercises, you are going to start working with multiple imputation (MI). Specifically, you will do some simple MI-based analyses using he mice package.
Load the mice package.
The mice package contains several datasets. Once the package is loaded, you can access these datasets in your workspace. In this lab, we will work with the nhanes dataset (Schafer, 1997, Table 6.14).
The nhanes dataset is a small data set with non-monotone missing values. These data comprise 25 observations of four variables: age group, body mass index, hypertension, and cholesterol.
Unless otherwise specified, all subsequent questions refer to the nhanes dataset.
Use the mice()
function to impute the missing data using stochastic
regression imputation.
Set the random number seed by specifying a value for the seed
argument.
Use the mice::complete() function to create a completed dataset.
Use the imputed data to regress bmi
onto age
.
Check the documentation for the mice()
function.
Use the mice()
function to multiply impute the nhanes data using all
default options.
The object returned by mice()
has the class mids (multiply imputed
data set). This object encapsulates all the relevant information mice()
produced when imputing the nhanes dataset (e.g., the original data, the
imputed values, the number of missing values, number of iterations, etc.).
Use the ls()
function to check the contents of the mids object you created
in Question 2.2.
What do you think is stored in the following slots?
data
imp
method
m
Before proceeding any further, we need to check convergence. If the imputation model did not converge, or if the imputed values are not reasonable, there is no reason to proceed to the analysis phase.
Use the plot.mids()
function to create trace plots of the imputed data.
Use the mice::densityplot()
function to check the distributions of the
imputed values.
At various places above, you should have also seen that mice()
created 5 sets
of imputations (which is the default). If we want more imputations (which we
frequently will), we can set an appropriate value for the m
argument in the
mice()
function.
Redo the imputation from Question 2.2.
This time, change the following settings.
If we want to create a set of stand-alone multiply imputed datasets, we use the
mice::complete()
function to do so.
When completing the data with stochastic regression imputation, we only had one
sensible way to return the result because we only created a single set of
imputations. MI, on the other hand, creates m different sets of imputations,
so we have several ways to format the completed datasets. We can select between
these options via the action
argument in the mice::complete()
function.
If we specify a single number for the action
argument, the function will
return a single completed datasets corresponding to the requested imputation
number (e.g., the third imputation in the following code).
imp <- mice(nhanes, m = 3, print = FALSE)
complete(imp, 3)
## age bmi hyp chl
## 1 1 27.2 2 199
## 2 2 22.7 1 187
## 3 1 26.3 1 187
## 4 3 27.5 2 206
## 5 1 20.4 1 113
## 6 3 24.9 2 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 22.7 2 186
## 11 1 22.0 1 187
## 12 2 35.3 1 206
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 131
## 16 1 35.3 1 199
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 186
## 21 1 26.3 1 187
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 184
## 25 2 27.4 1 186
We can also request all m imputed datasets be collapsed into a single block of data with three different structures.
action = long
row binds the imputed datasets one atop the other.action = broad
column binds the imputed datasets side-by-side.action = repeated
does the same as broad
but reorders the variables so the
m replicates of each variable form a contiguous block of data.complete(imp, "long")
## age bmi hyp chl .imp .id
## 1 1 35.3 2 206 1 1
## 2 2 22.7 1 187 1 2
## 3 1 27.2 1 187 1 3
## 4 3 25.5 1 204 1 4
## 5 1 20.4 1 113 1 5
## 6 3 21.7 2 184 1 6
## 7 1 22.5 1 118 1 7
## 8 1 30.1 1 187 1 8
## 9 2 22.0 1 238 1 9
## 10 2 22.7 1 187 1 10
## 11 1 22.5 1 187 1 11
## 12 2 22.7 1 187 1 12
## 13 3 21.7 1 206 1 13
## 14 2 28.7 2 204 1 14
## 15 1 29.6 1 199 1 15
## 16 1 22.0 1 131 1 16
## 17 3 27.2 2 284 1 17
## 18 2 26.3 2 199 1 18
## 19 1 35.3 1 218 1 19
## 20 3 25.5 2 206 1 20
## 21 1 22.0 1 187 1 21
## 22 1 33.2 1 229 1 22
## 23 1 27.5 1 131 1 23
## 24 3 24.9 1 184 1 24
## 25 2 27.4 1 186 1 25
## 26 1 27.4 1 131 2 1
## 27 2 22.7 1 187 2 2
## 28 1 30.1 1 187 2 3
## 29 3 22.5 1 206 2 4
## 30 1 20.4 1 113 2 5
## 31 3 21.7 1 184 2 6
## 32 1 22.5 1 118 2 7
## 33 1 30.1 1 187 2 8
## 34 2 22.0 1 238 2 9
## 35 2 30.1 1 184 2 10
## 36 1 35.3 1 199 2 11
## 37 2 22.7 1 131 2 12
## 38 3 21.7 1 206 2 13
## 39 2 28.7 2 204 2 14
## 40 1 29.6 1 187 2 15
## 41 1 25.5 1 238 2 16
## 42 3 27.2 2 284 2 17
## 43 2 26.3 2 199 2 18
## 44 1 35.3 1 218 2 19
## 45 3 25.5 2 218 2 20
## 46 1 33.2 1 199 2 21
## 47 1 33.2 1 229 2 22
## 48 1 27.5 1 131 2 23
## 49 3 24.9 1 204 2 24
## 50 2 27.4 1 186 2 25
## 51 1 27.2 2 199 3 1
## 52 2 22.7 1 187 3 2
## 53 1 26.3 1 187 3 3
## 54 3 27.5 2 206 3 4
## 55 1 20.4 1 113 3 5
## 56 3 24.9 2 184 3 6
## 57 1 22.5 1 118 3 7
## 58 1 30.1 1 187 3 8
## 59 2 22.0 1 238 3 9
## 60 2 22.7 2 186 3 10
## 61 1 22.0 1 187 3 11
## 62 2 35.3 1 206 3 12
## 63 3 21.7 1 206 3 13
## 64 2 28.7 2 204 3 14
## 65 1 29.6 1 131 3 15
## 66 1 35.3 1 199 3 16
## 67 3 27.2 2 284 3 17
## 68 2 26.3 2 199 3 18
## 69 1 35.3 1 218 3 19
## 70 3 25.5 2 186 3 20
## 71 1 26.3 1 187 3 21
## 72 1 33.2 1 229 3 22
## 73 1 27.5 1 131 3 23
## 74 3 24.9 1 184 3 24
## 75 2 27.4 1 186 3 25
complete(imp, "broad")
## age.1 bmi.1 hyp.1 chl.1 age.2 bmi.2 hyp.2 chl.2 age.3 bmi.3 hyp.3 chl.3
## 1 1 35.3 2 206 1 27.4 1 131 1 27.2 2 199
## 2 2 22.7 1 187 2 22.7 1 187 2 22.7 1 187
## 3 1 27.2 1 187 1 30.1 1 187 1 26.3 1 187
## 4 3 25.5 1 204 3 22.5 1 206 3 27.5 2 206
## 5 1 20.4 1 113 1 20.4 1 113 1 20.4 1 113
## 6 3 21.7 2 184 3 21.7 1 184 3 24.9 2 184
## 7 1 22.5 1 118 1 22.5 1 118 1 22.5 1 118
## 8 1 30.1 1 187 1 30.1 1 187 1 30.1 1 187
## 9 2 22.0 1 238 2 22.0 1 238 2 22.0 1 238
## 10 2 22.7 1 187 2 30.1 1 184 2 22.7 2 186
## 11 1 22.5 1 187 1 35.3 1 199 1 22.0 1 187
## 12 2 22.7 1 187 2 22.7 1 131 2 35.3 1 206
## 13 3 21.7 1 206 3 21.7 1 206 3 21.7 1 206
## 14 2 28.7 2 204 2 28.7 2 204 2 28.7 2 204
## 15 1 29.6 1 199 1 29.6 1 187 1 29.6 1 131
## 16 1 22.0 1 131 1 25.5 1 238 1 35.3 1 199
## 17 3 27.2 2 284 3 27.2 2 284 3 27.2 2 284
## 18 2 26.3 2 199 2 26.3 2 199 2 26.3 2 199
## 19 1 35.3 1 218 1 35.3 1 218 1 35.3 1 218
## 20 3 25.5 2 206 3 25.5 2 218 3 25.5 2 186
## 21 1 22.0 1 187 1 33.2 1 199 1 26.3 1 187
## 22 1 33.2 1 229 1 33.2 1 229 1 33.2 1 229
## 23 1 27.5 1 131 1 27.5 1 131 1 27.5 1 131
## 24 3 24.9 1 184 3 24.9 1 204 3 24.9 1 184
## 25 2 27.4 1 186 2 27.4 1 186 2 27.4 1 186
complete(imp, "repeated")
## age.1 age.2 age.3 bmi.1 bmi.2 bmi.3 hyp.1 hyp.2 hyp.3 chl.1 chl.2 chl.3
## 1 1 1 1 35.3 27.4 27.2 2 1 2 206 131 199
## 2 2 2 2 22.7 22.7 22.7 1 1 1 187 187 187
## 3 1 1 1 27.2 30.1 26.3 1 1 1 187 187 187
## 4 3 3 3 25.5 22.5 27.5 1 1 2 204 206 206
## 5 1 1 1 20.4 20.4 20.4 1 1 1 113 113 113
## 6 3 3 3 21.7 21.7 24.9 2 1 2 184 184 184
## 7 1 1 1 22.5 22.5 22.5 1 1 1 118 118 118
## 8 1 1 1 30.1 30.1 30.1 1 1 1 187 187 187
## 9 2 2 2 22.0 22.0 22.0 1 1 1 238 238 238
## 10 2 2 2 22.7 30.1 22.7 1 1 2 187 184 186
## 11 1 1 1 22.5 35.3 22.0 1 1 1 187 199 187
## 12 2 2 2 22.7 22.7 35.3 1 1 1 187 131 206
## 13 3 3 3 21.7 21.7 21.7 1 1 1 206 206 206
## 14 2 2 2 28.7 28.7 28.7 2 2 2 204 204 204
## 15 1 1 1 29.6 29.6 29.6 1 1 1 199 187 131
## 16 1 1 1 22.0 25.5 35.3 1 1 1 131 238 199
## 17 3 3 3 27.2 27.2 27.2 2 2 2 284 284 284
## 18 2 2 2 26.3 26.3 26.3 2 2 2 199 199 199
## 19 1 1 1 35.3 35.3 35.3 1 1 1 218 218 218
## 20 3 3 3 25.5 25.5 25.5 2 2 2 206 218 186
## 21 1 1 1 22.0 33.2 26.3 1 1 1 187 199 187
## 22 1 1 1 33.2 33.2 33.2 1 1 1 229 229 229
## 23 1 1 1 27.5 27.5 27.5 1 1 1 131 131 131
## 24 3 3 3 24.9 24.9 24.9 1 1 1 184 204 184
## 25 2 2 2 27.4 27.4 27.4 1 1 1 186 186 186
The list format produced by action = all
is, by far, the most useful. This
option returns the imputed datasets in an m-element list where each element
contains one completed dataset. Since R prefers to work with lists, you will
almost certainly want to use the action = all
format when completing multiply
imputed datasets for analysis in R.
complete(imp, "all")
## $`1`
## age bmi hyp chl
## 1 1 35.3 2 206
## 2 2 22.7 1 187
## 3 1 27.2 1 187
## 4 3 25.5 1 204
## 5 1 20.4 1 113
## 6 3 21.7 2 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 22.7 1 187
## 11 1 22.5 1 187
## 12 2 22.7 1 187
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 199
## 16 1 22.0 1 131
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 206
## 21 1 22.0 1 187
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 184
## 25 2 27.4 1 186
##
## $`2`
## age bmi hyp chl
## 1 1 27.4 1 131
## 2 2 22.7 1 187
## 3 1 30.1 1 187
## 4 3 22.5 1 206
## 5 1 20.4 1 113
## 6 3 21.7 1 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 30.1 1 184
## 11 1 35.3 1 199
## 12 2 22.7 1 131
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 187
## 16 1 25.5 1 238
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 218
## 21 1 33.2 1 199
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 204
## 25 2 27.4 1 186
##
## $`3`
## age bmi hyp chl
## 1 1 27.2 2 199
## 2 2 22.7 1 187
## 3 1 26.3 1 187
## 4 3 27.5 2 206
## 5 1 20.4 1 113
## 6 3 24.9 2 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 22.7 2 186
## 11 1 22.0 1 187
## 12 2 35.3 1 206
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 131
## 16 1 35.3 1 199
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 186
## 21 1 26.3 1 187
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 184
## 25 2 27.4 1 186
##
## attr(,"class")
## [1] "mild" "list"
Use the mice::complete()
function to generate a list of multiply imputed
datasets from the imputations you created in Question 3.1.
Save the result as miList
.
Use the mids object you created in Question 3.1 to answer the questions in this section.
After generating the multiply imputed datasets as we did above, the next step in an MI-based analysis is the so-called analysis phase. In the analysis phase, we fit a substantively interesting analysis model to each of the imputed datasets.
When working with mice, the simplest way to implement the analysis phase is
via the with.mids()
function. For example, the following code regresses bmi
onto chl
using the 3 imputed datasets created above.
fit <- with(imp, lm(bmi ~ chl))
fit
## call :
## with.mids(data = imp, expr = lm(bmi ~ chl))
##
## call1 :
## mice(data = nhanes, m = 3, printFlag = FALSE)
##
## nmis :
## [1] 0 9 8 10
##
## analyses :
## [[1]]
##
## Call:
## lm(formula = bmi ~ chl)
##
## Coefficients:
## (Intercept) chl
## 17.26467 0.04527
##
##
## [[2]]
##
## Call:
## lm(formula = bmi ~ chl)
##
## Coefficients:
## (Intercept) chl
## 21.44927 0.02879
##
##
## [[3]]
##
## Call:
## lm(formula = bmi ~ chl)
##
## Coefficients:
## (Intercept) chl
## 20.43439 0.03409
As you can see, the fit
object simply contains 3 fitted lm
objects. Each
of these objects represents a regression model fit to one of the imputed datasets
in the imp
object.
The fit
object returned by with.mids()
has the class mira (multiply
imputed repeated analyses).
class(fit)
## [1] "mira"
ls(fit)
## [1] "analyses" "call" "call1" "nmis"
The actual fitted models live in the analyses
slot of the mira
object. Each
of these fitted models will behave according to their individual types (e.g., a
fitted lm
object inside a mira
object will behave as any other lm
object
would).
library(dplyr)
fit$analyses[[1]] %>% summary()
##
## Call:
## lm(formula = bmi ~ chl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0387 -3.0300 -0.9996 2.2004 8.7099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.26467 4.30626 4.009 0.00055 ***
## chl 0.04527 0.02227 2.033 0.05374 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.044 on 23 degrees of freedom
## Multiple R-squared: 0.1523, Adjusted R-squared: 0.1155
## F-statistic: 4.134 on 1 and 23 DF, p-value: 0.05374
fit$analyses[[2]] %>% coef()
## (Intercept) chl
## 21.44927435 0.02878946
fit$analyses[[3]] %>% resid()
## 1 2 3 4 5 6
## -0.01769018 -4.10864703 -0.50864703 0.04370132 -3.88621430 -1.80638624
## 7 8 9 10 11 12
## -1.95664894 3.29135297 -6.54708040 -4.07456010 -4.80864703 7.84370132
## 13 14 15 16 17 18
## -5.75629868 1.31187518 4.70022098 8.08230982 -2.91507913 -0.91769018
## 19 20 21 22 23 24
## 7.43465818 -1.27456010 -0.50864703 4.95970196 2.60022098 -1.80638624
## 25
## 0.62543990
Use with.mids()
to regress bmi
onto age
and hyp
.
The final step in an MI-based analysis is the pooling phase wherein we
aggregate the multiple model fits from the analysis phase. If we’ve implemented
the analysis phase via the with.mids()
function, then we can pool using the
mice::pool()
function. For example, the following code will pool the regression
models estimated above.
poolFit <- pool(fit)
summary(poolFit)
## term estimate std.error statistic df p.value
## 1 (Intercept) 19.71611122 5.04500901 3.908043 10.64744 0.002596098
## 2 chl 0.03604852 0.02456045 1.467747 14.68944 0.163257414
The poolFit
object has class mipo
(multiply imputed pooled object).
The mipo
object contains the pooled regression coefficients and standard
errors (pooled with the Rubin, 1987, rules), as well as some other
useful statistics (e.g., the fraction of missing information, fmi
).
ls(poolFit)
## [1] "call" "glanced" "m" "pooled"
ls(poolFit$pooled)
## [1] "b" "df" "dfcom" "estimate" "fmi" "lambda"
## [7] "m" "riv" "t" "term" "ubar"
poolFit$pooled$fmi
## [1] 0.3595712 0.2518258
Use the mice::pool()
function to pool the estimates you generated in
Question 5.1.
What inferences can you draw from the MI estimates?
The workflow demonstrated above applies to many different types of analysis. As
long as the estimation function provides coef()
and vcov()
methods, this
workflow should succeed. In Lab 2c, we will explore alternative workflows that
you can use in situations where the mice()
\(\rightarrow\) with.mids()
\(\rightarrow\) pool()
process doesn’t work.
By default, mice()
will impute each incomplete variable using some flavor of
univariate supervised model wherein all other variables on the dataset act as
predictors. These individual, univariate models are called the elementary
imputation models (EIMs), and we are completely free to adjust their
specification. One such adjustment that we often want to make is restricting
which variables a given EIM uses as predictors.
In mice()
we adjust the right hand side (RHS) of the EIMs via the
predictorMatrix
argument.
The predictor matrix is stored in the predictorMatrix
slot of the mids
object. Let’s look at a default predictor matrix.
imp$predictorMatrix
## age bmi hyp chl
## age 0 1 1 1
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
Each row in the predictor matrix is a pattern vector that defines the predictors
used in the EIM for each variable in the dataset. Variables that are not imputed
still get a row in the predictor matrix, their rows are just ignored by mice()
.
Consequently, the predictor matrix is square. A value of 1 in a matrix entry
indicates that the column variable is used as a predictor in the EIM for the row
variable. For example, the 1 at position [2, 1] indicates that variable age
was used in the imputation model for bmi
. Note that the diagonal is zero
because a variable is cannot impute its own missing data.
We are not forced to use this version of the predictor matrix; mice
gives us
complete control over the predictor matrix. So, we can choose our own predictors
for each EIM in our particular problem. This flexibility can be very useful, for
example, when you have many variables, or when you have clear ideas or prior
knowledge about relations in the data.
We can use the mice::make.predictorMatrix()
function to initialize a default
predictor matrix, without doing any imputation. We can then modify this matrix
to fit our needs.
(pred <- make.predictorMatrix(nhanes))
## age bmi hyp chl
## age 0 1 1 1
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
The object pred
now contains the default predictor matrix for the nhanes
data. Altering the predictor matrix and using the updated version for imputation
is very simple. For example, the following code removes the variable hyp
from
the set of predictors in each EIM.
pred[ , "hyp"] <- 0
pred
## age bmi hyp chl
## age 0 1 0 1
## bmi 1 0 0 1
## hyp 1 1 0 1
## chl 1 1 0 0
We use the modified predictor matrix by supplying it to the predictorMatrix
argument in mice()
.
imp <- mice(nhanes, predictorMatrix = pred, print = FALSE, seed = 235711)
Create a custom predictor matrix for the nhanes data.
Specify the EIMs as follows:
The mice::quickpred()
function provides a simple way to construct predictor
matrices via various selection rules. The following code will select a variable
as a predictor when it correlates at least \(\rho = 0.30\) with the incomplete
variable.
(pred <- quickpred(nhanes, mincor = 0.3))
## age bmi hyp chl
## age 0 0 0 0
## bmi 1 0 0 1
## hyp 1 0 0 1
## chl 1 1 1 0
Check the documentation for quickpred()
.
Take note of the different selection criteria you can apply.
Use quickpred()
to create a predictor matrix for the nhanes data.
This predictor matrix should satisfy the following conditions.
age
as a predictor in every EIM.Which variables are selected as predictors for each EIM?
Use the predictor matrix you created in Question 6.3 to impute the nhanes data.
Use default settings except for the following options.
norm
.For each column in the dataset, we must specify an imputation method. The
method
slot in a mids object shows which methods were applied to each column.
The following code shows the default methods that would be applied if we impute
the nhanes data.
make.method(nhanes)
## age bmi hyp chl
## "" "pmm" "pmm" "pmm"
The age
variable is completely observed, so it is not imputed. The the empty
string, ""
, tells mice()
not to impute the corresponding variable. The other
variables will all be imputed via pmm
(predictive mean matching), which is
the default method for numeric data.
In reality, these data are actually a mix of numeric and categorical data. The nhanes2 dataset contains appropriately typed categorical variables.
summary(nhanes2)
## age bmi hyp chl
## 20-39:12 Min. :20.40 no :13 Min. :113.0
## 40-59: 7 1st Qu.:22.65 yes : 4 1st Qu.:185.0
## 60-99: 6 Median :26.75 NA's: 8 Median :187.0
## Mean :26.56 Mean :191.4
## 3rd Qu.:28.93 3rd Qu.:212.0
## Max. :35.30 Max. :284.0
## NA's :9 NA's :10
str(nhanes2)
## 'data.frame': 25 obs. of 4 variables:
## $ age: Factor w/ 3 levels "20-39","40-59",..: 1 2 1 3 1 3 1 1 2 2 ...
## $ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
## $ hyp: Factor w/ 2 levels "no","yes": NA 1 1 NA 1 NA 1 1 1 NA ...
## $ chl: num NA 187 187 NA 113 184 118 187 238 NA ...
In nhanes2, the age
variable is a factor comprising three categories, and
hyp
is a binary factor. If we run mice()
without specifying a method
argument, the algorithm will attempt to match the imputation method for each
column to the column’s type.
Let’s see what happens when we initialize a default method vector for the nhanes2 dataset.
(meth <- make.method(nhanes2))
## age bmi hyp chl
## "" "pmm" "logreg" "pmm"
The imputation method for hyp
is now set to logreg
, which imputes by
logistic regression (the default for binary factors). The mice()
algorithm
has recognized the type of the new hyp
variable and updated the imputation
method appropriately.
We can get an up-to-date overview of the available methods via the methods()
function.
methods(mice)
## [1] mice.impute.2l.bin mice.impute.2l.lmer
## [3] mice.impute.2l.norm mice.impute.2l.pan
## [5] mice.impute.2lonly.mean mice.impute.2lonly.norm
## [7] mice.impute.2lonly.pmm mice.impute.cart
## [9] mice.impute.jomoImpute mice.impute.lasso.logreg
## [11] mice.impute.lasso.norm mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm mice.impute.lda
## [15] mice.impute.logreg mice.impute.logreg.boot
## [17] mice.impute.mean mice.impute.midastouch
## [19] mice.impute.mnar.logreg mice.impute.mnar.norm
## [21] mice.impute.mpmm mice.impute.norm
## [23] mice.impute.norm.boot mice.impute.norm.nob
## [25] mice.impute.norm.predict mice.impute.panImpute
## [27] mice.impute.passive mice.impute.pmm
## [29] mice.impute.polr mice.impute.polyreg
## [31] mice.impute.quadratic mice.impute.rf
## [33] mice.impute.ri mice.impute.sample
## [35] mice.mids mice.theme
## see '?methods' for accessing help and source code
More detail on the methods is available in the Details section of the
documentation for mice()
, and we can see detailed documentation for each
method through the help page for the appropriate mice.impute.METHOD()
function.
?mice
?mice.impute.lasso.select.logreg
In Question 6.4, you changed the imputation method for all variables by
setting a single value for the method
argument. We can also change the methods
on a variable-by-variable basis. The following code will change the imputation
method for bmi
to Bayesian linear regression while leaving the remaining
methods unaffected.
(meth["bmi"] <- "norm")
## [1] "norm"
To use this custom method vector, we simply supply meth
to the method
argument when we call mice()
.
imp <- mice(nhanes2, method = meth, print = FALSE)
Create a custom method vector for the nhanes2 data.
Define the following methods.
age
: Nonebmi
: Classification and regression treeshyp
: Logistic regressionchl
: Predictive mean matchingBy default, mice()
will use five iterations. Sometimes, these five will be
enough, but often we will want to run the algorithm longer. We can set the
number of iterations via the maxit
argument.
The following code will run the default imputation model for ten iterations.
imp <- mice(nhanes2, maxit = 10)
##
## iter imp variable
## 1 1 bmi hyp chl
## 1 2 bmi hyp chl
## 1 3 bmi hyp chl
## 1 4 bmi hyp chl
## 1 5 bmi hyp chl
## 2 1 bmi hyp chl
## 2 2 bmi hyp chl
## 2 3 bmi hyp chl
## 2 4 bmi hyp chl
## 2 5 bmi hyp chl
## 3 1 bmi hyp chl
## 3 2 bmi hyp chl
## 3 3 bmi hyp chl
## 3 4 bmi hyp chl
## 3 5 bmi hyp chl
## 4 1 bmi hyp chl
## 4 2 bmi hyp chl
## 4 3 bmi hyp chl
## 4 4 bmi hyp chl
## 4 5 bmi hyp chl
## 5 1 bmi hyp chl
## 5 2 bmi hyp chl
## 5 3 bmi hyp chl
## 5 4 bmi hyp chl
## 5 5 bmi hyp chl
## 6 1 bmi hyp chl
## 6 2 bmi hyp chl
## 6 3 bmi hyp chl
## 6 4 bmi hyp chl
## 6 5 bmi hyp chl
## 7 1 bmi hyp chl
## 7 2 bmi hyp chl
## 7 3 bmi hyp chl
## 7 4 bmi hyp chl
## 7 5 bmi hyp chl
## 8 1 bmi hyp chl
## 8 2 bmi hyp chl
## 8 3 bmi hyp chl
## 8 4 bmi hyp chl
## 8 5 bmi hyp chl
## 9 1 bmi hyp chl
## 9 2 bmi hyp chl
## 9 3 bmi hyp chl
## 9 4 bmi hyp chl
## 9 5 bmi hyp chl
## 10 1 bmi hyp chl
## 10 2 bmi hyp chl
## 10 3 bmi hyp chl
## 10 4 bmi hyp chl
## 10 5 bmi hyp chl
plot(imp)
Notice that now the x-axis in the trace plots extends to 10 to reflect the higher number of iterations.
One common reason to increase the number of iterations is non-convergence. We’ll
sometimes observe trends when we inspect the trace plots of the imputed items.
These trends tell us that the algorithm has not yet reached an equilibrium point.
In such situations, the first thing to try (so long as you don’t notice any
obvious errors in your imputation model specification) is increasing the number
of iterations. While we could accomplish this goal simply by rerunning mice()
with a higher value for maxit
, doing so would be very wasteful. We would have
to recreate all of the existing iterations before generating any new iterations.
Thankfully, we don’t need to be so inefficient.
We can easily extend the chains in any mids object with the mice.mids()
function. This function takes as mids object as its primary argument and picks
up the sampling where mice()
left off (using exactly the same imputation model
specification). The following code will extend the chains in our most recent
mids object to 25 total iterations by adding on another 15 iterations.
imp <- mice.mids(imp, maxit = 15, print = FALSE)
plot(imp)
Now, imp
contains 25 iterations, which we can see by the range of the x-axis
in the trace plots.
Recall that even if the imputation model converges (e.g., as demonstrated by trace plots), the imputed values still might not be reasonable estimates of the missing data. If the data are MCAR, then the imputations should have the same distribution as the observed data. In general, though, the distributions of the observed and imputed data may differ when the data are MAR. However, we don’t want to see very large discrepancies.
Strip plots are another excellent visualization tool with which we can assess
the comparability of the imputed and observed data. The mice::stripplot()
function will generate strip plots from a mids object.
The following code will create a strip plot of the observed and imputed chl
values.
stripplot(imp, chl)
The observed data are plotted in blue, and the imputed data are plotted in red.
Since chl
was imputed with PMM (which draws imputations from the observed data),
the imputed values have the same gaps as the observed data, and the imputed
values are always within the range of the observed data. This figure indicates
that the observed and imputed values of chl
follow similar distributions. So,
we should conclude that these imputations are plausible.
We can also call stripplot()
without specifying any variables. Doing so will
plot all of the imputed variables in an array.
stripplot(imp)
Create trace plots, density plots, and strip plots for the mids object you created in Question 8.1.
Based on these visualizations, would you say the imputation model has converged to a valid solution?
Rerun the imputation from Question 8.1.
Keep all the same settings, but use Bayesian linear regression to impute bmi
and chl
.
Create trace plots, density plots, and strip plots for the mids object you created in Question 9.2.
Based on the visualizations, are the imputations from Question 8.1 or Question 9.2 more reasonable? Why?
Use the multiply imputed data you created in Question 8.1 to estimate the following regression model.
\(Y_{bmi} = \beta_0 + \beta_1 X_{age} = \beta_2 X_{chl} + \varepsilon\)
Pool the estimates from the model you estimated in Question 9.4.
Is the effect of chl
on bmi
significant after controlling for age
?
Rerun the regression analysis from above using the imputed data from Question 9.2.
Do the results differ between the two versions of the analysis? Why or why not?
End of Lab 2a