In these lab exercises, you are going to start working with multiple imputation (MI). Specifically, you will do some simple MI-based analyses using he mice package.


1 Single Imputation


1.1

Load the mice package.


The mice package contains several datasets. Once the package is loaded, you can access these datasets in your workspace. In this lab, we will work with the nhanes dataset (Schafer, 1997, Table 6.14).

The nhanes dataset is a small data set with non-monotone missing values. These data comprise 25 observations of four variables: age group, body mass index, hypertension, and cholesterol.

Unless otherwise specified, all subsequent questions refer to the nhanes dataset.


1.2

Use the mice() function to impute the missing data using stochastic regression imputation.

Set the random number seed by specifying a value for the seed argument.


1.3

Use the mice::complete() function to create a completed dataset.


1.4

Use the imputed data to regress bmi onto age.


2 Multiple Imputation


2.1

Check the documentation for the mice() function.


2.2

Use the mice() function to multiply impute the nhanes data using all default options.


The object returned by mice() has the class mids (multiply imputed data set). This object encapsulates all the relevant information mice() produced when imputing the nhanes dataset (e.g., the original data, the imputed values, the number of missing values, number of iterations, etc.).


2.3

Use the ls() function to check the contents of the mids object you created in Question 2.2.

What do you think is stored in the following slots?

  • data
  • imp
  • method
  • m

Before proceeding any further, we need to check convergence. If the imputation model did not converge, or if the imputed values are not reasonable, there is no reason to proceed to the analysis phase.


2.4

Use the plot.mids() function to create trace plots of the imputed data.


2.5

Use the mice::densityplot() function to check the distributions of the imputed values.


3 Creating More Imputations


At various places above, you should have also seen that mice() created 5 sets of imputations (which is the default). If we want more imputations (which we frequently will), we can set an appropriate value for the m argument in the mice() function.


3.1

Redo the imputation from Question 2.2.

This time, change the following settings.

  1. Create 20 imputations.
  2. Set a random number seed.

3.2

Check the convergence of the imputation model from Question 3.1 as you did before.


4 Output Formats


If we want to create a set of stand-alone multiply imputed datasets, we use the mice::complete() function to do so.

When completing the data with stochastic regression imputation, we only had one sensible way to return the result because we only created a single set of imputations. MI, on the other hand, creates m different sets of imputations, so we have several ways to format the completed datasets. We can select between these options via the action argument in the mice::complete() function.

If we specify a single number for the action argument, the function will return a single completed datasets corresponding to the requested imputation number (e.g., the third imputation in the following code).

imp <- mice(nhanes, m = 3, print = FALSE)

complete(imp, 3) 
##    age  bmi hyp chl
## 1    1 27.2   2 199
## 2    2 22.7   1 187
## 3    1 26.3   1 187
## 4    3 27.5   2 206
## 5    1 20.4   1 113
## 6    3 24.9   2 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 22.7   2 186
## 11   1 22.0   1 187
## 12   2 35.3   1 206
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 131
## 16   1 35.3   1 199
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 186
## 21   1 26.3   1 187
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 184
## 25   2 27.4   1 186

We can also request all m imputed datasets be collapsed into a single block of data with three different structures.

  • action = long row binds the imputed datasets one atop the other.
  • action = broad column binds the imputed datasets side-by-side.
  • action = repeated does the same as broad but reorders the variables so the m replicates of each variable form a contiguous block of data.
complete(imp, "long")  
##    age  bmi hyp chl .imp .id
## 1    1 35.3   2 206    1   1
## 2    2 22.7   1 187    1   2
## 3    1 27.2   1 187    1   3
## 4    3 25.5   1 204    1   4
## 5    1 20.4   1 113    1   5
## 6    3 21.7   2 184    1   6
## 7    1 22.5   1 118    1   7
## 8    1 30.1   1 187    1   8
## 9    2 22.0   1 238    1   9
## 10   2 22.7   1 187    1  10
## 11   1 22.5   1 187    1  11
## 12   2 22.7   1 187    1  12
## 13   3 21.7   1 206    1  13
## 14   2 28.7   2 204    1  14
## 15   1 29.6   1 199    1  15
## 16   1 22.0   1 131    1  16
## 17   3 27.2   2 284    1  17
## 18   2 26.3   2 199    1  18
## 19   1 35.3   1 218    1  19
## 20   3 25.5   2 206    1  20
## 21   1 22.0   1 187    1  21
## 22   1 33.2   1 229    1  22
## 23   1 27.5   1 131    1  23
## 24   3 24.9   1 184    1  24
## 25   2 27.4   1 186    1  25
## 26   1 27.4   1 131    2   1
## 27   2 22.7   1 187    2   2
## 28   1 30.1   1 187    2   3
## 29   3 22.5   1 206    2   4
## 30   1 20.4   1 113    2   5
## 31   3 21.7   1 184    2   6
## 32   1 22.5   1 118    2   7
## 33   1 30.1   1 187    2   8
## 34   2 22.0   1 238    2   9
## 35   2 30.1   1 184    2  10
## 36   1 35.3   1 199    2  11
## 37   2 22.7   1 131    2  12
## 38   3 21.7   1 206    2  13
## 39   2 28.7   2 204    2  14
## 40   1 29.6   1 187    2  15
## 41   1 25.5   1 238    2  16
## 42   3 27.2   2 284    2  17
## 43   2 26.3   2 199    2  18
## 44   1 35.3   1 218    2  19
## 45   3 25.5   2 218    2  20
## 46   1 33.2   1 199    2  21
## 47   1 33.2   1 229    2  22
## 48   1 27.5   1 131    2  23
## 49   3 24.9   1 204    2  24
## 50   2 27.4   1 186    2  25
## 51   1 27.2   2 199    3   1
## 52   2 22.7   1 187    3   2
## 53   1 26.3   1 187    3   3
## 54   3 27.5   2 206    3   4
## 55   1 20.4   1 113    3   5
## 56   3 24.9   2 184    3   6
## 57   1 22.5   1 118    3   7
## 58   1 30.1   1 187    3   8
## 59   2 22.0   1 238    3   9
## 60   2 22.7   2 186    3  10
## 61   1 22.0   1 187    3  11
## 62   2 35.3   1 206    3  12
## 63   3 21.7   1 206    3  13
## 64   2 28.7   2 204    3  14
## 65   1 29.6   1 131    3  15
## 66   1 35.3   1 199    3  16
## 67   3 27.2   2 284    3  17
## 68   2 26.3   2 199    3  18
## 69   1 35.3   1 218    3  19
## 70   3 25.5   2 186    3  20
## 71   1 26.3   1 187    3  21
## 72   1 33.2   1 229    3  22
## 73   1 27.5   1 131    3  23
## 74   3 24.9   1 184    3  24
## 75   2 27.4   1 186    3  25
complete(imp, "broad")
##    age.1 bmi.1 hyp.1 chl.1 age.2 bmi.2 hyp.2 chl.2 age.3 bmi.3 hyp.3 chl.3
## 1      1  35.3     2   206     1  27.4     1   131     1  27.2     2   199
## 2      2  22.7     1   187     2  22.7     1   187     2  22.7     1   187
## 3      1  27.2     1   187     1  30.1     1   187     1  26.3     1   187
## 4      3  25.5     1   204     3  22.5     1   206     3  27.5     2   206
## 5      1  20.4     1   113     1  20.4     1   113     1  20.4     1   113
## 6      3  21.7     2   184     3  21.7     1   184     3  24.9     2   184
## 7      1  22.5     1   118     1  22.5     1   118     1  22.5     1   118
## 8      1  30.1     1   187     1  30.1     1   187     1  30.1     1   187
## 9      2  22.0     1   238     2  22.0     1   238     2  22.0     1   238
## 10     2  22.7     1   187     2  30.1     1   184     2  22.7     2   186
## 11     1  22.5     1   187     1  35.3     1   199     1  22.0     1   187
## 12     2  22.7     1   187     2  22.7     1   131     2  35.3     1   206
## 13     3  21.7     1   206     3  21.7     1   206     3  21.7     1   206
## 14     2  28.7     2   204     2  28.7     2   204     2  28.7     2   204
## 15     1  29.6     1   199     1  29.6     1   187     1  29.6     1   131
## 16     1  22.0     1   131     1  25.5     1   238     1  35.3     1   199
## 17     3  27.2     2   284     3  27.2     2   284     3  27.2     2   284
## 18     2  26.3     2   199     2  26.3     2   199     2  26.3     2   199
## 19     1  35.3     1   218     1  35.3     1   218     1  35.3     1   218
## 20     3  25.5     2   206     3  25.5     2   218     3  25.5     2   186
## 21     1  22.0     1   187     1  33.2     1   199     1  26.3     1   187
## 22     1  33.2     1   229     1  33.2     1   229     1  33.2     1   229
## 23     1  27.5     1   131     1  27.5     1   131     1  27.5     1   131
## 24     3  24.9     1   184     3  24.9     1   204     3  24.9     1   184
## 25     2  27.4     1   186     2  27.4     1   186     2  27.4     1   186
complete(imp, "repeated")
##    age.1 age.2 age.3 bmi.1 bmi.2 bmi.3 hyp.1 hyp.2 hyp.3 chl.1 chl.2 chl.3
## 1      1     1     1  35.3  27.4  27.2     2     1     2   206   131   199
## 2      2     2     2  22.7  22.7  22.7     1     1     1   187   187   187
## 3      1     1     1  27.2  30.1  26.3     1     1     1   187   187   187
## 4      3     3     3  25.5  22.5  27.5     1     1     2   204   206   206
## 5      1     1     1  20.4  20.4  20.4     1     1     1   113   113   113
## 6      3     3     3  21.7  21.7  24.9     2     1     2   184   184   184
## 7      1     1     1  22.5  22.5  22.5     1     1     1   118   118   118
## 8      1     1     1  30.1  30.1  30.1     1     1     1   187   187   187
## 9      2     2     2  22.0  22.0  22.0     1     1     1   238   238   238
## 10     2     2     2  22.7  30.1  22.7     1     1     2   187   184   186
## 11     1     1     1  22.5  35.3  22.0     1     1     1   187   199   187
## 12     2     2     2  22.7  22.7  35.3     1     1     1   187   131   206
## 13     3     3     3  21.7  21.7  21.7     1     1     1   206   206   206
## 14     2     2     2  28.7  28.7  28.7     2     2     2   204   204   204
## 15     1     1     1  29.6  29.6  29.6     1     1     1   199   187   131
## 16     1     1     1  22.0  25.5  35.3     1     1     1   131   238   199
## 17     3     3     3  27.2  27.2  27.2     2     2     2   284   284   284
## 18     2     2     2  26.3  26.3  26.3     2     2     2   199   199   199
## 19     1     1     1  35.3  35.3  35.3     1     1     1   218   218   218
## 20     3     3     3  25.5  25.5  25.5     2     2     2   206   218   186
## 21     1     1     1  22.0  33.2  26.3     1     1     1   187   199   187
## 22     1     1     1  33.2  33.2  33.2     1     1     1   229   229   229
## 23     1     1     1  27.5  27.5  27.5     1     1     1   131   131   131
## 24     3     3     3  24.9  24.9  24.9     1     1     1   184   204   184
## 25     2     2     2  27.4  27.4  27.4     1     1     1   186   186   186

The list format produced by action = all is, by far, the most useful. This option returns the imputed datasets in an m-element list where each element contains one completed dataset. Since R prefers to work with lists, you will almost certainly want to use the action = all format when completing multiply imputed datasets for analysis in R.

complete(imp, "all")
## $`1`
##    age  bmi hyp chl
## 1    1 35.3   2 206
## 2    2 22.7   1 187
## 3    1 27.2   1 187
## 4    3 25.5   1 204
## 5    1 20.4   1 113
## 6    3 21.7   2 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 22.7   1 187
## 11   1 22.5   1 187
## 12   2 22.7   1 187
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 199
## 16   1 22.0   1 131
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 206
## 21   1 22.0   1 187
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 184
## 25   2 27.4   1 186
## 
## $`2`
##    age  bmi hyp chl
## 1    1 27.4   1 131
## 2    2 22.7   1 187
## 3    1 30.1   1 187
## 4    3 22.5   1 206
## 5    1 20.4   1 113
## 6    3 21.7   1 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 30.1   1 184
## 11   1 35.3   1 199
## 12   2 22.7   1 131
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 187
## 16   1 25.5   1 238
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 218
## 21   1 33.2   1 199
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 204
## 25   2 27.4   1 186
## 
## $`3`
##    age  bmi hyp chl
## 1    1 27.2   2 199
## 2    2 22.7   1 187
## 3    1 26.3   1 187
## 4    3 27.5   2 206
## 5    1 20.4   1 113
## 6    3 24.9   2 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 22.7   2 186
## 11   1 22.0   1 187
## 12   2 35.3   1 206
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 131
## 16   1 35.3   1 199
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 186
## 21   1 26.3   1 187
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 184
## 25   2 27.4   1 186
## 
## attr(,"class")
## [1] "mild" "list"

4.1

Use the mice::complete() function to generate a list of multiply imputed datasets from the imputations you created in Question 3.1.

Save the result as miList.


5 Analysis and Pooling


Use the mids object you created in Question 3.1 to answer the questions in this section.


After generating the multiply imputed datasets as we did above, the next step in an MI-based analysis is the so-called analysis phase. In the analysis phase, we fit a substantively interesting analysis model to each of the imputed datasets.

When working with mice, the simplest way to implement the analysis phase is via the with.mids() function. For example, the following code regresses bmi onto chl using the 3 imputed datasets created above.

fit <- with(imp, lm(bmi ~ chl))
fit
## call :
## with.mids(data = imp, expr = lm(bmi ~ chl))
## 
## call1 :
## mice(data = nhanes, m = 3, printFlag = FALSE)
## 
## nmis :
## [1]  0  9  8 10
## 
## analyses :
## [[1]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    17.26467      0.04527  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    21.44927      0.02879  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    20.43439      0.03409

As you can see, the fit object simply contains 3 fitted lm objects. Each of these objects represents a regression model fit to one of the imputed datasets in the imp object.

The fit object returned by with.mids() has the class mira (multiply imputed repeated analyses).

class(fit)
## [1] "mira"
ls(fit)
## [1] "analyses" "call"     "call1"    "nmis"

The actual fitted models live in the analyses slot of the mira object. Each of these fitted models will behave according to their individual types (e.g., a fitted lm object inside a mira object will behave as any other lm object would).

library(dplyr)

fit$analyses[[1]] %>% summary()
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0387 -3.0300 -0.9996  2.2004  8.7099 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.26467    4.30626   4.009  0.00055 ***
## chl          0.04527    0.02227   2.033  0.05374 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.044 on 23 degrees of freedom
## Multiple R-squared:  0.1523, Adjusted R-squared:  0.1155 
## F-statistic: 4.134 on 1 and 23 DF,  p-value: 0.05374
fit$analyses[[2]] %>% coef()
## (Intercept)         chl 
## 21.44927435  0.02878946
fit$analyses[[3]] %>% resid()
##           1           2           3           4           5           6 
## -0.01769018 -4.10864703 -0.50864703  0.04370132 -3.88621430 -1.80638624 
##           7           8           9          10          11          12 
## -1.95664894  3.29135297 -6.54708040 -4.07456010 -4.80864703  7.84370132 
##          13          14          15          16          17          18 
## -5.75629868  1.31187518  4.70022098  8.08230982 -2.91507913 -0.91769018 
##          19          20          21          22          23          24 
##  7.43465818 -1.27456010 -0.50864703  4.95970196  2.60022098 -1.80638624 
##          25 
##  0.62543990

5.1

Use with.mids() to regress bmi onto age and hyp.


The final step in an MI-based analysis is the pooling phase wherein we aggregate the multiple model fits from the analysis phase. If we’ve implemented the analysis phase via the with.mids() function, then we can pool using the mice::pool() function. For example, the following code will pool the regression models estimated above.

poolFit <- pool(fit)
summary(poolFit)
##          term    estimate  std.error statistic       df     p.value
## 1 (Intercept) 19.71611122 5.04500901  3.908043 10.64744 0.002596098
## 2         chl  0.03604852 0.02456045  1.467747 14.68944 0.163257414

The poolFit object has class mipo (multiply imputed pooled object). The mipo object contains the pooled regression coefficients and standard errors (pooled with the Rubin, 1987, rules), as well as some other useful statistics (e.g., the fraction of missing information, fmi).

ls(poolFit)
## [1] "call"    "glanced" "m"       "pooled"
ls(poolFit$pooled)
##  [1] "b"        "df"       "dfcom"    "estimate" "fmi"      "lambda"  
##  [7] "m"        "riv"      "t"        "term"     "ubar"
poolFit$pooled$fmi
## [1] 0.3595712 0.2518258

5.2

Use the mice::pool() function to pool the estimates you generated in Question 5.1.

What inferences can you draw from the MI estimates?


The workflow demonstrated above applies to many different types of analysis. As long as the estimation function provides coef() and vcov() methods, this workflow should succeed. In Lab 2c, we will explore alternative workflows that you can use in situations where the mice() \(\rightarrow\) with.mids() \(\rightarrow\) pool() process doesn’t work.


6 Imputation Model Predictors


By default, mice() will impute each incomplete variable using some flavor of univariate supervised model wherein all other variables on the dataset act as predictors. These individual, univariate models are called the elementary imputation models (EIMs), and we are completely free to adjust their specification. One such adjustment that we often want to make is restricting which variables a given EIM uses as predictors.

In mice() we adjust the right hand side (RHS) of the EIMs via the predictorMatrix argument.


The predictor matrix is stored in the predictorMatrix slot of the mids object. Let’s look at a default predictor matrix.

imp$predictorMatrix
##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

Each row in the predictor matrix is a pattern vector that defines the predictors used in the EIM for each variable in the dataset. Variables that are not imputed still get a row in the predictor matrix, their rows are just ignored by mice(). Consequently, the predictor matrix is square. A value of 1 in a matrix entry indicates that the column variable is used as a predictor in the EIM for the row variable. For example, the 1 at position [2, 1] indicates that variable age was used in the imputation model for bmi. Note that the diagonal is zero because a variable is cannot impute its own missing data.

We are not forced to use this version of the predictor matrix; mice gives us complete control over the predictor matrix. So, we can choose our own predictors for each EIM in our particular problem. This flexibility can be very useful, for example, when you have many variables, or when you have clear ideas or prior knowledge about relations in the data.

We can use the mice::make.predictorMatrix() function to initialize a default predictor matrix, without doing any imputation. We can then modify this matrix to fit our needs.

(pred <- make.predictorMatrix(nhanes))
##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

The object pred now contains the default predictor matrix for the nhanes data. Altering the predictor matrix and using the updated version for imputation is very simple. For example, the following code removes the variable hyp from the set of predictors in each EIM.

pred[ , "hyp"] <- 0
pred
##     age bmi hyp chl
## age   0   1   0   1
## bmi   1   0   0   1
## hyp   1   1   0   1
## chl   1   1   0   0

We use the modified predictor matrix by supplying it to the predictorMatrix argument in mice().

imp <- mice(nhanes, predictorMatrix = pred, print = FALSE, seed = 235711)

6.1

Create a custom predictor matrix for the nhanes data.

Specify the EIMs as follows:

  • \(Y_{bmi} = \beta_0 + \beta_1 X_{hyp} + \beta_2 X_{chl}\)
  • \(Y_{hyp} = \beta_0 + \beta_1 X_{age} + \beta_2 X_{chl}\)
  • \(Y_{chl} = \beta_0 + \beta_1 X_{bmi}\)

The mice::quickpred() function provides a simple way to construct predictor matrices via various selection rules. The following code will select a variable as a predictor when it correlates at least \(\rho = 0.30\) with the incomplete variable.

(pred <- quickpred(nhanes, mincor = 0.3))
##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   0   1
## hyp   1   0   0   1
## chl   1   1   1   0

6.2

Check the documentation for quickpred().

Take note of the different selection criteria you can apply.


6.3

Use quickpred() to create a predictor matrix for the nhanes data.

This predictor matrix should satisfy the following conditions.

  • Use age as a predictor in every EIM.
  • Select the remaining predictors such that they have a minimum correlation of \(\rho = 0.45\) with the imputation targets.

Which variables are selected as predictors for each EIM?


6.4

Use the predictor matrix you created in Question 6.3 to impute the nhanes data.

Use default settings except for the following options.

  • Create 8 imputations.
  • Set the imputation method to norm.
  • Set a random number seed.

6.5

Create trace plots and density plots for the mids object you created in Question 6.4.


7 Imputation Methods


For each column in the dataset, we must specify an imputation method. The method slot in a mids object shows which methods were applied to each column. The following code shows the default methods that would be applied if we impute the nhanes data.

make.method(nhanes)
##   age   bmi   hyp   chl 
##    "" "pmm" "pmm" "pmm"

The age variable is completely observed, so it is not imputed. The the empty string, "", tells mice() not to impute the corresponding variable. The other variables will all be imputed via pmm (predictive mean matching), which is the default method for numeric data.

In reality, these data are actually a mix of numeric and categorical data. The nhanes2 dataset contains appropriately typed categorical variables.

summary(nhanes2)
##     age          bmi          hyp          chl       
##  20-39:12   Min.   :20.40   no  :13   Min.   :113.0  
##  40-59: 7   1st Qu.:22.65   yes : 4   1st Qu.:185.0  
##  60-99: 6   Median :26.75   NA's: 8   Median :187.0  
##             Mean   :26.56             Mean   :191.4  
##             3rd Qu.:28.93             3rd Qu.:212.0  
##             Max.   :35.30             Max.   :284.0  
##             NA's   :9                 NA's   :10
str(nhanes2)
## 'data.frame':    25 obs. of  4 variables:
##  $ age: Factor w/ 3 levels "20-39","40-59",..: 1 2 1 3 1 3 1 1 2 2 ...
##  $ bmi: num  NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
##  $ hyp: Factor w/ 2 levels "no","yes": NA 1 1 NA 1 NA 1 1 1 NA ...
##  $ chl: num  NA 187 187 NA 113 184 118 187 238 NA ...

In nhanes2, the age variable is a factor comprising three categories, and hyp is a binary factor. If we run mice() without specifying a method argument, the algorithm will attempt to match the imputation method for each column to the column’s type.

Let’s see what happens when we initialize a default method vector for the nhanes2 dataset.

(meth <- make.method(nhanes2))
##      age      bmi      hyp      chl 
##       ""    "pmm" "logreg"    "pmm"

The imputation method for hyp is now set to logreg, which imputes by logistic regression (the default for binary factors). The mice() algorithm has recognized the type of the new hyp variable and updated the imputation method appropriately.


We can get an up-to-date overview of the available methods via the methods() function.

methods(mice)
##  [1] mice.impute.2l.bin              mice.impute.2l.lmer            
##  [3] mice.impute.2l.norm             mice.impute.2l.pan             
##  [5] mice.impute.2lonly.mean         mice.impute.2lonly.norm        
##  [7] mice.impute.2lonly.pmm          mice.impute.cart               
##  [9] mice.impute.jomoImpute          mice.impute.lasso.logreg       
## [11] mice.impute.lasso.norm          mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm   mice.impute.lda                
## [15] mice.impute.logreg              mice.impute.logreg.boot        
## [17] mice.impute.mean                mice.impute.midastouch         
## [19] mice.impute.mnar.logreg         mice.impute.mnar.norm          
## [21] mice.impute.mpmm                mice.impute.norm               
## [23] mice.impute.norm.boot           mice.impute.norm.nob           
## [25] mice.impute.norm.predict        mice.impute.panImpute          
## [27] mice.impute.passive             mice.impute.pmm                
## [29] mice.impute.polr                mice.impute.polyreg            
## [31] mice.impute.quadratic           mice.impute.rf                 
## [33] mice.impute.ri                  mice.impute.sample             
## [35] mice.mids                       mice.theme                     
## see '?methods' for accessing help and source code

More detail on the methods is available in the Details section of the documentation for mice(), and we can see detailed documentation for each method through the help page for the appropriate mice.impute.METHOD() function.

?mice
?mice.impute.lasso.select.logreg

In Question 6.4, you changed the imputation method for all variables by setting a single value for the method argument. We can also change the methods on a variable-by-variable basis. The following code will change the imputation method for bmi to Bayesian linear regression while leaving the remaining methods unaffected.

(meth["bmi"] <- "norm")
## [1] "norm"

To use this custom method vector, we simply supply meth to the method argument when we call mice().

imp <- mice(nhanes2, method = meth, print = FALSE)

7.1

Create a custom method vector for the nhanes2 data.

Define the following methods.

  • age: None
  • bmi: Classification and regression trees
  • hyp: Logistic regression
  • chl: Predictive mean matching

7.2

Impute the missing values in the nhanes2 dataset.

Set up the imputation run as follows.

  • Use the predictor matrix you created in Question 6.1.
  • Use the method vector you created in Question 7.1.
  • Create 15 imputations.
  • Set a random number seed.

7.3

Create trace plots and density plots for the mids object you created in Question 7.2.


8 Adjusting the Iterations


By default, mice() will use five iterations. Sometimes, these five will be enough, but often we will want to run the algorithm longer. We can set the number of iterations via the maxit argument.

The following code will run the default imputation model for ten iterations.

imp <- mice(nhanes2, maxit = 10)
## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl
##   6   1  bmi  hyp  chl
##   6   2  bmi  hyp  chl
##   6   3  bmi  hyp  chl
##   6   4  bmi  hyp  chl
##   6   5  bmi  hyp  chl
##   7   1  bmi  hyp  chl
##   7   2  bmi  hyp  chl
##   7   3  bmi  hyp  chl
##   7   4  bmi  hyp  chl
##   7   5  bmi  hyp  chl
##   8   1  bmi  hyp  chl
##   8   2  bmi  hyp  chl
##   8   3  bmi  hyp  chl
##   8   4  bmi  hyp  chl
##   8   5  bmi  hyp  chl
##   9   1  bmi  hyp  chl
##   9   2  bmi  hyp  chl
##   9   3  bmi  hyp  chl
##   9   4  bmi  hyp  chl
##   9   5  bmi  hyp  chl
##   10   1  bmi  hyp  chl
##   10   2  bmi  hyp  chl
##   10   3  bmi  hyp  chl
##   10   4  bmi  hyp  chl
##   10   5  bmi  hyp  chl
plot(imp)

Notice that now the x-axis in the trace plots extends to 10 to reflect the higher number of iterations.


One common reason to increase the number of iterations is non-convergence. We’ll sometimes observe trends when we inspect the trace plots of the imputed items. These trends tell us that the algorithm has not yet reached an equilibrium point. In such situations, the first thing to try (so long as you don’t notice any obvious errors in your imputation model specification) is increasing the number of iterations. While we could accomplish this goal simply by rerunning mice() with a higher value for maxit, doing so would be very wasteful. We would have to recreate all of the existing iterations before generating any new iterations. Thankfully, we don’t need to be so inefficient.

We can easily extend the chains in any mids object with the mice.mids() function. This function takes as mids object as its primary argument and picks up the sampling where mice() left off (using exactly the same imputation model specification). The following code will extend the chains in our most recent mids object to 25 total iterations by adding on another 15 iterations.

imp <- mice.mids(imp, maxit = 15, print = FALSE)
plot(imp)

Now, imp contains 25 iterations, which we can see by the range of the x-axis in the trace plots.


8.1

Increase the number of iterations for the imputation you ran in Question 7.2 to 25.


9 More Diagnostics


Recall that even if the imputation model converges (e.g., as demonstrated by trace plots), the imputed values still might not be reasonable estimates of the missing data. If the data are MCAR, then the imputations should have the same distribution as the observed data. In general, though, the distributions of the observed and imputed data may differ when the data are MAR. However, we don’t want to see very large discrepancies.

Strip plots are another excellent visualization tool with which we can assess the comparability of the imputed and observed data. The mice::stripplot() function will generate strip plots from a mids object.

The following code will create a strip plot of the observed and imputed chl values.

stripplot(imp, chl)

The observed data are plotted in blue, and the imputed data are plotted in red. Since chl was imputed with PMM (which draws imputations from the observed data), the imputed values have the same gaps as the observed data, and the imputed values are always within the range of the observed data. This figure indicates that the observed and imputed values of chl follow similar distributions. So, we should conclude that these imputations are plausible.

We can also call stripplot() without specifying any variables. Doing so will plot all of the imputed variables in an array.

stripplot(imp)


9.1

Create trace plots, density plots, and strip plots for the mids object you created in Question 8.1.

Based on these visualizations, would you say the imputation model has converged to a valid solution?


9.2

Rerun the imputation from Question 8.1.

Keep all the same settings, but use Bayesian linear regression to impute bmi and chl.


9.3

Create trace plots, density plots, and strip plots for the mids object you created in Question 9.2.

Based on the visualizations, are the imputations from Question 8.1 or Question 9.2 more reasonable? Why?


9.4

Use the multiply imputed data you created in Question 8.1 to estimate the following regression model.

\(Y_{bmi} = \beta_0 + \beta_1 X_{age} = \beta_2 X_{chl} + \varepsilon\)


9.5

Pool the estimates from the model you estimated in Question 9.4.

Is the effect of chl on bmi significant after controlling for age?


9.6

Rerun the regression analysis from above using the imputed data from Question 9.2.

Do the results differ between the two versions of the analysis? Why or why not?


End of Lab 2a