In these lab exercises, you are going to start working with multiple imputation (MI). Specifically, you will do some simple MI-based analyses using he mice package.

1 Single Imputation

1.1

Load the mice package.

library(mice)

The mice package contains several datasets. Once the package is loaded, you can access these datasets in your workspace. In this lab, we will work with the nhanes dataset (Schafer, 1997, Table 6.14).

The nhanes dataset is a small data set with non-monotone missing values. These data comprise 25 observations of four variables: age group, body mass index, hypertension, and cholesterol.

Unless otherwise specified, all subsequent questions refer to the nhanes dataset.

1.2

Use the mice() function to impute the missing data using stochastic regression imputation.

Set the random number seed by specifying a value for the seed argument.

imp1 <- mice(nhanes, method = "norm.nob", m = 1, seed = 235711)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   5   1  bmi  hyp  chl

This code imputes the missing values in the dataset with a single (because m = 1) stochastic regression imputation. This approach does not account for variability in the regression weights, so it is not ‘proper’ in the sense of Rubin (1987). The variability of the imputed data will be underestimated and, consequently, standard errors for statistics estimated from these imputed data will be too small.

Although the above is always true, for data with few missing values, the additional complexity of MI is not really necessary. The degree of standard error deflation produced by stochastic regression imputation is a function of the missing data rate increases. So, stochastic regression imputation can still be a good (and simple) solution when the missing data rate is low (say, 5% or less missing on any given variable).

1.3

Use the mice::complete() function to create a completed dataset.

(comp1 <- complete(imp1))

##    age      bmi       hyp      chl
## 1    1 33.14591 1.2730967 223.8749
## 2    2 22.70000 1.0000000 187.0000
## 3    1 33.18101 1.0000000 187.0000
## 4    3 20.67198 1.5426007 185.5465
## 5    1 20.40000 1.0000000 113.0000
## 6    3 22.27522 1.8590519 184.0000
## 7    1 22.50000 1.0000000 118.0000
## 8    1 30.10000 1.0000000 187.0000
## 9    2 22.00000 1.0000000 238.0000
## 10   2 23.90271 1.1955721 150.0048
## 11   1 30.23901 0.9862682 177.2913
## 12   2 24.64479 1.1240773 227.3365
## 13   3 21.70000 1.0000000 206.0000
## 14   2 28.70000 2.0000000 204.0000
## 15   1 29.60000 1.0000000 186.0343
## 16   1 32.82535 1.0013922 264.1071
## 17   3 27.20000 2.0000000 284.0000
## 18   2 26.30000 2.0000000 199.0000
## 19   1 35.30000 1.0000000 218.0000
## 20   3 25.50000 2.0000000 257.3953
## 21   1 30.74885 1.7896261 185.0158
## 22   1 33.20000 1.0000000 229.0000
## 23   1 27.50000 1.0000000 131.0000
## 24   3 24.90000 1.0000000 246.1195
## 25   2 27.40000 1.0000000 186.0000

The complete() function replaces the missing values with the imputations generated by the mice() run. Now, we have a completed dataset to analyze.

1.4

Use the imputed data to regress bmi onto age.

fit1 <- lm(bmi ~ age, data = comp1)
summary(fit1)

## 
## Call:
## lm(formula = bmi ~ age, data = comp1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1483 -2.0483  0.5517  2.4856  5.7517 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.8152     1.7343  18.921 1.61e-15 ***
## age          -3.2669     0.8944  -3.653  0.00133 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.64 on 23 degrees of freedom
## Multiple R-squared:  0.3671, Adjusted R-squared:  0.3396 
## F-statistic: 13.34 on 1 and 23 DF,  p-value: 0.001327

2 Multiple Imputation

2.1

Check the documentation for the mice() function.

?mice

2.2

Use the mice() function to multiply impute the nhanes data using all default options.

imp <- mice(nhanes)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl

As you can see from the printed output, the algorithm ran for 5 iterations and produced 5 imputations for each missing datum. These numbers result from the default values for the maxit and m arguments, respectively.

The object returned by mice() has the class mids (multiply imputed data set). This object encapsulates all the relevant information mice() produced when imputing the nhanes dataset (e.g., the original data, the imputed values, the number of missing values, number of iterations, etc.).

2.3

Use the ls() function to check the contents of the mids object you created in Question 2.2.

What do you think is stored in the following slots?

data
imp
method
m

ls(imp)

##  [1] "blocks"          "blots"           "call"            "chainMean"      
##  [5] "chainVar"        "data"            "date"            "formulas"       
##  [9] "ignore"          "imp"             "iteration"       "lastSeedValue"  
## [13] "loggedEvents"    "m"               "method"          "nmis"           
## [17] "post"            "predictorMatrix" "seed"            "version"        
## [21] "visitSequence"   "where"

The original, incomplete, data are stored in the data slot.

imp$data

##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186

A list containing the imputed values is stored in the imp slot.

imp$imp

## $age
## [1] 1 2 3 4 5
## <0 rows> (or 0-length row.names)
## 
## $bmi
##       1    2    3    4    5
## 1  22.7 35.3 30.1 29.6 33.2
## 3  33.2 27.2 26.3 30.1 27.2
## 4  22.7 24.9 27.4 22.5 22.5
## 6  25.5 25.5 27.4 20.4 27.4
## 10 22.0 30.1 22.7 27.5 21.7
## 11 35.3 27.5 28.7 22.0 27.5
## 12 22.0 27.5 20.4 22.7 22.5
## 16 29.6 35.3 26.3 22.0 25.5
## 21 29.6 30.1 27.5 28.7 29.6
## 
## $hyp
##    1 2 3 4 5
## 1  1 1 1 1 1
## 4  2 2 1 2 1
## 6  2 2 2 1 2
## 10 1 1 1 2 2
## 11 1 1 1 1 1
## 12 1 1 1 1 1
## 16 1 1 1 1 1
## 21 1 1 1 1 1
## 
## $chl
##      1   2   3   4   5
## 1  238 218 204 131 218
## 4  186 186 218 206 186
## 10 199 284 187 186 118
## 11 187 118 187 113 131
## 12 187 206 238 238 199
## 15 131 187 187 238 204
## 16 187 186 238 131 187
## 20 206 284 218 204 206
## 21 187 238 238 238 204
## 24 218 186 206 206 186

A vector defining the elementary imputation methods is stored in the method slot.

imp$method

##   age   bmi   hyp   chl 
##    "" "pmm" "pmm" "pmm"

The number of imputations is stored in the m slot.

imp$m

## [1] 5

Before proceeding any further, we need to check convergence. If the imputation model did not converge, or if the imputed values are not reasonable, there is no reason to proceed to the analysis phase.

2.4

Use the plot.mids() function to create trace plots of the imputed data.

plot(imp)

These plots show the means and standard deviations of the imputed values for each variable across iterations and imputations. Each line represents an different imputed dataset, and the x-axis indexes iteration.

If the imputation model has converged, we should see the lines mixing randomly with no trends in the latter parts of these trace plots. Although it’s difficult to judge with so few imputations, it looks like there may be some trending in these plots. We should probably consider increasing the number of iterations. We’ll come back to that idea below.

2.5

Use the mice::densityplot() function to check the distributions of the imputed values.

densityplot(imp)

In these plots, the blue curve represents the density estimated from the observed data, and the red curves represent the densities estimated from the imputed values. We want to see the red and blue curves behaving similarly (though not identically).

3 Creating More Imputations

At various places above, you should have also seen that mice() created 5 sets of imputations (which is the default). If we want more imputations (which we frequently will), we can set an appropriate value for the m argument in the mice() function.

3.1

Redo the imputation from Question 2.2.

This time, change the following settings.

Create 20 imputations.
Set a random number seed.

imp2 <- mice(nhanes, m = 20, seed = 235711, print = FALSE)

The print = FALSE options simply suppresses printed iteration information.

3.2

Check the convergence of the imputation model from Question 3.1 as you did before.

plot(imp2)

densityplot(imp2)

The trace plots suggest convergence of the imputation models, and the density plots suggest plausible imputed values. We’re good to go.

4 Output Formats

If we want to create a set of stand-alone multiply imputed datasets, we use the mice::complete() function to do so.

When completing the data with stochastic regression imputation, we only had one sensible way to return the result because we only created a single set of imputations. MI, on the other hand, creates m different sets of imputations, so we have several ways to format the completed datasets. We can select between these options via the action argument in the mice::complete() function.

If we specify a single number for the action argument, the function will return a single completed datasets corresponding to the requested imputation number (e.g., the third imputation in the following code).

imp <- mice(nhanes, m = 3, print = FALSE)

complete(imp, 3)

##    age  bmi hyp chl
## 1    1 27.2   2 199
## 2    2 22.7   1 187
## 3    1 26.3   1 187
## 4    3 27.5   2 206
## 5    1 20.4   1 113
## 6    3 24.9   2 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 22.7   2 186
## 11   1 22.0   1 187
## 12   2 35.3   1 206
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 131
## 16   1 35.3   1 199
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 186
## 21   1 26.3   1 187
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 184
## 25   2 27.4   1 186

We can also request all m imputed datasets be collapsed into a single block of data with three different structures.

action = long row binds the imputed datasets one atop the other.
action = broad column binds the imputed datasets side-by-side.
action = repeated does the same as broad but reorders the variables so the m replicates of each variable form a contiguous block of data.

complete(imp, "long")

##    age  bmi hyp chl .imp .id
## 1    1 35.3   2 206    1   1
## 2    2 22.7   1 187    1   2
## 3    1 27.2   1 187    1   3
## 4    3 25.5   1 204    1   4
## 5    1 20.4   1 113    1   5
## 6    3 21.7   2 184    1   6
## 7    1 22.5   1 118    1   7
## 8    1 30.1   1 187    1   8
## 9    2 22.0   1 238    1   9
## 10   2 22.7   1 187    1  10
## 11   1 22.5   1 187    1  11
## 12   2 22.7   1 187    1  12
## 13   3 21.7   1 206    1  13
## 14   2 28.7   2 204    1  14
## 15   1 29.6   1 199    1  15
## 16   1 22.0   1 131    1  16
## 17   3 27.2   2 284    1  17
## 18   2 26.3   2 199    1  18
## 19   1 35.3   1 218    1  19
## 20   3 25.5   2 206    1  20
## 21   1 22.0   1 187    1  21
## 22   1 33.2   1 229    1  22
## 23   1 27.5   1 131    1  23
## 24   3 24.9   1 184    1  24
## 25   2 27.4   1 186    1  25
## 26   1 27.4   1 131    2   1
## 27   2 22.7   1 187    2   2
## 28   1 30.1   1 187    2   3
## 29   3 22.5   1 206    2   4
## 30   1 20.4   1 113    2   5
## 31   3 21.7   1 184    2   6
## 32   1 22.5   1 118    2   7
## 33   1 30.1   1 187    2   8
## 34   2 22.0   1 238    2   9
## 35   2 30.1   1 184    2  10
## 36   1 35.3   1 199    2  11
## 37   2 22.7   1 131    2  12
## 38   3 21.7   1 206    2  13
## 39   2 28.7   2 204    2  14
## 40   1 29.6   1 187    2  15
## 41   1 25.5   1 238    2  16
## 42   3 27.2   2 284    2  17
## 43   2 26.3   2 199    2  18
## 44   1 35.3   1 218    2  19
## 45   3 25.5   2 218    2  20
## 46   1 33.2   1 199    2  21
## 47   1 33.2   1 229    2  22
## 48   1 27.5   1 131    2  23
## 49   3 24.9   1 204    2  24
## 50   2 27.4   1 186    2  25
## 51   1 27.2   2 199    3   1
## 52   2 22.7   1 187    3   2
## 53   1 26.3   1 187    3   3
## 54   3 27.5   2 206    3   4
## 55   1 20.4   1 113    3   5
## 56   3 24.9   2 184    3   6
## 57   1 22.5   1 118    3   7
## 58   1 30.1   1 187    3   8
## 59   2 22.0   1 238    3   9
## 60   2 22.7   2 186    3  10
## 61   1 22.0   1 187    3  11
## 62   2 35.3   1 206    3  12
## 63   3 21.7   1 206    3  13
## 64   2 28.7   2 204    3  14
## 65   1 29.6   1 131    3  15
## 66   1 35.3   1 199    3  16
## 67   3 27.2   2 284    3  17
## 68   2 26.3   2 199    3  18
## 69   1 35.3   1 218    3  19
## 70   3 25.5   2 186    3  20
## 71   1 26.3   1 187    3  21
## 72   1 33.2   1 229    3  22
## 73   1 27.5   1 131    3  23
## 74   3 24.9   1 184    3  24
## 75   2 27.4   1 186    3  25

complete(imp, "broad")

##    age.1 bmi.1 hyp.1 chl.1 age.2 bmi.2 hyp.2 chl.2 age.3 bmi.3 hyp.3 chl.3
## 1      1  35.3     2   206     1  27.4     1   131     1  27.2     2   199
## 2      2  22.7     1   187     2  22.7     1   187     2  22.7     1   187
## 3      1  27.2     1   187     1  30.1     1   187     1  26.3     1   187
## 4      3  25.5     1   204     3  22.5     1   206     3  27.5     2   206
## 5      1  20.4     1   113     1  20.4     1   113     1  20.4     1   113
## 6      3  21.7     2   184     3  21.7     1   184     3  24.9     2   184
## 7      1  22.5     1   118     1  22.5     1   118     1  22.5     1   118
## 8      1  30.1     1   187     1  30.1     1   187     1  30.1     1   187
## 9      2  22.0     1   238     2  22.0     1   238     2  22.0     1   238
## 10     2  22.7     1   187     2  30.1     1   184     2  22.7     2   186
## 11     1  22.5     1   187     1  35.3     1   199     1  22.0     1   187
## 12     2  22.7     1   187     2  22.7     1   131     2  35.3     1   206
## 13     3  21.7     1   206     3  21.7     1   206     3  21.7     1   206
## 14     2  28.7     2   204     2  28.7     2   204     2  28.7     2   204
## 15     1  29.6     1   199     1  29.6     1   187     1  29.6     1   131
## 16     1  22.0     1   131     1  25.5     1   238     1  35.3     1   199
## 17     3  27.2     2   284     3  27.2     2   284     3  27.2     2   284
## 18     2  26.3     2   199     2  26.3     2   199     2  26.3     2   199
## 19     1  35.3     1   218     1  35.3     1   218     1  35.3     1   218
## 20     3  25.5     2   206     3  25.5     2   218     3  25.5     2   186
## 21     1  22.0     1   187     1  33.2     1   199     1  26.3     1   187
## 22     1  33.2     1   229     1  33.2     1   229     1  33.2     1   229
## 23     1  27.5     1   131     1  27.5     1   131     1  27.5     1   131
## 24     3  24.9     1   184     3  24.9     1   204     3  24.9     1   184
## 25     2  27.4     1   186     2  27.4     1   186     2  27.4     1   186

complete(imp, "repeated")

##    age.1 age.2 age.3 bmi.1 bmi.2 bmi.3 hyp.1 hyp.2 hyp.3 chl.1 chl.2 chl.3
## 1      1     1     1  35.3  27.4  27.2     2     1     2   206   131   199
## 2      2     2     2  22.7  22.7  22.7     1     1     1   187   187   187
## 3      1     1     1  27.2  30.1  26.3     1     1     1   187   187   187
## 4      3     3     3  25.5  22.5  27.5     1     1     2   204   206   206
## 5      1     1     1  20.4  20.4  20.4     1     1     1   113   113   113
## 6      3     3     3  21.7  21.7  24.9     2     1     2   184   184   184
## 7      1     1     1  22.5  22.5  22.5     1     1     1   118   118   118
## 8      1     1     1  30.1  30.1  30.1     1     1     1   187   187   187
## 9      2     2     2  22.0  22.0  22.0     1     1     1   238   238   238
## 10     2     2     2  22.7  30.1  22.7     1     1     2   187   184   186
## 11     1     1     1  22.5  35.3  22.0     1     1     1   187   199   187
## 12     2     2     2  22.7  22.7  35.3     1     1     1   187   131   206
## 13     3     3     3  21.7  21.7  21.7     1     1     1   206   206   206
## 14     2     2     2  28.7  28.7  28.7     2     2     2   204   204   204
## 15     1     1     1  29.6  29.6  29.6     1     1     1   199   187   131
## 16     1     1     1  22.0  25.5  35.3     1     1     1   131   238   199
## 17     3     3     3  27.2  27.2  27.2     2     2     2   284   284   284
## 18     2     2     2  26.3  26.3  26.3     2     2     2   199   199   199
## 19     1     1     1  35.3  35.3  35.3     1     1     1   218   218   218
## 20     3     3     3  25.5  25.5  25.5     2     2     2   206   218   186
## 21     1     1     1  22.0  33.2  26.3     1     1     1   187   199   187
## 22     1     1     1  33.2  33.2  33.2     1     1     1   229   229   229
## 23     1     1     1  27.5  27.5  27.5     1     1     1   131   131   131
## 24     3     3     3  24.9  24.9  24.9     1     1     1   184   204   184
## 25     2     2     2  27.4  27.4  27.4     1     1     1   186   186   186

The list format produced by action = all is, by far, the most useful. This option returns the imputed datasets in an m-element list where each element contains one completed dataset. Since R prefers to work with lists, you will almost certainly want to use the action = all format when completing multiply imputed datasets for analysis in R.

complete(imp, "all")

## $`1`
##    age  bmi hyp chl
## 1    1 35.3   2 206
## 2    2 22.7   1 187
## 3    1 27.2   1 187
## 4    3 25.5   1 204
## 5    1 20.4   1 113
## 6    3 21.7   2 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 22.7   1 187
## 11   1 22.5   1 187
## 12   2 22.7   1 187
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 199
## 16   1 22.0   1 131
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 206
## 21   1 22.0   1 187
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 184
## 25   2 27.4   1 186
## 
## $`2`
##    age  bmi hyp chl
## 1    1 27.4   1 131
## 2    2 22.7   1 187
## 3    1 30.1   1 187
## 4    3 22.5   1 206
## 5    1 20.4   1 113
## 6    3 21.7   1 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 30.1   1 184
## 11   1 35.3   1 199
## 12   2 22.7   1 131
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 187
## 16   1 25.5   1 238
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 218
## 21   1 33.2   1 199
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 204
## 25   2 27.4   1 186
## 
## $`3`
##    age  bmi hyp chl
## 1    1 27.2   2 199
## 2    2 22.7   1 187
## 3    1 26.3   1 187
## 4    3 27.5   2 206
## 5    1 20.4   1 113
## 6    3 24.9   2 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 22.7   2 186
## 11   1 22.0   1 187
## 12   2 35.3   1 206
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 131
## 16   1 35.3   1 199
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 186
## 21   1 26.3   1 187
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 184
## 25   2 27.4   1 186
## 
## attr(,"class")
## [1] "mild" "list"

4.1

Use the mice::complete() function to generate a list of multiply imputed datasets from the imputations you created in Question 3.1.

Save the result as miList.

miList <- complete(imp2, "all")

5 Analysis and Pooling

Use the mids object you created in Question 3.1 to answer the questions in this section.

After generating the multiply imputed datasets as we did above, the next step in an MI-based analysis is the so-called analysis phase. In the analysis phase, we fit a substantively interesting analysis model to each of the imputed datasets.

When working with mice, the simplest way to implement the analysis phase is via the with.mids() function. For example, the following code regresses bmi onto chl using the 3 imputed datasets created above.

fit <- with(imp, lm(bmi ~ chl))
fit

## call :
## with.mids(data = imp, expr = lm(bmi ~ chl))
## 
## call1 :
## mice(data = nhanes, m = 3, printFlag = FALSE)
## 
## nmis :
## [1]  0  9  8 10
## 
## analyses :
## [[1]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    17.26467      0.04527  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    21.44927      0.02879  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    20.43439      0.03409

As you can see, the fit object simply contains 3 fitted lm objects. Each of these objects represents a regression model fit to one of the imputed datasets in the imp object.

The fit object returned by with.mids() has the class mira (multiply imputed repeated analyses).

class(fit)

## [1] "mira"

ls(fit)

## [1] "analyses" "call"     "call1"    "nmis"

The actual fitted models live in the analyses slot of the mira object. Each of these fitted models will behave according to their individual types (e.g., a fitted lm object inside a mira object will behave as any other lm object would).

library(dplyr)

fit$analyses[[1]] %>% summary()

## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0387 -3.0300 -0.9996  2.2004  8.7099 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.26467    4.30626   4.009  0.00055 ***
## chl          0.04527    0.02227   2.033  0.05374 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.044 on 23 degrees of freedom
## Multiple R-squared:  0.1523, Adjusted R-squared:  0.1155 
## F-statistic: 4.134 on 1 and 23 DF,  p-value: 0.05374

fit$analyses[[2]] %>% coef()

## (Intercept)         chl 
## 21.44927435  0.02878946

fit$analyses[[3]] %>% resid()

##           1           2           3           4           5           6 
## -0.01769018 -4.10864703 -0.50864703  0.04370132 -3.88621430 -1.80638624 
##           7           8           9          10          11          12 
## -1.95664894  3.29135297 -6.54708040 -4.07456010 -4.80864703  7.84370132 
##          13          14          15          16          17          18 
## -5.75629868  1.31187518  4.70022098  8.08230982 -2.91507913 -0.91769018 
##          19          20          21          22          23          24 
##  7.43465818 -1.27456010 -0.50864703  4.95970196  2.60022098 -1.80638624 
##          25 
##  0.62543990

5.1

Use with.mids() to regress bmi onto age and hyp.

fit2 <- with(imp2, lm(bmi ~ age + hyp))

The final step in an MI-based analysis is the pooling phase wherein we aggregate the multiple model fits from the analysis phase. If we’ve implemented the analysis phase via the with.mids() function, then we can pool using the mice::pool() function. For example, the following code will pool the regression models estimated above.

poolFit <- pool(fit)
summary(poolFit)

##          term    estimate  std.error statistic       df     p.value
## 1 (Intercept) 19.71611122 5.04500901  3.908043 10.64744 0.002596098
## 2         chl  0.03604852 0.02456045  1.467747 14.68944 0.163257414

The poolFit object has class mipo (multiply imputed pooled object). The mipo object contains the pooled regression coefficients and standard errors (pooled with the Rubin, 1987, rules), as well as some other useful statistics (e.g., the fraction of missing information, fmi).

ls(poolFit)

## [1] "call"    "glanced" "m"       "pooled"

ls(poolFit$pooled)

##  [1] "b"        "df"       "dfcom"    "estimate" "fmi"      "lambda"  
##  [7] "m"        "riv"      "t"        "term"     "ubar"

poolFit$pooled$fmi

## [1] 0.3595712 0.2518258

5.2

Use the mice::pool() function to pool the estimates you generated in Question 5.1.

What inferences can you draw from the MI estimates?

pool2 <- pool(fit2)
summary(pool2)

##          term  estimate std.error statistic       df      p.value
## 1 (Intercept) 28.106490  2.866908  9.803765 13.27152 1.896402e-07
## 2         age -2.812772  1.214656 -2.315695 13.61421 3.672980e-02
## 3         hyp  2.843548  2.233504  1.273133 15.56480 2.216580e-01

At the \(\alpha = 0.05\) significance level, neither age nor hyp significantly predict bmi.

The workflow demonstrated above applies to many different types of analysis. As long as the estimation function provides coef() and vcov() methods, this workflow should succeed. In Lab 2c, we will explore alternative workflows that you can use in situations where the mice() \(\rightarrow\) with.mids() \(\rightarrow\) pool() process doesn’t work.

6 Imputation Model Predictors

By default, mice() will impute each incomplete variable using some flavor of univariate supervised model wherein all other variables on the dataset act as predictors. These individual, univariate models are called the elementary imputation models (EIMs), and we are completely free to adjust their specification. One such adjustment that we often want to make is restricting which variables a given EIM uses as predictors.

In mice() we adjust the right hand side (RHS) of the EIMs via the predictorMatrix argument.

The predictor matrix is stored in the predictorMatrix slot of the mids object. Let’s look at a default predictor matrix.

imp$predictorMatrix

##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

Each row in the predictor matrix is a pattern vector that defines the predictors used in the EIM for each variable in the dataset. Variables that are not imputed still get a row in the predictor matrix, their rows are just ignored by mice(). Consequently, the predictor matrix is square. A value of 1 in a matrix entry indicates that the column variable is used as a predictor in the EIM for the row variable. For example, the 1 at position [2, 1] indicates that variable age was used in the imputation model for bmi. Note that the diagonal is zero because a variable is cannot impute its own missing data.

We are not forced to use this version of the predictor matrix; mice gives us complete control over the predictor matrix. So, we can choose our own predictors for each EIM in our particular problem. This flexibility can be very useful, for example, when you have many variables, or when you have clear ideas or prior knowledge about relations in the data.

We can use the mice::make.predictorMatrix() function to initialize a default predictor matrix, without doing any imputation. We can then modify this matrix to fit our needs.

(pred <- make.predictorMatrix(nhanes))

##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

The object pred now contains the default predictor matrix for the nhanes data. Altering the predictor matrix and using the updated version for imputation is very simple. For example, the following code removes the variable hyp from the set of predictors in each EIM.

pred[ , "hyp"] <- 0
pred

##     age bmi hyp chl
## age   0   1   0   1
## bmi   1   0   0   1
## hyp   1   1   0   1
## chl   1   1   0   0

We use the modified predictor matrix by supplying it to the predictorMatrix argument in mice().

imp <- mice(nhanes, predictorMatrix = pred, print = FALSE, seed = 235711)

6.1

Create a custom predictor matrix for the nhanes data.

Specify the EIMs as follows:

\(Y_{bmi} = \beta_0 + \beta_1 X_{hyp} + \beta_2 X_{chl}\)
\(Y_{hyp} = \beta_0 + \beta_1 X_{age} + \beta_2 X_{chl}\)
\(Y_{chl} = \beta_0 + \beta_1 X_{bmi}\)

pred1 <- make.predictorMatrix(nhanes)

pred1["age", ] <- 0
pred1["bmi", "age"] <- 0
pred1["hyp", "bmi"] <- 0
pred1["chl", c("age", "hyp")] <- 0

pred1

##     age bmi hyp chl
## age   0   0   0   0
## bmi   0   0   1   1
## hyp   1   0   0   1
## chl   0   1   0   0

The mice::quickpred() function provides a simple way to construct predictor matrices via various selection rules. The following code will select a variable as a predictor when it correlates at least \(\rho = 0.30\) with the incomplete variable.

(pred <- quickpred(nhanes, mincor = 0.3))

##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   0   1
## hyp   1   0   0   1
## chl   1   1   1   0

6.2

Check the documentation for quickpred().

Take note of the different selection criteria you can apply.

?quickpred

6.3

Use quickpred() to create a predictor matrix for the nhanes data.

This predictor matrix should satisfy the following conditions.

Use age as a predictor in every EIM.
Select the remaining predictors such that they have a minimum correlation of \(\rho = 0.45\) with the imputation targets.

Which variables are selected as predictors for each EIM?

(pred <- quickpred(nhanes, include = "age", mincor = 0.45))

##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   0   0
## hyp   1   0   0   0
## chl   1   0   0   0

Age is the only predictor in each of the imputation models.

6.4

Use the predictor matrix you created in Question 6.3 to impute the nhanes data.

Use default settings except for the following options.

Create 8 imputations.
Set the imputation method to norm.
Set a random number seed.

imp <- mice(nhanes, 
            m = 8, 
            method = "norm", 
            predictorMatrix = pred, 
            seed = 235711,
            print = FALSE)

6.5

Create trace plots and density plots for the mids object you created in Question 6.4.

plot(imp)

densityplot(imp)

7 Imputation Methods

For each column in the dataset, we must specify an imputation method. The method slot in a mids object shows which methods were applied to each column. The following code shows the default methods that would be applied if we impute the nhanes data.

make.method(nhanes)

##   age   bmi   hyp   chl 
##    "" "pmm" "pmm" "pmm"

The age variable is completely observed, so it is not imputed. The the empty string, "", tells mice() not to impute the corresponding variable. The other variables will all be imputed via pmm (predictive mean matching), which is the default method for numeric data.

In reality, these data are actually a mix of numeric and categorical data. The nhanes2 dataset contains appropriately typed categorical variables.

summary(nhanes2)

##     age          bmi          hyp          chl       
##  20-39:12   Min.   :20.40   no  :13   Min.   :113.0  
##  40-59: 7   1st Qu.:22.65   yes : 4   1st Qu.:185.0  
##  60-99: 6   Median :26.75   NA's: 8   Median :187.0  
##             Mean   :26.56             Mean   :191.4  
##             3rd Qu.:28.93             3rd Qu.:212.0  
##             Max.   :35.30             Max.   :284.0  
##             NA's   :9                 NA's   :10

str(nhanes2)

## 'data.frame':    25 obs. of  4 variables:
##  $ age: Factor w/ 3 levels "20-39","40-59",..: 1 2 1 3 1 3 1 1 2 2 ...
##  $ bmi: num  NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
##  $ hyp: Factor w/ 2 levels "no","yes": NA 1 1 NA 1 NA 1 1 1 NA ...
##  $ chl: num  NA 187 187 NA 113 184 118 187 238 NA ...

In nhanes2, the age variable is a factor comprising three categories, and hyp is a binary factor. If we run mice() without specifying a method argument, the algorithm will attempt to match the imputation method for each column to the column’s type.

Let’s see what happens when we initialize a default method vector for the nhanes2 dataset.

(meth <- make.method(nhanes2))

##      age      bmi      hyp      chl 
##       ""    "pmm" "logreg"    "pmm"

The imputation method for hyp is now set to logreg, which imputes by logistic regression (the default for binary factors). The mice() algorithm has recognized the type of the new hyp variable and updated the imputation method appropriately.

We can get an up-to-date overview of the available methods via the methods() function.

methods(mice)

##  [1] mice.impute.2l.bin              mice.impute.2l.lmer            
##  [3] mice.impute.2l.norm             mice.impute.2l.pan             
##  [5] mice.impute.2lonly.mean         mice.impute.2lonly.norm        
##  [7] mice.impute.2lonly.pmm          mice.impute.cart               
##  [9] mice.impute.jomoImpute          mice.impute.lasso.logreg       
## [11] mice.impute.lasso.norm          mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm   mice.impute.lda                
## [15] mice.impute.logreg              mice.impute.logreg.boot        
## [17] mice.impute.mean                mice.impute.midastouch         
## [19] mice.impute.mnar.logreg         mice.impute.mnar.norm          
## [21] mice.impute.mpmm                mice.impute.norm               
## [23] mice.impute.norm.boot           mice.impute.norm.nob           
## [25] mice.impute.norm.predict        mice.impute.panImpute          
## [27] mice.impute.passive             mice.impute.pmm                
## [29] mice.impute.polr                mice.impute.polyreg            
## [31] mice.impute.quadratic           mice.impute.rf                 
## [33] mice.impute.ri                  mice.impute.sample             
## [35] mice.mids                       mice.theme                     
## see '?methods' for accessing help and source code

More detail on the methods is available in the Details section of the documentation for mice(), and we can see detailed documentation for each method through the help page for the appropriate mice.impute.METHOD() function.

?mice
?mice.impute.lasso.select.logreg

In Question 6.4, you changed the imputation method for all variables by setting a single value for the method argument. We can also change the methods on a variable-by-variable basis. The following code will change the imputation method for bmi to Bayesian linear regression while leaving the remaining methods unaffected.

(meth["bmi"] <- "norm")

## [1] "norm"

To use this custom method vector, we simply supply meth to the method argument when we call mice().

imp <- mice(nhanes2, method = meth, print = FALSE)

7.1

Create a custom method vector for the nhanes2 data.

Define the following methods.

age: None
bmi: Classification and regression trees
hyp: Logistic regression
chl: Predictive mean matching

(meth <- c(age = "", bmi = "cart", hyp = "logreg", chl = "pmm"))

##      age      bmi      hyp      chl 
##       ""   "cart" "logreg"    "pmm"

## OR

meth <- make.method(nhanes2)
meth["bmi"] <- "cart"

meth

##      age      bmi      hyp      chl 
##       ""   "cart" "logreg"    "pmm"

Note that both of the above approaches produce the same result. The first approach is more direct and probably preferable for smaller datasets. The second approach can be more efficient when working with many variables, especially when you only need to use a few non-default methods.

7.2

Impute the missing values in the nhanes2 dataset.

Set up the imputation run as follows.

Use the predictor matrix you created in Question 6.1.
Use the method vector you created in Question 7.1.
Create 15 imputations.
Set a random number seed.

imp1 <- mice(nhanes2, 
             m = 15, 
             method = meth, 
             predictorMatrix = pred1,
             seed = 235711,
             print = FALSE)

7.3

Create trace plots and density plots for the mids object you created in Question 7.2.

plot(imp1)

densityplot(imp1)

8 Adjusting the Iterations

By default, mice() will use five iterations. Sometimes, these five will be enough, but often we will want to run the algorithm longer. We can set the number of iterations via the maxit argument.

The following code will run the default imputation model for ten iterations.

imp <- mice(nhanes2, maxit = 10)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl
##   6   1  bmi  hyp  chl
##   6   2  bmi  hyp  chl
##   6   3  bmi  hyp  chl
##   6   4  bmi  hyp  chl
##   6   5  bmi  hyp  chl
##   7   1  bmi  hyp  chl
##   7   2  bmi  hyp  chl
##   7   3  bmi  hyp  chl
##   7   4  bmi  hyp  chl
##   7   5  bmi  hyp  chl
##   8   1  bmi  hyp  chl
##   8   2  bmi  hyp  chl
##   8   3  bmi  hyp  chl
##   8   4  bmi  hyp  chl
##   8   5  bmi  hyp  chl
##   9   1  bmi  hyp  chl
##   9   2  bmi  hyp  chl
##   9   3  bmi  hyp  chl
##   9   4  bmi  hyp  chl
##   9   5  bmi  hyp  chl
##   10   1  bmi  hyp  chl
##   10   2  bmi  hyp  chl
##   10   3  bmi  hyp  chl
##   10   4  bmi  hyp  chl
##   10   5  bmi  hyp  chl

plot(imp)

Notice that now the x-axis in the trace plots extends to 10 to reflect the higher number of iterations.

One common reason to increase the number of iterations is non-convergence. We’ll sometimes observe trends when we inspect the trace plots of the imputed items. These trends tell us that the algorithm has not yet reached an equilibrium point. In such situations, the first thing to try (so long as you don’t notice any obvious errors in your imputation model specification) is increasing the number of iterations. While we could accomplish this goal simply by rerunning mice() with a higher value for maxit, doing so would be very wasteful. We would have to recreate all of the existing iterations before generating any new iterations. Thankfully, we don’t need to be so inefficient.

We can easily extend the chains in any mids object with the mice.mids() function. This function takes as mids object as its primary argument and picks up the sampling where mice() left off (using exactly the same imputation model specification). The following code will extend the chains in our most recent mids object to 25 total iterations by adding on another 15 iterations.

imp <- mice.mids(imp, maxit = 15, print = FALSE)
plot(imp)

Now, imp contains 25 iterations, which we can see by the range of the x-axis in the trace plots.

8.1

Increase the number of iterations for the imputation you ran in Question 7.2 to 25.

imp1 <- mice.mids(imp1, maxit = 10, print = FALSE)

9 More Diagnostics

Recall that even if the imputation model converges (e.g., as demonstrated by trace plots), the imputed values still might not be reasonable estimates of the missing data. If the data are MCAR, then the imputations should have the same distribution as the observed data. In general, though, the distributions of the observed and imputed data may differ when the data are MAR. However, we don’t want to see very large discrepancies.

Strip plots are another excellent visualization tool with which we can assess the comparability of the imputed and observed data. The mice::stripplot() function will generate strip plots from a mids object.

The following code will create a strip plot of the observed and imputed chl values.

stripplot(imp, chl)

The observed data are plotted in blue, and the imputed data are plotted in red. Since chl was imputed with PMM (which draws imputations from the observed data), the imputed values have the same gaps as the observed data, and the imputed values are always within the range of the observed data. This figure indicates that the observed and imputed values of chl follow similar distributions. So, we should conclude that these imputations are plausible.

We can also call stripplot() without specifying any variables. Doing so will plot all of the imputed variables in an array.

stripplot(imp)

9.1

Create trace plots, density plots, and strip plots for the mids object you created in Question 8.1.

Based on these visualizations, would you say the imputation model has converged to a valid solution?

plot(imp1)

densityplot(imp1)

stripplot(imp1)

Yes, these results look pretty good. The trace plots show the chains mixing well without any trends, and the density and strip plots show no extreme or implausible values.

9.2

Rerun the imputation from Question 8.1.

Keep all the same settings, but use Bayesian linear regression to impute bmi and chl.

meth <- imp1$method
meth[c("bmi", "chl")] <- "norm"
meth

##      age      bmi      hyp      chl 
##       ""   "norm" "logreg"   "norm"

imp2 <- mice(nhanes2, 
             m = 15, 
             method = meth, 
             predictorMatrix = pred1, 
             maxit = 25, 
             seed = 235711, 
             print = FALSE)

9.3

Create trace plots, density plots, and strip plots for the mids object you created in Question 9.2.

Based on the visualizations, are the imputations from Question 8.1 or Question 9.2 more reasonable? Why?

plot(imp2)

densityplot(imp2)

stripplot(imp2)

Both sets of imputations seem acceptable and reasonable, but each set also has its own potential issues. The donor-based imputations of bmi and chl from Question 8.1 show a tendency to cluster around the center of the distribution (hence the “sharp” spikes in some of the imputed densities). The normal-theory imputations from Question 9.2, on the other hand, contain some very high and very low values for both bmi and chl.

The choice between these two results probably comes down to which aspects of the distribution are more important for your application. If maintaining the same range as the observed data is important, then the PMM- and CART-based imputations are probably better. If a more accurate representation of variability is important (and the out-of-bound values don’t matter), then the normal-theory imputations may be preferable.

9.4

Use the multiply imputed data you created in Question 8.1 to estimate the following regression model.

\(Y_{bmi} = \beta_0 + \beta_1 X_{age} = \beta_2 X_{chl} + \varepsilon\)

fit <- with(imp1, lm(bmi ~ age + chl))

9.5

Pool the estimates from the model you estimated in Question 9.4.

Is the effect of chl on bmi significant after controlling for age?

est <- pool(fit)
summary(est)

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 20.41801164 4.21217336  4.847382 13.76014 0.0002714008
## 2    age40-59 -2.42177927 2.20075229 -1.100432 13.42765 0.2904806485
## 3    age60-99 -3.23774834 2.46989035 -1.310887 12.05394 0.2143209558
## 4         chl  0.03711272 0.02384541  1.556388 11.61866 0.1464214159

No, cholesterol level does not significantly predict BMI after controlling for age (\(\beta = 0.04\), \(t[11.62] = 1.56\), \(p = 0.146\)).

9.6

Rerun the regression analysis from above using the imputed data from Question 9.2.

Do the results differ between the two versions of the analysis? Why or why not?

est2 <- with(imp2, lm(bmi ~ age + chl)) %>% pool()
summary(est2)

##          term    estimate std.error statistic        df    p.value
## 1 (Intercept) 18.76625504 4.6061039  4.074214  9.172722 0.00267377
## 2    age40-59 -3.63438601 2.2470109 -1.617431 11.528841 0.13279824
## 3    age60-99 -4.10858709 2.5738980 -1.596251  9.889733 0.14185676
## 4         chl  0.05183625 0.0232639  2.228184 10.123331 0.04968446

The estimates differ slightly due to the different imputation methods for bmi and chl. Also, these differences were just enough to tip the effect of chl over into statistical significance at the \(\alpha = 0.05\) level.

End of Lab 2a

Lab 2a: MI Fundamentals

Missing Data in R

Kyle M. Lang

Updated: 2025-01-28