In these lab exercises, you will explore how certain pathological data characteristics can affect imputation results. You will also learn a very useful technique for imputing variables with known relations: Passive Imputation.


1 Setup


1.1

Load the mice package.


The mammalsleep dataset is part of mice. This dataset contains the Allison and Cicchetti (1976) data for mammalian species.

  • Unless otherwise specified, all questions in this section refer to the mammalsleep data.

1.2

Check the documentation for the mammalsleep dataset.


1.3

Summarize the data to get an overview of their structure.


2 Naive Imputation


2.1

Use mice() to multiply impute the mammalsleep data.

  • Create 5 imputations.
  • Use 10 iterations.
  • Use predictive mean matching as the imputation method.
  • Set a random number seed.
  • Keep all other options at their default values.

When you run this imputation, you will probably see a warning about “logged events”. When mice() encounters certain computational difficulties (e.g., extreme collinearity), it will take automatic remedial action and log the action in the loggedEvents slot of the mids object.

If you ever see a warning about logged events, you should check the loggedEvents slot to see what actions were taken. You want to assess if the actions were appropriate and judge the likely impact that the actions will have on your results. If the actions mice() has taken seem too extreme, you need to address the underlying data issues and rerun the imputation with the cleaned data.


2.2

Check the contents of the loggedEvents slot in the mids object you created in Question 2.1.

  • What problem is mice() logging here?

We can get a sense of how impactful the issues noted in the loggedEvents log have been through our usual diagnostic plots.


2.3

Create trace plots, density plots, and strip plots from the mids object you created in Question 2.1.

What do you conclude vis-a-vis convergence and the validity of these imputations?


For didactic purposes, let’s “play the fool”, ignore any information we may have gleaned from the diagnostic plots, and analyze these imputed data via the usual process.


2.4

Use the imputed data you created in Question 2.1 to fit the following regression model.

\(Y_{sws} = \beta_0 + \beta_1 X_{ln(bw)} + \beta_2 X_{odi} + \varepsilon\)

Pool the MI estimates, and check the RIV, \(\lambda\), and FMI.

  • What do you notice?

The poor performance you should have noted above is largely driven by the species variable. This variable is a factor with 62 levels. So, when we include this variable as a predictor in the imputation models, it enters the model as a set of 61 dummy codes. These dummy codes produce the \(P > N\) problem noted in the loggedEvents log which leads to poor imputations.


2.5

Use mice() to re-impute the mammalsleep data.

Use the same settings from Question 2.1, but do not use species as a predictor in any of the imputation models.

  • Name the resulting mids object imp1.

2.6

If you get a warning about logged events. Check the loggedEvents slot of the mids object you created in Question 2.5.


2.7

Create trace plots, density plots, and strip plots from the mids object you created in Question 2.5.

What do you conclude vis-a-vis convergence and the validity of these imputations?


The convergence issues you should have noticed above are caused by structural features of the data. Total sleep (ts) is the sum of paradoxical sleep (ps) and short wave sleep (sws). The imputation model treats ts as distinct and stochastically related to ps and sws, but ts is actually a deterministic function of ps and sws. This deterministic relation is ignored in the imputations, and the resulting circularity in the imputations keeps the model from finding a unique solution.

Thankfully, mice() offers a convenient routine for addressing exactly these types of known relations among variables in the imputation model: passive imputation. With passive imputation, we can account for transformations, combinations, and recoded variables when imputing their missing data.


3 Passive Imputation


Frequently, we need to transform, combine, or recode variables. When such a need arises with incomplete variables that we’d like to impute, we have a few options.

  1. We could impute the original, un-transformed, variable and transform the completed version afterwards (the so-called impute-then-transform approach).
  2. We could transform the incomplete variable and impute the transformed version as if it were any other variable (the so-called just-another-variable approach).

Both of these approaches have an important limitation, though. In neither case does the imputation model have access to both the original and the transformed versions of the variable in question. The imputations are either generated using the information in the raw version of the variable (in the impute-then-transform approach) or using the information in the transformation (in the just-another-variable approach), but not both. Note that keeping both the raw and transformed versions of a variable in the model is not an option since doing so induces perfectly collinear variables.

To solve this problem, mice() implements a third approach called passive imputation. The goal of passive imputation is to maintain known, deterministic relations among incomplete variables throughout the imputation process and to allow the imputation model to use the transformed variables as predictors when imputing other variables (other than the raw version of the transformed variables, themselves).

For example, we can use passive imputation to maintain the following deterministic function in the boys data \[\text{BMI} = \frac{\text{Weight}}{\text{Height}^2}\] or this compositional relation in the mammalsleep data \[\text{ts} = \text{ps}+\text{sws}.\]


To implement passive imputation, we need to adjust two features of the mice() setup:

  1. The method vector
    • We use the method vector to define the deterministic relations.
  2. The predictor matrix
    • We adjust the predictor matrix to keep a transformed variable from being used as a predictor of its raw version.

The following code will adjust the method vector from Question 2.5 to implement passive imputation for ts.

(meth <- imp1$method)
## species      bw     brw     sws      ps      ts     mls      gt      pi     sei 
##      ""      ""      ""   "pmm"   "pmm"   "pmm"   "pmm"   "pmm"      ""      "" 
##     odi 
##      ""
meth["ts"]<- "~ I(sws + ps)"
meth
##         species              bw             brw             sws              ps 
##              ""              ""              ""           "pmm"           "pmm" 
##              ts             mls              gt              pi             sei 
## "~ I(sws + ps)"           "pmm"           "pmm"              ""              "" 
##             odi 
##              ""

Now, ts will not be independently imputed along with the other variables. Rather, in each iteration, the most recently completed version of sws and ps will be added together to define the updated version of ts.


The updated version of ts defined according to the deterministic relation described above can then be used as a predictor when imputing other variables, but we do not want to use ts to impute either sws or ps (to avoid circularity). So, we need to adjust the predictor matrix to satisfy this restriction.

(pred <- imp1$predictorMatrix)
##         species bw brw sws ps ts mls gt pi sei odi
## species       0  1   1   1  1  1   1  1  1   1   1
## bw            0  0   1   1  1  1   1  1  1   1   1
## brw           0  1   0   1  1  1   1  1  1   1   1
## sws           0  1   1   0  1  1   1  1  1   1   1
## ps            0  1   1   1  0  1   1  1  1   1   1
## ts            0  1   1   1  1  0   1  1  1   1   1
## mls           0  1   1   1  1  1   0  1  1   1   1
## gt            0  1   1   1  1  1   1  0  1   1   1
## pi            0  1   1   1  1  1   1  1  0   1   1
## sei           0  1   1   1  1  1   1  1  1   0   1
## odi           0  1   1   1  1  1   1  1  1   1   0
pred[c("sws", "ps"), "ts"] <- 0
pred
##         species bw brw sws ps ts mls gt pi sei odi
## species       0  1   1   1  1  1   1  1  1   1   1
## bw            0  0   1   1  1  1   1  1  1   1   1
## brw           0  1   0   1  1  1   1  1  1   1   1
## sws           0  1   1   0  1  0   1  1  1   1   1
## ps            0  1   1   1  0  0   1  1  1   1   1
## ts            0  1   1   1  1  0   1  1  1   1   1
## mls           0  1   1   1  1  1   0  1  1   1   1
## gt            0  1   1   1  1  1   1  0  1   1   1
## pi            0  1   1   1  1  1   1  1  0   1   1
## sei           0  1   1   1  1  1   1  1  1   0   1
## odi           0  1   1   1  1  1   1  1  1   1   0

Now, we can re-impute the mammalsleep data using passive imputation to account for the deterministic relation between ts, sws, and ps. We do so simply by using the updated method vector and predictor matrix in a regular run of mice().

imp <- mice(mammalsleep, 
            method = meth, 
            predictorMatrix = pred, 
            maxit = 20, 
            seed = 235711, 
            print = FALSE)

If we inspect the diagnostic plots for these imputations, we see much better performance than we achieved in Questions 2.1 or 2.5.

plot(imp)

densityplot(imp)

stripplot(imp)

We can see that the pathological non-convergence of Question 2.5 has been corrected by the passive imputation.


You will now implement passive imputation yourself using the boys dataset. The boys dataset is distributed with mice, so you will be able to access these data once you’ve loaded the mice package. The boys data are a subset of a large Dutch dataset containing growth measures from the Fourth Dutch Growth Study.

Unless otherwise specified, all questions in this section refer to the boys dataset.


3.1

Check the documentation for the boys data.


3.2

Summarize the boys data to get a sense of their characteristics.


3.3

Use the mice::md.pattern() function to summarize the response patterns.

  • How many different missing data patterns are present in the boys data?
  • Which pattern occurs most frequently in these data?

3.4

Multiply impute the boys data using passive imputation for bmi.

Use passive imputation to maintain the known relation between bmi, wgt, and hgt.

  • Specify the method vector entry for bmi as "~ I(wgt / (hgt / 100)^2)".
  • Keep the default predictor matrix, for now.
  • Use 20 iterations.
  • Set a random number seed.
  • Leave all other settings at there default values.
  • Name the resulting mids object imp1.

Run the following code to inspect the relation between the imputed BMI and the BMI calculated from the imputed height and weight. If the passive imputation was successful, these points should fall along a perfect line.

xyplot(imp1, 
       bmi ~ I(wgt / (hgt / 100)^2), 
       ylab = "Imputed BMI", 
       xlab = "Calculated BMI")


3.5

Create trace plots, density plots, and strip plots from the mids object you created in Question 3.4.

What do you conclude vis-a-vis convergence and the validity of these imputations?


Of course, the issues you should have spotted in the above imputations are to be expected since we have purposefully omitted the second part of passive imputation. We have not adjusted the predictor matrix, so we have circularity in the imputations. We used passive imputation to create the imputations for bmi, but bmi is still used as predictor for wgt and hgt.


3.6

Adjust the predictor matrix to remove the circularity described above.

Re-impute the boys data using the updated predictor matrix.

  • Keep all other settings the same as in Question 3.4.
  • Name the resulting mids object imp1.

3.7

Recreate the xyplot() from above using the imputations from Question 3.6.

Is the deterministic definition of bmi maintained in the imputed data?


3.8

Create trace plots, density plots, and strip plots from the mids object you created in Question 3.6.

What do you conclude vis-a-vis convergence and the validity of these imputations?


Just for fun: What you shouldn’t do with passive imputation

Never fix all relations. The algorithm will never escape the starting values.

meth <- make.method(boys)

meth["bmi"] <- "~ I(wgt / (hgt / 100)^2)"
meth["wgt"] <- "~ I(bmi * (hgt / 100)^2)"
meth["hgt"] <- "~ I(sqrt(wgt / bmi) * 100)"

imp <- mice(boys, method = meth, seed = 235711, print = FALSE)

plot(imp, c("hgt", "wgt", "bmi"))


End of Lab 2b