In these lab exercises, you will explore how certain pathological data characteristics can affect imputation results. You will also learn a very useful technique for imputing variables with known relations: Passive Imputation.
Load the mice
package.
The mammalsleep dataset is part of mice. This dataset contains the Allison and Cicchetti (1976) data for mammalian species.
Check the documentation for the mammalsleep dataset.
Summarize the data to get an overview of their structure.
Use mice()
to multiply impute the mammalsleep data.
When you run this imputation, you will probably see a warning about “logged
events”. When mice()
encounters certain computational difficulties (e.g.,
extreme collinearity), it will take automatic remedial action and log the action
in the loggedEvents
slot of the mids object.
If you ever see a warning about logged events, you should check the loggedEvents
slot to see what actions were taken. You want to assess if the actions were
appropriate and judge the likely impact that the actions will have on your
results. If the actions mice()
has taken seem too extreme, you need to address
the underlying data issues and rerun the imputation with the cleaned data.
Check the contents of the loggedEvents
slot in the mids object you created
in Question 2.1.
mice()
logging here?We can get a sense of how impactful the issues noted in the loggedEvents
log
have been through our usual diagnostic plots.
Create trace plots, density plots, and strip plots from the mids object you created in Question 2.1.
What do you conclude vis-a-vis convergence and the validity of these imputations?
For didactic purposes, let’s “play the fool”, ignore any information we may have gleaned from the diagnostic plots, and analyze these imputed data via the usual process.
Use the imputed data you created in Question 2.1 to fit the following regression model.
\(Y_{sws} = \beta_0 + \beta_1 X_{ln(bw)} + \beta_2 X_{odi} + \varepsilon\)
Pool the MI estimates, and check the RIV, \(\lambda\), and FMI.
The poor performance you should have noted above is largely driven by the
species
variable. This variable is a factor with 62
levels. So, when we include this variable as a predictor in the imputation models,
it enters the model as a set of 61 dummy codes.
These dummy codes produce the \(P > N\) problem noted in the loggedEvents
log
which leads to poor imputations.
Use mice()
to re-impute the mammalsleep data.
Use the same settings from Question 2.1, but do not use species
as a
predictor in any of the imputation models.
imp1
.If you get a warning about logged events. Check the loggedEvents
slot of the
mids object you created in Question 2.5.
Create trace plots, density plots, and strip plots from the mids object you created in Question 2.5.
What do you conclude vis-a-vis convergence and the validity of these imputations?
The convergence issues you should have noticed above are caused by structural
features of the data. Total sleep (ts
) is the sum of paradoxical sleep (ps
)
and short wave sleep (sws
). The imputation model treats ts
as distinct and
stochastically related to ps
and sws
, but ts
is actually a deterministic
function of ps
and sws
. This deterministic relation is ignored in the
imputations, and the resulting circularity in the imputations keeps the model
from finding a unique solution.
Thankfully, mice()
offers a convenient routine for addressing exactly these
types of known relations among variables in the imputation model: passive
imputation. With passive imputation, we can account for transformations,
combinations, and recoded variables when imputing their missing data.
Frequently, we need to transform, combine, or recode variables. When such a need arises with incomplete variables that we’d like to impute, we have a few options.
Both of these approaches have an important limitation, though. In neither case does the imputation model have access to both the original and the transformed versions of the variable in question. The imputations are either generated using the information in the raw version of the variable (in the impute-then-transform approach) or using the information in the transformation (in the just-another-variable approach), but not both. Note that keeping both the raw and transformed versions of a variable in the model is not an option since doing so induces perfectly collinear variables.
To solve this problem, mice()
implements a third approach called passive
imputation. The goal of passive imputation is to maintain known, deterministic
relations among incomplete variables throughout the imputation process and to
allow the imputation model to use the transformed variables as predictors when
imputing other variables (other than the raw version of the transformed variables,
themselves).
For example, we can use passive imputation to maintain the following
deterministic function in the boys
data
\[\text{BMI} = \frac{\text{Weight}}{\text{Height}^2}\]
or this compositional relation in the mammalsleep
data
\[\text{ts} = \text{ps}+\text{sws}.\]
To implement passive imputation, we need to adjust two features of the mice()
setup:
The following code will adjust the method vector from Question 2.5 to
implement passive imputation for ts
.
(meth <- imp1$method)
## species bw brw sws ps ts mls gt pi sei
## "" "" "" "pmm" "pmm" "pmm" "pmm" "pmm" "" ""
## odi
## ""
meth["ts"]<- "~ I(sws + ps)"
meth
## species bw brw sws ps
## "" "" "" "pmm" "pmm"
## ts mls gt pi sei
## "~ I(sws + ps)" "pmm" "pmm" "" ""
## odi
## ""
Now, ts
will not be independently imputed along with the other variables.
Rather, in each iteration, the most recently completed version of sws
and ps
will be added together to define the updated version of ts
.
The updated version of ts
defined according to the deterministic relation
described above can then be used as a predictor when imputing other variables,
but we do not want to use ts
to impute either sws
or ps
(to avoid
circularity). So, we need to adjust the predictor matrix to satisfy this
restriction.
(pred <- imp1$predictorMatrix)
## species bw brw sws ps ts mls gt pi sei odi
## species 0 1 1 1 1 1 1 1 1 1 1
## bw 0 0 1 1 1 1 1 1 1 1 1
## brw 0 1 0 1 1 1 1 1 1 1 1
## sws 0 1 1 0 1 1 1 1 1 1 1
## ps 0 1 1 1 0 1 1 1 1 1 1
## ts 0 1 1 1 1 0 1 1 1 1 1
## mls 0 1 1 1 1 1 0 1 1 1 1
## gt 0 1 1 1 1 1 1 0 1 1 1
## pi 0 1 1 1 1 1 1 1 0 1 1
## sei 0 1 1 1 1 1 1 1 1 0 1
## odi 0 1 1 1 1 1 1 1 1 1 0
pred[c("sws", "ps"), "ts"] <- 0
pred
## species bw brw sws ps ts mls gt pi sei odi
## species 0 1 1 1 1 1 1 1 1 1 1
## bw 0 0 1 1 1 1 1 1 1 1 1
## brw 0 1 0 1 1 1 1 1 1 1 1
## sws 0 1 1 0 1 0 1 1 1 1 1
## ps 0 1 1 1 0 0 1 1 1 1 1
## ts 0 1 1 1 1 0 1 1 1 1 1
## mls 0 1 1 1 1 1 0 1 1 1 1
## gt 0 1 1 1 1 1 1 0 1 1 1
## pi 0 1 1 1 1 1 1 1 0 1 1
## sei 0 1 1 1 1 1 1 1 1 0 1
## odi 0 1 1 1 1 1 1 1 1 1 0
Now, we can re-impute the mammalsleep data using passive imputation to account
for the deterministic relation between ts
, sws
, and ps
. We do so simply by
using the updated method vector and predictor matrix in a regular run of mice()
.
imp <- mice(mammalsleep,
method = meth,
predictorMatrix = pred,
maxit = 20,
seed = 235711,
print = FALSE)
If we inspect the diagnostic plots for these imputations, we see much better performance than we achieved in Questions 2.1 or 2.5.
plot(imp)
densityplot(imp)
stripplot(imp)
We can see that the pathological non-convergence of Question 2.5 has been corrected by the passive imputation.
You will now implement passive imputation yourself using the boys dataset. The boys dataset is distributed with mice, so you will be able to access these data once you’ve loaded the mice package. The boys data are a subset of a large Dutch dataset containing growth measures from the Fourth Dutch Growth Study.
Unless otherwise specified, all questions in this section refer to the boys dataset.
Check the documentation for the boys data.
Summarize the boys data to get a sense of their characteristics.
Use the mice::md.pattern() function to summarize the response patterns.
Multiply impute the boys data using passive imputation for bmi
.
Use passive imputation to maintain the known relation between bmi
, wgt
, and
hgt
.
bmi
as "~ I(wgt / (hgt / 100)^2)"
.imp1
.Run the following code to inspect the relation between the imputed BMI and the BMI calculated from the imputed height and weight. If the passive imputation was successful, these points should fall along a perfect line.
xyplot(imp1,
bmi ~ I(wgt / (hgt / 100)^2),
ylab = "Imputed BMI",
xlab = "Calculated BMI")
Create trace plots, density plots, and strip plots from the mids object you created in Question 3.4.
What do you conclude vis-a-vis convergence and the validity of these imputations?
Of course, the issues you should have spotted in the above imputations are to
be expected since we have purposefully omitted the second part of passive
imputation. We have not adjusted the predictor matrix, so we have circularity in
the imputations. We used passive imputation to create the imputations for bmi
,
but bmi
is still used as predictor for wgt
and hgt
.
Adjust the predictor matrix to remove the circularity described above.
Re-impute the boys data using the updated predictor matrix.
imp1
.Recreate the xyplot()
from above using the imputations from Question
3.6.
Is the deterministic definition of bmi
maintained in the imputed data?
Create trace plots, density plots, and strip plots from the mids object you created in Question 3.6.
What do you conclude vis-a-vis convergence and the validity of these imputations?
Just for fun: What you shouldn’t do with passive imputation
Never fix all relations. The algorithm will never escape the starting values.
meth <- make.method(boys)
meth["bmi"] <- "~ I(wgt / (hgt / 100)^2)"
meth["wgt"] <- "~ I(bmi * (hgt / 100)^2)"
meth["hgt"] <- "~ I(sqrt(wgt / bmi) * 100)"
imp <- mice(boys, method = meth, seed = 235711, print = FALSE)
plot(imp, c("hgt", "wgt", "bmi"))
End of Lab 2b