In these lab exercises you will explore some different workflows that you can use to implement an MI-based analysis.
Unless otherwise specified, you will be analyzing the boys dataset from the mice package in all following exercises.
Load the mice
, miceadds
, and mitools
packages.
library(mice)
library(miceadds)
library(mitools)
Multiply impute the boys data using passive imputation for bmi
.
Use passive imputation to maintain the known relation between bmi
, wgt
, and
hgt
.
bmi
as "~ I(wgt / (hgt / 100)^2)"
.imp1
.## Use the mice::make.method() function to generate a default method vector:
(meth <- make.method(boys))
## age hgt wgt bmi hc gen phb tv
## "" "pmm" "pmm" "pmm" "pmm" "polr" "polr" "pmm"
## reg
## "polyreg"
meth["bmi"] <- "~ I(wgt / (hgt / 100)^2)"
meth
## age hgt
## "" "pmm"
## wgt bmi
## "pmm" "~ I(wgt / (hgt / 100)^2)"
## hc gen
## "pmm" "polr"
## phb tv
## "polr" "pmm"
## reg
## "polyreg"
## Adjust the default predictor matrix to break circularity in the imputations:
(pred <- make.predictorMatrix(boys))
## age hgt wgt bmi hc gen phb tv reg
## age 0 1 1 1 1 1 1 1 1
## hgt 1 0 1 1 1 1 1 1 1
## wgt 1 1 0 1 1 1 1 1 1
## bmi 1 1 1 0 1 1 1 1 1
## hc 1 1 1 1 0 1 1 1 1
## gen 1 1 1 1 1 0 1 1 1
## phb 1 1 1 1 1 1 0 1 1
## tv 1 1 1 1 1 1 1 0 1
## reg 1 1 1 1 1 1 1 1 0
pred[c("hgt", "wgt"), "bmi"] <- 0
pred
## age hgt wgt bmi hc gen phb tv reg
## age 0 1 1 1 1 1 1 1 1
## hgt 1 0 1 0 1 1 1 1 1
## wgt 1 1 0 0 1 1 1 1 1
## bmi 1 1 1 0 1 1 1 1 1
## hc 1 1 1 1 0 1 1 1 1
## gen 1 1 1 1 1 0 1 1 1
## phb 1 1 1 1 1 1 0 1 1
## tv 1 1 1 1 1 1 1 0 1
## reg 1 1 1 1 1 1 1 1 0
imp1 <- mice(boys,
m = 10,
maxit = 20,
method = meth,
predictorMatrix = pred,
seed = 235711,
print = FALSE)
Create trace plots, density plots, and strip plots from the mids object you created in Question 2.1.
What do you conclude vis-a-vis convergence and the validity of these imputations?
plot(imp1)
densityplot(imp1)
stripplot(imp1)
Everything looks good. The imputation model seems to have converged, and the imputed values look reasonable.
First, we will continue to explore the workflow we’ve already been using wherein
we fit the analysis models using with.mids()
. This workflow applies under two
conditions:
coef()
and vcov()
methods.Use the imputed data you created in Question 2.1 to fit the following regression model.
\(Y_{bmi} = \beta_0 + \beta_1 X_{age} + \beta_2 X_{region=east} + \beta_3 X_{region=west} + \beta_4 X_{region=south} + \beta_5 X_{region = city} + \varepsilon\)
Pool the MI estimates.
fit <- with(imp1, lm(bmi ~ age + reg))
est <- pool(fit)
summary(est)
## term estimate std.error statistic df p.value
## 1 (Intercept) 16.0708253 0.30371010 52.915018 727.0207 1.597231e-251
## 2 age 0.2743943 0.01274888 21.523018 721.5205 9.427183e-80
## 3 regeast -0.7383680 0.32397996 -2.279055 734.0648 2.295011e-02
## 4 regwest -0.6060686 0.30858387 -1.964032 702.4842 4.992029e-02
## 5 regsouth -0.7078314 0.31704556 -2.232586 727.5962 2.587977e-02
## 6 regcity -0.1071192 0.38660995 -0.277073 712.9221 7.818044e-01
est
## Class: mipo m = 10
## term m estimate ubar b t dfcom
## 1 (Intercept) 10 16.0708253 0.091333418 8.240046e-04 0.0922398233 742
## 2 age 10 0.2743943 0.000160496 1.852606e-06 0.0001625339 742
## 3 regeast 10 -0.7383680 0.104381410 5.287288e-04 0.1049630117 742
## 4 regwest 10 -0.6060686 0.093316921 1.733713e-03 0.0952240052 742
## 5 regsouth 10 -0.7078314 0.099561140 8.697706e-04 0.1005178877 742
## 6 regcity 10 -0.1071192 0.147048004 2.199314e-03 0.1494672504 742
## df riv lambda fmi
## 1 727.0207 0.009924134 0.009826613 0.012539340
## 2 721.5205 0.012697304 0.012538104 0.015263940
## 3 734.0648 0.005571889 0.005541015 0.008239445
## 4 702.4842 0.020436634 0.020027343 0.022805499
## 5 727.5962 0.009609650 0.009518183 0.012229618
## 6 712.9221 0.016452083 0.016185793 0.018934176
The effects of age and all regions except city
are statistically significant.
BMI increases significantly as the boys age, after controlling for the region in
which they live. Boys from the eastern, western, and southern regions have
significantly lower BMIs than boys from the northern region, after controlling
for age.
All FMI estimates are very low, so we can conclude that we have not lost a very large proportion of information to the missing data.
You may have noticed that the output above only includes information about the
coefficients and their significance tests but no model fit information. The \(R^2\)
statistic is not normally distributed, so we should use a slightly more complex
pooling method for the \(R^2\). More information on the specifics of pooling the
\(R^2\) is available in this section
of FIMD and in Harel (2009). The
correct pooling rule is implemented by the mice::pool.r.squared()
function.
Check the documentation for the mice::pool.r.squared()
function.
?pool.r.squared
Use pool.r.squared()
to pool the \(R^2\) and the adjusted \(R^2\) for the model
you estimated in Question 3.1.
pool.r.squared(fit)
## est lo 95 hi 95 fmi
## R^2 0.4021204 0.3470422 0.4559365 0.005371515
pool.r.squared(fit, adjusted = TRUE)
## est lo 95 hi 95 fmi
## adj R^2 0.3980915 0.3429559 0.4520318 0.005425639
We also need special techniques to pool the \(m\) estimated \(F\) statistics in an MI-based analysis. The technical details of these pooling rules are too complex to detail here. The general method was outlined by Rubin (1987) and extended by Li, Raghunathan, and Rubin (1991), Li, Meng, Raghunathan, and Rubin (1991), and Meng and Rubin (1992).
mice::D1()
.mice::D2()
.mice::D3()
.Use the D1()
, D2()
, and D3()
functions to pool the \(F\) for the model you
estimated in Question 3.1.
D1(fit)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 98.0492 5 735.4648 742 3.630013e-79 0.01778486
D2(fit)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 98.91273 5 70009.46 NA 2.809564e-104 0.008986336
D3(fit)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 75.57824 5 120581 742 2.304436e-79 0.01795142
You should notice some differences between the three pooled statistics. Each statistic uses a different pooling formula based on different assumptions.
When the appropriate estimates are available, the \(D1\) statistic is usually a good choice. A more detailed discussion and comparison of these three statistics is available in this section of FIMD.
We can also use these functions (as well as the anova()
function) to do
significance testing for model comparisons using MI data.
Use the imputed data from Question 2.1 to estimate a restricted model
wherein bmi
is predicted by only age
.
Use the D1()
, D2()
, D3()
, and anova()
functions to compare this
restricted model with the full model you estimated in Question 3.1.
anova()
?fit0 <- with(imp1, lm(bmi ~ age))
D1(fit, fit0)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 2.190947 4 733.2749 742 0.06838933 0.019004
D2(fit, fit0)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 2.163995 4 10224.32 NA 0.07037812 0.02469387
D3(fit, fit0)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 2.193873 4 79672.51 742 0.0669729 0.01931525
anova(fit, fit0)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 2.190947 4 733.2749 742 0.06838933 0.019004
reg
factor, as a whole, is not a significant
predictor of bmi
.anova()
function appears to be using the \(D1\) statistic.Sometimes (rather often, actually), we need to process the imputed data before we can fit an analysis model. In such cases, we usually implement something like the following workflow.
mice()
mice::complete()
to create a list of multiply imputed datasetsUse mice::complete()
to create a list of imputed datasets from the mids
object you created in Question 2.1.
Name the resulting list impData
.
impData <- complete(imp1, "all")
Center age
on 18 in each of the imputed datasets you created in Question
4.1.
TIP: You can use lapply()
to broadcast the data transformation across all
elements in impData
.
impData <- lapply(impData, mutate, age18 = age - 18)
Use lapply()
to fit the model from Question 3.1 to each of the
transformed datasets produced in Question 4.2.
fits <- lapply(impData, function(x) lm(bmi ~ age18 + reg, data = x))
At this point, you should have a standard R list containing the 10 fitted lm objects. You have a few options for pooling these results.
mitools::MIcombine()
function, and directly submit the list of model fits from
Question 4.3 as input to the function.mice::as.mira()
function to first cast the
list from Question 4.3 as a mira object. You can then pool the
results using all the methods you’ve already learned.Check the documentation for mitools::MIcombine()
and mice::as.mira()
.
?MIcombine
?as.mira
Pool the fitted models from Question 4.3 using mitools::MIcombine()
.
est <- MIcombine(fits)
summary(est)
## Multiple imputation results:
## MIcombine.default(fits)
## results se (lower upper) missInfo
## (Intercept) 21.0099231 0.27423445 20.4724309 21.54741541 1 %
## age18 0.2743943 0.01274888 0.2494065 0.29938219 1 %
## regeast -0.7383680 0.32397996 -1.3733596 -0.10337632 1 %
## regwest -0.6060686 0.30858387 -1.2109145 -0.00122274 2 %
## regsouth -0.7078314 0.31704556 -1.3292369 -0.08642598 1 %
## regcity -0.1071192 0.38660995 -0.8648874 0.65064911 2 %
est$missinfo
## (Intercept) age18 regeast regwest regsouth regcity
## 0.006033750 0.012572599 0.005547800 0.020114678 0.009538124 0.016243063
Pool the fitted models from Question 4.3 using mice::as.mira()
and mice::pool()
.
What do you notice vis-a-vis the FMI/\(\lambda\) produced by these two pooling approaches?
est <- fits %>% as.mira() %>% pool()
summary(est)
## term estimate std.error statistic df p.value
## 1 (Intercept) 21.0099231 0.27423445 76.612996 733.3727 0.000000e+00
## 2 age18 0.2743943 0.01274888 21.523018 721.5205 9.427183e-80
## 3 regeast -0.7383680 0.32397996 -2.279055 734.0648 2.295011e-02
## 4 regwest -0.6060686 0.30858387 -1.964032 702.4842 4.992029e-02
## 5 regsouth -0.7078314 0.31704556 -2.232586 727.5962 2.587977e-02
## 6 regcity -0.1071192 0.38660995 -0.277073 712.9221 7.818044e-01
ls(est$pooled)
## [1] "b" "df" "dfcom" "estimate" "fmi" "lambda"
## [7] "m" "riv" "t" "term" "ubar"
est$pooled$fmi
## [1] 0.008725380 0.015263940 0.008239445 0.022805499 0.012229618 0.018934176
est$pooled$lambda
## [1] 0.006025730 0.012538104 0.005541015 0.020027343 0.009518183 0.016185793
It looks like the quantity called “missing information” by mitools::MIcombine()
is equivalent to the quantity called \(\lambda\) by mice::pool()
.
If we want to pool parameters from a modeling function that does not provide
coef()
and vcov()
functions, we cannot use mice::pool()
or
mitools::MIcombine()
to do so.
Fortunately, as long as we can estimate the parameters of interest and their
standard errors from each imputed dataset, we can still pool the results. We can
use the mice::pool.scalar()
function to do so.
Check the documentation for mice::pool.scalar()
.
?pool.scalar
The t.test()
function is one popular function for which we cannot use the
standard pooling workflow. The following code shows one possible workflow for
conducting a t-test using multiply imputed data.
We’ll conduct an independent samples t-test for the average testicular volume of boys who are younger than 13 and boys who are 13 or older.
library(magrittr)
## Run the t-test on each imputed dataset:
tests <- lapply(impData, function(x) t.test(tv ~ I(age < 13), data = x))
## Extract the estimated parameters (i.e., mean differences):
d <- sapply(tests, function(x) diff(x$estimate) %>% abs())
## Extract the standard errors:
se <- sapply(tests, "[[", x = "stderr")
## Pool the estimates:
pooled <- pool.scalar(Q = d, U = se^2, n = nrow(impData[[1]]))
## View the pooled parameter estimate:
pooled$qbar
## [1] 14.15356
## Compute the t-statistic using the pooled estimates:
(t <- pooled %$% (qbar / sqrt(t)))
## [1] 28.57915
## Compute the two-tailed p-value:
2 * pt(t, df = pooled$df, lower.tail = FALSE)
## [1] 7.117193e-30
Conduct the same t-test as above using listwise deletion.
Compare the MI-based results to the deletion-based results.
(out <- t.test(tv ~ I(age < 13), data = boys))
##
## Welch Two Sample t-test
##
## data: tv by I(age < 13)
## t = 19.162, df = 219.56, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## 11.03961 13.57084
## sample estimates:
## mean in group FALSE mean in group TRUE
## 16.794118 4.488889
out$statistic
## t
## 19.16181
out$estimate %>% diff() %>% abs()
## mean in group TRUE
## 12.30523
The estimate t-statistic and parameter estimate are larger in the MI-based
analysis than in the deletion-based analysis. This difference is probably caused
by imputing many low testicular volumes (as we saw in the last practical). Hence
the mean of tv
in the age < 13
group is lower for the MI-based analysis.
If we want to do an ANOVA with MI data, the pooling techniques we’ve discussed so far can be a bit of a pain. We can easily estimate and pool the underlying linear model (since that’s just a linear regression model), but getting a pooled version of the standard ANOVA table would require quite a lot of work.
Thankfully, the miceadds::mi.anova()
function does all of the heavy lifting
for us.
Check the documentation for the miceadds::mi.anova()
function.
?mi.anova
Use mi.anova()
to estimate a factorial ANOVA wherein bmi
is the DV and
reg
and gen
are the IVs.
Use the mids object you created in Question 2.1.
mi.anova(imp1, "bmi ~ reg * gen")
## Univariate ANOVA for Multiply Imputed Data (Type 2)
##
## lm Formula: bmi ~ reg * gen
## R^2=0.3999
## ..........................................................................
## ANOVA Table
## SSQ df1 df2 F value Pr(>F) eta2 partial.eta2
## reg 166.9974 4 21678.60809 7.1222 0.00001 0.02412 0.03863
## gen 2450.4947 4 273.79663 90.7367 0.00000 0.35386 0.37093
## reg:gen 151.5760 16 24.18192 0.4948 0.92574 0.02189 0.03519
## Residual 4155.8863 NA NA NA NA NA NA
The interaction between reg
and gen
is non-significant, but there are
significant main effects for both reg
and gen
. Furthermore, the effect of
gen
appears to be much stronger than the effect of reg
.
End of Lab 2c