In these lab exercises you will explore some different workflows that you can use to implement an MI-based analysis.

Unless otherwise specified, you will be analyzing the boys dataset from the mice package in all following exercises.

1 Setup

1.1

Load the mice, miceadds, and mitools packages.

library(mice)
library(miceadds)
library(mitools)

2 Imputation Phase

2.1

Multiply impute the boys data using passive imputation for bmi.

Use passive imputation to maintain the known relation between bmi, wgt, and hgt.

Specify the method vector entry for bmi as "~ I(wgt / (hgt / 100)^2)".
Use 20 iterations.
Create 10 imputations.
Set a random number seed.
Leave all other settings at there default values.
Name the resulting mids object imp1.

## Use the mice::make.method() function to generate a default method vector:
(meth <- make.method(boys))

##       age       hgt       wgt       bmi        hc       gen       phb        tv 
##        ""     "pmm"     "pmm"     "pmm"     "pmm"    "polr"    "polr"     "pmm" 
##       reg 
## "polyreg"

meth["bmi"] <- "~ I(wgt / (hgt / 100)^2)"
meth

##                        age                        hgt 
##                         ""                      "pmm" 
##                        wgt                        bmi 
##                      "pmm" "~ I(wgt / (hgt / 100)^2)" 
##                         hc                        gen 
##                      "pmm"                     "polr" 
##                        phb                         tv 
##                     "polr"                      "pmm" 
##                        reg 
##                  "polyreg"

## Adjust the default predictor matrix to break circularity in the imputations:
(pred <- make.predictorMatrix(boys))

##     age hgt wgt bmi hc gen phb tv reg
## age   0   1   1   1  1   1   1  1   1
## hgt   1   0   1   1  1   1   1  1   1
## wgt   1   1   0   1  1   1   1  1   1
## bmi   1   1   1   0  1   1   1  1   1
## hc    1   1   1   1  0   1   1  1   1
## gen   1   1   1   1  1   0   1  1   1
## phb   1   1   1   1  1   1   0  1   1
## tv    1   1   1   1  1   1   1  0   1
## reg   1   1   1   1  1   1   1  1   0

pred[c("hgt", "wgt"), "bmi"] <- 0
pred

##     age hgt wgt bmi hc gen phb tv reg
## age   0   1   1   1  1   1   1  1   1
## hgt   1   0   1   0  1   1   1  1   1
## wgt   1   1   0   0  1   1   1  1   1
## bmi   1   1   1   0  1   1   1  1   1
## hc    1   1   1   1  0   1   1  1   1
## gen   1   1   1   1  1   0   1  1   1
## phb   1   1   1   1  1   1   0  1   1
## tv    1   1   1   1  1   1   1  0   1
## reg   1   1   1   1  1   1   1  1   0

imp1 <- mice(boys, 
             m = 10,
             maxit = 20, 
             method = meth, 
             predictorMatrix = pred,
             seed = 235711, 
             print = FALSE)

2.2

Create trace plots, density plots, and strip plots from the mids object you created in Question 2.1.

What do you conclude vis-a-vis convergence and the validity of these imputations?

plot(imp1)

densityplot(imp1)

stripplot(imp1)

Everything looks good. The imputation model seems to have converged, and the imputed values look reasonable.

3 Basic Workflow

First, we will continue to explore the workflow we’ve already been using wherein we fit the analysis models using with.mids(). This workflow applies under two conditions:

We don’t need to post-process the imputed datasets.
The modeling function we want to apply provides coef() and vcov() methods.

3.1

Use the imputed data you created in Question 2.1 to fit the following regression model.

\(Y_{bmi} = \beta_0 + \beta_1 X_{age} + \beta_2 X_{region=east} + \beta_3 X_{region=west} + \beta_4 X_{region=south} + \beta_5 X_{region = city} + \varepsilon\)

Pool the MI estimates.

What are the substantive conclusions?
What can you conclude from the FMIs?

fit <- with(imp1, lm(bmi ~ age + reg))
est <- pool(fit)

summary(est)

##          term   estimate  std.error statistic       df       p.value
## 1 (Intercept) 16.0708253 0.30371010 52.915018 727.0207 1.597231e-251
## 2         age  0.2743943 0.01274888 21.523018 721.5205  9.427183e-80
## 3     regeast -0.7383680 0.32397996 -2.279055 734.0648  2.295011e-02
## 4     regwest -0.6060686 0.30858387 -1.964032 702.4842  4.992029e-02
## 5    regsouth -0.7078314 0.31704556 -2.232586 727.5962  2.587977e-02
## 6     regcity -0.1071192 0.38660995 -0.277073 712.9221  7.818044e-01

est

## Class: mipo    m = 10 
##          term  m   estimate        ubar            b            t dfcom
## 1 (Intercept) 10 16.0708253 0.091333418 8.240046e-04 0.0922398233   742
## 2         age 10  0.2743943 0.000160496 1.852606e-06 0.0001625339   742
## 3     regeast 10 -0.7383680 0.104381410 5.287288e-04 0.1049630117   742
## 4     regwest 10 -0.6060686 0.093316921 1.733713e-03 0.0952240052   742
## 5    regsouth 10 -0.7078314 0.099561140 8.697706e-04 0.1005178877   742
## 6     regcity 10 -0.1071192 0.147048004 2.199314e-03 0.1494672504   742
##         df         riv      lambda         fmi
## 1 727.0207 0.009924134 0.009826613 0.012539340
## 2 721.5205 0.012697304 0.012538104 0.015263940
## 3 734.0648 0.005571889 0.005541015 0.008239445
## 4 702.4842 0.020436634 0.020027343 0.022805499
## 5 727.5962 0.009609650 0.009518183 0.012229618
## 6 712.9221 0.016452083 0.016185793 0.018934176

The effects of age and all regions except city are statistically significant. BMI increases significantly as the boys age, after controlling for the region in which they live. Boys from the eastern, western, and southern regions have significantly lower BMIs than boys from the northern region, after controlling for age.

All FMI estimates are very low, so we can conclude that we have not lost a very large proportion of information to the missing data.

You may have noticed that the output above only includes information about the coefficients and their significance tests but no model fit information. The \(R^2\) statistic is not normally distributed, so we should use a slightly more complex pooling method for the \(R^2\). More information on the specifics of pooling the \(R^2\) is available in this section of FIMD and in Harel (2009). The correct pooling rule is implemented by the mice::pool.r.squared() function.

3.2

Check the documentation for the mice::pool.r.squared() function.

?pool.r.squared

3.3

Use pool.r.squared() to pool the \(R^2\) and the adjusted \(R^2\) for the model you estimated in Question 3.1.

pool.r.squared(fit)

##           est     lo 95     hi 95         fmi
## R^2 0.4021204 0.3470422 0.4559365 0.005371515

pool.r.squared(fit, adjusted = TRUE)

##               est     lo 95     hi 95         fmi
## adj R^2 0.3980915 0.3429559 0.4520318 0.005425639

We also need special techniques to pool the \(m\) estimated \(F\) statistics in an MI-based analysis. The technical details of these pooling rules are too complex to detail here. The general method was outlined by Rubin (1987) and extended by Li, Raghunathan, and Rubin (1991), Li, Meng, Raghunathan, and Rubin (1991), and Meng and Rubin (1992).

The Li, Raghunathan, and Rubin (1991) method is implemented as mice::D1().
The Li, Meng, Raghunathan, and Rubin (1991) method is implemented as mice::D2().
The Meng and Rubin (1992) method is implemented as mice::D3().

3.4

Use the D1(), D2(), and D3() functions to pool the \(F\) for the model you estimated in Question 3.1.

D1(fit)

##    test statistic df1      df2 dfcom      p.value        riv
##  1 ~~ 2   98.0492   5 735.4648   742 3.630013e-79 0.01778486

D2(fit)

##    test statistic df1      df2 dfcom       p.value         riv
##  1 ~~ 2  98.91273   5 70009.46    NA 2.809564e-104 0.008986336

D3(fit)

##    test statistic df1    df2 dfcom      p.value        riv
##  1 ~~ 2  75.57824   5 120581   742 2.304436e-79 0.01795142

You should notice some differences between the three pooled statistics. Each statistic uses a different pooling formula based on different assumptions.

The \(D1\) statistic is a direct generalization of the standard Rubin (1987) pooling rules.
The \(D2\) statistic requires only the estimated parameters (while \(D1\) requires the parameter estimates and their asymptotic covariance matrices).
\(D3\) is actually a likelihood ratio test transformed to the scale of an F-statistic, so the theoretical underpinnings of the \(D3\) statistic are somewhat different from \(D1\) and \(D2\).

When the appropriate estimates are available, the \(D1\) statistic is usually a good choice. A more detailed discussion and comparison of these three statistics is available in this section of FIMD.

We can also use these functions (as well as the anova() function) to do significance testing for model comparisons using MI data.

3.5

Use the imputed data from Question 2.1 to estimate a restricted model wherein bmi is predicted by only age.

Use the D1(), D2(), D3(), and anova() functions to compare this restricted model with the full model you estimated in Question 3.1.

What conclusion do you draw from these tests?
Which version of the pooled F statistic is used by anova()?

fit0 <- with(imp1, lm(bmi ~ age))

D1(fit, fit0)

##    test statistic df1      df2 dfcom    p.value      riv
##  1 ~~ 2  2.190947   4 733.2749   742 0.06838933 0.019004

D2(fit, fit0)

##    test statistic df1      df2 dfcom    p.value        riv
##  1 ~~ 2  2.163995   4 10224.32    NA 0.07037812 0.02469387

D3(fit, fit0)

##    test statistic df1      df2 dfcom   p.value        riv
##  1 ~~ 2  2.193873   4 79672.51   742 0.0669729 0.01931525

anova(fit, fit0)

##    test statistic df1      df2 dfcom    p.value      riv
##  1 ~~ 2  2.190947   4 733.2749   742 0.06838933 0.019004

All tests agree that the reg factor, as a whole, is not a significant predictor of bmi.
The anova() function appears to be using the \(D1\) statistic.

4 Workflows with Data Processing

Sometimes (rather often, actually), we need to process the imputed data before we can fit an analysis model. In such cases, we usually implement something like the following workflow.

Impute the data with mice()
Use mice::complete() to create a list of multiply imputed datasets
Process each dataset
Apply our analysis model to each processed dataset
Pool the results

4.1

Use mice::complete() to create a list of imputed datasets from the mids object you created in Question 2.1.

Name the resulting list impData.

impData <- complete(imp1, "all")

4.2

Center age on 18 in each of the imputed datasets you created in Question 4.1.

TIP: You can use lapply() to broadcast the data transformation across all elements in impData.

impData <- lapply(impData, mutate, age18 = age - 18)

4.3

Use lapply() to fit the model from Question 3.1 to each of the transformed datasets produced in Question 4.2.

fits <- lapply(impData, function(x) lm(bmi ~ age18 + reg, data = x))

At this point, you should have a standard R list containing the 10 fitted lm objects. You have a few options for pooling these results.

If you simply want to pool the parameter estimates, you can use the mitools::MIcombine() function, and directly submit the list of model fits from Question 4.3 as input to the function.
If you want to continue to make use of the pooling utilities provided by the mice package, you can use the mice::as.mira() function to first cast the list from Question 4.3 as a mira object. You can then pool the results using all the methods you’ve already learned.

4.4

Check the documentation for mitools::MIcombine() and mice::as.mira().

?MIcombine
?as.mira

4.5

Pool the fitted models from Question 4.3 using mitools::MIcombine().

Summarize the pooled estimates.
Extract the fractions of missing information for the parameter estimates.

est <- MIcombine(fits)
summary(est)

## Multiple imputation results:
##       MIcombine.default(fits)
##                results         se     (lower      upper) missInfo
## (Intercept) 21.0099231 0.27423445 20.4724309 21.54741541      1 %
## age18        0.2743943 0.01274888  0.2494065  0.29938219      1 %
## regeast     -0.7383680 0.32397996 -1.3733596 -0.10337632      1 %
## regwest     -0.6060686 0.30858387 -1.2109145 -0.00122274      2 %
## regsouth    -0.7078314 0.31704556 -1.3292369 -0.08642598      1 %
## regcity     -0.1071192 0.38660995 -0.8648874  0.65064911      2 %

est$missinfo

## (Intercept)       age18     regeast     regwest    regsouth     regcity 
## 0.006033750 0.012572599 0.005547800 0.020114678 0.009538124 0.016243063

4.6

Pool the fitted models from Question 4.3 using mice::as.mira() and mice::pool().

Summarize the pooled estimates.
Extract the fractions of missing information for the parameter estimates.
Extract the \(\lambda\)s for the parameter estimates.

What do you notice vis-a-vis the FMI/\(\lambda\) produced by these two pooling approaches?

est <- fits %>% as.mira() %>% pool()
summary(est)

##          term   estimate  std.error statistic       df      p.value
## 1 (Intercept) 21.0099231 0.27423445 76.612996 733.3727 0.000000e+00
## 2       age18  0.2743943 0.01274888 21.523018 721.5205 9.427183e-80
## 3     regeast -0.7383680 0.32397996 -2.279055 734.0648 2.295011e-02
## 4     regwest -0.6060686 0.30858387 -1.964032 702.4842 4.992029e-02
## 5    regsouth -0.7078314 0.31704556 -2.232586 727.5962 2.587977e-02
## 6     regcity -0.1071192 0.38660995 -0.277073 712.9221 7.818044e-01

ls(est$pooled)

##  [1] "b"        "df"       "dfcom"    "estimate" "fmi"      "lambda"  
##  [7] "m"        "riv"      "t"        "term"     "ubar"

est$pooled$fmi

## [1] 0.008725380 0.015263940 0.008239445 0.022805499 0.012229618 0.018934176

est$pooled$lambda

## [1] 0.006025730 0.012538104 0.005541015 0.020027343 0.009518183 0.016185793

It looks like the quantity called “missing information” by mitools::MIcombine() is equivalent to the quantity called \(\lambda\) by mice::pool().

5 Workflows for Special Pooling

If we want to pool parameters from a modeling function that does not provide coef() and vcov() functions, we cannot use mice::pool() or mitools::MIcombine() to do so.

Fortunately, as long as we can estimate the parameters of interest and their standard errors from each imputed dataset, we can still pool the results. We can use the mice::pool.scalar() function to do so.

5.1

Check the documentation for mice::pool.scalar().

?pool.scalar

The t.test() function is one popular function for which we cannot use the standard pooling workflow. The following code shows one possible workflow for conducting a t-test using multiply imputed data.

We’ll conduct an independent samples t-test for the average testicular volume of boys who are younger than 13 and boys who are 13 or older.

library(magrittr)

## Run the t-test on each imputed dataset:
tests <- lapply(impData, function(x) t.test(tv ~ I(age < 13), data = x))

## Extract the estimated parameters (i.e., mean differences):
d <- sapply(tests, function(x) diff(x$estimate) %>% abs())

## Extract the standard errors:
se <- sapply(tests, "[[", x = "stderr")

## Pool the estimates:
pooled <- pool.scalar(Q = d, U = se^2, n = nrow(impData[[1]]))

## View the pooled parameter estimate:
pooled$qbar

## [1] 14.15356

## Compute the t-statistic using the pooled estimates:
(t <- pooled %$% (qbar / sqrt(t)))

## [1] 28.57915

## Compute the two-tailed p-value:
2 * pt(t, df = pooled$df, lower.tail = FALSE)

## [1] 7.117193e-30

5.2

Conduct the same t-test as above using listwise deletion.

Compare the MI-based results to the deletion-based results.

What differences do you observe?
What do you think causes these differences (if any)?

(out <- t.test(tv ~ I(age < 13), data = boys))

## 
##  Welch Two Sample t-test
## 
## data:  tv by I(age < 13)
## t = 19.162, df = 219.56, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  11.03961 13.57084
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##           16.794118            4.488889

out$statistic

##        t 
## 19.16181

out$estimate %>% diff() %>% abs()

## mean in group TRUE 
##           12.30523

The estimate t-statistic and parameter estimate are larger in the MI-based analysis than in the deletion-based analysis. This difference is probably caused by imputing many low testicular volumes (as we saw in the last practical). Hence the mean of tv in the age < 13 group is lower for the MI-based analysis.

If we want to do an ANOVA with MI data, the pooling techniques we’ve discussed so far can be a bit of a pain. We can easily estimate and pool the underlying linear model (since that’s just a linear regression model), but getting a pooled version of the standard ANOVA table would require quite a lot of work.

Thankfully, the miceadds::mi.anova() function does all of the heavy lifting for us.

5.3

Check the documentation for the miceadds::mi.anova() function.

?mi.anova

5.4

Use mi.anova() to estimate a factorial ANOVA wherein bmi is the DV and reg and gen are the IVs.

Use the mids object you created in Question 2.1.

What substantive conclusions do you draw from this model?

mi.anova(imp1, "bmi ~ reg * gen")

## Univariate ANOVA for Multiply Imputed Data (Type 2)  
## 
## lm Formula:  bmi ~ reg * gen
## R^2=0.3999 
## ..........................................................................
## ANOVA Table 
##                   SSQ df1         df2 F value  Pr(>F)    eta2 partial.eta2
## reg          166.9974   4 21678.60809  7.1222 0.00001 0.02412      0.03863
## gen         2450.4947   4   273.79663 90.7367 0.00000 0.35386      0.37093
## reg:gen      151.5760  16    24.18192  0.4948 0.92574 0.02189      0.03519
## Residual    4155.8863  NA          NA      NA      NA      NA           NA

The interaction between reg and gen is non-significant, but there are significant main effects for both reg and gen. Furthermore, the effect of gen appears to be much stronger than the effect of reg.

End of Lab 2c

Lab 2c: MI Workflows

Missing Data in R

Kyle M. Lang

Updated: 2025-01-28

1 Setup

1.1

2 Imputation Phase

2.1

2.2

3 Basic Workflow

3.1

3.2

3.3

3.4

3.5

4 Workflows with Data Processing

4.1

4.2

4.3

4.4

4.5

4.6

5 Workflows for Special Pooling

5.1

5.2

5.3

5.4