In these lab exercises you will explore some different workflows that you can use to implement an MI-based analysis.

Unless otherwise specified, you will be analyzing the boys dataset from the mice package in all following exercises.

1 Setup

1.1

Load the mice, miceadds, and mitools packages.

2 Imputation Phase

2.1

Multiply impute the boys data using passive imputation for bmi.

Use passive imputation to maintain the known relation between bmi, wgt, and hgt.

Specify the method vector entry for bmi as "~ I(wgt / (hgt / 100)^2)".
Use 20 iterations.
Create 10 imputations.
Set a random number seed.
Leave all other settings at there default values.
Name the resulting mids object imp1.

2.2

Create trace plots, density plots, and strip plots from the mids object you created in Question 2.1.

What do you conclude vis-a-vis convergence and the validity of these imputations?

3 Basic Workflow

First, we will continue to explore the workflow we’ve already been using wherein we fit the analysis models using with.mids(). This workflow applies under two conditions:

We don’t need to post-process the imputed datasets.
The modeling function we want to apply provides coef() and vcov() methods.

3.1

Use the imputed data you created in Question 2.1 to fit the following regression model.

\(Y_{bmi} = \beta_0 + \beta_1 X_{age} + \beta_2 X_{region=east} + \beta_3 X_{region=west} + \beta_4 X_{region=south} + \beta_5 X_{region = city} + \varepsilon\)

Pool the MI estimates.

What are the substantive conclusions?
What can you conclude from the FMIs?

You may have noticed that the output above only includes information about the coefficients and their significance tests but no model fit information. The \(R^2\) statistic is not normally distributed, so we should use a slightly more complex pooling method for the \(R^2\). More information on the specifics of pooling the \(R^2\) is available in this section of FIMD and in Harel (2009). The correct pooling rule is implemented by the mice::pool.r.squared() function.

3.2

Check the documentation for the mice::pool.r.squared() function.

3.3

Use pool.r.squared() to pool the \(R^2\) and the adjusted \(R^2\) for the model you estimated in Question 3.1.

We also need special techniques to pool the \(m\) estimated \(F\) statistics in an MI-based analysis. The technical details of these pooling rules are too complex to detail here. The general method was outlined by Rubin (1987) and extended by Li, Raghunathan, and Rubin (1991), Li, Meng, Raghunathan, and Rubin (1991), and Meng and Rubin (1992).

The Li, Raghunathan, and Rubin (1991) method is implemented as mice::D1().
The Li, Meng, Raghunathan, and Rubin (1991) method is implemented as mice::D2().
The Meng and Rubin (1992) method is implemented as mice::D3().

3.4

Use the D1(), D2(), and D3() functions to pool the \(F\) for the model you estimated in Question 3.1.

You should notice some differences between the three pooled statistics. Each statistic uses a different pooling formula based on different assumptions.

The \(D1\) statistic is a direct generalization of the standard Rubin (1987) pooling rules.
The \(D2\) statistic requires only the estimated parameters (while \(D1\) requires the parameter estimates and their asymptotic covariance matrices).
\(D3\) is actually a likelihood ratio test transformed to the scale of an F-statistic, so the theoretical underpinnings of the \(D3\) statistic are somewhat different from \(D1\) and \(D2\).

When the appropriate estimates are available, the \(D1\) statistic is usually a good choice. A more detailed discussion and comparison of these three statistics is available in this section of FIMD.

We can also use these functions (as well as the anova() function) to do significance testing for model comparisons using MI data.

3.5

Use the imputed data from Question 2.1 to estimate a restricted model wherein bmi is predicted by only age.

Use the D1(), D2(), D3(), and anova() functions to compare this restricted model with the full model you estimated in Question 3.1.

What conclusion do you draw from these tests?
Which version of the pooled F statistic is used by anova()?

4 Workflows with Data Processing

Sometimes (rather often, actually), we need to process the imputed data before we can fit an analysis model. In such cases, we usually implement something like the following workflow.

Impute the data with mice()
Use mice::complete() to create a list of multiply imputed datasets
Process each dataset
Apply our analysis model to each processed dataset
Pool the results

4.1

Use mice::complete() to create a list of imputed datasets from the mids object you created in Question 2.1.

Name the resulting list impData.

4.2

Center age on 18 in each of the imputed datasets you created in Question 4.1.

TIP: You can use lapply() to broadcast the data transformation across all elements in impData.

4.3

Use lapply() to fit the model from Question 3.1 to each of the transformed datasets produced in Question 4.2.

At this point, you should have a standard R list containing the 10 fitted lm objects. You have a few options for pooling these results.

If you simply want to pool the parameter estimates, you can use the mitools::MIcombine() function, and directly submit the list of model fits from Question 4.3 as input to the function.
If you want to continue to make use of the pooling utilities provided by the mice package, you can use the mice::as.mira() function to first cast the list from Question 4.3 as a mira object. You can then pool the results using all the methods you’ve already learned.

4.4

Check the documentation for mitools::MIcombine() and mice::as.mira().

4.5

Pool the fitted models from Question 4.3 using mitools::MIcombine().

Summarize the pooled estimates.
Extract the fractions of missing information for the parameter estimates.

4.6

Pool the fitted models from Question 4.3 using mice::as.mira() and mice::pool().

Summarize the pooled estimates.
Extract the fractions of missing information for the parameter estimates.
Extract the \(\lambda\)s for the parameter estimates.

What do you notice vis-a-vis the FMI/\(\lambda\) produced by these two pooling approaches?

5 Workflows for Special Pooling

If we want to pool parameters from a modeling function that does not provide coef() and vcov() functions, we cannot use mice::pool() or mitools::MIcombine() to do so.

Fortunately, as long as we can estimate the parameters of interest and their standard errors from each imputed dataset, we can still pool the results. We can use the mice::pool.scalar() function to do so.

5.1

Check the documentation for mice::pool.scalar().

The t.test() function is one popular function for which we cannot use the standard pooling workflow. The following code shows one possible workflow for conducting a t-test using multiply imputed data.

We’ll conduct an independent samples t-test for the average testicular volume of boys who are younger than 13 and boys who are 13 or older.

library(magrittr)

## Run the t-test on each imputed dataset:
tests <- lapply(impData, function(x) t.test(tv ~ I(age < 13), data = x))

## Extract the estimated parameters (i.e., mean differences):
d <- sapply(tests, function(x) diff(x$estimate) %>% abs())

## Extract the standard errors:
se <- sapply(tests, "[[", x = "stderr")

## Pool the estimates:
pooled <- pool.scalar(Q = d, U = se^2, n = nrow(impData[[1]]))

## View the pooled parameter estimate:
pooled$qbar

## [1] 14.15356

## Compute the t-statistic using the pooled estimates:
(t <- pooled %$% (qbar / sqrt(t)))

## [1] 28.57915

## Compute the two-tailed p-value:
2 * pt(t, df = pooled$df, lower.tail = FALSE)

## [1] 7.117193e-30

5.2

Conduct the same t-test as above using listwise deletion.

Compare the MI-based results to the deletion-based results.

What differences do you observe?
What do you think causes these differences (if any)?

If we want to do an ANOVA with MI data, the pooling techniques we’ve discussed so far can be a bit of a pain. We can easily estimate and pool the underlying linear model (since that’s just a linear regression model), but getting a pooled version of the standard ANOVA table would require quite a lot of work.

Thankfully, the miceadds::mi.anova() function does all of the heavy lifting for us.

5.3

Check the documentation for the miceadds::mi.anova() function.

5.4

Use mi.anova() to estimate a factorial ANOVA wherein bmi is the DV and reg and gen are the IVs.

Use the mids object you created in Question 2.1.

What substantive conclusions do you draw from this model?

End of Lab 2c

Lab 2c: MI Workflows

Missing Data in R

Kyle M. Lang

Updated: 2025-01-28

1 Setup

1.1

2 Imputation Phase

2.1

2.2

3 Basic Workflow

3.1

3.2

3.3

3.4

3.5

4 Workflows with Data Processing

4.1

4.2

4.3

4.4

4.5

4.6

5 Workflows for Special Pooling

5.1

5.2

5.3

5.4