In these lab exercises, you will explore how certain pathological data characteristics can affect imputation results. You will also learn a very useful technique for imputing variables with known relations: Passive Imputation.

1 Setup

1.1

Load the mice package.

library(mice)

The mammalsleep dataset is part of mice. This dataset contains the Allison and Cicchetti (1976) data for mammalian species.

Unless otherwise specified, all questions in this section refer to the mammalsleep data.

1.2

Check the documentation for the mammalsleep dataset.

?mammalsleep

1.3

Summarize the data to get an overview of their structure.

head(mammalsleep)

##                     species       bw    brw sws  ps   ts  mls  gt pi sei odi
## 1          African elephant 6654.000 5712.0  NA  NA  3.3 38.6 645  3   5   3
## 2 African giant pouched rat    1.000    6.6 6.3 2.0  8.3  4.5  42  3   1   3
## 3                Arctic Fox    3.385   44.5  NA  NA 12.5 14.0  60  1   1   1
## 4    Arctic ground squirrel    0.920    5.7  NA  NA 16.5   NA  25  5   2   3
## 5            Asian elephant 2547.000 4603.0 2.1 1.8  3.9 69.0 624  3   5   4
## 6                    Baboon   10.550  179.5 9.1 0.7  9.8 27.0 180  4   4   4

summary(mammalsleep)

##                       species         bw                brw         
##  African elephant         : 1   Min.   :   0.005   Min.   :   0.14  
##  African giant pouched rat: 1   1st Qu.:   0.600   1st Qu.:   4.25  
##  Arctic Fox               : 1   Median :   3.342   Median :  17.25  
##  Arctic ground squirrel   : 1   Mean   : 198.790   Mean   : 283.13  
##  Asian elephant           : 1   3rd Qu.:  48.202   3rd Qu.: 166.00  
##  Baboon                   : 1   Max.   :6654.000   Max.   :5712.00  
##  (Other)                  :56                                       
##       sws               ps              ts             mls         
##  Min.   : 2.100   Min.   :0.000   Min.   : 2.60   Min.   :  2.000  
##  1st Qu.: 6.250   1st Qu.:0.900   1st Qu.: 8.05   1st Qu.:  6.625  
##  Median : 8.350   Median :1.800   Median :10.45   Median : 15.100  
##  Mean   : 8.673   Mean   :1.972   Mean   :10.53   Mean   : 19.878  
##  3rd Qu.:11.000   3rd Qu.:2.550   3rd Qu.:13.20   3rd Qu.: 27.750  
##  Max.   :17.900   Max.   :6.600   Max.   :19.90   Max.   :100.000  
##  NA's   :14       NA's   :12      NA's   :4       NA's   :4        
##        gt               pi             sei             odi       
##  Min.   : 12.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 35.75   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 79.00   Median :3.000   Median :2.000   Median :2.000  
##  Mean   :142.35   Mean   :2.871   Mean   :2.419   Mean   :2.613  
##  3rd Qu.:207.50   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :645.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :4

str(mammalsleep)

## 'data.frame':    62 obs. of  11 variables:
##  $ species: Factor w/ 62 levels "African elephant",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ bw     : num  6654 1 3.38 0.92 2547 ...
##  $ brw    : num  5712 6.6 44.5 5.7 4603 ...
##  $ sws    : num  NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
##  $ ps     : num  NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
##  $ ts     : num  3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
##  $ mls    : num  38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
##  $ gt     : num  645 42 60 25 624 180 35 392 63 230 ...
##  $ pi     : int  3 3 1 5 3 4 1 4 1 1 ...
##  $ sei    : int  5 1 1 2 5 4 1 5 2 1 ...
##  $ odi    : int  3 3 1 3 4 4 1 4 1 1 ...

2 Naive Imputation

2.1

Use mice() to multiply impute the mammalsleep data.

Create 5 imputations.
Use 10 iterations.
Use predictive mean matching as the imputation method.
Set a random number seed.
Keep all other options at their default values.

imp <- mice(mammalsleep, maxit = 10, seed = 235711, print = FALSE)

When you run this imputation, you will probably see a warning about “logged events”. When mice() encounters certain computational difficulties (e.g., extreme collinearity), it will take automatic remedial action and log the action in the loggedEvents slot of the mids object.

If you ever see a warning about logged events, you should check the loggedEvents slot to see what actions were taken. You want to assess if the actions were appropriate and judge the likely impact that the actions will have on your results. If the actions mice() has taken seem too extreme, you need to address the underlying data issues and rerun the imputation with the cleaned data.

2.2

Check the contents of the loggedEvents slot in the mids object you created in Question 2.1.

What problem is mice() logging here?

imp$loggedEvents %>% head()

##   it im dep meth
## 1  1  1 sws  pmm
## 2  1  1 sws  pmm
## 3  1  1 sws  pmm
## 4  1  1  ps  pmm
## 5  1  1  ps  pmm
## 6  1  1  ps  pmm
##                                                                                                                                                                                                                                                                                                         out
## 1                                                                                                                                                                                                                                                       df set to 1. # observed cases: 48  # predictors: 71
## 2 speciesArctic Fox, speciesArctic ground squirrel, speciesAsian elephant, speciesBaboon, speciesDonkey, speciesGiraffe, speciesGorilla, speciesGray wolf, speciesJaguar, speciesKangaroo, speciesOkapi, speciesRaccoon, speciesRoe deer, speciesSlow loris, speciesYellow-bellied marmot, brw, ts, gt, odi
## 3                                                  mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
## 4                                                                                                                                                                                                                                                       df set to 1. # observed cases: 50  # predictors: 71
## 5                                                            speciesArctic Fox, speciesArctic ground squirrel, speciesDonkey, speciesGorilla, speciesGray wolf, speciesJaguar, speciesKangaroo, speciesRaccoon, speciesRoe deer, speciesSlow loris, speciesYellow-bellied marmot, bw, sws, ts, gt, sei, odi
## 6                                                  mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?

The imputation models seem to have more predictors (\(P = 71\)) than observations (\(N \approx 50\)). It looks like something may have gone “wrong” with the species variable. To get around the issue, mice() has applied a ridge penalty.

We can get a sense of how impactful the issues noted in the loggedEvents log have been through our usual diagnostic plots.

2.3

Create trace plots, density plots, and strip plots from the mids object you created in Question 2.1.

What do you conclude vis-a-vis convergence and the validity of these imputations?

plot(imp)

densityplot(imp)

stripplot(imp)

The trace plots look OK. So, it appears that the imputation model have converged onto some equilibrium. The density plots and the strip plots, however, suggest very poor imputations. The density plots, in particular, clearly show far too little variability in the imputed values. Nearly all of the imputations collapse onto a small range of values (hence the “sharp” spikes in the density plots). Although the imputation models may have converged, it appears they have converged onto the wrong solution. These imputations are certainly not reasonable.

For didactic purposes, let’s “play the fool”, ignore any information we may have gleaned from the diagnostic plots, and analyze these imputed data via the usual process.

2.4

Use the imputed data you created in Question 2.1 to fit the following regression model.

\(Y_{sws} = \beta_0 + \beta_1 X_{ln(bw)} + \beta_2 X_{odi} + \varepsilon\)

Pool the MI estimates, and check the RIV, \(\lambda\), and FMI.

What do you notice?

est <- with(imp, lm(sws ~ log(bw) + odi)) %>% pool()
summary(est)

##          term   estimate std.error statistic        df      p.value
## 1 (Intercept)  9.9944925 1.3428477  7.442759  7.795874 8.360758e-05
## 2     log(bw) -0.6146995 0.2856365 -2.152034  4.650517 8.815594e-02
## 3         odi -0.5202660 0.3426943 -1.518163 30.636875 1.392260e-01

est

## Class: mipo    m = 5 
##          term m   estimate       ubar          b          t dfcom        df
## 1 (Intercept) 5  9.9944925 0.74599710 0.88103582 1.80324008    59  7.795874
## 2     log(bw) 5 -0.6146995 0.01986307 0.05143763 0.08158823    59  4.650517
## 3         odi 5 -0.5202660 0.09327086 0.02014045 0.11743941    59 30.636875
##         riv    lambda       fmi
## 1 1.4172213 0.5863018 0.6629419
## 2 3.1075326 0.7565448 0.8201889
## 3 0.2591221 0.2057959 0.2530181

Although the estimates seem sensible, the RIV, \(\lambda\), and FMI values all suggest that the missing data have had a very large influence on the results. For example, the \(\lambda\) for \(\hat{\beta}_1\) tells us that 80% of the sampling variance in \(\hat{\beta}_1\) is attributable to the missing data and our treatment thereof. Although we already knew these imputations were suspect, these regression results further confirm the poor quality of the imputations.

The poor performance you should have noted above is largely driven by the species variable. This variable is a factor with 62 levels. So, when we include this variable as a predictor in the imputation models, it enters the model as a set of 61 dummy codes. These dummy codes produce the \(P > N\) problem noted in the loggedEvents log which leads to poor imputations.

2.5

Use mice() to re-impute the mammalsleep data.

Use the same settings from Question 2.1, but do not use species as a predictor in any of the imputation models.

Name the resulting mids object imp1.

pred <- imp$predictorMatrix
pred[ , "species"] <- 0

imp1 <- mice(mammalsleep, 
             predictorMatrix = pred, 
             maxit = 10, 
             seed = 235711, 
             print = FALSE)

2.6

If you get a warning about logged events. Check the loggedEvents slot of the mids object you created in Question 2.5.

imp1$loggedEvents

##    it im dep meth out
## 1   1  2 mls  pmm  ts
## 2   2  2 mls  pmm  ts
## 3   2  2  gt  pmm  ts
## 4   3  5 mls  pmm  ts
## 5   4  1 mls  pmm  ts
## 6   4  3 mls  pmm  ts
## 7   4  3  gt  pmm  ts
## 8   4  4 mls  pmm  ts
## 9   4  4  gt  pmm  ts
## 10  4  5 mls  pmm  ts
## 11  4  5  gt  pmm  ts
## 12  5  1 mls  pmm  ts
## 13  5  4 mls  pmm  ts
## 14  5  4  gt  pmm  ts
## 15  5  5 mls  pmm  ts
## 16  5  5  gt  pmm  ts
## 17  7  3 mls  pmm  ts
## 18  7  3  gt  pmm  ts
## 19  7  5 mls  pmm  ts
## 20  7  5  gt  pmm  ts
## 21  8  2 mls  pmm  ts
## 22  8  3 mls  pmm  ts
## 23  9  2 mls  pmm  ts
## 24 10  5 mls  pmm  ts

This time, the logged events are telling us about collinearity problems and the actions taken to remedy the collinearity. Specifically, when imputing mls and gt, ts was collinear with other predictors, so it was removed from the model.

2.7

Create trace plots, density plots, and strip plots from the mids object you created in Question 2.5.

What do you conclude vis-a-vis convergence and the validity of these imputations?

plot(imp1)

densityplot(imp1)

stripplot(imp1)

This time, the trace plots suggest some serious identification issues. Notice how the individual lines in the plots of the means for ps and ts are stable but do not mix. This pattern is indicative of an under-identified model. Basically, the data do not contain enough information to define a unique solution for some parameters.

The imputations look pretty much fine, but we don’t care. The imputation model must converge before we can move on to considering the plausibility of the imputed values.

The convergence issues you should have noticed above are caused by structural features of the data. Total sleep (ts) is the sum of paradoxical sleep (ps) and short wave sleep (sws). The imputation model treats ts as distinct and stochastically related to ps and sws, but ts is actually a deterministic function of ps and sws. This deterministic relation is ignored in the imputations, and the resulting circularity in the imputations keeps the model from finding a unique solution.

Thankfully, mice() offers a convenient routine for addressing exactly these types of known relations among variables in the imputation model: passive imputation. With passive imputation, we can account for transformations, combinations, and recoded variables when imputing their missing data.

3 Passive Imputation

Frequently, we need to transform, combine, or recode variables. When such a need arises with incomplete variables that we’d like to impute, we have a few options.

We could impute the original, un-transformed, variable and transform the completed version afterwards (the so-called impute-then-transform approach).
We could transform the incomplete variable and impute the transformed version as if it were any other variable (the so-called just-another-variable approach).

Both of these approaches have an important limitation, though. In neither case does the imputation model have access to both the original and the transformed versions of the variable in question. The imputations are either generated using the information in the raw version of the variable (in the impute-then-transform approach) or using the information in the transformation (in the just-another-variable approach), but not both. Note that keeping both the raw and transformed versions of a variable in the model is not an option since doing so induces perfectly collinear variables.

To solve this problem, mice() implements a third approach called passive imputation. The goal of passive imputation is to maintain known, deterministic relations among incomplete variables throughout the imputation process and to allow the imputation model to use the transformed variables as predictors when imputing other variables (other than the raw version of the transformed variables, themselves).

For example, we can use passive imputation to maintain the following deterministic function in the boys data \[\text{BMI} = \frac{\text{Weight}}{\text{Height}^2}\] or this compositional relation in the mammalsleep data \[\text{ts} = \text{ps}+\text{sws}.\]

To implement passive imputation, we need to adjust two features of the mice() setup:

The method vector
- We use the method vector to define the deterministic relations.
The predictor matrix
- We adjust the predictor matrix to keep a transformed variable from being used as a predictor of its raw version.

The following code will adjust the method vector from Question 2.5 to implement passive imputation for ts.

(meth <- imp1$method)

## species      bw     brw     sws      ps      ts     mls      gt      pi     sei 
##      ""      ""      ""   "pmm"   "pmm"   "pmm"   "pmm"   "pmm"      ""      "" 
##     odi 
##      ""

meth["ts"]<- "~ I(sws + ps)"
meth

##         species              bw             brw             sws              ps 
##              ""              ""              ""           "pmm"           "pmm" 
##              ts             mls              gt              pi             sei 
## "~ I(sws + ps)"           "pmm"           "pmm"              ""              "" 
##             odi 
##              ""

Now, ts will not be independently imputed along with the other variables. Rather, in each iteration, the most recently completed version of sws and ps will be added together to define the updated version of ts.

The updated version of ts defined according to the deterministic relation described above can then be used as a predictor when imputing other variables, but we do not want to use ts to impute either sws or ps (to avoid circularity). So, we need to adjust the predictor matrix to satisfy this restriction.

(pred <- imp1$predictorMatrix)

##         species bw brw sws ps ts mls gt pi sei odi
## species       0  1   1   1  1  1   1  1  1   1   1
## bw            0  0   1   1  1  1   1  1  1   1   1
## brw           0  1   0   1  1  1   1  1  1   1   1
## sws           0  1   1   0  1  1   1  1  1   1   1
## ps            0  1   1   1  0  1   1  1  1   1   1
## ts            0  1   1   1  1  0   1  1  1   1   1
## mls           0  1   1   1  1  1   0  1  1   1   1
## gt            0  1   1   1  1  1   1  0  1   1   1
## pi            0  1   1   1  1  1   1  1  0   1   1
## sei           0  1   1   1  1  1   1  1  1   0   1
## odi           0  1   1   1  1  1   1  1  1   1   0

pred[c("sws", "ps"), "ts"] <- 0
pred

##         species bw brw sws ps ts mls gt pi sei odi
## species       0  1   1   1  1  1   1  1  1   1   1
## bw            0  0   1   1  1  1   1  1  1   1   1
## brw           0  1   0   1  1  1   1  1  1   1   1
## sws           0  1   1   0  1  0   1  1  1   1   1
## ps            0  1   1   1  0  0   1  1  1   1   1
## ts            0  1   1   1  1  0   1  1  1   1   1
## mls           0  1   1   1  1  1   0  1  1   1   1
## gt            0  1   1   1  1  1   1  0  1   1   1
## pi            0  1   1   1  1  1   1  1  0   1   1
## sei           0  1   1   1  1  1   1  1  1   0   1
## odi           0  1   1   1  1  1   1  1  1   1   0

Now, we can re-impute the mammalsleep data using passive imputation to account for the deterministic relation between ts, sws, and ps. We do so simply by using the updated method vector and predictor matrix in a regular run of mice().

imp <- mice(mammalsleep, 
            method = meth, 
            predictorMatrix = pred, 
            maxit = 20, 
            seed = 235711, 
            print = FALSE)

If we inspect the diagnostic plots for these imputations, we see much better performance than we achieved in Questions 2.1 or 2.5.

plot(imp)

densityplot(imp)

stripplot(imp)

We can see that the pathological non-convergence of Question 2.5 has been corrected by the passive imputation.

You will now implement passive imputation yourself using the boys dataset. The boys dataset is distributed with mice, so you will be able to access these data once you’ve loaded the mice package. The boys data are a subset of a large Dutch dataset containing growth measures from the Fourth Dutch Growth Study.

Unless otherwise specified, all questions in this section refer to the boys dataset.

3.1

Check the documentation for the boys data.

?boys

3.2

Summarize the boys data to get a sense of their characteristics.

head(boys)

##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south

summary(boys)

##       age              hgt              wgt              bmi       
##  Min.   : 0.035   Min.   : 50.00   Min.   :  3.14   Min.   :11.77  
##  1st Qu.: 1.581   1st Qu.: 84.88   1st Qu.: 11.70   1st Qu.:15.90  
##  Median :10.505   Median :147.30   Median : 34.65   Median :17.45  
##  Mean   : 9.159   Mean   :132.15   Mean   : 37.15   Mean   :18.07  
##  3rd Qu.:15.267   3rd Qu.:175.22   3rd Qu.: 59.58   3rd Qu.:19.53  
##  Max.   :21.177   Max.   :198.00   Max.   :117.40   Max.   :31.74  
##                   NA's   :20       NA's   :4        NA's   :21     
##        hc          gen        phb            tv           reg     
##  Min.   :33.70   G1  : 56   P1  : 63   Min.   : 1.00   north: 81  
##  1st Qu.:48.12   G2  : 50   P2  : 40   1st Qu.: 4.00   east :161  
##  Median :53.00   G3  : 22   P3  : 19   Median :12.00   west :239  
##  Mean   :51.51   G4  : 42   P4  : 32   Mean   :11.89   south:191  
##  3rd Qu.:56.00   G5  : 75   P5  : 50   3rd Qu.:20.00   city : 73  
##  Max.   :65.00   NA's:503   P6  : 41   Max.   :25.00   NA's :  3  
##  NA's   :46                 NA's:503   NA's   :522

str(boys)

## 'data.frame':    748 obs. of  9 variables:
##  $ age: num  0.035 0.038 0.057 0.06 0.062 0.068 0.068 0.071 0.071 0.073 ...
##  $ hgt: num  50.1 53.5 50 54.5 57.5 55.5 52.5 53 55.1 54.5 ...
##  $ wgt: num  3.65 3.37 3.14 4.27 5.03 ...
##  $ bmi: num  14.5 11.8 12.6 14.4 15.2 ...
##  $ hc : num  33.7 35 35.2 36.7 37.3 37 34.9 35.8 36.8 38 ...
##  $ gen: Ord.factor w/ 5 levels "G1"<"G2"<"G3"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ phb: Ord.factor w/ 6 levels "P1"<"P2"<"P3"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ tv : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ reg: Factor w/ 5 levels "north","east",..: 4 4 4 4 4 4 4 3 3 2 ...

3.3

Use the mice::md.pattern() function to summarize the response patterns.

How many different missing data patterns are present in the boys data?
Which pattern occurs most frequently in these data?

(pats <- md.pattern(boys))

##     age reg wgt hgt bmi hc gen phb  tv     
## 223   1   1   1   1   1  1   1   1   1    0
## 19    1   1   1   1   1  1   1   1   0    1
## 1     1   1   1   1   1  1   1   0   1    1
## 1     1   1   1   1   1  1   0   1   0    2
## 437   1   1   1   1   1  1   0   0   0    3
## 43    1   1   1   1   1  0   0   0   0    4
## 16    1   1   1   0   0  1   0   0   0    5
## 1     1   1   1   0   0  0   0   0   0    6
## 1     1   1   0   1   0  1   0   0   0    5
## 1     1   1   0   0   0  1   1   1   1    3
## 1     1   1   0   0   0  0   1   1   1    4
## 1     1   1   0   0   0  0   0   0   0    7
## 3     1   0   1   1   1  1   0   0   0    4
##       0   3   4  20  21 46 503 503 522 1622

There are 13 total patterns. The pattern where gen, phb, and tv are missing occurs the most frequently.

3.4

Multiply impute the boys data using passive imputation for bmi.

Use passive imputation to maintain the known relation between bmi, wgt, and hgt.

Specify the method vector entry for bmi as "~ I(wgt / (hgt / 100)^2)".
Keep the default predictor matrix, for now.
Use 20 iterations.
Set a random number seed.
Leave all other settings at there default values.
Name the resulting mids object imp1.

## Use the mice::make.method() function to generate a default method vector:
(meth <- make.method(boys))

##       age       hgt       wgt       bmi        hc       gen       phb        tv 
##        ""     "pmm"     "pmm"     "pmm"     "pmm"    "polr"    "polr"     "pmm" 
##       reg 
## "polyreg"

meth["bmi"] <- "~ I(wgt / (hgt / 100)^2)"
meth

##                        age                        hgt 
##                         ""                      "pmm" 
##                        wgt                        bmi 
##                      "pmm" "~ I(wgt / (hgt / 100)^2)" 
##                         hc                        gen 
##                      "pmm"                     "polr" 
##                        phb                         tv 
##                     "polr"                      "pmm" 
##                        reg 
##                  "polyreg"

imp1 <- mice(boys, method = meth, maxit = 20, seed = 235711, print = FALSE)

Run the following code to inspect the relation between the imputed BMI and the BMI calculated from the imputed height and weight. If the passive imputation was successful, these points should fall along a perfect line.

xyplot(imp1, 
       bmi ~ I(wgt / (hgt / 100)^2), 
       ylab = "Imputed BMI", 
       xlab = "Calculated BMI")

3.5

Create trace plots, density plots, and strip plots from the mids object you created in Question 3.4.

What do you conclude vis-a-vis convergence and the validity of these imputations?

plot(imp1)

densityplot(imp1)

stripplot(imp1)

Although the deterministic definition of bmi is now preserved in the completed data, the trace plots indicate some pathological behavior for bmi, hgt, and wgt. We also get some absurd imputations for bmi.

Of course, the issues you should have spotted in the above imputations are to be expected since we have purposefully omitted the second part of passive imputation. We have not adjusted the predictor matrix, so we have circularity in the imputations. We used passive imputation to create the imputations for bmi, but bmi is still used as predictor for wgt and hgt.

3.6

Adjust the predictor matrix to remove the circularity described above.

Re-impute the boys data using the updated predictor matrix.

Keep all other settings the same as in Question 3.4.
Name the resulting mids object imp1.

(pred <- imp1$predictorMatrix)

##     age hgt wgt bmi hc gen phb tv reg
## age   0   1   1   1  1   1   1  1   1
## hgt   1   0   1   1  1   1   1  1   1
## wgt   1   1   0   1  1   1   1  1   1
## bmi   1   1   1   0  1   1   1  1   1
## hc    1   1   1   1  0   1   1  1   1
## gen   1   1   1   1  1   0   1  1   1
## phb   1   1   1   1  1   1   0  1   1
## tv    1   1   1   1  1   1   1  0   1
## reg   1   1   1   1  1   1   1  1   0

pred[c("hgt", "wgt"), "bmi"] <- 0
pred

##     age hgt wgt bmi hc gen phb tv reg
## age   0   1   1   1  1   1   1  1   1
## hgt   1   0   1   0  1   1   1  1   1
## wgt   1   1   0   0  1   1   1  1   1
## bmi   1   1   1   0  1   1   1  1   1
## hc    1   1   1   1  0   1   1  1   1
## gen   1   1   1   1  1   0   1  1   1
## phb   1   1   1   1  1   1   0  1   1
## tv    1   1   1   1  1   1   1  0   1
## reg   1   1   1   1  1   1   1  1   0

imp1 <- mice(boys, 
             method = meth, 
             predictorMatrix = pred, 
             maxit = 20,
             seed = 235711, 
             print = FALSE)

3.7

Recreate the xyplot() from above using the imputations from Question 3.6.

Is the deterministic definition of bmi maintained in the imputed data?

xyplot(imp1, 
       bmi ~ I(wgt / (hgt / 100)^2), 
       ylab="Imputed BMI", 
       xlab="Calculated BMI")

Yes, the relation is maintained. All points fall along the \(Y = X\) line.

3.8

Create trace plots, density plots, and strip plots from the mids object you created in Question 3.6.

What do you conclude vis-a-vis convergence and the validity of these imputations?

plot(imp1)

densityplot(imp1)

stripplot(imp1)

Everything looks good now. The trace plots indicate good convergence (though we clearly need more than the default 5 iterations for the model to stabilize). Judging from the density plots and strip plots, the imputations also seem sensible.

Just for fun: What you shouldn’t do with passive imputation

Never fix all relations. The algorithm will never escape the starting values.

meth <- make.method(boys)

meth["bmi"] <- "~ I(wgt / (hgt / 100)^2)"
meth["wgt"] <- "~ I(bmi * (hgt / 100)^2)"
meth["hgt"] <- "~ I(sqrt(wgt / bmi) * 100)"

imp <- mice(boys, method = meth, seed = 235711, print = FALSE)

plot(imp, c("hgt", "wgt", "bmi"))

End of Lab 2b

Lab 2b: Passive Imputation

Missing Data in R

Kyle M. Lang

Updated: 2025-01-28

1 Setup

1.1

1.2

1.3

2 Naive Imputation

2.1

2.2

2.3

2.4

2.5

2.6

2.7

3 Passive Imputation

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8