In these lab exercises, you will explore diagnostic tools that you can use to evaluate the extent of a missing data problem.

1 Preliminaries

1.1

Load packages.

Use the library() function to load the mice, ggmice, naniar, and ggplot2 packages into your workspace.

library(mice)
library(ggmice)
library(naniar)
library(ggplot2)

The mice package contains several datasets. Once an R package is loaded, the datasets contained therein will be accessible within the current R session.

The nhanes dataset (Schafer, 1997, Table 6.14) is one of the datasets provided by the mice package. The nhanes dataset is a small dataset with non-monotone missing values. It contains 25 observations on four variables:

Age Group (age)
Body Mass Index (bmi)
Hypertension (hyp)
Cholesterol (chl)

Unless otherwise specified, all questions below apply to the nhanes dataset.

1.2

Print the nhanes dataset to the console.

nhanes

##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186

1.3

Access the documentation for the nhanes dataset.

?nhanes

1.4

Use the naniar::vis_miss() function to visualize the missing data.

vis_miss(nhanes)

2 Response Rates

2.1

Use the summary() function to summarize the nhanes dataset.

summary(nhanes)

##       age            bmi             hyp             chl       
##  Min.   :1.00   Min.   :20.40   Min.   :1.000   Min.   :113.0  
##  1st Qu.:1.00   1st Qu.:22.65   1st Qu.:1.000   1st Qu.:185.0  
##  Median :2.00   Median :26.75   Median :1.000   Median :187.0  
##  Mean   :1.76   Mean   :26.56   Mean   :1.235   Mean   :191.4  
##  3rd Qu.:2.00   3rd Qu.:28.93   3rd Qu.:1.000   3rd Qu.:212.0  
##  Max.   :3.00   Max.   :35.30   Max.   :2.000   Max.   :284.0  
##                 NA's   :9       NA's   :8       NA's   :10

Use the output of the summary() function to answer the next two questions.

2.2

Which variable has the most missing values?

The cholesterol variable, chl, has the most missing values (i.e., 10).

2.3

Which variables, if any, have no missing values?

Only age is fully observed.

2.4

Compute the proportion of missing values for each variable.

pm <- colMeans(is.na(nhanes))

## OR ##

pm <- is.na(nhanes) %>% colMeans()

Notice how the second solutions uses the dplyr pipe to clarify the sequence of steps in the process.

2.5

What is the proportion of missing data in bmi?

pm["bmi"]

##  bmi 
## 0.36

2.6

Use the naniar::gg_miss_var() function to visualize the percents missing for each variable.

gg_miss_var(nhanes, show_pct = TRUE)

Notice that I have specified the show_pct = TRUE option to plot the percents missing. The default is to plot the counts of missing values.

3 Response Patterns

Inspecting the missing data/response patterns is always useful (but may be difficult for datasets with many variables). The response patterns give an indication of how much information is missing and how the missingness is distributed.

3.1

Visualize the missing data patterns.

You can use the plot_pattern() function from the ggmice package.
What information can you glean from the figure?

plot_pattern(nhanes)

There are 27 missing values in total: 10 for chl, 9 for bmi, and 8 for hyp. Moreover, there are thirteen completely observed rows, four rows with 1 missing value, one row with 2 missing values, and seven rows with 3 missing values.

3.2

How many observations would be available if we only analyzed complete cases?

The data would contain 13 observations if we excluded incomplete cases.

3.3

Which missing data pattern is most frequently observed?

The pattern wherein all variables are observed.

4 Coverage Rates

4.1

Calculate the covariance coverage rates.

You can use the md.pairs() function from the mice package to count the number of jointly observed cases for every pair or variables.

## Calculate covariance coverage:
cc <- md.pairs(nhanes)$rr /  nrow(nhanes)
round(cc, 2)

##      age  bmi  hyp  chl
## age 1.00 0.64 0.68 0.60
## bmi 0.64 0.64 0.64 0.52
## hyp 0.68 0.64 0.68 0.56
## chl 0.60 0.52 0.56 0.60

4.2

Calculate the flux statistics for each variable.

You can use the flux() function from the mice package to compute a panel of flux statistics.

flux(nhanes)

##     pobs    influx   outflux      ainb       aout      fico
## age 1.00 0.0000000 1.0000000 0.0000000 0.36000000 0.4800000
## bmi 0.64 0.1643836 0.1111111 0.4444444 0.06250000 0.1875000
## hyp 0.68 0.1232877 0.1481481 0.3750000 0.07843137 0.2352941
## chl 0.60 0.2191781 0.1111111 0.5333333 0.06666667 0.1333333

5 Testing the Missingness

5.1

Create a missingness vector for bmi.

This vector should be a logical vector of the same length as the bmi vector. The missingness vector should take the value TRUE for all missing entries in bmi and FALSE for all observed entries in bmi.

mBmi <- is.na(nhanes$bmi)

5.2

Test if missingness on bmi depends on age.

Use the t.test() function and the missingness vector you created above.

What is the estimated t-statistic?
What is the p-value for this test?
What is the conclusion of this test?

out <- t.test(age ~ mBmi, data = nhanes)
out$statistic

##         t 
## 0.4095038

out$p.value

## [1] 0.6875405

Missingness in bmi does not significantly depend on age.

5.3

Test if all missingness in nhanes is MCAR.

Use the naniar::mcar_test() function to implement the Little (1988) MCAR Test.

What is the estimated \(\chi^2\)-statistic?
What is the p-value for this test?
What is the conclusion of this test?

out <- mcar_test(nhanes)

out$statistic

## [1] 7.999024

out$p.value

## [1] 0.5342446

The missing values in nhanes appear to be MCAR.

5.4

Use the naniar::geom_miss_point() function to visualize the distribution of missing values between bmi and chl.

You will first need to set up the figure using ggplot(). You can then apply
geom_miss_point() to plot the points.

What conclusions can you draw from this figure, if any?

ggplot(data = nhanes, mapping = aes(x = bmi, y = chl)) + geom_miss_point()

Missing values in chl tend to occur more frequently in the lower tail of bmi and visa-versa.

6 More Complex Data

Real-world missing data problems are rarely as simple as the situation explored above. Now, we will consider a slightly more complex datasets. For the remaining exercises, you will analyze the Eating Attitudes data from Enders (2010). These data are available as eating_attitudes.rds.

This dataset includes 400 observations of the following 14 variables. Note that the variables are listed in the order that they appear on the dataset.

id: A numeric ID
eat1:eat24: Seven indicators of a Drive for Thinness construct
eat3:eat21: Three indicators of a Preoccupation with Food construct
bmi: Body mass index
wsb: A single item assessing Western Standards of Beauty
anx: A single item assessing Anxiety Level

You can download the original data here, and you can access the code used to process the data here.

6.1

Read in the eating_attitudes.rds dataset.

dataDir <- "../../../data/"
ea <- readRDS(paste(dataDir, "eating_attitudes.rds"))

NOTE:

In the following, I will refer to these data as the EA data.
Unless otherwise specified, the data analyzed in all following questions are the EA data.

6.2

Summarize the EA data to get a sense of their characteristics.

Pay attention to the missing values.

head(ea)

##   id eat1 eat2 eat10 eat11 eat12 eat14 eat24 eat3 eat18 eat21      bmi wsb anx
## 1  1    4    4     4     4     4     4     3    4     5     4 18.92319   9  11
## 2  2    6    5     6     6     6     7     6    5     6     5 26.00913  13  19
## 3  3    3    3     2     2     3     2     3    3     3     2 18.29959   6   8
## 4  4    3    3     4     3     4     4     3    3     5     4 18.16987   5  14
## 5  5    3    2     3     3     3     3     3    4     4     4 24.44818  10   7
## 6  6    4    5     4     5     4     4     4    4     5     4 25.10097   7  11

summary(ea)

##        id             eat1            eat2           eat10      
##  Min.   :  1.0   Min.   :2.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:100.8   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :200.5   Median :4.000   Median :4.000   Median :4.000  
##  Mean   :200.5   Mean   :3.987   Mean   :3.939   Mean   :3.938  
##  3rd Qu.:300.2   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :400.0   Max.   :7.000   Max.   :7.000   Max.   :7.000  
##                  NA's   :27      NA's   :21      NA's   :30     
##      eat11           eat12           eat14           eat24      
##  Min.   :1.000   Min.   :2.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :4.000   Median :4.000   Median :4.000   Median :4.000  
##  Mean   :3.938   Mean   :3.905   Mean   :3.962   Mean   :3.954  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.000  
##                  NA's   :33                      NA's   :33     
##       eat3           eat18           eat21            bmi       
##  Min.   :1.000   Min.   :2.000   Min.   :1.000   Min.   :15.23  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:20.43  
##  Median :4.000   Median :4.000   Median :4.000   Median :22.42  
##  Mean   :3.967   Mean   :3.951   Mean   :3.953   Mean   :22.39  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:24.21  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :31.52  
##                  NA's   :32      NA's   :19      NA's   :5      
##       wsb              anx       
##  Min.   : 4.000   Min.   : 4.00  
##  1st Qu.: 8.000   1st Qu.:10.00  
##  Median : 9.000   Median :12.00  
##  Mean   : 8.946   Mean   :11.94  
##  3rd Qu.:10.000   3rd Qu.:14.00  
##  Max.   :14.000   Max.   :20.00  
##  NA's   :27       NA's   :18

str(ea)

## 'data.frame':    400 obs. of  14 variables:
##  $ id   : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ eat1 : num  4 6 3 3 3 4 5 4 4 6 ...
##  $ eat2 : num  4 5 3 3 2 5 4 3 7 5 ...
##  $ eat10: num  4 6 2 4 3 4 4 4 6 5 ...
##  $ eat11: num  4 6 2 3 3 5 4 4 5 5 ...
##  $ eat12: num  4 6 3 4 3 4 4 4 4 6 ...
##  $ eat14: num  4 7 2 4 3 4 4 4 6 6 ...
##  $ eat24: num  3 6 3 3 3 4 4 4 4 5 ...
##  $ eat3 : num  4 5 3 3 4 4 3 6 4 5 ...
##  $ eat18: num  5 6 3 5 4 5 3 6 4 6 ...
##  $ eat21: num  4 5 2 4 4 4 3 5 4 5 ...
##  $ bmi  : num  18.9 26 18.3 18.2 24.4 ...
##  $ wsb  : num  9 13 6 5 10 7 11 8 10 12 ...
##  $ anx  : num  11 19 8 14 7 11 12 12 14 12 ...

6.3

Calculate the covariance coverage rates.

You can use the md.pairs() function from the mice package to count the number of jointly observed cases for every pair or variables.

library(magrittr) # Provides the extract2() alias function

## Calculate covariance coverage:
cc <- (ea %>% select(-id) %>% md.pairs() %>% extract2("rr")) /  nrow(ea)
round(cc, 2)

##       eat1 eat2 eat10 eat11 eat12 eat14 eat24 eat3 eat18 eat21  bmi  wsb  anx
## eat1  0.93 0.89  0.87  0.93  0.86  0.93  0.86 0.93  0.86  0.89 0.92 0.88 0.90
## eat2  0.89 0.95  0.87  0.95  0.87  0.95  0.88 0.95  0.88  0.90 0.94 0.88 0.90
## eat10 0.87 0.87  0.92  0.92  0.86  0.92  0.86 0.92  0.86  0.88 0.92 0.86 0.88
## eat11 0.93 0.95  0.92  1.00  0.92  1.00  0.92 1.00  0.92  0.95 0.99 0.93 0.96
## eat12 0.86 0.87  0.86  0.92  0.92  0.92  0.85 0.92  0.86  0.87 0.91 0.87 0.87
## eat14 0.93 0.95  0.92  1.00  0.92  1.00  0.92 1.00  0.92  0.95 0.99 0.93 0.96
## eat24 0.86 0.88  0.86  0.92  0.85  0.92  0.92 0.92  0.86  0.88 0.90 0.85 0.88
## eat3  0.93 0.95  0.92  1.00  0.92  1.00  0.92 1.00  0.92  0.95 0.99 0.93 0.96
## eat18 0.86 0.88  0.86  0.92  0.86  0.92  0.86 0.92  0.92  0.88 0.91 0.86 0.88
## eat21 0.89 0.90  0.88  0.95  0.87  0.95  0.88 0.95  0.88  0.95 0.94 0.89 0.91
## bmi   0.92 0.94  0.92  0.99  0.91  0.99  0.90 0.99  0.91  0.94 0.99 0.92 0.94
## wsb   0.88 0.88  0.86  0.93  0.87  0.93  0.85 0.93  0.86  0.89 0.92 0.93 0.89
## anx   0.90 0.90  0.88  0.96  0.87  0.96  0.88 0.96  0.88  0.91 0.94 0.89 0.96

6.4

Summarize the coverages from 6.3.

Covariance coverage matrices are often very large and, hence, difficult to parse. It can be useful to distill this information into a few succinct summaries to help extract the useful knowledge.

One of the most useful numeric summaries is the range. We’ll start there.

NOTE:

When computing the range, it is often helpful to exclude coverages of 1.0 since variables that are fully jointly observed won’t have much direct influence on our missing data treatment.
We usually want to exclude the diagonal elements from the coverage matrix, too. These values represent variance coverages instead of covariance coverages. In other words, the diagonal of the coverage matrix gives the proportion of observed cases for each variable.

## Range of coverages < 1.0:
## NOTE: Use lower.tri() to select only the elements below the diagonal.
uniqueCc <- cc[lower.tri(cc)] 
uniqueCc[uniqueCc < 1] %>% range()

## [1] 0.8500 0.9875

Sometimes, we may want to count the number of coverages that satisfy some condition (e.g., fall below some threshold). When doing such a summary, we need to remember two idiosyncrasies of the covariance coverage matrix:

The diagonal elements are not covariance coverages.
The matrix is symmetric.

So, to get a count of covariance coverages without double counting, we consider only the elements from the lower (or upper) triangle.

## Count the number of coverages lower than 0.9:
sum(uniqueCc < 0.9)

## [1] 32

6.5

Visualize the covariance coverage rates from 6.3.

As with numeric summaries, visualizations are also a good way to distill meaningful knowledge from the raw information in a covariance coverage matrix.

We’ll try two different visualizations.

First, we’ll create a simple histogram of the coverages to get a sense of their distribution. To see why such a visualization is useful, consider two hypothetical situations wherein the range of coverages is [0.2; 0.8].

Half of these coverages are lower than 0.3
Only one of the coverages is lower than 0.3

Clearly, the second situation is less problematic, but we cannot differentiate between these two simply by examining the range of coverages.

library(ggplot2)

## Simple histogram of coverages:
data.frame(Coverage = uniqueCc) %>% 
  ggplot(aes(Coverage)) + 
  geom_histogram(bins = 15, color = "black")

Next, we’ll create a heatmap of the coverage matrix itself. Such a visualization could help us locate clusters of variables with especially low coverages, for example.

## Convert the coverage matrix into a plotable, tidy-formatted data.frame:
pDat <- data.frame(Coverage = as.numeric(cc), 
                   x        = rep(colnames(cc), ncol(cc)), 
                   y        = rep(colnames(cc), each = ncol(cc))
                   )

## Create the heatmap via the "tile" geom:
ggplot(pDat, aes(x = x, y = y, fill = Coverage)) + 
  geom_tile() + 
  scale_x_discrete(limits = rev(colnames(cc))) +
  scale_y_discrete(limits = colnames(cc)) + 
  scale_fill_distiller(palette = "Oranges") + 
  xlab(NULL) + 
  ylab(NULL)

6.6

Visualize the missing data patterns.

How many unique response patterns are represented in the EA data?

HINT:

The plot_pattern() function from ggmice will create a nice visualization of the patterns.
The md.pattern() function from mice will create a (somewhat less beautiful) visualization but will return a numeric pattern matrix that you can further analyze.

library(ggmice)

## Visualize the response patterns:
plot_pattern(ea)

## Create an analyzable pattern matrix (without visualization):
(pats <- md.pattern(ea, plot = FALSE))

##     id eat11 eat14 eat3 bmi anx eat21 eat2 eat1 wsb eat10 eat18 eat12 eat24    
## 242  1     1     1    1   1   1     1    1    1   1     1     1     1     1   0
## 12   1     1     1    1   1   1     1    1    1   1     1     1     1     0   1
## 10   1     1     1    1   1   1     1    1    1   1     1     1     0     1   1
## 2    1     1     1    1   1   1     1    1    1   1     1     1     0     0   2
## 8    1     1     1    1   1   1     1    1    1   1     1     0     1     1   1
## 3    1     1     1    1   1   1     1    1    1   1     1     0     1     0   2
## 3    1     1     1    1   1   1     1    1    1   1     1     0     0     1   2
## 2    1     1     1    1   1   1     1    1    1   1     1     0     0     0   3
## 14   1     1     1    1   1   1     1    1    1   1     0     1     1     1   1
## 2    1     1     1    1   1   1     1    1    1   1     0     1     1     0   2
## 1    1     1     1    1   1   1     1    1    1   1     0     1     0     1   2
## 2    1     1     1    1   1   1     1    1    1   1     0     0     1     1   2
## 2    1     1     1    1   1   1     1    1    1   1     0     0     0     1   3
## 12   1     1     1    1   1   1     1    1    1   0     1     1     1     1   1
## 3    1     1     1    1   1   1     1    1    1   0     1     1     0     1   2
## 3    1     1     1    1   1   1     1    1    1   0     1     0     0     1   3
## 1    1     1     1    1   1   1     1    1    1   0     0     1     0     1   3
## 5    1     1     1    1   1   1     1    1    0   1     1     1     1     1   1
## 1    1     1     1    1   1   1     1    1    0   1     1     1     1     0   2
## 2    1     1     1    1   1   1     1    1    0   1     1     1     0     1   2
## 2    1     1     1    1   1   1     1    1    0   1     1     1     0     0   3
## 2    1     1     1    1   1   1     1    1    0   1     0     1     1     0   3
## 3    1     1     1    1   1   1     1    1    0   0     1     1     1     1   2
## 2    1     1     1    1   1   1     1    1    0   0     0     1     1     1   3
## 9    1     1     1    1   1   1     1    0    1   1     1     1     1     1   1
## 1    1     1     1    1   1   1     1    0    1   1     1     1     1     0   2
## 1    1     1     1    1   1   1     1    0    1   1     1     1     0     1   2
## 2    1     1     1    1   1   1     1    0    1   1     1     0     1     1   2
## 3    1     1     1    1   1   1     1    0    1   1     1     0     1     0   3
## 1    1     1     1    1   1   1     1    0    1   0     1     1     1     1   2
## 3    1     1     1    1   1   1     1    0    0   1     1     1     1     1   2
## 13   1     1     1    1   1   1     0    1    1   1     1     1     1     1   1
## 1    1     1     1    1   1   1     0    1    1   1     1     1     1     0   2
## 1    1     1     1    1   1   1     0    1    1   1     0     1     1     0   3
## 1    1     1     1    1   1   1     0    1    1   1     0     0     1     1   3
## 1    1     1     1    1   1   1     0    1    1   0     1     0     1     1   3
## 1    1     1     1    1   1   1     0    1    0   0     1     0     1     1   4
## 10   1     1     1    1   1   0     1    1    1   1     1     1     1     1   1
## 1    1     1     1    1   1   0     1    1    1   1     1     0     1     1   2
## 4    1     1     1    1   1   0     1    1    0   1     1     1     1     1   2
## 1    1     1     1    1   1   0     1    1    0   1     1     1     1     0   3
## 1    1     1     1    1   1   0     1    0    1   1     1     1     1     1   2
## 1    1     1     1    1   1   0     0    1    1   1     1     1     1     1   2
## 3    1     1     1    1   0   1     1    1    1   1     1     1     1     1   1
## 1    1     1     1    1   0   1     1    1    1   1     0     1     0     1   3
## 1    1     1     1    1   0   1     1    1    0   1     0     1     1     1   3
##      0     0     0    0   5  18    19   21   27  27    30    32    33    33 245

## Count the number of unique response patterns:
nrow(pats) - 1

## [1] 46

As shown by the above code, there are 46 unique response patterns in the EA data.

End of Lab 1

Lab 1: Missing Data Basics

Missing Data in R

Kyle M. Lang

Updated: 2023-01-31