In these lab exercises, you will explore diagnostic tools that you can use to evaluate the extent of a missing data problem.
Load packages.
Use the library()
function to load the mice, ggmice, naniar, and
ggplot2 packages into your workspace.
library(mice)
library(ggmice)
library(naniar)
library(ggplot2)
The mice
package contains several datasets. Once an R package is loaded, the
datasets contained therein will be accessible within the current R session.
The nhanes
dataset (Schafer, 1997, Table 6.14) is one of the datasets provided
by the mice package. The nhanes
dataset is a small dataset with
non-monotone missing values. It contains 25 observations on four variables:
age
)bmi
)hyp
)chl
)Unless otherwise specified, all questions below apply to the nhanes
dataset.
Print the nhanes
dataset to the console.
nhanes
## age bmi hyp chl
## 1 1 NA NA NA
## 2 2 22.7 1 187
## 3 1 NA 1 187
## 4 3 NA NA NA
## 5 1 20.4 1 113
## 6 3 NA NA 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 NA NA NA
## 11 1 NA NA NA
## 12 2 NA NA NA
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 NA
## 16 1 NA NA NA
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 NA
## 21 1 NA NA NA
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 NA
## 25 2 27.4 1 186
Access the documentation for the nhanes
dataset.
?nhanes
Use the naniar::vis_miss()
function to visualize the missing data.
vis_miss(nhanes)
Use the summary()
function to summarize the nhanes
dataset.
summary(nhanes)
## age bmi hyp chl
## Min. :1.00 Min. :20.40 Min. :1.000 Min. :113.0
## 1st Qu.:1.00 1st Qu.:22.65 1st Qu.:1.000 1st Qu.:185.0
## Median :2.00 Median :26.75 Median :1.000 Median :187.0
## Mean :1.76 Mean :26.56 Mean :1.235 Mean :191.4
## 3rd Qu.:2.00 3rd Qu.:28.93 3rd Qu.:1.000 3rd Qu.:212.0
## Max. :3.00 Max. :35.30 Max. :2.000 Max. :284.0
## NA's :9 NA's :8 NA's :10
Use the output of the summary()
function to answer the next two questions.
Which variable has the most missing values?
The cholesterol variable, chl
, has the most missing values (i.e.,
10).
Which variables, if any, have no missing values?
Only age
is fully observed.
Compute the proportion of missing values for each variable.
pm <- colMeans(is.na(nhanes))
## OR ##
pm <- is.na(nhanes) %>% colMeans()
Notice how the second solutions uses the dplyr pipe to clarify the sequence of steps in the process.
What is the proportion of missing data in bmi
?
pm["bmi"]
## bmi
## 0.36
Use the naniar::gg_miss_var()
function to visualize the percents missing
for each variable.
gg_miss_var(nhanes, show_pct = TRUE)
Notice that I have specified the show_pct = TRUE
option to plot the percents
missing. The default is to plot the counts of missing values.
Inspecting the missing data/response patterns is always useful (but may be difficult for datasets with many variables). The response patterns give an indication of how much information is missing and how the missingness is distributed.
Visualize the missing data patterns.
plot_pattern()
function from the ggmice package.plot_pattern(nhanes)
There are 27 missing values in total: 10 for chl
, 9 for bmi
, and 8 for hyp
.
Moreover, there are thirteen completely observed rows, four rows with 1 missing
value, one row with 2 missing values, and seven rows with 3 missing values.
How many observations would be available if we only analyzed complete cases?
The data would contain 13 observations if we excluded incomplete cases.
Which missing data pattern is most frequently observed?
The pattern wherein all variables are observed.
Calculate the covariance coverage rates.
md.pairs()
function from the mice package to
count the number of jointly observed cases for every pair or variables.## Calculate covariance coverage:
cc <- md.pairs(nhanes)$rr / nrow(nhanes)
round(cc, 2)
## age bmi hyp chl
## age 1.00 0.64 0.68 0.60
## bmi 0.64 0.64 0.64 0.52
## hyp 0.68 0.64 0.68 0.56
## chl 0.60 0.52 0.56 0.60
Calculate the flux statistics for each variable.
flux()
function from the mice package to
compute a panel of flux statistics.flux(nhanes)
## pobs influx outflux ainb aout fico
## age 1.00 0.0000000 1.0000000 0.0000000 0.36000000 0.4800000
## bmi 0.64 0.1643836 0.1111111 0.4444444 0.06250000 0.1875000
## hyp 0.68 0.1232877 0.1481481 0.3750000 0.07843137 0.2352941
## chl 0.60 0.2191781 0.1111111 0.5333333 0.06666667 0.1333333
Create a missingness vector for bmi
.
This vector should be a logical vector of the same length as the bmi
vector.
The missingness vector should take the value TRUE
for all missing entries in
bmi
and FALSE
for all observed entries in bmi
.
mBmi <- is.na(nhanes$bmi)
Test if missingness on bmi
depends on age
.
Use the t.test()
function and the missingness vector you created above.
out <- t.test(age ~ mBmi, data = nhanes)
out$statistic
## t
## 0.4095038
out$p.value
## [1] 0.6875405
Missingness in bmi
does not significantly depend on age
.
Test if all missingness in nhanes
is MCAR.
Use the naniar::mcar_test()
function to implement the
Little (1988) MCAR Test.
out <- mcar_test(nhanes)
out$statistic
## [1] 7.999024
out$p.value
## [1] 0.5342446
The missing values in nhanes
appear to be MCAR.
Use the naniar::geom_miss_point()
function to visualize the distribution of
missing values between bmi
and chl
.
You will first need to set up the figure using ggplot()
. You can then apply
geom_miss_point()
to plot the points.
ggplot(data = nhanes, mapping = aes(x = bmi, y = chl)) + geom_miss_point()
Missing values in chl
tend to occur more frequently in the lower tail of bmi
and visa-versa.
Real-world missing data problems are rarely as simple as the situation explored above. Now, we will consider a slightly more complex datasets. For the remaining exercises, you will analyze the Eating Attitudes data from Enders (2010). These data are available as eating_attitudes.rds.
This dataset includes 400 observations of the following 14 variables. Note that the variables are listed in the order that they appear on the dataset.
id
: A numeric IDeat1:eat24
: Seven indicators of a Drive for Thinness constructeat3:eat21
: Three indicators of a Preoccupation with Food constructbmi
: Body mass indexwsb
: A single item assessing Western Standards of Beautyanx
: A single item assessing Anxiety LevelYou can download the original data here, and you can access the code used to process the data here.
Read in the eating_attitudes.rds dataset.
dataDir <- "../../../data/"
ea <- readRDS(paste(dataDir, "eating_attitudes.rds"))
NOTE:
Summarize the EA data to get a sense of their characteristics.
head(ea)
## id eat1 eat2 eat10 eat11 eat12 eat14 eat24 eat3 eat18 eat21 bmi wsb anx
## 1 1 4 4 4 4 4 4 3 4 5 4 18.92319 9 11
## 2 2 6 5 6 6 6 7 6 5 6 5 26.00913 13 19
## 3 3 3 3 2 2 3 2 3 3 3 2 18.29959 6 8
## 4 4 3 3 4 3 4 4 3 3 5 4 18.16987 5 14
## 5 5 3 2 3 3 3 3 3 4 4 4 24.44818 10 7
## 6 6 4 5 4 5 4 4 4 4 5 4 25.10097 7 11
summary(ea)
## id eat1 eat2 eat10
## Min. : 1.0 Min. :2.000 Min. :1.000 Min. :1.000
## 1st Qu.:100.8 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :200.5 Median :4.000 Median :4.000 Median :4.000
## Mean :200.5 Mean :3.987 Mean :3.939 Mean :3.938
## 3rd Qu.:300.2 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :400.0 Max. :7.000 Max. :7.000 Max. :7.000
## NA's :27 NA's :21 NA's :30
## eat11 eat12 eat14 eat24
## Min. :1.000 Min. :2.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :4.000 Median :4.000
## Mean :3.938 Mean :3.905 Mean :3.962 Mean :3.954
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :7.000 Max. :7.000 Max. :7.000 Max. :7.000
## NA's :33 NA's :33
## eat3 eat18 eat21 bmi
## Min. :1.000 Min. :2.000 Min. :1.000 Min. :15.23
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:20.43
## Median :4.000 Median :4.000 Median :4.000 Median :22.42
## Mean :3.967 Mean :3.951 Mean :3.953 Mean :22.39
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:24.21
## Max. :7.000 Max. :7.000 Max. :7.000 Max. :31.52
## NA's :32 NA's :19 NA's :5
## wsb anx
## Min. : 4.000 Min. : 4.00
## 1st Qu.: 8.000 1st Qu.:10.00
## Median : 9.000 Median :12.00
## Mean : 8.946 Mean :11.94
## 3rd Qu.:10.000 3rd Qu.:14.00
## Max. :14.000 Max. :20.00
## NA's :27 NA's :18
str(ea)
## 'data.frame': 400 obs. of 14 variables:
## $ id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ eat1 : num 4 6 3 3 3 4 5 4 4 6 ...
## $ eat2 : num 4 5 3 3 2 5 4 3 7 5 ...
## $ eat10: num 4 6 2 4 3 4 4 4 6 5 ...
## $ eat11: num 4 6 2 3 3 5 4 4 5 5 ...
## $ eat12: num 4 6 3 4 3 4 4 4 4 6 ...
## $ eat14: num 4 7 2 4 3 4 4 4 6 6 ...
## $ eat24: num 3 6 3 3 3 4 4 4 4 5 ...
## $ eat3 : num 4 5 3 3 4 4 3 6 4 5 ...
## $ eat18: num 5 6 3 5 4 5 3 6 4 6 ...
## $ eat21: num 4 5 2 4 4 4 3 5 4 5 ...
## $ bmi : num 18.9 26 18.3 18.2 24.4 ...
## $ wsb : num 9 13 6 5 10 7 11 8 10 12 ...
## $ anx : num 11 19 8 14 7 11 12 12 14 12 ...
Calculate the covariance coverage rates.
md.pairs()
function from the mice package to
count the number of jointly observed cases for every pair or variables.library(magrittr) # Provides the extract2() alias function
## Calculate covariance coverage:
cc <- (ea %>% select(-id) %>% md.pairs() %>% extract2("rr")) / nrow(ea)
round(cc, 2)
## eat1 eat2 eat10 eat11 eat12 eat14 eat24 eat3 eat18 eat21 bmi wsb anx
## eat1 0.93 0.89 0.87 0.93 0.86 0.93 0.86 0.93 0.86 0.89 0.92 0.88 0.90
## eat2 0.89 0.95 0.87 0.95 0.87 0.95 0.88 0.95 0.88 0.90 0.94 0.88 0.90
## eat10 0.87 0.87 0.92 0.92 0.86 0.92 0.86 0.92 0.86 0.88 0.92 0.86 0.88
## eat11 0.93 0.95 0.92 1.00 0.92 1.00 0.92 1.00 0.92 0.95 0.99 0.93 0.96
## eat12 0.86 0.87 0.86 0.92 0.92 0.92 0.85 0.92 0.86 0.87 0.91 0.87 0.87
## eat14 0.93 0.95 0.92 1.00 0.92 1.00 0.92 1.00 0.92 0.95 0.99 0.93 0.96
## eat24 0.86 0.88 0.86 0.92 0.85 0.92 0.92 0.92 0.86 0.88 0.90 0.85 0.88
## eat3 0.93 0.95 0.92 1.00 0.92 1.00 0.92 1.00 0.92 0.95 0.99 0.93 0.96
## eat18 0.86 0.88 0.86 0.92 0.86 0.92 0.86 0.92 0.92 0.88 0.91 0.86 0.88
## eat21 0.89 0.90 0.88 0.95 0.87 0.95 0.88 0.95 0.88 0.95 0.94 0.89 0.91
## bmi 0.92 0.94 0.92 0.99 0.91 0.99 0.90 0.99 0.91 0.94 0.99 0.92 0.94
## wsb 0.88 0.88 0.86 0.93 0.87 0.93 0.85 0.93 0.86 0.89 0.92 0.93 0.89
## anx 0.90 0.90 0.88 0.96 0.87 0.96 0.88 0.96 0.88 0.91 0.94 0.89 0.96
Summarize the coverages from 6.3.
Covariance coverage matrices are often very large and, hence, difficult to parse. It can be useful to distill this information into a few succinct summaries to help extract the useful knowledge.
One of the most useful numeric summaries is the range. We’ll start there.
NOTE:
## Range of coverages < 1.0:
## NOTE: Use lower.tri() to select only the elements below the diagonal.
uniqueCc <- cc[lower.tri(cc)]
uniqueCc[uniqueCc < 1] %>% range()
## [1] 0.8500 0.9875
Sometimes, we may want to count the number of coverages that satisfy some condition (e.g., fall below some threshold). When doing such a summary, we need to remember two idiosyncrasies of the covariance coverage matrix:
So, to get a count of covariance coverages without double counting, we consider only the elements from the lower (or upper) triangle.
## Count the number of coverages lower than 0.9:
sum(uniqueCc < 0.9)
## [1] 32
Visualize the covariance coverage rates from 6.3.
As with numeric summaries, visualizations are also a good way to distill meaningful knowledge from the raw information in a covariance coverage matrix.
We’ll try two different visualizations.
First, we’ll create a simple histogram of the coverages to get a sense of their distribution. To see why such a visualization is useful, consider two hypothetical situations wherein the range of coverages is [0.2; 0.8].
Clearly, the second situation is less problematic, but we cannot differentiate between these two simply by examining the range of coverages.
library(ggplot2)
## Simple histogram of coverages:
data.frame(Coverage = uniqueCc) %>%
ggplot(aes(Coverage)) +
geom_histogram(bins = 15, color = "black")
Next, we’ll create a heatmap of the coverage matrix itself. Such a visualization could help us locate clusters of variables with especially low coverages, for example.
## Convert the coverage matrix into a plotable, tidy-formatted data.frame:
pDat <- data.frame(Coverage = as.numeric(cc),
x = rep(colnames(cc), ncol(cc)),
y = rep(colnames(cc), each = ncol(cc))
)
## Create the heatmap via the "tile" geom:
ggplot(pDat, aes(x = x, y = y, fill = Coverage)) +
geom_tile() +
scale_x_discrete(limits = rev(colnames(cc))) +
scale_y_discrete(limits = colnames(cc)) +
scale_fill_distiller(palette = "Oranges") +
xlab(NULL) +
ylab(NULL)
Visualize the missing data patterns.
HINT:
plot_pattern()
function from ggmice will create a nice
visualization of the patterns.md.pattern()
function from mice will create a (somewhat less beautiful)
visualization but will return a numeric pattern matrix that you can further analyze.library(ggmice)
## Visualize the response patterns:
plot_pattern(ea)
## Create an analyzable pattern matrix (without visualization):
(pats <- md.pattern(ea, plot = FALSE))
## id eat11 eat14 eat3 bmi anx eat21 eat2 eat1 wsb eat10 eat18 eat12 eat24
## 242 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 12 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## 10 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2
## 8 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 0 1 0 2
## 3 1 1 1 1 1 1 1 1 1 1 1 0 0 1 2
## 2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 3
## 14 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 0 1 1 0 2
## 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 2
## 2 1 1 1 1 1 1 1 1 1 1 0 0 1 1 2
## 2 1 1 1 1 1 1 1 1 1 1 0 0 0 1 3
## 12 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 0 1 1 0 1 2
## 3 1 1 1 1 1 1 1 1 1 0 1 0 0 1 3
## 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 3
## 5 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 2
## 2 1 1 1 1 1 1 1 1 0 1 1 1 0 1 2
## 2 1 1 1 1 1 1 1 1 0 1 1 1 0 0 3
## 2 1 1 1 1 1 1 1 1 0 1 0 1 1 0 3
## 3 1 1 1 1 1 1 1 1 0 0 1 1 1 1 2
## 2 1 1 1 1 1 1 1 1 0 0 0 1 1 1 3
## 9 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 2
## 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 2
## 2 1 1 1 1 1 1 1 0 1 1 1 0 1 1 2
## 3 1 1 1 1 1 1 1 0 1 1 1 0 1 0 3
## 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 2
## 3 1 1 1 1 1 1 1 0 0 1 1 1 1 1 2
## 13 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 2
## 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 3
## 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 3
## 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 3
## 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 4
## 10 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 2
## 4 1 1 1 1 1 0 1 1 0 1 1 1 1 1 2
## 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 3
## 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 2
## 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 2
## 3 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 3
## 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 3
## 0 0 0 0 5 18 19 21 27 27 30 32 33 33 245
## Count the number of unique response patterns:
nrow(pats) - 1
## [1] 46
As shown by the above code, there are 46 unique response patterns in the EA data.
End of Lab 1