In these lab exercises, you will explore diagnostic tools that you can use to evaluate the extent of a missing data problem.
Load packages.
Use the library()
function to load the mice, ggmice, naniar, and
ggplot2 packages into your workspace.
The mice
package contains several datasets. Once an R package is loaded, the
datasets contained therein will be accessible within the current R session.
The nhanes
dataset (Schafer, 1997, Table 6.14) is one of the datasets provided
by the mice package. The nhanes
dataset is a small dataset with
non-monotone missing values. It contains 25 observations on four variables:
age
)bmi
)hyp
)chl
)Unless otherwise specified, all questions below apply to the nhanes
dataset.
Print the nhanes
dataset to the console.
Access the documentation for the nhanes
dataset.
Use the naniar::vis_miss()
function to visualize the missing data.
Use the summary()
function to summarize the nhanes
dataset.
Use the output of the summary()
function to answer the next two questions.
Which variable has the most missing values?
Which variables, if any, have no missing values?
Compute the proportion of missing values for each variable.
What is the proportion of missing data in bmi
?
Use the naniar::gg_miss_var()
function to visualize the percents missing
for each variable.
Inspecting the missing data/response patterns is always useful (but may be difficult for datasets with many variables). The response patterns give an indication of how much information is missing and how the missingness is distributed.
Visualize the missing data patterns.
plot_pattern()
function from the ggmice package.How many observations would be available if we only analyzed complete cases?
Which missing data pattern is most frequently observed?
Calculate the covariance coverage rates.
md.pairs()
function from the mice package to
count the number of jointly observed cases for every pair or variables.Calculate the flux statistics for each variable.
flux()
function from the mice package to
compute a panel of flux statistics.Create a missingness vector for bmi
.
This vector should be a logical vector of the same length as the bmi
vector.
The missingness vector should take the value TRUE
for all missing entries in
bmi
and FALSE
for all observed entries in bmi
.
Test if missingness on bmi
depends on age
.
Use the t.test()
function and the missingness vector you created above.
Test if all missingness in nhanes
is MCAR.
Use the naniar::mcar_test()
function to implement the
Little (1988) MCAR Test.
Use the naniar::geom_miss_point()
function to visualize the distribution of
missing values between bmi
and chl
.
You will first need to set up the figure using ggplot()
. You can then apply
geom_miss_point()
to plot the points.
Real-world missing data problems are rarely as simple as the situation explored above. Now, we will consider a slightly more complex datasets. For the remaining exercises, you will analyze the Eating Attitudes data from Enders (2010). These data are available as eating_attitudes.rds.
This dataset includes 400 observations of the following 14 variables. Note that the variables are listed in the order that they appear on the dataset.
id
: A numeric IDeat1:eat24
: Seven indicators of a Drive for Thinness constructeat3:eat21
: Three indicators of a Preoccupation with Food constructbmi
: Body mass indexwsb
: A single item assessing Western Standards of Beautyanx
: A single item assessing Anxiety LevelYou can download the original data here, and you can access the code used to process the data here.
Read in the eating_attitudes.rds dataset.
NOTE:
Summarize the EA data to get a sense of their characteristics.
Calculate the covariance coverage rates.
md.pairs()
function from the mice package to
count the number of jointly observed cases for every pair or variables.Summarize the coverages from 6.3.
Covariance coverage matrices are often very large and, hence, difficult to parse. It can be useful to distill this information into a few succinct summaries to help extract the useful knowledge.
Visualize the covariance coverage rates from 6.3.
As with numeric summaries, visualizations are also a good way to distill meaningful knowledge from the raw information in a covariance coverage matrix.
Visualize the missing data patterns.
HINT:
plot_pattern()
function from ggmice will create a nice
visualization of the patterns.md.pattern()
function from mice will create a (somewhat less beautiful)
visualization but will return a numeric pattern matrix that you can further analyze.End of Lab 1