In these lab exercises, you will explore diagnostic tools that you can use to evaluate the extent of a missing data problem.

1 Preliminaries

1.1

Load packages.

Use the library() function to load the mice, ggmice, naniar, and ggplot2 packages into your workspace.

The mice package contains several datasets. Once an R package is loaded, the datasets contained therein will be accessible within the current R session.

The nhanes dataset (Schafer, 1997, Table 6.14) is one of the datasets provided by the mice package. The nhanes dataset is a small dataset with non-monotone missing values. It contains 25 observations on four variables:

Age Group (age)
Body Mass Index (bmi)
Hypertension (hyp)
Cholesterol (chl)

Unless otherwise specified, all questions below apply to the nhanes dataset.

1.2

Print the nhanes dataset to the console.

1.3

Access the documentation for the nhanes dataset.

1.4

Use the naniar::vis_miss() function to visualize the missing data.

2 Response Rates

2.1

Use the summary() function to summarize the nhanes dataset.

Use the output of the summary() function to answer the next two questions.

2.2

Which variable has the most missing values?

2.3

Which variables, if any, have no missing values?

2.4

Compute the proportion of missing values for each variable.

2.5

What is the proportion of missing data in bmi?

2.6

Use the naniar::gg_miss_var() function to visualize the percents missing for each variable.

3 Response Patterns

Inspecting the missing data/response patterns is always useful (but may be difficult for datasets with many variables). The response patterns give an indication of how much information is missing and how the missingness is distributed.

3.1

Visualize the missing data patterns.

You can use the plot_pattern() function from the ggmice package.
What information can you glean from the figure?

3.2

How many observations would be available if we only analyzed complete cases?

3.3

Which missing data pattern is most frequently observed?

4 Coverage Rates

4.1

Calculate the covariance coverage rates.

You can use the md.pairs() function from the mice package to count the number of jointly observed cases for every pair or variables.

4.2

Calculate the flux statistics for each variable.

You can use the flux() function from the mice package to compute a panel of flux statistics.

5 Testing the Missingness

5.1

Create a missingness vector for bmi.

This vector should be a logical vector of the same length as the bmi vector. The missingness vector should take the value TRUE for all missing entries in bmi and FALSE for all observed entries in bmi.

5.2

Test if missingness on bmi depends on age.

Use the t.test() function and the missingness vector you created above.

What is the estimated t-statistic?
What is the p-value for this test?
What is the conclusion of this test?

5.3

Test if all missingness in nhanes is MCAR.

Use the naniar::mcar_test() function to implement the Little (1988) MCAR Test.

What is the estimated \(\chi^2\)-statistic?
What is the p-value for this test?
What is the conclusion of this test?

5.4

Use the naniar::geom_miss_point() function to visualize the distribution of missing values between bmi and chl.

You will first need to set up the figure using ggplot(). You can then apply
geom_miss_point() to plot the points.

What conclusions can you draw from this figure, if any?

6 More Complex Data

Real-world missing data problems are rarely as simple as the situation explored above. Now, we will consider a slightly more complex datasets. For the remaining exercises, you will analyze the Eating Attitudes data from Enders (2010). These data are available as eating_attitudes.rds.

This dataset includes 400 observations of the following 14 variables. Note that the variables are listed in the order that they appear on the dataset.

id: A numeric ID
eat1:eat24: Seven indicators of a Drive for Thinness construct
eat3:eat21: Three indicators of a Preoccupation with Food construct
bmi: Body mass index
wsb: A single item assessing Western Standards of Beauty
anx: A single item assessing Anxiety Level

You can download the original data here, and you can access the code used to process the data here.

6.1

Read in the eating_attitudes.rds dataset.

NOTE:

In the following, I will refer to these data as the EA data.
Unless otherwise specified, the data analyzed in all following questions are the EA data.

6.2

Summarize the EA data to get a sense of their characteristics.

Pay attention to the missing values.

6.3

Calculate the covariance coverage rates.

You can use the md.pairs() function from the mice package to count the number of jointly observed cases for every pair or variables.

6.4

Summarize the coverages from 6.3.

Covariance coverage matrices are often very large and, hence, difficult to parse. It can be useful to distill this information into a few succinct summaries to help extract the useful knowledge.

6.5

Visualize the covariance coverage rates from 6.3.

As with numeric summaries, visualizations are also a good way to distill meaningful knowledge from the raw information in a covariance coverage matrix.

6.6

Visualize the missing data patterns.

How many unique response patterns are represented in the EA data?

HINT:

The plot_pattern() function from ggmice will create a nice visualization of the patterns.
The md.pattern() function from mice will create a (somewhat less beautiful) visualization but will return a numeric pattern matrix that you can further analyze.

End of Lab 1

Lab 1: Missing Data Basics

Missing Data in R

Kyle M. Lang

Updated: 2023-01-31