R for Data Science Exercises: Missing Values

Missing values can occur for various reasons, such as data entry errors, equipment malfunctions, or respondents choosing not to answer certain survey questions. Properly handling these missing values is crucial to ensure the integrity and accuracy of your analyses.

R for Data Science Exercises: Missing Values

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Missing Values

Run the code in your script for the answers! I'm just exploring as I go.

Packages to load

library(tidyverse)
library(janitor)
library(gt)
library(gtExtras)

Introduction

In data analysis and statistical modelling, the presence of missing values is a common challenge. Missing values can occur for various reasons, such as data entry errors, equipment malfunctions, or respondents choosing not to answer certain survey questions. Properly handling these missing values is crucial to ensure the integrity and accuracy of your analyses.

R, a powerful language for statistical computing and graphics, provides a comprehensive set of tools and functions to detect, manage, and analyse missing data. This introduction will cover the fundamental concepts and techniques for dealing with missing values in R, helping you maintain the robustness of your data workflows.

In R, missing values are represented by the NA (Not Available) symbol. This special value can be found in various data structures, including vectors, data frames, and matrices. It's important to note that NA is distinct from other special values such as NaN (Not a Number), Inf (Infinity), and -Inf (Negative Infinity).

Implicit Missing Values

Missing values can also be implicitly missing, meaning that they are not explicitly marked as NA but are still considered missing in the context of the analysis. For example, a value of 0 in a column representing a person's age may be considered missing if the age is unknown or not applicable. In such cases, it's important to identify and handle these implicit missing values to ensure the accuracy and reliability of your analyses.

# Create a data frame with implicit and explicit missing values
df <- tribble(
  ~person, ~age, ~height,
  "John Doe", 30, 180,
  "Jane Doe", 0, 155,
  "Ben Doe", NA, 160
)

In the example above, the value 0 in the age column for "Jane Doe" is an implicit missing value, as it does not represent a valid age. You can use dplyr::mutate() and dplyr::case_when() to identify and replace implicit missing values with NA. The value NA in the age column for "Ben Doe" is an explicitly missing value as it is marked as NA.

You can tell the difference between implicit and explicit missing values by examining the context and domain knowledge of the data. Implicit missing values may require additional data cleaning and transformation steps to ensure the integrity and accuracy of your analyses. Always remember:

  • Explicit missing values is the presence of an absence, are marked as NA and are easily identified.
  • Implicit missing values is the absence of a presence and may require domain knowledge and context to identify and handle appropriately.

Pivoting

Pivoting is a common data transformation technique used to reshape data from a long format to a wide format or vice versa. Pivoting can be useful for summarizing and aggregating data, creating visualizations, and preparing data for analysis. In R, you can use the tidyr::pivot_longer() and tidyr::pivot_wider() functions to pivot data frames and tibbles.

Making data wider can make implicit missing values more explicit because every combination of the rows and new columns must have some value. For example, consider the following data frame:

df <- tribble(
  ~person, ~age, ~height,
  "John Doe", 30, 180,
  "Jane Doe", 0, 155,
  "Ben Doe", NA, 160
)

You can pivot this data frame to make the implicit missing values more explicit:

df |>
  pivot_longer(cols = -person, names_to = "variable", values_to = "value")

Pivoting can help you identify and handle implicit missing values more effectively, ensuring the accuracy and reliability of your analyses. By default, making data longer preserves explicit missing values, but if they are structurally missing values, you can use the values_drop_na = TRUE argument to remove them (make them implicit).

Complete

The tidyr::complete() function is used to ensure that all combinations of the specified columns are present as in the data frame. This function is particularly useful for filling in explicit missing values and creating a complete set of observations for analysis. You can use complete() to fill in missing values with fixed values or to create a complete set of observations for further analysis.

stocks <- tibble(
  year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
  qtr   = c(   1,    2,    3,    4,    2,    3,    4),
  price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

stocks |>
  complete(year, qtr)

stocks |>
  complete(year = 2019:2021, qtr)

The complete() function can help you identify and handle explicit missing values by creating a complete set of observations for analysis. By filling in missing values with fixed values or creating a complete set of observations, you can ensure the integrity and accuracy of your analyses.

Joins

Joins are a powerful data manipulation technique used to combine data from multiple sources based on common keys or columns. Joins can help you merge data frames, tibbles, or data sets to create a unified data set for analysis. In R, you can use the dplyr::left_join(), dplyr::right_join(), dplyr::inner_join(), and dplyr::full_join() functions to perform different types of joins. Joins are an important way of handling missing values by combining data from multiple sources and ensuring that all relevant information is included in the analysis.

dplyr::anti_join(x, y) returns all rows from x that are not present in y. This function is useful for identifying missing values and discrepancies between data sets. You can use anti_join() to identify missing values and discrepancies between data sets, helping you ensure the integrity and accuracy of your analyses.

For example, we can use two anti_join()s to reveal that we're missing information for four airports and 722 planes mentioned in flights:

library(nycflights13)

flights |> 
  distinct(faa = dest) |> 
  anti_join(airports) |>
  count()

flights |> 
  distinct(tailnum) |> 
  anti_join(planes) |>
  count()

Questions

  1. Can you find any relationship between the carrier and the rows that appear to be missing from planes?

Answers

Solution 1:

The airline carriers MQ and AA have most of their aircraft' tail numbers missing from the planes data-set (@tbl-q1-ex3), apart from few other carriers that have a small percentage of their data missing.

#| label: tbl-q1-ex3
#| tbl-cap: "Percentage of tail numbers of each carrier that is missing from planes data set"
#| tbl-cap-location: top

library(nycflights13)
data("flights")
data("planes")

# Create a vector of carriers that have tailsnums missing in planes
car_vec = flights |>
  distinct(tailnum, carrier) |>
  anti_join(planes) |>
  distinct(carrier) |>
  as_vector() |>
  unname()

# Find total tailnums for these carriers
total_tails = flights |>
  filter(carrier %in% car_vec) |>
  group_by(carrier) |>
  summarize(
    total_aircrafts = n_distinct(tailnum)
    )

flights |>
  distinct(tailnum, carrier) |>
  anti_join(planes) |>
  count(carrier, name = "missing_tailnums") |>
  full_join(total_tails) |>
  mutate(percentage_missing = missing_tailnums/total_aircrafts) |>
  arrange(desc(percentage_missing)) |>
  gt() |>
  gt_theme_538() |>
  cols_label_with(fn = ~ janitor::make_clean_names(., case = "title")) |>
  fmt_percent(columns = percentage_missing) |>
   tab_style(
    style = list(cell_text(weight = "bold") ),
    locations = cells_body(columns = percentage_missing)) 

More Missing Value Learnings

Last observation carried forward

Last observation carried forward (LOCF) is a method for imputing missing values in time series data. This technique involves replacing missing values with the most recent non-missing value in the series. LOCF is commonly used in longitudinal studies and clinical trials to handle missing data points.

height_weight <- tribble(
  ~person,           ~height_cm, ~weight_kg,
  "John Doe", 180,         70,
  NA,                 200,         100,
  NA,                 130,         NA,
  "Jane Doe",  155,         49
)

You can fill in these missing values with tidyr::fill(). Essentially, fill() fills in missing values with the most recent non-missing value in the column. It works like select(), taking a set of columns:

height_weight |>
  fill(everything())

Fixed Values

Fixed values are another common approach to handling missing data. This method involves replacing missing values with a specific constant or predefined value. Fixed values can be used when the missing data points are known to be constant or when the missing values are not expected to change over time.

You can use dplyr::coalesce() to replace missing values with fixed values. The coalesce() function takes a set of columns and returns the first non-missing value in each row. If all values are missing, it returns NA.

x <- c(2, 7, 12, 17, NA)
coalesce(x, 0)

When the fixed value is known prior to the analysis, this method can be a simple and effective way to handle missing data. Simply, by using the na argument in readr::read_csv() you can specify the fixed value to replace missing values in the data set, e.g., read_csv("data.csv", na = "99"). If you discover the problem later, you can use dplyr::na_if() to replace the fixed value.

x <- c(1, 4, 5, 7, -99)
na_if(x, -99)

NaN

NaN (Not a Number) is a special floating-point value that represents an undefined or unrepresented value. In R, NaN is used to indicate the result of an undefined operation, such as 0/0 or log(-1). NaN is distinct from NA (Not Available), which represents missing values in data structures. NaN behaves much like NA and can be detected using the is.nan() function in R.

x <- c(NA, NaN)
x * 10
x == 1
is.na(x)

NaN are normally encountered when performing mathematical operations that result in undefined values. It's important to handle NaN values appropriately in your analyses to avoid errors and ensure the accuracy of your results.

0 / 0 
0 * Inf
Inf - Inf
sqrt(-1)

Factors and Empty Groups

Empty groups are a common issue in data analysis and statistical modelling. Empty groups occur when there are no observations or data points for a particular category or level in a factor variable. Empty groups can lead to errors in statistical analyses, such as missing values in contingency tables or incorrect estimates of group means and variances. It's important to identify and handle empty groups to ensure the accuracy and reliability of your analyses.

For example, consider the following health information data frame with a factor variable (smoker/non-smoker):

health <- tibble(
  name   = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
  smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
  age    = c(34, 88, 75, 47, 56),
)

You can use dplyr::count() to identify empty groups in the factor variable:

health |> 
  count(smoker)

In this example,dataset only contains non-smokers, but we know that smokers exist; the group of non-smokers is empty. We can request count() to keep all the groups, even those not seen in the data by using .drop = FALSE:

health |> 
  count(smoker, .drop = FALSE)

By keeping all the groups, you can identify and handle empty groups more effectively, ensuring the accuracy and reliability of your analyses. Empty groups can be handled by imputing missing values, creating synthetic data, or using statistical techniques to estimate group means and variances. Always remember to check for empty groups in your data and take appropriate steps to handle them.

The same problem arises with dplyr::group_by(). By default, group_by() drops empty groups, but you can keep them by using the .drop = FALSE argument:

health |> 
  group_by(smoker, .drop = FALSE) |> 
  summarize(
    n = n(),
    mean_age = mean(age),
    min_age = min(age),
    max_age = max(age),
    sd_age = sd(age)
  )

Here we see mean(age) returning NaN because mean(age) = sum(age)/length(age) which here is 0/0. max() and min() return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you'll get the minimum or maximum of the new data[1].

Exploring naniar [@naniar] and visdat [@visdat]

The naniar package provides a suite of functions for visualising and analysing missing values in data sets. You can use naniar to explore the patterns and distributions of missing values, identify missing data mechanisms, and impute missing values using various techniques. The naniar package complements the existing tools in R for handling missing values, providing additional insights and capabilities for data analysis and modelling.

The visdat package is another useful tool for visualising missing values in data sets. You can use visdat to create informative and interactive visualisations of missing values, helping you identify patterns and trends in your data. By visualising missing values, you can gain a deeper understanding of the data quality and integrity, enabling you to make informed decisions about handling missing values in your analyses.

library(naniar)
library(visdat)
library(nycflights13)

Some naniar and visdat Examples

  1. Visualize a data.frame to see what it contains

    planes |> 
      vis_dat() +
      labs(title = "A vis_dat() output to see the contents of a data-frame")
    
    flights |>
      slice_sample(n = 100) |>
      vis_dat() +
      labs(title = "A vis_dat() output to see the contents of a data-frame",
           col = "Type of variable") +
      theme(legend.position = "bottom") +
      scale_fill_brewer(palette = "Accent")
    
  2. Show missingness in particular, with percentages:

    flights |>
      slice_sample(n = 100) |>
      vis_miss() +
      labs(title = "A vis_miss() output to see the missing values",
           col = "Type of variable")
    
  3. Show the proportion of missingness for each variable

    flights |>
      gg_miss_var(show_pct = TRUE) +
      labs(title = "Percentage missing values for each variable in flights dataset",
           y = NULL, 
           x = "Percentage missing values (%)")
    
  4. To replace a value, say "99" or "-99" with NA, we use naniar::replace_with_na() or dplyr::na_if() . To replace NA with a given value we use, dplyr::replace_na()

  5. Using vis_expect() to see values that fulfil certain conditions

    flights |>
      select(day, dep_delay, arr_delay, air_time) |>
      slice_sample(n = 200) |>
      vis_expect(~.x >= 10)
    

References

Tierney, Nicholas. 2017. “Visdat: Visualising Whole Data Frames” 2: 355. https://doi.org/10.21105/joss.00355.
Tierney, Nicholas, and Dianne Cook. 2023. “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations” 105. https://doi.org/10.18637/jss.v105.i07.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".


  1. In other words, min(c(x, y)) is always equal to min(min(x), min(y)). ↩︎