Learning R Featured

R for Data Science Exercises: Exploratory Data Analysis

The goal during EDA is to develop an understanding of your data using questions as tools to guide your exploration: 1. What type of variation occurs within my variables?; 2. What type of covariation occurs between my variables?

Lorcán Mason

Jul 3, 2024 • 18 min read

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Exploratory Data Analysis

Run the code in your script for the answers! I'm just exploring as I go.

Packages to load

library(tidyverse)
library(nycflights13)
library(ggbeeswarm)
library(lvplot)
library(ggstance)

Introduction

Exploratory Data Analysis (EDA) is an repetitive cycle where you:

Generate questions about your data.
Search for answers by visualizing, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.

The goal during EDA is to develop an understanding of your data using questions as tools to guide your exploration. Two types of questions will always be useful for making discoveries within your data:

What type of variation occurs within my variables?
What type of covariation occurs between my variables?

By considering these questions, the objectives of EDA can be funnelled down to the following:

Explore the variation within the variables of your observations.
Deal with outliers and missing values in your data.
Explore the covariation between the variables of your observations.
Recognise how models can be used to explore patterns in your data.

Vocabulary

Variable: a quantity, quality, or property that you can measure.
Value: the state of a variable when you measure it. This can change.
Observation: a set of measurements made under similar conditions. One value per variable.
Tabular Data: observations of variables.
Tidy Data: 1 observation per row, 1 variable per column, 1 value per cell. Definition of "tidy" for a dataset can depend on the questions you're trying to answer.

Variation

variation: the tendency of values of a variable to change between measurements.
categorical variable: can only take certain values. Visualize variation with bar chart.
continuous variables: can take on infinite set of ordered values. Visualize variation with histogram.

Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. Error is inherent in any data collection process. Variables can vary if you measure across different subjects or at different times. Every variable has its own pattern of variation, which gives interesting information about how much it varies between measurements in the same observation as well as across observations.

Typical values

Typical values are the most common values of a variable. Think of a bell-curve - the higher the curve the more common a value is; the lower the curve the less-common the values are. The areas not under the curve show values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:

Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?

A good starting point for EDA are Distributions. Distributions can answer some of these questions within the data, while some will require domain expertise. Many of these questions will prompt you to explore a relationship between variables, to see if the values of one variable can explain the behaviour of another variable.

Unusual values

Outliers are observations that are unusual; data points that don't seem to fit the general pattern of the data which are difficult to observe within distributions. Outliers can sometimes be data entry errors, sometimes simply values at the extremes, and other times they suggest important new discoveries.

If you encounter unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options:

Drop the entire row with the strange values
Replace the unusual values with missing values (NA).

The former is not recommended! A single invalid value doesn't imply that all other values for that observation are also invalid. Additionally, if you have low quality data, applying this approach to every variable might leave you without any data!

You may also want to make distinctions between outliers to discover a pattern, whereby the outlier or missing data actually gives important information.

Questions

Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
Compare and contrast coord_cartesian() vs. xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?
What does na.rm = TRUE do in mean() and sum()?
Using the nycflights13::flights^[1] recreate the frequency plot of scheduled_dep_time coloured by whether the flight was cancelled or not.

Answers

Solution 1:

Calculate summary statistics for these variables and plot their distributions.

summary(select(diamonds, x, y, z))

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = x), binwidth = 0.01) # x

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.01) # y

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = z), binwidth = 0.01) # z

There's a few noticeable features of the distributions:

x and y are larger than z,
There are outliers,
They are all right skewed

There are two types of outliers in this data. Some diamonds have values of zero and some have abnormally large values of x, y, or z.

summary(select(diamonds, x, y, z))

These appear to be either data entry errors, or an undocumented convention in the dataset for indicating missing values. Alternatively it could be that values of zero are the result of rounding values minuscule down, but since there are no diamonds with values of 0.01, that does not seem to be the case.

filter(diamonds, x == 0 | y == 0 | z == 0)

There are also some diamonds with values of y and z that are abnormally large. There are diamonds with y == 58.9 and y == 31.8, and one with z == 31.8. These are probably data entry errors as they don’t seem in line with the values of the other variables.

diamonds %>%
  arrange(desc(y)) %>%
  head() # y
diamonds %>%
  arrange(desc(z)) %>%
  head() # z

According to the diamonds documentation, x is length, y is width, and z is depth in mm. If documentation were unavailable, you would compare the values of the variables to match them to the length, width, and depth. A quick google search for the definitions of length, width, and depth with respect to diamond cuts will suffice. Depth can be expressed as a percentage of the length/width of the diamond, which means it should be less than both the length and the width.

summarise(diamonds, mean(x > y), mean(x > z), mean(y > z))

It appears that depth (z) is always smaller than length (x) or width (y). Length is greater than width in less than half the observations.

Solution 2:

The price data has many spikes, but it is difficult to tell what each spike corresponds to. The following plots don't show much difference in the distributions in the last one or two digits.
There are no diamonds with a price of $1,500 (between $1,455 and $1,545, inclusive).
There's a swell in the distribution around $750.

ggplot(filter(diamonds, price < 2500), aes(x = price)) +
  geom_histogram(binwidth = 10, center = 0)

ggplot(filter(diamonds), aes(x = price)) +
  geom_histogram(binwidth = 100, center = 0)

The last digits of prices are often not uniformly distributed. They are often round, ending in 0 or 5. Another common pattern is in values ending in 99.

diamonds %>%
  mutate(ending = price %% 10) %>%
  ggplot(aes(x = ending)) +
  geom_histogram(binwidth = 1, center = 0)

diamonds %>%
  mutate(ending = price %% 100) %>%
  ggplot(aes(x = ending)) +
  geom_histogram(binwidth = 1)

diamonds %>%
  mutate(ending = price %% 1000) %>%
  filter(ending >= 500, ending <= 800) %>%
  ggplot(aes(x = ending)) +
  geom_histogram(binwidth = 1)

Solution 3:

There are more than 70 times as many 1 carat diamonds as 0.99 carat diamond.

diamonds %>%
  filter(carat >= 0.99, carat <= 1) %>%
  count(carat)

Presumably there is a premium for a 1 carat diamond vs. a 0.99 carat diamond beyond the expected increase in price due to a 0.01 carat increase.^[2]

To check this, look at the number of diamonds in each carat range to see if there is an unusually low number of 0.99 carat diamonds, and an abnormally large number of 1 carat diamonds.

diamonds %>%
  filter(carat >= 0.9, carat <= 1.1) %>%
  count(carat) %>%
  print(n = Inf)

Solution 4:

The coord_cartesian() function zooms in on the area specified by the limits, after having calculated and drawn the geoms.
The xlim() and ylim() functions influence actions before the calculation of the stats related to the histogram. Thus, any values outside the x- and y-limits are dropped before calculating bin widths and counts. This can influence how the histogram looks.

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  coord_cartesian(xlim = c(100, 5000), ylim = c(0, 3000)) # coord_cartesian

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  xlim(100, 5000) + # xlim
  ylim(0, 3000) # ylim

Solution 5: Missing values are removed when the number of observations in each bin are calculated. See the warning message: Removed 9 rows containing non-finite values (stat_bin)

diamonds2 <- diamonds %>%
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

ggplot(diamonds2, aes(x = y)) +
  geom_histogram()

In the geom_bar() function, NA is treated as another category. The x aesthetic in geom_bar() requires a discrete (categorical) variable, and missing values act like another category.

diamonds %>%
  mutate(cut = if_else(runif(n()) < 0.1, NA_character_, as.character(cut))) %>%
  ggplot() +
  geom_bar(mapping = aes(x = cut))

In a histogram, the x aesthetic variable needs to be numeric, and stat_bin() groups the observations by ranges into bins. Since the numeric value of the NA observations is unknown, they cannot be placed in a particular bin, and are dropped.

Solution 6:

This option removes NA values from the vector prior to calculating the mean and sum.

mean(c(0, 1, 2, NA), na.rm = TRUE)
sum(c(0, 1, 2, NA), na.rm = TRUE)

Solution 7:

Frequency Plot

nycflights13::flights |> 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + (sched_min / 60)
  ) |> 
  ggplot(aes(x = sched_dep_time)) + 
  geom_freqpoly(aes(color = cancelled), binwidth = 1/4)

Box Plot

nycflights13::flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>%
  ggplot() +
  geom_boxplot(mapping = aes(y = sched_dep_time, x = cancelled))

Covariation

If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables.

Questions

Based on EDA, what variable in the diamonds dataset appears to be most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
Instead of exchanging the x and y variables, add coord_flip() as a new layer to the vertical boxplot to create a horizontal one. How does this compare to exchanging the variables?
One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of "outlying values". One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs. cut. What do you learn? How do you interpret the plots?
Create a visualization of diamond prices vs. a categorical variable from the diamonds dataset using geom_violin(), then a faceted geom_histogram(), then a colored geom_freqpoly(), and then a colored geom_density(). Compare and contrast the four plots. What are the pros and cons of each method of visualizing the distribution of a numerical variable based on the levels of a categorical variable?
If you have a small dataset, it's sometimes useful to use geom_jitter() to avoid overplotting to more easily see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

Answers

Solution 1:

For investigating the general relationships of each variable with the price of the diamonds, first consider the variables: carat, clarity, color, and cut. Ignore the dimensions of the diamond since carat measures size, and thus incorporates most of the information contained in these variables.

Since both price and carat are continuous variables, use a scatter plot to visualise their relationship.

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point()

However, since there is a large number of points in the data, use a boxplot by binning carat, as suggested in the chapter:

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)), orientation = "x")

Note that the choice of the binning width is important, as if it were too large it would obscure any relationship, and if it were too small, the values in the bins could be too variable to reveal underlying patterns.

The variables color and clarity are ordered categorical variables. The chapter suggests visualizing a categorical and continuous variable using frequency polygons or boxplots. There is a weak negative relationship between color and price. The scale of diamond color goes from D (best) to J (worst). Before plotting, reverse the order of the color levels so they will be in increasing order of quality on the x-axis.

diamonds %>%
  mutate(color = fct_rev(color)) %>%
  ggplot(aes(x = color, y = price)) +
  geom_boxplot()

There is also weak negative relationship between clarity and price. The scale of clarity goes from I1 (worst) to IF (best).

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = clarity, y = price))

For both clarity and color, there is a much larger amount of variation within each category than between categories. Carat is clearly the single best predictor of diamond prices. Since this is an example of a continuous (carat) and categorical (cut) variable, it can be visualized with a box plot.

ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_boxplot()

There is a lot of variability in the distribution of carat sizes within each cut category. There is a slight negative relationship between carat and cut. Noticeably, the largest carat diamonds have a cut of "Fair" (the lowest).

This negative relationship can be due to the way in which diamonds are selected for sale. A larger diamond can be profitably sold with a lower quality cut, while a smaller diamond requires a better cut.

Solution 2:

First, recreate this horizontal box plot of the distribution hwy by class, using geom_boxplot() and coord_flip():

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()

In this case the output looks the same, but x and y aesthetics are flipped.

library("ggstance")

ggplot(data = mpg) +
  geom_boxploth(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy))

In ggplot2 (since version 3.3.0) all geoms can choose the direction. The direction is inferred from the aesthetic mapping. In this case, switching x and y produces a horizontal boxplot.

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy))

The orientation argument is used to explicitly specify the axis orientation of the plot.

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy), orientation = "y")

Solution 3:
Like box-plots, the boxes of the letter-value plot correspond to quantiles. However, they incorporate
far more quantiles than box-plots. They are useful for larger datasets because,

larger datasets can give precise estimates of quantiles beyond the quartiles
larger datasets should have more outliers (in absolute numbers).

ggplot(diamonds, aes(x = cut, y = price)) +
  geom_lv()

Solution 4:

The geom_freqpoly() is better for look-up: meaning that given a price, it is easy to tell which cut has the highest density. However, the overlapping lines makes it difficult to distinguish how the overall distributions relate to each other. The geom_violin() and faceted geom_histogram() have similar strengths and weaknesses. It is easy to visually distinguish differences in the overall shape of the distributions (skewness, central values, variance, etc). However, since we can't easily compare the vertical values of the distribution, it is difficult to look up which category has the highest density for a given price. All of these methods depend on tuning parameters to determine the level of smoothness of the distribution.

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
  geom_freqpoly(mapping = aes(color = cut), binwidth = 500) # geom_freqpoly

ggplot(data = diamonds, mapping = aes(x = price)) +
  geom_histogram() + # geom_histogram
  facet_wrap(~cut, ncol = 1, scales = "free_y")

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_violin() + # geom_violin
  coord_flip()

Solution 5:

There are two methods:

geom_quasirandom() produces plots that are a mix of jitter and violin plots. There are several different methods that determine exactly how the random location of the points is generated.
geom_beeswarm() produces a plot similar to a violin plot, but by offsetting the points.

Use the mpg box plot example since these methods display individual points, they are better suited for smaller datasets.

ggplot(data = mpg) +
  geom_quasirandom(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukey"
  )

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukeyDense"
  )

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "frowney"
  )

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "smiley"
  )

ggplot(data = mpg) +
  geom_beeswarm(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))

Two Categorical Variables

Questions

How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?
What different data insights do you get with a segmented bar chart if color is mapped to the x aesthetic and cut is mapped to the fill aesthetic?
Calculate the counts that fall into each of the segments.
Use geom_tile() together with dplyr to explore how average flight departure delays vary by destination and month of year.
What makes the plot difficult to read?
How could you improve it?

Answers

Solution 1:

To clearly show the distribution of cut within color, calculate a new variable prop which is the proportion of each cut within a color. This is done using a grouped mutate.


diamonds %>%
  count(color, cut) %>%
  group_by(color) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop))

Similarly, to scale by the distribution of color within cut,

diamonds %>%
  count(color, cut) %>%
  group_by(cut) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop))

Add limit = c(0, 1) to put the color scale between (0, 1). These are the logical boundaries of proportions. This makes it possible to compare each cell to its actual value, and would improve comparisons across multiple plots. However, it ends up limiting the colors and makes it harder to compare within the dataset. However, using the default limits of the minimum and maximum values makes it easier to compare within the dataset the emphasizing relative differences, but harder to compare across datasets.

Solution 2:

The scale of diamond color goes from D (best) to J (worst). Colourless diamonds are considered better than the ones with yellowish-brownish tint. This dataset contains diamonds of 7 colours- “D” to “J”. D, E, F are colourless & G-J have a very faint colour. The code below calculates the count of each cut within each color.

diamonds %>%
  count(color, cut) %>%
  group_by(color)

To show this visually, execute the below code. This will show the proportion of each cut in each color. position="dodge" creates the segmented bars to display the count of each cut in each color category.

diamonds %>%
  ggplot(mapping = aes(fill=cut, x=color)) +
  geom_bar(position="dodge")

Solution 3:

flights %>%
  group_by(month, dest) %>%
  summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = factor(month), y = dest, fill = dep_delay)) +
  geom_tile() +
  labs(x = "Month", y = "Destination", fill = "Departure Delay")

There are several things that could be done to improve it,

sort destinations by a meaningful quantity (distance, number of flights, average delay)
remove missing values

How to treat missing values is difficult. In this case, missing values correspond to airports which don't have regular flights (at least one flight each month) from NYC. These are likely smaller airports (with higher variance in their average due to fewer observations). When we group all pairs of (month, dest) again by destination, we should have a total count of 12 (one for each month) per group (dest). This makes it easy to filter.

flights %>%
  group_by(month, dest) %>%                                 # This gives (month, dest) pairs
  summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  group_by(dest) %>%                                        # group all (month, dest) pairs by destination ..
  filter(n() == 12) %>%                                     # and only select those that have one entry per month 
  ungroup() %>%
  mutate(dest = reorder(dest, dep_delay)) %>%
  ggplot(aes(x = factor(month), y = dest, fill = dep_delay)) +
  geom_tile() +
  labs(x = "Month", y = "Destination", fill = "Departure Delay")

Two Numerical Variables

Questions

Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon.
What do you need to consider when using cut_width() vs. cut_number()?
How does that impact a visualization of the 2d distribution of carat and price?
Visualize the distribution of carat, partitioned by price.
How does the price distribution of very large diamonds compare to small diamonds?
Is it as you expect, or does it surprise you?
Combine two of the techniques you've learned to visualize the combined distribution of cut, carat, and price.
Two dimensional plots reveal outliers that are not visible in one dimensional plots.
For example, some points in the following plot have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.
Why is a scatterplot a better display than a binned plot for this case?
```
#| eval: false
diamonds |> 
  filter(x >= 4) |> 
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
```
Instead of creating boxes of equal width with cut_width(), we could create boxes that contain roughly equal number of points with cut_number().
What are the advantages and disadvantages of this approach?
```
#| eval: false
ggplot(diamonds, aes(x = carat, y = price)) + 
  geom_boxplot(aes(group = cut_number(carat, 20)))
```

Answers

Solution 1:

Both cut_width() and cut_number() split a variable into groups. When using cut_width(), we need to choose the width, and the number of bins will be calculated automatically. When using cut_number(), we need to specify the number of bins, and the widths will be calculated automatically.

In either case, choose the bin widths and number large enough to aggregate observations and remove noise, but not so large as to remove all the signal/data.

If categorical colors are used, no more than eight colors should be used in order to keep them distinct. Using cut_number, split carats into quantiles (five groups).

ggplot(
  data = diamonds,
  mapping = aes(color = cut_number(carat, 5), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")

Alternatively, you can use cut_width to specify widths at which to cut. Choose 1-carat widths. Since there are very few diamonds larger than 2-carats, this is not as informative. However, using a width of 0.5 carats creates too many groups, and splitting at non-whole numbers is unappealing.

ggplot(
  data = diamonds,
  mapping = aes(color = cut_width(carat, 1, boundary = 0), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")

Solution 2:

Plotted with a box plot with 10 bins with an equal number of observations, and the width determined by the number of observations.

ggplot(diamonds, aes(x = cut_number(price, 10), y = carat)) +
  geom_boxplot() +
  coord_flip() +
  xlab("Price")

Plotted with a box plot with 10 equal-width bins of $2,000. The argument boundary = 0 ensures that first bin is $0--$2,000.

ggplot(diamonds, aes(x = cut_width(price, 2000, boundary = 0), y = carat)) +
  geom_boxplot(varwidth = TRUE) +
  coord_flip() +
  xlab("Price")

Solution 3:

The distribution of very large diamonds is more variable. This may be due to the way in which diamonds are selected for retail sales. For example, someone selling a diamond only finds it profitable to sell it if some combination size, cut, clarity, and color are above a certain threshold. The smallest diamonds are only profitable to sell if they are exceptional in all the other factors (cut, clarity, and color), so the small diamonds sold have similar characteristics. However, larger diamonds may be profitable regardless of the values of the other factors. Thus the data will show large diamonds with a wider variety of cut, clarity, and color and thus more variability in prices.

Solution 4:

There are many options to try, so solutions may vary. Here are a few options that work.

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex() +
  facet_wrap(~cut, ncol = 1) # 1 Facets by cut

ggplot(diamonds, aes(x = cut_number(carat, 5), y = price, colour = cut)) +
  geom_boxplot() # 2 Price of cuts using boxplots

ggplot(diamonds, aes(colour = cut_number(carat, 5), y = price, x = cut)) +
  geom_boxplot() # 3 Price of carats using boxplots

Solution 5:

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

In this case, there is a strong relationship between x and y. The outliers in this case are not extreme in either x or y. A binned plot would not reveal these outliers, and may lead us to conclude that the largest value of x was an outlier even though it appears to fit the bivariate pattern well.

Solution 6:

cut_width(x, width), as used above in Solution 1, divides x (the dataset) into bins of width width.
By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE.

Patterns and Models

If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:

Could this pattern be due to coincidence (i.e. random chance)?
How can you describe the relationship implied by the pattern?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if you look at individual subgroups of the data?

Patterns in your data provide clues about relationships, i.e., they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second. Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It's hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It's possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain.

Reference

Wickham, H., Mine Çetinkaya-Rundel and Grolemund, G. (2023) R for data science. 2nd ed. Sebastopol, CA: O’Reilly Media.

Remember that when we need to be explicit about where a function (or dataset) comes from, we'll use the special form package::function() or package::dataset. ↩︎
Prices for diamonds rise at key weights, such as 0.8, 0.9, and 1 carat. See https://www.hpdiamonds.com/en-us/extra/136/education-undercarat.htm. ↩︎

R4DS Exploratory Data Analysis

Quarto Document

R4DS Exploratory Data Analysis.qmd

31 KB

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Exploratory Data Analysis

Run the code in your script for the answers! I'm just exploring as I go.

Packages to load

Introduction

Vocabulary

Variation

Typical values

Unusual values

Questions

Answers

Covariation

Questions

Answers

Two Categorical Variables

Questions

Answers

Two Numerical Variables

Questions

Answers

Patterns and Models

Reference

Sign up for FREE to access the downloadable map!