R for Data Science Exercises: Web Scraping

Web scraping involves programmatically extracting data from websites.

R for Data Science Exercises: Web Scraping

R for Data Science 2nd Edition Exercises (Wickham, Mine Çetinkaya-Rundel and Grolemund, 2023)

Web Scraping

Run the code in your script for the answers! I'm just exploring as I go.

Packages to load

library(tidyverse)
library(rvest)
library(gt)
library(gtExtras)
library(scales)
library(janitor)
library(prismatic)
library(ggrepel)

Introduction

Web scraping involves programmatically extracting data from websites. This covers:

  • Legal and Ethical Considerations: Understanding the legality and ethical implications of scraping data.
  • Tools and Packages: Introduction to httr, rvest, selectr, and xml2 for web scraping.
  • Techniques: Methods for scraping both static and dynamic websites.
  • Ethical Considerations:
    • robots.txt: Check the website's robots.txt file, which indicates the areas of the site that can or cannot be accessed by web crawlers.
    • Terms of Service: Read and respect the terms of service of the website.
    • Server Load: Be considerate of the impact your scraping activities may have on the server's performance.
  • Legal Considerations:
    • Copyright Laws: Be aware of the intellectual property rights related to the content you are scraping.
    • Terms of Service Violations: Scraping data in violation of a website's terms of service can lead to legal issues.

Tools for Web Scraping

  • httr: A package for performing HTTP requests.
  • rvest: A package designed to simplify web scraping.
  • selectr: For parsing CSS selectors.
  • xml2: For parsing XML and HTML documents.

HTTP Requests with httr

HTTP requests are used to interact with web servers to retrieve or send data.

Basic Usage of httr

  • GET Request:

    library(httr)
    
    response <- GET("https://example.com")
    content(response, "text")
    
    • Use GET() to fetch data from a URL.
    • content(response, "text") retrieves the response content as text.
  • POST Request:

    response <- POST("https://example.com/post", body = list(key1 = "value1"))
    content(response, "text")
    
    • Use POST() to send data to a server.
  • Handling Errors:

    response <- GET("https://example.com")
    stop_for_status(response)
    
    • stop_for_status() checks for HTTP errors and stops the function if one is encountered.
  • Query Parameters:

    response <- GET("https://example.com", query = list(param1 = "value1", param2 = "value2"))
    content(response, "text")
    
    • Add query parameters using the query argument.

Scraping HTML with rvest

rvest simplifies the process of scraping web data using CSS or XPath selectors.

Basic Usage of rvest

  • Reading HTML:

    library(rvest)
    
    page <- read_html("https://example.com")
    
    • read_html() loads the HTML content of a webpage.
  • CSS Selectors:

    title <- page %>% html_node("title") %>% html_text()
    links <- page %>% html_nodes("a") %>% html_attr("href")
    
    • html_node() selects a single HTML element.
    • html_nodes() selects multiple elements.
    • html_text() extracts the text from an element.
    • html_attr() retrieves the value of an attribute.
  • XPath Selectors:

    title <- page %>% html_node(xpath = "//title") %>% html_text()
    
    • Use xpath for selecting elements with XPath syntax.
  • Extracting Tables:

    tables <- page %>% html_table()
    
    • html_table() extracts tables from HTML as data frames.

Advanced Scraping Techniques

For more complex scraping tasks, such as dealing with JavaScript-rendered content or automating interactions, additional tools are necessary.

Dealing with JavaScript

  • RSelenium: A package that provides an R interface to Selenium WebDriver, enabling control of a web browser for scraping dynamic content.
  • chromote: A package that interacts with the Chrome DevTools Protocol.

Automating Browser Actions with RSelenium

  • Starting a Selenium Server and Browser:

    library(RSelenium)
    
    rD <- rsDriver(browser = "chrome")
    remDr <- rD$client
    
    • rsDriver() starts a Selenium server and browser instance.
  • Navigating to a Webpage:

    remDr$navigate("https://example.com")
    
    • navigate() loads a specified URL in the browser.
  • Extracting Content:

    webElem <- remDr$findElement(using = "css selector", "p")
    webElem$getElementText()
    
    • findElement() locates an element using a CSS selector.
    • getElementText() retrieves the text of the element.
  • Closing the Browser:

    remDr$close()
    

Working with APIs

APIs offer a more structured and reliable way to get data compared to scraping web pages directly.

Interacting with APIs using httr

  • GET Request to an API:

    response <- GET("https://api.example.com/data")
    data <- content(response, "parsed")
    
    • content(response, "parsed") parses the response content.
  • POST Request to an API:

    response <- POST("https://api.example.com/submit", body = list(name = "John"))
    data <- content(response, "parsed")
    
    • Send data to an API endpoint using POST().

Parsing and Cleaning Data

After extracting data, it's often necessary to clean and preprocess it for analysis.

Handling HTML with xml2

  • Parsing HTML:
    library(xml2)
    
    doc <- read_html("https://example.com")
    nodes <- xml_find_all(doc, "//p")
    texts <- xml_text(nodes)
    
    • read_html() loads HTML content.
    • xml_find_all() selects elements using XPath.
    • xml_text() retrieves the text content of elements.

Cleaning Data

Use data manipulation packages like dplyr and tidyr to clean and preprocess the scraped data.

library(dplyr)
library(tidyr)

# Example data cleaning steps
cleaned_data <- raw_data %>%
  filter(!is.na(value)) %>%
  mutate(new_column = as.numeric(old_column))

Premier League Example

premxG.png

Scraping Data

# url storage
url <- "https://fbref.com/en/comps/9/Premier-League-Stats"

# read_html to scrape the items on url page
full_table <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE)

# html_nodes to pull all nodes under the "table" label
# the number [1] tells which table to pull from the list of tables
# html_table converts it to table format
prem <- full_table %>%  
  html_nodes("table") %>% 
  .[[1]] %>% 
  html_table(fill=T) 

Visualising Data

pl <- prem %>% ggplot(aes(x = xG, y = xGA, label = Squad)) +
geom_smooth(method = "lm", color = "green", fill = "green") +
geom_point(aes(fill = "green", color = after_scale(clr_darken(fill, 0.3))), 
             shape = 21, 
             alpha = .75, 
             size = 3) +
  geom_text_repel(size = 2.5, color = "white", min.segment.length = unit(0.1, "lines")) +
  theme(
    legend.position = "none",
    plot.background = element_rect(fill = "purple", colour = "purple"),
    panel.background = element_rect(fill = "purple", colour = "purple"),
    panel.grid.major = element_line(colour = "purple"),
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "white"),
    axis.text = element_text(colour = "white"),
    axis.title = element_text(colour = "white"),
    plot.title = element_text(colour = "white", hjust=.5, face="bold", size = 15),
    plot.subtitle = element_text(colour = "white", hjust=.5, face="bold", size = 8)) +
  labs(title = "xG For vs xG Against of PL Teams",
       subtitle = "2023-2024 Season") +
  scale_y_reverse()

pl

# Saving Plot
# ggsave("premier_league.png", pl, height = 6, width = 6, dpi = 300)

Bundesliga Example

bunxG.png

Scraping Data

# url storage
url <- "https://fbref.com/en/comps/20/Bundesliga-Stats"

# read_html to scrape the items on url page
full_table <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE)

# html_nodes to pull all nodes under the "table" label
# the number [1] tells which table to pull from the list of tables
# html_table converts it to table format
bund <- full_table %>%  
  html_nodes("table") %>% 
  .[[1]] %>% 
  html_table(fill=T) 

Visualising Data

bl <- bund %>% ggplot(aes(x = xG, y = xGA, label = Squad)) +
geom_smooth(method = "lm", color = "green", fill = "green") +
geom_point(aes(fill = "green", color = after_scale(clr_darken(fill, 0.3))), 
             shape = 21, 
             alpha = .75, 
             size = 3) +
  geom_text_repel(size = 2.5, color = "white", min.segment.length = unit(0.1, "lines")) +
  theme(
    legend.position = "none",
    plot.background = element_rect(fill = "purple", colour = "purple"),
    panel.background = element_rect(fill = "purple", colour = "purple"),
    panel.grid.major = element_line(colour = "purple"),
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "white"),
    axis.text = element_text(colour = "white"),
    axis.title = element_text(colour = "white"),
    plot.title = element_text(colour = "white", hjust=.5, face="bold", size = 15),
    plot.subtitle = element_text(colour = "white", hjust=.5, face="bold", size = 8)) +
  labs(title = "xG For vs xG Against of BL Teams",
       subtitle = "2023-2024 Season") +
  scale_y_reverse()

bl

# Saving Plot
# ggsave("bundesliga.png", pl, height = 6, width = 6, dpi = 300)

Big 5 Leagues Example

big5xG.png

Scraping Data

# url storage
url <- "https://fbref.com/en/comps/Big5/Big-5-European-Leagues-Stats"

# read_html to scrape the items on url page
full_table <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE)

# html_nodes to pull all nodes under the "table" label
# the number [1] tells which table to pull from the list of tables
# html_table converts it to table format
big5 <- full_table %>%  
  html_nodes("table") %>% 
  .[[1]] %>% 
  html_table(fill=T) 

Visualising Data

b5 <- big5 %>% ggplot(aes(x = xG, y = xGA, label = Squad)) +
geom_smooth(method = "lm", color = "green", fill = "green") +
geom_point(aes(fill = "green", color = after_scale(clr_darken(fill, 0.3))), 
             shape = 21, 
             alpha = .75, 
             size = 3) +
  geom_text_repel(size = 2.5, color = "white", min.segment.length = unit(0.1, "lines")) +
  theme(
    legend.position = "none",
    plot.background = element_rect(fill = "purple", colour = "purple"),
    panel.background = element_rect(fill = "purple", colour = "purple"),
    panel.grid.major = element_line(colour = "purple"),
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "white"),
    axis.text = element_text(colour = "white"),
    axis.title = element_text(colour = "white"),
    plot.title = element_text(colour = "white", hjust=.5, face="bold", size = 15),
    plot.subtitle = element_text(colour = "white", hjust=.5, face="bold", size = 8)) +
  labs(title = "xG For vs xG Against of Big 5 League Teams",
       subtitle = "2023-2024 Season") +
  scale_y_reverse()

b5

# Saving Plot
# ggsave("big5.png", pl, height = 6, width = 6, dpi = 300)

References

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".