DATA 607 Week Five: Working with Tidy Data

💡

Editor's note: I typed the title into the website editor tool I have but the rest is HTML export from R Studio. When I have some spare time, I'll figure out how to get it all to play nice so that it fits the theme of my website overall. For now, the HTML is as exported from R Studio, same some tweaks to section headers.

Supporting Links

Other data science masters work on this site

Github repository of all my masters degree work

DATA 607 Week Five: Working with Tidy Data

Week Five Homework Overview

Airline Delays Chart

The image above was provided by the assignment and is basis for the work below. It describes arrival delays for two airlines across five destinations.

For this assignment, I need to:

Create a .CSV file based on the image above
- I was “encouraged to use a ‘wide’ structure similar to how the information appears above so that [I] can practice tidying and transformations…”
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines

Import and Tidy Data

To tidy the data, I had to spin my tires for a bit to find the right combination of tidyverse code to unlock the format I needed. It took me a long minute to realize that “col” needed to equal the entire comma separated header as a string. Once I figured that out, I was able to break it from the comma separated values in single cells into a traditional column/row setup.

The data initially came in as non-numeric, which I learned when I went to go do the summary that’s next and I got an error saying the values were characters. I returned back and turned the city columns of number of flights into numbers.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)

delays_csv <- "https://storage.googleapis.com/data_science_masters_files/2024_fall/data_607_data_management/week_five_files/flight_delays_redux.csv"
delays_csv_raw <- read_csv(delays_csv)

## Rows: 4 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Airline,Status,Los Angeles,Phoenix,San Diego,San Francisco,Seattle
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

delays_split <- delays_csv_raw %>%
  separate(col = 'Airline,Status,Los Angeles,Phoenix,San Diego,San Francisco,Seattle', into = c('Airline','Status','Los Angeles','Phoenix','San Diego','San Francisco','Seattle'), sep = ",")

delays_split <- delays_split %>%
  mutate(across(c('Los Angeles', 'Phoenix', 'San Diego', 'San Francisco', 'Seattle'), as.numeric))

Analysis

For the final analysis piece, I first moved the data around a bit so I had the different cities as values in a column instead of columns on their own. It’s a lot easier to group different values from the same column together to do analysis than to have a setup where some key data is diffferent row values and others are different column values. In my experience, it’s best to keep like with like.

delays_city produces a basic table that shows on-time percent by airline, status, and city.

delays_analysis <- delays_split %>%
  pivot_longer(cols = c('Los Angeles', 'Phoenix', 'San Diego', 'San Francisco', 'Seattle'),
               names_to = "city", values_to = "flights")

delays_city <- delays_analysis %>%
  group_by(Airline, city) %>%
  mutate(total_flights = sum(flights)) %>%
  ungroup() %>%
  mutate(ontime_percent = (flights / total_flights) * 100) %>%
  select(Airline, city, Status, ontime_percent)

print(delays_city)

## # A tibble: 20 × 4
##    Airline city          Status  ontime_percent
##    <chr>   <chr>         <chr>            <dbl>
##  1 ALASKA  Los Angeles   on time          88.9 
##  2 ALASKA  Phoenix       on time          94.8 
##  3 ALASKA  San Diego     on time          91.4 
##  4 ALASKA  San Francisco on time          83.1 
##  5 ALASKA  Seattle       on time          85.8 
##  6 ALASKA  Los Angeles   delayed          11.1 
##  7 ALASKA  Phoenix       delayed           5.15
##  8 ALASKA  San Diego     delayed           8.62
##  9 ALASKA  San Francisco delayed          16.9 
## 10 ALASKA  Seattle       delayed          14.2 
## 11 AM WEST Los Angeles   on time          85.6 
## 12 AM WEST Phoenix       on time          92.1 
## 13 AM WEST San Diego     on time          85.5 
## 14 AM WEST San Francisco on time          71.3 
## 15 AM WEST Seattle       on time          76.7 
## 16 AM WEST Los Angeles   delayed          14.4 
## 17 AM WEST Phoenix       delayed           7.90
## 18 AM WEST San Diego     delayed          14.5 
## 19 AM WEST San Francisco delayed          28.7 
## 20 AM WEST Seattle       delayed          23.3

This chart breaks out flight status by city and airline. I like to use stacked bar charts for these types of percent comparisons because in early EDA, which is what this is, I’m looking for patterns that stand out right away or don’t quite make sense. Scanning across this, the question that immediately came to mind was: why does AM West struggle more in San Francisco than Alaska?

status_colors <- c("on time" = "#1f78b4", "delayed" = "#33a02c")

ggplot(delays_city, aes(x = city, y = ontime_percent, fill = Status)) +
  geom_bar(stat = "identity", position = "stack") + 
  facet_grid(~ Airline, scales = "free_x", space = "free_x") +  # Group by airline
  scale_fill_manual(values = status_colors) +  # Use status-specific colors
  labs(title = "Percent On-time and Delayed Flights by City and Airline",
       x = "City", y = "Percent") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))