This section documents how flight diversion data was collected, processed, and prepared for analysis. Understanding the data pipeline is crucial for interpreting the results and understanding the limitations of this analysis.
2 Data Sources
2.1 Primary Source: BTS OTMC-OTP Database
Source: U.S. Bureau of Transportation Statistics Portal: https://www.transtats.bts.gov/ Database: OTMC-OTP (Operations Time Series - On-Time Performance)
The BTS provides monthly CSV files containing detailed flight operation statistics for all U.S. carriers operating passenger services. Each file includes:
Flight identifiers (carrier, flight number, date)
Origin and destination airports
Scheduled, actual, and diverted times
Delay information
Diversion airport information
Coverage: July 2021 - December 2024 (45 months) Records: 25,386,632 total flight records Diversions Identified: 64,815 diverted flights
2.2 Secondary Source: Airport Coordinates
Source: Federal Aviation Administration & OpenFlights Database Portal: https://www.faa.gov/ and https://openflights.org/ Data: Airport geographic coordinates (latitude/longitude)
Geographic data for U.S. airports was obtained to enable spatial analysis and visualization. Each airport is represented by its IATA code (3-letter code) and geographic coordinates.
Coverage: 408 airports across the United States Accuracy: Latitude/longitude to seconds precision (~30 meter accuracy)
3 Data Download Process
3.1 Steps to Reproduce Data Collection
The BTS data can be downloaded manually or via programmatic access:
3.1.1 Manual Download (Current Process)
Visit https://www.transtats.bts.gov/
Navigate to the “Downloads” section
Select “OTMC-OTP” database
Download monthly CSV files (July 2021 - December 2024)
Store in a /data directory
3.1.2 Programmatic Approach (Future Enhancement)
# Example: Automating BTS data downloadimport requestsimport pandas as pd# The BTS API documentation would go here# This is a template for how to automate collection
4 Data Structure
4.1 Key Variables in Raw Data
Column
Description
Type
Notes
FlightDate
Date of flight departure
Date
Original scheduled date
Reporting_Airline
IATA code of reporting airline
String
2-3 letter code
Marketing_Airline_Network
Airline name
String
Full airline name
Origin
Origin airport IATA code
String
3-letter code
Dest
Intended destination IATA code
String
3-letter code
DepTime
Scheduled departure time
Integer
HHMM format
ActualDepTime
Actual departure time
Integer
HHMM format
DepDelay
Departure delay in minutes
Float
Negative = early, Positive = late
ArrDelay
Arrival delay in minutes
Float
Negative = early, Positive = late
Div1Airport
First diversion airport
String
If diverted
Duplicate
Record duplicate flag
String
Y/N
4.2 Example Data Row
FlightDate: 2024-12-18
Marketing_Airline_Network: American Airlines
Origin: LAX
Dest: SAN
Div1Airport: PHX
DepDelay: 42
ArrDelay: 285
5 Data Cleaning Pipeline
5.1 Step 1: File Consolidation
Multiple monthly CSV files were combined into a single dataset:
import pandas as pdimport glob# List all flight data files (2021-07 to 2024-12)file_list =sorted(glob.glob('data/*_OTMC_OTP.csv')) # Verify all files have consistent column namescolumns_set = []forfilein file_list: df = pd.read_csv(file, nrows=0) columns_set.append(set(df.columns))# Combine filesflights_list = [pd.read_csv(f) for f in file_list]flights = pd.concat(flights_list, ignore_index=True)
Result: 45 monthly files successfully consolidated into single DataFrame Total Records: 25,386,632 flight records
5.2 Step 2: Diversion Identification
Not all flights in the dataset represent diversions. A flight is considered “diverted” when:
The flight has a recorded Div1Airport value
The Div1Airport is different from the Dest (intended destination)
Result: 64,815 diverted flights identified Filtering Ratio: 0.26% of all flights were diverted
5.3 Step 3: Missing Data Handling
Empty Columns Removed: - Columns with 100% missing values were dropped - Reduced dimensionality for cleaner analysis
Time Data Cleaning: - FlightDate converted to datetime format - Missing departure times handled (indicates cancelled flights) - Departure delays filled where possible
The estimated actual arrival time was calculated to determine when diversions occurred:
# Calculate actual arrival datetimediverted_flights['EstimatedActualArrival'] = ( diverted_flights['FlightDate'] + pd.to_timedelta(diverted_flights['ArrDelay'], unit='m'))# Extract just the date for clusteringdiverted_flights['ArrivalDate'] = diverted_flights['EstimatedActualArrival'].dt.date
This feature is crucial for the temporal clustering analysis that identifies when diversions cluster together.
6 Data Quality Assessment
6.1 Data Completeness
Total Diversion Records: 64,815
Records with Complete Airport Codes: 99.8%
Records with Complete Coordinates: 99.75% (after geocoding)
Records with Airline Information: 100%
6.2 Data Consistency Checks
✅ Passed: - All airports are valid IATA codes - Delay values are numeric and reasonable - Dates are in valid range (July 2021 - December 2024) - Carrier codes are consistent with FAA registry - No impossible delay values (all within reasonable range for U.S. flights)
⚠️ Notes: - Some records have missing secondary diversion airports (only first diversion tracked by BTS) - A small percentage of flights have missing coordinates (handled by geocoding lookup)
7 Airport Geocoding
To enable geographic analysis, airports were geocoded with latitude/longitude:
7.1 Airport Coordinates Source
Airport coordinates were obtained from the Federal Aviation Administration and OpenFlights Database: - IATA airport codes matched to latitude/longitude coordinates - 408 U.S. airports successfully geocoded - Coordinates precise to seconds (approximately ±30 meters)
Bureau of Transportation Statistics (BTS): https://www.transtats.bts.gov/
FAA Airport Data: https://www.faa.gov/
OpenFlights Database: https://openflights.org/
Data Collection Date: December 2024 Analysis Period: July 2021 - December 2024 Total Records: 64,815 diversions from 25,386,632 total flights Data Quality: 100% completeness for primary analysis fields