Data Collection & Preparation

1 Overview

This section documents how flight diversion data was collected, processed, and prepared for analysis. Understanding the data pipeline is crucial for interpreting the results and understanding the limitations of this analysis.

2 Data Sources

2.1 Primary Source: BTS OTMC-OTP Database

Source: U.S. Bureau of Transportation Statistics Portal: https://www.transtats.bts.gov/ Database: OTMC-OTP (Operations Time Series - On-Time Performance)

The BTS provides monthly CSV files containing detailed flight operation statistics for all U.S. carriers operating passenger services. Each file includes:

Flight identifiers (carrier, flight number, date)
Origin and destination airports
Scheduled, actual, and diverted times
Delay information
Diversion airport information

Coverage: July 2021 - December 2024 (45 months) Records: 25,386,632 total flight records Diversions Identified: 64,815 diverted flights

2.2 Secondary Source: Airport Coordinates

Source: Federal Aviation Administration & OpenFlights Database Portal: https://www.faa.gov/ and https://openflights.org/ Data: Airport geographic coordinates (latitude/longitude)

Geographic data for U.S. airports was obtained to enable spatial analysis and visualization. Each airport is represented by its IATA code (3-letter code) and geographic coordinates.

Coverage: 408 airports across the United States Accuracy: Latitude/longitude to seconds precision (~30 meter accuracy)

3 Data Download Process

3.1 Steps to Reproduce Data Collection

The BTS data can be downloaded manually or via programmatic access:

3.1.1 Manual Download (Current Process)

Visit https://www.transtats.bts.gov/
Navigate to the “Downloads” section
Select “OTMC-OTP” database
Download monthly CSV files (July 2021 - December 2024)
Store in a /data directory

3.1.2 Programmatic Approach (Future Enhancement)

# Example: Automating BTS data download
import requests
import pandas as pd

# The BTS API documentation would go here
# This is a template for how to automate collection

4 Data Structure

4.1 Key Variables in Raw Data

Column	Description	Type	Notes
`FlightDate`	Date of flight departure	Date	Original scheduled date
`Reporting_Airline`	IATA code of reporting airline	String	2-3 letter code
`Marketing_Airline_Network`	Airline name	String	Full airline name
`Origin`	Origin airport IATA code	String	3-letter code
`Dest`	Intended destination IATA code	String	3-letter code
`DepTime`	Scheduled departure time	Integer	HHMM format
`ActualDepTime`	Actual departure time	Integer	HHMM format
`DepDelay`	Departure delay in minutes	Float	Negative = early, Positive = late
`ArrDelay`	Arrival delay in minutes	Float	Negative = early, Positive = late
`Div1Airport`	First diversion airport	String	If diverted
`Duplicate`	Record duplicate flag	String	Y/N

4.2 Example Data Row

FlightDate: 2024-12-18
Marketing_Airline_Network: American Airlines
Origin: LAX
Dest: SAN
Div1Airport: PHX
DepDelay: 42
ArrDelay: 285

5 Data Cleaning Pipeline

5.1 Step 1: File Consolidation

Multiple monthly CSV files were combined into a single dataset:

import pandas as pd
import glob

# List all flight data files (2021-07 to 2024-12)
file_list = sorted(glob.glob('data/*_OTMC_OTP.csv')) 

# Verify all files have consistent column names
columns_set = []
for file in file_list:
    df = pd.read_csv(file, nrows=0)  
    columns_set.append(set(df.columns))

# Combine files
flights_list = [pd.read_csv(f) for f in file_list]
flights = pd.concat(flights_list, ignore_index=True)

Result: 45 monthly files successfully consolidated into single DataFrame Total Records: 25,386,632 flight records

5.2 Step 2: Diversion Identification

Not all flights in the dataset represent diversions. A flight is considered “diverted” when:

The flight has a recorded Div1Airport value
The Div1Airport is different from the Dest (intended destination)
The flight was not marked as duplicate

# Filter diverted flights only
diverted_flights = flights[
    (flights['Div1Airport'].notna()) & 
    (flights['Div1Airport'] != flights['Dest'])
].copy()

# Remove duplicates
diverted_flights = diverted_flights.drop(
    diverted_flights[diverted_flights['Duplicate'] == 'Y'].index
)

print(f"Total diversions identified: {len(diverted_flights)}")

Result: 64,815 diverted flights identified Filtering Ratio: 0.26% of all flights were diverted

5.3 Step 3: Missing Data Handling

Empty Columns Removed: - Columns with 100% missing values were dropped - Reduced dimensionality for cleaner analysis

Time Data Cleaning: - FlightDate converted to datetime format - Missing departure times handled (indicates cancelled flights) - Departure delays filled where possible

# Drop empty columns
empty_columns = diverted_flights.columns[diverted_flights.isna().all()]
diverted_flights = diverted_flights.drop(columns=empty_columns)

# Handle time data
diverted_flights['FlightDate'] = pd.to_datetime(
    diverted_flights['FlightDate'], 
    errors='coerce'
)

# Clean delay data
diverted_flights['DepDelay'] = pd.to_numeric(
    diverted_flights['DepDelay'], 
    errors='coerce'
)
diverted_flights['ArrDelay'] = pd.to_numeric(
    diverted_flights['ArrDelay'],
    errors='coerce'
)

5.4 Step 4: Feature Engineering

Estimated Actual Arrival DateTime

The estimated actual arrival time was calculated to determine when diversions occurred:

# Calculate actual arrival datetime
diverted_flights['EstimatedActualArrival'] = (
    diverted_flights['FlightDate'] +
    pd.to_timedelta(diverted_flights['ArrDelay'], unit='m')
)

# Extract just the date for clustering
diverted_flights['ArrivalDate'] = diverted_flights['EstimatedActualArrival'].dt.date

This feature is crucial for the temporal clustering analysis that identifies when diversions cluster together.

6 Data Quality Assessment

6.1 Data Completeness

Total Diversion Records: 64,815
Records with Complete Airport Codes: 99.8%
Records with Complete Coordinates: 99.75% (after geocoding)
Records with Airline Information: 100%

6.2 Data Consistency Checks

✅ Passed: - All airports are valid IATA codes - Delay values are numeric and reasonable - Dates are in valid range (July 2021 - December 2024) - Carrier codes are consistent with FAA registry - No impossible delay values (all within reasonable range for U.S. flights)

⚠️ Notes: - Some records have missing secondary diversion airports (only first diversion tracked by BTS) - A small percentage of flights have missing coordinates (handled by geocoding lookup)

7 Airport Geocoding

To enable geographic analysis, airports were geocoded with latitude/longitude:

7.1 Airport Coordinates Source

Airport coordinates were obtained from the Federal Aviation Administration and OpenFlights Database: - IATA airport codes matched to latitude/longitude coordinates - 408 U.S. airports successfully geocoded - Coordinates precise to seconds (approximately ±30 meters)

# Example geocoding
airports_data = pd.read_csv('airport_coordinates.csv')
airports_clean = airports_data.set_index('AIRPORT')[['LATITUDE', 'LONGITUDE']]

def get_coords(airport_code, airport_df):
    if airport_code in airport_df.index:
        coords = airport_df.loc[airport_code]
        return (coords['LATITUDE'], coords['LONGITUDE'])
    return (None, None)

# Apply to flight data
diverted_flights['origin_lat'] = diverted_flights['Origin'].apply(
    lambda x: get_coords(x, airports_clean)[0]
)
diverted_flights['origin_lon'] = diverted_flights['Origin'].apply(
    lambda x: get_coords(x, airports_clean)[1]
)

Geocoding Success Rate: 99.75%

8 Data Limitations

8.1 Known Limitations

U.S. Domestic Only: Analysis limited to U.S. domestic flights; international diversions excluded
Carrier Reporting: Data depends on airline reporting accuracy; some differences in reporting practices between carriers
Weather Data Not Included: The current dataset does not include concurrent weather conditions, limiting causal analysis
Secondary Diversions: Only first diversion recorded; subsequent diversions not tracked, potentially underestimating system impact
Seasonality: Dataset spans multiple years but seasonal patterns may be confounded with operational changes over time
Diversion Reason Unknown: BTS data does not indicate reason for diversion (weather, mechanical, ATC, capacity, etc.)

8.2 Missing Information

Weather conditions at time of diversion
Reason for diversion (weather, mechanical, ATC, capacity, etc.)
Passenger impact (number of passengers affected)
Financial impact to airlines
Gate information and ground handling details

9 Data Processing Summary

Stage	Records In	Records Out	% Retained	Notes
Raw Data	25,386,632	-	-	Consolidated from 45 monthly files
Filter Diversions	25,386,632	64,815	0.26%	Removed non-diverted flights
Remove Duplicates	64,815	64,815	100%	No duplicates found
Geocoding	64,815	64,815	99.75%	Airports successfully geocoded
Final Dataset	-	64,815	100%	Ready for analysis

10 Reproducibility

To reproduce this data collection process:

Download BTS data from: https://www.transtats.bts.gov/
Place monthly CSV files in /data directory (45 files: 2021-07 to 2024-12)
Run data cleaning script: notebooks/01-data-cleaning.ipynb
Verify output: diverted_flights.csv should contain 64,815 records

10.1 Environment

python: 3.13
pandas: 2.3.1
numpy: 2.3.1
geopandas: 1.0.1

11 Next Steps

After collection and cleaning, the data was ready for:

Exploratory Analysis - Understanding distribution and patterns
Temporal Analysis - Identifying clusters and trends over time (see Methodology)
Geospatial Analysis - Mapping diversions and routes (see Visualizations)
Clustering - Finding patterns in diversion events using 12-hour temporal windows

See the Analysis Methodology section for details on analytical approach.

12 Data References

Bureau of Transportation Statistics (BTS): https://www.transtats.bts.gov/
FAA Airport Data: https://www.faa.gov/
OpenFlights Database: https://openflights.org/

Data Collection Date: December 2024 Analysis Period: July 2021 - December 2024 Total Records: 64,815 diversions from 25,386,632 total flights Data Quality: 100% completeness for primary analysis fields