Data Collection & Preparation

1 Overview

This section documents how flight diversion data was collected, processed, and prepared for analysis. Understanding the data pipeline is crucial for interpreting the results and understanding the limitations of this analysis.


2 Data Sources

2.1 Primary Source: BTS OTMC-OTP Database

Source: U.S. Bureau of Transportation Statistics Portal: https://www.transtats.bts.gov/ Database: OTMC-OTP (Operations Time Series - On-Time Performance)

The BTS provides monthly CSV files containing detailed flight operation statistics for all U.S. carriers operating passenger services. Each file includes:

  • Flight identifiers (carrier, flight number, date)
  • Origin and destination airports
  • Scheduled, actual, and diverted times
  • Delay information
  • Diversion airport information

Coverage: July 2021 - December 2024 (45 months) Records: 25,386,632 total flight records Diversions Identified: 64,815 diverted flights

2.2 Secondary Source: Airport Coordinates

Source: Federal Aviation Administration & OpenFlights Database Portal: https://www.faa.gov/ and https://openflights.org/ Data: Airport geographic coordinates (latitude/longitude)

Geographic data for U.S. airports was obtained to enable spatial analysis and visualization. Each airport is represented by its IATA code (3-letter code) and geographic coordinates.

Coverage: 408 airports across the United States Accuracy: Latitude/longitude to seconds precision (~30 meter accuracy)


3 Data Download Process

3.1 Steps to Reproduce Data Collection

The BTS data can be downloaded manually or via programmatic access:

3.1.1 Manual Download (Current Process)

  1. Visit https://www.transtats.bts.gov/
  2. Navigate to the “Downloads” section
  3. Select “OTMC-OTP” database
  4. Download monthly CSV files (July 2021 - December 2024)
  5. Store in a /data directory

3.1.2 Programmatic Approach (Future Enhancement)

# Example: Automating BTS data download
import requests
import pandas as pd

# The BTS API documentation would go here
# This is a template for how to automate collection

4 Data Structure

4.1 Key Variables in Raw Data

Column Description Type Notes
FlightDate Date of flight departure Date Original scheduled date
Reporting_Airline IATA code of reporting airline String 2-3 letter code
Marketing_Airline_Network Airline name String Full airline name
Origin Origin airport IATA code String 3-letter code
Dest Intended destination IATA code String 3-letter code
DepTime Scheduled departure time Integer HHMM format
ActualDepTime Actual departure time Integer HHMM format
DepDelay Departure delay in minutes Float Negative = early, Positive = late
ArrDelay Arrival delay in minutes Float Negative = early, Positive = late
Div1Airport First diversion airport String If diverted
Duplicate Record duplicate flag String Y/N

4.2 Example Data Row

FlightDate: 2024-12-18
Marketing_Airline_Network: American Airlines
Origin: LAX
Dest: SAN
Div1Airport: PHX
DepDelay: 42
ArrDelay: 285

5 Data Cleaning Pipeline

5.1 Step 1: File Consolidation

Multiple monthly CSV files were combined into a single dataset:

import pandas as pd
import glob

# List all flight data files (2021-07 to 2024-12)
file_list = sorted(glob.glob('data/*_OTMC_OTP.csv')) 

# Verify all files have consistent column names
columns_set = []
for file in file_list:
    df = pd.read_csv(file, nrows=0)  
    columns_set.append(set(df.columns))

# Combine files
flights_list = [pd.read_csv(f) for f in file_list]
flights = pd.concat(flights_list, ignore_index=True)

Result: 45 monthly files successfully consolidated into single DataFrame Total Records: 25,386,632 flight records


5.2 Step 2: Diversion Identification

Not all flights in the dataset represent diversions. A flight is considered “diverted” when:

  1. The flight has a recorded Div1Airport value
  2. The Div1Airport is different from the Dest (intended destination)
  3. The flight was not marked as duplicate
# Filter diverted flights only
diverted_flights = flights[
    (flights['Div1Airport'].notna()) & 
    (flights['Div1Airport'] != flights['Dest'])
].copy()

# Remove duplicates
diverted_flights = diverted_flights.drop(
    diverted_flights[diverted_flights['Duplicate'] == 'Y'].index
)

print(f"Total diversions identified: {len(diverted_flights)}")

Result: 64,815 diverted flights identified Filtering Ratio: 0.26% of all flights were diverted


5.3 Step 3: Missing Data Handling

Empty Columns Removed: - Columns with 100% missing values were dropped - Reduced dimensionality for cleaner analysis

Time Data Cleaning: - FlightDate converted to datetime format - Missing departure times handled (indicates cancelled flights) - Departure delays filled where possible

# Drop empty columns
empty_columns = diverted_flights.columns[diverted_flights.isna().all()]
diverted_flights = diverted_flights.drop(columns=empty_columns)

# Handle time data
diverted_flights['FlightDate'] = pd.to_datetime(
    diverted_flights['FlightDate'], 
    errors='coerce'
)

# Clean delay data
diverted_flights['DepDelay'] = pd.to_numeric(
    diverted_flights['DepDelay'], 
    errors='coerce'
)
diverted_flights['ArrDelay'] = pd.to_numeric(
    diverted_flights['ArrDelay'],
    errors='coerce'
)

5.4 Step 4: Feature Engineering

Estimated Actual Arrival DateTime

The estimated actual arrival time was calculated to determine when diversions occurred:

# Calculate actual arrival datetime
diverted_flights['EstimatedActualArrival'] = (
    diverted_flights['FlightDate'] +
    pd.to_timedelta(diverted_flights['ArrDelay'], unit='m')
)

# Extract just the date for clustering
diverted_flights['ArrivalDate'] = diverted_flights['EstimatedActualArrival'].dt.date

This feature is crucial for the temporal clustering analysis that identifies when diversions cluster together.


6 Data Quality Assessment

6.1 Data Completeness

  • Total Diversion Records: 64,815
  • Records with Complete Airport Codes: 99.8%
  • Records with Complete Coordinates: 99.75% (after geocoding)
  • Records with Airline Information: 100%

6.2 Data Consistency Checks

Passed: - All airports are valid IATA codes - Delay values are numeric and reasonable - Dates are in valid range (July 2021 - December 2024) - Carrier codes are consistent with FAA registry - No impossible delay values (all within reasonable range for U.S. flights)

⚠️ Notes: - Some records have missing secondary diversion airports (only first diversion tracked by BTS) - A small percentage of flights have missing coordinates (handled by geocoding lookup)


7 Airport Geocoding

To enable geographic analysis, airports were geocoded with latitude/longitude:

7.1 Airport Coordinates Source

Airport coordinates were obtained from the Federal Aviation Administration and OpenFlights Database: - IATA airport codes matched to latitude/longitude coordinates - 408 U.S. airports successfully geocoded - Coordinates precise to seconds (approximately ±30 meters)

# Example geocoding
airports_data = pd.read_csv('airport_coordinates.csv')
airports_clean = airports_data.set_index('AIRPORT')[['LATITUDE', 'LONGITUDE']]

def get_coords(airport_code, airport_df):
    if airport_code in airport_df.index:
        coords = airport_df.loc[airport_code]
        return (coords['LATITUDE'], coords['LONGITUDE'])
    return (None, None)

# Apply to flight data
diverted_flights['origin_lat'] = diverted_flights['Origin'].apply(
    lambda x: get_coords(x, airports_clean)[0]
)
diverted_flights['origin_lon'] = diverted_flights['Origin'].apply(
    lambda x: get_coords(x, airports_clean)[1]
)

Geocoding Success Rate: 99.75%


8 Data Limitations

8.1 Known Limitations

  1. U.S. Domestic Only: Analysis limited to U.S. domestic flights; international diversions excluded
  2. Carrier Reporting: Data depends on airline reporting accuracy; some differences in reporting practices between carriers
  3. Weather Data Not Included: The current dataset does not include concurrent weather conditions, limiting causal analysis
  4. Secondary Diversions: Only first diversion recorded; subsequent diversions not tracked, potentially underestimating system impact
  5. Seasonality: Dataset spans multiple years but seasonal patterns may be confounded with operational changes over time
  6. Diversion Reason Unknown: BTS data does not indicate reason for diversion (weather, mechanical, ATC, capacity, etc.)

8.2 Missing Information

  • Weather conditions at time of diversion
  • Reason for diversion (weather, mechanical, ATC, capacity, etc.)
  • Passenger impact (number of passengers affected)
  • Financial impact to airlines
  • Gate information and ground handling details

9 Data Processing Summary

Stage Records In Records Out % Retained Notes
Raw Data 25,386,632 - - Consolidated from 45 monthly files
Filter Diversions 25,386,632 64,815 0.26% Removed non-diverted flights
Remove Duplicates 64,815 64,815 100% No duplicates found
Geocoding 64,815 64,815 99.75% Airports successfully geocoded
Final Dataset - 64,815 100% Ready for analysis

10 Reproducibility

To reproduce this data collection process:

  1. Download BTS data from: https://www.transtats.bts.gov/
  2. Place monthly CSV files in /data directory (45 files: 2021-07 to 2024-12)
  3. Run data cleaning script: notebooks/01-data-cleaning.ipynb
  4. Verify output: diverted_flights.csv should contain 64,815 records

10.1 Environment

python: 3.13
pandas: 2.3.1
numpy: 2.3.1
geopandas: 1.0.1

11 Next Steps

After collection and cleaning, the data was ready for:

  1. Exploratory Analysis - Understanding distribution and patterns
  2. Temporal Analysis - Identifying clusters and trends over time (see Methodology)
  3. Geospatial Analysis - Mapping diversions and routes (see Visualizations)
  4. Clustering - Finding patterns in diversion events using 12-hour temporal windows

See the Analysis Methodology section for details on analytical approach.


12 Data References

  • Bureau of Transportation Statistics (BTS): https://www.transtats.bts.gov/
  • FAA Airport Data: https://www.faa.gov/
  • OpenFlights Database: https://openflights.org/

Data Collection Date: December 2024 Analysis Period: July 2021 - December 2024 Total Records: 64,815 diversions from 25,386,632 total flights Data Quality: 100% completeness for primary analysis fields