Introduction

Artificial Intelligence (AI) and Machine Learning (ML) have become a staple in today’s data analytics. Whether your problem lies in healthcare, fintech (finance technology), or transportation, ML provides a novel way to make predictions from large amounts of data. In some cases, the dataset can be so overwhelmingly large that it’s difficult to sort the data in the first place; in these instances, extracting any sort of interpretation can be cumbersome.

This tutorial walks through an ML problem start to finish. Specifically, we focus on solving a problem using Scikit-Learn to implement supervised ML. In doing so, we’ll use publicly available data to generate a model capable of predicting the type of flight delay experienced by an aircraft in 2015.yo

We begin by downloading an ugly dataset and piecing together its remnants until the data are ready to be processed. At that point, we take two approaches to understanding the dataset. The first approach is a conventional analytical approach. With this mindset, we produce a variety of plots comparing different airlines in an attempt to gain insight into our data and draw preliminary conclusions. Second, we’ll feed the same dataset into four machine learning models as a means to draw predictions from the data.

That’s a lot to handle in one article, so let’s jump right in! If you’d like to jump ahead, the Jupyter notebook for this analysis can be found here.

Understanding the data

Our publicly available dataset characterizes different flight delays experienced by major airlines in 2015. The dataset, found on Kaggle, comprises three comma-separated value (CSV) files: airlines.csv, airports.csv, and flights.csv. The data contained in each CSV was collected by the U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics, which tracks the on-time performance of domestic flights operated by large airline companies.

At the time of writing, the dataset is nearly ten years old, and more complete datasets exist. However, I’ve chosen to analyze this particular dataset because it helps illustrate how different resources can be brought together to build a coherent dataset ready to be processed by modern ML algorithms. Let’s begin by ingesting the dataset and taking a look around.

import os
import tempfile
import opendatasets as od

# Create new temporary directory
data_folder = tempfile.mkdtemp()

# Download dataset into temporary directory
# A more complete dataset can be found at 'https://www.kaggle.com/datasets/patrickzel/flight-delay-and-cancellation-dataset-2019-2023'
dataset = 'https://www.kaggle.com/datasets/usdot/flight-delays'
od.download(dataset, data_folder, force=True)

When we try to extract the dataset from Kaggle, we get a dialogue box asking for a username and a key; in this case, the dataset is open source so you only need to enter something into the text boxes! Using “a” for username and password will work fine. After the credentials are processed, the file begins downloading and the download progress is illustrated with an animated status bar.

Understanding data elements

The following categories are contained within flights.csv and are given in xx:yy – hour:minute format (e.g. 1536 means 3:36pm, 345 means 3:45am, 16 means 00:16am):

  • SCHEDULED_DEPARTURE
  • DEPARTURE_TIME
  • SCHEDULED_ARRIVAL
  • ARRIVAL_TIME
  • WHEELS_OFF
  • WHEELS_ON

Other parameters are given in minutes (negative values indicate the actual time occurred before the scheduled time by the same magnitude):

  • ARRIVAL_DELAY
  • DEPARTURE_DELAY
  • TAXI_IN
  • TAXI_OUT
  • SCHEDULED_TIME
  • ELAPSED_TIME
  • AIR_TIME

The DISTANCE parameter is given in miles. Already, we can see we’re going to need to format the dataset so our machine learning models properly interpret the input values! Here are some specific definitions of the parameters:

  • WHEELS_OFF: The time point that the aircraft’s wheels leave the ground.
  • WHEELS_ON: The time point that the aircraft’s wheels touch on the ground.
  • TAXI_OUT: The time duration elapsed between departure from the origin airport gate and wheels off.
  • TAXI_IN: The time duration elapsed between wheels-on and gate arrival at the destination airport.
  • AIR_TIME: The time duration between wheels_off and wheels_on time.

Based on the variables’ definitions, relationships between variables can be defined:

  • ARRIVAL_TIME WHEELS_ON + TAXI_IN
  • ARRIVAL_DELAY ARRIVAL_TIME – SCHEDULED_ARRIVAL
  • DEPARTURE_TIME WHEELS_OFF – TAXI_OUT
  • DEPARTURE_DELAY DEPARTURE_TIME – SCHEDULED_DEPARTURE
  • ELAPSED_TIME AIR_TIME + TAXI_IN + TAXI_OUT
  • AIR_TIME WHEELS_ON – WHEELS_OFF

Examine contents of each dataset

Next, we’re going to parse airline, airport, and flight data from the other CSV files contained within the downloaded ZIP file. We use the Pandas module in Python to read each file as a Pandas dataframe titled airlinesairports, and flights, respectively.

Airlines dataset

First we’ll consider the airlines dataframe. In structure, the dataframe contains two columns, one listing the International Air Transport Association (IATA) code for each airline, and the other containing the name of the respective airline. The “head” of the dataframe gives us the first five rows, indexed from 0 to 4. In no particular order, we see the airlines are United Airlines, American Airlines, US Airways, Frontier Airlines, and JetBlue Airways. Keep in mind that the airlines dataframe contains more than just these 4, but only 4 show up since we’ve only printed the first 5 rows of the dataframe.

(Note: Although we print the head of the dataframe here, I’ll do so very little as the tutorial progresses since the size of the final dataframe is large and unsightly).

import pandas as pd

# Read each CSV dataset
airlines_path = os.path.join(data_folder,'flight-delays','airlines.csv')

# Read each CSV into Pandas dataframe
airlines = pd.read_csv(airlines_path,low_memory=False)

print(airlines.head())
  IATA_CODE                 AIRLINE
0 UA United Air Lines Inc.
1 AA American Airlines Inc.
2 US US Airways Inc.
3 F9 Frontier Airlines Inc.
4 B6 JetBlue Airways

Airports dataset

A list of the airlines is nice, but not altogether helpful. We also need information on which airports the airlines operate from and where each airport is located. The location information can be represented in many different ways (ex. city/state/country, latitude and longitude, etc.) Similar to the IATA code of each airline, each airport also has an associated IATA code. This information, along with the location and name of each airport, is included in the airports dataset. Again, we can print the head of the dataframe to see a list of airports alphabetically organized by the IATA code.

# Read each CSV dataset
airports_path = os.path.join(data_folder,'flight-delays','airports.csv')

# Read each CSV into Pandas dataframe
airports = pd.read_csv(airports_path,low_memory=False)

print(airports.head())
  IATA_CODE                              AIRPORT         CITY STATE COUNTRY  \
0 ABE Lehigh Valley International Airport Allentown PA USA
1 ABI Abilene Regional Airport Abilene TX USA
2 ABQ Albuquerque International Sunport Albuquerque NM USA
3 ABR Aberdeen Regional Airport Aberdeen SD USA
4 ABY Southwest Georgia Regional Airport Albany GA USA

LATITUDE LONGITUDE
0 40.65236 -75.44040
1 32.41132 -99.68190
2 35.04022 -106.60919
3 45.44906 -98.42183
4 31.53552 -84.19447

Flights dataset

The remaining piece of information is the flight path data which describes each flight of each airline that occurred within the time period covered by the dataset, characterized by departure and arrival information. The flight path data includes year, month, and day of the flight, flight specifics (tail number, airline, and flight number), and the flight’s origin information and destination information. But it also includes data on whether flights were diverted, cancelled, or delayed — if the flights were delayed, the dataset further tells us which type of delay the flight experienced, and the length of each delay in minutes. Printing the head of the dataframe gives us a better insight into its column titles and content, but this is not done below. Note that January 1, 2015 (MONTH=1, DAY=1, YEAR=2015) was a Thursday; thus the first day of the week (DAY_OF_WEEK=1) corresponds to Monday.

# Read each CSV dataset
flights_path = os.path.join(data_folder,'flight-delays','flights.csv')

# Read each CSV into Pandas dataframe
flights = pd.read_csv(flights_path,low_memory=False)
pd.set_option('display.max_columns', None)

Clean & Split

Since our focus will ultimately be on completed (“successful”) flights that experienced delays, we’ll begin by splitting the dataset into two: successful flights and cancelled flights.

# Split dataset into cancelled flights and uncancelled flights
flown_flights = flights[flights['CANCELLED']==0]
cancl_flights = flights[flights['CANCELLED']==1]

Successful Flight Analysis

As we saw in the Introduction, our flight data contains different units for different columns. Our simplification begins by removing useless columns (ex. TAIL_NUMBER) and combining columns based on their relationship. In doing so, we want to minimize the number of columns we keep while maintaining all the information we were given. For instance, ARRIVAL_DELAY is calculated using ARRIVAL_TIME and SCHEDULED_ARRIVAL; since the former accounts for both of the latter, removing ARRIVAL_TIME and SCHEDULED_ARRIVAL does not remove relevant information from our dataset. Here are some other simplifications:

  • We know we’re dealing only with the flights that have flown successfully, and have not been cancelled. This means we can remove the CANCELLED and CANCELLATION_REASON columns.
  • TAIL_NUMBER and FLIGHT_NUMBER will not impact the delay type, so we’ll get rid of those.
  • ARRIVAL_DELAY accounts for ARRIVAL_TIME and SCHEDULED_ARRIVAL, so we can keep the former in favor of the latter. Similarly, ARRIVAL_TIME is based on WHEELS_ON and TAXI_IN, so we can also eliminate these two as well.
  • We can apply similar logic to DEPARTURE_DELAY and eliminate DEPARTURE_TIMESCHEDULED_DEPARTUREWHEELS_OFF, and TAXI_OUT.
  • ELAPSED_TIME accounts for AIR_TIME, but it also inherently accounts for DISTANCE (longer distances require longer air time to reach). We can remove the latter and preserve the former.
  • We think the ARRIVAL_DELAY is the sum of the other delay types (DEPARTURE,AIR_SYSTEM,SECURITY,LATE_AIRCRAFT, and WEATHER), but it’s best to not get this wrong. Let’s remove ARRIVAL_DELAY and redefine a new column to represent this data so we know for sure.
  • We’ll rearrange the columns to put all the _DELAY types together by moving DEPARTURE_DELAY to the right a few columns.

Coincidentally, dropping these eight columns also eliminates every column of a format other than minutes or miles.

# Remove cancellation-related columns
clean_flown = flown_flights.drop(columns=['CANCELLED','CANCELLATION_REASON'])

# Remove irrelevant flight information (tail, flight number)
clean_flights = clean_flown.drop(columns=['TAIL_NUMBER','FLIGHT_NUMBER'])

# Remove redundant arrival data
clean_arrival = clean_flights.drop(columns=['ARRIVAL_TIME','SCHEDULED_ARRIVAL','WHEELS_ON','TAXI_IN'])

# Remove redundant departure data
clean_departure = clean_arrival.drop(columns=['DEPARTURE_TIME','SCHEDULED_DEPARTURE','WHEELS_OFF','TAXI_OUT'])

# Remove redundant delay summation data
clean_arrival = clean_departure.drop(columns=['ARRIVAL_DELAY','AIR_TIME','DISTANCE'])

# Calculate the total delay for each flight
clean_arrival['TOTAL_DELAY'] = clean_arrival['DEPARTURE_DELAY'] + clean_arrival['AIR_SYSTEM_DELAY'] + clean_arrival['SECURITY_DELAY'] + clean_arrival['AIRLINE_DELAY'] + clean_arrival['LATE_AIRCRAFT_DELAY'] + clean_arrival['WEATHER_DELAY']

# Rearrange columns to group similar concepts
clean_arrival = clean_arrival[['YEAR','MONTH','DAY','DAY_OF_WEEK','AIRLINE','ORIGIN_AIRPORT','DESTINATION_AIRPORT',
                               'SCHEDULED_TIME','ELAPSED_TIME','DIVERTED', 'DEPARTURE_DELAY','AIR_SYSTEM_DELAY',
                               'SECURITY_DELAY','AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY','WEATHER_DELAY','TOTAL_DELAY']]

# Show current dataframe
clean_arrival.head()

These changes result in a table similar to the following:

The YEAR column is redundant to begin with, because all of the dataset falls within a single year (2015). But we can combine the YEAR, MONTH, and DAY columns into a single variable called Day_of_year. Happily, the new variable has the added benefit of seasonality; while it would be difficult to extract winter, spring, summer, and fall from ` back out seasonality from day and month, larger values of DAY_OF_YEAR inherently account for changes in seasons since the largest values will represent winter timeframes.

from datetime import datetime

# Combine YEAR, MONTH, and DAY into a single column
clean_arrival['DAY_OF_YEAR'] = [datetime(y,m,d).timetuple().tm_yday for y,m,d in zip(clean_arrival['YEAR'],
                                                                                     clean_arrival['MONTH'],
                                                                                     clean_arrival['DAY'])]
# Consolidate calendar information
clean_doy = clean_arrival.drop(columns=['YEAR','MONTH','DAY'])

# Print output metrics
print('\n# rows:', clean_doy.shape[0], '\n# categories:',clean_doy.shape[1])
# rows: 5729195 
# categories: 15

Airline label consistency

Aside from the numerical values, we still need to handle our airline label values. In its current state, airlines are represented by unreadable 2-letter acronyms “NK”; let’s convert them into something we can use (e.g., ‘Spirit Airlines’) by bringing in other datasets we ingested at the beginning.

# Generate dictionaries from dataframes
airline_dict = dict(zip(airlines.IATA_CODE, airlines.AIRLINE))

# Update airline name
airline_name = []
[airline_name.append(airline_dict[u]) for u in clean_doy['AIRLINE']]
clean_doy['AIRLINE'] = airline_name

Origin and Destination inconsistencies

Another glaring inconsistency within the flights dataframe lies in the airport identifier. At present, the ORIGIN_AIRPORT and DESTINATION_AIRPORT columns contain 3-letter airport codes like “LAX” defined by the IATA (International Air Transport Association) as well as five digit FAA numerical codes like “12789”. Although both formats are valid, we need to make them consistent in order to interpret and analyze the data. (Fun fact: Five digit airport codes are often used instead of of IATA codes because the 3-letter codes can change over time!)

Link FAA and IATA data

This is where we need to be creative. While numerical FAA airport identifiers are not included in the original Kaggle dataset, they can be interpreted using two additional datafiles provided by the Bureau of Transportation Statistics (L_AIRPORT_ID.csv and L_AIRPORT.csv). The first dataset links the numerical ID value to an airport name and the second links the alphabetic airport identifier to the airport name. Using the airport name as a “crosswalk” between the two datasets, we are able to link the numerical ID and alphabetic identifier.

# Define paths for files connecting airport IDs and IATA codes
id_path = os.path.join(os.getcwd(),'flight-delays','L_AIRPORT_ID.csv')
code_path = os.path.join(os.getcwd(),'flight-delays','L_AIRPORT.csv')

# Read CSV files as dataframes
airport_id = pd.read_csv(id_path, low_memory=False)
airport_code = pd.read_csv(code_path, low_memory=False)

# Merge dataframes on Airport name
merged_codes = pd.merge(airport_id, airport_code, on="Description")

# Rename dataframe columns
merged_codes = merged_codes.rename(columns={"Code_x": "FAA_ID", "Code_y": "IATA"})

# Show new results
merged_codes.head()

An example of the resulting table is shown below:

With the FAA-ID and IATA codes linked, FAA-ID values within the flights dataset can be converted to the IATA format in both the ORIGIN_AIRPORT and DESTINATION_AIRPORT columns. To make sure the conversion took place correctly, we can print the number of numerical codes within the airport identifier column to show that FAA-ID values are no longer present.

# Create dictionary linking airport IDs with IATA codes
orig_airport_dict = dict(zip(merged_codes.FAA_ID, merged_codes.IATA))

# Match airport ID numbers with IATA codes
orig_airport = []
for a in clean_doy.ORIGIN_AIRPORT:
    if str(a).isdecimal():
        orig_airport.append(orig_airport_dict[int(a)])
    else:
        orig_airport.append(a)

# Replace dataframe column
flown_origin = clean_doy.copy()
flown_origin.ORIGIN_AIRPORT = orig_airport

# Test if all elements in ORIGIN_AIRPORT are alphabetic identifiers
print('\nNew airport origin list:\n',flown_origin.ORIGIN_AIRPORT.unique())
print("\nAll elements only contain letters:",all([x.isalpha() for x in flown_origin.ORIGIN_AIRPORT]),'\n')

The code snippet prints both ORIGIN_AIRPORT list before and after the conversion, giving us visual confirmation of the successful change. Given the length of each list, much of the lists have been removed and replaced with ellipses (…).

Old airport origin list:
['ANC' 'LAX' 'SFO' 'SEA' 'LAS' 'DEN' 'SLC' 'PDX' 'FAI' 'MSP' 'PHX' 'SJU' 'PBG' 'IAG' 'PSE' 'BQN' 'ORD' 'GEG' 'HNL' 'ONT' 'MCO' 'BOS' ... 'MVY' 'WYS' 'DLG' 'AKN' 'GST' 'HYA' '14747' '14771' '12889' '12892' '14869' '10299' '11292' '14107' '11630' ... '13344' '13933' '15991' '10754' '14256' '12094' '13873' '11097' '10154' '10926' '13964' '13541' '15027' '10170' '10165' '15497' '12265' '14222' '14025' '13502']

New airport origin list:
['ANC' 'LAX' 'SFO' 'SEA' 'LAS' 'DEN' 'SLC' 'PDX' 'FAI' 'MSP' 'PHX' 'SJU' 'PBG' 'IAG' 'PSE' 'BQN' 'ORD' 'GEG' 'HNL' 'ONT' 'MCO' 'BOS' 'HIB' 'ABR' ... 'TXK' 'SPS' 'ABI' 'YAK' 'SJT' 'CDV' 'OTH' 'ADQ' 'PPG' 'ROW' 'HOB' 'BGM' 'BGR' 'ITH' 'ACK' 'MVY' 'WYS' 'DLG' 'AKN' 'GST' 'HYA' 'BSM']

All elements only contain letters: True

The same treatment can be done for the DESTINATION_AIRPORT column.

# Match airport ID numbers with IATA codes
dest_airport = []
for d in flown_origin.DESTINATION_AIRPORT:
    if str(d).isdecimal():
        dest_airport.append(orig_airport_dict[int(d)])
    else:
        dest_airport.append(d)

# Replace dataframe column
flown_dest = flown_origin.copy()
flown_dest.DESTINATION_AIRPORT = dest_airport

# Test if all elements in DESTINATION_AIRPORT are alphabetic identifiers
print("\nAll elements only contain letters:",all([x.isalpha() for x in flown_dest.DESTINATION_AIRPORT]))

Despite our efforts there remains a single outlier. The outlier represents a case when the 3-letter code for an airport was changed (“BSM” was updated to “AUS”), Since this information is not present in our existing crosswalk, we make the change manually. Finally, we confirm there are no more list differences between our known airports and those listed in our processed dataframe.

# Show difference between `flights` IATA codes and `airports` IATA codes
print('\nAirport list differences:')
print('   Before substitution:', len(list(set(flown_dest.ORIGIN_AIRPORT) - set(airports.IATA_CODE))))

# Handle each IATA code change
flown_dest.ORIGIN_AIRPORT = flown_dest.ORIGIN_AIRPORT.replace('BSM','AUS')
flown_dest.DESTINATION_AIRPORT = flown_dest.DESTINATION_AIRPORT.replace('BSM','AUS')

# Show output
print('    After substitution:', len(list(set(flown_dest.ORIGIN_AIRPORT) - set(airports.IATA_CODE))))
Airport list differences:
Before substitution: 1
After substitution: 0

Now that we know the 3-letter acronym of each airport, we can define the latitude and longitude for each origin and destination airport.

import numpy as np

# Copy the existing dataframe before adding information
flown_lla = flown_dest.copy()

# Create dictionaries comprising origin airport, latitude, and longitude
lats_dict = dict(zip(airports.IATA_CODE, airports.LATITUDE))
lons_dict = dict(zip(airports.IATA_CODE, airports.LONGITUDE))

# ---------- Add Latitude and Longitude for Origin ----------
# Append matching lat,lon for each Origin_Airport
olats,olons = [],[]
[olats.append(lats_dict[u]) for u in flown_lla['ORIGIN_AIRPORT']]
[olons.append(lons_dict[u]) for u in flown_lla['ORIGIN_AIRPORT']]

# Add origin location to dataframe
flown_lla['ORIGIN_LAT'] = olats
flown_lla['ORIGIN_LON'] = olons

# ---------- Add Latitude and Longitude for Destination ----------
# Append matching lat,lon for each Destination_Airport
dlats,dlons = [],[]
[dlats.append(lats_dict[u]) for u in flown_lla['DESTINATION_AIRPORT']]
[dlons.append(lons_dict[u]) for u in flown_lla['DESTINATION_AIRPORT']]

# Add origin location to dataframe
flown_lla['DESTINATION_LAT'] = dlats
flown_lla['DESTINATION_LON'] = dlons

# Show all columns when printing dataframes
pd.set_option('display.max_columns', None)

# Since the LLA values make the airport names redundant, ORIGIN_AIRPORT and DESTINATION_AIRPORT can be removed
flown_lla = flown_lla.drop(columns=['ORIGIN_AIRPORT','DESTINATION_AIRPORT'])
print(flown_lla.head())

Removing NaN delay values

The final troublesome issue is dealing with NaN values in our _DELAY columns. Since our machine learning model will be used to predict the _DELAY type, we need to remove all NaN values and ensure all data are explicitly defined. This turns into a bit of a discussion, so hang in there.

It turns out that if we remove all NaN values in each _DELAY column, we end up with a dataframe that contains no flights with a cumulative error length of 0.0 minutes. That is, there are no “perfect” on-time flights remaining. It is tempting to assume, then, that flights with NaN values in each delay column experienced no delays, so we should set all such values to 0.0, and assume these are “on time” flights. But this cannot be the case since the minimum and maximum values for DEPARTURE_DELAY are -82.0 and 1988.0, respectively. Since the DEPARTURE_DELAYS are non-zero, there had to be some cause for the departure delay (ex. weather, airlines, late aircraft, etc.) — from this viewpoint, it seems like the NaN values are the result of simply not being filled in correctly rather than indicating no delays have occurred at all. Furthermore, flights that exhibit no delay in one or more _DELAY category are often listed as 0.0 instead of NaN. Ultimately, it is unsafe to assume that NaN values indicate 0.0 for all _DELAY columns.

To avoid incorrect assumptions, then, we’ll simply remove all rows where a NaN value exists in any _DELAY column; doing so leaves us with 1,063,439 flights each characterized by 19 categories. We can then iterate through the remaining rows and identify and find the dominant flight delay for each flight. Printing the number of occurrences for each flight delay provides us with a glimpse of how our ML categories will be spread.

import numpy as np
import matplotlib.pyplot as plt

# Drop rows with NaN values in _DELAY columns
flown_noNA = flown_lla.dropna(subset = ['AIR_SYSTEM_DELAY','SECURITY_DELAY','AIRLINE_DELAY','LATE_AIRCRAFT_DELAY','WEATHER_DELAY'])

# Find max delay type in each row
sub_df = flown_noNA[['AIR_SYSTEM_DELAY','SECURITY_DELAY','AIRLINE_DELAY','LATE_AIRCRAFT_DELAY','WEATHER_DELAY']]
max_delay = sub_df.idxmax(axis=1)
flown_noNA.loc[:,'MAX_DELAY'] = max_delay

# Ensure no delay values are accurately recorded
num_noDelay = len(flown_noNA[
                          (flown_noNA['AIR_SYSTEM_DELAY'] == 0.0) &
                          (flown_noNA['SECURITY_DELAY'] == 0.0) &
                          (flown_noNA['AIRLINE_DELAY'] == 0.0) &
                          (flown_noNA['LATE_AIRCRAFT_DELAY'] == 0.0) &
                          (flown_noNA['WEATHER_DELAY'] == 0.0)
                          ])
print('Number of flights with 0 delays: ',num_noDelay)

# Show sum for each max_delay column value
print('\nNumber of delay counts for each type:\n',flown_noNA['MAX_DELAY'].value_counts())

# Show head of remaining dataframe
print('\nRemaining data shape: ',flown_noNA.shape)
Number of flights with 0 delays:  0

Number of delay counts for each type:
MAX_DELAY
LATE_AIRCRAFT_DELAY 410647
AIRLINE_DELAY 311386
AIR_SYSTEM_DELAY 303784
WEATHER_DELAY 35711
SECURITY_DELAY 1911

Remaining data shape: (1063439, 18)

Exploring the data

Due to its length and content, this portion of the analysis has been moved to a separate article. Enjoy the analytic portion, and head back here when you’ve had a chance to read it!

AI/ML Modeling

With our cleaned dataset in hand, we’ll begin setting up our AI/ML model to predict the type of delay experienced by a given flight based on the remaining flight parameters (day of week, airline, etc.).

Parameter Relationships

During our dataframe simplification, we removed columns that were strongly related to (or “highly correlated with”) one another. We can visualize the remaining correlations within our dataframe using only its numerical values. In the existing dataset, we notice that the DIVERTED column has null correlations, and this is due to the column only containing one unique value (0). Since this parameter has no correlation with the other variables, we’ll remove it now and plot the remaining correlation matrix.

# Remove `DIVERTED`
rm_diverted = sorted_delay.drop(columns=['index','DIVERTED','SCHEDULED_TIME'])

# Plot a graphical correlation matrix for each pair of columns in the dataframe
corr2 = rm_diverted.corr(numeric_only=True)

plt.figure(figsize = (8,6))
sns.heatmap(corr2, annot=False)
sns.set(font_scale=0.7)
plt.show()

Next, we’ll take some steps to further shape up add one more column combining all five _DELAY columns; although highly correlated, the new column will describe the maximum delay type experienced for any given flight, and will act as the “truth” value for our ML model.

We’ve managed to remove strong correlations in our dataframe, but we still need to convert strings (ex. AIRLINE, etc.) into numerical values. To keep track of which numerical values correspond to which airlines, we’ll quickly zip the two arrays together for later recall:

# Generate correlation between numerical values and airline names
airlines = rm_diverted['AIRLINE'].unique()
indices = np.arange(len(airlines))
airline_key = zip(airlines,indices)

# Show corresponding values
print(list(airline_key))

# Show result beforehand
print('\nUnique airlines before substitution:\n',rm_diverted['AIRLINE'].unique())

for k in np.arange(len(airlines)):
    rm_diverted.loc[rm_diverted['AIRLINE'] == airlines[k], 'AIRLINE'] = k

# Show result afterward
print('\nUnique airlines after substitution:\n',rm_diverted['AIRLINE'].unique())

# Generate correlation between numerical values and airline names
delays = rm_diverted['MAX_DELAY'].unique()
indices = np.arange(len(delays))
delay_key = zip(delays,indices)

# Create unique labels for ML characterization
uniqlbl = [x.split('_')[0] for x in rm_diverted['MAX_DELAY'].unique()]

for k in np.arange(len(delays)):
    rm_diverted.loc[rm_diverted['MAX_DELAY'] == delays[k], 'MAX_DELAY'] = k

# Remove any final NaN values from dataframe
clean_noNA = rm_diverted.dropna()

print("\nFinal dataset size:\n",clean_noNA.shape)
print('\nNaN values in final array:\n',clean_noNA.isnull().values.any(),'\n')
[('American Airlines Inc.', 0), ('American Eagle Airlines Inc.', 1), ('Hawaiian Airlines Inc.', 2), ('Skywest Airlines Inc.', 3), ('United Air Lines Inc.', 4), ('Delta Air Lines Inc.', 5), ('Atlantic Southeast Airlines', 6), ('Frontier Airlines Inc.', 7), ('JetBlue Airways', 8), ('Alaska Airlines Inc.', 9), ('Spirit Air Lines', 10), ('US Airways Inc.', 11), ('Southwest Airlines Co.', 12), ('Virgin America', 13)]

Unique airlines before substitution:
['American Airlines Inc.' 'American Eagle Airlines Inc.'
'Hawaiian Airlines Inc.' 'Skywest Airlines Inc.' 'United Air Lines Inc.'
'Delta Air Lines Inc.' 'Atlantic Southeast Airlines'
'Frontier Airlines Inc.' 'JetBlue Airways' 'Alaska Airlines Inc.'
'Spirit Air Lines' 'US Airways Inc.' 'Southwest Airlines Co.'
'Virgin America']

Unique airlines after substitution:
[0 1 2 3 4 5 6 7 8 9 10 11 12 13]

Final dataset size:
(1061932, 17)

NaN values in final array:
False

Split into Training / Test data

Now, our dataset only contains numerical values. We then split the dataset into training data and test data, at a ratio of 80:20. Of the split data, we need to resacle input parameters (“features”) so that all features contribute equally to the model — a process known as “feature scaling”. Finally, the categorical columns are converted into numerical columns for the ML model.

Note that we’re also going to remove the obvious columns ending in _DELAY. Since we’re asking our classification model to predict the delay type but continue to include the delay information, we could expect the models to nearly perfect since it would only need to make a 1:1 correlation. Before the train/test split, the first five rows of our dataframe look like…

Now onto the train/test split!

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn import utils

# Remove obvious data columns
clean_delays = clean_noNA.drop(columns=['DEPARTURE_DELAY','AIR_SYSTEM_DELAY','SECURITY_DELAY','WEATHER_DELAY','AIRLINE_DELAY','LATE_AIRCRAFT_DELAY'])

# Splitting the data into independent and dependent variables
X = clean_delays .drop(columns=['MAX_DELAY']).values
y = clean_delays ['MAX_DELAY'].values
feature_list = list(clean_delays .drop(columns=['MAX_DELAY']).columns)

# Creating the Training and Test set from data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert categorical columns into numerical columns for fitting
lab = preprocessing.LabelEncoder()
y_train = lab.fit_transform(y_train)
y_test = lab.fit_transform(y_test)

Next, we’ll define four models classification models to process our data: Decision Tree, Random Forest, Gaussian Naive Bayes, and Logistic Regression classifiers. After each model has been utilized, we store the training predictions, accuracy, and other metrics for visualization.

import matplotlib.pyplot as plt
plt.rcParams["axes.grid"] = False

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, log_loss

# --------- PLOT TOP ROW OF CLASSIFIERS ---------
classifiers = {
    "Dec. Tree": DecisionTreeClassifier(),
    "Rnd Forest": RandomForestClassifier(n_estimators = 25, criterion = 'gini'),
    "Naive Bayes": GaussianNB(),
    "Log. Reg.": LogisticRegression(),
}

f, axes = plt.subplots(1, 4, figsize=(12,6), sharey='row')

log_cols=["Classifier", "Accuracy", "Log Loss"]
log = pd.DataFrame(columns=log_cols)

for i, (key, classifier) in enumerate(classifiers.items()):
    y_pred = classifier.fit(X_train, y_train).predict(X_test)
    cf_matrix = confusion_matrix(y_test, y_pred, normalize='true')

    #name = classifier.__class__.__name__    
    name = key
    train_predictions = classifier.predict(X_test)
    acc = accuracy_score(y_test, train_predictions)
    train_predictions = classifier.predict_proba(X_test)
    ll = log_loss(y_test, train_predictions)
    log_entry = pd.DataFrame([[name, acc*100, ll]], columns=log_cols)
    log = pd.concat([log, log_entry], ignore_index=True)
    
    disp = ConfusionMatrixDisplay(cf_matrix,
                                  display_labels=uniqlbl,
                                 )
    disp.plot(ax=axes[i], xticks_rotation=45)
    disp.ax_.set_title(key)
    disp.im_.colorbar.remove()
    disp.ax_.set_xlabel('')
    if i!=0:
        disp.ax_.set_ylabel('')

# Plot confusion matrices
plt.style.use('dark_background')
sns.set_color_codes("muted")
plt.grid(False)
plt.tight_layout()
plt.show()

Which produces the following confusion matrices…

A cursory glance tells us that each of the four models has some difficulty predicting the delay type. Let’s get a better feel for the numerical performance of each model by plotting the classifier accuracy, log loss, and calculating the R^2 values.

# Calculate R^2 values
for i, (key, classifier) in enumerate(classifiers.items()):
    y_pred = classifier.fit(X_train, y_train).predict(X_test)
    r2_reg_model_test = round(classifier.score(X_test, y_test),2)
    print(classifier,"R^2 Value: {}".format(r2_reg_model_test))

# Image bar charts for Accuracy and Log Loss
import seaborn as sns
plt.figure(figsize=(14,2))
plt.subplot(1,2,1)
plt.style.use('dark_background')
sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")
plt.xlabel('Accuracy %')
plt.title('Classifier Accuracy')

plt.subplot(1,2,2)
sns.set_color_codes("muted")
sns.barplot(x='Log Loss', y='Classifier', data=log, color="g")
plt.xlabel('Log Loss')
plt.title('Classifier Log Loss')
plt.show()
DecisionTreeClassifier() R^2 Value: 0.55
RandomForestClassifier(n_estimators=25) R^2 Value: 0.62
GaussianNB() R^2 Value: 0.47
LogisticRegression() R^2 Value: 0.5

Of the four classifiers, RandomForest shows the most promise, so we’ll concentrate on this model moving forward. With the model in hand, we can plot the importance of each feature the model considered:

import time
from sklearn.inspection import permutation_importance

# Create a  regressor using all the feature variables
rf_model = RandomForestClassifier(n_estimators=25, criterion='gini')
rf_model.fit(X_train, y_train)

# Monitor feature importance for each permutation
result = permutation_importance(
    rf_model, X_test, y_test, n_repeats=10, n_jobs=4
)
forest_importances = pd.Series(result.importances_mean, index=feature_list)

# Plot feature importance using permutation on full model
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
ax.set_title("Feature importances using permutation on full model")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.show()

The bar chart illustrates that the RandomForest classifier considered TOTAL_DELAY, ELAPSED_TIME, and DAY_OF_YEAR to be the most significant factors in predicting the type of the delay experienced by a given flight, although each parameter wound up contributing a noticeable amount.

Hyperparameter tuning

In our analysis so far, we’ve run the RandomForest classifier with two hyperparameters defined: the number of estimators (n_estimators=25) and the criterion (‘gini’). But there is no shortage of other parameters we can alter according to the Scikit-Learn documentation.

After spending some time to play around with hyperparameter optimization, it was found that with this RandomForest model, the optimal parameters were:

  • n_estimators = 30
  • min_samples_split = 12
  • min_samples_leaf = 2
  • max_features = ‘sqrt’

Unfortunately, our maximum accuracy is ~64% and that will have to be good enough for now. We’ve tried different classifier models, but we’ve yet to exhaust the entire list of classifiers we could use. We could use better data, but we chose this database based on it’s broken, semi-functional nature to illustrate how different datasets could be woven together.

Conclusion

In this tutorial, we used an open-source, publicly available dataset and used standard cleaning methods, reasoning, and other tools to clean the dataset and prepare the data for analysis. Once the dataset was clean, we broke our analysis into two sections: one section using more conventional data analysis methods, and another that utilized four machine learning.

The first type of analysis dug into one specific airline (American Airlines), examining the impact of various flight parameters on delay time. This method was more of a general characterization than our machine learning model. With the ML model, we were able to ask our model to predict which delay type a given aircraft experienced, and the results were rocky to say the least. We then compared initial metrics for each classifier and identified features within one of the models to show which parameters made the largest difference!