Introduction
This article is an aside from a previous article on machine learning. The dataframe used in this tutorial (flown_noNA
) is derived in the previous article after ingesting multiple CSV files, combining data, and removing redundancies within the resulting dataframe.
Exploring the data
Let’s begin by casting a wide analysis net by plotting histograms of all features present within the dataset. The histogram only shows us the distribution of unique values within each column. For instance, the first plot, DAY_OF_WEEK
illustrates that a roughly similar number of flights occurred on week day 1 (a Monday), 2, 3, 4, 5, and 7. By comparison, a significantly smaller number of flights occurred on day 6 (Saturday). Similarly, the distribution under SCHEDULED_TIME
indicates that a large majority of airline flights in 2015 lasted under 200 minutes.
flown_noNA.hist(bins=100, figsize=(20,15))
Delay Time VS. Airline
Let’s take this one step further and dig into TOTAL_DELAY
as a function of unique airlines, which we can do by looking at the total delay time for each airline. An easy way to do this is to plot total delay time for each airline on the same axis using the Python package “Seaborn”. To make things easier, we’ll sort the TOTAL_DELAY
values in descending order so that airlines with the highest delays will be listed first and airlines with the shortest delays will be listed last. To handle extreme cases (e.g. outliers), we’ll plot the logarithm of the total delay time, and sort on the log(Delay Time) values.
import seaborn as sns
sns.set_theme(style="dark", rc={'figure.figsize':(11,5)})
# Pre-emptively sort dataframe
sorted_delay = flown_noNA.sort_values(by='TOTAL_DELAY',ascending=False).reset_index()
descending_delays = sorted_delay['AIRLINE'].unique()
# Generate plot of delays sorted by decreasing delay
log_delay = np.log(sorted_delay['TOTAL_DELAY'])
delay_plot = sns.violinplot(sorted_delay, x="AIRLINE", y=log_delay, order=descending_delays)
delay_plot.set_xticklabels(delay_plot.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.xlabel("Airline")
plt.ylabel("log(Total delay)")
Shown below, the plot demonstrates that American Airlines flights experienced significantly higher delays the any of the other airlines.
Delay time VS Day of the Week
In a similar manner, we can also plot the log(Total_Delay)
values for each day of the week. Similar to the previous example, the plot we build here will first list the day of the week in which the longest delays occurred, while the day of the week with the least delays will be plotted on the right-most portion of the X-axis.
# Pre-emptively sort dataframe
descending_DayOfWeek = sorted_delay['DAY_OF_WEEK'].unique()
# Generate plot of delays sorted by decreasing delay
delay_plot = sns.violinplot(sorted_delay, x="DAY_OF_WEEK", y=log_delay, order=descending_DayOfWeek)
delay_plot.set_xticklabels(delay_plot.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.xlabel("Day of Week")
plt.ylabel("log(Total delay)")
Interestingly, the plot suggests that traveling on the fifth day of the week (Friday) is associated with the longest flight delays. Subsequently, Saturday and Sunday are associated with the next longest flight delays, and weekday flights tend to experience the shortest total flight delays.
At first, this may not seem surprising since Friday and the weekends are generally associated with higher volume traffic in airports. It may be tempting, then, to blame the increased delay length on the increased number of flights.
But is this really the case? To validate our assumption, we’d need to look at the number of flights recorded on each day of the week. We can easily do this using the dataframe’s sum
method.
# Count DAY_OF_WEEK occurrences, plot and label
ax = sns.countplot(sorted_delay, x="DAY_OF_WEEK")
ax.bar_label(ax.containers[0])
Interestingly, the largest number of flights within the dataset actually occurred on Thursday (DoW=4), followed by Monday (DoW=1), then Friday (DoW=5). This tells us that our initial assumption is (at least partially) incorrect since the total length of flight delays on a Friday is not correlated with the number of flights occurring on Fridays. If this were true, then Thursday’s total flight delays should be the highest, but they came in fourth highest in total delays during the week!
Of course, we need to keep in mind that our conclusions are based on a dataset of decreased database volume, which in turn is based on our cleaning methods and assumptions.
Delay cause
Alternatively, we can query the database directly to find the root cause of the total delays. That is, we can use our cleaned dataset to plot DELAY_TYPE
as listed in the database to find the leading ‘offender’ in flight delays.
# Generate plot of delay type for most delayed airline
largest_delay_airline = sorted_delay['AIRLINE'][0]
largest_delay_by_airline = sorted_delay[sorted_delay['AIRLINE']==largest_delay_airline]
# Retain only flight delay data
trimmed_df = largest_delay_by_airline[['AIR_SYSTEM_DELAY','SECURITY_DELAY','AIRLINE_DELAY','LATE_AIRCRAFT_DELAY','WEATHER_DELAY','TOTAL_DELAY']]
delay_melt = trimmed_df.melt('TOTAL_DELAY', var_name='TYPE', value_name='LENGTH')
# Sort data by delay length
sorted_by_delay_len = delay_melt.sort_values(by='LENGTH',ascending=False).reset_index()
sorted_by_delay_len['TYPE'] = [x.replace('_DELAY','').replace('_',' ').capitalize() for x in sorted_by_delay_len['TYPE']]
descending_Type = sorted_by_delay_len['TYPE'].unique()
sorted_by_delay_len['LENGTH'] = np.log(sorted_by_delay_len['LENGTH'])
# Display violin plot of delay type sorted by decreasing delay length
delay_plot = sns.violinplot(sorted_by_delay_len, x="TYPE", y="LENGTH",order=descending_Type)
plt.xlabel("Delay type")
plt.ylabel("log(Delay length)")
plt.title(largest_delay_airline + " delay causes")
The results show that “Airline delays” are the leading cause, followed by late aircraft then weather delays. While not terribly helpful, the plot does show that late flights and weather are among the top three causes of flight delays. But this is where our current analysis ends (for now), and we leave off with remaining questions: What constitutes an “Airline delay”? What causes “late aircraft”? How are weather and late aircraft associated?
These are questions that remain outside the scope of this course given our interest in developing an AI/ML approach to data analysis.
Exercises to the reader
Of course, there are many other ways to process the same dataset! For instance, we could instead focus on understanding cancelled flights and their cause, or we could also look at other factors such as:
- Delay causes across each airline
- Delay length different geographic regions
- Delay cause across different geographic regions
- Delay as a function of flight direction
- Delay as a function of time of year
- . . .
Conclusion
In this tutorial, we used a “pre-existing” dataframe that had already been constructed, cleaned, and sanity checked to illustrate preliminary analysis on a large publicly available dataset. Walking through the dataset, we were able to ask ourselves questions related to the data, formulate hypotheses, and use other types of analysis to validate or invalidate our conclusions.
Although the analysis feels incomplete, this leaves much room for you to explore! What questions did you come up with, and which questions were you able to answer? What insights were you able to extract from the available data? Leave your answers in the comments, and happy analyzing!