Introduction

In our tutorial on flight delay analysis, we scraped data from Kaggle. By ingesting a number of different data sources, we were able to piece together a large coherent dataset containing ~1.3M flights. Each flight in the dataset was further characterized by a major airline carrier, departure / arrival information, and delay type. Utilizing different supervised learning models, we were able to predict the type of flight delay for any given flight with a relatively low accuracy (64%). Although the accuracy does not seem impressive, the result is still 44% better than randomly guessing the type of flight delay. Based on our model’s performance, we tried to finetune the hyperparameters to enhance prediction accuracy, but were ultimately unable to do so.

The tutorial presented in this article discusses a different but related topic: weather. Specifically, we’ll look at a highly unique dataset, and one I collected myself. The dataset comprises METAR (Meteorological Aerodrome Report) data from airports in 21 states (AL, AR, AZ, CA, CO, CT, DE, FL, GA, IA, ID, IL, IN, KS, KY, LA, MA, MD, ME, MI, MN, MO, MS, MT, NC, ND, NE, NH, NJ, NM, NV, NY, OH, OK, OR, PA, RI, SC, SD, TN, TX, UT, VA, VT, WA, WI, WV, and WY) between January 1, 2016 – December 31, 2022.

With the data collection completed, we’ll begin by ingesting and cleaning the data straight away. Unlike the previous analysis, this article will not contain a counterpart discussing how the same dataset might be processed using conventional signal processing methods. Similar to the previous article, there’s much to unpack, so let’s jump right in!

Understanding the data

Similar to the flight delay data, we’ll begin by reading in the CSV file, creating a copy, and printing its contents. In this dataset, we have 6,349,496 data points, each characterized by 32 features.

import pandas as pd

# Load dataset
raw_data = pd.read_csv('.//state_metars/all_metars.csv')
dataset = raw_data.copy()
print('\nColumns:',[x for x in dataset.columns])
print('\nDataset size:',dataset.shape)
Columns: ['Station_ID', 'Timestamp', 'Longitude', 'Latitude', 'Temperature_F', 'Dew_point_F', 'Relative_humidity', 'Wind_direction_deg', 'Wind_speed_kts', 'One_hour_precip', 'Pressure_altimeter', 'Sea_level_pressure', 'Visibility_mi', 'Wind_gust_kts', 'Sky_cover_L1', 'Sky_cover_L2', 'Sky_cover_L3', 'Sky_cover_L4', 'Sky_alt_L1_ft', 'Sky_alt_L2_ft', 'Sky_alt_L3_ft', 'Sky_alt_L4_ft', 'Weather_codes', 'ice_accretion_1hr', 'ice_accretion_3hr', 'ice_accretion_6hr', 'peak_wind_gust', 'peak_wind_drct', 'peak_wind_time', 'Apparent_temp_F', 'METAR', 'Snow_depth']

Dataset size: (6349496, 32)

Based on our previous example, however, we also know we’re going to try our best to remove redundant data, consolidate columns, and clean up the data as much as possible before inserting it into our learning model. For this dataset, there’s no dearth of cleaning.

Date/time information

Date and time information is always high on my list of items to check right away. Once we correct the datetime data, we can us plots and other visualizations to help sanity check our work and look for obvious outliers or weird behavior in the data (e.g. ‘jumps’ or discontinuities).

import numpy as np
from datetime import datetime

# Remove known outliers
fix_dat = dataset.drop(dataset[dataset.Timestamp == "2018"].index)
fix_dat = fix_dat.dropna(subset = ['Timestamp','METAR'])

# Add hour information
dt_info = [datetime.strptime(str(x),'%Y-%m-%d %H:%M') for x in fix_dat.Timestamp]

# Combine YEAR, MONTH, and DAY into a single column
fix_dat['DoY'] = [datetime(y,m,d).timetuple().tm_yday for y,m,d in zip([x.year for x in dt_info],
                                                                       [x.month for x in dt_info],
                                                                       [x.day for x in dt_info])]
fix_dat['Hour'] = [x.hour for x in dt_info]

This dataset is no different. By trying to convert the Timestamp column to DateTime Python objects, we’re told of at least one outlier: a timestamp simply labeled “2018”. With no other information to go on, we’ll remove this data point along with any other NaN values in our Timestamp or METAR columns. We’ll then process the rest of the data by converting each similarly-formatted timestamp entry to datetime objects then use those objects to calculate a DayOfYear (DoY) parameter. Doing so enables us to collapse three columns into one. For fun, we’re keeping the Hour data, because we don’t know if the time of day is an affect on fog presence (though we suspect it is, since fog tends to happen in the evenings, nights, and early morning).

Extract weather data

In the introduction, we mentioned that instead of downloading the dataset from Kaggle, I’d built a scraper to collect data from various online resources. The collections attempted to extract meaningful weather data from automated METAR reports, but ran into trouble. The truth is, METAR reports are difficult to parse because their length, information, and format can change based on the current weather and other parameters.

To make sense of the METARs, we’ll begin by defining two lists: the first is a list of known weather conditions (wx_cnd) and the second is a list of known weather descriptions (wx_dsc) like “BL” for blowing. We’ll then iterate through each METAR; after we split the METAR into its elements, we’ll search for the known weather conditions in each METAR element and capture the elements comprising weather conditions; this leaves us with a list of each weather condition, metar_obs.

# Weather conditions and descriptors
wx_cnd = ['BR', 'CLR', 'DU', 'DZ', 'FG', 'FU', 'GR', 'GS', 'HZ', 'PL', 'PO', 'RA', 'SA', 'SG', 'SN', 'TS', 'UP']
wx_dsc = ['BC', 'BL', 'DR', 'FC', 'FZ', 'MI', 'N', 'PR', 'SH', 'SQ', 'VC']

wx_obs = []

def extract_wx(metar):
    # Filter out airport due to strings
    spc_idx = metar.find(' ')+1
    # Split METAR into list
    sh_met = metar[spc_idx:].split(' ')
    # Find substrings in list of metar elements
    aaa = next((s for s in sh_met if any(xs in s for xs in wx_cnd)), None)
    # Identify non-existent conditions
    if (aaa==None) or (aaa == 'CLR'): condition = 'Clear'
    else: condition = aaa
    return condition
    
metar_obs = [extract_wx(x) for x in fix_dat['METAR']]

With each weather condition in hand, we still need to decipher the strings we’ve extracted but it’s important to notice that order matters. For instance, some weather reports might list multiple weather conditions at the same location; for instance, “BLRAFG” stands for “blowing rain and fog” — if we told our search to replace strings containing the abbreviation for rain, RA, we would eliminate a datapoint for fog, and the reverse is true for replacing fog first and losing rain. But our data comprise far more elements with rain than fog, so it is better to preserve our rarer events for the analysis. Still, there are other labels that will never overlap; for instance, clear weather (CLR) will never be concurrent with rain (RA) or fog (FG), so we can overwrite those immediately.

# Replace clear labels
metar_clear = [('Clear' if ('CLR' in s) else s) for s in metar_obs]
# Replace fog-like phenomena
foglike = ['FG','BR','HZ']
metar_fog = [('Fog' if any(ext in s for ext in foglike) else s) for s in metar_clear]
# Replace rain-like phenomena
rainlike = ['DZ','RA','TS']
metar_rain = [('Rain' if any(ext in s for ext in rainlike) else s) for s in metar_fog]
# Replace snow-like phenomena
snowlike = ['GR','GS','PL','SG','SN']
metar_snow = [('Snow' if any(ext in s for ext in snowlike) else s) for s in metar_rain]
# Other
dustlike = ['DU','FU','PO','SA','UP']
metar_othr = [('Other' if any(ext in s for ext in dustlike) else s) for s in metar_snow]

Furthermore, we’ll group similar objects together. For instance, fog, mist (BR), and haze (HZ) are all similar in definition because they each represent a phenomena of suspended water droplets. Similarly, drizzle (DZ), rain, and thunderstorms (TS) are all forms of large water droplets hurdling towards the Earth, so we’ll group them together as well. Frozen water, on the other hand, comes in many forms: hail (GR), small hail (GS), ice pellets (PL), snow grains (SG), and snow (SN) and these are grouped together as well. Finally, we group all the remaining unrelated items together like dust (DU), smoke (FU), sand whirls (PO), sand (SA), and unknown precipitation (UP). Using these groups, we apply a broad label to each METAR and append the list to our dataframe as the Condition column.

[print(x+':',metar_othr.count(x)) for x in ['Fog','Rain','Snow','Clear','Other']]
print('Unique weather conditions:',[x for x in set(metar_othr)])

# Add weather condition column onto existing dataframe
repl_wx = fix_dat.copy()
repl_wx['Condition'] = metar_othr

For fun, we’ll count the number of instances of each category in our dataset:

Fog: 392025
Rain: 637826
Snow: 279883
Clear: 5023260
Other: 16479
Unique weather conditions: ['Fog', 'Clear', 'Other', 'Rain', 'Snow']

Spring cleaning

At this point, we’ll begin to remove data. Below, I’ve listed the columns we’re removing and the reasons for removing each column.

  • Station_ID: The actual ID name has no effect on fog
  • Timestamp: Redundant, based on extracted day-of-year and hour info.
  • Wind_direction_deg: Fog is based on the presence of wind, not its direction
  • Wind_gust_kts: We only need wind magnitude (Wind_speed_kts)
  • Sky_cover_*: We’re keeping levels low to the ground (L1,L2)
  • Sky_alt_*: See above
  • ice_accretion_*: If this has an effect, we’re only keeping recent accretions (e.g. 1 hr)
  • peak_wind_*: Including drctgust,and time parameters, since these are redundant for our temporal resolution.
  • METAR is also redundant, now that we’ve interpreted each weather catagory based on the METAR value.
  • Apparent_temp_F: Our analysis depends on the outside temperature, not the temperature we perceive it to be.

Right now, my hypothesis is that Apparent_temp_F and Snow_depth won’t matter much, but we’ll keep them in just in case. Removing the columns is done simply enough, and we’ll take a look at the remaining dataframe.

simp_df = repl_wx.drop(columns=['Station_ID', 'Timestamp', 'Wind_direction_deg', 'Wind_gust_kts',
                                'Sky_cover_L3', 'Sky_cover_L4', 'Sky_alt_L3_ft', 'Sky_alt_L4_ft',
                                'Weather_codes','peak_wind_gust', 'peak_wind_drct', 'peak_wind_time',
                                'METAR','ice_accretion_3hr','ice_accretion_6hr','Sky_cover_L2',
                                'Sky_alt_L2_ft','Apparent_temp_F'])
print(simp_df.head())
   Longitude  Latitude  Temperature_F  Dew_point_F  Relative_humidity  \
0 -86.2481 34.2316 41.00 39.20 93.24
1 -85.9613 32.9143 50.36 46.58 86.80
2 -87.6718 30.2905 54.50 51.80 90.57
3 -86.3902 31.3061 57.20 51.80 82.12
4 -86.2481 34.2316 39.20 37.40 93.19

Wind_speed_kts One_hour_precip Pressure_altimeter Sea_level_pressure \
0 6.0 0.0 30.25 NaN
1 5.0 0.0 30.23 NaN
2 5.0 0.0 30.18 NaN
3 9.0 0.0 30.19 NaN
4 6.0 0.0 30.25 NaN

Visibility_mi Sky_cover_L1 Sky_alt_L1_ft ice_accretion_1hr Snow_depth \
0 10.0 CLR NaN NaN NaN
1 10.0 CLR NaN NaN NaN
2 10.0 BKN 2700.0 NaN NaN
3 10.0 FEW 1000.0 NaN NaN
4 10.0 CLR NaN NaN NaN

DoY Hour Condition
0 1 0 Clear
1 1 0 Clear
2 1 0 Clear
3 1 0 Clear
4 1 0 Clear

Adding intuition-based values

Snow depth / ice accretion

Pulling all unique values of simp_df‘s ice_accretion_1h column produce non-zero values with one exception: NaN. For values where this column’s values are NaN, the Temperature_F values are all greater than the freezing point (32 deg. F). It stands to reason, then, that we can set all NaN values in ice_accretion_1h — and Snow_depth as well — to 0.0.

Sky altitude

The Sky_altitude_L1_ft refers to the altitude of the lowest cloud layer when clouds are present. On days when the sky cover for the lowest cloud layer (Sky_cover_L1) is non-existent (“Clear”, or CLR), there is no lowest layer of clouds to speak of, and hence the NaN values. Using this knowledge, we can set the NaN values to something very, very high (ex. 10,000) when Sky_cover_L1 is Clear.

Sea level pressure

The Sea_level_pressure column also contains NaN values. In this case, however, the column contains NaN and non-NaN values. But this is more difficult to reason around, since the only standard value is 1013.25 millibars at sea level. When you gain or lose altitude relative to sea level, the atmospheric pressure (how much ‘atmosphere’ is weighing down on you) changes accordingly. Although it’s possible to calculate sea level pressure from the altimeter value, observation altitude, and temperature, our dataframe does not include the altitude of each station. Instead of using incorrect values, we’ll drop this column. It’s worth noting, however, that we could use additional Python tools to find the altitude based on our Lat,Lon coordinates; we will not be doing that here.

# Copy incoming dataframe
snw_ice_df = simp_df.copy()

# Convert all column elements to float
snw_ice_df = [snw_ice_df.astype({x: float}) for x in ['Latitude','Longitude','Temperature_F','Dew_point_F','Relative_humidity',
                                                      'Wind_speed_kts', 'One_hour_precip', 'Pressure_altimeter', 'Sea_level_pressure',
                                                      'Visibility_mi']][0]

# Correct NaN values for snow depth & ice accretion
snw_ice_df.loc[(snw_ice_df['Temperature_F'] >= 32.0) & (snw_ice_df['ice_accretion_1hr'].isnull()), 'Ice_accretion_1hr'] = 0
snw_ice_df.loc[(snw_ice_df['Temperature_F'] >= 32.0) & (snw_ice_df['Snow_depth'].isnull()), 'Snow_depth'] = 0

# Correct NaN values for sky altitude
snw_ice_df.loc[(snw_ice_df['Condition'] == 'Clear') & (snw_ice_df['Sky_cover_L1'].isnull()), 'Sky_cover_L1'] = 'CLR'
snw_ice_df.loc[(snw_ice_df['Sky_cover_L1'] == 'CLR') & (snw_ice_df['Sky_alt_L1_ft'].isnull()), 'Sky_alt_L1_ft'] = 10000
snw_ice_df.loc[(snw_ice_df['Condition'] == 'Clear') & (snw_ice_df['Sky_alt_L1_ft'].isnull()), 'Sky_alt_L1_ft'] = 10000

# Correct NaN values for sea level pressure
snw_ice_df = snw_ice_df.drop(columns=['ice_accretion_1hr','Sea_level_pressure','Snow_depth','Ice_accretion_1hr'])

# Print remaining NaN values
print('Number of rows containing NaN values:\n', snw_ice_df[snw_ice_df.columns[snw_ice_df.isnull().any()]].isnull().sum())

Number of rows containing NaN values:
Temperature_F 90286
Dew_point_F 160007
Relative_humidity 168543
Wind_speed_kts 89939
One_hour_precip 270770
Pressure_altimeter 22014
Visibility_mi 136739
Sky_cover_L1 20453
Sky_alt_L1_ft 20591

Handling missing values

Despite our best efforts, it seems we’ve still some NaN values to deal with. At this point, we need to make a decision. We can dump all rows containing NaN values; at most, this would cost us 194,000 rows but we would still preserve 5,739,274 rows. On the other hand, we could use K-Nearest Neighbors to impute the missing numerical data from all other existing data. Although the code is left in below, this is very time consuming. In the interest of finishing our analyses, let’s skip the imputation in favor of dumping the rows.

# Delete rows containing NaN values
clean_df = snw_ice_df.dropna()
print(clean_df.shape)

Prepare dataframe for ML model

With no remaining NaN values in our dataframe, our last chore is to convert all string values to numerical values, especially in columns that record sky and weather conditions. This is done below as the first step.

def convert_str_to_num(df,col):
    new_df = df.copy()
    var = new_df[col].unique()
    indices = np.arange(len(var))
    var_key = zip(var,indices)
    for k in np.arange(len(var)):
        new_df.loc[new_df[col] == var[k], col] = k
    return new_df, var_key

skycov_df,skycov_key = convert_str_to_num(clean_df,'Sky_cover_L1')
dataset,condtn_key = convert_str_to_num(skycov_df,'Condition')
print(dataset.head())

# Convert zipped list to `uniqlbl` for ML
uniqlbl,uniqval = map(list,zip(*condtn_key))

import seaborn as sns
import matplotlib.pyplot as plt

# Plot a graphical correlation matrix for each pair of columns in the dataframe
corr2 = dataset.corr(numeric_only=True)

plt.figure(figsize = (6,4))
sns.heatmap(corr2, annot=False)
plt.show()

The correlation matrix illustrates that none of our values are highly correlated with each other; the exception to this may be temperature and dewpoint, but this is a natural relationship. We’ll continue to treat them as independent variables and continue with the analysis.

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# --------- PLOT TOP ROW OF CLASSIFIERS ---------
classifiers = {
    "Dec. Tree": DecisionTreeClassifier(),
    "Rnd Forest": RandomForestClassifier(n_estimators = 25, criterion = 'gini'),
    "Naive Bayes": GaussianNB(),
    "Log. Reg.": LogisticRegression(),
    "KNN": KNeighborsClassifier(3),
}

f, axes = plt.subplots(1, 5, figsize=(12,6), sharey='row')

log_cols=["Classifier", "Accuracy", "Log Loss"]
log = pd.DataFrame(columns=log_cols)

for i, (key, classifier) in enumerate(classifiers.items()):
    y_pred = classifier.fit(X_train, y_train).predict(X_test)
    cf_matrix = confusion_matrix(y_test, y_pred, normalize='true')
    
    disp = ConfusionMatrixDisplay(cf_matrix,
                                  display_labels=uniqlbl,
                                 )
    disp.plot(ax=axes[i], xticks_rotation=45,values_format='.2f')
    disp.ax_.set_title(key)
    disp.im_.colorbar.remove()
    disp.ax_.set_xlabel('')
    if i!=0:
        disp.ax_.set_ylabel('')

f.text(0.4, 0.2, 'Predicted label', ha='center')
plt.rcParams["axes.grid"] = False
plt.style.use('dark_background')
sns.set_color_codes("muted")
plt.grid(False)
plt.tight_layout()
plt.show()

Which produces five confusion matrices, color-coded by value.

Model Metrics

From the plots. it’s clear that Naive Bayes and Logistic Regression are sub-optimal for our model choice. On the other hand, Decision Trees seem to give us the best performance, although its largest confusion between Clear days and Snow days is a bit puzzling.

Scores of the individual models can also be compared using bar plots and other predictability metrics like Accuracy, Log Loss, F1 score, and recall. What’s the difference between these?

  • Accuracy: Measures the model’s ability to correctly predict values in the test dataset (%).
  • Log loss: (“cross-entropy”) Assesses the model’s classification performance by determining the difference between true data labels and expected probabilities.
  • Precision: Evaluates the accuracy of the positive predictions made by the classifier model.
  • Recall score: The number of true positives classified from the total number of positive values.
  • F1 score: Combines precision and recall by considering the contribution of each to the overall performance.

The performance metrics for the DT classifier can be shown together in a report for conciseness:

from sklearn.metrics import classification_report

classifier = DecisionTreeClassifier()
y_pred = classifier.fit(X_train, y_train).predict(X_test)
print(classification_report(y_test, y_pred, target_names=uniqlbl))

resulting in the following table:

Although it’s great to be able to predict each of the parameters correctly to a high degree, our current interest is mostly on fog-like phenomena. Across the board, our model’s precision, recall, and f1-score come in at 93% — this is great, considering that the fog-like labels only account for 6.2% of the total samples (392,025 of 6349473).

Feature significance

For our purpose, a precision of 93% is great, so we’ve no need to tweak our model further. But we can still probe further by examining feature importance to see if we can throw out any data columns, and to gain some intuition on what the decision tree model was cueing off of. We can do that by invoking Scikit-Learn’s permutation_importance function.

import time
from sklearn.inspection import permutation_importance

# Create a  regressor using all the feature variables
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

# Monitor feature importance for each permutation
result = permutation_importance(
    dt_model, X_test, y_test, n_repeats=10, n_jobs=4
)
feature_importances = pd.Series(result.importances_mean, index=feature_list)

# Plot feature importance using permutation on full model
fig, ax = plt.subplots()
feature_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
ax.set_title("Feature importances using permutation on full model")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.show()

which produces the following bar chart:

Interestingly, the top three important features were determined to be visibility, dewpoint, and longitude (how far north or south you are on the Earth). Visibility seems a bit intuitive; long visibility distances are associated with clear weather, but it becomes very short during rain, fog, snow, and dust storms so this parameter by itself doesn’t buy us much leeway. The second feature, dewpoint, also seems intuitive, but we’d expect it to be intimately tied with temperature since the dewpoint spread is heavily associated with the presence of fog. The third feature, longitude, certainly becomes a bit more intriguing as well, and will take some time to think through.

Finally, it’s worth noting that there are no insignificant features in the the existing dataset — each feature enhanced the model’s accuracy, precision, and f1 score whether it was the day o f the year (DoY), time of day (Hour), or even the pressure altimeter.

Conclusion

We began our analysis by scraping various internet sources for METAR data, and wound up accumulating more than 6 million datapoints, each datapoint characterized by 32 features. During the cleaning process, we were able to reduce the number of features to just 13 parameter through intuition and feature engineering. We tried to calculate a sea level pressure to combine temperature and the pressure_altimeter columns, but were unable to do so.

Nonetheless, we wound up with a very strong model producing a 93% precision for fog-like phenomena, including fog, mist, and haze. With such a high degree of precision, this model is ready to save out or load into an API for future use.