Project Report

1 Introduction

1.1 Motivation

Most of us have experienced waiting at the airport for the delayed flight. As the holiday season approaches, departure on-time have becoming more and more challenging. This year, there are already more than 4,000 flights delayed as holiday travel spikes in the U.S. The flight delay issue especially concerns busy metropolitan airports like JFK.

JFK Airport in 2022 holiday season: We can’t stress this enough — plan ahead and arrive early

Source: Twitter@JFKairport

Using JFK departed flights as an example, we’re interested in exploring the factors that are potentially related with flight delay (e.g., airlines, weather, and COVID conditions). By doing so, we’d like to give the holiday travelers a better sense of what to expect regarding flight delay for this holiday season, in order to make better travel plans accordingly.

1.3 Initial Questions

As in the draft project proposal, we would like to do some reviews on the three airports serving New York City, i.e., JFK, LGA, and EWR. The initial questions we would like to focus on included the comparisons among these three airports, such as restaurants, shops and stores, lounges, and other facilities/services.

However, considering the availability and scale of the accessible and available data, we switched our main topic to focus on JFK Airport’s delay and cancellation data from 11/1/2021 to 1/31/2022. And the weather and COVID cases data within that time range were accordingly. Based on these information, our final questions are below:

What are the key trends of the flights at JFK during last holiday season?
What are the potential factors that contribute to the flight delay or cancellation at JFK during last holiday season?
Are there significant associations of multiple factors with JFK flight delays and cancellations during last holiday season?

2 Data Sources and Cleaning

We used three primary datasets for exploration, visualization, and statistical analysis, and an additional supplemental dataset for interactive mapping.

2.1 Delay and Cancellation

Flight delay and cancellation data for JFK departures was obtained from the Bureau of Transportation Statistics (BTS). We first wrote a function to iterate on the reading-in process for each airline of interest from November 2021 to January 2022. Some new variable were created for subsequent analysis use, including scheduled hours for flight’s departures, year, month, day, etc. Delay minutes were manually calculated as the time difference in actual departure time and scheduled departure time. Flights with actual elapsed time minutes equal to 0 were treated as cancellation. Air carriers were recoded as airline names to be more clear.

2.2 Weather

Hourly and daily weather information from 11/1/2021 to 1/31/2022 was obtained from the National Oceanic and Atmospheric Administration (NOAA). We selected the specific zip code for JFK and requested the raw data thereof. To align the time unit of weather with flight information, considering data availability, we picked the weather at the 51th minute to represent the weather of the hour, and the weather at 23:59 to represent the weather condition of the day. Only date and hourly/daily weather condition of interest were kept in the resulted tidied dataset.

2.3 COVID Cases

Daily COVID cases count from 11/1/2021 to 1/31/2022 was obtained, by API, from NYC OpenData, provided by the NYC Department of Health and Mental Hygiene (DOHMH). The resulted dataset included date, year, month, day, and daily case counts.

2.4 Airport Locations

An additional source for the interactive map was U.S. domestic airport location information from the Humanitarian Data Exchange, where we extracted the latitude and longitude for those airports with delay and/or cancellation records.

2.5 Summary of Variables of Interest

Outcomes

Delay Time: Delay time in minutes
Cancellation Count: Daily flight cancellation counts at JFK airport

Potential Predictors

Categorical Variables

Times of the Day: Morning, Noon, Afternoon, Night
Months: November, 2021; December, 2021; January, 2022
Airlines: Alaska Airlines, American Airlines, Delta Air Lines, Endeavor Air, JetBlue Airways, Republic Airways, United Air Lines
Destination Airports: 66 destination airports with records of delay/cancellation

Continuous Variables

Carrier Delay: Carrier delay in minutes
Extreme Weather Delay: Extreme weather delay in minutes
Late Arrival Delay: Late arrival delay in minutes
NAS Delay: National Aviation System (NAS) delay in minutes
Security Delay: Security delay in minutes
Temperature: Dry bulb temperature (°F)
Humidity: Relative humidity (%)
Visibility: Visibility
Wind Speed: Wind speed (mph)
COVID Cases Count: New daily cases of COVID; and, daily cases of COVID with 6 days’ lag in time

3 Exploration and Visualization

3.1 Destination Airports

There are 66 and 65 destination airports in the delay and cancellation datasets, respectively. For the efficiency purpose in statistical analysis, we did not include destination airport as a predictor in our models, but we still kept it as one of the predictor to explore.
First, we checked if delay and cancellation counts differ in different destination airports. Flights from JFK to LAX have the highest delay occurrences with a number of 2293 and flights to BGR have the highest delay occurrences with a number of 6. Flights from JFK to SFO have the highest cancellation occurrences with a number of 76 and flights to BZN have the highest cancellation occurrences with a number of 1. We also took a look at whether different airlines could have different trends in delay and cancellation counts among all the destination airports, but there is no significant findings.
We found that LAX and SFO have outstanding delay and cancellation counts, so we decided to take a closer look at the underlying factors behind those delays and cancellations. Some interesting findings are:

The airlines which departure from JFK to the two airports are different.
Both airports show an increasing trend in delay minutes from November to January.
There is a distinct difference in cancellation counts in each scheduled hour between the two airports.

3.2 Cancellation and Delay

We then created a Shiny App for the audience to engage in our data exploration process. The audience could select which airline and which month they concern and get the user-selected outputs. In the Cancellation tab, you could observe the number of cancellations and the number of COVID cases on each day of a month. In the Delay tab, you could observe the number of delays and the average delay time in minutes on each day of a month.

3.3 Categorical Predictors

Delay time in minutes is one of our outcome of interests and we decided to conduct a linear regression model. Besides the main effects, we would like to check if there are any significant effect modifiers in our model. In this part, we investigated the interaction between the categorical predictors, including Times of the Day, Months，and Airlines.
Our exploration and visualization process hinted that between groups differences existed, and adding interaction terms between the categorical predictors could be one of the options for building the linear regression model. The following plots could hint that interactions between the categorical variables might exist, but we still need statistical analysis to prove our findings. In our analysis, we mainly focused on the interaction between month and airline.

3.4 Continuous Predictors

We were also interested in whether our continuous predictors, including Carrier Delay, Extreme Weather Delay, Late Arrival Delay, NAS Delay, Security Delay, Temperature, Humidity, Visibility, Wind Speed, could have different effects on different levels of periods of time, months, or airlines. Based on the graphs, we found that there could be significant interactions between:

Carrier Delay * Airline
Temperature * Month

As a result, these interaction terms in addition to other predictors would be further analyzed using statistical testing. Some possible analysis we would consider could be models cross validation, ANOVA test, Type III analysis.

4 Statistical Analysis

4.1 Predictive Model

4.1.1 Pre-analysis Data Exploration

Since the outcome variable is delay time, a continuous variable, we want to fit a linear regression model. However, the scale of outcome data (from 0 to +∞) is a problem since it does not in consistent with the scale of linear function (from -∞ to +∞). To make them in agreement, we decided to normalize the outcome data with log transformation and scale it to -∞ to +∞. At the meanwhile, this log transformation also solved the terrible skewness observed in the distribution of delay time. After this step, our outcome variable change from delay time to log (delay time), in which still representing the delay time.

NOTE: We do have plenty of outliers with extreme long delay time, however, considering these extreme observations could be indicative to the underlying relationship, we chose not to exclude them. After log transformation, the outlier issues were no longer too scary.

What we did next is to check if there are any associations existing between the independent variables and the dependent variable (i.e., log (delay time)).

By plotting boxplots, doing ANOVA and pairwise comparisons for groups of our categorical independent variables, we found all 3 variables are highly significantly associated with the outcome, with p-value (< 2e-16), and thus we decided to include these 3 categorical variables (i.e., airlines, months, time of the day) into the final model.
By plotting scatterplots, calculating Pearson correlation coefficients for all continuous variables, we did not find any significant linear association between independent variables and delay time. However, considering the fact that we do not have perfect data source and limited data collection interval, which all might bias the scatterplot as well as the correlation, we identified 2 variables with correlation coefficients greater than 0.3 (moderate correlation) and we decided to include these 2 continuous variables (i.e., carrier delay time and late arrival delay time) into the model for prediction purpose.

4.1.2 Model Fitting

From the previous data exploration and visualization, we found some interesting trends that we all think worthy further inspection and can be the potential independent variables for the prediction of delay in time.

Given the above information, we came up 2 rationales for building the linear regression model:

Based on observed relationship only:

Include the 5 variables we identified above.

Model 1: delay ~ airline + month + time of the day + carrier delay time + late arrival delay time

r.squared	statistic	p.value	df
0.4220695	602.9567	<2.2e-16	13

Based on both common sense and observed relationship:

Except for the 5 variables identified above, we also hypothesize the rest of variables (i.e., extreme weather delay time, NAS delay time, security delay time, temperature, humidity, visibility, wind speed) would affect the delay time based on our common sense and experience.

Model 2: delay ~ airline + month + time of the day + carrier delay time + late arrival delay time + extreme weather delay time + NAS delay time + security delay time + temperature + humidity + visibility + wind speed

r.squared	statistic	p.value	df
0.4524891	443.2239	<2.2e-16	20

Besides, from previous data exploration focusing on interactions between independent variables, we identified 3 interaction terms (i.e., Temperature * Month, Carrier * Airline, Month * Airline) that have the potential to be added into the final model.

So, next we did cross validation to figure out the best model.

It turns out that Model 1 is the best model, with similar RMSE value with the other three but fewest number of model parameters. For parsimony purpose, we chose Model 1 as the final model.

4.1.2 Model Diagnostics

The last step about building predictive model will be the model diagnostics, where we plot residuals against fitted value to see if our model has a good fit and prediction power.

Unfortunately, the answer is NO.

4.2 Poisson Model

4.2.1 Pre-analysis

We wanted to investigate factors that are related to - and may be used to predict daily flight cancellation count since the effects of weather and COVID-19 were our primary research interests. Along with these predictors, we include categorical and time-dependent stratification parameters in the model. We clearly discovered the difference between different months and different airlines based on the monthly flight cancellation count. Monthly flight cancellations were the highest in January 2022, followed by December and November 2021. Monthly cancellation rates fluctuate greatly amongst airlines as well. As a result, we used the month and the airline as stratification factors. We first used Poisson regression to predict the risk ratio for daily flight cancellations in order to test for these variations.

Year, Month	Alaska Airlines	American Airlines	Delta Air Lines	Endeavor Air	JetBlue Airways	Republic Airways	United Air Lines
2021-Nov	6	33	2	1	5	4	1
2021-Dec	15	22	45	4	165	2	7
2022-Jan	41	139	182	156	416	185	10

From data exploration, we also found that the time trend of daily flight cancellations and COVID-19 cases did not coincide with one other. The rises of COVID-19 cases frequently took some time to affect our daily lives, including flight cancellations. Thus, we created new variables regarding COVID cases count with time lag. We used the lag selection criteria and Akaike information criterion (AIC) to select 6 days as the optimal time lag. Then, we constructed a new variable named covid_lag6, equivalent to the value of COVID-19 cases with a 6-day time lag.

4.2.2 Model Fitting

Our final Poisson model was:

Terms	Estimate (log(OR))	Estimated Adjusted OR	P-value
(Intercept)	2.443	11.509	1.41e-195
temperature	-0.119	0.888	0.00e+00
humidity	0.057	1.059	0.00e+00
windspeed	0.036	1.036	2.27e-154
covid_lag6	0.000	1.000	2.37e-05

To test the model in different months and airlines, we also conducted a stratification analysis. We graphed these risk ratios for the coefficients (i.e., the exponentiated coefficien) that show the difference in daily flight cancellations at JFK airport by month and by airline for the purpose of illustration for both month and airline. Along with the p-value, we also provided 95% confidence intervals, which were calculated as the exponentiated coefficient plus or minus 1.96 times the standard error.

5 Main Discussions

5.1 Predictive Model

The data source does not serve as a good prediction data for the delay time, probably it is due to the lack of interval of data collection. Our dataset was collection between November 2021 to January 2022, which only 3 months in involved and they have similar weather within these 3 months. This could explain why we observe a non-commonsense result showing no association between delay time and weather specific independent variables.
Linear model might not be an ideal model for this data, we could later try to categorize the outcome and fit in a multinomial logistic regression to see if that works better.
The efficiency of our model is very low, that our model only explains 42% of the variability observed in the delay time given 5 predictors in the model. Again, this may be due to limitations of our source data.

5.2 Poisson Model

We hypothesized that just two factors, adverse weather and COVID-19 cases, affect daily flight cancellations. I More elements that might influence real-life cancellation were not considered. As a result, we should note the constraint that this cancellation data does not include cancellations due to other factors, such as air traffic limitations. This restricted the information we could obtain from the cancellation dataset, preventing us from adjusting for variables we suspected were confounding the correlations.
The association between COVID-19 cases and daily flight cancellations was not as strong as we expected. One reason might be that the research window was too short to discover the association because COVID-19 cases had a time lag and we only counted 3-month cancellations and COVID-19 cases.

6 Future Directions

6.1 Predictive Model

In order to better predict the delay time, we need to collect a larger dataset with data collected on longer time intervals is needed to solve the problem. To achieve this, we could collect data from more airports in the US and make our prediction more generalizable. Meanwhile, of course we need to try other modeling design to find the one that fit our data the best.

6.2 Poisson Model

In order to find the association between COVID-19 cases and daily flight cancellation counts, we could further follow-up this model with longer period of study. Also, it can be improved by adding more different sources of cancellation counts into the model, and add types of cancellation into this model.