We wanted to investigate factors that were related to and might be used to predict daily flight cancellation count. We hypothesized that there could be daily flight cancellations due to COVID breakouts or some specific weather conditions.


Data Description

We would like to explore the relationship between cancellation counts and some predictors. The number of cancellations each day is therefore regarded as the result (dependent variable), and exposures of interest are regarded as the independent variables.

  • daily_cancel: daily flight cancellation counts at JFK airport
  • airline_name: domestic airline
  • year_month: the date format in “Year Month”
  • temperature: daily average dry bulb temperature (°F)
  • humidity: daily average relative humidity (%)
  • windspeed: daily average relative windspeed (mph)
  • covid_case: new daily cases of COVID-19
  • covid_lag6: daily cases of COVID-19 with 6 days time lag


Distribution of Daily Flight Cancellations Counts

cancel_tidy %>% 
  ggplot(aes(x = daily_cancel, fill = "rgb(255, 65, 54)")) +
  geom_histogram(alpha = .4) +
  labs(
    x = "Number of Daily Flight Cancellations",
    y = "Count",
    title = "Distribution of Daily Flight Cancellations Counts",
    subtitle = "11/1/2021 to 1/31/2022"
  ) +
  scale_x_continuous(
    n.breaks = 10
  ) +
  theme(
    legend.position = "none"
  )

From 11/1/2021 to 1/31/2022, the minimum number of daily cancellations is 1, while the daily cancellations can reach up to as high as 333.


Stratification Factors

We discovered that the cancellation count each month varied by month and airline. Monthly flight cancellations were the highest in January 2022, followed by December and November 2021. Monthly cancellation rates fluctuate greatly amongst airlines as well. As a result, we used the month and the airline as stratification factors.

cancel_airline %>%
  arrange(airline_name) %>% 
  pivot_wider(
  names_from = airline_name, 
  values_from = Total_number_of_cancellation) %>%
  knitr::kable(
    col.names = c("Year, Month", "Alaska Airlines", "American Airlines", "Delta Air Lines", "Endeavor Air",
                  "JetBlue Airways", "Republic Airways", "United Air Lines")) %>% 
  kableExtra::kable_styling(bootstrap_options = "hover")
Year, Month Alaska Airlines American Airlines Delta Air Lines Endeavor Air JetBlue Airways Republic Airways United Air Lines
2021-Nov 6 33 2 1 5 4 1
2021-Dec 15 22 45 4 165 2 7
2022-Jan 41 139 182 156 416 185 10
cancel_airline %>% 
  mutate(year_month = fct_relevel(year_month, "2021-Nov", "2021-Dec", "2022-Jan")) %>%
  ggplot(aes(x = year_month, y = Total_number_of_cancellation, fill = airline_name)) +
  geom_col(position = "dodge", alpha = .7) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(
    x = "",
    y = "Count by Month",
    fill = "Airline",
    title = "Flight Cancellations by Month",
    subtitle = "Nov, 2021 to Jan, 2022"
  )

cancel_airline %>% 
  mutate(year_month = fct_relevel(year_month, "2021-Nov", "2021-Dec", "2022-Jan")) %>% 
  ggplot(aes(x = airline_name, y = Total_number_of_cancellation, fill = year_month)) +
  geom_col(position = "dodge", alpha = .7) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(
    x = "Airline",
    y = "Count by Month",
    fill = "Year, Month",
    title = "Flight Cancellations by Month",
    subtitle = "Nov, 2021 to Jan, 2022"
  )


Lagging COVID Cases

We hypothesized that the time trend of daily flight cancellations and COVID-19 cases did not coincide with one other. The rises of COVID-19 cases frequently took some time to affect our daily lives, including flight cancellations. Thus, we created new variables regarding COVID cases count with time lag.

cancel_covid = cancel_tidy  %>%
  dplyr::select(date, daily_cancel, covid_case) %>% 
  distinct()

cancel_plot = cancel_covid %>%
  mutate(text_label = str_c("Date: ", date, "\nCancellation Count: ", daily_cancel)) %>% 
  plot_ly(x = ~date, y = ~daily_cancel, type = "scatter", mode = "lines+markers",
          text = ~text_label, alpha = 0.5, hoverinfo = "text") %>% 
  layout(annotations = list(x = 0.1 , y = 0.8, text = "Daily Flight Cncellations", showarrow = F,
                            xref = "paper", yref = "paper"))
         
covid_plot <-  cancel_covid  %>%
  mutate(text_label = str_c("Date: ", date, "\nCOVID Cases Count: ", covid_case)) %>% 
  plot_ly(
    x = ~date, y = ~covid_case, type = "scatter", mode = "lines+markers",
   text = ~text_label, alpha = 0.5, hoverinfo = "text") %>% 
  layout(annotations = list(x = 0.1 , y = 0.8, text = "Daily COVID Cases", showarrow = F, 
                            xref = "paper", yref = "paper"))

scatter_plot <- subplot(cancel_plot, covid_plot, nrows = 2, shareX = TRUE) %>% 
  layout(title = list(text = "Time Trend of Daily Cancellations vs. Daily COVID Cases"),
         xaxis = list(title = "Date"),
         showlegend = FALSE)

scatter_plot

We used the lag selection criteria and Akaike information criterion (AIC) to select 6 days as the optimal time lag. Then, we constructed a new variable named covid_lag6, equivalent to the value of COVID-19 cases with a 6-day time lag.


Poisson Regression Model


poisson = glm(daily_cancel ~ temperature + humidity + windspeed + covid_lag6, 
              family = "poisson",
              data = cancel_tidy)

poisson %>% 
  broom::tidy() %>% 
  mutate(OR = exp(estimate),
    p.value = format(p.value, scientific = TRUE, digits = 3)
  ) %>% 
  dplyr::select(term, estimate, OR, p.value) %>% 
  knitr::kable(digits = 3, col.names = c("Terms", "Estimate (log(OR))", "Estimated Adjusted OR", "P-value")) %>% 
  kable_styling(bootstrap_options = "hover")
Terms Estimate (log(OR)) Estimated Adjusted OR P-value
(Intercept) 2.443 11.509 1.41e-195
temperature -0.119 0.888 0.00e+00
humidity 0.057 1.059 0.00e+00
windspeed 0.036 1.036 2.27e-154
covid_lag6 0.000 1.000 2.37e-05

The above is the outcome of Poisson regression. All predictor coefficients are statistically significant (associated with a p-value of 0.05), indicating that temperature, humidity, windspeed, and COVID-19 cases indeed impact the rate of daily flight cancellation.

Exponentiating the Poisson regression coefficient provides the incidence rate ratio of daily flight cancellation. Take temperature as an example; its parameter estimate is -0.119, which means that increasing the temperature by one degree Fahrenheit should decrease the difference in the logs of predicted daily flight cancellations by 0.119 occurrences while holding the other variables in the model constant.

We obtained an estimated adjusted OR after exponentiating all of the coefficients. Again, take temperature as an example: a one-degree Fahrenheit increase in temperature is expected to reduce the estimated rate ratio of daily flight cancellations by a factor of 0.888 while holding all other variables constant in the model.


Stratified by Month

cancel_tidy %>% 
  nest(df = -month) %>%
  mutate(
    models = map(.x = df, ~ glm(daily_cancel ~ temperature + humidity + windspeed + covid_lag6,family = "poisson", data = .x)),
    results = map(models, broom::tidy)
  ) %>% 
  unnest(results) %>% 
  dplyr::select(month, term, estimate) %>% 
  mutate(term = fct_inorder(term)) %>% 
  pivot_wider(
    names_from = term, 
    values_from = estimate) %>% 
  knitr::kable(digits = 6) %>% 
  kable_styling(bootstrap_options = "hover")
month (Intercept) temperature humidity windspeed covid_lag6
November -0.289407 0.050352 -0.033685 0.029304 3.3e-05
December 0.136457 -0.012493 0.015354 0.053751 3.2e-05
January 2.570463 -0.122494 0.060112 0.031507 0.0e+00


OR and 95% CI

Overview of estimated OR, 95% CI and P-value.

poisson_by_month = cancel_tidy %>%
  nest(data = -month) %>% 
  mutate(
    models = map(.x = data, ~glm(daily_cancel ~ temperature + humidity + windspeed + covid_lag6, family = "poisson", data = .x)),
    results = map(models, broom::tidy)
    ) %>% 
  dplyr::select(month, results) %>% 
  unnest(results) %>% 
  mutate(
    OR = exp(estimate),
    CI_lower = exp(estimate - 1.96 * std.error),
    CI_upper = exp(estimate + 1.96 * std.error),
    p.value = format(p.value, scientific = TRUE, digits = 3)
  ) %>% 
  dplyr::select(month, term, OR, CI_lower,CI_upper, p.value) 

poisson_by_month %>% 
  filter(term != "(Intercept)" ) %>% 
  knitr::kable(digits = 3, align = "llccc", col.names = c("Month", "Terms", "Estimated Adjusted OR", "CI Lower Bound", "CI Upper Bound", "P-value")) %>% 
  kable_styling(bootstrap_options = "hover") %>% 
  scroll_box(width = "100%", height = "220px")
Month Terms Estimated Adjusted OR CI Lower Bound CI Upper Bound P-value
November temperature 1.052 1.006 1.099 2.60e-02
November humidity 0.967 0.940 0.995 2.12e-02
November windspeed 1.030 0.996 1.064 8.25e-02
November covid_lag6 1.000 0.999 1.001 9.52e-01
December temperature 0.988 0.970 1.006 1.79e-01
December humidity 1.015 1.007 1.024 1.62e-04
December windspeed 1.055 1.039 1.071 5.33e-12
December covid_lag6 1.000 1.000 1.000 2.51e-33
January temperature 0.885 0.881 0.889 0.00e+00
January humidity 1.062 1.059 1.065 0.00e+00
January windspeed 1.032 1.029 1.035 4.06e-92
January covid_lag6 1.000 1.000 1.000 8.81e-01


Plot

poisson_by_month %>%
  filter(term != "(Intercept)") %>%
  mutate(month = fct_relevel(month, "November", "December", "January")) %>% 
  ggplot(aes(x = month, y = OR, color = term)) + 
  geom_point(show.legend = FALSE, aes()) +
  geom_errorbar(aes(ymin = CI_lower, 
                    ymax = CI_upper)) +
  labs(
    title = "Estimated OR with 95% CI in Daily Flight Cancellations by Month",
    x = "Month",
    y = "Estimated OR with 95% CI"
  ) +
  theme(legend.position = "right", legend.title = element_blank(),
        text = element_text(size = 10),
        axis.text.x = element_text(angle = 60, hjust = 1, size = 8)) + 
  facet_grid(. ~ term)


Stratified by Airline

cancel_tidy %>% 
  nest(df = -airline_name) %>%
  mutate(
    models = map(.x = df, ~ glm(daily_cancel ~ temperature + humidity + windspeed + covid_lag6,family = "poisson", data = .x)),
    results = map(models, broom::tidy)
  ) %>% 
  unnest(results) %>% 
  dplyr::select(airline_name, term, estimate) %>% 
  mutate(term = fct_inorder(term)) %>% 
  pivot_wider(
    names_from = term, 
    values_from = estimate) %>% 
  knitr::kable(digits = 6) %>% 
  kable_styling(bootstrap_options = "hover")
airline_name (Intercept) temperature humidity windspeed covid_lag6
American Airlines 1.377950 -0.119697 0.063910 0.048125 8e-06
Alaska Airlines 3.116599 -0.110591 0.045749 0.035340 3e-06
JetBlue Airways 1.987016 -0.130157 0.068785 0.031199 4e-06
Delta Air Lines 3.122657 -0.101793 0.044541 0.032940 -4e-06
Endeavor Air 2.736109 -0.133060 0.071020 0.015299 5e-06
Republic Airways 2.102555 -0.128872 0.059484 0.046475 9e-06
United Air Lines 3.431154 -0.077203 0.029393 0.042199 -5e-06


OR and 95% CI

poisson_by_airline = cancel_tidy %>%
  nest(data = -airline_name) %>% 
  mutate(
    models = map(.x = data, ~glm(daily_cancel ~ temperature + humidity + windspeed + covid_lag6, family = "poisson", data = .x)),
    results = map(models, broom::tidy)
    ) %>% 
  dplyr::select(airline_name, results) %>% 
  unnest(results) %>% 
  mutate(
    OR = exp(estimate),
    CI_lower = exp(estimate - 1.96 * std.error),
    CI_upper = exp(estimate + 1.96 * std.error),
    p.value = format(p.value, scientific = TRUE, digits = 3)
  ) %>% 
  dplyr::select(airline_name, term, OR, CI_lower,CI_upper, p.value) 

poisson_by_airline %>% 
  filter(term != "(Intercept)" ) %>% 
  knitr::kable(digits = 3, 
               align = "llccc", 
               col.names = c("Airline", "Terms", "Estimated Adjusted OR", "CI Lower Bound", "CI Upper Bound", "P-value")) %>% 
  kable_styling(bootstrap_options = "hover") %>% 
  scroll_box(width = "100%", height = "220px")
Airline Terms Estimated Adjusted OR CI Lower Bound CI Upper Bound P-value
American Airlines temperature 0.887 0.880 0.894 1.14e-196
American Airlines humidity 1.066 1.061 1.071 1.17e-131
American Airlines windspeed 1.049 1.042 1.056 1.09e-46
American Airlines covid_lag6 1.000 1.000 1.000 1.43e-04
Alaska Airlines temperature 0.895 0.888 0.903 5.56e-139
Alaska Airlines humidity 1.047 1.041 1.053 4.36e-57
Alaska Airlines windspeed 1.036 1.028 1.044 2.99e-19
Alaska Airlines covid_lag6 1.000 1.000 1.000 2.13e-01
JetBlue Airways temperature 0.878 0.871 0.885 2.37e-212
JetBlue Airways humidity 1.071 1.066 1.076 5.54e-170
JetBlue Airways windspeed 1.032 1.025 1.038 4.61e-23
JetBlue Airways covid_lag6 1.000 1.000 1.000 6.23e-02
Delta Air Lines temperature 0.903 0.894 0.912 1.62e-85
Delta Air Lines humidity 1.046 1.040 1.051 2.11e-67
Delta Air Lines windspeed 1.033 1.027 1.041 1.48e-21
Delta Air Lines covid_lag6 1.000 1.000 1.000 8.70e-02
Endeavor Air temperature 0.875 0.866 0.885 1.55e-125
Endeavor Air humidity 1.074 1.066 1.081 9.53e-85
Endeavor Air windspeed 1.015 1.007 1.024 1.66e-04
Endeavor Air covid_lag6 1.000 1.000 1.000 3.45e-02
Republic Airways temperature 0.879 0.871 0.888 4.60e-146
Republic Airways humidity 1.061 1.055 1.068 1.80e-77
Republic Airways windspeed 1.048 1.040 1.055 1.55e-38
Republic Airways covid_lag6 1.000 1.000 1.000 9.78e-05
United Air Lines temperature 0.926 0.913 0.938 4.56e-29
United Air Lines humidity 1.030 1.024 1.036 3.03e-21
United Air Lines windspeed 1.043 1.030 1.056 6.75e-11
United Air Lines covid_lag6 1.000 1.000 1.000 4.94e-02


Plot

poisson_by_airline %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(x = airline_name, y = OR, color = term)) + 
  geom_point(show.legend = FALSE, aes()) +
  geom_errorbar(aes(ymin = CI_lower, 
                    ymax = CI_upper)) +
  labs(
    title = "Estimated OR with 95% CI in Daily Flight Cancellations by Airline",
    x = "Airline",
    y = "Estimated OR with 95% CI"
  ) +
  theme(legend.position = "right", legend.title = element_blank(),
        text = element_text(size = 10),
        axis.text.x = element_text(angle = 60, hjust = 1, size = 8)) + 
  facet_grid(. ~ term)


Summary

Our Poisson regression model indicates that all our main effects (weather-related factors and COVID-19 cases) are statistically significant, as we proposed. Since the data exploration and pre-analysis indicated that there indeed are differences by month and airline, we then stratified our Poisson model by these two factors. An interesting finding is that although the regression output shows that daily COVID-19 cases have a statistically significant effect on daily flight cancellation, after stratified by month and airline, there is no obvious trend between daily COVID-19 cases and daily flight cancellation based on graph.


Discussion

We hypothesized that just two factors, adverse weather and COVID-19 cases, affect daily flight cancellations. I More elements that might influence real-life cancellation were not considered. As a result, we should note the constraint that this cancellation data does not include cancellations due to other factors, such as air traffic limitations. This restricted the information we could obtain from the cancellation dataset, preventing us from adjusting for variables we suspected were confounding the correlations.

What’s more, the association between COVID-19 cases and daily flight cancellations was not as strong as we expected. One reason might be that the research window was too short to discover the association because COVID-19 cases had a time lag and we only counted 3-month cancellations and COVID-19 cases.