KrishnaBathula 1
KrishnaBathula 1
Abstract—Data is exponentially growing and in different forms such as structured, unstructured and semi-structured.
This study uses big data analytics concepts and machine learning algorithms to work with structured data of flights,
airports, airlines and weather information. The objective is to represent the correlation between different data points
among the datasets and use these associations to identify the key features that can disrupt flight schedules and lead the
study for impact analysis. The domino effect that is passed on to the stopover and connecting flights in the route to their
destinations is also predicted. These insights provide the basis for disaster management and recovery of valuable air time
as the delays in flights influence the economy of airport authority, airlines and flyers, causing damage to environments due
to increased consumption of utilities like fuel and gas.
Index Terms—Big data analytics, data science, data mining, flight schedule disruptions, flight delays, weather analysis.
I. INTRODUCTION
T HE U.S. Department of Transportation tracks the punctuality of domestic flight arrivals operated across the USA
by large airlines. This statistical information has an apparent increase of air transportation in recent years, and
points towards progressively congested airports and airspaces. This extensive usage of airspace reflects the heightened
risk of operational disruptions that can be traced to delays, cost criteria, deteriorated quality of service, airline setbacks,
passenger discontent, etc. In addition to these, airlines are constantly finding numerous strategies in optimizing their
operational profits above competing with their counterparts. It is a challenging task to balance higher turnover with
customer gratification and sustainability [1].
As per the analysis done by Bureau of Transportation Statistics (BTS), nearly 20% of commercial air transport run
out of scheduled time. Which results in heavy losses to airline companies as well as superfluous distress to customers.
Weather is contributing sophisticated in flights delay as well as late arrivals. The study of prediction of delay in flights
is vivacious area of research as demands for air travel increase. according to the U.S. Department of Transportation’s
Air Travel Consumer Report, in January 2017 alone, all the airlines of United States of America, recorded a mean on-
time arrival frequency of 76.0 percentage, down from 81.0% compared to 2016 data and up to 82.5% in the final
quarter of 2016. Thus, various studies were attempted earlier to determine patterns in air traffic and flight schedule
deviation from the actual planned.
In this study, we focus on the flight disruptions that arise from air system delays, security delays, airline delays, late
aircraft delays and weather delays. Our aim is to find the correlation between attributes like flights and weather and
whether we must consider all the constraints that effect delays or just the ones that have major impact. This impact
analysis further provides grounds for determining the influence of these delays to the connecting flights from the
destinations. The data from various datasets are observed for the features that effect the overall delays. These patterns
in the itinerary and delays can be used to predict the possible future delays. We use data science algorithms to do in-
depth study of several key factors that contribute to the airline performance factor and determine the correlation among
these characteristics. We also propose a model that will predict the flight disruptive phenomenon with respect to
climate changes and forecast the repercussions to the consecutive flight patterns, one such instance being the
connecting flights. These observations will help the airline industry to take essential precautions for operational
effectiveness, time and wealth optimizations.
If the historical data is further prompted for the exact cause of delay, the study yields weather as the much
incriminating factor with almost 61.89% than the others (Table 1). All these measures are based on the number of
flight operations.
Rebello et al. have created indicative model predicting network-related interruptions of the forthcoming by applying
the system-level dependences between airports [5]. In a similar fashion, Hansen et al. analyzed the advancement in
flying delays in the United States, domestic system by assessing an econometric model of systematic routine delay
that combines the properties of arrival line up such as terminal weather situations, seasonal and secular impacts. The
results suggested that even after monitoring these factors overall, the delays decreased gradually from 2000 through
mid-2003, but the trend inverted drastically thereafter [4].
Another group of researchers Mueller et al, developed a statistical method to analyze the departure, arrival data and
characterize the delay data [6]. Belcastro et al. [7] developed a model that govern onset delay prediction of a scheduled
flight. On the other hand, Choi et al. [8] have anticipated a model to forecast carrier onset interruptions caused by
inclement weather situations using data mining techniques and supervised machine learning algorithms.
Over a period, numerous analytical models and simulation methods have been developed to analyze flight delay,
which includes deterministic queuing models, neural networks, econometric models etc.
III. METHODOLOGY
In this research we are predicting the flight delay due to weather disruption, which can help airlines and passengers
to have appropriate plan of action. In this paper we will be using machine learning techniques for predicting the delay.
The project is divided in to 3 different parts, one is the Data Engineering, the second is the Exploratory Data
Analysis and the last is the prediction of the connecting Flight Disruptions.
The initial step towards a successful Data Analysis is ensuring the Data Quality. The Airline data is fetched from
Department of Transportation US (DOT). And the source of Weather data is National Oceanic and Atmospheric
Administration (NOAA).
A. Data Preprocessing:
Data is thoroughly examined for integrity criteria as well. Since we expect the model to work with all the forms like
offline, near line and online data, we curtailed the irrelevant and unnecessary parameters that could overburden the
dataset. We have also dropped the null values and assigned zero to Not a Number (NaN) values as one of the data
cleansing activities. The data types of time factors such as scheduled time, airtime etc., are found to be in float point
and needs proper conversion of input time to standard date time format. The categorical data is assigned with proper
numeric values which are the most contributing factors that flag the key filters. Finally, the data is analyzed for
distribution after cleansing, converting and preprocessing.
Then different datasets such the airline, flight, airports and weather datasets are integrated and normalized to identify
the correlating factors that affect the flight cancellations (Fig. 2).
Since all the airlines are operated from airports which can be either an origin or destination. From these data we will
be selecting only Top 50 busiest airports. To get the Top 50 we have added departures and arrivals. The predictors are
chosen based on their delay factors. The data frame for the small feature set is fitted with Random Forest classifier
and extracted Feature importance score for each feature. Such as Departure Delay and Arrival Delay due to weather.
1) Departure Delay Prediction: All the feature set utilized in the departure delay forecasting is identified with the
help of correlation matrix. Therefore, in predicting the delay, each feature is relatively significant.
2) Arrival Delay Prediction: As departure delays will affect the arrivals, we will be selecting the origin airport.
Data Integration: Top 50 busiest airlines data is integrated to weather data by origin and destination airports at the
time of takeoff and landing. While analyzing the data we have found that approximately 65% of the flights originate
and land in these airports. The Flight data is integrated with the weather data, for all the weather station, we have
considered the average weather parameters, i.e. Annual Mean Temperature, Annual Mean Precipitation, Annual Mean
Visibility etc. Two Data Frames are created for simplicity - One for Origin, and one for Destination. They are the
same dataframes, except for the Column Names.
The delays are plotted against weather conditions such as heavy rains gauged in inches fig.5. The impact analysis
plot can be viewed in fig.4
Fig. 4 Rain effected Delays Fig. 5 Delays vs Precipitation Rate
C. Model
The input data is split in to training and testing data. The intent of our model is to predict arrival delay, which gives
us the reference window time prediction for the connecting flight. In order to get closer time window that determines
if a connecting flight can be boarded by delayed passengers, we start with arrival delay which is vastly tricky, as
majority of flights having zero or a small arrival delay. We break the problem into two subparts:
The threshold of delay factor being more than 5 minutes, we performed binary classification, training a
logistic regression model and record resulting P value of the delay i.e., the output probability of delays.
2) Predicted Delay
Perform Linear Regression and model trained on positive delays from the above result of binary classification.
D. Model Evaluation:
The model has been trained to predict arrival delays, given flight features such as flight number, origin and
destination etc. Additionally, the weather features like precipitation, wind speed, visibility which are primarily key
characteristics. To discount the effect of weather on historical delays, we predict arrival delay for each flight with the
mean weather conditions of the origin and destination airports.
IV. RESULTS
As per the preliminary results, we were able to obtain 30% of positive prediction Fig. 6, which is on the much lower
side. We have selected the K-fold cross validation method which is deterministic. If we try to use more than one
model, there is a possibility of overfitting the data and this approach may lead the parameters of the model to be biased.
We were able to achieve these results using the regularization techniques.
Fig. 6 Results
V. CONCLUSIONS
With the help of this model airlines and passengers can get advance notice on the expected delays in their journey.
This study proposed a prediction model which is Regression using Linear regression and Logistics Regression
Model classify the delay. Airline delays triggered by extreme weather condition. In specific, the model was built on
historic weather and airlines data for top 50 airlines by utilizing machine learning algorithms. The project has
incorporated and showed the importance of Regression Analysis in Machine Learning, Big Data Analytics, and also
Cross Validation technique and Regularization in ML for making proper models. Because the data was imbalanced,
we have performed data cleansing techniques. We were also able to infer that there is significant subsequent impact
on the connecting flights when there is a delay of greater than 45 minutes in the arrival of the aircraft at the stopover
destination.
The model’s prediction performance on the validation set and the test set was analyzed. We feel that there are few
more possible methods that can be useful to improve the model in the future. Also, for future work we can integrate
expenses which could be saved by predicting the delays, every second in delay could lead to losses. We will study
on how we can save one percent by avoiding delays which are in control.
VI. REFERENCES
[1] Burmester, Gerrit, et al. “Big Data and Data Analytics in Aviation.” SpringerLink, Springer, Cham, 1 Jan. 1970,
https://link.springer.com/chapter/10.1007/978-3-319-75058-3_5.
[2] “Aviation Data & Statistics.” FAA Seal, 1 Aug. 2017, https://www.faa.gov/data_research/aviation_data_statistics/.
[3] MIT, Lexington, Massachusetts, Allan, S.S., S.G. Gaddy, and J.E.Evans, (2001) Delay Causality and Reduction at the New
York City Airports Using Terminal Weather Information.
[4] Hansen, M., and C. Y. Hsiao (2005), Going South? An Econometric Analysis of US Airline Flight Delays from 2000 to 2004,
Presented at the 84th Annual Meeting of the Transportation Research Board (TRB), Washington D.C.’05.
[5] H. B. Juan Jose Rebollo, “Characterization and prediction of air traffic delays,” Transportation Research Part C: Emerging
Technologies, vol. 44, pp. 231–241, 2014.
[6] E. Mueller and G. Chatterji, ”Analysis of Aircraft Arrival and Departure Delay Characteristics,” AIAAs Aircraft Technology,
Integration, and Operations (ATIO) 2002 Technical Forum, 2002.
[7] L. Belcastro, F. Marozzo, D. Talia, and P. Trunfio, ”Using Scalable Data Mining for Predicting Flight Delays,” ACM
Transactions on Intelligent Systems and Technology, vol. 8, no. 1, pp. 1-20, 2016.
[8] S. Choi, Y. J. Kim, S. Briceno and D. Mavris, ”Prediction of weather induced airline delays based on machine learning
algorithms,” 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), Sacramento, CA, 2016, pp. 1-6.