Predicting Flight Delays
Predicting Flight Delays
After reviewing the model, we saw that this variable did not affect as we expected since de
adjusted r^2 stayed pretty much the same and adding nonvalue variables just makes the
model harder to understand so we removed it.
We thought that another interesting variable could be the acceleration, since maybe even
though the airport departure was after the original hour, the pilot may increase the airplane
speed in order to compensate time and give a better service to the customers, so we looked
for the difference between the departure delay and the arrival delay so from the ones that
improved for more than 30 min we consider that they accelerate and we did a similar binary
table so we could integrate this variable in the model (distance and air time), and the result
remained the same.
Then, we decided to add directly the variables distance and airtime and we finally got an
increase in our adjusted r square to .88. We thought this was a very good number, but we
wanted to try to improve it a little bit more, so then we kept thinking and looked for a way
to relate that since the departure was late and the distance was considerable (200km) the
airplane could accelerate, so we relate flights that accelerated to the company that they
belong to, and we obtained that our model just got better by .05.
Another thing we did was to build a correlation analysis to know which airplane companies
had more correlation with the arrival delay so we could filter good from bad companies, we
still took as delay >=30 min. but we obtain a positive correlation for .04 and .06 from 2 of
the companies, but we consider this to low. So, we conclude that the air company doesn’t
affect much.
Then one of the only things that we had left and that we consider important was the date,
sounds very logical that if you travel on a holyday or any busy dates, you have more
probability of getting late to your destiny since the number of flights in the airport is higher
than in normal days. So we did a graph related to the number of flights each day from each
month, and we discovered that there was a tendency in which there is much less number
of flights on January and February and this flights had more arrival delay than the other
ones during the year so that made us think that since we are analyzing new York and winter
there hits strong, this may be due to snowfalls, so we integrate this variable in our model
but the r^2 almost stayed the same. But we also see that there were some days in the week
that the delay was above the rest of them, so we did a correlation analysis comparing day
of the week vs arrival delay and “winter” but the model didn’t improve.
We got a little frustrated at this point since we already tried a lot of ways to integrate the
different variables but none of them seemed to work but in the correlation analysis, we saw
that on winter 2014 the correlation of delay was high, so we looked in the news for the
exact days that had a lot of arrival delays and we found out that there has been a huge
snowfall on all of them so we concluded that if you know that there is a possibility of a
snowfall in the date that you are planning to travel the delay will increase. But suddenly we
realized that the variables that we were analyzing all of them had to do with departure delay
so they were already represented by that variable and that was why our model wasn’t
getting any better, so then we were going to focus only on the ones that can affect de arrival
delay independently to the departure, and this was the airport destiny.
We calculated the average arrival delay for each destiny airport, and we created a binary
variable that was 1 for the airports with an average arrival delay above 10 minutes and 0
for the rest. We ran the regression analysis, and the model remained the same, so we
concluded that the destination airport doesn’t have any significant effect and the arrival
delay.