GNR 652 Assignment 2
GNR 652 Assignment 2
Report
Roll Number:183170017
A) Flight Carriers
a) The data lacks the number of flights from ‘CO’, ‘OH’ and ‘UA’ carrier
which is why it’s significance on the model will be low. Model will be unbised
if we have uniform distributed data.
b) Most of the flight carriers have similar percentage of delayed flights
therefore the flight carriers doesn’t affect the delays
B) DISTANCE
a) The number of flight data from ‘214’ is larger than others therefore it will
have more significance than others
b) The %age delay of same is quite lower than others therefore it will have
lesser probability of getting delayed
C) ORIGIN
a) Datapoints of ‘DCA’ are more than others and have lower number of delays
therefore it will have lesser probability to delays
D) DAY_WEEK
a) The data shows a pattern that, although more number of flights run on mid
days %age of delays in mid days of week are lesser than others therefore
mid days will have lesser chance of delay flights while end days will have
more delays.
E) DAY_OF_MONTH
a) The %age of delays with number of weeks are increasing while the
approximate number of flights in each week remains same
b) Therefore the flights delay more at the end of months
F) DEST
a) The data points of ‘LFG’ are much greater and its %age delays are also less
therefore it will have more probability of ontime flights
2) Preprocess the dataset (to remove null values, generate dummy variables
etc. ) and divide the dataset into 60% train and 40% test. Prepare a logistic
model that can obtain accurate classifications of new flights based on their
predictor information.
4) Perform variable selection, and reduce the size of the model, only keeping
the relevant variables based on the analysis done earlier.
A) Based on t statistics with a confidence interval of 80% the following
variable were found to be significantly affecting the delays:
B) ['Delay', 'DEST_LGA', 'DISTANCE_214', 'Weather_0', 'ORIGIN_DCA']
6) Find the ideal weather conditions for the highest chance of an ontime flight
from DC to New York.
A) The model predicts that the flight delays depends mostly on following
with (+ve) beta values
'Delay', [0.09849605],
'DEST_LGA', [0.0844899 ],
'DISTANCE_214', [0.06678279],
'Weather_0', [0.19673315],
'ORIGIN_DCA' [0.09010007]
B) Therefore the best weather condition is 0 i,e, no weather related issues and
doesn’t have much correlation with days of week and month but visualization
shows that mid days of week and early days of month are more suited to less
chances of delay.