0% found this document useful (0 votes)
53 views4 pages

GNR 652 Assignment 2

The document summarizes the analysis of a flight delay dataset. Various visualizations are created to explore trends in the data by carrier, distance, origin, day of week, and destination. [1] Logistic regression models are fitted on preprocessed training and test sets to classify flight delays, achieving up to 88% accuracy. [2] Variable selection identifies five key predictors of delay: delay, destination as LGA, distance of 214 miles, weather of 0 (no issues), and origin as DCA. [3] Refitting the model on the selected variables maintains high accuracy, suggesting ideal conditions for on-time travel from DCA to New York are no weather issues with a 214 mile flight.

Uploaded by

Sayan Rakshit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views4 pages

GNR 652 Assignment 2

The document summarizes the analysis of a flight delay dataset. Various visualizations are created to explore trends in the data by carrier, distance, origin, day of week, and destination. [1] Logistic regression models are fitted on preprocessed training and test sets to classify flight delays, achieving up to 88% accuracy. [2] Variable selection identifies five key predictors of delay: delay, destination as LGA, distance of 214 miles, weather of 0 (no issues), and origin as DCA. [3] Refitting the model on the selected variables maintains high accuracy, suggesting ideal conditions for on-time travel from DCA to New York are no weather issues with a 214 mile flight.

Uploaded by

Sayan Rakshit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

GNR 652 Assignment 2

Report
Roll Number:183170017

1) Show visualisations to explore the dataset and understand the underlying


trends (Often called Exploratory Data Analysis). Choose visualisation
methods you think best represent the data
Bar Graph is the best to represent the data as we have to compare the data of
flight for different category, also in particular category how the delayed and
ontime flight is distributed.

A) Flight Carriers

a) The data lacks the number of flights from ‘CO’, ‘OH’ and ‘UA’ carrier
which is why it’s significance on the model will be low. Model will be unbised
if we have uniform distributed data.
b) Most of the flight carriers have similar percentage of delayed flights
therefore the flight carriers doesn’t affect the delays

B) DISTANCE
a) The number of flight data from ‘214’ is larger than others therefore it will
have more significance than others
b) The %age delay of same is quite lower than others therefore it will have
lesser probability of getting delayed

C) ORIGIN

a) Datapoints of ‘DCA’ are more than others and have lower number of delays
therefore it will have lesser probability to delays

D) DAY_WEEK

a) The data shows a pattern that, although more number of flights run on mid
days %age of delays in mid days of week are lesser than others therefore
mid days will have lesser chance of delay flights while end days will have
more delays.

E) DAY_OF_MONTH
a) The %age of delays with number of weeks are increasing while the
approximate number of flights in each week remains same
b) Therefore the flights delay more at the end of months

F) DEST

a) The data points of ‘LFG’ are much greater and its %age delays are also less
therefore it will have more probability of ontime flights

2) Preprocess the dataset (to remove null values, generate dummy variables
etc. ) and divide the dataset into 60% train and 40% test. Prepare a logistic
model that can obtain accurate classifications of new flights based on their
predictor information.

After pre-processing the data I got the following accuracy


A) F1score = 0.8792
B) test_accuracy = 87.62%
C) confusion matrix =[[126 43]
[ 66 646]]
3) Interpret the model and coefficients and present some insights
Most of the variables in above model had very small values of t statistics,
since the data depends mostly on the delay at the origin.

4) Perform variable selection, and reduce the size of the model, only keeping
the relevant variables based on the analysis done earlier.
A) Based on t statistics with a confidence interval of 80% the following
variable were found to be significantly affecting the delays:
B) ['Delay', 'DEST_LGA', 'DISTANCE_214', 'Weather_0', 'ORIGIN_DCA']

5) Conclude the analysis by fitting a new model on these selected variables


and report the same. Report the accuracy.
Based on above variables the following results were obtained:
A) F1score = 0.8821
B) Test_accuracy= 88.08%
C) confusion matrix= [[134 47]
[ 58 642]]

6) Find the ideal weather conditions for the highest chance of an ontime flight
from DC to New York.
A) The model predicts that the flight delays depends mostly on following
with (+ve) beta values

'Delay', [0.09849605],
'DEST_LGA', [0.0844899 ],
'DISTANCE_214', [0.06678279],
'Weather_0', [0.19673315],
'ORIGIN_DCA' [0.09010007]

B) Therefore the best weather condition is 0 i,e, no weather related issues and
doesn’t have much correlation with days of week and month but visualization
shows that mid days of week and early days of month are more suited to less
chances of delay.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy