Cuoicuoicuoi
Cuoicuoicuoi
STT First and last name MSSV Class Department Sign to attend
1 Trần Quang Linh 2153527 CC01 Transportation
2 Đào Bình Phương Nam 2153592 CC01 Transportation
3 Phan Tuấn Anh 2210121 CC01 Transportation
4 Lê Quốc Anh 2252025 CC01 Transportation
5 Phạm Thành Lộc 2252462 CC01 Transportation
Class: CC01
1
TABLE OF CONTENT
1. INTRODUCTION..................................................................................................... 4
BACKGROUND:.................................................................................................................... 4
OBJECTIVE OF THE ANALYSIS:.................................................................................................4
2. DATA INTRODUCTION............................................................................................ 5
1. DATA DESCRIPTION..........................................................................................................5
3. THEORETICAL BASIS............................................................................................... 6
2.1 LINEAR REGRESSION.................................................................................................6
2.1.1 Introducing the multiple linear regression model...............................................6
2.2 ESTIMATE THE PARAMETERS OF THE MULTIPLE LINEAR REGRESSION MODEL..............................7
2.2.1Population Regression Function (PRF).................................................................7
Sample Regression Function (SRF – Sample Regression Function):.............................7
2.3 MODEL FIT.................................................................................................................10
2.4 TESTING THE GENERAL SIGNIFICANCE LEVEL OF THE MODEL (SPECIAL CASE OF WALD TEST).......13
2.5 THE MATRIX METHOD SOLVES THE MULTIPLE LINEAR REGRESSION PROBLEM............................14
2.6 ONE-FACTOR ANALYSIS OF VARIANCE...............................................................................15
3. DATA PREPROCESSING.......................................................................................... 17
4. DESCRIPTIVE STATISTICS....................................................................................... 19
5. INFERENTIAL STATISTICS........................................................................................ 33
6. DISCUSS AND EXPAND........................................................................................... 35
7. CONCLUSION......................................................................................................... 37
8. REFERENCES.......................................................................................................... 38
2
ACKNOWLEDGEMENT
We would like to extend our sincere gratitude to Ms. Phan Thi Huong – the
lecturer for the Statistical Probability course and our project supervisor. Her
wholehearted guidance enabled the team to complete the assignment on schedule and
effectively address encountered challenges. Her mentorship has served as a guiding
beacon for every action of the team, maximizing the supportive relationship between the
instructor and students within the educational environment.
3
1. INTRODUCTION
The comprehensive analysis of air traffic passenger statistics holds paramount
importance in understanding the dynamics of aviation within a specific region. In the
context of this report, our focus is directed towards Canada, a nation characterized by
vast geographical landscapes and diverse urban centers. The dataset under examination,
sourced from the Air Traffic Passenger Statistics, encapsulates a wealth of information
crucial for unraveling the intricate patterns and trends that govern air travel within the
Canadian territory.
Background:
Canada, as one of the largest countries globally, boasts a sophisticated air
transportation network that plays a pivotal role in connecting its varied regions. With a
myriad of airports and a robust aviation sector, the air traffic patterns reflect not only the
economic activities but also the cultural diversity and regional nuances that define the
Canadian landscape. The significance of dissecting air traffic passenger statistics in
Canada extends beyond the mere quantification of travel; it serves as a lens through
which we can comprehend the economic vibrancy, tourism dynamics, and the
interconnectedness of its diverse provinces and territories.
This report aspires to provide stakeholders, policymakers, and industry enthusiasts with a
nuanced understanding of the air travel landscape in Canada. By elucidating the
intricacies of air traffic passenger statistics, we endeavor to contribute to a broader
discourse on the factors shaping the aviation sector within this diverse and expansive
nation.
2. Data introduction
4
1. Data description
The data could be used to study the effect of different factors on air traffic passenger numbers
and will be show in the below table with the population is the total flights around the World.
To be more specific, the sample is flights from Canada, calculate the number of passenger and
categorized by IATA code.
Variable Data type Type Unit Description
Operating Airline Categorical None The airline that operated the flight.
Published Airline Categorical None The airline that published the fare for the flight.
Published Airline IATA The IATA code of the airline that published the fare f
Categorical None
Code the flight.
Passenger Count { x ∈ N∨1 ≤ x ≤ 659837 } Continuous People The number of passengers on the flight.
Operating Airline IATA The IATA code of the airline that operated the flight
Categorical None
Code (String)
Year { x ∈ N∨2005 ≤ x ≤ 2016 } Continous None The year of the activity.
5
3. Theoretical basis
2.1 LINEAR REGRESSION
Regression is a statistical method for establishing a relationship between a dependent
variable and a set of independent variables. A model with one dependent variable with
two or more independent variables is called multiple regression (also called multiple
regression). For example: Household expenditure on food depends on household size,
income, geographical location, etc.; A country's child mortality rate depends on per capita
income, education level, etc.; A person's salary depends on position, experience, age,...
Theory
Data collection
6
Thus, "Linear Regression" is a method to predict the value of the dependent variable
(Y) based on the value of the independent variable (X). The term linearity is used to
indicate that the nature of the population parameters β 1 and β i is linear (first order). It can
be used for cases where we want to predict a continuous quantity. For example, predicting
the time a user stops on a certain page or the number of people who have visited a certain
website, etc. Using the collected data, we estimate the regression function of the
population, which is the estimate. quantity of population parameters: β 1, β 2, …, β k
7
2.2 Ordinary Least Squares Method
The least squares method was introduced by German mathematician Carl Friedrich
Gauss. The idea of this method is to minimize the sum of squares of the residuals.
Therefore, it can be said that to get the most appropriate regression, we choose estimates
with intercept and slope so that the residual is small.
2.2.1. Assumptions of the least squares method for multiple linear regression
models
The least squares (OLS) method is a very reliable method in estimating model
parameters, however the estimated model must satisfy 7 assumptions. When the
assumptions are satisfied, the least squares (OLS) estimator is the most efficient unbiased
linear estimator among the estimators. Therefore, the OLS method provides the best
linear unbiased estimate (BLUE).
This result is called the Gauss - Markov Theorem, according to this theory the OLS
estimator is BLUE, meaning that among all unbiased linear combinations of Y, the OLS
estimator has the smallest variance. The assumptions are as follows:
1/ The regression function is linear according to the coefficients.
This means that the actual process of regression is described by the relationship of
the form:
y = β 1+ β 2x 2+ β 3x 3+ β 4x 4+ … + β kx k+ u
Or the actual relationship can be rewritten, for example, as taking the logarithm of
both sides.
…
2
∂Σ e i
= -2 Σ ( y i - ( β^ 1 + β^ 2 x 2i + β^ 3 x3i +…+ β^ k x ki ) )x ki = 0
∂ β^k
The system of equations we have is called the standard system of equations of the
sample regression. We can solve these k standard equations to find the k β^ unknown
coefficients. β^ 1 , β^ 2 ,…, β^ k are called least squares estimates.
9
2.2.3. In case the regression function has 2 explanatory variables
The parameters β^ 1 , β^ 2 , β^ 3 are calculated from the following standard equation system:
β^ 1 + β^ 2 x 2i + β^ 3 x3i = y
β^ 1 Σ x 2i + β^ 2 Σ x2i + β^ 3 Σ x2i x 3i=Σ y i x 2i
2
β^ 1 Σ x 3i + β^ 2 Σ x2i x 3i + β^ 3 Σ x 3i +=Σ y i x 3i
2
In there:
x2 = Σ x 2i /n ; x3 = Σ x 3i /n ; y=Σ y i
Solving the system of equations we find
β^ 1 = y - β^ 2 x 2 - β^ 3 x 3
( Σ ( y i - y ) ( x2i - x 2 ) )( Σ ( x 3i - x 3) ) - ( Σ ( y i - y ) ( x 3i - x3 ) )( Σ ( x2i - x 2 ) ( x 3i - x3))
β^ 2 =
( Σ ( x 2i - x2 )2 )( Σ ( x3i - x 3 )2 ) - ( Σ ( x2i - x 2 ) ( x3i - x 3 ) )2
( Σ ( y i - y ) ( x3i - x 3 ) )( Σ ( x 3i - x 3 ) ) - ( Σ ( y i - y ) ( x 2i - x2 ) )( Σ ( x2i - x 2 ) ( x 3i - x3))
β^ 3 =
( Σ ( x 2i - x2 )2 )( Σ ( x3i - x 3 )2 ) - ( Σ ( x2i - x 2 ) ( x3i - x 3 ) )2
Variance and standard deviation are calculated by the following formula:
σ2
Var( β^ 2 ) = ¿ ; Se( β^ 2 ) =√ Var( β^ 2 )
Σ ( x 2i - x2 )2 (1- r 223 ¿
2
σ
Var( β^ 3 ) = 2 2
¿ ; Se( β^ 3 ) =√ Var( β^ 3 )
Σ ( x 3i - x3 ) (1- r 23 ¿
2 2
- r 23 σ
PM2.5v( β^ 2 , β^ 3 ) = 2
(1- r 23 ¿
√ 2 2
Σ ( x 2i - x 2) Σ ( x3i - x 3 ) ¿√
2
2
( Σ ( x2i - x 2 ) ( x 3i - x3))
In there:r =¿ 23
( Σ ( x 2i - x2 )2 )( Σ ( x3i - x 3 )2 )
In the above formula, σ 2 is the variance of u i but is unknown. Its unbiased estimate is:
2
2 Σ ei
σ^ =
n-k
Where: k is the number of parameters in the model.
2.3 Model fit
To know how or how much of the variation in the dependent variable the model
explains, people use R 2
We have: Σ (y i – ĩ) 2 = Σ [(y i – ŷ i ) + (ŷ i – ĩ)] 2 = Σ [e i + (ŷ i – ĩ)] 2
= Σ e i 2 + 2 Σ e i (ŷ i – ĩ) + Σ (ŷ i – ĩ) 2
Put:
Σ (y i – ĩ) 2 : SST – Total Sum of Squares
Σ (ŷ i – ĩ) 2 : SSE – Explained Sum of Squares
Σ e i 2 : SSR – Residual Sum of Squares
10
Do Σ e i (ŷ i – ĩ) = 0 ⇐( Σ e i ŷ i = 0; Σ e i ọ = 0)
We can write: SST = SSE + SSR
Meaning of the components:
SST is the sum of squares of all the deviations between the observed values Y i
and the mean.
SSE is the sum of squares of all the deviations between the values of the
dependent variable Y obtained from the sample regression function and their
mean. This part measures the accuracy of the regression function.
SSR is the sum of squares of all the differences between the observed Y values
and the values obtained from the regression function.
SST is divided into two parts: one caused by SSE and one caused by SSR.
From SST = SSE + SSR, we divide both sides by SST, we have:
2
SS E SS R Σ (ŷ i – ӯ) 2 Σ e i
1= + = +
SS T SS T Σ( y i - y ) 2 Σ( y i - y )
2
2
SS E SS R ( Σ ( y i - y ) ( y i - ^y ) )
R 2= =1 - =
SS T SS T ( Σ ( y - y ) 2) ( Σ ( y - y^ )2 )
i i
The ratio of the total variation explained by the model to the total squares to be
explained is called the coefficient of determination, or “good of fit” statistic. From the
definition of R 2 we see that R 2 measures the ratio or % of the total deviation Y from the
mean value explained by the model. Then people use R2 to measure the suitability of the
regression function:
0 ≤ R 2 ≤ 1.
R 2 means that the estimated model explains a high level of variation in the
dependent variable.
If R 2 = 1, it means that the regression line explains 100% of the change in y.
If R 2 = 0, it means the model does not provide any information about the
change in the dependent variable y.
In a multivariate regression model, the ratio of the total difference in variable y caused
by all variables x 2 and x 3 is called the multiple coefficients of determination, denoted R 2
2
β^ 2Σ ( y i - y )( x 2i - x 2 )+ β^ 3 Σ ( y i - y )( x 3i - x 3 ) 2
Σ ei
R = 2 = 1- 2
Σ( y i - y ) Σ( y i - y )
2.3. Confidence intervals and testing of regression coefficients
2.3.1. Estimate confidence intervals for regression coefficients
The purpose of regression analysis is not only to speculate about β1 , β2 ,…, β k but also
to examine the nature of the dependence. Therefore, it is necessary to know the
11
probability distribution of β1 , β2 ,…, β k . These distributions depend on the distribution of
ui .
With OLS assumptions, u i has distribution N (0, σ2 ). The estimated coefficients
follow a normal distribution:
β^ j N( βj , Se ( β^ j ))
β^ j - β j
~T (n−k )
Se ( β^ j)
Estimate error variance based on least squares residuals. Where k is the number of
coefficients in the multivariate regression equation:
2
Σ ei 2
σ^ =
n-k
Estimating both sides, we find t α ( n-3 ) =1- α the satisfaction:
2
β^ j - β j
P( −t α2 ( n-3 ))≤ ≤P( t α ( n-3 ) )
Se ( β^ j) 2
Test βj :
H0 : β j = 0 ⇔x j has no impact
H1 : β j ≠ 0 ⇔x j has an impact
12
β j < 0 ⇔x j has the opposite effect
β j > 0 ⇔x j has a positive effect
2.4 Testing the general significance level of the model (special case of
WALD test)
2.3.3. Overview of WALD testing
Suppose we have 2 models below:
(U): Y = β 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + u
(R): Y = β 1 + β 2 X 2 + v
The U model is called the Unrestrict model, and the R model is called the Restrict
model. That is because β 3 and β 4 must be equal to 0 in the R model. We can test the
linking hypothesis β 3 = β 4 = 0 with the alternative hypothesis that at least one of these
coefficients is not equal to 0. Testing this association hypothesis is called the Wald test,
the procedure is as follows:
Set the bounded and unbounded models as:
(U): Y = β 1 + β 2 X 2 + … + β m X m + β m+1 X m+1 + … + β k X k + u
(R): Y = β 1 + β 2 X 2 + … + β m X m + v
Hypothesis:
H 0 : β m+1 = … = β k = 0
H 1 : “The parameters are not all 0 at the same time”
(U) contains k unknown regression coefficients and (R) contains m unknown
regression coefficients. Therefore, model R has (k – m) fewer parameters than U. The
question we ask is whether the (k – m) omitted variables have a meaningful association
effect on Y.
The test statistic for this hypothesis is:
[ SSR R −SSR U ] /(k −m) 2 2
R U −R R /(k−m)
F c= ~ F( α , k – m, n – k) = 2
SSR U /(n−k ) 1−RU /(n−k )
Where R 2 is the uncorrected fitness measure. Under the null hypothesis, F c has an F
distribution with (k – m) degrees of freedom for the numerator and (n – k) degrees of
freedom for the denominator.
Reject the hypothesis H 0 when: F c > F(α, k – m, n – k) or the p-value of the F statistic
is less than the given level of significance.
2.3.4. Test the significance of the model
13
In the multivariate regression model, the “null” hypothesis that the model is not
significant is understood to mean that all individual regression coefficients are equal to 0.
The application of the Wald test (commonly known as the F test) is specifically
conducted as follows:
Step 1: The “null” hypothesis is H 0 : β 2 = β 3 = … = β k = 0.
The alternative hypothesis is H 1 : “at least one of the β values is non-zero”.
Step 2: First, regress Y on a constant term and X 2 , X 3 , …, X k , then calculate
the sum of squared errors RSS U , RSS R . The F-distribution is the ratio of two
independent chi-squared distributed random variables. This gives us the
statistical value:
[ SSR R −SSR U ] /(k −m)
F c= ~ F( α , k – m, n – k)
SSR U /(n−k )
Since H 0 : β 2 = β 3 = … = β k = 0, it is observed that the test statistic for this hypothesis
will be:
SSE /(k−1)
F c= ~ F( α , k – 1, n – k)
SSR /(n−k )
Step 3: Look up the data in table F corresponding to the degrees of freedom (k –
1) for the numerator and (n – k) for the denominator, and with the given level of
significance α.
Step 4: Reject the hypothesis H 0 at the α level of significance if F c > F(α, k-1,
nk). For the p-value method, calculate the value p = P(F > F c |H 0 ) and reject
the hypothesis H 0 if p < significance level α
2.5 The matrix method solves the multiple linear regression problem
We see that in multivariate regression, if we use the gradual substitution method to
find regression parameters, it will be very complicated and take a lot of time, especially
for models with many explanatory variables. A simpler way to solve it is to use the
matrix method.
Let's say we have a population regression function (PRF):
y = β 1+ β 2x 2+ β 3x 3+ β 4x 4+ … + β kx k+ u
We write for each observation as follows:
y 1 = β 1 + β 2 x 2;1 + β 3 x 3;1 + β 4 x 4;1 + … + β k x k;1 + u 1
y 2 = β 1 + β 2 x 2;2 + β 3 x 3;2 + β 4 x 4;2 + … + β k x k;2 + u 2
⋮ ⋮ ⋮
y n = β 1 + β 2 x 2;n + β 3 x 3;n + β 4 x 4;n + … + β k x k;n + u n
Writing in matrix form we have:
Y = Xβ + U
In which: Y is a nx 1 matrix
14
X is a matrix of level nxk
β is a kx 1 level matrix
U is a nx 1 matrix
With x 1 represented as:
[]
1
1
x 1= ⋮
1
For example: if k = 2:
[ ] [ ][ ] [ ]
y1 1 x 21 u1
y2 1 x 22 β 1 u2
= +
⋮ ⋮ ⋮ β2 ⋮
yn 1 x2 n un
The sample regression function is written as follows:
^y i = β^ 1 + β^ 2 x2i + β^ 3 x 3i +…+ β^ k x ki
y = β^ + β^ x + β^ x +…+ β^ x +e
i 1 2 2i 3 3i k ki i
Or: Y = X β^ + e
Then the OLS estimates are found by:
Σ e i =Σ ( y i ( β^ 1 + β^ 2 x 2i + β^ 3 x 3i +…+ β^ k x ki ) ) ⇒min. min
2 2
e'e = Σ e 2i = (Y - X β^ )'( Y - X β^ )= Y'Y - β^ ' X'Y – Y'X β^ + β^ ' X'X β^ = Y'Y - 2 β^ ' X'Y + X'X β^ 2
'
∂(e e)
= -2X'Y + 2X'Xβ^ ⇒X'Y = X'Xβ^ ⇒ β^ = (X'X) -1 X'Y
∂ β^
From there we can easily calculate the parameters of the sample regression function,
as well as applying the rules for calculating variance and standard deviation, we can also
easily find the variance and standard deviation of the regression parameters. regulation.
2.6 One-factor analysis of variance
Theory of 1-factor variance analysis
Analysis of variance is the comparison of the means of two or more groups based
on the mean values of observed samples from these groups, and through hypothesis
testing to conclude about the equality of the means. this overall.
One-factor variance analysis method
15
Analysis of variance is used in tests to compare the average values of two or more
samples taken from fractions. This can be seen as an extension of t or z tests (comparing
two average values).
The purpose of 1-factor analysis of variance:
Evaluate the influence of a certain factor (artificial or natural) on the observed
values, Y i (i=0,1,2,…k)
Model:
Experimental element
first 2 … K
Medicin Medicin
… Y k1
e 11 e 21
Medicin Medicin
… Y k2
e 12 e 22
… … … …
Y 1N Y 2N … Y kN
T1 T2 … Tk T
Average gate total
Yfirst Y2 … Yk Y
ANOVA table:
Source of variation df Sum of squares Mean square(MS)
Treatments k-1 k
T
2
T
2 SSB
SSB=∑ MSB=
i
− k−1
i=1 N N
Error k(n-1) SSW=SST-SSB SSW
MSW =
n−k
k n
Total kn-1 T2
SST =∑ ∑ Y n −¿ ¿ 2
i=1 j=1 N
Assumption:
H 0:
μ1 =μ2 =.. .. .=μ k ⇔ “Mean values are equal”
16
Question 1 :
μi ≠μ j ⇔ “ At least two different average values”
MSF
Statistical value: F = MSE
Argument: If F < F α (k - 1; N - k) →Accept the hypothesis H 0
3. Data preprocessing
Code:
data= read.csv("C:/Users /Desktop/Air_Traffic_Passenger_Statistics.csv")
head(data)
names(data)
Explanation : read data and save data with name data
Result :
Code:
17
names(data)
Explanation : outputs all variables in data.
Result :
Code:
data$Price.Category.Code <- as.factor(data$Price.Category.Code)
data$Operating.Airline <- as.factor(data$Operating.Airline)
data$Operating.Airline.IATA.Code <- as.factor(data$Operating.Airline.IATA.Code)
data$Published.Airline <- as.factor(data$Published.Airline)
data$Published.Airline.IATA.Code <- as.factor(data$Published.Airline.IATA.Code)
data$GEO.Summary <- as.factor(data$GEO.Summary)
data$GEO.Region <- as.factor(data$GEO.Region)
data$Activity.Type.Code <- as.factor(data$Activity.Type.Code)
data$Terminal <- as.factor(data$Terminal)
data$Boarding.Area <- as.factor(data$Boarding.Area)
data$Adjusted.Activity.Type.Code <- as.factor(data$Adjusted.Activity.Type.Code)
data$Month <- as.factor(data$Month)
Explanation : Convert categorical variables to type factor.
Code:
Code:
data_cut=data[c(3,4,7,8,10,11,12,15)]
head(data_cut)
Explanation : Extract sub-data named data_cut including variables according to the topic.
18
Result :
Code:
new_DF <- subset(data_cut, GEO.Region == "Canada")
Explain: Extract data of a specific region which is the Canadian region of the categorical
variable GEO.Region for analysis.
Result:
Code:
19
apply(is.na(new_DF),2,which)
apply(is.na(new_DF),2,sum)
Explanation : Check and output missing values of variables in the set
Result :
4. Descriptive statistics
Code:
a = mean(new_DF$Adjusted.Passenger.Count)
b = median(new_DF$Adjusted.Passenger.Count)
c = sd(new_DF$Adjusted.Passenger.Count)
d = max(new_DF$Adjusted.Passenger.Count)
e = min(new_DF$Adjusted.Passenger.Count)
des=data.frame(a,b,c,d,e)
t(des)
Result :
Code:
table(new_DF$Operating.Airline)
table(new_DF$Operating.Airline.IATA.Code)
table(new_DF$GEO.Summary)
table(new_DF$GEO.Region)
table(new_DF$Price.Category.Code)
table(new_DF$Terminal)
table(new_DF$Boarding.Area)
Result :
20
21
Code :
hist(new_DF$ Adjusted.Passenger.Count )
Result :
Code:
qqnorm(new_DF$Adjusted.Passenger.Count, col="red")
qqline(new_DF$Adjusted.Passenger.Count)
22
Result:
Code:
shapiro.test(new_DF$Adjusted.Passenger.Count)
Result:
Comment:
Hypothesis H 0 : The adjusted number of passengers follows a normal distribution.
Hypothesis H 1 : The adjusted number of passengers does not follow a normal
distribution.
Because pvalue < 5% significance level, we reject H 0 . So the adjusted number of
passengers does not follow a normal distribution.
Code:
boxplot(new_DF$Adjusted.Passenger.Count~Operating.Airline,data = new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~Operating.Airline.IATA.Code,data =
new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~GEO.Summary,data = new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~GEO.Region,data = new_DF)
23
boxplot(new_DF$Adjusted.Passenger.Count~Price.Category.Code,data = new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~Terminal,data = new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~Boarding.Area,data = new_DF)
Result :
Comment: Based on the interquartiles, Air Canada (CA) has the highest mean
value and the highest extreme outlier compared to other airlines.
24
Comment: United Airlines and United Airlines - Pre 07/01/2013 has the same airline
code, so the values of these are merged.
International
Comment: The GEO Summary box plot indicates a median value of approximately 2000
with an IQR suggesting moderate data dispersion.
25
Comment: The “Low Fare” category offers less variability and lower prices, while the
“Other” category has a wider range of prices and a higher average price. This graph
effectively visualizes the price distribution for these two categories, providing valuable
insights for decision-making.
26
Comment: Area E has the highest median and wide range of values. Passengers may have
better experiences in Boarding Area E.
27
Code:
library(ggplot2)
ggplot(new_DF, aes(x = Operating.Airline, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between Operating.Airline and Adjusted.Passenger.Count",
x = "Operating Airline",
y = "Adjusted Passenger Count")
Result:
Comment: The data illustrates a difference in the adjusted passenger counts among
various airline codes. Air Canada (AC) distinctly leads, indicating outstanding scale in
Canada.
28
Code:
ggplot(new_DF, aes(x = Operating.Airline.IATA.Code, y = Adjusted.Passenger.Count))
+
geom_point() +
labs(title = "Relationship between Operating.Airline.IATA.Code and
Adjusted.Passenger.Count",
x = "Operating.Airline.IATA.Code",
y = "Adjusted Passenger Count")
Result:
Comment: United Airlines and United Airlines - Pre 07/01/2013 has the same airline
code, so the values of these are merged.
29
Code:
ggplot(new_DF, aes(x = GEO.Summary, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between GEO.Summary and Adjusted.Passenger.Count",
x = "GEO.Summary",
y = "Adjusted Passenger Count")
Result:
Comment: With GEO Summary, it ranged from 0 to 40000 passengers with the most
distribution under 30000. People in Canada don't use airplanes for domestic travel.
30
Code:
ggplot(new_DF, aes(x = Price.Category.Code, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between Price.Category.Code and Adjusted.Passenger.Count",
x = "Price.Category.Code",
y = "Adjusted Passenger Count")
Result:
Comment: There is a significant difference in passenger counts between the “Low Fare”
and “Other” price categories. The “Other” category is vastly more popular than the "Low
Fare".
31
Code:
ggplot(new_DF, aes(x = Terminal, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between Terminal and Adjusted.Passenger.Count",
x = "Terminal",
y = "Adjusted Passenger Count")
Result:
Comment: Based on the distribution of the graph. Values are illustrated variably across
the International, Terminal 1, and Terminal 2 terminals.
32
Code:
ggplot(new_DF, aes(x = Boarding.Area, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between Boarding.Area and Adjusted.Passenger.Count",
x = "Boarding.Area",
y = "Adjusted Passenger Count")
Result:
Comment: Boarding area B has some distinctive values, which are obviously higher than
the mean group of values.
33
5. Inferential statistics
a, Consider a linear regression model that includes Adjusted Passenger Count
as a dependent variable, and all remaining variables are independent variables. Use
the lm() command to execute the multiple linear regression model.
Code:
m1=lm(Adjusted.Passenger.Count ~
Operating.Airline+Operating.Airline.IATA.Code+Price.Category.Code+Terminal+Board
ing.Area, data = new_DF)
summary(m1)
Result:
b, Based on the results of the linear regression model above, which variables
will you remove from the model at the 5% confidence level?
H0: The regression coefficients are not statistically significant.
34
H1: The regression coefficients are statistically significant.
Comment:
Based on the results of the linear model, because the Pr(>|t|) of the variables are almost
all less than 0.05 except for a few subvariables that are not statistically significant.
Besides, there are some variables whose Pr value is NA (Not Value). All variables that
are not statistically significant will be excluded from the model.
c, Consider two linear models that include Adjusted Passenger Count as the
dependent variable but:
+ Model M1 contains all remaining variables as independent variables
+ Model M2 is to remove some variables from model M1.
Code:
m2=lm(Adjusted.Passenger.Count ~ Operating.Airline +Terminal, data = new_DF)
summary(m2)
Result:
Comment: Removing the variables Operating Airline IATA Code, Boarding Area
and Price, Category; Code whose Pr value does not reach statistical significance or Pr
value produces NA results in model m1, helping we obtain model m2 that includes only
variables that yield statistical significance. This helps optimize the model and focus on
35
important variables, improving the statistical relationship between variables in the
regression model.
Comment:
Set test hypothesis:
H0: Two models m1 and m2 have the same efficiency.
H1: Two models m1 and m2 have different efficiency.
Looking at the ANOVA analysis results of two models m1 and m2, the Pr(>F) value is
0.3402. At the 5% significance level, when Pr(>F) < 5%, we can conclude that there is a
significant difference in multiple regression performance between the two models. To
choose a better regression model, we can consider the adjusted R 2 coefficient. Adjusted
R2 of model 1 is 0.6117, larger than model 2 with adjusted R 2 of 0.5928. Therefore,
although model m1 has variables that are not statistically significant, in general we still
see that regression m1 still performs better than m2.
Code:
plot(m1,which=1)
Result:
36
Comment: The graph above represents the forecast values and corresponding
error values. Based on the graph, we see that the colored line on the graph is a horizontal
line and partly curved. This shows that the relationship between the predictor variables X
and the dependent variable Y is considered nearly linear but not completely linear. In
addition, the error values are relatively evenly dispersed around the line y = 0 (except for
some outlier values), proving that the variance of the errors is constant.
37
7. Conclusion
In this assignment, we visualized 17 metrics of Air traffic passenger in
15007 flights via statistical description and graph.
38
8. References
1. Phan Thi Huong, Lecture on Statistical Probability
2. Nguyen Tien Dung (editor), Nguyen Dinh Huy, Probability - Statistics & Data
Analysis, 2019
3. Nguyen Dinh Huy (editor), Nguyen Ba Thi, Probability and Statistics Textbook,
2018
4. Introductory Statistics with R, J Jambers – D. Hand – W. Hardle
5. Applied Statistics with R, 2020
6. Lecture on Quantitative Economics, PhD. Nguyen Canh Huy
7. Sample example of multiple regression, Hoang Van Ha
8. Data: https://www.kaggle.com/datasets/thedevastator/airlines-traffic-passenger-
statistics
39