0% found this document useful (0 votes)
20 views39 pages

Cuoicuoicuoi

This document is a big assignment report for a Statistical Probability subject at the Ho Chi Minh University of Technology, focusing on the analysis of air traffic passenger statistics in Canada. It includes sections on data introduction, theoretical basis for linear regression, data preprocessing, and inferential statistics, aiming to derive insights from the dataset to understand air travel dynamics. The report acknowledges the guidance of the lecturer and expresses gratitude to all who contributed to the project.

Uploaded by

phucthai0816
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views39 pages

Cuoicuoicuoi

This document is a big assignment report for a Statistical Probability subject at the Ho Chi Minh University of Technology, focusing on the analysis of air traffic passenger statistics in Canada. It includes sections on data introduction, theoretical basis for linear regression, data preprocessing, and inferential statistics, aiming to derive insights from the dataset to understand air travel dynamics. The report acknowledges the guidance of the lecturer and expresses gratitude to all who contributed to the project.

Uploaded by

phucthai0816
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

VIET NAM NATIONAL UNIVERSITY, HO CHI MINH CITY

HO CHI MINH UNIVERSITY OF TECHNOLOGY




BIG ASSIGNMENT REPORT

STATISTICAL PROBABILITY SUBJECT

GVHD: Phan Thị Hường

STT First and last name MSSV Class Department Sign to attend
1 Trần Quang Linh 2153527 CC01 Transportation
2 Đào Bình Phương Nam 2153592 CC01 Transportation
3 Phan Tuấn Anh 2210121 CC01 Transportation
4 Lê Quốc Anh 2252025 CC01 Transportation
5 Phạm Thành Lộc 2252462 CC01 Transportation
Class: CC01

1
TABLE OF CONTENT
1. INTRODUCTION..................................................................................................... 4
BACKGROUND:.................................................................................................................... 4
OBJECTIVE OF THE ANALYSIS:.................................................................................................4
2. DATA INTRODUCTION............................................................................................ 5
1. DATA DESCRIPTION..........................................................................................................5
3. THEORETICAL BASIS............................................................................................... 6
2.1 LINEAR REGRESSION.................................................................................................6
2.1.1 Introducing the multiple linear regression model...............................................6
2.2 ESTIMATE THE PARAMETERS OF THE MULTIPLE LINEAR REGRESSION MODEL..............................7
2.2.1Population Regression Function (PRF).................................................................7
Sample Regression Function (SRF – Sample Regression Function):.............................7
2.3 MODEL FIT.................................................................................................................10
2.4 TESTING THE GENERAL SIGNIFICANCE LEVEL OF THE MODEL (SPECIAL CASE OF WALD TEST).......13
2.5 THE MATRIX METHOD SOLVES THE MULTIPLE LINEAR REGRESSION PROBLEM............................14
2.6 ONE-FACTOR ANALYSIS OF VARIANCE...............................................................................15
3. DATA PREPROCESSING.......................................................................................... 17
4. DESCRIPTIVE STATISTICS....................................................................................... 19
5. INFERENTIAL STATISTICS........................................................................................ 33
6. DISCUSS AND EXPAND........................................................................................... 35
7. CONCLUSION......................................................................................................... 37
8. REFERENCES.......................................................................................................... 38

2
ACKNOWLEDGEMENT
We would like to extend our sincere gratitude to Ms. Phan Thi Huong – the
lecturer for the Statistical Probability course and our project supervisor. Her
wholehearted guidance enabled the team to complete the assignment on schedule and
effectively address encountered challenges. Her mentorship has served as a guiding
beacon for every action of the team, maximizing the supportive relationship between the
instructor and students within the educational environment.

In conclusion, we wish to express profound gratitude to all individuals, professors, and


mentors who have invested their time in guiding our team. This gratitude serves as a
deep-seated belief and a significant motivator that propelled our team to successfully
complete this project.

3
1. INTRODUCTION
The comprehensive analysis of air traffic passenger statistics holds paramount
importance in understanding the dynamics of aviation within a specific region. In the
context of this report, our focus is directed towards Canada, a nation characterized by
vast geographical landscapes and diverse urban centers. The dataset under examination,
sourced from the Air Traffic Passenger Statistics, encapsulates a wealth of information
crucial for unraveling the intricate patterns and trends that govern air travel within the
Canadian territory.

Background:
Canada, as one of the largest countries globally, boasts a sophisticated air
transportation network that plays a pivotal role in connecting its varied regions. With a
myriad of airports and a robust aviation sector, the air traffic patterns reflect not only the
economic activities but also the cultural diversity and regional nuances that define the
Canadian landscape. The significance of dissecting air traffic passenger statistics in
Canada extends beyond the mere quantification of travel; it serves as a lens through
which we can comprehend the economic vibrancy, tourism dynamics, and the
interconnectedness of its diverse provinces and territories.

Objective of the Analysis:


The primary objective of this analysis is to derive meaningful insights from the air
traffic passenger statistics within Canada. By scrutinizing key variables such as the
adjusted passenger count across different regions, airlines, and operational parameters,
we aim to identify patterns, trends, and potential factors influencing the variability in
passenger numbers. Through a rigorous statistical examination, we seek to answer pivotal
questions surrounding the drivers of air travel demand within the Canadian context.

This report aspires to provide stakeholders, policymakers, and industry enthusiasts with a
nuanced understanding of the air travel landscape in Canada. By elucidating the
intricacies of air traffic passenger statistics, we endeavor to contribute to a broader
discourse on the factors shaping the aviation sector within this diverse and expansive
nation.

2. Data introduction

4
1. Data description
The data could be used to study the effect of different factors on air traffic passenger numbers
and will be show in the below table with the population is the total flights around the World.
To be more specific, the sample is flights from Canada, calculate the number of passenger and
categorized by IATA code.
Variable Data type Type Unit Description

Operating Airline Categorical None The airline that operated the flight.

Published Airline Categorical None The airline that published the fare for the flight.

Published Airline IATA The IATA code of the airline that published the fare f
Categorical None
Code the flight.

A summary of the geographic region. (String) (0 =


GEO Summary x=0∨x=1 Categorical None
Domestic; 1 = International).
The geographic region. (String) (0 = Asia; 1 =
Australia/Oceania; 2 = Canada; 3 = Central America;
GEO Region { x ∈ N∨0≤ x ≤ 8 } Categorical None
= Europe; 5 = Mexico; 6 = Middle East; 7 = South
America; 8 = US).
The terminal of the flight. (String) (0 = International;
Terminal { x ∈ N∨0≤ x ≤ 3 } Categorical None
= Terminal 1; 2 = Terminal 2; 3 = Terminal 3).

The boarding area of the flight. (String) (0 = A; 1 = B;


Boarding Area { x ∈ N∨0≤ x ≤ 7 } Categorical None
= C; 3 = D; 4 = E; 5 = F; 6 = G; 7 = Other).

Passenger Count { x ∈ N∨1 ≤ x ≤ 659837 } Continuous People The number of passengers on the flight.

The type of activity. (String) (0 = Deplaned; 1 =


Activity Type Code { x ∈ N∨0≤ x ≤ 2 } Categorical None
Enplaned; 2 = Thru/Transit).

The price category of the fare. (String) (0 = Low fare;


Price Category Code x=0∨x=1 Categorical None
= Other).

Operating Airline IATA The IATA code of the airline that operated the flight
Categorical None
Code (String)
Year { x ∈ N∨2005 ≤ x ≤ 2016 } Continous None The year of the activity.

Month { x ∈ N∨1 ≤ x ≤ 12 } Continous None The month of the activity.

5
3. Theoretical basis
2.1 LINEAR REGRESSION
Regression is a statistical method for establishing a relationship between a dependent
variable and a set of independent variables. A model with one dependent variable with
two or more independent variables is called multiple regression (also called multiple
regression). For example: Household expenditure on food depends on household size,
income, geographical location, etc.; A country's child mortality rate depends on per capita
income, education level, etc.; A person's salary depends on position, experience, age,...
Theory

State the hypotheses

Data collection

Set up the model

Estimating parameters - Is the model appropriate?


No Yes

Find another model Forecasting + Decision making

2.1.1 Introducing the multiple linear regression model


The multiple linear regression model has the following general form:
Y = β 1+ β 2X 2+ β 3X 3+ … + β kX k+ u
In which: Y: dependent variable
Xi: independent variable
β1: free coefficient (blocking coefficient)
βi: individual regression coefficient. β i Measures the partial impact of variable
X i on Y under the condition that other variables in the model remain unchanged. More
specifically, if other variables in the model remain unchanged, the expected value of Y
will increase β i units if X i increases by 1 unit. u: random error.

6
Thus, "Linear Regression" is a method to predict the value of the dependent variable
(Y) based on the value of the independent variable (X). The term linearity is used to
indicate that the nature of the population parameters β 1 and β i is linear (first order). It can
be used for cases where we want to predict a continuous quantity. For example, predicting
the time a user stops on a certain page or the number of people who have visited a certain
website, etc. Using the collected data, we estimate the regression function of the
population, which is the estimate. quantity of population parameters: β 1, β 2, …, β k

2.2 Estimate the parameters of the multiple linear regression model


2.2.1Population Regression Function (PRF)
With Y as the dependent variable, X 2, X 3, ..., X k as the independent variable, Y is
random and has a certain probability distribution. Deduce: There exists E(Y|X 2 , X 3 ,…,
X k ) = definite value. Therefore, F(X 2 , X 3 , …, X k ) = E(Y|X 2 , X 3 , … , k .

We have: Y i ≠ F(X 2 , X 3 ,…, X k ) ⇒ u i = Y i – F


For an individual i, there exists (X 2,i , X 3,i , …, X k,i , Y i )

Therefore: Y i = E(Y|X 2 , X 3 ,…, X k ) + u i


Overall regression PRF:
 Y = E(Y|X) + U
 E(Y|X) = F(X)

Sample Regression Function (SRF – Sample Regression Function):


Because we do not know the population, we do not know to what extent the
population mean of the dependent variable is correct. Therefore, we must rely on sample
data to estimate.
On a sample with n individuals, call ^ Y= F^ (X 2 , X 3 ,…, X k ) is sample regression.
For a sample individual Y i ≠ F^ (X 2,i , X 3,i ,…, X k,i ) yields e i = Y i – F^ (X 2 , X 3 ,…, X k
); e i is called the SRF residual.
We have a general sample regression function written as follows:
^y i = β^ 1 + β^ 2 x 2,i + β^ 3 x 3,i + … + β^ k x k,i
The resulting remainder: e i = y i – ŷ i
Notation: β^ m is an estimate of β m . We look forward to ^β m is an unbiased estimate of β
m , moreover, should be an efficient estimate.

Estimating SRF: choose a certain method to estimate the parameters of F by finding


the parameters of F^ and taking the observed values of these parameters as approximate
values for the parameters of F.

7
2.2 Ordinary Least Squares Method

The least squares method was introduced by German mathematician Carl Friedrich
Gauss. The idea of this method is to minimize the sum of squares of the residuals.
Therefore, it can be said that to get the most appropriate regression, we choose estimates
with intercept and slope so that the residual is small.
2.2.1. Assumptions of the least squares method for multiple linear regression
models
The least squares (OLS) method is a very reliable method in estimating model
parameters, however the estimated model must satisfy 7 assumptions. When the
assumptions are satisfied, the least squares (OLS) estimator is the most efficient unbiased
linear estimator among the estimators. Therefore, the OLS method provides the best
linear unbiased estimate (BLUE).
This result is called the Gauss - Markov Theorem, according to this theory the OLS
estimator is BLUE, meaning that among all unbiased linear combinations of Y, the OLS
estimator has the smallest variance. The assumptions are as follows:
1/ The regression function is linear according to the coefficients.
This means that the actual process of regression is described by the relationship of
the form:
y = β 1+ β 2x 2+ β 3x 3+ β 4x 4+ … + β kx k+ u
Or the actual relationship can be rewritten, for example, as taking the logarithm of
both sides.

2/ E(u i ) = 0 : The expectation of the random factors u i is equal to 0.


The overall average error is 0. This means that some error values are positive and
some error values are negative. Because the function is considered an average, it can
be assumed that the above random errors will cancel each other out, on average, in the
population.
3/Var(u i ) = σ 2 : Equal and homogeneous variance for all u i .
All u values are equally distributed with the same variance σ 2, such that
Var(u i ) = E(u i 2 ) = σ 2 .
4/ PM2.5v(u i ,u j ) = 0 : There is no correlation between u i .
There is no correlation between the observations of the error factor. If we consider
time series data (data collected from one source over many different time periods), the
error factor u i in this time period does not have any correlation with the factor error in
the previous period.
5/ PM2.5v(u i ,x i ) = 0 : U and X have no correlation with each other.
8
When any explanatory variable becomes larger or smaller, the error factor will not
change with it.
6/ u i normal distribution.
This is important when generating confidence intervals and performing hypothesis
testing in small sample ranges. But the larger the sample range, this becomes less
important.
There is no perfect linear dependence between x 2 , x 3 ,…, x k .
If x 2 , x 3 ,…, x k have a linear relationship, it is said that there is multicollinearity.
2.2.2. Estimate
We set: y i denotes the actual value of variable y at observation i
ŷ i denotes the value of the sample regression function
e i denotes the remainder y i – ŷ i

Therefore, minimizing Σ ( y i – ŷ i ) 2 will be equivalent to minimizing Σ e i 2 from which


we can find β^ 0 , β^ 1 ,…, β^ k
We have:
Σ e i 2 = Σ (y i – ( β^ 1 + β^ 2 x 2,i + β^ 3 x 3,i + β^ 4 x 4,i + … + β^ k x k,i )) 2
We can set up the first-order conditions for this minimal calculation as follows:
2
∂Σ e i
= -2 Σ ( y i - ( β^ 1 + β^ 2 x 2i + β^ 3 x3i +…+ β^ k x ki ) )x i = 0
∂ β^1
2
∂Σ e i
= -2 Σ ( y i - ( β^ 1 + β^ 2 x 2i + β^ 3 x3i +…+ β^ k x ki ) )x 2i = 0
∂ β^2


2
∂Σ e i
= -2 Σ ( y i - ( β^ 1 + β^ 2 x 2i + β^ 3 x3i +…+ β^ k x ki ) )x ki = 0
∂ β^k

The system of equations we have is called the standard system of equations of the
sample regression. We can solve these k standard equations to find the k β^ unknown
coefficients. β^ 1 , β^ 2 ,…, β^ k are called least squares estimates.
9
2.2.3. In case the regression function has 2 explanatory variables
The parameters β^ 1 , β^ 2 , β^ 3 are calculated from the following standard equation system:
β^ 1 + β^ 2 x 2i + β^ 3 x3i = y
β^ 1 Σ x 2i + β^ 2 Σ x2i + β^ 3 Σ x2i x 3i=Σ y i x 2i
2

β^ 1 Σ x 3i + β^ 2 Σ x2i x 3i + β^ 3 Σ x 3i +=Σ y i x 3i
2

In there:
x2 = Σ x 2i /n ; x3 = Σ x 3i /n ; y=Σ y i
Solving the system of equations we find
β^ 1 = y - β^ 2 x 2 - β^ 3 x 3
( Σ ( y i - y ) ( x2i - x 2 ) )( Σ ( x 3i - x 3) ) - ( Σ ( y i - y ) ( x 3i - x3 ) )( Σ ( x2i - x 2 ) ( x 3i - x3))
β^ 2 =
( Σ ( x 2i - x2 )2 )( Σ ( x3i - x 3 )2 ) - ( Σ ( x2i - x 2 ) ( x3i - x 3 ) )2
( Σ ( y i - y ) ( x3i - x 3 ) )( Σ ( x 3i - x 3 ) ) - ( Σ ( y i - y ) ( x 2i - x2 ) )( Σ ( x2i - x 2 ) ( x 3i - x3))
β^ 3 =
( Σ ( x 2i - x2 )2 )( Σ ( x3i - x 3 )2 ) - ( Σ ( x2i - x 2 ) ( x3i - x 3 ) )2
Variance and standard deviation are calculated by the following formula:
σ2
Var( β^ 2 ) = ¿ ; Se( β^ 2 ) =√ Var( β^ 2 )
Σ ( x 2i - x2 )2 (1- r 223 ¿
2
σ
Var( β^ 3 ) = 2 2
¿ ; Se( β^ 3 ) =√ Var( β^ 3 )
Σ ( x 3i - x3 ) (1- r 23 ¿
2 2
- r 23 σ
PM2.5v( β^ 2 , β^ 3 ) = 2
(1- r 23 ¿
√ 2 2
Σ ( x 2i - x 2) Σ ( x3i - x 3 ) ¿√
2
2
( Σ ( x2i - x 2 ) ( x 3i - x3))
In there:r =¿ 23
( Σ ( x 2i - x2 )2 )( Σ ( x3i - x 3 )2 )
In the above formula, σ 2 is the variance of u i but is unknown. Its unbiased estimate is:
2
2 Σ ei
σ^ =
n-k
Where: k is the number of parameters in the model.
2.3 Model fit
To know how or how much of the variation in the dependent variable the model
explains, people use R 2
We have: Σ (y i – ĩ) 2 = Σ [(y i – ŷ i ) + (ŷ i – ĩ)] 2 = Σ [e i + (ŷ i – ĩ)] 2
= Σ e i 2 + 2 Σ e i (ŷ i – ĩ) + Σ (ŷ i – ĩ) 2
Put:
 Σ (y i – ĩ) 2 : SST – Total Sum of Squares
 Σ (ŷ i – ĩ) 2 : SSE – Explained Sum of Squares
 Σ e i 2 : SSR – Residual Sum of Squares
10
Do Σ e i (ŷ i – ĩ) = 0 ⇐( Σ e i ŷ i = 0; Σ e i ọ = 0)
We can write: SST = SSE + SSR
Meaning of the components:
 SST is the sum of squares of all the deviations between the observed values Y i
and the mean.
 SSE is the sum of squares of all the deviations between the values of the
dependent variable Y obtained from the sample regression function and their
mean. This part measures the accuracy of the regression function.
 SSR is the sum of squares of all the differences between the observed Y values
and the values obtained from the regression function.
 SST is divided into two parts: one caused by SSE and one caused by SSR.
From SST = SSE + SSR, we divide both sides by SST, we have:
2
SS E SS R Σ (ŷ i – ӯ) 2 Σ e i
1= + = +
SS T SS T Σ( y i - y ) 2 Σ( y i - y )
2

2
SS E SS R ( Σ ( y i - y ) ( y i - ^y ) )
R 2= =1 - =
SS T SS T ( Σ ( y - y ) 2) ( Σ ( y - y^ )2 )
i i

The ratio of the total variation explained by the model to the total squares to be
explained is called the coefficient of determination, or “good of fit” statistic. From the
definition of R 2 we see that R 2 measures the ratio or % of the total deviation Y from the
mean value explained by the model. Then people use R2 to measure the suitability of the
regression function:
0 ≤ R 2 ≤ 1.
 R 2 means that the estimated model explains a high level of variation in the
dependent variable.
 If R 2 = 1, it means that the regression line explains 100% of the change in y.
 If R 2 = 0, it means the model does not provide any information about the
change in the dependent variable y.
In a multivariate regression model, the ratio of the total difference in variable y caused
by all variables x 2 and x 3 is called the multiple coefficients of determination, denoted R 2
2
β^ 2Σ ( y i - y )( x 2i - x 2 )+ β^ 3 Σ ( y i - y )( x 3i - x 3 ) 2
Σ ei
R = 2 = 1- 2
Σ( y i - y ) Σ( y i - y )
2.3. Confidence intervals and testing of regression coefficients
2.3.1. Estimate confidence intervals for regression coefficients
The purpose of regression analysis is not only to speculate about β1 , β2 ,…, β k but also
to examine the nature of the dependence. Therefore, it is necessary to know the

11
probability distribution of β1 , β2 ,…, β k . These distributions depend on the distribution of
ui .
With OLS assumptions, u i has distribution N (0, σ2 ). The estimated coefficients
follow a normal distribution:
β^ j N( βj , Se ( β^ j ))
β^ j - β j
~T (n−k )
Se ( β^ j)

Estimate error variance based on least squares residuals. Where k is the number of
coefficients in the multivariate regression equation:
2
Σ ei 2
σ^ =
n-k
 Estimating both sides, we find t α ( n-3 ) =1- α the satisfaction:
2

β^ j - β j
P( −t α2 ( n-3 ))≤ ≤P( t α ( n-3 ) )
Se ( β^ j) 2

 Confidence interval 1 - α ofβ j To be:

[ β^ j−t ( n-3 ) Se ( β^ j )] ; [ β^ j+ t ( n-3) Se ( β^ j )]


α
2
α
2

2.3.2. Test the hypothesis for β j


Test the statistical significance of the regression coefficients: test whether the
explanatory variable really affects the dependent variable or not. In other words, whether
the regression coefficient is statistically significant or not.
Some assumption can be made for β j , for example β j = β j *. If this assumption is
correct then:
β^ j - β j
T= ~T (n−k )
Se ( β^ j)

We have the following table:

Type of hypothesis Hypothesis Ho Hypothesis H1 Rejection region


Two-sided β i = β j* β j  β j* ¿ t∨¿ t α/ 2 (n−k)
Right-sided β j ≤ β j* β j > β j* t >t α (n−k )
Left-sided β j ≥β j* β j < β j* t ←t α (n−k )

Test βj :
 H0 : β j = 0 ⇔x j has no impact
 H1 : β j ≠ 0 ⇔x j has an impact

12
β j < 0 ⇔x j has the opposite effect
β j > 0 ⇔x j has a positive effect

2.4 Testing the general significance level of the model (special case of
WALD test)
2.3.3. Overview of WALD testing
Suppose we have 2 models below:
(U): Y = β 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + u
(R): Y = β 1 + β 2 X 2 + v
The U model is called the Unrestrict model, and the R model is called the Restrict
model. That is because β 3 and β 4 must be equal to 0 in the R model. We can test the
linking hypothesis β 3 = β 4 = 0 with the alternative hypothesis that at least one of these
coefficients is not equal to 0. Testing this association hypothesis is called the Wald test,
the procedure is as follows:
Set the bounded and unbounded models as:
(U): Y = β 1 + β 2 X 2 + … + β m X m + β m+1 X m+1 + … + β k X k + u
(R): Y = β 1 + β 2 X 2 + … + β m X m + v
Hypothesis:
 H 0 : β m+1 = … = β k = 0
 H 1 : “The parameters are not all 0 at the same time”
(U) contains k unknown regression coefficients and (R) contains m unknown
regression coefficients. Therefore, model R has (k – m) fewer parameters than U. The
question we ask is whether the (k – m) omitted variables have a meaningful association
effect on Y.
The test statistic for this hypothesis is:
[ SSR R −SSR U ] /(k −m) 2 2
R U −R R /(k−m)
F c= ~ F( α , k – m, n – k) = 2
SSR U /(n−k ) 1−RU /(n−k )
Where R 2 is the uncorrected fitness measure. Under the null hypothesis, F c has an F
distribution with (k – m) degrees of freedom for the numerator and (n – k) degrees of
freedom for the denominator.
Reject the hypothesis H 0 when: F c > F(α, k – m, n – k) or the p-value of the F statistic
is less than the given level of significance.
2.3.4. Test the significance of the model

13
In the multivariate regression model, the “null” hypothesis that the model is not
significant is understood to mean that all individual regression coefficients are equal to 0.
The application of the Wald test (commonly known as the F test) is specifically
conducted as follows:
 Step 1: The “null” hypothesis is H 0 : β 2 = β 3 = … = β k = 0.
The alternative hypothesis is H 1 : “at least one of the β values is non-zero”.
 Step 2: First, regress Y on a constant term and X 2 , X 3 , …, X k , then calculate
the sum of squared errors RSS U , RSS R . The F-distribution is the ratio of two
independent chi-squared distributed random variables. This gives us the
statistical value:
[ SSR R −SSR U ] /(k −m)
F c= ~ F( α , k – m, n – k)
SSR U /(n−k )
Since H 0 : β 2 = β 3 = … = β k = 0, it is observed that the test statistic for this hypothesis
will be:
SSE /(k−1)
F c= ~ F( α , k – 1, n – k)
SSR /(n−k )
 Step 3: Look up the data in table F corresponding to the degrees of freedom (k –
1) for the numerator and (n – k) for the denominator, and with the given level of
significance α.
 Step 4: Reject the hypothesis H 0 at the α level of significance if F c > F(α, k-1,
nk). For the p-value method, calculate the value p = P(F > F c |H 0 ) and reject
the hypothesis H 0 if p < significance level α
2.5 The matrix method solves the multiple linear regression problem
We see that in multivariate regression, if we use the gradual substitution method to
find regression parameters, it will be very complicated and take a lot of time, especially
for models with many explanatory variables. A simpler way to solve it is to use the
matrix method.
Let's say we have a population regression function (PRF):
y = β 1+ β 2x 2+ β 3x 3+ β 4x 4+ … + β kx k+ u
We write for each observation as follows:
y 1 = β 1 + β 2 x 2;1 + β 3 x 3;1 + β 4 x 4;1 + … + β k x k;1 + u 1
y 2 = β 1 + β 2 x 2;2 + β 3 x 3;2 + β 4 x 4;2 + … + β k x k;2 + u 2
⋮ ⋮ ⋮
y n = β 1 + β 2 x 2;n + β 3 x 3;n + β 4 x 4;n + … + β k x k;n + u n
Writing in matrix form we have:
Y = Xβ + U
In which: Y is a nx 1 matrix

14
X is a matrix of level nxk
β is a kx 1 level matrix
U is a nx 1 matrix
With x 1 represented as:

[]
1
1
x 1= ⋮
1

For example: if k = 2:

[ ] [ ][ ] [ ]
y1 1 x 21 u1
y2 1 x 22 β 1 u2
= +
⋮ ⋮ ⋮ β2 ⋮
yn 1 x2 n un
The sample regression function is written as follows:
^y i = β^ 1 + β^ 2 x2i + β^ 3 x 3i +…+ β^ k x ki
y = β^ + β^ x + β^ x +…+ β^ x +e
i 1 2 2i 3 3i k ki i

Or: Y = X β^ + e
Then the OLS estimates are found by:
Σ e i =Σ ( y i ( β^ 1 + β^ 2 x 2i + β^ 3 x 3i +…+ β^ k x ki ) ) ⇒min. min
2 2

e'e = Σ e 2i = (Y - X β^ )'( Y - X β^ )= Y'Y - β^ ' X'Y – Y'X β^ + β^ ' X'X β^ = Y'Y - 2 β^ ' X'Y + X'X β^ 2
'
∂(e e)
= -2X'Y + 2X'Xβ^ ⇒X'Y = X'Xβ^ ⇒ β^ = (X'X) -1 X'Y
∂ β^

From there we can easily calculate the parameters of the sample regression function,
as well as applying the rules for calculating variance and standard deviation, we can also
easily find the variance and standard deviation of the regression parameters. regulation.
2.6 One-factor analysis of variance
 Theory of 1-factor variance analysis
Analysis of variance is the comparison of the means of two or more groups based
on the mean values of observed samples from these groups, and through hypothesis
testing to conclude about the equality of the means. this overall.
 One-factor variance analysis method

15
Analysis of variance is used in tests to compare the average values of two or more
samples taken from fractions. This can be seen as an extension of t or z tests (comparing
two average values).
 The purpose of 1-factor analysis of variance:
Evaluate the influence of a certain factor (artificial or natural) on the observed
values, Y i (i=0,1,2,…k)

Model:
Experimental element
first 2 … K
Medicin Medicin
… Y k1
e 11 e 21
Medicin Medicin
… Y k2
e 12 e 22
… … … …
Y 1N Y 2N … Y kN
T1 T2 … Tk T
Average gate total
Yfirst Y2 … Yk Y

ANOVA table:
Source of variation df Sum of squares Mean square(MS)
Treatments k-1 k
T
2
T
2 SSB
SSB=∑ MSB=
i
− k−1
i=1 N N
Error k(n-1) SSW=SST-SSB SSW
MSW =
n−k
k n
Total kn-1 T2
SST =∑ ∑ Y n −¿ ¿ 2

i=1 j=1 N

 Assumption:
H 0:
μ1 =μ2 =.. .. .=μ k ⇔ “Mean values are equal”

16
Question 1 :
μi ≠μ j ⇔ “ At least two different average values”
MSF
 Statistical value: F = MSE
 Argument: If F < F α (k - 1; N - k) →Accept the hypothesis H 0

3. Data preprocessing
Code:
data= read.csv("C:/Users /Desktop/Air_Traffic_Passenger_Statistics.csv")
head(data)
names(data)
Explanation : read data and save data with name data
Result :

Code:

17
names(data)
Explanation : outputs all variables in data.
Result :

Code:
data$Price.Category.Code <- as.factor(data$Price.Category.Code)
data$Operating.Airline <- as.factor(data$Operating.Airline)
data$Operating.Airline.IATA.Code <- as.factor(data$Operating.Airline.IATA.Code)
data$Published.Airline <- as.factor(data$Published.Airline)
data$Published.Airline.IATA.Code <- as.factor(data$Published.Airline.IATA.Code)
data$GEO.Summary <- as.factor(data$GEO.Summary)
data$GEO.Region <- as.factor(data$GEO.Region)
data$Activity.Type.Code <- as.factor(data$Activity.Type.Code)
data$Terminal <- as.factor(data$Terminal)
data$Boarding.Area <- as.factor(data$Boarding.Area)
data$Adjusted.Activity.Type.Code <- as.factor(data$Adjusted.Activity.Type.Code)
data$Month <- as.factor(data$Month)
Explanation : Convert categorical variables to type factor.
Code:
Code:
data_cut=data[c(3,4,7,8,10,11,12,15)]
head(data_cut)
Explanation : Extract sub-data named data_cut including variables according to the topic.

18
Result :

Code:
new_DF <- subset(data_cut, GEO.Region == "Canada")
Explain: Extract data of a specific region which is the Canadian region of the categorical
variable GEO.Region for analysis.
Result:

Code:
19
apply(is.na(new_DF),2,which)
apply(is.na(new_DF),2,sum)
Explanation : Check and output missing values of variables in the set
Result :

Comment: The file has no missing data.

4. Descriptive statistics
Code:
a = mean(new_DF$Adjusted.Passenger.Count)
b = median(new_DF$Adjusted.Passenger.Count)
c = sd(new_DF$Adjusted.Passenger.Count)
d = max(new_DF$Adjusted.Passenger.Count)
e = min(new_DF$Adjusted.Passenger.Count)
des=data.frame(a,b,c,d,e)
t(des)
Result :

Code:
table(new_DF$Operating.Airline)
table(new_DF$Operating.Airline.IATA.Code)
table(new_DF$GEO.Summary)
table(new_DF$GEO.Region)
table(new_DF$Price.Category.Code)
table(new_DF$Terminal)
table(new_DF$Boarding.Area)

Result :

20
21
Code :
hist(new_DF$ Adjusted.Passenger.Count )
Result :

Code:
qqnorm(new_DF$Adjusted.Passenger.Count, col="red")
qqline(new_DF$Adjusted.Passenger.Count)

22
Result:

Code:
shapiro.test(new_DF$Adjusted.Passenger.Count)
Result:

Comment:
Hypothesis H 0 : The adjusted number of passengers follows a normal distribution.
Hypothesis H 1 : The adjusted number of passengers does not follow a normal
distribution.
Because pvalue < 5% significance level, we reject H 0 . So the adjusted number of
passengers does not follow a normal distribution.

Code:
boxplot(new_DF$Adjusted.Passenger.Count~Operating.Airline,data = new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~Operating.Airline.IATA.Code,data =
new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~GEO.Summary,data = new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~GEO.Region,data = new_DF)
23
boxplot(new_DF$Adjusted.Passenger.Count~Price.Category.Code,data = new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~Terminal,data = new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~Boarding.Area,data = new_DF)

Result :

Comment: Based on the interquartiles, Air Canada (CA) has the highest mean
value and the highest extreme outlier compared to other airlines.

24
Comment: United Airlines and United Airlines - Pre 07/01/2013 has the same airline
code, so the values of these are merged.

International

Comment: The GEO Summary box plot indicates a median value of approximately 2000
with an IQR suggesting moderate data dispersion.

25
Comment: The “Low Fare” category offers less variability and lower prices, while the
“Other” category has a wider range of prices and a higher average price. This graph
effectively visualizes the price distribution for these two categories, providing valuable
insights for decision-making.

Comment: The International terminal shows a significantly higher median and


interquartile range compared to Terminal 1 and Terminal 3. This suggests that the
International terminal handles a much larger volume of passengers annually.

26
Comment: Area E has the highest median and wide range of values. Passengers may have
better experiences in Boarding Area E.

27
Code:
library(ggplot2)
ggplot(new_DF, aes(x = Operating.Airline, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between Operating.Airline and Adjusted.Passenger.Count",
x = "Operating Airline",
y = "Adjusted Passenger Count")
Result:

Comment: The data illustrates a difference in the adjusted passenger counts among
various airline codes. Air Canada (AC) distinctly leads, indicating outstanding scale in
Canada.

28
Code:
ggplot(new_DF, aes(x = Operating.Airline.IATA.Code, y = Adjusted.Passenger.Count))
+
geom_point() +
labs(title = "Relationship between Operating.Airline.IATA.Code and
Adjusted.Passenger.Count",
x = "Operating.Airline.IATA.Code",
y = "Adjusted Passenger Count")
Result:

Comment: United Airlines and United Airlines - Pre 07/01/2013 has the same airline
code, so the values of these are merged.

29
Code:
ggplot(new_DF, aes(x = GEO.Summary, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between GEO.Summary and Adjusted.Passenger.Count",
x = "GEO.Summary",
y = "Adjusted Passenger Count")
Result:

Comment: With GEO Summary, it ranged from 0 to 40000 passengers with the most
distribution under 30000. People in Canada don't use airplanes for domestic travel.

30
Code:
ggplot(new_DF, aes(x = Price.Category.Code, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between Price.Category.Code and Adjusted.Passenger.Count",
x = "Price.Category.Code",
y = "Adjusted Passenger Count")
Result:

Comment: There is a significant difference in passenger counts between the “Low Fare”
and “Other” price categories. The “Other” category is vastly more popular than the "Low
Fare".

31
Code:
ggplot(new_DF, aes(x = Terminal, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between Terminal and Adjusted.Passenger.Count",
x = "Terminal",
y = "Adjusted Passenger Count")
Result:

Comment: Based on the distribution of the graph. Values are illustrated variably across
the International, Terminal 1, and Terminal 2 terminals.

32
Code:
ggplot(new_DF, aes(x = Boarding.Area, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between Boarding.Area and Adjusted.Passenger.Count",
x = "Boarding.Area",
y = "Adjusted Passenger Count")
Result:

Comment: Boarding area B has some distinctive values, which are obviously higher than
the mean group of values.

33
5. Inferential statistics
a, Consider a linear regression model that includes Adjusted Passenger Count
as a dependent variable, and all remaining variables are independent variables. Use
the lm() command to execute the multiple linear regression model.
Code:
m1=lm(Adjusted.Passenger.Count ~
Operating.Airline+Operating.Airline.IATA.Code+Price.Category.Code+Terminal+Board
ing.Area, data = new_DF)
summary(m1)
Result:

b, Based on the results of the linear regression model above, which variables
will you remove from the model at the 5% confidence level?
H0: The regression coefficients are not statistically significant.

34
H1: The regression coefficients are statistically significant.

Comment:
Based on the results of the linear model, because the Pr(>|t|) of the variables are almost
all less than 0.05 except for a few subvariables that are not statistically significant.
Besides, there are some variables whose Pr value is NA (Not Value). All variables that
are not statistically significant will be excluded from the model.
c, Consider two linear models that include Adjusted Passenger Count as the
dependent variable but:
+ Model M1 contains all remaining variables as independent variables
+ Model M2 is to remove some variables from model M1.
Code:
m2=lm(Adjusted.Passenger.Count ~ Operating.Airline +Terminal, data = new_DF)
summary(m2)
Result:

Comment: Removing the variables Operating Airline IATA Code, Boarding Area
and Price, Category; Code whose Pr value does not reach statistical significance or Pr
value produces NA results in model m1, helping we obtain model m2 that includes only
variables that yield statistical significance. This helps optimize the model and focus on

35
important variables, improving the statistical relationship between variables in the
regression model.

6. Discuss and expand


Use the anova() command to propose a more reasonable regression
model. Thereby, giving comments will select a more reasonable model and
draw a diagram of the selected model.
Code:
anova(m1,m2)
Result:

Comment:
Set test hypothesis:
H0: Two models m1 and m2 have the same efficiency.
H1: Two models m1 and m2 have different efficiency.
Looking at the ANOVA analysis results of two models m1 and m2, the Pr(>F) value is
0.3402. At the 5% significance level, when Pr(>F) < 5%, we can conclude that there is a
significant difference in multiple regression performance between the two models. To
choose a better regression model, we can consider the adjusted R 2 coefficient. Adjusted
R2 of model 1 is 0.6117, larger than model 2 with adjusted R 2 of 0.5928. Therefore,
although model m1 has variables that are not statistically significant, in general we still
see that regression m1 still performs better than m2.
Code:
plot(m1,which=1)
Result:

36
Comment: The graph above represents the forecast values and corresponding
error values. Based on the graph, we see that the colored line on the graph is a horizontal
line and partly curved. This shows that the relationship between the predictor variables X
and the dependent variable Y is considered nearly linear but not completely linear. In
addition, the error values are relatively evenly dispersed around the line y = 0 (except for
some outlier values), proving that the variance of the errors is constant.

37
7. Conclusion
In this assignment, we visualized 17 metrics of Air traffic passenger in
15007 flights via statistical description and graph.

We also successfully built up a muliple linear regression model to assess passenger


count but due to the low accuracy (61,17%), the model is not selected. Then we
created a Anova model, which is considered one of the methods to analyze and
evaluate the passenger count.

38
8. References
1. Phan Thi Huong, Lecture on Statistical Probability
2. Nguyen Tien Dung (editor), Nguyen Dinh Huy, Probability - Statistics & Data
Analysis, 2019
3. Nguyen Dinh Huy (editor), Nguyen Ba Thi, Probability and Statistics Textbook,
2018
4. Introductory Statistics with R, J Jambers – D. Hand – W. Hardle
5. Applied Statistics with R, 2020
6. Lecture on Quantitative Economics, PhD. Nguyen Canh Huy
7. Sample example of multiple regression, Hoang Van Ha
8. Data: https://www.kaggle.com/datasets/thedevastator/airlines-traffic-passenger-
statistics

39

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy