0% found this document useful (0 votes)

20 views39 pages

Cuoicuoicuoi

This document is a big assignment report for a Statistical Probability subject at the Ho Chi Minh University of Technology, focusing on the analysis of air traffic passenger statistics in Canada. It includes sections on data introduction, theoretical basis for linear regression, data preprocessing, and inferential statistics, aiming to derive insights from the dataset to understand air travel dynamics. The report acknowledges the guidance of the lecturer and expresses gratitude to all who contributed to the project.

Uploaded by

phucthai0816

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views39 pages

Cuoicuoicuoi

Uploaded by

phucthai0816

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 39

VIET NAM NATIONAL UNIVERSITY, HO CHI MINH CITY

HO CHI MINH UNIVERSITY OF TECHNOLOGY



BIG ASSIGNMENT REPORT

STATISTICAL PROBABILITY SUBJECT

GVHD: Phan Thị Hường

STT First and last name MSSV Class Department Sign to attend
1 Trần Quang Linh 2153527 CC01 Transportation
2 Đào Bình Phương Nam 2153592 CC01 Transportation
3 Phan Tuấn Anh 2210121 CC01 Transportation
4 Lê Quốc Anh 2252025 CC01 Transportation
5 Phạm Thành Lộc 2252462 CC01 Transportation
Class: CC01

1
TABLE OF CONTENT
1. INTRODUCTION..................................................................................................... 4
BACKGROUND:.................................................................................................................... 4
OBJECTIVE OF THE ANALYSIS:.................................................................................................4
2. DATA INTRODUCTION............................................................................................ 5
1. DATA DESCRIPTION..........................................................................................................5
3. THEORETICAL BASIS............................................................................................... 6
2.1 LINEAR REGRESSION.................................................................................................6
2.1.1 Introducing the multiple linear regression model...............................................6
2.2 ESTIMATE THE PARAMETERS OF THE MULTIPLE LINEAR REGRESSION MODEL..............................7
2.2.1Population Regression Function (PRF).................................................................7
Sample Regression Function (SRF – Sample Regression Function):.............................7
2.3 MODEL FIT.................................................................................................................10
2.4 TESTING THE GENERAL SIGNIFICANCE LEVEL OF THE MODEL (SPECIAL CASE OF WALD TEST).......13
2.5 THE MATRIX METHOD SOLVES THE MULTIPLE LINEAR REGRESSION PROBLEM............................14
2.6 ONE-FACTOR ANALYSIS OF VARIANCE...............................................................................15
3. DATA PREPROCESSING.......................................................................................... 17
4. DESCRIPTIVE STATISTICS....................................................................................... 19
5. INFERENTIAL STATISTICS........................................................................................ 33
6. DISCUSS AND EXPAND........................................................................................... 35
7. CONCLUSION......................................................................................................... 37
8. REFERENCES.......................................................................................................... 38

2
ACKNOWLEDGEMENT
We would like to extend our sincere gratitude to Ms. Phan Thi Huong – the
lecturer for the Statistical Probability course and our project supervisor. Her
wholehearted guidance enabled the team to complete the assignment on schedule and
effectively address encountered challenges. Her mentorship has served as a guiding
beacon for every action of the team, maximizing the supportive relationship between the
instructor and students within the educational environment.

In conclusion, we wish to express profound gratitude to all individuals, professors, and

mentors who have invested their time in guiding our team. This gratitude serves as a
deep-seated belief and a significant motivator that propelled our team to successfully
complete this project.

3
1. INTRODUCTION
The comprehensive analysis of air traffic passenger statistics holds paramount
importance in understanding the dynamics of aviation within a specific region. In the
context of this report, our focus is directed towards Canada, a nation characterized by
vast geographical landscapes and diverse urban centers. The dataset under examination,
sourced from the Air Traffic Passenger Statistics, encapsulates a wealth of information
crucial for unraveling the intricate patterns and trends that govern air travel within the
Canadian territory.

Background:
Canada, as one of the largest countries globally, boasts a sophisticated air
transportation network that plays a pivotal role in connecting its varied regions. With a
myriad of airports and a robust aviation sector, the air traffic patterns reflect not only the
economic activities but also the cultural diversity and regional nuances that define the
Canadian landscape. The significance of dissecting air traffic passenger statistics in
Canada extends beyond the mere quantification of travel; it serves as a lens through
which we can comprehend the economic vibrancy, tourism dynamics, and the
interconnectedness of its diverse provinces and territories.

Objective of the Analysis:

The primary objective of this analysis is to derive meaningful insights from the air
traffic passenger statistics within Canada. By scrutinizing key variables such as the
adjusted passenger count across different regions, airlines, and operational parameters,
we aim to identify patterns, trends, and potential factors influencing the variability in
passenger numbers. Through a rigorous statistical examination, we seek to answer pivotal
questions surrounding the drivers of air travel demand within the Canadian context.

This report aspires to provide stakeholders, policymakers, and industry enthusiasts with a
nuanced understanding of the air travel landscape in Canada. By elucidating the
intricacies of air traffic passenger statistics, we endeavor to contribute to a broader
discourse on the factors shaping the aviation sector within this diverse and expansive
nation.

2. Data introduction

4
1. Data description
The data could be used to study the effect of different factors on air traffic passenger numbers
and will be show in the below table with the population is the total flights around the World.
To be more specific, the sample is flights from Canada, calculate the number of passenger and
categorized by IATA code.
Variable Data type Type Unit Description

Operating Airline Categorical None The airline that operated the flight.

Published Airline Categorical None The airline that published the fare for the flight.

Published Airline IATA The IATA code of the airline that published the fare f
Categorical None
Code the flight.

A summary of the geographic region. (String) (0 =

GEO Summary x=0∨x=1 Categorical None
Domestic; 1 = International).
The geographic region. (String) (0 = Asia; 1 =
Australia/Oceania; 2 = Canada; 3 = Central America;
GEO Region { x ∈ N∨0≤ x ≤ 8 } Categorical None
= Europe; 5 = Mexico; 6 = Middle East; 7 = South
America; 8 = US).
The terminal of the flight. (String) (0 = International;
Terminal { x ∈ N∨0≤ x ≤ 3 } Categorical None
= Terminal 1; 2 = Terminal 2; 3 = Terminal 3).

The boarding area of the flight. (String) (0 = A; 1 = B;

Boarding Area { x ∈ N∨0≤ x ≤ 7 } Categorical None
= C; 3 = D; 4 = E; 5 = F; 6 = G; 7 = Other).

Passenger Count { x ∈ N∨1 ≤ x ≤ 659837 } Continuous People The number of passengers on the flight.

The type of activity. (String) (0 = Deplaned; 1 =

Activity Type Code { x ∈ N∨0≤ x ≤ 2 } Categorical None
Enplaned; 2 = Thru/Transit).

The price category of the fare. (String) (0 = Low fare;

Price Category Code x=0∨x=1 Categorical None
= Other).

Operating Airline IATA The IATA code of the airline that operated the flight
Categorical None
Code (String)
Year { x ∈ N∨2005 ≤ x ≤ 2016 } Continous None The year of the activity.

Month { x ∈ N∨1 ≤ x ≤ 12 } Continous None The month of the activity.

5
3. Theoretical basis
2.1 LINEAR REGRESSION
Regression is a statistical method for establishing a relationship between a dependent
variable and a set of independent variables. A model with one dependent variable with
two or more independent variables is called multiple regression (also called multiple
regression). For example: Household expenditure on food depends on household size,
income, geographical location, etc.; A country's child mortality rate depends on per capita
income, education level, etc.; A person's salary depends on position, experience, age,...
Theory

State the hypotheses

Data collection

Set up the model

Estimating parameters - Is the model appropriate?

No Yes

Find another model Forecasting + Decision making

2.1.1 Introducing the multiple linear regression model

The multiple linear regression model has the following general form:
Y = β 1+ β 2X 2+ β 3X 3+ … + β kX k+ u
In which: Y: dependent variable
Xi: independent variable
β1: free coefficient (blocking coefficient)
βi: individual regression coefficient. β i Measures the partial impact of variable
X i on Y under the condition that other variables in the model remain unchanged. More
specifically, if other variables in the model remain unchanged, the expected value of Y
will increase β i units if X i increases by 1 unit. u: random error.

6
Thus, "Linear Regression" is a method to predict the value of the dependent variable
(Y) based on the value of the independent variable (X). The term linearity is used to
indicate that the nature of the population parameters β 1 and β i is linear (first order). It can
be used for cases where we want to predict a continuous quantity. For example, predicting
the time a user stops on a certain page or the number of people who have visited a certain
website, etc. Using the collected data, we estimate the regression function of the
population, which is the estimate. quantity of population parameters: β 1, β 2, …, β k

2.2 Estimate the parameters of the multiple linear regression model

2.2.1Population Regression Function (PRF)
With Y as the dependent variable, X 2, X 3, ..., X k as the independent variable, Y is
random and has a certain probability distribution. Deduce: There exists E(Y|X 2 , X 3 ,…,
X k ) = definite value. Therefore, F(X 2 , X 3 , …, X k ) = E(Y|X 2 , X 3 , … , k .

We have: Y i ≠ F(X 2 , X 3 ,…, X k ) ⇒ u i = Y i – F

For an individual i, there exists (X 2,i , X 3,i , …, X k,i , Y i )

Therefore: Y i = E(Y|X 2 , X 3 ,…, X k ) + u i

Overall regression PRF:
 Y = E(Y|X) + U
 E(Y|X) = F(X)

Sample Regression Function (SRF – Sample Regression Function):

Because we do not know the population, we do not know to what extent the
population mean of the dependent variable is correct. Therefore, we must rely on sample
data to estimate.
On a sample with n individuals, call ^ Y= F^ (X 2 , X 3 ,…, X k ) is sample regression.
For a sample individual Y i ≠ F^ (X 2,i , X 3,i ,…, X k,i ) yields e i = Y i – F^ (X 2 , X 3 ,…, X k
); e i is called the SRF residual.
We have a general sample regression function written as follows:
^y i = β^ 1 + β^ 2 x 2,i + β^ 3 x 3,i + … + β^ k x k,i
The resulting remainder: e i = y i – ŷ i
Notation: β^ m is an estimate of β m . We look forward to ^β m is an unbiased estimate of β
m , moreover, should be an efficient estimate.

Estimating SRF: choose a certain method to estimate the parameters of F by finding

the parameters of F^ and taking the observed values of these parameters as approximate
values for the parameters of F.

7
2.2 Ordinary Least Squares Method

The least squares method was introduced by German mathematician Carl Friedrich
Gauss. The idea of this method is to minimize the sum of squares of the residuals.
Therefore, it can be said that to get the most appropriate regression, we choose estimates
with intercept and slope so that the residual is small.
2.2.1. Assumptions of the least squares method for multiple linear regression
models
The least squares (OLS) method is a very reliable method in estimating model
parameters, however the estimated model must satisfy 7 assumptions. When the
assumptions are satisfied, the least squares (OLS) estimator is the most efficient unbiased
linear estimator among the estimators. Therefore, the OLS method provides the best
linear unbiased estimate (BLUE).
This result is called the Gauss - Markov Theorem, according to this theory the OLS
estimator is BLUE, meaning that among all unbiased linear combinations of Y, the OLS
estimator has the smallest variance. The assumptions are as follows:
1/ The regression function is linear according to the coefficients.
This means that the actual process of regression is described by the relationship of
the form:
y = β 1+ β 2x 2+ β 3x 3+ β 4x 4+ … + β kx k+ u
Or the actual relationship can be rewritten, for example, as taking the logarithm of
both sides.

2/ E(u i ) = 0 : The expectation of the random factors u i is equal to 0.

The overall average error is 0. This means that some error values are positive and
some error values are negative. Because the function is considered an average, it can
be assumed that the above random errors will cancel each other out, on average, in the
population.
3/Var(u i ) = σ 2 : Equal and homogeneous variance for all u i .
All u values are equally distributed with the same variance σ 2, such that
Var(u i ) = E(u i 2 ) = σ 2 .
4/ PM2.5v(u i ,u j ) = 0 : There is no correlation between u i .
There is no correlation between the observations of the error factor. If we consider
time series data (data collected from one source over many different time periods), the
error factor u i in this time period does not have any correlation with the factor error in
the previous period.
5/ PM2.5v(u i ,x i ) = 0 : U and X have no correlation with each other.
8
When any explanatory variable becomes larger or smaller, the error factor will not
change with it.
6/ u i normal distribution.
This is important when generating confidence intervals and performing hypothesis
testing in small sample ranges. But the larger the sample range, this becomes less
important.
There is no perfect linear dependence between x 2 , x 3 ,…, x k .
If x 2 , x 3 ,…, x k have a linear relationship, it is said that there is multicollinearity.
2.2.2. Estimate
We set: y i denotes the actual value of variable y at observation i
ŷ i denotes the value of the sample regression function
e i denotes the remainder y i – ŷ i

Therefore, minimizing Σ ( y i – ŷ i ) 2 will be equivalent to minimizing Σ e i 2 from which

we can find β^ 0 , β^ 1 ,…, β^ k
We have:
Σ e i 2 = Σ (y i – ( β^ 1 + β^ 2 x 2,i + β^ 3 x 3,i + β^ 4 x 4,i + … + β^ k x k,i )) 2
We can set up the first-order conditions for this minimal calculation as follows:
2
∂Σ e i
= -2 Σ ( y i - ( β^ 1 + β^ 2 x 2i + β^ 3 x3i +…+ β^ k x ki ) )x i = 0
∂ β^1
2
∂Σ e i
= -2 Σ ( y i - ( β^ 1 + β^ 2 x 2i + β^ 3 x3i +…+ β^ k x ki ) )x 2i = 0
∂ β^2

…
2
∂Σ e i
= -2 Σ ( y i - ( β^ 1 + β^ 2 x 2i + β^ 3 x3i +…+ β^ k x ki ) )x ki = 0
∂ β^k

The system of equations we have is called the standard system of equations of the
sample regression. We can solve these k standard equations to find the k β^ unknown
coefficients. β^ 1 , β^ 2 ,…, β^ k are called least squares estimates.
9
2.2.3. In case the regression function has 2 explanatory variables
The parameters β^ 1 , β^ 2 , β^ 3 are calculated from the following standard equation system:
β^ 1 + β^ 2 x 2i + β^ 3 x3i = y
β^ 1 Σ x 2i + β^ 2 Σ x2i + β^ 3 Σ x2i x 3i=Σ y i x 2i
2

β^ 1 Σ x 3i + β^ 2 Σ x2i x 3i + β^ 3 Σ x 3i +=Σ y i x 3i
2

In there:
x2 = Σ x 2i /n ; x3 = Σ x 3i /n ; y=Σ y i
Solving the system of equations we find
β^ 1 = y - β^ 2 x 2 - β^ 3 x 3
( Σ ( y i - y ) ( x2i - x 2 ) )( Σ ( x 3i - x 3) ) - ( Σ ( y i - y ) ( x 3i - x3 ) )( Σ ( x2i - x 2 ) ( x 3i - x3))
β^ 2 =
( Σ ( x 2i - x2 )2 )( Σ ( x3i - x 3 )2 ) - ( Σ ( x2i - x 2 ) ( x3i - x 3 ) )2
( Σ ( y i - y ) ( x3i - x 3 ) )( Σ ( x 3i - x 3 ) ) - ( Σ ( y i - y ) ( x 2i - x2 ) )( Σ ( x2i - x 2 ) ( x 3i - x3))
β^ 3 =
( Σ ( x 2i - x2 )2 )( Σ ( x3i - x 3 )2 ) - ( Σ ( x2i - x 2 ) ( x3i - x 3 ) )2
Variance and standard deviation are calculated by the following formula:
σ2
Var( β^ 2 ) = ¿ ; Se( β^ 2 ) =√ Var( β^ 2 )
Σ ( x 2i - x2 )2 (1- r 223 ¿
2
σ
Var( β^ 3 ) = 2 2
¿ ; Se( β^ 3 ) =√ Var( β^ 3 )
Σ ( x 3i - x3 ) (1- r 23 ¿
2 2
- r 23 σ
PM2.5v( β^ 2 , β^ 3 ) = 2
(1- r 23 ¿
√ 2 2
Σ ( x 2i - x 2) Σ ( x3i - x 3 ) ¿√
2
2
( Σ ( x2i - x 2 ) ( x 3i - x3))
In there:r =¿ 23
( Σ ( x 2i - x2 )2 )( Σ ( x3i - x 3 )2 )
In the above formula, σ 2 is the variance of u i but is unknown. Its unbiased estimate is:
2
2 Σ ei
σ^ =
n-k
Where: k is the number of parameters in the model.
2.3 Model fit
To know how or how much of the variation in the dependent variable the model
explains, people use R 2
We have: Σ (y i – ĩ) 2 = Σ [(y i – ŷ i ) + (ŷ i – ĩ)] 2 = Σ [e i + (ŷ i – ĩ)] 2
= Σ e i 2 + 2 Σ e i (ŷ i – ĩ) + Σ (ŷ i – ĩ) 2
Put:
 Σ (y i – ĩ) 2 : SST – Total Sum of Squares
 Σ (ŷ i – ĩ) 2 : SSE – Explained Sum of Squares
 Σ e i 2 : SSR – Residual Sum of Squares
10
Do Σ e i (ŷ i – ĩ) = 0 ⇐( Σ e i ŷ i = 0; Σ e i ọ = 0)
We can write: SST = SSE + SSR
Meaning of the components:
 SST is the sum of squares of all the deviations between the observed values Y i
and the mean.
 SSE is the sum of squares of all the deviations between the values of the
dependent variable Y obtained from the sample regression function and their
mean. This part measures the accuracy of the regression function.
 SSR is the sum of squares of all the differences between the observed Y values
and the values obtained from the regression function.
 SST is divided into two parts: one caused by SSE and one caused by SSR.
From SST = SSE + SSR, we divide both sides by SST, we have:
2
SS E SS R Σ (ŷ i – ӯ) 2 Σ e i
1= + = +
SS T SS T Σ( y i - y ) 2 Σ( y i - y )
2

2
SS E SS R ( Σ ( y i - y ) ( y i - ^y ) )
R 2= =1 - =
SS T SS T ( Σ ( y - y ) 2) ( Σ ( y - y^ )2 )
i i

The ratio of the total variation explained by the model to the total squares to be
explained is called the coefficient of determination, or “good of fit” statistic. From the
definition of R 2 we see that R 2 measures the ratio or % of the total deviation Y from the
mean value explained by the model. Then people use R2 to measure the suitability of the
regression function:
0 ≤ R 2 ≤ 1.
 R 2 means that the estimated model explains a high level of variation in the
dependent variable.
 If R 2 = 1, it means that the regression line explains 100% of the change in y.
 If R 2 = 0, it means the model does not provide any information about the
change in the dependent variable y.
In a multivariate regression model, the ratio of the total difference in variable y caused
by all variables x 2 and x 3 is called the multiple coefficients of determination, denoted R 2
2
β^ 2Σ ( y i - y )( x 2i - x 2 )+ β^ 3 Σ ( y i - y )( x 3i - x 3 ) 2
Σ ei
R = 2 = 1- 2
Σ( y i - y ) Σ( y i - y )
2.3. Confidence intervals and testing of regression coefficients
2.3.1. Estimate confidence intervals for regression coefficients
The purpose of regression analysis is not only to speculate about β1 , β2 ,…, β k but also
to examine the nature of the dependence. Therefore, it is necessary to know the

11
probability distribution of β1 , β2 ,…, β k . These distributions depend on the distribution of
ui .
With OLS assumptions, u i has distribution N (0, σ2 ). The estimated coefficients
follow a normal distribution:
β^ j N( βj , Se ( β^ j ))
β^ j - β j
~T (n−k )
Se ( β^ j)

Estimate error variance based on least squares residuals. Where k is the number of
coefficients in the multivariate regression equation:
2
Σ ei 2
σ^ =
n-k
 Estimating both sides, we find t α ( n-3 ) =1- α the satisfaction:
2

β^ j - β j
P( −t α2 ( n-3 ))≤ ≤P( t α ( n-3 ) )
Se ( β^ j) 2

 Confidence interval 1 - α ofβ j To be:

[ β^ j−t ( n-3 ) Se ( β^ j )] ; [ β^ j+ t ( n-3) Se ( β^ j )]

α
2
α
2

2.3.2. Test the hypothesis for β j

Test the statistical significance of the regression coefficients: test whether the
explanatory variable really affects the dependent variable or not. In other words, whether
the regression coefficient is statistically significant or not.
Some assumption can be made for β j , for example β j = β j *. If this assumption is
correct then:
β^ j - β j
T= ~T (n−k )
Se ( β^ j)

We have the following table:

Type of hypothesis Hypothesis Ho Hypothesis H1 Rejection region

Two-sided β i = β j* β j  β j* ¿ t∨¿ t α/ 2 (n−k)
Right-sided β j ≤ β j* β j > β j* t >t α (n−k )
Left-sided β j ≥β j* β j < β j* t ←t α (n−k )

Test βj :
 H0 : β j = 0 ⇔x j has no impact
 H1 : β j ≠ 0 ⇔x j has an impact

12
β j < 0 ⇔x j has the opposite effect
β j > 0 ⇔x j has a positive effect

2.4 Testing the general significance level of the model (special case of
WALD test)
2.3.3. Overview of WALD testing
Suppose we have 2 models below:
(U): Y = β 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + u
(R): Y = β 1 + β 2 X 2 + v
The U model is called the Unrestrict model, and the R model is called the Restrict
model. That is because β 3 and β 4 must be equal to 0 in the R model. We can test the
linking hypothesis β 3 = β 4 = 0 with the alternative hypothesis that at least one of these
coefficients is not equal to 0. Testing this association hypothesis is called the Wald test,
the procedure is as follows:
Set the bounded and unbounded models as:
(U): Y = β 1 + β 2 X 2 + … + β m X m + β m+1 X m+1 + … + β k X k + u
(R): Y = β 1 + β 2 X 2 + … + β m X m + v
Hypothesis:
 H 0 : β m+1 = … = β k = 0
 H 1 : “The parameters are not all 0 at the same time”
(U) contains k unknown regression coefficients and (R) contains m unknown
regression coefficients. Therefore, model R has (k – m) fewer parameters than U. The
question we ask is whether the (k – m) omitted variables have a meaningful association
effect on Y.
The test statistic for this hypothesis is:
[ SSR R −SSR U ] /(k −m) 2 2
R U −R R /(k−m)
F c= ~ F( α , k – m, n – k) = 2
SSR U /(n−k ) 1−RU /(n−k )
Where R 2 is the uncorrected fitness measure. Under the null hypothesis, F c has an F
distribution with (k – m) degrees of freedom for the numerator and (n – k) degrees of
freedom for the denominator.
Reject the hypothesis H 0 when: F c > F(α, k – m, n – k) or the p-value of the F statistic
is less than the given level of significance.
2.3.4. Test the significance of the model

13
In the multivariate regression model, the “null” hypothesis that the model is not
significant is understood to mean that all individual regression coefficients are equal to 0.
The application of the Wald test (commonly known as the F test) is specifically
conducted as follows:
 Step 1: The “null” hypothesis is H 0 : β 2 = β 3 = … = β k = 0.
The alternative hypothesis is H 1 : “at least one of the β values is non-zero”.
 Step 2: First, regress Y on a constant term and X 2 , X 3 , …, X k , then calculate
the sum of squared errors RSS U , RSS R . The F-distribution is the ratio of two
independent chi-squared distributed random variables. This gives us the
statistical value:
[ SSR R −SSR U ] /(k −m)
F c= ~ F( α , k – m, n – k)
SSR U /(n−k )
Since H 0 : β 2 = β 3 = … = β k = 0, it is observed that the test statistic for this hypothesis
will be:
SSE /(k−1)
F c= ~ F( α , k – 1, n – k)
SSR /(n−k )
 Step 3: Look up the data in table F corresponding to the degrees of freedom (k –
1) for the numerator and (n – k) for the denominator, and with the given level of
significance α.
 Step 4: Reject the hypothesis H 0 at the α level of significance if F c > F(α, k-1,
nk). For the p-value method, calculate the value p = P(F > F c |H 0 ) and reject
the hypothesis H 0 if p < significance level α
2.5 The matrix method solves the multiple linear regression problem
We see that in multivariate regression, if we use the gradual substitution method to
find regression parameters, it will be very complicated and take a lot of time, especially
for models with many explanatory variables. A simpler way to solve it is to use the
matrix method.
Let's say we have a population regression function (PRF):
y = β 1+ β 2x 2+ β 3x 3+ β 4x 4+ … + β kx k+ u
We write for each observation as follows:
y 1 = β 1 + β 2 x 2;1 + β 3 x 3;1 + β 4 x 4;1 + … + β k x k;1 + u 1
y 2 = β 1 + β 2 x 2;2 + β 3 x 3;2 + β 4 x 4;2 + … + β k x k;2 + u 2
⋮ ⋮ ⋮
y n = β 1 + β 2 x 2;n + β 3 x 3;n + β 4 x 4;n + … + β k x k;n + u n
Writing in matrix form we have:
Y = Xβ + U
In which: Y is a nx 1 matrix

14
X is a matrix of level nxk
β is a kx 1 level matrix
U is a nx 1 matrix
With x 1 represented as:

[]
1
1
x 1= ⋮
1

For example: if k = 2:

[ ] [ ][ ] [ ]
y1 1 x 21 u1
y2 1 x 22 β 1 u2
= +
⋮ ⋮ ⋮ β2 ⋮
yn 1 x2 n un
The sample regression function is written as follows:
^y i = β^ 1 + β^ 2 x2i + β^ 3 x 3i +…+ β^ k x ki
y = β^ + β^ x + β^ x +…+ β^ x +e
i 1 2 2i 3 3i k ki i

Or: Y = X β^ + e
Then the OLS estimates are found by:
Σ e i =Σ ( y i ( β^ 1 + β^ 2 x 2i + β^ 3 x 3i +…+ β^ k x ki ) ) ⇒min. min
2 2

e'e = Σ e 2i = (Y - X β^ )'( Y - X β^ )= Y'Y - β^ ' X'Y – Y'X β^ + β^ ' X'X β^ = Y'Y - 2 β^ ' X'Y + X'X β^ 2
'
∂(e e)
= -2X'Y + 2X'Xβ^ ⇒X'Y = X'Xβ^ ⇒ β^ = (X'X) -1 X'Y
∂ β^

From there we can easily calculate the parameters of the sample regression function,
as well as applying the rules for calculating variance and standard deviation, we can also
easily find the variance and standard deviation of the regression parameters. regulation.
2.6 One-factor analysis of variance
 Theory of 1-factor variance analysis
Analysis of variance is the comparison of the means of two or more groups based
on the mean values of observed samples from these groups, and through hypothesis
testing to conclude about the equality of the means. this overall.
 One-factor variance analysis method

15
Analysis of variance is used in tests to compare the average values of two or more
samples taken from fractions. This can be seen as an extension of t or z tests (comparing
two average values).
 The purpose of 1-factor analysis of variance:
Evaluate the influence of a certain factor (artificial or natural) on the observed
values, Y i (i=0,1,2,…k)

Model:
Experimental element
first 2 … K
Medicin Medicin
… Y k1
e 11 e 21
Medicin Medicin
… Y k2
e 12 e 22
… … … …
Y 1N Y 2N … Y kN
T1 T2 … Tk T
Average gate total
Yfirst Y2 … Yk Y

ANOVA table:
Source of variation df Sum of squares Mean square(MS)
Treatments k-1 k
T
2
T
2 SSB
SSB=∑ MSB=
i
− k−1
i=1 N N
Error k(n-1) SSW=SST-SSB SSW
MSW =
n−k
k n
Total kn-1 T2
SST =∑ ∑ Y n −¿ ¿ 2

i=1 j=1 N

 Assumption:
H 0:
μ1 =μ2 =.. .. .=μ k ⇔ “Mean values are equal”

16
Question 1 :
μi ≠μ j ⇔ “ At least two different average values”
MSF
 Statistical value: F = MSE
 Argument: If F < F α (k - 1; N - k) →Accept the hypothesis H 0

3. Data preprocessing
Code:
data= read.csv("C:/Users /Desktop/Air_Traffic_Passenger_Statistics.csv")
head(data)
names(data)
Explanation : read data and save data with name data
Result :

Code:

17
names(data)
Explanation : outputs all variables in data.
Result :

Code:
data$Price.Category.Code <- as.factor(data$Price.Category.Code)
data$Operating.Airline <- as.factor(data$Operating.Airline)
data$Operating.Airline.IATA.Code <- as.factor(data$Operating.Airline.IATA.Code)
data$Published.Airline <- as.factor(data$Published.Airline)
data$Published.Airline.IATA.Code <- as.factor(data$Published.Airline.IATA.Code)
data$GEO.Summary <- as.factor(data$GEO.Summary)
data$GEO.Region <- as.factor(data$GEO.Region)
data$Activity.Type.Code <- as.factor(data$Activity.Type.Code)
data$Terminal <- as.factor(data$Terminal)
data$Boarding.Area <- as.factor(data$Boarding.Area)
data$Adjusted.Activity.Type.Code <- as.factor(data$Adjusted.Activity.Type.Code)
data$Month <- as.factor(data$Month)
Explanation : Convert categorical variables to type factor.
Code:
Code:
data_cut=data[c(3,4,7,8,10,11,12,15)]
head(data_cut)
Explanation : Extract sub-data named data_cut including variables according to the topic.

18
Result :

Code:
new_DF <- subset(data_cut, GEO.Region == "Canada")
Explain: Extract data of a specific region which is the Canadian region of the categorical
variable GEO.Region for analysis.
Result:

Code:
19
apply(is.na(new_DF),2,which)
apply(is.na(new_DF),2,sum)
Explanation : Check and output missing values of variables in the set
Result :

Comment: The file has no missing data.

4. Descriptive statistics
Code:
a = mean(new_DF$Adjusted.Passenger.Count)
b = median(new_DF$Adjusted.Passenger.Count)
c = sd(new_DF$Adjusted.Passenger.Count)
d = max(new_DF$Adjusted.Passenger.Count)
e = min(new_DF$Adjusted.Passenger.Count)
des=data.frame(a,b,c,d,e)
t(des)
Result :

Code:
table(new_DF$Operating.Airline)
table(new_DF$Operating.Airline.IATA.Code)
table(new_DF$GEO.Summary)
table(new_DF$GEO.Region)
table(new_DF$Price.Category.Code)
table(new_DF$Terminal)
table(new_DF$Boarding.Area)

Result :

20
21
Code :
hist(new_DF$ Adjusted.Passenger.Count )
Result :

Code:
qqnorm(new_DF$Adjusted.Passenger.Count, col="red")
qqline(new_DF$Adjusted.Passenger.Count)

22
Result:

Code:
shapiro.test(new_DF$Adjusted.Passenger.Count)
Result:

Comment:
Hypothesis H 0 : The adjusted number of passengers follows a normal distribution.
Hypothesis H 1 : The adjusted number of passengers does not follow a normal
distribution.
Because pvalue < 5% significance level, we reject H 0 . So the adjusted number of
passengers does not follow a normal distribution.

Code:
boxplot(new_DF$Adjusted.Passenger.Count~Operating.Airline,data = new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~Operating.Airline.IATA.Code,data =
new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~GEO.Summary,data = new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~GEO.Region,data = new_DF)
23
boxplot(new_DF$Adjusted.Passenger.Count~Price.Category.Code,data = new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~Terminal,data = new_DF)
boxplot(new_DF$Adjusted.Passenger.Count~Boarding.Area,data = new_DF)

Result :

Comment: Based on the interquartiles, Air Canada (CA) has the highest mean
value and the highest extreme outlier compared to other airlines.

24
Comment: United Airlines and United Airlines - Pre 07/01/2013 has the same airline
code, so the values of these are merged.

International

Comment: The GEO Summary box plot indicates a median value of approximately 2000
with an IQR suggesting moderate data dispersion.

25
Comment: The “Low Fare” category offers less variability and lower prices, while the
“Other” category has a wider range of prices and a higher average price. This graph
effectively visualizes the price distribution for these two categories, providing valuable
insights for decision-making.

Comment: The International terminal shows a significantly higher median and

interquartile range compared to Terminal 1 and Terminal 3. This suggests that the
International terminal handles a much larger volume of passengers annually.

26
Comment: Area E has the highest median and wide range of values. Passengers may have
better experiences in Boarding Area E.

27
Code:
library(ggplot2)
ggplot(new_DF, aes(x = Operating.Airline, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between Operating.Airline and Adjusted.Passenger.Count",
x = "Operating Airline",
y = "Adjusted Passenger Count")
Result:

Comment: The data illustrates a difference in the adjusted passenger counts among
various airline codes. Air Canada (AC) distinctly leads, indicating outstanding scale in
Canada.

28
Code:
ggplot(new_DF, aes(x = Operating.Airline.IATA.Code, y = Adjusted.Passenger.Count))
+
geom_point() +
labs(title = "Relationship between Operating.Airline.IATA.Code and
Adjusted.Passenger.Count",
x = "Operating.Airline.IATA.Code",
y = "Adjusted Passenger Count")
Result:

Comment: United Airlines and United Airlines - Pre 07/01/2013 has the same airline
code, so the values of these are merged.

29
Code:
ggplot(new_DF, aes(x = GEO.Summary, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between GEO.Summary and Adjusted.Passenger.Count",
x = "GEO.Summary",
y = "Adjusted Passenger Count")
Result:

Comment: With GEO Summary, it ranged from 0 to 40000 passengers with the most
distribution under 30000. People in Canada don't use airplanes for domestic travel.

30
Code:
ggplot(new_DF, aes(x = Price.Category.Code, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between Price.Category.Code and Adjusted.Passenger.Count",
x = "Price.Category.Code",
y = "Adjusted Passenger Count")
Result:

Comment: There is a significant difference in passenger counts between the “Low Fare”
and “Other” price categories. The “Other” category is vastly more popular than the "Low
Fare".

31
Code:
ggplot(new_DF, aes(x = Terminal, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between Terminal and Adjusted.Passenger.Count",
x = "Terminal",
y = "Adjusted Passenger Count")
Result:

Comment: Based on the distribution of the graph. Values are illustrated variably across
the International, Terminal 1, and Terminal 2 terminals.

32
Code:
ggplot(new_DF, aes(x = Boarding.Area, y = Adjusted.Passenger.Count)) +
geom_point() +
labs(title = "Relationship between Boarding.Area and Adjusted.Passenger.Count",
x = "Boarding.Area",
y = "Adjusted Passenger Count")
Result:

Comment: Boarding area B has some distinctive values, which are obviously higher than
the mean group of values.

33
5. Inferential statistics
a, Consider a linear regression model that includes Adjusted Passenger Count
as a dependent variable, and all remaining variables are independent variables. Use
the lm() command to execute the multiple linear regression model.
Code:
m1=lm(Adjusted.Passenger.Count ~
Operating.Airline+Operating.Airline.IATA.Code+Price.Category.Code+Terminal+Board
ing.Area, data = new_DF)
summary(m1)
Result:

b, Based on the results of the linear regression model above, which variables
will you remove from the model at the 5% confidence level?
H0: The regression coefficients are not statistically significant.

34
H1: The regression coefficients are statistically significant.

Comment:
Based on the results of the linear model, because the Pr(>|t|) of the variables are almost
all less than 0.05 except for a few subvariables that are not statistically significant.
Besides, there are some variables whose Pr value is NA (Not Value). All variables that
are not statistically significant will be excluded from the model.
c, Consider two linear models that include Adjusted Passenger Count as the
dependent variable but:
+ Model M1 contains all remaining variables as independent variables
+ Model M2 is to remove some variables from model M1.
Code:
m2=lm(Adjusted.Passenger.Count ~ Operating.Airline +Terminal, data = new_DF)
summary(m2)
Result:

Comment: Removing the variables Operating Airline IATA Code, Boarding Area
and Price, Category; Code whose Pr value does not reach statistical significance or Pr
value produces NA results in model m1, helping we obtain model m2 that includes only
variables that yield statistical significance. This helps optimize the model and focus on

35
important variables, improving the statistical relationship between variables in the
regression model.

6. Discuss and expand

Use the anova() command to propose a more reasonable regression
model. Thereby, giving comments will select a more reasonable model and
draw a diagram of the selected model.
Code:
anova(m1,m2)
Result:

Comment:
Set test hypothesis:
H0: Two models m1 and m2 have the same efficiency.
H1: Two models m1 and m2 have different efficiency.
Looking at the ANOVA analysis results of two models m1 and m2, the Pr(>F) value is
0.3402. At the 5% significance level, when Pr(>F) < 5%, we can conclude that there is a
significant difference in multiple regression performance between the two models. To
choose a better regression model, we can consider the adjusted R 2 coefficient. Adjusted
R2 of model 1 is 0.6117, larger than model 2 with adjusted R 2 of 0.5928. Therefore,
although model m1 has variables that are not statistically significant, in general we still
see that regression m1 still performs better than m2.
Code:
plot(m1,which=1)
Result:

36
Comment: The graph above represents the forecast values and corresponding
error values. Based on the graph, we see that the colored line on the graph is a horizontal
line and partly curved. This shows that the relationship between the predictor variables X
and the dependent variable Y is considered nearly linear but not completely linear. In
addition, the error values are relatively evenly dispersed around the line y = 0 (except for
some outlier values), proving that the variance of the errors is constant.

37
7. Conclusion
In this assignment, we visualized 17 metrics of Air traffic passenger in
15007 flights via statistical description and graph.

We also successfully built up a muliple linear regression model to assess passenger

count but due to the low accuracy (61,17%), the model is not selected. Then we
created a Anova model, which is considered one of the methods to analyze and
evaluate the passenger count.

38
8. References
1. Phan Thi Huong, Lecture on Statistical Probability
2. Nguyen Tien Dung (editor), Nguyen Dinh Huy, Probability - Statistics & Data
Analysis, 2019
3. Nguyen Dinh Huy (editor), Nguyen Ba Thi, Probability and Statistics Textbook,
2018
4. Introductory Statistics with R, J Jambers – D. Hand – W. Hardle
5. Applied Statistics with R, 2020
6. Lecture on Quantitative Economics, PhD. Nguyen Canh Huy
7. Sample example of multiple regression, Hoang Van Ha
8. Data: https://www.kaggle.com/datasets/thedevastator/airlines-traffic-passenger-
statistics

Assesment
40% (5)
Assesment
15 pages
(Ebook PDF) A Second Course in Statistics: Regression Analysis 8th Edition 2024 Scribd Download
100% (2)
(Ebook PDF) A Second Course in Statistics: Regression Analysis 8th Edition 2024 Scribd Download
45 pages
A Star: Fundamentals and Applications
From Everand
A Star: Fundamentals and Applications
Fouad Sabry
No ratings yet
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
From Everand
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
Fouad Sabry
No ratings yet
Research Methodology 22 Year Question
No ratings yet
Research Methodology 22 Year Question
3 pages
How To Create Beautiful Graphs and Charts With LaTeX
No ratings yet
How To Create Beautiful Graphs and Charts With LaTeX
1 page
4815multiple Regression and Beyond An Introduction To Multiple Regression and Structural Equation Modeling 2nd Edition Timothy Keith Download
No ratings yet
4815multiple Regression and Beyond An Introduction To Multiple Regression and Structural Equation Modeling 2nd Edition Timothy Keith Download
61 pages
Introductory Econometrics A Modern Approach 5th Edition Wooldridge Solutions Manualdownload
100% (6)
Introductory Econometrics A Modern Approach 5th Edition Wooldridge Solutions Manualdownload
51 pages
Output
No ratings yet
Output
18 pages
DMV & ML Lab
No ratings yet
DMV & ML Lab
103 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
22 pages
Factor Analysis
No ratings yet
Factor Analysis
18 pages
KNN Updated
No ratings yet
KNN Updated
30 pages
Email Spam Detection Final Presentation-21BSCHH010002
No ratings yet
Email Spam Detection Final Presentation-21BSCHH010002
17 pages
Att - xi4hoK9TIIy0zojqev Pm7EjiHjEesIl2KzyKBf86No
No ratings yet
Att - xi4hoK9TIIy0zojqev Pm7EjiHjEesIl2KzyKBf86No
56 pages
Orange Assignment
No ratings yet
Orange Assignment
5 pages
BTL KHOA GIAO THÔNG Hoàn Thành 23 Trang UP
No ratings yet
BTL KHOA GIAO THÔNG Hoàn Thành 23 Trang UP
23 pages
Flight Price Prediction
No ratings yet
Flight Price Prediction
34 pages
Forecasting - Guide To Forecasting in SPSS
No ratings yet
Forecasting - Guide To Forecasting in SPSS
3 pages
Passenger Demand Modeling and Forecasting
No ratings yet
Passenger Demand Modeling and Forecasting
8 pages
Study Guide For STA3701
No ratings yet
Study Guide For STA3701
325 pages
Ditzen 2018
No ratings yet
Ditzen 2018
33 pages
Group 07 Class CC02
No ratings yet
Group 07 Class CC02
38 pages
GROUP 07 CLASS CC02 Ê
No ratings yet
GROUP 07 CLASS CC02 Ê
36 pages
WST 311 - Part 1 2024
No ratings yet
WST 311 - Part 1 2024
59 pages
BTK XSTK Chính TH C
No ratings yet
BTK XSTK Chính TH C
34 pages
Chapter - 2 - Forecasting
No ratings yet
Chapter - 2 - Forecasting
39 pages
Chapter 5
No ratings yet
Chapter 5
14 pages
Week 3 Forecasting Homework
No ratings yet
Week 3 Forecasting Homework
4 pages
Analisis Manajemen Perubahan, Kepemimpinan Transformasional, Struktur Organisasi, Budaya Organisasi Dan Disiplin Kerja Terhadap Kinerja Karyawan
No ratings yet
Analisis Manajemen Perubahan, Kepemimpinan Transformasional, Struktur Organisasi, Budaya Organisasi Dan Disiplin Kerja Terhadap Kinerja Karyawan
13 pages
191FF03063 - Muhammad Gian Fahrozi - 2 FA 2 - Laporan HKSA-2
No ratings yet
191FF03063 - Muhammad Gian Fahrozi - 2 FA 2 - Laporan HKSA-2
6 pages
BA ZG524 Advanced Statistical Methods
No ratings yet
BA ZG524 Advanced Statistical Methods
7 pages
Kumar Soft Drink Bottling Company (Refer To The Dataset: "Kumar Soft Drink") One Way Anova Test
No ratings yet
Kumar Soft Drink Bottling Company (Refer To The Dataset: "Kumar Soft Drink") One Way Anova Test
9 pages
Hubungan Tipe Kepribadian Dengan Pilihan Karir Peserta Didik Kelas Xi Man 1 Pontianak
No ratings yet
Hubungan Tipe Kepribadian Dengan Pilihan Karir Peserta Didik Kelas Xi Man 1 Pontianak
10 pages
Introduction To Econometrics, 5 Edition: Chapter 5: Dummy Variables
No ratings yet
Introduction To Econometrics, 5 Edition: Chapter 5: Dummy Variables
40 pages
ML Section16 Causality
No ratings yet
ML Section16 Causality
57 pages
Lab 6 Answers
No ratings yet
Lab 6 Answers
14 pages
Group Assignment - Data Mining
No ratings yet
Group Assignment - Data Mining
28 pages
Bayesian Analysis of Time Series - Broemeling L. D. (CRC 2019) (1st Ed.)
100% (5)
Bayesian Analysis of Time Series - Broemeling L. D. (CRC 2019) (1st Ed.)
293 pages
HW1 - Predicting Airfares On New Routes - JJ - (2024 Spring)
No ratings yet
HW1 - Predicting Airfares On New Routes - JJ - (2024 Spring)
3 pages
Analysis of Factors in Flight Delay: Yiyang Xu, Luyao Liu, Xichen Gao and Fanyu Frank Zeng
No ratings yet
Analysis of Factors in Flight Delay: Yiyang Xu, Luyao Liu, Xichen Gao and Fanyu Frank Zeng
7 pages
Chapter 2-Simple Regression Model
No ratings yet
Chapter 2-Simple Regression Model
25 pages
Pegasus - Group 5
No ratings yet
Pegasus - Group 5
40 pages
Unit 3
No ratings yet
Unit 3
17 pages
Oe Statistics Notes
No ratings yet
Oe Statistics Notes
32 pages
The Secrets of the airport
From Everand
The Secrets of the airport
hossam sarhan
5/5 (1)
47.epra Journals 14763
No ratings yet
47.epra Journals 14763
6 pages
Assignment 8
No ratings yet
Assignment 8
4 pages
Multiple Regression MS
No ratings yet
Multiple Regression MS
35 pages
Sara AlRuzeiqiThesisPhd
No ratings yet
Sara AlRuzeiqiThesisPhd
229 pages
Final Cc01 Group7
No ratings yet
Final Cc01 Group7
23 pages
BT10203 PLUMS Assignment 2022
No ratings yet
BT10203 PLUMS Assignment 2022
25 pages
Presentation On Flight Price Prediction
No ratings yet
Presentation On Flight Price Prediction
30 pages
Case1 Gr1 Team10 Airfare
No ratings yet
Case1 Gr1 Team10 Airfare
14 pages
FINAL - CC01 - Group7
No ratings yet
FINAL - CC01 - Group7
23 pages
Sta2604 Study Guide
No ratings yet
Sta2604 Study Guide
159 pages
Case1 Gr1 Team10 Airfare
No ratings yet
Case1 Gr1 Team10 Airfare
15 pages
Manual Stata 13
100% (1)
Manual Stata 13
371 pages
EE5253 2023 Paper Group35
No ratings yet
EE5253 2023 Paper Group35
5 pages
Lec10 - Big Data, Forecasting and Linear Regression
No ratings yet
Lec10 - Big Data, Forecasting and Linear Regression
43 pages
Generalized Additive Model
No ratings yet
Generalized Additive Model
10 pages
XSTK Project PDF
No ratings yet
XSTK Project PDF
26 pages
Notes 14
100% (1)
Notes 14
190 pages
Stat 378
No ratings yet
Stat 378
73 pages
Flight Price Prediction Report
No ratings yet
Flight Price Prediction Report
18 pages
SM Notes 2020
No ratings yet
SM Notes 2020
139 pages
Intro To Data Coursera
No ratings yet
Intro To Data Coursera
9 pages
Handout 2020 Part1 PDF
No ratings yet
Handout 2020 Part1 PDF
82 pages
Flight Fare Prediction: Project Report
No ratings yet
Flight Fare Prediction: Project Report
38 pages
Exponential Trend Model
No ratings yet
Exponential Trend Model
20 pages
Script ASR v161212
No ratings yet
Script ASR v161212
148 pages
Tourism Forecasting - 19M805, 19M811
No ratings yet
Tourism Forecasting - 19M805, 19M811
23 pages
Awad K 2018 Report PDF
No ratings yet
Awad K 2018 Report PDF
27 pages
Lecture 6-Revisions Chapter 1-5
100% (1)
Lecture 6-Revisions Chapter 1-5
62 pages
Notes Minal PDF
No ratings yet
Notes Minal PDF
178 pages
Flight Price Prediction Project Presentation
No ratings yet
Flight Price Prediction Project Presentation
15 pages
2 Medan Listrik 1 - ORIGINAL
No ratings yet
2 Medan Listrik 1 - ORIGINAL
27 pages
Flight Price Project
No ratings yet
Flight Price Project
15 pages
Stevenson Chapter 3 - Forecasting
No ratings yet
Stevenson Chapter 3 - Forecasting
52 pages
Lecture Notes in Financial Econometrics (MSC Course) : Paul Söderlind 13 June 2013
No ratings yet
Lecture Notes in Financial Econometrics (MSC Course) : Paul Söderlind 13 June 2013
348 pages
Statistical Modeling For Management
No ratings yet
Statistical Modeling For Management
255 pages
11 - Introduction To Forecasting Analysis
No ratings yet
11 - Introduction To Forecasting Analysis
54 pages
Notes 14
No ratings yet
Notes 14
189 pages
Data Mining Project
100% (1)
Data Mining Project
14 pages
Advance Stats
No ratings yet
Advance Stats
233 pages
235 Farzane Ahmadzade
No ratings yet
235 Farzane Ahmadzade
8 pages
Forecasting in Civil Aviation
No ratings yet
Forecasting in Civil Aviation
30 pages
What Statistics Books Try To Teach You But Dont Joe King University of Washington
No ratings yet
What Statistics Books Try To Teach You But Dont Joe King University of Washington
40 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
46 pages
Forecast
No ratings yet
Forecast
48 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Cuoicuoicuoi

Uploaded by

Cuoicuoicuoi

Uploaded by

VIET NAM NATIONAL UNIVERSITY, HO CHI MINH CITY

HO CHI MINH UNIVERSITY OF TECHNOLOGY

BIG ASSIGNMENT REPORT

STATISTICAL PROBABILITY SUBJECT

GVHD: Phan Thị Hường

In conclusion, we wish to express profound gratitude to all individuals, professors, and

Objective of the Analysis:

A summary of the geographic region. (String) (0 =

The boarding area of the flight. (String) (0 = A; 1 = B;

The type of activity. (String) (0 = Deplaned; 1 =

The price category of the fare. (String) (0 = Low fare;

Month { x ∈ N∨1 ≤ x ≤ 12 } Continous None The month of the activity.

State the hypotheses

Set up the model

Estimating parameters - Is the model appropriate?

Find another model Forecasting + Decision making

2.1.1 Introducing the multiple linear regression model

2.2 Estimate the parameters of the multiple linear regression model

We have: Y i ≠ F(X 2 , X 3 ,…, X k ) ⇒ u i = Y i – F

Therefore: Y i = E(Y|X 2 , X 3 ,…, X k ) + u i

Sample Regression Function (SRF – Sample Regression Function):

Estimating SRF: choose a certain method to estimate the parameters of F by finding

2/ E(u i ) = 0 : The expectation of the random factors u i is equal to 0.

Therefore, minimizing Σ ( y i – ŷ i ) 2 will be equivalent to minimizing Σ e i 2 from which

 Confidence interval 1 - α ofβ j To be:

[ β^ j−t ( n-3 ) Se ( β^ j )] ; [ β^ j+ t ( n-3) Se ( β^ j )]

2.3.2. Test the hypothesis for β j

We have the following table:

Type of hypothesis Hypothesis Ho Hypothesis H1 Rejection region

Comment: The file has no missing data.

Comment: The International terminal shows a significantly higher median and

6. Discuss and expand

We also successfully built up a muliple linear regression model to assess passenger

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.