0% found this document useful (0 votes)
15 views35 pages

Lecture 3

cityu hk

Uploaded by

rub.crecycle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views35 pages

Lecture 3

cityu hk

Uploaded by

rub.crecycle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

LECTURE 3

REGRESSION ANALYSIS
- MULTIPLE REGRESSION

1
AGENDA

 Last class:
 𝑌෡𝑖 = 0.326 + 0.1578 𝑋𝑖  For every $1 increase in taxi fare, what can we expect?
 𝑟 2 = 0.5533  What does it say about our model?
 𝐻0 : 𝛽1 = 0  p-value is very, very close to 0, which implies…

 Basic Concepts of Multiple Linear Regression


 Using Categorical (Dummy) Variables
 Measures of Variation and Statistical Inference

2
FORMULATION OF MULTIPLE REGRESSION
MODEL

 A multiple regression model is to relate one dependent variable with two or more
independent variables in a linear function
Population Intercept Population Slope Coefficients

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝐾 𝑋𝐾𝑖 + 𝜀𝑖

Dependent Variable Independent Variable Random Error

 K is the number of independent variables (e.g., K = 1 for simple linear regression)


 𝛽0 , 𝛽1 , 𝛽2 … , 𝛽𝐾 are the K+1 parameters in a multiple regression model with K independent
variables
 𝑏0 , 𝑏1 , 𝑏2 … , 𝑏𝐾 are used to represent sample intercept and sample slope coefficients
3
MULTIPLE REGRESSION, 2 EXPLANATORY
VARIABLES

 Say we have 𝑛 data points or 𝑛 observations


 Our observations are in the form 𝑋11 , 𝑋21 , 𝑌1 , 𝑋12 , 𝑋22 , 𝑌2 , … , 𝑋1𝑛 , 𝑋2𝑛 , 𝑌𝑛

Observati Taxi – Pre- Ratecode ID Taxi - Tips (𝑿𝟏𝒊 , 𝑿𝟐𝒊 , 𝒀𝒊 )


on # tipped fare 1=NYC,
2=JFK
#1 8.30 1 1.65 (8.30, 1, 1.65)

#2 15.30 1 1.00 (15.30, 1, 1.00)

#3 7.80 1 1.25 (7.80, 1, 1.25)

#27 52.80 2 5.00 (14.80, 2, 3.70)


4

Source: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
TLC Trip Record Data: January 2019 Yellow Taxi Trip Records
Published by NYC Taxi & Limousine Commission We will need to “fix” this later…
FORMULATION OF MULTIPLE REGRESSION
MODEL

5
FORMULATION OF MULTIPLE REGRESSION
MODEL

 Coefficients in a multiple regression net out the impact of each independent


variable in the regression equation
 The estimated slope coefficient, 𝑏𝑗 , measures the change in the average value of
𝑌 as a result of a one-unit increase in 𝑋𝑗 , holding all other independent variables
constant – “ceteris paribus effect”
remain constant

෡ ∙ = 𝑏0 + 𝑏1 𝑋1∙ + 𝑏2 𝑋2∙ + ⋯ + 𝒃𝒋 𝑋𝑗∙ + ⋯ + 𝑏𝐾 𝑋𝐾∙


𝒀

6
EXAMPLE – USING CATEGORICAL (DUMMY)
VARIABLES

 Last time, we did a simple linear regression on taxi fare and tips.
 We want to see if the location also affects the tip.
 Column E (RatecodeID) has 2 possibilities: 1= New York City, 2 = JFK Airport
 Can we use column E as-is? Consider two trips from NYC and JFK, both with
fares of $10.

Observation Taxi – Pre- Ratecode ID What the model looks like


#𝒊 tipped fare 1=NYC, 2=JFK ෡ 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊
𝒀
𝑿𝟏𝒊 𝑿𝟐𝒊
e.g.1 10.00 1 𝑌෡1 = 𝑏0 + 10 𝑏1 + 𝑏2

e.g.2 10.00 2 𝑌෡2 = 𝑏0 + 10 𝑏1 + 2𝑏2 7

𝑏2 vs 2𝑏2 ? Double
the bonus?
USING CATEGORICAL (DUMMY) VARIABLES

 Column E (RatecodeID) has 2 possibilities: 1= New York City, 2 = JFK Airport


 Let’s define a new column: AreaID. We are “inside” the area if we are in NYC,
“outside” the area if we are NOT in NYC (i.e. JFK, etc).
 We can pre-process the data so that 𝑋2𝑖 = 1 if we are inside NYC and 𝑋2𝑖 = 0
if we are outside NYC

Observation Taxi – Pre- Ratecode ID What the model looks like


#𝒊 tipped fare 1=NYC, 0=JFK ෡ 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊
𝒀
𝑿𝟏𝒊 𝑿𝟐𝒊
e.g.1 10.00 1 𝑌෡1 = 𝑏0 + 10 𝑏1 + 𝑏2
8

e.g.2 10.00 20 𝑌෡2 = 𝑏0 + 10 𝑏1


USING CATEGORICAL (DUMMY) VARIABLES
 𝑋2𝑖 = 1 if we are inside NYC and 𝑋2𝑖 = 0 if we are outside NYC
 Interpretation:
 If 𝑏2 > 0: Everything else remaining constant, we expect to receive a bonus tip of
$|𝑏2 | when we pick up a passenger in NYC
 If 𝑏2 < 0: Everything else remaining constant, we expect our tip to reduce by $|𝑏2 |
when we pick up a passenger in NYC.
 This variable incorporates a fixed tip amount for NYC vs non-NYC trips, NOT a
change in the tips %!

Observation Taxi – Pre- Ratecode ID What the model looks like


#𝒊 tipped fare 1=NYC, 0=JFK ෡ 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊
𝒀
𝑿𝟏𝒊 𝑿𝟐𝒊
e.g.1 10.00 1 𝑌෡1 = 𝑏0 + 10 𝑏1 + 𝑏2
9

e.g.2 10.00 20 𝑌෡2 = 𝑏0 + 10 𝑏1


USING CATEGORICAL (DUMMY) VARIABLES

 Useful when an explanatory variable isn’t numerical (e.g. colours, locations)


 Use 0, 1 variables: 0 = “is not, does not fit definition”, 1 = “is, fits definition”
 If a category has 𝑐 choices, then we need 𝑐 − 1 categorical variables
 E.g. Product design: A product can be red, yellow, or blue. We want to see how
colour affects popularity. In a regression model, we need 2 categorical variable
 𝑋1 = 1 if it is red, and 0 otherwise
 𝑋2 = 1 if it is yellow, and 0 otherwise

Obs # 𝒊 Red? Yellow? What the model looks like


𝑿𝟏𝒊 𝑿𝟐𝒊 ෡ 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊 + ⋯
𝒀

e.g.1 (Red) 1 0 𝑌෡1 = 𝑏0 + 𝑏1 + ⋯


10
e.g.2 (Yellow) 0 1 𝑌෡2 = 𝑏0 + 𝑏2 + ⋯
e.g. 3 (Blue) 0 0 𝑌෡3 = 𝑏0 + ⋯
BUILDING THE MODEL
 After fixing the categorical variable for AreaID, we can fill in the regression
window.

11
MODEL OUTPUT
 Excel’s Output:

෡ = 𝟏. 𝟑𝟕𝟕𝟏 + 𝟎. 𝟏𝟒𝟖𝟖 𝑿𝟏 − 𝟎. 𝟗𝟓𝟐𝟏 𝑿𝟐


𝒀

12

*Scientific notation: 1.7284E − 226 = 1.7284 × 10−226 ≈ 0


INTERPRETATION OF ESTIMATES

 The estimated multiple regression equation:


𝑌෠ = 1.3771 + 0.1488 𝑋1 − 0.9521 𝑋2
 𝑌෠ = Estimated taxi tips in NYC in $
 𝑋1 = Pre-tip amount in $
 𝑋2 = Area indicator (NYC =1, Non-NYC (JFK) = 0)
 Interpretation of the estimated slope coefficient:
 𝑏1 = 0.1488 says that the estimated average tips increase by $0.1488 for each $1
increase in pre-tip taxi fare, given that other independent variables remain constant
 𝑏2 = −0.9521 says that the estimated average tips decrease by $0.952 when starts in
NYC instead of JFK, given that other independent variables remain constant
13
COMPARISON OF MODELS
 Suppose we add more explanatory variables
 𝑋1 = Pre-tip amount in $
 𝑋2 = Area indicator (NYC =1, Non-NYC (JFK) = 0)
 𝑋3 = # of riders
 𝑋4 = New Year’s Day indicatory (Jan 1 =1, otherwise =0)


𝒀
= 𝟏. 𝟑𝟏𝟖𝟏 + 𝟎. 𝟏𝟒𝟖𝟓 𝑿𝟏 − 𝟎. 𝟗𝟓𝟎𝟏 𝑿𝟐
+ 𝟎. 𝟎𝟒𝟎𝟒𝑿𝟑 + 𝟎. 𝟎𝟓𝟎𝟑𝑿𝟒

14
INTERPRETATION OF ESTIMATES

 Multiple regression model:


𝑌෠ = 1.3181 + 0.1485 𝑋1 − 0.9501 𝑋2 + 0.0404𝑋3 + 0.0503𝑋4
 The estimated slope coefficient
 𝑏1 = 0.1485 says that the estimated average tips increase by $0.1485 for each $1
increase in pre-tip taxi fare, holding all other things equal
 𝑏2 = −0.9521 says that the estimated average tips decrease by $0.952 when starts in
NYC instead of non-NYC (JFK), holding all other things equal
 𝑏3 = 0.0404 says that the estimated average tips increase by $0.0404 for each
additional rider, holding all other things equal
 𝑏4 = 0.0503 says that the estimated average tips increase by $0.0503 if it it on New
year day, holding all other things equal
15
EVALUATE THE MODEL

 𝑟 2 and adjusted 𝑟 2
 F-test for overall model significance
 t-test for a particular 𝑋-variable significance

16
MEASURES OF VARIATION - 𝑟 2

 𝑌෠ = 1.3181 + 0.1485 𝑋1 − 0.9501 𝑋2 + 0.0404𝑋3 + 0.0503𝑋4


 Total variation of the 𝑌-variable is made up of two parts

𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸


where
ത 2
𝑆𝑆𝑇 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌) SSR - regression SSE - error
𝑆𝑆𝑅 = σ𝑛𝑖=1(𝑌෠𝑖 − 𝑌)
ത 2 𝑌ത 𝑌෠𝑖 𝑌𝑖
𝑆𝑆𝐸 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌෠𝑖 )2

Pre-tip New Year’s


fare Area # of Day 17
passengers
MEASURES OF VARIATION - 𝑟 2

 We can ALWAYS increase 𝑟 2 by adding variables that don’t explain the changes in 𝑌
 Easier to see with less data. See “r-squared comparison” tab in spreadsheet
 We add one more column of 0/1s. 1 = odd number row, 0 = even number row

Vs.

18
MEASURES OF VARIATION - 𝑟 2

 What is the net effect of adding a new 𝑋-variable?


 𝑟 2 increases , even if the new 𝑋-variable is explaining an insignificant proportion of the
variation of the 𝑌-variable
 Is it fair to use 𝑟 2 for comparing models with different number of 𝑋-variables?

 A degree of freedom* will be lost, as a slope coefficient has to be estimated for that
new 𝑋-variable
 Did the new 𝑋-variable add enough explanatory power to offset the loss of one degree of
freedom?

 Degree of freedom on the residual = 𝑛 − 𝐾 + 1 = 𝑛 − 1 − 𝐾

*Degrees of freedom: Number of independent pieces of information (data values) in the random sample.
If 𝐾 + 1 parameters (intercept, slopes) must be estimated before the sum of squares errors, SSE, can be calculated from a sample of size
n, the degrees of freedom are equal to 𝑛 − (𝐾 + 1) (𝐾 + 1 coefficients of b0, b1, …, bK).
19
MEASURES OF VARIATION – ADJUSTED 𝑟 2

𝑆𝑆𝐸
(Recall: 𝑟 2 = 1 − 𝑆𝑆𝑇)
𝑆𝑆𝐸Τ 𝑛−𝐾−1 𝑛−1
 Adjusted 𝑟 2 = 1 − = 1− (1 − 𝑟 2 )
𝑆𝑆𝑇Τ 𝑛−1 𝑛−𝐾−1

 Measures the proportion of variation of the 𝑌 values that is explained by the


regression equation with the independent variable 𝑋1 , 𝑋2 , … , 𝑋𝐾 , after the
adjusting for sample size (𝑛) and the number of 𝑋-variables used (𝐾)
 Smaller than or equal to 𝑟 2 , and can be negative
 Penalize the excessive use of 𝑋-variables
 Useful in comparing among models with different number of 𝑋-variables

20
EXAMPLE – ADJUSTED 𝑟 2
 Compare the models that we’ve built
 Number of Observations: 197,103
 SST: 1,163,798

1 explanatory 2 explanatory 4 variables


variable (pre-tip variables (pre-tip
fare) fare, area ID)

Degree of freedom – 197,101 197,100 197,098


residual
SSE 519,852 517,136 516,911
𝑟2 0.553314 0.555647 0.555841
21
Adjusted 𝑟2 0.553312 0.555643 0.555832
INFERENCE: OVERALL MODEL SIGNIFICANCE

 Is the model significant? Do we need a model?


 F-test

22
OVERALL MODEL SIGNIFICANCE: F-TEST

 F-test for the overall model significance

 Null hypothesis 𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝐾 = 0 (none of the 𝑋-variables affects 𝑌)

 Alternative hypothesis: 𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0 (at least one 𝑋-variable affects 𝑌)

 We want to REJECT the null hypothesis by showing that the probability of seeing
our value of 𝑏1 , 𝑏2 , … , 𝑏𝐾 is “low” if it 𝐻0 was indeed true.

 F-statistic : For SSR For SSE

𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
F = 𝑀𝑆𝐸 = 𝑆𝑆𝐸/(𝑛−𝐾−1) with 𝐾, (𝑛 − 1 − 𝐾) degrees of freedom (d.f.)

23
OVERALL MODEL SIGNIFICANCE: F-TEST

𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
 F= = 𝑆𝑆𝐸/(𝑛−𝐾−1) with 𝐾, (𝑛 − 1 − 𝐾) degrees of freedom (d.f.)
𝑀𝑆𝐸

 First decide on size of rejection region 𝛼 (one tails)  Level of significance

 Method 1 (with F-table): Rejection region approach

 Reject 𝐻0 if F > critical value (C.V.) = 𝐹𝛼,𝐾,(𝑛−𝐾−1)

 Method 2 (with Excel output): p-value approach

 p-value = 𝑃(𝐹 ≥ F)

 Reject 𝐻0 if p-value < 𝛼


24
OVERALL MODEL SIGNIFICANCE: F-TEST
Probability distribution of F. Suppose 𝛼 = 0.05

At 5% significance level, p-value  0 < 5%. Therefore 𝐻0


is rejected.

 = tail area = P(F ≥ C. V.)

𝐩 − 𝐯𝐚𝐥𝐮𝐞 = P(𝐹 ≥ F)

F
0 C. V. = F =61,664, calculated
𝐹𝛼,𝐾,(𝑛−𝐾−1) =
from sample data
𝟐. 𝟑𝟕
25
SIGNIFICANCE OF A PARTICULAR X-VARIABLE:
T-TEST
 Even if we reject the 𝐻0 in our F-test, we cannot distinguish which 𝑋-variable(s)
has a significant impact on the 𝑌-variable
 t-test for a particular 𝑋-variable’s significance
 Null 𝐻0 : 𝛽𝑖 = 0 (𝑋𝑖 has no linear relationship with 𝑌, given presence of other 𝑋-
variable(s))
 Alternative 𝐻1 : 𝛽𝑖 ≠ 0 (𝑋𝑖 is linearly related to 𝑌, given presence of other 𝑋-
variable(s))

26
SIGNIFICANCE OF A PARTICULAR X-VARIABLE:
T-TEST

 Null 𝐻0 : 𝛽1 = 0
 Method 1: Rejection region approach
 Reject 𝐻0 if T > C. V. = 𝑡𝛼Τ2,(𝑛−𝐾−1)

 Method 2: p-value approach


 p-value = 𝑃(|T| ≥ |t|)
 Reject 𝐻0 if p-value < 𝛼

Student’s t-distribution
Probability

standard 𝛼
𝛼
error
2 2
If 𝛼 = 5%,
𝑡 then 𝑡0.025 ,(𝑛−5) ≈
-348.81 C.V.= 𝜷𝟏 = 𝟎 C.V.= t=348.81 27
-1.96 1.96 1.96
EXAMPLE

 Conclusion: p-value is smaller than 5%, so reject 𝐻0 . The pre-tip fare is significantly
related to the tips, given presence of other 𝑋-variables.
 What about the other variables?

*
*

 According to the t-test results, the p-value for each of the four explanatory variables
is smaller than 5%,.
 This indicates each explanatory variable is significantly related to tips paid in NYC,
given presence of other 𝑋-variables.

*Scientific notation: 6.41657E − 08 = 6.41657 × 10−8 = 0.0000000642 ≈ 0


28
EXAMPLE

 What does the table look like if there is an insignificant explanatory variable?
 Added fifth variable to labels rows as “odd” or “even” (“5var – odd/even” tab)

 The p-value for “Odd/Even transaction” is LARGER than 5%, so we cannot reject
𝐻0 . This indicates that odd/even transactions is not significantly related to tips
paid in NYC, given presence of other 𝑋-variables. 29
VARIABLES SELECTION STRATEGIES

 Some of the independent variables are insignificant based on t-test results


 We may consider eliminating insignificant independent variables using the following
methods:
 All possible regressions
 Backward elimination
 Forward selection
 Stepwise regression

30
ALL POSSIBLE REGRESSIONS

 To develop all the possible regression models between the dependent variable
and all possible combinations of independent variables
 If there are 𝐾 𝑋-variables to consider using, there are (2𝐾 −1) possible
regression models to be developed
 The criteria for selecting the best model may include
 Mean Sum of Squares Errors (MSE)
 Adjusted 𝑟 2
 Disadvantages of all possible regressions
 No unique conclusion, with different criteria, different conclusions will arise
 Look at overall model performance, but not individual variable significance
 When there is a large number of potential 𝑋-variables, computational time can be long

31
BACKWARD ELIMINATION

 Evaluate individual variable significance 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , 𝑋5

Step 1: Build a model by using all potential 𝑋-variables 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4


Step 2: Identify the least significant 𝑋-variable using t-test
Step 3: Remove this 𝑋-variable if its p-value is larger than the specified level of
significance; otherwise terminate the procedure
Step 4: Develop a new regression model after removing this 𝑋-variable, repeat
steps 2 and 3 until all remaining 𝑋-variables are significant

32
FORWARD SELECTION
nothing
 Evaluate individual variable significance

𝑋1 𝑋2 𝑋3 𝑋4 𝑋5
Step 1: Start with a model which only contains the intercept term
Step 2: Identify the most significant 𝑋-variable using t-test 𝑋1 , 𝑋2 𝑋1 , 𝑋3 𝑋1 , 𝑋4 𝑋1 , 𝑋5
Step 3: Add this 𝑋-variable if its p-value is smaller than the specified
level of significance; otherwise terminate the procedure
Step 4: Develop a new regression model after including this 𝑋-variable,
repeat steps 2 and 3 until all significant 𝑋-variables are entered

33
STEPWISE REGRESSION

 Evaluate individual variable significance


 An 𝑋-variable entering can later leave; an 𝑋-variable eliminated can later go back in

Step 1: Start with a model which only contains the intercept term
Step 2: Identify the most significant 𝑋-variable, add this 𝑋-variable if its p-value is smaller
than the specified level of significance; otherwise terminate the procedure
Step 3: Identify the least significant 𝑋-variable from the model, remove this 𝑋-variable if
its p-value is larger than the specified level of significance
Step 4: Repeat steps 2 and 3 until all significant 𝑋-variables are entered and none of them
have to be removed

34
PRINCIPLE OF MODEL BUILDING

 A good model should


 Have few independent variables
 Have high predictive power
 Have low correlation between independent variables
 Be easy to interpret

35

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy