1 - Regression (NEW)
1 - Regression (NEW)
1
Relationship between variables
The objective of many investigations is to
understand and explain the relationship among
variables.
Frequently, one wants to know how and to what
extent a certain variable (dependent variable) is
related to a set of other variables (independent
variables).
2
TYPES OF RELATIONSHIP
1-- Deterministic relationship (functional relationship)
The relationship between two variables is known exactly
• Pressure of a gas in a container is related to the temperature
• Velocity of water in an open channel is related to the width of
the channel
• Displacement of a particle at a certain time is related to its
velocity, if we let d0 be the displacement of the particle from
the origin at time t = 0 and v be the velocity, the
displacement at time t is
dt= do+ vt
Regressor Regressand
Predictor Predicted
5
Regression Analysis
Regression Analysis is used to estimate a function f( ) that
describes the relationship between a continuous dependent
variable and one or more independent variables.
Note:
• f( ) describes systematic variation in the relationship.
• e represents the unsystematic variation (or random error)
in the relationship
· Where Y=dependent, response, predeictand, Regressand
X=Independent, Stimulus, predictor, Regressor
6
Examples
• Sales =f(Adv.Expenditure)+E
9
STEP1: Graphical Method: Scatter Plot
Scatter plot is a graphical tool that may suggest that what type of mathematical
functions would be appropriate for summarizing the data. A variety of functions are useful in
fitting models to data.
190 188
184
182
Strength
181
180
175
171
170
160
160
10 15 20 25 30
Hardwood
10
STEP2:
Numerical Method:
Scatter plot of Strength Vs Hardwood
Best Fit Line or 200
200
Regression Line 195
193
The observed data points do 190 188
Strength
181
but cluster about it. Many 180
175
lines can be drawn through
171
is least (minimum).
The method of LEAST SQUARE results in a line that minimizes
the sum of squared vertical distances from the observed data
points to the line (i.e Random Error). Any other line has a
larger sum 11
Population Regression Model
Regressor
Response
Yi = 0 + 1 X i + e i
Regression coefficients Random error
• In formulating the above relation between strength and
amount of hardwood, we are ignoring the fact that
strength of paper depends on other characteristics as12
12
13
Estimation of Unknown Parameters
14
Best fit line to the data LEAST SQUARE LINE
A least square line is described in terms of its Y-intercept (the
height at which it intercepts the Y-axis) and its slope (the angle
of the line). The line can be expressed by the following relation
𝑌 = 𝑏𝑜 + 𝑏1 𝑋 Estimated 𝑠𝑎𝑚𝑝𝑙𝑒 regression 𝑙𝑖𝑛𝑒
Where
𝑆𝑋𝑌
►b1= Slope of line 𝑏1 = 2
𝑆 𝑋
S2x = Variance of X.
ሜ 2 1
σ(𝑋 − 𝑋) (σ 𝑋)2
𝑆2 𝑋 = 2
= 𝑋 −
𝑛 𝑛 𝑛
15
Strength (Y) Hardwood (X) XY X2 Y2 Y^ e=Y-Y^ e2
160 10 1600 100 25600 162.6 -2.6 6.76
171 15 2565 225 29241 172.0 -1.0 1.00
175 15 2625 225 30625 172.0 3.0 9.00
182 20 3640 400 33124 181.4 0.6 0.36
184 20 3680 400 33856 181.4 2.6 6.76
181 20 3620 400 32761 181.4 -0.4 0.16
188 25 4700 625 35344 190.8 -2.8 7.84
193 25 4825 625 37249 190.8 2.2 4.84
195 28 5460 784 38025 196.4 -1.4 2.07
200 30 6000 900 40000 200.2 -0.2 0.04
1829 208 38715 4684 335825 1829.04 -0.04 38.83
𝑋𝑌 = 38715
𝑋 = 208
𝑌 = 1829
𝑋 2 = 4684
𝑌 2 = 335825
16
𝑋ሜ = 20.8 𝑌ሜ =182.9
𝑋𝑌 = 38715
1 (208)(1829)
𝑋 = 208 𝑆𝑋𝑌 = 38715 − = 67.18
10 10
1 (208)2
𝑌 = 1829 𝑆𝑋2 = 4684 − =35.76
10 10
1 (1829)2
𝑋 2 = 4684 𝑆𝑌2 = 335825 − = 130.09
10 10
𝑌 2 = 335825
𝑋ሜ = 20.8 𝑌ሜ =182.9
𝑆𝑋𝑌 67.18
𝑏1 = 2 = = 1.88
𝑆𝑋 35.76
data
19
Measuring The Reliability Of The Estimating Equation
Scatterplot of Strength vs Hardwood
200
200
195
193
190 188
184
182
Strength
181
180
175
171
170
160
160
10 15 20 25 30
Hardwood
The observed values of Y do not all fall on the regression line but they scatter away
from it. The degree of scatter of the observed values about the regression line is
measured by standard error of estimate or standard error of regression and denoted
by Se.
The standard error of estimate measures the variability of observed points about the
regression line. A small variation indicates that the estimating regression is
20
adequate
Standard Error of Estimate
2
=
(Y − Yˆ ) 2
=
e 2
= 4.9
S e
n−2 n−2
Strength (Y) Hardwood (X) Y^ e=Y-Y^ e2
= = 2.21
2
S e S e
160
171
10
15
162.6
172.0
-2.6
-1.0
6.76
1.00
175 15 172.0 3.0 9.00
The standard deviation of 182 20 181.4 0.6 0.36
184 20 181.4 2.6 6.76
regression can be interpreted as 181 20 181.4 -0.4 0.16
the average deviation of the 188 25 190.8 -2.8 7.84
193 25 190.8 2.2 4.84
points from the estimated 195 28 196.4 -1.4 2.07
regression line and hence small 200 30 200.2 -0.2 0.04
1829 208 1829.04 -0.04 38.83
deviation is desire for the better
fitted model 21
Precision of Estimators bo & b1
• The OLS estimators are random variables as they may take
different values for different samples and a variance of a random
variable is a measure of its dispersion around the mean.
1 𝑋ሜ 2 1
𝑆𝐸(𝑏𝑜 ) = 𝑆𝑒 + 2 = 2.53 𝑆𝐸(𝑏1 ) = 𝑆𝑒 2 = 0.117
𝑛 𝑛 𝑆𝑥 𝑛 𝑆𝑥
22
Inference in Simple Linear Regression
(From samples to population)
23
STEP3: Test the significance of regression by t test
Test the hypothesis that there is no linear relation between strength
and amount of hardwood in the population i.e strength is not related
with hardwood at 5% level of significance
1) Construction of hypotheses
Ho : 1 = 0
H1 : 1 0
2) Level of significance
= 5%
3) TEST STATISTIC
𝑏1 − 𝛽1 1.88 − 0 ∗
𝑡= = = 16.05
𝑆𝐸(𝑏1 ) 0.117
4) Decision Rule:- Reject Ho if 𝑡𝑐𝑎𝑙 ≤ −𝑡𝛼,
2
𝑛−2
𝑜𝑟 𝑡𝑐𝑎𝑙 ≥ 𝑡𝛼,(𝑛−2)
2
𝑡0.025, 8 =2.306
25
90% C.I can be interpret as If we
95% C.I for 1
take 100 samples of the same size
under the same conditions and
compute 100 C.I’s about
b1 t 2
,( n 2)
SE (b1 ) parameter, one from each sample,
then 90 such C.Is will contain the
parameter (i.e not all the
1.88 ± 2.306 × 0.117 constructed C.Is)
Confidence interval estimate of a
parameter is more informative
(1.61 , 2.15) than point estimate because it
reflects the precision of the
estimator.
26
Relation between Confidence interval
and two sided hypothesis
If the constructed confidence interval does not contain the value of
parameter under Ho then reject Ho
Ho : 1 = 0
H1: 1 0
27
Partition of variation in dependent variable
Explained Unexplained
variation variation
28
STEP3: Test the significance of regression by F test (ANOVA)
ANOVA TABLE
Source Of Degree of Sum of Mean Sum Fcal Ftab p-Value
Variation Freedom Squares of Squares
(S.O.V) (DF) (SS) (MSS=SS/df)
Regression 1 1262.1 1262.1 257.6* F.05(1,8)=5.318 0.000
Error 8 38.8 4.9
TOTAL 10-1=9 1300.9
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 2
𝑅2 = × 100
𝑇𝑜𝑡𝑎𝑙
1262.1 1
= × 100 = 97%
1300.9
Residuals
0
about 97% variation in strength is due to amount 160 170 180 190 200 210
Possible value of 𝐑𝟐 -4
Y hat
0 ≤ 𝑅2 ≤ 1 𝑜𝑟 0 ≤ 𝑅2 ≤ 100 30
STEP5: Estimation in regression
(Predicting unknown value of Y from known value of X)
Strength=143.8 + 1.88(23)=187.04
31
Regression
Linear Non-Linear
Intrinsically Intrinsically
Simple Multiple
Linear Non-Linear
32
Non-Linear Regression
It is easy to deal with the regression, which is linear in
parameters, but in some situations the models are non-
linear. The non-linear models can be divided into two
types
• Intrinsically Linear
• Intrinsically Non-Linear models
The models that can be transformed into linear models
after applying some suitable transformation are called
intrinsically linear models and the models that can not be
transformed into linear models are called intrinsically non-
linear models.
33
Regression on Transformed Variables
Nonlinearity (in parameters) is visually determined from the
scatter diagram, and sometimes, because of prior experience
or underlying theory, we know in advance that the model is
nonlinear
34
EXAMPLE:- The number (Y) of bacteria per unit volume
present in a culture after X hours is given in the following
table. Fit a least square curve having the form 𝒀 = 𝜷𝒐 𝜷𝑿
𝟏𝝐
to the data. Estimate the value of Y when X=7.
Y X
32 0
47 1
65 2
92 3
132 4
190 5
275 6 35
Scatterplot of Y vs X Scatterplot of Log(Y)=Y* vs X
300
2.4
250 2.3
2.2
200
2.1
Log(Y)=Y*
2.0
150
Y
1.9
100 1.8
1.7
50
1.6
1.5
0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
X X
36
𝑌 = 𝛽𝑜 𝛽1𝑋 𝜖 𝐿𝑜𝑔 𝑌 = log 𝛽𝑜 + 𝑋 𝑙𝑜𝑔 𝛽1 + log(𝜖) 𝑌 ∗ = 𝛽𝑜∗ + 𝛽1∗ 𝑋 + 𝜖 ∗
X Y Y*=log(Y)
0 32 1.50515
1 47 1.6721
2 65 1.81291
3 92 1.96379
4 132 2.12057
5 190 2.27875
6 275 2.43933
Estimation of regression of X on Y* by least Square
bo*=1.51 , b1*= 0.154
Y*=1.51 + 0.154 X
37
Y*=1.51 + 0.154 X
log(bo)=1.51
bo=antilog(1.51) =32.36
log(b1)=0.154
b1=antilog( 0.154) =1.43
𝒀 = 𝒃𝒐 𝒃𝟏 X
𝑌 = 32.36 1.43 X
0.40
2 0.24
V (mM/ min)
0.35
3 0.28 0.30
4 0.33 0.25
8 0.4 0.20
0 2 4 6 8 10 12 14 16
16 0.45 C (mM)
39
Transformation to Linear form
𝑉𝑚𝑎𝑥 × 𝐶 1 𝐾𝑚 + 𝐶 1 𝐾𝑚 𝐶
𝑉= = = +
𝐾𝑚 + 𝐶 𝑉 𝑉𝑚𝑎𝑥 × 𝐶 𝑉 𝑉𝑚𝑎𝑥 × 𝐶 𝑉𝑚𝑎𝑥 × 𝐶
1 1 𝐾𝑚 1
= + Y*=bo*+b1*X* linear form
𝑉 𝑉𝑚𝑎𝑥 𝑉𝑚𝑎𝑥 𝐶
Y* bo* b1* X*
40
C (mM) V (mM/ min) 1/ C 1/ V Scatter Plot 1 / C vs1/ V
Lineweaver Burt Pplot
1/V
3.5
2.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
1/C
Y*=bo* + b1*X*
= 1.998 +4.266 X*
41
Estimated Liner Regression Equation:
Y*=bo* + b1*X*
= 1.998 +4.266 X*
R2 = 99.287%
Calculation of Vmax and Km
1 𝐾𝑚
bo* = b1* =
𝑉𝑚𝑎𝑥 𝑉𝑚𝑎𝑥
1 𝐾𝑚
1.998 = 4.266 =
𝑉𝑚𝑎𝑥 𝑉𝑚𝑎𝑥
1 𝐾𝑚 = 𝑉𝑚𝑎𝑥 × 4.266
𝑉𝑚𝑎𝑥 = = 0.5005
1.998 =0.5005 × 4.266=2.135
Estimated Michaelis Menten Equation
0.5005 𝑪
𝑉=
2.135+𝑪 42
Regression with more than one independent variables
Multiple linear regression is a relationship that describes the dependence of mean values
of the response variable (Y) for given values of two or more than two independent
variables (X)
There are many applications where many explanatory variables affect the dependent
variable, for example
1) Yield of a crop depend upon the fertility of the land, dose of the fertilizer applied,
quantity of seed etc.
2) The grade point average of students depend on aptitude, mental ability , hours devoted
to study, type and nature of grading by teachers.
3) The systolic blood pressure of a person depends upon one’s weight, age, etc.
43
Multiple Linear Regression with two regressors
A real estate agency collects the following data concerning
Y = Sales price of a house ( in thousands of dollars)
X1= Home Size (in hundreds of square feet)
X2= Rating (an overall ”niceness rating” for the house
expressed on a scale from 0 (worst) to 10 (best), provided
by the real estate agency
Sales price Home Size Rating
Y X1 X2
120.0 23 5
65.4 11 2
115.4 20 9
91.0 17 3
94.0 15 8
110.6 21 4
129.0 24 7
85.2 13 6
109.0 19 7
115.0 25 2 44
• STEP1: Identify the type of relationship (functional form) between
variables by using graphical tool ( Scatter plot)
• STEP 2: Estimate the relation from sample data by using Method
of Least Squares and interpret it
• STEP 3: Test the over all significance of regression by F-test
(ANOVA)
• Test the individual significance of regression by t-test
45
STEP1: Identify the type of relationship between variables by
using graphical tool ( Scatter plot)
46
Population Regression Model
Response
Y = 0 + 1 X 1 + 2 X 2 + e
Partial Regression Random error
coefficients
47
Estimation of parameters from sample data
Don’t worry
We will use computer
for calculation
(S 2
)( S x1 y ) − ( S x1x2 )( S x2 y )
b1 =
x2
2
( S )( S
x1
2
x2 ) − ( S x1x2 ) 2
( S )( S x2 y ) − ( S x1x2 )( S x1 y )
2
b2 =
x1
2
( S )( S
x1
2
x2 ) − ( S x1x2 ) 2
b0 = y − b1 x1 − b2 x2
48
• STEP 2: Estimate the relation from sample data
by using Method of Least Squares and interpret it
Computer Output
Estimation by Method of Least Squares
Coefficients Standard Error t Stat P-value
Intercept b0= 19.56 3.26 6.00 0.00
Home Size X1 b1= 3.74 0.15 24.56 0.00
Rating X2 b2= 2.56 0.29 8.85 0.00
49
Price = 19.56 + 3.74 Size + 2.56 Rating
• The value of b1=3.74, indicates that mean sale price is
expected to increase by $ 3,740 (3.74 thousands) with
each 100 square feet increase in house size keeping the
effect of rating constant.
ANOVA
df SS MS F Significance F
Regression 2 3277.311817 1638.656 350.87* 9.57533E-08
Residual 7 32.69218291 4.670312
Total 9 3310.004
51
STEP 3: Test the individual significance of regression by t-test
Significance of X2
Ho : 2 = 0
H1: 2 0
Significance of X1
Ho : 1 = 0
H1: 1 0 𝑡=
𝑏1 −𝛽1
𝑆𝐸(𝑏1 )
= =24.56
3.74−0
0.152
2
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑
𝑅 = × 100
𝑇𝑜𝑡𝑎𝑙 20
3277.31
= × 100 10
3310
= 99%
99% variation in house price 0
90 95 100 105 110 115 120
factors
53
STEP 5: Use estimated regression equation for
prediction of unknown value of dependent variable
Estimate the price of a house with 1500 square feet area with
rating of 7
57
Y= Strength Y ( Strength) X1 (Cotton)
X= Cotton %
10 15
15 20
19 25
17 30
10 35
58
Scatter plot
SCATTER PLOT COTTON VS STRENGTH
20
18
16
strength
14
12
10
8
12 17 22 27 32 37
cotton
59
Quadratic
Regression
coefficients
Y = 0 + 1 X + 2 X + e i 2
Linear Regression
coefficients
X = X1 ,X = X2
2 60
Y = 0 + 1 X 1 + 2 X 2 + e
Computer output
R Square 0.968
Standard Error 1.028
ANOVA
df SS MS F p value
Regression 2 64.69 32.34 30.59 0.03
Residual 2 2.11 1.06
Total 4 66.80
61
Analysis of Variance (Overall Significance)
Ho : 1=2=0
H1 : At least one is not equal to zero
ANOVA
df SS MS F p value
Regression 2 64.69 32.34 30.59 0.03
Residual 2 2.11 1.06
Total 4 66.80
62
Significance of Quadratic Regression
63
Regression Statistics
R Square 0.968
Standard Error 1.028
Coefficient of Determination=97%
17.5943
64
The value of X at which maximum or minimum value of quadratic regression occur
−b1
X = = 25.27
2b2
b12
bo − = 18.67
4b2
65
Linear vs Quadratic Regression
Linear Regression (Y=b0+b1X)
Regression Statistics RESIDUAL PLOT
Multiple R 0.077 6
R Square 0.006 4
2
Adjusted R Square -0.325
Residuals
0
Standard Error 4.705 13.6 13.8 14 14.2 14.4 14.6 14.8
-2
Observations 5
-4
-6
Predicted Y
ANOVA
df SS MS F Significance F
Regression 1 0.4 0.4 0.01807229 0.901572184
Residual 3 66.4 22.133333
Total 4 66.8
Residuals
0.00
R Square 0.968
8.00 10.00 12.00 14.00 16.00 18.00 20.00
Adjusted R Square 0.937 -0.50
Standard Error 1.028
-1.00
Observations 5
-1.50
Predicted Y
ANOVA
df SS MS F Significance F
Regression 2 64.69 32.34 30.59 0.03
Residual 2 2.11 1.06
Total 4 66.80
Coefficients
Standard Errort Stat P-value Lower 95% Upper 95%
Intercept -36.086 6.542 -5.516 0.031 -64.234 -7.937
X1 (Cotton) 4.326 0.553 7.816 0.016 1.945 6.707
X2 (Cotton)^2 -0.086 0.011 -7.798 0.016 -0.133 -0.038 67