0% found this document useful (0 votes)
34 views65 pages

1 - Regression (NEW)

1. Regression analysis is used to understand the relationship between variables and estimate the dependence of a dependent variable on one or more independent variables. 2. Relationships between variables can be either deterministic (functional) or probabilistic (statistical). Deterministic relationships are known exactly, while probabilistic relationships contain random error and must be approximated statistically. 3. Regression analysis estimates a function that describes the systematic relationship between a dependent variable and independent variables, while accounting for random/unexplained variability with an error term. The parameters of this population regression model are estimated from sample data.

Uploaded by

Hassan Latif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views65 pages

1 - Regression (NEW)

1. Regression analysis is used to understand the relationship between variables and estimate the dependence of a dependent variable on one or more independent variables. 2. Relationships between variables can be either deterministic (functional) or probabilistic (statistical). Deterministic relationships are known exactly, while probabilistic relationships contain random error and must be approximated statistically. 3. Regression analysis estimates a function that describes the systematic relationship between a dependent variable and independent variables, while accounting for random/unexplained variability with an error term. The parameters of this population regression model are estimated from sample data.

Uploaded by

Hassan Latif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Relationship Between Variables

1
Relationship between variables
The objective of many investigations is to
understand and explain the relationship among
variables.
Frequently, one wants to know how and to what
extent a certain variable (dependent variable) is
related to a set of other variables (independent
variables).
2
TYPES OF RELATIONSHIP
1-- Deterministic relationship (functional relationship)
The relationship between two variables is known exactly
• Pressure of a gas in a container is related to the temperature
• Velocity of water in an open channel is related to the width of
the channel
• Displacement of a particle at a certain time is related to its
velocity, if we let d0 be the displacement of the particle from
the origin at time t = 0 and v be the velocity, the
displacement at time t is
dt= do+ vt

• Area of a circle= r2


• F=k(m1m2/r2) (Newton’s law of gravity) 3
TYPES OF RELATIONSHIP
2-- Probabilistic relationship (statistical relationship)
The relation between variables are not know exactly and we
have to approximate the relationship and develop models
that characterize their main features.
• Electrical energy consumption of a house is related to the size of
the house (x, in square feet), but it is unlikely to be a
deterministic relationship, as it is possible for different houses to
use different amounts of electricity even if they are the same size.
• The fuel usage of an automobile is related to the vehicle weight ,
but the relationship is not a deterministic one as it is possible for
different automobiles to have different fuel usage even if they
weigh the same. The collection of statistical tools that are used to
model and explore relationships between variables that are
related in a nondeterministic manner is called Regression
Analysis 4
Regression
Regression is used to investigate the
dependence of one variable called the
dependent variable denoted by Y, on
independent variable(s) denoted by X and
provides an equation to be used for
estimating or predicting the average value
of the dependent variable from the known
values of the independent variables
Independent Variable Dependent Variable

Explanatory variable Explained variable

Regressor Regressand

Predictor Predicted
5
Regression Analysis
Regression Analysis is used to estimate a function f( ) that
describes the relationship between a continuous dependent
variable and one or more independent variables.

Y = f(X1, X2, X3,…, Xk) + e

Note:
• f( ) describes systematic variation in the relationship.
• e represents the unsystematic variation (or random error)
in the relationship
· Where Y=dependent, response, predeictand, Regressand
X=Independent, Stimulus, predictor, Regressor

6
Examples
• Sales =f(Adv.Expenditure)+E

• Fiber =f(Weight of jute plant)+E

• Consumption Exp.=f( Income) +E

• Price of a car =(Age of car)+E

• Yield =f( fertilizer, seed rate, rainfall)+E

• Marks =f(Study hours, IQ level)+E

• Demand =f(Price, , Consumer income ,


Price of related commodities, Consumer taste, Adv.
Expenses for creation of demand)+E 7
Model building with one independent variable
Example: The tensile strength of a paper product is related to
the amount of hardwood in the pulp. Ten samples are
produced in the pilot plant and the data obtained are shown
below Strength (Y) Hardwood (X)
160 10
171 15
There is probably a relation between 175 15
182 20
strength and HW concentration 184 20
----as amount of HW increases there is 181 20
corresponding increase in paper strength 188 25
193 25
195 28
200 30
But how would we measure and quantify this relationship?
There are Two ways to identify the relation between variables
(a) Graphical Method
(b) Numerical Method 8
A possible strategy for regression model building

• STEP1: Identify the type of relationship (functional form) between


variables by using graphical tool ( Scatter plot) or underlying
theory
• STEP 2: Estimate the relation from sample data by using Method
of Least Squares and interpret it
• STEP 3: Test the significance of regression by
• t-test
• F-test (ANOVA)
• STEP 4: Measure the goodness of fit of the model (Coefficient of
Determination)
• STEP 5: Use estimated regression equation for prediction of
unknown value of dependent t variable

9
STEP1: Graphical Method: Scatter Plot
Scatter plot is a graphical tool that may suggest that what type of mathematical
functions would be appropriate for summarizing the data. A variety of functions are useful in
fitting models to data.

Scatter plot of Strength Vs Hardwood


200
200
195
193

190 188

184
182
Strength

181
180
175

171
170

160
160

10 15 20 25 30
Hardwood
10
STEP2:
Numerical Method:
Scatter plot of Strength Vs Hardwood
Best Fit Line or 200
200
Regression Line 195
193
The observed data points do 190 188

not all fall on a straight line 184


182

Strength
181
but cluster about it. Many 180
175
lines can be drawn through
171

the data points; the problem 170 For Line


• Y-intercept (a)
is to select BEST FIT 160
• Slope (b)
160
among them. A line for
10 15 20 25 30
which sum of squared errors Hardwood

is least (minimum).
The method of LEAST SQUARE results in a line that minimizes
the sum of squared vertical distances from the observed data
points to the line (i.e Random Error). Any other line has a
larger sum 11
Population Regression Model
Regressor
Response

Yi = 0 + 1 X i + e i
Regression coefficients Random error
• In formulating the above relation between strength and
amount of hardwood, we are ignoring the fact that
strength of paper depends on other characteristics as12
12

well. Thus we are basically assuming that all these


effects are absorbed by the error term.
Population linear regression model between strength and hardwood
𝑌 = 𝛽𝑜 + 𝛽1 𝑋 + 𝜀

The error term is actually a combination of four different effects

1. It accounts for the effect of variables omitted from the


model
2. It captures the effects of nonlinearities in the
relationship between Y and X
3. Measurement errors in X and Y are also absorbed by
the error term
4. Error term also includes inherently unpredictable
random effects

13
Estimation of Unknown Parameters

After formulating the model next step is to


obtain the “best” estimates for unknown
parameters Bo and B1
There are several methods for the estimation of
the parameters
•Method of Least Squares (OLS)
•Method of Maximum Likelihood (MLE)
•Methods of Moments (MM)

14
Best fit line to the data LEAST SQUARE LINE
A least square line is described in terms of its Y-intercept (the
height at which it intercepts the Y-axis) and its slope (the angle
of the line). The line can be expressed by the following relation
𝑌෠ = 𝑏𝑜 + 𝑏1 𝑋 Estimated 𝑠𝑎𝑚𝑝𝑙𝑒 regression 𝑙𝑖𝑛𝑒
Where
𝑆𝑋𝑌
►b1= Slope of line 𝑏1 = 2
𝑆 𝑋

►bo =intercept of the line 𝑏𝑜 = 𝑌ሜ − 𝑏1 𝑋ሜ

Sxy = Covariance between X & Y


σ(𝑋 − 𝑋ሜ ) (𝑌 − 𝑌)
ሜ 1 σ𝑋σ𝑌
𝑆𝑥𝑦 = = ෍ 𝑋𝑌 −
𝑛 𝑛 𝑛

S2x = Variance of X.
ሜ 2 1
σ(𝑋 − 𝑋) (σ 𝑋)2
𝑆2 𝑋 = 2
= ෍𝑋 −
𝑛 𝑛 𝑛
15
Strength (Y) Hardwood (X) XY X2 Y2 Y^ e=Y-Y^ e2
160 10 1600 100 25600 162.6 -2.6 6.76
171 15 2565 225 29241 172.0 -1.0 1.00
175 15 2625 225 30625 172.0 3.0 9.00
182 20 3640 400 33124 181.4 0.6 0.36
184 20 3680 400 33856 181.4 2.6 6.76
181 20 3620 400 32761 181.4 -0.4 0.16
188 25 4700 625 35344 190.8 -2.8 7.84
193 25 4825 625 37249 190.8 2.2 4.84
195 28 5460 784 38025 196.4 -1.4 2.07
200 30 6000 900 40000 200.2 -0.2 0.04
1829 208 38715 4684 335825 1829.04 -0.04 38.83

෍ 𝑋𝑌 = 38715

෍ 𝑋 = 208

෍ 𝑌 = 1829

෍ 𝑋 2 = 4684

෍ 𝑌 2 = 335825
16
𝑋ሜ = 20.8 𝑌ሜ =182.9
෍ 𝑋𝑌 = 38715
1 (208)(1829)
෍ 𝑋 = 208 𝑆𝑋𝑌 = 38715 − = 67.18
10 10
1 (208)2
෍ 𝑌 = 1829 𝑆𝑋2 = 4684 − =35.76
10 10
1 (1829)2
෍ 𝑋 2 = 4684 𝑆𝑌2 = 335825 − = 130.09
10 10

෍ 𝑌 2 = 335825

𝑋ሜ = 20.8 𝑌ሜ =182.9

𝑆𝑋𝑌 67.18
𝑏1 = 2 = = 1.88
𝑆𝑋 35.76

𝑏𝑜 = 𝑌ሜ − 𝑏1ሜ𝑋ሜ = 18209 − 1.88 20.8 = 143.8

Strength=143.8 + 1.88 Hardwood


17
Interpretation of the estimated parameters
Strength=143.8 + 1.88 Hardwood
• The value of b1=1.88, indicates that the average strength of
paper is expected to increase by 1.88 with one percent
increase in amount of Hardwood in the pulp
• The value of bo indicates that average estimated strength of
paper is 143.8 without amount of hardwood in the pulp but this
interpretation is not always true be careful in interpreting the
intercept coefficient when scope of the model does not cover
X=0
• The observed range of hardwood (Explanatory Variable) in the
experiment was 10 to 30 percent (i.e scope of the model), therefore
it would be an unreasonable extrapolation to expect this rate of
increase to continue if hardwood percent to increase. It is safe to
use the results of regression only within the range of the observed
value of the independent variable only (i.e within the scope of the
model). 18
Properties of least Square Line

• Sum of errors is always equal to zero

• Sum of squared errors is minimum

• Least square line always passes through the mean of the

data

19
Measuring The Reliability Of The Estimating Equation
Scatterplot of Strength vs Hardwood
200
200
195
193

190 188

184
182
Strength

181
180
175

171
170

160
160

10 15 20 25 30
Hardwood

The observed values of Y do not all fall on the regression line but they scatter away
from it. The degree of scatter of the observed values about the regression line is
measured by standard error of estimate or standard error of regression and denoted
by Se.
The standard error of estimate measures the variability of observed points about the
regression line. A small variation indicates that the estimating regression is
20
adequate
Standard Error of Estimate

2
=
 (Y − Yˆ ) 2

=
 e 2

= 4.9
S e
n−2 n−2
Strength (Y) Hardwood (X) Y^ e=Y-Y^ e2
= = 2.21
2
S e S e
160
171
10
15
162.6
172.0
-2.6
-1.0
6.76
1.00
175 15 172.0 3.0 9.00
The standard deviation of 182 20 181.4 0.6 0.36
184 20 181.4 2.6 6.76
regression can be interpreted as 181 20 181.4 -0.4 0.16
the average deviation of the 188 25 190.8 -2.8 7.84
193 25 190.8 2.2 4.84
points from the estimated 195 28 196.4 -1.4 2.07
regression line and hence small 200 30 200.2 -0.2 0.04
1829 208 1829.04 -0.04 38.83
deviation is desire for the better
fitted model 21
Precision of Estimators bo & b1
• The OLS estimators are random variables as they may take
different values for different samples and a variance of a random
variable is a measure of its dispersion around the mean.

• The smaller the variance the closer, on average, individual values


are to the mean thus the variance of an estimator is an indicator
of the precision of the estimator. Standard deviation of variance
of an estimator is called its standard error

1 𝑋ሜ 2 1
𝑆𝐸(𝑏𝑜 ) = 𝑆𝑒 + 2 = 2.53 𝑆𝐸(𝑏1 ) = 𝑆𝑒 2 = 0.117
𝑛 𝑛 𝑆𝑥 𝑛 𝑆𝑥
22
Inference in Simple Linear Regression
(From samples to population)

• Generally, more is sought in regression analysis than a


description of observed data. One usually wishes to draw
inferences about the relationship of the variables in the
population from which the sample was taken
• The slope and the intercept estimated from a single sample
typically differ from the population values and vary from
sample to sample. To use these estimates for inference
about the population values, the sampling distributions of
the two statistics are needed

23
STEP3: Test the significance of regression by t test
Test the hypothesis that there is no linear relation between strength
and amount of hardwood in the population i.e strength is not related
with hardwood at 5% level of significance
1) Construction of hypotheses
Ho : 1 = 0
H1 : 1  0
2) Level of significance
 = 5%
3) TEST STATISTIC
𝑏1 − 𝛽1 1.88 − 0 ∗
𝑡= = = 16.05
𝑆𝐸(𝑏1 ) 0.117
4) Decision Rule:- Reject Ho if 𝑡𝑐𝑎𝑙 ≤ −𝑡𝛼,
2
𝑛−2
𝑜𝑟 𝑡𝑐𝑎𝑙 ≥ 𝑡𝛼,(𝑛−2)
2
𝑡0.025, 8 =2.306

5) Result:- So reject Ho and conclude that there is significant


relationship between strength and amount of hardwood 24
Confidence intervals for regression parameters

• A statistic calculated from a sample provides a point estimate of


the unknown parameter.
• A point estimate can be thought of as the single best guess for the
population value.
• While the estimated value from the sample is typically different
from the value of the unknown population parameter, the hope is
that it isn’t too for away.
• Based on the sample estimates, it is possible to calculate a range
of values that, with a designated likelihood, includes the
population value. Such a range is called a confidence interval.

25
90% C.I can be interpret as If we
95% C.I for 1
take 100 samples of the same size
under the same conditions and
compute 100 C.I’s about
b1 t 2
,( n 2)
SE (b1 ) parameter, one from each sample,
then 90 such C.Is will contain the
parameter (i.e not all the
1.88 ± 2.306 × 0.117 constructed C.Is)
Confidence interval estimate of a
parameter is more informative
(1.61 , 2.15) than point estimate because it
reflects the precision of the
estimator.

26
Relation between Confidence interval
and two sided hypothesis
If the constructed confidence interval does not contain the value of
parameter under Ho then reject Ho
Ho : 1 = 0
H1: 1  0

95% C.I for 1 (1.61 , 2.15 )


As the constructed confidence interval (1.61, 2.15)
does not contain the value of B1 under Ho (i.e 0) so reject Ho

27
Partition of variation in dependent variable

Explained Unexplained
variation variation

Total variation= Explained variation (Regression )


+
Unexplained variation (Error)
Total Variation = nS2y =1300.9
Explained variation =(n) (b) Sxy =1262.1
Un-Explained variation =Total-Explained=38.8

28
STEP3: Test the significance of regression by F test (ANOVA)

ANOVA TABLE
Source Of Degree of Sum of Mean Sum Fcal Ftab p-Value
Variation Freedom Squares of Squares
(S.O.V) (DF) (SS) (MSS=SS/df)
Regression 1 1262.1 1262.1 257.6* F.05(1,8)=5.318 0.000
Error 8 38.8 4.9
TOTAL 10-1=9 1300.9

Relation between F and t for testing 1=0


F=t2 i.e 257.6=(16.05)2
Decision of rejecting Ho by p-value (Probability value)
p-value is the probability in support of Ho, in case of too
much small support (p-value < α) reject Ho
29
STEP4: Goodness of Fit Structure less Residual plot
The co-efficient of determination tells us percent of RESIDUAL PLOT (YHAT VS
variation in the dependent variable explained by
RESIDUAL)
4
the independent variable
3

𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 2
𝑅2 = × 100
𝑇𝑜𝑡𝑎𝑙
1262.1 1
= × 100 = 97%
1300.9

Residuals
0
about 97% variation in strength is due to amount 160 170 180 190 200 210

of hardwood and remaining due to some other -1

unknown factors i.e there is no need to improve


-2
the model by introducing other variables or
functional form -3

Possible value of 𝐑𝟐 -4
Y hat

0 ≤ 𝑅2 ≤ 1 𝑜𝑟 0 ≤ 𝑅2 ≤ 100 30
STEP5: Estimation in regression
(Predicting unknown value of Y from known value of X)

• Estimate the strength of paper when hardwood


concentration is 23

Strength=143.8 + 1.88(23)=187.04

•NOTE:-Predictions made using an estimated regression


function may have little or no validity for values of the
independent variables that are substantially different from
those represented in the sample.

31
Regression

Linear Non-Linear

Intrinsically Intrinsically
Simple Multiple
Linear Non-Linear

32
Non-Linear Regression
It is easy to deal with the regression, which is linear in
parameters, but in some situations the models are non-
linear. The non-linear models can be divided into two
types

• Intrinsically Linear
• Intrinsically Non-Linear models
The models that can be transformed into linear models
after applying some suitable transformation are called
intrinsically linear models and the models that can not be
transformed into linear models are called intrinsically non-
linear models.
33
Regression on Transformed Variables
Nonlinearity (in parameters) is visually determined from the
scatter diagram, and sometimes, because of prior experience
or underlying theory, we know in advance that the model is
nonlinear

Nonlinear Transformation Linear


𝑌 = 𝛽𝑜 𝛽1𝑋 𝜖 𝐿𝑜𝑔(𝑌) = log(𝛽𝑜 ) + 𝑋 𝑙𝑜𝑔(𝛽1 ) + log⁡
(𝜖) 𝑌 ∗ = 𝛽𝑜∗ + 𝛽1∗ 𝑋 +⁡
𝜖∗

𝑌 = 𝛽𝑜 𝑋 𝛽1 𝜖 𝑙𝑜𝑔(𝑌) = 𝑙𝑜𝑔 𝛽𝑜 + 𝛽1 𝑙𝑜𝑔(𝑋)+ log⁡


(𝜖) 𝑌 ∗ = 𝛽𝑜 ∗ + 𝛽1 𝑋 ∗ + 𝜖 ∗
𝑌 = 𝛽𝑜 𝑒 𝛽1 𝑋 𝜖 𝑙𝑛(𝑌) = 𝑙𝑛 𝛽𝑜 + 𝛽1 𝑋+ ln⁡(𝜖) 𝑌 ∗ = 𝛽𝑜 ∗ + 𝛽1 𝑋 + 𝜖 ∗
1 1 𝑌 ∗ = 𝛽𝑜 + 𝛽1 𝑋 + 𝜖
𝑌= = 𝛽𝑜 + 𝛽1 𝑋 + 𝜖
𝛽𝑜 + 𝛽1 𝑋 + 𝜖 𝑌
𝑋 1 1 𝑌 ∗ = 𝛽𝑜 + 𝛽1 𝑋 ∗ + 𝜖
𝑌= = 𝛽𝑜 + 𝛽1 + 𝜖
𝛽𝑜 𝑋 + 𝛽1 + 𝑋𝜖 𝑌 𝑋

34
EXAMPLE:- The number (Y) of bacteria per unit volume
present in a culture after X hours is given in the following
table. Fit a least square curve having the form 𝒀 = 𝜷𝒐 𝜷𝑿
𝟏𝝐
to the data. Estimate the value of Y when X=7.

Y X
32 0
47 1
65 2
92 3
132 4
190 5
275 6 35
Scatterplot of Y vs X Scatterplot of Log(Y)=Y* vs X
300
2.4
250 2.3

2.2
200
2.1

Log(Y)=Y*
2.0
150
Y

1.9

100 1.8

1.7
50
1.6

1.5
0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
X X

𝑌 = 𝛽𝑜 𝛽1𝑋 𝜖 𝐿𝑜𝑔 𝑌 = log 𝛽𝑜 + 𝑋 𝑙𝑜𝑔 𝛽1 + log(𝜖) 𝑌 ∗ = 𝛽𝑜∗ + 𝛽1∗ 𝑋 + 𝜖 ∗

36
𝑌 = 𝛽𝑜 𝛽1𝑋 𝜖 𝐿𝑜𝑔 𝑌 = log 𝛽𝑜 + 𝑋 𝑙𝑜𝑔 𝛽1 + log(𝜖) 𝑌 ∗ = 𝛽𝑜∗ + 𝛽1∗ 𝑋 + 𝜖 ∗

X Y Y*=log(Y)
0 32 1.50515
1 47 1.6721
2 65 1.81291
3 92 1.96379
4 132 2.12057
5 190 2.27875
6 275 2.43933
Estimation of regression of X on Y* by least Square
bo*=1.51 , b1*= 0.154

Y*=1.51 + 0.154 X
37
Y*=1.51 + 0.154 X

log(bo)=1.51
bo=antilog(1.51) =32.36

log(b1)=0.154
b1=antilog( 0.154) =1.43

𝒀 = 𝒃𝒐 𝒃𝟏 X

𝑌 = 32.36 1.43 X

Estimate growth after 7 hours

𝑌 = 32.36 1.43 7 = 395.70


38
Example: Michaelis Menten equation for an enzyme reaction shows
the relation substrate concentration and reaction rate.
Fit Michaelis Menten equation and estimate Vmax and Km
V =Reaction rate (mM) [Y]
𝑉𝑚𝑎𝑥 × 𝐶 C =Substrate concentration (mM / min) [X]
𝑉=
𝐾𝑚 + 𝐶 Vmax=Theoratical maximum rate of the process
Km =Michaelis coefficient Linear regression:
V=0.2291+0.01552 C
R2=84.67%
C V Scatter Plot C vs V
(Michaelis Menten plot)

1.5 0.21 0.45

0.40

2 0.24
V (mM/ min)

0.35

3 0.28 0.30

4 0.33 0.25

8 0.4 0.20
0 2 4 6 8 10 12 14 16

16 0.45 C (mM)

39
Transformation to Linear form
𝑉𝑚𝑎𝑥 × 𝐶 1 𝐾𝑚 + 𝐶 1 𝐾𝑚 𝐶
𝑉= = = +
𝐾𝑚 + 𝐶 𝑉 𝑉𝑚𝑎𝑥 × 𝐶 𝑉 𝑉𝑚𝑎𝑥 × 𝐶 𝑉𝑚𝑎𝑥 × 𝐶

1 1 𝐾𝑚 1
= + Y*=bo*+b1*X* linear form
𝑉 𝑉𝑚𝑎𝑥 𝑉𝑚𝑎𝑥 𝐶

Y* bo* b1* X*

40
C (mM) V (mM/ min) 1/ C 1/ V Scatter Plot 1 / C vs1/ V
Lineweaver Burt Pplot

1.5 0.21 0.6667 4.7619 5.0

2 0.24 0.5000 4.1667 4.5

3 0.28 0.3333 3.5714 4.0

4 0.33 0.2500 3.0303

1/V
3.5

8 0.4 0.1250 2.5000 3.0

16 0.45 0.0625 2.2222 2.5

2.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
1/C

Estimated Liner Regression Equation:

Y*=bo* + b1*X*
= 1.998 +4.266 X*

41
Estimated Liner Regression Equation:
Y*=bo* + b1*X*
= 1.998 +4.266 X*
R2 = 99.287%
Calculation of Vmax and Km

1 𝐾𝑚
bo* = b1* =
𝑉𝑚𝑎𝑥 𝑉𝑚𝑎𝑥
1 𝐾𝑚
1.998 = 4.266 =
𝑉𝑚𝑎𝑥 𝑉𝑚𝑎𝑥
1 𝐾𝑚 = 𝑉𝑚𝑎𝑥 × 4.266
𝑉𝑚𝑎𝑥 = = 0.5005
1.998 =0.5005 × 4.266=2.135
Estimated Michaelis Menten Equation
0.5005 𝑪
𝑉=
2.135+𝑪 42
Regression with more than one independent variables
Multiple linear regression is a relationship that describes the dependence of mean values
of the response variable (Y) for given values of two or more than two independent
variables (X)

There are many applications where many explanatory variables affect the dependent
variable, for example
1) Yield of a crop depend upon the fertility of the land, dose of the fertilizer applied,
quantity of seed etc.
2) The grade point average of students depend on aptitude, mental ability , hours devoted
to study, type and nature of grading by teachers.
3) The systolic blood pressure of a person depends upon one’s weight, age, etc.

43
Multiple Linear Regression with two regressors
A real estate agency collects the following data concerning
Y = Sales price of a house ( in thousands of dollars)
X1= Home Size (in hundreds of square feet)
X2= Rating (an overall ”niceness rating” for the house
expressed on a scale from 0 (worst) to 10 (best), provided
by the real estate agency
Sales price Home Size Rating
Y X1 X2
120.0 23 5
65.4 11 2
115.4 20 9
91.0 17 3
94.0 15 8
110.6 21 4
129.0 24 7
85.2 13 6
109.0 19 7
115.0 25 2 44
• STEP1: Identify the type of relationship (functional form) between
variables by using graphical tool ( Scatter plot)
• STEP 2: Estimate the relation from sample data by using Method
of Least Squares and interpret it
• STEP 3: Test the over all significance of regression by F-test
(ANOVA)
• Test the individual significance of regression by t-test

• STEP 4: Measure the goodness of fit of the model (Coefficient of


Determination)
• STEP 5: Use estimated regression equation for prediction of
unknown value of dependent variable

45
STEP1: Identify the type of relationship between variables by
using graphical tool ( Scatter plot)

46
Population Regression Model
Response

Y =  0 + 1 X 1 +  2 X 2 + e
Partial Regression Random error
coefficients

47
Estimation of parameters from sample data

Don’t worry
We will use computer
for calculation

(S 2
)( S x1 y ) − ( S x1x2 )( S x2 y )
b1 =
x2
2
( S )( S
x1
2
x2 ) − ( S x1x2 ) 2

( S )( S x2 y ) − ( S x1x2 )( S x1 y )
2

b2 =
x1
2
( S )( S
x1
2
x2 ) − ( S x1x2 ) 2

b0 = y − b1 x1 − b2 x2
48
• STEP 2: Estimate the relation from sample data
by using Method of Least Squares and interpret it
Computer Output
Estimation by Method of Least Squares
Coefficients Standard Error t Stat P-value
Intercept b0= 19.56 3.26 6.00 0.00
Home Size X1 b1= 3.74 0.15 24.56 0.00
Rating X2 b2= 2.56 0.29 8.85 0.00

Price = 19.56 + 3.74 Size + 2.56 Rating

49
Price = 19.56 + 3.74 Size + 2.56 Rating
• The value of b1=3.74, indicates that mean sale price is
expected to increase by $ 3,740 (3.74 thousands) with
each 100 square feet increase in house size keeping the
effect of rating constant.

• The value of b2=2.56, indicates that mean sale price is


expected to increase by $ 2,560 (2.56 thousands) with
each 1 point increase in rating keeping the effect of
house size constant.
50
STEP 3: Overall significance of regression by F test (ANOVA)
Ho : 1=2=0
i.e both independent variables have no influence on dependent variable
H1 : At least one  is not equal to zero
i.e. at least one intendent variable has significant influence on dependent variable

ANOVA
df SS MS F Significance F
Regression 2 3277.311817 1638.656 350.87* 9.57533E-08
Residual 7 32.69218291 4.670312
Total 9 3310.004
51
STEP 3: Test the individual significance of regression by t-test
Significance of X2
Ho : 2 = 0
H1: 2  0
Significance of X1
Ho : 1 = 0
H1: 1  0 𝑡=
𝑏1 −𝛽1
𝑆𝐸(𝑏1 )
= =24.56
3.74−0
0.152

Coefficients Standard Error t Stat P-value


Intercept 19.56 3.261 6.000 0.001
Home Size X1 3.74 0.152 24.561 0.000
Rating X2 2.56 0.289 8.851 0.000
Both independent variables have significant effect on dependent
variable, keep both variables in the final model
52
STEP 4: Measure the goodness of fit of the model
(Coefficient of Determination)
Residuals
30

2
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑
𝑅 = × 100
𝑇𝑜𝑡𝑎𝑙 20

3277.31
= × 100 10
3310
= 99%
99% variation in house price 0
90 95 100 105 110 115 120

has been explained by -10

square feet constructed area -20

and rating while only 1 % -30

is due to other unknown


-40

factors
53
STEP 5: Use estimated regression equation for
prediction of unknown value of dependent variable

Estimation of Price(dependent) variable

Estimate the price of a house with 1500 square feet area with
rating of 7

Price = 19.56 + 3.74 Size + 2.56 Rating

Price = 19.56 + 3.74×15 + 2.56 ×7


=93.58 thousand of dollars
=$ 93,580
54
Polynomial Regression ( Quadratic Regression)
Example:- A product development engineer is interested in investigating
the tensile strength of a new synthetic fiber that will be used to make
cloth for men's shirts.
He knows from experience that the strength is affected by the weight
percent of cotton used in the blend of materials for the fiber.

The engineer decides to test specimens at five levels of cotton weight


percent: (15, 20, 25, 30, 35). Fit curve of appropriate degree and estimate
cotton percent for maximum tensile strength

57
Y= Strength Y ( Strength) X1 (Cotton)
X= Cotton %
10 15
15 20
19 25
17 30
10 35

58
Scatter plot
SCATTER PLOT COTTON VS STRENGTH
20

18

16
strength

14

12

10

8
12 17 22 27 32 37
cotton

59
Quadratic
Regression
coefficients

Y =  0 + 1 X +  2 X + e i 2

Linear Regression
coefficients

For estimation, convert quadratic regression to


multiple linear regression

X = X1 ,X = X2
2 60

Y =  0 + 1 X 1 +  2 X 2 + e
Computer output
R Square 0.968
Standard Error 1.028

ANOVA
df SS MS F p value
Regression 2 64.69 32.34 30.59 0.03
Residual 2 2.11 1.06
Total 4 66.80

Coefficients SE t Stat P-value


Intercept -36.086 6.542 -5.516 0.031
X1 (Cotton) 4.326 0.553 7.816 0.016
X2 (Cotton)^2 -0.086 0.011 -7.798 0.016

Strength = −36.02 + 4.33Cotton − 0.0857 Cotton 2

61
Analysis of Variance (Overall Significance)

Ho : 1=2=0
H1 : At least one  is not equal to zero

ANOVA
df SS MS F p value
Regression 2 64.69 32.34 30.59 0.03
Residual 2 2.11 1.06
Total 4 66.80

62
Significance of Quadratic Regression

Coefficients Standard Error t Stat P-value


Intercept -36.086 6.542 -5.516 0.031
X1 (Cotton) 4.326 0.553 7.816 0.016
X2 (Cotton)^2 -0.086 0.011 -7.798 0.016

63
Regression Statistics
R Square 0.968
Standard Error 1.028

Coefficient of Determination=97%

97% variation in strength has been explained by cotton


percentage and remaining 3 % is due to other
unknown factors

Estimate the strength with 22% cotton


Strength with 22 percent cotton

17.5943
64
The value of X at which maximum or minimum value of quadratic regression occur

−b1
X = = 25.27
2b2

The maximum or minimum value of Y is

b12
bo − = 18.67
4b2
65
Linear vs Quadratic Regression
Linear Regression (Y=b0+b1X)
Regression Statistics RESIDUAL PLOT
Multiple R 0.077 6

R Square 0.006 4
2
Adjusted R Square -0.325

Residuals
0
Standard Error 4.705 13.6 13.8 14 14.2 14.4 14.6 14.8
-2
Observations 5
-4
-6
Predicted Y
ANOVA
df SS MS F Significance F
Regression 1 0.4 0.4 0.01807229 0.901572184
Residual 3 66.4 22.133333
Total 4 66.8

CoefficientsStandard Error t Stat P-value


Intercept 13.2 7.730459236 1.7075312 0.18626064
X1 (Cotton %) 0.04 0.297545515 0.1344332 0.90157218 66
Quadratic Regression ((Y=b0+b1X+b2X2)
RESIDUAL PLOT
1.00

Regression Statistics 0.50


Multiple R 0.984

Residuals
0.00
R Square 0.968
8.00 10.00 12.00 14.00 16.00 18.00 20.00
Adjusted R Square 0.937 -0.50
Standard Error 1.028
-1.00
Observations 5
-1.50
Predicted Y
ANOVA
df SS MS F Significance F
Regression 2 64.69 32.34 30.59 0.03
Residual 2 2.11 1.06
Total 4 66.80

Coefficients
Standard Errort Stat P-value Lower 95% Upper 95%
Intercept -36.086 6.542 -5.516 0.031 -64.234 -7.937
X1 (Cotton) 4.326 0.553 7.816 0.016 1.945 6.707
X2 (Cotton)^2 -0.086 0.011 -7.798 0.016 -0.133 -0.038 67

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy