0% found this document useful (0 votes)

34 views65 pages

1 - Regression (NEW)

1. Regression analysis is used to understand the relationship between variables and estimate the dependence of a dependent variable on one or more independent variables. 2. Relationships between variables can be either deterministic (functional) or probabilistic (statistical). Deterministic relationships are known exactly, while probabilistic relationships contain random error and must be approximated statistically. 3. Regression analysis estimates a function that describes the systematic relationship between a dependent variable and independent variables, while accounting for random/unexplained variability with an error term. The parameters of this population regression model are estimated from sample data.

Uploaded by

Hassan Latif

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views65 pages

1 - Regression (NEW)

Uploaded by

Hassan Latif

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Relationship Between Variables

1
Relationship between variables
The objective of many investigations is to
understand and explain the relationship among
variables.
Frequently, one wants to know how and to what
extent a certain variable (dependent variable) is
related to a set of other variables (independent
variables).
2
TYPES OF RELATIONSHIP
1-- Deterministic relationship (functional relationship)
The relationship between two variables is known exactly
• Pressure of a gas in a container is related to the temperature
• Velocity of water in an open channel is related to the width of
the channel
• Displacement of a particle at a certain time is related to its
velocity, if we let d0 be the displacement of the particle from
the origin at time t = 0 and v be the velocity, the
displacement at time t is
dt= do+ vt

• Area of a circle= r2

• F=k(m1m2/r2) (Newton’s law of gravity) 3
TYPES OF RELATIONSHIP
2-- Probabilistic relationship (statistical relationship)
The relation between variables are not know exactly and we
have to approximate the relationship and develop models
that characterize their main features.
• Electrical energy consumption of a house is related to the size of
the house (x, in square feet), but it is unlikely to be a
deterministic relationship, as it is possible for different houses to
use different amounts of electricity even if they are the same size.
• The fuel usage of an automobile is related to the vehicle weight ,
but the relationship is not a deterministic one as it is possible for
different automobiles to have different fuel usage even if they
weigh the same. The collection of statistical tools that are used to
model and explore relationships between variables that are
related in a nondeterministic manner is called Regression
Analysis 4
Regression
Regression is used to investigate the
dependence of one variable called the
dependent variable denoted by Y, on
independent variable(s) denoted by X and
provides an equation to be used for
estimating or predicting the average value
of the dependent variable from the known
values of the independent variables
Independent Variable Dependent Variable

Explanatory variable Explained variable

Regressor Regressand

Predictor Predicted
5
Regression Analysis
Regression Analysis is used to estimate a function f( ) that
describes the relationship between a continuous dependent
variable and one or more independent variables.

Y = f(X1, X2, X3,…, Xk) + e

Note:
• f( ) describes systematic variation in the relationship.
• e represents the unsystematic variation (or random error)
in the relationship
· Where Y=dependent, response, predeictand, Regressand
X=Independent, Stimulus, predictor, Regressor

6
Examples
• Sales =f(Adv.Expenditure)+E

• Fiber =f(Weight of jute plant)+E

• Consumption Exp.=f( Income) +E

• Price of a car =(Age of car)+E

• Yield =f( fertilizer, seed rate, rainfall)+E

• Marks =f(Study hours, IQ level)+E

• Demand =f(Price, , Consumer income ,

Price of related commodities, Consumer taste, Adv.
Expenses for creation of demand)+E 7
Model building with one independent variable
Example: The tensile strength of a paper product is related to
the amount of hardwood in the pulp. Ten samples are
produced in the pilot plant and the data obtained are shown
below Strength (Y) Hardwood (X)
160 10
171 15
There is probably a relation between 175 15
182 20
strength and HW concentration 184 20
----as amount of HW increases there is 181 20
corresponding increase in paper strength 188 25
193 25
195 28
200 30
But how would we measure and quantify this relationship?
There are Two ways to identify the relation between variables
(a) Graphical Method
(b) Numerical Method 8
A possible strategy for regression model building

• STEP1: Identify the type of relationship (functional form) between

variables by using graphical tool ( Scatter plot) or underlying
theory
• STEP 2: Estimate the relation from sample data by using Method
of Least Squares and interpret it
• STEP 3: Test the significance of regression by
• t-test
• F-test (ANOVA)
• STEP 4: Measure the goodness of fit of the model (Coefficient of
Determination)
• STEP 5: Use estimated regression equation for prediction of
unknown value of dependent t variable

9
STEP1: Graphical Method: Scatter Plot
Scatter plot is a graphical tool that may suggest that what type of mathematical
functions would be appropriate for summarizing the data. A variety of functions are useful in
fitting models to data.

Scatter plot of Strength Vs Hardwood

200
200
195
193

190 188

184
182
Strength

181
180
175

171
170

160
160

10 15 20 25 30
Hardwood
10
STEP2:
Numerical Method:
Scatter plot of Strength Vs Hardwood
Best Fit Line or 200
200
Regression Line 195
193
The observed data points do 190 188

not all fall on a straight line 184

182

Strength
181
but cluster about it. Many 180
175
lines can be drawn through
171

the data points; the problem 170 For Line

• Y-intercept (a)
is to select BEST FIT 160
• Slope (b)
160
among them. A line for
10 15 20 25 30
which sum of squared errors Hardwood

is least (minimum).
The method of LEAST SQUARE results in a line that minimizes
the sum of squared vertical distances from the observed data
points to the line (i.e Random Error). Any other line has a
larger sum 11
Population Regression Model
Regressor
Response

Yi = 0 + 1 X i + e i
Regression coefficients Random error
• In formulating the above relation between strength and
amount of hardwood, we are ignoring the fact that
strength of paper depends on other characteristics as12
12

well. Thus we are basically assuming that all these

effects are absorbed by the error term.
Population linear regression model between strength and hardwood
𝑌 = 𝛽𝑜 + 𝛽1 𝑋 + 𝜀

The error term is actually a combination of four different effects

1. It accounts for the effect of variables omitted from the

model
2. It captures the effects of nonlinearities in the
relationship between Y and X
3. Measurement errors in X and Y are also absorbed by
the error term
4. Error term also includes inherently unpredictable
random effects

13
Estimation of Unknown Parameters

After formulating the model next step is to

obtain the “best” estimates for unknown
parameters Bo and B1
There are several methods for the estimation of
the parameters
•Method of Least Squares (OLS)
•Method of Maximum Likelihood (MLE)
•Methods of Moments (MM)

14
Best fit line to the data LEAST SQUARE LINE
A least square line is described in terms of its Y-intercept (the
height at which it intercepts the Y-axis) and its slope (the angle
of the line). The line can be expressed by the following relation
𝑌෠ = 𝑏𝑜 + 𝑏1 𝑋 Estimated 𝑠𝑎𝑚𝑝𝑙𝑒 regression 𝑙𝑖𝑛𝑒
Where
𝑆𝑋𝑌
►b1= Slope of line 𝑏1 = 2
𝑆 𝑋

►bo =intercept of the line 𝑏𝑜 = 𝑌ሜ − 𝑏1 𝑋ሜ

Sxy = Covariance between X & Y

σ(𝑋 − 𝑋ሜ ) (𝑌 − 𝑌)
ሜ 1 σ𝑋σ𝑌
𝑆𝑥𝑦 = = ෍ 𝑋𝑌 −
𝑛 𝑛 𝑛

S2x = Variance of X.
ሜ 2 1
σ(𝑋 − 𝑋) (σ 𝑋)2
𝑆2 𝑋 = 2
= ෍𝑋 −
𝑛 𝑛 𝑛
15
Strength (Y) Hardwood (X) XY X2 Y2 Y^ e=Y-Y^ e2
160 10 1600 100 25600 162.6 -2.6 6.76
171 15 2565 225 29241 172.0 -1.0 1.00
175 15 2625 225 30625 172.0 3.0 9.00
182 20 3640 400 33124 181.4 0.6 0.36
184 20 3680 400 33856 181.4 2.6 6.76
181 20 3620 400 32761 181.4 -0.4 0.16
188 25 4700 625 35344 190.8 -2.8 7.84
193 25 4825 625 37249 190.8 2.2 4.84
195 28 5460 784 38025 196.4 -1.4 2.07
200 30 6000 900 40000 200.2 -0.2 0.04
1829 208 38715 4684 335825 1829.04 -0.04 38.83

෍ 𝑋𝑌 = 38715

෍ 𝑋 = 208

෍ 𝑌 = 1829

෍ 𝑋 2 = 4684

෍ 𝑌 2 = 335825
16
𝑋ሜ = 20.8 𝑌ሜ =182.9
෍ 𝑋𝑌 = 38715
1 (208)(1829)
෍ 𝑋 = 208 𝑆𝑋𝑌 = 38715 − = 67.18
10 10
1 (208)2
෍ 𝑌 = 1829 𝑆𝑋2 = 4684 − =35.76
10 10
1 (1829)2
෍ 𝑋 2 = 4684 𝑆𝑌2 = 335825 − = 130.09
10 10

෍ 𝑌 2 = 335825

𝑋ሜ = 20.8 𝑌ሜ =182.9

𝑆𝑋𝑌 67.18
𝑏1 = 2 = = 1.88
𝑆𝑋 35.76

𝑏𝑜 = 𝑌ሜ − 𝑏1ሜ𝑋ሜ = 18209 − 1.88 20.8 = 143.8

Strength=143.8 + 1.88 Hardwood

17
Interpretation of the estimated parameters
Strength=143.8 + 1.88 Hardwood
• The value of b1=1.88, indicates that the average strength of
paper is expected to increase by 1.88 with one percent
increase in amount of Hardwood in the pulp
• The value of bo indicates that average estimated strength of
paper is 143.8 without amount of hardwood in the pulp but this
interpretation is not always true be careful in interpreting the
intercept coefficient when scope of the model does not cover
X=0
• The observed range of hardwood (Explanatory Variable) in the
experiment was 10 to 30 percent (i.e scope of the model), therefore
it would be an unreasonable extrapolation to expect this rate of
increase to continue if hardwood percent to increase. It is safe to
use the results of regression only within the range of the observed
value of the independent variable only (i.e within the scope of the
model). 18
Properties of least Square Line

• Sum of errors is always equal to zero

• Sum of squared errors is minimum

• Least square line always passes through the mean of the

data

19
Measuring The Reliability Of The Estimating Equation
Scatterplot of Strength vs Hardwood
200
200
195
193

190 188

184
182
Strength

181
180
175

171
170

160
160

10 15 20 25 30
Hardwood

The observed values of Y do not all fall on the regression line but they scatter away
from it. The degree of scatter of the observed values about the regression line is
measured by standard error of estimate or standard error of regression and denoted
by Se.
The standard error of estimate measures the variability of observed points about the
regression line. A small variation indicates that the estimating regression is
20
adequate
Standard Error of Estimate

2
=
 (Y − Yˆ ) 2

=
 e 2

= 4.9
S e
n−2 n−2
Strength (Y) Hardwood (X) Y^ e=Y-Y^ e2
= = 2.21
2
S e S e
160
171
10
15
162.6
172.0
-2.6
-1.0
6.76
1.00
175 15 172.0 3.0 9.00
The standard deviation of 182 20 181.4 0.6 0.36
184 20 181.4 2.6 6.76
regression can be interpreted as 181 20 181.4 -0.4 0.16
the average deviation of the 188 25 190.8 -2.8 7.84
193 25 190.8 2.2 4.84
points from the estimated 195 28 196.4 -1.4 2.07
regression line and hence small 200 30 200.2 -0.2 0.04
1829 208 1829.04 -0.04 38.83
deviation is desire for the better
fitted model 21
Precision of Estimators bo & b1
• The OLS estimators are random variables as they may take
different values for different samples and a variance of a random
variable is a measure of its dispersion around the mean.

• The smaller the variance the closer, on average, individual values

are to the mean thus the variance of an estimator is an indicator
of the precision of the estimator. Standard deviation of variance
of an estimator is called its standard error

1 𝑋ሜ 2 1
𝑆𝐸(𝑏𝑜 ) = 𝑆𝑒 + 2 = 2.53 𝑆𝐸(𝑏1 ) = 𝑆𝑒 2 = 0.117
𝑛 𝑛 𝑆𝑥 𝑛 𝑆𝑥
22
Inference in Simple Linear Regression
(From samples to population)

• Generally, more is sought in regression analysis than a

description of observed data. One usually wishes to draw
inferences about the relationship of the variables in the
population from which the sample was taken
• The slope and the intercept estimated from a single sample
typically differ from the population values and vary from
sample to sample. To use these estimates for inference
about the population values, the sampling distributions of
the two statistics are needed

23
STEP3: Test the significance of regression by t test
Test the hypothesis that there is no linear relation between strength
and amount of hardwood in the population i.e strength is not related
with hardwood at 5% level of significance
1) Construction of hypotheses
Ho : 1 = 0
H1 : 1  0
2) Level of significance
 = 5%
3) TEST STATISTIC
𝑏1 − 𝛽1 1.88 − 0 ∗
𝑡= = = 16.05
𝑆𝐸(𝑏1 ) 0.117
4) Decision Rule:- Reject Ho if 𝑡𝑐𝑎𝑙 ≤ −𝑡𝛼,
2
𝑛−2
𝑜𝑟 𝑡𝑐𝑎𝑙 ≥ 𝑡𝛼,(𝑛−2)
2
𝑡0.025, 8 =2.306

5) Result:- So reject Ho and conclude that there is significant

relationship between strength and amount of hardwood 24
Confidence intervals for regression parameters

• A statistic calculated from a sample provides a point estimate of

the unknown parameter.
• A point estimate can be thought of as the single best guess for the
population value.
• While the estimated value from the sample is typically different
from the value of the unknown population parameter, the hope is
that it isn’t too for away.
• Based on the sample estimates, it is possible to calculate a range
of values that, with a designated likelihood, includes the
population value. Such a range is called a confidence interval.

25
90% C.I can be interpret as If we
95% C.I for 1
take 100 samples of the same size
under the same conditions and
compute 100 C.I’s about
b1 t 2
,( n 2)
SE (b1 ) parameter, one from each sample,
then 90 such C.Is will contain the
parameter (i.e not all the
1.88 ± 2.306 × 0.117 constructed C.Is)
Confidence interval estimate of a
parameter is more informative
(1.61 , 2.15) than point estimate because it
reflects the precision of the
estimator.

26
Relation between Confidence interval
and two sided hypothesis
If the constructed confidence interval does not contain the value of
parameter under Ho then reject Ho
Ho : 1 = 0
H1: 1  0

95% C.I for 1 (1.61 , 2.15 )

As the constructed confidence interval (1.61, 2.15)
does not contain the value of B1 under Ho (i.e 0) so reject Ho

27
Partition of variation in dependent variable

Explained Unexplained
variation variation

Total variation= Explained variation (Regression )

+
Unexplained variation (Error)
Total Variation = nS2y =1300.9
Explained variation =(n) (b) Sxy =1262.1
Un-Explained variation =Total-Explained=38.8

28
STEP3: Test the significance of regression by F test (ANOVA)

ANOVA TABLE
Source Of Degree of Sum of Mean Sum Fcal Ftab p-Value
Variation Freedom Squares of Squares
(S.O.V) (DF) (SS) (MSS=SS/df)
Regression 1 1262.1 1262.1 257.6* F.05(1,8)=5.318 0.000
Error 8 38.8 4.9
TOTAL 10-1=9 1300.9

Relation between F and t for testing 1=0

F=t2 i.e 257.6=(16.05)2
Decision of rejecting Ho by p-value (Probability value)
p-value is the probability in support of Ho, in case of too
much small support (p-value < α) reject Ho
29
STEP4: Goodness of Fit Structure less Residual plot
The co-efficient of determination tells us percent of RESIDUAL PLOT (YHAT VS
variation in the dependent variable explained by
RESIDUAL)
4
the independent variable
3

𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 2
𝑅2 = × 100
𝑇𝑜𝑡𝑎𝑙
1262.1 1
= × 100 = 97%
1300.9

Residuals
0
about 97% variation in strength is due to amount 160 170 180 190 200 210

of hardwood and remaining due to some other -1

unknown factors i.e there is no need to improve

-2
the model by introducing other variables or
functional form -3

Possible value of 𝐑𝟐 -4
Y hat

0 ≤ 𝑅2 ≤ 1 𝑜𝑟 0 ≤ 𝑅2 ≤ 100 30
STEP5: Estimation in regression
(Predicting unknown value of Y from known value of X)

• Estimate the strength of paper when hardwood

concentration is 23

Strength=143.8 + 1.88(23)=187.04

•NOTE:-Predictions made using an estimated regression

function may have little or no validity for values of the
independent variables that are substantially different from
those represented in the sample.

31
Regression

Linear Non-Linear

Intrinsically Intrinsically
Simple Multiple
Linear Non-Linear

32
Non-Linear Regression
It is easy to deal with the regression, which is linear in
parameters, but in some situations the models are non-
linear. The non-linear models can be divided into two
types

• Intrinsically Linear
• Intrinsically Non-Linear models
The models that can be transformed into linear models
after applying some suitable transformation are called
intrinsically linear models and the models that can not be
transformed into linear models are called intrinsically non-
linear models.
33
Regression on Transformed Variables
Nonlinearity (in parameters) is visually determined from the
scatter diagram, and sometimes, because of prior experience
or underlying theory, we know in advance that the model is
nonlinear

Nonlinear Transformation Linear

𝑌 = 𝛽𝑜 𝛽1𝑋 𝜖 𝐿𝑜𝑔(𝑌) = log(𝛽𝑜 ) + 𝑋 𝑙𝑜𝑔(𝛽1 ) + log⁡
(𝜖) 𝑌 ∗ = 𝛽𝑜∗ + 𝛽1∗ 𝑋 +⁡
𝜖∗

𝑌 = 𝛽𝑜 𝑋 𝛽1 𝜖 𝑙𝑜𝑔(𝑌) = 𝑙𝑜𝑔 𝛽𝑜 + 𝛽1 𝑙𝑜𝑔(𝑋)+ log⁡

(𝜖) 𝑌 ∗ = 𝛽𝑜 ∗ + 𝛽1 𝑋 ∗ + 𝜖 ∗
𝑌 = 𝛽𝑜 𝑒 𝛽1 𝑋 𝜖 𝑙𝑛(𝑌) = 𝑙𝑛 𝛽𝑜 + 𝛽1 𝑋+ ln⁡(𝜖) 𝑌 ∗ = 𝛽𝑜 ∗ + 𝛽1 𝑋 + 𝜖 ∗
1 1 𝑌 ∗ = 𝛽𝑜 + 𝛽1 𝑋 + 𝜖
𝑌= = 𝛽𝑜 + 𝛽1 𝑋 + 𝜖
𝛽𝑜 + 𝛽1 𝑋 + 𝜖 𝑌
𝑋 1 1 𝑌 ∗ = 𝛽𝑜 + 𝛽1 𝑋 ∗ + 𝜖
𝑌= = 𝛽𝑜 + 𝛽1 + 𝜖
𝛽𝑜 𝑋 + 𝛽1 + 𝑋𝜖 𝑌 𝑋

34
EXAMPLE:- The number (Y) of bacteria per unit volume
present in a culture after X hours is given in the following
table. Fit a least square curve having the form 𝒀 = 𝜷𝒐 𝜷𝑿
𝟏𝝐
to the data. Estimate the value of Y when X=7.

Y X
32 0
47 1
65 2
92 3
132 4
190 5
275 6 35
Scatterplot of Y vs X Scatterplot of Log(Y)=Y* vs X
300
2.4
250 2.3

2.2
200
2.1

Log(Y)=Y*
2.0
150
Y

1.9

100 1.8

1.7
50
1.6

1.5
0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
X X

𝑌 = 𝛽𝑜 𝛽1𝑋 𝜖 𝐿𝑜𝑔 𝑌 = log 𝛽𝑜 + 𝑋 𝑙𝑜𝑔 𝛽1 + log(𝜖) 𝑌 ∗ = 𝛽𝑜∗ + 𝛽1∗ 𝑋 + 𝜖 ∗

36
𝑌 = 𝛽𝑜 𝛽1𝑋 𝜖 𝐿𝑜𝑔 𝑌 = log 𝛽𝑜 + 𝑋 𝑙𝑜𝑔 𝛽1 + log(𝜖) 𝑌 ∗ = 𝛽𝑜∗ + 𝛽1∗ 𝑋 + 𝜖 ∗

X Y Y*=log(Y)
0 32 1.50515
1 47 1.6721
2 65 1.81291
3 92 1.96379
4 132 2.12057
5 190 2.27875
6 275 2.43933
Estimation of regression of X on Y* by least Square
bo*=1.51 , b1*= 0.154

Y*=1.51 + 0.154 X
37
Y*=1.51 + 0.154 X

log(bo)=1.51
bo=antilog(1.51) =32.36

log(b1)=0.154
b1=antilog( 0.154) =1.43

𝒀 = 𝒃𝒐 𝒃𝟏 X

𝑌 = 32.36 1.43 X

Estimate growth after 7 hours

𝑌 = 32.36 1.43 7 = 395.70

38
Example: Michaelis Menten equation for an enzyme reaction shows
the relation substrate concentration and reaction rate.
Fit Michaelis Menten equation and estimate Vmax and Km
V =Reaction rate (mM) [Y]
𝑉𝑚𝑎𝑥 × 𝐶 C =Substrate concentration (mM / min) [X]
𝑉=
𝐾𝑚 + 𝐶 Vmax=Theoratical maximum rate of the process
Km =Michaelis coefficient Linear regression:
V=0.2291+0.01552 C
R2=84.67%
C V Scatter Plot C vs V
(Michaelis Menten plot)

1.5 0.21 0.45

0.40

2 0.24
V (mM/ min)

0.35

3 0.28 0.30

4 0.33 0.25

8 0.4 0.20
0 2 4 6 8 10 12 14 16

16 0.45 C (mM)

39
Transformation to Linear form
𝑉𝑚𝑎𝑥 × 𝐶 1 𝐾𝑚 + 𝐶 1 𝐾𝑚 𝐶
𝑉= = = +
𝐾𝑚 + 𝐶 𝑉 𝑉𝑚𝑎𝑥 × 𝐶 𝑉 𝑉𝑚𝑎𝑥 × 𝐶 𝑉𝑚𝑎𝑥 × 𝐶

1 1 𝐾𝑚 1
= + Y*=bo*+b1*X* linear form
𝑉 𝑉𝑚𝑎𝑥 𝑉𝑚𝑎𝑥 𝐶

Y* bo* b1* X*

40
C (mM) V (mM/ min) 1/ C 1/ V Scatter Plot 1 / C vs1/ V
Lineweaver Burt Pplot

1.5 0.21 0.6667 4.7619 5.0

2 0.24 0.5000 4.1667 4.5

3 0.28 0.3333 3.5714 4.0

4 0.33 0.2500 3.0303

1/V
3.5

8 0.4 0.1250 2.5000 3.0

16 0.45 0.0625 2.2222 2.5

2.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
1/C

Estimated Liner Regression Equation:

Y*=bo* + b1*X*
= 1.998 +4.266 X*

41
Estimated Liner Regression Equation:
Y*=bo* + b1*X*
= 1.998 +4.266 X*
R2 = 99.287%
Calculation of Vmax and Km

1 𝐾𝑚
bo* = b1* =
𝑉𝑚𝑎𝑥 𝑉𝑚𝑎𝑥
1 𝐾𝑚
1.998 = 4.266 =
𝑉𝑚𝑎𝑥 𝑉𝑚𝑎𝑥
1 𝐾𝑚 = 𝑉𝑚𝑎𝑥 × 4.266
𝑉𝑚𝑎𝑥 = = 0.5005
1.998 =0.5005 × 4.266=2.135
Estimated Michaelis Menten Equation
0.5005 𝑪
𝑉=
2.135+𝑪 42
Regression with more than one independent variables
Multiple linear regression is a relationship that describes the dependence of mean values
of the response variable (Y) for given values of two or more than two independent
variables (X)

There are many applications where many explanatory variables affect the dependent
variable, for example
1) Yield of a crop depend upon the fertility of the land, dose of the fertilizer applied,
quantity of seed etc.
2) The grade point average of students depend on aptitude, mental ability , hours devoted
to study, type and nature of grading by teachers.
3) The systolic blood pressure of a person depends upon one’s weight, age, etc.

43
Multiple Linear Regression with two regressors
A real estate agency collects the following data concerning
Y = Sales price of a house ( in thousands of dollars)
X1= Home Size (in hundreds of square feet)
X2= Rating (an overall ”niceness rating” for the house
expressed on a scale from 0 (worst) to 10 (best), provided
by the real estate agency
Sales price Home Size Rating
Y X1 X2
120.0 23 5
65.4 11 2
115.4 20 9
91.0 17 3
94.0 15 8
110.6 21 4
129.0 24 7
85.2 13 6
109.0 19 7
115.0 25 2 44
• STEP1: Identify the type of relationship (functional form) between
variables by using graphical tool ( Scatter plot)
• STEP 2: Estimate the relation from sample data by using Method
of Least Squares and interpret it
• STEP 3: Test the over all significance of regression by F-test
(ANOVA)
• Test the individual significance of regression by t-test

• STEP 4: Measure the goodness of fit of the model (Coefficient of

Determination)
• STEP 5: Use estimated regression equation for prediction of
unknown value of dependent variable

45
STEP1: Identify the type of relationship between variables by
using graphical tool ( Scatter plot)

46
Population Regression Model
Response

Y =  0 + 1 X 1 +  2 X 2 + e
Partial Regression Random error
coefficients

47
Estimation of parameters from sample data

Don’t worry
We will use computer
for calculation

(S 2
)( S x1 y ) − ( S x1x2 )( S x2 y )
b1 =
x2
2
( S )( S
x1
2
x2 ) − ( S x1x2 ) 2

( S )( S x2 y ) − ( S x1x2 )( S x1 y )
2

b2 =
x1
2
( S )( S
x1
2
x2 ) − ( S x1x2 ) 2

b0 = y − b1 x1 − b2 x2
48
• STEP 2: Estimate the relation from sample data
by using Method of Least Squares and interpret it
Computer Output
Estimation by Method of Least Squares
Coefficients Standard Error t Stat P-value
Intercept b0= 19.56 3.26 6.00 0.00
Home Size X1 b1= 3.74 0.15 24.56 0.00
Rating X2 b2= 2.56 0.29 8.85 0.00

Price = 19.56 + 3.74 Size + 2.56 Rating

49
Price = 19.56 + 3.74 Size + 2.56 Rating
• The value of b1=3.74, indicates that mean sale price is
expected to increase by $ 3,740 (3.74 thousands) with
each 100 square feet increase in house size keeping the
effect of rating constant.

• The value of b2=2.56, indicates that mean sale price is

expected to increase by $ 2,560 (2.56 thousands) with
each 1 point increase in rating keeping the effect of
house size constant.
50
STEP 3: Overall significance of regression by F test (ANOVA)
Ho : 1=2=0
i.e both independent variables have no influence on dependent variable
H1 : At least one  is not equal to zero
i.e. at least one intendent variable has significant influence on dependent variable

ANOVA
df SS MS F Significance F
Regression 2 3277.311817 1638.656 350.87* 9.57533E-08
Residual 7 32.69218291 4.670312
Total 9 3310.004
51
STEP 3: Test the individual significance of regression by t-test
Significance of X2
Ho : 2 = 0
H1: 2  0
Significance of X1
Ho : 1 = 0
H1: 1  0 𝑡=
𝑏1 −𝛽1
𝑆𝐸(𝑏1 )
= =24.56
3.74−0
0.152

Coefficients Standard Error t Stat P-value

Intercept 19.56 3.261 6.000 0.001
Home Size X1 3.74 0.152 24.561 0.000
Rating X2 2.56 0.289 8.851 0.000
Both independent variables have significant effect on dependent
variable, keep both variables in the final model
52
STEP 4: Measure the goodness of fit of the model
(Coefficient of Determination)
Residuals
30

2
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑
𝑅 = × 100
𝑇𝑜𝑡𝑎𝑙 20

3277.31
= × 100 10
3310
= 99%
99% variation in house price 0
90 95 100 105 110 115 120

has been explained by -10

square feet constructed area -20

and rating while only 1 % -30

is due to other unknown

-40

factors
53
STEP 5: Use estimated regression equation for
prediction of unknown value of dependent variable

Estimation of Price(dependent) variable

Estimate the price of a house with 1500 square feet area with
rating of 7

Price = 19.56 + 3.74 Size + 2.56 Rating

Price = 19.56 + 3.74×15 + 2.56 ×7

=93.58 thousand of dollars
=$ 93,580
54
Polynomial Regression ( Quadratic Regression)
Example:- A product development engineer is interested in investigating
the tensile strength of a new synthetic fiber that will be used to make
cloth for men's shirts.
He knows from experience that the strength is affected by the weight
percent of cotton used in the blend of materials for the fiber.

The engineer decides to test specimens at five levels of cotton weight

percent: (15, 20, 25, 30, 35). Fit curve of appropriate degree and estimate
cotton percent for maximum tensile strength

57
Y= Strength Y ( Strength) X1 (Cotton)
X= Cotton %
10 15
15 20
19 25
17 30
10 35

58
Scatter plot
SCATTER PLOT COTTON VS STRENGTH
20

16
strength

8
12 17 22 27 32 37
cotton

59
Quadratic
Regression
coefficients

Y =  0 + 1 X +  2 X + e i 2

Linear Regression
coefficients

For estimation, convert quadratic regression to

multiple linear regression

X = X1 ,X = X2
2 60

Y =  0 + 1 X 1 +  2 X 2 + e
Computer output
R Square 0.968
Standard Error 1.028

ANOVA
df SS MS F p value
Regression 2 64.69 32.34 30.59 0.03
Residual 2 2.11 1.06
Total 4 66.80

Coefficients SE t Stat P-value

Intercept -36.086 6.542 -5.516 0.031
X1 (Cotton) 4.326 0.553 7.816 0.016
X2 (Cotton)^2 -0.086 0.011 -7.798 0.016

Strength = −36.02 + 4.33Cotton − 0.0857 Cotton 2

61
Analysis of Variance (Overall Significance)

Ho : 1=2=0
H1 : At least one  is not equal to zero

ANOVA
df SS MS F p value
Regression 2 64.69 32.34 30.59 0.03
Residual 2 2.11 1.06
Total 4 66.80

62
Significance of Quadratic Regression

Coefficients Standard Error t Stat P-value

Intercept -36.086 6.542 -5.516 0.031
X1 (Cotton) 4.326 0.553 7.816 0.016
X2 (Cotton)^2 -0.086 0.011 -7.798 0.016

63
Regression Statistics
R Square 0.968
Standard Error 1.028

Coefficient of Determination=97%

97% variation in strength has been explained by cotton

percentage and remaining 3 % is due to other
unknown factors

Estimate the strength with 22% cotton

Strength with 22 percent cotton

17.5943
64
The value of X at which maximum or minimum value of quadratic regression occur

−b1
X = = 25.27
2b2

The maximum or minimum value of Y is

b12
bo − = 18.67
4b2
65
Linear vs Quadratic Regression
Linear Regression (Y=b0+b1X)
Regression Statistics RESIDUAL PLOT
Multiple R 0.077 6

R Square 0.006 4
2
Adjusted R Square -0.325

Residuals
0
Standard Error 4.705 13.6 13.8 14 14.2 14.4 14.6 14.8
-2
Observations 5
-4
-6
Predicted Y
ANOVA
df SS MS F Significance F
Regression 1 0.4 0.4 0.01807229 0.901572184
Residual 3 66.4 22.133333
Total 4 66.8

CoefficientsStandard Error t Stat P-value

Intercept 13.2 7.730459236 1.7075312 0.18626064
X1 (Cotton %) 0.04 0.297545515 0.1344332 0.90157218 66
Quadratic Regression ((Y=b0+b1X+b2X2)
RESIDUAL PLOT
1.00

Regression Statistics 0.50

Multiple R 0.984

Residuals
0.00
R Square 0.968
8.00 10.00 12.00 14.00 16.00 18.00 20.00
Adjusted R Square 0.937 -0.50
Standard Error 1.028
-1.00
Observations 5
-1.50
Predicted Y
ANOVA
df SS MS F Significance F
Regression 2 64.69 32.34 30.59 0.03
Residual 2 2.11 1.06
Total 4 66.80

Coefficients
Standard Errort Stat P-value Lower 95% Upper 95%
Intercept -36.086 6.542 -5.516 0.031 -64.234 -7.937
X1 (Cotton) 4.326 0.553 7.816 0.016 1.945 6.707
X2 (Cotton)^2 -0.086 0.011 -7.798 0.016 -0.133 -0.038 67

Cook P. Fundamentals of HTML, SVG, CSS and JavaScript For Data Visual. 2022
No ratings yet
Cook P. Fundamentals of HTML, SVG, CSS and JavaScript For Data Visual. 2022
87 pages
Emd-Mi3900 Motor de Traccion
100% (4)
Emd-Mi3900 Motor de Traccion
24 pages
1 Regression
No ratings yet
1 Regression
65 pages
1 Regression - 105448
No ratings yet
1 Regression - 105448
15 pages
Lecture 4.3 Regression-1
No ratings yet
Lecture 4.3 Regression-1
30 pages
Regression
No ratings yet
Regression
72 pages
Unit 5
No ratings yet
Unit 5
104 pages
M1 Stat-701 SLR 2022
No ratings yet
M1 Stat-701 SLR 2022
17 pages
Simple and Multiple Linear Regression
No ratings yet
Simple and Multiple Linear Regression
91 pages
3 STAT-602 Regression & Correlation
No ratings yet
3 STAT-602 Regression & Correlation
4 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
104 pages
ASS#1-FINALS Doromal
No ratings yet
ASS#1-FINALS Doromal
8 pages
Introduction To Linear Regression and Correlation Analysis: Objectives
100% (1)
Introduction To Linear Regression and Correlation Analysis: Objectives
33 pages
Inferential Statistics: Identify
No ratings yet
Inferential Statistics: Identify
7 pages
YMS Topic Review (Chs 1-8)
No ratings yet
YMS Topic Review (Chs 1-8)
7 pages
1 - Stat-701 Regression
No ratings yet
1 - Stat-701 Regression
18 pages
Management Science Notes
No ratings yet
Management Science Notes
13 pages
Lecture9 Regression1 PDF
No ratings yet
Lecture9 Regression1 PDF
22 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
Chapter 0
No ratings yet
Chapter 0
10 pages
Lecture 11
No ratings yet
Lecture 11
62 pages
Topic 9: 9.1 Objectives
No ratings yet
Topic 9: 9.1 Objectives
16 pages
4 STAT-602 Regression & Correlation (Mid&Final)
No ratings yet
4 STAT-602 Regression & Correlation (Mid&Final)
22 pages
Regression
No ratings yet
Regression
66 pages
Regcorr 5
No ratings yet
Regcorr 5
20 pages
Selvanathan 7e - 17
No ratings yet
Selvanathan 7e - 17
93 pages
Simple Regression and Simple Correlation: MA261 Statistical and Numerical Techniques March 24, 2022
No ratings yet
Simple Regression and Simple Correlation: MA261 Statistical and Numerical Techniques March 24, 2022
52 pages
AP Stats Study Guide
No ratings yet
AP Stats Study Guide
17 pages
Bivariate
No ratings yet
Bivariate
28 pages
Correlation & Simple Linear Regression: Dr. Sanjay Rastogi, IIFT, New Delhi
No ratings yet
Correlation & Simple Linear Regression: Dr. Sanjay Rastogi, IIFT, New Delhi
57 pages
Carlo Serquina Activity 4
No ratings yet
Carlo Serquina Activity 4
10 pages
Linear Model
No ratings yet
Linear Model
10 pages
Regression Course For Second Year (Chap 1-3)
No ratings yet
Regression Course For Second Year (Chap 1-3)
59 pages
Week 12+13
No ratings yet
Week 12+13
47 pages
Stats101A - Chapter 1
No ratings yet
Stats101A - Chapter 1
25 pages
Engineering Research Models Evaluation Methods
No ratings yet
Engineering Research Models Evaluation Methods
4 pages
6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2
No ratings yet
6.3 SSK5210 Parametric Statistical Testing - Analysis of Variance LR and Correlation - 2
39 pages
Lecture 6 Linear Regression
No ratings yet
Lecture 6 Linear Regression
8 pages
QMM Epgdm 5
No ratings yet
QMM Epgdm 5
58 pages
Part 2 Exploring Relationships Among Variables
No ratings yet
Part 2 Exploring Relationships Among Variables
8 pages
Correlation (Quantitative Variables)
No ratings yet
Correlation (Quantitative Variables)
39 pages
Lecture Note 7 - Part 1
No ratings yet
Lecture Note 7 - Part 1
3 pages
Slides
No ratings yet
Slides
39 pages
Regression and Correlation
No ratings yet
Regression and Correlation
17 pages
Lecture 6 - Regression Analysis
No ratings yet
Lecture 6 - Regression Analysis
34 pages
4 Regression Analysis
No ratings yet
4 Regression Analysis
33 pages
Lecture SLR
No ratings yet
Lecture SLR
60 pages
Examining Relationships Scatterplot Analysis.: R N 1 Xi X SX Yi y Sy
No ratings yet
Examining Relationships Scatterplot Analysis.: R N 1 Xi X SX Yi y Sy
3 pages
Chapter 2
No ratings yet
Chapter 2
67 pages
SPSS Stat Tool Guide
No ratings yet
SPSS Stat Tool Guide
2 pages
14 Statistics and Probability
No ratings yet
14 Statistics and Probability
37 pages
Chapter 3 - Regression
No ratings yet
Chapter 3 - Regression
8 pages
SJB Institute of Technology: Big Data Analytics (21Cs71)
No ratings yet
SJB Institute of Technology: Big Data Analytics (21Cs71)
23 pages
Chapter13 Regression
No ratings yet
Chapter13 Regression
110 pages
Cha 6
No ratings yet
Cha 6
8 pages
Which Test When: 1 Exploratory Tests
No ratings yet
Which Test When: 1 Exploratory Tests
5 pages
BUSN 2429 Chapter 14 Correlation and Single Regression Model
No ratings yet
BUSN 2429 Chapter 14 Correlation and Single Regression Model
85 pages
Regression: Simple Linear Regression Model
No ratings yet
Regression: Simple Linear Regression Model
16 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Helical Compression Cylindrical Springs
From Everand
Helical Compression Cylindrical Springs
Andrea Faussone
No ratings yet
Exercises of Tensors
From Everand
Exercises of Tensors
Simone Malacrida
No ratings yet
WHO EMRO - World Antimicrobial Awareness Week 2019 - Online Application
No ratings yet
WHO EMRO - World Antimicrobial Awareness Week 2019 - Online Application
1 page
Who Is Ali?? Ans Ali Is My Best Friend
No ratings yet
Who Is Ali?? Ans Ali Is My Best Friend
1 page
Content Weightages
No ratings yet
Content Weightages
12 pages
Group Testing Officer (G.T.O)
No ratings yet
Group Testing Officer (G.T.O)
4 pages
Py 706
No ratings yet
Py 706
9 pages
Document
No ratings yet
Document
3 pages
Final Assignment Edu-511 Curriculum Phy
No ratings yet
Final Assignment Edu-511 Curriculum Phy
5 pages
Answer: B. Variables, Constants, Types, Functions, and Labels
No ratings yet
Answer: B. Variables, Constants, Types, Functions, and Labels
7 pages
Plasma Assignment 2hassan Latif
No ratings yet
Plasma Assignment 2hassan Latif
2 pages
2 Probability
No ratings yet
2 Probability
47 pages
3 Probability Mass Function
No ratings yet
3 Probability Mass Function
51 pages
2 Probability
No ratings yet
2 Probability
52 pages
Teamlease Services Limited: Earnings Rs. Deduction Rs
No ratings yet
Teamlease Services Limited: Earnings Rs. Deduction Rs
1 page
Rapoo C1612 Brochure
No ratings yet
Rapoo C1612 Brochure
4 pages
Os Practical
No ratings yet
Os Practical
23 pages
Colleges List
No ratings yet
Colleges List
28 pages
3 Hproblems
No ratings yet
3 Hproblems
8 pages
Parts Diagram and Description
No ratings yet
Parts Diagram and Description
8 pages
June 2019 Pure Shadow Paper 2
No ratings yet
June 2019 Pure Shadow Paper 2
13 pages
Single Sideband Modulation
No ratings yet
Single Sideband Modulation
25 pages
SNCS D 24 08064 - R123
No ratings yet
SNCS D 24 08064 - R123
47 pages
Design of A Latent Heat Storage System For The Replacement of Cooling Tower For DG Set
No ratings yet
Design of A Latent Heat Storage System For The Replacement of Cooling Tower For DG Set
6 pages
Enrrique Gomez, El Ingeniero Tercermundista
No ratings yet
Enrrique Gomez, El Ingeniero Tercermundista
3 pages
AWS & DevOps Course Content
No ratings yet
AWS & DevOps Course Content
2 pages
CSC2102 Data Structures and Algorithm Program BSSE-3 Sec. A Week 1
No ratings yet
CSC2102 Data Structures and Algorithm Program BSSE-3 Sec. A Week 1
30 pages
VDL Sample
No ratings yet
VDL Sample
2 pages
13930
No ratings yet
13930
11 pages
Agarwal Dhar 2014 Editorial Big Data Data Science and Analytics The Opportunity and Challenge For Is Research
No ratings yet
Agarwal Dhar 2014 Editorial Big Data Data Science and Analytics The Opportunity and Challenge For Is Research
6 pages
DBMS
No ratings yet
DBMS
19 pages
Supply Chain PDF
No ratings yet
Supply Chain PDF
2 pages
B+V Manual - Hinge Casing Spider 200 SH Tons
No ratings yet
B+V Manual - Hinge Casing Spider 200 SH Tons
7 pages
Kodak Easyshare m340
No ratings yet
Kodak Easyshare m340
24 pages
Manoj V - 3.8years
No ratings yet
Manoj V - 3.8years
3 pages
6FM9Y
No ratings yet
6FM9Y
2 pages
KehuaFrance 3kW
No ratings yet
KehuaFrance 3kW
2 pages
Logcat 1711449573996
No ratings yet
Logcat 1711449573996
27 pages
Manuel #1116649 (FM841, FM840) Rig 301-52
No ratings yet
Manuel #1116649 (FM841, FM840) Rig 301-52
101 pages
Block Diagram: X541UV Repair Guide
No ratings yet
Block Diagram: X541UV Repair Guide
7 pages
F&G Devices Inspection and Test Plan
No ratings yet
F&G Devices Inspection and Test Plan
3 pages
Api Tools Presentation
No ratings yet
Api Tools Presentation
18 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

1 - Regression (NEW)

Uploaded by

1 - Regression (NEW)

Uploaded by

Relationship Between Variables

• Area of a circle= r2

Explanatory variable Explained variable

Y = f(X1, X2, X3,…, Xk) + e

• Fiber =f(Weight of jute plant)+E

• Consumption Exp.=f( Income) +E

• Price of a car =(Age of car)+E

• Yield =f( fertilizer, seed rate, rainfall)+E

• Marks =f(Study hours, IQ level)+E

• Demand =f(Price, , Consumer income ,

• STEP1: Identify the type of relationship (functional form) between

Scatter plot of Strength Vs Hardwood

not all fall on a straight line 184

the data points; the problem 170 For Line

well. Thus we are basically assuming that all these

The error term is actually a combination of four different effects

1. It accounts for the effect of variables omitted from the

After formulating the model next step is to

►bo =intercept of the line 𝑏𝑜 = 𝑌ሜ − 𝑏1 𝑋ሜ

Sxy = Covariance between X & Y

𝑏𝑜 = 𝑌ሜ − 𝑏1ሜ𝑋ሜ = 18209 − 1.88 20.8 = 143.8

Strength=143.8 + 1.88 Hardwood

• Sum of errors is always equal to zero

• Sum of squared errors is minimum

• Least square line always passes through the mean of the

• The smaller the variance the closer, on average, individual values

• Generally, more is sought in regression analysis than a

5) Result:- So reject Ho and conclude that there is significant

• A statistic calculated from a sample provides a point estimate of

95% C.I for 1 (1.61 , 2.15 )

Total variation= Explained variation (Regression )

Relation between F and t for testing 1=0

of hardwood and remaining due to some other -1

unknown factors i.e there is no need to improve

• Estimate the strength of paper when hardwood

•NOTE:-Predictions made using an estimated regression

Nonlinear Transformation Linear

𝑌 = 𝛽𝑜 𝑋 𝛽1 𝜖 𝑙𝑜𝑔(𝑌) = 𝑙𝑜𝑔 𝛽𝑜 + 𝛽1 𝑙𝑜𝑔(𝑋)+ log⁡

𝑌 = 𝛽𝑜 𝛽1𝑋 𝜖 𝐿𝑜𝑔 𝑌 = log 𝛽𝑜 + 𝑋 𝑙𝑜𝑔 𝛽1 + log(𝜖) 𝑌 ∗ = 𝛽𝑜∗ + 𝛽1∗ 𝑋 + 𝜖 ∗

Estimate growth after 7 hours

𝑌 = 32.36 1.43 7 = 395.70

1.5 0.21 0.45

1.5 0.21 0.6667 4.7619 5.0

2 0.24 0.5000 4.1667 4.5

3 0.28 0.3333 3.5714 4.0

4 0.33 0.2500 3.0303

8 0.4 0.1250 2.5000 3.0

16 0.45 0.0625 2.2222 2.5

Estimated Liner Regression Equation:

• STEP 4: Measure the goodness of fit of the model (Coefficient of

Price = 19.56 + 3.74 Size + 2.56 Rating

• The value of b2=2.56, indicates that mean sale price is

Coefficients Standard Error t Stat P-value

has been explained by -10

square feet constructed area -20

and rating while only 1 % -30

is due to other unknown

Estimation of Price(dependent) variable

Price = 19.56 + 3.74 Size + 2.56 Rating

Price = 19.56 + 3.74×15 + 2.56 ×7

The engineer decides to test specimens at five levels of cotton weight

For estimation, convert quadratic regression to

Coefficients SE t Stat P-value

Strength = −36.02 + 4.33Cotton − 0.0857 Cotton 2

Coefficients Standard Error t Stat P-value

97% variation in strength has been explained by cotton

Estimate the strength with 22% cotton

The maximum or minimum value of Y is

CoefficientsStandard Error t Stat P-value

Regression Statistics 0.50

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.