regression lecture notes
regression lecture notes
Regression: Study of relationship between a dependent variable and one or more independent
variables.
Simple regression: deals with one independent variable
Multiple Regression: deals with more than one independent variables
Example: Law of demand
Simple Regression: Quantity demanded = 𝛽0 + 𝛽1 𝑃𝑟𝑖𝑐𝑒 + 𝑒
Multiple Regression:
Quantity demanded=𝛽0 + 𝛽1 𝑃𝑟𝑖𝑐𝑒 + 𝛽2 𝑃𝑟𝑖𝑐𝑒 𝑜𝑓 𝑆𝑢𝑏𝑠𝑡𝑖𝑡𝑢𝑡𝑒 + 𝛽3 𝐶𝑜𝑛𝑠𝑢𝑚𝑒𝑟 ′ 𝑠 𝐼𝑛𝑐𝑜𝑚𝑒 + 𝑒
Scatter plot: For simple regression, it gives a rough idea of (i) direction, (ii) strength, and
(iii) linearity of the relationship. In addition, outliers or high leverage points can also be
detected.
80
60
40
20
0
0 2 4 6 8 10 12 14
Experience (years)
Residual: Actual y – Predicted y i.e. e = 𝑦 − 𝑦̂
Least Squares: We want to determine a straight-line equation which describes the relationship
between experience and sales as closely possible. How can do this? The answer is the method
of Least Squares. Least Square Criterion Weiss p-648: The best fitting line corresponds to the
intercept (𝛽0 ) and slope (𝛽1 ) such that the sum of squared residuals is minimum possible.
Using the method of Least Squares, the best fitting line is:
̂ = 𝟖𝟎 + 𝟒 𝒙,
𝒚 R2 = 0.93
Intercept (80) Interpretation: A sales person with no experience the resulting sales is $80
(thousands)
Slope (4) Interpretation: With every one year increase in experience. sales increase on
average by $4 (thousands).
Prediction: Predict sales for a sales person who has 9 years of experience:
̂ = 𝟖𝟎 + 𝟒 (𝟗) = $𝟏𝟏𝟔 (𝒕𝒉𝒐𝒖𝒔𝒂𝒏𝒅𝒔)
𝒚
[Note: Manual calculation for slope, intercept given x, y pairs is not included in SI syllabus.
You are already covered in ITS]. Students should be careful to write interpretations of
coefficients and predictions using given units not using generic words like a one unit increase
in…
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + ⋯ + 𝛽𝑘 𝑥𝑘 + 𝑒
The parameters of the multiple regression are estimated by minimizing the residual sum of
squares.
𝑛
2
𝑆 = ∑(𝑦𝑖 − 𝑦̂)
𝑖
𝑖=1
Statistical software (e.g., R) or Excel can be used to estimate the multiple regression model.
𝑦̂ = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ + 𝑏𝑘 𝑥𝑘
Note: Greek letters are used for parameters (e.g., 𝛽1 ) and small cap letters (e.g., 𝑏1 ) are used
for sample estimates.
Weiss Module A Example A.4, p-A11: Price of Orion car. Here we want to model the price ($)
of a used car of ‘Orion’ brand using age of car and number of miles driven as independent
variables.
Coefficient of determination R2: Proportion of total variation in dependent variable (y) that
is explained by independent variables of the model.
SST = SSR + SSE
(Total sum of square = Sum of square due to regression + sum of square due to error)
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = = 1−
𝑆𝑆𝑇 𝑆𝑆𝑇
For the Orion data R2 = 0.936. This shows that 93.6% variation in prices of car is explained by
age of the car and number of miles driven through this model.
Testing Statistical Significance of Regression Coefficients:
For the hypothesis: 𝐻 0 ∶ 𝛽1 = 0 (Age of car is not a useful predictor of car price)
Against : 𝐻 1 ∶ 𝛽1 ≠ 0 (Age of car is a useful predictor of car price)
The test statistic is:
𝒃𝟏 −𝜷𝟏
𝒕=
𝑺𝑬(𝒃𝟏 )
Here 𝑏1 represents the sample estimator of the parameters. 𝛽1is the value of parameter under
the null hypothesis.
The t statistic has a student’s T distribution with n – (k+1) degrees of freedom (n = number of
observations or sample size, k +1 = # model parameters including intercept=3 here)
T statistic is – 2.45 t( 0.025, 11-3 df) = ±2.306, Thus null hypothesis is rejected and we
conclude that indeed the age of car is a useful predictor of its price.
Ex: Test the hypothesis (at 5% level) that there is a negative relationship between age and
price of car i.e. test that as age increase price of car decreases.
For the hypothesis: 𝐻 0 ∶ 𝛽1 = 0 (No or positive relationship)
Against : 𝐻 1 ∶ 𝛽1 < 0 (Age increases car price decreases)
T statistic is – 2.45, t( 0.05, 8 df) = −1.860, Thus null hypothesis is rejected, and we conclude
that there is indeed negative relationship between the age and price of car.
[Alternatively, the p-value of the test = Software reported two tail p-value /2 = 0.03974/2 =
0.01987 < 0.05. Hence the null hypothesis is rejected in favor of alternative at 5% sig level].
There is indeed negative relationship between age and car price.
Prediction from the model: Suppose we want to predict the price of an Orion which is 4 years
old and which is already driven 50,000 miles.
𝑦̂ = 18303.5 − 950.4 (4) − 0.0821(50000) = $10,396.9 (pl mention unit of y and not just a
number)
Input the y range and x range (the x variables must be in adjacent columns). Click labels if
variable name row is also selected (suggested to include this).
Note: Analysis Tool pack must be installed in Excel. If not, do this within Excel:
Files > Options > Add-Ins >Analysis Tool Pack > Go > Analysis ToolPak > OK
Then the Analysis Tool Pack named ‘Data Analysis’ is visible in the Data tab.
#Multiple regression in R
orion=read.csv(file.choose()) # choose orion1.csv data. File must be saved in csv
attach(orion)
head(orion)
model1=lm(price~age + miles, data=orion)
summary(model1)
round((summary(model1)$coefficients), 5) # to preset outcome with 5 decimals (avoid
scientific notation)
𝑀𝑆𝑅 𝑆𝑆𝑅/𝑘
𝐹= =
𝑀𝑆𝐸 𝑆𝑆𝐸/(𝑛 − 𝑘 − 1)
This F test with DF (k and n – (k+1) and is reported by Excel (in ANOVA section) and all
statistical software. [Note: the manual calculation of F test using above formula is not
required].
For the Orion case, F = 58.61, p-value = 0.0000167, null hypothesis is rejected, and we
conclude that at least one variable (age or miles or both) has a significant impact on price.
Note: In any multiple regression problem, we can expect from students to be able to do these
things. These aspects can be tested in exams.
(1) Estimate parameters from software. Alternatively, software output or estimated eq can be
provided in textbook form along with relevant statistics e.g., standard error, R sq etc.
(2) Interpret each coefficient in practical terms. (take care of units of measurement)
(3) Interpret R sq in practical terms. (i.e. in percent variation explained)
(4) Make prediction of y given relevant predictor values.
(5) Test hypothesis on individual coefficients.
The overall F test (of the hypothesis that all parameters (except intercept) are zero and its p-
value can also be asked.
Interpretation:
200: Average house price in the South Zone is $200,000
50: House price in East Zone is on average $50,000 higher than in the South Zone.
-75: House price in West Zone is on average $75,000 lower than in the South Zone.
Anderson Example pdf p-786,
Some further exercises from Anderson:
Ex 4 pdf p-769, Ex 5, pdf p-769 (check Rev = 83.23 + 2.29 TVAd + 1.30NPAd)
Ex 14 pdf p-775 (only part d and f), Ex 34 p-pdf p-791 , Ex 38 pdf p-793
Estimate the regression model, write down the estimated eq, interpret the coefficients. Predict
price of Nikon camera of 14 megapixels with a weight of 6 oz and score of 55. Interpret Rsq.
Test the hypothesis (at 5%) that the average price of Canon is significantly less than Nikon.
Ex2: Consider the data of sales prices of 176 houses to be explained by value of land, value
of improvement (all three variables in $1000) and the city area where the house is located.
(CHEVAL is the base area). The estimated regression is as follows.
Interpret each coefficient. Predict the price of a house located in Cheval that has value of land
and improvement as 100 and 200 (thousands of dollars). Test the hypothesis (at 5%) that
average prices in the Hydepark area are significantly less than Cheval area.
̂ = −16.93 + 1.594 𝐿𝑎𝑛𝑑 + 1.301 𝐼𝑚𝑝 − 82.97DAVISISLES +10.187 HUNTERSGREE −
𝑆𝑎𝑙𝑒𝑠
47.28 HYDEPARK