0% found this document useful (0 votes)
6 views

regression lecture notes

The document provides an overview of regression analysis, including simple and multiple regression, and their applications in predicting dependent variables based on independent variables. It explains the method of least squares for estimating parameters, the interpretation of coefficients, and the significance testing of regression coefficients. Additionally, it discusses the use of qualitative variables in regression models and provides examples of how to conduct regression analysis using statistical software like Excel and R.

Uploaded by

Ammar Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

regression lecture notes

The document provides an overview of regression analysis, including simple and multiple regression, and their applications in predicting dependent variables based on independent variables. It explains the method of least squares for estimating parameters, the interpretation of coefficients, and the significance testing of regression coefficients. Additionally, it discusses the use of qualitative variables in regression models and provides examples of how to conduct regression analysis using statistical software like Excel and R.

Uploaded by

Ammar Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Regression Analysis -1

Regression: Study of relationship between a dependent variable and one or more independent
variables.
Simple regression: deals with one independent variable
Multiple Regression: deals with more than one independent variables
Example: Law of demand
Simple Regression: Quantity demanded = 𝛽0 + 𝛽1 𝑃𝑟𝑖𝑐𝑒 + 𝑒
Multiple Regression:
Quantity demanded=𝛽0 + 𝛽1 𝑃𝑟𝑖𝑐𝑒 + 𝛽2 𝑃𝑟𝑖𝑐𝑒 𝑜𝑓 𝑆𝑢𝑏𝑠𝑡𝑖𝑡𝑢𝑡𝑒 + 𝛽3 𝐶𝑜𝑛𝑠𝑢𝑚𝑒𝑟 ′ 𝑠 𝐼𝑛𝑐𝑜𝑚𝑒 + 𝑒

Scatter plot: For simple regression, it gives a rough idea of (i) direction, (ii) strength, and
(iii) linearity of the relationship. In addition, outliers or high leverage points can also be
detected.

Consider the data in Anderson Ex 7, pdf p-694:

160 Scatter plot of Experience and Sales


140
120
100
Sales

80
60
40
20
0
0 2 4 6 8 10 12 14
Experience (years)
Residual: Actual y – Predicted y i.e. e = 𝑦 − 𝑦̂

Least Squares: We want to determine a straight-line equation which describes the relationship
between experience and sales as closely possible. How can do this? The answer is the method
of Least Squares. Least Square Criterion Weiss p-648: The best fitting line corresponds to the
intercept (𝛽0 ) and slope (𝛽1 ) such that the sum of squared residuals is minimum possible.
Using the method of Least Squares, the best fitting line is:

̂ = 𝟖𝟎 + 𝟒 𝒙,
𝒚 R2 = 0.93
Intercept (80) Interpretation: A sales person with no experience the resulting sales is $80
(thousands)
Slope (4) Interpretation: With every one year increase in experience. sales increase on
average by $4 (thousands).
Prediction: Predict sales for a sales person who has 9 years of experience:
̂ = 𝟖𝟎 + 𝟒 (𝟗) = $𝟏𝟏𝟔 (𝒕𝒉𝒐𝒖𝒔𝒂𝒏𝒅𝒔)
𝒚
[Note: Manual calculation for slope, intercept given x, y pairs is not included in SI syllabus.
You are already covered in ITS]. Students should be careful to write interpretations of
coefficients and predictions using given units not using generic words like a one unit increase
in…

Regression Analysis: Multiple Regression


In real life, a response variable (y) is affected not by one but many independent variables.
Thus a more flexible model is the multiple regression model given by:

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + ⋯ + 𝛽𝑘 𝑥𝑘 + 𝑒
The parameters of the multiple regression are estimated by minimizing the residual sum of
squares.
𝑛
2
𝑆 = ∑(𝑦𝑖 − 𝑦̂)
𝑖
𝑖=1

Statistical software (e.g., R) or Excel can be used to estimate the multiple regression model.
𝑦̂ = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ + 𝑏𝑘 𝑥𝑘
Note: Greek letters are used for parameters (e.g., 𝛽1 ) and small cap letters (e.g., 𝑏1 ) are used
for sample estimates.
Weiss Module A Example A.4, p-A11: Price of Orion car. Here we want to model the price ($)
of a used car of ‘Orion’ brand using age of car and number of miles driven as independent
variables.

Car Age (years) Miles Price ($)


1 5 57000 8500
2 4 40000 10300
3 6 77000 7000
4 5 60000 8200
5 5 49000 8900
6 5 47000 9800
7 6 58000 6600
8 6 39000 9500
9 2 8000 16900
10 7 69000 7000
11 7 89000 4800
The Excel software output for this regression is as follows:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.967530
R Square 0.936115
Adjusted R
Square 0.920144
Standard Error 880.505444
Observations 11
ANOVA
Significance
Df SS MS F F
Regression 2 90883135.85 45441568 58.61236 1.67E-05
Residual 8 6202318.693 775289.8
Total 10 97085454.55
Standard Upper
Coefficients Error t Stat P-value Lower 95% 95%
Intercept 18303.5208 1134.76186 16.12983 2.19E-07 15686.76 20920.29
Age(years) -950.4270 387.4188755 -2.45323 0.039736 -1843.82 -57.0375
Miles -0.0821 0.025520666 -3.21889 0.01226 -0.141 -0.0233

The estimated model in equation form is:


𝑦̂ = 18303.5 − 950.4 𝑥1 − 0.0821𝑥2

Where y: Price of car ($) 𝑥1 =Age of car (years), 𝑥2 = Miles driven


Interpretation of Parameter Estimates:
Intercept 𝑏0 = 18303.5: For an Orion car which is brand new (0 year age) and is driven 0 miles,
the estimated price is $18303.5. [Note: In many cases, intercept is not meaningful to interpret
since x variables at zero values are not sensible.]
Coefficient of Age 𝑏1 = -950.4: Keeping number of miles driven fixed, if age of car increases
by one year, its price decreases by $950.4 on average.
Coefficient of Miles driven 𝑏2 = -0.08215: Keeping age of car fixed, if number of miles driven
increases by one, car’s price decreases by $0.082 on average (which corresponds to a decrease
of $82.1 for every 1000 miles driven).

Coefficient of determination R2: Proportion of total variation in dependent variable (y) that
is explained by independent variables of the model.
SST = SSR + SSE
(Total sum of square = Sum of square due to regression + sum of square due to error)
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = = 1−
𝑆𝑆𝑇 𝑆𝑆𝑇
For the Orion data R2 = 0.936. This shows that 93.6% variation in prices of car is explained by
age of the car and number of miles driven through this model.
Testing Statistical Significance of Regression Coefficients:
For the hypothesis: 𝐻 0 ∶ 𝛽1 = 0 (Age of car is not a useful predictor of car price)
Against : 𝐻 1 ∶ 𝛽1 ≠ 0 (Age of car is a useful predictor of car price)
The test statistic is:
𝒃𝟏 −𝜷𝟏
𝒕=
𝑺𝑬(𝒃𝟏 )

Here 𝑏1 represents the sample estimator of the parameters. 𝛽1is the value of parameter under
the null hypothesis.
The t statistic has a student’s T distribution with n – (k+1) degrees of freedom (n = number of
observations or sample size, k +1 = # model parameters including intercept=3 here)
T statistic is – 2.45 t( 0.025, 11-3 df) = ±2.306, Thus null hypothesis is rejected and we
conclude that indeed the age of car is a useful predictor of its price.
Ex: Test the hypothesis (at 5% level) that there is a negative relationship between age and
price of car i.e. test that as age increase price of car decreases.
For the hypothesis: 𝐻 0 ∶ 𝛽1 = 0 (No or positive relationship)
Against : 𝐻 1 ∶ 𝛽1 < 0 (Age increases car price decreases)
T statistic is – 2.45, t( 0.05, 8 df) = −1.860, Thus null hypothesis is rejected, and we conclude
that there is indeed negative relationship between the age and price of car.
[Alternatively, the p-value of the test = Software reported two tail p-value /2 = 0.03974/2 =
0.01987 < 0.05. Hence the null hypothesis is rejected in favor of alternative at 5% sig level].
There is indeed negative relationship between age and car price.
Prediction from the model: Suppose we want to predict the price of an Orion which is 4 years
old and which is already driven 50,000 miles.
𝑦̂ = 18303.5 − 950.4 (4) − 0.0821(50000) = $10,396.9 (pl mention unit of y and not just a
number)

Estimation of multiple regression in Excel:


Go to Data Tab> Data Analysis > Regression

Input the y range and x range (the x variables must be in adjacent columns). Click labels if
variable name row is also selected (suggested to include this).
Note: Analysis Tool pack must be installed in Excel. If not, do this within Excel:
Files > Options > Add-Ins >Analysis Tool Pack > Go > Analysis ToolPak > OK
Then the Analysis Tool Pack named ‘Data Analysis’ is visible in the Data tab.
#Multiple regression in R
orion=read.csv(file.choose()) # choose orion1.csv data. File must be saved in csv
attach(orion)
head(orion)
model1=lm(price~age + miles, data=orion)
summary(model1)
round((summary(model1)$coefficients), 5) # to preset outcome with 5 decimals (avoid
scientific notation)

F Test for overall significance in multiple regression:


𝐻0 : 𝛽1 = 𝛽2 = ⋯ 𝛽𝑘 = 0 vs 𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0 (𝑖 = 1,2, . . 𝑘)

𝑀𝑆𝑅 𝑆𝑆𝑅/𝑘
𝐹= =
𝑀𝑆𝐸 𝑆𝑆𝐸/(𝑛 − 𝑘 − 1)
This F test with DF (k and n – (k+1) and is reported by Excel (in ANOVA section) and all
statistical software. [Note: the manual calculation of F test using above formula is not
required].
For the Orion case, F = 58.61, p-value = 0.0000167, null hypothesis is rejected, and we
conclude that at least one variable (age or miles or both) has a significant impact on price.
Note: In any multiple regression problem, we can expect from students to be able to do these
things. These aspects can be tested in exams.

(1) Estimate parameters from software. Alternatively, software output or estimated eq can be
provided in textbook form along with relevant statistics e.g., standard error, R sq etc.
(2) Interpret each coefficient in practical terms. (take care of units of measurement)
(3) Interpret R sq in practical terms. (i.e. in percent variation explained)
(4) Make prediction of y given relevant predictor values.
(5) Test hypothesis on individual coefficients.
The overall F test (of the hypothesis that all parameters (except intercept) are zero and its p-
value can also be asked.

Anderson Ex 4, 5, pdf p-769:


[Ex 5: Estimated Eq:
̂ = 83.23 + 2.29 𝑇𝑉 + 1.30 𝑁𝑒𝑤𝑠𝑃𝑎𝑝𝑒𝑟]
𝑅𝑒𝑣𝑒𝑛𝑢𝑒

Use of Qualitative /Categorical Independent Variables:


(i) Qualitative X variable (of Binary Category)
How can you measure the effect of gender on wage? i.e. are the wages for male and female
same on average? Here we want to explain y = wage with the help of x = gender.
̂ (Rs.) = 20,000 – 3500 Female
Example: Let the estimated regression is; 𝑊𝑎𝑔𝑒
The intercept: Average wage for a male person is Rs. 20,000.
The slope: Average wage of a female is Rs. 3500 lower than a male.
(ii) Qualitative X variable with k categories. (We use one less number (k-1) dummy
variables to avoid perfect collinearity):
Consider the house price ($1000) in three different city Zones: East, West and South.
We code and include any two dummy variables e.g.
East = 1 if house is in East Zone, 0 otherwise
West = 1 if house is in West Zone, 0 otherwise
Keeping South Zone as reference. The estimated model
may be like:
̂ = 200 + 50 East – 75 West
𝑃𝑟𝑖𝑐𝑒

Interpretation:
200: Average house price in the South Zone is $200,000
50: House price in East Zone is on average $50,000 higher than in the South Zone.
-75: House price in West Zone is on average $75,000 lower than in the South Zone.
Anderson Example pdf p-786,
Some further exercises from Anderson:
Ex 4 pdf p-769, Ex 5, pdf p-769 (check Rev = 83.23 + 2.29 TVAd + 1.30NPAd)
Ex 14 pdf p-775 (only part d and f), Ex 34 p-pdf p-791 , Ex 38 pdf p-793

Some further exercises (especially to illustrate dummy x variables)


Ex1: Consider the factors such as the number of megapixels, weight (oz.), and overall score
ranges from 0 to 100 of sample of Canon and Nikon cameras used to explain prices.

Observation Brand Price_$ Megapixels Weight_oz Score Brand


1 Canon 330 10 7 66 1
2 Canon 200 12 5 66 1
3 Canon 300 12 7 65 1
4 Canon 200 10 6 62 1
5 Canon 180 12 5 62 1
6 Canon 200 12 7 61 1
7 Canon 200 14 5 60 1
8 Canon 130 10 7 60 1
9 Canon 130 12 5 59 1
10 Canon 110 16 5 55 1
11 Canon 90 14 5 52 1
12 Canon 100 10 6 51 1
13 Canon 90 12 7 46 1
14 Nikon 270 16 5 65 0
15 Nikon 300 16 7 63 0
16 Nikon 200 14 6 61 0
17 Nikon 400 14 7 59 0
18 Nikon 120 14 5 57 0
19 Nikon 170 16 6 56 0
20 Nikon 150 12 5 56 0
21 Nikon 230 14 6 55 0
22 Nikon 180 12 6 53 0
23 Nikon 130 12 6 53 0
24 Nikon 80 12 7 52 0
25 Nikon 80 14 7 50 0
26 Nikon 100 12 4 46 0
27 Nikon 110 12 5 45 0
28 Nikon 130 14 4 42 0

Estimate the regression model, write down the estimated eq, interpret the coefficients. Predict
price of Nikon camera of 14 megapixels with a weight of 6 oz and score of 55. Interpret Rsq.
Test the hypothesis (at 5%) that the average price of Canon is significantly less than Nikon.
Ex2: Consider the data of sales prices of 176 houses to be explained by value of land, value
of improvement (all three variables in $1000) and the city area where the house is located.
(CHEVAL is the base area). The estimated regression is as follows.
Interpret each coefficient. Predict the price of a house located in Cheval that has value of land
and improvement as 100 and 200 (thousands of dollars). Test the hypothesis (at 5%) that
average prices in the Hydepark area are significantly less than Cheval area.
̂ = −16.93 + 1.594 𝐿𝑎𝑛𝑑 + 1.301 𝐼𝑚𝑝 − 82.97DAVISISLES +10.187 HUNTERSGREE −
𝑆𝑎𝑙𝑒𝑠
47.28 HYDEPARK

SE 20.33 0.091 0.0468 32.536 22.731 28.396

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy