0% found this document useful (0 votes)
12 views14 pages

Week 03 Regression

Uploaded by

sabrinashah2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views14 pages

Week 03 Regression

Uploaded by

sabrinashah2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

22-08-2024

TOD 533
Correlation, Introduction to Regression
Amit Das
TODS / AMSOM / AU
amit.das@ahduni.edu.in

Association between interval variables


• Do two interval variables “move together” ?
• When one takes on “high” values (relative to its mean),
what does the other do?
• Pearson correlation coefficient

r
 Z x Zy
 1  r  1
N
• When high (low) z-scores of the two variables co-occur, the
correlation coefficient is larger

1
22-08-2024

Computing the correlation coefficient


Task 1 Task 2 Product of
z-scores
Student Raw Score z-score Raw Score z-score
1 42 +1.78 90 +1.21 +2.15
2 9 -1.04 40 -1.65 +1.72
3 28 +0.58 92 +1.33 +0.77
4 11 -0.87 50 -1.08 +0.94
5 8 -1.13 49 -1.13 +1.28
6 15 -0.53 63 -0.33 +0.17
7 14 -0.62 68 -0.05 +0.03
8 25 +0.33 75 +0.35 +0.12
9 40 +1.61 89 +1.16 +1.87
10 20 -0.10 72 +0.18 -0.02
SUM 212 0 688 0 +9.03
MEAN 21.2 0 68.8 0 +0.903
STD. DEV. 11.69 1 17.47 1

Eyeballing correlation

2
22-08-2024

Statistical significance of r
• Null hypothesis: r = 0
Compute test statistic = n2
r
1  r2
• Compare against t-distribution with df = n-2

• For r = 0.903 with n = 10,


• test statistic = 5.94, compare against t8 distribution
• p-value (2-tailed) = 0.0003 << 0.05

Correlation and sample size


• Significance of r depends on sample size
• for larger n, smaller value of r might be significant

Sample size Value of r required to reach statistical significance at …


10% (two-tailed) 5% (two-tailed)
12 0.497 0.576
22 0.360 0.423
32 0.296 0.349
42 0.257 0.304
52 0.231 0.273
102 0.164 0.195

• for very large n, a very small r might be significant


• statistical vs. managerial significance

3
22-08-2024

Association between ordinal variables


• The Spearman rank correlation coefficient

6 d 2

rs  1 
n n 2  1
• where d is the difference in the ranks of a given individual for the two
variables
• suitable for ordinal data
• less affected than Pearson r by outliers

Rank Correlation example


Task 1 Task 2 (Difference
in ranks)2
Student Raw Score Rank 1 Raw Score Rank 2
1 42 1 90 2 1
2 9 9 40 10 1
3 28 3 92 1 4
4 11 8 50 8 0
5 8 10 49 9 1
6 15 6 63 7 1
7 14 7 68 6 1
8 25 4 75 4 0
9 40 2 89 3 1
10 20 5 72 5 0

• Spearman rank correlation coefficient


= 6  10
1 = 94%
10  100  1

4
22-08-2024

Correlation and regression …1


• Earlier, we examined whether two interval-scaled variables are
associated (“move together”) using the correlation coefficient
-1  r  +1
• linear regression frames the same question in a slightly different form
• by modeling the dependent variable Y as a linear function of the independent
variable X
Y = a + bX

The linear regression model


slope b = p/q X
X
price in dollars

X
X X
p
X X

X q

X
intercept a

area in square feet

Relation of apartment prices to floor area (hypothetical)

5
22-08-2024

The best-fit regression line


• More than one line can be passed through the cloud (“scatterplot”) of
Y on X
• each line denotes a combination of a and b
• For each line
• for each data point compute error = Yobs – Ypred
• square the errors and add them up Se2
• The best-fit (least-squares) regression line

Y = A + BX (note A, B in caps) minimizes Se2

Solution to minimization problem

• For the mathematically inclined, here’s how A and B (optimum values


of a and b) may be calculated:

N  XY   X Y
B
N  X 2   X 
2

A  Y  BX

6
22-08-2024

Interpreting the slope

Y Y Y
X
B<0 X X X

X X X
X X
X X
X
X
X

B>0 X
B=0

X X X

The value of Y does


Larger values of X Larger values of X
not depend on X:
are associated with are associated with
the best estimate of
larger values of Y smaller values of Y
Y is simply its mean

Scale Invariance (or not)


• Let us say that, for area measured in square feet, the slope B of the
best-fit regression line is 500
• If we measure area in square meters, the value of B would work out
to be 5382
• Is that a problem?
• $500 per square foot vs $5382 per square meter?
• we can standardize all X and Y values before we start … then regression
coefficient B is scale-free

7
22-08-2024

Correlation and regression …2


• The correlation coefficient r and the regression slope B
are related as follows:
r  BS X / SY 
• where SX and SY are the standard deviations of X and Y
respectively
• r also has the benefit of being scale-invariant
• it does not matter whether area is measured in square feet or
square meters, or whether price is measured in INR or USD

Standardized regression coefficients


• Recall that regression coefficients are not scale-invariant
• i.e. they depend on the units of measurement
• To get scale-invariant coefficients
• standardize Y as well as X1, X2, …, Xn, estimate
zY  C  D1z X1  D2 z X 2  ...  Dn z X n
• the z-score of Y is modeled as a function of the z-scores of Xi … the
coefficients Di are scale-invariant
• Also used when the relative magnitudes of Xi differ
widely (in their “natural” units)

8
22-08-2024

Generalizing to multiple regression


• How does Y vary with the levels of multiple
“explanatory” variables?
Y = A + B 1 X 1 + B 2 X 2 + … + B nX n
• Bi is the slope of Y on dimension Xi
• B1, B2, …, Bn called “partial” regression coefficients
• the magnitudes (and even signs) of B1, B2, …, Bn depend on which
other variables are included in the multiple regression model
• might not agree in magnitude (or even sign) with the bivariate
correlation coefficient r between Xi and Y

Predictive power
• R = bivariate correlation between Yobserved and Ypredicted
(how well do they agree?)
• Consider the proportionate reduction in prediction
error (PRE) using the model
 Y obs   
 Y   Yobs  Y pred  /  Yobs  Y
2 2
2

• to the baseline of predicting Y using just its mean Y


• turns out that PRE = R2
• R2 or R-square measures the predictive power of the
multiple regression model

9
22-08-2024

Hypothesis-testing in regression
• Consider Y = A + B1X1 + B2X2 + …+ BnXn
• For the null hypothesis H0 that ALL the coefficients Bi
are zero, B1 = B2 = Bn = 0
• and the alternate hypothesis Ha that at least one Bi is
NOT zero, Bi  0
R2 / k
F
• the test statistic is
1  R /n  k  1
2

• k = number of explanatory variables Xi


• n = number of observations (sample size)

Overall F-test of model


• The test statistic is compared against the
F-distribution with df1 = k and df2 = n-(k+1)
• If the test statistic is large, the area to the right of this value will be
small
• small p-value enables rejection of the null hypothesis (H0: all Bi are zero)
• note that this is more likely if R2 is large
• A model that fails this test is no better than no model (in terms of
prediction error)

10
22-08-2024

Significance of coefficients
• Whether each coefficient Bi differs significantly from
zero is tested using the test statistic Bi /  Bi
(value of coefficient / standard error)
• compared against t-distribution with n-(k+1) df
• Each coefficient can be tested in this manner
• H0: coefficient is zero vs. Ha: coefficient is not zero
• When a coefficient Bi fails this test, it is not significantly
different from zero, and the term involving Xi can be
dropped from the model

Desirable properties of regression model


• High R2
• indicates that a large proportion of the variation in Y is explained by the
independent variables
• Significant F-test
• the null hypothesis that all Bi are zero can be conclusively rejected
• Significant coefficients (t-test)
• change in each explanatory variable significantly affects the level of the
dependent variable

11
22-08-2024

Another example: Boston housing prices


Variables
1. CRIM - per capita crime rate by town
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS - proportion of non-retail business acres per town.
4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX - nitric oxides concentration (parts per 10 million)
6. RM - average number of rooms per dwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
9. RAD - index of accessibility to radial highways
10. TAX - full-value property-tax rate per $10,000
11. PTRATIO - pupil-teacher ratio by town
12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT - % lower status of the population
14. MEDV - Median value of owner-occupied homes in $1000's

Excerpt of Boston housing data


crim zn indus chas nox ptratio b lstat medv
0.00632 18 2.31 0 0.538 15.3 396.9 4.98 24
0.02731 0 7.07 0 0.469 17.8 396.9 9.14 21.6
0.02729 0 7.07 0 0.469 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 18.7 396.9 5.33 36.2
0.02985 0 2.18 0 0.458 18.7 394.12 5.21 28.7
0.08829 12.5 7.87 0 0.524 15.2 395.6 12.43 22.9
0.14455 12.5 7.87 0 0.524 15.2 396.9 19.15 27.1
0.21124 12.5 7.87 0 0.524 15.2 386.63 29.93 16.5

12
22-08-2024

Boston housing regression model

Boston housing: Regression model predictions


crim zn indus chas nox ptratio b lstat medv Predicted values Residuals
0.00632 18 2.31 0 0.538 15.3 396.9 4.98 24 30.0 -6.00
0.02731 0 7.07 0 0.469 17.8 396.9 9.14 21.6 25.0 -3.43
0.02729 0 7.07 0 0.469 17.8 392.83 4.03 34.7 30.6 4.13
0.03237 0 2.18 0 0.458 18.7 394.63 2.94 33.4 28.6 4.79
0.06905 0 2.18 0 0.458 18.7 396.9 5.33 36.2 27.9 8.26
0.02985 0 2.18 0 0.458 18.7 394.12 5.21 28.7 25.3 3.44
0.08829 12.5 7.87 0 0.524 15.2 395.6 12.43 22.9 23.0 -0.10
0.14455 12.5 7.87 0 0.524 15.2 396.9 19.15 27.1 19.5 7.56
0.21124 12.5 7.87 0 0.524 15.2 386.63 29.93 16.5 11.5 4.98

• Negative residuals (actual – predicted) -> underpriced? -> good value?


• Positive residuals -> overpriced?

13
22-08-2024

Getting carried away … the story of Zillow

14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy