0% found this document useful (0 votes)
2 views19 pages

Chapter 3

Chapter 3 discusses the implications of heteroskedasticity and cluster sampling on statistical inference in business analytics. It highlights that violations of homoskedasticity and independence lead to biased estimators and unreliable inference procedures, but provides solutions such as Heteroskedasticity-Corrected Covariance Matrix Estimators and cluster-robust estimators. The chapter emphasizes the importance of using these robust methods to ensure valid results in regression analysis.

Uploaded by

daryn.imashev.bu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views19 pages

Chapter 3

Chapter 3 discusses the implications of heteroskedasticity and cluster sampling on statistical inference in business analytics. It highlights that violations of homoskedasticity and independence lead to biased estimators and unreliable inference procedures, but provides solutions such as Heteroskedasticity-Corrected Covariance Matrix Estimators and cluster-robust estimators. The chapter emphasizes the importance of using these robust methods to ensure valid results in regression analysis.

Uploaded by

daryn.imashev.bu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Statistical Foundations of Business Analytics

Chapter 3: Heteroskedasticity and Cluster Sampling

Tim Ederer

Mini 2, 2024
Tepper Business School
Introduction

With Chapter 1 and 2, we are now equipped to make inference about β


• Relies on assumptions EXO, RANK, IID, and HOMOSKEDASTICITY

What happens when HOMOSKEDASTICITY is not satisfied?


• Var(εi |xi ) = σi2 instead of Var(εi |xi ) = σ 2

What happens when IID is not satisfied?


• Cov(εi , εj |X ) ̸= 0 instead of Cov(εi , εj |X ) = 0

1 / 16
Reminder: Variance of OLS Estimator

Remember that the variance of β̂ has the following expression

Var(β̂|X ) = (X ′ X )−1 X ′ Var(ε|X )X (X ′ X )−1

Problem: Var(ε|X ) is a complex object


 
Var(ε1 |X ) Cov(ε1 , ε2 |X ) ... Cov(ε1 , εn |X )
Cov(ε1 , ε2 |X ) Var(ε2 |X ) ... Cov(ε2 , εn |X )
Var(ε|X ) = 
 
.. .. .. .. 
 . . . . 
Cov(ε1 , εn |X ) Cov(ε2 , εn |X ) . . . Var(εn |X )

2 / 16
Reminder: Homoskedasticity and IID

HOMOSKEDASTICITY + IID highly simplify the problem!


 2 
σ 0 ... 0
 0 σ2 . . . 0
2 2 ′ −1
Var(ε|X ) =  . ..  = σ In =⇒ Var(β̂|X ) = σ (X X )
 
.. . .
 .. . . .
0 0 ... σ2

This has two important consequences


• β̂ is BLUE
• Var(β̂|X ) is easy to estimate

Let’s start by seeing what happens when we relax HOMOSKEDASTICITY

3 / 16
Heteroskedasticity
Heteroskedasticity: Definition

Heteroskedasticity means that the variance of εi is not constant across i

Var(εi |xi ) = σi2 ̸= σ 2

Examples
• Volatility in earnings increases with education: Var(εi |xi ) = γ1 + γ2 educi
• House price variance is higher in neighborhood A vs neighborhood B

4 / 16
Heteroskedasticity: Visualisation

Under HOMOSKEDASTICITY, plotting residuals ε̂i against xi should look like this
• Variance of residuals should not depend on xi

5 / 16
Heteroskedasticity: Visualisation
Under heteroskedasticity, plotting residuals ε̂i against xi could look like this
• Variance of residuals is increasing with xi in this case

Why is this a problem?


6 / 16
Heteroskedasticity: Consequences

Assume now that the variance of εi is not constant across i


 2 
σ1 0 . . . 0
 0 σ22 . . . 0 
Var(ε|X ) =  . ..  = Ω
 
.. . .
 .. . . .
0 0 ... σn2

The variance of β̂ now has the following expression

Var(β̂|X ) = (X ′ X )−1 X ′ ΩX (X ′ X )−1


̸= σ 2 (X ′ X )−1

7 / 16
Heteroskedasticity: Consequences

Two important consequences


• β̂ is not BLUE anymore (but it is still unbiased and consistent!)
• Our estimator for Var(β̂|X ) is biased

Under heteroskedasticity our inference procedure collapses


• The distribution of test statistics is no longer known
• Confidence intervals are wrong

This will eventually lead you to wrong conclusions!

8 / 16
Illustration in R

 
0
Consider model yi = β1 + β2 xi + εi with β =
1
• We assume that xi ∼ N (0, 1) and εi |xi ∼ N (0, σi2 )
• Set σi = 1 + 0.5xi + 0.1xi2

Draw 10000 samples of n = 100 observations


• For each sample: derive β̂2 and the confidence interval for α = 5%
• The confidence interval should contain 1 for 95% of the samples

Results
• It contains 1 only for 88% of the samples
• Confidence intervals are smaller than what they should be because of bias in std errors!

9 / 16
Testing for the Presence of Heteroskedasticity

The presence of heteroskedasticity can be visualized/tested


• We do not observe εi = yi − xi′ β but we observe the residuals ε̂i = yi − xi′ β̂

Method 1: plot ε̂i against xi


• If the variance of ε̂i varies with xi this is evidence of heteroskedasticity
• Problem: not easy to visualize when xi is multidimensional

Method 2: White (1990) test


 
• Step 1: run the auxiliary regression ε̂2i = γ0 + zi′ γ1 + νi where zi = xi
xi2
• Step 2: Test H0 : γ1 = 0 ⇐⇒ H0 : E[ε̂2i |zi ] = Var(ε̂i |zi ) = γ0 (homoskedasticity)
• Rejecting H0 is evidence of the presence of heteroskedasticity

10 / 16
Heteroskedasticity Robust Variance Estimators

Good news: there exists a very simple solution to this problem


• Our estimator for the variance of β̂ is biased...
• Why not simply find another estimator for Var(β̂|X )?

Solution: Heteroskedasticity-Corrected Covariance Matrix Estimators (HCCME)


 2 
ε̂1 0 . . . 0
HC
 0 ε̂22 . . . 0 
′ −1 ′ b ′ −1
Var (β̂|X ) = (X X ) X ΩX (X X ) where Ω =  .
 
c b .. . . .. 
 .. . . .
0 0 ... ε̂2n

The HC estimator is unbiased under both heteroskedasticity and homoskedasticity


q
c HC (β̂k ) =
• Can derive robust standard errors: s.e. c HC (β̂|X )
Var (k,k)

11 / 16
Summary

Heteroskedasticity is a major problem for inference


• Introduces bias in estimator of variance of β̂
• Interpretation of results can be severely affected

You can test for the presence of heteroskedasticity


• Either visually of formally

But more importantly you can change your estimator for Var(β̂|X )!
• HC estimator is unbiased under homoskedasticity AND heteroskedasticity
• It can be computed at 0 cost in any statistical software
• There is no excuse for not using it next time you run a regression!

12 / 16
Cluster Sampling
Relaxing IID

What happens when individuals are not sampled independently?


• Introduces dependence between observations
• Cov(εi , εj |X ) = σij ̸= 0

The variance covariance matrix of ε becomes very complex


 2 
σ1 σ12 . . . σ1n
σ12 σ22 . . . σ2n 
Var(ε|X ) =  .
 
.. .. .. 
 .. . . . 
σ1n σ2n . . . σn2

13 / 16
Cluster Sampling

Focus on case where you sample groups (or clusters) instead of individuals
• Example: you sample households or villages instead of individuals
• Independence across clusters but dependence within clusters
• Cov(εi , εj |X ) = σij ̸= 0 for i and j in the same cluster c = 1, ..., C

The variance covariance matrix of ε has a block structure


   2 
Ω1 0 . . . 0 σ1 σ12 ... σ1nc
 0 Ω2 . . . 0   σ12 σ22 ... σ2nc 
Var(ε|X ) =  . ..  where Ωc =  ..
   
.. . . .. .. .. 
 .. . . .   . . . . 
0 0 . . . ΩC σ1nc σ2nc ... σn2c

14 / 16
Consequences

Same consequences as heteroskedasticity


• β̂ is not BLUE anymore
• Our estimator for Var(β̂|X ) is biased

Solution: Cluster-Robust Variance Covariance Estimator


ε̂21
 
ε̂1 ε̂2 ... ε̂1 ε̂nc
ε̂22
C
!
CR X  ε̂1 ε̂2 ... ε̂2 ε̂nc 
′ −1
Var (β̂|X ) = (X X ) Xc′ Ω
b c Xc (X ′ X )−1 where Ω
bc =  
c  .. .. .. .. 
c=1
 . . . . 
ε̂1 ε̂nc ε̂2 ε̂nc ... ε̂2nc

The Cluster-Robust estimator is unbiased when number of clusters C is large


• Also robust to the presence of heteroskedasticity!
q
c CR (β̂k ) =
• Can derive cluster-robust standard errors: s.e. c CR (β̂|X )
Var (k,k)

15 / 16
Summary

Important issues arise when HOMOSKEDASTICITY and IID are not satisfied
• Estimator of Var(β̂|X ) is biased
• Inference procedure breaks down: confidence intervals are wrong, tests are unreliable

But there is an easy fix!


• Heteroskedasticity: use HCCME for Var(β̂|X )
• Cluster sampling: use cluster-robust estimator for Var(β̂|X )
• These fixes can be implemented at no cost on any software!

Only important assumption remaining: EXO


• Study in Chapter 4 what we should do when EXO does not hold

16 / 16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy