0% found this document useful (0 votes)
4 views

Chapter 08 Nonlinear Regression Functions (1)

Chapter 8 of FIN2016 discusses nonlinear regression functions, focusing on dummy variables, polynomials, logarithmic functions, interaction terms, and variable rescaling. It illustrates the use of dummy variables in regression analysis, the importance of avoiding the dummy variable trap, and how to interpret polynomial regression models. The chapter emphasizes the need for nonlinear models when linear relationships do not adequately fit the data.

Uploaded by

qqqweefk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Chapter 08 Nonlinear Regression Functions (1)

Chapter 8 of FIN2016 discusses nonlinear regression functions, focusing on dummy variables, polynomials, logarithmic functions, interaction terms, and variable rescaling. It illustrates the use of dummy variables in regression analysis, the importance of avoiding the dummy variable trap, and how to interpret polynomial regression models. The chapter emphasizes the need for nonlinear models when linear relationships do not adequately fit the data.

Uploaded by

qqqweefk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

FIN2016 Introductory Econometrics

Chapter 08: Nonlinear Regression Functions

Po-Yu Liu

National Taiwan University

April 8, 2025

1/75
Table of Contents

1. Dummy Variables

2. Polynomials in X

3. Logarithmic Functions of X or Y

4. Interaction Terms

5. Rescaling & Shifting of Variables

2/75
Dummy variable example
▶ Let’s say you have available G dummy variables that together
are mutually exclusive and exhaustive of the population
▶ Example: smoker with G = 2
▶ Two dummies:
▶ smoker equal 1 if person is a smoker (zero otherwise)
▶ nonsmoker equal 1 if person is a non-smoker (zero otherwise)
▶ If you are interested in the association between smoking and
birthweight, consider the following specifications:
▶ birthweight = β0 + β1 smoker + ui
▶ birthweight = β0 + β2 nonsmoker + ui
▶ birthweight = β0 + β1 smoker + β2 nonsmoker + ui

3/75
Dummy variable trap
> dt = readxl::read_xlsx('data/birthweight_smoking.xlsx') %>% setDT
> model1 = feols(birthweight ~ smoker, dt, vcov = 'hetero')
> model1
OLS estimation, Dep. Var.: birthweight
Observations: 3,000
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3432.060 11.8905 288.63802 < 2.2e-16 ***
smoker -253.228 26.8104 -9.44516 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 583.5 Adj. R2: 0.02828

> dt[, nonsmoker := 1-smoker]


> model2 = feols(birthweight ~ nonsmoker, dt, vcov = 'hetero')
> model2
OLS estimation, Dep. Var.: birthweight
Observations: 3,000
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3178.832 24.0294 132.28925 < 2.2e-16 ***
nonsmoker 253.228 26.8104 9.44516 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 583.5 Adj. R2: 0.02828 4/75
Dummy variable trap
▶ In each regression, the group represented by “zero” is the
so-called benchmark or default group (represented by the
constant term)
▶ Absolute value of slope coefficient is identical
▶ Regression with both smoker and nonsmoker will throw an
error (that’s the dummy variable trap)

5/75
Example: number of prenatal visits
▶ Example: number of prenatal visits with G = 4
▶ Four dummies:
▶ tripre0 equal 1 if never went for prenatal health visits
(presumably a problematic group)
▶ tripre1 equal 1 if first prenatal health visit in 1st trimester
(presumably the most common group)
▶ tripre2 equal 1 if first prenatal health visit in 2nd trimester
▶ tripre3 equal 1 if first prenatal health visit in 3rd trimester
▶ We’ve just learned: only need to use a subset of three
dummies
▶ Which subset should we use?
▶ It doesn’t matter: as long as we use any three, we are not
throwing out any information
▶ However: the unused dummy implicitly defines the benchmark
group

6/75
Benchmark groups in regression
▶ Benchmark: first prenatal health visit in 1st trimester
> fml = birthweight ~ smoker + alcohol + tripre0 + tripre2 + tripre3
> model1 = feols(fml, dt, vcov = 'hetero')
> model1
OLS estimation, Dep. Var.: birthweight
Observations: 3,000
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3454.5493 12.4817 276.769748 < 2.2e-16 ***
smoker -228.8476 26.5489 -8.619854 < 2.2e-16 ***
alcohol -15.1000 69.7031 -0.216633 8.2851e-01
tripre0 -697.9687 146.5788 -4.761732 2.0106e-06 ***
tripre2 -100.8373 31.5530 -3.195810 1.4089e-03 **
tripre3 -136.9553 67.6958 -2.023099 4.3152e-02 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 578.1 Adj. R2: 0.044873

▶ birthweight from mother in tripre1 group, smoked, and drank:


3454.5493 − 228.8476 − 15.1000 = 3210.602
7/75
Another benchmark group
▶ Benchmark: never went for prenatal health visit
> fml = birthweight ~ smoker + alcohol + tripre1 + tripre2 + tripre3
> model2 = feols(fml, dt, vcov = 'hetero')
> model2
OLS estimation, Dep. Var.: birthweight
Observations: 3,000
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2756.5806 146.0773 18.870703 < 2.2e-16 ***
smoker -228.8476 26.5489 -8.619854 < 2.2e-16 ***
alcohol -15.1000 69.7031 -0.216633 8.2851e-01
tripre1 697.9687 146.5788 4.761732 2.0106e-06 ***
tripre2 597.1315 149.1019 4.004856 6.3564e-05 ***
tripre3 561.0135 160.9453 3.485739 4.9783e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 578.1 Adj. R2: 0.044873

▶ birthweight from mother in tripre1 group, smoked, and drank:


2756.5806 − 228.8476 − 15.1000 + 697.9687 = 3210.602
8/75
Table of Contents

1. Dummy Variables

2. Polynomials in X

3. Logarithmic Functions of X or Y

4. Interaction Terms

5. Rescaling & Shifting of Variables

9/75
278 Nonlinear Regression Functions
CHAPTER 8
Motivation: Relationship with varying slopes
FIGURE 8.1 Population Regression Functions with Different Slopes

Y Y

Rise

Rise Run
Run
Rise
Run

X1 X1
(a) Constant slope (b) Slope depends on the value of X1

Rise Population regression


function when X2 = 1
Run

Rise
Run

Population regression function when X2 = 0

X1
(c) Slope depends on the value of X2

In Figure 8.1(a), the population regression function has a constant slope. In Figure 8.1(b), the slope of the popula-
tion regression function depends on the value of X1. In Figure 8.1(c), the slope of the population regression function
depends on the value of X2. 10/75
the California data set, along with the OLS regression line relating these two
variables. Test scores and district income are strongly positively correlated, with a
Motivation: Linear relationship does not fit data well
FIGURE 8.2 Scatterplot of Test Scores vs. District Income with a Linear OLS Regression Function

There is a positive correlation between Test score


test scores and district income 740
(correlation = 0.71), but the linear OLS
720
regression line does not adequately
describe the relationship between these 700
variables.
680

660

640

620

600
0 10 20 30 40 50 60
District income
(thousands of dollars)

▶ In practice, we still just use linear because it’s “good enough”


and easy to understand. We want causal inference not
prediction

55_04_GE_C08.indd 279 28/11/18 4:48


11/75
Multiple regression model with polynomials of X
▶ Consider the following multiple regression model:

Yi = β0 + β1 Xi + β2 Xi2 + · · · + βr Xir + ui

▶ This is just the linear multiple regression model — except that


the regressors are powers of X
▶ Estimation, hypothesis testing, etc. proceeds as in the
multiple regression model using OLS
▶ The coefficients are difficult to interpret, but the regression
function itself is interpretable

▶ This is still “linear” regression: Think of X 2 as a new


regressor that is different from X . Then Y is still “linear”
combination of the regressors

12/75
Polynomial regression example
▶ We will illustrate the use of polynomials using the textbook’s
data on test scores and student-teacher ratios
▶ Here we focus on the following two variables only:
▶ testscr is average test score in school district i
▶ avginc is the average income in school district i (thousands of
dollars per capita)
▶ Quadratic specification:

testscr = β0 + β1 avginc + β2 avginc 2 + ui

▶ Cubic specification:

testscr = β0 + β1 avginc + β2 avginc 2 + β3 avginc 3 + ui

13/75
Estimation of the quadratic specification in R
> dt = haven::read_dta('data/caschool.dta') %>% setDT
> fml = testscr ~ avginc + avginc^2
> model1 = feols(fml, dt, vcov = 'hetero')
> model1
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 607.301735 2.901754 209.28782 < 2.2e-16 ***
avginc 3.850995 0.268094 14.36434 < 2.2e-16 ***
I(avginc^2) -0.042308 0.004780 -8.85051 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 12.7 Adj. R2: 0.554045

▶ Implementation varies with languages / packages you use.


Sometimes easier to just create a new column that is X 2

14/75
of this test (which is 1.96). Indeed, the p-value for the t-statistic is less than 0.01%, so
we can reject the hypothesis that b2 = 0 at all conventional significance levels. Thus
Compare the predicted values between linear and quadratic
this formal hypothesis test supports our informal inspection of Figures 8.2 and 8.3:
The quadratic model fits the data better than the linear model.
specification
FIGURE 8.3 Scatterplot of Test Scores vs. District Income with Linear and Quadratic Regression Functions

The quadratic OLS regression function fits the Test score


data better than the linear OLS regression 740
function. Linear regression
720

700

680
Quadratic regression
660

640

620

600
0 10 20 30 40 50 60
District income
(thousands of dollars)

15/75
Interpretation of polynomial regression
▶ How to interpret the estimated PRF?
▶ Estimated PRF is:

\ i = 607 + 3.85 avginc i − 0.042 avginc 2i


testscr

▶ Predicted change in testscr i for a change in avginc i from


$5,000 to $6,000 per capita:

\ i = (607 + 3.85 · 6 − 0.0423 · 62 )−


∆testscr
(607 + 3.85 · 5 − 0.0423 · 52 )
= 3.4

16/75
Interpretation of polynomial regression
▶ Predicted effects for different values of avginc i :
∆avginc ∆testscr
From $5,000 to $6,000 3.4
From $25,000 to $26,000 1.7
From $45,000 to $46,000 0.0
▶ The effect of changing avginc i on testscr i is decreasing in
avginc i
▶ The second derivative is negative (because the coefficient
estimate on the quadratic term is negative)
▶ Caution: do not extrapolate outside the range of the data

17/75
Estimation of the cubic specification in R
> dt[, avginc2 := avginc^2]
> dt[, avginc3 := avginc^3]
> fml = testscr ~ avginc + avginc2 + avginc3
> model2 = feols(fml, dt, vcov = 'hetero')
> model2
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 600.078985 5.102062 117.61500 < 2.2e-16 ***
avginc 5.018677 0.707350 7.09504 5.6063e-12 ***
avginc2 -0.095805 0.028954 -3.30890 1.0181e-03 **
avginc3 0.000685 0.000347 1.97509 4.8919e-02 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 12.6 Adj. R2: 0.555228

18/75
Testing for linearity
▶ Testing the null hypothesis of linearity, against the alternative
that the population regression is quadratic and/or cubic:

H0 : population coefficients on avginc 2 and avginc 3 both 0


H1 : at least one of these coefficients is nonzero

19/75
Testing for linearity

▶ The hypothesis that the population regression is linear is


rejected at the 5% significance level against the alternative
that it is a polynomial of (up to) third order 20/75
Table of Contents

1. Dummy Variables

2. Polynomials in X

3. Logarithmic Functions of X or Y

4. Interaction Terms

5. Rescaling & Shifting of Variables

21/75
The290(natural)Nonlinear
log Regression Functions
CHAPTER 8

FIGURE 8.4 The Logarithm Function, y = ln(x)

The logarithmic function y = ln(x) is steeper for y


small than for large values of x, is defined only 5
for x 7 0, and has slope 1>x.

4
y = ln(x)

0
0 20 40 60 80 100 120
x

x + ∆x and the logarithm of x is approximately ∆x>x, the percentage change in x


divided by 100. That is,

∆x ∆x
ln1x + ∆x2 - ln1x2 ≅ ¢ when is smallb , (8.16)
x x
where “_” means “approximately equal to.” The derivation of this approximation
relies on calculus, but it is readily demonstrated by trying out some values of x and 22/75
Using logarithmic transformations in regression
▶ Using logarithmic transformations of both the dependent and
independent variables can be useful when estimating
coefficients
▶ (Two main reasons: % interpretation & mitigate outliers)
▶ Using the student test score example, let’s focus on two
variables:
▶ Yi : test score in school district i
▶ Xi : average income in school district i (this is a proxy for
socio-economic status of the district)
▶ Let’s look at the simple regression model:

Yi = β0 + β1 Xi + u
▶ We estimate β1 by running a regression of Yi on Xi
▶ But what do we estimate when instead we:
▶ Run a regression of ln Yi on Xi
▶ Run a regression of Yi on ln Xi
▶ Run a regression of ln Yi on ln Xi
23/75
Properties of the logarithm
▶ The logarithm has useful features based on calculus
▶ Compare the independent variable at two values x1 and x0 (it
works the same for the dependent variable)
▶ Starting at x0 , you change the dependent variable by
∆x := x1 − x0
▶ Define the following: x̃1 = ln(x1 ) and x̃0 = ln(x0 )
▶ The corresponding change in the logarithm captures:

∆x̃ := x̃1 − x̃0 = ln(x1 ) − ln(x0 ) = ln(x0 + ∆x) − ln(x0 )


   
x0 + ∆x ∆x ∆x
= ln = ln 1 + ≈ = percentage change
x0 x0 x0
▶ The difference in the logarithmic values of x1 and x0 is
approximately equal to the percentage change between x1 and
x0
▶ The difference in logarithms approximates percentage changes
24/75
Example of logarithmic approximation
▶ For example:

x0 = 50 x̃0 = ln(x0 ) = 3.91


x1 = 52 x̃1 = ln(x1 ) = 3.95
∆x
=⇒ = 4% =⇒ ∆x̃ = 0.04
x0
▶ Another example:
▶ If ∆x̃ = 0.07 then you know that x increased by 7%
▶ In a few slides we will have:
∆x̃ = 1 which means that x increased by 100%
▶ (Aside: the log-approximation works best when the change
from x0 to x1 is small)

25/75
Example of logarithmic approximation
> log(11)-log(10)
[1] 0.09531018
> log(110)-log(100)
[1] 0.09531018
> log(1100)-log(1000)
[1] 0.09531018
> log(11000)-log(10000)
[1] 0.09531018
▶ No matter how big x is, ∆ log(x) always tells us there’s a
0.1 = 10% change

▶ Remember: Use natural log, not log 10

26/75
Back to the regression model
▶ You create log-versions of both Xi and Yi :

Xei := ln Xi

Yei := ln Yi
▶ Now compare the following four specifications:
Specification Population regression function
(1) linear-linear Yi = β0 + β1 Xi
(2) linear-log Yi = β0 + β1 Xei
(3) log-linear Yei = β0 + β1 Xi
(4) log-log Yei = β0 + β1 Xei
▶ The interpretation of the slope coefficient β1 differs in each
case
▶ The generic interpretation of the slope coefficient β1 is:
By how much does the dependent variable change, on
average, when the independent variable changes by one unit? 27/75
What does this mean in the different specifications?

∆Yi
(1) β1 = therefore ∆Xi = 1 =⇒ ∆Yi = β1
∆Xi
X up by 1 unit, Y up by β1 units

∆Yi
(2) β1 = therefore ∆X̃i = 1 =⇒ ∆Yi = β1
∆X̃i
X up by 100%, Y up by β1 units

∆Ỹi
(3) β1 = therefore ∆Xi = 1 =⇒ ∆Ỹi = β1
∆Xi
X up by 1 unit, Y up by 100 · β1 %

∆Ỹi
(4) β1 = therefore ∆X̃i = 1 =⇒ ∆Ỹi = β1
∆X̃i
X up by 100%, Y up by 100 · β1 % 28/75
Linear-log specification
> model3 = feols(testscr ~ log(avginc), dt, vcov = 'hetero')
> model3
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 557.8323 3.83994 145.2711 < 2.2e-16 ***
log(avginc) 36.4197 1.39694 26.0710 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 12.6 Adj. R2: 0.561461

▶ Interpretation:
A 100% increase in avginc is associated with an increase in testscr
by 36.42 points (the measurement units of testscr ) on average
▶ Or alternatively:
A 1% increase in avginc is associated with an increase in testscr by
0.3642 points on average
29/75
292 Nonlinear Regression Functions
CHAPTER 8
Linear-log specification
FIGURE 8.5 The Linear-Log Regression Function

The estimated linear-log regression function Test score


Yn = bn0 + bn1 ln1X2 captures much of the nonlinear 740
relation between test scores and district income.
720 Linear-log regression

700

680

660

640

620

600
0 10 20 30 40 50 60
District income
(thousands of dollars)

For X + ∆X, the expected value is given by ln1Y + ∆Y2 = b0 + b1 1X + ∆X2.


Thus the difference between these expected values is ln1Y + ∆Y2 - ln1Y2 =
3b0 + b1 1X + ∆X2 4 - 3b0 + b1X4 = b1 ∆X. From the approximation in Equation
(8.16), however, if b1 ∆X is small, then ln1Y + ∆Y2 - ln1Y2 ≅ ∆Y>Y. Thus
∆Y>Y ≅ b1 ∆X. If ∆X = 1, so that X changes by one unit, then ∆Y>Y changes by 30/75
Log-linear specification
> model4 = feols(log(testscr) ~ avginc, dt, vcov = 'hetero')
> model4
OLS estimation, Dep. Var.: log(testscr)
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.439362 0.002894 2225.2097 < 2.2e-16 ***
avginc 0.002844 0.000175 16.2436 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.0206 Adj. R2: 0.497011

▶ Interpretation:
An increase by $1 in avginc will increase testscr by 0.28% on
average

31/75
Log-log specification
> model5 = feols(log(testscr) ~ log(avginc), dt, vcov = 'hetero'
> model5
OLS estimation, Dep. Var.: log(testscr)
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.336349 0.005925 1069.5006 < 2.2e-16 ***
log(avginc) 0.055419 0.002145 25.8414 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.019339 Adj. R2: 0.556725
▶ Interpretation:
An increase by 100% in avginc will increase testscr by 5.5% on
average
▶ Or alternatively:
An increase by 1% in avginc will increase testscr by 0.055% on
average
▶ The coefficient β1 measures the elasticity of Y with respect to X
32/75
294
Log-linear andNonlinear
CHAPTER 8
log-log Regression Functions

FIGURE 8.6 The Log-Linear and Log-Log Regression Functions

In the log-linear regression function, ln(Y) is a ln(Test score)


linear function of X. In the log-log regression 6.60
Log-linear regression
function, ln(Y) is a linear function of ln(X).

6.55

Log-log regression
6.50

6.45

6.40
0 10 20 30 40 50 60
District income
(thousands of dollars)

As you can see in Figure 8.6, the log-log specification fits better than the log-
linear specification. This is consistent with the higher R2 for the log-log regression
(0.557) than for the log-linear regression (0.497). Even so, the log-log specification
does not fit the data especially well: At the lower values of income, most of the obser-
vations fall below the log-log curve, while in the middle income range most of the
observations fall above the estimated regression function.
33/75
Table of Contents

1. Dummy Variables

2. Polynomials in X

3. Logarithmic Functions of X or Y

4. Interaction Terms

5. Rescaling & Shifting of Variables

34/75
Interactions between two binary regressors
▶ We will illustrate the use of interaction terms using the
textbook’s data on test scores and student-teacher ratios
▶ Consider the following multiple regression model:

testscr i = β0 + β1 str i + β2 el pct i + ui


▶ Where:
▶ testscr i : average test score in school district i
▶ str i : average student-teacher ratio in school district i
▶ el pct i : percent of English learners in school district i
(remember, this data set is from California where many
students are native Spanish speakers)

35/75
R code for multiple regression
▶ When you run this regression in R, this is what you get:
> model6 = feols(testscr ~ str + el_pct, dt, vcov = 'hetero')
> model6
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.032249 8.728224 78.59930 < 2.2e-16 ***
str -1.101296 0.432847 -2.54431 0.011309 *
el_pct -0.649777 0.031032 -20.93909 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 14.4 Adj. R2: 0.42368

36/75
Interpreting the results
▶ If district i could decrease str i by one unit while holding
el pct i constant, it can expect an increase in average test
scores of 1.10
▶ If district i could decrease el pct i by one percentage point
(say from 25% to 24%) while holding str i constant, it can
expect an increase in average test scores of 0.65
▶ Both effects are statistically significant at the 5% level

37/75
Effect of class size reduction
▶ Perhaps a class size reduction is more effective in some
circumstances than in others
▶ Perhaps the effect of student-teacher ratio on test scores
varies with the percentage of English learners
▶ This would be the case, for example, if English learners
benefit disproportionately from smaller class sizes (and
therefore lower student-teacher ratios)
▶ More technically, ∆testscr
∆str might depend on el pct
∆Y
▶ More generally, ∆X
1
might depend on X2
▶ How to model such interactions between X1 and X2 ?

38/75
Baseline model with interaction terms
▶ Baseline model:

Yi = β0 + β1 D1i + β2 D2i + ui

▶ Where D1i and D2i are binary regressors (dummy variables)


▶ β1 is the effect on Yi of changing D1i = 0 to D1i = 1
▶ In this specification, the effect does not depend on the value
of D2i
▶ To allow the effect of changing D1i to depend on D2i , include
the interaction term D1i × D2i as a separate regressor:

Yi = β0 + β1 D1i + β2 D2i + β3 (D1i × D2i ) + ui

39/75
Interpreting the coefficients
▶ Compare the PRF when D1i changes from 0 to 1
while D2i is fixed at q ∈ {0, 1}:

E [Yi |D1i = 0, D2i = q] = β0 + β2 q


E [Yi |D1i = 1, D2i = q] = β0 + β1 + β2 q + β3 q

▶ And their difference:

E [Yi |D1i = 1, D2i = q] − E [Yi |D1i = 0, D2i = q] = β1 + β3 q

▶ The effect of D1i now depends on the value q ∈ {0, 1} of D2i


▶ Interpretation of β3 :
increment to the effect of D1i on Yi when D2i = 1

40/75
Illustration with dummy variables
▶ For illustration, define the following two dummy variables:
(
1 if str ≥ 20
HiSTR :=
0 if str < 20
(
1 if el pct ≥ 10
HiEL :=
0 if el pct < 10

▶ You want to estimate:

testscr i = β0 +β1 HiSTR i +β2 HiELi +β3 (HiSTR i ×HiELi )+ui

41/75
Interpreting the results
> dt[, hi_str := ifelse(str >= 20, 1, 0)]
> dt[, hi_el_pct := ifelse(el_pct >= 10, 1, 0)]
> model7 = feols(testscr ~ hi_str * hi_el_pct, dt, vcov = 'hetero')
> model7
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 664.14329 1.38809 478.458830 < 2.2e-16 ***
hi_str -1.90784 1.93221 -0.987386 3.2403e-01
hi_el_pct -18.16295 2.34595 -7.742249 7.5024e-14 ***
hi_str:hi_el_pct -3.49434 3.12123 -1.119539 2.6356e-01
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 16.0 Adj. R2: 0.290475
▶ Effect of HiSTR i when HiELi = 0 is −1.9
▶ Effect of HiSTR i when HiELi = 1 is −1.9 − 3.5 = −5.4
▶ Class size reduction is estimated to have a bigger effect when the percent
of English learners is large
▶ However, the interaction term is not statistically significant
42/75
Interactions between a continuous and a binary regressor
▶ Baseline model:

Yi = β0 + β1 Xi + β2 Di + ui

▶ Where Di is binary and Xi is continuous


▶ β1 is the effect on Yi of changing Xi
▶ In this specification, the effect does not depend on the value
of Di
▶ To allow the effect of changing Xi to depend on Di , include
the interaction term Di × Xi as a separate regressor:

Yi = β0 + β1 Xi + β2 Di + β3 (Di × Xi ) + ui

43/75
Interpreting the coefficients

Yi = β0 + β1 Xi + β2 Di + β3 (Di × Xi ) + ui

▶ Compare the PRF when X changes from x to x + 1 while Di


is fixed at q ∈ {0, 1}:

E [Yi |Xi = x, Di = q] = β0 + β1 x + β2 q + β3 (q × x)
E [Yi |Xi = x + 1, Di = q] = β0 + β1 (x + 1) + β2 q + β3 (q × (x + 1))

▶ And their difference:

E [Yi |Xi = x + 1, Di = q] − E [Yi |Xi = x, Di = q] = β1 + β3 q

▶ The effect of X now depends on the value q ∈ {0, 1} of Di


▶ Interpretation of β3 : increment to the effect of Xi on Yi when
Di = 1
44/75
Two different PRFs
▶ You could view these two cases as two different PRFs:
▶ The intercept is different
▶ The slope is different
▶ To see this, rewrite the model:

Yi = β0 + β1 Xi + β2 Di + β3 (Di × Xi ) + ui
= (β0 + β2 Di ) + (β1 + β3 Di )Xi + ui

▶ To make this more explicit, set Di = 0 to obtain:

Yi = β0 + β1 Xi + ui

▶ and set Di = 1 to obtain:

Yi = (β0 + β2 ) + (β1 + β3 )Xi + ui

45/75
R code for interaction terms
> model8 = feols(testscr ~ str * hi_el_pct, dt, vcov = 'hetero')
> model8
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 682.245839 11.867814 57.487067 < 2.2e-16 ***
str -0.968460 0.589102 -1.643961 0.10094
hi_el_pct 5.639141 19.514556 0.288971 0.77275
str:hi_el_pct -1.276613 0.966919 -1.320289 0.18746
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 15.8 Adj. R2: 0.305368
▶ Effect of str i when HiELi = 0 is −0.97
▶ Effect of str i when HiELi = 1 is −0.97 − 1.28 = −2.25
▶ Class size reduction is estimated to have a bigger effect when
the percent of English learners is large
▶ But which effects are significant? 46/75
Comparing the two PRFs
▶ Comparing the two PRFs:

Yi = β0 + β1 Xi + ui Di = 0
Yi = (β0 + β2 ) + (β1 + β3 )Xi + ui Di = 1
▶ Three hypotheses we could look at:
1. The two PRFs are the same: β2 = 0 and β3 = 0
> wald(model8, keep = c('str:hi_el_pct', 'hi_el_pct'))
Wald test, H0: joint nullity of hi_el_pct and str:hi_el_pct
stat = 89.9, p-value < 2.2e-16, on 2 and 416 DoF, VCOV: Heteroskedas
Rejected
2. The two PRFs have the same slope: β3 = 0
Coefficient on the interaction term has t-statistic of -1.32
Not rejected
3. The two PRFs have the same intercept: β2 = 0
Coefficient on HiEL has t-statistic of 0.289
Not rejected
47/75
Interactions between two continuous regressors
▶ Baseline model:

Yi = β0 + β1 X1i + β2 X2i + ui

▶ Where X1i and X2i are both continuous


▶ β1 is the effect on Yi of changing X1i
▶ In this specification, the effect does not depend on the value
of X2i
▶ To allow the effect of changing X1i to depend on X2i , include
the interaction term X1i × X2i as a separate regressor:

Yi = β0 + β1 X1i + β2 X2i + β3 (X1i × X2i ) + ui

48/75
Interpreting the coefficients

Yi = β0 + β1 X1i + β2 X2i + β3 (X1i × X2i ) + ui

▶ Compare the PRF when X1i changes from x to x + 1 while


X2i is fixed at q ∈ R:

E [Yi |X1i = x, X2i = q] = β0 + β1 x + β2 q + β3 (q × x)


E [Yi |X1i = x + 1, X2i = q] = β0 + β1 (x + 1) + β2 q + β3 (q × (x + 1))

▶ And their difference:

E [Yi |X1i = x +1, X2i = q]−E [Yi |X1i = x, X2i = q] = β1 +β3 q

▶ The effect of X1i now depends on the value q ∈ R of X2i


▶ Interpretation of β3 : increment to the effect of X1i on Yi
when X2i = q 49/75
R code for continuous interaction
> model9 = feols(testscr ~ str * el_pct, dt, vcov = 'hetero')
> model9
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.338525 11.759345 58.365370 < 2.2e-16 ***
str -1.117018 0.587514 -1.901264 0.057958 .
el_pct -0.672911 0.374123 -1.798636 0.072801 .
str:el_pct 0.001162 0.018536 0.062676 0.950054
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 14.4 Adj. R2: 0.422299

50/75
Interpreting the results
▶ Estimated effect of class size reduction is nonlinear because
the size of the effect itself depends on el pct i
el pct Slope
value location of str
1.94 25th percentile -1.12
8.85 median -1.11
23.00 75th percentile -1.09
43.92 90th percentile -1.07
▶ For example, at the median of el pct i (8.85% are English
learners), the effect of str i on test scores is −1.11
▶ The effect of str i is decreasing in el pct i (absolute value)
▶ But the differences do not seem large

51/75
Checking statistical significance
▶ Interaction term is not significant at the 5% level
▶ Neither is the coefficient on str
▶ But
> wald(model9, keep = c('str:el_pct', 'str'))
Wald test, H0: joint nullity of str and str:el_pct
stat = 3.88966, p-value = 0.0212, on 2 and 416 DoF,
VCOV: Heteroskedasticity-robust.
Rejected
▶ Yet another example in which one should not conduct a joint
hypothesis by looking at the coefficients individually
▶ An F -test is required

52/75
Summary for interaction terms Interactions Between Independent Variables 8.3 301

FIGURE 8.8 Regression Functions Using Binary and Continuous Variables

( b 0 + b 2 ) +(b 1 +b 3 )X
Y ( b 0 + b 2 )+ b 1 X Y
Slope = b 1 +b 3

b 0 +b 1 X
b0 +b2 b 0 +b 2
Slope = b 1
b 0 + b 1X
b0 Slope = b 1 b0

X X
(a) Different intercepts, same slope (b) Different intercepts, different slopes

Y b + (b +b )X
0 1 2
Slope = b 1 + b 2

b 0 +b 1 X Slope = b 1
b0

X
(c) Same intercept, different slopes

Interactions of binary variables and continuous variables can produce three different population regression functions:
(a) b0 + b1X + b2D allows for different intercepts but has the same slope, (b) b0 + b1X + b2D + b3 1X * D2 allows
for different intercepts and different slopes, and (c) b0 + b1X + b2 1X * D2 has the same intercept but allows for
different slopes.
53/75
Table of Contents

1. Dummy Variables

2. Polynomials in X

3. Logarithmic Functions of X or Y

4. Interaction Terms

5. Rescaling & Shifting of Variables

54/75
CEO compensation example
▶ Suppose I estimate the following model of CEO compensation
▶ Salary for CEO i is in $000s; ROE is a %

salaryi = α + βROEi + ui

> dt = haven::read_dta('data/CEOSAL1.DTA') %>% setDT


> feols(salary ~ roe, dt, vcov = 'hetero')
OLS estimation, Dep. Var.: salary
Observations: 209
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 963.1913 121.10623 7.95328 1.1658e-13 ***
roe 18.5012 6.82945 2.70903 7.3131e-03 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1,360.0 Adj. R2: 0.008421

55/75
Interpreting the estimates

salaryi = 963.2 + 18.5ROEi + ui

▶ What do these coefficients tell us?


▶ 1 percentage point increase in ROE is associated with $18,500
increase in salary
▶ Average salary for CEO with ROE = 0 was equal to $963,200

56/75
Scaling the dependent variable
▶ What if I change measurement of salary from $000s to $s by
multiplying it by 1,000?
▶ Estimates were:

α̂ = 963.2
β̂ = 18.50

▶ Now, they will be:

α̂ = 963, 200
β̂ = 18, 500

57/75
Scaling the dependent variable
> dt[, salary_dollar := salary * 1000]

salary salary dollar


(1) (2)
Constant 963.2∗∗∗ 963,191.3∗∗∗
(121.1) (121,106.2)
roe 18.50∗∗∗ 18,501.2∗∗∗
(6.829) (6,829.4)

Observations 209 209


R2 0.01319 0.01319
Adjusted R2 0.00842 0.00842

58/75
Scaling y continued. . .
▶ Scaling y by an amount c just causes all the estimates to be
scaled by the same amount
▶ Mathematically, easy to see why:

y = α + βx + u
cy = (cα) + (cβ)x + cu

▶ New intercept = cα
▶ New slope = cβ

59/75
Interpreting scaled coefficients
▶ Notice, the scaling has no effect on the relationship between
ROE and salary
▶ I.e., because y is expressed in $s now, β̂ = 18, 500 means that
a one percentage point increase in ROE is still associated with
$18,500 increase in salary

60/75
Scaling the independent variable
▶ What if I instead change measurement of ROE from
percentage to decimal? (i.e., multiply ROE by 1/100)
▶ Estimates were:

α̂ = 963.2
β̂ = 18.50

▶ Now, they will be:

α̂ = 963.2
β̂ = 1, 850

61/75
Scaling the independent variable
> dt[, roe_dec := roe / 100]

salary
(1) (2)
Constant 963.2∗∗∗ 963.2∗∗∗
(121.1) (121.1)
roe 18.50∗∗∗
(6.829)
roe dec 1,850.1∗∗∗
(682.9)

Observations 209 209


R2 0.01319 0.01319
Adjusted R2 0.00842 0.00842

62/75
Scaling x continued. . .
▶ Scaling x by an amount k just causes the slope on x to be
scaled by 1/k
▶ Mathematically, easy to see why:

y = α + βx + u
 
β
y =α+ kx + u
k
▶ New slope = β/k
▶ Will interpretation of estimates change?
▶ Answer: Again, no!

63/75
Scaling both x and y
▶ If we scale y by an amount c and x by amount k, then we get:

y = α + βx + u
 

cy = (cα) + kx + cu
k
▶ Intercept scaled by c
▶ Slope scaled by c/k
▶ When is scaling useful?

64/75
Scaling both x and y
salary salary dollar
(1) (2) (3) (4)
Constant 963.2∗∗∗ 963.2∗∗∗ 963,191.3∗∗∗ 963,191.3∗∗∗
(121.1) (121.1) (121,106.2) (121,106.2)
roe 18.50∗∗∗ 18,501.2∗∗∗
(6.829) (6,829.4)
roe dec 1,850.1∗∗∗ 1,850,118.6∗∗∗
(682.9) (682,944.8)

Observations 209 209 209 209


R2 0.01319 0.01319 0.01319 0.01319
Adjusted R2 0.00842 0.00842 0.00842 0.00842

65/75
Practical applications of scaling #1
▶ No one wants to see a coefficient of 0.000000456 or
1,234,567,890
▶ Just scale the variables for cosmetic purposes!
▶ It will affect coefficients & SEs
▶ However, it won’t affect t-stats or inference

66/75
Practical applications of scaling #2
▶ To improve interpretation, in terms of estimated magnitudes,
it’s helpful to scale the variables by their sample standard
deviations
▶ Let σx and σy be sample standard deviations of x and y
respectively
▶ Let c, the scalar for y , be equal to 1/σy
▶ Let k, the scalar for x, be equal to 1/σx
▶ I.e., units of x and y are now standard deviations

67/75
Practical applications of scaling #2
▶ With the prior rescaling, how would we interpret a slope
coefficient of 0.25?
▶ Answer = a 1 s.d. increase in x is associated with 41 s.d.
increase in y
▶ The slope tells us how many standard deviations y changes,
on average, for a standard deviation change in x
▶ Is 0.25 large in magnitude? What about 0.01?

68/75
Standard deviation interpretation
> dt[, salary_sd := salary / sd(salary)]
> dt[, roe_sd := roe / sd(roe)]
salary salary sd
(1) (2)
Constant 963.2∗∗∗ 0.7019∗∗∗
(121.1) (0.0882)
roe 18.50∗∗∗
(6.829)
roe sd 0.1148∗∗∗
(0.0424)

Observations 209 209


R2 0.01319 0.01319
Adjusted R2 0.00842 0.00842
▶ 1 sd increase in roe ⇒ 0.11 sd increase in salary (can also
calculate from unscaled regression, just a bit tedious)
> dt[, sd(salary)] # 1372.34530795889
> dt[, sd(roe)] # 8.5185086590749
> dt[, sd(roe)] * 18.5012 / dt[, sd(salary)]
0.114841819685806 69/75
Shifting the variables
▶ Suppose we instead add c to y and k to x (i.e., we shift y
and x up by c and k respectively)
▶ Will the estimated slope change?

70/75
Shifting continued. . .
▶ No! Only the estimated intercept will change
▶ Mathematically, easy to see why:

y = α + βx + u
y + c = α + c + βx + u
y + c = α + c + β(x + k) − k + u
y + c = (α + c − βk) + β(x + k) + u

▶ New intercept = α + c − βk
▶ Slope remains the same

71/75
Shifting continued. . .
> dt[, salary_demean := salary - mean(salary)]
> dt[, roe_demean := roe - mean(roe)]

salary salary demean


(1) (2) (3) (4)
Constant 963.2∗∗∗ 1,281.1∗∗∗ -317.9∗∗∗ 1.37 × 10−13
(121.1) (94.53) (121.1) (94.53)
roe 18.50∗∗∗ 18.50∗∗∗
(6.829) (6.829)
roe demean 18.50∗∗∗ 18.50∗∗∗
(6.829) (6.829)

Observations 209 209 209 209


R2 0.01319 0.01319 0.01319 0.01319
Adjusted R2 0.00842 0.00842 0.00842 0.00842

72/75
Practical application of shifting
▶ To improve interpretation, sometimes helpful to demean x by
its sample mean
▶ Let µx be the sample mean of x; regress y on x − µx
▶ Intercept now reflects expected value of y for x = µx

y = (α + βµx ) + β(x − µx ) + u
E (y |x = µx ) = α + βµx

73/75
Shifting in interaction

Y = α′ + β1′ X1 + β2′ X2 + β3′ X1 X2 + ϵ′


Y = α + β1 (X1 − µX1 ) + β2 (X2 − µX2 ) + β3 (X1 − µX1 )(X2 − µX2 ) + ϵ

▶ β3 = β3′ ⇒ Shifting does not affect the interaction term


▶ β1 vs. β1′
▶ β1′ : Effect of X1 when X2 is zero (Y = α′ + β1′ X1 + ϵ′ )
▶ β1 : Effect of X1 when X2 is mean (Y = α + β1 (X1 − µX1 ) + ϵ)
▶ β2 vs. β2′
▶ β2′ : Effect of X2 when X1 is zero
▶ β2 : Effect of X2 when X1 is mean
▶ Shifting does not affect the coef on the interaction term, but
affects the coefs the single terms.

74/75
Shifting in interaction
salary
(1) (2) (3) (4)
Constant 1,018.3∗∗∗ 1,278.4∗∗∗ 1,367.7∗∗∗ 911.2∗∗∗
(160.0) (100.9) (140.0) (116.0)
roe 20.33∗ 21.37∗∗∗
(11.87) (8.074)
ros -1.734 -1.445∗
(1.250) (0.8687)
roe × ros 0.0168
(0.0831)
roe demean 21.37∗∗∗ 20.33∗
(8.074) (11.87)
ros demean -1.445∗ -1.734
(0.8687) (1.250)
roe demean × ros demean 0.0168
(0.0831)
roe demean × ros 0.0168
(0.0831)
roe × ros demean 0.0168
(0.0831)

Observations 209 209 209 209


R2 0.01784 0.01784 0.01784 0.01784
Adjusted R2 0.00347 0.00347 0.00347 0.00347 75/75

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy