Chapter 08 Nonlinear Regression Functions (1)
Chapter 08 Nonlinear Regression Functions (1)
Po-Yu Liu
April 8, 2025
1/75
Table of Contents
1. Dummy Variables
2. Polynomials in X
3. Logarithmic Functions of X or Y
4. Interaction Terms
2/75
Dummy variable example
▶ Let’s say you have available G dummy variables that together
are mutually exclusive and exhaustive of the population
▶ Example: smoker with G = 2
▶ Two dummies:
▶ smoker equal 1 if person is a smoker (zero otherwise)
▶ nonsmoker equal 1 if person is a non-smoker (zero otherwise)
▶ If you are interested in the association between smoking and
birthweight, consider the following specifications:
▶ birthweight = β0 + β1 smoker + ui
▶ birthweight = β0 + β2 nonsmoker + ui
▶ birthweight = β0 + β1 smoker + β2 nonsmoker + ui
3/75
Dummy variable trap
> dt = readxl::read_xlsx('data/birthweight_smoking.xlsx') %>% setDT
> model1 = feols(birthweight ~ smoker, dt, vcov = 'hetero')
> model1
OLS estimation, Dep. Var.: birthweight
Observations: 3,000
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3432.060 11.8905 288.63802 < 2.2e-16 ***
smoker -253.228 26.8104 -9.44516 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 583.5 Adj. R2: 0.02828
5/75
Example: number of prenatal visits
▶ Example: number of prenatal visits with G = 4
▶ Four dummies:
▶ tripre0 equal 1 if never went for prenatal health visits
(presumably a problematic group)
▶ tripre1 equal 1 if first prenatal health visit in 1st trimester
(presumably the most common group)
▶ tripre2 equal 1 if first prenatal health visit in 2nd trimester
▶ tripre3 equal 1 if first prenatal health visit in 3rd trimester
▶ We’ve just learned: only need to use a subset of three
dummies
▶ Which subset should we use?
▶ It doesn’t matter: as long as we use any three, we are not
throwing out any information
▶ However: the unused dummy implicitly defines the benchmark
group
6/75
Benchmark groups in regression
▶ Benchmark: first prenatal health visit in 1st trimester
> fml = birthweight ~ smoker + alcohol + tripre0 + tripre2 + tripre3
> model1 = feols(fml, dt, vcov = 'hetero')
> model1
OLS estimation, Dep. Var.: birthweight
Observations: 3,000
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3454.5493 12.4817 276.769748 < 2.2e-16 ***
smoker -228.8476 26.5489 -8.619854 < 2.2e-16 ***
alcohol -15.1000 69.7031 -0.216633 8.2851e-01
tripre0 -697.9687 146.5788 -4.761732 2.0106e-06 ***
tripre2 -100.8373 31.5530 -3.195810 1.4089e-03 **
tripre3 -136.9553 67.6958 -2.023099 4.3152e-02 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 578.1 Adj. R2: 0.044873
1. Dummy Variables
2. Polynomials in X
3. Logarithmic Functions of X or Y
4. Interaction Terms
9/75
278 Nonlinear Regression Functions
CHAPTER 8
Motivation: Relationship with varying slopes
FIGURE 8.1 Population Regression Functions with Different Slopes
Y Y
Rise
Rise Run
Run
Rise
Run
X1 X1
(a) Constant slope (b) Slope depends on the value of X1
Rise
Run
X1
(c) Slope depends on the value of X2
In Figure 8.1(a), the population regression function has a constant slope. In Figure 8.1(b), the slope of the popula-
tion regression function depends on the value of X1. In Figure 8.1(c), the slope of the population regression function
depends on the value of X2. 10/75
the California data set, along with the OLS regression line relating these two
variables. Test scores and district income are strongly positively correlated, with a
Motivation: Linear relationship does not fit data well
FIGURE 8.2 Scatterplot of Test Scores vs. District Income with a Linear OLS Regression Function
660
640
620
600
0 10 20 30 40 50 60
District income
(thousands of dollars)
Yi = β0 + β1 Xi + β2 Xi2 + · · · + βr Xir + ui
12/75
Polynomial regression example
▶ We will illustrate the use of polynomials using the textbook’s
data on test scores and student-teacher ratios
▶ Here we focus on the following two variables only:
▶ testscr is average test score in school district i
▶ avginc is the average income in school district i (thousands of
dollars per capita)
▶ Quadratic specification:
▶ Cubic specification:
13/75
Estimation of the quadratic specification in R
> dt = haven::read_dta('data/caschool.dta') %>% setDT
> fml = testscr ~ avginc + avginc^2
> model1 = feols(fml, dt, vcov = 'hetero')
> model1
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 607.301735 2.901754 209.28782 < 2.2e-16 ***
avginc 3.850995 0.268094 14.36434 < 2.2e-16 ***
I(avginc^2) -0.042308 0.004780 -8.85051 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 12.7 Adj. R2: 0.554045
14/75
of this test (which is 1.96). Indeed, the p-value for the t-statistic is less than 0.01%, so
we can reject the hypothesis that b2 = 0 at all conventional significance levels. Thus
Compare the predicted values between linear and quadratic
this formal hypothesis test supports our informal inspection of Figures 8.2 and 8.3:
The quadratic model fits the data better than the linear model.
specification
FIGURE 8.3 Scatterplot of Test Scores vs. District Income with Linear and Quadratic Regression Functions
700
680
Quadratic regression
660
640
620
600
0 10 20 30 40 50 60
District income
(thousands of dollars)
15/75
Interpretation of polynomial regression
▶ How to interpret the estimated PRF?
▶ Estimated PRF is:
16/75
Interpretation of polynomial regression
▶ Predicted effects for different values of avginc i :
∆avginc ∆testscr
From $5,000 to $6,000 3.4
From $25,000 to $26,000 1.7
From $45,000 to $46,000 0.0
▶ The effect of changing avginc i on testscr i is decreasing in
avginc i
▶ The second derivative is negative (because the coefficient
estimate on the quadratic term is negative)
▶ Caution: do not extrapolate outside the range of the data
17/75
Estimation of the cubic specification in R
> dt[, avginc2 := avginc^2]
> dt[, avginc3 := avginc^3]
> fml = testscr ~ avginc + avginc2 + avginc3
> model2 = feols(fml, dt, vcov = 'hetero')
> model2
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 600.078985 5.102062 117.61500 < 2.2e-16 ***
avginc 5.018677 0.707350 7.09504 5.6063e-12 ***
avginc2 -0.095805 0.028954 -3.30890 1.0181e-03 **
avginc3 0.000685 0.000347 1.97509 4.8919e-02 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 12.6 Adj. R2: 0.555228
18/75
Testing for linearity
▶ Testing the null hypothesis of linearity, against the alternative
that the population regression is quadratic and/or cubic:
19/75
Testing for linearity
1. Dummy Variables
2. Polynomials in X
3. Logarithmic Functions of X or Y
4. Interaction Terms
21/75
The290(natural)Nonlinear
log Regression Functions
CHAPTER 8
4
y = ln(x)
0
0 20 40 60 80 100 120
x
∆x ∆x
ln1x + ∆x2 - ln1x2 ≅ ¢ when is smallb , (8.16)
x x
where “_” means “approximately equal to.” The derivation of this approximation
relies on calculus, but it is readily demonstrated by trying out some values of x and 22/75
Using logarithmic transformations in regression
▶ Using logarithmic transformations of both the dependent and
independent variables can be useful when estimating
coefficients
▶ (Two main reasons: % interpretation & mitigate outliers)
▶ Using the student test score example, let’s focus on two
variables:
▶ Yi : test score in school district i
▶ Xi : average income in school district i (this is a proxy for
socio-economic status of the district)
▶ Let’s look at the simple regression model:
Yi = β0 + β1 Xi + u
▶ We estimate β1 by running a regression of Yi on Xi
▶ But what do we estimate when instead we:
▶ Run a regression of ln Yi on Xi
▶ Run a regression of Yi on ln Xi
▶ Run a regression of ln Yi on ln Xi
23/75
Properties of the logarithm
▶ The logarithm has useful features based on calculus
▶ Compare the independent variable at two values x1 and x0 (it
works the same for the dependent variable)
▶ Starting at x0 , you change the dependent variable by
∆x := x1 − x0
▶ Define the following: x̃1 = ln(x1 ) and x̃0 = ln(x0 )
▶ The corresponding change in the logarithm captures:
25/75
Example of logarithmic approximation
> log(11)-log(10)
[1] 0.09531018
> log(110)-log(100)
[1] 0.09531018
> log(1100)-log(1000)
[1] 0.09531018
> log(11000)-log(10000)
[1] 0.09531018
▶ No matter how big x is, ∆ log(x) always tells us there’s a
0.1 = 10% change
26/75
Back to the regression model
▶ You create log-versions of both Xi and Yi :
Xei := ln Xi
Yei := ln Yi
▶ Now compare the following four specifications:
Specification Population regression function
(1) linear-linear Yi = β0 + β1 Xi
(2) linear-log Yi = β0 + β1 Xei
(3) log-linear Yei = β0 + β1 Xi
(4) log-log Yei = β0 + β1 Xei
▶ The interpretation of the slope coefficient β1 differs in each
case
▶ The generic interpretation of the slope coefficient β1 is:
By how much does the dependent variable change, on
average, when the independent variable changes by one unit? 27/75
What does this mean in the different specifications?
∆Yi
(1) β1 = therefore ∆Xi = 1 =⇒ ∆Yi = β1
∆Xi
X up by 1 unit, Y up by β1 units
∆Yi
(2) β1 = therefore ∆X̃i = 1 =⇒ ∆Yi = β1
∆X̃i
X up by 100%, Y up by β1 units
∆Ỹi
(3) β1 = therefore ∆Xi = 1 =⇒ ∆Ỹi = β1
∆Xi
X up by 1 unit, Y up by 100 · β1 %
∆Ỹi
(4) β1 = therefore ∆X̃i = 1 =⇒ ∆Ỹi = β1
∆X̃i
X up by 100%, Y up by 100 · β1 % 28/75
Linear-log specification
> model3 = feols(testscr ~ log(avginc), dt, vcov = 'hetero')
> model3
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 557.8323 3.83994 145.2711 < 2.2e-16 ***
log(avginc) 36.4197 1.39694 26.0710 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 12.6 Adj. R2: 0.561461
▶ Interpretation:
A 100% increase in avginc is associated with an increase in testscr
by 36.42 points (the measurement units of testscr ) on average
▶ Or alternatively:
A 1% increase in avginc is associated with an increase in testscr by
0.3642 points on average
29/75
292 Nonlinear Regression Functions
CHAPTER 8
Linear-log specification
FIGURE 8.5 The Linear-Log Regression Function
700
680
660
640
620
600
0 10 20 30 40 50 60
District income
(thousands of dollars)
▶ Interpretation:
An increase by $1 in avginc will increase testscr by 0.28% on
average
31/75
Log-log specification
> model5 = feols(log(testscr) ~ log(avginc), dt, vcov = 'hetero'
> model5
OLS estimation, Dep. Var.: log(testscr)
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.336349 0.005925 1069.5006 < 2.2e-16 ***
log(avginc) 0.055419 0.002145 25.8414 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.019339 Adj. R2: 0.556725
▶ Interpretation:
An increase by 100% in avginc will increase testscr by 5.5% on
average
▶ Or alternatively:
An increase by 1% in avginc will increase testscr by 0.055% on
average
▶ The coefficient β1 measures the elasticity of Y with respect to X
32/75
294
Log-linear andNonlinear
CHAPTER 8
log-log Regression Functions
6.55
Log-log regression
6.50
6.45
6.40
0 10 20 30 40 50 60
District income
(thousands of dollars)
As you can see in Figure 8.6, the log-log specification fits better than the log-
linear specification. This is consistent with the higher R2 for the log-log regression
(0.557) than for the log-linear regression (0.497). Even so, the log-log specification
does not fit the data especially well: At the lower values of income, most of the obser-
vations fall below the log-log curve, while in the middle income range most of the
observations fall above the estimated regression function.
33/75
Table of Contents
1. Dummy Variables
2. Polynomials in X
3. Logarithmic Functions of X or Y
4. Interaction Terms
34/75
Interactions between two binary regressors
▶ We will illustrate the use of interaction terms using the
textbook’s data on test scores and student-teacher ratios
▶ Consider the following multiple regression model:
35/75
R code for multiple regression
▶ When you run this regression in R, this is what you get:
> model6 = feols(testscr ~ str + el_pct, dt, vcov = 'hetero')
> model6
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.032249 8.728224 78.59930 < 2.2e-16 ***
str -1.101296 0.432847 -2.54431 0.011309 *
el_pct -0.649777 0.031032 -20.93909 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 14.4 Adj. R2: 0.42368
36/75
Interpreting the results
▶ If district i could decrease str i by one unit while holding
el pct i constant, it can expect an increase in average test
scores of 1.10
▶ If district i could decrease el pct i by one percentage point
(say from 25% to 24%) while holding str i constant, it can
expect an increase in average test scores of 0.65
▶ Both effects are statistically significant at the 5% level
37/75
Effect of class size reduction
▶ Perhaps a class size reduction is more effective in some
circumstances than in others
▶ Perhaps the effect of student-teacher ratio on test scores
varies with the percentage of English learners
▶ This would be the case, for example, if English learners
benefit disproportionately from smaller class sizes (and
therefore lower student-teacher ratios)
▶ More technically, ∆testscr
∆str might depend on el pct
∆Y
▶ More generally, ∆X
1
might depend on X2
▶ How to model such interactions between X1 and X2 ?
38/75
Baseline model with interaction terms
▶ Baseline model:
Yi = β0 + β1 D1i + β2 D2i + ui
39/75
Interpreting the coefficients
▶ Compare the PRF when D1i changes from 0 to 1
while D2i is fixed at q ∈ {0, 1}:
40/75
Illustration with dummy variables
▶ For illustration, define the following two dummy variables:
(
1 if str ≥ 20
HiSTR :=
0 if str < 20
(
1 if el pct ≥ 10
HiEL :=
0 if el pct < 10
41/75
Interpreting the results
> dt[, hi_str := ifelse(str >= 20, 1, 0)]
> dt[, hi_el_pct := ifelse(el_pct >= 10, 1, 0)]
> model7 = feols(testscr ~ hi_str * hi_el_pct, dt, vcov = 'hetero')
> model7
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 664.14329 1.38809 478.458830 < 2.2e-16 ***
hi_str -1.90784 1.93221 -0.987386 3.2403e-01
hi_el_pct -18.16295 2.34595 -7.742249 7.5024e-14 ***
hi_str:hi_el_pct -3.49434 3.12123 -1.119539 2.6356e-01
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 16.0 Adj. R2: 0.290475
▶ Effect of HiSTR i when HiELi = 0 is −1.9
▶ Effect of HiSTR i when HiELi = 1 is −1.9 − 3.5 = −5.4
▶ Class size reduction is estimated to have a bigger effect when the percent
of English learners is large
▶ However, the interaction term is not statistically significant
42/75
Interactions between a continuous and a binary regressor
▶ Baseline model:
Yi = β0 + β1 Xi + β2 Di + ui
Yi = β0 + β1 Xi + β2 Di + β3 (Di × Xi ) + ui
43/75
Interpreting the coefficients
Yi = β0 + β1 Xi + β2 Di + β3 (Di × Xi ) + ui
E [Yi |Xi = x, Di = q] = β0 + β1 x + β2 q + β3 (q × x)
E [Yi |Xi = x + 1, Di = q] = β0 + β1 (x + 1) + β2 q + β3 (q × (x + 1))
Yi = β0 + β1 Xi + β2 Di + β3 (Di × Xi ) + ui
= (β0 + β2 Di ) + (β1 + β3 Di )Xi + ui
Yi = β0 + β1 Xi + ui
45/75
R code for interaction terms
> model8 = feols(testscr ~ str * hi_el_pct, dt, vcov = 'hetero')
> model8
OLS estimation, Dep. Var.: testscr
Observations: 420
Standard-errors: Heteroskedasticity-robust
Estimate Std. Error t value Pr(>|t|)
(Intercept) 682.245839 11.867814 57.487067 < 2.2e-16 ***
str -0.968460 0.589102 -1.643961 0.10094
hi_el_pct 5.639141 19.514556 0.288971 0.77275
str:hi_el_pct -1.276613 0.966919 -1.320289 0.18746
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 15.8 Adj. R2: 0.305368
▶ Effect of str i when HiELi = 0 is −0.97
▶ Effect of str i when HiELi = 1 is −0.97 − 1.28 = −2.25
▶ Class size reduction is estimated to have a bigger effect when
the percent of English learners is large
▶ But which effects are significant? 46/75
Comparing the two PRFs
▶ Comparing the two PRFs:
Yi = β0 + β1 Xi + ui Di = 0
Yi = (β0 + β2 ) + (β1 + β3 )Xi + ui Di = 1
▶ Three hypotheses we could look at:
1. The two PRFs are the same: β2 = 0 and β3 = 0
> wald(model8, keep = c('str:hi_el_pct', 'hi_el_pct'))
Wald test, H0: joint nullity of hi_el_pct and str:hi_el_pct
stat = 89.9, p-value < 2.2e-16, on 2 and 416 DoF, VCOV: Heteroskedas
Rejected
2. The two PRFs have the same slope: β3 = 0
Coefficient on the interaction term has t-statistic of -1.32
Not rejected
3. The two PRFs have the same intercept: β2 = 0
Coefficient on HiEL has t-statistic of 0.289
Not rejected
47/75
Interactions between two continuous regressors
▶ Baseline model:
Yi = β0 + β1 X1i + β2 X2i + ui
48/75
Interpreting the coefficients
50/75
Interpreting the results
▶ Estimated effect of class size reduction is nonlinear because
the size of the effect itself depends on el pct i
el pct Slope
value location of str
1.94 25th percentile -1.12
8.85 median -1.11
23.00 75th percentile -1.09
43.92 90th percentile -1.07
▶ For example, at the median of el pct i (8.85% are English
learners), the effect of str i on test scores is −1.11
▶ The effect of str i is decreasing in el pct i (absolute value)
▶ But the differences do not seem large
51/75
Checking statistical significance
▶ Interaction term is not significant at the 5% level
▶ Neither is the coefficient on str
▶ But
> wald(model9, keep = c('str:el_pct', 'str'))
Wald test, H0: joint nullity of str and str:el_pct
stat = 3.88966, p-value = 0.0212, on 2 and 416 DoF,
VCOV: Heteroskedasticity-robust.
Rejected
▶ Yet another example in which one should not conduct a joint
hypothesis by looking at the coefficients individually
▶ An F -test is required
52/75
Summary for interaction terms Interactions Between Independent Variables 8.3 301
( b 0 + b 2 ) +(b 1 +b 3 )X
Y ( b 0 + b 2 )+ b 1 X Y
Slope = b 1 +b 3
b 0 +b 1 X
b0 +b2 b 0 +b 2
Slope = b 1
b 0 + b 1X
b0 Slope = b 1 b0
X X
(a) Different intercepts, same slope (b) Different intercepts, different slopes
Y b + (b +b )X
0 1 2
Slope = b 1 + b 2
b 0 +b 1 X Slope = b 1
b0
X
(c) Same intercept, different slopes
Interactions of binary variables and continuous variables can produce three different population regression functions:
(a) b0 + b1X + b2D allows for different intercepts but has the same slope, (b) b0 + b1X + b2D + b3 1X * D2 allows
for different intercepts and different slopes, and (c) b0 + b1X + b2 1X * D2 has the same intercept but allows for
different slopes.
53/75
Table of Contents
1. Dummy Variables
2. Polynomials in X
3. Logarithmic Functions of X or Y
4. Interaction Terms
54/75
CEO compensation example
▶ Suppose I estimate the following model of CEO compensation
▶ Salary for CEO i is in $000s; ROE is a %
salaryi = α + βROEi + ui
55/75
Interpreting the estimates
56/75
Scaling the dependent variable
▶ What if I change measurement of salary from $000s to $s by
multiplying it by 1,000?
▶ Estimates were:
α̂ = 963.2
β̂ = 18.50
α̂ = 963, 200
β̂ = 18, 500
57/75
Scaling the dependent variable
> dt[, salary_dollar := salary * 1000]
58/75
Scaling y continued. . .
▶ Scaling y by an amount c just causes all the estimates to be
scaled by the same amount
▶ Mathematically, easy to see why:
y = α + βx + u
cy = (cα) + (cβ)x + cu
▶ New intercept = cα
▶ New slope = cβ
59/75
Interpreting scaled coefficients
▶ Notice, the scaling has no effect on the relationship between
ROE and salary
▶ I.e., because y is expressed in $s now, β̂ = 18, 500 means that
a one percentage point increase in ROE is still associated with
$18,500 increase in salary
60/75
Scaling the independent variable
▶ What if I instead change measurement of ROE from
percentage to decimal? (i.e., multiply ROE by 1/100)
▶ Estimates were:
α̂ = 963.2
β̂ = 18.50
α̂ = 963.2
β̂ = 1, 850
61/75
Scaling the independent variable
> dt[, roe_dec := roe / 100]
salary
(1) (2)
Constant 963.2∗∗∗ 963.2∗∗∗
(121.1) (121.1)
roe 18.50∗∗∗
(6.829)
roe dec 1,850.1∗∗∗
(682.9)
62/75
Scaling x continued. . .
▶ Scaling x by an amount k just causes the slope on x to be
scaled by 1/k
▶ Mathematically, easy to see why:
y = α + βx + u
β
y =α+ kx + u
k
▶ New slope = β/k
▶ Will interpretation of estimates change?
▶ Answer: Again, no!
63/75
Scaling both x and y
▶ If we scale y by an amount c and x by amount k, then we get:
y = α + βx + u
cβ
cy = (cα) + kx + cu
k
▶ Intercept scaled by c
▶ Slope scaled by c/k
▶ When is scaling useful?
64/75
Scaling both x and y
salary salary dollar
(1) (2) (3) (4)
Constant 963.2∗∗∗ 963.2∗∗∗ 963,191.3∗∗∗ 963,191.3∗∗∗
(121.1) (121.1) (121,106.2) (121,106.2)
roe 18.50∗∗∗ 18,501.2∗∗∗
(6.829) (6,829.4)
roe dec 1,850.1∗∗∗ 1,850,118.6∗∗∗
(682.9) (682,944.8)
65/75
Practical applications of scaling #1
▶ No one wants to see a coefficient of 0.000000456 or
1,234,567,890
▶ Just scale the variables for cosmetic purposes!
▶ It will affect coefficients & SEs
▶ However, it won’t affect t-stats or inference
66/75
Practical applications of scaling #2
▶ To improve interpretation, in terms of estimated magnitudes,
it’s helpful to scale the variables by their sample standard
deviations
▶ Let σx and σy be sample standard deviations of x and y
respectively
▶ Let c, the scalar for y , be equal to 1/σy
▶ Let k, the scalar for x, be equal to 1/σx
▶ I.e., units of x and y are now standard deviations
67/75
Practical applications of scaling #2
▶ With the prior rescaling, how would we interpret a slope
coefficient of 0.25?
▶ Answer = a 1 s.d. increase in x is associated with 41 s.d.
increase in y
▶ The slope tells us how many standard deviations y changes,
on average, for a standard deviation change in x
▶ Is 0.25 large in magnitude? What about 0.01?
68/75
Standard deviation interpretation
> dt[, salary_sd := salary / sd(salary)]
> dt[, roe_sd := roe / sd(roe)]
salary salary sd
(1) (2)
Constant 963.2∗∗∗ 0.7019∗∗∗
(121.1) (0.0882)
roe 18.50∗∗∗
(6.829)
roe sd 0.1148∗∗∗
(0.0424)
70/75
Shifting continued. . .
▶ No! Only the estimated intercept will change
▶ Mathematically, easy to see why:
y = α + βx + u
y + c = α + c + βx + u
y + c = α + c + β(x + k) − k + u
y + c = (α + c − βk) + β(x + k) + u
▶ New intercept = α + c − βk
▶ Slope remains the same
71/75
Shifting continued. . .
> dt[, salary_demean := salary - mean(salary)]
> dt[, roe_demean := roe - mean(roe)]
72/75
Practical application of shifting
▶ To improve interpretation, sometimes helpful to demean x by
its sample mean
▶ Let µx be the sample mean of x; regress y on x − µx
▶ Intercept now reflects expected value of y for x = µx
y = (α + βµx ) + β(x − µx ) + u
E (y |x = µx ) = α + βµx
73/75
Shifting in interaction
74/75
Shifting in interaction
salary
(1) (2) (3) (4)
Constant 1,018.3∗∗∗ 1,278.4∗∗∗ 1,367.7∗∗∗ 911.2∗∗∗
(160.0) (100.9) (140.0) (116.0)
roe 20.33∗ 21.37∗∗∗
(11.87) (8.074)
ros -1.734 -1.445∗
(1.250) (0.8687)
roe × ros 0.0168
(0.0831)
roe demean 21.37∗∗∗ 20.33∗
(8.074) (11.87)
ros demean -1.445∗ -1.734
(0.8687) (1.250)
roe demean × ros demean 0.0168
(0.0831)
roe demean × ros 0.0168
(0.0831)
roe × ros demean 0.0168
(0.0831)