Topic One Linear Regression Regularization
Topic One Linear Regression Regularization
regularization
Eric B. Laber
Statistics 561
The world needs another penalized regression method.
—Nobody (circa 2010).1
1
Relax, I’m 98.5 percent joking.
Warm-up (5 minutes)
I True or false
I Masking occurs when there is a single large outlier
I The Gauss-Markov theorem says the least squares estimator
minimizes MSE
I In the TV show ‘Cheer’ there are only two teams competing
for a national title in Navarro’s division.
1 / 56
Roadmap
I Ridge regression
I Lasso
Roadmap
I Ridge regression
I Lasso
Review: fitting a linear model
PnX (Y − X |β ) = 0
2 / 56
Review: fitting a linear model cont’d
I
I
I
I
3 / 56
Estimation and approximation error
I Suppose Y = f (X
X) +
I Let β ∗ = arg min P(Y − X |β )2
β ∈Rp
n o2
X ) − X |β ∗ }2 + E X | (βb n − β ∗ )
E(Y − X |βb n )2 = Var() + E {f (X
= Noise + Approx Err + Est. Err
4 / 56
Parsimonious models
2
Note* this might be you! If you don’t trust your own models it will be hard
to make progress.
5 / 56
Parsimonious models cont’d
I Justification for parsimonious models
I Occam’s Razor
I (Medicine) true optimal decision rule simple
I More nuanced mathematical arguments3
5
For the best in psychic vampire repellent see:
https://goop.com/paper-crane-apothecary-psychic-vampire-repellent/p/
6
This is a real one.
Roadmap
I Ridge regression
I Lasso
Finding the ‘true’ model
7
We’ll mostly use python in this course but sometimes a guy already has
some R code written. Geez, get off my back.
8 / 56
Finding the ‘true’ model
I If you want optimal predictions, you may not want to use the
‘true’ model even if it were available to you
I Estimating small effects inflates variance but does little to
improve prediction
I What constitutes a small effect depends on residual variance
and sample size (not an absolute)
9 / 56
Blank page for notes
Hard- and soft-thresholding
bH
I Soft-thresholding µ n = X n g (|X n |) for some function
g : R+ → R+
10 / 56
Soft-thresholding: try it at home!
I Consider an estimator of the form µ
en = αX n , derive the value
of αopt that minimizes MSE over α
11 / 56
Hooray! A new estimator of the mean?
9
For a fun paper on this topic see Furnival, George M., and Robert W.
Wilson. ”Regressions by leaps and bounds.” Technometrics 42.1 (2000): 69-79.
13 / 56
Roadmap
I Ridge regression
I Lasso
People say sometimes that Beauty is superficial. That
may be so. But at least it is not so superficial as Thought
is. To me, Beauty is the wonder of wonders. It is only
shallow people who do not judge by appearances. The
true mystery of the world is the visible, not the invisible.
—Excerpt ”What to Expect When You’re Expecting.”
Ridge regression: a superficial first look
15 / 56
Historical side-note
10
the ridge trace plot is what we might call a solution path today (nothing is
new)
16 / 56
Ridge regression: orthogonal case
λ βbn,j
βbn,j =
1+λ
17 / 56
Blank page for notes
Ridge regression cont’d
19 / 56
Tuning λn (one more criterion)
I EB in a nutshell
I Posit Bayesian model
I Marginal distribution to obtain frequentist estimators of
hyper-parameters11
11
This is a rich area but we don’t have the bandwidth to cover it in depth in
this class.
12
Credit to Derek Bingham for this nugget of wisdom.
20 / 56
Tuning λn with EB
I Posterior mean is
βb n = (PnX X | + λn I )−1 Pn (X
X Y + λn µ),
where λn = σ 2 /(nτ 2 )
I Setting µ = 0 yields ridge estimator, given estimators σ bn2 and
2 2 2
τbn yield a plug-in estimator of λn , i.e., λ
bn = σ
bn /(nb
τn )
21 / 56
Tuning λn with EB cont’d
X X T ) + σ2
I Marginal distribution13 EY 2 = τ 2 trace(X
Pn Y 2 ≈ τ 2 trace(PnX X | ) + σ 2
Pn Y 2 − σ
bn2
τbn2 =
trace(PnX X | )
13
To align with classic derivations, we’re conditioning on X and treating it as
fixed here. Which makes the expression Pn X X | a bit of an abuse of notation.
22 / 56
Tuning λn with EB more notes
I Note that λ
bn is a well-defined statistic without reference to
the Bayes model
I We can use this estimator and analyze it from a frequentist
point-of-view
I Bayesian connection is just icing
23 / 56
Blank page for notes
Ridge regression: BIC
I Let X denote the design matrix, effective degrees of freedom
n o
df (λ) = trace X (X| X + λI )−1 X|
24 / 56
Asymptotic behavior
25 / 56
Asymptotic behavior notes
Housekeeping: project
I Key points
I Groups of 1-5
I Due April 23
I Instructor approval required by March 1
14
Search reddit or quora for ‘why dropout works’.
30 / 56
Ridge regression and dropout cont’d
B X
X n n o2
βe n = arg min Yi − (X
Xi Z ki )|β /(1 − φ)
β
k=1 i=1
≈ arg min EZ Pn {Y − (X
X Z )|β /(1 − φ)}2 ,
β
32 / 56
Ridge regression and dropout discussion
33 / 56
Ridge regression and noise addition
X + Z )|β }2
βe = arg min EZ Pn {Y − (X
β
34 / 56
Blank page for notes
Ridge regession: discussion
I Does not perform variable selection, i.e., all variables are kept
in the model, this can be a problem if parsimony is critical but
we can always threshold small values to zero
35 / 56
Principal components regression
16
Assume predictors have been centered
36 / 56
Principal components regression cont’d
37 / 56
Principal components regression cont’d
I Pros:
I Reduce MSE relative to standard least squares
I Sometimes interpret principal components (see pcr.R)
I Parallel computation possible
I Cons:
I Interpretation of principal components subjective
I Doesn’t involve Y in construction of features (pc’s)
38 / 56
Roadmap
I Ridge regression
I Lasso
A nice relaxing quiz
I True or false
I Maximum likelihood is limited to parametric models
I Large coefficient std errors can be sign of near collinearity
I The world’s most expensive Donkey cheese sells for several
thousand dollars a pound
39 / 56
Lasso: superficial overview
p
X
where ||β
β ||1 = |βj | and τ > 0 is a tuning parameter
j=1
40 / 56
Lasso vs. ridge (a stolen picture)
41 / 56
Lasso: orthogonal predictors
I Suppose X| X = I then
X |β )2 +τ ||β
Pn (Y −X β ||1 = Pn Y 2 −||βb n ||2 +||βb n −β
β ||2 +τ ||β
β ||1 ,
42 / 56
Lasso: orthogonal predictors
τ
I In orthogonal case we can compute βbn,j explicitly
τ
βbn,j = sgn(βbn,j )(|βbn,j | − τ /2)+ ,
43 / 56
Computing the lasso solution
44 / 56
Lasso discussion
I Potential problems
I Shrinks all the coefficients, even those that are not close to
zero, can lead to excessive bias
I Select at most min(n, p) variables, problematic in some settings
45 / 56
Adaptive lasso
I Recall nature of shrinkage for lasso
τ = 12.5
10
5
βn
^τ
0
−5
−10
−20 −10 0 10 20
^
βn
46 / 56
Adaptive lasso cont’d
47 / 56
Adaptive lasso cont’d
48 / 56
Adaptive lasso: orthogonal case
for comparison
τ
τ
Lasso: βbn,j = sgn(βbn,j ) |βbn,j | −
2 +
λ βn,j
b
Ridge: βbn,j =
1+λ
I R code example: adaptiveLasso.R
49 / 56
Adaptive lasso: fitting non-orthogonal case
50 / 56
Adaptive lasso: fitting non-orthogonal case cont’d
δ δ
compute adaptive lasso estimator βbn,j = βbn,j βen,j
4. Select δ using BIC/AIC, etc.
51 / 56
Adaptive lasso discussion
52 / 56
Robustness and the lasso
53 / 56
Robustness and lasso equivalence thm
I Claim: lasso and robust regression soln are equivalent in that
β n = arg min max ||Y − (X + ∆ )β
e β ||2
β ∆ ∈M
Xp
= arg min ||Y − Xβ β ||2 + cj |βj |
β
j=1
54 / 56
Blank page for notes
Robustness and lasso discussion
55 / 56
Penalization and regularization discussion
56 / 56
Thank you.
eric.laber@duke.edu
laber-labs.com