0% found this document useful (0 votes)
14 views68 pages

Topic One Linear Regression Regularization

lecture notes for course in ML

Uploaded by

qinhaoran0520
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views68 pages

Topic One Linear Regression Regularization

lecture notes for course in ML

Uploaded by

qinhaoran0520
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Topic 1: Linear regression and

regularization
Eric B. Laber

Department of Statistical Science, Duke University

Statistics 561
The world needs another penalized regression method.
—Nobody (circa 2010).1

Penalization is like the word ‘myself,’ it’s used too much


(often incorrectly) by people wishing to seem more intelli-
gent than they are.
—Joel Vaughan

1
Relax, I’m 98.5 percent joking.
Warm-up (5 minutes)

I Explain to your group


I Why is penalization/regularization used with predictive
models?
I How does penalization fit into the bias-variance trade-off?
I What’s the connection between ridge-regression and a
Bayesian linear model?

I True or false
I Masking occurs when there is a single large outlier
I The Gauss-Markov theorem says the least squares estimator
minimizes MSE
I In the TV show ‘Cheer’ there are only two teams competing
for a national title in Navarro’s division.

1 / 56
Roadmap

I Review and reminders

I All subsets (you don’t want the truth)

I Ridge regression

I Lasso
Roadmap

I Review and reminders

I All subsets (you don’t want the truth)

I Ridge regression

I Lasso
Review: fitting a linear model

I Ordinary least squares estimator as


n
1X
βb n = arg minp Pn (Y − X |β )2 = arg minp (Yi − X |i β )2
β ∈R β ∈R n
i=1

I Alternatively, view βb n as solution to

PnX (Y − X |β ) = 0

I If PnX X | is invertible βb n = (PnX X | )−1 PnX Y

2 / 56
Review: fitting a linear model cont’d

I What to do if PnX X | is (nearly) singular?

I
I
I
I

3 / 56
Estimation and approximation error

I Suppose Y = f (X
X) + 
I Let β ∗ = arg min P(Y − X |β )2
β ∈Rp

I Decompose mean squared error as

n o2
X ) − X |β ∗ }2 + E X | (βb n − β ∗ )
E(Y − X |βb n )2 = Var() + E {f (X
= Noise + Approx Err + Est. Err

I A complex model reduces the approximation error but


increases the estimation error

4 / 56
Parsimonious models

I Predictive models used to inform decision making


I Generate interesting hypotheses for further study
I Forecasts weighed by stakeholders
I Drive automatic action, i.e., decide if and what type of push
notification to send patient in mHealth

I Need to build trust with stakeholders2


I Models must be validated in domain context
I Interpretable models often required

2
Note* this might be you! If you don’t trust your own models it will be hard
to make progress.
5 / 56
Parsimonious models cont’d
I Justification for parsimonious models
I Occam’s Razor
I (Medicine) true optimal decision rule simple
I More nuanced mathematical arguments3

I Linear models aren’t really that interpretable


I Easy when abstracted away from the problem
I Notion of one variable moving while all others held fix can be
nonsensical to domain experts in some contexts4
I But the fewer terms they have, the more interpretable they
tend to be (we’ll explore lists and trees later in this course)
3
Duke’s own Cynthia Rudin has some excellent work in this area. Check out
https://impact.duke.edu/story/whats-in-the-box and her papers on this topic.
4
Imagine increasing the square-footage of a grocery store without changing
the products, back-of-house inventory, layout, etc.
6 / 56
It is our experience and strong belief that better models
and a better understanding of ones data result from fo-
cused data analysis, guided by substantive theory.
–Gwyneth Paltrow5

Automatic model-building procedures should be avoided


at all cost. –D.R. Cox6

5
For the best in psychic vampire repellent see:
https://goop.com/paper-crane-apothecary-psychic-vampire-repellent/p/
6
This is a real one.
Roadmap

I Review and reminders

I All subsets (you don’t want the truth)

I Ridge regression

I Lasso
Finding the ‘true’ model

I Assume that βj∗ = 0 for j ∈ J


I Only need to regress Y on Xj where j ∈
/J
I Let XJ c denote relevant predictors
I Question: suppose an oracle gave you J is
c
b J = arg min Pn (Y − X | c β )2 the optimal estimator?
β n J
β

I R code example: growingLinearModel.R7

7
We’ll mostly use python in this course but sometimes a guy already has
some R code written. Geez, get off my back.
8 / 56
Finding the ‘true’ model

I If you want optimal predictions, you may not want to use the
‘true’ model even if it were available to you
I Estimating small effects inflates variance but does little to
improve prediction
I What constitutes a small effect depends on residual variance
and sample size (not an absolute)

I Exercise: suppose X1 , . . . , Xn ∼i.i.d. (µ, σ 2 ), derive the MSE


en ≡ 0 and µ
for the estimators µ bn = X n . When is µ en
preferable to µ
bn ?

9 / 56
Blank page for notes
Hard- and soft-thresholding

I Consider our toy example X1 , . . . , Xn ∼i.i.d. (µ, σ 2 )

I Idea: reduce MSE by adaptively shrinking our estimator X n


bH
I Hard-thresholding µ n = X n 1f (|X n |)≥τ for some function
f : R+ → R +

bH
I Soft-thresholding µ n = X n g (|X n |) for some function
g : R+ → R+

10 / 56
Soft-thresholding: try it at home!
I Consider an estimator of the form µ
en = αX n , derive the value
of αopt that minimizes MSE over α

11 / 56
Hooray! A new estimator of the mean?

I αopt X n resembles an empirical Bayes estimator of the mean 8

I Optimal shrinkage depends on signal-to-noise ratio


I Common theme for adaptive shrinkage methods, i.e, adaptive
ridge and adaptive lasso

I Not very exciting for a 1D example, but can be extremely


useful/effective in higher dimensions

I Warning: hard-thresholding estimators are often non-regular


which means we cannot estimate their sampling distributions
uniformly ⇒ standard inference procedures, i.e., bootstrap or
series approximations, perform poorly
8
you will derive the analog for hard-thresholding estimator X n 1|X n |≥α in
HW2.
12 / 56
Quick aside: all subsets

I Before moving on to penalized regression, we should mention


the obvious approach of looking at all models and choosing
the ‘best one’ based on some criterion

I An intuitive approach is to examine all possible models


I Choose model with lowest BIC/AIC, etc.
I p variables ⇒ 2p possible models
I Branch-and-bound algorithms can reduce the search space
making all subsets search feasible for p ≤ 509

I R code example: allSubsets.R

9
For a fun paper on this topic see Furnival, George M., and Robert W.
Wilson. ”Regressions by leaps and bounds.” Technometrics 42.1 (2000): 69-79.
13 / 56
Roadmap

I Review and reminders

I All subsets (you don’t want the truth)

I Ridge regression

I Lasso
People say sometimes that Beauty is superficial. That
may be so. But at least it is not so superficial as Thought
is. To me, Beauty is the wonder of wonders. It is only
shallow people who do not judge by appearances. The
true mystery of the world is the visible, not the invisible.
—Excerpt ”What to Expect When You’re Expecting.”
Ridge regression: a superficial first look

I Least squares estimator is unbiased and has minimum variance


among all unbiased estimators
I As we’ve seen, this does not mean it minimizes MSE
I Recall MSE = bias2 + variance ⇒ small increase in bias + big
reduction in variance = smaller MSE

I Shrink regression coefficients toward zero by solving


λ
βb n = arg min Pn (Y − X | )2 + λ||β
β ||2 ,
β

where λ ≥ 0 is a tuning parameter

15 / 56
Historical side-note

I Proposed by Hoerl and Kennard (1970)


I Goal was to stabilize OLS estimator
I Proved there always exists λ∗ s.t.

λ
E||βb n − β ∗ ||2 < E||βb n − β ∗ ||2 ,

but λ∗ depends on β ∗ , suggested looking at ‘ridge trace’ and


picking values at each coefficients appear to ‘stabilize’10 we’ll
look at other data-driven tuning methods

I Fact: Hoerl is a fun name to say, try it out

10
the ridge trace plot is what we might call a solution path today (nothing is
new)
16 / 56
Ridge regression: orthogonal case

I Ridge regression estimator can give dramatic reduction in


MSE, especially when the dimension p is large

I Consider first the orthogonal case PnX X | = Ip

λ βbn,j
βbn,j =
1+λ

thus E(βbnλ ) = βj∗ /(1 + λ) and variance is Var(βbn,j )/(1 + λ)2

I Exercise: suppose p = 1, find the value λ that minimizes


λ
MSE(βbn,1 ). What is the optimal for general p?

17 / 56
Blank page for notes
Ridge regression cont’d

I In the non-orthogonal case


λ −1
PnX X | + n−1 λIp

βb n = Pn X Y
n o−1
= I + n−1 λ (PnX X | )−1 (PnX X | )−1 PnX Y
n o−1
= I + n−1 λ (PnX X | )−1 βb n

I How to choose λ ≥ 0 in this case?


I Information criteria AIC, BIC, etc.
I Cross-validation (generalized cross-validation)
I You’ll explore these criteria in lab

19 / 56
Tuning λn (one more criterion)

I Can use empirical Bayes (EB) to select λn

I EB in a nutshell
I Posit Bayesian model
I Marginal distribution to obtain frequentist estimators of
hyper-parameters11

I Pro-tip: if you want to derive an estimator that performs well


in practice, posit a Bayesian model, derive posterior mean, call
this a frequentist estimator and hide all evidence you ever
considered Bayes approach12

11
This is a rich area but we don’t have the bandwidth to cover it in depth in
this class.
12
Credit to Derek Bingham for this nugget of wisdom.
20 / 56
Tuning λn with EB

I Assume linear model correct Y = X |β +  where


 ∼ Normal(0, σ 2 ) and β ∼ Normal(µ, τ 2 ) and β ⊥ 

I Posterior mean is

βb n = (PnX X | + λn I )−1 Pn (X
X Y + λn µ),

where λn = σ 2 /(nτ 2 )
I Setting µ = 0 yields ridge estimator, given estimators σ bn2 and
2 2 2
τbn yield a plug-in estimator of λn , i.e., λ
bn = σ
bn /(nb
τn )

21 / 56
Tuning λn with EB cont’d

X X T ) + σ2
I Marginal distribution13 EY 2 = τ 2 trace(X

I MOM estimator matches

Pn Y 2 ≈ τ 2 trace(PnX X | ) + σ 2

bn2 = Pn (Y − X |βb n )2 yields


plugging in σ

Pn Y 2 − σ
bn2
τbn2 =
trace(PnX X | )

and thus λ bn2 /(b


bn = σ τn2 n)

13
To align with classic derivations, we’re conditioning on X and treating it as
fixed here. Which makes the expression Pn X X | a bit of an abuse of notation.
22 / 56
Tuning λn with EB more notes

I Note that λ
bn is a well-defined statistic without reference to
the Bayes model
I We can use this estimator and analyze it from a frequentist
point-of-view
I Bayesian connection is just icing

I Related idea: when we talk about RL, we might use a


Bayesian framework for exploration-exploitation but not
require these models to be correct when analyzing alg
performance

23 / 56
Blank page for notes
Ridge regression: BIC
I Let X denote the design matrix, effective degrees of freedom
n o
df (λ) = trace X (X| X + λI )−1 X|

to compute this efficiently we use the SVD of X


p
|
X dj2
X = UDV ⇒ df (λ) =
dj2 + λ
j=1

I BIC for each value of λ is

BIC(λ) = log {RSS(λ)} + df (λ) log(n)/n


bBIC = arg min BIC(λ)
choose λ
λ

24 / 56
Asymptotic behavior

I Under what conditions on {λn }n≥1 will:


λn
I βb n → β ∗
√ b λn
 
I n βn − β ∗
converge to a Gaussian limit

I Before we mathematize anything, what does your intuition


say? Should λn shrink? Grow? Converge to a constant?

25 / 56
Asymptotic behavior notes
Housekeeping: project
I Key points
I Groups of 1-5
I Due April 23
I Instructor approval required by March 1

I What you’ll produce: a PRFAQ + Technical Appendix/POC


I Pitching a new idea to improve science/society
I PR = Press release: a one page description describing the
launch of your product
I FAQ = Frequently Asked Questions: 2-5 pages identifying
potential pitfalls, solutions, contingency plans
I Technical appendix + POC: the math and code needed for a
proof of concept or minimal viable product
I Imagine you’re pitching to VC, gov funding agency, private
foundation, etc.
27 / 56
Housekeeping project: cont’d
I What you owe me by March 1
I Your group!
I Title + one paragraph summary of your idea
I An outline of technical results you plan to include

I How to find a group?


I Use your study buddies
I I will have a speed-dating session Wed in class

I How to find a project idea


I Something you’re already working on or passionate about
I Look around, where is there need? Where is there potential?
Think big!
28 / 56
Ridge regression: prostate data example
I Example: prostate cancer data set
I n = 97 patients, response prostate-specific antigen (lspa)
I Predictors:
I lcavol: log (cancer volume)
I lweight: log(prostate weight)
I age: age
I lbph: log (benign prostatic hyperplasia)
I svi: seminal vesicle invasion
I lcp: log (capsular penetration)
I gleason: Gleason score
I pgg45: percent of Gleason scores 4 or 5

I R code example: ridgeExample.R


29 / 56
Ridge regression and dropout

I Dropout is popular heuristic to reduce overfitting in machine


learning models (esp. nnets)
I Randomly ‘zero-out’ some components in the training input
I Lots of heuristic explanations for why this works14 , with a
linear model we can obtain a more rigorous answer

I Let Z ∈ {0, 1, }p be a vector of independent Bernoulli random


variables so that P(Zj = 0) = φ (and P(Zj = 1) = 1 − φ))
I Suppose Z is ind of (X
X,Y)
I Random dropout X Z = (X1 Z1 , . . . , Xp Zp )|

14
Search reddit or quora for ‘why dropout works’.
30 / 56
Ridge regression and dropout cont’d

I Imagine for each i = 1, . . . , n we generate a bajillion15 values


of Z , say Z 1i , . . . , Z B
nh i , and
i subsequently generate new o dataset
k
Yi , X i Z i /(1 − φ) k = 1, . . . , B, i = 1, . . . , n

I Now suppose we fit a linear model using this data

B X
X n n o2
βe n = arg min Yi − (X
Xi Z ki )|β /(1 − φ)
β
k=1 i=1
≈ arg min EZ Pn {Y − (X
X Z )|β /(1 − φ)}2 ,
β

where we’ve approximated the sample average with


expectation (recall we generate as many Z ’s as we want)
15
By a bajillion, I mean a lot. Also, this lecture is sponsored by ”Bajillion
Dollar Properties,” watch now on Amazon Prime.
31 / 56
Ridge regression and dropout cont’d
I Suppose data has been centered and scaled derive a closed
form expression for βe n

32 / 56
Ridge regression and dropout discussion

I Reducing info available to the model to make a prediction at


X 7→ X Z ) has a regularizing effect
any given point (X
I General strategy can be used with models that are harder to
penalize explicitly, e.g., this is used when building trees in
random forests (we’ll talk more about this later)
I Smoother version of dropout can be obtained by replacing
Bernoulli’s with any unit-mean r.v.’s

33 / 56
Ridge regression and noise addition

I Another way to reduce info available to the model is to replace


X with X + Z , w/ Z vector of independent (0, λ) r.v.’s

I As with dropout, note that this is a general strategy that can


be applied with any ML algorithm (for supervised learning)

I In-class exercise: compute

X + Z )|β }2
βe = arg min EZ Pn {Y − (X
β

34 / 56
Blank page for notes
Ridge regession: discussion

I Reduces MSE by shrinking regression coefficients, very


effective in a wide range of settings (battle-tested)

I Does not perform variable selection, i.e., all variables are kept
in the model, this can be a problem if parsimony is critical but
we can always threshold small values to zero

I Choose amount of penalization using information criteria


(BIC, AIC, etc.) or cross-validation

35 / 56
Principal components regression

I A closely related alternative to ridge regression is principal


components regression

I Review: principal components16


I Write X| X/n = VD 2 V | where V = [vv 1 , . . . , v p ]
I Zj = X |v j is the called the jth principal component of X
I Var(Z1 ) ≥ Var(Z2 ) ≥ . . . ≥ Var(Zp )

I Idea: regression Y on Z1 , . . . , Zq for some q ≤ p instead of


regressing on X
I Capture important features of X
I Reduce dimension ⇒ bias-variance trade-off

16
Assume predictors have been centered
36 / 56
Principal components regression cont’d

I Matrix of principal components is Z = XV


I PR regression design matrix is orthogonal Z| Z = nD 2 (why?)
I Compute regression of Y on Z1 , . . . , Zq using series of
univariate regressions
I Tuning parameter q can be chosen using BIC/AIC,
cross-validation, etc.

I R code example: pcr.R

37 / 56
Principal components regression cont’d

I Pros:
I Reduce MSE relative to standard least squares
I Sometimes interpret principal components (see pcr.R)
I Parallel computation possible

I Cons:
I Interpretation of principal components subjective
I Doesn’t involve Y in construction of features (pc’s)

38 / 56
Roadmap

I Review and reminders

I All subsets (you don’t want the truth)

I Ridge regression

I Lasso
A nice relaxing quiz

I Explain to your group


I What does the acronym LASSO stand for?
I What is the problem of inference after model selection?
I What goes wrong in regression when p  n?

I True or false
I Maximum likelihood is limited to parametric models
I Large coefficient std errors can be sign of near collinearity
I The world’s most expensive Donkey cheese sells for several
thousand dollars a pound

39 / 56
Lasso: superficial overview

I Simultaneous estimation and model selection via penalization


τ
βb n = arg min Pn (Y − X |β )2 + τ ||β
β ||1 ,
β

p
X
where ||β
β ||1 = |βj | and τ > 0 is a tuning parameter
j=1

I Looks like ridge but use of L1 norm yields sparse solutions

40 / 56
Lasso vs. ridge (a stolen picture)

41 / 56
Lasso: orthogonal predictors

I Suppose X| X = I then

X |β )2 +τ ||β
Pn (Y −X β ||1 = Pn Y 2 −||βb n ||2 +||βb n −β
β ||2 +τ ||β
β ||1 ,

we can re-write the lasso solution as


τ
βb n = arg min ||βb n − β ||2 + τ ||β
β ||1
β
p n
X o
= arg min (βbn,j − βj )2 + τ |βj | ,
β
j=1

we can look at each component separately

I R code example: lasso.R

42 / 56
Lasso: orthogonal predictors

τ
I In orthogonal case we can compute βbn,j explicitly

τ
βbn,j = sgn(βbn,j )(|βbn,j | − τ /2)+ ,

where (u)+ = max(0, u)

I Can give us further insight into lasso soln (back to lasso.R)

43 / 56
Computing the lasso solution

I Can reformulate lasso objective as quadratic program,


however, faster iterative algorithms exist

I As with ridge, select tuning parameter using BIC/AIC,


cross-validation, etc.
τ
I Problem: βb n is not linear estimator, how to define degrees of
freedom?
I Approximate degrees of freedom is the number of nonzero
τ
components of βb n
p
X
df (τ ) = 1βbτ
n,j 6=0
j=1

44 / 56
Lasso discussion

I Tool for simultaneous variable selection and estimation


I Can be solved very quickly for large problems
I Leads to parsimonious models ⇒ iterpretable
I Variants exist, i.e., for glm, cox-ph, etc. (more on this later)

I Potential problems
I Shrinks all the coefficients, even those that are not close to
zero, can lead to excessive bias
I Select at most min(n, p) variables, problematic in some settings

45 / 56
Adaptive lasso
I Recall nature of shrinkage for lasso
τ = 12.5

10
5
βn

0
−5
−10

−20 −10 0 10 20
^
βn
46 / 56
Adaptive lasso cont’d

I Lasso shrinks all coefficients toward zero


I Introduces excessive bias when true regression coef is large
I Potentially better strategy is to shrink more aggressively when
coefficients are small and less aggressively if they are large
I How can we do this if true coefficients are unknown?

I Idea: use ordinary least squares estimates as surrogates

47 / 56
Adaptive lasso cont’d

I Adaptive lasso estimator


p
X
βbnδ = arg min Pn (Y − X | β)2 + δ |βj |/|βbn,j |
β
j=1

I If |βbn,j | is small ⇒ more shrinkage is applied

I Note* if p is large can use ridge estimator as weights

48 / 56
Adaptive lasso: orthogonal case

I In the orthogonal case the adaptive lasso estimator is


!
δ δ
βbn,j = sgn(βb n,j ) |βbn,j | −
2|βbn,j | +

for comparison

τ
 τ
Lasso: βbn,j = sgn(βbn,j ) |βbn,j | −
2 +
λ βn,j
b
Ridge: βbn,j =
1+λ
I R code example: adaptiveLasso.R

49 / 56
Adaptive lasso: fitting non-orthogonal case

I Adaptive lasso objective can be recast as quadratic program


I We can use existing lasso software by modifying design matrix
I Write
p p
X | X
Pn (Y − X |β )2 + δ e 2+δ
|βj |/|βbn,j | = Pn (Y − Xe β) |βej |,
j=1 j=1

where Xej = βbn,j Xj and βej = βj /βbn,j

50 / 56
Adaptive lasso: fitting non-orthogonal case cont’d

I Pseudo-code (assumes centered, scaled data)

1. Fit ordinary least squares, say βb n


2. Create scaled design matrix X
e = Xdiag(βb )
n

3. For each δ under consideration compute lasso estimator


δ |
βe n = arg min Pn (Y − Xe β )2 + δ||β
β ||1 ,
β

δ δ
compute adaptive lasso estimator βbn,j = βbn,j βen,j
4. Select δ using BIC/AIC, etc.

51 / 56
Adaptive lasso discussion

I Reduce excessive bias due to overshrinkage of large coefficients


I An alternative is to refit the model selected by lasso using least
squares, this generally works well in practice

I Can use existing statistical software, generally much more


computationally efficient than alternative proposals (SCAD,
etc.)

52 / 56
Robustness and the lasso

I Let c1 , . . . , cp be positive constants and define


 v 
 u n 
X
M = ∆ ∈ Rn×p : t
u
∆2i,j ≤ cj , j = 1, . . . , p ,
 
i=1

i.e., the set of n × p matrices where norm of jth column


bounded by cj for j = 1, . . . , p

I Robust lin regression estimator under perturbations M is


 
β n = arg min max ||Y − (X + ∆ )β
e β ||2
β ∆ ∈M

53 / 56
Robustness and lasso equivalence thm
I Claim: lasso and robust regression soln are equivalent in that
 
β n = arg min max ||Y − (X + ∆ )β
e β ||2
β ∆ ∈M
 
 Xp 
= arg min ||Y − Xβ β ||2 + cj |βj |
β  
j=1

54 / 56
Blank page for notes
Robustness and lasso discussion

I Showed equivalence between so-called square-root lasso and


robust regression
I Set cj ≡ τ yields standard (square-root) lasso
I Setting cj ≡ 1/|βbn,j | yields adaptive (square-root) lasso

I The square-root lasso has some desirable properties in terms


of tuning (see papers by A. Belloni here at Duke), however,
the solution paths are the same

55 / 56
Penalization and regularization discussion

I Bias-variance trade-off ⇒ regularization needed to improve


predictive performance

I Ridge regression smoothly (soft) penalizes coefficients but can


combine with thresholding to obtain sparse solutions

I Lasso automatically yields sparse solutions but may perform


poorly if true signal is dense and comprised of weak signals

I Showed that several information deletion/distortion methods


are equivalent to penalized regression ⇒ more general
strategy for regularizing complex estimators

56 / 56
Thank you.

eric.laber@duke.edu

laber-labs.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy