0% found this document useful (0 votes)

14 views68 pages

Topic One Linear Regression Regularization

lecture notes for course in ML

Uploaded by

qinhaoran0520

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views68 pages

Topic One Linear Regression Regularization

lecture notes for course in ML

Uploaded by

qinhaoran0520

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Topic 1: Linear regression and

regularization
Eric B. Laber

Department of Statistical Science, Duke University

Statistics 561
The world needs another penalized regression method.
—Nobody (circa 2010).1

Penalization is like the word ‘myself,’ it’s used too much

(often incorrectly) by people wishing to seem more intelli-
gent than they are.
—Joel Vaughan

1
Relax, I’m 98.5 percent joking.
Warm-up (5 minutes)

I Explain to your group

I Why is penalization/regularization used with predictive
models?
I How does penalization fit into the bias-variance trade-off?
I What’s the connection between ridge-regression and a
Bayesian linear model?

I True or false
I Masking occurs when there is a single large outlier
I The Gauss-Markov theorem says the least squares estimator
minimizes MSE
I In the TV show ‘Cheer’ there are only two teams competing
for a national title in Navarro’s division.

1 / 56
Roadmap

I Review and reminders

I All subsets (you don’t want the truth)

I Ridge regression

I Lasso
Roadmap

I Review and reminders

I All subsets (you don’t want the truth)

I Ridge regression

I Lasso
Review: fitting a linear model

I Ordinary least squares estimator as

n
1X
βb n = arg minp Pn (Y − X |β )2 = arg minp (Yi − X |i β )2
β ∈R β ∈R n
i=1

I Alternatively, view βb n as solution to

PnX (Y − X |β ) = 0

I If PnX X | is invertible βb n = (PnX X | )−1 PnX Y

2 / 56
Review: fitting a linear model cont’d

I What to do if PnX X | is (nearly) singular?

I
I
I
I

3 / 56
Estimation and approximation error

I Suppose Y = f (X
X) +
I Let β ∗ = arg min P(Y − X |β )2
β ∈Rp

I Decompose mean squared error as

n o2
X ) − X |β ∗ }2 + E X | (βb n − β ∗ )
E(Y − X |βb n )2 = Var() + E {f (X
= Noise + Approx Err + Est. Err

I A complex model reduces the approximation error but

increases the estimation error

4 / 56
Parsimonious models

I Predictive models used to inform decision making

I Generate interesting hypotheses for further study
I Forecasts weighed by stakeholders
I Drive automatic action, i.e., decide if and what type of push
notification to send patient in mHealth

I Need to build trust with stakeholders2

I Models must be validated in domain context
I Interpretable models often required

2
Note* this might be you! If you don’t trust your own models it will be hard
to make progress.
5 / 56
Parsimonious models cont’d
I Justification for parsimonious models
I Occam’s Razor
I (Medicine) true optimal decision rule simple
I More nuanced mathematical arguments3

I Linear models aren’t really that interpretable

I Easy when abstracted away from the problem
I Notion of one variable moving while all others held fix can be
nonsensical to domain experts in some contexts4
I But the fewer terms they have, the more interpretable they
tend to be (we’ll explore lists and trees later in this course)
3
Duke’s own Cynthia Rudin has some excellent work in this area. Check out
https://impact.duke.edu/story/whats-in-the-box and her papers on this topic.
4
Imagine increasing the square-footage of a grocery store without changing
the products, back-of-house inventory, layout, etc.
6 / 56
It is our experience and strong belief that better models
and a better understanding of ones data result from fo-
cused data analysis, guided by substantive theory.
–Gwyneth Paltrow5

Automatic model-building procedures should be avoided

at all cost. –D.R. Cox6

5
For the best in psychic vampire repellent see:
https://goop.com/paper-crane-apothecary-psychic-vampire-repellent/p/
6
This is a real one.
Roadmap

I Review and reminders

I All subsets (you don’t want the truth)

I Ridge regression

I Lasso
Finding the ‘true’ model

I Assume that βj∗ = 0 for j ∈ J

I Only need to regress Y on Xj where j ∈
/J
I Let XJ c denote relevant predictors
I Question: suppose an oracle gave you J is
c
b J = arg min Pn (Y − X | c β )2 the optimal estimator?
β n J
β

I R code example: growingLinearModel.R7

7
We’ll mostly use python in this course but sometimes a guy already has
some R code written. Geez, get off my back.
8 / 56
Finding the ‘true’ model

I If you want optimal predictions, you may not want to use the
‘true’ model even if it were available to you
I Estimating small effects inflates variance but does little to
improve prediction
I What constitutes a small effect depends on residual variance
and sample size (not an absolute)

I Exercise: suppose X1 , . . . , Xn ∼i.i.d. (µ, σ 2 ), derive the MSE

en ≡ 0 and µ
for the estimators µ bn = X n . When is µ en
preferable to µ
bn ?

9 / 56
Blank page for notes
Hard- and soft-thresholding

I Consider our toy example X1 , . . . , Xn ∼i.i.d. (µ, σ 2 )

I Idea: reduce MSE by adaptively shrinking our estimator X n

bH
I Hard-thresholding µ n = X n 1f (|X n |)≥τ for some function
f : R+ → R +

bH
I Soft-thresholding µ n = X n g (|X n |) for some function
g : R+ → R+

10 / 56
Soft-thresholding: try it at home!
I Consider an estimator of the form µ
en = αX n , derive the value
of αopt that minimizes MSE over α

11 / 56
Hooray! A new estimator of the mean?

I αopt X n resembles an empirical Bayes estimator of the mean 8

I Optimal shrinkage depends on signal-to-noise ratio

I Common theme for adaptive shrinkage methods, i.e, adaptive
ridge and adaptive lasso

I Not very exciting for a 1D example, but can be extremely

useful/effective in higher dimensions

I Warning: hard-thresholding estimators are often non-regular

which means we cannot estimate their sampling distributions
uniformly ⇒ standard inference procedures, i.e., bootstrap or
series approximations, perform poorly
8
you will derive the analog for hard-thresholding estimator X n 1|X n |≥α in
HW2.
12 / 56
Quick aside: all subsets

I Before moving on to penalized regression, we should mention

the obvious approach of looking at all models and choosing
the ‘best one’ based on some criterion

I An intuitive approach is to examine all possible models

I Choose model with lowest BIC/AIC, etc.
I p variables ⇒ 2p possible models
I Branch-and-bound algorithms can reduce the search space
making all subsets search feasible for p ≤ 509

I R code example: allSubsets.R

9
For a fun paper on this topic see Furnival, George M., and Robert W.
Wilson. ”Regressions by leaps and bounds.” Technometrics 42.1 (2000): 69-79.
13 / 56
Roadmap

I Review and reminders

I All subsets (you don’t want the truth)

I Ridge regression

I Lasso
People say sometimes that Beauty is superficial. That
may be so. But at least it is not so superficial as Thought
is. To me, Beauty is the wonder of wonders. It is only
shallow people who do not judge by appearances. The
true mystery of the world is the visible, not the invisible.
—Excerpt ”What to Expect When You’re Expecting.”
Ridge regression: a superficial first look

I Least squares estimator is unbiased and has minimum variance

among all unbiased estimators
I As we’ve seen, this does not mean it minimizes MSE
I Recall MSE = bias2 + variance ⇒ small increase in bias + big
reduction in variance = smaller MSE

I Shrink regression coefficients toward zero by solving

λ
βb n = arg min Pn (Y − X | )2 + λ||β
β ||2 ,
β

where λ ≥ 0 is a tuning parameter

15 / 56
Historical side-note

I Proposed by Hoerl and Kennard (1970)

I Goal was to stabilize OLS estimator
I Proved there always exists λ∗ s.t.
∗
λ
E||βb n − β ∗ ||2 < E||βb n − β ∗ ||2 ,

but λ∗ depends on β ∗ , suggested looking at ‘ridge trace’ and

picking values at each coefficients appear to ‘stabilize’10 we’ll
look at other data-driven tuning methods

I Fact: Hoerl is a fun name to say, try it out

10
the ridge trace plot is what we might call a solution path today (nothing is
new)
16 / 56
Ridge regression: orthogonal case

I Ridge regression estimator can give dramatic reduction in

MSE, especially when the dimension p is large

I Consider first the orthogonal case PnX X | = Ip

λ βbn,j
βbn,j =
1+λ

thus E(βbnλ ) = βj∗ /(1 + λ) and variance is Var(βbn,j )/(1 + λ)2

I Exercise: suppose p = 1, find the value λ that minimizes

λ
MSE(βbn,1 ). What is the optimal for general p?

17 / 56
Blank page for notes
Ridge regression cont’d

I In the non-orthogonal case

λ −1
PnX X | + n−1 λIp

βb n = Pn X Y
n o−1
= I + n−1 λ (PnX X | )−1 (PnX X | )−1 PnX Y
n o−1
= I + n−1 λ (PnX X | )−1 βb n

I How to choose λ ≥ 0 in this case?

I Information criteria AIC, BIC, etc.
I Cross-validation (generalized cross-validation)
I You’ll explore these criteria in lab

19 / 56
Tuning λn (one more criterion)

I Can use empirical Bayes (EB) to select λn

I EB in a nutshell
I Posit Bayesian model
I Marginal distribution to obtain frequentist estimators of
hyper-parameters11

I Pro-tip: if you want to derive an estimator that performs well

in practice, posit a Bayesian model, derive posterior mean, call
this a frequentist estimator and hide all evidence you ever
considered Bayes approach12

11
This is a rich area but we don’t have the bandwidth to cover it in depth in
this class.
12
Credit to Derek Bingham for this nugget of wisdom.
20 / 56
Tuning λn with EB

I Assume linear model correct Y = X |β + where

∼ Normal(0, σ 2 ) and β ∼ Normal(µ, τ 2 ) and β ⊥

I Posterior mean is

βb n = (PnX X | + λn I )−1 Pn (X
X Y + λn µ),

where λn = σ 2 /(nτ 2 )
I Setting µ = 0 yields ridge estimator, given estimators σ bn2 and
2 2 2
τbn yield a plug-in estimator of λn , i.e., λ
bn = σ
bn /(nb
τn )

21 / 56
Tuning λn with EB cont’d

X X T ) + σ2
I Marginal distribution13 EY 2 = τ 2 trace(X

I MOM estimator matches

Pn Y 2 ≈ τ 2 trace(PnX X | ) + σ 2

bn2 = Pn (Y − X |βb n )2 yields

plugging in σ

Pn Y 2 − σ
bn2
τbn2 =
trace(PnX X | )

and thus λ bn2 /(b

bn = σ τn2 n)

13
To align with classic derivations, we’re conditioning on X and treating it as
fixed here. Which makes the expression Pn X X | a bit of an abuse of notation.
22 / 56
Tuning λn with EB more notes

I Note that λ
bn is a well-defined statistic without reference to
the Bayes model
I We can use this estimator and analyze it from a frequentist
point-of-view
I Bayesian connection is just icing

I Related idea: when we talk about RL, we might use a

Bayesian framework for exploration-exploitation but not
require these models to be correct when analyzing alg
performance

23 / 56
Blank page for notes
Ridge regression: BIC
I Let X denote the design matrix, effective degrees of freedom
n o
df (λ) = trace X (X| X + λI )−1 X|

to compute this efficiently we use the SVD of X

p
|
X dj2
X = UDV ⇒ df (λ) =
dj2 + λ
j=1

I BIC for each value of λ is

BIC(λ) = log {RSS(λ)} + df (λ) log(n)/n

bBIC = arg min BIC(λ)
choose λ
λ

24 / 56
Asymptotic behavior

I Under what conditions on {λn }n≥1 will:

λn
I βb n → β ∗
√ b λn

I n βn − β ∗
converge to a Gaussian limit

I Before we mathematize anything, what does your intuition

say? Should λn shrink? Grow? Converge to a constant?

25 / 56
Asymptotic behavior notes
Housekeeping: project
I Key points
I Groups of 1-5
I Due April 23
I Instructor approval required by March 1

I What you’ll produce: a PRFAQ + Technical Appendix/POC

I Pitching a new idea to improve science/society
I PR = Press release: a one page description describing the
launch of your product
I FAQ = Frequently Asked Questions: 2-5 pages identifying
potential pitfalls, solutions, contingency plans
I Technical appendix + POC: the math and code needed for a
proof of concept or minimal viable product
I Imagine you’re pitching to VC, gov funding agency, private
foundation, etc.
27 / 56
Housekeeping project: cont’d
I What you owe me by March 1
I Your group!
I Title + one paragraph summary of your idea
I An outline of technical results you plan to include

I How to find a group?

I Use your study buddies
I I will have a speed-dating session Wed in class

I How to find a project idea

I Something you’re already working on or passionate about
I Look around, where is there need? Where is there potential?
Think big!
28 / 56
Ridge regression: prostate data example
I Example: prostate cancer data set
I n = 97 patients, response prostate-specific antigen (lspa)
I Predictors:
I lcavol: log (cancer volume)
I lweight: log(prostate weight)
I age: age
I lbph: log (benign prostatic hyperplasia)
I svi: seminal vesicle invasion
I lcp: log (capsular penetration)
I gleason: Gleason score
I pgg45: percent of Gleason scores 4 or 5

I R code example: ridgeExample.R

29 / 56
Ridge regression and dropout

I Dropout is popular heuristic to reduce overfitting in machine

learning models (esp. nnets)
I Randomly ‘zero-out’ some components in the training input
I Lots of heuristic explanations for why this works14 , with a
linear model we can obtain a more rigorous answer

I Let Z ∈ {0, 1, }p be a vector of independent Bernoulli random

variables so that P(Zj = 0) = φ (and P(Zj = 1) = 1 − φ))
I Suppose Z is ind of (X
X,Y)
I Random dropout X Z = (X1 Z1 , . . . , Xp Zp )|

14
Search reddit or quora for ‘why dropout works’.
30 / 56
Ridge regression and dropout cont’d

I Imagine for each i = 1, . . . , n we generate a bajillion15 values

of Z , say Z 1i , . . . , Z B
nh i , and
i subsequently generate new o dataset
k
Yi , X i Z i /(1 − φ) k = 1, . . . , B, i = 1, . . . , n

I Now suppose we fit a linear model using this data

B X
X n n o2
βe n = arg min Yi − (X
Xi Z ki )|β /(1 − φ)
β
k=1 i=1
≈ arg min EZ Pn {Y − (X
X Z )|β /(1 − φ)}2 ,
β

where we’ve approximated the sample average with

expectation (recall we generate as many Z ’s as we want)
15
By a bajillion, I mean a lot. Also, this lecture is sponsored by ”Bajillion
Dollar Properties,” watch now on Amazon Prime.
31 / 56
Ridge regression and dropout cont’d
I Suppose data has been centered and scaled derive a closed
form expression for βe n

32 / 56
Ridge regression and dropout discussion

I Reducing info available to the model to make a prediction at

X 7→ X Z ) has a regularizing effect
any given point (X
I General strategy can be used with models that are harder to
penalize explicitly, e.g., this is used when building trees in
random forests (we’ll talk more about this later)
I Smoother version of dropout can be obtained by replacing
Bernoulli’s with any unit-mean r.v.’s

33 / 56
Ridge regression and noise addition

I Another way to reduce info available to the model is to replace

X with X + Z , w/ Z vector of independent (0, λ) r.v.’s

I As with dropout, note that this is a general strategy that can

be applied with any ML algorithm (for supervised learning)

I In-class exercise: compute

X + Z )|β }2
βe = arg min EZ Pn {Y − (X
β

34 / 56
Blank page for notes
Ridge regession: discussion

I Reduces MSE by shrinking regression coefficients, very

effective in a wide range of settings (battle-tested)

I Does not perform variable selection, i.e., all variables are kept
in the model, this can be a problem if parsimony is critical but
we can always threshold small values to zero

I Choose amount of penalization using information criteria

(BIC, AIC, etc.) or cross-validation

35 / 56
Principal components regression

I A closely related alternative to ridge regression is principal

components regression

I Review: principal components16

I Write X| X/n = VD 2 V | where V = [vv 1 , . . . , v p ]
I Zj = X |v j is the called the jth principal component of X
I Var(Z1 ) ≥ Var(Z2 ) ≥ . . . ≥ Var(Zp )

I Idea: regression Y on Z1 , . . . , Zq for some q ≤ p instead of

regressing on X
I Capture important features of X
I Reduce dimension ⇒ bias-variance trade-off

16
Assume predictors have been centered
36 / 56
Principal components regression cont’d

I Matrix of principal components is Z = XV

I PR regression design matrix is orthogonal Z| Z = nD 2 (why?)
I Compute regression of Y on Z1 , . . . , Zq using series of
univariate regressions
I Tuning parameter q can be chosen using BIC/AIC,
cross-validation, etc.

I R code example: pcr.R

37 / 56
Principal components regression cont’d

I Pros:
I Reduce MSE relative to standard least squares
I Sometimes interpret principal components (see pcr.R)
I Parallel computation possible

I Cons:
I Interpretation of principal components subjective
I Doesn’t involve Y in construction of features (pc’s)

38 / 56
Roadmap

I Review and reminders

I All subsets (you don’t want the truth)

I Ridge regression

I Lasso
A nice relaxing quiz

I Explain to your group

I What does the acronym LASSO stand for?
I What is the problem of inference after model selection?
I What goes wrong in regression when p n?

I True or false
I Maximum likelihood is limited to parametric models
I Large coefficient std errors can be sign of near collinearity
I The world’s most expensive Donkey cheese sells for several
thousand dollars a pound

39 / 56
Lasso: superficial overview

I Simultaneous estimation and model selection via penalization

τ
βb n = arg min Pn (Y − X |β )2 + τ ||β
β ||1 ,
β

p
X
where ||β
β ||1 = |βj | and τ > 0 is a tuning parameter
j=1

I Looks like ridge but use of L1 norm yields sparse solutions

40 / 56
Lasso vs. ridge (a stolen picture)

41 / 56
Lasso: orthogonal predictors

I Suppose X| X = I then

X |β )2 +τ ||β
Pn (Y −X β ||1 = Pn Y 2 −||βb n ||2 +||βb n −β
β ||2 +τ ||β
β ||1 ,

we can re-write the lasso solution as

τ
βb n = arg min ||βb n − β ||2 + τ ||β
β ||1
β
p n
X o
= arg min (βbn,j − βj )2 + τ |βj | ,
β
j=1

we can look at each component separately

I R code example: lasso.R

42 / 56
Lasso: orthogonal predictors

τ
I In orthogonal case we can compute βbn,j explicitly

τ
βbn,j = sgn(βbn,j )(|βbn,j | − τ /2)+ ,

where (u)+ = max(0, u)

I Can give us further insight into lasso soln (back to lasso.R)

43 / 56
Computing the lasso solution

I Can reformulate lasso objective as quadratic program,

however, faster iterative algorithms exist

I As with ridge, select tuning parameter using BIC/AIC,

cross-validation, etc.
τ
I Problem: βb n is not linear estimator, how to define degrees of
freedom?
I Approximate degrees of freedom is the number of nonzero
τ
components of βb n
p
X
df (τ ) = 1βbτ
n,j 6=0
j=1

44 / 56
Lasso discussion

I Tool for simultaneous variable selection and estimation

I Can be solved very quickly for large problems
I Leads to parsimonious models ⇒ iterpretable
I Variants exist, i.e., for glm, cox-ph, etc. (more on this later)

I Potential problems
I Shrinks all the coefficients, even those that are not close to
zero, can lead to excessive bias
I Select at most min(n, p) variables, problematic in some settings

45 / 56
Adaptive lasso
I Recall nature of shrinkage for lasso
τ = 12.5

10
5
βn
^τ

0
−5
−10

−20 −10 0 10 20
^
βn
46 / 56
Adaptive lasso cont’d

I Lasso shrinks all coefficients toward zero

I Introduces excessive bias when true regression coef is large
I Potentially better strategy is to shrink more aggressively when
coefficients are small and less aggressively if they are large
I How can we do this if true coefficients are unknown?

I Idea: use ordinary least squares estimates as surrogates

47 / 56
Adaptive lasso cont’d

I Adaptive lasso estimator

p
X
βbnδ = arg min Pn (Y − X | β)2 + δ |βj |/|βbn,j |
β
j=1

I If |βbn,j | is small ⇒ more shrinkage is applied

I Note* if p is large can use ridge estimator as weights

48 / 56
Adaptive lasso: orthogonal case

I In the orthogonal case the adaptive lasso estimator is

!
δ δ
βbn,j = sgn(βb n,j ) |βbn,j | −
2|βbn,j | +

for comparison

τ
τ
Lasso: βbn,j = sgn(βbn,j ) |βbn,j | −
2 +
λ βn,j
b
Ridge: βbn,j =
1+λ
I R code example: adaptiveLasso.R

49 / 56
Adaptive lasso: fitting non-orthogonal case

I Adaptive lasso objective can be recast as quadratic program

I We can use existing lasso software by modifying design matrix
I Write
p p
X | X
Pn (Y − X |β )2 + δ e 2+δ
|βj |/|βbn,j | = Pn (Y − Xe β) |βej |,
j=1 j=1

where Xej = βbn,j Xj and βej = βj /βbn,j

50 / 56
Adaptive lasso: fitting non-orthogonal case cont’d

I Pseudo-code (assumes centered, scaled data)

1. Fit ordinary least squares, say βb n

2. Create scaled design matrix X
e = Xdiag(βb )
n

3. For each δ under consideration compute lasso estimator

δ |
βe n = arg min Pn (Y − Xe β )2 + δ||β
β ||1 ,
β

δ δ
compute adaptive lasso estimator βbn,j = βbn,j βen,j
4. Select δ using BIC/AIC, etc.

51 / 56
Adaptive lasso discussion

I Reduce excessive bias due to overshrinkage of large coefficients

I An alternative is to refit the model selected by lasso using least
squares, this generally works well in practice

I Can use existing statistical software, generally much more

computationally efficient than alternative proposals (SCAD,
etc.)

52 / 56
Robustness and the lasso

I Let c1 , . . . , cp be positive constants and define

 v 
 u n 
X
M = ∆ ∈ Rn×p : t
u
∆2i,j ≤ cj , j = 1, . . . , p ,
 
i=1

i.e., the set of n × p matrices where norm of jth column

bounded by cj for j = 1, . . . , p

I Robust lin regression estimator under perturbations M is

β n = arg min max ||Y − (X + ∆ )β
e β ||2
β ∆ ∈M

53 / 56
Robustness and lasso equivalence thm
I Claim: lasso and robust regression soln are equivalent in that

β n = arg min max ||Y − (X + ∆ )β
e β ||2
β ∆ ∈M
 
 Xp 
= arg min ||Y − Xβ β ||2 + cj |βj |
β  
j=1

54 / 56
Blank page for notes
Robustness and lasso discussion

I Showed equivalence between so-called square-root lasso and

robust regression
I Set cj ≡ τ yields standard (square-root) lasso
I Setting cj ≡ 1/|βbn,j | yields adaptive (square-root) lasso

I The square-root lasso has some desirable properties in terms

of tuning (see papers by A. Belloni here at Duke), however,
the solution paths are the same

55 / 56
Penalization and regularization discussion

I Bias-variance trade-off ⇒ regularization needed to improve

predictive performance

I Ridge regression smoothly (soft) penalizes coefficients but can

combine with thresholding to obtain sparse solutions

I Lasso automatically yields sparse solutions but may perform

poorly if true signal is dense and comprised of weak signals

I Showed that several information deletion/distortion methods

are equivalent to penalized regression ⇒ more general
strategy for regularizing complex estimators

56 / 56
Thank you.

eric.laber@duke.edu

laber-labs.com

AIML-Unit 5 Notes
No ratings yet
AIML-Unit 5 Notes
45 pages
Unit 2
No ratings yet
Unit 2
92 pages
ML 2024 Part4 More Methods
No ratings yet
ML 2024 Part4 More Methods
90 pages
did DML
No ratings yet
did DML
54 pages
Ordinary least Squares
No ratings yet
Ordinary least Squares
54 pages
3 - SupervisedIntro
No ratings yet
3 - SupervisedIntro
80 pages
Course1 Review
No ratings yet
Course1 Review
45 pages
Linear Regression
No ratings yet
Linear Regression
108 pages
Econometric estimation BETA
No ratings yet
Econometric estimation BETA
36 pages
Indian Institue of Technology 1
No ratings yet
Indian Institue of Technology 1
113 pages
Shrinkage Priors For Bayesian Penalized Regression
No ratings yet
Shrinkage Priors For Bayesian Penalized Regression
50 pages
27 Canonical Forms 2
No ratings yet
27 Canonical Forms 2
26 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
Solving Linear Equations Involving Fractions
No ratings yet
Solving Linear Equations Involving Fractions
35 pages
The Bayesian Elastic Net
No ratings yet
The Bayesian Elastic Net
20 pages
Notes_Lecture 13_Regularization_LASSO and RIDGE Regression
No ratings yet
Notes_Lecture 13_Regularization_LASSO and RIDGE Regression
29 pages
Bayesian Inference in The Normal Linear Regression Model
No ratings yet
Bayesian Inference in The Normal Linear Regression Model
53 pages
Extensions Beyond Linear Regression: Topics in Data Science
No ratings yet
Extensions Beyond Linear Regression: Topics in Data Science
66 pages
Slides 2
No ratings yet
Slides 2
27 pages
Lec7 Model
No ratings yet
Lec7 Model
8 pages
Econ20222 MJAbackgr
No ratings yet
Econ20222 MJAbackgr
164 pages
UnivariateRegression 3
No ratings yet
UnivariateRegression 3
81 pages
2 Overview of Numerical Analysis
No ratings yet
2 Overview of Numerical Analysis
59 pages
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
No ratings yet
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
24 pages
Econometrics PDF
No ratings yet
Econometrics PDF
19 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Lecture1 FGV
No ratings yet
Lecture1 FGV
21 pages
2 Gauss Jordan and Gaussian
No ratings yet
2 Gauss Jordan and Gaussian
26 pages
Scribe Notes BML
No ratings yet
Scribe Notes BML
25 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Quant_Chapter_05_ols
No ratings yet
Quant_Chapter_05_ols
15 pages
A Gentle Tutorial in Bayesian Statistics PDF
100% (4)
A Gentle Tutorial in Bayesian Statistics PDF
45 pages
The Bayesian Lasso: Rebecca C. Steorts Predictive Modeling and Data Mining: STA 521
No ratings yet
The Bayesian Lasso: Rebecca C. Steorts Predictive Modeling and Data Mining: STA 521
16 pages
Linera Regression II PDF
No ratings yet
Linera Regression II PDF
14 pages
Econometrics II Week 3 Summary
No ratings yet
Econometrics II Week 3 Summary
8 pages
EDA 4th Module
No ratings yet
EDA 4th Module
26 pages
Notes2
No ratings yet
Notes2
16 pages
MAFE208IU-L15 - Parabolic and Hyperbolic Equations
No ratings yet
MAFE208IU-L15 - Parabolic and Hyperbolic Equations
26 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Exegeses ANOVA III
No ratings yet
Exegeses ANOVA III
26 pages
Answers Review Questions Econometrics PDF
93% (14)
Answers Review Questions Econometrics PDF
59 pages
Lec2 ASE
No ratings yet
Lec2 ASE
86 pages
Non-Linear Curve Fit Proof
0% (1)
Non-Linear Curve Fit Proof
5 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Statement of Purpose
0% (1)
Statement of Purpose
2 pages
Week1 Lecture2
No ratings yet
Week1 Lecture2
57 pages
Jacobi
No ratings yet
Jacobi
2 pages
A Comparative Study of Uncertainty Propagation Methods For Black-Box Type Functions
No ratings yet
A Comparative Study of Uncertainty Propagation Methods For Black-Box Type Functions
10 pages
MHF4U Characteristics of Polynomial Functions
No ratings yet
MHF4U Characteristics of Polynomial Functions
3 pages
Matlab - Lagranges Interpolation Method
No ratings yet
Matlab - Lagranges Interpolation Method
3 pages
Appendix Robust Regression
No ratings yet
Appendix Robust Regression
17 pages
Robust Regression: 1 M-Estimation
No ratings yet
Robust Regression: 1 M-Estimation
8 pages
Numerical Methods - Principles, Analysis, and Algorithms - S. Pal
0% (2)
Numerical Methods - Principles, Analysis, and Algorithms - S. Pal
286 pages
Different Variables: Laiba Ijaz - Operational Research
No ratings yet
Different Variables: Laiba Ijaz - Operational Research
26 pages
Lab Week 7 PT4008 Christofides Algorithm, Branch and Bound: Presented By: Davoud Hosseinnezhad
No ratings yet
Lab Week 7 PT4008 Christofides Algorithm, Branch and Bound: Presented By: Davoud Hosseinnezhad
28 pages
econometrics-cheat-sheet
No ratings yet
econometrics-cheat-sheet
4 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Khalil Mikaela Compilation-Of-Algorithems PDF
No ratings yet
Khalil Mikaela Compilation-Of-Algorithems PDF
49 pages
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
No ratings yet
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
12 pages
The Risk of Machine Learning
No ratings yet
The Risk of Machine Learning
66 pages
10: Empirical Risk Minimization
No ratings yet
10: Empirical Risk Minimization
6 pages
Finite Element Method by Ss Rao
100% (3)
Finite Element Method by Ss Rao
73 pages
Euler's Method
100% (1)
Euler's Method
10 pages
Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes
No ratings yet
Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes
46 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
Bayesian Methods in Applied Econometrics, Or, Why Econometrics Should Always and Everywhere Be Bayesian
No ratings yet
Bayesian Methods in Applied Econometrics, Or, Why Econometrics Should Always and Everywhere Be Bayesian
14 pages
Notes MSM
No ratings yet
Notes MSM
66 pages
Math644 - Chapter 1 - Part2 PDF
No ratings yet
Math644 - Chapter 1 - Part2 PDF
14 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
7.assignment Problem
No ratings yet
7.assignment Problem
32 pages
Gaussian Elimination Calculator: Study of Mathematics Online
No ratings yet
Gaussian Elimination Calculator: Study of Mathematics Online
2 pages
Deep Learning Via Hessian-Free Optimization: James Martens
No ratings yet
Deep Learning Via Hessian-Free Optimization: James Martens
8 pages
Gujarat Technological University: Master of Computer Application
No ratings yet
Gujarat Technological University: Master of Computer Application
3 pages
Lesson Plan Numerical Methods
No ratings yet
Lesson Plan Numerical Methods
3 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Non Lineaer
No ratings yet
Non Lineaer
324 pages
Answers Review Questions Econometrics
84% (25)
Answers Review Questions Econometrics
59 pages
Econometrics Eviews 6
No ratings yet
Econometrics Eviews 6
12 pages
Hannah copy.pdf
No ratings yet
Hannah copy.pdf
2 pages
M.J.P. Rohilkhand University, Bareilly: Course Structure & Detailed Syllabi
No ratings yet
M.J.P. Rohilkhand University, Bareilly: Course Structure & Detailed Syllabi
27 pages
Shife Ass
No ratings yet
Shife Ass
14 pages
Introductory Econometrics For Finance Chris Brooks Solutions To Review - Chapter 3
100% (2)
Introductory Econometrics For Finance Chris Brooks Solutions To Review - Chapter 3
7 pages
Appendix Robust Regression
No ratings yet
Appendix Robust Regression
8 pages
Chebyshev Approximation
No ratings yet
Chebyshev Approximation
6 pages
Big M Method: A. R. Dani
No ratings yet
Big M Method: A. R. Dani
10 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Topic One Linear Regression Regularization

Uploaded by

Topic One Linear Regression Regularization

Uploaded by

Topic 1: Linear regression and

Department of Statistical Science, Duke University

Penalization is like the word ‘myself,’ it’s used too much

I Explain to your group

I Review and reminders

I All subsets (you don’t want the truth)

I Review and reminders

I All subsets (you don’t want the truth)

I Ordinary least squares estimator as

I Alternatively, view βb n as solution to

I If PnX X | is invertible βb n = (PnX X | )−1 PnX Y

I What to do if PnX X | is (nearly) singular?

I Decompose mean squared error as

I A complex model reduces the approximation error but

I Predictive models used to inform decision making

I Need to build trust with stakeholders2

I Linear models aren’t really that interpretable

Automatic model-building procedures should be avoided

I Review and reminders

I All subsets (you don’t want the truth)

I Assume that βj∗ = 0 for j ∈ J

I R code example: growingLinearModel.R7

I Exercise: suppose X1 , . . . , Xn ∼i.i.d. (µ, σ 2 ), derive the MSE

I Consider our toy example X1 , . . . , Xn ∼i.i.d. (µ, σ 2 )

I Idea: reduce MSE by adaptively shrinking our estimator X n

I αopt X n resembles an empirical Bayes estimator of the mean 8

I Optimal shrinkage depends on signal-to-noise ratio

I Not very exciting for a 1D example, but can be extremely

I Warning: hard-thresholding estimators are often non-regular

I Before moving on to penalized regression, we should mention

I An intuitive approach is to examine all possible models

I R code example: allSubsets.R

I Review and reminders

I All subsets (you don’t want the truth)

I Least squares estimator is unbiased and has minimum variance

I Shrink regression coefficients toward zero by solving

where λ ≥ 0 is a tuning parameter

I Proposed by Hoerl and Kennard (1970)

but λ∗ depends on β ∗ , suggested looking at ‘ridge trace’ and

I Fact: Hoerl is a fun name to say, try it out

I Ridge regression estimator can give dramatic reduction in

I Consider first the orthogonal case PnX X | = Ip

thus E(βbnλ ) = βj∗ /(1 + λ) and variance is Var(βbn,j )/(1 + λ)2

I Exercise: suppose p = 1, find the value λ that minimizes

I In the non-orthogonal case

I How to choose λ ≥ 0 in this case?

I Can use empirical Bayes (EB) to select λn

I Pro-tip: if you want to derive an estimator that performs well

I Assume linear model correct Y = X |β +  where

I MOM estimator matches

bn2 = Pn (Y − X |βb n )2 yields

and thus λ bn2 /(b

I Related idea: when we talk about RL, we might use a

to compute this efficiently we use the SVD of X

I BIC for each value of λ is

BIC(λ) = log {RSS(λ)} + df (λ) log(n)/n

I Under what conditions on {λn }n≥1 will:

I Before we mathematize anything, what does your intuition

I What you’ll produce: a PRFAQ + Technical Appendix/POC

I How to find a group?

I How to find a project idea

I R code example: ridgeExample.R

I Dropout is popular heuristic to reduce overfitting in machine

I Let Z ∈ {0, 1, }p be a vector of independent Bernoulli random

I Imagine for each i = 1, . . . , n we generate a bajillion15 values

I Now suppose we fit a linear model using this data

where we’ve approximated the sample average with

I Reducing info available to the model to make a prediction at

I Another way to reduce info available to the model is to replace

I As with dropout, note that this is a general strategy that can

I In-class exercise: compute

I Reduces MSE by shrinking regression coefficients, very

I Choose amount of penalization using information criteria

I A closely related alternative to ridge regression is principal

I Review: principal components16

I Idea: regression Y on Z1 , . . . , Zq for some q ≤ p instead of

I Assume linear model correct Y = X |β + where