0% found this document useful (0 votes)

11 views9 pages

01 Lecturenote SRM

The document outlines the first lecture of the CSE517A Machine Learning course, focusing on Structural Risk Minimization (SRM) as a key concept in machine learning. It discusses the balance between minimizing loss and model complexity, introduces various loss functions for classification and regression, and explains the role of regularizers. The lecture also sets the stage for a practical application by building a spam filter using linear classifiers.

Uploaded by

mizhou0309

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views9 pages

01 Lecturenote SRM

Uploaded by

mizhou0309

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

CSE517A Machine Learning Fall 2022

Lecture 1: Structural Risk Minimization

Instructor: Marion Neumann
Reading: fcml Ch1 (Linear Modeling); esl 3.4.3, 10.6

Learning Objective
Understand that many machine learning algorithms solve the structural risk minimization problem,
which is essentially minimizing a combination of loss function and model complexity penalty.

Application
Build a spam filter that works well on the training data and generalizes well to
unseen test data.
In fact, this will be our first implementation project for the course. Take some
time to answer the following warm-up questions:
(1) How does our data look like? (2) What are the features? (3) What is the pre-
diction task? (4) How well do you think a linear classifier will perform? (5) How
do you measure the performance of a (linear) classifier?

1 Introduction
1.1 Machine Learning Problem
Assume we have a dataset
D = {(xi , yi )}i=1,...,n , (1)
our goal is to learn

y = h(x) (2)

such that:

h(xi ) = yi , ∀i = 1, . . . , n and
h(x∗ ) = y ∗ for unseen test data

Note that we can write y = h(f (x)) with f ∶ x → R and we get classification models with class labels {−1, +1}
by choosing the sign function h(a) = sign(a) and regression models by using the identity function h(a) = I(a).

Question: How do we find such a function h?

Answer: Optimize some performance measure of the model (aka the function/hypothesis h).

⇒ minimize expected risk:

min R[h] = min ∫ l(h(x), y) dp(x, y), (3)
h h ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
e.g.squared loss

where p(x, y) is the joint probability of the data, which is unknown.

1
2

⇒ use empirical risk instead:

1 n
min Remp [h] = min ∑ l(h(xi ), yi ) (4)
h h n i=1
Empirical risk minimization minimizes the training error. However, this tends to overfit to the training
data, and typically a simpler model is preferred (Occam’s razor ).

Recap: Occam’s razor says one should pick the simplest model that adequately explains the data.

1.2 Structural Risk Minimization

The goal of structural risk minimization is to balance fitting the training data against model complexity.
Training means then to learn the model parameters by solving the following optimization problem:

1 n
min L(w) = min ∑ l(hw (xi ), yi ) + λ r(w) (5)
w w n i=1 ²
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ regularizer
training loss

where the objective function L(w) combines a loss function penalizing a high training error and a regularizer
penalizing the model complexity. λ is a model hyperparameter that controls the trade-off between the terms.
We are interested in choosing λ to minimize the true risk (test error). As the true risk is unknown we may
resort to cross-validation to learn λ.1 The science behind finding an ideal loss function is known as Empirical
Risk Minimization (erm). Extending the objective function to incorporate a regularizer leads to Structural
Risk Minimization (srm).

This provides us with a unified view on many machine learning methods. By plugging in different (surrogate)
loss functions and regularizers, we obtain different machine learning models. Remember for example the
unconstrained SVM formulation:
n
min C ∑ max[1 − yi (wT xi + b), 0] + ∣∣w∣∣22 (6)
w
i=1 ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ²
=fw (xi ) l2 −regularizer
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
hinge-loss

SVM uses the hinge loss as error measure and the l2 -regularizer to penalize complex solutions; λ = 1
C
.
1 In Bayesian machine learning, we truly incorporate the choice of λ into the learning objective.
3

2 Loss Functions
2.1 Commonly Used Binary Classification Loss Functions
For binary classification we naturally want to minimize the zero-one loss l(hw (x), y) also called true classification
error. However, due to its non-continuity it is impractical to optimize and we resort to using various surrogate
loss functions l(fw (x), y). Table 1 summarizes the most commonly used classification losses.

Table 1: loss functions for classification y ∈ {−1, +1}

Loss l(fw (xi ), yi ) Usage Comments

Zero-one Loss: true classification loss Non-continuous and thus impractical to op-
timize.
δ(hw (xi ) =/ yi )

Hinge-Loss: When used for standard SVM, the loss func-

• standard SVM (p = 1) tion denotes margin length between linear
max[1−fw (xi )∗yi , 0]p separator and its closest point in either
• (differentiable) squared class. Only differentiable everywhere with
hinge loss SVM (p = 2) p = 2.

Log-Loss: logistic regression One of the most popular loss functions in

machine learning, since its outputs are very
log(1 + e−fw (xi )yi ) well-tuned.

Exponential Loss: AdaBoost This function is very aggressive and thus

sensitive to label noise. The loss of a mis-
e−fw (xi )yi prediction increases exponentially with the
value of −fw (xi )yi .

What do all these loss functions look like? The Illustration below shows the zero-one and exponential losses,
where the input/x-axis is the ”correctness” of the prediction fw (xi )yi .
Exercise 2.1. Add the hinge loss and log loss functions to this plot.

Additional Notes on Classification Loss Functions

1. Zero-one loss is zero when the prediction is correct, and one when incorrect.
2. As z Ð
→ ∞ , log-loss, exp-loss, and hinge loss become increasingly parallel.
3. The exponential loss and the hinge loss are both upper bounds of the zero-one loss. (For the exponential
loss, this is an important aspect in Adaboost.)

Exercise 2.2. In what scenario would you want to use the Huber loss instead of squared or absolute loss?

2.2 Commonly Used Regression Loss Functions

Unsurprisingly, regression models (where the predictions are reals) also have their own loss functions sum-
marized in Table 2. Note that for regression the error is quantified as hw (x) = fw (x).
4

Table 2: Loss Functions With Regression, y ∈ R

Loss l(fw (xi ), yi ) Comments

Squared Loss:

(fw (xi ) − yi )2 • most popular regression loss function

• w∗ will be related to the mean observations in D2
• ADVANTAGE: differentiable everywhere

• DISADVANTAGE: tries to accommodate every sample →

sensitive to outliers/noise
• used in Ordinary Least Squares (ols)

Absolute Loss:

∣fw (xi ) − yi ∣ • also a very popular loss function

• w∗ will be related to the median observations in D3

• ADVANTAGE: less sensitive to noise
• DISADVANTAGE: not differentiable at 0

2 For the same input location and multiple observations, the minimal squared loss is achieved for the mean of the observations.
3 For the same input location and multiple observations, the minimal absolute loss is achieved for the median of the obser-
vations.
5

Huber Loss:
⎧
⎪ • also known as Smooth Absolute Loss
⎪ 12 zi2 if ∣zi ∣ < δ
⎨
⎪
⎪δ(∣zi ∣ − 2δ ) otherwise • once-differentiable
⎩
where zi = fw (xi ) − yi • ADVANTAGE: “Best of Both Worlds” of squared and
absolute loss

• Takes on behavior of squared loss when loss is small, and

absolute loss when loss is large.

Log-Cosh Loss:

log(cosh(fw (xi ) − yi )), • ADVANTAGE: similar to Huber loss, but twice differen-
tiable everywhere.
x
+e−x
where cosh(x) = e 2
ε-Insensitive Loss
⎧
⎪ • yields sparse solution (cf. support vectors)
⎪0 if ∣zi ∣ < ε
⎨
⎪
⎪∣z ∣ − ε otherwise • used in SVM regression
⎩ i
where zi = fw (xi ) − yi • ε regulates the sensitivity of the loss and hence the number
of support vectors used in the SVM

What do all these loss functions look like? The Illustration below shows the squared loss, huber loss with
δ = 1, and log-cosh loss, where the input/x-axis is the “correctness” of the prediction z = fw (xi ) − yi .

Exercise 2.3. Add the functions for the absolute loss and the huber loss with δ = 2 to this plot.

Exercise 2.4. In the same or a new plot, graph the functions for the ε-insensitive loss for ε = 1 and ε = 0.1.
6

3 Regularizers
Similar to the SVM primal derivation, we can establish the following equivalency:
n n
min ∑ l(wT xi + b, yi ) + λr(w) ⇐⇒ min ∑ l(wT xi + b, yi )
w,b i=1 w,b i=1 (7)
subject to: r(w) ≤ B

For each λ ≥ 0, there exists B ≥ 0 such that the two formulations in Eq. (7) are equivalent, and vice versa.
In previous sections, the l2 -regularizer has been introduced as the component in SVM that reflects the
complexity of solutions. Besides the l2 -regularizer, other types of useful regularizers and their properties are
listed in Table 3.

Table 3: Types of regularizers

Regularizers r(w) Properties

l2 -regularizer:

r(w) = wT w = (∣∣w∣∣2 )2 • ADVANTAGE: strictly convex

• ADVANTAGE: differentiable

• DISADVANTAGE: uses weights on all features, i.e. relies

on all features to some degree (ideally we would like to avoid
this) - these are known as dense solutions.

l1 -regularizer:

r(w) = ∣∣w∣∣1 • convex (but not strictly)

• DISADVANTAGE: not differentiable at 0 (the point which
minimization is intended to bring us to
• ADVANTAGE: sparse (i.e. not dense) solutions

Elastic Net:

α∣∣w∣∣1 + (1 − α)∣∣w∣∣22 , α ∈ [0, 1) • ADVANTAGE: strictly convex (i.e. unique solution)

• DISADVANTAGE: non-differentiable

lp-Norm, often 0 < p ≤ 1:

d 1 • DISADVANTAGE: non-convex → initialization dependent
∣∣w∣∣p = (∑ ∣w∣p ) p
i=1 • ADVANTAGE: very sparse solutions
• DISADVANTAGE: not differentiable

Figure 1a shows plots of some common used regularizers. Note that regularizers are functions of w and
not xi . Figure 1b shows the effect that adding a regularizer to the loss function minimization has on the
optimization problem and its solution.
7

(a) Common regularizers. (b) Contours of a loss function, constraint region for the
l2 -regularizer, and optimal solution.

Figure 1: Illustrations for a two dimensional feature space (d = 2).

Exercise 3.1. Regularizers

(a) Add the constraint region for the l1 and elastic net regularizers to the plot in Figure 1b.
(b) The lp -regularizer yields sparse solutions for w. Explain why.
(c) What is the advantage of using the elastic net regularizer compared to l1 regularization? What is it’s
advantage compared to l2 regularization? Briefly explain one advantage each.

4 Famous SRM models

This section includes several special cases of srm – all preforming structural risk minimization, such as
Ordinary Least Squares, Ridge Regression, Lasso, and Logistic Regression. Table 4 provides information on
their loss functions, regularizers, as well as solutions.

Notation: Here we use X ∈ Rd×n , where d is the number of feature dimensions and n is the number of
training data points4 . Caution: this is the transpose of what they use in fcml.

Table 4: Special Cases of srm

Loss and Regularizer Properties Solution

Ordinary Least Squares:

1 n • squared loss • w = (XX T )−1 Xy T

∑(w xi − yi )
T 2
min
w n i=1
• no regularization • X = [x1 , ..., xn ]

• y = [y1 , ..., yn ]

4 I personally prefer this notation since you can quickly determine whether we are using inner or outer products. E.g. the

inner product in matrix notation will then look like X T X, which is more intuitive as it aligns with the inner product of vectors
xT x. However, both notations are found in the literature. So, always be careful about the definition of X.
8

Ridge Regression:

1 n • squared loss • w = (XX T + λI)−1 Xy T

∑(w xi − yi ) + λ∣∣w∣∣2
T 2 2
min
w n i=1
• l2 -regularization

Lasso:
1 n • + sparsity inducing (good • Solve with (sub)-gradient descent or
∑(w xi − yi ) + λ∣∣w∣∣1
T 2
min
w n i=1 for feature selection) LARS (least angle regression)
• + convex

• - not strictly convex (no

unique solution)
• - not differentiable (at 0)

Logistic Regression:

1 n −y (wT xi +b) • often also l1 or l2 regular- • Solve with gradient descent or New-
min ∑ log(1 + e i ) ized ton’s method
w n i=1
• P (y = +1 ∣ x) = 1
1+e−y(wT x+b)

SVM:

cf. Equation (6) • hinge-loss • solve dual quatradic program (QP)

• l2 -regularization

5 Summary
srm tries to find the best model by explicitly adding a weighted complexity penalty (regularization term) to
the loss function optimization. List of concepts and terms to understand from this lecture:

• erm
• srm
• overfitting
• underfitting
• Occam’s razor
• training loss, testing loss
• surrogate loss
• regularizer
• hyperparameter
9

Exercise 5.1. Practice Retrieving!

For this summary exercise, it is intended that your answers are based on your own (current) understanding
of the concepts (and not on the definitions you read and copy from these notes or from elsewhere). Don’t
hesitate to say it out loud to your seat neighbor, your pet or stuffed animal, or to yourself before writing
it down. Research studies show that this practice of retrieval and phrasing out loud will help you retain
the knowledge!

(a) Using your own words, summarize each of the above concepts in 1-2 sentences by retrieving the knowl-
edge from the top of your head.
(b) What is the difference between erm and srm?
(c) What is the difference between overfitting and underfitting?

And always remember: It’s not bad to get it wrong. Getting it wrong is part of learning! Use your notes or
other resources to get the correct answer or come to our office hours to get help!

Our Application
With respect to our spam filter application, we are now able to come up with
objective functions, aka learning models, that will (hopefully) work well on the
training data and are able to generalize well to unseen test data.

The next step is to actually build the spam filter, which means we will need
to solve the srm problem (Eq. 5) for various combinations of loss functions and
regularizers. In the next lecture, we will cover a variety of optimization techniques
to help us with that.

Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
10: Empirical Risk Minimization
No ratings yet
10: Empirical Risk Minimization
6 pages
Representer Function
No ratings yet
Representer Function
12 pages
Loss Function - Ipynb - Colaboratory
No ratings yet
Loss Function - Ipynb - Colaboratory
6 pages
05 AIS302 ANN-Optimization
No ratings yet
05 AIS302 ANN-Optimization
44 pages
Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
Group 30
No ratings yet
Group 30
33 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
Class 02
No ratings yet
Class 02
42 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Chapter 4a Riskmin-Reg - Commented4
No ratings yet
Chapter 4a Riskmin-Reg - Commented4
54 pages
Nptel Lec
No ratings yet
Nptel Lec
22 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Loss Function
No ratings yet
Loss Function
23 pages
Lect 6
No ratings yet
Lect 6
10 pages
Most Influential Data Science Research Papers
No ratings yet
Most Influential Data Science Research Papers
628 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
A General and Adaptive Robust Loss Function: Jonathan T. Barron Google Research
No ratings yet
A General and Adaptive Robust Loss Function: Jonathan T. Barron Google Research
19 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
02 Lecturenote GD
No ratings yet
02 Lecturenote GD
10 pages
Beyond Classification Beyond Classification Beyond Classification Beyond Classification
No ratings yet
Beyond Classification Beyond Classification Beyond Classification Beyond Classification
23 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
A General and Adaptive Robust Loss Function
No ratings yet
A General and Adaptive Robust Loss Function
9 pages
Lecture 7 Loss Function and Regularization
No ratings yet
Lecture 7 Loss Function and Regularization
38 pages
MSCV MLDL Remedial
No ratings yet
MSCV MLDL Remedial
95 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
IML Summary
No ratings yet
IML Summary
12 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
9.b Handout-1-Loss Functions
No ratings yet
9.b Handout-1-Loss Functions
3 pages
2022 Scribe Lecture7
No ratings yet
2022 Scribe Lecture7
9 pages
Detailed Guide 7 Loss Functions Machine Learning Python Code
No ratings yet
Detailed Guide 7 Loss Functions Machine Learning Python Code
16 pages
Lecture 8 Zainab
No ratings yet
Lecture 8 Zainab
21 pages
Lec SML Basic Theory 2
No ratings yet
Lec SML Basic Theory 2
49 pages
Loss Functions
No ratings yet
Loss Functions
29 pages
ML Opt
No ratings yet
ML Opt
89 pages
Lecture 07
No ratings yet
Lecture 07
29 pages
Module 3
No ratings yet
Module 3
35 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Lecture 1.5-1.6
No ratings yet
Lecture 1.5-1.6
23 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Loss Functions
No ratings yet
Loss Functions
8 pages
Kayatu
No ratings yet
Kayatu
3 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
New Support Vector Algorithms: Letter
No ratings yet
New Support Vector Algorithms: Letter
39 pages
practicalMachineLearning Lecture3
No ratings yet
practicalMachineLearning Lecture3
25 pages
Hundred Page ML Book CH 3
No ratings yet
Hundred Page ML Book CH 3
16 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Preferences in Senior High School Tracks of The Grade 10 Students
No ratings yet
Preferences in Senior High School Tracks of The Grade 10 Students
9 pages
Comm 851 Note - Clinical Epidemiology
No ratings yet
Comm 851 Note - Clinical Epidemiology
229 pages
Chapter 1 Boyle
No ratings yet
Chapter 1 Boyle
9 pages
Module 3 Linear System of Equations
No ratings yet
Module 3 Linear System of Equations
41 pages
2012 March
No ratings yet
2012 March
8 pages
9th FA 3
No ratings yet
9th FA 3
2 pages
Daftar Pustaka
No ratings yet
Daftar Pustaka
6 pages
Subhojit Roy Resume Java Latest
No ratings yet
Subhojit Roy Resume Java Latest
5 pages
Perspectives and Problems of Codifying Nigerian Pidgin English Orthography
No ratings yet
Perspectives and Problems of Codifying Nigerian Pidgin English Orthography
9 pages
Narayana Business School - Digital Brochure - MBA+PGPCE - Final - Mobile Version
No ratings yet
Narayana Business School - Digital Brochure - MBA+PGPCE - Final - Mobile Version
17 pages
Qualifications2023 24
No ratings yet
Qualifications2023 24
7 pages
1 Lesson Plan in Mapeh 7
No ratings yet
1 Lesson Plan in Mapeh 7
7 pages
AI, Brain, and Child Navigating The Intersection of Artificial Intelligence, Neuroscience, and Child Development
No ratings yet
AI, Brain, and Child Navigating The Intersection of Artificial Intelligence, Neuroscience, and Child Development
6 pages
CV Marcos Bendrao
No ratings yet
CV Marcos Bendrao
4 pages
BSC Sem 3 & 4 (Major-Minor-MDC-SEC) Medical Laboratory Syllabus From 2024-25 (DT 13-05-2024)
No ratings yet
BSC Sem 3 & 4 (Major-Minor-MDC-SEC) Medical Laboratory Syllabus From 2024-25 (DT 13-05-2024)
24 pages
CV
No ratings yet
CV
3 pages
The Effect of Sociocultural and Economic Factor in Broken Homes and Childhood Development
No ratings yet
The Effect of Sociocultural and Economic Factor in Broken Homes and Childhood Development
5 pages
Summary of Some ONS and Enteral Formulas
No ratings yet
Summary of Some ONS and Enteral Formulas
3 pages
Ironhack - Financing Options FR, en & ES
No ratings yet
Ironhack - Financing Options FR, en & ES
32 pages
Gold Coast Network Map
No ratings yet
Gold Coast Network Map
1 page
Downloader
No ratings yet
Downloader
3 pages
Test Automation Framework & Design For XXXXX Project: Author: XXXXXX
No ratings yet
Test Automation Framework & Design For XXXXX Project: Author: XXXXXX
14 pages
Grade 3 Project Anall Numerates
50% (2)
Grade 3 Project Anall Numerates
8 pages
TPCN Monthly List of Subcontractors 06-2017
No ratings yet
TPCN Monthly List of Subcontractors 06-2017
3 pages
A Cow in The House
No ratings yet
A Cow in The House
2 pages
Grade 11 DLL Entreo Q1 Week 13
No ratings yet
Grade 11 DLL Entreo Q1 Week 13
3 pages
Tle - Work Plan 2023 2024
100% (2)
Tle - Work Plan 2023 2024
6 pages
Intern Schedule
No ratings yet
Intern Schedule
1 page
Cit Take Away Cat
No ratings yet
Cit Take Away Cat
4 pages
Case Based 1 - Week 9
No ratings yet
Case Based 1 - Week 9
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

01 Lecturenote SRM

Uploaded by

01 Lecturenote SRM

Uploaded by

CSE517A Machine Learning Fall 2022

Lecture 1: Structural Risk Minimization

Question: How do we find such a function h?

⇒ minimize expected risk:

where p(x, y) is the joint probability of the data, which is unknown.

⇒ use empirical risk instead:

1.2 Structural Risk Minimization

Table 1: loss functions for classification y ∈ {−1, +1}

Loss l(fw (xi ), yi ) Usage Comments

Hinge-Loss: When used for standard SVM, the loss func-

Log-Loss: logistic regression One of the most popular loss functions in

Exponential Loss: AdaBoost This function is very aggressive and thus

Additional Notes on Classification Loss Functions

2.2 Commonly Used Regression Loss Functions

Table 2: Loss Functions With Regression, y ∈ R

Loss l(fw (xi ), yi ) Comments

(fw (xi ) − yi )2 • most popular regression loss function

• DISADVANTAGE: tries to accommodate every sample →

∣fw (xi ) − yi ∣ • also a very popular loss function

• w∗ will be related to the median observations in D3

• Takes on behavior of squared loss when loss is small, and

Table 3: Types of regularizers

Regularizers r(w) Properties

r(w) = wT w = (∣∣w∣∣2 )2 • ADVANTAGE: strictly convex

• DISADVANTAGE: uses weights on all features, i.e. relies

r(w) = ∣∣w∣∣1 • convex (but not strictly)

α∣∣w∣∣1 + (1 − α)∣∣w∣∣22 , α ∈ [0, 1) • ADVANTAGE: strictly convex (i.e. unique solution)

lp-Norm, often 0 < p ≤ 1:

Figure 1: Illustrations for a two dimensional feature space (d = 2).

Exercise 3.1. Regularizers

4 Famous SRM models

Table 4: Special Cases of srm

Loss and Regularizer Properties Solution

1 n • squared loss • w = (XX T )−1 Xy T

1 n • squared loss • w = (XX T + λI)−1 Xy T

• - not strictly convex (no

cf. Equation (6) • hinge-loss • solve dual quatradic program (QP)

Exercise 5.1. Practice Retrieving!

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.