0% found this document useful (0 votes)
11 views9 pages

01 Lecturenote SRM

The document outlines the first lecture of the CSE517A Machine Learning course, focusing on Structural Risk Minimization (SRM) as a key concept in machine learning. It discusses the balance between minimizing loss and model complexity, introduces various loss functions for classification and regression, and explains the role of regularizers. The lecture also sets the stage for a practical application by building a spam filter using linear classifiers.

Uploaded by

mizhou0309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

01 Lecturenote SRM

The document outlines the first lecture of the CSE517A Machine Learning course, focusing on Structural Risk Minimization (SRM) as a key concept in machine learning. It discusses the balance between minimizing loss and model complexity, introduces various loss functions for classification and regression, and explains the role of regularizers. The lecture also sets the stage for a practical application by building a spam filter using linear classifiers.

Uploaded by

mizhou0309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CSE517A Machine Learning Fall 2022

Lecture 1: Structural Risk Minimization


Instructor: Marion Neumann
Reading: fcml Ch1 (Linear Modeling); esl 3.4.3, 10.6

Learning Objective
Understand that many machine learning algorithms solve the structural risk minimization problem,
which is essentially minimizing a combination of loss function and model complexity penalty.

Application
Build a spam filter that works well on the training data and generalizes well to
unseen test data.
In fact, this will be our first implementation project for the course. Take some
time to answer the following warm-up questions:
(1) How does our data look like? (2) What are the features? (3) What is the pre-
diction task? (4) How well do you think a linear classifier will perform? (5) How
do you measure the performance of a (linear) classifier?

1 Introduction
1.1 Machine Learning Problem
Assume we have a dataset
D = {(xi , yi )}i=1,...,n , (1)
our goal is to learn

y = h(x) (2)

such that:

h(xi ) = yi , ∀i = 1, . . . , n and
h(x∗ ) = y ∗ for unseen test data

Note that we can write y = h(f (x)) with f ∶ x → R and we get classification models with class labels {−1, +1}
by choosing the sign function h(a) = sign(a) and regression models by using the identity function h(a) = I(a).

Question: How do we find such a function h?


Answer: Optimize some performance measure of the model (aka the function/hypothesis h).

⇒ minimize expected risk:


min R[h] = min ∫ l(h(x), y) dp(x, y), (3)
h h ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
e.g.squared loss

where p(x, y) is the joint probability of the data, which is unknown.

1
2

⇒ use empirical risk instead:


1 n
min Remp [h] = min ∑ l(h(xi ), yi ) (4)
h h n i=1
Empirical risk minimization minimizes the training error. However, this tends to overfit to the training
data, and typically a simpler model is preferred (Occam’s razor ).

Recap: Occam’s razor says one should pick the simplest model that adequately explains the data.

1.2 Structural Risk Minimization


The goal of structural risk minimization is to balance fitting the training data against model complexity.
Training means then to learn the model parameters by solving the following optimization problem:

1 n
min L(w) = min ∑ l(hw (xi ), yi ) + λ r(w) (5)
w w n i=1 ²
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ regularizer
training loss

where the objective function L(w) combines a loss function penalizing a high training error and a regularizer
penalizing the model complexity. λ is a model hyperparameter that controls the trade-off between the terms.
We are interested in choosing λ to minimize the true risk (test error). As the true risk is unknown we may
resort to cross-validation to learn λ.1 The science behind finding an ideal loss function is known as Empirical
Risk Minimization (erm). Extending the objective function to incorporate a regularizer leads to Structural
Risk Minimization (srm).

This provides us with a unified view on many machine learning methods. By plugging in different (surrogate)
loss functions and regularizers, we obtain different machine learning models. Remember for example the
unconstrained SVM formulation:
n
min C ∑ max[1 − yi (wT xi + b), 0] + ∣∣w∣∣22 (6)
w
i=1 ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ²
=fw (xi ) l2 −regularizer
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
hinge-loss

SVM uses the hinge loss as error measure and the l2 -regularizer to penalize complex solutions; λ = 1
C
.
1 In Bayesian machine learning, we truly incorporate the choice of λ into the learning objective.
3

2 Loss Functions
2.1 Commonly Used Binary Classification Loss Functions
For binary classification we naturally want to minimize the zero-one loss l(hw (x), y) also called true classification
error. However, due to its non-continuity it is impractical to optimize and we resort to using various surrogate
loss functions l(fw (x), y). Table 1 summarizes the most commonly used classification losses.

Table 1: loss functions for classification y ∈ {−1, +1}

Loss l(fw (xi ), yi ) Usage Comments


Zero-one Loss: true classification loss Non-continuous and thus impractical to op-
timize.
δ(hw (xi ) =/ yi )

Hinge-Loss: When used for standard SVM, the loss func-


• standard SVM (p = 1) tion denotes margin length between linear
max[1−fw (xi )∗yi , 0]p separator and its closest point in either
• (differentiable) squared class. Only differentiable everywhere with
hinge loss SVM (p = 2) p = 2.

Log-Loss: logistic regression One of the most popular loss functions in


machine learning, since its outputs are very
log(1 + e−fw (xi )yi ) well-tuned.

Exponential Loss: AdaBoost This function is very aggressive and thus


sensitive to label noise. The loss of a mis-
e−fw (xi )yi prediction increases exponentially with the
value of −fw (xi )yi .

What do all these loss functions look like? The Illustration below shows the zero-one and exponential losses,
where the input/x-axis is the ”correctness” of the prediction fw (xi )yi .
Exercise 2.1. Add the hinge loss and log loss functions to this plot.

Additional Notes on Classification Loss Functions


1. Zero-one loss is zero when the prediction is correct, and one when incorrect.
2. As z Ð
→ ∞ , log-loss, exp-loss, and hinge loss become increasingly parallel.
3. The exponential loss and the hinge loss are both upper bounds of the zero-one loss. (For the exponential
loss, this is an important aspect in Adaboost.)

Exercise 2.2. In what scenario would you want to use the Huber loss instead of squared or absolute loss?

2.2 Commonly Used Regression Loss Functions


Unsurprisingly, regression models (where the predictions are reals) also have their own loss functions sum-
marized in Table 2. Note that for regression the error is quantified as hw (x) = fw (x).
4

Table 2: Loss Functions With Regression, y ∈ R

Loss l(fw (xi ), yi ) Comments


Squared Loss:

(fw (xi ) − yi )2 • most popular regression loss function


• w∗ will be related to the mean observations in D2
• ADVANTAGE: differentiable everywhere

• DISADVANTAGE: tries to accommodate every sample →


sensitive to outliers/noise
• used in Ordinary Least Squares (ols)

Absolute Loss:

∣fw (xi ) − yi ∣ • also a very popular loss function

• w∗ will be related to the median observations in D3


• ADVANTAGE: less sensitive to noise
• DISADVANTAGE: not differentiable at 0

2 For the same input location and multiple observations, the minimal squared loss is achieved for the mean of the observations.
3 For the same input location and multiple observations, the minimal absolute loss is achieved for the median of the obser-
vations.
5

Huber Loss:

⎪ • also known as Smooth Absolute Loss
⎪ 12 zi2 if ∣zi ∣ < δ


⎪δ(∣zi ∣ − 2δ ) otherwise • once-differentiable

where zi = fw (xi ) − yi • ADVANTAGE: “Best of Both Worlds” of squared and
absolute loss

• Takes on behavior of squared loss when loss is small, and


absolute loss when loss is large.

Log-Cosh Loss:

log(cosh(fw (xi ) − yi )), • ADVANTAGE: similar to Huber loss, but twice differen-
tiable everywhere.
x
+e−x
where cosh(x) = e 2
ε-Insensitive Loss

⎪ • yields sparse solution (cf. support vectors)
⎪0 if ∣zi ∣ < ε


⎪∣z ∣ − ε otherwise • used in SVM regression
⎩ i
where zi = fw (xi ) − yi • ε regulates the sensitivity of the loss and hence the number
of support vectors used in the SVM

What do all these loss functions look like? The Illustration below shows the squared loss, huber loss with
δ = 1, and log-cosh loss, where the input/x-axis is the “correctness” of the prediction z = fw (xi ) − yi .

Exercise 2.3. Add the functions for the absolute loss and the huber loss with δ = 2 to this plot.

Exercise 2.4. In the same or a new plot, graph the functions for the ε-insensitive loss for ε = 1 and ε = 0.1.
6

3 Regularizers
Similar to the SVM primal derivation, we can establish the following equivalency:
n n
min ∑ l(wT xi + b, yi ) + λr(w) ⇐⇒ min ∑ l(wT xi + b, yi )
w,b i=1 w,b i=1 (7)
subject to: r(w) ≤ B

For each λ ≥ 0, there exists B ≥ 0 such that the two formulations in Eq. (7) are equivalent, and vice versa.
In previous sections, the l2 -regularizer has been introduced as the component in SVM that reflects the
complexity of solutions. Besides the l2 -regularizer, other types of useful regularizers and their properties are
listed in Table 3.

Table 3: Types of regularizers

Regularizers r(w) Properties


l2 -regularizer:

r(w) = wT w = (∣∣w∣∣2 )2 • ADVANTAGE: strictly convex


• ADVANTAGE: differentiable

• DISADVANTAGE: uses weights on all features, i.e. relies


on all features to some degree (ideally we would like to avoid
this) - these are known as dense solutions.

l1 -regularizer:

r(w) = ∣∣w∣∣1 • convex (but not strictly)


• DISADVANTAGE: not differentiable at 0 (the point which
minimization is intended to bring us to
• ADVANTAGE: sparse (i.e. not dense) solutions

Elastic Net:

α∣∣w∣∣1 + (1 − α)∣∣w∣∣22 , α ∈ [0, 1) • ADVANTAGE: strictly convex (i.e. unique solution)


• DISADVANTAGE: non-differentiable

lp-Norm, often 0 < p ≤ 1:


d 1 • DISADVANTAGE: non-convex → initialization dependent
∣∣w∣∣p = (∑ ∣w∣p ) p
i=1 • ADVANTAGE: very sparse solutions
• DISADVANTAGE: not differentiable

Figure 1a shows plots of some common used regularizers. Note that regularizers are functions of w and
not xi . Figure 1b shows the effect that adding a regularizer to the loss function minimization has on the
optimization problem and its solution.
7

(a) Common regularizers. (b) Contours of a loss function, constraint region for the
l2 -regularizer, and optimal solution.

Figure 1: Illustrations for a two dimensional feature space (d = 2).

Exercise 3.1. Regularizers

(a) Add the constraint region for the l1 and elastic net regularizers to the plot in Figure 1b.
(b) The lp -regularizer yields sparse solutions for w. Explain why.
(c) What is the advantage of using the elastic net regularizer compared to l1 regularization? What is it’s
advantage compared to l2 regularization? Briefly explain one advantage each.

4 Famous SRM models


This section includes several special cases of srm – all preforming structural risk minimization, such as
Ordinary Least Squares, Ridge Regression, Lasso, and Logistic Regression. Table 4 provides information on
their loss functions, regularizers, as well as solutions.

Notation: Here we use X ∈ Rd×n , where d is the number of feature dimensions and n is the number of
training data points4 . Caution: this is the transpose of what they use in fcml.

Table 4: Special Cases of srm

Loss and Regularizer Properties Solution


Ordinary Least Squares:

1 n • squared loss • w = (XX T )−1 Xy T


∑(w xi − yi )
T 2
min
w n i=1
• no regularization • X = [x1 , ..., xn ]

• y = [y1 , ..., yn ]

4 I personally prefer this notation since you can quickly determine whether we are using inner or outer products. E.g. the

inner product in matrix notation will then look like X T X, which is more intuitive as it aligns with the inner product of vectors
xT x. However, both notations are found in the literature. So, always be careful about the definition of X.
8

Ridge Regression:

1 n • squared loss • w = (XX T + λI)−1 Xy T


∑(w xi − yi ) + λ∣∣w∣∣2
T 2 2
min
w n i=1
• l2 -regularization

Lasso:
1 n • + sparsity inducing (good • Solve with (sub)-gradient descent or
∑(w xi − yi ) + λ∣∣w∣∣1
T 2
min
w n i=1 for feature selection) LARS (least angle regression)
• + convex

• - not strictly convex (no


unique solution)
• - not differentiable (at 0)

Logistic Regression:

1 n −y (wT xi +b) • often also l1 or l2 regular- • Solve with gradient descent or New-
min ∑ log(1 + e i ) ized ton’s method
w n i=1
• P (y = +1 ∣ x) = 1
1+e−y(wT x+b)

SVM:

cf. Equation (6) • hinge-loss • solve dual quatradic program (QP)


• l2 -regularization

5 Summary
srm tries to find the best model by explicitly adding a weighted complexity penalty (regularization term) to
the loss function optimization. List of concepts and terms to understand from this lecture:

• erm
• srm
• overfitting
• underfitting
• Occam’s razor
• training loss, testing loss
• surrogate loss
• regularizer
• hyperparameter
9

Exercise 5.1. Practice Retrieving!


For this summary exercise, it is intended that your answers are based on your own (current) understanding
of the concepts (and not on the definitions you read and copy from these notes or from elsewhere). Don’t
hesitate to say it out loud to your seat neighbor, your pet or stuffed animal, or to yourself before writing
it down. Research studies show that this practice of retrieval and phrasing out loud will help you retain
the knowledge!

(a) Using your own words, summarize each of the above concepts in 1-2 sentences by retrieving the knowl-
edge from the top of your head.
(b) What is the difference between erm and srm?
(c) What is the difference between overfitting and underfitting?

And always remember: It’s not bad to get it wrong. Getting it wrong is part of learning! Use your notes or
other resources to get the correct answer or come to our office hours to get help!

Our Application
With respect to our spam filter application, we are now able to come up with
objective functions, aka learning models, that will (hopefully) work well on the
training data and are able to generalize well to unseen test data.

The next step is to actually build the spam filter, which means we will need
to solve the srm problem (Eq. 5) for various combinations of loss functions and
regularizers. In the next lecture, we will cover a variety of optimization techniques
to help us with that.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy