01 Lecturenote SRM
01 Lecturenote SRM
Learning Objective
Understand that many machine learning algorithms solve the structural risk minimization problem,
which is essentially minimizing a combination of loss function and model complexity penalty.
Application
Build a spam filter that works well on the training data and generalizes well to
unseen test data.
In fact, this will be our first implementation project for the course. Take some
time to answer the following warm-up questions:
(1) How does our data look like? (2) What are the features? (3) What is the pre-
diction task? (4) How well do you think a linear classifier will perform? (5) How
do you measure the performance of a (linear) classifier?
1 Introduction
1.1 Machine Learning Problem
Assume we have a dataset
D = {(xi , yi )}i=1,...,n , (1)
our goal is to learn
y = h(x) (2)
such that:
h(xi ) = yi , ∀i = 1, . . . , n and
h(x∗ ) = y ∗ for unseen test data
Note that we can write y = h(f (x)) with f ∶ x → R and we get classification models with class labels {−1, +1}
by choosing the sign function h(a) = sign(a) and regression models by using the identity function h(a) = I(a).
1
2
Recap: Occam’s razor says one should pick the simplest model that adequately explains the data.
1 n
min L(w) = min ∑ l(hw (xi ), yi ) + λ r(w) (5)
w w n i=1 ²
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ regularizer
training loss
where the objective function L(w) combines a loss function penalizing a high training error and a regularizer
penalizing the model complexity. λ is a model hyperparameter that controls the trade-off between the terms.
We are interested in choosing λ to minimize the true risk (test error). As the true risk is unknown we may
resort to cross-validation to learn λ.1 The science behind finding an ideal loss function is known as Empirical
Risk Minimization (erm). Extending the objective function to incorporate a regularizer leads to Structural
Risk Minimization (srm).
This provides us with a unified view on many machine learning methods. By plugging in different (surrogate)
loss functions and regularizers, we obtain different machine learning models. Remember for example the
unconstrained SVM formulation:
n
min C ∑ max[1 − yi (wT xi + b), 0] + ∣∣w∣∣22 (6)
w
i=1 ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ²
=fw (xi ) l2 −regularizer
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
hinge-loss
SVM uses the hinge loss as error measure and the l2 -regularizer to penalize complex solutions; λ = 1
C
.
1 In Bayesian machine learning, we truly incorporate the choice of λ into the learning objective.
3
2 Loss Functions
2.1 Commonly Used Binary Classification Loss Functions
For binary classification we naturally want to minimize the zero-one loss l(hw (x), y) also called true classification
error. However, due to its non-continuity it is impractical to optimize and we resort to using various surrogate
loss functions l(fw (x), y). Table 1 summarizes the most commonly used classification losses.
What do all these loss functions look like? The Illustration below shows the zero-one and exponential losses,
where the input/x-axis is the ”correctness” of the prediction fw (xi )yi .
Exercise 2.1. Add the hinge loss and log loss functions to this plot.
Exercise 2.2. In what scenario would you want to use the Huber loss instead of squared or absolute loss?
Absolute Loss:
2 For the same input location and multiple observations, the minimal squared loss is achieved for the mean of the observations.
3 For the same input location and multiple observations, the minimal absolute loss is achieved for the median of the obser-
vations.
5
Huber Loss:
⎧
⎪ • also known as Smooth Absolute Loss
⎪ 12 zi2 if ∣zi ∣ < δ
⎨
⎪
⎪δ(∣zi ∣ − 2δ ) otherwise • once-differentiable
⎩
where zi = fw (xi ) − yi • ADVANTAGE: “Best of Both Worlds” of squared and
absolute loss
Log-Cosh Loss:
log(cosh(fw (xi ) − yi )), • ADVANTAGE: similar to Huber loss, but twice differen-
tiable everywhere.
x
+e−x
where cosh(x) = e 2
ε-Insensitive Loss
⎧
⎪ • yields sparse solution (cf. support vectors)
⎪0 if ∣zi ∣ < ε
⎨
⎪
⎪∣z ∣ − ε otherwise • used in SVM regression
⎩ i
where zi = fw (xi ) − yi • ε regulates the sensitivity of the loss and hence the number
of support vectors used in the SVM
What do all these loss functions look like? The Illustration below shows the squared loss, huber loss with
δ = 1, and log-cosh loss, where the input/x-axis is the “correctness” of the prediction z = fw (xi ) − yi .
Exercise 2.3. Add the functions for the absolute loss and the huber loss with δ = 2 to this plot.
Exercise 2.4. In the same or a new plot, graph the functions for the ε-insensitive loss for ε = 1 and ε = 0.1.
6
3 Regularizers
Similar to the SVM primal derivation, we can establish the following equivalency:
n n
min ∑ l(wT xi + b, yi ) + λr(w) ⇐⇒ min ∑ l(wT xi + b, yi )
w,b i=1 w,b i=1 (7)
subject to: r(w) ≤ B
For each λ ≥ 0, there exists B ≥ 0 such that the two formulations in Eq. (7) are equivalent, and vice versa.
In previous sections, the l2 -regularizer has been introduced as the component in SVM that reflects the
complexity of solutions. Besides the l2 -regularizer, other types of useful regularizers and their properties are
listed in Table 3.
l1 -regularizer:
Elastic Net:
Figure 1a shows plots of some common used regularizers. Note that regularizers are functions of w and
not xi . Figure 1b shows the effect that adding a regularizer to the loss function minimization has on the
optimization problem and its solution.
7
(a) Common regularizers. (b) Contours of a loss function, constraint region for the
l2 -regularizer, and optimal solution.
(a) Add the constraint region for the l1 and elastic net regularizers to the plot in Figure 1b.
(b) The lp -regularizer yields sparse solutions for w. Explain why.
(c) What is the advantage of using the elastic net regularizer compared to l1 regularization? What is it’s
advantage compared to l2 regularization? Briefly explain one advantage each.
Notation: Here we use X ∈ Rd×n , where d is the number of feature dimensions and n is the number of
training data points4 . Caution: this is the transpose of what they use in fcml.
• y = [y1 , ..., yn ]
4 I personally prefer this notation since you can quickly determine whether we are using inner or outer products. E.g. the
inner product in matrix notation will then look like X T X, which is more intuitive as it aligns with the inner product of vectors
xT x. However, both notations are found in the literature. So, always be careful about the definition of X.
8
Ridge Regression:
Lasso:
1 n • + sparsity inducing (good • Solve with (sub)-gradient descent or
∑(w xi − yi ) + λ∣∣w∣∣1
T 2
min
w n i=1 for feature selection) LARS (least angle regression)
• + convex
Logistic Regression:
1 n −y (wT xi +b) • often also l1 or l2 regular- • Solve with gradient descent or New-
min ∑ log(1 + e i ) ized ton’s method
w n i=1
• P (y = +1 ∣ x) = 1
1+e−y(wT x+b)
SVM:
5 Summary
srm tries to find the best model by explicitly adding a weighted complexity penalty (regularization term) to
the loss function optimization. List of concepts and terms to understand from this lecture:
• erm
• srm
• overfitting
• underfitting
• Occam’s razor
• training loss, testing loss
• surrogate loss
• regularizer
• hyperparameter
9
(a) Using your own words, summarize each of the above concepts in 1-2 sentences by retrieving the knowl-
edge from the top of your head.
(b) What is the difference between erm and srm?
(c) What is the difference between overfitting and underfitting?
And always remember: It’s not bad to get it wrong. Getting it wrong is part of learning! Use your notes or
other resources to get the correct answer or come to our office hours to get help!
Our Application
With respect to our spam filter application, we are now able to come up with
objective functions, aka learning models, that will (hopefully) work well on the
training data and are able to generalize well to unseen test data.
The next step is to actually build the spam filter, which means we will need
to solve the srm problem (Eq. 5) for various combinations of loss functions and
regularizers. In the next lecture, we will cover a variety of optimization techniques
to help us with that.