0% found this document useful (0 votes)
19 views6 pages

04 Lecturenote MLE MAP Discriminative

The document discusses Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) methods for discriminative supervised learning, focusing on applications like yield prediction and linear/logistic regression. It outlines the mathematical formulations for estimating model parameters and highlights the equivalence of MLE to ordinary least squares regression and MAP to regularized OLS. Additionally, it emphasizes the importance of understanding predictive distributions and parameter estimation techniques in machine learning.

Uploaded by

mizhou0309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views6 pages

04 Lecturenote MLE MAP Discriminative

The document discusses Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) methods for discriminative supervised learning, focusing on applications like yield prediction and linear/logistic regression. It outlines the mathematical formulations for estimating model parameters and highlights the equivalence of MLE to ordinary least squares regression and MAP to regularized OLS. Additionally, it emphasizes the importance of understanding predictive distributions and parameter estimation techniques in machine learning.

Uploaded by

mizhou0309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CSE517A Machine Learning Fall 2022

Lecture 4: MLE and MAP for Discriminative Supervised Learning


Instructor: Marion Neumann
Reading: fcml 2.8 (mle), 3.8 (map), 4.2-4.3 (map), 5.2 (Bayes Classifier and Logistic Regression)

Application
Let’s consider our yield prediction problem from last lecture. This can
be cast as a classical discriminative supervised learning problem: predict
the production of bushels of corn per acre on a farm as a function
of the proportion of that farm’s planting area that was treated with a
new pesticide by modeling p(y ∣ x) which incorporates a reasonable way
to model the noise in the observed data (https://www.developer.com/mgmt/
real-world-machine-learning-model-evaluation-and-optimization.html).
In addition to the point estimate of the yield for a given amount of treated
area, it will be very informative for the farmer to know what the expected
deviation from this point estimate is. In other words, we would like to http://www.corncapitalinnovations.com/production/
300- bushel-corn/
provide the standard deviation as an estimator of uncertainty.

1 Introduction
1.1 Predictive Distribution
In discriminative supervised machine learning our goal is to model the posterior predictive distribution:

p(y ∣ D, x) = ∫ p(y, θ ∣ D, x) dθ
θ
(1)
= ∫ p(y ∣ D, x, θ) p(θ ∣ D) dθ
θ

This makes sense, since we really want to incorporate all possible models parameterized by their respective
model parameters θ weighted by the parameter’s probability (i.e. the posterior probability over parameters);
cf. fcml 3.8.6.

Unfortunately, the above integral is generally intractable in closed form and sampling techniques, such as
Monte Carlo approximations, are used to approximate the distribution. So, oftentimes we will actually not
use this distribution for predictions but estimate the model parameters via mle or map and then plug those
into our model p(y ∣ x, θ̂) for predictions. We will meet the posterior predictive distribution again when
discussing Gaussian processes later in the course.

1.2 Parameter Estimation


Usually, there are two assumptions in discriminative supervised learning.

Assumptions for Discriminative Supervised Learning:


(1) xi are known ⇒ xi independent of the model parameters w ⇒ p(X ∣ w) = p(X), also p(w ∣ X) = p(w)
(2) yi′ s are independent given the input features xi and w

Our goal is to estimate w directly from D = {(x, yi )}ni=1 using the joint conditional likelihood p(y ∣ X, w).

1
2

Lemma 1.1. Maximizing the (data) likelihood p(D ∣ w) = p(y, X ∣ w) is equivalent to maximizing the (joint)
conditional likelihood p(y ∣ X, w).
⎡y ⎤
⎢ 1⎥
⎢ ⎥
Notation Reminder: X = [x1 , ..., xn ] ∈ R d×n
where xi ∈ R ; y = ⎢ ⋮ ⎥ ∈ Rn d
⎢ ⎥
⎢yn ⎥
⎣ ⎦
Exercise 1.1. Prove Lemma 1.1. hint: use assumption (1).

Maximum Likelihood Estimation


Choose w to maximize the joint conditional likelihood p(y ∣ X, w).

ŵM LE = arg max p(y ∣ X, w)


w
n
(2)
= arg max ∏ p(yi ∣ xi , w)
w i=1 (2)
n
= arg max ∑ log p(yi ∣ xi , w)
w i=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
log−likelihood

Maximum-a-posterior Estimation
Bayesian Way: Model w as a random variable from p(w) and use p(w ∣ D). Choose w to maximize the
posterior over parameters p(w ∣ X, y).

ŵM AP = arg max p(w ∣ X, y)


w
= arg max p(y ∣ X, w) p(w)
w ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ²
likelihood prior (3)
n
= arg max ∑ log p(yi ∣ xi , w) + log p(w)
w i=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
same as MLE

2 Example: Linear Regression


Model Assumption: yi = w⊺ xi + i ∈ R, where we use the Gaussian distribution (cf. fcml 2.5.3) to model
the noise i ∼ N (0, σ 2 ), which is independent identically distributed (iid).

1 −(w⊺ xi −yi )2
⇒ yi ∣ xi , w ∼ N (w⊺ xi , σ 2 ) ⇒ p(yi ∣ xi , w) = √ e 2σ 2 (4)
2πσ 2

2.1 Learning Phase


To train our model we estimate w from D.

MLE
Use Eq.(2):
3

n
ŵM LE = arg max ∑ log p(yi ∣ xi , w)
w i=1
1 −(w⊺ xi −yi )2
n
= arg max ∑ log( √ ) + log(e 2σ 2 )
w i=1 2πσ 2
n
(5)
= arg max ∑ −(w⊺ xi − yi )2
w i=1
1 n
= arg min ∑(w⊺ xi − yi )2
w n i=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
OLS/squared loss

The loss thus l(w) = n1 ∑ni=1 (w⊺ xi − yi )2 aka square loss or Ordinary Least Squares (OLS). OLS can be
optimized with gradient descent, Newton’s method, or in closed form.

Closed Form Solution: w = (XX ⊺ )−1 Xy.


Note: We need to take the inverse; for low dimensional data this is fine since XX ⊺ is d×d, for high-dimensional
data we will have to get an approximate solution.

MAP
Additional Model Assumption: prior distribution:

w ∼ N (0, σp2 I)
−w w ⊺
1
p(w) = √
2
e 2σp
2πσp2

Ensure for yourself that this prior is a conjugate prior to our likelihood.
Now, use Eq.(3):
n
ŵM AP = arg max ∑ log p(yi ∣ xi , w) + log p(w)
w i=1
1 n ⊺ 1 ⊺
= arg min ∑(w xi − yi ) + 2 w w
2
w 2σ 2 i=1 2σp (6)
1 n
= arg min ∑(w⊺ xi − yi )2 + λ∣∣w∣∣22
w n i=1 ´¹¹ ¹ ¹ ¸¹¹ ¹ ¹ ¶
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ l2 −regularization
squared loss

This formulation is known as ridge regression and we have derived it before in a frequentist setting using
structural risk minimization (srm). Note that λ is a hyperparameter controlling the amount of regularization
used/needed. It can be learned via cross-validation.

Closed Form Solution: w = (XX ⊺ + λI)−1 Xy.


Note: The solution is numerically more stable as the term λI makes the matrix to invert less likely to be
ill-conditioned.

2.2 Prediction Phase


Use the estimated model parameters ŵ in predictive distribution p(y ∗ ∣ x∗ , ŵ). For linear regression we
have
1 −(ŵ⊺ x∗ −y ∗ )2
p(y ∗ ∣ x∗ , ŵ) = √ e 2σ 2 .
2πσ 2
The point estimate would be given by the mean of this distribution: ŷ ∗ = ŵ⊺ x∗ .
4

2.3 Summary
• mle solution is equivalent to ordinary least squares regression.
• map solution is equivalent to regularized ols using an l2 regularizer.

• We could use a different noise model such as the full Gaussian N (µ, Σ), multiplicative noise, or non-
stationary noise (e.g. heteroscedastic noise) to make this model more expressive.

Exercise 2.1. True or false? Justify your answer.


(a) If n → ∞, MAP can recover from a wrong prior distribution over parameters, where we assume that
our prior distribution is strictly larger than zero on [0,1].
(b) The MAP solution to linear regression is numerically less stable to compute than the MLE solution.

3 Example: Logistic Regression


Model Assumption: We need to squash w⊺ xi to get a value in [0,1]. In logistic regression we model
p(y ∣ x, w) and assume that it takes on the form:
1
p(y ∣ x, w) = Ber (y ∣ ), (7)
1 + e−w⊺ x
where we use the Bernoulli distribution (cf. fcml 2.3.1):


⎪θ if a = 1
Ber(a ∣ θ) = ⎨

⎪1−θ if a = −1.

For binary classification our observations are y ∈ {−1, +1} and we can write Eq.(7) as p(y ∣ x, w) = 1
1+e−y(w⊺ x)
.

Exercise 3.1. Verify that p(y ∣ x, w) = 1


1+e−y(w⊺ x)
is equivalent to Eq.(7).

3.1 Learning Phase


MLE
Now, plug this into Eq.(2) to get:
n
ŵM LE = arg max ∑ log p(yi ∣ xi , w)
w i=1
n ⊺ (8)
= arg min ∑ log(1 + e−yi (w xi )
)
w i=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
negative log likelihood (nll)

We need to estimate the parameter w. To find the values of the parameter at minimum, we can try to

find solutions for ∇w ∑ni=1 log(1 + e−yi (w xi ) ) = 0. This equation has no closed form solution, so we will use

Gradient Descent on the negative log likelihood nll(w) = ∑ni=1 log(1 + e−yi (w xi ) ).
5

MAP
In the MAP estimate we treat w as a random variable and can specify a prior belief distribution over it.
Additional Model Assumption:

w ∼ N (0, σ 2 I)
1 −w⊺ w
p(w) = √ e 2σ2
2πσ 2
Then the MAP estimator is given by

n ⊺
ŵM AP = arg min ∑ log(1 + e−yi (w xi )
) + λ∣∣w∣∣22
w i=1 (9)
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
negative log posterior (nlp)

Once again, this function has no closed form solution, but we can use Gradient Descent on the negative

log posterior nlp(w) = ∑ni=1 log(1 + e−yi (w xi ) ) + λ∣∣w∣∣22 to find the optimal parameter. Note again that we
derived this before via srm using the log-loss and l2 -regularization (frequentist approach).
Exercise 3.2. Derive Eq.(9), the negative log-posterior for logistic regression.

[optional] True Bayesian Logistic Regression


Did you notice that in order to get the MAP solution we modeled the posterior as the product of the
likelihood and the prior? This means that, we have to approximate/model two distributions, p(y ∣ X, w)
and p(w). Alternatively, we can directly model the posterior p(w ∣ X, y). We have two options:
• Model the posterior via Laplace approximation (most common approach).

• Derive an algorithm for sampling from the posterior and use this as an approximation.
We will not cover this approach in this course. For further reference see FCML 4.4 and 4.5.

3.2 Prediction Phase


Use ŵ in Eq. (7):
1
p(y ∗ ∣ x∗ , ŵ) = Ber (y ∗ ∣ ).
1 + e−ŵ⊺ x∗
To get a point estimate this means


⎪1 if ŵ⊺ x∗ ≥ 0
ŷ ∗ = ⎨

⎪−1 if ŵ⊺ x∗ < 0

which just simplifies to ŷ ∗ = sign (ŵ⊺ x∗ ).

3.3 Summary
Logistic regression is easy to
• fit (estimate w directly from D, linear in dn)
p(y=1∣x)
• interpret as log odds: log p(y=−1∣x) = w⊺ x

ewc x
• easy to extend to multi-class classification: p(y = c ∣ x, w) = ⊺
∑c ewc x
6

Exercise 3.3. One benefit of LR is that it is easy to interpret. This can be seen by looking at the log odds:

p(y = 1 ∣ x, w)
log
p(y = −1 ∣ x, w)

Show that
p(y = 1 ∣ x, w)
log = w⊺ x
p(y = −1 ∣ x, w)

Our Application

Back to our application of predicting the production of bushels


of corn per acre on a farm as a function of the proportion of that
farm’s planting area that was treated with pesticides.
The data clearly shows a non-liner relationship between x and
y. How could you use the mle and map solutions developed in
Section 2 to model this trend?
(image source: https://www.developer.com/mgmt/
real-world-machine-learning-model-evaluation-and-optimization.
html)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy