04 Lecturenote MLE MAP Discriminative
04 Lecturenote MLE MAP Discriminative
Application
Let’s consider our yield prediction problem from last lecture. This can
be cast as a classical discriminative supervised learning problem: predict
the production of bushels of corn per acre on a farm as a function
of the proportion of that farm’s planting area that was treated with a
new pesticide by modeling p(y ∣ x) which incorporates a reasonable way
to model the noise in the observed data (https://www.developer.com/mgmt/
real-world-machine-learning-model-evaluation-and-optimization.html).
In addition to the point estimate of the yield for a given amount of treated
area, it will be very informative for the farmer to know what the expected
deviation from this point estimate is. In other words, we would like to http://www.corncapitalinnovations.com/production/
300- bushel-corn/
provide the standard deviation as an estimator of uncertainty.
1 Introduction
1.1 Predictive Distribution
In discriminative supervised machine learning our goal is to model the posterior predictive distribution:
p(y ∣ D, x) = ∫ p(y, θ ∣ D, x) dθ
θ
(1)
= ∫ p(y ∣ D, x, θ) p(θ ∣ D) dθ
θ
This makes sense, since we really want to incorporate all possible models parameterized by their respective
model parameters θ weighted by the parameter’s probability (i.e. the posterior probability over parameters);
cf. fcml 3.8.6.
Unfortunately, the above integral is generally intractable in closed form and sampling techniques, such as
Monte Carlo approximations, are used to approximate the distribution. So, oftentimes we will actually not
use this distribution for predictions but estimate the model parameters via mle or map and then plug those
into our model p(y ∣ x, θ̂) for predictions. We will meet the posterior predictive distribution again when
discussing Gaussian processes later in the course.
Our goal is to estimate w directly from D = {(x, yi )}ni=1 using the joint conditional likelihood p(y ∣ X, w).
1
2
Lemma 1.1. Maximizing the (data) likelihood p(D ∣ w) = p(y, X ∣ w) is equivalent to maximizing the (joint)
conditional likelihood p(y ∣ X, w).
⎡y ⎤
⎢ 1⎥
⎢ ⎥
Notation Reminder: X = [x1 , ..., xn ] ∈ R d×n
where xi ∈ R ; y = ⎢ ⋮ ⎥ ∈ Rn d
⎢ ⎥
⎢yn ⎥
⎣ ⎦
Exercise 1.1. Prove Lemma 1.1. hint: use assumption (1).
Maximum-a-posterior Estimation
Bayesian Way: Model w as a random variable from p(w) and use p(w ∣ D). Choose w to maximize the
posterior over parameters p(w ∣ X, y).
1 −(w⊺ xi −yi )2
⇒ yi ∣ xi , w ∼ N (w⊺ xi , σ 2 ) ⇒ p(yi ∣ xi , w) = √ e 2σ 2 (4)
2πσ 2
MLE
Use Eq.(2):
3
n
ŵM LE = arg max ∑ log p(yi ∣ xi , w)
w i=1
1 −(w⊺ xi −yi )2
n
= arg max ∑ log( √ ) + log(e 2σ 2 )
w i=1 2πσ 2
n
(5)
= arg max ∑ −(w⊺ xi − yi )2
w i=1
1 n
= arg min ∑(w⊺ xi − yi )2
w n i=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
OLS/squared loss
The loss thus l(w) = n1 ∑ni=1 (w⊺ xi − yi )2 aka square loss or Ordinary Least Squares (OLS). OLS can be
optimized with gradient descent, Newton’s method, or in closed form.
MAP
Additional Model Assumption: prior distribution:
w ∼ N (0, σp2 I)
−w w ⊺
1
p(w) = √
2
e 2σp
2πσp2
Ensure for yourself that this prior is a conjugate prior to our likelihood.
Now, use Eq.(3):
n
ŵM AP = arg max ∑ log p(yi ∣ xi , w) + log p(w)
w i=1
1 n ⊺ 1 ⊺
= arg min ∑(w xi − yi ) + 2 w w
2
w 2σ 2 i=1 2σp (6)
1 n
= arg min ∑(w⊺ xi − yi )2 + λ∣∣w∣∣22
w n i=1 ´¹¹ ¹ ¹ ¸¹¹ ¹ ¹ ¶
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ l2 −regularization
squared loss
This formulation is known as ridge regression and we have derived it before in a frequentist setting using
structural risk minimization (srm). Note that λ is a hyperparameter controlling the amount of regularization
used/needed. It can be learned via cross-validation.
2.3 Summary
• mle solution is equivalent to ordinary least squares regression.
• map solution is equivalent to regularized ols using an l2 regularizer.
• We could use a different noise model such as the full Gaussian N (µ, Σ), multiplicative noise, or non-
stationary noise (e.g. heteroscedastic noise) to make this model more expressive.
We need to estimate the parameter w. To find the values of the parameter at minimum, we can try to
⊺
find solutions for ∇w ∑ni=1 log(1 + e−yi (w xi ) ) = 0. This equation has no closed form solution, so we will use
⊺
Gradient Descent on the negative log likelihood nll(w) = ∑ni=1 log(1 + e−yi (w xi ) ).
5
MAP
In the MAP estimate we treat w as a random variable and can specify a prior belief distribution over it.
Additional Model Assumption:
w ∼ N (0, σ 2 I)
1 −w⊺ w
p(w) = √ e 2σ2
2πσ 2
Then the MAP estimator is given by
n ⊺
ŵM AP = arg min ∑ log(1 + e−yi (w xi )
) + λ∣∣w∣∣22
w i=1 (9)
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
negative log posterior (nlp)
Once again, this function has no closed form solution, but we can use Gradient Descent on the negative
⊺
log posterior nlp(w) = ∑ni=1 log(1 + e−yi (w xi ) ) + λ∣∣w∣∣22 to find the optimal parameter. Note again that we
derived this before via srm using the log-loss and l2 -regularization (frequentist approach).
Exercise 3.2. Derive Eq.(9), the negative log-posterior for logistic regression.
• Derive an algorithm for sampling from the posterior and use this as an approximation.
We will not cover this approach in this course. For further reference see FCML 4.4 and 4.5.
3.3 Summary
Logistic regression is easy to
• fit (estimate w directly from D, linear in dn)
p(y=1∣x)
• interpret as log odds: log p(y=−1∣x) = w⊺ x
⊺
ewc x
• easy to extend to multi-class classification: p(y = c ∣ x, w) = ⊺
∑c ewc x
6
Exercise 3.3. One benefit of LR is that it is easy to interpret. This can be seen by looking at the log odds:
p(y = 1 ∣ x, w)
log
p(y = −1 ∣ x, w)
Show that
p(y = 1 ∣ x, w)
log = w⊺ x
p(y = −1 ∣ x, w)
Our Application