Learning With Maximum Likelihood: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Learning With Maximum Likelihood: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Maximum Likelihood
Note to other teachers and users of these
slides. Andrew would be delighted if you
Andrew W. Moore
Professor
found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use School of Computer Science
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the Carnegie Mellon University
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials . www.cs.cmu.edu/~awm
Comments and corrections gratefully
received. awm@cs.cmu.edu
412-268-7599
Despite this, we’ll spend 95% of our time on MLE. Why? Wait and see…
= (by i.i.d)
= (monotonicity of
log)
= (plug in formula
for Gaussian)
= (after
simplification)
= arg max 1 R
( x ) 2
(plug in formula
2 i 1
i
2 2 for Gaussian)
= R (after
arg min ( xi ) 2 simplification)
i 1
LL
s.t. 0
= (what?)
LL R
s.t. 0
i 1
( xi ) 2
R
2( xi )
i 1
1 R
Thus xi
R i 1
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 12
Lawks-a-lawdy!
1 R
mle xi
R i 1
2
2
2 2
i
( x
i 1
) 2
LL 1 R
2 (x
i 1
i )
LL R 1 R
2
2 2
2 4
i
( x
i 1
) 2
2
2
2 2
i
( x
i 1
) 2
R
1
0
2 (x
i 1
i )
R
R 1
0
2 2
2 4
i
( x
i 1
) 2
2 2 2
i
( x
i 1
) 2
R
1 1 R
0 2 ( xi ) xi
i 1 R i 1
R 1 R
4
0 ( x ) 2
what?
2 2 i 1
2 i
1 R
mle xi
R i 1
1 R
mle
2
( xi mle ) 2
R i 1
1 R
E[ ] E xi
mle
R i 1
mle is unbiased
2
1
R R R
1 1
E mle
2
E ( xi mle ) 2 E xi x j 2
R i 1 R i 1 R j 1
2mle is biased
1 2
R
1
E mle
2
E
R i 1
xi x
j
R j 1 R
1 2
1 2
R
1
E mle
2
E
R i 1
xi x
j
R j 1 R
1 2
mle
2
unbiased
2
So define So E unbiased
2
2
1
1
R
1 2
R
1
E mle
2
E
R i 1
xi x
j
R j 1 R
1 2
mle
2
unbiased
2
So define
1 So E unbiased
2
2
1
R
1 R
2
unbiased
R 1 i 1
( xi mle 2
)
1 R
unbiased
2
R 1 i 1
( xi mle 2
)
Answer:
•It depends on the task
•And doesn’t make much difference once R--> large
1 R
μ mle
xk
R k 1
Σ mle 1 R
x k μ mle x k μ mle
R k 1
T
1 R 1 R Where 1 i m
μ mle
xk mle
μ
i x ki
R k 1 R k 1 And xki is value of the
ith component of xk
Σ mle 1 R
R k 1
x k mle x k mle
T
(the ith attribute of
the kth record)
Σ mle 1 R
x k mle x k mle
R k 1
T
Σ unbiased
Σ mle
1 R
1 R 1
x k mle
x
k mle
T
1 k 1
R
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 32
Confidence intervals
We need to talk
R
( R m 1) Σ R ( 0 m 1) Σ 0 x k x x k x
T x μ 0 x μ 0
T
k 1 1/ 0 1/ R
Step 3: Posterior: (R+m-1) ~ IW(R, (R+m-1) R ),
| ~ N(R , / R)
• x1 , x2 , …
Suppose you have•Conjugate xR mean
priors ~(i.i.d) N(,)
prior form and posterior
form are same and characterized by “sufficient
• MAP: Which (,)statistics” maximizes data. |x , x , … x )?
of the p(,
1 2 R
Step 1: Prior: 0-m-1) ~ •IW(
The marginal distribution on is a student-t
0, 0-m-1)0 ), | ~ N(0 , / 0)
•One point of view: it’s pretty academic if R > 30
Step 2: μ Rx
1R
R 0 R
x xk μ R 0 0
R k 1 0 R R 0 R
R
( R m 1) Σ R ( 0 m 1) Σ 0 x k x x k x
T x μ 0 x μ 0
T
k 1 1/ 0 1/ R
Step 3: Posterior: (R+m-1) ~ IW(R, (R+m-1) R ),
| ~ N(R , / R)
Predict Joint BC
Inputs
Dec Tree
Classifier category Naïve BC
Joint DE Gauss DE
Inputs Inputs
Density Prob-
Estimator ability Naïve DE
Predict
Regressor real no.