0% found this document useful (0 votes)
32 views50 pages

Learning With Maximum Likelihood: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Uploaded by

tsabharwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views50 pages

Learning With Maximum Likelihood: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Uploaded by

tsabharwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 50

Learning with

Maximum Likelihood
Note to other teachers and users of these
slides. Andrew would be delighted if you
Andrew W. Moore
Professor
found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use School of Computer Science
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the Carnegie Mellon University
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials . www.cs.cmu.edu/~awm
Comments and corrections gratefully
received. awm@cs.cmu.edu
412-268-7599

Copyright © 2001, 2004, Andrew W. Moore Sep 6th, 2001


Maximum Likelihood learning of
Gaussians for Data Mining
• Why we should care
• Learning Univariate Gaussians
• Learning Multivariate Gaussians
• What’s a biased estimator?
• Bayesian Learning of Gaussians

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 2


Why we should care
• Maximum Likelihood Estimation is a very
very very very fundamental part of data
analysis.
• “MLE for Gaussians” is training wheels for
our future techniques
• Learning Gaussians is more useful than you
might guess…

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 3


Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~ (i.i.d) N(,2)
• But you don’t know 
(you do know 2)

MLE: For which  is x1, x2, … xR most likely?

MAP: Which  maximizes p(|x1, x2, … xR , 2)?

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 4


Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know 
Sneer
(you do know 2)

MLE: For which  is x1, x2, … xR most likely?

MAP: Which  maximizes p(|x1, x2, … xR , 2)?

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 5


Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know 
Sneer
(you do know 2)

MLE: For which  is x1, x2, … xR most likely?

MAP: Which  maximizes p(|x1, x2, … xR , 2)?

Despite this, we’ll spend 95% of our time on MLE. Why? Wait and see…

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 6


MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know  (you do know 2)
• MLE: For which  is x1, x2, … xR most likely?

 mle  arg max p( x1 , x2 ,...xR |  ,  2 )


Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 7


Algebra Euphoria
 mle  arg max p( x1 , x2 ,...xR |  ,  2 )

= (by i.i.d)

= (monotonicity of
log)

= (plug in formula
for Gaussian)

= (after
simplification)

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 8


Algebra Euphoria
 mle  arg max p( x1 , x2 ,...xR |  ,  2 )

R
= arg max  p ( xi |  ,  )
2 (by i.i.d)
 i 1
R
= arg max log p ( x |  ,  2 ) (monotonicity of
 
i
i 1
log)

= arg max 1 R
( x   ) 2
(plug in formula



2  i 1
 i
2 2 for Gaussian)

= R (after
arg min  ( xi   ) 2 simplification)
 i 1

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 9


Intermission: A General Scalar
MLE strategy
Task: Find MLE  assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
3. Set LL/=0 for a maximum, creating an equation in
terms of 
4. Solve it*
5. Check that you’ve found a maximum rather than a
minimum or saddle-point, and be careful if  is
constrained

*This is a perfect example of something that works perfectly in


all textbook examples and usually involves surprising pain if
you need it for something new.
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 10
The MLE 
 mle  arg max p( x1 , x2 ,...xR |  ,  2 )

R
 arg min  ( xi   ) 2
 i 1

LL
  s.t. 0  


= (what?)

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 11


The MLE 
 mle  arg max p( x1 , x2 ,...xR |  ,  2 )

R
 arg min  ( xi   ) 2
 i 1

LL  R
  s.t. 0 

 
 i 1
( xi   ) 2

R
  2( xi   )
i 1

1 R
Thus    xi
R i 1
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 12
Lawks-a-lawdy!
1 R
 mle   xi
R i 1

• The best estimate of the mean of a


distribution is the mean of the sample!
At first sight:
This kind of pedantic, algebra-filled and
ultimately unsurprising fact is exactly the
reason people throw down their
“Statistics” book and pick up their “Agent
Based Evolutionary Data Mining Using
The Neuro-Fuzz Transform” book.

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 13


A General MLE strategy
Suppose = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE  assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
 LL 
 
 θ1 
 LL 
LL 
 θ2 
θ  
  
 LL 
 θ 
 n
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 14
A General MLE strategy
Suppose = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE  assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
3. Solve the set of simultaneous equations
LL
0
θ1
LL
0
θ2

LL
0
θn
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 15
A General MLE strategy
Suppose = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE  assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
3. Solve the set of simultaneous equations
LL
0
θ1
LL
0 4. Check that you’re at
θ2
a maximum

LL
0
θn
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 16
A General MLE strategy
Suppose = (1, 2, …, n)T is a vector of parameters.
Task: Find MLE  assuming known form for p(Data| ,stuff)
1. Write LL = log P(Data| ,stuff)
2. Work out LL/ using high-school calculus
3. Solve the set of simultaneous equations
LL
0
If you can’t solve them, θ1
what should you do? LL
0 4. Check that you’re at
θ2
a maximum

LL
0
θn
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 17
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know  or 2
• MLE: For which  =(,2) is x1, x2,…xR most likely?
R
1 1
log p ( x1 , x2 ,...xR |  ,  )   R (log   log  ) 
2

2
2

2 2
 i
( x
i 1
 ) 2

LL 1 R

 
 2  (x
i 1
i  )

LL R 1 R

 2

2 2

2 4
 i
( x
i 1
 ) 2

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 18


MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know  or 2
• MLE: For which  =(,2) is x1, x2,…xR most likely?
R
1 1
log p ( x1 , x2 ,...xR |  ,  )   R (log   log  ) 
2

2
2

2 2
 i
( x
i 1
 ) 2

R
1
0
 2  (x
i 1
i  )

R
R 1
0
2 2

2 4
 i
( x
i 1
 ) 2

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 19


MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know  or 2
• MLE: For which  =(,2) is x1, x2,…xR most likely?
R
1 1
log p ( x1 , x2 ,...xR |  ,  )   R (log   log  2 ) 
2

2 2 2
 i
( x
i 1
 ) 2

R
1 1 R
0  2  ( xi  )     xi
 i 1 R i 1
R 1 R
4 
0  ( x  ) 2
 what?
2 2 i 1
2 i

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 20


MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,2)
• But you don’t know  or 2
• MLE: For which  =(,2) is x1, x2,…xR most likely?

1 R
 mle   xi
R i 1

1 R
 mle
2
  ( xi  mle ) 2
R i 1

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 21


Unbiased Estimators
• An estimator of a parameter is unbiased if the expected
value of the estimate is the same as the true value of the
parameters.
• If x1, x2, … xR ~(i.i.d) N(,2) then

 1 R

E[  ]  E   xi   
mle

 R i 1 

mle is unbiased

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 22


Biased Estimators
• An estimator of a parameter is biased if the expected value
of the estimate is different from the true value of the
parameters.
• If x1, x2, … xR ~(i.i.d) N(,2) then

  
2

  1 
R R R
1 1
E  mle
2
 E   ( xi  mle ) 2   E    xi   x j     2
 R i 1   R  i 1 R j 1  

2mle is biased

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 23


MLE Variance Bias
• If x1, x2, … xR ~(i.i.d) N(,2) then
1  R   
2

  1 2
R
1
E  mle
2
E  
 
 R  i 1
xi   x 
j
 
R j 1    R 
 1     2

Intuition check: consider the case of R=1

Why should our guts expect that 2mle would be an


underestimate of true 2?
How could you prove that?

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 24


Unbiased estimate of Variance
• If x1, x2, … xR ~(i.i.d) N(,2) then
1  R   
2

  1 2
R
1
E  mle
2
E  
 
 R  i 1
xi   x 
j
 
R j 1    R 
 1     2

 mle
2
 unbiased 
 
2
So define So E  unbiased
2
2
 1
1  
 R

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 25


Unbiased estimate of Variance
• If x1, x2, … xR ~(i.i.d) N(,2) then
1  R   
2

  1 2
R
1
E  mle
2
E  
 
 R  i 1
xi   x 
j
 
R j 1    R 
 1     2

 mle
2
 unbiased   
2
So define
 1 So E  unbiased
2
2
1  
 R

1 R
 2
unbiased  
R  1 i 1
( xi  mle 2
)

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 26


Unbiaseditude discussion
• Which is best?
1 R
 mle
2
  ( xi  mle ) 2
R i 1

1 R
 unbiased
2
 
R  1 i 1
( xi  mle 2
)

Answer:
•It depends on the task
•And doesn’t make much difference once R--> large

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 27


Don’t get too excited about being
unbiased
• Assume x1, x2, … xR ~(i.i.d) N(,2)
• Suppose we had these estimators for the mean
R
1
 suboptimal 
R7 R
x
i 1
i

Are either of these unbiased?


 crap  x1 Will either of them asymptote to the
correct value as R gets large?
Which is more useful?

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 28


MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)
• But you don’t know  or 
• MLE: For which  =(,) is x1, x2, … xR most likely?

1 R
μ mle
  xk
R k 1

Σ mle 1 R

  x k  μ mle x k  μ mle
R k 1
 
T

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 29


MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)
• But you don’t know  or 
• MLE: For which  =(,) is x1, x2, … xR most likely?

1 R 1 R Where 1  i  m
μ mle
  xk mle
μ
i   x ki
R k 1 R k 1 And xki is value of the
ith component of xk
Σ mle 1 R
R k 1

  x k   mle x k   mle  
T
(the ith attribute of
the kth record)

And imle is the ith


component of mle
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 30
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)
• But you don’t know  or 
• MLE: For which  =(,) is x1, x2, … xR most likely?
Where 1  i  m, 1  j  m
1 R
μ mle
  xk And xki is value of the ith
R k 1 component of xk (the ith
attribute of the kth record)
Σ mle 1 R

  x k   mle x k   mle
R k 1
 T

And ijmle is the (i,j)th


component of mle
 ijmle
1 R
 
  x ki   imle x kj   mle
R k 1
j 
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 31
MLE for m-dimensional Gaussian
Q: How would you prove this?
• Suppose you have x1, x2, … xRA:~(i.i.d)
Just plug N(,)
through the MLE
recipe.
• But you don’t know  or 
Note how mle is forced to be
• MLE: For which  =(,) is xsymmetric xR most likely?
1, x2, … non-negative definite
Note the unbiased case
1 R
μ mle
  xk How many datapoints would you
R k 1 need before the Gaussian has a
chance of being non-degenerate?

Σ mle 1 R

  x k   mle x k   mle
R k 1
 T

Σ unbiased

Σ mle

1 R
1 R 1 
x k  mle
x 
k   mle
T

1 k 1
R
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 32
Confidence intervals
We need to talk

We need to discuss how accurate we expect mle and mle to be as a


function of R
And we need to consider how to estimate these accuracies from
data…
•Analytically *
•Non-parametrically (using randomization and bootstrapping) *
But we won’t. Not yet.
*Will be discussed in future Andrew lectures…just
before we need this technology.

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 33


Structural error
Actually, we need to talk about something else too..
What if we do all this analysis when the true distribution is in fact
not Gaussian?
How can we tell? *
How can we survive? *
*Will be discussed in future Andrew lectures…just
before we need this technology.

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 34


Gaussian MLE in action
Using R=392 cars from the
“MPG” UCI dataset supplied
by Ross Quinlan

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 35


Data-starved Gaussian MLE
Using three subsets of
MPG.
Each subset has 6
randomly-chosen cars.

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 36


Bivariate MLE in action

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 37


Multivariate MLE

Covariance matrices are not exciting to look at

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 38


Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)
• But you don’t know  or 
• MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Step 1: Put a prior on (,)

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 39


Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)
• But you don’t know  or 
• MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Step 1: Put a prior on (,)
Step 1a: Put a prior on 
0-m-1) ~ IW(0, 0-m-1) 0 )
This thing is called the Inverse-Wishart
distribution.
A PDF over SPD matrices!

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 40


Being Bayesian: MAP estimates for Gaussians
0 small: “I am not sure
• Suppose
about myyou of x0 1“, x2,
have
guess … xR ~(i.i.d) N(,)
0 : (Roughly) my best
• But you don’t know  or  guess of 
0 large: “I’m pretty sure
• MAP: Which (,) maximizes p(, |x1, x2, … xR)?
about my guess of 0 “ 0
Step 1: Put a prior on (,)
Step 1a: Put a prior on 
0-m-1) ~ IW(0, 0-m-1) 0 )
This thing is called the Inverse-Wishart
distribution.
A PDF over SPD matrices!

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 41


Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)
• But you don’t know  or 
• MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Step 1: Put a prior on (,)
Step 1a: Put a prior on 
0-m-1) ~ IW(0, 0-m-1)0 ) Together, “” and
“ | ” define a
Step 1b: Put a prior on  | 
joint distribution
 | ~ N(0 , / 0) on (,)

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 42


Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)
• But you don’t know  or  0 small: “I am not sure
about my guess of 0 “
• MAP: Which (,) maximizes p(, |x1, x2, … xR)?
0Step
: My1:best
Putguess  (,)
a priorofon 0 large: “I’m pretty sure
about my guess of 0 “
Step E
1a: Put a0 prior on 
0-m-1) ~ IW(0, 0-m-1)0 ) Together, “” and
“ | ” define a
Step 1b: Put a prior on  | 
joint distribution
 | ~ N(0 , / 0) on (,)
Notice how we are forced to express our
ignorance of proportionally to 

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 43


Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)
• But you don’t know  or 
• MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Step 1: Put a prior on (,) Why do we use this form
of prior?
Step 1a: Put a prior on 
0-m-1) ~ IW(0, 0-m-1)0 )
Step 1b: Put a prior on  | 
 | ~ N(0 , / 0)

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 44


Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)
• But you don’t know  or 
• MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Why do we use this form of
Step 1: Put a prior on (,) prior?
Step 1a: Put a prior on  Actually, we don’t have to

0-m-1) ~ IW(0, 0-m-1)0 ) But it is computationally and


algebraically convenient…
Step 1b: Put a prior on  |  …it’s a conjugate prior.
 | ~ N(0 , / 0)

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 45


Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(,)
• MAP: Which (,) maximizes p(, |x1, x2, … xR)?
Step 1: Prior: 0-m-1) ~ IW(0, 0-m-1)0 ),  | ~ N(0 , / 0)
Step 2:  0 μ 0  Rx  R   0  R
1 R
x   xk μ R 
R k 1 0  R     R
R 0


R
 
( R  m  1) Σ R  ( 0  m  1) Σ 0   x k  x x k  x 
T x  μ 0 x  μ 0 
T

k 1 1/  0 1/ R
Step 3: Posterior: (R+m-1) ~ IW(R, (R+m-1) R ),
 | ~ N(R , / R)

Result: map = R, E[ |x1, x2, … xR ]= R


Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 46
Being Bayesian: •Look
MAP estimates
carefully for
at what these
doing. It’s all very sensible.
Gaussians
formulae are

• x1 , x2 , …
Suppose you have•Conjugate xR mean
priors ~(i.i.d) N(,)
prior form and posterior
form are same and characterized by “sufficient
• MAP: Which (,)statistics” maximizes data. |x , x , … x )?
of the p(,
1 2 R
Step 1: Prior: 0-m-1) ~ •IW(
The marginal distribution on is a student-t
0, 0-m-1)0 ),  | ~ N(0 , / 0)
•One point of view: it’s pretty academic if R > 30
Step 2:  μ  Rx
1R
 R 0  R
x   xk μ R  0 0
R k 1 0  R R  0  R


R
 
( R  m  1) Σ R  ( 0  m  1) Σ 0   x k  x x k  x 
T x  μ 0 x  μ 0 
T

k 1 1/  0 1/ R
Step 3: Posterior: (R+m-1) ~ IW(R, (R+m-1) R ),
 | ~ N(R , / R)

Result: map = R, E[ |x1, x2, … xR ]= R


Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 47
Where we’re at

Categorical Real-valued Mixed Real /


inputs only inputs only Cat okay

Predict Joint BC
Inputs

Dec Tree
Classifier category Naïve BC

Joint DE Gauss DE
Inputs Inputs

Density Prob-
Estimator ability Naïve DE
Predict
Regressor real no.

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 48


What you should know
• The Recipe for MLE
• What do we sometimes prefer MLE to MAP?
• Understand MLE estimation of Gaussian
parameters
• Understand “biased estimator” versus
“unbiased estimator”
• Appreciate the outline behind Bayesian
estimation of Gaussian parameters

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 49


Useful exercise
• We’d already done some MLE in this class
without even telling you!
• Suppose categorical arity-n inputs x1, x2, …
xR~(i.i.d.) from a multinomial
M(p1, p2, … pn)
where
P(xk=j|p)=pj
• What is the MLE p=(p1, p2, … pn)?
Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 50

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy