0% found this document useful (0 votes)
24 views19 pages

ds11 2

Uploaded by

Aditya Shankar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views19 pages

ds11 2

Uploaded by

Aditya Shankar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 19

The EM Algorithm

Generative Models
So far, we looked at probability theory as a tool to express the belief of
an ML algorithm that the true label is such and such
Likelihood: given model  it tells us 
We also looked at how to use probability theory to express our beliefs
about which models are preferred by us and which are not
Prior: this just tells us 
Notice that in all of this, the data features were always considered
constant and never questions as being random or flexible
Can we also talk about ?
Very beneficial: given label , this would allow us to generate a new  from the
distribution ?
Can generate new cat images, new laptop designs (GANs do this very thing!)
Generative Algorithms
ML algos that can learn dist. of the form  or  or 
A slightly funny bit of terminology used in machine learning
Discriminative Algorithms: that only use  to do their stuff
Generative Algorithms: that use  or  etc to do their stuff
Generative Algorithms have their advantages and disadvantages
More expensive: slower train times, slower test times, larger models
An overkill: often, need only  to make predictions – disc. algos enough!
More frugal: can work even if we have very less training data (e.g. RecSys)
More robust: can work even if features corrupted e.g. some features missing
A recent application of generative techniques (GANs etc) allows us to
Generate novel examples of a certain class of data points
A very simple generative model
Given a few feature vectors (never mind labels for now) 
We wish to learn a probability distribution  with support over 
This distribution should capture interesting properties about the data in a way
that allows us to do things like generate similar-looking feature vectors etc
Let us try to learn a standard Gaussian as this distribution i.e. we
wish to learn  so that the distribution  explains the data well
One way is to look for a  that achieves maximum likelihood i.e. MLE!!
As before, assume that our feature vectors were independently generated
 which, upon applying first order optimality, gives us 
We just learnt  as our generating dist. for data features!
Still more powerful generative model?
Suppose we are concerned that a single Gaussian cannot capture all
the variations in our data
Can we learn 2 (or more) Gaussians to represent our data instead?
Such a generative model is often called a mixture of Gaussians
The Expectation Maximization (EM) algorithm is a very powerful
technique for performing this and several other tasks
Soft clustering, learning Gaussian mixture models (GMM)
Robust learning, Mixed Regression
Also underlies more powerful variational algorithms such as VAE
Learning a Mixture of Two Gaussians
WeThis means that
suspect thatifinstead
someone tells us that
of one  this means
Gaussian, twothatGaussians
the first Gaussian
are involved
is responsible for that data point and consequently, the likelihood expression
in generating our feature vectors
is . Similarly, if someone tells us that  this means that the second
Gaussian is responsible for that data point and the likelihood expression is
Let us call them  and  .
Each of these is called a component of this GMM
Covariance matrices, more than two components can also be incorporated
Since we are unsure which data point came from which component,
we introduce a latent variable  per data point to denote this
The English word “latent” means hidden or dormant or concealed
Nice name since this variable describes something that was hidden from us
These latent variables may seem similar to the one we used in (soft) k-means
Not an accident – the connections will be clear soon!
Latent variables can be discrete or continuous
MLE with Latent Variables
We wish to obtain the maximum (log) likelihood models i.e.

Since we do not know the values of latent variables, force them
into the expression using the law of total probability
We did a similar thing (introduce models) in predictive posterior calculations

Very difficult optimization problem – NP-hard in general
However, two heuristics exist which work reasonably well in practice
Also theoretically sound if data is “nice” (details in a learning theory
course)
Heuristic 1: Alternating Optimization
Keep alternating between step 1 and step 2
till you are tired or till the process has
Convert the original optimization problem converged!

to a double maximization problem (assume  const)

In several ML problems with latent vars, although the above double
optimization
The problem is (still)
intuition behind difficult,
reducing following
things two
to a double problems isare
optimization thateasy
Step 1: FixThe most
 itand important
mayupdate difference
latent
be mostly between
variables
the case that the
one
only original
to of
their and in
the optimal
terms thethe
new
values
problem is that original
summation  will has a sumand
dominate of if
logthis
of is
sumthe which
case, is very
then
 difficult to optimize whereas thelargest
new problem gets be
ridokay
of this and
approximating
Step 2: Fixlooks
latentsimply
variablesthe sum
 by
and the
update  termto should
their optimal i.e. 
values
like a MLE problem. We know how to solve MLE
 problems very easily!
Heuristic 1 at Work Isn’t this like the k-
means clustering
algorithm?
As discussed before, we assume a mixture of two Gaussians
 and  Not just “like” – this is the k-means algorithm! This
means that the k-means algorithm is one heuristic way
Step 1 becomes to compute an MLE which is difficult to compute
directly!

Indeed! Notice that even here, instead of choosing just
Step 2 becomes one value of the latent variables  at each time step, we
can instead use a distribution over their support 

 I have a feeling that the
Thus,  and  where  is the number of datasecond
pointsheuristic
for which
willwe
alsohave 
give us something
Repeat! familiar!
Heuristic 2: Expectation Maximization
Original Prob: 
Step 1 (E Step) Consists of two sub-steps
Step 1.1 Assume our current model estimates are 
Use the current models to ascertain how likely are different values of  for the -th data
Yet
point i.e. compute again, the new
for both  problem gets rid of the treacherous “sum of
log of sum”
Step 1.2 Use weights  toterms which
set up are difficult
a new objectiveto optimize.
functionThe new
problem
As before, assume instead looks simply like a weighted MLE problem
 constant for sake of simplicity
with weights  and we know how to solve MLE problems very

easily!
Step 2 (M Step) Maximize the new obj. fn. to get new models

Repeat!
Derivation of the E Step
Jensen’s inequality tells us that  for any convex function. We used the fact
that  is a concave function and so the inequality reverses since every concave
function is the negative of a convex function
Let  denote the models  to avoid clutter. Also let  denote our
current estimate of the model
Law of total probability
Just need to see derivation for a single point, say the -th point
 Simply multiply and
 divide by the same term
 Jensen’s inequality


Just renaming 

 does not depend on 


(it depends on 
Note: assumptions such as  const are made for sake of simplicity. Can
The EM Algorithm
execute EM perfectly without making these assumptions as well. However, then
updates get more involved – be careful not to make mistakes
If we instantiate the EM algorithm with the GMM likelihoods, we
will recover the soft k-means EM foralgorithm
GMM
1. Initialize
Thus, the soft k-means means
algorithm is yet
another heuristic way (the k-means
algo was the first) For
2. to , update
compute an MLE usingisdifficult to compute directly!
 which
The EM algorithm 1. Let pros
has  and cons over alternating optimization
2. Let  (normalize)
Con: EM is usually
3. more
Let expensive to execute than alternating optimization
Pro: EM will ensures
4.
that objective
Update  value of the original problem i.e.
 5. Repeat until convergence
… always keeps going up at every iteration – monotonic progress!!
However, no guarantee that we will ever reach the global maximum
May converge to, and get stuck at, a local maximum instead
The Q WeFunction
call it  instead of just  since  uses the values  which are
calculated using . Thus, the  function keeps changing
Let  be the new objective function constructed at time step 
The Generic
The EM algorithm constructs a newEM  Algorithm
function at each time during
the E-step1.and maximizes
Initialize modelit during
 the M-step
 2. For every latent variable  and every possible
We have already valueseen
 itthat
could take,
 for allcompute
 
3. Compute the Q-function
Easy enough4. toUpdate 
show that
Some indication
5. as to why
Repeat EM
until increases likelihood at each iteration
convergence
Alternating optimization can be seen as a cousin of EM that uses a 
fn of the form  where 
A pictorial depiction
 is notof the
necessarily anEM
inverted quadratic fn.
Just an illustration

The -curves always lie


below the red curve
The  curves always
touch the red curve at 
Stuck!! because
M-step maximizes 
Mixed Regression
An example of latent variables
in aSure,
supervised
we couldlearning taskthis data first and then apply regression
try clustering
models
We haveseparately on both
regression clusters.
train data However, using latent variables may be
beneficial
 since 1) clustering e.g. k-means may not necessarily work well
since the points here are really not close to two centroids (instead, they lie
Example:
close to two denotes
lines age andis
which k-means really not meant to handle) and 2) using
denotes
latent time spent
variables, weon
canwebsite
elegantly cluster and learn regression models
There are two subpopulations jointly!!
in
data (gender) which behave
differently even if age is same
An indication that our features

?
may be incomplete/latent
We could have had separate  and  for the two components as
Latent Variables to the Rescue
well which we could also learn. However, this would make things
more tedious so for now, let us assume  and also that 
As before, if we believe that our data is best explained using two
linear regression models instead of one, we should work with a
mixed model (aka mixture of experts)
Will fit two regression models to the data and use a latent variable
 to keep track of which data point belongs to which model
Let us use Gaussian likelihoods since we are comfortable with it


Note: this is not generative learning since we are still learning
discriminative distributions of the form 
Will see soon how to perform generative learning in supervised settings
MLE for Mixed Regression
 which, upon introducing latent variables gives 
Method 1: Alternating Optimization

As before, assume  constant for sake of simplicity to get

Step 1: Fix  and update all 
Step 2: Fix all  and update the models 
Alternating Optimization for MR
As before, we assumed the likelihood
AltOpt for MR distributions as
 for  1. Initialize models 
2. For , update  using 
Step 1 becomes
1. Let 
 3. Update  using 
Let
i.e. assign every1.data to its “closest” line or the line which fits it better
point
Step 2 becomes
4. Repeat until convergence


i.e. perform least squares on the data points assigned to each component
May incorporate a prior as well to add a regularizer (ridge regression)
Repeat!
EM for Mixed Regression
Original Prob:  EM for MR
Initialize
1. Step models
1 (E Step)  (forof
Consists twocomponents)
sub-steps
For1.1
2. Step , updateour
Assume using model
current  estimates are 
Use Let
1. the  models to ascertain how likely are different values of  for the -th data
current
point i.e. compute  for both 
2. Let  (normalize)
Step 1.2 Use weights  to set up a new objective function
3.
AsUpdate  
before, assume where 
constant (apply
for sake first order optimality)
of simplicity
4. Repeat until convergence

Step 2 (M Step) Maximize the new obj. fn. to get new models

Repeat!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy