ds11 2
ds11 2
Generative Models
So far, we looked at probability theory as a tool to express the belief of
an ML algorithm that the true label is such and such
Likelihood: given model  it tells us 
We also looked at how to use probability theory to express our beliefs
about which models are preferred by us and which are not
Prior: this just tells us 
Notice that in all of this, the data features were always considered
constant and never questions as being random or flexible
Can we also talk about ?
Very beneficial: given label , this would allow us to generate a new  from the
distribution ?
Can generate new cat images, new laptop designs (GANs do this very thing!)
Generative Algorithms
ML algos that can learn dist. of the form  or  or 
A slightly funny bit of terminology used in machine learning
Discriminative Algorithms: that only use  to do their stuff
Generative Algorithms: that use  or  etc to do their stuff
Generative Algorithms have their advantages and disadvantages
More expensive: slower train times, slower test times, larger models
An overkill: often, need only  to make predictions – disc. algos enough!
More frugal: can work even if we have very less training data (e.g. RecSys)
More robust: can work even if features corrupted e.g. some features missing
A recent application of generative techniques (GANs etc) allows us to
Generate novel examples of a certain class of data points
A very simple generative model
Given a few feature vectors (never mind labels for now) 
We wish to learn a probability distribution  with support over 
This distribution should capture interesting properties about the data in a way
that allows us to do things like generate similar-looking feature vectors etc
Let us try to learn a standard Gaussian as this distribution i.e. we
wish to learn  so that the distribution  explains the data well
One way is to look for a  that achieves maximum likelihood i.e. MLE!!
As before, assume that our feature vectors were independently generated
 which, upon applying first order optimality, gives us 
We just learnt  as our generating dist. for data features!
Still more powerful generative model?
Suppose we are concerned that a single Gaussian cannot capture all
the variations in our data
Can we learn 2 (or more) Gaussians to represent our data instead?
Such a generative model is often called a mixture of Gaussians
The Expectation Maximization (EM) algorithm is a very powerful
technique for performing this and several other tasks
Soft clustering, learning Gaussian mixture models (GMM)
Robust learning, Mixed Regression
Also underlies more powerful variational algorithms such as VAE
Learning a Mixture of Two Gaussians
WeThis means that
suspect thatifinstead
someone tells us that
of one  this means
Gaussian, twothatGaussians
the first Gaussian
are involved
is responsible for that data point and consequently, the likelihood expression
in generating our feature vectors
is . Similarly, if someone tells us that  this means that the second
Gaussian is responsible for that data point and the likelihood expression is
Let us call them  and  .
Each of these is called a component of this GMM
Covariance matrices, more than two components can also be incorporated
Since we are unsure which data point came from which component,
we introduce a latent variable  per data point to denote this
The English word “latent” means hidden or dormant or concealed
Nice name since this variable describes something that was hidden from us
These latent variables may seem similar to the one we used in (soft) k-means
Not an accident – the connections will be clear soon!
Latent variables can be discrete or continuous
MLE with Latent Variables
We wish to obtain the maximum (log) likelihood models i.e.

Since we do not know the values of latent variables, force them
into the expression using the law of total probability
We did a similar thing (introduce models) in predictive posterior calculations

Very difficult optimization problem – NP-hard in general
However, two heuristics exist which work reasonably well in practice
Also theoretically sound if data is “nice” (details in a learning theory
course)
Heuristic 1: Alternating Optimization
Keep alternating between step 1 and step 2
till you are tired or till the process has
Convert the original optimization problem converged!

to a double maximization problem (assume  const)

In several ML problems with latent vars, although the above double
optimization
The problem is (still)
intuition behind difficult,
reducing following
things two
to a double problems isare
optimization thateasy
Step 1: FixThe most
 itand important
mayupdate difference
latent
be mostly between
variables
the case that the
one
only original
to of
their and in
the optimal
terms thethe
new
values
problem is that original
summation  will has a sumand
dominate of if
logthis
of is
sumthe which
case, is very
then
 difficult to optimize whereas thelargest
new problem gets be
ridokay
of this and
approximating
Step 2: Fixlooks
latentsimply
variablesthe sum
 by
and the
update  termto should
their optimal i.e. 
values
like a MLE problem. We know how to solve MLE
 problems very easily!
Heuristic 1 at Work Isn’t this like the k-
means clustering
algorithm?
As discussed before, we assume a mixture of two Gaussians
 and  Not just “like” – this is the k-means algorithm! This
means that the k-means algorithm is one heuristic way
Step 1 becomes to compute an MLE which is difficult to compute
directly!

Indeed! Notice that even here, instead of choosing just
Step 2 becomes one value of the latent variables  at each time step, we
can instead use a distribution over their support 

 I have a feeling that the
Thus,  and  where  is the number of datasecond
pointsheuristic
for which
willwe
alsohave 
give us something
Repeat! familiar!
Heuristic 2: Expectation Maximization
Original Prob: 
Step 1 (E Step) Consists of two sub-steps
Step 1.1 Assume our current model estimates are 
Use the current models to ascertain how likely are different values of  for the -th data
Yet
point i.e. compute again, the new
for both  problem gets rid of the treacherous “sum of
log of sum”
Step 1.2 Use weights  toterms which
set up are difficult
a new objectiveto optimize.
functionThe new
problem
As before, assume instead looks simply like a weighted MLE problem
 constant for sake of simplicity
with weights  and we know how to solve MLE problems very

easily!
Step 2 (M Step) Maximize the new obj. fn. to get new models

Repeat!
Derivation of the E Step
Jensen’s inequality tells us that  for any convex function. We used the fact
that  is a concave function and so the inequality reverses since every concave
function is the negative of a convex function
Let  denote the models  to avoid clutter. Also let  denote our
current estimate of the model
Law of total probability
Just need to see derivation for a single point, say the -th point
 Simply multiply and
 divide by the same term
 Jensen’s inequality


Just renaming 
?
may be incomplete/latent
We could have had separate  and  for the two components as
Latent Variables to the Rescue
well which we could also learn. However, this would make things
more tedious so for now, let us assume  and also that 
As before, if we believe that our data is best explained using two
linear regression models instead of one, we should work with a
mixed model (aka mixture of experts)
Will fit two regression models to the data and use a latent variable
 to keep track of which data point belongs to which model
Let us use Gaussian likelihoods since we are comfortable with it


Note: this is not generative learning since we are still learning
discriminative distributions of the form 
Will see soon how to perform generative learning in supervised settings
MLE for Mixed Regression
 which, upon introducing latent variables gives 
Method 1: Alternating Optimization

As before, assume  constant for sake of simplicity to get

Step 1: Fix  and update all 
Step 2: Fix all  and update the models 
Alternating Optimization for MR
As before, we assumed the likelihood
AltOpt for MR distributions as
 for  1. Initialize models 
2. For , update  using 
Step 1 becomes
1. Let 
 3. Update  using 
Let
i.e. assign every1.data to its “closest” line or the line which fits it better
point
Step 2 becomes
4. Repeat until convergence


i.e. perform least squares on the data points assigned to each component
May incorporate a prior as well to add a regularizer (ridge regression)
Repeat!
EM for Mixed Regression
Original Prob:  EM for MR
Initialize
1. Step models
1 (E Step)  (forof
Consists twocomponents)
sub-steps
For1.1
2. Step , updateour
Assume using model
current  estimates are 
Use Let
1. the  models to ascertain how likely are different values of  for the -th data
current
point i.e. compute  for both 
2. Let  (normalize)
Step 1.2 Use weights  to set up a new objective function
3.
AsUpdate  
before, assume where 
constant (apply
for sake first order optimality)
of simplicity
4. Repeat until convergence
Step 2 (M Step) Maximize the new obj. fn. to get new models

Repeat!