CS236 Hw1 Answers
CS236 Hw1 Answers
arg max Ep̂(x,y) [log pθ (y|x)] = arg min Ep̂(x) [DKL (p̂(y|x)kpθ (y|x))] .
θ∈Θ θ∈Θ
Solution
We rely on the known property that if ψ is a strictly monotonically decreasing function, then the following
two problems are equivalent
This property can be proven via proof by contradiction and we assume familiarity with this property. Now, it
suffices to show that there exists a strictly monotonically decreasing ψ such that
ψ Ep̂(x,y) log pθ (y|x) = Ep̂(x) KL(p̂(y|x)kpθ (y|x)). (2)
Note that
Ep̂(x) KL(p̂(y|x)kpθ (y|x)) = Ep̂(x) Ep̂(y|x) log p̂(y|x) − log pθ (y|x) (3)
= Ep̂(x,y) log p̂(y|x) − Ep̂(x,y) log pθ (y|x) . (4)
Since the first term is a constant in our optimization problem (since it does not depend on θ), we simply choose
the strictly monotonically decreasing function
ψ(z) = Ep̂(x,y) log p̂(y|x) − z. (5)
1
mixture id and x ∈ Rn denotes n-dimensional real valued points. The generative process for this mixture can
be specified as:
k
X
pθ (y) = πy , where πy = 1 (6)
y=1
where we assume a diagonal covariance structure for modeling each of the Gaussians in the mixture. Such a
model is parameterized by θ = (π1 , π2 , . . . , πk , µ1 , µ2 , . . . , µk , σ), where πi ∈ R++ , µi ∈ Rn , and σ ∈ R++ . Now
consider the multi-class logistic regression model for directly predicting y from x as:
exp(x> wy + by )
pγ (y|x) = Pk , (8)
>
i=1 exp(x wi + bi )
Solution
Note that
pθ (x, y)
pθ (y|x) = (10)
pθ (x)
1
πy · exp − 2 (x − µy )> (x − µy ) · Z −1 (σ)
2σ
= , (11)
P 1 > −1 (σ)
i πi · exp − (x − µi ) (x − µi ) · Z
2σ 2
where Z(σ) is the Gaussian partition function (which is a function of σ). Further algebraic manipulations show
that
1 > > >
exp − 2 (x x − 2x µy + µy µy ) + ln πy
2σ
pθ (y|x) = (12)
P 1 > x − 2x> µ + µ> µ ) + ln π
i exp − (x i i i i
2σ 2
1
exp (2x> µy − µ> y µy ) + ln πy
2σ 2
= (13)
P 1 > µ − µ> µ ) + ln π
i exp (2x i i i i
2σ 2
" #!
>
µ y µ y µ y
exp x> 2 + − + ln πy
σ 2σ 2
= . (14)
µ> µ
P µ
> i + − i i + ln π
i exp x i
σ2 2σ 2
2
Problem 3: Conditional Independence and Parameterization (16 points)
Consider a collection of n discrete random variables {Xi }ni=1 , where the number of outcomes for Xi is
|val(Xi )| = ki .
1. [2 points] Without any conditional independence assumptions, what is the total number of independent
parameters needed to describe the joint distribution over (X1 , . . . , Xn )?
2. [12 points] Let 1, 2, . . . , n denote the topological sort for a Bayesian network for the random variables
X1 , X2 , . . . , Xn . Let m be a positive integer in {1, 2, . . . , n − 1}. Suppose, for every i > m, the random
variable Xi is conditionally independent of all ancestors given m previous ancestors in the topological
ordering. Mathematically, we impose the independence assumptions
for i > m. For i ≤ m, we impose no conditional independence of Xi with respect to its ancestors.
Derive the total number of independent parameters to specify the joint distribution over (X1 , . . . , Xn )?
3. [2 points]
Pn Under what independence assumptions is it possible to represent the joint distribution (X1 , . . . , Xn )
with i=1 (ki − 1) total number of independent parameters?
Solution
Qn
1. There are i=1 ki unique
Qnconfigurations. Without independence assumptions, the number of independent
parameters needed is ( i=1 ki ) − 1.
Qm
2. The random variables {Xi }m
i=1 are part of a complete graph and thus requires ( i=1 ki ) − 1 parameters.
Qi−1
When m > i, each random variable requires (ki − 1) j=m−i kj parameters. The total is thus
m
X i−1
Y n
X i−1
Y
(ki − 1) kj + (ki − 1) kj
i=1 j=1 i=m+1 j=i−m
! (17)
m
Y n
X i−1
Y
= ki − 1+ (ki − 1) kj
i=1 i=m+1 j=i−m
Q Pn
3. If the distribution is fully-factorized (i.e., p(x1 , . . . , xn ) = i p(xi )), then we only need i=1 (ki − 1)
independent parameters.
3
Your friend chooses to factor the model in the reverse order using equally powerful neural networks {µ̂i }ni=1 and
{σ̂i }ni=1 that can represent any function µ̂i : Rn−i → R and σ̂i : Rn−i → R++ :
n
Y n
Y
pr (x1 , . . . , xn ) = pr (xi |x>i ) = N (xi |µ̂i (x>i ), σ̂i2 (x>i )), (20)
i=1 i=1
Do these models cover the same hypothesis space of distributions? In other words, given any choice of {µi , σi }ni=1 ,
does there always exist a choice of {µ̂i , σ̂i }ni=1 such that pf = pr ? If yes, provide a proof. Else, provide a
counterexample.
[Hint: Consider the case where n = 2.]
Solution
They do not cover the same hypothesis space. To see why, consider the simple case of describing a joint
distribution over (X1 , X2 ) using the forward versus reverse factorizations. Consider the forward factorization
where
for which
(
0 if x1 ≤ 0
µ2 (x1 ) = (24)
1 otherwise.
(*) This construction makes pf (x2 ) a mixture of two distinct Gaussians, which pr (x2 ) cannot match, since
pf (x2 ) is strictly Gaussian. Any counterexample of this form, which makes pf (x2 ) non-Gaussian, suffices for
full-credit.
(**) Interestingly, we can also intuit about the distribution pf (x1 |x2 ). If one chooses a very small positive , then
the corresponding pf (x1 |x2 ) will approach a truncated Gaussian distribution, which cannot be approximated
by the Gaussian pr (x1 |x2 ).1
Optionally, we can prove (*) and a variant of (**) which states that, any > 0, the distribution
pf (x1 , x2 )
pf (x1 |x2 ) = . (25)
pf (x2 )
4
Proof of (**) variant. The numerator is simply
(
pf (x1 )N0 (x2 ) if x1 ≤ 0
pf (x1 , x2 ) = (29)
pf (x1 )N1 (x2 ) if x1 > 0.
Combining the numerator and denominator thus yields
2N0 (x2 )
pf (x1 ) · if x1 ≤ 0
N0 (x2 ) + N1 (x2 )
pf (x1 |x2 ) = (30)
2N1 (x2 )
pf (x1 ) · if x1 > 0,
N0 (x2 ) + N1 (x2 )
where pf (x1 ) is multiplied by the weighting term
2Ni (x2 |i, )
vi = . (31)
N0 (x2 |0, ) + N1 (x2 |1, )
Note that vi /2 can be interpreted as the posterior probability of the ith Gaussian mixture component when x2
is observed. For any choice of x2 6= 0.5, note that v1 6= v0 . Thus, when x2 6= 0, pf (x1 |x2 ) will experience a
sudden density transition when x1 crosses 0. One should be able to see that pf (x1 |x2 ) is an unevenly-weighted
mixture of two truncated Gaussian distributions, which pr (x1 |x2 ) cannot match. Furthermore, as → 0, we see
that (v0 , v1 ) approaches (0, 1), which in turn causes pf (x1 |x2 = 1) to approach a truncated Gaussian.
When z is high dimensional, tractable evaluation of the marginal likelihood is computationally intractable even
if we can tractably evaluate the prior and the conditional likelihood for any given x and z. We can however
use Monte Carlo to estimate the above integral. To do so, we sample k samples from the prior p(z) and our
estimate is given as:
k
1X
A(z (1) , . . . , z (k) ) = p(x|z (i) ), where z (i) ∼ p(z). (33)
k i=1
1. [5 points] An estimator θ̂ is an unbiased estimator of θ if and only if E[θ̂] = θ. Show that A is an unbiased
estimator of p(x).
2. [5 points] Is log A an unbiased estimator of log p(x)? Explain why or why not.
Solution
The estimator A is unbiased since
k
1X
Ez(1) ,...,z(k) A(z (1) , . . . , z (k) ) = E (i) p(x|z (i) ) (34)
k i=1 z
= Ep(z) p(x|z) (35)
Z
= p(z)p(x|z)dz (36)
= p(x). (37)
5
The estimator log A is not guaranteed to be unbiased since, by Jensen’s inequality,
Ez(1) ,...,z(k) log A(z (1) , . . . , z (k) ) ≤ log Ez(1) ,...,z(k) A(z (1) , . . . , z (k) ) (38)
= log p(x). (39)
Note that since log is strictly convex, equality holds if and only if the random variable A is deterministic.
6
64-dim 128-dim 657-dim 657-dim
x0 ex0 h0 l0 p0
x1 ex1 h1 l1 p1
xT exT hT lT pT
Figure 1: The architecture of our model. T is the sequence length of a given input. xi is the index token. exi is
the trainable embedding of token xi . hi is the output of LSTMs. li is the logit and pi is the probability. Nodes
in gray (please view in color) contain trainable parameters.
There are a total of 657 different characters in NIPS 2015 papers, including alphanumeric characters as well as
many non-ascii symbols. During training, we first convert characters to a number in the range 0 to 656. Then
for each number, we use a 64-dimensional trainable vector as its embedding. The embeddings are then fed into
a four-layer LSTM network, where each layer contains 128 units. The output vectors of the LSTM network are
finally passed through a fully-connected layer to form a 657-way softmax representing the probability distribu-
tion of the next token. See Figure 1 for an illustration.
Training such models can be computationally expensive, requiring specialized GPU hardware. In this particular
assignment, we provide a pretrained generative model. After loading this pretrained model into P yT orch, you
are expected to implement and answer the following questions.
1. [4 points] Suppose we wish to find an efficient bit representation for the 657 characters. That is, every
character is represented as (a1 , a2 , · · · , an ), where ai ∈ {0, 1}, ∀i = 1, 2, · · · , n. What is the minimal n
that we can use?
Solution: 10.
2. [6 points] If the size of vocabulary increases from 657 to 900, what is the increase in the number of
parameters? [Hint: The number of parameters in the LSTM module in Fig. 1 does not change.]
Solution:
(900 − 657) × 64 + (900 − 657) × 128 + 900 − 657 = 46899.
| {z } | {z }
embeddings bias
| {z }
fully-connected layer
Note: For the following questions, you will need to complete the starter code in designated areas. After the
code is completed, run main.py to provide related files for submission. Run the script ./make submission.sh
to generate hw1.zip and upload it to GradeScope.
3. [10 points] In the starter code, complete the method sample in model.py to generate 5 paragraphs each
of length 1000 from this model.
2 Neural Information Processing Systems (NIPS) is a top machine learning conference.
7
Solutions:
Code:
def sample ( self , seq_len ) :
"""
Sample a string of length ‘ seq_len ‘ from the model .
: param seq_len [ int ]: String length
: return [ list ]: A list of length ‘ seq_len ‘ that c o n t a i n s the index of each g e n e r a t e d
character .
"""
voc_freq = self . dataset . voc_freq
with torch . no_grad () :
h_prev = None
texts = [ ]
x = np . random . choice ( voc_freq . shape [ 0 ] , 1 , p = voc_freq ) [ None , : ]
x = torch . from_numpy ( x ) . type ( torch . int64 ) . to ( self . device )
# TODO : C o m p l e t e the code here .
for i in range ( seq_len ) :
logits , h_prev = self . forward (x , h_prev )
np_logits = logits [ -1 , : ] . to ( ’ cpu ’) . numpy ()
probs = np . exp ( np_logits ) / np . sum ( np . exp ( np_logits ) )
ix = np . random . choice ( np . arange ( self . vocab_size ) , p = probs . ravel () )
x = torch . tensor ( ix , dtype = torch . int64 ) [ None , None ] . to ( self . device )
texts . append ( ix )
return texts
4. [10 points] Complete the method compute_prob in model.py to compute the log-likelihoods for each
string. Plot a separate histogram of the log-likelihoods of strings within each file.
Figure 2
8
Solutions:
Code:
def compute_prob ( self , strings ) :
"""
Compute the p r o b a b i l i t y for each string in ‘ strings ‘
: param strings [ np . ndarray ]: an integer array of length N
: return [ float ]: the log - l i k e l i h o o d
"""
voc_freq = self . dataset . voc_freq
with torch . no_grad () :
h_prev = None
x = strings [ None , 0 , None ]
x = torch . from_numpy ( x ) . type ( torch . int64 ) . to ( self . device )
ll = np . log ( voc_freq [ strings [ 0 ] ] )
# TODO : C o m p l e t e the code here
for i in range ( len ( strings ) - 1 ) :
logits , h_prev = self . forward (x , h_prev )
log_softmax = F . log_softmax ( logits , dim = 1 )
ll + = log_softmax [ -1 , strings [ i + 1 ] ] . item ()
x = strings [ None , i + 1 , None ]
x = torch . from_numpy ( x ) . type ( torch . int64 ) . to ( self . device )
return ll
Solutions:
Code:
for snippet in snippets :
ll = rnn . compute_prob ( np . asarray ( [ dataset . char2idx [ c ] for c in snippet ] ) )
if ll < - 600 :
lbls . append ( 0 )
elif ll < - 200 :
lbls . append ( 1 )
else :
lbls . append ( 2 )