0% found this document useful (0 votes)
0 views9 pages

CS236 Hw1 Answers

The document contains solutions to CS 236 Homework 1, covering topics such as Maximum Likelihood Estimation, KL Divergence, Logistic Regression, Conditional Independence, Autoregressive Models, and Monte Carlo Integration. Each problem is presented with a detailed explanation and mathematical derivations to support the solutions. The homework was made available on October 1, 2018, and was due on October 15, 2018.

Uploaded by

nvt2341
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views9 pages

CS236 Hw1 Answers

The document contains solutions to CS 236 Homework 1, covering topics such as Maximum Likelihood Estimation, KL Divergence, Logistic Regression, Conditional Independence, Autoregressive Models, and Monte Carlo Integration. Each problem is presented with a detailed explanation and mathematical derivations to support the solutions. The homework was made available on October 1, 2018, and was due on October 15, 2018.

Uploaded by

nvt2341
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CS 236 Homework 1 Solutions

Instructors: Stefano Ermon and Aditya Grover


{ermon,adityag}@cs.stanford.edu

Available: 10/01/2018; Due: 23:59 PST, 10/15/2018

Problem 1: Maximum Likelihood Estimation and KL Divergence (10 points)


Let p̂(x, y) denote the empirical data distribution over a space of inputs x ∈ X and outputs y ∈ Y. For
example, in an image recognition task, x can be an image and y can be whether the image contains a cat or not.
Let pθ (y|x) be a probabilistic classifier parameterized by θ, e.g., a logistic regression classifier with coefficients
θ. Show that the following equivalence holds:

arg max Ep̂(x,y) [log pθ (y|x)] = arg min Ep̂(x) [DKL (p̂(y|x)kpθ (y|x))] .
θ∈Θ θ∈Θ

where DKL denotes the KL-divergence:

DKL (p(x)kq(x)) = Ex∼p(x) [log p(x) − log q(x)].

Solution
We rely on the known property that if ψ is a strictly monotonically decreasing function, then the following
two problems are equivalent

max f (θ) ≡ min ψ(f (θ)). (1)


θ θ

This property can be proven via proof by contradiction and we assume familiarity with this property. Now, it
suffices to show that there exists a strictly monotonically decreasing ψ such that
 
ψ Ep̂(x,y) log pθ (y|x) = Ep̂(x) KL(p̂(y|x)kpθ (y|x)). (2)

Note that
 
Ep̂(x) KL(p̂(y|x)kpθ (y|x)) = Ep̂(x) Ep̂(y|x) log p̂(y|x) − log pθ (y|x) (3)
   
= Ep̂(x,y) log p̂(y|x) − Ep̂(x,y) log pθ (y|x) . (4)

Since the first term is a constant in our optimization problem (since it does not depend on θ), we simply choose
the strictly monotonically decreasing function
 
ψ(z) = Ep̂(x,y) log p̂(y|x) − z. (5)

Problem 2: Logistic Regression and Naive Bayes (12 points)


A mixture of k Gaussians specifies a joint distribution given by pθ (x, y) where y ∈ {1, . . . , k} signifies the

1
mixture id and x ∈ Rn denotes n-dimensional real valued points. The generative process for this mixture can
be specified as:
k
X
pθ (y) = πy , where πy = 1 (6)
y=1

pθ (x|y) = N (x|µy , σ 2 I). (7)

where we assume a diagonal covariance structure for modeling each of the Gaussians in the mixture. Such a
model is parameterized by θ = (π1 , π2 , . . . , πk , µ1 , µ2 , . . . , µk , σ), where πi ∈ R++ , µi ∈ Rn , and σ ∈ R++ . Now
consider the multi-class logistic regression model for directly predicting y from x as:

exp(x> wy + by )
pγ (y|x) = Pk , (8)
>
i=1 exp(x wi + bi )

parameterized by vectors γ = {w1 , w2 , . . . , wk , b1 , b2 , . . . , bk }, where wi ∈ Rn and bi ∈ R.


Show that for any choice of θ, there exists γ such that

pθ (y|x) = pγ (y|x). (9)

Solution
Note that
pθ (x, y)
pθ (y|x) = (10)
pθ (x)
 
1
πy · exp − 2 (x − µy )> (x − µy ) · Z −1 (σ)

=   , (11)
P 1 > −1 (σ)
i πi · exp − (x − µi ) (x − µi ) · Z
2σ 2

where Z(σ) is the Gaussian partition function (which is a function of σ). Further algebraic manipulations show
that
 
1 > > >
exp − 2 (x x − 2x µy + µy µy ) + ln πy

pθ (y|x) =   (12)
P 1 > x − 2x> µ + µ> µ ) + ln π
i exp − (x i i i i
2σ 2
 
1
exp (2x> µy − µ> y µy ) + ln πy
2σ 2
=   (13)
P 1 > µ − µ> µ ) + ln π
i exp (2x i i i i
2σ 2
" #!
>
µ y µ y µ y
exp x> 2 + − + ln πy
σ 2σ 2
=  . (14)
µ> µ
 
P µ
> i + − i i + ln π
i exp x i
σ2 2σ 2

Thus, when θ = (σ, π, µ1 , . . . , µk ), simply set


µy
wy = +α (15)
σ2 !
µ>
y µy
by = − + ln πy + β, (16)
2σ 2

where α and β are allowed to be any constants (with respect to y).

2
Problem 3: Conditional Independence and Parameterization (16 points)
Consider a collection of n discrete random variables {Xi }ni=1 , where the number of outcomes for Xi is
|val(Xi )| = ki .

1. [2 points] Without any conditional independence assumptions, what is the total number of independent
parameters needed to describe the joint distribution over (X1 , . . . , Xn )?
2. [12 points] Let 1, 2, . . . , n denote the topological sort for a Bayesian network for the random variables
X1 , X2 , . . . , Xn . Let m be a positive integer in {1, 2, . . . , n − 1}. Suppose, for every i > m, the random
variable Xi is conditionally independent of all ancestors given m previous ancestors in the topological
ordering. Mathematically, we impose the independence assumptions

p(Xi |Xi−1 , Xi−2 , . . . X2 , X1 ) = p(Xi |Xi−1 , Xi−2 , . . . Xi−m )

for i > m. For i ≤ m, we impose no conditional independence of Xi with respect to its ancestors.
Derive the total number of independent parameters to specify the joint distribution over (X1 , . . . , Xn )?
3. [2 points]
Pn Under what independence assumptions is it possible to represent the joint distribution (X1 , . . . , Xn )
with i=1 (ki − 1) total number of independent parameters?

Solution

Qn
1. There are i=1 ki unique
Qnconfigurations. Without independence assumptions, the number of independent
parameters needed is ( i=1 ki ) − 1.
Qm
2. The random variables {Xi }m
i=1 are part of a complete graph and thus requires ( i=1 ki ) − 1 parameters.
Qi−1
When m > i, each random variable requires (ki − 1) j=m−i kj parameters. The total is thus
   
m
X i−1
Y n
X i−1
Y
 (ki − 1) kj  +  (ki − 1) kj 
i=1 j=1 i=m+1 j=i−m
!   (17)
m
Y n
X i−1
Y
= ki − 1+  (ki − 1) kj 
i=1 i=m+1 j=i−m

Q Pn
3. If the distribution is fully-factorized (i.e., p(x1 , . . . , xn ) = i p(xi )), then we only need i=1 (ki − 1)
independent parameters.

Problem 4: Autoregressive Models (12 points)


Consider a set of n univariate continuous real-valued random variables (X1 , . . . , Xn ). You have access to pow-
erful neural networks {µi }ni=1 and {σi }ni=1 that can represent any function µi : Ri−1 → R and σi : Ri−1 → R++ .
We shall, for notational simplicity, define R0 = {0}. You choose to build the following Gaussian autoregressive
model in the forward direction:
n
Y n
Y
pf (x1 , . . . , xn ) = pf (xi |x<i ) = N (xi |µi (x<i ), σi2 (x<i )), (18)
i=1 i=1

where x<i denotes


(
(x1 , . . . , xi−1 )> if i > 1
x<i = (19)
0 if i = 1.

3
Your friend chooses to factor the model in the reverse order using equally powerful neural networks {µ̂i }ni=1 and
{σ̂i }ni=1 that can represent any function µ̂i : Rn−i → R and σ̂i : Rn−i → R++ :
n
Y n
Y
pr (x1 , . . . , xn ) = pr (xi |x>i ) = N (xi |µ̂i (x>i ), σ̂i2 (x>i )), (20)
i=1 i=1

where x>i denotes


(
(xi+1 , . . . , xn )> if i < n
x>i = (21)
0 if i = n.

Do these models cover the same hypothesis space of distributions? In other words, given any choice of {µi , σi }ni=1 ,
does there always exist a choice of {µ̂i , σ̂i }ni=1 such that pf = pr ? If yes, provide a proof. Else, provide a
counterexample.
[Hint: Consider the case where n = 2.]

Solution
They do not cover the same hypothesis space. To see why, consider the simple case of describing a joint
distribution over (X1 , X2 ) using the forward versus reverse factorizations. Consider the forward factorization
where

pf (x1 ) = N (x1 |0, 1) (22)


pf (x2 |x1 ) = N (x2 |µ2 (x1 ), ), (23)

for which
(
0 if x1 ≤ 0
µ2 (x1 ) = (24)
1 otherwise.

(*) This construction makes pf (x2 ) a mixture of two distinct Gaussians, which pr (x2 ) cannot match, since
pf (x2 ) is strictly Gaussian. Any counterexample of this form, which makes pf (x2 ) non-Gaussian, suffices for
full-credit.
(**) Interestingly, we can also intuit about the distribution pf (x1 |x2 ). If one chooses a very small positive , then
the corresponding pf (x1 |x2 ) will approach a truncated Gaussian distribution, which cannot be approximated
by the Gaussian pr (x1 |x2 ).1
Optionally, we can prove (*) and a variant of (**) which states that, any  > 0, the distribution

pf (x1 , x2 )
pf (x1 |x2 ) = . (25)
pf (x2 )

is a mixture of truncated Gaussians whose mixture weights depend on .


Proof of (*). We exploit the fact that µ2 is step function by noting that
Z ∞
pf (x2 ) = pf (x1 , x2 )dx1 (26)
−∞
Z 0 Z ∞
= pf (x1 )N (x2 |0, )dx1 + pf (x1 )N (x2 |1, )dx1 (27)
−∞ 0
1
= (N0 (x2 ) + N1 (x2 )). (28)
2
For notational simplicity, we introduce the notation Nµ in Eq. (28). The use of a step function for µ2 thus
partitions the space of x1 so that the marginal distribution of x2 is a mixture of two Gaussians.
1 Thisobservation will be useful when we move on to variational autoencoders p(z, x) (where z is a latent variable) and discuss
the importance of having good variational approximations of the true posterior p(z|x).

4
Proof of (**) variant. The numerator is simply
(
pf (x1 )N0 (x2 ) if x1 ≤ 0
pf (x1 , x2 ) = (29)
pf (x1 )N1 (x2 ) if x1 > 0.
Combining the numerator and denominator thus yields

2N0 (x2 )

 pf (x1 ) · if x1 ≤ 0
N0 (x2 ) + N1 (x2 )



pf (x1 |x2 ) = (30)
2N1 (x2 )


pf (x1 ) · if x1 > 0,


N0 (x2 ) + N1 (x2 )
where pf (x1 ) is multiplied by the weighting term
2Ni (x2 |i, )
vi = . (31)
N0 (x2 |0, ) + N1 (x2 |1, )
Note that vi /2 can be interpreted as the posterior probability of the ith Gaussian mixture component when x2
is observed. For any choice of x2 6= 0.5, note that v1 6= v0 . Thus, when x2 6= 0, pf (x1 |x2 ) will experience a
sudden density transition when x1 crosses 0. One should be able to see that pf (x1 |x2 ) is an unevenly-weighted
mixture of two truncated Gaussian distributions, which pr (x1 |x2 ) cannot match. Furthermore, as  → 0, we see
that (v0 , v1 ) approaches (0, 1), which in turn causes pf (x1 |x2 = 1) to approach a truncated Gaussian.

Problem 5: Monte Carlo Integration (10 points)


A latent variable generative model specifies a joint probability distribution p(x, z) between a set of observed
variables x ∈ X and a set of latent variables z ∈ Z. From the definition of conditional probability, we can
express the joint distribution as p(x, z) = p(z)p(x|z). Here, p(z) is referred to as the prior distribution over z
and p(x|z) is the likelihood of the observed data condition on the latent variables. One natural objective for
learning a latent variable model is to maximize the marginal likelihood of the observed data given by:
Z
p(x) = p(x, z)dz. (32)
z

When z is high dimensional, tractable evaluation of the marginal likelihood is computationally intractable even
if we can tractably evaluate the prior and the conditional likelihood for any given x and z. We can however
use Monte Carlo to estimate the above integral. To do so, we sample k samples from the prior p(z) and our
estimate is given as:
k
1X
A(z (1) , . . . , z (k) ) = p(x|z (i) ), where z (i) ∼ p(z). (33)
k i=1

1. [5 points] An estimator θ̂ is an unbiased estimator of θ if and only if E[θ̂] = θ. Show that A is an unbiased
estimator of p(x).
2. [5 points] Is log A an unbiased estimator of log p(x)? Explain why or why not.

Solution
The estimator A is unbiased since
k
1X
Ez(1) ,...,z(k) A(z (1) , . . . , z (k) ) = E (i) p(x|z (i) ) (34)
k i=1 z
= Ep(z) p(x|z) (35)
Z
= p(z)p(x|z)dz (36)

= p(x). (37)

5
The estimator log A is not guaranteed to be unbiased since, by Jensen’s inequality,

Ez(1) ,...,z(k) log A(z (1) , . . . , z (k) ) ≤ log Ez(1) ,...,z(k) A(z (1) , . . . , z (k) ) (38)
= log p(x). (39)

Note that since log is strictly convex, equality holds if and only if the random variable A is deterministic.

6
64-dim 128-dim 657-dim 657-dim
x0 ex0 h0 l0 p0

x1 ex1 h1 l1 p1

x2 ex2 LSTM h2 Fully-Connected Layer l2 Softmax p2

··· ··· ··· ··· ···

xT exT hT lT pT

Figure 1: The architecture of our model. T is the sequence length of a given input. xi is the index token. exi is
the trainable embedding of token xi . hi is the output of LSTMs. li is the logit and pi is the probability. Nodes
in gray (please view in color) contain trainable parameters.

Problem 6: Programming assignment (40 points)


In this programming assignment, we will use an autoregressive generative model to generate text from ma-
chine learning papers. In particular, we will train a character-based recurrent neural network (RNN) to generate
paragraphs. The training dataset consists of all papers published in NIPS 2015.2 The model used in this as-
signment is a four-layer Long Short-Term Memory (LSTM) network. LSTM is a variant of RNN that performs
better in modeling long-term dependencies. See this blog post for a friendly introduction.

There are a total of 657 different characters in NIPS 2015 papers, including alphanumeric characters as well as
many non-ascii symbols. During training, we first convert characters to a number in the range 0 to 656. Then
for each number, we use a 64-dimensional trainable vector as its embedding. The embeddings are then fed into
a four-layer LSTM network, where each layer contains 128 units. The output vectors of the LSTM network are
finally passed through a fully-connected layer to form a 657-way softmax representing the probability distribu-
tion of the next token. See Figure 1 for an illustration.

Training such models can be computationally expensive, requiring specialized GPU hardware. In this particular
assignment, we provide a pretrained generative model. After loading this pretrained model into P yT orch, you
are expected to implement and answer the following questions.

1. [4 points] Suppose we wish to find an efficient bit representation for the 657 characters. That is, every
character is represented as (a1 , a2 , · · · , an ), where ai ∈ {0, 1}, ∀i = 1, 2, · · · , n. What is the minimal n
that we can use?

Solution: 10.
2. [6 points] If the size of vocabulary increases from 657 to 900, what is the increase in the number of
parameters? [Hint: The number of parameters in the LSTM module in Fig. 1 does not change.]

Solution:
(900 − 657) × 64 + (900 − 657) × 128 + 900 − 657 = 46899.
| {z } | {z }
embeddings bias
| {z }
fully-connected layer

Note: For the following questions, you will need to complete the starter code in designated areas. After the
code is completed, run main.py to provide related files for submission. Run the script ./make submission.sh
to generate hw1.zip and upload it to GradeScope.
3. [10 points] In the starter code, complete the method sample in model.py to generate 5 paragraphs each
of length 1000 from this model.
2 Neural Information Processing Systems (NIPS) is a top machine learning conference.

7
Solutions:
Code:
def sample ( self , seq_len ) :
"""
Sample a string of length ‘ seq_len ‘ from the model .
: param seq_len [ int ]: String length
: return [ list ]: A list of length ‘ seq_len ‘ that c o n t a i n s the index of each g e n e r a t e d
character .
"""
voc_freq = self . dataset . voc_freq
with torch . no_grad () :
h_prev = None
texts = [ ]
x = np . random . choice ( voc_freq . shape [ 0 ] , 1 , p = voc_freq ) [ None , : ]
x = torch . from_numpy ( x ) . type ( torch . int64 ) . to ( self . device )
# TODO : C o m p l e t e the code here .
for i in range ( seq_len ) :
logits , h_prev = self . forward (x , h_prev )
np_logits = logits [ -1 , : ] . to ( ’ cpu ’) . numpy ()
probs = np . exp ( np_logits ) / np . sum ( np . exp ( np_logits ) )
ix = np . random . choice ( np . arange ( self . vocab_size ) , p = probs . ravel () )
x = torch . tensor ( ix , dtype = torch . int64 ) [ None , None ] . to ( self . device )
texts . append ( ix )
return texts

4. [10 points] Complete the method compute_prob in model.py to compute the log-likelihoods for each
string. Plot a separate histogram of the log-likelihoods of strings within each file.

Figure 2

8
Solutions:
Code:
def compute_prob ( self , strings ) :
"""
Compute the p r o b a b i l i t y for each string in ‘ strings ‘
: param strings [ np . ndarray ]: an integer array of length N
: return [ float ]: the log - l i k e l i h o o d
"""
voc_freq = self . dataset . voc_freq
with torch . no_grad () :
h_prev = None
x = strings [ None , 0 , None ]
x = torch . from_numpy ( x ) . type ( torch . int64 ) . to ( self . device )
ll = np . log ( voc_freq [ strings [ 0 ] ] )
# TODO : C o m p l e t e the code here
for i in range ( len ( strings ) - 1 ) :
logits , h_prev = self . forward (x , h_prev )
log_softmax = F . log_softmax ( logits , dim = 1 )
ll + = log_softmax [ -1 , strings [ i + 1 ] ] . item ()
x = strings [ None , i + 1 , None ]
x = torch . from_numpy ( x ) . type ( torch . int64 ) . to ( self . device )
return ll

Figures: See Figure. 2


5. [10 points] Can you determine the category of an input string by only looking at its log-likelihood? We
now provide new strings in snippets.pkl. Try to infer whether the string is generated randomly, copied
from Shakespeare’s work or retrieved from NIPS publications. You will need to complete the code in
main.py.

Solutions:
Code:
for snippet in snippets :
ll = rnn . compute_prob ( np . asarray ( [ dataset . char2idx [ c ] for c in snippet ] ) )
if ll < - 600 :
lbls . append ( 0 )
elif ll < - 200 :
lbls . append ( 1 )
else :
lbls . append ( 2 )

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy