0% found this document useful (0 votes)

0 views9 pages

CS236 Hw1 Answers

The document contains solutions to CS 236 Homework 1, covering topics such as Maximum Likelihood Estimation, KL Divergence, Logistic Regression, Conditional Independence, Autoregressive Models, and Monte Carlo Integration. Each problem is presented with a detailed explanation and mathematical derivations to support the solutions. The homework was made available on October 1, 2018, and was due on October 15, 2018.

Uploaded by

nvt2341

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views9 pages

CS236 Hw1 Answers

Uploaded by

nvt2341

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

CS 236 Homework 1 Solutions

Instructors: Stefano Ermon and Aditya Grover

{ermon,adityag}@cs.stanford.edu

Available: 10/01/2018; Due: 23:59 PST, 10/15/2018

Problem 1: Maximum Likelihood Estimation and KL Divergence (10 points)

Let p̂(x, y) denote the empirical data distribution over a space of inputs x ∈ X and outputs y ∈ Y. For
example, in an image recognition task, x can be an image and y can be whether the image contains a cat or not.
Let pθ (y|x) be a probabilistic classifier parameterized by θ, e.g., a logistic regression classifier with coefficients
θ. Show that the following equivalence holds:

arg max Ep̂(x,y) [log pθ (y|x)] = arg min Ep̂(x) [DKL (p̂(y|x)kpθ (y|x))] .
θ∈Θ θ∈Θ

where DKL denotes the KL-divergence:

DKL (p(x)kq(x)) = Ex∼p(x) [log p(x) − log q(x)].

Solution
We rely on the known property that if ψ is a strictly monotonically decreasing function, then the following
two problems are equivalent

max f (θ) ≡ min ψ(f (θ)). (1)

θ θ

This property can be proven via proof by contradiction and we assume familiarity with this property. Now, it
suffices to show that there exists a strictly monotonically decreasing ψ such that

ψ Ep̂(x,y) log pθ (y|x) = Ep̂(x) KL(p̂(y|x)kpθ (y|x)). (2)

Since the first term is a constant in our optimization problem (since it does not depend on θ), we simply choose
the strictly monotonically decreasing function

ψ(z) = Ep̂(x,y) log p̂(y|x) − z. (5)

Problem 2: Logistic Regression and Naive Bayes (12 points)

A mixture of k Gaussians specifies a joint distribution given by pθ (x, y) where y ∈ {1, . . . , k} signifies the

1
mixture id and x ∈ Rn denotes n-dimensional real valued points. The generative process for this mixture can
be specified as:
k
X
pθ (y) = πy , where πy = 1 (6)
y=1

pθ (x|y) = N (x|µy , σ 2 I). (7)

where we assume a diagonal covariance structure for modeling each of the Gaussians in the mixture. Such a
model is parameterized by θ = (π1 , π2 , . . . , πk , µ1 , µ2 , . . . , µk , σ), where πi ∈ R++ , µi ∈ Rn , and σ ∈ R++ . Now
consider the multi-class logistic regression model for directly predicting y from x as:

exp(x> wy + by )
pγ (y|x) = Pk , (8)
>
i=1 exp(x wi + bi )

parameterized by vectors γ = {w1 , w2 , . . . , wk , b1 , b2 , . . . , bk }, where wi ∈ Rn and bi ∈ R.

Show that for any choice of θ, there exists γ such that

pθ (y|x) = pγ (y|x). (9)

Solution
Note that
pθ (x, y)
pθ (y|x) = (10)
pθ (x)

1
πy · exp − 2 (x − µy )> (x − µy ) · Z −1 (σ)
2σ
= , (11)
P 1 > −1 (σ)
i πi · exp − (x − µi ) (x − µi ) · Z
2σ 2

where Z(σ) is the Gaussian partition function (which is a function of σ). Further algebraic manipulations show
that

1 > > >
exp − 2 (x x − 2x µy + µy µy ) + ln πy
2σ
pθ (y|x) = (12)
P 1 > x − 2x> µ + µ> µ ) + ln π
i exp − (x i i i i
2σ 2

1
exp (2x> µy − µ> y µy ) + ln πy
2σ 2
= (13)
P 1 > µ − µ> µ ) + ln π
i exp (2x i i i i
2σ 2
" #!
>
µ y µ y µ y
exp x> 2 + − + ln πy
σ 2σ 2
= . (14)
µ> µ

P µ
> i + − i i + ln π
i exp x i
σ2 2σ 2

Thus, when θ = (σ, π, µ1 , . . . , µk ), simply set

µy
wy = +α (15)
σ2 !
µ>
y µy
by = − + ln πy + β, (16)
2σ 2

where α and β are allowed to be any constants (with respect to y).

2
Problem 3: Conditional Independence and Parameterization (16 points)
Consider a collection of n discrete random variables {Xi }ni=1 , where the number of outcomes for Xi is
|val(Xi )| = ki .

1. [2 points] Without any conditional independence assumptions, what is the total number of independent
parameters needed to describe the joint distribution over (X1 , . . . , Xn )?
2. [12 points] Let 1, 2, . . . , n denote the topological sort for a Bayesian network for the random variables
X1 , X2 , . . . , Xn . Let m be a positive integer in {1, 2, . . . , n − 1}. Suppose, for every i > m, the random
variable Xi is conditionally independent of all ancestors given m previous ancestors in the topological
ordering. Mathematically, we impose the independence assumptions

p(Xi |Xi−1 , Xi−2 , . . . X2 , X1 ) = p(Xi |Xi−1 , Xi−2 , . . . Xi−m )

for i > m. For i ≤ m, we impose no conditional independence of Xi with respect to its ancestors.
Derive the total number of independent parameters to specify the joint distribution over (X1 , . . . , Xn )?
3. [2 points]
Pn Under what independence assumptions is it possible to represent the joint distribution (X1 , . . . , Xn )
with i=1 (ki − 1) total number of independent parameters?

Solution

Qn
1. There are i=1 ki unique
Qnconfigurations. Without independence assumptions, the number of independent
parameters needed is ( i=1 ki ) − 1.
Qm
2. The random variables {Xi }m
i=1 are part of a complete graph and thus requires ( i=1 ki ) − 1 parameters.
Qi−1
When m > i, each random variable requires (ki − 1) j=m−i kj parameters. The total is thus
   
m
X i−1
Y n
X i−1
Y
 (ki − 1) kj  +  (ki − 1) kj 
i=1 j=1 i=m+1 j=i−m
!   (17)
m
Y n
X i−1
Y
= ki − 1+  (ki − 1) kj 
i=1 i=m+1 j=i−m

Q Pn
3. If the distribution is fully-factorized (i.e., p(x1 , . . . , xn ) = i p(xi )), then we only need i=1 (ki − 1)
independent parameters.

Problem 4: Autoregressive Models (12 points)

Consider a set of n univariate continuous real-valued random variables (X1 , . . . , Xn ). You have access to pow-
erful neural networks {µi }ni=1 and {σi }ni=1 that can represent any function µi : Ri−1 → R and σi : Ri−1 → R++ .
We shall, for notational simplicity, define R0 = {0}. You choose to build the following Gaussian autoregressive
model in the forward direction:
n
Y n
Y
pf (x1 , . . . , xn ) = pf (xi |x<i ) = N (xi |µi (x<i ), σi2 (x<i )), (18)
i=1 i=1

where x<i denotes

(
(x1 , . . . , xi−1 )> if i > 1
x<i = (19)
0 if i = 1.

3
Your friend chooses to factor the model in the reverse order using equally powerful neural networks {µ̂i }ni=1 and
{σ̂i }ni=1 that can represent any function µ̂i : Rn−i → R and σ̂i : Rn−i → R++ :
n
Y n
Y
pr (x1 , . . . , xn ) = pr (xi |x>i ) = N (xi |µ̂i (x>i ), σ̂i2 (x>i )), (20)
i=1 i=1

where x>i denotes

(
(xi+1 , . . . , xn )> if i < n
x>i = (21)
0 if i = n.

Do these models cover the same hypothesis space of distributions? In other words, given any choice of {µi , σi }ni=1 ,
does there always exist a choice of {µ̂i , σ̂i }ni=1 such that pf = pr ? If yes, provide a proof. Else, provide a
counterexample.
[Hint: Consider the case where n = 2.]

Solution
They do not cover the same hypothesis space. To see why, consider the simple case of describing a joint
distribution over (X1 , X2 ) using the forward versus reverse factorizations. Consider the forward factorization
where

pf (x1 ) = N (x1 |0, 1) (22)

pf (x2 |x1 ) = N (x2 |µ2 (x1 ), ), (23)

for which
(
0 if x1 ≤ 0
µ2 (x1 ) = (24)
1 otherwise.

(*) This construction makes pf (x2 ) a mixture of two distinct Gaussians, which pr (x2 ) cannot match, since
pf (x2 ) is strictly Gaussian. Any counterexample of this form, which makes pf (x2 ) non-Gaussian, suffices for
full-credit.
(**) Interestingly, we can also intuit about the distribution pf (x1 |x2 ). If one chooses a very small positive , then
the corresponding pf (x1 |x2 ) will approach a truncated Gaussian distribution, which cannot be approximated
by the Gaussian pr (x1 |x2 ).1
Optionally, we can prove (*) and a variant of (**) which states that, any > 0, the distribution

pf (x1 , x2 )
pf (x1 |x2 ) = . (25)
pf (x2 )

is a mixture of truncated Gaussians whose mixture weights depend on .

Proof of (*). We exploit the fact that µ2 is step function by noting that
Z ∞
pf (x2 ) = pf (x1 , x2 )dx1 (26)
−∞
Z 0 Z ∞
= pf (x1 )N (x2 |0, )dx1 + pf (x1 )N (x2 |1, )dx1 (27)
−∞ 0
1
= (N0 (x2 ) + N1 (x2 )). (28)
2
For notational simplicity, we introduce the notation Nµ in Eq. (28). The use of a step function for µ2 thus
partitions the space of x1 so that the marginal distribution of x2 is a mixture of two Gaussians.
1 Thisobservation will be useful when we move on to variational autoencoders p(z, x) (where z is a latent variable) and discuss
the importance of having good variational approximations of the true posterior p(z|x).

4
Proof of (**) variant. The numerator is simply
(
pf (x1 )N0 (x2 ) if x1 ≤ 0
pf (x1 , x2 ) = (29)
pf (x1 )N1 (x2 ) if x1 > 0.
Combining the numerator and denominator thus yields

2N0 (x2 )

 pf (x1 ) · if x1 ≤ 0
N0 (x2 ) + N1 (x2 )



pf (x1 |x2 ) = (30)
2N1 (x2 )


pf (x1 ) · if x1 > 0,


N0 (x2 ) + N1 (x2 )
where pf (x1 ) is multiplied by the weighting term
2Ni (x2 |i, )
vi = . (31)
N0 (x2 |0, ) + N1 (x2 |1, )
Note that vi /2 can be interpreted as the posterior probability of the ith Gaussian mixture component when x2
is observed. For any choice of x2 6= 0.5, note that v1 6= v0 . Thus, when x2 6= 0, pf (x1 |x2 ) will experience a
sudden density transition when x1 crosses 0. One should be able to see that pf (x1 |x2 ) is an unevenly-weighted
mixture of two truncated Gaussian distributions, which pr (x1 |x2 ) cannot match. Furthermore, as → 0, we see
that (v0 , v1 ) approaches (0, 1), which in turn causes pf (x1 |x2 = 1) to approach a truncated Gaussian.

Problem 5: Monte Carlo Integration (10 points)

A latent variable generative model specifies a joint probability distribution p(x, z) between a set of observed
variables x ∈ X and a set of latent variables z ∈ Z. From the definition of conditional probability, we can
express the joint distribution as p(x, z) = p(z)p(x|z). Here, p(z) is referred to as the prior distribution over z
and p(x|z) is the likelihood of the observed data condition on the latent variables. One natural objective for
learning a latent variable model is to maximize the marginal likelihood of the observed data given by:
Z
p(x) = p(x, z)dz. (32)
z

When z is high dimensional, tractable evaluation of the marginal likelihood is computationally intractable even
if we can tractably evaluate the prior and the conditional likelihood for any given x and z. We can however
use Monte Carlo to estimate the above integral. To do so, we sample k samples from the prior p(z) and our
estimate is given as:
k
1X
A(z (1) , . . . , z (k) ) = p(x|z (i) ), where z (i) ∼ p(z). (33)
k i=1

1. [5 points] An estimator θ̂ is an unbiased estimator of θ if and only if E[θ̂] = θ. Show that A is an unbiased
estimator of p(x).
2. [5 points] Is log A an unbiased estimator of log p(x)? Explain why or why not.

Solution
The estimator A is unbiased since
k
1X
Ez(1) ,...,z(k) A(z (1) , . . . , z (k) ) = E (i) p(x|z (i) ) (34)
k i=1 z
= Ep(z) p(x|z) (35)
Z
= p(z)p(x|z)dz (36)

= p(x). (37)

5
The estimator log A is not guaranteed to be unbiased since, by Jensen’s inequality,

Ez(1) ,...,z(k) log A(z (1) , . . . , z (k) ) ≤ log Ez(1) ,...,z(k) A(z (1) , . . . , z (k) ) (38)
= log p(x). (39)

Note that since log is strictly convex, equality holds if and only if the random variable A is deterministic.

6
64-dim 128-dim 657-dim 657-dim
x0 ex0 h0 l0 p0

x1 ex1 h1 l1 p1

x2 ex2 LSTM h2 Fully-Connected Layer l2 Softmax p2

··· ··· ··· ··· ···

xT exT hT lT pT

Figure 1: The architecture of our model. T is the sequence length of a given input. xi is the index token. exi is
the trainable embedding of token xi . hi is the output of LSTMs. li is the logit and pi is the probability. Nodes
in gray (please view in color) contain trainable parameters.

Problem 6: Programming assignment (40 points)

In this programming assignment, we will use an autoregressive generative model to generate text from ma-
chine learning papers. In particular, we will train a character-based recurrent neural network (RNN) to generate
paragraphs. The training dataset consists of all papers published in NIPS 2015.2 The model used in this as-
signment is a four-layer Long Short-Term Memory (LSTM) network. LSTM is a variant of RNN that performs
better in modeling long-term dependencies. See this blog post for a friendly introduction.

There are a total of 657 different characters in NIPS 2015 papers, including alphanumeric characters as well as
many non-ascii symbols. During training, we first convert characters to a number in the range 0 to 656. Then
for each number, we use a 64-dimensional trainable vector as its embedding. The embeddings are then fed into
a four-layer LSTM network, where each layer contains 128 units. The output vectors of the LSTM network are
finally passed through a fully-connected layer to form a 657-way softmax representing the probability distribu-
tion of the next token. See Figure 1 for an illustration.

Training such models can be computationally expensive, requiring specialized GPU hardware. In this particular
assignment, we provide a pretrained generative model. After loading this pretrained model into P yT orch, you
are expected to implement and answer the following questions.

1. [4 points] Suppose we wish to find an efficient bit representation for the 657 characters. That is, every
character is represented as (a1 , a2 , · · · , an ), where ai ∈ {0, 1}, ∀i = 1, 2, · · · , n. What is the minimal n
that we can use?

Solution: 10.
2. [6 points] If the size of vocabulary increases from 657 to 900, what is the increase in the number of
parameters? [Hint: The number of parameters in the LSTM module in Fig. 1 does not change.]

Solution:
(900 − 657) × 64 + (900 − 657) × 128 + 900 − 657 = 46899.
| {z } | {z }
embeddings bias
| {z }
fully-connected layer

Note: For the following questions, you will need to complete the starter code in designated areas. After the
code is completed, run main.py to provide related files for submission. Run the script ./make submission.sh
to generate hw1.zip and upload it to GradeScope.
3. [10 points] In the starter code, complete the method sample in model.py to generate 5 paragraphs each
of length 1000 from this model.
2 Neural Information Processing Systems (NIPS) is a top machine learning conference.

7
Solutions:
Code:
def sample ( self , seq_len ) :
"""
Sample a string of length ‘ seq_len ‘ from the model .
: param seq_len [ int ]: String length
: return [ list ]: A list of length ‘ seq_len ‘ that c o n t a i n s the index of each g e n e r a t e d
character .
"""
voc_freq = self . dataset . voc_freq
with torch . no_grad () :
h_prev = None
texts = [ ]
x = np . random . choice ( voc_freq . shape [ 0 ] , 1 , p = voc_freq ) [ None , : ]
x = torch . from_numpy ( x ) . type ( torch . int64 ) . to ( self . device )
# TODO : C o m p l e t e the code here .
for i in range ( seq_len ) :
logits , h_prev = self . forward (x , h_prev )
np_logits = logits [ -1 , : ] . to ( ’ cpu ’) . numpy ()
probs = np . exp ( np_logits ) / np . sum ( np . exp ( np_logits ) )
ix = np . random . choice ( np . arange ( self . vocab_size ) , p = probs . ravel () )
x = torch . tensor ( ix , dtype = torch . int64 ) [ None , None ] . to ( self . device )
texts . append ( ix )
return texts

4. [10 points] Complete the method compute_prob in model.py to compute the log-likelihoods for each
string. Plot a separate histogram of the log-likelihoods of strings within each file.

Figure 2

8
Solutions:
Code:
def compute_prob ( self , strings ) :
"""
Compute the p r o b a b i l i t y for each string in ‘ strings ‘
: param strings [ np . ndarray ]: an integer array of length N
: return [ float ]: the log - l i k e l i h o o d
"""
voc_freq = self . dataset . voc_freq
with torch . no_grad () :
h_prev = None
x = strings [ None , 0 , None ]
x = torch . from_numpy ( x ) . type ( torch . int64 ) . to ( self . device )
ll = np . log ( voc_freq [ strings [ 0 ] ] )
# TODO : C o m p l e t e the code here
for i in range ( len ( strings ) - 1 ) :
logits , h_prev = self . forward (x , h_prev )
log_softmax = F . log_softmax ( logits , dim = 1 )
ll + = log_softmax [ -1 , strings [ i + 1 ] ] . item ()
x = strings [ None , i + 1 , None ]
x = torch . from_numpy ( x ) . type ( torch . int64 ) . to ( self . device )
return ll

Figures: See Figure. 2

5. [10 points] Can you determine the category of an input string by only looking at its log-likelihood? We
now provide new strings in snippets.pkl. Try to infer whether the string is generated randomly, copied
from Shakespeare’s work or retrieved from NIPS publications. You will need to complete the code in
main.py.

Solutions:
Code:
for snippet in snippets :
ll = rnn . compute_prob ( np . asarray ( [ dataset . char2idx [ c ] for c in snippet ] ) )
if ll < - 600 :
lbls . append ( 0 )
elif ll < - 200 :
lbls . append ( 1 )
else :
lbls . append ( 2 )

STA 2402 Design and Analysis of Sample Surveys PDF
No ratings yet
STA 2402 Design and Analysis of Sample Surveys PDF
81 pages
CS236 Homework 1
100% (1)
CS236 Homework 1
4 pages
ESTIMATION
No ratings yet
ESTIMATION
51 pages
J. K. Shah Classes Sampling Theory and Theory of Estimation
No ratings yet
J. K. Shah Classes Sampling Theory and Theory of Estimation
37 pages
Probability and Statistics Cheat Sheet
100% (2)
Probability and Statistics Cheat Sheet
28 pages
ECONOMETRICS (2014 Version) by Bruce E. Hansen
No ratings yet
ECONOMETRICS (2014 Version) by Bruce E. Hansen
387 pages
STAT 714 Linear Statistical Models: Lecture Notes
No ratings yet
STAT 714 Linear Statistical Models: Lecture Notes
150 pages
Stanford University CS 229, Autumn 2014 Midterm Examination
No ratings yet
Stanford University CS 229, Autumn 2014 Midterm Examination
23 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Statistics For Management 2marks
75% (8)
Statistics For Management 2marks
15 pages
Pattern File
No ratings yet
Pattern File
29 pages
LogReg 2024 25 Exercs Sols
No ratings yet
LogReg 2024 25 Exercs Sols
20 pages
SPTC 0902 Q3 FD
No ratings yet
SPTC 0902 Q3 FD
27 pages
Murphysolns
No ratings yet
Murphysolns
45 pages
Statistical Machine Learning: Yiqiao YIN Department of Statistics Columbia University
No ratings yet
Statistical Machine Learning: Yiqiao YIN Department of Statistics Columbia University
204 pages
CH-1 - Introduction
No ratings yet
CH-1 - Introduction
55 pages
MLB Assignment 7 Final
No ratings yet
MLB Assignment 7 Final
16 pages
Homework 1: Instructions and Notes
No ratings yet
Homework 1: Instructions and Notes
2 pages
ps1 Sol
No ratings yet
ps1 Sol
25 pages
Predicting Home Prices in Bangalore
No ratings yet
Predicting Home Prices in Bangalore
18 pages
Homework Set 3
No ratings yet
Homework Set 3
7 pages
Solution 4 Problem 1: A A ( 1, +1) : Iid Data
No ratings yet
Solution 4 Problem 1: A A ( 1, +1) : Iid Data
18 pages
Stat2 Chapter 2-1
No ratings yet
Stat2 Chapter 2-1
10 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
Exercise Solution 05 Linear Classification
No ratings yet
Exercise Solution 05 Linear Classification
9 pages
Final Fall 2019
No ratings yet
Final Fall 2019
38 pages
Lecture 7 Confidence Interval
No ratings yet
Lecture 7 Confidence Interval
83 pages
2011 End Spring 2011 Computer Science Machine Learning
No ratings yet
2011 End Spring 2011 Computer Science Machine Learning
10 pages
Solution 5 Problem 1: Let a > 0 be a known constant, and let θ > 0 be a parameter
No ratings yet
Solution 5 Problem 1: Let a > 0 be a known constant, and let θ > 0 be a parameter
8 pages
Stat 336-Design of Experiments - Dr. Eric Nyarko
No ratings yet
Stat 336-Design of Experiments - Dr. Eric Nyarko
125 pages
Dis1 Sol
No ratings yet
Dis1 Sol
8 pages
CS725 2020 Quiz1
No ratings yet
CS725 2020 Quiz1
3 pages
Chapter2 Sampling Simple Random Sampling
No ratings yet
Chapter2 Sampling Simple Random Sampling
24 pages
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
16 pages
Experiments Montgomery Word
No ratings yet
Experiments Montgomery Word
111 pages
hw7 Sol
No ratings yet
hw7 Sol
12 pages
ML Assignments 2025
No ratings yet
ML Assignments 2025
91 pages
Problem 1: Otherwise, 0 X 0 1), 0 ( ) (
No ratings yet
Problem 1: Otherwise, 0 X 0 1), 0 ( ) (
4 pages
Introduction To Neural Networks
No ratings yet
Introduction To Neural Networks
3 pages
STAT 135 Solutions To Homework 4:: 30 Points
No ratings yet
STAT 135 Solutions To Homework 4:: 30 Points
9 pages
Series 1, Oct 1st, 2013 Probability and Related) : Machine Learning
No ratings yet
Series 1, Oct 1st, 2013 Probability and Related) : Machine Learning
4 pages
Cs 419 Endsemsols
No ratings yet
Cs 419 Endsemsols
6 pages
Collection of Assignments
No ratings yet
Collection of Assignments
2 pages
HW 2
No ratings yet
HW 2
5 pages
Advanced ML Notes (Midterm)
No ratings yet
Advanced ML Notes (Midterm)
10 pages
PSB 2024
No ratings yet
PSB 2024
5 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
All Tasks
No ratings yet
All Tasks
7 pages
Homework 1
0% (1)
Homework 1
4 pages
Cheat Sheet For The Final Exam
No ratings yet
Cheat Sheet For The Final Exam
6 pages
2017-18-I MS Key
No ratings yet
2017-18-I MS Key
6 pages
Dis 1
No ratings yet
Dis 1
5 pages
Sella+et+al +2021
No ratings yet
Sella+et+al +2021
58 pages
NNLS1 2019 HW4 Solutions
No ratings yet
NNLS1 2019 HW4 Solutions
11 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
Chapter 6 Sampling Design
No ratings yet
Chapter 6 Sampling Design
70 pages
Cs229 Midterm Aut2015
No ratings yet
Cs229 Midterm Aut2015
21 pages
CMU 2018s NinaBALCAN HW3
No ratings yet
CMU 2018s NinaBALCAN HW3
7 pages
Quiz3 2024
No ratings yet
Quiz3 2024
2 pages
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
6 pages
Taller 3 (A. NG.) - Introducción Al Aprendizaje Supervisado
No ratings yet
Taller 3 (A. NG.) - Introducción Al Aprendizaje Supervisado
8 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
HW 23 P 4 Rie
No ratings yet
HW 23 P 4 Rie
5 pages
Monte Carlo Integration Lecture
No ratings yet
Monte Carlo Integration Lecture
8 pages
EndSem 202223 Solution
No ratings yet
EndSem 202223 Solution
4 pages
1st Exam Question Paper
No ratings yet
1st Exam Question Paper
2 pages
Midterm Aut2014 (Final) Sol
No ratings yet
Midterm Aut2014 (Final) Sol
23 pages
A Tutorial On Correlation Coefficients
No ratings yet
A Tutorial On Correlation Coefficients
13 pages
Exercise 3 Computer Intensive Statistics
No ratings yet
Exercise 3 Computer Intensive Statistics
10 pages
Estimation
No ratings yet
Estimation
6 pages
HW 3
No ratings yet
HW 3
7 pages
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 4: Solutions
No ratings yet
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 4: Solutions
8 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
365 Data Science - Statistics: Glossary Section Lesson Word
No ratings yet
365 Data Science - Statistics: Glossary Section Lesson Word
5 pages
Final: CS 189 Spring 2013 Introduction To Machine Learning
No ratings yet
Final: CS 189 Spring 2013 Introduction To Machine Learning
9 pages
Practical 5 - STA351 - 2240876
No ratings yet
Practical 5 - STA351 - 2240876
4 pages
IFT Notes R05 Sampling and Estimation
No ratings yet
IFT Notes R05 Sampling and Estimation
16 pages
14 Estimation
No ratings yet
14 Estimation
10 pages
Chapter 11 AUDTheo
No ratings yet
Chapter 11 AUDTheo
14 pages
Plant Growth Analysis: An Evaluation of Experimental Design and Computational Methods
No ratings yet
Plant Growth Analysis: An Evaluation of Experimental Design and Computational Methods
9 pages
Midterm 2010 Solutions
No ratings yet
Midterm 2010 Solutions
8 pages
HW 1
No ratings yet
HW 1
4 pages
Practice Midterm 2010
No ratings yet
Practice Midterm 2010
4 pages
Practice Midterm
No ratings yet
Practice Midterm
4 pages
Problem Set 10
No ratings yet
Problem Set 10
2 pages
Comparative Study of Ratio and Regression Estimators Using Double Sampling For Estimation of Population Mean
No ratings yet
Comparative Study of Ratio and Regression Estimators Using Double Sampling For Estimation of Population Mean
7 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CS236 Hw1 Answers

Uploaded by

CS236 Hw1 Answers

Uploaded by

CS 236 Homework 1 Solutions

Instructors: Stefano Ermon and Aditya Grover

Available: 10/01/2018; Due: 23:59 PST, 10/15/2018

Problem 1: Maximum Likelihood Estimation and KL Divergence (10 points)

where DKL denotes the KL-divergence:

DKL (p(x)kq(x)) = Ex∼p(x) [log p(x) − log q(x)].

max f (θ) ≡ min ψ(f (θ)). (1)

Problem 2: Logistic Regression and Naive Bayes (12 points)

pθ (x|y) = N (x|µy , σ 2 I). (7)

parameterized by vectors γ = {w1 , w2 , . . . , wk , b1 , b2 , . . . , bk }, where wi ∈ Rn and bi ∈ R.

pθ (y|x) = pγ (y|x). (9)

Thus, when θ = (σ, π, µ1 , . . . , µk ), simply set

where α and β are allowed to be any constants (with respect to y).

p(Xi |Xi−1 , Xi−2 , . . . X2 , X1 ) = p(Xi |Xi−1 , Xi−2 , . . . Xi−m )

Problem 4: Autoregressive Models (12 points)

where x<i denotes

where x>i denotes

pf (x1 ) = N (x1 |0, 1) (22)

is a mixture of truncated Gaussians whose mixture weights depend on .

Problem 5: Monte Carlo Integration (10 points)

x2 ex2 LSTM h2 Fully-Connected Layer l2 Softmax p2

··· ··· ··· ··· ···

Problem 6: Programming assignment (40 points)

Figures: See Figure. 2

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

CS236 Hw1 Answers

Uploaded by

CS236 Hw1 Answers

Uploaded by

CS 236 Homework 1 Solutions

Instructors: Stefano Ermon and Aditya Grover

Available: 10/01/2018; Due: 23:59 PST, 10/15/2018

Problem 1: Maximum Likelihood Estimation and KL Divergence (10 points)

where DKL denotes the KL-divergence:

DKL (p(x)kq(x)) = Ex∼p(x) [log p(x) − log q(x)].

max f (θ) ≡ min ψ(f (θ)). (1)

Problem 2: Logistic Regression and Naive Bayes (12 points)

pθ (x|y) = N (x|µy , σ 2 I). (7)

parameterized by vectors γ = {w1 , w2 , . . . , wk , b1 , b2 , . . . , bk }, where wi ∈ Rn and bi ∈ R.

pθ (y|x) = pγ (y|x). (9)

Thus, when θ = (σ, π, µ1 , . . . , µk ), simply set

where α and β are allowed to be any constants (with respect to y).

p(Xi |Xi−1 , Xi−2 , . . . X2 , X1 ) = p(Xi |Xi−1 , Xi−2 , . . . Xi−m )

Problem 4: Autoregressive Models (12 points)

where x<i denotes

where x>i denotes

pf (x1 ) = N (x1 |0, 1) (22)

is a mixture of truncated Gaussians whose mixture weights depend on .

Problem 5: Monte Carlo Integration (10 points)

x2 ex2 LSTM h2 Fully-Connected Layer l2 Softmax p2

··· ··· ··· ··· ···

Problem 6: Programming assignment (40 points)

Figures: See Figure. 2

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

is a mixture of truncated Gaussians whose mixture weights depend on .