0% found this document useful (0 votes)

48 views17 pages

Cs224n Midterm 2018 Solution

The CS224N Winter 2018 Midterm Exam consists of 5 questions worth a total of 100 points, accounting for 20% of the total grade, with strict guidelines on materials allowed and conduct during the exam. It includes multiple-choice questions and short-answer questions related to natural language processing concepts such as neural networks, backpropagation, and dependency parsing. Students must adhere to the Stanford University Honor Code, ensuring academic integrity throughout the examination process.

Uploaded by

Zeynep Kasabalı

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views17 pages

Cs224n Midterm 2018 Solution

Uploaded by

Zeynep Kasabalı

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

CS224N: Natural Language Processing with Deep Learning

Winter 2018 Midterm Exam

This examination consists of 17 printed sides, 5 questions, and 100 points. The exam accounts
for 20% of your total grade. Please write your answers on the exam paper in the spaces provided.
You may use the 16th page if necessary but you must make a note on the question’s answer
box. You have 80 minutes to complete the exam. Exams turned in after the end of the exam-
ination period will be either penalized or not graded at all. The exam is closed book and allows
only a single page of notes. You are not allowed to: use a phone, laptop/tablet, calculator or
spreadsheet, access the internet, communicate with others, or use other programming capabilities.
You must disable all networking and radios (“airplane mode”).
If you are taking the exam remotely, please send us the exam by Tuesday, February 13 at
5:50 pm PDT as a scanned PDF copy to scpd-distribution@lists.stanford.edu.
Stanford University Honor Code: I attest that I have not given or received aid in this
examination, and that I have done my share and taken an active part in seeing to it that others as
well as myself uphold the spirit and letter of the Honor Code.

SUNet ID: Signature:

Name (printed): SCPD student

Question Points Score

Multiple Choice 18
Short Questions 32
Word Vectors 12
Backpropagation 17
RNNs 21
Total: 100

The standard of academic conduct for Stanford students is as follows:

1. The Honor Code is an undertaking of the students, individually and collectively: a. that they will not
give or receive aid in examinations; that they will not give or receive unpermitted aid in class work,
in the preparation of reports, or in any other work that is to be used by the instructor as the basis of
grading; b. that they will do their share and take an active part in seeing to it that they as well as
others uphold the spirit and letter of the Honor Code.
2. The faculty on its part manifests its confidence in the honor of its students by refraining from proc-
toring examinations and from taking unusual and unreasonable precautions to prevent the forms of
dishonesty mentioned above. The faculty will also avoid, as far as practicable, academic procedures
that create temptations to violate the Honor Code.
3. While the faculty alone has the right and obligation to set academic requirements, the students and
faculty will work together to establish optimal conditions for honorable academic work.
1. Multiple Choice (18 points)
For each of the following questions, color all the circles you think are correct. No
explanations are required.
(a) (2 points) Which of the following statement about Skip-gram are correct?
It predicts the center word from the surrounding context words
√
The final word vector for a word is the average or sum of the
input vector v and output vector u corresponding to that word
When it comes to a small corpus, it has better performance than GloVe
It makes use of global co-occurrence statistics
(b) (2 points) Which of the following statements about dependency trees are correct?
Each word is connected to exactly one dependent (i.e., each word has
exactly one outgoing edge)
A dependency tree with crossing edges is called a “projective” dependency
tree.
√
Assuming it parses the sentence correctly, the last transition
made by the dependency parser from class and assignment 2
will always be a RIGHT-ARC connecting ROOT to some word.
None of the above
(c) (2 points) Which of the following statements is true of language models?
Neural window-based models share weights across the window
Neural window-based language models suffer from the sparsity problem,
but n-gram language models do not
The number of parameters in an RNN language model grows with the
number of time steps
√
Neural window-based models can be parallelized, but RNN lan-
guage models cannot

Solution: D: Gradients must flow through each time step for RNNs whereas
for neural window-based models can perform its forward- and back-propagation
in parallel.

(d) (2 points) Assume x and y get multiplied together element-wise x∗y. Sx and Sy are
the shapes of x and y respectively. For which Sx and Sy will Numpy / Tensorflow
throw an error?
Sx = [1, 10], Sy = [10, 10]
Sx = [10, 1], Sy = [10, 1]
√
Sx = [10, 100], Sy = [100, 10]
Sx = [1, 10, 100], Sy = [1, 1, 1]
(e) (2 points) Suppose that you are training a neural network for classification, but
you notice that the training loss is much lower than the validation loss. Which of
the following can be used to address the issue (select all that apply)?

Page 2
√
Use a network with fewer layers
Decrease dropout probability
√
Increase L2 regularization weight
Increase the size of each hidden layer

Solution: A, C – the model is now overfitting, and all of these are valid tech-
niques to address it. However, since dropout is a form of regularization, then
decreasing it (B) will have the opposite effect, as will increasing network size
(D).

(f) (2 points) Suppose a classifier predicts each possible class with equal probability.
If there are 10 classes, what will the cross-entropy error be on a single example?
− log(10)
−0.1 log(1)
√
− log(0.1)
−10 log(0.1)

Solution: C. Cross-Entropy loss simplifies to negative log of the predicted prob-

ability for the correct class. At the start of training, we have approximately
uniform probabilities for the 10 classes, or 0.1 for each class. So, that means
-log(0.1) is what the loss should approximately be at the start.

(g) (2 points) Suppose we have a loss function f (x, y; θ), defined in terms of parameters
θ, inputs x and labels y. Suppose that, for some special input and label pair (x0 ,
y0 ), the loss equals zero. True or false: it follows that the gradient of the loss with
respect to θ √is equal to zero.
True False
(h) (2 points) During backpropagation, when the gradient flows backwards through the
sigmoid
√ or tanh non-linearities, it cannot change sign.
True False

Solution: A. True. The local gradient for both of these non-linearites is always
positive.

(i) (2 points) Suppose we are training a neural network with stochastic gradient de-
scent on minibatches. True or false: summing the cost across the minibatch is
equivalent to averaging the cost across the minibatch, if in the first case we divide
the learning rate by the minibatch size.
√
True False

Solution: A. True. There is only a constant factor difference between averaging

and summing, and you can simply multiply the learning rate to get the same
update.

Page 3
2. Short Questions (32 points)
Please write answers to the following questions in a sentence or two.
(a) Suppose we want to classify movie review text as (1) either positive or negative
sentiment, and (2) either action, comedy or romance movie genre. To perform
these two related classification tasks, we use a neural network that shares the first
layer, but branches into two separate layers to compute the two classifications. The
loss is a weighted sum of the two cross-entropy losses.
h = ReLU(W0 x + b0 )
ŷ1 = softmax(W1 h + b1 )
ŷ2 = softmax(W2 h + b2 )
J = αCE(y1 , ŷ1 ) + βCE(y2 , ŷ2 )

Here input x ∈ R10 is some vector encoding of the input text, label ŷ1 ∈ R2 is
a one-hot vector encoding the true sentiment, label ŷ2 ∈ R3 is a one-hot vector
encoding the true movie genre, h ∈ R10 is a hidden layer, W0 ∈ R10×10 , W1 ∈
R2×10 , W2 ∈ R3×10 are weight matrices, and α and β are scalars that balance the
two losses.
i. (4 points) To compute backpropagation for this network, we can use the mul-
tivariable chain rule. Assuming we already know:
∂ŷ1 ∂ŷ2 ∂J ∂J
= ∆1 , = ∆2 , = δ3T , = δ4T
∂x ∂x ∂ŷ1 ∂ŷ2
∂J
What is ∂x
?

Solution: δ3T ∆1 + δ4T ∆2 (row convention), or ∆1 δ3T + ∆2 δ4T (column con-

vention)

If you are using row convention(numerator-layout notation), the dimensions

of parameters in the questions are δ3T : 1x2, ∆1 : 2x10, δ4T : 1x3, ∆2 : 3x10
If you are using column convention(denominator-layout notation), the di-
mensions of parameters in the questions are δ3T : 2x1, ∆1 : 10x2, δ4T : 3x1,
∆2 : 10x3

ii. (3 points) When we train this model, we find that it underfits the training data.
Why might underfitting happen in this case? Provide at least one suggestion
to reduce underfitting.

Solution: The model is too simple (just two layers, hidden layers’ dimen-
sion is only 10, input feature dimension is only 10).
Anything increasing the complexity of the model would be accepted, in-
cluding:

• Increasing dimensions of the hidden layer

Page 4
• Adding more layers

• Splitting the model into two (with more overall parameters)

(b) Practical Deep Learning Training

i. (2 points) In assignment 1, we saw that we could use gradient check, which
calculates numerical gradients using the central difference formula, as a way
to validate the accuracy of our analytical gradients. Why don’t we use the
numerical gradient to train neural networks in practice?

Solution: Since calculating the complete numerical gradient for a single

update requires iterating through all dimensions of all the parameters and
computing two forward passes for each iteration, this becomes too expensive
to use for training in practice.

ii. (3 points) In class, we learned how the ReLU activation function (ReLU(z) =
max(0, z) could ”die” or become saturated/inactive when the input is nega-
tive. A friend of yours suggests the use of another activation function f (z) =
max(0.2z, 2z) which he claims will remove the saturation problem. Would this
address the problem? Why or why not?

Solution: Yes, – it addresses the saturation problem of ReLU and also is

not a purely linear function. In fact, this is just an instance of the Leaky
ReLU non-linearity, frequently used to train neural nets in practice.

Page 5
iii. (3 points) The same friend also proposes another activation function g(z) =
1.5z. Would this be a good activation function? Why or why not?

Solution: No, this is not a good idea – this is no longer a non-linear func-
tion, so the neural network boils down to a simple linear predictor.

iv. (4 points) There are several tradeoffs when choosing the batch size for training.
Name one advantage for having very large batch sizes during training. Also,
name one advantage for having very small batch sizes during training.

Solution: Larger batch sizes give you updates that are closer to the overall
gradient on the dataset; small batches give you less exact but more frequent
updates.

v. (2 points) Suppose we are training a feed-forward neural network with several

hidden layers (all with ReLU non-linearities) and a softmax output. A friend
suggests that you should initialize all the weights (including biases) to zeros.
Is this a good idea or not? Explain briefly in 1-2 sentences.

Solution: No – all neurons will receive the same update if they all start off
with the same initialization due to symmetry. In fact, because everything
is zeros to start off with, then there will be no gradient flow to any of the
lower layers of the network (since the upstream gradients get multiplied by
Wi ’s and hi ’s, which are all 0). (Full credit was also given to students who
mentioned saturation/dead ReLUs, or thinking about non-differentiability
at 0 for ReLU.)

(c) (4 points) Word2Vec represents a family of embedding algorithms that are com-
monly used in a variety of contexts. Suppose in a recommender system for online
shopping, we have information about co-purchase records for items x1 , x2 , . . . , xn
(for example, item xi is commonly bought together with item xj ). Explain how you
would use ideas similar to Word2Vec to recommend similar items to users who have
shown interest in any one of the items.

Solution: We can treat items that are copurchased with x to be in the ’con-
text’ of item x (1 point). We can use those copurchase records to build item
embeddings akin to Word2Vec. (2 points, you need to mention that item em-
beddings are created). Then we can use a similarity metric such as finding the
items with the largest cosine similarity to the average basket to determine item
recommendations for users (1 point).

Page 6
(d) In lectures and Assignment 2 you learned about transition-based dependency pars-
ing. Another model for parsing sentences is a graph-based dependency parser. It
takes as input a sentence [w1 , w2 , ..., wT ]. First, it encodes the sentence as a sequence
of d-dimensional vectors [h1 , h2 , ..., hT ] (usually using a bidirectional LSTM, but the
particular details don’t matter for this question). Next, it assigns a score s(i, j) to
each possible dependency i → j going from word wi to word wj (in this question, the
model only predicts which edges are in the dependency graph, not the types of the
edges). The score is computed as s(i, j) = hTi Ahj where A is a d × d weight matrix.
There is also a score for having an edge going from ROOT to a word wj given as
s(ROOT, j) = wT hj where w is a d-dimensional vector of weights. Lastly, the model
assigns each word j the head whose edge scores the highest: argmaxi∈[1,...,T ] s(i, j).
i. (3 points) Is there a kind of parse tree this model can produce, but a transition-
based dependency parser can’t? If so, what is this kind of tree?

Solution: Unlike a transition-based parser, this model can parse sentences

with non-projective dependency trees (i.e., with crossing edges).

ii. (2 points) Suppose the model instead scored each edge with a simple dot prod-
uct: s(i, j) = hTi hj . Why would this not work as well?

Solution: Any of the following is correct

• The score of an edge i → j will be the same as the score of an edge

j → i, which doesn’t make sense.

• s(i, i) will be high, so words may get linked to themselves.

• Similar words will score highly, but often similar words should not be
connected in a dependency tree.

• A allows the model to express more complex interactions between the

elements of the h vectors.

iii. (2 points) What is one disadvantange of graph-based dependency parsers com-

pared to transition-based dependency parsers?

Solution: Any of the following is correct:

• Graph-based dependency parsers are slower (quadratic time instead

of linear in the length of the sentence).

• Graph-based dependency parsers are not constrained to produce valid

parse trees, so they may be more likely to produce invalid ones.

• Graph-based dependency parsers make parsing decisions independently

so they can’t use features based on previous parsing decisions.

Page 7
3. Word Vectors (12 points)
(a) (3 points) Although pre-trained word vectors work very well in many practical
downstream tasks, in some settings it’s best to continue to learn (i.e. ‘retrain’)
the word vectors as parameters of our neural network. Explain why retraining the
word vectors may hurt our model if our dataset for the specific task is too small.

Solution: See TV/television/telly example from lecture 4. Word vectors in

training data move around; word vectors not in training data don’t move around.
Destroys structure of word vector space. Could also phrase as an overfitting or
generalization problem.

(b) (2 points) Give 2 examples of how we can evaluate word vectors. For each example,
please indicate whether it is intrinsic or extrinsic.

Solution: Intrinsic: Word Vector Analogies ; Word vector distances and their
correlation with human judgments.
Extrinsic: Name entity recognitions: finding a person, location or organization
and so on.

(c) (4 points) In lectures, we saw how word vectors can alternatively be learned via co-
occurrence count-based methods. How does Word2Vec compare with these meth-
ods? Please briefly explain one advantage and one disadvantage of the Word2Vec
model.

Solution:
Advantage of word2vec: scale with corpus size; capture complex patterns beyond
word similarity
Disadvantage: slower than occurrence count based model; efficient usage of
statistics

(d) (3 points) Alice and Bob have each used the Word2Vec algorithm to obtain word
embeddings for the same vocabulary of words V . In particular, Alice has obtained
‘context’ vectors uA A
w and ‘center’ vectors vw for every w ∈ V , and Bob has obtained
‘context’ vectors uw and ‘center’ vectors vwB for every w ∈ V .
B

Suppose that, for every pair of words w, w0 ∈ V , the inner product is the same in
both Alice and Bob’s model: (uA T A B T B
w ) vw0 = (uw ) vw0 . Does it follow that, for every
word w ∈ V , vwA = vwB ? Why or why not?

Solution: No. Word2Vec model only optimizes for the inner product between
word vectors for words in the same context.
One can rotate all word vectors by the same amount and the inner product will
still be the same. Alternatively one can scale the set of context vectors by a

Page 8
factor of k and the set of center vectors by a factor of 1/k. Such transformations
preserves inner product, but the set of vectors could be different.
Note that degenerate solutions (all zero vectors etc.) are discouraged.

Page 9
4. Backpropagation (17 points)
In class, we used the sigmoid function as our activation function. The most widely
used activation function in deep learning now is ReLU: Rectified Linear Unit:

ReLU(z) = max(0, z)

In this problem, we’ll explore using the ReLU function in a neural network. Through-
out this problem, you are allowed (and highly encouraged) to use variables to represent
intermediate gradients (i.e. δ1 , δ2 , etc.).

∂
(a) (2 points) Compute the gradient of the ReLU function, i.e. derive ∂z
ReLU.

Hint: You can use the following notation:

(
1 if x > 0,
1{x > 0} =
0 if x ≤ 0

Solution:
∂
ReLU = 1{z > 0} (1)
∂z

(b) Now, we’ll use the ReLU function to construct a multi-layer neural network.

Below, the figure on the left shows a computation graph for a single ReLU hidden
layer at layer i. The second figure shows the computation graph for the full neural
network we’ll use in this problem. The network is specified by the equations below:

Page 10
Neural network with 2
ReLU layer i ReLU layers

z1 = W1 x + b1
h1 = ReLU(z1 )
z2 = W2 h1 + b2
h2 = ReLU(z2 )
ŷ = softmax(h2 )
J = CE(y, ŷ)
X
CE(y, ŷ) = − yi log(ŷi )
i

The dimensions of our parameters and variables are x ∈ RDx ×1 , W1 ∈ RH×Dx ,

b1 ∈ RH , W2 ∈ RDy ×H , b2 ∈ RDy , ŷ ∈ RDy ×1 . Note: x is a single column vector.
∂J ∂J
i. (4 points) Compute the gradients ∂W 2
and ∂b 2
.

∂J
Hint: Recall from PA1 that ∂θ
= ŷ − y, where θ is the inputs of softmax.

Solution:
∂J
δ1 = = ŷ − y
∂h2
∂J ∂h2
δ2 = = δ1 = δ1 ◦ 1{z2 > 0}
∂z2 ∂z2
∂J
= δ2
∂b2
∂J
= δ2 hT1
∂W2

∂J ∂J
ii. (4 points) Compute the gradients ∂W 1
and ∂b1
. You may use gradients you
have already derived in the previous part.

Page 11
Solution:
∂J ∂z2
δ3 = = δ2 = W2T δ2
∂h1 ∂h1
∂J ∂h1
δ4 = = δ3 = δ3 ◦ 1{z1 > 0}
∂z1 ∂z1
∂J
= δ4
∂b1
∂J
= δ4 x T
∂W1

(c) (7 points) When neural networks become very deep (i.e. have many layers), they
become difficult to train due to the vanishing gradient problem – as the gradient is
back-propagated through many layers, repeated multiplication can make the gradi-
ent extremely small, so that performance plateaus or even degrades.

An effective approach, particularly in computer vision applications, is ResNet. The

core idea of ResNet is skip connections that skip one or more layers. See the
computation graph below:

For this part, the dimensions from part B still apply, but assume Dy = Dx = H.
Compute the gradient ∂J ∂x
. Again, you are allowed (and highly encouraged) to
use variables to represent intermediate gradients (i.e. δ1 , δ2 , etc.)

Page 12
Hint: Use the computational graph to compute the upstream and local gradients for
each node. Recall that downstream = upstream * local. Alternatively, compute each
of these gradients in order to build up your answer: ∂J , ∂J , ∂J , ∂J , ∂J . Show your
∂θ ∂h2 ∂z2 ∂d ∂x
work so we are able to give partial credit!

Please write your answer in the box provided on the next page.

Solution:
∂J
δ1 = = ŷ − y
∂θ
∂J ∂J ∂θ ∂h2
δ2 = = = δ1 ◦ 1{z2 > 0}
∂z2 ∂θ ∂h2 ∂z2
∂J ∂J ∂z2 ∂J ∂θ
δ3 = = + = W2T δ2 + δ1
∂d ∂z2 ∂d ∂θ ∂d
∂J ∂J ∂d ∂h1 ∂J ∂d
= + = W1T (δ3 ◦ 1{z1 > 0}) + δ3
∂x ∂d ∂h1 ∂x ∂d ∂x

Page 13
5. RNNs (21 points)
RNNs are versatile! In class, we learned that this family of neural networks have many
important advantages and can be used in a variety of tasks. They are commonly used
in many state-of-the-art architectures for NLP.
(a) For each of the following tasks, state how you would run an RNN to do that task.
In particular, specify how the RNN would be used at test time (not training time),
and specify
1. how many outputs i.e. number of times the softmax ŷ(t) is called from your
RNN. If the number of outputs is not fixed, state it as arbitrary.
2. what each ŷ(t) is a probability distribution over (e.g. distributed over all species
of cats)
3. which inputs are fed at each time step to produce each output
The inputs are specified below.
i. (3 points) Named-Entity Recognition: For each word in a sentence, classify
that word as either a person, organization, location, or none.
Inputs: A sentence containing n words.

Solution: Number of Outputs: n outputs, one per input word at each time
step.

Each ŷ(t) is a probability distribution over 4 NER categories.

Each word in the sentence is fed into the RNN and one output is produced
at every time step corresponding to the predicted tag/category for each
word.

ii. (3 points) Sentiment Analysis: Classify the sentiment of a sentence ranging

from negative to positive (integer values from 0 to 4).
Inputs: A sentence containing n words.

Solution: Number of Outputs: 1 output. n outputs is also acceptable if

they say for instance to take average of all outputs.

Each ŷ(t) is a probability distribution over 5 sentiment values.

Each word in the sentence is fed into the RNN and one output is produced
from the hidden states (by either taking only the final, max, or mean across
all states) corresponding to the sentiment value of the sentence.

Page 14
iii. (3 points) Language models: generating text from a chatbot that was trained
to speak like you by predicting the next word in the sequence.
Input: A single start word or token that is fed into the first time step of the
RNN.

Solution: Number of Outputs: arbitrary

Each ŷ(t) is a probability distribution over the vocabulary.

The previous output is fed as input for the next time step and produces
the next output corresponding to the next predicted word of the generated
sentence.
As a detail, the first input can also be a designated < ST ART > token and
the first output would be the first word of the sentence.

(b) You build a sentiment analysis system that feeds a sentence into a RNN, and then
computes the sentiment class between 0 (very negative) and 4 (very positive), based
only on the final hidden state of the RNN.
i. (2 points) What is one advantage that an RNN would have over a neural
window-based model for this task?

Solution: There are multiple answers: We can process arbitrary length

inputs. It can encode temporal information.(’Take ordering into considera-
tion’ is only partial correct, because theoretically window-based model also
can, although hard to). Shared weights. Less parameters. The number
of parameters would increase proportional to the input size of the neu-
ral window-based network whereas it would stay constant for RNNs since
weights are shared at every time-step.

ii. (2 points) You observe that your model predicts very positive sentiment for
the following passage:
Yesterday turned out to be a terrible day.
I overslept my alarm clock, and to make matters worse,
my dog ate my homework. At least my dog seems happy...
Why might the model misclassify the appropriate sentiment for this sentence?

Solution: The final word in the sentence is ’happy’ which has very positive
sentiment. Since we only use the final hidden state to compute the output,
the final word would have too much impact in the classification. In addition,
because the sentence is quite long, information from earlier time steps may
not survive due to the vanishing gradient problem.

Page 15
iii. (4 points) Your friend suggests using an LSTM instead. Recall the units of an
LSTM cell are defined as
it = σ(W (i) xt + U (i) ht−1 )
ft = σ(W (f ) xt + U (f ) ht−1 )
ot = σ(W (o) xt + U (o) ht−1 )
ct = tanh(W (c) xt + U (c) ht−1 )
e
ct = ft ◦ ct−1 + it ◦ e
ct
ht = ot ◦ tanh(ct )
where the final output of the last lstm cell is defined by ŷt = softmax(ht W + b).
The final cost function J uses the cross-entropy loss. Consider an LSTM for
two time steps, t and t − 1.

y (t)

U (c) U
(c)
... h(t−1) h(t)

c(t−1) c(t)

x(t−1) x(t)

Derive the gradient ∂J

∂U (c)
in terms of the following gradients: ∂ht
, ∂ht−1 , ∂J ,
∂ht−1 ∂U (c) ∂ht
∂ct
, ∂ct−1 , ∂ct , ∂ht ,
∂U (c) ∂U (c) ∂ct−1 ∂ct
and ∂h t
∂ot
. Not all of the gradients may be used. You can
leave the answer in the form of chain rule and do not have to calculate any
individual gradients in your final result.

Solution:
t
∂J X ∂J
(c)
=
∂U i=t−1
∂U (c) i

∂J ∂J ∂ht ∂ct ∂ct ∂ct−1 ∂ht ∂ht−1
= ( + )+
∂U (c) ∂ht ∂ct ∂U (c) ∂ct−1 ∂U (c) ∂ht−1 ∂U (c)
∂ct
Because ∂U (c)
is ambiguous, the following solution is also accepted:
t
∂J X ∂J
(c)
=
∂U i=t−1
∂U (c) i

∂J ∂J ∂ht ∂ct ∂ht ∂ht−1
= +
∂U (c) ∂ht ∂ct ∂U (c) ∂ht−1 ∂U (c)
We must consider the gradient going through the current hidden state ht
to the memory cells, ct , ct−1 , and to the previous hidden state ht−1

Page 16
iv. (2 points) Which part of the gradient ∂U∂J(c) allows LSTMs to mitigate the effect
of the vanishing gradient problem? Explain in two sentences or less how this
would help classify the correct sentiment for the sentence in part b).

Solution: ∂c∂ct−1
t
. Since ∂c∂ct−1
t
= ft , the forget gate, which can act as the
identify function when ft = 1, this allows the gradients entering the current
cell to pass to the previous cell. This would help the model take into
consideration words that appear many time steps away. From the sentences
in part b), it would take more into account the negative sentiment at the
beginning.

v. (2 points) Rather than using the last hidden state to output the sentiment of
a sentence, what could be a better solution to improve the performance of the
sentiment analysis task?

Solution: We can take advantage of every cell by taking a max pool/av-

erage/sum of all hidden states.

Page 17

CASE STUDY - Library Management System
20% (5)
CASE STUDY - Library Management System
7 pages
Activations
No ratings yet
Activations
8 pages
7COM1033test_0000
No ratings yet
7COM1033test_0000
4 pages
Hcia Ai v3.5 h13-311 Exam Site2
No ratings yet
Hcia Ai v3.5 h13-311 Exam Site2
7 pages
2022_resit_solution
No ratings yet
2022_resit_solution
12 pages
NLP Midsem Paper Jan 2024 Regular exam
No ratings yet
NLP Midsem Paper Jan 2024 Regular exam
4 pages
Week11
No ratings yet
Week11
3 pages
Week 11 nptel deep learning
No ratings yet
Week 11 nptel deep learning
6 pages
Exam
No ratings yet
Exam
10 pages
NLP_end_sem
No ratings yet
NLP_end_sem
6 pages
CSCI_5521_Spring_2025_Final_Exam
No ratings yet
CSCI_5521_Spring_2025_Final_Exam
8 pages
Bianchi
No ratings yet
Bianchi
62 pages
Time Delay Neural Network
No ratings yet
Time Delay Neural Network
6 pages
sp19 Midterm Solutions
No ratings yet
sp19 Midterm Solutions
11 pages
UCS664_EST_23
No ratings yet
UCS664_EST_23
3 pages
Deep Learning Question Bank
No ratings yet
Deep Learning Question Bank
8 pages
hw1_red
No ratings yet
hw1_red
4 pages
hw1 Red
No ratings yet
hw1 Red
4 pages
0YqnEK3vg4heOTv089KxSI1ijWzuAxT1AgGevOKKJE
No ratings yet
0YqnEK3vg4heOTv089KxSI1ijWzuAxT1AgGevOKKJE
4 pages
CST414-SCHEME
No ratings yet
CST414-SCHEME
8 pages
mid-sem
No ratings yet
mid-sem
2 pages
AI42001_Machine_Learing_Foundations_ES_2024
No ratings yet
AI42001_Machine_Learing_Foundations_ES_2024
18 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
81 pages
CS_221_Fall_19_Solution
No ratings yet
CS_221_Fall_19_Solution
30 pages
final_exam_solutions
No ratings yet
final_exam_solutions
12 pages
Regular Language Presentation
No ratings yet
Regular Language Presentation
10 pages
SS_2020
No ratings yet
SS_2020
21 pages
Fa20 Midterm Solutions
No ratings yet
Fa20 Midterm Solutions
17 pages
WS_2021_Solutions
No ratings yet
WS_2021_Solutions
16 pages
Chandrasekaran, R., & Paramasivan, S. K. (2022). a State-Of-The-Art Review of Time Series Forecasting Using Deep Learning Approaches.
No ratings yet
Chandrasekaran, R., & Paramasivan, S. K. (2022). a State-Of-The-Art Review of Time Series Forecasting Using Deep Learning Approaches.
14 pages
Vin AI
No ratings yet
Vin AI
55 pages
Cs230exam Win20 Soln
No ratings yet
Cs230exam Win20 Soln
28 pages
Midpaper
No ratings yet
Midpaper
16 pages
CS 224n Assignment #2: Word2Vec and Dependency Parsing
No ratings yet
CS 224n Assignment #2: Word2Vec and Dependency Parsing
10 pages
DL Quiz1
No ratings yet
DL Quiz1
5 pages
sp20 Midterm Solutions
No ratings yet
sp20 Midterm Solutions
12 pages
BM2 Chapter 5 Forecasting
No ratings yet
BM2 Chapter 5 Forecasting
21 pages
OODP - Unit - I - UML Diagram
No ratings yet
OODP - Unit - I - UML Diagram
64 pages
Kami Export - Assignment - 2 - 20240709
No ratings yet
Kami Export - Assignment - 2 - 20240709
13 pages
20-Delta Rule-02-09-2024
No ratings yet
20-Delta Rule-02-09-2024
3 pages
Sample2022 New
No ratings yet
Sample2022 New
4 pages
1734451533458_2425-CS420-22TT-HW04
No ratings yet
1734451533458_2425-CS420-22TT-HW04
6 pages
Complementary Error Function Table: X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X)
No ratings yet
Complementary Error Function Table: X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X)
1 page
Group Method of Data Handling
No ratings yet
Group Method of Data Handling
6 pages
Homework2
No ratings yet
Homework2
3 pages
Markov Chain - HW4
No ratings yet
Markov Chain - HW4
3 pages
Exam - Deep Learning - From Theory To Practice (201800177) - Jan 22 2019
No ratings yet
Exam - Deep Learning - From Theory To Practice (201800177) - Jan 22 2019
3 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
cs224n Practice Midterm 3 Sol
No ratings yet
cs224n Practice Midterm 3 Sol
14 pages
Unit 5-1
No ratings yet
Unit 5-1
6 pages
Exam Long Questions
No ratings yet
Exam Long Questions
8 pages
Appendix A Statistical Tables and Charts
No ratings yet
Appendix A Statistical Tables and Charts
26 pages
Systems
No ratings yet
Systems
4 pages
DSE 3151 25 Sep 2023
No ratings yet
DSE 3151 25 Sep 2023
9 pages
(COMP4332) (2021) (S) Final P6a03t 90367
No ratings yet
(COMP4332) (2021) (S) Final P6a03t 90367
14 pages
245008-23CS2902 - Deep Learning
No ratings yet
245008-23CS2902 - Deep Learning
4 pages
ML Endsem 2022
No ratings yet
ML Endsem 2022
7 pages
.... Statistics MCQS
No ratings yet
.... Statistics MCQS
4 pages
Second Exam 2021-22
No ratings yet
Second Exam 2021-22
14 pages
ADL Midterm Mock Exam 2021
No ratings yet
ADL Midterm Mock Exam 2021
5 pages
Cs230exam Spr18 Soln PDF
100% (1)
Cs230exam Spr18 Soln PDF
45 pages
Name: Assignment - CPU Scheduling Algorithms (Single Queue) Grade
No ratings yet
Name: Assignment - CPU Scheduling Algorithms (Single Queue) Grade
7 pages
Teaching Model Driven Architecture Approach With The Sirius Project
No ratings yet
Teaching Model Driven Architecture Approach With The Sirius Project
14 pages
F16midterm Sols v2
No ratings yet
F16midterm Sols v2
14 pages
Regular Language
No ratings yet
Regular Language
15 pages
Cs230exam Win19 Soln
No ratings yet
Cs230exam Win19 Soln
29 pages
Midterm Solutions
No ratings yet
Midterm Solutions
14 pages
Excel NormS Functions Spreadsheet
No ratings yet
Excel NormS Functions Spreadsheet
16 pages
Week 2
No ratings yet
Week 2
17 pages
Solutions To Deep Learning
No ratings yet
Solutions To Deep Learning
25 pages
DL - Midterm - Fall23
No ratings yet
DL - Midterm - Fall23
2 pages
Practice Final sp22
No ratings yet
Practice Final sp22
10 pages
DNN Cluster S2 22 MidSem Makeup
No ratings yet
DNN Cluster S2 22 MidSem Makeup
7 pages
CS230: Deep Learning: Winter Quarter 2019 Stanford University Midterm Examination 180 Minutes
No ratings yet
CS230: Deep Learning: Winter Quarter 2019 Stanford University Midterm Examination 180 Minutes
29 pages
Assesment Ns
No ratings yet
Assesment Ns
2 pages
Negative Exponential Distribution
No ratings yet
Negative Exponential Distribution
3 pages
Table
No ratings yet
Table
10 pages
Solution PDF
No ratings yet
Solution PDF
20 pages
CS 540: Introduction To Artificial Intelligence: Final Exam: 8:15-9:45am, December 21, 2016 132 Noland
No ratings yet
CS 540: Introduction To Artificial Intelligence: Final Exam: 8:15-9:45am, December 21, 2016 132 Noland
8 pages
Single Layer Perceptron
No ratings yet
Single Layer Perceptron
14 pages
Solution: Introduction To Deep Learning
No ratings yet
Solution: Introduction To Deep Learning
20 pages
Multiple-Layer Networks Backpropagation Algorithms
No ratings yet
Multiple-Layer Networks Backpropagation Algorithms
46 pages
CS6503-Theory of Computation
No ratings yet
CS6503-Theory of Computation
9 pages
CS 540: Introduction To Artificial Intelligence: Final Exam: 2:45-4:45pm, May 11, 2008 Room 1240 Computer Sciences
No ratings yet
CS 540: Introduction To Artificial Intelligence: Final Exam: 2:45-4:45pm, May 11, 2008 Room 1240 Computer Sciences
11 pages
Cheatsheet Deep Learning
No ratings yet
Cheatsheet Deep Learning
2 pages
Sample Midterm Questions Answers
No ratings yet
Sample Midterm Questions Answers
5 pages
Chapter 5 Sol
100% (1)
Chapter 5 Sol
14 pages
Converting NFA To DFA - Solved Examples - Gate Vidyalay
No ratings yet
Converting NFA To DFA - Solved Examples - Gate Vidyalay
14 pages
Hexagon Number Sense
From Everand
Hexagon Number Sense
Christopher Casey
No ratings yet
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Cs224n Midterm 2018 Solution

Uploaded by

Cs224n Midterm 2018 Solution

Uploaded by

CS224N: Natural Language Processing with Deep Learning

Winter 2018 Midterm Exam

SUNet ID: Signature:

Name (printed): SCPD student

Question Points Score

The standard of academic conduct for Stanford students is as follows:

Solution: C. Cross-Entropy loss simplifies to negative log of the predicted prob-

Solution: A. True. There is only a constant factor difference between averaging

Solution: δ3T ∆1 + δ4T ∆2 (row convention), or ∆1 δ3T + ∆2 δ4T (column con-

If you are using row convention(numerator-layout notation), the dimensions

• Increasing dimensions of the hidden layer

• Splitting the model into two (with more overall parameters)

(b) Practical Deep Learning Training

Solution: Since calculating the complete numerical gradient for a single

Solution: Yes, – it addresses the saturation problem of ReLU and also is

v. (2 points) Suppose we are training a feed-forward neural network with several

Solution: Unlike a transition-based parser, this model can parse sentences

Solution: Any of the following is correct

• The score of an edge i → j will be the same as the score of an edge

• s(i, i) will be high, so words may get linked to themselves.

• A allows the model to express more complex interactions between the

iii. (2 points) What is one disadvantange of graph-based dependency parsers com-

Solution: Any of the following is correct:

• Graph-based dependency parsers are slower (quadratic time instead

• Graph-based dependency parsers are not constrained to produce valid

• Graph-based dependency parsers make parsing decisions independently

Solution: See TV/television/telly example from lecture 4. Word vectors in

Hint: You can use the following notation:

The dimensions of our parameters and variables are x ∈ RDx ×1 , W1 ∈ RH×Dx ,

An effective approach, particularly in computer vision applications, is ResNet. The

Neural network with skip connections

Each ŷ(t) is a probability distribution over 4 NER categories.

ii. (3 points) Sentiment Analysis: Classify the sentiment of a sentence ranging

Solution: Number of Outputs: 1 output. n outputs is also acceptable if

Each ŷ(t) is a probability distribution over 5 sentiment values.

Solution: Number of Outputs: arbitrary

Each ŷ(t) is a probability distribution over the vocabulary.

Solution: There are multiple answers: We can process arbitrary length

Derive the gradient ∂J

Solution: We can take advantage of every cell by taking a max pool/av-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.