0% found this document useful (0 votes)
48 views17 pages

Cs224n Midterm 2018 Solution

The CS224N Winter 2018 Midterm Exam consists of 5 questions worth a total of 100 points, accounting for 20% of the total grade, with strict guidelines on materials allowed and conduct during the exam. It includes multiple-choice questions and short-answer questions related to natural language processing concepts such as neural networks, backpropagation, and dependency parsing. Students must adhere to the Stanford University Honor Code, ensuring academic integrity throughout the examination process.

Uploaded by

Zeynep Kasabalı
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views17 pages

Cs224n Midterm 2018 Solution

The CS224N Winter 2018 Midterm Exam consists of 5 questions worth a total of 100 points, accounting for 20% of the total grade, with strict guidelines on materials allowed and conduct during the exam. It includes multiple-choice questions and short-answer questions related to natural language processing concepts such as neural networks, backpropagation, and dependency parsing. Students must adhere to the Stanford University Honor Code, ensuring academic integrity throughout the examination process.

Uploaded by

Zeynep Kasabalı
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

CS224N: Natural Language Processing with Deep Learning

Winter 2018 Midterm Exam


This examination consists of 17 printed sides, 5 questions, and 100 points. The exam accounts
for 20% of your total grade. Please write your answers on the exam paper in the spaces provided.
You may use the 16th page if necessary but you must make a note on the question’s answer
box. You have 80 minutes to complete the exam. Exams turned in after the end of the exam-
ination period will be either penalized or not graded at all. The exam is closed book and allows
only a single page of notes. You are not allowed to: use a phone, laptop/tablet, calculator or
spreadsheet, access the internet, communicate with others, or use other programming capabilities.
You must disable all networking and radios (“airplane mode”).
If you are taking the exam remotely, please send us the exam by Tuesday, February 13 at
5:50 pm PDT as a scanned PDF copy to scpd-distribution@lists.stanford.edu.
Stanford University Honor Code: I attest that I have not given or received aid in this
examination, and that I have done my share and taken an active part in seeing to it that others as
well as myself uphold the spirit and letter of the Honor Code.

SUNet ID: Signature:

Name (printed):  SCPD student

Question Points Score


Multiple Choice 18
Short Questions 32
Word Vectors 12
Backpropagation 17
RNNs 21
Total: 100

The standard of academic conduct for Stanford students is as follows:


1. The Honor Code is an undertaking of the students, individually and collectively: a. that they will not
give or receive aid in examinations; that they will not give or receive unpermitted aid in class work,
in the preparation of reports, or in any other work that is to be used by the instructor as the basis of
grading; b. that they will do their share and take an active part in seeing to it that they as well as
others uphold the spirit and letter of the Honor Code.
2. The faculty on its part manifests its confidence in the honor of its students by refraining from proc-
toring examinations and from taking unusual and unreasonable precautions to prevent the forms of
dishonesty mentioned above. The faculty will also avoid, as far as practicable, academic procedures
that create temptations to violate the Honor Code.
3. While the faculty alone has the right and obligation to set academic requirements, the students and
faculty will work together to establish optimal conditions for honorable academic work.
1. Multiple Choice (18 points)
For each of the following questions, color all the circles you think are correct. No
explanations are required.
(a) (2 points) Which of the following statement about Skip-gram are correct?
It predicts the center word from the surrounding context words

The final word vector for a word is the average or sum of the
input vector v and output vector u corresponding to that word
When it comes to a small corpus, it has better performance than GloVe
It makes use of global co-occurrence statistics
(b) (2 points) Which of the following statements about dependency trees are correct?
Each word is connected to exactly one dependent (i.e., each word has
exactly one outgoing edge)
A dependency tree with crossing edges is called a “projective” dependency
tree.

Assuming it parses the sentence correctly, the last transition
made by the dependency parser from class and assignment 2
will always be a RIGHT-ARC connecting ROOT to some word.
None of the above
(c) (2 points) Which of the following statements is true of language models?
Neural window-based models share weights across the window
Neural window-based language models suffer from the sparsity problem,
but n-gram language models do not
The number of parameters in an RNN language model grows with the
number of time steps

Neural window-based models can be parallelized, but RNN lan-
guage models cannot

Solution: D: Gradients must flow through each time step for RNNs whereas
for neural window-based models can perform its forward- and back-propagation
in parallel.

(d) (2 points) Assume x and y get multiplied together element-wise x∗y. Sx and Sy are
the shapes of x and y respectively. For which Sx and Sy will Numpy / Tensorflow
throw an error?
Sx = [1, 10], Sy = [10, 10]
Sx = [10, 1], Sy = [10, 1]

Sx = [10, 100], Sy = [100, 10]
Sx = [1, 10, 100], Sy = [1, 1, 1]
(e) (2 points) Suppose that you are training a neural network for classification, but
you notice that the training loss is much lower than the validation loss. Which of
the following can be used to address the issue (select all that apply)?

Page 2

Use a network with fewer layers
Decrease dropout probability

Increase L2 regularization weight
Increase the size of each hidden layer

Solution: A, C – the model is now overfitting, and all of these are valid tech-
niques to address it. However, since dropout is a form of regularization, then
decreasing it (B) will have the opposite effect, as will increasing network size
(D).

(f) (2 points) Suppose a classifier predicts each possible class with equal probability.
If there are 10 classes, what will the cross-entropy error be on a single example?
− log(10)
−0.1 log(1)

− log(0.1)
−10 log(0.1)

Solution: C. Cross-Entropy loss simplifies to negative log of the predicted prob-


ability for the correct class. At the start of training, we have approximately
uniform probabilities for the 10 classes, or 0.1 for each class. So, that means
-log(0.1) is what the loss should approximately be at the start.

(g) (2 points) Suppose we have a loss function f (x, y; θ), defined in terms of parameters
θ, inputs x and labels y. Suppose that, for some special input and label pair (x0 ,
y0 ), the loss equals zero. True or false: it follows that the gradient of the loss with
respect to θ √is equal to zero.
True False
(h) (2 points) During backpropagation, when the gradient flows backwards through the
sigmoid
√ or tanh non-linearities, it cannot change sign.
True False

Solution: A. True. The local gradient for both of these non-linearites is always
positive.

(i) (2 points) Suppose we are training a neural network with stochastic gradient de-
scent on minibatches. True or false: summing the cost across the minibatch is
equivalent to averaging the cost across the minibatch, if in the first case we divide
the learning rate by the minibatch size.

True False

Solution: A. True. There is only a constant factor difference between averaging


and summing, and you can simply multiply the learning rate to get the same
update.

Page 3
2. Short Questions (32 points)
Please write answers to the following questions in a sentence or two.
(a) Suppose we want to classify movie review text as (1) either positive or negative
sentiment, and (2) either action, comedy or romance movie genre. To perform
these two related classification tasks, we use a neural network that shares the first
layer, but branches into two separate layers to compute the two classifications. The
loss is a weighted sum of the two cross-entropy losses.
h = ReLU(W0 x + b0 )
ŷ1 = softmax(W1 h + b1 )
ŷ2 = softmax(W2 h + b2 )
J = αCE(y1 , ŷ1 ) + βCE(y2 , ŷ2 )

Here input x ∈ R10 is some vector encoding of the input text, label ŷ1 ∈ R2 is
a one-hot vector encoding the true sentiment, label ŷ2 ∈ R3 is a one-hot vector
encoding the true movie genre, h ∈ R10 is a hidden layer, W0 ∈ R10×10 , W1 ∈
R2×10 , W2 ∈ R3×10 are weight matrices, and α and β are scalars that balance the
two losses.
i. (4 points) To compute backpropagation for this network, we can use the mul-
tivariable chain rule. Assuming we already know:
∂ŷ1 ∂ŷ2 ∂J ∂J
= ∆1 , = ∆2 , = δ3T , = δ4T
∂x ∂x ∂ŷ1 ∂ŷ2
∂J
What is ∂x
?

Solution: δ3T ∆1 + δ4T ∆2 (row convention), or ∆1 δ3T + ∆2 δ4T (column con-


vention)

If you are using row convention(numerator-layout notation), the dimensions


of parameters in the questions are δ3T : 1x2, ∆1 : 2x10, δ4T : 1x3, ∆2 : 3x10
If you are using column convention(denominator-layout notation), the di-
mensions of parameters in the questions are δ3T : 2x1, ∆1 : 10x2, δ4T : 3x1,
∆2 : 10x3

ii. (3 points) When we train this model, we find that it underfits the training data.
Why might underfitting happen in this case? Provide at least one suggestion
to reduce underfitting.

Solution: The model is too simple (just two layers, hidden layers’ dimen-
sion is only 10, input feature dimension is only 10).
Anything increasing the complexity of the model would be accepted, in-
cluding:

• Increasing dimensions of the hidden layer

Page 4
• Adding more layers

• Splitting the model into two (with more overall parameters)

(b) Practical Deep Learning Training


i. (2 points) In assignment 1, we saw that we could use gradient check, which
calculates numerical gradients using the central difference formula, as a way
to validate the accuracy of our analytical gradients. Why don’t we use the
numerical gradient to train neural networks in practice?

Solution: Since calculating the complete numerical gradient for a single


update requires iterating through all dimensions of all the parameters and
computing two forward passes for each iteration, this becomes too expensive
to use for training in practice.

ii. (3 points) In class, we learned how the ReLU activation function (ReLU(z) =
max(0, z) could ”die” or become saturated/inactive when the input is nega-
tive. A friend of yours suggests the use of another activation function f (z) =
max(0.2z, 2z) which he claims will remove the saturation problem. Would this
address the problem? Why or why not?

Solution: Yes, – it addresses the saturation problem of ReLU and also is


not a purely linear function. In fact, this is just an instance of the Leaky
ReLU non-linearity, frequently used to train neural nets in practice.

Page 5
iii. (3 points) The same friend also proposes another activation function g(z) =
1.5z. Would this be a good activation function? Why or why not?

Solution: No, this is not a good idea – this is no longer a non-linear func-
tion, so the neural network boils down to a simple linear predictor.

iv. (4 points) There are several tradeoffs when choosing the batch size for training.
Name one advantage for having very large batch sizes during training. Also,
name one advantage for having very small batch sizes during training.

Solution: Larger batch sizes give you updates that are closer to the overall
gradient on the dataset; small batches give you less exact but more frequent
updates.

v. (2 points) Suppose we are training a feed-forward neural network with several


hidden layers (all with ReLU non-linearities) and a softmax output. A friend
suggests that you should initialize all the weights (including biases) to zeros.
Is this a good idea or not? Explain briefly in 1-2 sentences.

Solution: No – all neurons will receive the same update if they all start off
with the same initialization due to symmetry. In fact, because everything
is zeros to start off with, then there will be no gradient flow to any of the
lower layers of the network (since the upstream gradients get multiplied by
Wi ’s and hi ’s, which are all 0). (Full credit was also given to students who
mentioned saturation/dead ReLUs, or thinking about non-differentiability
at 0 for ReLU.)

(c) (4 points) Word2Vec represents a family of embedding algorithms that are com-
monly used in a variety of contexts. Suppose in a recommender system for online
shopping, we have information about co-purchase records for items x1 , x2 , . . . , xn
(for example, item xi is commonly bought together with item xj ). Explain how you
would use ideas similar to Word2Vec to recommend similar items to users who have
shown interest in any one of the items.

Solution: We can treat items that are copurchased with x to be in the ’con-
text’ of item x (1 point). We can use those copurchase records to build item
embeddings akin to Word2Vec. (2 points, you need to mention that item em-
beddings are created). Then we can use a similarity metric such as finding the
items with the largest cosine similarity to the average basket to determine item
recommendations for users (1 point).

Page 6
(d) In lectures and Assignment 2 you learned about transition-based dependency pars-
ing. Another model for parsing sentences is a graph-based dependency parser. It
takes as input a sentence [w1 , w2 , ..., wT ]. First, it encodes the sentence as a sequence
of d-dimensional vectors [h1 , h2 , ..., hT ] (usually using a bidirectional LSTM, but the
particular details don’t matter for this question). Next, it assigns a score s(i, j) to
each possible dependency i → j going from word wi to word wj (in this question, the
model only predicts which edges are in the dependency graph, not the types of the
edges). The score is computed as s(i, j) = hTi Ahj where A is a d × d weight matrix.
There is also a score for having an edge going from ROOT to a word wj given as
s(ROOT, j) = wT hj where w is a d-dimensional vector of weights. Lastly, the model
assigns each word j the head whose edge scores the highest: argmaxi∈[1,...,T ] s(i, j).
i. (3 points) Is there a kind of parse tree this model can produce, but a transition-
based dependency parser can’t? If so, what is this kind of tree?

Solution: Unlike a transition-based parser, this model can parse sentences


with non-projective dependency trees (i.e., with crossing edges).

ii. (2 points) Suppose the model instead scored each edge with a simple dot prod-
uct: s(i, j) = hTi hj . Why would this not work as well?

Solution: Any of the following is correct

• The score of an edge i → j will be the same as the score of an edge


j → i, which doesn’t make sense.

• s(i, i) will be high, so words may get linked to themselves.

• Similar words will score highly, but often similar words should not be
connected in a dependency tree.

• A allows the model to express more complex interactions between the


elements of the h vectors.

iii. (2 points) What is one disadvantange of graph-based dependency parsers com-


pared to transition-based dependency parsers?

Solution: Any of the following is correct:

• Graph-based dependency parsers are slower (quadratic time instead


of linear in the length of the sentence).

• Graph-based dependency parsers are not constrained to produce valid


parse trees, so they may be more likely to produce invalid ones.

• Graph-based dependency parsers make parsing decisions independently


so they can’t use features based on previous parsing decisions.

Page 7
3. Word Vectors (12 points)
(a) (3 points) Although pre-trained word vectors work very well in many practical
downstream tasks, in some settings it’s best to continue to learn (i.e. ‘retrain’)
the word vectors as parameters of our neural network. Explain why retraining the
word vectors may hurt our model if our dataset for the specific task is too small.

Solution: See TV/television/telly example from lecture 4. Word vectors in


training data move around; word vectors not in training data don’t move around.
Destroys structure of word vector space. Could also phrase as an overfitting or
generalization problem.

(b) (2 points) Give 2 examples of how we can evaluate word vectors. For each example,
please indicate whether it is intrinsic or extrinsic.

Solution: Intrinsic: Word Vector Analogies ; Word vector distances and their
correlation with human judgments.
Extrinsic: Name entity recognitions: finding a person, location or organization
and so on.

(c) (4 points) In lectures, we saw how word vectors can alternatively be learned via co-
occurrence count-based methods. How does Word2Vec compare with these meth-
ods? Please briefly explain one advantage and one disadvantage of the Word2Vec
model.

Solution:
Advantage of word2vec: scale with corpus size; capture complex patterns beyond
word similarity
Disadvantage: slower than occurrence count based model; efficient usage of
statistics

(d) (3 points) Alice and Bob have each used the Word2Vec algorithm to obtain word
embeddings for the same vocabulary of words V . In particular, Alice has obtained
‘context’ vectors uA A
w and ‘center’ vectors vw for every w ∈ V , and Bob has obtained
‘context’ vectors uw and ‘center’ vectors vwB for every w ∈ V .
B

Suppose that, for every pair of words w, w0 ∈ V , the inner product is the same in
both Alice and Bob’s model: (uA T A B T B
w ) vw0 = (uw ) vw0 . Does it follow that, for every
word w ∈ V , vwA = vwB ? Why or why not?

Solution: No. Word2Vec model only optimizes for the inner product between
word vectors for words in the same context.
One can rotate all word vectors by the same amount and the inner product will
still be the same. Alternatively one can scale the set of context vectors by a

Page 8
factor of k and the set of center vectors by a factor of 1/k. Such transformations
preserves inner product, but the set of vectors could be different.
Note that degenerate solutions (all zero vectors etc.) are discouraged.

Page 9
4. Backpropagation (17 points)
In class, we used the sigmoid function as our activation function. The most widely
used activation function in deep learning now is ReLU: Rectified Linear Unit:

ReLU(z) = max(0, z)

In this problem, we’ll explore using the ReLU function in a neural network. Through-
out this problem, you are allowed (and highly encouraged) to use variables to represent
intermediate gradients (i.e. δ1 , δ2 , etc.).


(a) (2 points) Compute the gradient of the ReLU function, i.e. derive ∂z
ReLU.

Hint: You can use the following notation:


(
1 if x > 0,
1{x > 0} =
0 if x ≤ 0

Solution:

ReLU = 1{z > 0} (1)
∂z

(b) Now, we’ll use the ReLU function to construct a multi-layer neural network.

Below, the figure on the left shows a computation graph for a single ReLU hidden
layer at layer i. The second figure shows the computation graph for the full neural
network we’ll use in this problem. The network is specified by the equations below:

Page 10
Neural network with 2
ReLU layer i ReLU layers

z1 = W1 x + b1
h1 = ReLU(z1 )
z2 = W2 h1 + b2
h2 = ReLU(z2 )
ŷ = softmax(h2 )
J = CE(y, ŷ)
X
CE(y, ŷ) = − yi log(ŷi )
i

The dimensions of our parameters and variables are x ∈ RDx ×1 , W1 ∈ RH×Dx ,


b1 ∈ RH , W2 ∈ RDy ×H , b2 ∈ RDy , ŷ ∈ RDy ×1 . Note: x is a single column vector.
∂J ∂J
i. (4 points) Compute the gradients ∂W 2
and ∂b 2
.

∂J
Hint: Recall from PA1 that ∂θ
= ŷ − y, where θ is the inputs of softmax.

Solution:
∂J
δ1 = = ŷ − y
∂h2
∂J ∂h2
δ2 = = δ1 = δ1 ◦ 1{z2 > 0}
∂z2 ∂z2
∂J
= δ2
∂b2
∂J
= δ2 hT1
∂W2

∂J ∂J
ii. (4 points) Compute the gradients ∂W 1
and ∂b1
. You may use gradients you
have already derived in the previous part.

Page 11
Solution:
∂J ∂z2
δ3 = = δ2 = W2T δ2
∂h1 ∂h1
∂J ∂h1
δ4 = = δ3 = δ3 ◦ 1{z1 > 0}
∂z1 ∂z1
∂J
= δ4
∂b1
∂J
= δ4 x T
∂W1

(c) (7 points) When neural networks become very deep (i.e. have many layers), they
become difficult to train due to the vanishing gradient problem – as the gradient is
back-propagated through many layers, repeated multiplication can make the gradi-
ent extremely small, so that performance plateaus or even degrades.

An effective approach, particularly in computer vision applications, is ResNet. The


core idea of ResNet is skip connections that skip one or more layers. See the
computation graph below:

Neural network with skip connections

z1 = W1 x + b1
h1 = ReLU(z1 )
d = h1 + x
z2 = W2 d + b2
h2 = ReLU(z2 )
θ = h2 + d
ŷ = softmax(θ)
J = CE(y, ŷ)

For this part, the dimensions from part B still apply, but assume Dy = Dx = H.
Compute the gradient ∂J ∂x
. Again, you are allowed (and highly encouraged) to
use variables to represent intermediate gradients (i.e. δ1 , δ2 , etc.)

Page 12
Hint: Use the computational graph to compute the upstream and local gradients for
each node. Recall that downstream = upstream * local. Alternatively, compute each
of these gradients in order to build up your answer: ∂J , ∂J , ∂J , ∂J , ∂J . Show your
∂θ ∂h2 ∂z2 ∂d ∂x
work so we are able to give partial credit!

Please write your answer in the box provided on the next page.

Solution:
∂J
δ1 = = ŷ − y
∂θ
∂J ∂J ∂θ ∂h2
δ2 = = = δ1 ◦ 1{z2 > 0}
∂z2 ∂θ ∂h2 ∂z2
∂J ∂J ∂z2 ∂J ∂θ
δ3 = = + = W2T δ2 + δ1
∂d ∂z2 ∂d ∂θ ∂d
∂J ∂J ∂d ∂h1 ∂J ∂d
= + = W1T (δ3 ◦ 1{z1 > 0}) + δ3
∂x ∂d ∂h1 ∂x ∂d ∂x

Page 13
5. RNNs (21 points)
RNNs are versatile! In class, we learned that this family of neural networks have many
important advantages and can be used in a variety of tasks. They are commonly used
in many state-of-the-art architectures for NLP.
(a) For each of the following tasks, state how you would run an RNN to do that task.
In particular, specify how the RNN would be used at test time (not training time),
and specify
1. how many outputs i.e. number of times the softmax ŷ(t) is called from your
RNN. If the number of outputs is not fixed, state it as arbitrary.
2. what each ŷ(t) is a probability distribution over (e.g. distributed over all species
of cats)
3. which inputs are fed at each time step to produce each output
The inputs are specified below.
i. (3 points) Named-Entity Recognition: For each word in a sentence, classify
that word as either a person, organization, location, or none.
Inputs: A sentence containing n words.

Solution: Number of Outputs: n outputs, one per input word at each time
step.

Each ŷ(t) is a probability distribution over 4 NER categories.

Each word in the sentence is fed into the RNN and one output is produced
at every time step corresponding to the predicted tag/category for each
word.

ii. (3 points) Sentiment Analysis: Classify the sentiment of a sentence ranging


from negative to positive (integer values from 0 to 4).
Inputs: A sentence containing n words.

Solution: Number of Outputs: 1 output. n outputs is also acceptable if


they say for instance to take average of all outputs.

Each ŷ(t) is a probability distribution over 5 sentiment values.

Each word in the sentence is fed into the RNN and one output is produced
from the hidden states (by either taking only the final, max, or mean across
all states) corresponding to the sentiment value of the sentence.

Page 14
iii. (3 points) Language models: generating text from a chatbot that was trained
to speak like you by predicting the next word in the sequence.
Input: A single start word or token that is fed into the first time step of the
RNN.

Solution: Number of Outputs: arbitrary

Each ŷ(t) is a probability distribution over the vocabulary.

The previous output is fed as input for the next time step and produces
the next output corresponding to the next predicted word of the generated
sentence.
As a detail, the first input can also be a designated < ST ART > token and
the first output would be the first word of the sentence.

(b) You build a sentiment analysis system that feeds a sentence into a RNN, and then
computes the sentiment class between 0 (very negative) and 4 (very positive), based
only on the final hidden state of the RNN.
i. (2 points) What is one advantage that an RNN would have over a neural
window-based model for this task?

Solution: There are multiple answers: We can process arbitrary length


inputs. It can encode temporal information.(’Take ordering into considera-
tion’ is only partial correct, because theoretically window-based model also
can, although hard to). Shared weights. Less parameters. The number
of parameters would increase proportional to the input size of the neu-
ral window-based network whereas it would stay constant for RNNs since
weights are shared at every time-step.

ii. (2 points) You observe that your model predicts very positive sentiment for
the following passage:
Yesterday turned out to be a terrible day.
I overslept my alarm clock, and to make matters worse,
my dog ate my homework. At least my dog seems happy...
Why might the model misclassify the appropriate sentiment for this sentence?

Solution: The final word in the sentence is ’happy’ which has very positive
sentiment. Since we only use the final hidden state to compute the output,
the final word would have too much impact in the classification. In addition,
because the sentence is quite long, information from earlier time steps may
not survive due to the vanishing gradient problem.

Page 15
iii. (4 points) Your friend suggests using an LSTM instead. Recall the units of an
LSTM cell are defined as
it = σ(W (i) xt + U (i) ht−1 )
ft = σ(W (f ) xt + U (f ) ht−1 )
ot = σ(W (o) xt + U (o) ht−1 )
ct = tanh(W (c) xt + U (c) ht−1 )
e
ct = ft ◦ ct−1 + it ◦ e
ct
ht = ot ◦ tanh(ct )
where the final output of the last lstm cell is defined by ŷt = softmax(ht W + b).
The final cost function J uses the cross-entropy loss. Consider an LSTM for
two time steps, t and t − 1.

y (t)

U (c)  U
(c)
... h(t−1) h(t)

c(t−1) c(t)

x(t−1) x(t)

Derive the gradient ∂J


∂U (c)
in terms of the following gradients: ∂ht
, ∂ht−1 , ∂J ,
∂ht−1 ∂U (c) ∂ht
∂ct
, ∂ct−1 , ∂ct , ∂ht ,
∂U (c) ∂U (c) ∂ct−1 ∂ct
and ∂h t
∂ot
. Not all of the gradients may be used. You can
leave the answer in the form of chain rule and do not have to calculate any
individual gradients in your final result.

Solution:
t
∂J X ∂J
(c)
=
∂U i=t−1
∂U (c) i
 
∂J ∂J ∂ht ∂ct ∂ct ∂ct−1 ∂ht ∂ht−1
= ( + )+
∂U (c) ∂ht ∂ct ∂U (c) ∂ct−1 ∂U (c) ∂ht−1 ∂U (c)
∂ct
Because ∂U (c)
is ambiguous, the following solution is also accepted:
t
∂J X ∂J
(c)
=
∂U i=t−1
∂U (c) i
 
∂J ∂J ∂ht ∂ct ∂ht ∂ht−1
= +
∂U (c) ∂ht ∂ct ∂U (c) ∂ht−1 ∂U (c)
We must consider the gradient going through the current hidden state ht
to the memory cells, ct , ct−1 , and to the previous hidden state ht−1

Page 16
iv. (2 points) Which part of the gradient ∂U∂J(c) allows LSTMs to mitigate the effect
of the vanishing gradient problem? Explain in two sentences or less how this
would help classify the correct sentiment for the sentence in part b).

Solution: ∂c∂ct−1
t
. Since ∂c∂ct−1
t
= ft , the forget gate, which can act as the
identify function when ft = 1, this allows the gradients entering the current
cell to pass to the previous cell. This would help the model take into
consideration words that appear many time steps away. From the sentences
in part b), it would take more into account the negative sentiment at the
beginning.

v. (2 points) Rather than using the last hidden state to output the sentiment of
a sentence, what could be a better solution to improve the performance of the
sentiment analysis task?

Solution: We can take advantage of every cell by taking a max pool/av-


erage/sum of all hidden states.

Page 17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy