11 RNN
11 RNN
Spring 2017
𝑝 𝑋 = ] 𝑝(𝑥3 ∣ 𝑥36, )
3_,
• With RNNs, can do non-Markovian models:
/
𝑝 𝑋 = ] 𝑝(𝑥3 ∣ 𝑥, , … , 𝑥36, )
3_,
RNN: Transducer Architecture
• Predict output for every time step
Language Modeling
• Input: 𝑋 = 𝑥, , … , 𝑥/
• Goal: compute 𝑝(𝑋)
• Model:
/
𝑝 𝑋 = ] 𝑝(𝑥3 ∣ 𝑥, , … , 𝑥36, )
3_,
𝑝 𝑥3 𝑥, , … , 𝑥36, = 𝑂 𝒔3 = 𝑂(𝑅 𝒔36, , 𝒙3 )
𝑂 𝒔3 = softmax(𝑠3 𝑾 + 𝒃)
• Predict next token 𝑦`3 as we go:
𝑦`3 = argmax𝑂(𝒔3 )
RNN: Transducer Architecture
• Predict output for every time step
• Examples:
– Language modeling
– POS tagging
– NER
RNN: Encoder Architecture
• Similar to acceptor
• Difference: last state is used as input to
another model and not for prediction
𝑂 𝑠3 = 𝑠3 à 𝑦/ = 𝑠/
• Example:
– Sentence embedding
Bidirectional RNNs
• RNN decisions are based on historical data only
– How can we account for future input?
• When is it relevant? Feasible?
Bidirectional RNNs
• RNN decisions are based on historical data only
– How can we account for future input?
• When is it relevant? Feasible?
• When all the input is possible. So not in real-time input, for example.
• Probabilistic model, for example for language modeling:
/
𝑝 𝑋 𝑠$ = ] 𝑝(𝑥3 ∣ 𝑥, , … , 𝑥36, , 𝑠$ )
3_,
Example: Caption Generation
• Given: image 𝐼
• Goal: generate caption
• Set 𝒔$ = CNN(𝐼)
• Model:
/
𝑝 𝑋 𝐼 = ] 𝑝(𝑥3 ∣ 𝑥, , … , 𝑥36, , 𝐼)
3_,
Output
Cell State
ft = (Wf [ht 1 , xt ] + bf )
it = (Wi [ht 1 , xi ] + bf )
ct =ft ct 1 + it tanh(Wc [ht 1 , xi ] + bc )
ot = (Wo [ht 1 , xi ] + bo )
ht =ot tanh(ct )
Input
Image by Tim Rocktäschel