Deep
Deep
Supervised learning
Unsupervised learning
Final remarks
Dtrain Learner f
Learner
min TrainLoss(θ)
θ∈Rd
For t = 1, . . . , T :
For (x, y) ∈ Dtrain :
θ ← θ − ηt ∇θ Loss(x, y, θ)
x2 fθ (x)
x3
Output:
fθ (x) = w · x
Parameters: θ = w
a b a b a b
max σ
a b a
∂ out
∂ mid
mid function1
∂ mid
∂ in
in
Chain rule:
∂out ∂out ∂mid
∂in = ∂mid ∂in
residual −
1
+ y
1 1 1
· · ·
h1 w1 h2 w2 h3 w3
w1 h1 σ w2 h2 σ w3 h3 σ
h1 (1 − h1 ) h2 (1 − h2 ) h3 (1 − h3 )
· · ·
φ(x) φ(x) φ(x)
Intuitions:
• Hierarchical feature representations
• Can simulate a bounded computation logic circuit (original moti-
vation from McCulloch/Pitts, 1943)
• Learn this computation (and potentially more because networks
are real-valued)
• Depth k + 1 logic circuits can represent more than depth k (count-
ing argument)
• Formal theory/understanding is still incomplete
Supervised learning
Unsupervised learning
Final remarks
• Humans rarely get direct supervision; can learn from raw sensory
information?
If can compress a data point and still reconstruct it, then we have
learned something generally useful.
General framework:
x x̂
h
Encode Decode
minimize kx − x̂k2
Input: points x1 , . . . , xn
x U
U> h
Encode(x) = Decode(h) =
PCA objective:
n
X
minimize kxi − Decode(Encode(xi ))k2
i=1
CS221 / Autumn 2015 / Liang 23
Autoencoders
Increase dimensionality of hidden dimension:
x h x̂
Encode Decode
Encode(x) = σ(W x + b)
Decode(h) = σ(W 0 h + b0 )
W0 b0
W b
Loss function:
minimize kx − Decode(Encode(x))k2
Key: non-linearity makes life harder, prevents degeneracy
Types of noise:
• Blankout: Corrupt([1, 2, 3, 4]) = [0, 2, 3, 0]
• Gaussian: Corrupt([1, 2, 3, 4]) = [1.1, 1.9, 3.3, 4.2]
Objective:
minimize kx − Decode(Encode(Corrupt(x)))k2
Algorithm: pick example, add fresh noise, SGD update
Key: noise makes life harder, prevents degeneracy
Denoising autoencoders
MNIST: 60,000 images of digits (784 dimensions)
Encode Decode
...
Test time: Encode0 (Encode(x))
CS221 / Autumn 2015 / Liang 28
Probabilistic models
So far:
Decode(Encode(x))
p(x, h)
Two types:
• Restricted Boltzmann machines (Markov network)
• Deep belief network (Bayesian network)
word w N
Doc1 Doc2
cats 1 0
dogs 0 1
have 1 1
tails 1 1
document c
S V>
≈ Θ
word w N
pθ (g = 1 | w, c) = (1 + exp(θw · βc ))−1
p̂(w, c) ∝ N (w, c)
Objective:
X
max p̂(w, c) log p(g = 1 | w, c)+
θ,β
w,c
X
k p̂(w)p̂(c0 ) log p(g = 0 | w, c0 )
w,c0
Analogies
Differences in context vectors capture relations:
Intuition:
• Helps less given large amounts of labeled data, but doesn’t mean
unsupervised learning is solved — quite the opposite!
Supervised learning
Unsupervised learning
Final remarks
Prior knowledge
Max-pooling
AlexNet
Supervised learning
Unsupervised learning
Final remarks
Paris Talks Set Stage for Action as Risks to the Climate Rise
No conditionally independence!
x1 x2 x3 x4
h1 = Encode(x1 )
x2 ∼ Decode(h1 ) Update context vector:
h2 = Encode(h1 , x2 ) ht = Encode(ht−1 , xt )
x3 ∼ Decode(h2 ) Predict next character:
h3 = Encode(h2 , x3 ) xt+1 = Decode(ht )
x4 ∼ Decode(h3 ) context ht compresses x1 , . . . xt
h4 = Encode(h3 , x4 )
x1 x2 x3 x4
xt
V ht−1 W ht
Encode(ht−1 , xt ) = σ( + )=
W0 p(xt+1 )
ht
Decode(ht ) ∼ softmax( )=
x1 x2 x3 x4 x5
h5 =
If V = 0.1, then
• Value: ht = 0.1t−1 W
∂ht
• Gradient: ∂W = 0.1t−1 (vanishes as length increases)
CS221 / Autumn 2015 / Liang 55
Additive combinations
h1 h2 h3 h4 h5
x1 x2 x3 x4 x5
What if:
ht = ht−1 + W xt
Then:
(set x1 = 1, x2 = x3 = · · · = 0, σ = identity function)
• Value: ht = W
∂ht
• Gradient: ∂W = 1 for any t
Naturalism and decision for the majority of Arab countries’ capitalide was
grounded by the Irish language by [[John Clair]], [[An Imperial Japanese
Revolt]], associated with Guangzham’s sovereignty. His generals were
the powerful ruler of the Portugal in the [[Protestant Immineners]], which
could be said to be directly in Cantonese Communication, which followed
a ceremony and set inspired prison, training. The emperor travelled back
to [[Antioch, Perth, October 25—21]] to note, the Kingdom of Costa
Rica, unsuccessful fashioned the [[Thrales]], [[Cynth’s Dajoard]], known
in western [[Scotland]], near Italy to the conquest of India with the
conflict.
Sequence-to-sequence model
Motivation: machine translation
x:Je crains l’homme de un seul livre.
y:Fear the man of one book.
y4 y5 y6
h1 h2 h3
h4 h5 h6
x1 x2 x3
Attention-based models
y4 y5 y6
h1 h2 h3
h4 h5 h6
x1 x2 x3
Machine translation
Email responder
Image captioning
Supervised learning
Unsupervised learning
Final remarks
Sequence-to-sequence encoder-decoder
attention-based models
x fθ y
Data: