0% found this document useful (0 votes)
4 views73 pages

Deep

The lecture discusses deep learning, its historical development, and its impact on fields like speech recognition and image classification. It covers key concepts such as supervised and unsupervised learning, neural networks, and optimization techniques like stochastic gradient descent. Additionally, it introduces advanced topics like autoencoders and distributional semantics, emphasizing the empirical nature of deep learning and the necessity for large datasets.

Uploaded by

andrws.vieira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views73 pages

Deep

The lecture discusses deep learning, its historical development, and its impact on fields like speech recognition and image classification. It covers key concepts such as supervised and unsupervised learning, neural networks, and optimization techniques like stochastic gradient descent. Additionally, it introduces advanced topics like autoencoders and distributional semantics, emphasizing the empirical nature of deep learning and the necessity for large datasets.

Uploaded by

andrws.vieira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Lecture 19: Deep Learning

CS221 / Autumn 2015 / Liang


Google Trends
Query: deep learning

CS221 / Autumn 2015 / Liang 1


[figure from Li Deng]

Speech recognition (2009-2011)

• Steep drop in WER due to deep learning


• IBM, Google, Microsoft all switched over from GMM-HMM
CS221 / Autumn 2015 / Liang 2
[Krizhevsky et al., 2012, a.k.a. AlexNet]

Object recognition (2012)

• Landslide win in ImageNet competition


• Computer vision community switched to CNNs

CS221 / Autumn 2015 / Liang 3


[figure from Honglak Lee]

What is deep learning?


Philosophy: learn high-level abstractions automatically

CS221 / Autumn 2015 / Liang 4


A brief history
• 1950-60s: modeling brain using neural networks (Rosenblatt,
Hebb, etc.)
• 1969: research stagnated after Minsky and Papert’s paper
• 1986: popularization of backpropagation by Rumelhardt, Hinton,
Williams
• 1990s: convolutional neural networks (LeCun)
• 1990s: recurrent neural networks (Schmidhuber)
• 2006: revival of deep networks (Hinton et al.)
• 2013-: massive industrial interest
Key problem: was difficult to get training multi-layer neural networks to
work!

CS221 / Autumn 2015 / Liang 5


What’s different today
Computation (time/memory) Information (data)

Deep learning is fundamentally empirical

CS221 / Autumn 2015 / Liang 6


Roadmap

Supervised learning

Unsupervised learning

Convolutional neural networks

Recurrent neural networks

Final remarks

CS221 / Autumn 2015 / Liang 7


Framework
x

Dtrain Learner f

Learner

Optimization problem Optimization algorithm

CS221 / Autumn 2015 / Liang 8


Review: optimization
Regression:
Loss(x, y, θ) = (fθ (x) − y)2
Key idea: minimize training loss
1 X
TrainLoss(θ) = Loss(x, y, θ)
|Dtrain |
(x,y)∈Dtrain

min TrainLoss(θ)
θ∈Rd

Algorithm: stochastic gradient descent

For t = 1, . . . , T :
For (x, y) ∈ Dtrain :
θ ← θ − ηt ∇θ Loss(x, y, θ)

CS221 / Autumn 2015 / Liang 9


Review: linear predictors
w
x1

x2 fθ (x)

x3

Output:

fθ (x) = w · x

Parameters: θ = w

CS221 / Autumn 2015 / Liang 10


Review: neural networks
V h1
x1 w
σ
x2 fθ (x)
σ
x3
h2
Intermediate hidden units:
hj (x) = σ(vj · x) σ(z) = (1 + e−z )−1
Output:
fθ (x) = w · h(x)
Parameters: θ = (V, w)

CS221 / Autumn 2015 / Liang 11


Summary so far
Neural network predictor: fθ (x) = w · σ(Vx)

Squared loss: Loss(x, y, θ) = (fθ (x) − y)2

Next step: compute the gradient ∇θ Loss(x, y, θ)

CS221 / Autumn 2015 / Liang 12


Basic building blocks
+ − ·
1 1 1 −1 b a

a b a b a b

max σ

1[a > b] 1[a < b] σ(a)(1 − σ(a))

a b a

CS221 / Autumn 2015 / Liang 13


Composing functions
out function2

∂ out
∂ mid

mid function1

∂ mid
∂ in

in
Chain rule:
∂out ∂out ∂mid
∂in = ∂mid ∂in

CS221 / Autumn 2015 / Liang 14


Computing the gradient
 2
k
X
Loss(x, y, w) =  wj σ(vj · φ(x)) − y 
j=1

Assume labels {1, 2, 3} and correct label is y = 1


(·)2
2 · residual

residual −
1

+ y
1 1 1

· · ·
h1 w1 h2 w2 h3 w3

w1 h1 σ w2 h2 σ w3 h3 σ
h1 (1 − h1 ) h2 (1 − h2 ) h3 (1 − h3 )

· · ·
φ(x) φ(x) φ(x)

v1 φ(x) v2 φ(x) v3 φ(x)

CS221 / Autumn 2015 / Liang 15


[Andrej Karpathy’s demo]

CS221 / Autumn 2015 / Liang 16


Deep neural networks
1-layer neural network: x
w>
score =

2-layer neural network: x


> V
w
score = σ( )

3-layer neural network: x


U V
>
w
score = σ( σ( ))

CS221 / Autumn 2015 / Liang


... 17
Depth
x h h0 h00 h000
fθ (x)

Intuitions:
• Hierarchical feature representations
• Can simulate a bounded computation logic circuit (original moti-
vation from McCulloch/Pitts, 1943)
• Learn this computation (and potentially more because networks
are real-valued)
• Depth k + 1 logic circuits can represent more than depth k (count-
ing argument)
• Formal theory/understanding is still incomplete

CS221 / Autumn 2015 / Liang 18


Supervised learning
• Construct deep neural networks by composing non-linearities (σ)
and linear transformations (matrix multiplication)

• Train via SGD, use backpragation to compute gradients

• Non-convex optimization, but works empirically given enough com-


pute and data

CS221 / Autumn 2015 / Liang 19


Roadmap

Supervised learning

Unsupervised learning

Convolutional neural networks

Recurrent neural networks

Final remarks

CS221 / Autumn 2015 / Liang 20


Motivation
• Deep neural networks requires lots of data

• Sometimes not very much labeled data, but plenty of unlabeled


data (text, images, videos)

• Humans rarely get direct supervision; can learn from raw sensory
information?

CS221 / Autumn 2015 / Liang 21


Autoencoders
Analogy:
AAAABBBBB 4 A’s, 5 B’s AAAABBBBB

Key idea: autoencoders

If can compress a data point and still reconstruct it, then we have
learned something generally useful.

General framework:
x x̂
h
Encode Decode

minimize kx − x̂k2

CS221 / Autumn 2015 / Liang 22


Principal component analysis

Input: points x1 , . . . , xn

x U
U> h
Encode(x) = Decode(h) =

(assume xi ’s are mean zero and U is orthogonal)

PCA objective:
n
X
minimize kxi − Decode(Encode(xi ))k2
i=1
CS221 / Autumn 2015 / Liang 23
Autoencoders
Increase dimensionality of hidden dimension:
x h x̂

Encode Decode

• Problem: learning nothing — just set Encode, Decode to identity


function!

• Need to control complexity of Encode and Decode somehow...

CS221 / Autumn 2015 / Liang 24


Non-linear autoencoders
Non-linear transformation (e.g., logistic function):

Encode(x) = σ(W x + b)
Decode(h) = σ(W 0 h + b0 )
W0 b0
W b

Loss function:
minimize kx − Decode(Encode(x))k2
Key: non-linearity makes life harder, prevents degeneracy

CS221 / Autumn 2015 / Liang 25


Denoising autoencoders
Corrupt(x) x̂
h
Encode Decode

Types of noise:
• Blankout: Corrupt([1, 2, 3, 4]) = [0, 2, 3, 0]
• Gaussian: Corrupt([1, 2, 3, 4]) = [1.1, 1.9, 3.3, 4.2]
Objective:
minimize kx − Decode(Encode(Corrupt(x)))k2
Algorithm: pick example, add fresh noise, SGD update
Key: noise makes life harder, prevents degeneracy

CS221 / Autumn 2015 / Liang 26


[Figure 7 of Vincent et al. (2010)]

Denoising autoencoders
MNIST: 60,000 images of digits (784 dimensions)

200 learned filters (rows of W ):

CS221 / Autumn 2015 / Liang 27


Stacked denoising autoencoders
Goal: learn hierarchical features
Train first layer:
Corrupt(x) h x̃

Encode Decode

Train second layer:


Corrupt(h) h̃
h0
Encode0 Decode0

...
Test time: Encode0 (Encode(x))
CS221 / Autumn 2015 / Liang 28
Probabilistic models
So far:

Decode(Encode(x))

Probabilistic model: distribution over inputs and hidden states

p(x, h)

Two types:
• Restricted Boltzmann machines (Markov network)
• Deep belief network (Bayesian network)

For simplicity, assume x and h are binary vectors

CS221 / Autumn 2015 / Liang 29


Restricted Boltzmann machines
Markov network (factor graph):
h

Sampling: h | x is easy, x is hard


x
> W
h
p(x, h) ∝ exp( +b> h + c> x)

Learning: SGD; gradient requires summing over all (x, h)


Contrastive divergence: initialize x, 1 step of Gibbs sampling

CS221 / Autumn 2015 / Liang 30


Deep belief networks
Bayesian network:
h

Sampling: h | x is hard, x is easy


x
> W
h
p(x|h) ∝ exp( +b> h + c> x)

Learning: maximum likelihood is intractable, so use same algorithm as


RBM; repeat to get deep (like for stacked denoising autoencoders)

CS221 / Autumn 2015 / Liang 31


Distributional semantics: warmup
The new design has lines.

Let’s try to keep the kitchen .

I forgot to out the cabinet.

What does mean?

CS221 / Autumn 2015 / Liang 32


Distributional semantics
The new design has lines.

Observation: context can tell us a lot about word meaning

Autoencoding: predict x from x

Distributional methods: predict x from context

CS221 / Autumn 2015 / Liang 33


General recipe
1. Form a word-context matrix of counts (data)
context c

word w N

2. Perform dimensionality reduction (generalize)

word w Θ ⇒ word vectors θw ∈ Rd

CS221 / Autumn 2015 / Liang 34


[Deerwater/Dumais/Furnas/Landauer/Harshman, 1990]

Latent semantic analysis


Data:
Doc1: Cats have tails.
Doc2: Dogs have tails.
Matrix: context = documents that word appear in

Doc1 Doc2
cats 1 0
dogs 0 1
have 1 1
tails 1 1

CS221 / Autumn 2015 / Liang 35


[Deerwater/Dumais/Furnas/Landauer/Harshman, 1990]

Latent semantic analysis


Dimensionality reduction: SVD

document c
S V>
≈ Θ
word w N

• Used for information retrieval

• Match query to documents in latent space rather than on keywords

CS221 / Autumn 2015 / Liang 36


[Mikolov/Sutskever/Chen/Corrado/Dean, 2013 (word2vec)]

Skip-gram model with negative sampling


Data:
Cats and dogs have tails.
Matrix: context = words in a window

cats and dogs have tails


cats 0 1 1 0 0
and 1 0 1 1 0
dogs 1 1 0 1 1
have 0 1 1 0 1
tails 0 0 1 1 0

CS221 / Autumn 2015 / Liang 37


[Mikolov/Sutskever/Chen/Corrado/Dean, 2013 (word2vec)]

Skip-gram model with negative sampling


Cats are smarter than the best AI.

Dimensionality reduction: logistic regression with SGD

Model: predict good (w, c) using logistic regression

pθ (g = 1 | w, c) = (1 + exp(θw · βc ))−1

Positives: (w, c) from data

Negatives: (w, c0 ) for irrelevant c0 (k times more)

+(cats, AI) −(cats, linguistics) −(cats, statistics)

CS221 / Autumn 2015 / Liang 38


[Mikolov/Sutskever/Chen/Corrado/Dean, 2013 (word2vec)]

Skip-gram model with negative sampling


Data distribution:

p̂(w, c) ∝ N (w, c)

Objective:
X
max p̂(w, c) log p(g = 1 | w, c)+
θ,β
w,c

X
k p̂(w)p̂(c0 ) log p(g = 0 | w, c0 )
w,c0

CS221 / Autumn 2015 / Liang 39


2D visualization of word vectors

CS221 / Autumn 2015 / Liang 40


[Mikolov/Yih/Zweig, 2013; Levy/Goldberg, 2014]

Analogies
Differences in context vectors capture relations:

θking − θman ≈ θqueen − θwoman (gender)

θfrance − θfrench ≈ θmexico − θspanish (language)

θcar − θcars ≈ θapple − θapples (plural)

Intuition:

θking − θman ≈ θqueen − θwoman


|{z} |{z} | {z } | {z }
[crown,he] [he] [crown,she] [she]

CS221 / Autumn 2015 / Liang 41


Unsupervised learning
• Principle: make up prediction tasks (e.g., x given x or context)

• Hard task → pressure to learn something

• Loss minimzation using SGD

• Discriminatively fine tune: initialize feedforward neural network


and backpropagate to optimize task accuracy

• Helps less given large amounts of labeled data, but doesn’t mean
unsupervised learning is solved — quite the opposite!

CS221 / Autumn 2015 / Liang 42


Roadmap

Supervised learning

Unsupervised learning

Convolutional neural networks

Recurrent neural networks

Final remarks

CS221 / Autumn 2015 / Liang 43


Motivation
x
W

• Observation: images are not arbitrary vectors

• Goal: leverage spatial structure of images (translation invariance)

CS221 / Autumn 2015 / Liang 44


[figure from Andrej Karpathy]

Prior knowledge

• Local connectivity: each hidden unit operates on a local image


patch (3 instead of 7 connections per hidden unit)

• Parameter sharing: processing of each image patch is same (3


parameters instead of 3 · 5)

• Intuition: try to match a pattern in image

CS221 / Autumn 2015 / Liang 45


Fully-connected:

Convolutional: each depth column produced from localized region (in


height/width)

[Andrej Karpathy’s demo]


CS221 / Autumn 2015 / Liang 46
[figure from Andrej Karpathy]

Max-pooling

• Intuition: test if there exists a pattern in neighborhood

• Reduce computation, prevent overfitting

CS221 / Autumn 2015 / Liang 47


Example of function evaluation

[Andrej Karpathy’s demo]

CS221 / Autumn 2015 / Liang 48


[Krizhevsky et al., 2012, a.k.a. AlexNet]

AlexNet

• Non-linearity: use RelU (max(z, 0)) instead of logistic


• Data augmentation: translate, horizontal reflection, vary intensity,
dropout (guard against overfitting)
• Computation: parallelize across two GPUs (6 days)
• Impressive results: 15% error; next best was 25%!

CS221 / Autumn 2015 / Liang 49


Summary
• Intuition: spatial regularity across the input

• Key idea: locality and parameter sharing

• Dominant in computer vision

• Applications to text classification and speech recognition

CS221 / Autumn 2015 / Liang 50


Roadmap

Supervised learning

Unsupervised learning

Convolutional neural networks

Recurrent neural networks

Final remarks

CS221 / Autumn 2015 / Liang 51


Motivation
Model sequences (sentences):

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

Paris Talks Set Stage for Action as Risks to the Climate Rise

Goal: rich probabilistic model

p(x1 )p(x2 | x1 )p(x3 | x1 , x2 )p(x4 | x1 , x2 , x3 ) · · ·

No conditionally independence!

CS221 / Autumn 2015 / Liang 52


Recurrent neural networks
h1 h2 h3 h4

x1 x2 x3 x4

h1 = Encode(x1 )
x2 ∼ Decode(h1 ) Update context vector:
h2 = Encode(h1 , x2 ) ht = Encode(ht−1 , xt )
x3 ∼ Decode(h2 ) Predict next character:
h3 = Encode(h2 , x3 ) xt+1 = Decode(ht )
x4 ∼ Decode(h3 ) context ht compresses x1 , . . . xt
h4 = Encode(h3 , x4 )

CS221 / Autumn 2015 / Liang 53


[Elman, 1990]

Simple recurrent network


h1 h2 h3 h4

x1 x2 x3 x4

xt
V ht−1 W ht

Encode(ht−1 , xt ) = σ( + )=

W0 p(xt+1 )
ht

Decode(ht ) ∼ softmax( )=

CS221 / Autumn 2015 / Liang 54


Vanishing gradient problem
h1 h2 h3 h4 h5

x1 x2 x3 x4 x5

(set x1 = 1, x2 = x3 = · · · = 0, σ = identity function)


x1
V V V V W

h5 =

If V = 0.1, then
• Value: ht = 0.1t−1 W
∂ht
• Gradient: ∂W = 0.1t−1 (vanishes as length increases)
CS221 / Autumn 2015 / Liang 55
Additive combinations
h1 h2 h3 h4 h5

x1 x2 x3 x4 x5

What if:
ht = ht−1 + W xt
Then:
(set x1 = 1, x2 = x3 = · · · = 0, σ = identity function)

• Value: ht = W
∂ht
• Gradient: ∂W = 1 for any t

CS221 / Autumn 2015 / Liang 56


[Schmidhuber & Hochreiter, 1997]

Long Short Term Memory (LSTM)


API:
(ht , ct ) = LSTM(ht−1 , ct−1 , xt )
Input gate:
it = σ(Wi xt + Ui ht−1 + Vi ct−1 + bi )
Forget gate (initialize with bf large, so close to 1):
ft = σ(Wf xt + Uf ht−1 + Vf ct−1 + bf )
Cell: additive combination of RNN update with previous cell
ct = it tanh(Wc xt + Uc ht−1 + bc ) + ft ct−1
Output gate:
ot = σ(Wo xt + Uo ht−1 + Vo ct + bo )
Hidden state:
ht = ot tanh(ct )

CS221 / Autumn 2015 / Liang 57


[from Andrej Karpathy’s blog]

Character-level language modeling


Sampled output:

Naturalism and decision for the majority of Arab countries’ capitalide was
grounded by the Irish language by [[John Clair]], [[An Imperial Japanese
Revolt]], associated with Guangzham’s sovereignty. His generals were
the powerful ruler of the Portugal in the [[Protestant Immineners]], which
could be said to be directly in Cantonese Communication, which followed
a ceremony and set inspired prison, training. The emperor travelled back
to [[Antioch, Perth, October 25—21]] to note, the Kingdom of Costa
Rica, unsuccessful fashioned the [[Thrales]], [[Cynth’s Dajoard]], known
in western [[Scotland]], near Italy to the conquest of India with the
conflict.

CS221 / Autumn 2015 / Liang 58


[from Andrej Karpathy’s blog]

CS221 / Autumn 2015 / Liang 59


[Sutskever et al., 2014]

Sequence-to-sequence model
Motivation: machine translation
x:Je crains l’homme de un seul livre.
y:Fear the man of one book.
y4 y5 y6
h1 h2 h3

h4 h5 h6
x1 x2 x3

Read in a sentence first, output according to RNN:


ht = Encode(ht−1 , xt or yt−1 ), yt = Decode(ht )
CS221 / Autumn 2015 / Liang 60
Attention-based models

Motivation: long sentences — compress to finite dimensional vector?

Eine Folge von Ereignissen bewirkte, dass aus Beethovens Studienreise


nach Wien ein dauerhafter und endgültiger Aufenthalt wurde. Kurz nach
Beethovens Ankunft, am 18. Dezember 1792, starb sein Vater. 1794
besetzten französische Truppen das Rheinland, und der kurfürstliche Hof
musste fliehen.

Key idea: attention

Learn to look back at your notes.

CS221 / Autumn 2015 / Liang 61


[Bahdanau et al., 2015]

Attention-based models
y4 y5 y6
h1 h2 h3

h4 h5 h6
x1 x2 x3

Distribution over input positions:


αt = softmax([Attend(h1 , ht−1 ), . . . , Attend(hL , ht−1 )])

Generate with attended input:


PL
ht = Encode(ht−1 , yt−1 , j=1 αt hj )

CS221 / Autumn 2015 / Liang 62


[Bahdanau et al., 2015]

Machine translation

CS221 / Autumn 2015 / Liang 63


[Google, 2015]

Email responder

CS221 / Autumn 2015 / Liang 64


[Xu et al., 2015]

Image captioning

CS221 / Autumn 2015 / Liang 65


Summary
• Recurrent neural networks: model sequences (non-linear version of
Kalman filter or HMM)

• Logic intuition: learning a program with a for loop (reduce)

• LSTMs mitigate the vanishing gradient problem

• Attention-based models: when only part of input is relevant at a


time

• Newer models with ”external memory”: memory networks, neural


Turing machines

CS221 / Autumn 2015 / Liang 66


Roadmap

Supervised learning

Unsupervised learning

Convolutional neural networks

Recurrent neural networks

Final remarks

CS221 / Autumn 2015 / Liang 67


Computation
...wait for a long time...
Better optimization algorithms: SGD, SGD+momentum, AdaGrad,
AdaDelta, momentum, Nesterov, Adam
Buy GPUs:

...wait for a long time...

CS221 / Autumn 2015 / Liang 68


Theory: why does it work?
Two questions:
• Approximation: why are neural networks good hypothesis classes?
• Optimization: why can SGD optimize a high-dimensional non-
convex problem?
Partial answers:
• 1-layer neural networks can approximate any continuous function
on compact set [Cybenko, 1989; Barron, 1993]
• Generate random features works too [Rahimi/Recht, 2009; Andoni
et. al, 2014]
• Use statistical physics to analyze loss surfaces [Choromanska et
al., 2014]

CS221 / Autumn 2015 / Liang 69


Summary
Phenomena Ideas

Fixed vectors Feedforward NNs

Spatial structure convolutional NNs

Sequence recurrent NNs


LSTMs

Sequence-to-sequence encoder-decoder
attention-based models

Unsupervised belief networks


RBMs
autoencoders
CS221 / Autumn 2015 / Liang 70
References
Tutorials:
• http://deeplearning.net/tutorial/
• http://deeplearning.stanford.edu/tutorial/
• http://cs.stanford.edu/people/karpathy/convnetjs/
Software:
• Caffe (Berkeley): centered around computer vision
• Theano (Montreal); also see Keras: Python
• Torch (Facebook): fast, but write in Lua
• TensorFlow (Google): new!

CS221 / Autumn 2015 / Liang 71


Outlook
Extensibility: able to compose modules

LSTM Attend Encode

Learning programs: think about analogy with a computer

x fθ y

Data:

reinforcement learning? unsupervised learning?

CS221 / Autumn 2015 / Liang 72

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy