0% found this document useful (0 votes)
7 views118 pages

Energy Based Models in Document Recognition and Computer Vision

Uploaded by

kimtinh18012005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views118 pages

Energy Based Models in Document Recognition and Computer Vision

Uploaded by

kimtinh18012005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

Energy­Based Models in Document

Recognition and Computer Vision

Yann LeCun
The Courant Institute of Mathematical Sciences
New York University
Collaborators:
Marc'Aurelio Ranzato, Sumit Chopra, Fu­Jie Huang, Y­Lan Boureau

See: [LeCun et al. 2006]: “A Tutorial on Energy­Based Learning”


[Ranzato et al. AI­Stats'07], [Ranzato et al. NIPS06], [Ranzato et al. ICDAR '07]
http://yann.lecun.com/exdb/publis/
Yann LeCun
The Challenges of Pattern Recognition,
Computer Vision, and Visual Neuroscience

How do we learn “invariant representations”?


From the image of an airplane, how do we extract a
representation that is invariant to pose, illumination,
background, clutter, object instance....
How can a human (or a machine) learn those
representations by just looking at the world?

How can we learn visual categories from just a few examples?


I don't need to see many airplanes before I can
recognize every airplane (even really weird ones)

Yann LeCun
Two Big Problems in Learning and Recognition

The “Normalization Problem” (aka Partition Function Problem)


Give high probability (or low energy) to good answers
Give low probability (or high energy) to bad answers
There are too many bad answers!
The normalization constant of probabilistic models is a sum over too
many terms.

2. The “Deep Learning Problem”


Training “Deep Belief Networks” is a necessary step towards solving the
invariance problem in visual recognition (and perception in general).
How do we train deep architectures with lots of non-linear stages?

This talks has three parts:


Energy-Based learning: circumventing the intractable partition function
problem.
Supervised methods for deep visual learning: convolutional nets
Unsupervised methods to learn deep, invariant feature hierarchies:
“Deep belief networks”.
Yann LeCun
Part 1: Energy­Based Learning.
circumventing the intractable partition function problem
Highly popular methods in the Machine Learning and Natural Language
Processing Communities have their roots in Handwriting Recognition
Conditional Random Fields, and related learning models with
“structured outputs” are descendants of discriminative learning
methods for word-level handwriting recognition.

A Tutorial and Energy­Based Learning:


[LeCun & al., 2006]

Discriminative Training for “Structured Output” models


The whole literature on discriminative speech recognition [1987-]
The whole literature on neural-net/HMM hybrids for speech [Bottou
1991, Bengio 1993, Haffner 1993, Bourlard 1994]
Graph Transformer Networks [LeCun & al. Proc IEEE 1998]
Conditional Random Fields [Lafferty & al 2001]
Max Margin Markov Nets [Altun & al 2003, Taskar & al 2003]

Yann LeCun
Energy­Based Model for Decision­Making

Model: Measures the compatibility


between an observed variable X and
a variable to be predicted Y through
an energy function E(Y,X).

Inference: Search for the Y that


minimizes the energy within a set
If the set has low cardinality, we can
use exhaustive search.

Yann LeCun
Complex Tasks: Inference is non­trivial

When the
cardinality or
dimension of Y is
large, exhaustive
search is
impractical.
We need to use
“smart” inference
procedures: min­
sum, Viterbi, min
cut, belief
propagation,
gradient decent.....

Yann LeCun
Converting Energies to Probabilities

Energies are uncalibrated


The energies of two separately-trained systems cannot be combined
The energies are uncalibrated (measured in arbitrary untis)

How do we calibrate energies?


We turn them into probabilities (positive numbers that sum to 1).
Simplest way: Gibbs distribution
Other ways can be reduced to Gibbs by a suitable redefinition of the
energy.

Partition function Inverse temperature

Yann LeCun
Handwriting recognition
Sequence labeling

integrated segmentation and


recognition of sequences.

Each segmentation and


recognition hypothesis is a path
in a graph

inference = finding the shortest


path in the interpretation graph.

Un­normalized hierarchical
HMMs a.k.a. Graph
Transformer Networks
[LeCun, Bottou, Bengio,
Haffner 1998]

Yann LeCun
Latent Variable Models

The energy includes “hidden” variables Z whose value is never given to us

Yann LeCun
What can the latent variables represent?

Variables that would make the task easier if they were known:
Face recognition: the gender of the person, the orientation of
the face.
Object recognition: the pose parameters of the object
(location, orientation, scale), the lighting conditions.
Parts of Speech Tagging: the segmentation of the sentence
into syntactic units, the parse tree.
Speech Recognition: the segmentation of the sentence into
phonemes or phones.
Handwriting Recognition: the segmentation of the line into
characters.

In general, we will search for the value of the latent variable that
allows us to get an answer (Y) of smallest energy.

Yann LeCun
Probabilistic Latent Variable Models

Marginalizing over latent variables instead of minimizing.

Equivalent to traditional energy­based inference with a redefined


energy function:

Reduces to traditional minimization when Beta­>infinity

Yann LeCun
Training an EBM

Training an EBM consists in shaping the energy function so that the


energies of the correct answer is lower than the energies of all other
answers.
Training sample: X = image of an animal, Y = “animal”

E animal , X E  y , X  ∀ y≠animal

Yann LeCun
Architecture and Loss Function

Family of energy functions

Training set

Loss functional / Loss function


Measures the quality of an energy function on training
set
Training
Form of the loss functional
invariant under permutations and repetitions of the samples

Energy surface
Regularizer
Per­sample Desired for a given Xi
loss answer as Y varies
Yann LeCun
Designing a Loss Functional

Push down on the energy of the correct answer

Pull up on the energies of the incorrect answers, particularly if they


are smaller than the correct one
Yann LeCun
Architecture + Inference Algo + Loss Function = Model

E(W,Y,X) 1. Design an architecture: a particular form for E(W,Y,X).


2. Pick an inference algorithm for Y: MAP or conditional
distribution, belief prop, min cut, variational methods,
W gradient descent, MCMC, HMC.....
3. Pick a loss function: in such a way that minimizing it
with respect to W over a training set will make the inference
algorithm find the correct Y for a given X.
X Y 4. Pick an optimization method.

PROBLEM: What loss functions will make the machine approach


the desired behavior?

Yann LeCun
Examples of Loss Functions: Energy Loss

Energy Loss
Simply pushes down on the energy of the correct answer
!!
S!
K
R
O
W

!!!
ES
PS
A
LL
O
C
Yann LeCun
Negative Log­Likelihood Loss

Conditional probability of the samples (assuming independence)

Gibbs distribution:

We get the NLL loss by dividing by P and Beta:

Reduces to the perceptron loss when Beta­>infinity


Yann LeCun
Negative Log­Likelihood Loss

Pushes down on the energy of the correct answer

Pulls up on the energies of all answers in proportion to their probability

Yann LeCun
Negative Log­Likelihood Loss

A probabilistic model is an EBM in which:


The energy can be integrated over Y (the variable to be predicted)
The loss function is the negative log-likelihood

Negative Log Likelihood Loss has been used for a long time in many
communities for discriminative learning with structured outputs
Speech recognition: many papers going back to the early 90's
[Bengio 92], [Bourlard 94]. They call “Maximum Mutual
Information”
Handwriting recognition [Bengio LeCun 94], [LeCun et al. 98]
Bio-informatics [Haussler]
Conditional Random Fields [Lafferty et al. 2001]
Lots more......
In all the above cases, it was used with non-linearly parameterized
energies.

Yann LeCun
A Simpler Loss Functions:Perceptron Loss

Perceptron Loss [LeCun et al. 1998], [Collins 2002]


Pushes down on the energy of the correct answer
Pulls up on the energy of the machine's answer
Always positive. Zero when answer is correct
No “margin”: technically does not prevent the energy surface from
being almost flat.
Works pretty well in practice, particularly if the energy
parameterization does not allow flat surfaces.
This is often called “discriminative Viterbi training” in the
speech and handwriting literature

Yann LeCun
A Better Loss Function: Generalized Margin Losses

First, we need to define the Most Offending Incorrect Answer

Most Offending Incorrect Answer: discrete case

Most Offending Incorrect Answer: continuous case

Yann LeCun
Examples of Generalized Margin Losses

Hinge Loss
[Altun et al. 2003], [Taskar et al. 2003]
With the linearly-parameterized binary
classifier architecture, we get linear SVMs

Log Loss
“soft hinge” loss
With the linearly-parameterized binary
classifier architecture, we get linear
Logistic Regression

Yann LeCun
Examples of Margin Losses: Square­Square Loss

Square­Square Loss
[LeCun-Huang 2005]
Appropriate for positive energy
functions

Learning Y = X^2

!!
E!
PS
A
LL
O
C
O
N
Yann LeCun
Other Margin­Like Losses

LVQ2 Loss [Kohonen, Oja], Driancourt­Bottou 1991]

Minimum Classification Error Loss [Juang, Chou, Lee 1997]

Square­Exponential Loss [Osadchy, Miller, LeCun 2004]

Yann LeCun
What Make a “Good” Loss Function

Good and bad loss functions

Yann LeCun
Advantages/Disadvantages of various losses

Loss functions differ in how they pick the point(s) whose energy is
pulled up, and how much they pull them up

Losses with a log partition function in the contrastive term pull up all
the bad answers simultaneously.
This may be good if the gradient of the contrastive term can be
computed efficiently
This may be bad if it cannot, in which case we might as well use
a loss with a single point in the contrastive term

Variational methods pull up many points, but not as many as with the
full log partition function.

Efficiency of a loss/architecture: how many energies are pulled up for


a given amount of computation?
The theory for this is to be developed

Yann LeCun
Energy­Based Factor Graphs: Energy = Sum of “factors”

Sequence Labeling
Output is a sequence
Y1,Y2,Y3,Y4......
NLP parsing, MT, +
speech/handwriting
recognition, biological
sequence analysis
The factors ensure
grammatical consistency
They give low energy to
consistent sub-sequences of
output symbols Y1 Y2 Y3 Y4
The graph is generally simple
(chain or tree)
Inference is easy (dynamic X
programming, min-sum)

Yann LeCun
Simple Energy­Based Factor Graphs with “Shallow” Factors

Linearly Parameterized Factors

with the NLL Loss :


Lafferty's Conditional
Random Field

with Hinge Loss:


Taskar's Max Margin
Markov Nets

with Perceptron Loss


Collins's sequence
labeling model

With Log Loss:


Altun/Hofmann
sequence labeling
model

Yann LeCun
Deep/non­linear Factors for Speech and Handwriting

Trainable Speech/Handwriting Recognition systems that integrate Neural Nets (or


other “deep” classifiers) with dynamic time warping, Hidden Markov Models, or
other graph­based hypothesis representations
Training the feature With Minimum Empirical Error loss
Ljolje and Rabiner (1990)
extractor as part of the
whole process. with NLL:
Bengio (1992), Haffner (1993), Bourlard
with the LVQ2 Loss : (1994)
Driancourt and
Bottou's speech With MCE
recognizer (1991) Juang et al. (1997)

with NLL: Late normalization scheme (un­normalized


Bengio's speech HMM)
recognizer (1992) Bottou pointed out the label bias
Haffner's speech problem (1991)
recognizer (1993) Denker and Burges proposed a solution
Yann LeCun (1995)
Really Deep Factors &
implicit graphs: GTN

Handwriting Recognition with


Graph Transformer Networks

Un­normalized hierarchical
HMMs
Trained with Perceptron loss
[LeCun, Bottou, Bengio,
Haffner 1998]
Trained with NLL loss
[Bengio, LeCun 1994],
[LeCun, Bottou, Bengio,
Haffner 1998]

Answer = sequence of symbols

Latent variable = segmentation


Yann LeCun
Check Reader

Graph transformer network


trained to read check amounts.

Trained globally with


Negative­Log­Likelihood loss.

50% percent corrent, 49%


reject, 1% error (detectable
later in the process.

Fielded in 1996, used in many


banks in the US and Europe.

Processes an estimated 10% of


all the checks written in the
US.

Yann LeCun
What's so bad about probabilistic models?

Why bother with a normalization since we don't use it for decision making?
Why insist that P(Y|X) have a specific shape, when we only care about the position of its
minimum?
When Y is high­dimensional (or simply combinatorial), normalizing becomes intractable
(e.g. Language modeling, image restoration, large DoF robot control...).
A tiny number of models are pre­normalized (Gaussian, exponential family)
A very small number are easily normalizable
A large number have intractable normalization
A huuuge number can't be normalized at all (examples will be shown).
Normalization forces us to take into account areas of the space that we don't actually care
about because our inference algorithm never takes us there.
If we only care about making the right decisions, maximizing the likelihood solves a
much more complex problem than we have to.

Yann LeCun
EBM

Unlike traditional classifiers, EBMs can represent multiple alternative outputs


The normalization in probabilistic models is often an unnecessary aggravation,
particularly if the ultimate goal of the system is to make decisions.
EBMs with appropriate loss function avoid the necessity to compute the partition
function and its derivatives (which may be intractable)
EBMs give us complete freedom in the choice of the architecture that models the
joint “incompatibility” (energy) between the variables.
We can use architectures that are not normally allowed in the probabilistic
framework (like neural nets).
The inference algorithm that finds the most offending (lowest energy)
incorrect answer does not need to be exact: our model may give low energy to
far­away regions of the landscape. But if our inference algorithm never finds those
regions, they do not affect us. But they do affect normalized probabilistic models

Yann LeCun
Part 2: Deep Supervised Learning for Vision:
The Convolutional Network Architecture

Convolutional Networks:
[LeCun et al., Neural Computation, 1988]
[LeCun et al., Proc IEEE 1998]

Face Detection and pose estimation with convolutional networks:


[Vaillant, Monrocq, LeCun, IEE Proc Vision, Image and Signal
Processing, 1994]
[Osadchy, Miller, LeCun, JMLR vol 8, May 2007]

Category­level object recognition with invariance to pose and lighting


[LeCun, Huang, Bottou, CVPR 2004]
[Huang, LeCun, CVPR 2005]

autonomous robot driving


[LeCun et al. NIPS 2005]
Yann LeCun
The Traditional Architecture for Recognition

Pre­processing /
Trainable Classifier
Feature Extraction

this part is mostly hand­crafted

The raw input is pre­processed through a hand­crafted feature extractor

The trainable classifier is often generic (task independent)

Yann LeCun
End­to­End Learning

trainable
trainable classifier
Feature Extraction

The entire system is integrated and trainable “end­to­end”.

In some of the models presented here, there will be no discernible


difference between the feature extractor and the classifier.

We can embed general prior knowledge about images into the


architecture of the system.

Yann LeCun
An Old Idea for Local Shift Invariance

[Hubel & Wiesel 1962]: architecture of the cat's visual cortex


simple cells detect local features
complex cells “pool” the outputs of simple cells within a
retinotopic neighborhood.
“Simple cells”
“Complex cells”

pooling subsampling
Multiple
convolutions

Retinotopic Feature Maps

Yann LeCun
The Multistage Hubel­Wiesel Architecture

Building a complete artificial vision system:


Stack multiple stages of simple cells / complex cells layers
Higher stages compute more global, more invariant features
Stick a classification layer on top
[Fukushima 1971-1982]
neocognitron
[LeCun 1988-2007]
convolutional net
[Poggio 2002-2006]
HMAX
[Ullman 2002-2006]
fragment hierarchy
[Lowe 2006]
HMAX

QUESTION: How do we
find (or learn) the filters?

Yann LeCun
Convolutional Network

Hierarchical/multilayer: features get progressively more global, invariant, and numerous


dense features: features detectors applied everywhere (no interest point)
broadly tuned (possibly invariant) features: sigmoid units are on half the time.
Global discriminative training: The whole system is trained “end­to­end” with a gradient­
based method to minimize a global loss function
Integrates segmentation, feature extraction, and invariant classification in one fell swoop.

Yann LeCun
Convolutional Net Architecture

Layer 3 Layer 5
Layer 1 Layer 2 Layer 4
input 12@10x10 100@1x1
6@28x28 6@14x14 12@5x5
1@32x32
Layer 6: 10
10

5x5
2x2 5x5 2x2
5x5 convolution
pooling/ convolution pooling/
convolution
subsampling subsampling

Convolutional net for handwriting recognition (400,000 synapses)


Convolutional layers (simple cells): all units in a feature plane share the same weights
Pooling/subsampling layers (complex cells): for invariance to small distortions.
Supervised gradient­descent learning using back­propagation
The entire network is trained end­to­end. All the layers are trained simultaneously.

Yann LeCun
MNIST Handwritten Digit Dataset

Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples

Yann LeCun
Results on MNIST Handwritten Digits
CLASSIFIER DEFORMATION PREPROCESSING ERROR (%) Reference
linear classifier (1-layer NN) none 12.00 LeCun et al. 1998
linear classifier (1-layer NN) deskewing 8.40 LeCun et al. 1998
pairwise linear classifier deskewing 7.60 LeCun et al. 1998
K-nearest-neighbors, (L2) none 3.09 Kenneth Wilder, U. Chicago
K-nearest-neighbors, (L2) deskewing 2.40 LeCun et al. 1998
K-nearest-neighbors, (L2) deskew, clean, blur 1.80 Kenneth Wilder, U. Chicago
K-NN L3, 2 pixel jitter deskew, clean, blur 1.22 Kenneth Wilder, U. Chicago
K-NN, shape context matching shape context feature 0.63 Belongie et al. IEEE PAMI 2002
40 PCA + quadratic classifier none 3.30 LeCun et al. 1998
1000 RBF + linear classifier none 3.60 LeCun et al. 1998
K-NN, Tangent Distance subsamp 16x16 pixels 1.10 LeCun et al. 1998
SVM, Gaussian Kernel none 1.40
SVM deg 4 polynomial deskewing 1.10 LeCun et al. 1998
Reduced Set SVM deg 5 poly deskewing 1.00 LeCun et al. 1998
Virtual SVM deg-9 poly Affine none 0.80 LeCun et al. 1998
V-SVM, 2-pixel jittered none 0.68 DeCoste and Scholkopf, MLJ 2002
V-SVM, 2-pixel jittered deskewing 0.56 DeCoste and Scholkopf, MLJ 2002
2-layer NN, 300 HU, MSE none 4.70 LeCun et al. 1998
2-layer NN, 300 HU, MSE, Affine none 3.60 LeCun et al. 1998
2-layer NN, 300 HU deskewing 1.60 LeCun et al. 1998
3-layer NN, 500+150 HU none 2.95 LeCun et al. 1998
3-layer NN, 500+150 HU Affine none 2.45 LeCun et al. 1998
3-layer NN, 500+300 HU, CE, reg none 1.53 Hinton, unpublished, 2005
2-layer NN, 800 HU, CE none 1.60 Simard et al., ICDAR 2003
2-layer NN, 800 HU, CE Affine none 1.10 Simard et al., ICDAR 2003
2-layer NN, 800 HU, MSE Elastic none 0.90 Simard et al., ICDAR 2003
2-layer NN, 800 HU, CE Elastic none 0.70 Simard et al., ICDAR 2003
Convolutional net LeNet-1 subsamp 16x16 pixels 1.70 LeCun et al. 1998
Convolutional net LeNet-4 none 1.10 LeCun et al. 1998
Convolutional net LeNet-5, none 0.95 LeCun et al. 1998
Conv. net LeNet-5, Affine none 0.80 LeCun et al. 1998
Boosted LeNet-4 Affine none 0.70 LeCun et al. 1998
Conv. net, CE Affine none 0.60 Simard et al., ICDAR 2003
Comv net, CE Elastic none 0.40 Simard et al., ICDAR 2003
Yann LeCun
Some Results on MNIST (from raw images: no preprocessing)

CLASSIFIER DEFORMATION ERROR Reference


Knowledge-free methods (a fixed permutation of the pixels would make no difference)
2-layer NN, 800 HU, CE 1.60 Simard et al., ICDAR 2003
3-layer NN, 500+300 HU, CE, reg 1.53 Hinton, in press, 2005
SVM, Gaussian Kernel 1.40 Cortes 92 + Many others
??? 0.95
Convolutional nets
Convolutional net LeNet-5, 0.80 Ranzato et al. NIPS 2006
Convolutional net LeNet-6, 0.70 Ranzato et al. NIPS 2006
??? 0.60
Training set augmented with Affine Distortions
2-layer NN, 800 HU, CE Affine 1.10 Simard et al., ICDAR 2003
Virtual SVM deg-9 poly Affine 0.80 Scholkopf
Convolutional net, CE Affine 0.60 Simard et al., ICDAR 2003
Training et augmented with Elastic Distortions
2-layer NN, 800 HU, CE Elastic 0.70 Simard et al., ICDAR 2003
Convolutional net, CE Elastic 0.40 Simard et al., ICDAR 2003
??? 0.39

Note: some groups have obtained good results with various amounts of preprocessing: [deCoste and Schoelkopf]
get 0.56% with an SVM on deskewed images; [Belongie] get 0.63% with “shape context” features;
[CENPARMI] get below 0.4% with features and SVM; [Liu] get 0.42% with features and SVM.
Yann LeCun
Invariance and Robustness to Noise

Yann LeCun
Recognizing Multiple Characters with Replicated Nets

Yann LeCun
Handwriting Recognition

Yann LeCun
Face Detection and Pose Estimation with Convolutional Nets
Training: 52,850, 32x32 grey­level images of faces, 52,850 non­faces.

Each sample: used 5 times with random variation in scale, in­plane rotation, brightness
and contrast.

2nd phase: half of the initial negative set was replaced by false positives of the initial
version of the detector .

Yann LeCun
Face Detection: Results

Data Set-> TILTED PROFILE MIT+CMU


False positives per image-> 4.42 26.9 0.47 3.36 0.5 1.28

Our Detector 90% 97% 67% 83% 83% 88%

Jones & Viola (tilted) 90% 95% x x

Jones & Viola (profile) x 70% 83% x

Rowley et al 89% 96% x

Schneiderman & Kanade 86% 93% x

Yann LeCun
Face Detection and Pose Estimation: Results

Yann LeCun
Face Detection with a Convolutional Net

Yann LeCun
Applying a ConvNet on Sliding Windows is Very Cheap!

output: 3x3

96x96

input:120x120
Traditional Detectors/Classifiers must be applied to every
location on a large input image, at multiple scales.
Convolutional nets can replicated over large images very
cheaply.
The network is applied to multiple scales spaced by 1.5.

Yann LeCun
Building a Detector/Recognizer:
Replicated Convolutional Nets

Computational cost for replicated convolutional net:


96x96 ­> 4.6 million multiply­accumulate operations
120x120 ­> 8.3 million multiply­accumulate operations
240x240 ­> 47.5 million multiply­accumulate operations
480x480 ­> 232 million multiply­accumulate operations
Computational cost for a non­convolutional detector of the
same size, applied every 12 pixels:
96x96 ­> 4.6 million multiply­accumulate operations
120x120 ­> 42.0 million multiply­accumulate operations
240x240 ­> 788.0 million multiply­accumulate operations
480x480 ­> 5,083 million multiply­accumulate operations 96x96 window

12 pixel shift

84x84 overlap
TV sport categorization (with Alex Niculescu, Cornell)

Classifying TV sports snapshots into 7 categories: auto racing, baseball,


basketball, bicycle, golf, soccer, football.

123,900 training images (300 sequence with 59 frames for each sport)

82,600 test images (200 sequences with 59 frames for each sport)

Preprocessing: convert to YUV, high­pass filter the Y component, crop,


subsample to 72x60 pixels

Results:
frame-level accuracy: 61% correct
Sequence-level accuracy 68% correct (simple voting scheme).

Yann LeCun
TV sport categorization (with Alex Niculescu, Cornell)

Yann LeCun
C. Elegans Embryo Phenotyping

[Ning et al. IEEE Trans. Image Processing, Nov 2005]


Analyzing results for Gene Knock­Out Experiments
C. Elegans Embryo Phenotyping

Analyzing results for Gene Knock­Out Experiments


C. Elegans Embryo Phenotyping

Raw
input

ConvNet
labeling

CCPoE
Cleanup

Elastic
Model
Fitting

CCPoE = Convolutional Conditional Product of Experts [Ning et al, IEEE TIP 2005]
(similar to Field of Experts [Roth & Black, CVPR 2005])
Visual Navigation for a Mobile Robot

[LeCun et al. NIPS 2005]

Mobile robot with two cameras


The convolutional net is trained to emulate
a human driver from recorded sequences of
video + human­provided steering angles.
The network maps stereo images to steering
angles for obstacle avoidance
LAGR: Learning Applied to Ground Robotics

Getting a robot to drive autonomously in


unknown terrain solely from vision (camera
input).
Our team (NYU/Net­Scale Technologies
Inc.) is one of 8 participants funded by
DARPA
All teams received identical robots and can
only modify the software (not the hardware)
The robot is given the GPS coordinates of a
goal, and must drive to the goal as fast as
possible. The terrain is unknown in advance.
The robot is run 3 times through the same
course.

Yann LeCun
Training a ConvNet On­line to detect obstacles
[Hadsell et al. Robotics Science and Systems 2007]

Traversability labels Traversability labels


Raw image from stereo (12 meters) from ConvNet (30 meters)

Yann LeCun
Training a ConvNet On­line to detect obstacles
[Hadsell et al. Robotics Science and Systems 2007]

Traversability labels Traversability labels


Raw image from stereo (12 meters) from ConvNet (30 meters)

Yann LeCun
Generic Object Detection and Recognition
with Invariance to Pose and Illumination
50 toys belonging to 5 categories: animal, human figure, airplane, truck, car
10 instance per category: 5 instances used for training, 5 instances for testing
Raw dataset: 972 stereo pair of each object instance. 48,600 image pairs total.

For each instance:


18 azimuths
0 to 350 degrees every 20
degrees
9 elevations
30 to 70 degrees from
horizontal every 5 degrees
6 illuminations
on/off combinations of 4
lights
2 cameras (stereo)
Training instances Test instances
7.5 cm apart
40 cm from the object
Yann LeCun
Data Collection, Sample Generation

Image capture setup Objects are painted green so that:


­ all features other than shape are removed
­ objects can be segmented, transformed,
and composited onto various backgrounds
Original image Object mask

Shadow factor Composite image


Yann LeCun
Textured and Cluttered Datasets

Yann LeCun
Convolutional Network

Layer 3
Layer 6
24@18x18 Layer 4
Stereo Layer 1 Layer 5 Fully
Layer 2 24@6x6
input 8@92x92 100 connected
2@96x96 8@23x23
(500 weights)
5

4x4 6x6
5x5 6x6
subsampling convolution 3x3
convolution convolution
(96 kernels) subsampling
(16 kernels) (2400 kernels)
90,857 free parameters, 3,901,162 connections.
The architecture alternates convolutional layers (feature detectors) and subsampling layers
(local feature pooling for invariance to small distortions).
The entire network is trained end­to­end (all the layers are trained simultaneously).
A gradient­based algorithm is used to minimize a supervised loss function.
Yann LeCun
Alternated Convolutions and Subsampling

“Simple cells”
“Complex cells”

Averaging
Multiple subsampling
convolutions
Local features are extracted
everywhere.
averaging/subsampling layer
builds robustness to variations in
feature locations.
Hubel/Wiesel'62, Fukushima'71,
LeCun'89, Riesenhuber &
Poggio'02, Ullman'02,....

Yann LeCun
Normalized­Uniform Set: Error Rates

Linear Classifier on raw stereo images: 30.2% error.


K­Nearest­Neighbors on raw stereo images: 18.4% error.
K­Nearest­Neighbors on PCA­95: 16.6% error.
Pairwise SVM on 96x96 stereo images: 11.6% error
Pairwise SVM on 95 Principal Components: 13.3% error.
Convolutional Net on 96x96 stereo images: 5.8% error.

Training instances Test instances


Yann LeCun
Normalized­Uniform Set: Learning Times

SVM: using a parallel implementation by Chop off the


Graf, Durdanovic, and Cosatto (NEC Labs) last layer of the
convolutional net
and train an SVM on it
Yann LeCun
Jittered­Cluttered Dataset

Jittered­Cluttered Dataset:
291,600 tereo pairs for training, 58,320 for testing
Objects are jittered: position, scale, in­plane rotation, contrast, brightness,
backgrounds, distractor objects,...
Input dimension: 98x98x2 (approx 18,000)

Yann LeCun
Experiment 2: Jittered­Cluttered Dataset

291,600 training samples, 58,320 test samples


SVM with Gaussian kernel 43.3% error
Convolutional Net with binocular input: 7.8% error
Convolutional Net + SVM on top: 5.9% error
Convolutional Net with monocular input: 20.8% error
Smaller mono net (DEMO): 26.0% error
Dataset available from http://www.cs.nyu.edu/~yann
Yann LeCun
Jittered­Cluttered Dataset

The convex loss, VC bounds Chop off the last layer,


OUCH!
and representers theorems and train an SVM on it
don't seem to help it works!
Yann LeCun
What's wrong with SVMs? they are shallow!

SVM with Gaussian kernels is based on matching global templates


It is a “shallow” architectures
There is now way to learn invariant recognition tasks with such naïve architectures
(unless we use an impractically large number of templates).
Output
The number of necessary templates grows
exponentially with the number of dimensions Linear
of variations.
Combinations
Global templates are in trouble when the
variations include: category, instance shape, Features (similarities)
configuration (for articulated object),
Global Template Matchers
position, azimuth, elevation, scale,
illumination, texture, albedo, in­plane (each training sample is a template
rotation, background luminance, background
Input
texture, background clutter, .....

SVM is glorified template matching


Examples (Monocular Mode)

Yann LeCun
Learned Features

Layer 3
Layer 2

Layer 1
Input

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Natural Images (Monocular Mode)

Yann LeCun
Commercially Deployed applications of Convolutional Nets

Faxed form reader


Developed at AT&T Bell Labs in the early 90's
Commercially deployed in 1994

Check Reading system:


Developed at AT&T Bell Labs in the mid 90's
Commercially deployed by NCR in 1996
First practical system for reading handwritten checks
Read 10 to 20% of all the checks in the US in the late 90's

Face detector / Person detector / Intrusion detector


Developed at NEC Research Institute in 2002/2003
Commercially deployed in 2004 by Vidient Technologies
Used at San Francisco Airport (among others).

Yann LeCun
Supervised Convolutional Nets: Pros and Cons

Convolutional nets can be trained to perform a wide variety of visual


tasks.
Global supervised gradient descent can produce parsimonious
architectures

BUT: they require lots of labeled training samples


60,000 samples for handwriting
120,000 samples for face detection
25,000 to 350,000 for object recognition

Since low­level features tend to be non task specific, we should be able to


learn them unsupervised.

Hinton has shown that layer­by­layer unsupervised “pre­training” can be


used to initialize “deep” architectures
[Hinton & Shalakhutdinov, Science 2006]

Can we use this idea to reduce the number of necessary labeled examples.
Yann LeCun
Models Similar to ConvNets

HMAX
[Poggio &
Riesenhuber
2003]
[Serre et al.
2007]
[Mutch and Lowe
CVPR 2006]

Difference?
the features are
not learned

HMAX is very
similar to
Fukushima's
Neocognitron
[from Serre et al. 2007]
Yann LeCun
Part 3:
Unsupervised Training of “Deep” Energy­Based Models,
Learning Invariant Feature Hierarchies

Why do we need Deep Learning?


“scaling learning algorithms towards AI” [Bengio and LeCun 2007]

Deep Belief Networks, Deep Learning


Stacked RBM [Hinton, Osindero, and Teh, Neural Comp 2006]
Stacked autoencoders [Bengio et al. NIPS 2006]
Stacked sparse features [Ranzato & al., NIPS 2006]
Improved stacked RBM [Salakhutdinov & Hinton, AI-Stats 07]

Unsupervised Learning of Invariant Feature Hierarchies


learning features for Caltech-101 [Ranzato et al. CVPR 2006]
learning features hierarchies for hand-writing [Ranzato et al ICDAR'07]

[See Mar'cAurelio Ranzato's poster on Wednesday]


Yann LeCun
Why do we need “Deep” Architectures?
[Bengio & LeCun 2007]
Conjecture: we won't solve the perception problem without solving the
problem of learning in deep architectures [Hinton]
Neural nets with lots of layers
Deep belief networks
Factor graphs with a “Markov” structure

We will not solve the perception problem with kernel machines


Kernel machines are glorified template matchers
You can't handle complicated invariances with templates (you would
need too many templates)

Many interesting functions are “deep”


Any function can be approximated with 2 layers (linear combination
of non-linear functions)
But many interesting functions a more efficiently represented with
multiple layers
Stupid examples: binary addition
Yann LeCun
The Basic Idea of Deep Learning
[Hinton et al. 2005 ­ 2007]
Unsupervised Training of Feature Hierarchy [Hinton et al. 2005 – 2007]
Each layer is designed to extract higher-level features from
lower-level ones
Each layer is trained unsupervised with a reconstruction criterion
The layers are trained one after the other, in sequence.

RECONSTRUCTION ERROR RECONSTRUCTION ERROR

COST COST

DECODER DECODER

ENCODER ENCODER
INPUT Y LEVEL 1 LEVEL 2
FEATURES FEATURES
Yann LeCun
Encoder­Decoder Architecture for Unsupervised Learning

A principle on which RECONSTRUCTION ENERGY


E(Y,W) = min_z E(Y,Z,W)
unsupervised algorithms can be COST
built is reconstruction of the
input from a code (feature DECODER
vector)
reconstruction from compact
feature vectors (e.g. PCA). FEATURES
reconstruction from sparse
ENCODER (CODE)
overcomplete feature vectors
[Olshausen & Field 1997], Z
[Ranzato et al NIPS 06]. Y
approximation of data
likelihood: Restricted E Y , W =min Z E Y , Z ,W 
Boltzmann Machine [Hinton
2005-...] ZY = argmin Z E Y , Z , W 
Yann LeCun
What is Energy­Based Unsupervised Learning?
Probabilistic View:
Produce a probability density function that:
has high value in regions of high sample P(Y)
density
has low value everywhere else (integral=1)
Training: maximize the data likelihood
(intractable)

Energy­Based View: E(Y)


produce an energy function E(Y) that:
has low value in regions of high sample
density
has high(er) value everywhere else

Y
Yann LeCun
Unsupervised Training of Energy­Based Models

Basic Idea:
push down on the energy of training samples
pull up on the energy of everything else
but this is often intractable

Approximation #1: Contrastive Divergence [Hinton et al 2005]


Push down on the energy of the training samples
Pull up on the energies of configuration that have low energy near
the training samples (to create local minima of the energy
surface)

Approximation #2: Minimizing the information content of the code


[Ranzato et al. AI­Stats 2007]
Reduce the information content of the code by making it sparse
This has the effect of increasing the reconstruction error for non-
training samples.

Yann LeCun
Deep Learning for Non­Linear Dimensionality Reduction

Restricted Boltzmann
Machine.
simple energy function

E Y , Z ,W =∑ij −Y i W ij Z j
code units are binary
stochastic
training with contrastive
divergence

Yann LeCun From: [Hinton and Salakhutdinov, Science 2006]


RBM: filters trained on MNIST

“bubble” detectors

Yann LeCun
Non­Linear Dimensionality Reduction: MNIST

[Hinton and Salakhutdinov, Science 2006]

[Salakhutdinov and Hinton, AI­Stats 2007]:


< 1.00% error on MNIST using K-NN on 30 dimensions:
BEST ERROR RATE OF ANY KNOWEDGE-FREE METHODS!!!
Yann LeCun
Encoder/Decoder Architecture for
learning Sparse Feature Representations
Energy of decoder
Algorithm: Code Z
(reconstruction error)
1. find the code Z
that minimizes the ||Wd f(Z)–X||
reconstruction
error AND is close DECODER Sparsifying
to the encoder
Wd Logistic f
output
2. Update the
weights of the
decoder to
decrease the
reconstruction ENCODER
error Wc
3. Update the
weights of the ||Wc X–Z||
encoder to Input X
decrease the Energy of encoder
prediction error
(prediction error)
Yann LeCun
MNIST Dataset

Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples

Yann LeCun
Training on handwritten digits

60,000 28x28 images


196 units in the code
 0.01
1
learning rate 0.001
L1, L2 regularizer 0.005

Encoder direct filters


Handwritten digits
Handwritten digits­ ­MNIST
MNIST

forward propagation through


encoder and decoder

after training there is no need to


minimize in code space
Training The Layers of a Convolutional Net Unsupervised

Extract windows from the MNIST images

Train the sparse encoder/decoder on those windows

Use the resulting encoder weights as the convolution kernels of a


convolution network

Repeat the process for the second layer

Train the resulting network supervised.

Yann LeCun
Unsupervised Training of Convolutional Filters

CLASSIFICATION EXPERIMENTS
IDEA: improving supervised learning by pre-training supervised filters in first conv. layer
with the unsupervised method (*)
sparse representations & lenet6 (1->50->50->200->10)

The baseline: lenet6 initialized randomly


Test error rate: 0.70%. Training error rate: 0.01%.

unsupervised filters in first conv. layer


Experiment 1
Train on 5x5 patches to find 50 features
Use the scaled filters in the encoder to initialize the kernels in
the first convolutional layer
Test error rate: 0.60%. Training error rate: 0.00%.

Experiment 2
Same as experiment 1, but training set augmented by elastically distorted digits (random
initialization gives test error rate equal to 0.49%).
Test error rate: 0.39%. Training error rate: 0.23%.
(*)[Hinton, Osindero, Teh “A fast learning algorithm for deep belief nets” Neural Computaton 2006]
Best Results on MNIST (from raw images: no preprocessing)

CLASSIFIER DEFORMATION ERROR Reference


Knowledge-free methods
2-layer NN, 800 HU, CE 1.60 Simard et al., ICDAR 2003
3-layer NN, 500+300 HU, CE, reg 1.53 Hinton, in press, 2005
SVM, Gaussian Kernel 1.40 Cortes 92 + Many others
Unsupervised Stacked RBM + backprop 0.95 Hinton, Neur Comp 2006
Convolutional nets
Convolutional net LeNet-5, 0.80 Ranzato et al. NIPS 2006
Convolutional net LeNet-6, 0.70 Ranzato et al. NIPS 2006
Conv. net LeNet-6- + unsup learning 0.60 Ranzato et al. NIPS 2006
Training set augmented with Affine Distortions
2-layer NN, 800 HU, CE Affine 1.10 Simard et al., ICDAR 2003
Virtual SVM deg-9 poly Affine 0.80 Scholkopf
Convolutional net, CE Affine 0.60 Simard et al., ICDAR 2003
Training et augmented with Elastic Distortions
2-layer NN, 800 HU, CE Elastic 0.70 Simard et al., ICDAR 2003
Convolutional net, CE Elastic 0.40 Simard et al., ICDAR 2003
Conv. net LeNet-6- + unsup learning Elastic 0.39 Ranzato et al. NIPS 2006

Yann LeCun
MNIST Errors (0.42% error)

Yann LeCun
Training on natural image patches

Berkeley data set

100,000 12x12 patches


200 units in the code

 0.02
1
learning rate 0.001
L1 regularizer 0.001
fast convergence: < 30min.
Natural image patches: Filters

200 decoder filters (reshaped columns of matrix Wd)


Learning Invariant Feature Hierarchies

Learning Shift Invariant Features

RECONSTRUCTION ERROR RECONSTRUCTION ERROR

COST COST

DECODER
DECODER
INVARIANT
FEATURES
TRANSFORMATION
FEATURES
(CODE) PARAMETERS U
(CODE)
Z
ENCODER Z
ENCODER

INPUT Y INPUT Y

Standard Feature Extractor Invariant Feature Extractor

Yann LeCun
Learning Invariant Feature Hierarchies

Learning Shift Invariant Features

(a) (b) (c) encoder shift­invariant decoder (d)


filter bank representation basis functions
“1001”
feature maps

17x17 1 17x17

0 +
input
0

reconstruction
image
feature 1 feature
maps max maps
convolutions switch convolutions
pooling
upsampling
encoder transformation
parameters
decoder

Yann LeCun
Shift Invariant Global Features on MNIST

Learning 50 Shift Invariant Global Features on MNIST:


50 filters of size 20x20 movable in a 28x28 frame (81 positions)
movable strokes!

Yann LeCun
Example of Reconstruction

Any character can be reconstructed as a


linear combination of a small number of
basis functions.

ORIGINAL RECONS­


DIGIT TRUCTION

 =

ACTIVATED DECODER
BASIS FUNCTIONS
(in feed­back layer)
red squares: decoder bases
Yann LeCun
Learning Invariant Filters in a Convolutional Net

Yann LeCun
Influence of Number of Training Samples

Yann LeCun
Generic Object Recognition: 101 categories + background

Caltech­101 dataset: 101 categories


accordion airplanes anchor ant barrel bass beaver binocular bonsai brain
brontosaurus buddha butterfly camera cannon car_side ceiling_fan cellphone
chair chandelier cougar_body cougar_face crab crayfish crocodile crocodile_head
cup dalmatian dollar_bill dolphin dragonfly electric_guitar elephant emu
euphonium ewer Faces Faces_easy ferry flamingo flamingo_head garfield
gerenuk gramophone grand_piano hawksbill headphone hedgehog helicopter ibis
inline_skate joshua_tree kangaroo ketch lamp laptop Leopards llama lobster
lotus mandolin mayfly menorah metronome minaret Motorbikes nautilus octopus
okapi pagoda panda pigeon pizza platypus pyramid revolver rhino rooster
saxophone schooner scissors scorpion sea_horse snoopy soccer_ball stapler
starfish stegosaurus stop_sign strawberry sunflower tick trilobite umbrella watch
water_lilly wheelchair wild_cat windsor_chair wrench yin_yang

Only 30 training examples per category!

A convolutional net trained with backprop (supervised) gets 20%


correct recognition.

Training the filters with the sparse invariant unsupervised method

Yann LeCun
Training the 1st stage filters

12x12 input windows (complex cell receptive fields)

9x9 filters (simple cell receptive fields)

4x4 pooling

Yann LeCun
Training the 2nd stage filters
13x13 input windows (complex cell receptive fields on 1st features)

9x9 filters (simple cell receptive fields)

Each output feature map combines 4 input feature maps

5x5 pooling

Yann LeCun
Generic Object Recognition: 101 categories + background

9x9 filters at the first level

9x9 filters at the second level

Yann LeCun
Shift­Invariant Feature Hierarchies on Caltech­101

2 layers of filters input 8 among the 64 33x33 feature maps 2 among the 512
trained image 5x5

.
unsupervised 140x140 feature maps

+
supervised
classifier on top.

54% correct on
Caltech­101 with
30 examples per
class

20% correct with


+
max­pooling max­pooling

purely supervised 4x4 window 5x5 window

backprop and squashing and squashing

convolution convolution
64 9x9 filters 2048 9x9 filters
first level second level
feature extraction feature extraction
Yann LeCun
Recognition Rate on Caltech 101

dollar w. chair ewer 65% bonsai beaver


100% 92% 34% 13%
skate
cougar body
100% anchor lotus 12%
face 41 22%
84% % wild cat
1%
BEST WORST
okapi
100% cellphone sea horse
minaret 83% ant
37% 16%
joshua t.
metronome background
3%

100% 91% 47%


Yann LeCun
Caltech 256

Yann LeCun
Conclusion

Energy­Based Models is a general framework for probabilistic and non­


probabilistic learning
Make the energy of training samples low, make the energy of
everything else high (e.g. Discriminant HMM, Graph Transformer
Networks, Conditional Random Fields, Max Margin Markov Nets,...)

Invariant vision tasks require deep learning


shallow models such as SVM can't learn complicated invariances.

Deep Supervised Learning works well with lots of samples


Convolutional nets have record accuracy on handwriting recognition
and face detection, and can be applied to many tasks.

Unsupervised Learning can reduce the need for labeled samples


Stacks of sequentially-trained RBMs or sparse encoder-decoder
layers learn good feature without requiring labeled samples

Learning invariant feature hierarchies


yields excellent accuracy for shape recognition
Yann LeCun
Thank You

Yann LeCun

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy