Energy Based Models in Document Recognition and Computer Vision
Energy Based Models in Document Recognition and Computer Vision
Yann LeCun
The Courant Institute of Mathematical Sciences
New York University
Collaborators:
Marc'Aurelio Ranzato, Sumit Chopra, FuJie Huang, YLan Boureau
Yann LeCun
Two Big Problems in Learning and Recognition
Yann LeCun
EnergyBased Model for DecisionMaking
Yann LeCun
Complex Tasks: Inference is nontrivial
When the
cardinality or
dimension of Y is
large, exhaustive
search is
impractical.
We need to use
“smart” inference
procedures: min
sum, Viterbi, min
cut, belief
propagation,
gradient decent.....
Yann LeCun
Converting Energies to Probabilities
Yann LeCun
Handwriting recognition
Sequence labeling
Unnormalized hierarchical
HMMs a.k.a. Graph
Transformer Networks
[LeCun, Bottou, Bengio,
Haffner 1998]
Yann LeCun
Latent Variable Models
Yann LeCun
What can the latent variables represent?
Variables that would make the task easier if they were known:
Face recognition: the gender of the person, the orientation of
the face.
Object recognition: the pose parameters of the object
(location, orientation, scale), the lighting conditions.
Parts of Speech Tagging: the segmentation of the sentence
into syntactic units, the parse tree.
Speech Recognition: the segmentation of the sentence into
phonemes or phones.
Handwriting Recognition: the segmentation of the line into
characters.
In general, we will search for the value of the latent variable that
allows us to get an answer (Y) of smallest energy.
Yann LeCun
Probabilistic Latent Variable Models
Yann LeCun
Training an EBM
Yann LeCun
Architecture and Loss Function
Training set
Energy surface
Regularizer
Persample Desired for a given Xi
loss answer as Y varies
Yann LeCun
Designing a Loss Functional
Yann LeCun
Examples of Loss Functions: Energy Loss
Energy Loss
Simply pushes down on the energy of the correct answer
!!
S!
K
R
O
W
!!!
ES
PS
A
LL
O
C
Yann LeCun
Negative LogLikelihood Loss
Gibbs distribution:
Yann LeCun
Negative LogLikelihood Loss
Negative Log Likelihood Loss has been used for a long time in many
communities for discriminative learning with structured outputs
Speech recognition: many papers going back to the early 90's
[Bengio 92], [Bourlard 94]. They call “Maximum Mutual
Information”
Handwriting recognition [Bengio LeCun 94], [LeCun et al. 98]
Bio-informatics [Haussler]
Conditional Random Fields [Lafferty et al. 2001]
Lots more......
In all the above cases, it was used with non-linearly parameterized
energies.
Yann LeCun
A Simpler Loss Functions:Perceptron Loss
Yann LeCun
A Better Loss Function: Generalized Margin Losses
Yann LeCun
Examples of Generalized Margin Losses
Hinge Loss
[Altun et al. 2003], [Taskar et al. 2003]
With the linearly-parameterized binary
classifier architecture, we get linear SVMs
Log Loss
“soft hinge” loss
With the linearly-parameterized binary
classifier architecture, we get linear
Logistic Regression
Yann LeCun
Examples of Margin Losses: SquareSquare Loss
SquareSquare Loss
[LeCun-Huang 2005]
Appropriate for positive energy
functions
Learning Y = X^2
!!
E!
PS
A
LL
O
C
O
N
Yann LeCun
Other MarginLike Losses
Yann LeCun
What Make a “Good” Loss Function
Yann LeCun
Advantages/Disadvantages of various losses
Loss functions differ in how they pick the point(s) whose energy is
pulled up, and how much they pull them up
Losses with a log partition function in the contrastive term pull up all
the bad answers simultaneously.
This may be good if the gradient of the contrastive term can be
computed efficiently
This may be bad if it cannot, in which case we might as well use
a loss with a single point in the contrastive term
Variational methods pull up many points, but not as many as with the
full log partition function.
Yann LeCun
EnergyBased Factor Graphs: Energy = Sum of “factors”
Sequence Labeling
Output is a sequence
Y1,Y2,Y3,Y4......
NLP parsing, MT, +
speech/handwriting
recognition, biological
sequence analysis
The factors ensure
grammatical consistency
They give low energy to
consistent sub-sequences of
output symbols Y1 Y2 Y3 Y4
The graph is generally simple
(chain or tree)
Inference is easy (dynamic X
programming, min-sum)
Yann LeCun
Simple EnergyBased Factor Graphs with “Shallow” Factors
Yann LeCun
Deep/nonlinear Factors for Speech and Handwriting
Unnormalized hierarchical
HMMs
Trained with Perceptron loss
[LeCun, Bottou, Bengio,
Haffner 1998]
Trained with NLL loss
[Bengio, LeCun 1994],
[LeCun, Bottou, Bengio,
Haffner 1998]
Yann LeCun
What's so bad about probabilistic models?
Why bother with a normalization since we don't use it for decision making?
Why insist that P(Y|X) have a specific shape, when we only care about the position of its
minimum?
When Y is highdimensional (or simply combinatorial), normalizing becomes intractable
(e.g. Language modeling, image restoration, large DoF robot control...).
A tiny number of models are prenormalized (Gaussian, exponential family)
A very small number are easily normalizable
A large number have intractable normalization
A huuuge number can't be normalized at all (examples will be shown).
Normalization forces us to take into account areas of the space that we don't actually care
about because our inference algorithm never takes us there.
If we only care about making the right decisions, maximizing the likelihood solves a
much more complex problem than we have to.
Yann LeCun
EBM
Yann LeCun
Part 2: Deep Supervised Learning for Vision:
The Convolutional Network Architecture
Convolutional Networks:
[LeCun et al., Neural Computation, 1988]
[LeCun et al., Proc IEEE 1998]
Preprocessing /
Trainable Classifier
Feature Extraction
Yann LeCun
EndtoEnd Learning
trainable
trainable classifier
Feature Extraction
Yann LeCun
An Old Idea for Local Shift Invariance
pooling subsampling
Multiple
convolutions
Yann LeCun
The Multistage HubelWiesel Architecture
QUESTION: How do we
find (or learn) the filters?
Yann LeCun
Convolutional Network
Yann LeCun
Convolutional Net Architecture
Layer 3 Layer 5
Layer 1 Layer 2 Layer 4
input 12@10x10 100@1x1
6@28x28 6@14x14 12@5x5
1@32x32
Layer 6: 10
10
5x5
2x2 5x5 2x2
5x5 convolution
pooling/ convolution pooling/
convolution
subsampling subsampling
Yann LeCun
MNIST Handwritten Digit Dataset
Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples
Yann LeCun
Results on MNIST Handwritten Digits
CLASSIFIER DEFORMATION PREPROCESSING ERROR (%) Reference
linear classifier (1-layer NN) none 12.00 LeCun et al. 1998
linear classifier (1-layer NN) deskewing 8.40 LeCun et al. 1998
pairwise linear classifier deskewing 7.60 LeCun et al. 1998
K-nearest-neighbors, (L2) none 3.09 Kenneth Wilder, U. Chicago
K-nearest-neighbors, (L2) deskewing 2.40 LeCun et al. 1998
K-nearest-neighbors, (L2) deskew, clean, blur 1.80 Kenneth Wilder, U. Chicago
K-NN L3, 2 pixel jitter deskew, clean, blur 1.22 Kenneth Wilder, U. Chicago
K-NN, shape context matching shape context feature 0.63 Belongie et al. IEEE PAMI 2002
40 PCA + quadratic classifier none 3.30 LeCun et al. 1998
1000 RBF + linear classifier none 3.60 LeCun et al. 1998
K-NN, Tangent Distance subsamp 16x16 pixels 1.10 LeCun et al. 1998
SVM, Gaussian Kernel none 1.40
SVM deg 4 polynomial deskewing 1.10 LeCun et al. 1998
Reduced Set SVM deg 5 poly deskewing 1.00 LeCun et al. 1998
Virtual SVM deg-9 poly Affine none 0.80 LeCun et al. 1998
V-SVM, 2-pixel jittered none 0.68 DeCoste and Scholkopf, MLJ 2002
V-SVM, 2-pixel jittered deskewing 0.56 DeCoste and Scholkopf, MLJ 2002
2-layer NN, 300 HU, MSE none 4.70 LeCun et al. 1998
2-layer NN, 300 HU, MSE, Affine none 3.60 LeCun et al. 1998
2-layer NN, 300 HU deskewing 1.60 LeCun et al. 1998
3-layer NN, 500+150 HU none 2.95 LeCun et al. 1998
3-layer NN, 500+150 HU Affine none 2.45 LeCun et al. 1998
3-layer NN, 500+300 HU, CE, reg none 1.53 Hinton, unpublished, 2005
2-layer NN, 800 HU, CE none 1.60 Simard et al., ICDAR 2003
2-layer NN, 800 HU, CE Affine none 1.10 Simard et al., ICDAR 2003
2-layer NN, 800 HU, MSE Elastic none 0.90 Simard et al., ICDAR 2003
2-layer NN, 800 HU, CE Elastic none 0.70 Simard et al., ICDAR 2003
Convolutional net LeNet-1 subsamp 16x16 pixels 1.70 LeCun et al. 1998
Convolutional net LeNet-4 none 1.10 LeCun et al. 1998
Convolutional net LeNet-5, none 0.95 LeCun et al. 1998
Conv. net LeNet-5, Affine none 0.80 LeCun et al. 1998
Boosted LeNet-4 Affine none 0.70 LeCun et al. 1998
Conv. net, CE Affine none 0.60 Simard et al., ICDAR 2003
Comv net, CE Elastic none 0.40 Simard et al., ICDAR 2003
Yann LeCun
Some Results on MNIST (from raw images: no preprocessing)
Note: some groups have obtained good results with various amounts of preprocessing: [deCoste and Schoelkopf]
get 0.56% with an SVM on deskewed images; [Belongie] get 0.63% with “shape context” features;
[CENPARMI] get below 0.4% with features and SVM; [Liu] get 0.42% with features and SVM.
Yann LeCun
Invariance and Robustness to Noise
Yann LeCun
Recognizing Multiple Characters with Replicated Nets
Yann LeCun
Handwriting Recognition
Yann LeCun
Face Detection and Pose Estimation with Convolutional Nets
Training: 52,850, 32x32 greylevel images of faces, 52,850 nonfaces.
Each sample: used 5 times with random variation in scale, inplane rotation, brightness
and contrast.
2nd phase: half of the initial negative set was replaced by false positives of the initial
version of the detector .
Yann LeCun
Face Detection: Results
Yann LeCun
Face Detection and Pose Estimation: Results
Yann LeCun
Face Detection with a Convolutional Net
Yann LeCun
Applying a ConvNet on Sliding Windows is Very Cheap!
output: 3x3
96x96
input:120x120
Traditional Detectors/Classifiers must be applied to every
location on a large input image, at multiple scales.
Convolutional nets can replicated over large images very
cheaply.
The network is applied to multiple scales spaced by 1.5.
Yann LeCun
Building a Detector/Recognizer:
Replicated Convolutional Nets
12 pixel shift
84x84 overlap
TV sport categorization (with Alex Niculescu, Cornell)
123,900 training images (300 sequence with 59 frames for each sport)
82,600 test images (200 sequences with 59 frames for each sport)
Results:
frame-level accuracy: 61% correct
Sequence-level accuracy 68% correct (simple voting scheme).
Yann LeCun
TV sport categorization (with Alex Niculescu, Cornell)
Yann LeCun
C. Elegans Embryo Phenotyping
Raw
input
ConvNet
labeling
CCPoE
Cleanup
Elastic
Model
Fitting
CCPoE = Convolutional Conditional Product of Experts [Ning et al, IEEE TIP 2005]
(similar to Field of Experts [Roth & Black, CVPR 2005])
Visual Navigation for a Mobile Robot
Yann LeCun
Training a ConvNet Online to detect obstacles
[Hadsell et al. Robotics Science and Systems 2007]
Yann LeCun
Training a ConvNet Online to detect obstacles
[Hadsell et al. Robotics Science and Systems 2007]
Yann LeCun
Generic Object Detection and Recognition
with Invariance to Pose and Illumination
50 toys belonging to 5 categories: animal, human figure, airplane, truck, car
10 instance per category: 5 instances used for training, 5 instances for testing
Raw dataset: 972 stereo pair of each object instance. 48,600 image pairs total.
Yann LeCun
Convolutional Network
Layer 3
Layer 6
24@18x18 Layer 4
Stereo Layer 1 Layer 5 Fully
Layer 2 24@6x6
input 8@92x92 100 connected
2@96x96 8@23x23
(500 weights)
5
4x4 6x6
5x5 6x6
subsampling convolution 3x3
convolution convolution
(96 kernels) subsampling
(16 kernels) (2400 kernels)
90,857 free parameters, 3,901,162 connections.
The architecture alternates convolutional layers (feature detectors) and subsampling layers
(local feature pooling for invariance to small distortions).
The entire network is trained endtoend (all the layers are trained simultaneously).
A gradientbased algorithm is used to minimize a supervised loss function.
Yann LeCun
Alternated Convolutions and Subsampling
“Simple cells”
“Complex cells”
Averaging
Multiple subsampling
convolutions
Local features are extracted
everywhere.
averaging/subsampling layer
builds robustness to variations in
feature locations.
Hubel/Wiesel'62, Fukushima'71,
LeCun'89, Riesenhuber &
Poggio'02, Ullman'02,....
Yann LeCun
NormalizedUniform Set: Error Rates
JitteredCluttered Dataset:
291,600 tereo pairs for training, 58,320 for testing
Objects are jittered: position, scale, inplane rotation, contrast, brightness,
backgrounds, distractor objects,...
Input dimension: 98x98x2 (approx 18,000)
Yann LeCun
Experiment 2: JitteredCluttered Dataset
Yann LeCun
Learned Features
Layer 3
Layer 2
Layer 1
Input
Yann LeCun
Examples (Monocular Mode)
Yann LeCun
Examples (Monocular Mode)
Yann LeCun
Examples (Monocular Mode)
Yann LeCun
Examples (Monocular Mode)
Yann LeCun
Examples (Monocular Mode)
Yann LeCun
Examples (Monocular Mode)
Yann LeCun
Natural Images (Monocular Mode)
Yann LeCun
Commercially Deployed applications of Convolutional Nets
Yann LeCun
Supervised Convolutional Nets: Pros and Cons
Can we use this idea to reduce the number of necessary labeled examples.
Yann LeCun
Models Similar to ConvNets
HMAX
[Poggio &
Riesenhuber
2003]
[Serre et al.
2007]
[Mutch and Lowe
CVPR 2006]
Difference?
the features are
not learned
HMAX is very
similar to
Fukushima's
Neocognitron
[from Serre et al. 2007]
Yann LeCun
Part 3:
Unsupervised Training of “Deep” EnergyBased Models,
Learning Invariant Feature Hierarchies
COST COST
DECODER DECODER
ENCODER ENCODER
INPUT Y LEVEL 1 LEVEL 2
FEATURES FEATURES
Yann LeCun
EncoderDecoder Architecture for Unsupervised Learning
Y
Yann LeCun
Unsupervised Training of EnergyBased Models
Basic Idea:
push down on the energy of training samples
pull up on the energy of everything else
but this is often intractable
Yann LeCun
Deep Learning for NonLinear Dimensionality Reduction
Restricted Boltzmann
Machine.
simple energy function
E Y , Z ,W =∑ij −Y i W ij Z j
code units are binary
stochastic
training with contrastive
divergence
“bubble” detectors
Yann LeCun
NonLinear Dimensionality Reduction: MNIST
Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples
Yann LeCun
Training on handwritten digits
Yann LeCun
Unsupervised Training of Convolutional Filters
CLASSIFICATION EXPERIMENTS
IDEA: improving supervised learning by pre-training supervised filters in first conv. layer
with the unsupervised method (*)
sparse representations & lenet6 (1->50->50->200->10)
Experiment 2
Same as experiment 1, but training set augmented by elastically distorted digits (random
initialization gives test error rate equal to 0.49%).
Test error rate: 0.39%. Training error rate: 0.23%.
(*)[Hinton, Osindero, Teh “A fast learning algorithm for deep belief nets” Neural Computaton 2006]
Best Results on MNIST (from raw images: no preprocessing)
Yann LeCun
MNIST Errors (0.42% error)
Yann LeCun
Training on natural image patches
COST COST
DECODER
DECODER
INVARIANT
FEATURES
TRANSFORMATION
FEATURES
(CODE) PARAMETERS U
(CODE)
Z
ENCODER Z
ENCODER
INPUT Y INPUT Y
Yann LeCun
Learning Invariant Feature Hierarchies
17x17 1 17x17
0 +
input
0
reconstruction
image
feature 1 feature
maps max maps
convolutions switch convolutions
pooling
upsampling
encoder transformation
parameters
decoder
Yann LeCun
Shift Invariant Global Features on MNIST
Yann LeCun
Example of Reconstruction
ORIGINAL RECONS
DIGIT TRUCTION
=
ACTIVATED DECODER
BASIS FUNCTIONS
(in feedback layer)
red squares: decoder bases
Yann LeCun
Learning Invariant Filters in a Convolutional Net
Yann LeCun
Influence of Number of Training Samples
Yann LeCun
Generic Object Recognition: 101 categories + background
Yann LeCun
Training the 1st stage filters
4x4 pooling
Yann LeCun
Training the 2nd stage filters
13x13 input windows (complex cell receptive fields on 1st features)
5x5 pooling
Yann LeCun
Generic Object Recognition: 101 categories + background
Yann LeCun
ShiftInvariant Feature Hierarchies on Caltech101
2 layers of filters input 8 among the 64 33x33 feature maps 2 among the 512
trained image 5x5
.
unsupervised 140x140 feature maps
+
supervised
classifier on top.
54% correct on
Caltech101 with
30 examples per
class
convolution convolution
64 9x9 filters 2048 9x9 filters
first level second level
feature extraction feature extraction
Yann LeCun
Recognition Rate on Caltech 101
Yann LeCun
Conclusion
Yann LeCun