0% found this document useful (0 votes)

7 views118 pages

Energy Based Models in Document Recognition and Computer Vision

Uploaded by

kimtinh18012005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views118 pages

Energy Based Models in Document Recognition and Computer Vision

Uploaded by

kimtinh18012005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 118

EnergyBased Models in Document

Recognition and Computer Vision

Yann LeCun
The Courant Institute of Mathematical Sciences
New York University
Collaborators:
Marc'Aurelio Ranzato, Sumit Chopra, FuJie Huang, YLan Boureau

See: [LeCun et al. 2006]: “A Tutorial on EnergyBased Learning”

[Ranzato et al. AIStats'07], [Ranzato et al. NIPS06], [Ranzato et al. ICDAR '07]
http://yann.lecun.com/exdb/publis/
Yann LeCun
The Challenges of Pattern Recognition,
Computer Vision, and Visual Neuroscience

How do we learn “invariant representations”?

From the image of an airplane, how do we extract a
representation that is invariant to pose, illumination,
background, clutter, object instance....
How can a human (or a machine) learn those
representations by just looking at the world?

How can we learn visual categories from just a few examples?

I don't need to see many airplanes before I can
recognize every airplane (even really weird ones)

Yann LeCun
Two Big Problems in Learning and Recognition

The “Normalization Problem” (aka Partition Function Problem)

Give high probability (or low energy) to good answers
Give low probability (or high energy) to bad answers
There are too many bad answers!
The normalization constant of probabilistic models is a sum over too
many terms.

2. The “Deep Learning Problem”

Training “Deep Belief Networks” is a necessary step towards solving the
invariance problem in visual recognition (and perception in general).
How do we train deep architectures with lots of non-linear stages?

This talks has three parts:

Energy-Based learning: circumventing the intractable partition function
problem.
Supervised methods for deep visual learning: convolutional nets
Unsupervised methods to learn deep, invariant feature hierarchies:
“Deep belief networks”.
Yann LeCun
Part 1: EnergyBased Learning.
circumventing the intractable partition function problem
Highly popular methods in the Machine Learning and Natural Language
Processing Communities have their roots in Handwriting Recognition
Conditional Random Fields, and related learning models with
“structured outputs” are descendants of discriminative learning
methods for word-level handwriting recognition.

A Tutorial and EnergyBased Learning:

[LeCun & al., 2006]

Discriminative Training for “Structured Output” models

The whole literature on discriminative speech recognition [1987-]
The whole literature on neural-net/HMM hybrids for speech [Bottou
1991, Bengio 1993, Haffner 1993, Bourlard 1994]
Graph Transformer Networks [LeCun & al. Proc IEEE 1998]
Conditional Random Fields [Lafferty & al 2001]
Max Margin Markov Nets [Altun & al 2003, Taskar & al 2003]

Yann LeCun
EnergyBased Model for DecisionMaking

Model: Measures the compatibility

between an observed variable X and
a variable to be predicted Y through
an energy function E(Y,X).

Inference: Search for the Y that

minimizes the energy within a set
If the set has low cardinality, we can
use exhaustive search.

Yann LeCun
Complex Tasks: Inference is nontrivial

When the
cardinality or
dimension of Y is
large, exhaustive
search is
impractical.
We need to use
“smart” inference
procedures: min
sum, Viterbi, min
cut, belief
propagation,
gradient decent.....

Yann LeCun
Converting Energies to Probabilities

Energies are uncalibrated

The energies of two separately-trained systems cannot be combined
The energies are uncalibrated (measured in arbitrary untis)

How do we calibrate energies?

We turn them into probabilities (positive numbers that sum to 1).
Simplest way: Gibbs distribution
Other ways can be reduced to Gibbs by a suitable redefinition of the
energy.

Partition function Inverse temperature

Yann LeCun
Handwriting recognition
Sequence labeling

integrated segmentation and

recognition of sequences.

Each segmentation and

recognition hypothesis is a path
in a graph

inference = finding the shortest

path in the interpretation graph.

Unnormalized hierarchical
HMMs a.k.a. Graph
Transformer Networks
[LeCun, Bottou, Bengio,
Haffner 1998]

Yann LeCun
Latent Variable Models

The energy includes “hidden” variables Z whose value is never given to us

Yann LeCun
What can the latent variables represent?

Variables that would make the task easier if they were known:
Face recognition: the gender of the person, the orientation of
the face.
Object recognition: the pose parameters of the object
(location, orientation, scale), the lighting conditions.
Parts of Speech Tagging: the segmentation of the sentence
into syntactic units, the parse tree.
Speech Recognition: the segmentation of the sentence into
phonemes or phones.
Handwriting Recognition: the segmentation of the line into
characters.

In general, we will search for the value of the latent variable that
allows us to get an answer (Y) of smallest energy.

Yann LeCun
Probabilistic Latent Variable Models

Marginalizing over latent variables instead of minimizing.

Equivalent to traditional energybased inference with a redefined

energy function:

Reduces to traditional minimization when Beta>infinity

Yann LeCun
Training an EBM

Training an EBM consists in shaping the energy function so that the

energies of the correct answer is lower than the energies of all other
answers.
Training sample: X = image of an animal, Y = “animal”

E animal , X E  y , X  ∀ y≠animal

Yann LeCun
Architecture and Loss Function

Family of energy functions

Training set

Loss functional / Loss function

Measures the quality of an energy function on training
set
Training
Form of the loss functional
invariant under permutations and repetitions of the samples

Energy surface
Regularizer
Persample Desired for a given Xi
loss answer as Y varies
Yann LeCun
Designing a Loss Functional

Push down on the energy of the correct answer

Pull up on the energies of the incorrect answers, particularly if they

are smaller than the correct one
Yann LeCun
Architecture + Inference Algo + Loss Function = Model

E(W,Y,X) 1. Design an architecture: a particular form for E(W,Y,X).

2. Pick an inference algorithm for Y: MAP or conditional
distribution, belief prop, min cut, variational methods,
W gradient descent, MCMC, HMC.....
3. Pick a loss function: in such a way that minimizing it
with respect to W over a training set will make the inference
algorithm find the correct Y for a given X.
X Y 4. Pick an optimization method.

PROBLEM: What loss functions will make the machine approach

the desired behavior?

Yann LeCun
Examples of Loss Functions: Energy Loss

Energy Loss
Simply pushes down on the energy of the correct answer
!!
S!
K
R
O
W

!!!
ES
PS
A
LL
O
C
Yann LeCun
Negative LogLikelihood Loss

Conditional probability of the samples (assuming independence)

Gibbs distribution:

We get the NLL loss by dividing by P and Beta:

Reduces to the perceptron loss when Beta>infinity

Yann LeCun
Negative LogLikelihood Loss

Pushes down on the energy of the correct answer

Pulls up on the energies of all answers in proportion to their probability

Yann LeCun
Negative LogLikelihood Loss

A probabilistic model is an EBM in which:

The energy can be integrated over Y (the variable to be predicted)
The loss function is the negative log-likelihood

Negative Log Likelihood Loss has been used for a long time in many
communities for discriminative learning with structured outputs
Speech recognition: many papers going back to the early 90's
[Bengio 92], [Bourlard 94]. They call “Maximum Mutual
Information”
Handwriting recognition [Bengio LeCun 94], [LeCun et al. 98]
Bio-informatics [Haussler]
Conditional Random Fields [Lafferty et al. 2001]
Lots more......
In all the above cases, it was used with non-linearly parameterized
energies.

Yann LeCun
A Simpler Loss Functions:Perceptron Loss

Perceptron Loss [LeCun et al. 1998], [Collins 2002]

Pushes down on the energy of the correct answer
Pulls up on the energy of the machine's answer
Always positive. Zero when answer is correct
No “margin”: technically does not prevent the energy surface from
being almost flat.
Works pretty well in practice, particularly if the energy
parameterization does not allow flat surfaces.
This is often called “discriminative Viterbi training” in the
speech and handwriting literature

Yann LeCun
A Better Loss Function: Generalized Margin Losses

First, we need to define the Most Offending Incorrect Answer

Most Offending Incorrect Answer: discrete case

Most Offending Incorrect Answer: continuous case

Yann LeCun
Examples of Generalized Margin Losses

Hinge Loss
[Altun et al. 2003], [Taskar et al. 2003]
With the linearly-parameterized binary
classifier architecture, we get linear SVMs

Log Loss
“soft hinge” loss
With the linearly-parameterized binary
classifier architecture, we get linear
Logistic Regression

Yann LeCun
Examples of Margin Losses: SquareSquare Loss

SquareSquare Loss
[LeCun-Huang 2005]
Appropriate for positive energy
functions

Learning Y = X^2

!!
E!
PS
A
LL
O
C
O
N
Yann LeCun
Other MarginLike Losses

LVQ2 Loss [Kohonen, Oja], DriancourtBottou 1991]

Minimum Classification Error Loss [Juang, Chou, Lee 1997]

SquareExponential Loss [Osadchy, Miller, LeCun 2004]

Yann LeCun
What Make a “Good” Loss Function

Good and bad loss functions

Yann LeCun
Advantages/Disadvantages of various losses

Loss functions differ in how they pick the point(s) whose energy is
pulled up, and how much they pull them up

Losses with a log partition function in the contrastive term pull up all
the bad answers simultaneously.
This may be good if the gradient of the contrastive term can be
computed efficiently
This may be bad if it cannot, in which case we might as well use
a loss with a single point in the contrastive term

Variational methods pull up many points, but not as many as with the
full log partition function.

Efficiency of a loss/architecture: how many energies are pulled up for

a given amount of computation?
The theory for this is to be developed

Yann LeCun
EnergyBased Factor Graphs: Energy = Sum of “factors”

Sequence Labeling
Output is a sequence
Y1,Y2,Y3,Y4......
NLP parsing, MT, +
speech/handwriting
recognition, biological
sequence analysis
The factors ensure
grammatical consistency
They give low energy to
consistent sub-sequences of
output symbols Y1 Y2 Y3 Y4
The graph is generally simple
(chain or tree)
Inference is easy (dynamic X
programming, min-sum)

Yann LeCun
Simple EnergyBased Factor Graphs with “Shallow” Factors

Linearly Parameterized Factors

with the NLL Loss :

Lafferty's Conditional
Random Field

with Hinge Loss:

Taskar's Max Margin
Markov Nets

with Perceptron Loss

Collins's sequence
labeling model

With Log Loss:

Altun/Hofmann
sequence labeling
model

Yann LeCun
Deep/nonlinear Factors for Speech and Handwriting

Trainable Speech/Handwriting Recognition systems that integrate Neural Nets (or

other “deep” classifiers) with dynamic time warping, Hidden Markov Models, or
other graphbased hypothesis representations
Training the feature With Minimum Empirical Error loss
Ljolje and Rabiner (1990)
extractor as part of the
whole process. with NLL:
Bengio (1992), Haffner (1993), Bourlard
with the LVQ2 Loss : (1994)
Driancourt and
Bottou's speech With MCE
recognizer (1991) Juang et al. (1997)

with NLL: Late normalization scheme (unnormalized

Bengio's speech HMM)
recognizer (1992) Bottou pointed out the label bias
Haffner's speech problem (1991)
recognizer (1993) Denker and Burges proposed a solution
Yann LeCun (1995)
Really Deep Factors &
implicit graphs: GTN

Handwriting Recognition with

Graph Transformer Networks

Unnormalized hierarchical
HMMs
Trained with Perceptron loss
[LeCun, Bottou, Bengio,
Haffner 1998]
Trained with NLL loss
[Bengio, LeCun 1994],
[LeCun, Bottou, Bengio,
Haffner 1998]

Answer = sequence of symbols

Latent variable = segmentation

Yann LeCun
Check Reader

Graph transformer network

trained to read check amounts.

Trained globally with

NegativeLogLikelihood loss.

50% percent corrent, 49%

reject, 1% error (detectable
later in the process.

Fielded in 1996, used in many

banks in the US and Europe.

Processes an estimated 10% of

all the checks written in the
US.

Yann LeCun
What's so bad about probabilistic models?

Why bother with a normalization since we don't use it for decision making?
Why insist that P(Y|X) have a specific shape, when we only care about the position of its
minimum?
When Y is highdimensional (or simply combinatorial), normalizing becomes intractable
(e.g. Language modeling, image restoration, large DoF robot control...).
A tiny number of models are prenormalized (Gaussian, exponential family)
A very small number are easily normalizable
A large number have intractable normalization
A huuuge number can't be normalized at all (examples will be shown).
Normalization forces us to take into account areas of the space that we don't actually care
about because our inference algorithm never takes us there.
If we only care about making the right decisions, maximizing the likelihood solves a
much more complex problem than we have to.

Yann LeCun
EBM

Unlike traditional classifiers, EBMs can represent multiple alternative outputs

The normalization in probabilistic models is often an unnecessary aggravation,
particularly if the ultimate goal of the system is to make decisions.
EBMs with appropriate loss function avoid the necessity to compute the partition
function and its derivatives (which may be intractable)
EBMs give us complete freedom in the choice of the architecture that models the
joint “incompatibility” (energy) between the variables.
We can use architectures that are not normally allowed in the probabilistic
framework (like neural nets).
The inference algorithm that finds the most offending (lowest energy)
incorrect answer does not need to be exact: our model may give low energy to
faraway regions of the landscape. But if our inference algorithm never finds those
regions, they do not affect us. But they do affect normalized probabilistic models

Yann LeCun
Part 2: Deep Supervised Learning for Vision:
The Convolutional Network Architecture

Convolutional Networks:
[LeCun et al., Neural Computation, 1988]
[LeCun et al., Proc IEEE 1998]

Face Detection and pose estimation with convolutional networks:

[Vaillant, Monrocq, LeCun, IEE Proc Vision, Image and Signal
Processing, 1994]
[Osadchy, Miller, LeCun, JMLR vol 8, May 2007]

Categorylevel object recognition with invariance to pose and lighting

[LeCun, Huang, Bottou, CVPR 2004]
[Huang, LeCun, CVPR 2005]

autonomous robot driving

[LeCun et al. NIPS 2005]
Yann LeCun
The Traditional Architecture for Recognition

Preprocessing /
Trainable Classifier
Feature Extraction

this part is mostly handcrafted

The raw input is preprocessed through a handcrafted feature extractor

The trainable classifier is often generic (task independent)

Yann LeCun
EndtoEnd Learning

trainable
trainable classifier
Feature Extraction

The entire system is integrated and trainable “endtoend”.

In some of the models presented here, there will be no discernible

difference between the feature extractor and the classifier.

We can embed general prior knowledge about images into the

architecture of the system.

Yann LeCun
An Old Idea for Local Shift Invariance

[Hubel & Wiesel 1962]: architecture of the cat's visual cortex

simple cells detect local features
complex cells “pool” the outputs of simple cells within a
retinotopic neighborhood.
“Simple cells”
“Complex cells”

pooling subsampling
Multiple
convolutions

Retinotopic Feature Maps

Yann LeCun
The Multistage HubelWiesel Architecture

Building a complete artificial vision system:

Stack multiple stages of simple cells / complex cells layers
Higher stages compute more global, more invariant features
Stick a classification layer on top
[Fukushima 1971-1982]
neocognitron
[LeCun 1988-2007]
convolutional net
[Poggio 2002-2006]
HMAX
[Ullman 2002-2006]
fragment hierarchy
[Lowe 2006]
HMAX

QUESTION: How do we
find (or learn) the filters?

Yann LeCun
Convolutional Network

Hierarchical/multilayer: features get progressively more global, invariant, and numerous

dense features: features detectors applied everywhere (no interest point)
broadly tuned (possibly invariant) features: sigmoid units are on half the time.
Global discriminative training: The whole system is trained “endtoend” with a gradient
based method to minimize a global loss function
Integrates segmentation, feature extraction, and invariant classification in one fell swoop.

Yann LeCun
Convolutional Net Architecture

Layer 3 Layer 5
Layer 1 Layer 2 Layer 4
input 12@10x10 100@1x1
6@28x28 6@14x14 12@5x5
1@32x32
Layer 6: 10
10

5x5
2x2 5x5 2x2
5x5 convolution
pooling/ convolution pooling/
convolution
subsampling subsampling

Convolutional net for handwriting recognition (400,000 synapses)

Convolutional layers (simple cells): all units in a feature plane share the same weights
Pooling/subsampling layers (complex cells): for invariance to small distortions.
Supervised gradientdescent learning using backpropagation
The entire network is trained endtoend. All the layers are trained simultaneously.

Yann LeCun
MNIST Handwritten Digit Dataset

Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples

Yann LeCun
Results on MNIST Handwritten Digits
CLASSIFIER DEFORMATION PREPROCESSING ERROR (%) Reference
linear classifier (1-layer NN) none 12.00 LeCun et al. 1998
linear classifier (1-layer NN) deskewing 8.40 LeCun et al. 1998
pairwise linear classifier deskewing 7.60 LeCun et al. 1998
K-nearest-neighbors, (L2) none 3.09 Kenneth Wilder, U. Chicago
K-nearest-neighbors, (L2) deskewing 2.40 LeCun et al. 1998
K-nearest-neighbors, (L2) deskew, clean, blur 1.80 Kenneth Wilder, U. Chicago
K-NN L3, 2 pixel jitter deskew, clean, blur 1.22 Kenneth Wilder, U. Chicago
K-NN, shape context matching shape context feature 0.63 Belongie et al. IEEE PAMI 2002
40 PCA + quadratic classifier none 3.30 LeCun et al. 1998
1000 RBF + linear classifier none 3.60 LeCun et al. 1998
K-NN, Tangent Distance subsamp 16x16 pixels 1.10 LeCun et al. 1998
SVM, Gaussian Kernel none 1.40
SVM deg 4 polynomial deskewing 1.10 LeCun et al. 1998
Reduced Set SVM deg 5 poly deskewing 1.00 LeCun et al. 1998
Virtual SVM deg-9 poly Affine none 0.80 LeCun et al. 1998
V-SVM, 2-pixel jittered none 0.68 DeCoste and Scholkopf, MLJ 2002
V-SVM, 2-pixel jittered deskewing 0.56 DeCoste and Scholkopf, MLJ 2002
2-layer NN, 300 HU, MSE none 4.70 LeCun et al. 1998
2-layer NN, 300 HU, MSE, Affine none 3.60 LeCun et al. 1998
2-layer NN, 300 HU deskewing 1.60 LeCun et al. 1998
3-layer NN, 500+150 HU none 2.95 LeCun et al. 1998
3-layer NN, 500+150 HU Affine none 2.45 LeCun et al. 1998
3-layer NN, 500+300 HU, CE, reg none 1.53 Hinton, unpublished, 2005
2-layer NN, 800 HU, CE none 1.60 Simard et al., ICDAR 2003
2-layer NN, 800 HU, CE Affine none 1.10 Simard et al., ICDAR 2003
2-layer NN, 800 HU, MSE Elastic none 0.90 Simard et al., ICDAR 2003
2-layer NN, 800 HU, CE Elastic none 0.70 Simard et al., ICDAR 2003
Convolutional net LeNet-1 subsamp 16x16 pixels 1.70 LeCun et al. 1998
Convolutional net LeNet-4 none 1.10 LeCun et al. 1998
Convolutional net LeNet-5, none 0.95 LeCun et al. 1998
Conv. net LeNet-5, Affine none 0.80 LeCun et al. 1998
Boosted LeNet-4 Affine none 0.70 LeCun et al. 1998
Conv. net, CE Affine none 0.60 Simard et al., ICDAR 2003
Comv net, CE Elastic none 0.40 Simard et al., ICDAR 2003
Yann LeCun
Some Results on MNIST (from raw images: no preprocessing)

CLASSIFIER DEFORMATION ERROR Reference

Knowledge-free methods (a fixed permutation of the pixels would make no difference)
2-layer NN, 800 HU, CE 1.60 Simard et al., ICDAR 2003
3-layer NN, 500+300 HU, CE, reg 1.53 Hinton, in press, 2005
SVM, Gaussian Kernel 1.40 Cortes 92 + Many others
??? 0.95
Convolutional nets
Convolutional net LeNet-5, 0.80 Ranzato et al. NIPS 2006
Convolutional net LeNet-6, 0.70 Ranzato et al. NIPS 2006
??? 0.60
Training set augmented with Affine Distortions
2-layer NN, 800 HU, CE Affine 1.10 Simard et al., ICDAR 2003
Virtual SVM deg-9 poly Affine 0.80 Scholkopf
Convolutional net, CE Affine 0.60 Simard et al., ICDAR 2003
Training et augmented with Elastic Distortions
2-layer NN, 800 HU, CE Elastic 0.70 Simard et al., ICDAR 2003
Convolutional net, CE Elastic 0.40 Simard et al., ICDAR 2003
??? 0.39

Note: some groups have obtained good results with various amounts of preprocessing: [deCoste and Schoelkopf]
get 0.56% with an SVM on deskewed images; [Belongie] get 0.63% with “shape context” features;
[CENPARMI] get below 0.4% with features and SVM; [Liu] get 0.42% with features and SVM.
Yann LeCun
Invariance and Robustness to Noise

Yann LeCun
Recognizing Multiple Characters with Replicated Nets

Yann LeCun
Handwriting Recognition

Yann LeCun
Face Detection and Pose Estimation with Convolutional Nets
Training: 52,850, 32x32 greylevel images of faces, 52,850 nonfaces.

Each sample: used 5 times with random variation in scale, inplane rotation, brightness
and contrast.

2nd phase: half of the initial negative set was replaced by false positives of the initial
version of the detector .

Yann LeCun
Face Detection: Results

Data Set-> TILTED PROFILE MIT+CMU

False positives per image-> 4.42 26.9 0.47 3.36 0.5 1.28

Our Detector 90% 97% 67% 83% 83% 88%

Jones & Viola (tilted) 90% 95% x x

Jones & Viola (profile) x 70% 83% x

Rowley et al 89% 96% x

Schneiderman & Kanade 86% 93% x

Yann LeCun
Face Detection and Pose Estimation: Results

Yann LeCun
Face Detection with a Convolutional Net

Yann LeCun
Applying a ConvNet on Sliding Windows is Very Cheap!

output: 3x3

96x96

input:120x120
Traditional Detectors/Classifiers must be applied to every
location on a large input image, at multiple scales.
Convolutional nets can replicated over large images very
cheaply.
The network is applied to multiple scales spaced by 1.5.

Yann LeCun
Building a Detector/Recognizer:
Replicated Convolutional Nets

Computational cost for replicated convolutional net:

96x96 > 4.6 million multiplyaccumulate operations
120x120 > 8.3 million multiplyaccumulate operations
240x240 > 47.5 million multiplyaccumulate operations
480x480 > 232 million multiplyaccumulate operations
Computational cost for a nonconvolutional detector of the
same size, applied every 12 pixels:
96x96 > 4.6 million multiplyaccumulate operations
120x120 > 42.0 million multiplyaccumulate operations
240x240 > 788.0 million multiplyaccumulate operations
480x480 > 5,083 million multiplyaccumulate operations 96x96 window

12 pixel shift

84x84 overlap
TV sport categorization (with Alex Niculescu, Cornell)

Classifying TV sports snapshots into 7 categories: auto racing, baseball,

basketball, bicycle, golf, soccer, football.

123,900 training images (300 sequence with 59 frames for each sport)

82,600 test images (200 sequences with 59 frames for each sport)

Preprocessing: convert to YUV, highpass filter the Y component, crop,

subsample to 72x60 pixels

Results:
frame-level accuracy: 61% correct
Sequence-level accuracy 68% correct (simple voting scheme).

Yann LeCun
TV sport categorization (with Alex Niculescu, Cornell)

Yann LeCun
C. Elegans Embryo Phenotyping

[Ning et al. IEEE Trans. Image Processing, Nov 2005]

Analyzing results for Gene KnockOut Experiments
C. Elegans Embryo Phenotyping

Analyzing results for Gene KnockOut Experiments

C. Elegans Embryo Phenotyping

Raw
input

ConvNet
labeling

CCPoE
Cleanup

Elastic
Model
Fitting

CCPoE = Convolutional Conditional Product of Experts [Ning et al, IEEE TIP 2005]
(similar to Field of Experts [Roth & Black, CVPR 2005])
Visual Navigation for a Mobile Robot

[LeCun et al. NIPS 2005]

Mobile robot with two cameras

The convolutional net is trained to emulate
a human driver from recorded sequences of
video + humanprovided steering angles.
The network maps stereo images to steering
angles for obstacle avoidance
LAGR: Learning Applied to Ground Robotics

Getting a robot to drive autonomously in

unknown terrain solely from vision (camera
input).
Our team (NYU/NetScale Technologies
Inc.) is one of 8 participants funded by
DARPA
All teams received identical robots and can
only modify the software (not the hardware)
The robot is given the GPS coordinates of a
goal, and must drive to the goal as fast as
possible. The terrain is unknown in advance.
The robot is run 3 times through the same
course.

Yann LeCun
Training a ConvNet Online to detect obstacles
[Hadsell et al. Robotics Science and Systems 2007]

Traversability labels Traversability labels

Raw image from stereo (12 meters) from ConvNet (30 meters)

Yann LeCun
Training a ConvNet Online to detect obstacles
[Hadsell et al. Robotics Science and Systems 2007]

Traversability labels Traversability labels

Raw image from stereo (12 meters) from ConvNet (30 meters)

Yann LeCun
Generic Object Detection and Recognition
with Invariance to Pose and Illumination
50 toys belonging to 5 categories: animal, human figure, airplane, truck, car
10 instance per category: 5 instances used for training, 5 instances for testing
Raw dataset: 972 stereo pair of each object instance. 48,600 image pairs total.

For each instance:

18 azimuths
0 to 350 degrees every 20
degrees
9 elevations
30 to 70 degrees from
horizontal every 5 degrees
6 illuminations
on/off combinations of 4
lights
2 cameras (stereo)
Training instances Test instances
7.5 cm apart
40 cm from the object
Yann LeCun
Data Collection, Sample Generation

Image capture setup Objects are painted green so that:

all features other than shape are removed
objects can be segmented, transformed,
and composited onto various backgrounds
Original image Object mask

Shadow factor Composite image

Yann LeCun
Textured and Cluttered Datasets

Yann LeCun
Convolutional Network

Layer 3
Layer 6
24@18x18 Layer 4
Stereo Layer 1 Layer 5 Fully
Layer 2 24@6x6
input 8@92x92 100 connected
2@96x96 8@23x23
(500 weights)
5

4x4 6x6
5x5 6x6
subsampling convolution 3x3
convolution convolution
(96 kernels) subsampling
(16 kernels) (2400 kernels)
90,857 free parameters, 3,901,162 connections.
The architecture alternates convolutional layers (feature detectors) and subsampling layers
(local feature pooling for invariance to small distortions).
The entire network is trained endtoend (all the layers are trained simultaneously).
A gradientbased algorithm is used to minimize a supervised loss function.
Yann LeCun
Alternated Convolutions and Subsampling

“Simple cells”
“Complex cells”

Averaging
Multiple subsampling
convolutions
Local features are extracted
everywhere.
averaging/subsampling layer
builds robustness to variations in
feature locations.
Hubel/Wiesel'62, Fukushima'71,
LeCun'89, Riesenhuber &
Poggio'02, Ullman'02,....

Yann LeCun
NormalizedUniform Set: Error Rates

Linear Classifier on raw stereo images: 30.2% error.

KNearestNeighbors on raw stereo images: 18.4% error.
KNearestNeighbors on PCA95: 16.6% error.
Pairwise SVM on 96x96 stereo images: 11.6% error
Pairwise SVM on 95 Principal Components: 13.3% error.
Convolutional Net on 96x96 stereo images: 5.8% error.

Training instances Test instances

Yann LeCun
NormalizedUniform Set: Learning Times

SVM: using a parallel implementation by Chop off the

Graf, Durdanovic, and Cosatto (NEC Labs) last layer of the
convolutional net
and train an SVM on it
Yann LeCun
JitteredCluttered Dataset

JitteredCluttered Dataset:
291,600 tereo pairs for training, 58,320 for testing
Objects are jittered: position, scale, inplane rotation, contrast, brightness,
backgrounds, distractor objects,...
Input dimension: 98x98x2 (approx 18,000)

Yann LeCun
Experiment 2: JitteredCluttered Dataset

291,600 training samples, 58,320 test samples

SVM with Gaussian kernel 43.3% error
Convolutional Net with binocular input: 7.8% error
Convolutional Net + SVM on top: 5.9% error
Convolutional Net with monocular input: 20.8% error
Smaller mono net (DEMO): 26.0% error
Dataset available from http://www.cs.nyu.edu/~yann
Yann LeCun
JitteredCluttered Dataset

The convex loss, VC bounds Chop off the last layer,

OUCH!
and representers theorems and train an SVM on it
don't seem to help it works!
Yann LeCun
What's wrong with SVMs? they are shallow!

SVM with Gaussian kernels is based on matching global templates

It is a “shallow” architectures
There is now way to learn invariant recognition tasks with such naïve architectures
(unless we use an impractically large number of templates).
Output
The number of necessary templates grows
exponentially with the number of dimensions Linear
of variations.
Combinations
Global templates are in trouble when the
variations include: category, instance shape, Features (similarities)
configuration (for articulated object),
Global Template Matchers
position, azimuth, elevation, scale,
illumination, texture, albedo, inplane (each training sample is a template
rotation, background luminance, background
Input
texture, background clutter, .....

SVM is glorified template matching

Examples (Monocular Mode)

Yann LeCun
Learned Features

Layer 3
Layer 2

Layer 1
Input

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Natural Images (Monocular Mode)

Yann LeCun
Commercially Deployed applications of Convolutional Nets

Faxed form reader

Developed at AT&T Bell Labs in the early 90's
Commercially deployed in 1994

Check Reading system:

Developed at AT&T Bell Labs in the mid 90's
Commercially deployed by NCR in 1996
First practical system for reading handwritten checks
Read 10 to 20% of all the checks in the US in the late 90's

Face detector / Person detector / Intrusion detector

Developed at NEC Research Institute in 2002/2003
Commercially deployed in 2004 by Vidient Technologies
Used at San Francisco Airport (among others).

Yann LeCun
Supervised Convolutional Nets: Pros and Cons

Convolutional nets can be trained to perform a wide variety of visual

tasks.
Global supervised gradient descent can produce parsimonious
architectures

BUT: they require lots of labeled training samples

60,000 samples for handwriting
120,000 samples for face detection
25,000 to 350,000 for object recognition

Since lowlevel features tend to be non task specific, we should be able to

learn them unsupervised.

Hinton has shown that layerbylayer unsupervised “pretraining” can be

used to initialize “deep” architectures
[Hinton & Shalakhutdinov, Science 2006]

Can we use this idea to reduce the number of necessary labeled examples.
Yann LeCun
Models Similar to ConvNets

HMAX
[Poggio &
Riesenhuber
2003]
[Serre et al.
2007]
[Mutch and Lowe
CVPR 2006]

Difference?
the features are
not learned

HMAX is very
similar to
Fukushima's
Neocognitron
[from Serre et al. 2007]
Yann LeCun
Part 3:
Unsupervised Training of “Deep” EnergyBased Models,
Learning Invariant Feature Hierarchies

Why do we need Deep Learning?

“scaling learning algorithms towards AI” [Bengio and LeCun 2007]

Deep Belief Networks, Deep Learning

Stacked RBM [Hinton, Osindero, and Teh, Neural Comp 2006]
Stacked autoencoders [Bengio et al. NIPS 2006]
Stacked sparse features [Ranzato & al., NIPS 2006]
Improved stacked RBM [Salakhutdinov & Hinton, AI-Stats 07]

Unsupervised Learning of Invariant Feature Hierarchies

learning features for Caltech-101 [Ranzato et al. CVPR 2006]
learning features hierarchies for hand-writing [Ranzato et al ICDAR'07]

[See Mar'cAurelio Ranzato's poster on Wednesday]

Yann LeCun
Why do we need “Deep” Architectures?
[Bengio & LeCun 2007]
Conjecture: we won't solve the perception problem without solving the
problem of learning in deep architectures [Hinton]
Neural nets with lots of layers
Deep belief networks
Factor graphs with a “Markov” structure

We will not solve the perception problem with kernel machines

Kernel machines are glorified template matchers
You can't handle complicated invariances with templates (you would
need too many templates)

Many interesting functions are “deep”

Any function can be approximated with 2 layers (linear combination
of non-linear functions)
But many interesting functions a more efficiently represented with
multiple layers
Stupid examples: binary addition
Yann LeCun
The Basic Idea of Deep Learning
[Hinton et al. 2005 2007]
Unsupervised Training of Feature Hierarchy [Hinton et al. 2005 – 2007]
Each layer is designed to extract higher-level features from
lower-level ones
Each layer is trained unsupervised with a reconstruction criterion
The layers are trained one after the other, in sequence.

RECONSTRUCTION ERROR RECONSTRUCTION ERROR

COST COST

DECODER DECODER

ENCODER ENCODER
INPUT Y LEVEL 1 LEVEL 2
FEATURES FEATURES
Yann LeCun
EncoderDecoder Architecture for Unsupervised Learning

A principle on which RECONSTRUCTION ENERGY

E(Y,W) = min_z E(Y,Z,W)
unsupervised algorithms can be COST
built is reconstruction of the
input from a code (feature DECODER
vector)
reconstruction from compact
feature vectors (e.g. PCA). FEATURES
reconstruction from sparse
ENCODER (CODE)
overcomplete feature vectors
[Olshausen & Field 1997], Z
[Ranzato et al NIPS 06]. Y
approximation of data
likelihood: Restricted E Y , W =min Z E Y , Z ,W 
Boltzmann Machine [Hinton
2005-...] ZY = argmin Z E Y , Z , W 
Yann LeCun
What is EnergyBased Unsupervised Learning?
Probabilistic View:
Produce a probability density function that:
has high value in regions of high sample P(Y)
density
has low value everywhere else (integral=1)
Training: maximize the data likelihood
(intractable)

EnergyBased View: E(Y)

produce an energy function E(Y) that:
has low value in regions of high sample
density
has high(er) value everywhere else

Y
Yann LeCun
Unsupervised Training of EnergyBased Models

Basic Idea:
push down on the energy of training samples
pull up on the energy of everything else
but this is often intractable

Approximation #1: Contrastive Divergence [Hinton et al 2005]

Push down on the energy of the training samples
Pull up on the energies of configuration that have low energy near
the training samples (to create local minima of the energy
surface)

Approximation #2: Minimizing the information content of the code

[Ranzato et al. AIStats 2007]
Reduce the information content of the code by making it sparse
This has the effect of increasing the reconstruction error for non-
training samples.

Yann LeCun
Deep Learning for NonLinear Dimensionality Reduction

Restricted Boltzmann
Machine.
simple energy function

E Y , Z ,W =∑ij −Y i W ij Z j
code units are binary
stochastic
training with contrastive
divergence

Yann LeCun From: [Hinton and Salakhutdinov, Science 2006]

RBM: filters trained on MNIST

“bubble” detectors

Yann LeCun
NonLinear Dimensionality Reduction: MNIST

[Hinton and Salakhutdinov, Science 2006]

[Salakhutdinov and Hinton, AIStats 2007]:

< 1.00% error on MNIST using K-NN on 30 dimensions:
BEST ERROR RATE OF ANY KNOWEDGE-FREE METHODS!!!
Yann LeCun
Encoder/Decoder Architecture for
learning Sparse Feature Representations
Energy of decoder
Algorithm: Code Z
(reconstruction error)
1. find the code Z
that minimizes the ||Wd f(Z)–X||
reconstruction
error AND is close DECODER Sparsifying
to the encoder
Wd Logistic f
output
2. Update the
weights of the
decoder to
decrease the
reconstruction ENCODER
error Wc
3. Update the
weights of the ||Wc X–Z||
encoder to Input X
decrease the Energy of encoder
prediction error
(prediction error)
Yann LeCun
MNIST Dataset

Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples

Yann LeCun
Training on handwritten digits

60,000 28x28 images

196 units in the code
 0.01
1
learning rate 0.001
L1, L2 regularizer 0.005

Encoder direct filters

Handwritten digits
Handwritten digits MNIST
MNIST

forward propagation through

encoder and decoder

after training there is no need to

minimize in code space
Training The Layers of a Convolutional Net Unsupervised

Extract windows from the MNIST images

Train the sparse encoder/decoder on those windows

Use the resulting encoder weights as the convolution kernels of a

convolution network

Repeat the process for the second layer

Train the resulting network supervised.

Yann LeCun
Unsupervised Training of Convolutional Filters

CLASSIFICATION EXPERIMENTS
IDEA: improving supervised learning by pre-training supervised filters in first conv. layer
with the unsupervised method (*)
sparse representations & lenet6 (1->50->50->200->10)

The baseline: lenet6 initialized randomly

Test error rate: 0.70%. Training error rate: 0.01%.

unsupervised filters in first conv. layer

Experiment 1
Train on 5x5 patches to find 50 features
Use the scaled filters in the encoder to initialize the kernels in
the first convolutional layer
Test error rate: 0.60%. Training error rate: 0.00%.

Experiment 2
Same as experiment 1, but training set augmented by elastically distorted digits (random
initialization gives test error rate equal to 0.49%).
Test error rate: 0.39%. Training error rate: 0.23%.
(*)[Hinton, Osindero, Teh “A fast learning algorithm for deep belief nets” Neural Computaton 2006]
Best Results on MNIST (from raw images: no preprocessing)

CLASSIFIER DEFORMATION ERROR Reference

Knowledge-free methods
2-layer NN, 800 HU, CE 1.60 Simard et al., ICDAR 2003
3-layer NN, 500+300 HU, CE, reg 1.53 Hinton, in press, 2005
SVM, Gaussian Kernel 1.40 Cortes 92 + Many others
Unsupervised Stacked RBM + backprop 0.95 Hinton, Neur Comp 2006
Convolutional nets
Convolutional net LeNet-5, 0.80 Ranzato et al. NIPS 2006
Convolutional net LeNet-6, 0.70 Ranzato et al. NIPS 2006
Conv. net LeNet-6- + unsup learning 0.60 Ranzato et al. NIPS 2006
Training set augmented with Affine Distortions
2-layer NN, 800 HU, CE Affine 1.10 Simard et al., ICDAR 2003
Virtual SVM deg-9 poly Affine 0.80 Scholkopf
Convolutional net, CE Affine 0.60 Simard et al., ICDAR 2003
Training et augmented with Elastic Distortions
2-layer NN, 800 HU, CE Elastic 0.70 Simard et al., ICDAR 2003
Convolutional net, CE Elastic 0.40 Simard et al., ICDAR 2003
Conv. net LeNet-6- + unsup learning Elastic 0.39 Ranzato et al. NIPS 2006

Yann LeCun
MNIST Errors (0.42% error)

Yann LeCun
Training on natural image patches

Berkeley data set

100,000 12x12 patches

200 units in the code

 0.02
1
learning rate 0.001
L1 regularizer 0.001
fast convergence: < 30min.
Natural image patches: Filters

200 decoder filters (reshaped columns of matrix Wd)

Learning Invariant Feature Hierarchies

Learning Shift Invariant Features

RECONSTRUCTION ERROR RECONSTRUCTION ERROR

COST COST

DECODER
DECODER
INVARIANT
FEATURES
TRANSFORMATION
FEATURES
(CODE) PARAMETERS U
(CODE)
Z
ENCODER Z
ENCODER

INPUT Y INPUT Y

Standard Feature Extractor Invariant Feature Extractor

Yann LeCun
Learning Invariant Feature Hierarchies

Learning Shift Invariant Features

(a) (b) (c) encoder shiftinvariant decoder (d)

filter bank representation basis functions
“1001”
feature maps

17x17 1 17x17

0 +
input
0

reconstruction
image
feature 1 feature
maps max maps
convolutions switch convolutions
pooling
upsampling
encoder transformation
parameters
decoder

Yann LeCun
Shift Invariant Global Features on MNIST

Learning 50 Shift Invariant Global Features on MNIST:

50 filters of size 20x20 movable in a 28x28 frame (81 positions)
movable strokes!

Yann LeCun
Example of Reconstruction

Any character can be reconstructed as a

linear combination of a small number of
basis functions.

ORIGINAL RECONS


DIGIT TRUCTION

 =

ACTIVATED DECODER
BASIS FUNCTIONS
(in feedback layer)
red squares: decoder bases
Yann LeCun
Learning Invariant Filters in a Convolutional Net

Yann LeCun
Influence of Number of Training Samples

Yann LeCun
Generic Object Recognition: 101 categories + background

Caltech101 dataset: 101 categories

accordion airplanes anchor ant barrel bass beaver binocular bonsai brain
brontosaurus buddha butterfly camera cannon car_side ceiling_fan cellphone
chair chandelier cougar_body cougar_face crab crayfish crocodile crocodile_head
cup dalmatian dollar_bill dolphin dragonfly electric_guitar elephant emu
euphonium ewer Faces Faces_easy ferry flamingo flamingo_head garfield
gerenuk gramophone grand_piano hawksbill headphone hedgehog helicopter ibis
inline_skate joshua_tree kangaroo ketch lamp laptop Leopards llama lobster
lotus mandolin mayfly menorah metronome minaret Motorbikes nautilus octopus
okapi pagoda panda pigeon pizza platypus pyramid revolver rhino rooster
saxophone schooner scissors scorpion sea_horse snoopy soccer_ball stapler
starfish stegosaurus stop_sign strawberry sunflower tick trilobite umbrella watch
water_lilly wheelchair wild_cat windsor_chair wrench yin_yang

Only 30 training examples per category!

A convolutional net trained with backprop (supervised) gets 20%

correct recognition.

Training the filters with the sparse invariant unsupervised method

Yann LeCun
Training the 1st stage filters

12x12 input windows (complex cell receptive fields)

9x9 filters (simple cell receptive fields)

4x4 pooling

Yann LeCun
Training the 2nd stage filters
13x13 input windows (complex cell receptive fields on 1st features)

9x9 filters (simple cell receptive fields)

Each output feature map combines 4 input feature maps

5x5 pooling

Yann LeCun
Generic Object Recognition: 101 categories + background

9x9 filters at the first level

9x9 filters at the second level

Yann LeCun
ShiftInvariant Feature Hierarchies on Caltech101

2 layers of filters input 8 among the 64 33x33 feature maps 2 among the 512
trained image 5x5

.
unsupervised 140x140 feature maps

+
supervised
classifier on top.

54% correct on
Caltech101 with
30 examples per
class

20% correct with

+
maxpooling maxpooling

purely supervised 4x4 window 5x5 window

backprop and squashing and squashing

convolution convolution
64 9x9 filters 2048 9x9 filters
first level second level
feature extraction feature extraction
Yann LeCun
Recognition Rate on Caltech 101

dollar w. chair ewer 65% bonsai beaver

100% 92% 34% 13%
skate
cougar body
100% anchor lotus 12%
face 41 22%
84% % wild cat
1%
BEST WORST
okapi
100% cellphone sea horse
minaret 83% ant
37% 16%
joshua t.
metronome background
3%

100% 91% 47%

Yann LeCun
Caltech 256

Yann LeCun
Conclusion

EnergyBased Models is a general framework for probabilistic and non

probabilistic learning
Make the energy of training samples low, make the energy of
everything else high (e.g. Discriminant HMM, Graph Transformer
Networks, Conditional Random Fields, Max Margin Markov Nets,...)

Invariant vision tasks require deep learning

shallow models such as SVM can't learn complicated invariances.

Deep Supervised Learning works well with lots of samples

Convolutional nets have record accuracy on handwriting recognition
and face detection, and can be applied to many tasks.

Unsupervised Learning can reduce the need for labeled samples

Stacks of sequentially-trained RBMs or sparse encoder-decoder
layers learn good feature without requiring labeled samples

Learning invariant feature hierarchies

yields excellent accuracy for shape recognition
Yann LeCun
Thank You

Yann LeCun

Deep Learning Hand Book 2024
No ratings yet
Deep Learning Hand Book 2024
185 pages
Lecunslides
No ratings yet
Lecunslides
99 pages
GML Slides 2024 04 29
No ratings yet
GML Slides 2024 04 29
206 pages
Chapter 3 - Linear Classification
No ratings yet
Chapter 3 - Linear Classification
83 pages
ANN Unit-2
No ratings yet
ANN Unit-2
48 pages
DL 02 Basics
No ratings yet
DL 02 Basics
94 pages
A Tutorial On Energy-Based Learning
No ratings yet
A Tutorial On Energy-Based Learning
60 pages
Bus Uncle Chatbot - Creating A Successful Digital Business (A)
No ratings yet
Bus Uncle Chatbot - Creating A Successful Digital Business (A)
10 pages
Unit 4 Short Notes
No ratings yet
Unit 4 Short Notes
27 pages
Midterm Study Guide Csci566
No ratings yet
Midterm Study Guide Csci566
20 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Machine Learning 2025
No ratings yet
Machine Learning 2025
111 pages
A Unified Contrastive Energy-Based Model For Understanding The Generative Ability of Adversarial Training
No ratings yet
A Unified Contrastive Energy-Based Model For Understanding The Generative Ability of Adversarial Training
18 pages
Machine Learning A Lecture Note
No ratings yet
Machine Learning A Lecture Note
111 pages
Lecun 05 A
No ratings yet
Lecun 05 A
8 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
167 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
ML LittelBook
No ratings yet
ML LittelBook
161 pages
Lecun 20080517 Deep Learning
No ratings yet
Lecun 20080517 Deep Learning
74 pages
Lecture 2
No ratings yet
Lecture 2
57 pages
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
No ratings yet
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
163 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
LBDL
No ratings yet
LBDL
142 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
163 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
The - Little - Book - of - Deep Learning
No ratings yet
The - Little - Book - of - Deep Learning
140 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Lbdlu
No ratings yet
Lbdlu
168 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
RevRes PDF
No ratings yet
RevRes PDF
1,134 pages
Book
No ratings yet
Book
349 pages
Conmatphys 031119 050745
No ratings yet
Conmatphys 031119 050745
28 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Luciw RBM DBN
No ratings yet
Luciw RBM DBN
38 pages
Hybrid Discriminative-Generative Training Via COnstrrastive Leanring
No ratings yet
Hybrid Discriminative-Generative Training Via COnstrrastive Leanring
14 pages
HSBC Bank Statement TemplateLab Com
100% (1)
HSBC Bank Statement TemplateLab Com
1 page
Data Umum SSH 2024
No ratings yet
Data Umum SSH 2024
376 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Lecture1 2015
No ratings yet
Lecture1 2015
52 pages
Lecture 12 Learning in Vision 2022
No ratings yet
Lecture 12 Learning in Vision 2022
100 pages
Statistics Mechanic of Deep Learning
No ratings yet
Statistics Mechanic of Deep Learning
28 pages
Bitwise Neural Network
No ratings yet
Bitwise Neural Network
5 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
The Little Book of Deep Learning
100% (1)
The Little Book of Deep Learning
140 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Essay Topics Grade 11
100% (2)
Essay Topics Grade 11
5 pages
Tom M CMU ANN Lecture Notes
No ratings yet
Tom M CMU ANN Lecture Notes
68 pages
Deep Unsupervised Learning
No ratings yet
Deep Unsupervised Learning
90 pages
CS437 5317 EE414 L2 LinearRegression
No ratings yet
CS437 5317 EE414 L2 LinearRegression
42 pages
Contemporary ML For Physicists
No ratings yet
Contemporary ML For Physicists
91 pages
Deep Learning: IPAM Summer School 2012 Tutorial On
No ratings yet
Deep Learning: IPAM Summer School 2012 Tutorial On
69 pages
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
No ratings yet
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
101 pages
Variational Autoencoder
No ratings yet
Variational Autoencoder
31 pages
Short Course On Deep Learning: Welcome!!
No ratings yet
Short Course On Deep Learning: Welcome!!
57 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Class 01
No ratings yet
Class 01
75 pages
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
No ratings yet
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
59 pages
The Functions of Deep Learning: Gilbert Strang
No ratings yet
The Functions of Deep Learning: Gilbert Strang
1 page
A Hybrid Approach To Bangla Handwritten OCR: Combining YOLO and An Advanced CNN
No ratings yet
A Hybrid Approach To Bangla Handwritten OCR: Combining YOLO and An Advanced CNN
26 pages
Gender Classification From Pashto Handwritten Text Images
No ratings yet
Gender Classification From Pashto Handwritten Text Images
19 pages
A Novel Framework For Saraiki Script Recognition Using Advanced Machine Learning Models YOLOv8 and CNN
No ratings yet
A Novel Framework For Saraiki Script Recognition Using Advanced Machine Learning Models YOLOv8 and CNN
18 pages
6953 20763 1 PB
No ratings yet
6953 20763 1 PB
16 pages
13 2 07 Chikashua
No ratings yet
13 2 07 Chikashua
17 pages
Learning Area Learners With Special Educational Needs (LSEN) Learning Delivery Modality Modular Distance Learning Modality
No ratings yet
Learning Area Learners With Special Educational Needs (LSEN) Learning Delivery Modality Modular Distance Learning Modality
5 pages
Expansion of Theme
100% (2)
Expansion of Theme
10 pages
Bachelor Thesis
No ratings yet
Bachelor Thesis
88 pages
Excel - Chapter
No ratings yet
Excel - Chapter
69 pages
Traits of 21st Century Teacher
No ratings yet
Traits of 21st Century Teacher
14 pages
Boiling: 1. Neutralization of Magma Gas in Host Rock at Deep Location
No ratings yet
Boiling: 1. Neutralization of Magma Gas in Host Rock at Deep Location
84 pages
Concurrency and Distribution in Object-Oriented Programming
No ratings yet
Concurrency and Distribution in Object-Oriented Programming
39 pages
FF Ffi
No ratings yet
FF Ffi
15 pages
SSRN 5056626
No ratings yet
SSRN 5056626
7 pages
Sony KDL - 52s5100 Chasis Exr2
No ratings yet
Sony KDL - 52s5100 Chasis Exr2
104 pages
Core Techniques and Algorithms in Game Programming: Daniel Sanchez-Crespo Dalmau
No ratings yet
Core Techniques and Algorithms in Game Programming: Daniel Sanchez-Crespo Dalmau
15 pages
AR Yolo 12: A - B E - P V: Eview of V Ttention Ased Nhancements VS Revious Ersions
No ratings yet
AR Yolo 12: A - B E - P V: Eview of V Ttention Ased Nhancements VS Revious Ersions
18 pages
Service Manual, PM7100, English PT00112534 Rev A Release 8-2020
No ratings yet
Service Manual, PM7100, English PT00112534 Rev A Release 8-2020
64 pages
77 4001 StaSaf
No ratings yet
77 4001 StaSaf
20 pages
Econ2330 Ch09
No ratings yet
Econ2330 Ch09
65 pages
Enhancing Pediatric Distal Radius Fracture Detection: Optimizing Yolov8 With Advanced Ai and Machine Learning Techniques
No ratings yet
Enhancing Pediatric Distal Radius Fracture Detection: Optimizing Yolov8 With Advanced Ai and Machine Learning Techniques
18 pages
From Natural Language To Simulations Applying AI To Automate Simulation Modelling of Logistics Systems
No ratings yet
From Natural Language To Simulations Applying AI To Automate Simulation Modelling of Logistics Systems
25 pages
Motion in 1dimension DPP 05 of Lec 06 Prayas JEE 2.0 2025666a9a5e40a9650018c46d03
No ratings yet
Motion in 1dimension DPP 05 of Lec 06 Prayas JEE 2.0 2025666a9a5e40a9650018c46d03
5 pages
508-Article Text-508-1-10-20080128
No ratings yet
508-Article Text-508-1-10-20080128
23 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Quadrant Data Efficient Machine Learning Screen
No ratings yet
Quadrant Data Efficient Machine Learning Screen
6 pages
Pediatric Wrist Fracture Detection Using Feature Context Excitation Modules in X-Ray Images
No ratings yet
Pediatric Wrist Fracture Detection Using Feature Context Excitation Modules in X-Ray Images
10 pages
Concurrent Object-Oriented: Ccyy"Nlcitlcyicct"Eaccy/September 199O/Vol.33, No.9
No ratings yet
Concurrent Object-Oriented: Ccyy"Nlcitlcyicct"Eaccy/September 199O/Vol.33, No.9
17 pages
This Story Paraphrased From A Post On 9/4/12
No ratings yet
This Story Paraphrased From A Post On 9/4/12
7 pages
Judo Physiological Profile Sportsmedicine Franchini
No ratings yet
Judo Physiological Profile Sportsmedicine Franchini
21 pages
Computer Vision NN Architecture
No ratings yet
Computer Vision NN Architecture
19 pages
Psychology and Marketing - 2023 - Pugliese - How To Conduct Efficient and Objective Literature Reviews Using Natural
No ratings yet
Psychology and Marketing - 2023 - Pugliese - How To Conduct Efficient and Objective Literature Reviews Using Natural
15 pages
Marxs Value Theory Revisited A Value-Form Appro
No ratings yet
Marxs Value Theory Revisited A Value-Form Appro
15 pages
Playit: Game Based Learning Approach For Teaching Programming Concepts
No ratings yet
Playit: Game Based Learning Approach For Teaching Programming Concepts
14 pages
Storyblocks: A Tangible Programming Game To Create Accessible Audio Stories
No ratings yet
Storyblocks: A Tangible Programming Game To Create Accessible Audio Stories
12 pages
Amine Unit
100% (1)
Amine Unit
69 pages
s41746 019 0180 3.html
No ratings yet
s41746 019 0180 3.html
5 pages
Monostori Acsc2001
No ratings yet
Monostori Acsc2001
9 pages
Global Context Modeling in Yolov8 For Pediatric Wrist Fracture Detection
No ratings yet
Global Context Modeling in Yolov8 For Pediatric Wrist Fracture Detection
5 pages
Pediatric Wrist Fracture Detection in X-Rays Via Yolov10 Algorithm and Dual Label Assignment System
No ratings yet
Pediatric Wrist Fracture Detection in X-Rays Via Yolov10 Algorithm and Dual Label Assignment System
5 pages
Femur Fracture Detection Based On Deep Learning Model Yolov8
No ratings yet
Femur Fracture Detection Based On Deep Learning Model Yolov8
4 pages
An Attempt To Generate New Bridge Types From Latent Space of Pixelcnn
No ratings yet
An Attempt To Generate New Bridge Types From Latent Space of Pixelcnn
7 pages
Electronics Letters - 2024 - Chien - YOLOv9 For Fracture Detection in Pediatric Wrist Trauma X Ray Images
No ratings yet
Electronics Letters - 2024 - Chien - YOLOv9 For Fracture Detection in Pediatric Wrist Trauma X Ray Images
3 pages
Hyaluronic Acid
No ratings yet
Hyaluronic Acid
7 pages
Exhibit B - Security Policy
No ratings yet
Exhibit B - Security Policy
4 pages
Hopf Bifurcation Normal Form
100% (2)
Hopf Bifurcation Normal Form
3 pages
Project Charter Template
No ratings yet
Project Charter Template
9 pages
Lovino - B8 - Case Analysis Essay Volunteerism
No ratings yet
Lovino - B8 - Case Analysis Essay Volunteerism
3 pages
Tropical Soils
No ratings yet
Tropical Soils
5 pages
ED Mid
No ratings yet
ED Mid
1 page
Human Resources, Job Design, and Work Measurement: Human Resource Strategy For Competitive Advantage
No ratings yet
Human Resources, Job Design, and Work Measurement: Human Resource Strategy For Competitive Advantage
3 pages
Worksheet 3 LS6 - MIANO, REYMARK
No ratings yet
Worksheet 3 LS6 - MIANO, REYMARK
1 page
Century Iib: Autopilot Flight System
No ratings yet
Century Iib: Autopilot Flight System
24 pages
The Relationship of Endodontic-Periodontic Lesions
No ratings yet
The Relationship of Endodontic-Periodontic Lesions
7 pages
Introduction to Logarithms and Exponentials
From Everand
Introduction to Logarithms and Exponentials
Simone Malacrida
No ratings yet
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Energy Based Models in Document Recognition and Computer Vision

Uploaded by

Energy Based Models in Document Recognition and Computer Vision

Uploaded by

Energy­Based Models in Document

Recognition and Computer Vision

See: [LeCun et al. 2006]: “A Tutorial on Energy­Based Learning”

How do we learn “invariant representations”?

How can we learn visual categories from just a few examples?

The “Normalization Problem” (aka Partition Function Problem)

2. The “Deep Learning Problem”

This talks has three parts:

A Tutorial and Energy­Based Learning:

Discriminative Training for “Structured Output” models

Model: Measures the compatibility

Inference: Search for the Y that

Energies are uncalibrated

How do we calibrate energies?

Partition function Inverse temperature

integrated segmentation and

Each segmentation and

inference = finding the shortest

The energy includes “hidden” variables Z whose value is never given to us

Marginalizing over latent variables instead of minimizing.

Equivalent to traditional energy­based inference with a redefined

Reduces to traditional minimization when Beta­>infinity

Training an EBM consists in shaping the energy function so that the

E animal , X E  y , X  ∀ y≠animal

Family of energy functions

Loss functional / Loss function

Push down on the energy of the correct answer

Pull up on the energies of the incorrect answers, particularly if they

E(W,Y,X) 1. Design an architecture: a particular form for E(W,Y,X).

PROBLEM: What loss functions will make the machine approach

Conditional probability of the samples (assuming independence)

We get the NLL loss by dividing by P and Beta:

Reduces to the perceptron loss when Beta­>infinity

Pushes down on the energy of the correct answer

Pulls up on the energies of all answers in proportion to their probability

A probabilistic model is an EBM in which:

Perceptron Loss [LeCun et al. 1998], [Collins 2002]

First, we need to define the Most Offending Incorrect Answer

Most Offending Incorrect Answer: discrete case

Most Offending Incorrect Answer: continuous case

LVQ2 Loss [Kohonen, Oja], Driancourt­Bottou 1991]

Minimum Classification Error Loss [Juang, Chou, Lee 1997]

Square­Exponential Loss [Osadchy, Miller, LeCun 2004]

Good and bad loss functions

Efficiency of a loss/architecture: how many energies are pulled up for

Linearly Parameterized Factors

with the NLL Loss :

with Hinge Loss:

with Perceptron Loss

With Log Loss:

Trainable Speech/Handwriting Recognition systems that integrate Neural Nets (or

with NLL: Late normalization scheme (un­normalized

Handwriting Recognition with

Answer = sequence of symbols

Latent variable = segmentation

Graph transformer network

Trained globally with

50% percent corrent, 49%

Fielded in 1996, used in many

Processes an estimated 10% of

Unlike traditional classifiers, EBMs can represent multiple alternative outputs

Face Detection and pose estimation with convolutional networks:

Category­level object recognition with invariance to pose and lighting

autonomous robot driving

this part is mostly hand­crafted

The raw input is pre­processed through a hand­crafted feature extractor

The trainable classifier is often generic (task independent)

The entire system is integrated and trainable “end­to­end”.

In some of the models presented here, there will be no discernible

We can embed general prior knowledge about images into the

[Hubel & Wiesel 1962]: architecture of the cat's visual cortex

Retinotopic Feature Maps

Building a complete artificial vision system:

Hierarchical/multilayer: features get progressively more global, invariant, and numerous

Convolutional net for handwriting recognition (400,000 synapses)

CLASSIFIER DEFORMATION ERROR Reference

EnergyBased Models in Document

See: [LeCun et al. 2006]: “A Tutorial on EnergyBased Learning”

A Tutorial and EnergyBased Learning:

Equivalent to traditional energybased inference with a redefined

Reduces to traditional minimization when Beta>infinity

Reduces to the perceptron loss when Beta>infinity

LVQ2 Loss [Kohonen, Oja], DriancourtBottou 1991]

SquareExponential Loss [Osadchy, Miller, LeCun 2004]

with NLL: Late normalization scheme (unnormalized

Categorylevel object recognition with invariance to pose and lighting

this part is mostly handcrafted

The raw input is preprocessed through a handcrafted feature extractor

The entire system is integrated and trainable “endtoend”.

Preprocessing: convert to YUV, highpass filter the Y component, crop,

Analyzing results for Gene KnockOut Experiments

Since lowlevel features tend to be non task specific, we should be able to

Hinton has shown that layerbylayer unsupervised “pretraining” can be

EnergyBased View: E(Y)

[Salakhutdinov and Hinton, AIStats 2007]:

(a) (b) (c) encoder shiftinvariant decoder (d)

Caltech101 dataset: 101 categories