Deep Learning: Yann Le Cun The Courant Institute of Mathematical Sciences New York University
Deep Learning: Yann Le Cun The Courant Institute of Mathematical Sciences New York University
Yann Le Cun
The Courant Institute of Mathematical Sciences
New York University
http://yann.lecun.com
Yann LeCun
The Challenges of Machine Learning
an we find learning methods that solve really complex problems end-to-end, such as vision,
natural language, speech....?
ow can we build/learn internal representations of the world that allow us to discover its hidden
structure?
ow can we learn internal representations that capture the relevant information and eliminates irrelevant
variabilities?
ow can a human or a machine learn internal representations by just looking at the world?
Yann LeCun
The Next Frontier in Machine Learning: Learning Representations
he representation of the input, and the metric to compare them are assumed to be “intelligently
designed.”
xample: Support Vector Machines require a good input representation, and a good kernel function.
ow can different sentences with the same meaning be mapped to the same internal representation?
he trainable classifier is often generic (task independent), and “simple” (linear classifier, kernel
machine, nearest neighbor,.....)
Yann LeCun
Good Representations are Hierarchical
Trainable Trainable
Trainable
Feature Feature
Classifier
Extractor Extractor
ords->Parts of Speech->Sentences->Text
ixels->Edges->Textons->Parts->Objects->Scenes
Yann LeCun
“Deep” Learning: Learning Hierarchical Representations
Trainable Trainable
Trainable
Feature Feature
Classifier
Extractor Extractor
f the visual system is deep and learned, what is the learning algorithm?
hat learning algorithm can train neural nets as “deep” as the visual system (10 layers?).
Yann LeCun
Do we really need deep architectures?
eep machines are more efficient for representing certain classes of functions,
particularly those involved in visual recognition
e need an efficient parameterization of the class of functions that are useful for
“AI” tasks.
Yann LeCun
Why are Deep Architectures More Efficient?
[Bengio & LeCun 2007 “Scaling Learning Algorithms Towards AI”]
deep architecture trades space for time (or breadth for depth)
epth-Breadth tradoff
equires an exponential number of gates of we restrict ourselves to 2 layers (DNF formula with exponential
number of minterms).
equires O(N) gates, and O(N) layers using N one-bit adders with ripple carry propagation.
equires lots of gates (some polynomial in N) if we restrict ourselves to two layers (e.g. Disjunctive Normal Form).
ad news: almost all boolean functions have a DNF formula with an exponential number of minterms O(2^N).....
Yann LeCun
Strategies (a parody of [Hinton 2007])
enial: kernel machines can approximate anything we want, and the VC-bounds guarantee
generalization. Why would we need anything else?
nfortunately, kernel machines with common kernels can only represent a tiny subset of
functions efficiently
ptimism: Let's look for learning models that can be applied to the largest possible subset of
the AI-set, while requiring the smallest amount of task-specific knowledge for each task.
onvolutional Networks:
Vaillant, Monrocq, LeCun, IEE Proc Vision, Image and Signal Processing, 1994]
he loss surface is non-convex, ill-conditioned, has saddle points, has flat spots.....
ack-prop doesn't work well with networks that are tall and skinny.
hort and fat nets with fixed first layers aren't very different from SVMs.
or reasons that are not well understood theoretically, back-prop works well when they are
highly structured
Yann LeCun
An Old Idea for Local Shift Invariance
pooling subsampling
Multiple
convolutions
Yann LeCun
The Multistage Hubel-Wiesel Architecture
Fukushima 1971-1982]
eocognitron
LeCun 1988-2007]
onvolutional net
Poggio 2002-2006]
MAX
Ullman 2002-2006]
ragment hierarchy
Lowe 2006]
UESTION:
MAX
How do we find (or
learn) the filters?
Yann LeCun
Getting Inspiration from Biology: Convolutional Network
broadly tuned (possibly invariant) features: sigmoid units are on half the time.
Global discriminative training: The whole system is trained “end-to-end” with a gradient-based
method to minimize a global loss function
Yann LeCun
Convolutional Net Architecture
Layer 3 Layer 5
Layer 1 Layer 2 Layer 4
input 12@10x10 100@1x1
6@28x28 6@14x14 12@5x5
1@32x32
Layer 6: 10
10
5x5
2x2 5x5 2x2
5x5 convolution
pooling/ convolution pooling/
convolution
subsampling subsampling
Convolutional layers (simple cells): all units in a feature plane share the same weights
Yann LeCun
Any Architecture works
ny connection is permissible
ny module is permissible
Yann LeCun
Deep Supervised Learning is Hard
xample: what is the loss function for the simplest 2-layer neural net ever
unction: 1-1-1 neural net. Map 0.5 to 0.5 and -0.5 to -0.5 (identity function)
with quadratic cost:
Yann LeCun
MNIST Handwritten Digit Dataset
Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples
Yann LeCun
Results on MNIST Handwritten Digits
Yann LeCun
Some Results on MNIST (from raw images: no preprocessing)
Note: some groups have obtained good results with various amounts of preprocessing
such as deskewing (e.g. 0.56% using an SVM with smart kernels [deCoste and Schoelkopf])
hand-designed feature representations (e.g. 0.63% with “shape context” and nearest neighbor [Belongie]
Yann LeCun
Invariance and Robustness to Noise
Yann LeCun
Handwriting Recognition
Yann LeCun
Face Detection and Pose Estimation with Convolutional Nets
ach sample: used 5 times with random variation in scale, in-plane rotation, brightness and contrast.
nd phase: half of the initial negative set was replaced by false positives of the initial version of the
detector .
Yann LeCun
Face Detection: Results
Yann LeCun
Face Detection and Pose Estimation: Results
Yann LeCun
Face Detection with a Convolutional Net
Yann LeCun
Generic Object Detection and Recognition
with Invariance to Pose and Illumination
10 instance per category: 5 instances used for training, 5 instances for testing
Raw dataset:
For each 972 stereo pair of each object instance. 48,600 image pairs total.
instance:
8 azimuths
elevations
illuminations
Training instances Test instances
n/off combinations of 4 lights
cameras
Yann LeCun (stereo)
Textured and Cluttered Datasets
Yann LeCun
Experiment 1: Normalized-Uniform Set: Representations
Yann LeCun
Convolutional Network
Layer 3
Layer 6
24@18x18 Layer 4
Stereo Layer 1 Layer 5 Fully
Layer 2 24@6x6
input 8@92x92 100 connected
2@96x96 8@23x23 (500 weights)
4x4 6x6
5x5 6x6
subsampling convolution 3x3
convolution convolution
(96 kernels) subsampling
(16 kernels) (2400 kernels)
The entire
Yann LeCun network is trained end-to-end (all the layers are trained simultaneously).
Normalized-Uniform Set: Error Rates
Jittered-Cluttered Dataset:
Yann LeCun
Convolutional Net + SVM on top:
Jittered-Cluttered Dataset
K-NN and SVM with Gaussian kernels are based on matching global templates
There is now way to learn invariant recognition tasks with such naïve Output
architectures
(unless we use an impractically large number of templates).
he number of necessary templates grows
Linear
exponentially with the number of dimensions of
variations. Combinations
Features (similarities)
Global templates are in trouble when the
variations include: category, instance shape,
configuration (for articulated object), position, Global Template Matchers
azimuth, elevation, scale, illumination, texture, (each training sample is a template
albedo, in-plane rotation, background
luminance, background texture, background Input
clutter, .....
Examples (Monocular Mode)
Yann LeCun
Learned Features
Layer 3
Layer 1
Input
Yann LeCun
Examples (Monocular Mode)
Yann LeCun
Examples (Monocular Mode)
Yann LeCun
Examples (Monocular Mode)
Yann LeCun
Examples (Monocular Mode)
Yann LeCun
Examples (Monocular Mode)
Yann LeCun
Examples (Monocular Mode)
Yann LeCun
Visual Navigation for a Mobile Robot
Yann LeCun
Convolutional Nets for Image Region Labeling
Yann LeCun
Input image Stereo Labels Classifier Output
Page 62
Industrial Applications of ConvNets
T&T/Lucent/NCR
idient Inc
idient Inc's “SmartCatch” system deployed in several airports and facilities around the US for detecting intrusions,
tailgating, and abandoned objects (Vidient is a spin-off of NEC)
EC Labs
oogle
CR, ???
icrosoft
rance Telecom
rototype runs at lower speed b/c of narrow memory bus on dev board
igh-End FPGA could deliver very high speed: 1024 multipliers at 500MHz: 500Gcps peak perf.
Yann LeCun
CNP Architecture
Yann LeCun
Systolic Convolver: 7x7 kernel in 1 clock cycle
Yann LeCun
Design
igmoid X to Y
Yann LeCun
Face detector on CNP
Yann LeCun
Results
Yann LeCun
Results
Yann LeCun
Results
Yann LeCun
Results
Yann LeCun
Results
Yann LeCun
FPGA Custom Board: NYU ConvNet Proc
VI output
Yann LeCun
Models Similar to ConvNets
MAX
ifference?