0% found this document useful (0 votes)
138 views69 pages

Deep Learning: Yann Le Cun The Courant Institute of Mathematical Sciences New York University

The document summarizes Yann LeCun's views on deep learning challenges and strategies. It discusses 3 key points: 1) One of the major challenges of machine learning is learning representations from data instead of relying on engineered features. Deep learning aims to learn hierarchical representations from raw inputs. 2) Deep learning architectures are more efficient than shallow ones for representing certain functions, like those needed for visual recognition. Multiple layers of representation allow extracting increasingly invariant features. 3) Convolutional neural networks, first introduced by LeCun, are an example of deep supervised learning that has been successful for computer vision tasks by incorporating local shift invariance through pooling layers.

Uploaded by

Rin Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views69 pages

Deep Learning: Yann Le Cun The Courant Institute of Mathematical Sciences New York University

The document summarizes Yann LeCun's views on deep learning challenges and strategies. It discusses 3 key points: 1) One of the major challenges of machine learning is learning representations from data instead of relying on engineered features. Deep learning aims to learn hierarchical representations from raw inputs. 2) Deep learning architectures are more efficient than shallow ones for representing certain functions, like those needed for visual recognition. Multiple layers of representation allow extracting increasingly invariant features. 3) Convolutional neural networks, first introduced by LeCun, are an example of deep supervised learning that has been successful for computer vision tasks by incorporating local shift invariance through pooling layers.

Uploaded by

Rin Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 69

Deep Learning

Yann Le Cun
The Courant Institute of Mathematical Sciences
New York University
http://yann.lecun.com

Yann LeCun
The Challenges of Machine Learning

ow can we use learning to progress towards AI?

an we find learning methods that scale?

an we find learning methods that solve really complex problems end-to-end, such as vision,
natural language, speech....?

ow can we learn the structure of the world?

ow can we build/learn internal representations of the world that allow us to discover its hidden
structure?

ow can we learn internal representations that capture the relevant information and eliminates irrelevant
variabilities?

ow can a human or a machine learn internal representations by just looking at the world?

Yann LeCun
The Next Frontier in Machine Learning: Learning Representations

he big success of ML has been to learn classifiers from labeled data

he representation of the input, and the metric to compare them are assumed to be “intelligently
designed.”

xample: Support Vector Machines require a good input representation, and a good kernel function.

he next frontier is to “learn the features”

he question: how can a machine learn good internal representations

n language, good representations are paramount.

hat makes the words “cat” and “dog” semantically similar?

ow can different sentences with the same meaning be mapped to the same internal representation?

ow can we leverage unlabeled data (which is plentiful)?


Yann LeCun
The Traditional “Shallow” Architecture for Recognition

Pre-processing / “Simple” Trainable


Feature Extraction Classifier

this part is mostly hand-


crafted Internal Representation

he raw input is pre-processed through a hand-crafted feature extractor

he features are not learned

he trainable classifier is often generic (task independent), and “simple” (linear classifier, kernel
machine, nearest neighbor,.....)

he most common Machine Learning architecture: the Kernel Machine


Yann LeCun
The Next Challenge of ML, Vision (and Neuroscience)

ow do we learn invariant representations?

rom the image of an airplane, how do we extract a


representation that is invariant to pose, illumination,
background, clutter, object instance....

ow can a human (or a machine) learn those representations by


just looking at the world?

ow can we learn visual categories from just a few examples?

don't need to see many airplanes before I can recognize every


airplane (even really weird ones)

Yann LeCun
Good Representations are Hierarchical

Trainable Trainable
Trainable
Feature Feature
Classifier
Extractor Extractor

n Language: hierarchy in syntax and semantics

ords->Parts of Speech->Sentences->Text

bjects,Actions,Attributes...-> Phrases -> Statements -> Stories

n Vision: part-whole hierarchy

ixels->Edges->Textons->Parts->Objects->Scenes

Yann LeCun
“Deep” Learning: Learning Hierarchical Representations

Trainable Trainable
Trainable
Feature Feature
Classifier
Extractor Extractor

Learned Internal Representation

eep Learning: learning a hierarchy of internal representations

rom low-level features to mid-level invariant representations, to object identities

epresentations are increasingly invariant as we go up the layers

sing multiple stages gets around the specificity/invariance dilemma


Yann LeCun
The Primate's Visual System is Deep

he recognition of everyday objects is a very fast process.

he recognition of common objects is essentially “feed forward.”

ut not all of vision is feed forward.

uch of the visual system (all of it?) is the result of learning

ow much prior structure is there?

f the visual system is deep and learned, what is the learning algorithm?

hat learning algorithm can train neural nets as “deep” as the visual system (10 layers?).

nsupervised vs Supervised learning

hat is the loss function?

hat is the organizing principle?

roader question (Hinton): what is the learning algorithm of the neo-cortex?

Yann LeCun
Do we really need deep architectures?

e can approximate any function as close as we want with shallow architecture.


Why would we need deep ones?

ernel machines and 2-layer neural net are “universal”.

eep learning machines

eep machines are more efficient for representing certain classes of functions,
particularly those involved in visual recognition

hey can represent more complex functions with less “hardware”

e need an efficient parameterization of the class of functions that are useful for
“AI” tasks.
Yann LeCun
Why are Deep Architectures More Efficient?
[Bengio & LeCun 2007 “Scaling Learning Algorithms Towards AI”]
deep architecture trades space for time (or breadth for depth)

ore layers (more sequential computation),

ut less hardware (less parallel computation).

epth-Breadth tradoff

xample1: N-bit parity

equires N-1 XOR gates in a tree of depth log(N).

equires an exponential number of gates of we restrict ourselves to 2 layers (DNF formula with exponential
number of minterms).

xample2: circuit for addition of 2 N-bit binary numbers

equires O(N) gates, and O(N) layers using N one-bit adders with ripple carry propagation.

equires lots of gates (some polynomial in N) if we restrict ourselves to two layers (e.g. Disjunctive Normal Form).

ad news: almost all boolean functions have a DNF formula with an exponential number of minterms O(2^N).....
Yann LeCun
Strategies (a parody of [Hinton 2007])

efeatism: since no good parameterization of the “AI-set” is available, let's parameterize a


much smaller set for each specific task through careful engineering (preprocessing,
kernel....).

enial: kernel machines can approximate anything we want, and the VC-bounds guarantee
generalization. Why would we need anything else?

nfortunately, kernel machines with common kernels can only represent a tiny subset of
functions efficiently

ptimism: Let's look for learning models that can be applied to the largest possible subset of
the AI-set, while requiring the smallest amount of task-specific knowledge for each task.

here is a parameterization of the AI-set with neurons.

s there an efficient parameterization of the AI-set with computer technology?

oday, the ML community oscillates between defeatism and denial.


Yann LeCun
Supervised Deep Learning,
The Convolutional Network Architecture

onvolutional Networks:

LeCun et al., Neural Computation, 1988]

LeCun et al., Proc IEEE 1998] (handwriting recognition)

ace Detection and pose estimation with convolutional networks:

Vaillant, Monrocq, LeCun, IEE Proc Vision, Image and Signal Processing, 1994]

Osadchy, Miller, LeCun, JMLR vol 8, May 2007]

ategory-level object recognition with invariance to pose and lighting

LeCun, Huang, Bottou, CVPR 2004]

Huang, LeCun, CVPR 2006]

utonomous robot driving

LeCun et al. NIPS 2005]


Yann LeCun
Deep Supervised Learning is Hard

he loss surface is non-convex, ill-conditioned, has saddle points, has flat spots.....

or large networks, it will be horrible! (not really, actually)

ack-prop doesn't work well with networks that are tall and skinny.

ots of layers with few hidden units.

ack-prop works fine with short and fat networks

ut over-parameterization becomes a problem without regularization

hort and fat nets with fixed first layers aren't very different from SVMs.

or reasons that are not well understood theoretically, back-prop works well when they are
highly structured

.g. convolutional networks.

Yann LeCun
An Old Idea for Local Shift Invariance

Hubel & Wiesel 1962]:

imple cells detect local features


“Simple cells”
omplex cells “pool” the outputs of simple cells within a
“Complex cells”
retinotopic neighborhood.

pooling subsampling
Multiple
convolutions

Retinotopic Feature Maps

Yann LeCun
The Multistage Hubel-Wiesel Architecture

uilding a complete artificial vision system:

tack multiple stages of simple cells / complex cells layers

igher stages compute more global, more invariant features

tick a classification layer on top

Fukushima 1971-1982]

eocognitron

LeCun 1988-2007]

onvolutional net

Poggio 2002-2006]

MAX

Ullman 2002-2006]

ragment hierarchy

Lowe 2006]
UESTION:
MAX
How do we find (or
learn) the filters?

Yann LeCun
Getting Inspiration from Biology: Convolutional Network

Hierarchical/multilayer: features get progressively more global, invariant, and numerous

dense features: features detectors applied everywhere (no interest point)

broadly tuned (possibly invariant) features: sigmoid units are on half the time.

Global discriminative training: The whole system is trained “end-to-end” with a gradient-based
method to minimize a global loss function
Yann LeCun
Convolutional Net Architecture

Layer 3 Layer 5
Layer 1 Layer 2 Layer 4
input 12@10x10 100@1x1
6@28x28 6@14x14 12@5x5
1@32x32
Layer 6: 10
10

5x5
2x2 5x5 2x2
5x5 convolution
pooling/ convolution pooling/
convolution
subsampling subsampling

Convolutional net for handwriting recognition (400,000 synapses)

Convolutional layers (simple cells): all units in a feature plane share the same weights

Pooling/subsampling layers (complex cells): for invariance to small distortions.

Supervised gradient-descent learning using back-propagation


Yann LeCun
Back-propagation: deep supervised gradient-based learning

Yann LeCun
Any Architecture works

ny connection is permissible

etworks with loops must be “unfolded in


time”.

ny module is permissible

s long as it is continuous and differentiable


almost everywhere with respect to the
parameters, and with respect to non-
terminal inputs.

Yann LeCun
Deep Supervised Learning is Hard

xample: what is the loss function for the simplest 2-layer neural net ever

unction: 1-1-1 neural net. Map 0.5 to 0.5 and -0.5 to -0.5 (identity function)
with quadratic cost:

Yann LeCun
MNIST Handwritten Digit Dataset

Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples

Yann LeCun
Results on MNIST Handwritten Digits

Yann LeCun
Some Results on MNIST (from raw images: no preprocessing)

Note: some groups have obtained good results with various amounts of preprocessing
such as deskewing (e.g. 0.56% using an SVM with smart kernels [deCoste and Schoelkopf])
hand-designed feature representations (e.g. 0.63% with “shape context” and nearest neighbor [Belongie]
Yann LeCun
Invariance and Robustness to Noise

Yann LeCun
Handwriting Recognition

Yann LeCun
Face Detection and Pose Estimation with Convolutional Nets

raining: 52,850, 32x32 grey-level images of faces, 52,850 non-faces.

ach sample: used 5 times with random variation in scale, in-plane rotation, brightness and contrast.

nd phase: half of the initial negative set was replaced by false positives of the initial version of the
detector .

Yann LeCun
Face Detection: Results

Yann LeCun
Face Detection and Pose Estimation: Results

Yann LeCun
Face Detection with a Convolutional Net

Yann LeCun
Generic Object Detection and Recognition
with Invariance to Pose and Illumination

50 toys belonging to 5 categories: animal, human figure, airplane, truck, car

10 instance per category: 5 instances used for training, 5 instances for testing

Raw dataset:
For each 972 stereo pair of each object instance. 48,600 image pairs total.
instance:

8 azimuths

to 350 degrees every 20 degrees

elevations

0 to 70 degrees from horizontal


every 5 degrees

illuminations
Training instances Test instances
n/off combinations of 4 lights

cameras
Yann LeCun (stereo)
Textured and Cluttered Datasets

Yann LeCun
Experiment 1: Normalized-Uniform Set: Representations

1 - Raw Stereo Input: 2 images 96x96 pixels input dim. = 18432

2 - Raw Monocular Input:1 image, 96x96 pixels input dim. = 9216

3 – Subsampled Mono Input: 1 image, 32x32 pixels input dim = 1024

First 60 eigenvectors (EigenToys)


4 – PCA-95 (EigenToys): First 95 Principal Components input dim. = 95

Yann LeCun
Convolutional Network

Layer 3
Layer 6
24@18x18 Layer 4
Stereo Layer 1 Layer 5 Fully
Layer 2 24@6x6
input 8@92x92 100 connected
2@96x96 8@23x23 (500 weights)

4x4 6x6
5x5 6x6
subsampling convolution 3x3
convolution convolution
(96 kernels) subsampling
(16 kernels) (2400 kernels)

90,857 free parameters, 3,901,162 connections.

he architecture alternates convolutional layers (feature detectors) and subsampling layers


(local feature pooling for invariance to small distortions).

The entire
Yann LeCun network is trained end-to-end (all the layers are trained simultaneously).
Normalized-Uniform Set: Error Rates

Linear Classifier on raw stereo images: 30.2% error.

K-Nearest-Neighbors on raw stereo images: 18.4% error.

K-Nearest-Neighbors on PCA-95: 16.6%


error.

Pairwise SVM on 96x96 stereo images: 11.6% error

Pairwise SVM on 95 Principal Components: 13.3% error.

Convolutional Net on 96x96 stereo images: 5.8% error.

Training instances Test instances


Yann LeCun
Normalized-Uniform Set: Learning Times

SVM: using a parallel implementation by Chop off the


Graf, Durdanovic, and Cosatto (NEC Labs) last layer of the
convolutional net
and train an SVM on it
Yann LeCun
Jittered-Cluttered Dataset

Jittered-Cluttered Dataset:

291,600 tereo pairs for training, 58,320 for testing

Objects are jittered: position, scale, in-plane rotation, contrast, brightness,


backgrounds, distractor objects,...

nput dimension: 98x98x2 (approx 18,000)


Yann LeCun
Experiment 2: Jittered-Cluttered Dataset

291,600 training samples, 58,320 test samples

SVM with Gaussian kernel


43.3% error

onvolutional Net with binocular input: 7.8%


error

Yann LeCun
Convolutional Net + SVM on top:
Jittered-Cluttered Dataset

The convex loss, VC bounds Chop off the last layer,


OUCH!
and representers theorems and train an SVM on it
don't seem to help it works!
Yann LeCun
What's wrong with K-NN and SVMs?

K-NN and SVM with Gaussian kernels are based on matching global templates

Both are “shallow” architectures

There is now way to learn invariant recognition tasks with such naïve Output
architectures
(unless we use an impractically large number of templates).
he number of necessary templates grows
Linear
exponentially with the number of dimensions of
variations. Combinations

Features (similarities)
Global templates are in trouble when the
variations include: category, instance shape,
configuration (for articulated object), position, Global Template Matchers
azimuth, elevation, scale, illumination, texture, (each training sample is a template
albedo, in-plane rotation, background
luminance, background texture, background Input
clutter, .....
Examples (Monocular Mode)

Yann LeCun
Learned Features

Layer 3

Layer 1
Input

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Examples (Monocular Mode)

Yann LeCun
Visual Navigation for a Mobile Robot

[LeCun et al. NIPS 2005]

Mobile robot with two cameras

The convolutional net is trained to emulate a


human driver from recorded sequences of video
+ human-provided steering angles.

The network maps stereo images to steering


angles for obstacle avoidance
Convolutional Nets for Counting/Classifying Zebra Fish

Head – Straight Tail – Curved Tail


Yann LeCun
C. Elegans Embryo Phenotyping

nalyzing results for Gene Knock-Out Experiments


C. Elegans Embryo Phenotyping

nalyzing results for Gene Knock-Out Experiments


C. Elegans Embryo Phenotyping

nalyzing results for Gene Knock-Out Experiments


Convolutional Nets For Brain Imaging and Biology

rain tissue reconstruction from slice images [Jain,....,Denk, Seung 2007]

ebastian Seung's lab at MIT.

D convolutional net for image segmentation

onvNets Outperform MRF, Conditional Random Fields, Mean Shift, Diffusion,...[ICCV'07]

Yann LeCun
Convolutional Nets for Image Region Labeling

ong-range obstacle labeling for vision-based mobile robot navigation

more on this later....)

Input image Stereo Labels Classifier Output

Input image Stereo Labels Classifier Output

Yann LeCun
Input image Stereo Labels Classifier Output

Input image Stereo Labels Classifier Output

Page 62
Industrial Applications of ConvNets

T&T/Lucent/NCR

heck reading, OCR, handwriting recognition (deployed 1996)

idient Inc

idient Inc's “SmartCatch” system deployed in several airports and facilities around the US for detecting intrusions,
tailgating, and abandoned objects (Vidient is a spin-off of NEC)

EC Labs

ancer cell detection, automotive applications, kiosks

oogle

CR, ???

icrosoft

CR, handwriting recognition, speech detection

rance Telecom

ace detection, HCI, cell phone-based applications

ther projects: HRL (3D vision)....


Yann LeCun
CNP: FPGA Implementation of ConvNets

mplementation on low-end Xilinx FPGA

ilinx Spartan3A-DSP: 250MHz, 126 multipliers.

ace detector ConvNet at 640x480: 5e8 connections

fps with 200MHz clock: 4Gcps effective

rototype runs at lower speed b/c of narrow memory bus on dev board

ery lightweight, very low power

ustom board the size of a matchbox (4 chips: FPGA + 3 RAM chips)

good for micro UAVs vision-based navigation.

igh-End FPGA could deliver very high speed: 1024 multipliers at 500MHz: 500Gcps peak perf.

Yann LeCun
CNP Architecture

Yann LeCun
Systolic Convolver: 7x7 kernel in 1 clock cycle

Yann LeCun
Design

oft CPU used as micro-sequencer

icro-program is a C program on soft CPU

6x16 fixed-point multipliers

eights on 16 bits, neuron states on 8 bits.

nstruction set includes:

onvolve X with kernel K result in Y, with sub-sampling ratio S

igmoid X to Y

ultiply/Divide X by Y (for contrast normalization)

icrocode generated automatically from network description in Lush

Yann LeCun
Face detector on CNP

Yann LeCun
Results

lock speed limited by low memory bandwidth on the development board

ev board uses a single DDR with 32 bit bus

ustom board will use 128 bit memory bus

urrently uses a single 7x7 convolver

e have space for 2, but the memory bandwidth limits us

urrent Implementation: 5fps at 512x384

ustom board will yield 30fps at 640x480

e10 connections per second peak.


Yann LeCun
Results

Yann LeCun
Results

Yann LeCun
Results

Yann LeCun
Results

Yann LeCun
Results

Yann LeCun
FPGA Custom Board: NYU ConvNet Proc

ilinx Virtex 4 FPGA, 8x5 cm board

ual camera port, expansion and I/O port

ual QDR RAM for fast memory bandwidth

icroSD port for easy configuration

VI output

erial communication to optional host

Yann LeCun
Models Similar to ConvNets

MAX

Poggio & Riesenhuber


2003]

Serre et al. 2007]

Mutch and Lowe CVPR


2006]

ifference?

he features are not learned

MAX is very similar to


Fukushima's
Neocognitron
[from Serre et al. 2007]
Yann LeCun

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy