0% found this document useful (0 votes)
66 views38 pages

I2ml3e Chap11

The document is a set of lecture slides on multilayer perceptrons. It contains 23 slides covering topics such as: - How multilayer perceptrons work with multiple hidden layers and outputs - The backpropagation algorithm for training multilayer perceptrons using gradient descent - How backpropagation calculates the gradients to update weights in the network to minimize error for problems like regression and classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views38 pages

I2ml3e Chap11

The document is a set of lecture slides on multilayer perceptrons. It contains 23 slides covering topics such as: - How multilayer perceptrons work with multiple hidden layers and outputs - The backpropagation algorithm for training multilayer perceptrons using gradient descent - How backpropagation calculates the gradients to update weights in the network to minimize error for problems like regression and classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Lecture Slides for

INTRODUCTION
TO
MACHINE
LEARNING
3RD EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014

alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 11:

MULTILAYER PERCEPTRONS
Neural Networks
3

 Networks of processing units (neurons) with


connections (synapses) between them
 Large number of neurons: 10 10
 Large connectitivity: 105
 Parallel processing
 Distributed computation/memory
 Robust to noise, failures
Understanding the Brain
4

 Levels of analysis (Marr, 1982)


1. Computational theory
2. Representation and algorithm
3. Hardware implementation
 Reverse engineering: From hardware to theory
 Parallel processing: SIMD vs MIMD
Neural net: SIMD with modifiable local memory
Learning: Update by training/experience
Perceptron
5

d
y   w j x j  w0  w T x
j 1

w  w0 ,w1 ,..., wd T


x  1, x1 ,..., xd 
T

(Rosenblatt, 1962)
What a Perceptron Does
6

 Regression: y=wx+w0  Classification:y=1(wx+w0>0)

y y
s y
w0 w0
w w
x
w0
x x
x0=+1
1
y  sigmoido 

1  exp  wT x 
Regression:
K Outputs d
yi   wij x j  wi 0  w Ti x
7 j 1

y  Wx

Classification:
oi  w Ti x
exp oi
yi 
k exp ok
choose C i
if y i  max y k
k
Training
8

 Online (instances seen one by one) vs batch (whole


sample) learning:
 No need to store the whole sample
 Problem may change in time
 Wear and degradation in system components

 Stochastic gradient-descent: Update after a single


pattern
 Generic update rule (LMS rule):
wijt   rit  yit x tj
Update LearningFa ctor DesiredOutput ActualOutput  Input
Training a Perceptron: Regression

 Regression (Linear output):

t t t 1 t
2

E w | x , r   r  y   r  w x 
t 2 1 t
2

T t 2

w tj   r t  y t x tj

9
Classification
10

 Single sigmoid output


y t  sigmoidw T xt 
E t w | xt , r t   r t log y t  1  r t  log 1  y t 
w tj   r t  y t x tj

 K>2 softmax outputs

E t w i i | xt , r t    rit log yit


exp w T t
i x
y 
t

k exp w T t
kx i

wijt   rit  yit x tj


Learning Boolean AND
11
XOR

 No w0, w1, w2 satisfy:


w0 0
(Minsky and Papert, 1969)
w 2  w0 0
w1  w0 0
w1  w 2  w0 0
12
Multilayer Perceptrons
13

H
y i  v Ti z   v ih zh  v i 0
h 1

zh  sigmoidw Th x 
1

1  exp   d
j 1
whj x j  wh 0 

(Rumelhart et al., 1986)


14 x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)
Backpropagation
15

H
y i  v z   v ih zh  v i 0
T
i
h 1

zh  sigmoidw Th x 
1
1  exp   
 d
j 1
whj x j  wh 0

E E y i zh

whj y i zh whj
E W, v | X    r  y 
1 t 2
Regression
t

2 t

vh   r t  y t zht
H
y   v z  v0
t t
h h t
h 1
Backward
E
Forward whj  
whj

zh  sigmoidw x T
h
E y t zht
   t t
t y z h w hj

    r t  y t v h zht 1  zht x tj
t

x    r t  y t v h zht 1  zht x tj
t
16
Regression with Multiple Outputs
17

yi

E W ,V | X    ri  y i 
1 t t 2

2 t i vih
H
y it   v ih zht v i 0
h 1 zh
v ih    rit  y it zht whj
t
xj
  t
whj     ri  y i v ih zh 1  zht x tj
t t

t  i 
18
19
whx+w0
vhzh
zh

20
Two-Class Discrimination
21

 One sigmoid output yt for P(C1|xt) and


P(C2|xt) ≡ 1-yt
 H 
y  sigmoid  v h zh  v 0 
t t

 h1 
E W , v | X    r t log y t  1  r t  log 1  y t 
t

v h    r t  y t zht
t

whj    r t  y t v h zht 1  zht x tj


t
K>2 Classes
22

exp
 
H t
o
oit   v ih zht  v i 0 y it  i
 P C | x t

k exp okt i
h 1

E W , v | X    rit log y it


t i

v ih    rit  y it zht
t

  t
whj     ri  y i v ih zh 1  zht x tj
t t

t  i 
Multiple Hidden Layers
23

 MLP with one hidden layer is a universal


approximator (Hornik et al., 1989), but using
multiple layers may lead to simpler networks
 d 
z1h  sigmoidw x   sigmoid  w1hj x j  w1h 0 , h  1,..., H1
T
1h

 j 1 
 H1 
z 2l  sigmoidw z   sigmoid  w 2lh z1h  w 2l 0 , l  1,..., H2
T
2l 1
 h1 
H2
y  v z 2   vl z2l  v0
T

l 1
Improving Convergence
24

 Momentum
E t
w  
t
 wit 1
wi
i

 Adaptive learning rate

  a if E t   E t
  
 b otherwise
Overfitting/Overtraining
25

Number of weights: H (d+1)+(H+1)K


26
Structured MLP
27

 Convolutional networks (Deep learning)

(Le Cun et al, 1989)


Weight Sharing
28
Hints
29

 Invariance to translation, rotation, size

 Virtual examples (Abu-Mostafa, 1995)


 Augmented error: E’=E+λhEh
If x’ and x are the “same”: Eh=[g(x|θ)- g(x’|θ)]2
Approximation hint: 0 if gx |  ax , bx 

E h  gx |   ax 2 if gx |   ax
gx |   b 2 if gx |   b
 x x
Tuning the Network Size
30

 Destructive  Constructive
Weight decay: Growing networks

E
wi    w i
wi

E'  E 
2
 i
i
w 2

(Ash, 1989) (Fahlman and Lebiere, 1989)


Bayesian Learning
31

 Consider weights wi as random vars, prior p(wi)


pX | w pw 
pw | X   ˆ MAP  arg max log pw | X 
w
pX  w
log pw | X   log pX | w   log pw   C
 wi2 
pw    pwi  where pwi   c  exp  
i  2(1 / 2 ) 
E'  E   w
2

 Weight decay, ridge regression, regularization


cost=data-misfit + λ complexity
More about Bayesian methods in chapter 14
Dimensionality Reduction
32

Autoencoder networks
33
Learning Time
34

 Applications:
 Sequence recognition: Speech recognition
 Sequence reproduction: Time-series prediction

 Sequence association

 Network architectures
 Time-delay networks (Waibel et al., 1989)
 Recurrent networks (Rumelhart et al., 1986)
Time-Delay Neural Networks
35
Recurrent Networks
36
Unfolding in Time
37
Deep Networks
38

 Layers of feature extraction units


 Can have local receptive fields as in convolution
networks, or can be fully connected
 Can be trained layer by layer using an autoencoder
in an unsupervised manner
 No need to craft the right features or the right basis
functions or the right dimensionality reduction method;
learns multiple layers of abstraction all by itself given
a lot of data and a lot of computation
 Applications in vision, language processing, ...

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy