06 NeuralNetworks 2024
06 NeuralNetworks 2024
Neural Networks
Dr. Ioannis Patras
School of EECS
• Neural Networks:
– Inspired from algorithms that try to mimic the brain
– Formalized as systems of interconnected “neurons”
– Each neuron does a very simple calculation
– Many layers of neurons put together.
– The collective can do qualitatively more powerful
computation than any individual unit.
Neural Nets now!
• Until late 00’s, they were largely shunned.
– Now they are back!
• Superhuman performance at:
– Character recognition
– Face recognition
– Object recognition
– Atari video game playing
– Go playing(!)
• Pervasive:
– Run OCR in banks/mail.
– Every facebook photo upload (millions per day) goes
through a Deep CNN.
– Run Cortana, Google’s, Siri speech recognition.
In the Media
Deepmind Neural Atari
…and scandal!
Brain vs Computers
• Brain is massively parallel, but slow
– 100 billion (1011) neurons
– 100 trillion (1014-15) connections (synapses)
– 102 operations per second
• Contrast fast serial computers
– 1-103 processors
– 1-10 billion operators per second
Von Neuman computer versus
biological neural system
Artificial neural
networks have an
architecture roughly
similar to the
biological neural
system
Can think of a logistic regression unit as a neuron (function) that multiplies the
input by the parameters (weights) and squashes the resulting sum through the
sigmoid.
• Logistic Regression Unit => Perception if weight strength goes to infinity, so
sigmoid => step.
Feed Forward Neural Net
• Connected set of logistic regression units
– Arranged in layers.
– Each unit’s output is a non-linear function (e.g.,
sigmoid, step function) of a linear combination of its
inputs.
• Input layer size:
– Size of input to classify/regress
• Output layer size
– Size of target for classification/regression
• Any number of hidden neurons
• Any number of hidden layers
A Sigmoid/Logistic Neuron
Sigmoid
Activation
Weights
! x $
# 0 & &' # #
x = # x1 & $
' = $'" ! !
# & $%' ! !"
#" x2 &%
A Single Neuron
x0
x1
hθ (x)
x2
A Neural Network
x0
x1
h12
θ (x)
x2
h13
θ (x)
A Neural Network
x0
h11
θ (x)
x1
h12
θ (x)
x2
h13
θ (x)
h12
θ (x) hθ22(x)
x1
h13
θ (x) hθ23(x)
x2
h12
θ (x)
22
h (x)
θ
hθ32(x)
x1
13
h (x)
θ
23
h (x)
θ
hθ33(x)
x2
Can you draw a linear decision boundary to separate the two classes?
How can a NN help?
• It can combine decision boundaries at each
level of the network.
– The decision of one layer can become the input of
the next
• This can make more complex boundaries
overall.
• Lets look at some examples…
Logical Functions
• Neural AND?
• Neural OR?
• Neural XOR?
Logistic Unit AND
x0 = 1
W0=-1.5
x1 = 0 /1
W1=1
hw (x)
W2=1
x2 = 0 /1
x1 = 0 /1
W1=1
hw (x)
W2=1
x2 = 0 /1
x1 = 0 /1
hw (x)
x2 = 0 /1
-1.5
1
x1 = 0 /1
1
x2 = 0 /1 1
Neural Net XOR
x10 = 1
x0 = 1 -0.5
-1.5
OR
-1.5 1
1
x1 = 0 /1 AND
1
[OR] AND [NOT AND]
-1
1
AND
x2 = 0 /1 1
How to Come up With these Weights?
h11
θ (x) hθ21(x) hθ31(x)
x0
h12
θ (x) hθ22(x) hθ32(x)
x1
h13
θ (x) hθ23(x) hθ33(x)
x2
h11 (x)
x1
h12 (x) h12(x)
x2
h13 (x)
h11 (x)
x1
& $ " $! ! #
( % "!! = (% $ ( D %# ' ( # "( D #" &" ! !!
$
$ "#!
h12 (x) h12(x)
% # =# " =# " x2
h13 (x)
h11 (x)
x1
h12 (x) h12(x)
x2
# " !! = # "
$
( " $!
! #" " "#!
!! ) h13 (x)
h11 (x)
• No matter how many layers…. x1
h12 (x) h12(x)
– Still a simple linear model
x2
h13 (x)
x2
h13 (x)
h10
x0
2 2 2 1 2 1 2 1 h11 (x)
h (x) = σ (w + w h + w h + w h )
1 10 11 1 12 2 13 3 x1
h12 (x) h12(x)
x2
h13 (x)
x2
h13 (x) h12(x)
! (" )
" !
$ $# # = !
$" ! # " ! ! x2
!# ! =" h13 (x)
"$& "! !
= ## ""% !(#""% ! ! ! )
2
En (w) = 0.5 ( h(a j ) − y)
""%
"$& "! !
h(ak ) = ak h'(ak ) = 1 = # % = (#""% ! ! ! )
""%
Output Error Term: Binary Classifier
• Output error term: "# "! !
= !!
""!
• For binary classifier
– Cost: Square Devation (or cross-entropy)
– Activation: Sigmoid.
"$& "! !
= ## ""% !(#""% ! ! ! )
2
En (w) = 0.5 ( h(ak ) − y)
""%
"% #! ! "% #! !
= ! " #! = ! " # #"!
!
"$ "!
#"!
"$ "!
# $!
Summary: Binary Classification
• Iterate:
– Propagate forward to find activations a and h(a) of all
internal and output units: ! $ "!! = ! "# $ !# " " #!!!
– Evaluate error term δ for output units:
# # = " # ""# !(" ""# ! ! ! )
– Backpropagate output δs to obtain internal δs.
! " = %# "$ " !" #!"! !
!
– Use δs to get a gradient update for each weight:
"% #! ! "% #! !
= ! " #! = ! " # #"!
!
"$ "!
#"!
"$ "!
# $!
Training Overview
• Implement forward propagation to get h(x) for
any x.
• Implement cost function computation.
• Implement backpropagation to compute
partial derivatives.
• Iterate forward & backward propagation.
Summary
• Training neural nets:
– Use backpropagation to minimise their cost
function.
– Errors at later nodes are backpropagated to
compute errors at earlier nodes.
– Errors at each node give gradient update for that
node.
Overview
• Neural Nets Intro
• From Neurons to Neural Nets
• Multi-Layer Neural Nets for Non-Linear
Functions
• Neural Net Prediction Details
• Training Neural Nets
• More considerations
Batch vs Online
• Updates in past couple slides are gradients
WRT one single input n.
– Online Gradient Descent:
• Iterate over data n, and weights w(i,j):
!%# "! !
$ !" = $ !" " #
!$ !"
– Batch Gradient Descent:
• Iterate over weights w(i,j):
"%! "! !
$ "# = $ "# # $ !
! "$ "#
Gradient Checking
– When implementing gradient-descent algorithms like
backprop….
– May be useful to check correctness of your derivatives
numerically. "% "! ! Should be equal
= ! " #!
"$ "!
– Can implement finite difference numerical
differentiation to check them
#%# #! " %# # $ !" + ! " " %# # $ !" " ! "
=
#$ !" !!
– I.e., perturb the current weight and re-compute the
network’s error. Get gradient from this change in error.
Gradient Checking
– Aside: You could also implement gradient descent
by numerical differentiation, but slow!
• Each forward propagation costs O(W).
• Each weight must be perturbed individually at cost O(W).
– => Overall cost O(W2)
#%# #! " %# # $ !" + ! " " %# # $ !" " ! "
=
#$ !" !!
x2
h13 (x)
x2
h13 (x)