AI Lec24-25
AI Lec24-25
Learning – Recap
Neurons
Neurons
o Basic unit of a neural network
Dendrites
o Very simple processing units
Synapses
o A neuron takes inputs, does some
Axon math with them, and produces one
Cell Body
output
.
Neurons
Dendrites
o Basic unit of a neural network
Synapses o Very simple processing units
Axon
Cell Body o A neuron takes inputs, does some
math with them, and produces one
output
x1
-1
3
W1
𝑥1 → 𝑥1∗ 𝑤1
Output
4
W2 f(x) = max(x,0)
Inputs
x2
0 +b f(-4)
+1 f() 0y
𝑥2 → 𝑥2∗ 𝑤2 2
W3
(𝑥1∗ 𝑤1 ) + (𝑥2∗ 𝑤2 ) + (𝑥3∗ 𝑤3 ) + b y = f(𝑥1∗ 𝑤1 + 𝑥2∗ 𝑤2 + b)
-1
x3
2-
1- All
3- The
Eachthe weighted
sum is is
input inputs
passed are
by aadded
through
multiplied an
weight
𝑥3 →𝑥
. together
activationwith a bias b
function
3∗ 𝑤3
Neuron Model
Model Parameters
• Neuros have a set of x1
parameters, called weights
(like the slope in linear W1
regression) 𝑥1 → 𝑥1∗ 𝑤1
• Called w’s Output
• The neuron model has also a
bias (like the y-intercept in W2 f(x) = max(x,0)
Inputs +b
linear regression) x2 +b f()
f(.) y
• Called b
• Together the w’s and the b 𝑥2 → 𝑥2∗ 𝑤2 W3
create a line in the space to
divide it into regions (𝑥1∗ 𝑤1 ) + (𝑥2∗ 𝑤2 ) + (𝑥3∗ 𝑤3 ) + b y = f(𝑥1∗ 𝑤1 + 𝑥2∗ 𝑤2 + b)
(classes)
• Finally, the neuron has a x3
non-linear activation function
• Called f() .
𝑥3 → 𝑥3∗ 𝑤3
• The nonlinearity will allow us
to create curves instead of .
lines to divide the space
Neuron Example
Weights, Bias, and Activations
Inputs Weights Output
(X) (W) (X*W)
5.4 8 43.2
x1
-10.2 5 -51.0
W1
𝑥1 → 𝑥1∗ 𝑤1
-0.1 22 -2.2 Output
Linear -553.0 x3
Combination
.
𝑥3 → 𝑥3∗ 𝑤3
Bias 10.0
.
Output 0
(Activation)
Neuron Activation Functions
Connecting
Neurons to Build
Neural Networks
Neural Network
Brain Analogy
• Towards the ends of the • Derivative is more steep • It doesn’t saturate. • Can be used as improvement
function, y values react very little compared to Sigmoid. over ReLU Activation function.
to the changes in x • It converges faster than some
• It is more efficient because it has other activation functions • It has all properties of ReLU, and
• The derivative values in these a wider range for faster learning it overcomes the ReLU
regions are very small and (sigmoid and tanh).
converge to 0. This is called the [-1,1] drawbacks.
vanishing gradient and the
• The problem of gradients at the • The most commonly used • Unlike the ReLU, the value of
learning is minimal. activation function, because of its
ends of the function continues. hyper-parameter (α) is defined
• When slow learning occurs, the simplicity during the optimization prior to the training and hence
optimization algorithm cannot process (backpropagation) cannot be adjusted during the
get maximum performance from training time. The value of α
the neural network model. • It is not computationally hence chosen might not be the
expensive. most optimal value.
• input signals ‘x’ and weights ‘w’ are multiplied
• weights correspond to connection strengths
• signals are added up – if they are enough, FIRE!
x1 w1 add
if (a t)
a = xi w output = 1
M output
x2 output
w2 i else signal
signal
i =1 output = 0
x3 w3
output = 0
output = 1
Is this a good decision boundary?
M
xi wi t then output =1, else output = 0
if
i =1
w1 = 1.0
w2 = 0.2
t = 0.05
M
xi wi t then output =1, else output = 0
if
i =1
w1 = 2.1
w2 = 0.2
t = 0.05
M
xi wi t then output =1, else output = 0
if
i =1
w1 = 1.9
w2 = 0.02
t = 0.05
M
xi wi t then output =1, else output = 0
if
i =1
w1 = -0.8
w2 = 0.03
t = 0.05
t = 1.0 x2
w2
x3 w3
t = 1.0
w2
x2
x3 w3
i =1
t = 1.0
w2
x2
x3 w3
i =1
M
if xi wi t then output =1, else output = 0
i =1
M
if xi wi − t 0 then output = 1, else output = 0
i =1
M
if xi wi + (−1 t ) 0 then output = 1, else output = 0
i =1
M
if xi wi + ( x0 w0 ) 0 then output = 1, else output = 0
i =1
M
xi wi then output = 1, else output = 0
if 0
i =0
We now treat the threshold like any other weight with a permanent input of -1
Can a Perceptron solve this problem? ….. NO.
LINEARLY SEPARABLE
problems
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1
0 1 0
1 0 0
1 1 1 𝑤1 𝑥1 + 𝑤2 𝑥2 − 𝜃 = 0
2 0 0 0 𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1 0 0
1 1 1
Threshold: = 0.2; learning rate: = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0
1 0 0
1 1 1 Training Example 1:
2 0 0 0 = 0.3 x 0 − 0.1 x 0 − 0.2
0 1 0 = − 0.2 < 0
1 0 0 Output: 0
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 0 0
1 0 0 𝑤1 = 0.3
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 0 0
1 0 0 𝑤2 = −0.1
1 1 1
Threshold: = 0.2; learning rate: = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0 0.3 −0.1
1 0 0
1 1 1 Training Example 1:
2 0 0 0 = 0.3 x 0 − 0.1 x 0 − 0.2
0 1 0 = − 0.2 < 0
1 0 0 Output: 0
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 0 0
1 0 0 𝑤1 = 0.3
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 0 0
1 0 0 𝑤2 = −0.1
1 1 1
Threshold: = 0.2; learning rate: = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0 0.3 −0.1 0 0 0.3 −0.1
1 0 0
1 1 1 Training Example 2:
2 0 0 0 = 0.3 x 0 − 0.1 x 1 − 0.2
0 1 0 = − 0.3 < 0
1 0 0 Output: 0
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 0 0
1 0 0 𝑤1 = 0.3
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 0 1
1 0 0 𝑤2 = −0.1
1 1 1
Threshold: = 0.2; learning rate: = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0 0.3 −0.1 0 0 0.3 −0.1
1 0 0 0.3 −0.1
1 1 1 Training Example 2:
2 0 0 0 = 0.3 x 0 − 0.1 x 1 − 0.2
0 1 0 = − 0.3 < 0
1 0 0 Output: 0
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 0 0
1 0 0 𝑤1 = 0.3
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 0 1
1 0 0 𝑤2 = −0.1
1 1 1
Threshold: = 0.2; learning rate: = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0 0.3 −0.1 0 0 0.3 −0.1
1 0 0 0.3 −0.1
1 1 1 Training Example 3:
2 0 0 0 = 0.3 x 1 − 0.1 x 0 − 0.2
0 1 0 = 0.1 > 0
1 0 0 Output: 1
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 1 1
1 0 0 𝑤1 = 0.2
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 1 0
1 0 0 𝑤2 = −0.1
1 1 1
Threshold: = 0.2; learning rate: = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0 0.3 −0.1 0 0 0.3 −0.1
1 0 0 0.3 −0.1 1 -1 0.2 −0.1
1 1 1 Training Example 3:
2 0 0 0 = 0.3 x 1 − 0.1 x 0 − 0.2
0 1 0 = 0.1 > 0
1 0 0 Output: 1
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 1 1
1 0 0 𝑤1 = 0.2
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 1 0
1 0 0 𝑤2 = −0.1
1 1 1
Threshold: = 0.2; learning rate: = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0 0.3 −0.1 0 0 0.3 −0.1
1 0 0 0.3 −0.1 1 -1 0.2 −0.1
1 1 1 0.2 −0.1 Training Example 3:
2 0 0 0 = 0.3 x 1 − 0.1 x 0 − 0.2
0 1 0 = 0.1 > 0
1 0 0 Output: 1
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 1 1
1 0 0 𝑤1 = 0.2
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 1 0
1 0 0 𝑤2 = −0.1
1 1 1
Threshold: = 0.2; learning rate: = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights PERCEPTRON LEARNING:
x1 x2 Yd w1 w2 Y e w1 w2 LOGICAL OPERATION AND
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
0 1 0 0.3 −0.1 0 0 0.3 −0.1
1 0 0 0.3 −0.1 1 −1 0.2 −0.1
1 1 1 0.2 −0.1 0 1 0.3 0.0
2 0 0 0 0.3 0.0 0 0 0.3 0.0
0 1 0 0.3 0.0 0 0 0.3 0.0
1 0 0 0.3 0.0 1 −1 0.2 0.0
1 1 1 0.2 0.0 1 0 0.2 0.0
3 0 0 0 0.2 0.0 0 0 0.2 0.0
0 1 0 0.2 0.0 0 0 0.2 0.0
1 0 0 0.2 0.0 1 −1 0.1 0.0
1 1 1 0.1 0.0 0 1 0.2 0.1
4 0 0 0 0.2 0.1 0 0 0.2 0.1
0 1 0 0.2 0.1 0 0 0.2 0.1
1 0 0 0.2 0.1 1 −1 0.1 0.1
1 1 1 0.1 0.1 1 0 0.1 0.1
5 0 0 0 0.1 0.1 0 0 0.1 0.1
0 1 0 0.1 0.1 0 0 0.1 0.1
1 0 0 0.1 0.1 0 0 0.1 0.1
1 1 1 0.1 0.1 1 0 0.1 0.1
Threshold: = 0.2; learning rate: = 0.1
Multilayer Perceptron
◼ A multilayer perceptron is a feedforward neural network with
one or more hidden layers.
◼ The network consists of an input layer of source neurons, at
least one middle or hidden layer of computational neurons,
and an output layer of computational neurons.
◼ The input signals are propagated in a forward direction on a
layer-by-layer basis.
Multi layer perceptron
Multi layer perceptron
Multi layer perceptron
Simple Neural Network
Dendrites This network has:
• n inputs (x1, x2, …, xn)
Synapses
• a hidden layer with n
Axon neurons (Ha, Hb, …, Hz)
• a hidden layer with n
neurons (V$, V#, …, V*)
• an output layer with one or
more neurons (y)
Cell Body • All neurons are connected to
Hidden Hidden each other → Called fully
Layer 1 Layer 2 connected
x1 Wa$ • A neural network can
Ha V$ have any number of
layers with any number of
x2 neurons in those layers
Hb V#
Inputs
y1
.
. Output
. .
. .
xn-1 . .
Hz V*
xn
Thereiscan
A hidden layer anybe multiple
layer hidden
between thelayers!
input (first) layer and output (last) layer
Neural Networks in Practice
Designing Neural Networks Architectures
Number of Hidden Layers
W12
x2
Inputs
h2 v2 v2 v2 Output
x y
n-1
hn vn vn vn
xn
Hidden Hidden
x1
W11 Layer 1 Layer 2
W12
h1
x2
v1 Output
Inputs
h2 y
x
n-1
v2
hn
xn
o In general, using the same number of neurons for all hidden layers will suffice
o For some datasets, having a large first layer and following it up with smaller layers will lead to better performance as
the first layer can learn a lot of lower-level features that can feed into higher order features in subsequent layers.
o Usually, you will get more of a performance boost from adding more layers than adding more neurons in each layer
o Remember: When choosing the number of layers/neurons if the number is too small, your network will not be able
to learn the underlying patterns in your data and thus be useless.
Example
Handwritten Digit Recognition
(Hello World of PR/ANN)
Trained Network