0% found this document useful (0 votes)
13 views63 pages

AI Lec24-25

The document provides an overview of neurons as the basic units of neural networks, detailing their structure and function, including inputs, weights, biases, and activation functions. It explains how neurons connect to form neural networks and discusses various activation functions like Sigmoid, Tanh, ReLU, and Leaky ReLU. Additionally, it touches on the decision-making process of neurons and the limitations of perceptrons in solving non-linearly separable problems.

Uploaded by

arbutt017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views63 pages

AI Lec24-25

The document provides an overview of neurons as the basic units of neural networks, detailing their structure and function, including inputs, weights, biases, and activation functions. It explains how neurons connect to form neural networks and discusses various activation functions like Sigmoid, Tanh, ReLU, and Leaky ReLU. Additionally, it touches on the decision-making process of neurons and the limitations of perceptrons in solving non-linearly separable problems.

Uploaded by

arbutt017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Types of Machine

Learning – Recap
Neurons
Neurons
o Basic unit of a neural network
Dendrites
o Very simple processing units
Synapses
o A neuron takes inputs, does some
Axon math with them, and produces one
Cell Body
output

.
Neurons
Dendrites
o Basic unit of a neural network
Synapses o Very simple processing units
Axon
Cell Body o A neuron takes inputs, does some
math with them, and produces one
output

x1
-1

3
W1
𝑥1 → 𝑥1∗ 𝑤1
Output

4
W2 f(x) = max(x,0)
Inputs

x2
0 +b f(-4)
+1 f() 0y
𝑥2 → 𝑥2∗ 𝑤2 2
W3
(𝑥1∗ 𝑤1 ) + (𝑥2∗ 𝑤2 ) + (𝑥3∗ 𝑤3 ) + b y = f(𝑥1∗ 𝑤1 + 𝑥2∗ 𝑤2 + b)

-1
x3
2-
1- All
3- The
Eachthe weighted
sum is is
input inputs
passed are
by aadded
through
multiplied an
weight
𝑥3 →𝑥
. together
activationwith a bias b
function
3∗ 𝑤3
Neuron Model
Model Parameters
• Neuros have a set of x1
parameters, called weights
(like the slope in linear W1
regression) 𝑥1 → 𝑥1∗ 𝑤1
• Called w’s Output
• The neuron model has also a
bias (like the y-intercept in W2 f(x) = max(x,0)
Inputs +b
linear regression) x2 +b f()
f(.) y
• Called b
• Together the w’s and the b 𝑥2 → 𝑥2∗ 𝑤2 W3
create a line in the space to
divide it into regions (𝑥1∗ 𝑤1 ) + (𝑥2∗ 𝑤2 ) + (𝑥3∗ 𝑤3 ) + b y = f(𝑥1∗ 𝑤1 + 𝑥2∗ 𝑤2 + b)
(classes)
• Finally, the neuron has a x3
non-linear activation function
• Called f() .
𝑥3 → 𝑥3∗ 𝑤3
• The nonlinearity will allow us
to create curves instead of .
lines to divide the space
Neuron Example
Weights, Bias, and Activations
Inputs Weights Output
(X) (W) (X*W)
5.4 8 43.2
x1
-10.2 5 -51.0
W1
𝑥1 → 𝑥1∗ 𝑤1
-0.1 22 -2.2 Output

101.4 -5 -507.0 W2 f(x) = max(x,0)


Inputs +b
x2 +b f()
f(.) y
0.0 2 0.0
𝑥2 → 𝑥2∗ 𝑤2 W3
12.0 -3 -36.0 (𝑥1∗ 𝑤1 ) + (𝑥2∗ 𝑤2 ) + (𝑥3∗ 𝑤3 ) + b y = f(𝑥1∗ 𝑤1 + 𝑥2∗ 𝑤2 + b)

Linear -553.0 x3
Combination
.
𝑥3 → 𝑥3∗ 𝑤3
Bias 10.0
.
Output 0
(Activation)
Neuron Activation Functions
Connecting
Neurons to Build
Neural Networks
Neural Network
Brain Analogy

• A neural network is nothing more than a bunch of


neurons connected together.
• It is a system that consists of a large number of
neurons, each of which can process information on its
own so that instead of having a CPU process each
piece of information one after the other
Designing Neural Networks Architectures
Activation functions
Sigmoid Tanh ReLU Leaky ReLU

• Towards the ends of the • Derivative is more steep • It doesn’t saturate. • Can be used as improvement
function, y values react very little compared to Sigmoid. over ReLU Activation function.
to the changes in x • It converges faster than some
• It is more efficient because it has other activation functions • It has all properties of ReLU, and
• The derivative values in these a wider range for faster learning it overcomes the ReLU
regions are very small and (sigmoid and tanh).
converge to 0. This is called the [-1,1] drawbacks.
vanishing gradient and the
• The problem of gradients at the • The most commonly used • Unlike the ReLU, the value of
learning is minimal. activation function, because of its
ends of the function continues. hyper-parameter (α) is defined
• When slow learning occurs, the simplicity during the optimization prior to the training and hence
optimization algorithm cannot process (backpropagation) cannot be adjusted during the
get maximum performance from training time. The value of α
the neural network model. • It is not computationally hence chosen might not be the
expensive. most optimal value.
• input signals ‘x’ and weights ‘w’ are multiplied
• weights correspond to connection strengths
• signals are added up – if they are enough, FIRE!

x1 w1 add

if (a  t)
a =  xi w output = 1
M output
x2 output
w2 i else signal
signal
i =1 output = 0
x3 w3

incoming connection activation


signal strength
level
M 
if    t then output =1, else output = 0

x w
i =1
i i

output = 0
output = 1
Is this a good decision boundary?

M 
 xi wi   t then output =1, else output = 0

if
i =1 
w1 = 1.0

w2 = 0.2

t = 0.05

M 
 xi wi   t then output =1, else output = 0

if
i =1 
w1 = 2.1

w2 = 0.2

t = 0.05

M 
 xi wi   t then output =1, else output = 0

if
i =1 
w1 = 1.9

w2 = 0.02

t = 0.05

M 
 xi wi   t then output =1, else output = 0

if
i =1 
w1 = -0.8

w2 = 0.03

t = 0.05

Changing the weights/threshold makes the decision boundary move.


a =  xi wi
M

x = [ 1.0, 0.5, 2.0 ] i =1

w = [ 0.2, 0.5, 0.5 ] x1 w1

t = 1.0 x2
w2

x3 w3

Q1. What is the activation, a, of the neuron?


Q2. Does the neuron fire?
Q3. What if we set threshold at 0.5 and weight #3 to zero?
a =  xi wi
M

x = [ 1.0, 0.5, 2.0 ] i =1

w = [ 0.2, 0.5, 0.5 ] x1 w1

t = 1.0
w2
x2

x3 w3

Q1. What is the activation, a, of the neuron?

a =  xi wi = (1.0 0.2)+ (0.5  0.5) + (2.0  0.5) = 1.45


M

i =1

Q2. Does the neuron fire?


if (activation > threshold) output=1 else output=0
…. So yes, it fires.
a =  xi wi
M

x = [ 1.0, 0.5, 2.0 ] i =1

w = [ 0.2, 0.5, 0.5 ] x1 w1

t = 1.0
w2
x2

x3 w3

Q3. What if we set threshold at 0.5 and weight #3 to zero?

a =  xi wi = (1.0 0.2)+ (0.5  0.5) + (2.0  0.0) = 0.45


M

i =1

if (activation > threshold) output=1 else output=0


…. So no, it does not fire..
We can rearrange the decision rule….

M 
if   xi wi   t then output =1, else output = 0
 i =1 
M 
if   xi wi  − t  0 then output = 1, else output = 0
 i =1 
M 
if   xi wi  + (−1 t )  0 then output = 1, else output = 0
 i =1 
M 
if   xi wi  + ( x0  w0 )  0 then output = 1, else output = 0
 i =1 
M 
 xi wi   then output = 1, else output = 0

if 0
i =0 

We now treat the threshold like any other weight with a permanent input of -1
Can a Perceptron solve this problem? ….. NO.

Perceptrons only solve

LINEARLY SEPARABLE

problems
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1
0 1 0
1 0 0
1 1 1 𝑤1 𝑥1 + 𝑤2 𝑥2 − 𝜃 = 0
2 0 0 0 𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0
1 0 0
1 1 1
3 0 0 0
0 1 0
1 0 0
1 1 1
4 0 0 0
0 1 0
1 0 0
1 1 1
5 0 0 0
0 1 0
1 0 0
1 1 1
Threshold:  = 0.2; learning rate:  = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0
1 0 0
1 1 1 Training Example 1:
2 0 0 0 = 0.3 x 0 − 0.1 x 0 − 0.2
0 1 0 = − 0.2 < 0
1 0 0 Output: 0
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 0 0
1 0 0 𝑤1 = 0.3
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 0 0
1 0 0 𝑤2 = −0.1
1 1 1
Threshold:  = 0.2; learning rate:  = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0 0.3 −0.1
1 0 0
1 1 1 Training Example 1:
2 0 0 0 = 0.3 x 0 − 0.1 x 0 − 0.2
0 1 0 = − 0.2 < 0
1 0 0 Output: 0
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 0 0
1 0 0 𝑤1 = 0.3
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 0 0
1 0 0 𝑤2 = −0.1
1 1 1
Threshold:  = 0.2; learning rate:  = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0 0.3 −0.1 0 0 0.3 −0.1
1 0 0
1 1 1 Training Example 2:
2 0 0 0 = 0.3 x 0 − 0.1 x 1 − 0.2
0 1 0 = − 0.3 < 0
1 0 0 Output: 0
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 0 0
1 0 0 𝑤1 = 0.3
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 0 1
1 0 0 𝑤2 = −0.1
1 1 1
Threshold:  = 0.2; learning rate:  = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0 0.3 −0.1 0 0 0.3 −0.1
1 0 0 0.3 −0.1
1 1 1 Training Example 2:
2 0 0 0 = 0.3 x 0 − 0.1 x 1 − 0.2
0 1 0 = − 0.3 < 0
1 0 0 Output: 0
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 0 0
1 0 0 𝑤1 = 0.3
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 0 1
1 0 0 𝑤2 = −0.1
1 1 1
Threshold:  = 0.2; learning rate:  = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0 0.3 −0.1 0 0 0.3 −0.1
1 0 0 0.3 −0.1
1 1 1 Training Example 3:
2 0 0 0 = 0.3 x 1 − 0.1 x 0 − 0.2
0 1 0 = 0.1 > 0
1 0 0 Output: 1
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 1 1
1 0 0 𝑤1 = 0.2
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 1 0
1 0 0 𝑤2 = −0.1
1 1 1
Threshold:  = 0.2; learning rate:  = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0 0.3 −0.1 0 0 0.3 −0.1
1 0 0 0.3 −0.1 1 -1 0.2 −0.1
1 1 1 Training Example 3:
2 0 0 0 = 0.3 x 1 − 0.1 x 0 − 0.2
0 1 0 = 0.1 > 0
1 0 0 Output: 1
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 1 1
1 0 0 𝑤1 = 0.2
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 1 0
1 0 0 𝑤2 = −0.1
1 1 1
Threshold:  = 0.2; learning rate:  = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights
x1 x2 Yd w1 w2 Y e w1 w2
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
𝑤1 𝑥1 + 𝑤2 𝑥2 − 0.2 = 0
0 1 0 0.3 −0.1 0 0 0.3 −0.1
1 0 0 0.3 −0.1 1 -1 0.2 −0.1
1 1 1 0.2 −0.1 Training Example 3:
2 0 0 0 = 0.3 x 1 − 0.1 x 0 − 0.2
0 1 0 = 0.1 > 0
1 0 0 Output: 1
1 1 1
Update Rule:
3 0 0 0
0 1 0 𝑤𝑖 = 𝑤𝑖 + 𝜂 𝑡 − 𝑜 𝑥𝑖
1 0 0
1 1 1
4 0 0 0 𝑤1 = 𝑤1 + 0.1 𝑡 − 𝑜 𝑥1
0 1 0 𝑤1 = 0.3 + 0.1 0 − 1 1
1 0 0 𝑤1 = 0.2
1 1 1
5 0 0 0 𝑤2 = 𝑤2 + 0.1 𝑡 − 𝑜 𝑥2
0 1 0 𝑤2 = −0.1 + 0.1 0 − 1 0
1 0 0 𝑤2 = −0.1
1 1 1
Threshold:  = 0.2; learning rate:  = 0.1
Inputs Desired Initial Actual Error Final
Epoch output weights output weights PERCEPTRON LEARNING:
x1 x2 Yd w1 w2 Y e w1 w2 LOGICAL OPERATION AND
1 0 0 0 0.3 −0.1 0 0 0.3 −0.1
0 1 0 0.3 −0.1 0 0 0.3 −0.1
1 0 0 0.3 −0.1 1 −1 0.2 −0.1
1 1 1 0.2 −0.1 0 1 0.3 0.0
2 0 0 0 0.3 0.0 0 0 0.3 0.0
0 1 0 0.3 0.0 0 0 0.3 0.0
1 0 0 0.3 0.0 1 −1 0.2 0.0
1 1 1 0.2 0.0 1 0 0.2 0.0
3 0 0 0 0.2 0.0 0 0 0.2 0.0
0 1 0 0.2 0.0 0 0 0.2 0.0
1 0 0 0.2 0.0 1 −1 0.1 0.0
1 1 1 0.1 0.0 0 1 0.2 0.1
4 0 0 0 0.2 0.1 0 0 0.2 0.1
0 1 0 0.2 0.1 0 0 0.2 0.1
1 0 0 0.2 0.1 1 −1 0.1 0.1
1 1 1 0.1 0.1 1 0 0.1 0.1
5 0 0 0 0.1 0.1 0 0 0.1 0.1
0 1 0 0.1 0.1 0 0 0.1 0.1
1 0 0 0.1 0.1 0 0 0.1 0.1
1 1 1 0.1 0.1 1 0 0.1 0.1
Threshold:  = 0.2; learning rate:  = 0.1
Multilayer Perceptron
◼ A multilayer perceptron is a feedforward neural network with
one or more hidden layers.
◼ The network consists of an input layer of source neurons, at
least one middle or hidden layer of computational neurons,
and an output layer of computational neurons.
◼ The input signals are propagated in a forward direction on a
layer-by-layer basis.
Multi layer perceptron
Multi layer perceptron
Multi layer perceptron
Simple Neural Network
Dendrites This network has:
• n inputs (x1, x2, …, xn)
Synapses
• a hidden layer with n
Axon neurons (Ha​, Hb​, …, Hz)
• a hidden layer with n
neurons (V$​, V#​, …, V*)
• an output layer with one or
more neurons (y​)
Cell Body • All neurons are connected to
Hidden Hidden each other → Called fully
Layer 1 Layer 2 connected
x1 Wa$ • A neural network can
Ha V$ have any number of
layers with any number of
x2 neurons in those layers
Hb V#
Inputs

y1
.
. Output
. .
. .
xn-1 . .

Hz V*
xn
Thereiscan
A hidden layer anybe multiple
layer hidden
between thelayers!
input (first) layer and output (last) layer
Neural Networks in Practice
Designing Neural Networks Architectures
Number of Hidden Layers

Hidden Hidden Hidden Hidden


Layer 1 Layer 2 Layer 3 Layer 4
x1 W11
W12
h1 v1 v1 v1

W12
x2
Inputs

h2 v2 v2 v2 Output

x y
n-1

hn vn vn vn

xn

o The number of hidden layers is dependent on the problem


o You’re essentially trying to design the best network architecture — not too big, not too small, just right
o Generally, 1–5 hidden layers will serve you well for most problems
Designing Neural Networks Architectures
Neurons per Hidden Layers

Hidden Hidden
x1
W11 Layer 1 Layer 2
W12
h1
x2

v1 Output
Inputs

h2 y
x
n-1

v2

hn
xn

o In general, using the same number of neurons for all hidden layers will suffice
o For some datasets, having a large first layer and following it up with smaller layers will lead to better performance as
the first layer can learn a lot of lower-level features that can feed into higher order features in subsequent layers.
o Usually, you will get more of a performance boost from adding more layers than adding more neurons in each layer
o Remember: When choosing the number of layers/neurons if the number is too small, your network will not be able
to learn the underlying patterns in your data and thus be useless.
Example
Handwritten Digit Recognition
(Hello World of PR/ANN)
Trained Network

A pattern of activations in one layer causes a specific


pattern of activations in the next layer
How do we recognize digits?
The upper neuron is expected to fire for all samples with an upper loop
How to recognize the loop?

Combination of small edges?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy