0% found this document useful (0 votes)
23 views55 pages

Artificial Neural Networks - DL

Uploaded by

enpass
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views55 pages

Artificial Neural Networks - DL

Uploaded by

enpass
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Artificial Neural Networks

- The nuts and bolts of a neural net


Why the rage?
Neuronal structure in Brain
Quick Q: Will it fire?
Input 1 (x1) = 0.6
Input 2 (x2) = 1.0
Weight 1 (w1) = 0.5
Weight 2 (w2) = 0.8

Activation function is a step function with threshold = 1.0


Structure of a network
Demo
http://www.emergentmind.com/neural-network
Size of a network

2 layer ANN 3 layer ANN


Setting number of layers and their sizes
Optimising the weight
Moving across the steepest descent
Tensorflow playground
http://playground.tensorflow.org
Why biases
Degrees of freedom
Why activation function
Parameters vs Hyperparameters
Data split
- Small data set

- Large data set


Regularisation
- L1

- L2

- Dropout
Weight initialisation
- Uniform distribution

- He initialisation

- Xavier initialisation
Learning rate
Hyperparameters
- Epochs

- Batch Size

- Number of neurons

- Number of hidden layers

- Momentum
Training of a neural net
Overview of Learning

1)Model initialisation
2)Forward propagate
3)Loss function
4)Optimising weights
5)Backpropagation
6)Weight update
7)Iteration until convergence
1) Model initialisation

Input Desired output

0 0

1 2

2 4

3 6

4 8
2) Forward propagate

Input Actual output of model 1 (y= 3.x)

0 0

1 3

2 6

3 9

4 12
3) Loss function

Input actual Desired Absolute Error Square Error

0 0 0 0 0

1 3 2 1 1

2 6 4 2 4

3 9 6 3 9

4 12 8 4 16

Total: - - 10 30
4) Optimising the weight
4a) Differentiation

Input Desired Output W=3 Square Error W=3.0001 Square Error

0 0 0 0 0 0

1 2 3 1 3.0001 1.0002

2 4 6 4 6.0002 4.0008

3 6 9 9 9.0003 9.0018

4 8 12 16 12.0004 16.0032

Total: - - 30 - 30.006
4b) Moving across the
5) Backpropagation

dJ/da = dJ/dv x dv/da


dJ/db = dJ/dv x dv/du x du/db
6) Weight update

New weight = old weight - Derivative Rate * learning rate


Learning rate

• If it’s too big you can never converge to the low point. If it’s too small,
then you will take a lot of time to converge. So, we need to maintain a
balance and find an optimum value.

• Now several weight update methods exist. These methods are called
optimisers. The delta rule is the most simple and intuitive one. We call it
the standard gradient descent.
7) Iteration until convergence

Depends on many factors


• Learning rate
• Optimisation method
• Random initialisation
• Quality of training set
Actual step by step example of the math involved
Random initialisation
Forward pass
Matrix representation

GPU vs CPU
Calculating Output
Loss function / Error

Repeating the process for other output


Backward pass
Weight update
Hidden layer

v
Hidden layer weight updates

Finally, we’ve updated all of our weights! When we fed forward the 0.05 and 0.1 inputs originally,
the error on the network was 0.298371109. After this first round of backpropagation, the total error
is now down to 0.291027924. It might not seem like much, but after repeating this process 10,000
times, for example, the error plummets to 0.0000351085. At this point, when we feed forward 0.05
and 0.1, the two outputs neurons generate 0.015912196 (vs 0.01 target) and 0.984065734 (vs 0.99
target)
Simulation

https://www.mladdict.com/linear-regression-simulator
https://www.mladdict.com/neural-network-simulator
Effect of Batch size

• Updating the parameters using all training data is not efficient. You can
update the parameters several times if you only use part of the whole
data.

• On the other hand, updating by one single sample (online updating) is


noisy if the sample is not a good representation of the whole data. You
can consider a mini-batch to ba an approximation of the whole database
Activation functions

• Sigmoid
• Tanh
• Relu (Rectified linear Unit)
Sigmoid
Tanh
Vanishing gradient

If your weight matrix W is initialized too large, the output of the matrix multiply could have a very
large range (e.g. numbers between -400 and 400), which will make all outputs in the vector z
almost binary: either 1 or 0. But if that is the case, z*(1-z), which is local gradient of the sigmoid
nonlinearity, will in both cases become zero (“vanish”), making the gradient for both x and W be
zero. The rest of the backward pass will come out all zero from this point on due to multiplication in
the chain rule.Same is the case for tanh as well, as it is just a scaled up version of sigmoid.
Relu

The Rectified Linear Unit has become very popular in the


last few years. It computes the function
f(x)=max(0,x)f(x)=max(0,x). In other words, the activation
is simply thresholded at zero (see image above on the left).
There are several pros and cons to using the ReLUs:
● It was found to greatly accelerate (e.g. a factor of 6
in Krizhevsky et al.) the convergence of stochastic
gradient descent compared to the sigmoid/tanh
functions. It is argued that this is due to its linear,
non-saturating form.
● Compared to tanh/sigmoid neurons that involve
expensive operations (exponentials, etc.), the ReLU
can be implemented by simply thresholding a matrix
of activations at zero
Dying ReLu

Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient
flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will
never activate on any datapoint again. If this happens, then the gradient flowing through the unit will
forever be zero from that point on. That is, the ReLU units can irreversibly die during training It’s like
permanent, irrecoverable brain damage. For example, you may find that as much as 40% of your
network can be “dead” (i.e. neurons that never activate across the entire training dataset) if the
Experiments:Dropout

In each layer of the neural network, the neurons become dependent on each other. Some neurons gain
more influence than others. The dropout layer randomly mutes different neurons. This way each
neuron has to build a distinct contribution to the final output.The second popular method to prevent
overfitting is applying an L1 or L2 regularizer function on each layer.
Experiments: Regularisation

The neural network with regularization functions outperforms the one without them. The regularization
function L2 punishes functions that are too complex. It measures how much each function contributes to
the final output. It then punishes the ones with large coefficients.
Experiments: Batch size

As we see in the result, a large batch size requires fewer cycles but has more accurate training steps.
In comparison, a smaller batch size is more random but take more steps to compensate for it. A large
batch size requires fewer learning steps. But, you need more memory and time to compute each step.
Experiments: Learning Rate

The learning rate is often considered one of the most important parameters due to its impact. It regulates
how to adjust the change in prediction for each learning step. If the learning rate is too high or too low it
might not converge, like the large learning rate above. There is no fixed way of designing neural networks.
A lot of it has to do with experimentation. Look at what others have done by adding layers, and tuning
hyper parameters. If you have access to a lot of computing power, you can create programs to design and
Experiments : Optimiser

As we can see, the adaptive learning-rate methods, i.e. Adagrad, Adadelta, RMSprop, and Adam are
most suitable and provide the best convergence for these scenarios.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy