Artificial Neural Networks - DL
Artificial Neural Networks - DL
- L2
- Dropout
Weight initialisation
- Uniform distribution
- He initialisation
- Xavier initialisation
Learning rate
Hyperparameters
- Epochs
- Batch Size
- Number of neurons
- Momentum
Training of a neural net
Overview of Learning
1)Model initialisation
2)Forward propagate
3)Loss function
4)Optimising weights
5)Backpropagation
6)Weight update
7)Iteration until convergence
1) Model initialisation
0 0
1 2
2 4
3 6
4 8
2) Forward propagate
0 0
1 3
2 6
3 9
4 12
3) Loss function
0 0 0 0 0
1 3 2 1 1
2 6 4 2 4
3 9 6 3 9
4 12 8 4 16
Total: - - 10 30
4) Optimising the weight
4a) Differentiation
0 0 0 0 0 0
1 2 3 1 3.0001 1.0002
2 4 6 4 6.0002 4.0008
3 6 9 9 9.0003 9.0018
4 8 12 16 12.0004 16.0032
Total: - - 30 - 30.006
4b) Moving across the
5) Backpropagation
• If it’s too big you can never converge to the low point. If it’s too small,
then you will take a lot of time to converge. So, we need to maintain a
balance and find an optimum value.
• Now several weight update methods exist. These methods are called
optimisers. The delta rule is the most simple and intuitive one. We call it
the standard gradient descent.
7) Iteration until convergence
GPU vs CPU
Calculating Output
Loss function / Error
v
Hidden layer weight updates
Finally, we’ve updated all of our weights! When we fed forward the 0.05 and 0.1 inputs originally,
the error on the network was 0.298371109. After this first round of backpropagation, the total error
is now down to 0.291027924. It might not seem like much, but after repeating this process 10,000
times, for example, the error plummets to 0.0000351085. At this point, when we feed forward 0.05
and 0.1, the two outputs neurons generate 0.015912196 (vs 0.01 target) and 0.984065734 (vs 0.99
target)
Simulation
https://www.mladdict.com/linear-regression-simulator
https://www.mladdict.com/neural-network-simulator
Effect of Batch size
• Updating the parameters using all training data is not efficient. You can
update the parameters several times if you only use part of the whole
data.
• Sigmoid
• Tanh
• Relu (Rectified linear Unit)
Sigmoid
Tanh
Vanishing gradient
If your weight matrix W is initialized too large, the output of the matrix multiply could have a very
large range (e.g. numbers between -400 and 400), which will make all outputs in the vector z
almost binary: either 1 or 0. But if that is the case, z*(1-z), which is local gradient of the sigmoid
nonlinearity, will in both cases become zero (“vanish”), making the gradient for both x and W be
zero. The rest of the backward pass will come out all zero from this point on due to multiplication in
the chain rule.Same is the case for tanh as well, as it is just a scaled up version of sigmoid.
Relu
Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient
flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will
never activate on any datapoint again. If this happens, then the gradient flowing through the unit will
forever be zero from that point on. That is, the ReLU units can irreversibly die during training It’s like
permanent, irrecoverable brain damage. For example, you may find that as much as 40% of your
network can be “dead” (i.e. neurons that never activate across the entire training dataset) if the
Experiments:Dropout
In each layer of the neural network, the neurons become dependent on each other. Some neurons gain
more influence than others. The dropout layer randomly mutes different neurons. This way each
neuron has to build a distinct contribution to the final output.The second popular method to prevent
overfitting is applying an L1 or L2 regularizer function on each layer.
Experiments: Regularisation
The neural network with regularization functions outperforms the one without them. The regularization
function L2 punishes functions that are too complex. It measures how much each function contributes to
the final output. It then punishes the ones with large coefficients.
Experiments: Batch size
As we see in the result, a large batch size requires fewer cycles but has more accurate training steps.
In comparison, a smaller batch size is more random but take more steps to compensate for it. A large
batch size requires fewer learning steps. But, you need more memory and time to compute each step.
Experiments: Learning Rate
The learning rate is often considered one of the most important parameters due to its impact. It regulates
how to adjust the change in prediction for each learning step. If the learning rate is too high or too low it
might not converge, like the large learning rate above. There is no fixed way of designing neural networks.
A lot of it has to do with experimentation. Look at what others have done by adding layers, and tuning
hyper parameters. If you have access to a lot of computing power, you can create programs to design and
Experiments : Optimiser
As we can see, the adaptive learning-rate methods, i.e. Adagrad, Adadelta, RMSprop, and Adam are
most suitable and provide the best convergence for these scenarios.