0% found this document useful (0 votes)
10 views39 pages

NN_Notes

Uploaded by

SOUMYODEEP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views39 pages

NN_Notes

Uploaded by

SOUMYODEEP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Summary Notes

● NN-1: Intro to NN
What are the issues with Classic ML Algos?
- Require us to do manual feature engineering to be able to create complex
boundaries
- They might not always work well with big datasets, and we are living in a big-data
regime
- Works poorly for sparse data and for unstructured data (image/ text/ speech
data)
Clearly, all these algos have some limitations which call for a more powerful ML model
that can work in the conditions mentioned above. This brings us to Neural Networks (NN).

NN models power pretty much every minute of your digital life


- Online Ads (Google, YouTube)
- Data compression: Done using Autoencoders
- Image enhancement: Eg Magic Eraser
- Gmail: Eg Autocomplete, smart reply

Where did the inspiration for a neuron come from?


NN is loosely inspired by the biological neurons found in the human brain.
In the brain, there exist biological neurons that are connected to each other, forming a
network
In simple terms, we understood that
- Neuron takes input(s)
- Perform some computations.
- Ultimately, it fires/passes the output to further neurons, for further processing.

How does an Artificial neuron do its computations?


Consider an artificial neuron.
- It receives input features: x1,x2,x3
- Every input has a weight associated with it: w1,w2,w3
- These weights are multiplied by the input values, thereby telling us how
important a given input is to the neuron.
- Inputs are processed by taking the weighted sum.
- Also, a bias term is added.
- The net input becomes: w1x1 + w2x2 + w3x3 + b = z (let)
- There is a function, called activation function f, which is associated with a neuron
- The neuron applies this function on the net input value:
f(z) = f( w1x1 + w2x2 + w3x3 + b )
- The result of this function becomes the output o1 = f(z)
- This output is then forwarded to other neurons.
This flow of computations is known as Forward Propagation.
Notice that we are going from left to right during Forward Propagation.
What happens if we put the sigmoid function as the activation function?
We get a Logistic Regression model, as the output of this becomes:
o1 = sigmoid( w1x1 + w2x2 + w3x3 + b )
This neuron, where the activation is a sigmoid function is called a Logistic Regression
Unit (LRU)
We can diagrammatically represent a neuron with 2 inputs as:

What if the model looked the same, but the activation function is different? Would
it still be called the same NN?
If we replace the activation function to a hinge loss function, we get a Linear SVM model.
Forward propagation here would look like:-
- z = w1x1 + w2x2 + … + wdxd + b
^
- 𝑦 = fhinge(z)

Note:
- A single neuron is able to represent such powerful models, just imagine what will
happen if we use multiple neurons.
- Those models would be able to represent really complex relations.

A brief history of Artificial Neural Networks


Perceptron
- Perceptron is the very first NN-based model. It was designed by Rosenblatt in
1957
- It is different from LRU, because it used a step function for activation, which is:

Challenges:
- As we tried to increase the complexity of NN, they performed poorly.
- There was no optimal way of training the NN.
- This put the whole area in deep freeze for a decade or so.
- A breakthrough came in 1986 when backpropagation was introduced by Geoff
Hinton and his team.
- This also brought up a lot of hype for artificial intelligence, But, all of that hype
died down by 1990s
- There were two main bottlenecks:-
- We didn't have computational power
- Nor did we have enough data to train
- Backprop was failing for NNs with more depth.
- This time period is called as AI winter, where the funding for AI dried up by 1995
- Meanwhile, Geoff Hinton continued his research and finally came up with a
solution to this problem in 2006.
- Also, the discovery of using ReLu and Leaky ReLu as activation functions was
another breakthrough.

How do NN fare against classical ML (based on training data)?


An LRU can only help in the case of binary classification (0 or 1).
How can we adapt a NN to do multi-class classification?
Suppose we wish to do multi-class classification for 3 classes: A, B, and C.
Recall multi-class classification:
- We calculated the probability that a given data point belongs to class A, B, or C
respectively.
- Then, we returned the class with the highest probability as the answer.
- This gives us the intuition that perhaps, our output layer should have 3 outputs.
One for each class.

We cannot perform multi-class classification using a sigmoid because we might get


𝑝 > 0. 5 for more than one class.
- Model will predict the presence of multiple classes in the output - [1, 1, 0], [1, 1,
1], [1, 0, 1]
- Conclusion: We want these probabilities values to sum to 1, as we had in Logistic
Regression (p and 1-p).
It should be 𝑝𝐴 + 𝑝𝐵 + 𝑝𝐶 = 1
In order to do this, we use the softmax function as activation in the neurons of the
output layer.

The model now looks like:-


Note:
- There will be 2*3 = 6 weights, and it’ll be stored as a 2x3 matrix: 𝑊 2𝑥3
- There will be 3 biases, one for each neuron, and it'll be stored as a 1x3 matrix:

𝑏 1𝑥3

How will we calculate the loss for this multi-class classification NN Model?
We use Categorical Cross Entropy.

where,
𝑘 -> number of classes
𝑦𝑖𝑗 -> one hot encoded label. Ex: [1,0,0], [0,1,0], or [0,0,1]
𝑃𝑖𝑗 -> Calculated Probability of datapoint belonging to class j
This can be seen as log loss extended to the multiclass setting. For k=2, we will get log
loss formulation.

How to train NN?


Let, m -> no of training examples
d -> no of features
n -> no of classes/neurons in the output layer

Process to train a NN is:-


- Randomly Initialise parameters: W and b matrices
- Do forward propagation
- 𝑍 = 𝑋𝑚 𝑥 𝑑. 𝑊 𝑑𝑥𝑛
+𝑏 1𝑥𝑛

- 𝐴 = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛(𝑍 )
- Calculate the Loss
- Repeat until Loss converges
- update 𝑤𝑖 = 𝑤𝑖 − 𝑙𝑟 * (ә𝐿𝑜𝑠𝑠/ә𝑤𝑖)
- calculate the output using hypothesis and updated params
- calculate the Loss

● NN-3: Backpropagation and Activation functions


What are MLPs?
If we wish to get a more complex decision boundary, we need to add another layer of
neurons in the model. This is known as the hidden layer.
- These models are known as MLPs (Multi Layer Perceptrons). Or N-layered NN
- They are based on the idea of function composition.
- We can have as many hidden layers as we want, however, the greater the
number, the higher the risk of overfitting.
- The activation of hidden layers also needs to always be non-linear, otherwise,
we will not be able to get complex features.

A 2-layer model looks like:-

Since there are multiple layers, we modify the notation to cater to it


- Weights: WLij
- Bias: bLi

Note:-
1
- Weights in layer-1 will be stored as a 2x4 matrix: 𝑊 2𝑥4
1
- Biases in layer-1 will be stored as a 1x4 matrix: 𝑏 1𝑥4
2
- Weights in layer-2 will be stored as a 4x3 matrix: 𝑊 4𝑥3
2
- Biases in layer-2 will be stored as a 1x3 matrix: 𝑏 1𝑥3

Why do we need to add a hidden layer to increase complexity?


The idea is that Stacking a non-linearity over a linear function, and repeating the process
helps create complex features

What does forward propagation look like for this NN?


Let h -> no of neurons in the hidden layer.
m -> no of training examples
d -> no of features
n -> no of classes/neurons in the output layer

Forward propagation:-
1 1 1
𝑍 𝑚𝑥ℎ
= 𝑋𝑚 𝑥 𝑑. 𝑊 𝑑𝑥ℎ
+𝑏 1𝑥ℎ
1 1 1
𝐴 𝑚𝑥ℎ
= 𝑓 (𝑍 𝑚𝑥ℎ
)
2 1 2 2
𝑍 𝑚𝑥𝑛
= 𝐴 𝑚𝑥ℎ
.𝑊 ℎ𝑥𝑛
+𝑏 1𝑥𝑛
2 2 2
𝐴 𝑚𝑥𝑛
= 𝑓 (𝑍 𝑚𝑥𝑛
)

How do we train this complex NN? What is Backpropagation?


- In order to update the parameters, we need to find their gradients. For this, we
use the Backpropagation algorithm, where we traverse from right to left in the
NN.
- It is based on the concept of chain rule of
differentiation.
3
Gradient of 𝑤 11
We’ll encounter a31 while going from loss towards w311
3 3
∂𝐿 ∂𝐿 ∂𝑎 ∂𝑧
Therefore gradient: 3 = 3 . 3
1
. 3
1

∂𝑤 11
∂𝑎 1
∂𝑧 1
∂𝑤 11

Gradient of w211
We’ll encounter a31 and a21 while going from loss
towards w211
Therefore gradient:
3 3 2 2
∂𝐿 ∂𝐿 ∂𝑎 1
∂𝑧 1
∂𝑎 1
∂𝑧 1
3 = 3 . 3 . 2 . 2 . 2
∂𝑤 11
∂𝑎 1
∂𝑧 1
∂𝑎 1
∂𝑧 1
∂𝑤 11

Gradient of w 111
There are 2 possible paths to reach w 111
Path-1 : L -> a 31 -> a21 -> a11 -> w111
Path-2 : L -> a 31 -> a22 -> a11 -> w111

We need to combine the derivatives from path 1


and 2 by adding them up.

Therefore gradient:

3 3 2 2 1 1 3 3 2 2 1 1
∂𝐿 ∂𝐿 ∂𝑎 1
∂𝑧 1
∂𝑎 1
∂𝑧 1
∂𝑎 1
∂𝑧 1 ∂𝐿 ∂𝑎 1
∂𝑧 1
∂𝑎 2
∂𝑧 2
∂𝑎 1
∂𝑧 1
1 = 3 . 3 . 2 2 . 1 . 1 . 1 + 3 . 3 . 2 . 2 . 1 . 1 . 1
∂𝑤 11
∂𝑎 1
∂𝑧 1
∂𝑎 1
∂𝑧 1
∂𝑎 1
∂𝑧 1
∂𝑤 11
∂𝑎 1
∂𝑧 2
∂𝑎 1
∂𝑧 2
∂𝑎 1
∂𝑧 1
∂𝑤 11

What are the different activation functions?


- Sigmoid function
𝑧 −𝑧
𝑒 −𝑒
- Hyperbolic tan function: 𝑧 −𝑧
𝑒 +𝑒
- ReLu: 𝑅𝑒𝐿𝑢(𝑧) = 𝑚𝑎𝑥(𝑧, 0)
- Leaky ReLu: 𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑢(𝑧) = 𝑚𝑎𝑥(𝑧, α𝑧); α is a small gradient that we add

What is the vanishing gradient problem?


- Downside of both sigmoid and tanh is that their gradient is ~0, for most of the
values of z
- This hampers the gradient descent process, as the calculated gradients become
very small.
1
- For eg Suppose we wish to update weight 𝑤 11
. its gradient is calculated as:

- So, the product of these terms inside the bracket will become very small.
- In fact, as the number of layers in the NN increase, this product will become
smaller and smaller.

Backprop for MLP

● Calculating dZ2
2
∂𝐿 ∂𝐿 ∂𝐴
2
dZ = 2 = 2 . 2
∂𝑍 ∂𝐴 ∂𝑍

∂𝐿 ∂𝑝
dZ2 = . 2 = pi - I(i == k)
∂𝑝 ∂𝑍

● Calculating dW2

2 2
∂𝐿 ∂𝐿 ∂𝐴 ∂𝑍
2
dW = 2 = 2 . 2 2
∂𝑊 ∂𝐴 ∂𝑍 ∂𝑊

2
∂𝑍
2 2
dW = dZ . 2
∂𝑊

dW2 = dZ2. A 1
● Calculating db2

2 2
∂𝐿 ∂𝐿 ∂𝐴 ∂𝑍
2
db = 2 = 2 . 2 2
∂𝑏 ∂𝐴 ∂𝑍 ∂𝑏

2
∂𝑍
2 2
db = dZ . 2 = dZ2
∂𝑏

● Calculating dA1

2 2
∂𝐿 ∂𝐿 ∂𝐴 ∂𝑍
dA1 = 1 = 2 . 2 1
∂𝐴 ∂𝐴 ∂𝑍 ∂𝐴

2
∂𝑍
1 2
dA = dZ . 1 = dZ2 . W2
∂𝐴

● Calculating dZ1

2 2 1
∂𝐿 ∂𝐿 ∂𝐴 ∂𝑍 ∂𝐴
dZ1 = 1 = 2 . 2 1 1
∂𝑍 ∂𝐴 ∂𝑍 ∂𝐴 ∂𝑍

1
∂𝐴
1 1
dZ = dA . 1
∂𝑍

1
∂𝐴
1 1
dZ = dA . 1
∂𝑍
● Calculating dW1

2 2 1 1
∂𝐿 ∂𝐿 ∂𝐴 ∂𝑍 ∂𝐴 ∂𝑍
dW1 = 1 = 2 . 2 1 1 1
∂𝑊 ∂𝐴 ∂𝑍 ∂𝐴 ∂𝑍 ∂𝑊

1
∂𝑍
dW1 = dZ1 1 = dZ1.X
∂𝑊

● Calculating db1

2 2 1 1
∂𝐿 ∂𝐿 ∂𝐴 ∂𝑍 ∂𝐴 ∂𝑍
1
db = 1 = 2 . 2 1 1 1
∂𝑏 ∂𝐴 ∂𝑍 ∂𝐴 ∂𝑍 ∂𝑏

1
∂𝑍
1 1
db = dZ . 1 = dZ1.1 = dZ1
∂𝑏
Summarizing forward and backward prop for MLP
While performing forward prop,
- we store/cache the value of 𝑍𝑗,𝑊𝑗,𝑏𝑗 in order to use them during back prop

Can we use Neural Networks for the Regression task?


- Yes, if the activation function for the output layer is a linear function, then NN will
do regression.
- The activations for intermediate layers still need to be non-linear, otherwise, NN
will not be able to map complex relationships.

Tensorflow and keras

Tensorflow

TensorFlow is the premier open-source deep learning framework developed and maintained by
Google

Keras
● Using TensorFlow directly can be challenging,
● In TensorFlow 2, Keras has become the default high-level API
● No need to separately install keras
● The modern tf.keras API brings Keras’s simplicity and ease of use to the
TensorFlow project.

Why keras ?
● easy to build and use giant deep learning models
● Light-weight and quick
● can support other backends as well besides tensorflow, eg: Theano
● Open source

dir()
It returns list of the attributes and methods of any objec
- dir(tf.keras)
- dir(tf.keras.activation)
- dir(tf.data)

Ways of writing code in Keras


● Sequential API
● Functional API

Keras Sequential API


● Simplest and recommended API to start with
● Called as “Sequential” because we add layers to the model one by one in a
● linear manner, from input to output.
● You can select optimizers, loss functions and metrics while writing code

How to import ?

Dense layer helps us define one layer of a Feedforward NN.

Example model using Sequential API


Note : The layers in the sequential model interact with each other therefore we don't need to
define the input shape for all the layers.

We can check model weights using model.weights

model.add()
● Instead of passing the list of layers as an argument while creating a model instance, we
can use the add method.

model.summary()
To print the summary of model we have created

Custom names
● keras has provided the names by itself.
● We can also give custom names to the layer as well
Plotting model
Weights and Bias Initializer
In keras, in Dense layer,

1) the biases are set to zero (zeros) by default


2) the weights are set according to glorot_uniform

For own custom initializer, we can use bias_initialiser and kernel_initialiser

Compiler - loss and optimizer

What things to decide while compiling model ?

- Loss
- Optimizer

we can define a list of metrics which we might want to track during the training, like accuracy

● Another way to define optimizer = keras.optimizers.Adam(learning_rate=0.01)


● We can also pass customized loss and optimizer functions in keras models.
● These metrics will be calculated and saved after each epoch (one pass of whole data to
update the model).

Epoch
● To avoid memory issues, data is passed in small batches instead of whole
● Each pass of mini-batch is called an iteration.
● Each pass of whole datasets is called an Epoch.
● One epoch means that each sample in the training dataset has had an opportunity to
update the internal model parameters.

Training : model.fit ()

It means updating the weights using the optimizer and loss functions on the dataset.
model.fit(X_train, y_train)

X_train = (num_samples, num_features)

y_train = (num_samples, num_classes) or y_train = (num_samples, )

Arguments we can pass to this method

- epochs ( number of epochs you want to train for )


- batch_size ( Batch size usually in form of 2x like 4,8,16,32 )
- Validation_split ( size of validation data )
- verbose ( 0 for silent training, 1 to print each iteration )

Note: After training, weights now will follow normal distribution and biases will not be Zero now

History object

● model.fit returns a history object which contains the record of progress NN training.
● History object contains records of loss and metrics values for each epoch.
● It's an alternative to dir(). __dict__ attribute can be used to retrieve all the keys
associated with the object on which it is called.

history object's dictionary has another dictionary with key "history" inside it
Model has saved all the loss and metrics values for each epoch inside the history dictionary
where all the values are stored in different lists.

Prediction and Evaluation

Evaluate the model


loss, accuracy = model.evaluate(X_test, y_test)

● model.evaluate returns the loss value & metrics value for the model.
● weights/parameters are not updated during evaluation (and prediction) which means
only forward pass, no backward pass
Predictions
pred = model.predict(X_test)

● To get predictions on unseen data, model.predict method is used


● It returns raw output from the model (i.e. probabilities of an observation belong to
each one of the 4 classes
● sum of probabilities of an observation belong to each of the 4 classes will be 1 i.e,
np.sum(pred, axis=1)

To know the class an observation belongs to, using these 4 probability values

● Find the index having the largest probability and that will be the predicted class.
● pred_class = np.argmax(pred, axis = 1)

To check accuracy of the model using sklearn's accuracy_score

from sklearn.metrics import accuracy_score


acc_score = accuracy_score(y_test, pred_class)

Callbacks
A callback defines a set of functions which are executed at different stages of the training
procedure.
They can be used to view internal states of the model during training.
● For example, we may want to print loss, accuracy or lr every 2000th epoch.
Examples:

1. on_epoch_begin : function will execute before every epoch


2. on_epoch_end : function will execute after every epoch
3. verbose=1, model training prints associated data after every epoch
4. verbose=0, model training prints nothing

Customized callback example

● The custom class will inherit from tf.keras.callbacks.Callback.


● All methods in keras.callbacks.callback class will be available for our customized class,
and we can also override them.

We will have to pass a list of callback objects to callbacks argument of the fit method
Note: We can pass callback objects to evaluate and predict method as well.

The parent class tf.keras.callbacks.Callback supports various kinds of methods which we can
override

● Global methods
at the beginning/ending of training
● Batch-level method
at the beginning/ending of a batch
● Epoch-level method
at the beginning/ending of an epoch

Other examples include:

1. CSVLogger - save history object in a csv file csv_logger =


keras.callbacks.CSVLogger("file_name.csv")
2. EarlyStopping - stop the training when model starts to overfit
3. ModelCheckpoint - saves the intermediate model weights
4. LearningRateScheduler - control/change Learning Rate in between epoch

Tensorboard

● It is used to closely monitor the training process


● It can be used to visualize information regarding the training process like
❖ Metrics - loss, accuracy
❖ Visualize the model graphs
❖ Histograms of W, b, or other tensors as they change during training - distributions
❖ Displaying images, text, and audio data

Ways to install

pip install tensorboard

conda install -c conda-forge tensorboard

To load tensorboard in the notebook

%load_ext tensorboard

TensorBoard will store all the logs in this log directory.


log_folder = 'logs'

● It will read from these logs in order to display the various visualizations.

If we want to reload the TensorBoard extension, we can use the reload magic method

%reload_ext tensorboard

Import

Callback arguments:

● log_dir - (Path)
○ directory where logs will be saved
○ This directory should not be used by any other callbacks.
● update_freq - (int/str)
○ how frequently losses/metrics are updated.
○ when set to batch, losses/ metrics will be updated after every batch/iteration
○ when set to an integer N, updation will be done after N batches
○ when set to 'epoch', updation will be done after every epoch
● histogram_freq - (int)
○ how frequently (in epochs) histograms(Distribution of W) will be updated.
○ Setting this to 0 means, histograms will not be computed.
● write_graph - (Bool), True if we want to visualize our training process
● write_images - (Bool), True if we want to visualize our model weights

Launch tensorboard using following command

%tensorboard --logdir={log_folder}
Weight Initialization

Why need different Optimizer techniques ?

Ans: When training Deep NN, the model tends to have immense training time due to
- Large number of layers in NN
- Large number of neurons in NN
This makes the parameter (weight matrix) of the NN to be large in size.

What are the issues with Deep NN ?


1. Dead Neuron: When the weight 𝑊 = 0 and bias 𝑏 = 0 for all the layers of the NN

- As activation function is Relu which means :


∂𝑟𝑒𝑙𝑢(𝑧) ∂𝑟𝑒𝑙𝑢(𝑧)
∂𝑥
= 1 if 𝑍 > 0 and ∂𝑥
= 0 if 𝑍 ≤ 0
- Thus making weight updation to be zero, meaning the NN doesn't learn during training
time
- Similarly, when 𝑊 = 𝑘 and bias 𝑏 = 𝑘 for all the layers of the NN, where 𝑘 is a
constant
- The model acts as a single neuron NN, inspite of there being N number of Neurons.
2. Exploding Gradients
- If the Deep NN uses linear activation function for all of its L layers
𝑎 = 𝑔(𝑧) = 𝑧
1 2 3 𝐿−1 𝐿
- Then the Weight Matrix = [ 𝑊 , 𝑊 , 𝑊 ,...... 𝑊 , 𝑊 ], and the final layer output is:
^ 𝐿 𝐿−1
𝑦 = 𝑔(𝑊 × 𝑎 )

𝐿 𝐿−1 𝐿 𝐿−1 𝐿−1 𝐿−1 𝐿−2


As Activation being linear: 𝑔(𝑊 𝑎 ) =𝑊 𝑎 and 𝑎 = 𝑔(𝑊 𝑎 ):
^ 𝐿 𝐿−1 𝐿−2
𝑦 = 𝑊 × 𝑔(𝑊 × 𝑎 )

Since the NN has layers ∈ [1, 𝐿]:


^ 𝐿 𝐿−1 2 1
𝑦 = 𝑊 𝑊 ..... × 𝑊 𝑊 𝑋

𝐿
𝑖
- Now for a Deep NN, L is a large value, which makes ∏ 𝑊 will be a very large value
𝑖=1
- Therefore the gradient values becomes exponentially high

How to avoid these issues ?


Weight Initialization strategies

1. Uniform Distribution: We initialize the weights as:


𝑘 −1 1
𝑤 𝑖𝑗
~ 𝑢𝑛𝑖𝑓𝑜𝑟𝑚 [ , ]
𝑓𝑎𝑛𝑖𝑛 𝑓𝑎𝑛𝑜𝑢𝑡

- Where 𝑓𝑎𝑛𝑖𝑛 is the number of input to a neuron while 𝑓𝑎𝑛𝑜𝑢𝑡 is the number of output of
the neuron

2. Glorot/Xavier init:
a. Normal Distribution
𝑘 2
𝑤 𝑖𝑗
~ 𝑁(0, σ𝑖𝑗), 𝑤ℎ𝑒𝑟𝑒 σ𝑖𝑗 = 𝑓𝑎𝑛𝑖𝑛 + 𝑓𝑎𝑛𝑜𝑢𝑡
]
b. Uniform Distribution
𝑘 − 6 6
𝑤 𝑖𝑗
~ 𝑢𝑛𝑖𝑓𝑜𝑟𝑚 [ , ]
𝑓𝑎𝑛𝑖𝑛+𝑓𝑎𝑛𝑜𝑢𝑡 𝑓𝑎𝑛𝑖𝑛+𝑓𝑎𝑛𝑜𝑢𝑡

Note: Used when activation function is tanh

3. He Init:
a. Normal Distribution
𝑘 2
𝑤 𝑖𝑗
~ 𝑁(0, σ𝑖𝑗), 𝑤ℎ𝑒𝑟𝑒 σ𝑖𝑗 = 𝑓𝑎𝑛𝑖𝑛
]
b. Uniform Distribution
𝑘 − 6 6
𝑤 𝑖𝑗
~ 𝑢𝑛𝑖𝑓𝑜𝑟𝑚 [ , ]
𝑓𝑎𝑛𝑖𝑛 𝑓𝑎𝑛𝑖𝑛

Note: Used when activation function is ReLU

Why need to initialize the weight based on input and output of the neuron ?

𝐿 𝐿
derivative of 𝑍 1 w.r.t 𝑤11 is defined as:
𝐿
∂𝑍 1 𝐿−1 𝐿−1
𝐿 =𝑎 1
= 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛(𝑍 1
)
∂𝑤11

𝐿−1 1 2 𝐿−1
And observe 𝑍 1
is nothing but a function of weights = [ 𝑊 , 𝑊 ,...... 𝑊 ]
𝐿−1
- Hence if 𝑍 1
has greater number of inputs, the value for each weight value is
influenced drastically leading to exploding gradient

Optimizer

Why does SGD and Mini Batch Gradient Descent (GD), take so many epochs while
training Deep NN ?

Ans: Mini-batch GD takes steps (𝑉) where the GD tends to:


- Move in direction where it will never reach minima
- Hence due to all these noisy steps, the GD takes so many epochs

Why does mini-Batch GD have noisy steps ?

Ans: Because, training data is divided into batches


- And for some batch the model has very small loss
- while for a few batch, the loss is quite high
- Making the gradients of weights fluctuate
How to reduce these noisy steps ?

Ans: By taking some weighted average (β) from the previous Optimizer Step (𝑉𝑡−1) along with
the current gradient (△𝑤) , hence:
𝑉𝑡 = β 𝑉𝑡−1 + (1 − β)△𝑤𝑡

𝑡ℎ
Note: 𝑡 denotes 𝑡 iteration, where 1 iteration = Forward + Backward Propagation
With the weightage (β) being introduced,
- The direction of 𝑉3 will tend to be influenced by all previous gradients △𝑊1, △𝑊2 along
with the current gradient △𝑊3
- Thus making the optimizer to take a step in the direction such that it avoids the noisy
step
- This is known as Exponential Moving Average

This idea of Exponential Moving Average can be considered as a ball moving down a hill thus:
- β → Friction
- 𝑉𝑡−1 → Velocity/Momentum
- △𝑤𝑡 → Acceleration

How does Gradient Descent implement Exponential Moving Average ?


Ans: for some iteration t and layer 𝑘 of the NN :
𝑘 𝑘
- We find the gradients 𝑑𝑤 , 𝑑𝑏

the exponential moving average is introduced as:


𝑘
𝑉 𝑘 = β × 𝑉 𝑘 + (1 − β) × 𝑑𝑤
𝑑𝑤 𝑑𝑤

Similarly:
𝑘
𝑉 𝑘 = β × 𝑉 𝑘 + (1 − β) × 𝑑𝑏
𝑑𝑏 𝑑𝑏

Hence Weight updation with learning rate α becomes:


𝑘 𝑘
𝑤 = 𝑤 − α × 𝑉 𝑘
𝑑𝑤
𝑘 𝑘
𝑏 = 𝑏 − α × 𝑉 𝑘
𝑑𝑏
Note: This Optimizer is called Gradient Descent with Momentum

How to further reduce the oscillations of Gradient Descent ?

Ans: optimizer tends to move in direction (oscillations) when gradient of one weight is greater
than the other
- Meaning Δb >>> Δw

Hence to reduce this moving direction:


𝑘 2
𝑉 𝑘 = β𝑉 𝑘 + (1 − β)( 𝑑𝑤 )
𝑑𝑤 𝑑𝑤
𝑘 2
𝑉 𝑘 = β𝑉 𝑘 + (1 − β)( 𝑑𝑏 )
𝑑𝑏 𝑑𝑏

And weight updation becomes:

𝑘
𝑘 𝑘 𝑑𝑤
𝑤 = 𝑤 − α ×
𝑉 𝑘 +ϵ
𝑑𝑤
𝑘
𝑘 𝑘 𝑑𝑏
𝑏 = 𝑏 − α ×
𝑉 𝑘 +ϵ
𝑑𝑏
−8
where ϵ is a very small value = 10

How is squaring useful ?

Ans: as gradients in which the optimizer moves is higher then:


- the square of the gradient will be much high
1 1
- thus making 𝑉 𝑘 > 𝑉 𝑘 and 𝑉 𝑘
< 𝑉
𝑑𝑏 𝑑𝑤 𝑑𝑏 𝑑𝑤
𝑘

𝑘
- Therefore after weight updation, 𝑤 reaches optimal value faster

Note: This approach is known as RMSprop

Is there a way to combine both RMSprop’s decreased oscillation and Momentum fast
convergence ?

Ans: This is done by Adam which defines


- Momentum:

𝑘
𝑉 𝑘 = β1 𝑉 𝑘 + (1 − β1) 𝑑𝑤
𝑑𝑤 𝑑𝑤
𝑘
𝑉 𝑘 = β1 𝑉 𝑘 + (1 − β1) 𝑑𝑏
𝑑𝑏 𝑑𝑏

- RMSprop:
𝑘 2
𝑆 𝑘 = β2 𝑆 𝑘 + (1 − β2) (𝑑𝑤 )
𝑑𝑤 𝑑𝑤
𝑘 2
𝑆 𝑘 = β 2𝑆 𝑘 + (1 − β2)( 𝑑𝑏 )
𝑑𝑏 𝑑𝑏

Now both in RMSprop and Momentum, the initial averaged out values are biased,
- so to kickstart the algorithm Biasness correction is done such that:

𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑉 𝑘
𝑑𝑤
𝑉 𝑘 = 𝑡
𝑑𝑤 1−β1

𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑉 𝑘
𝑑𝑏
𝑉 𝑘 = 𝑡
𝑑𝑏 1−β1

𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑆 𝑘
𝑑𝑤
𝑆 𝑘 = 𝑡
𝑑𝑤 1−β2

𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑆 𝑘
𝑑𝑏
𝑆 𝑘 = 𝑡
𝑑𝑏 1−β2

Therefore Weight updation becomes:

𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑘 𝑘 𝑉 𝑘
𝑑𝑤
𝑤 = 𝑤 − α ×
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑆 𝑘 +ϵ
𝑑𝑤
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑘 𝑘 𝑉 𝑘
𝑑𝑏
𝑏 = 𝑏 − α ×
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑆 𝑘 +ϵ
𝑑𝑏
Hyperparameter tuning

How to reduce the overfitting of NN and improve performance?

1. Regularization: Regularization cannot be done in NN


1 2 3 𝐿
- due to L layered weight matrix 𝑊 = [𝑊 , 𝑊 , 𝑊 ,......., 𝑊 ]
Hence a hack is used called Forbenius Norm
𝑘=𝐿
λ 𝑘 2
𝑅𝑒𝑔 = 2𝑛
∑ ||𝑊 ||𝐹
𝑘=1

Where 𝑛 is the number of samples and 𝑘 is the current layer


𝑘 2
we define ||𝑊 ||𝐹 as:
𝑘−1 𝑘
𝑛 𝑛
𝑘 2 𝑘 2
||𝑊 ||𝐹 = ∑ ∑ (𝑤𝑖𝑗 )
𝑖=1 𝑗=1

𝑘 𝑘−1
Where , 𝑛 is the number of neurons in the current layer 𝑘 and 𝑛 is the number of
neurons in the previous layer 𝑘 − 1

How are the weights updated ?


Ans: Loss is defined as
𝑛 𝐿
1 ^ λ 𝑘 2
𝐿𝑜𝑠𝑠(𝐿) = 2𝑛
∑ 𝐿(𝑦𝑖, 𝑦𝑖) + 2𝑛
∑ ||𝑊 ||𝐹
𝑖=1 𝑘=1
Hence Gradient becomes:
𝑑𝐿 λ 𝑘
𝑘 = (𝐹𝑟𝑜𝑚 𝐵𝑎𝑐𝑘𝑝𝑟𝑜𝑝) + 𝑛
𝑊
𝑑𝑊

Hence weight Updation becomes:

𝑘 𝑘 λ 𝑘
𝑤 = 𝑤 − α(𝐹𝑟𝑜𝑚 𝐵𝑎𝑐𝑘𝑝𝑟𝑜𝑝) − α 𝑛
𝑊
𝑘 λ 𝑘
𝑤 = (1 − α 𝑛
)𝑤 − α(𝐹𝑟𝑜𝑚 𝐵𝑎𝑐𝑘𝑝𝑟𝑜𝑝)
λ
Note: the extra (1 − α 𝑛
) is known as weight decay
2. Dropout: Regularizes the NN by :
- Dropping weights( Edges) of the NN
𝑘 𝑘 𝑡ℎ
- By creating a mask (𝑑 ) through a random probability matrix ( 𝑃(𝑊 ) ) for the 𝑘
layer such that:
𝑘 𝑘
𝑀𝑎𝑠𝑘 (𝑑 ) = 1 𝑖𝑓 𝑃(𝑊 ) > 𝑑𝑟𝑜𝑝𝑜𝑢𝑡 𝑟𝑎𝑡𝑒 (𝑟)
𝑘
𝑒𝑙𝑠𝑒 𝑀𝑎𝑠𝑘 (𝑑 ) = 0

- During test time, all the weights are upscaled by a factor of 𝑝 = 1 − 𝑟

What is the need of Upscaling of weights during test time ?


Ans: Weight updation do not take place for the ones which are dropped
- Making the weights not reach its optimal values
- Hence for optimal values, upscaling during test time is done

Note: During Test time, no Dropout takes place

3. Batch normalization: Standardizing the input is one of the important steps for reaching
global minima
- And since computing activation functions, weight multiplications and biases, the
input to hidden layers tend to have different distributions
- These changed distributions gets amplified as we go down the layers of NN
- This is known as Internal Covariate Shift

Hence data standardization is performed as :


𝑚
1
𝑚𝑒𝑎𝑛 (µ) = 𝑚
∑ 𝑧𝑖
𝑖=1
𝑚
2 1 2
Variance (σ ) = 𝑚
∑ (𝑧𝑖 − µ)
𝑖=1
(𝑧𝑖−µ)
𝑍𝑛𝑜𝑟𝑚 = 2
σ +ϵ
−10
where 𝑚 is the number of neuron in a layer and ϵ = 𝑒

Is having normal distribution for all layers a good thing ?

Ans: No, since two layers having the exact same mean and variance, makes 2nd layer
redundant , therefore :
^
𝑍 = γ × 𝑍𝑛𝑜𝑟𝑚 + β

Where γ, β becomes two learnable parameters


4. Early Stopping: Sometimes the NN performance increases for a certain epoch and
decreases on later training epochs
- So in order to prevent the model to update weights on the later training epochs
- The weights of the best validation score model is stored using
ModelCheckpointCallback
- After which the model runs for a certain threshold after which training stops
- This stopping of training is done through another callback using
EarlyStoppingCallback

5. LearningRateDecay: Sometimes model gets stuck around the global minima,


- due to high learning rate
And be taking alot of epochs to reach global minima
- When learning rate is quite small

Hence to make the NN train faster with high accuracy,


- An adaptive Learning rate is used such that
- The learning rate is reduced gradually over epochs
- So the NN first quickly reaches around the global minima
- Then converges to global minima with a smaller learning rate
- This is implemented using a callback LearningRateScheduler

Practical Aspects

What are all the hyperparameters for NN ?


1. Number of epochs
2. NN depth and complexity
3. Batch Normalization
4. Learning rate
5. Regularization
6. Dropout
7. Choosing optimizer
8. Collecting more data

Note: Collecting data should be the last resort, since


- Data in general is hard to get
- Collecting data is time intensive
- Does not guarantee an improved performance of NN

What order should the major hyperparameters be tuned for NN ?


1. Learning Rate
2. β value of GD with Momentum
3. Number of Hidden units/ Neurons
4. Batch size
5. Number of layers of NN
6. Learning Rate Decay
7. β1, β2, ϵ of Adam

With such a variety of hyperparameters to experiment with,


- it becomes very important to know which hyperparameter to tune to enhance model
performance
- This is known as Orthogonalization of NN

What to tweak if NN has a bad training performance ?


Ans: Clearly NN underfits hence:
● More epochs
● Deeper and Complex NN
● Different Optimizer
● More data

What to tweak if NN has good training performance but bad validation performance ?
Ans: Clearly NN underfits hence:
● Use simple NN
● Regularization
● Dropout
● Batch Normalization
● Diverse training samples

What to tweak if NN has bad testing performance but good training and validation
performance ?
Ans: Though its not a good practice to tune NN for test data, yet some tweaks are:
● Changing loss function
● More Validation data

How to find the correct value for the hyperparameters ?


Ans: Perform Random Search and get some hyperparameter value ranges
- Followed by Grid Search to have more accurate results
Before Hyperparameter tuning, its very important to have an error analysis done of the model
- Hence its a good practice to have a human performance on the task
- Along with the maximum attainable performance known as Bayes Optimal error

Why need Human performance ?


Ans: Helps identifying whether model is having
- a low bias and high variance
- or a high variance and low bias

Note: The difference/gap between Human and Training error is called Avoidable Bias

Autoencoders

What are autoencoders (AE)?


- Autoencoders are self supervised learning method where the input (𝑥𝑖) is same
^
as output (𝑥𝑖)
- The network comprises of 3 parts:
- Encoder
- Bottleneck layer => used as encoding/embedding
- Decoder

- The encoder part compresses the input to lower dimensionality


embedding/encoding
- The decoder produces output by expanding this lower dimensionality
embedding and tries to reconstruct input from it.
^
- Goal: 𝑥𝑖 ~ 𝑥𝑖

Do note: It is also called unsupervised learning as they don’t need labels to train on.

Applications

● Dimensionality Reduction
○ We can use AE to reduce the dimensionality of the data.
○ Do note that
■ The dim. Reduction is data specific
● For example: If the AE has been trained on handwritten
digits, we can’t expect it to compress cats and dogs
images.
● It’ll be able to meaningfully compress data similar to what it
has been trained on
■ The output of the decoder will not be exactly the same as input i.e.
there’ll be loss of information.
○ Code: Link

● Denoising AE
○ In order to make sure that AE doesn’t overfit i.e. it doesn’t simply learn to
copy input to output
■ We add random noise to that data

What happens when we add random noise?


● If AE recreates the noisy input, it means it has overfitted
● Think of it as regularizing the data
○ As there is no pattern to noise, network shouldn’t recreate it
● Code: Link

● Recommender Sys using AE


○ We can use AE to generate embeddings to find similar items (i.e. as a
Recommender System)
○ In order to do so
■ We feed sparse data as input to the network
■ Learn the dense embeddings
■ Find similar items using dense embeddings
○ For example:
■ Find similar movies for a given user-item interaction matrix
● We feed movies vector (item vector) as input to AE
● The network learns the dense embeddings.
● Using cosine similarity on these embeddings, we find
similar movies (higher the score, more similar the movie ).
○ Code: Link

Is it necessary for the encoder and decoder to be symmetric ?


● Not necessarily.
𝑡ℎ 𝑡ℎ
● Earlier we used to keep them symmetric i.e. 𝑘 and 𝑛 − 𝑘 layer will have same
number of neurons
Why did we keep the network symmetric ?
● It was because of weights sharing (weights tying)
● Weights were shared between encoder and decoder
● So as to reduce the number of parameters.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy