NN_Notes
NN_Notes
● NN-1: Intro to NN
What are the issues with Classic ML Algos?
- Require us to do manual feature engineering to be able to create complex
boundaries
- They might not always work well with big datasets, and we are living in a big-data
regime
- Works poorly for sparse data and for unstructured data (image/ text/ speech
data)
Clearly, all these algos have some limitations which call for a more powerful ML model
that can work in the conditions mentioned above. This brings us to Neural Networks (NN).
What if the model looked the same, but the activation function is different? Would
it still be called the same NN?
If we replace the activation function to a hinge loss function, we get a Linear SVM model.
Forward propagation here would look like:-
- z = w1x1 + w2x2 + … + wdxd + b
^
- 𝑦 = fhinge(z)
Note:
- A single neuron is able to represent such powerful models, just imagine what will
happen if we use multiple neurons.
- Those models would be able to represent really complex relations.
Challenges:
- As we tried to increase the complexity of NN, they performed poorly.
- There was no optimal way of training the NN.
- This put the whole area in deep freeze for a decade or so.
- A breakthrough came in 1986 when backpropagation was introduced by Geoff
Hinton and his team.
- This also brought up a lot of hype for artificial intelligence, But, all of that hype
died down by 1990s
- There were two main bottlenecks:-
- We didn't have computational power
- Nor did we have enough data to train
- Backprop was failing for NNs with more depth.
- This time period is called as AI winter, where the funding for AI dried up by 1995
- Meanwhile, Geoff Hinton continued his research and finally came up with a
solution to this problem in 2006.
- Also, the discovery of using ReLu and Leaky ReLu as activation functions was
another breakthrough.
𝑏 1𝑥3
How will we calculate the loss for this multi-class classification NN Model?
We use Categorical Cross Entropy.
where,
𝑘 -> number of classes
𝑦𝑖𝑗 -> one hot encoded label. Ex: [1,0,0], [0,1,0], or [0,0,1]
𝑃𝑖𝑗 -> Calculated Probability of datapoint belonging to class j
This can be seen as log loss extended to the multiclass setting. For k=2, we will get log
loss formulation.
- 𝐴 = 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛(𝑍 )
- Calculate the Loss
- Repeat until Loss converges
- update 𝑤𝑖 = 𝑤𝑖 − 𝑙𝑟 * (ә𝐿𝑜𝑠𝑠/ә𝑤𝑖)
- calculate the output using hypothesis and updated params
- calculate the Loss
Note:-
1
- Weights in layer-1 will be stored as a 2x4 matrix: 𝑊 2𝑥4
1
- Biases in layer-1 will be stored as a 1x4 matrix: 𝑏 1𝑥4
2
- Weights in layer-2 will be stored as a 4x3 matrix: 𝑊 4𝑥3
2
- Biases in layer-2 will be stored as a 1x3 matrix: 𝑏 1𝑥3
Forward propagation:-
1 1 1
𝑍 𝑚𝑥ℎ
= 𝑋𝑚 𝑥 𝑑. 𝑊 𝑑𝑥ℎ
+𝑏 1𝑥ℎ
1 1 1
𝐴 𝑚𝑥ℎ
= 𝑓 (𝑍 𝑚𝑥ℎ
)
2 1 2 2
𝑍 𝑚𝑥𝑛
= 𝐴 𝑚𝑥ℎ
.𝑊 ℎ𝑥𝑛
+𝑏 1𝑥𝑛
2 2 2
𝐴 𝑚𝑥𝑛
= 𝑓 (𝑍 𝑚𝑥𝑛
)
∂𝑤 11
∂𝑎 1
∂𝑧 1
∂𝑤 11
Gradient of w211
We’ll encounter a31 and a21 while going from loss
towards w211
Therefore gradient:
3 3 2 2
∂𝐿 ∂𝐿 ∂𝑎 1
∂𝑧 1
∂𝑎 1
∂𝑧 1
3 = 3 . 3 . 2 . 2 . 2
∂𝑤 11
∂𝑎 1
∂𝑧 1
∂𝑎 1
∂𝑧 1
∂𝑤 11
Gradient of w 111
There are 2 possible paths to reach w 111
Path-1 : L -> a 31 -> a21 -> a11 -> w111
Path-2 : L -> a 31 -> a22 -> a11 -> w111
Therefore gradient:
3 3 2 2 1 1 3 3 2 2 1 1
∂𝐿 ∂𝐿 ∂𝑎 1
∂𝑧 1
∂𝑎 1
∂𝑧 1
∂𝑎 1
∂𝑧 1 ∂𝐿 ∂𝑎 1
∂𝑧 1
∂𝑎 2
∂𝑧 2
∂𝑎 1
∂𝑧 1
1 = 3 . 3 . 2 2 . 1 . 1 . 1 + 3 . 3 . 2 . 2 . 1 . 1 . 1
∂𝑤 11
∂𝑎 1
∂𝑧 1
∂𝑎 1
∂𝑧 1
∂𝑎 1
∂𝑧 1
∂𝑤 11
∂𝑎 1
∂𝑧 2
∂𝑎 1
∂𝑧 2
∂𝑎 1
∂𝑧 1
∂𝑤 11
- So, the product of these terms inside the bracket will become very small.
- In fact, as the number of layers in the NN increase, this product will become
smaller and smaller.
● Calculating dZ2
2
∂𝐿 ∂𝐿 ∂𝐴
2
dZ = 2 = 2 . 2
∂𝑍 ∂𝐴 ∂𝑍
∂𝐿 ∂𝑝
dZ2 = . 2 = pi - I(i == k)
∂𝑝 ∂𝑍
● Calculating dW2
2 2
∂𝐿 ∂𝐿 ∂𝐴 ∂𝑍
2
dW = 2 = 2 . 2 2
∂𝑊 ∂𝐴 ∂𝑍 ∂𝑊
2
∂𝑍
2 2
dW = dZ . 2
∂𝑊
dW2 = dZ2. A 1
● Calculating db2
2 2
∂𝐿 ∂𝐿 ∂𝐴 ∂𝑍
2
db = 2 = 2 . 2 2
∂𝑏 ∂𝐴 ∂𝑍 ∂𝑏
2
∂𝑍
2 2
db = dZ . 2 = dZ2
∂𝑏
● Calculating dA1
2 2
∂𝐿 ∂𝐿 ∂𝐴 ∂𝑍
dA1 = 1 = 2 . 2 1
∂𝐴 ∂𝐴 ∂𝑍 ∂𝐴
2
∂𝑍
1 2
dA = dZ . 1 = dZ2 . W2
∂𝐴
● Calculating dZ1
2 2 1
∂𝐿 ∂𝐿 ∂𝐴 ∂𝑍 ∂𝐴
dZ1 = 1 = 2 . 2 1 1
∂𝑍 ∂𝐴 ∂𝑍 ∂𝐴 ∂𝑍
1
∂𝐴
1 1
dZ = dA . 1
∂𝑍
1
∂𝐴
1 1
dZ = dA . 1
∂𝑍
● Calculating dW1
2 2 1 1
∂𝐿 ∂𝐿 ∂𝐴 ∂𝑍 ∂𝐴 ∂𝑍
dW1 = 1 = 2 . 2 1 1 1
∂𝑊 ∂𝐴 ∂𝑍 ∂𝐴 ∂𝑍 ∂𝑊
1
∂𝑍
dW1 = dZ1 1 = dZ1.X
∂𝑊
● Calculating db1
2 2 1 1
∂𝐿 ∂𝐿 ∂𝐴 ∂𝑍 ∂𝐴 ∂𝑍
1
db = 1 = 2 . 2 1 1 1
∂𝑏 ∂𝐴 ∂𝑍 ∂𝐴 ∂𝑍 ∂𝑏
1
∂𝑍
1 1
db = dZ . 1 = dZ1.1 = dZ1
∂𝑏
Summarizing forward and backward prop for MLP
While performing forward prop,
- we store/cache the value of 𝑍𝑗,𝑊𝑗,𝑏𝑗 in order to use them during back prop
Tensorflow
TensorFlow is the premier open-source deep learning framework developed and maintained by
Google
Keras
● Using TensorFlow directly can be challenging,
● In TensorFlow 2, Keras has become the default high-level API
● No need to separately install keras
● The modern tf.keras API brings Keras’s simplicity and ease of use to the
TensorFlow project.
Why keras ?
● easy to build and use giant deep learning models
● Light-weight and quick
● can support other backends as well besides tensorflow, eg: Theano
● Open source
dir()
It returns list of the attributes and methods of any objec
- dir(tf.keras)
- dir(tf.keras.activation)
- dir(tf.data)
How to import ?
model.add()
● Instead of passing the list of layers as an argument while creating a model instance, we
can use the add method.
model.summary()
To print the summary of model we have created
Custom names
● keras has provided the names by itself.
● We can also give custom names to the layer as well
Plotting model
Weights and Bias Initializer
In keras, in Dense layer,
- Loss
- Optimizer
we can define a list of metrics which we might want to track during the training, like accuracy
Epoch
● To avoid memory issues, data is passed in small batches instead of whole
● Each pass of mini-batch is called an iteration.
● Each pass of whole datasets is called an Epoch.
● One epoch means that each sample in the training dataset has had an opportunity to
update the internal model parameters.
Training : model.fit ()
It means updating the weights using the optimizer and loss functions on the dataset.
model.fit(X_train, y_train)
Note: After training, weights now will follow normal distribution and biases will not be Zero now
History object
● model.fit returns a history object which contains the record of progress NN training.
● History object contains records of loss and metrics values for each epoch.
● It's an alternative to dir(). __dict__ attribute can be used to retrieve all the keys
associated with the object on which it is called.
history object's dictionary has another dictionary with key "history" inside it
Model has saved all the loss and metrics values for each epoch inside the history dictionary
where all the values are stored in different lists.
● model.evaluate returns the loss value & metrics value for the model.
● weights/parameters are not updated during evaluation (and prediction) which means
only forward pass, no backward pass
Predictions
pred = model.predict(X_test)
To know the class an observation belongs to, using these 4 probability values
● Find the index having the largest probability and that will be the predicted class.
● pred_class = np.argmax(pred, axis = 1)
Callbacks
A callback defines a set of functions which are executed at different stages of the training
procedure.
They can be used to view internal states of the model during training.
● For example, we may want to print loss, accuracy or lr every 2000th epoch.
Examples:
We will have to pass a list of callback objects to callbacks argument of the fit method
Note: We can pass callback objects to evaluate and predict method as well.
The parent class tf.keras.callbacks.Callback supports various kinds of methods which we can
override
● Global methods
at the beginning/ending of training
● Batch-level method
at the beginning/ending of a batch
● Epoch-level method
at the beginning/ending of an epoch
Tensorboard
Ways to install
%load_ext tensorboard
● It will read from these logs in order to display the various visualizations.
If we want to reload the TensorBoard extension, we can use the reload magic method
%reload_ext tensorboard
Import
Callback arguments:
● log_dir - (Path)
○ directory where logs will be saved
○ This directory should not be used by any other callbacks.
● update_freq - (int/str)
○ how frequently losses/metrics are updated.
○ when set to batch, losses/ metrics will be updated after every batch/iteration
○ when set to an integer N, updation will be done after N batches
○ when set to 'epoch', updation will be done after every epoch
● histogram_freq - (int)
○ how frequently (in epochs) histograms(Distribution of W) will be updated.
○ Setting this to 0 means, histograms will not be computed.
● write_graph - (Bool), True if we want to visualize our training process
● write_images - (Bool), True if we want to visualize our model weights
%tensorboard --logdir={log_folder}
Weight Initialization
Ans: When training Deep NN, the model tends to have immense training time due to
- Large number of layers in NN
- Large number of neurons in NN
This makes the parameter (weight matrix) of the NN to be large in size.
𝐿
𝑖
- Now for a Deep NN, L is a large value, which makes ∏ 𝑊 will be a very large value
𝑖=1
- Therefore the gradient values becomes exponentially high
- Where 𝑓𝑎𝑛𝑖𝑛 is the number of input to a neuron while 𝑓𝑎𝑛𝑜𝑢𝑡 is the number of output of
the neuron
2. Glorot/Xavier init:
a. Normal Distribution
𝑘 2
𝑤 𝑖𝑗
~ 𝑁(0, σ𝑖𝑗), 𝑤ℎ𝑒𝑟𝑒 σ𝑖𝑗 = 𝑓𝑎𝑛𝑖𝑛 + 𝑓𝑎𝑛𝑜𝑢𝑡
]
b. Uniform Distribution
𝑘 − 6 6
𝑤 𝑖𝑗
~ 𝑢𝑛𝑖𝑓𝑜𝑟𝑚 [ , ]
𝑓𝑎𝑛𝑖𝑛+𝑓𝑎𝑛𝑜𝑢𝑡 𝑓𝑎𝑛𝑖𝑛+𝑓𝑎𝑛𝑜𝑢𝑡
3. He Init:
a. Normal Distribution
𝑘 2
𝑤 𝑖𝑗
~ 𝑁(0, σ𝑖𝑗), 𝑤ℎ𝑒𝑟𝑒 σ𝑖𝑗 = 𝑓𝑎𝑛𝑖𝑛
]
b. Uniform Distribution
𝑘 − 6 6
𝑤 𝑖𝑗
~ 𝑢𝑛𝑖𝑓𝑜𝑟𝑚 [ , ]
𝑓𝑎𝑛𝑖𝑛 𝑓𝑎𝑛𝑖𝑛
Why need to initialize the weight based on input and output of the neuron ?
𝐿 𝐿
derivative of 𝑍 1 w.r.t 𝑤11 is defined as:
𝐿
∂𝑍 1 𝐿−1 𝐿−1
𝐿 =𝑎 1
= 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛(𝑍 1
)
∂𝑤11
𝐿−1 1 2 𝐿−1
And observe 𝑍 1
is nothing but a function of weights = [ 𝑊 , 𝑊 ,...... 𝑊 ]
𝐿−1
- Hence if 𝑍 1
has greater number of inputs, the value for each weight value is
influenced drastically leading to exploding gradient
Optimizer
Why does SGD and Mini Batch Gradient Descent (GD), take so many epochs while
training Deep NN ?
Ans: By taking some weighted average (β) from the previous Optimizer Step (𝑉𝑡−1) along with
the current gradient (△𝑤) , hence:
𝑉𝑡 = β 𝑉𝑡−1 + (1 − β)△𝑤𝑡
𝑡ℎ
Note: 𝑡 denotes 𝑡 iteration, where 1 iteration = Forward + Backward Propagation
With the weightage (β) being introduced,
- The direction of 𝑉3 will tend to be influenced by all previous gradients △𝑊1, △𝑊2 along
with the current gradient △𝑊3
- Thus making the optimizer to take a step in the direction such that it avoids the noisy
step
- This is known as Exponential Moving Average
This idea of Exponential Moving Average can be considered as a ball moving down a hill thus:
- β → Friction
- 𝑉𝑡−1 → Velocity/Momentum
- △𝑤𝑡 → Acceleration
Similarly:
𝑘
𝑉 𝑘 = β × 𝑉 𝑘 + (1 − β) × 𝑑𝑏
𝑑𝑏 𝑑𝑏
Ans: optimizer tends to move in direction (oscillations) when gradient of one weight is greater
than the other
- Meaning Δb >>> Δw
𝑘
𝑘 𝑘 𝑑𝑤
𝑤 = 𝑤 − α ×
𝑉 𝑘 +ϵ
𝑑𝑤
𝑘
𝑘 𝑘 𝑑𝑏
𝑏 = 𝑏 − α ×
𝑉 𝑘 +ϵ
𝑑𝑏
−8
where ϵ is a very small value = 10
𝑘
- Therefore after weight updation, 𝑤 reaches optimal value faster
Is there a way to combine both RMSprop’s decreased oscillation and Momentum fast
convergence ?
𝑘
𝑉 𝑘 = β1 𝑉 𝑘 + (1 − β1) 𝑑𝑤
𝑑𝑤 𝑑𝑤
𝑘
𝑉 𝑘 = β1 𝑉 𝑘 + (1 − β1) 𝑑𝑏
𝑑𝑏 𝑑𝑏
- RMSprop:
𝑘 2
𝑆 𝑘 = β2 𝑆 𝑘 + (1 − β2) (𝑑𝑤 )
𝑑𝑤 𝑑𝑤
𝑘 2
𝑆 𝑘 = β 2𝑆 𝑘 + (1 − β2)( 𝑑𝑏 )
𝑑𝑏 𝑑𝑏
Now both in RMSprop and Momentum, the initial averaged out values are biased,
- so to kickstart the algorithm Biasness correction is done such that:
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑉 𝑘
𝑑𝑤
𝑉 𝑘 = 𝑡
𝑑𝑤 1−β1
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑉 𝑘
𝑑𝑏
𝑉 𝑘 = 𝑡
𝑑𝑏 1−β1
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑆 𝑘
𝑑𝑤
𝑆 𝑘 = 𝑡
𝑑𝑤 1−β2
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑆 𝑘
𝑑𝑏
𝑆 𝑘 = 𝑡
𝑑𝑏 1−β2
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑘 𝑘 𝑉 𝑘
𝑑𝑤
𝑤 = 𝑤 − α ×
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑆 𝑘 +ϵ
𝑑𝑤
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑘 𝑘 𝑉 𝑘
𝑑𝑏
𝑏 = 𝑏 − α ×
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑆 𝑘 +ϵ
𝑑𝑏
Hyperparameter tuning
𝑘 𝑘−1
Where , 𝑛 is the number of neurons in the current layer 𝑘 and 𝑛 is the number of
neurons in the previous layer 𝑘 − 1
𝑘 𝑘 λ 𝑘
𝑤 = 𝑤 − α(𝐹𝑟𝑜𝑚 𝐵𝑎𝑐𝑘𝑝𝑟𝑜𝑝) − α 𝑛
𝑊
𝑘 λ 𝑘
𝑤 = (1 − α 𝑛
)𝑤 − α(𝐹𝑟𝑜𝑚 𝐵𝑎𝑐𝑘𝑝𝑟𝑜𝑝)
λ
Note: the extra (1 − α 𝑛
) is known as weight decay
2. Dropout: Regularizes the NN by :
- Dropping weights( Edges) of the NN
𝑘 𝑘 𝑡ℎ
- By creating a mask (𝑑 ) through a random probability matrix ( 𝑃(𝑊 ) ) for the 𝑘
layer such that:
𝑘 𝑘
𝑀𝑎𝑠𝑘 (𝑑 ) = 1 𝑖𝑓 𝑃(𝑊 ) > 𝑑𝑟𝑜𝑝𝑜𝑢𝑡 𝑟𝑎𝑡𝑒 (𝑟)
𝑘
𝑒𝑙𝑠𝑒 𝑀𝑎𝑠𝑘 (𝑑 ) = 0
3. Batch normalization: Standardizing the input is one of the important steps for reaching
global minima
- And since computing activation functions, weight multiplications and biases, the
input to hidden layers tend to have different distributions
- These changed distributions gets amplified as we go down the layers of NN
- This is known as Internal Covariate Shift
Ans: No, since two layers having the exact same mean and variance, makes 2nd layer
redundant , therefore :
^
𝑍 = γ × 𝑍𝑛𝑜𝑟𝑚 + β
Practical Aspects
What to tweak if NN has good training performance but bad validation performance ?
Ans: Clearly NN underfits hence:
● Use simple NN
● Regularization
● Dropout
● Batch Normalization
● Diverse training samples
What to tweak if NN has bad testing performance but good training and validation
performance ?
Ans: Though its not a good practice to tune NN for test data, yet some tweaks are:
● Changing loss function
● More Validation data
Note: The difference/gap between Human and Training error is called Avoidable Bias
Autoencoders
Do note: It is also called unsupervised learning as they don’t need labels to train on.
Applications
● Dimensionality Reduction
○ We can use AE to reduce the dimensionality of the data.
○ Do note that
■ The dim. Reduction is data specific
● For example: If the AE has been trained on handwritten
digits, we can’t expect it to compress cats and dogs
images.
● It’ll be able to meaningfully compress data similar to what it
has been trained on
■ The output of the decoder will not be exactly the same as input i.e.
there’ll be loss of information.
○ Code: Link
● Denoising AE
○ In order to make sure that AE doesn’t overfit i.e. it doesn’t simply learn to
copy input to output
■ We add random noise to that data