0% found this document useful (0 votes)
9 views35 pages

Unit 2

This document covers key concepts in neural networks, including activation functions (like Sigmoid, ReLU, and Softmax) and optimization techniques addressing errors, bias-variance trade-off, and overfitting-underfitting. It explains the necessity of activation functions for introducing non-linearity, the types of errors that can occur during training, and strategies to balance bias and variance for improved model performance. Additionally, it outlines methods to mitigate overfitting and underfitting in machine learning models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views35 pages

Unit 2

This document covers key concepts in neural networks, including activation functions (like Sigmoid, ReLU, and Softmax) and optimization techniques addressing errors, bias-variance trade-off, and overfitting-underfitting. It explains the necessity of activation functions for introducing non-linearity, the types of errors that can occur during training, and strategies to balance bias and variance for improved model performance. Additionally, it outlines methods to mitigate overfitting and underfitting in machine learning models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

UNIT-II

Topics:
Activation functions: Sigmoid, ReLU, Hyperbolic Functions, Softmax
Optimization: Types of errors, bias-variance trade-off, overfitting-underfitting,
Cross Validation, Feature Selection, Gradient Descent (GD), Momentum Based
GD, Stochastic GD, Regularization (dropout, drop connect, batch
normalization), Hyper parameters.

Activation Functions
What is an Activation Function?
 An activation function in the context of neural networks is a
mathematical function applied to the output of a neuron. The purpose of
an activation function is to introduce non-linearity into the model,
allowing the network to learn and represent complex patterns in the data.
 Without non-linearity, a neural network would essentially behave like a
linear regression model, regardless of the number of layers it has.
 The activation function decides whether a neuron should be activated or
not by calculating the weighted sum and further adding bias to it.
 The neural network has neurons that work in correspondence
with weight, bias, and their respective activation function.
 In a neural network, we would update the weights and biases of the
neurons on the basis of the error at the output. This process is known as
back propagation.
 Activation functions make the back-propagation possible since the
gradients are supplied along with the error to update the weights and
biases.
Elements of a Neural Network
 Input Layer: This layer accepts input features. It provides information
from the outside world to the network, no computation is performed at
this layer, nodes here just pass on the information(features) to the
hidden layer.
 Hidden Layer: Nodes of this layer are not exposed to the outer world;
they are part of the abstraction provided by any neural network. The
hidden layer performs all sorts of computation on the features entered
through the input layer and transfers the result to the output layer.
 Output Layer: This layer brings up the information learned by the
network to the outer world.

Need for an Activation function:


A neural network without an activation function is essentially just a linear
regression model. The activation function does the non-linear transformation to
the input making it capable to learn and perform more complex tasks.
Mathematical Proof:
Suppose you have a neural net like this:

Elements of the diagram are as follows:


Hidden layer i.e., layer 1:
z(1) = W(1)X + b(1) = a(1)
Here,
 z(1) is the vectorized output of layer 1
 W(1) be the vectorized weights assigned to neurons of
hidden layer i.e. w1, w2, w3 and w4
 X be the vectorized input features i.e. i1 and i2
 b is the vectorized bias assigned to neurons in hidden
layer i.e. b1 and b2
 a(1) is the vectorized form of any linear function.
Layer 2 i.e., Output Layer:
Note: Input for layer 2 is the output of layer 1.
z(2) = W(2)*a(1) + b(2)
a(2) = z(2)
Calculation at layer 2:
z(2) = (W(2) * [W(1)X + b(1)]) + b(2)
z(2) = [W(2) * W(1)] * X + [W(2)*b(1) + b(2)]
Let,
[W(2) * W(1)] = W
[W(2)*b(1) + b(2)] = b

Final output: z(2) = W*X + b

This observation results again in a linear function even after applying a


hidden layer, hence we can conclude that, doesn’t matter how many hidden
layers we attach in neural net, all layers will behave same way because the
composition of two linear function is a linear function itself.
Neuron cannot learn with just a linear function attached to it. A non-
linear activation function will let it learn as per the difference w.r.t error. Hence,
we need an activation function.
Variants of Activation Functions:
Linear Function:
 Equation: Linear function has the equation similar to as of a straight line
i.e. y = x
 No matter how many layers we have, if all are linear in nature, the final
activation function of last layer is nothing but just a linear function of the
input of first layer.
 Range: -inf to +inf
 Uses: Linear activation function is used at just one place i.e. output layer.
 Issues: If we will differentiate linear function to bring non-linearity,
result will no longer depend on input “x” and function will become
constant, it won’t introduce any ground-breaking behavior to our
algorithm.

For example: Calculation of price of a house is a regression problem.


House price may have any big/small value, so we can apply linear activation at
output layer. Even in this case neural net must have any non-linear function at
hidden layers.
Sigmoid Function:

 It is a function which is plotted as ‘S’ shaped graph.


 Equation: A = 1/(1 + e-x)
 Nature: Non-linear. Notice that X values lie between -2 to 2, Y values are
very steep. This means, small changes in x would also bring about large
changes in the value of Y.
 Value Range: 0 to 1
 Uses: Usually used in output layer of a binary classification, where result
is either 0 or 1, as value for sigmoid function lies between 0 and 1 only
so, result can be predicted easily to be 1 if value is greater
than 0.5 and 0 otherwise.
Tanh Function:

 The activation that works almost always better than sigmoid function is
Tanh function also known as Tangent Hyperbolic function. It’s actually
mathematically shifted version of the sigmoid function. Both are similar
and can be derived from each other.
 Equation: f(x) = tanh(x) = 2/(1 + e-2x) – 1
OR
tanh(x) = 2 * sigmoid(2x) – 1
 Value Range: -1 to +1
 Nature: non-linear
 Uses: Usually used in hidden layers of a neural network as it’s values lies
between -1 to 1 hence the mean for the hidden layer comes out be 0 or
very close to it, hence helps in centering the data by bringing mean close
to 0. This makes learning for the next layer much easier.
ReLU Function:
 It Stands for Rectified linear unit. It is the most widely used
activation function. Chiefly implemented in hidden layers of Neural
network.

 Equation: A(x) = max (0, x). It gives an output x if x is positive and 0


otherwise.
 Value Range: [0, inf)
 Nature: non-linear, which means we can easily backpropagate the errors
and have multiple layers of neurons being activated by the ReLU
function.
 Uses: ReLU is less computationally expensive than tanh and sigmoid
because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and
easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function:
The Softmax function is also a type of sigmoid function but is handy
when we are trying to handle multi- class classification problems.
 Nature: non-linear
 Uses: Usually used when trying to handle multiple classes. the Softmax
function was commonly found in the output layer of image classification
problems. The Softmax function would squeeze the outputs for each class
between 0 and 1 and would also divide by the sum of the outputs.
 Output: The Softmax function is ideally used in the output layer of the
classifier where we are actually trying to attain the probabilities to define
the class of each input.
Conclusion:
 The basic rule of thumb is if you really don’t know what activation
function to use, then simply use RELU as it is a general activation
function in hidden layers and is used in most cases these days.
 If your output is for binary classification then, sigmoid function is
very natural choice for output layer.
 If your output is for multi-class classification then, Softmax is very
useful to predict the probabilities of each class.

Optimization
Types of errors:
In deep learning, errors can arise from various sources, and they can
generally be categorized into several types:
1. Training Errors:
o Underfitting: Occurs when the model is too simple to capture the
underlying patterns in the data, resulting in high training and
validation errors.
o Overfitting: Happens when the model learns the noise in the
training data rather than the actual signal, leading to low training
error but high validation error.
2. Optimization Errors:
o Gradient Vanishing: In deep networks, gradients can become very
small, making it difficult for the model to learn.
o Gradient Exploding: Opposite of vanishing gradients, where
gradients become excessively large, leading to unstable training.
3. Data Errors:
o Noisy Data: Presence of irrelevant or misleading information in
the dataset that can misguide the learning process.
o Imbalanced Data: When classes in the dataset are not represented
equally, leading to biased model predictions.
4. Model Errors:
o Incorrect Architecture: Choosing a model architecture that is not
suitable for the specific task can lead to poor performance.
o Hyperparameter Tuning Errors: Poor choices in
hyperparameters (like learning rate, batch size, etc.) can hinder the
training process.
5. Implementation Errors:
o Coding Bugs: Errors in the implementation of the model or
training process can lead to unexpected behavior.
o Library/Framework Misuse: Misunderstanding how to use a
deep learning library or framework can introduce errors.
6. Evaluation Errors:
o Wrong Metrics: Using inappropriate evaluation metrics that don’t
accurately reflect model performance can lead to misleading
conclusions.
o Data Leakage: When information from the validation/test set is
unintentionally used during training, resulting in overly optimistic
performance estimates.
7. Inference Errors:
o Domain Shift: When the model is applied to data that comes from
a different distribution than the training data, leading to poor
performance.
o Adversarial Attacks: Deliberate manipulation of input data to
trick the model into making incorrect predictions.
Each type of error can significantly impact the performance and reliability of a
deep learning model, and identifying the source of errors is crucial for
improvement.

Bias Variance Trade-off:


Introduction:
A learning model’s performance is evaluated based on how
accurate is its prediction and how well it generalizes on another independent
dataset it has not seen.
The errors in a learning model can be broken down into 2 parts:
1. Reducible Error
Reducible errors, on the other hand, is further broken down
into square of bias and variance.
2. Irreducible Error
Irreducible errors are errors that cannot be reduced even if
you use any other learning model.
Due to this bias-variance, it causes the learning model to either overfit or
underfit the given data.
What exactly is Bias?
Bias is the inability of a learning model to capture the true
relationship between the data variables. It is caused by the erroneous
assumptions that are inherent to the learning algorithm.
For example, in linear regression, the relationship between the X
and the Y variable is assumed to be linear, when in reality the relationship may
not be perfectly linear.
Let’s look at an example of artificial dataset with variables study
hours and marks.
This graph shows the original relationship between the variables.
Notice, there is a limit to the marks you can get on the test. That is even if you
study an extraordinary amount of time, there is always a certain ‘maximum
mark’ you can score. You can see the line flattening beyond a certain value of
the X-axis. So, the relationship is only piecewise linear. This sort of error will
not be captured by the vanilla linear regression model.
You can expect an algorithm like linear regression to have high
bias error, whereas an algorithm like decision tree has lower bias. Why?
because decision trees don’t make such hard assumptions. So is the case with
algorithms like k-Nearest Neighbours, Support Vectors, etc.

In general,
1. High Bias indicates more assumptions in the learning algorithm
about the relationships between the variables.
2. Less Bias indicates fewer assumptions in the learning algorithm.

What is the Variance Error?


This is nothing but the concept of the model overfitting on a
particular dataset. If the model learns to fit very closely to the points on a
particular dataset, when it used to predict on another dataset, it may not predict
as accurately as it did in the first.
Variance is the difference in the fits between different datasets.
Generally, nonlinear learning algorithms like decision trees have a
high variance. It is even higher if the branches are not pruned during training.
Low-variance ML algorithms: Linear Regression, Logistic Regression, Linear
Discriminant Analysis.
High-variance ML algorithms: Decision Trees, k-NN, and Support Vector s.
Let’s look at the same dataset and try to fit the training data better. Fitting the
training data with more complex functions to reduce the error.
See that we have got nearly zero error in the training data. Now let’s try this
curve to the test data.

The errors in the test data are more in this case. If there is more difference in the
errors in different datasets, then it means that the model has a high variance. At
the same time, this type of curvy model will have a low bias because it is able to
capture the relationships in the training data unlike straight line.
Bias-Variance Trade-off:

Summary:
a. If a model uses a simple machine learning algorithm like in
the case of a linear model in the above code, the model will
have high bias and low variance (underfitting the data).
b. If a model follows a complex machine learning model, then
it will have high variance and low bias (overfitting the data).
c. You need to find a good balance between the bias and
variance of the model we have used. This trade-off in
complexity is what is referred to as bias and variance trade-
off. An optimal balance of bias and variance should never
overfit or underfit the model.
d. This trade-off applies to all forms of supervised learning:
classification, regression, and structured output learning.
How to fix bias and variance problems?
Fixing High Bias:
1. Adding more input features will help improve the data to fit better.
2. Add more polynomial features to improve the complexity of the
model.
3. Decrease the regularization term to have a balance between bias
and variance.
Fixing High Variance:
1. Reduce the input features, use only features with more feature
importance to reduce overfitting the data.
2. Getting more training data will help in this case, because the high
variance model will not be working for an independent dataset if
you have very data.

Overfitting and Underfitting:


Overfitting
Definition
Overfitting occurs when a model learns the details and noise in the training data
to the extent that it negatively impacts its performance on new data. Essentially,
the model becomes too tailored to the training data and fails to generalize well.
Symptoms
 High accuracy on training data but significantly lower accuracy on
validation/test data.
 The training loss continues to decrease while validation loss starts to
increase.
Causes
1. Complex Model Architecture: Using a neural network with too many
layers or parameters for the amount of training data.
2. Insufficient Data: Not enough data to represent the underlying
distribution.
3. Noise in Data: If the training data contains a lot of noise, the model
might learn those noisy patterns.
Example
Imagine training a deep neural network to classify images of cats and dogs. If
the dataset consists of only a few hundred images, and the model has millions of
parameters, it might learn to memorize the images instead of identifying the
characteristics that differentiate cats from dogs. As a result, it performs
exceptionally well on the training set but fails to classify new, unseen images
correctly.
Solutions
 Regularization: Techniques like L1/L2 regularization add a penalty for
larger weights.
 Dropout: Randomly setting a fraction of neurons to zero during training
helps prevent reliance on specific neurons.
 Early Stopping: Monitor validation loss and stop training when it begins
to increase.
 Data Augmentation: Use transformations (e.g., rotation, flipping) to
create more training examples.
Underfitting
Definition
Underfitting occurs when a model is too simple to capture the underlying trends
in the data, resulting in poor performance on both training and validation
datasets.
Symptoms
 High training loss and high validation loss.
 The model does not learn enough from the training data to make accurate
predictions.
Causes
1. Simple Model Architecture: Using too few layers or parameters for the
complexity of the data.
2. Inadequate Training: Not enough training epochs or a learning rate
that’s too high, causing the model to skip optimal weights.
Example
Consider trying to fit a linear regression model to a dataset that has a quadratic
relationship. If you use a simple linear model, it will fail to capture the curve of
the data, resulting in both high training and validation errors.
Solutions
 Increase Model Complexity: Add more layers or neurons to the
network.
 Feature Engineering: Introduce polynomial features or other relevant
transformations.
Adjust Training Parameters: Tune the learning rate or increase the number
of training epochs.
Visual Example:

Overfitting Example Visualization


 Training Loss: Decreases steadily.
 Validation Loss: Decreases initially but then starts to increase after a
certain point, creating a divergence.
 Graph: A U-shaped curve where the training loss decreases, but the
validation loss dips and then rises sharply.
Underfitting Example Visualization
 Training Loss: Remains high throughout.
 Validation Loss: Also remains high and parallels training loss.
 Graph: Both curves are high and do not show significant improvement
over epochs, indicating that the model has not learned the data well.
Balancing the Two
To achieve optimal performance, it’s essential to find a sweet spot between
underfitting and overfitting. This often involves:
 Using techniques like cross-validation to assess model performance
across different subsets of data.
 Experimenting with different model architectures and hyperparameters.
 Utilizing tools such as learning curves to visualize performance and
diagnose issues.
By carefully monitoring training and validation metrics, one can refine the
model to achieve better generalization on unseen data.
Cross Validation:
Cross-validation is a crucial technique in deep learning used to assess the
performance of a model and ensure its generalizability. It helps mitigate issues
like overfitting and underfitting by providing a more reliable estimate of model
performance on unseen data.
What is Cross-Validation?
Cross-validation involves dividing the dataset into multiple subsets (or folds) to
train and evaluate the model multiple times. The goal is to maximize the use of
the data while ensuring that the model's performance is evaluated fairly.
Common Types of Cross-Validation
1. K-Fold Cross-Validation:
o The dataset is divided into k equally sized folds.
o The model is trained on k−1 folds and tested on the remaining fold.
This process is repeated k times, with each fold serving as the test
set once.
o The final performance metric is the average of the metrics obtained
from each fold.
Example: For k=5, if you have 100 samples, the data will be split into 5 groups
of 20. The model will train on 80 samples and test on the remaining 20, cycling
through each fold.
2. Stratified K-Fold Cross-Validation:
o Similar to K-Fold, but it ensures that each fold has a proportional
representation of different classes, especially useful for imbalanced
datasets.
3. Leave-One-Out Cross-Validation (LOOCV):
o A special case of K-Fold where k is equal to the number of
samples. This means that each training set is created by taking all
samples except one.
o While thorough, LOOCV can be computationally expensive,
especially for large datasets.
4. Time Series Cross-Validation:
o Used for time-dependent data. The training set includes all past
data points, and the model is validated on the next time step,
moving forward in time.
o This method respects the temporal order of data and is essential in
time series forecasting.
Benefits of Cross-Validation
 Better Estimation of Model Performance: By using different subsets of
the data for training and testing, cross-validation provides a more robust
measure of how well the model is likely to perform on unseen data.
 Reduced Variance: Evaluating the model on multiple test sets helps
reduce the variability in performance metrics, leading to a more stable
estimate.
 Improved Hyperparameter Tuning: Cross-validation allows for more
reliable comparisons of model configurations, helping to select the best
hyperparameters.
Considerations When Using Cross-Validation
1. Computational Cost: Cross-validation can be resource-intensive,
particularly in deep learning where training models can take significant
time. It may require parallelization or using a subset of the data for
quicker evaluations.
2. Data Leakage: Care must be taken to ensure that data used in training
does not leak into the validation set, especially when preprocessing steps
(like normalization) are involved.
3. Choice of k: Selecting the right number of folds is important. A larger k
increases the training set size for each iteration but can lead to higher
computational costs.
Example Workflow Using K-Fold Cross-Validation in Deep Learning
1. Split the Data: Choose k (e.g., 5) and split the dataset into k folds.
2. Iterate Through Each Fold:
o For each fold:
 Use k−1 folds for training and the remaining fold for
validation.
 Train the model on the training set.
 Evaluate the model on the validation set and record the
performance metric (e.g., accuracy, loss).
3. Average Results: After all folds have been processed, compute the
average of the performance metrics to get the final estimate.
Conclusion
Cross-validation is an essential tool in deep learning for evaluating model
performance and ensuring generalizability. By carefully implementing cross-
validation, practitioners can make more informed decisions about model
selection and tuning, ultimately leading to better-performing models in real-
world applications.

Feature Selection:
Feature selection is a critical step in the data preprocessing phase, especially in
deep learning, where the number of features can be vast. Properly selecting
features can improve model performance, reduce training time, and enhance
interpretability. Here’s an in-depth look at feature selection in deep learning.
What is Feature Selection?
Feature selection involves identifying and selecting a subset of relevant features
(or variables) from the original dataset. The goal is to improve the model's
efficiency and effectiveness by focusing on the most informative features while
discarding irrelevant or redundant ones.
Importance of Feature Selection
1. Improved Model Performance: Removing irrelevant features can lead
to better generalization, as the model focuses on the most important
information.
2. Reduced Overfitting: A simpler model with fewer features is less likely
to overfit to noise in the training data.
3. Decreased Computational Cost: Fewer features can significantly reduce
the training time and resource consumption.
4. Enhanced Interpretability: A model with fewer features is often easier
to interpret and understand.
Methods of Feature Selection
1. Filter Methods:
o These methods assess the relevance of features using statistical
measures, independently of any machine learning algorithm.
o Examples:
 Correlation Coefficient: Measures the linear relationship
between features and the target variable.
 Chi-Squared Test: Evaluates the association between
categorical features and the target.
 Mutual Information: Measures the dependency between
features and the target variable.
o Advantages: Fast and scalable; works well with high-dimensional
data.
o Disadvantages: Ignores feature interactions.
2. Wrapper Methods:
o These methods evaluate subsets of features based on the model’s
performance.
o Examples:
 Recursive Feature Elimination (RFE): Recursively
removes the least important features based on model
performance.
 Forward Selection: Starts with no features and adds them
one at a time based on model performance.
 Backward Elimination: Starts with all features and
removes the least significant ones.
o Advantages: Takes into account feature interactions; usually yields
better performance.
o Disadvantages: Computationally expensive, especially with deep
learning models.
3. Embedded Methods:
o These methods perform feature selection during the model training
process.
o Examples:
 L1 Regularization (Lasso): Encourages sparsity in the
model weights, effectively selecting features by driving
some coefficients to zero.
 Tree-based Methods: Algorithms like Random Forests and
Gradient Boosting can provide feature importance scores
based on how much each feature contributes to reducing
impurity.
o Advantages: Combines feature selection and model training;
typically, more efficient than wrapper methods.
o Disadvantages: May still involve some computational overhead.
4. Dimensionality Reduction Techniques:
o While not strictly feature selection, these techniques reduce the
number of features by transforming the feature space.
o Examples:
 Principal Component Analysis (PCA): Transforms features
into a lower-dimensional space while preserving variance.
 t-Distributed Stochastic Neighbor Embedding (t-SNE):
Primarily used for visualization, but can help identify
patterns in high-dimensional data.
o Advantages: Can capture complex relationships; useful for
visualization.
o Disadvantages: Transformed features may be harder to interpret.
Practical Considerations
1. Data Preprocessing: Properly preprocessing the data (e.g.,
normalization, handling missing values) is essential before feature
selection.
2. Correlation Check: Analyze correlations among features to identify
highly correlated features that can be candidates for removal.
3. Domain Knowledge: Leverage domain expertise to inform which
features are likely to be relevant.
4. Cross-Validation: Use cross-validation when evaluating feature subsets
to ensure that the selection process does not lead to overfitting.
Conclusion
Feature selection is a vital process in deep learning that can enhance model
performance, reduce complexity, and improve interpretability. By employing
various methods—filter, wrapper, embedded, and dimensionality reduction—
practitioners can effectively select the most relevant features for their models.
The right approach will depend on the specific context, including data
characteristics and the modelling task at hand.

Gradient Descent:
Gradient descent is an optimization algorithm used to minimize a function by
iteratively moving toward the steepest descent, which is the negative gradient of
the function. It’s widely used in machine learning and deep learning to
minimize loss functions.
Key Concepts:
1. Objective Function: The function you want to minimize (e.g., a loss
function in machine learning).
2. Gradient: A vector of partial derivatives that points in the direction of the
steepest ascent of the function.
3. Learning Rate (α): A hyperparameter that controls the size of the steps
taken toward the minimum. A small learning rate might lead to slow
convergence, while a large learning rate can cause overshooting.
Steps in Gradient Descent:
1. Initialize Parameters: Start with initial values for the parameters
(weights).
2. Compute the Gradient: Calculate the gradient of the objective function
with respect to the parameters.
3. Update Parameters: Adjust the parameters in the opposite direction of
the gradient:
θ=θ−α∇J(θ)
where θ represents the parameters, α is the learning rate, and ∇J(θ) is the
gradient.
4. Repeat: Continue computing the gradient and updating the parameters
until convergence (when changes become negligible).
Types of Gradient Descent:
1. Batch Gradient Descent: Uses the entire dataset to compute the
gradient. It can be slow for large datasets.
2. Stochastic Gradient Descent (SGD): Updates parameters using one data
point at a time, leading to faster iterations but more noisy updates.
3. Mini-batch Gradient Descent: A compromise between batch and
stochastic, it uses a small subset of the dataset to compute the gradient.
Convergence:
Gradient descent may converge to local minima, especially in non-convex
functions. Techniques like momentum, adaptive learning rates (like Adam or
RMSprop), and regularization can help improve convergence and avoid local
minima.
Applications:
 Training machine learning models (like neural networks)
 Linear regression
 Logistic regression
Momentum based GD:
Momentum-based gradient descent is an enhancement of the basic gradient
descent algorithm that helps accelerate convergence and reduce oscillations,
especially in the context of high-dimensional optimization problems.
Key Concepts:
1. Momentum: The idea behind momentum is to keep track of the previous
gradients to smooth out the updates. This is analogous to a physical
object that gains momentum as it moves; it won't stop immediately when
a force (gradient) changes direction.
2. Velocity Update: Instead of updating parameters directly using the
gradient, momentum computes a velocity vector that combines the
current gradient and the previous velocity. This helps to keep moving in
the direction of previous gradients.
Momentum Algorithm Steps:
1. Initialize: Set initial values for parameters θ, learning rate α, and
momentum coefficient β (commonly set between 0.5 and 0.9).
2. Velocity Initialization: Initialize the velocity v to zero or a small value.
3. Gradient Calculation: For each iteration, calculate the gradient of the
objective function with respect to the parameters.
4. Update Velocity:
v=βv+(1−β)∇J(θ)
where ∇J(θ) is the current gradient.
5. Update Parameters:
θ=θ−αv
6. Repeat: Continue this process until convergence.
Benefits:
 Faster Convergence: By incorporating past gradients, momentum allows
for larger steps in directions where gradients consistently point, leading to
faster convergence.
 Reduced Oscillation: It dampens the oscillations that can occur in
narrow ravines of the loss landscape, improving stability.
Variants:
 Nesterov Accelerated Gradient (NAG): An extension of momentum
that looks ahead by computing the gradient at the "lookahead" position
(i.e., the current position plus the momentum) before updating the
velocity. This can provide a more accurate update.
Applications:
Momentum-based gradient descent is particularly effective in training deep
learning models, where the loss landscape can be complex with many local
minima and ravines.
If you have more specific questions or want to delve deeper into a certain
aspect, let me know!
Stochastic GD:
Stochastic Gradient Descent (SGD) is an optimization algorithm commonly
used in machine learning and statistics to minimize a loss function. It is
particularly useful for large datasets since it updates the model parameters more
frequently than traditional gradient descent methods. Here’s a detailed
breakdown of SGD, including its formula and workings:
Overview of Gradient Descent
In traditional gradient descent, the algorithm updates the model parameters
based on the gradient of the loss function computed using the entire dataset. The
update rule for gradient descent is given by:
θ=θ−α∇J(θ)
where:
 θ represents the model parameters,
 α is the learning rate,
 ∇J(θ) is the gradient of the loss function J with respect to the parameters
θ.
Stochastic Gradient Descent
In contrast, stochastic gradient descent updates the parameters using only a
single training example (or a small batch) at each iteration. This can lead to
faster convergence and the ability to escape local minima due to the noise
introduced by using a subset of data.
Update Rule
The update rule for SGD is as follows:
θ=θ−α∇Ji(θ)
where:
 Ji(θ) is the loss function computed for a single training example .
 ∇Ji(θ) is the gradient of the loss with respect to the parameters θ for that
specific example.
Steps in Stochastic Gradient Descent
1. Initialization: Start with random values for the parameters θ.
2. Shuffle the Training Data: This helps in ensuring that the updates are
not biased by the order of the training examples.
3. Iterate Over Training Examples:
o For each training example :
1. Compute the gradient of the loss function: ∇Ji(θ).
2. Update the parameters: θ=θ−α∇Ji(θ).
4. Repeat: Go through the entire dataset multiple times (epochs) until
convergence.
Advantages of Stochastic Gradient Descent
 Faster Updates: Since it uses one example at a time, it can converge
faster than batch gradient descent.
 Better Generalization: The noise in the updates can help the model
escape local minima, potentially leading to better generalization.
 Online Learning: SGD can be used for online learning, adapting to new
data points as they arrive.
Disadvantages
 Noisy Updates: The updates can be noisy, which may lead to oscillations
and prevent convergence in some cases.
 Learning Rate Sensitivity: Choosing an appropriate learning rate is
crucial; a rate that's too high can cause divergence, while one that's too
low can result in slow convergence.
Variants of SGD
To mitigate some of the disadvantages, several variants of SGD have been
developed:
1. Mini-Batch Gradient Descent: Combines the benefits of both SGD and
batch gradient descent by using a small batch of training examples.
2. Momentum: Helps accelerate SGD in the relevant direction and dampens
oscillations by adding a fraction of the previous update to the current
update.
3. Adam (Adaptive Moment Estimation): Combines ideas from
momentum and RMSProp, adapting the learning rate for each parameter
individually.
Regularization:
Regularization is a technique used in machine learning and statistics to prevent
overfitting, which occurs when a model learns the noise in the training data
rather than the underlying patterns. By adding a penalty to the loss function,
regularization helps to constrain the complexity of the model. Here are a few
common types of regularization:
1. L1 Regularization (Lasso): This adds the absolute value of the
coefficients as a penalty to the loss function. It can lead to sparse
solutions, effectively performing feature selection by driving some
coefficients to zero.
2. L2 Regularization (Ridge): This adds the square of the coefficients as a
penalty. It tends to distribute the error among all features rather than
completely eliminating some, which can help with multicollinearity.
3. Elastic Net: A combination of L1 and L2 regularization, Elastic Net
balances the benefits of both methods, allowing for feature selection
while maintaining some of the benefits of Ridge regression.
4. Dropout: In neural networks, dropout regularization randomly sets a
fraction of the neurons to zero during training, which helps prevent co-
adaptation of features.
5. Early Stopping: This technique involves monitoring the model's
performance on a validation set and stopping training when performance
begins to degrade, thus avoiding overfitting.
Regularization is crucial for building robust models that generalize well to
unseen data.
Dropout:
Mechanics of Dropout:
Basic Concept: During training, dropout randomly sets a fraction of the
neurons in a layer to zero2 (i.e., "drops them out"). This process occurs
independently for each training example and each forward pass.
 Dropout Rate: This is the probability of dropping out a neuron, typically
denoted as p. Common values for p range from 0.2 to 0.5. A dropout rate
of 0.5 means that, on average, half of the neurons will be dropped during
training.
 Forward Pass: When a layer is activated with dropout, the output is
multiplied by a binary mask. For example, if a neuron is retained (not
dropped), it gets multiplied by 1/1−p during training to maintain the
expected output level.
 Backward Pass: During backpropagation, gradients are computed only
for the neurons that were active (not dropped out). This ensures that only
the weights of the active neurons are updated.
Implementation in Neural Networks:
Dropout can be applied in different ways:
 Fully Connected Layers: Commonly used in dense layers, dropout helps
prevent overfitting by randomly ignoring some neurons during training.
 Convolutional Layers: While dropout is less commonly applied to
convolutional layers due to their spatial structure, it can still be used,
typically after the convolutional and activation operations.
 Recurrent Neural Networks (RNNs): Dropout can also be applied to
RNNs, although it’s done differently (often between layers and not on the
recurrent connections) to maintain the sequential nature of data.
Variations of Dropout:
 Spatial Dropout: This variation is specifically designed for
convolutional layers, where entire feature maps are dropped out instead
of individual neurons. This maintains the spatial structure of the data.
 Gaussian Dropout: Instead of dropping out units, this variation scales
the outputs of neurons by a Gaussian noise. It allows for a smoother
transition during training.
 DropConnect: Instead of dropping out the activations of neurons,
DropConnect drops the connections (weights) between neurons during
training, which can be thought of as a form of dropout for weights.
Advantages of Dropout:
 Prevention of Overfitting: By ensuring that no single neuron or pathway
becomes overly specialized, dropout encourages more general features to
be learned.
 Improved Generalization: Models trained with dropout tend to perform
better on unseen data, as they have learned to operate robustly even with
various neurons inactive.
 Increased Model Capacity: Dropout effectively allows for training an
ensemble of many models with shared weights, as different combinations
of neurons are activated each time.
Practical Considerations:
 Hyperparameter Tuning: The dropout rate is a critical hyperparameter.
Common practice is to start with a rate around 0.5 for fully connected
layers and lower values for convolutional layers.
 Training Time: While dropout can speed up convergence in some cases
by reducing overfitting, it might increase training time since more epochs
may be needed to reach optimal performance.
 When to Use Dropout: It is most beneficial in larger networks where
overfitting is a concern, particularly when the dataset is small relative to
the model complexity.
 Evaluation Mode: During testing or validation, dropout is turned off, and
all neurons are used, ensuring that the full capacity of the model is
utilized.
Drop Connect:
DropConnect is a regularization technique used in deep learning to prevent
overfitting, similar to Dropout, but with a different approach. Here’s a detailed
overview of DropConnect:
Concept:
 DropConnect involves randomly setting a subset of weights in a neural
network to zero during training. Instead of dropping entire neurons (as in
Dropout), DropConnect drops connections (weights) between neurons.
 This means that, during each forward pass, some weights are temporarily
removed from the network, which forces the model to learn more robust
features and prevents reliance on any single connection.
Mechanism
1. Random Weight Selection:
o For each training iteration, a binary mask is generated where each
weight has a probability p of being retained (1) and a probability
1−p of being set to zero (0).
o This mask is applied to the weights of the layer, effectively
removing some connections.
2. Forward Pass:
o The forward pass is conducted using the modified weights. The
output of the layer is computed with the remaining active weights.
3. Backpropagation:
o During backpropagation, the gradients are only computed for the
active weights, and the inactive weights do not contribute to the
update.
Benefits:
 Improved Generalization: By reducing the complexity of the model in
each training iteration, DropConnect helps the network to generalize
better to unseen data.
 Prevent Overfitting: It mitigates the risk of overfitting, especially in
scenarios with limited training data.
Comparison with Dropout:
 Dropout: Drops entire neurons, which means that the outputs from those
neurons are not considered during training.
 DropConnect: Drops specific weights, allowing for more nuanced
control over the network’s connectivity and often leads to improved
performance in certain architectures.
Implementation:
 DropConnect can be implemented by modifying the forward pass of a
layer in a neural network. Many deep learning libraries (like TensorFlow
or PyTorch) allow for custom layers where you can integrate
DropConnect.
 The dropout rate p needs to be tuned as a hyperparameter, similar to how
you would tune dropout rates in Dropout.
Use Cases:
 DropConnect has been found effective in various tasks, particularly in
deep networks where overfitting is a concern.
 It’s useful in scenarios where you want to explore sparsity in model
weights, potentially leading to more efficient networks.
Limitations:
 Computational Cost: The stochastic nature of DropConnect can increase
training time, as the model may require more epochs to converge.
 Implementation Complexity: Compared to Dropout, it can be slightly
more complex to implement and tune.

Batch Normalization
 Batch normalization is a technique used to improve the training of
deep neural networks.
 It helps address issues like internal covariate shift, which refers to
the changes in the distribution of network activations during
training.
Need for Batch Normalization:
 Batch Normalization is extension of concept of normalization from just
the input layer to the activations of each hidden layer throughout the
neural network.
 By normalizing the activations of each layer, Batch Normalization helps
to alleviate the internal covariate shift problem, which can hinder the
convergence of the network during training.
 The inputs to each hidden layer are the activations from the previous
layer. If these activations are normalized, it ensures that the network is
consistently presented with inputs that have a similar distribution,
regardless of the training stage. This stability in the distribution of
inputs allows for smoother and more efficient training.
 By applying Batch Normalization into the hidden layers of the network,
the gradients propagated during backpropagation are less likely to
vanish or explode, leading to more stable training dynamics. This
ultimately facilitates faster convergence and better performance of the
neural network on the given task.
Fundamentals of Batch Normalization:

The following are the steps taken to perform batch normalization.


Step 1: Compute the Mean and Variance of Mini-Batches
For mini-batch of activations x1,x2,...,xm, the mean μB and variance σB2 of
the mini-batch are computed.
Step 2: Normalization
a. Each activation xi is normalized using the computed mean
and variance of the mini-batch.
b. The normalization process subtracts the mean μB from each
activation and divides by the square root of the variance σB2 ,
ensuring that the normalized activations have a zero mean
and unit variance.
c. Additionally, a small constant ϵ is added to the denominator
for numerical stability, particularly to prevent division by
zero.
x i−μ B
^x i=
√σ 2
B +ε
Step 3: Scale and Shift the Normalized Activations.
The normalized activations xi are then scaled by a learnable
parameter γ and shifted by another learnable parameter β.
These parameters allow the model to learn the optimal scaling and
shifting of the normalized activations, giving the network additional
flexibility.
y i=γ ^x i+ β

Benefits of Batch Normalization


 Faster Convergence: Batch Normalization reduces internal
covariate shift, allowing for faster convergence during training.
 Higher Learning Rates: With Batch Normalization, higher
learning rates can be used without the risk of divergence.
 Regularization Effect: Batch Normalization introduces a slight
regularization effect that reduces the need for adding regularization
techniques like dropout.

Hyperparameters:
In deep learning, hyperparameters are the parameters that are set before
training a model, and they govern the overall structure and performance of the
model. Unlike parameters (weights and biases) that are learned during training,
hyperparameters must be specified beforehand and are typically tuned through
experimentation to find the optimal configuration for a given task.

Types of Hyperparameters:

Hyperparameters can be grouped into various categories based on their


role in the model architecture, training process, and optimization process:
a) Model Architecture Hyperparameters
These control the architecture and design of the model.
 Number of Layers: Determines how deep the network is. The more
layers, the more complex the model can learn. Common types are:
o Fully connected layers (dense layers)
o Convolutional layers (in CNNs)
o Recurrent layers (in RNNs, LSTMs)
 Number of Neurons per Layer: Defines the number of units or nodes in
each layer. More neurons allow the model to capture more complex
patterns, but may also lead to overfitting.
 Activation Functions: Functions applied to the output of neurons (e.g.,
ReLU, sigmoid, tanh, softmax). The choice of activation function can
significantly impact the network’s ability to learn.
 Dropout Rate: In some layers, random neurons are dropped out during
training to prevent overfitting. This dropout rate is a hyperparameter (e.g.,
0.2 means 20% of neurons are dropped out).
b) Optimization Hyperparameters
These affect how the model is trained and optimized.
 Learning Rate: The most critical hyperparameter for optimization. It
controls how much to change the model parameters with respect to the
loss gradient during training. Too large a learning rate might cause the
model to overshoot the optimal solution, while too small a rate might lead
to slow convergence.
 Batch Size: The number of training samples used in one iteration before
the model parameters are updated. Smaller batch sizes can lead to noisy
gradients but may help generalize better, while larger batch sizes lead to
smoother gradient estimates.
 Epochs: The number of times the model will iterate over the entire
training dataset. More epochs can improve the model’s performance but
increase the risk of overfitting.
 Momentum: A technique used to accelerate gradients in the relevant
direction, preventing oscillations. Momentum helps the optimizer by
accumulating a "memory" of past gradients, improving convergence.
 Weight Decay (L2 Regularization): A technique used to prevent
overfitting by penalizing large weights. This is done by adding a
regularization term to the loss function, which is proportional to the
squared magnitude of the weights.
 Optimizer Type: Determines how the learning rate is adjusted over time.
Common optimizers include:
o SGD (Stochastic Gradient Descent)
o Adam (Adaptive Moment Estimation)
o RMSProp (Root Mean Square Propagation)
o Adagrad Each optimizer has its own characteristics and advantages
for different types of problems.
c) Regularization Hyperparameters
Regularization techniques help to prevent overfitting by limiting the
complexity of the model.
 Dropout Rate: As mentioned earlier, dropout is a regularization method
where randomly selected neurons are ignored during training. This helps
to prevent the model from overfitting.
 L1 and L2 Regularization: These methods add a penalty to the loss
function based on the size of the model’s weights. L1 regularization leads
to sparse weights, while L2 regularization encourages smaller weights.
d) Data-Related Hyperparameters
These define how the data is processed and fed into the model.
 Input Data Size: The size of the input data that the model expects. It
could be the size of the image (in image classification) or the sequence
length (in text or time-series models).
 Data Augmentation: Methods like rotation, scaling, and flipping (for
image data) can be used to artificially increase the size of the training
dataset, helping to prevent overfitting.
 Shuffle: Whether the data should be shuffled before training or not. This
can prevent the model from memorizing the order of the data during
training.
 Normalization and Scaling: Preprocessing steps like normalizing data
(scaling the input features to a range, like [0,1]) to make the training
process more efficient and stable.
e) Learning Rate Scheduling
Sometimes, it’s helpful to adjust the learning rate during training to fine-
tune the convergence behavior.
 Step Decay: The learning rate is reduced by a factor every few epochs
(e.g., after every 10 epochs).
 Exponential Decay: The learning rate decays exponentially over time.
 Cosine Annealing: A method where the learning rate oscillates in a cosine
wave, reducing as training progresses.

Hyperparameter Tuning:

Since hyperparameters greatly influence the model’s performance,


selecting the optimal combination is essential. However, manually tuning
hyperparameters can be time-consuming, so several techniques exist to
automate this process:
a) Grid Search
Grid search involves exhaustively trying all possible combinations of a
predefined set of hyperparameters. This can be computationally expensive,
especially if there are many hyperparameters to tune.
b) Random Search
Instead of trying all combinations, random search selects random values
for hyperparameters. This method can be more efficient than grid search,
especially for models with many hyperparameters.
c) Bayesian Optimization
Bayesian optimization uses probabilistic models to predict the best
hyperparameters based on past results, progressively improving the search
process. Popular libraries like Optuna and Hyperopt implement this.
d) Genetic Algorithms
Genetic algorithms apply principles of natural evolution (such as
mutation, crossover, and selection) to explore the hyperparameter space.
e) Gradient-Based Optimization
Methods like Hyper gradient descent adapt the hyperparameters by
considering gradients with respect to hyperparameters, fine-tuning them along
with model parameters.

Common Hyperparameters in Deep Learning Models


Here’s a summary of some of the most commonly used hyperparameters
in deep learning:
Category Hyperparameter
Number of layers, Number of neurons per
Model Architecture
layer, Activation functions, Dropout rate
Learning rate, Batch size, Epochs, Momentum,
Optimization
Optimizer type
Regularization L1/L2 regularization, Dropout rate
Input data size, Data augmentation,
Data Processing
Normalization/scaling, Shuffle
Learning Rate Step decay, Exponential decay, Cosine
Scheduling annealing

Importance of Hyperparameter Tuning


The optimal hyperparameter configuration can significantly improve the
performance of a model. Poor choices in hyperparameters can lead to:
 Underfitting: The model is too simple, unable to capture the complexity
of the data (e.g., insufficient layers or neurons).
 Overfitting: The model is too complex and learns the noise in the
training data, leading to poor generalization on new data (e.g., too many
layers, high learning rate).
Therefore, careful experimentation and systematic tuning are necessary to
achieve the best possible model performance.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy