Unit 2
Unit 2
Topics:
Activation functions: Sigmoid, ReLU, Hyperbolic Functions, Softmax
Optimization: Types of errors, bias-variance trade-off, overfitting-underfitting,
Cross Validation, Feature Selection, Gradient Descent (GD), Momentum Based
GD, Stochastic GD, Regularization (dropout, drop connect, batch
normalization), Hyper parameters.
Activation Functions
What is an Activation Function?
An activation function in the context of neural networks is a
mathematical function applied to the output of a neuron. The purpose of
an activation function is to introduce non-linearity into the model,
allowing the network to learn and represent complex patterns in the data.
Without non-linearity, a neural network would essentially behave like a
linear regression model, regardless of the number of layers it has.
The activation function decides whether a neuron should be activated or
not by calculating the weighted sum and further adding bias to it.
The neural network has neurons that work in correspondence
with weight, bias, and their respective activation function.
In a neural network, we would update the weights and biases of the
neurons on the basis of the error at the output. This process is known as
back propagation.
Activation functions make the back-propagation possible since the
gradients are supplied along with the error to update the weights and
biases.
Elements of a Neural Network
Input Layer: This layer accepts input features. It provides information
from the outside world to the network, no computation is performed at
this layer, nodes here just pass on the information(features) to the
hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world;
they are part of the abstraction provided by any neural network. The
hidden layer performs all sorts of computation on the features entered
through the input layer and transfers the result to the output layer.
Output Layer: This layer brings up the information learned by the
network to the outer world.
The activation that works almost always better than sigmoid function is
Tanh function also known as Tangent Hyperbolic function. It’s actually
mathematically shifted version of the sigmoid function. Both are similar
and can be derived from each other.
Equation: f(x) = tanh(x) = 2/(1 + e-2x) – 1
OR
tanh(x) = 2 * sigmoid(2x) – 1
Value Range: -1 to +1
Nature: non-linear
Uses: Usually used in hidden layers of a neural network as it’s values lies
between -1 to 1 hence the mean for the hidden layer comes out be 0 or
very close to it, hence helps in centering the data by bringing mean close
to 0. This makes learning for the next layer much easier.
ReLU Function:
It Stands for Rectified linear unit. It is the most widely used
activation function. Chiefly implemented in hidden layers of Neural
network.
Optimization
Types of errors:
In deep learning, errors can arise from various sources, and they can
generally be categorized into several types:
1. Training Errors:
o Underfitting: Occurs when the model is too simple to capture the
underlying patterns in the data, resulting in high training and
validation errors.
o Overfitting: Happens when the model learns the noise in the
training data rather than the actual signal, leading to low training
error but high validation error.
2. Optimization Errors:
o Gradient Vanishing: In deep networks, gradients can become very
small, making it difficult for the model to learn.
o Gradient Exploding: Opposite of vanishing gradients, where
gradients become excessively large, leading to unstable training.
3. Data Errors:
o Noisy Data: Presence of irrelevant or misleading information in
the dataset that can misguide the learning process.
o Imbalanced Data: When classes in the dataset are not represented
equally, leading to biased model predictions.
4. Model Errors:
o Incorrect Architecture: Choosing a model architecture that is not
suitable for the specific task can lead to poor performance.
o Hyperparameter Tuning Errors: Poor choices in
hyperparameters (like learning rate, batch size, etc.) can hinder the
training process.
5. Implementation Errors:
o Coding Bugs: Errors in the implementation of the model or
training process can lead to unexpected behavior.
o Library/Framework Misuse: Misunderstanding how to use a
deep learning library or framework can introduce errors.
6. Evaluation Errors:
o Wrong Metrics: Using inappropriate evaluation metrics that don’t
accurately reflect model performance can lead to misleading
conclusions.
o Data Leakage: When information from the validation/test set is
unintentionally used during training, resulting in overly optimistic
performance estimates.
7. Inference Errors:
o Domain Shift: When the model is applied to data that comes from
a different distribution than the training data, leading to poor
performance.
o Adversarial Attacks: Deliberate manipulation of input data to
trick the model into making incorrect predictions.
Each type of error can significantly impact the performance and reliability of a
deep learning model, and identifying the source of errors is crucial for
improvement.
In general,
1. High Bias indicates more assumptions in the learning algorithm
about the relationships between the variables.
2. Less Bias indicates fewer assumptions in the learning algorithm.
The errors in the test data are more in this case. If there is more difference in the
errors in different datasets, then it means that the model has a high variance. At
the same time, this type of curvy model will have a low bias because it is able to
capture the relationships in the training data unlike straight line.
Bias-Variance Trade-off:
Summary:
a. If a model uses a simple machine learning algorithm like in
the case of a linear model in the above code, the model will
have high bias and low variance (underfitting the data).
b. If a model follows a complex machine learning model, then
it will have high variance and low bias (overfitting the data).
c. You need to find a good balance between the bias and
variance of the model we have used. This trade-off in
complexity is what is referred to as bias and variance trade-
off. An optimal balance of bias and variance should never
overfit or underfit the model.
d. This trade-off applies to all forms of supervised learning:
classification, regression, and structured output learning.
How to fix bias and variance problems?
Fixing High Bias:
1. Adding more input features will help improve the data to fit better.
2. Add more polynomial features to improve the complexity of the
model.
3. Decrease the regularization term to have a balance between bias
and variance.
Fixing High Variance:
1. Reduce the input features, use only features with more feature
importance to reduce overfitting the data.
2. Getting more training data will help in this case, because the high
variance model will not be working for an independent dataset if
you have very data.
Feature Selection:
Feature selection is a critical step in the data preprocessing phase, especially in
deep learning, where the number of features can be vast. Properly selecting
features can improve model performance, reduce training time, and enhance
interpretability. Here’s an in-depth look at feature selection in deep learning.
What is Feature Selection?
Feature selection involves identifying and selecting a subset of relevant features
(or variables) from the original dataset. The goal is to improve the model's
efficiency and effectiveness by focusing on the most informative features while
discarding irrelevant or redundant ones.
Importance of Feature Selection
1. Improved Model Performance: Removing irrelevant features can lead
to better generalization, as the model focuses on the most important
information.
2. Reduced Overfitting: A simpler model with fewer features is less likely
to overfit to noise in the training data.
3. Decreased Computational Cost: Fewer features can significantly reduce
the training time and resource consumption.
4. Enhanced Interpretability: A model with fewer features is often easier
to interpret and understand.
Methods of Feature Selection
1. Filter Methods:
o These methods assess the relevance of features using statistical
measures, independently of any machine learning algorithm.
o Examples:
Correlation Coefficient: Measures the linear relationship
between features and the target variable.
Chi-Squared Test: Evaluates the association between
categorical features and the target.
Mutual Information: Measures the dependency between
features and the target variable.
o Advantages: Fast and scalable; works well with high-dimensional
data.
o Disadvantages: Ignores feature interactions.
2. Wrapper Methods:
o These methods evaluate subsets of features based on the model’s
performance.
o Examples:
Recursive Feature Elimination (RFE): Recursively
removes the least important features based on model
performance.
Forward Selection: Starts with no features and adds them
one at a time based on model performance.
Backward Elimination: Starts with all features and
removes the least significant ones.
o Advantages: Takes into account feature interactions; usually yields
better performance.
o Disadvantages: Computationally expensive, especially with deep
learning models.
3. Embedded Methods:
o These methods perform feature selection during the model training
process.
o Examples:
L1 Regularization (Lasso): Encourages sparsity in the
model weights, effectively selecting features by driving
some coefficients to zero.
Tree-based Methods: Algorithms like Random Forests and
Gradient Boosting can provide feature importance scores
based on how much each feature contributes to reducing
impurity.
o Advantages: Combines feature selection and model training;
typically, more efficient than wrapper methods.
o Disadvantages: May still involve some computational overhead.
4. Dimensionality Reduction Techniques:
o While not strictly feature selection, these techniques reduce the
number of features by transforming the feature space.
o Examples:
Principal Component Analysis (PCA): Transforms features
into a lower-dimensional space while preserving variance.
t-Distributed Stochastic Neighbor Embedding (t-SNE):
Primarily used for visualization, but can help identify
patterns in high-dimensional data.
o Advantages: Can capture complex relationships; useful for
visualization.
o Disadvantages: Transformed features may be harder to interpret.
Practical Considerations
1. Data Preprocessing: Properly preprocessing the data (e.g.,
normalization, handling missing values) is essential before feature
selection.
2. Correlation Check: Analyze correlations among features to identify
highly correlated features that can be candidates for removal.
3. Domain Knowledge: Leverage domain expertise to inform which
features are likely to be relevant.
4. Cross-Validation: Use cross-validation when evaluating feature subsets
to ensure that the selection process does not lead to overfitting.
Conclusion
Feature selection is a vital process in deep learning that can enhance model
performance, reduce complexity, and improve interpretability. By employing
various methods—filter, wrapper, embedded, and dimensionality reduction—
practitioners can effectively select the most relevant features for their models.
The right approach will depend on the specific context, including data
characteristics and the modelling task at hand.
Gradient Descent:
Gradient descent is an optimization algorithm used to minimize a function by
iteratively moving toward the steepest descent, which is the negative gradient of
the function. It’s widely used in machine learning and deep learning to
minimize loss functions.
Key Concepts:
1. Objective Function: The function you want to minimize (e.g., a loss
function in machine learning).
2. Gradient: A vector of partial derivatives that points in the direction of the
steepest ascent of the function.
3. Learning Rate (α): A hyperparameter that controls the size of the steps
taken toward the minimum. A small learning rate might lead to slow
convergence, while a large learning rate can cause overshooting.
Steps in Gradient Descent:
1. Initialize Parameters: Start with initial values for the parameters
(weights).
2. Compute the Gradient: Calculate the gradient of the objective function
with respect to the parameters.
3. Update Parameters: Adjust the parameters in the opposite direction of
the gradient:
θ=θ−α∇J(θ)
where θ represents the parameters, α is the learning rate, and ∇J(θ) is the
gradient.
4. Repeat: Continue computing the gradient and updating the parameters
until convergence (when changes become negligible).
Types of Gradient Descent:
1. Batch Gradient Descent: Uses the entire dataset to compute the
gradient. It can be slow for large datasets.
2. Stochastic Gradient Descent (SGD): Updates parameters using one data
point at a time, leading to faster iterations but more noisy updates.
3. Mini-batch Gradient Descent: A compromise between batch and
stochastic, it uses a small subset of the dataset to compute the gradient.
Convergence:
Gradient descent may converge to local minima, especially in non-convex
functions. Techniques like momentum, adaptive learning rates (like Adam or
RMSprop), and regularization can help improve convergence and avoid local
minima.
Applications:
Training machine learning models (like neural networks)
Linear regression
Logistic regression
Momentum based GD:
Momentum-based gradient descent is an enhancement of the basic gradient
descent algorithm that helps accelerate convergence and reduce oscillations,
especially in the context of high-dimensional optimization problems.
Key Concepts:
1. Momentum: The idea behind momentum is to keep track of the previous
gradients to smooth out the updates. This is analogous to a physical
object that gains momentum as it moves; it won't stop immediately when
a force (gradient) changes direction.
2. Velocity Update: Instead of updating parameters directly using the
gradient, momentum computes a velocity vector that combines the
current gradient and the previous velocity. This helps to keep moving in
the direction of previous gradients.
Momentum Algorithm Steps:
1. Initialize: Set initial values for parameters θ, learning rate α, and
momentum coefficient β (commonly set between 0.5 and 0.9).
2. Velocity Initialization: Initialize the velocity v to zero or a small value.
3. Gradient Calculation: For each iteration, calculate the gradient of the
objective function with respect to the parameters.
4. Update Velocity:
v=βv+(1−β)∇J(θ)
where ∇J(θ) is the current gradient.
5. Update Parameters:
θ=θ−αv
6. Repeat: Continue this process until convergence.
Benefits:
Faster Convergence: By incorporating past gradients, momentum allows
for larger steps in directions where gradients consistently point, leading to
faster convergence.
Reduced Oscillation: It dampens the oscillations that can occur in
narrow ravines of the loss landscape, improving stability.
Variants:
Nesterov Accelerated Gradient (NAG): An extension of momentum
that looks ahead by computing the gradient at the "lookahead" position
(i.e., the current position plus the momentum) before updating the
velocity. This can provide a more accurate update.
Applications:
Momentum-based gradient descent is particularly effective in training deep
learning models, where the loss landscape can be complex with many local
minima and ravines.
If you have more specific questions or want to delve deeper into a certain
aspect, let me know!
Stochastic GD:
Stochastic Gradient Descent (SGD) is an optimization algorithm commonly
used in machine learning and statistics to minimize a loss function. It is
particularly useful for large datasets since it updates the model parameters more
frequently than traditional gradient descent methods. Here’s a detailed
breakdown of SGD, including its formula and workings:
Overview of Gradient Descent
In traditional gradient descent, the algorithm updates the model parameters
based on the gradient of the loss function computed using the entire dataset. The
update rule for gradient descent is given by:
θ=θ−α∇J(θ)
where:
θ represents the model parameters,
α is the learning rate,
∇J(θ) is the gradient of the loss function J with respect to the parameters
θ.
Stochastic Gradient Descent
In contrast, stochastic gradient descent updates the parameters using only a
single training example (or a small batch) at each iteration. This can lead to
faster convergence and the ability to escape local minima due to the noise
introduced by using a subset of data.
Update Rule
The update rule for SGD is as follows:
θ=θ−α∇Ji(θ)
where:
Ji(θ) is the loss function computed for a single training example .
∇Ji(θ) is the gradient of the loss with respect to the parameters θ for that
specific example.
Steps in Stochastic Gradient Descent
1. Initialization: Start with random values for the parameters θ.
2. Shuffle the Training Data: This helps in ensuring that the updates are
not biased by the order of the training examples.
3. Iterate Over Training Examples:
o For each training example :
1. Compute the gradient of the loss function: ∇Ji(θ).
2. Update the parameters: θ=θ−α∇Ji(θ).
4. Repeat: Go through the entire dataset multiple times (epochs) until
convergence.
Advantages of Stochastic Gradient Descent
Faster Updates: Since it uses one example at a time, it can converge
faster than batch gradient descent.
Better Generalization: The noise in the updates can help the model
escape local minima, potentially leading to better generalization.
Online Learning: SGD can be used for online learning, adapting to new
data points as they arrive.
Disadvantages
Noisy Updates: The updates can be noisy, which may lead to oscillations
and prevent convergence in some cases.
Learning Rate Sensitivity: Choosing an appropriate learning rate is
crucial; a rate that's too high can cause divergence, while one that's too
low can result in slow convergence.
Variants of SGD
To mitigate some of the disadvantages, several variants of SGD have been
developed:
1. Mini-Batch Gradient Descent: Combines the benefits of both SGD and
batch gradient descent by using a small batch of training examples.
2. Momentum: Helps accelerate SGD in the relevant direction and dampens
oscillations by adding a fraction of the previous update to the current
update.
3. Adam (Adaptive Moment Estimation): Combines ideas from
momentum and RMSProp, adapting the learning rate for each parameter
individually.
Regularization:
Regularization is a technique used in machine learning and statistics to prevent
overfitting, which occurs when a model learns the noise in the training data
rather than the underlying patterns. By adding a penalty to the loss function,
regularization helps to constrain the complexity of the model. Here are a few
common types of regularization:
1. L1 Regularization (Lasso): This adds the absolute value of the
coefficients as a penalty to the loss function. It can lead to sparse
solutions, effectively performing feature selection by driving some
coefficients to zero.
2. L2 Regularization (Ridge): This adds the square of the coefficients as a
penalty. It tends to distribute the error among all features rather than
completely eliminating some, which can help with multicollinearity.
3. Elastic Net: A combination of L1 and L2 regularization, Elastic Net
balances the benefits of both methods, allowing for feature selection
while maintaining some of the benefits of Ridge regression.
4. Dropout: In neural networks, dropout regularization randomly sets a
fraction of the neurons to zero during training, which helps prevent co-
adaptation of features.
5. Early Stopping: This technique involves monitoring the model's
performance on a validation set and stopping training when performance
begins to degrade, thus avoiding overfitting.
Regularization is crucial for building robust models that generalize well to
unseen data.
Dropout:
Mechanics of Dropout:
Basic Concept: During training, dropout randomly sets a fraction of the
neurons in a layer to zero2 (i.e., "drops them out"). This process occurs
independently for each training example and each forward pass.
Dropout Rate: This is the probability of dropping out a neuron, typically
denoted as p. Common values for p range from 0.2 to 0.5. A dropout rate
of 0.5 means that, on average, half of the neurons will be dropped during
training.
Forward Pass: When a layer is activated with dropout, the output is
multiplied by a binary mask. For example, if a neuron is retained (not
dropped), it gets multiplied by 1/1−p during training to maintain the
expected output level.
Backward Pass: During backpropagation, gradients are computed only
for the neurons that were active (not dropped out). This ensures that only
the weights of the active neurons are updated.
Implementation in Neural Networks:
Dropout can be applied in different ways:
Fully Connected Layers: Commonly used in dense layers, dropout helps
prevent overfitting by randomly ignoring some neurons during training.
Convolutional Layers: While dropout is less commonly applied to
convolutional layers due to their spatial structure, it can still be used,
typically after the convolutional and activation operations.
Recurrent Neural Networks (RNNs): Dropout can also be applied to
RNNs, although it’s done differently (often between layers and not on the
recurrent connections) to maintain the sequential nature of data.
Variations of Dropout:
Spatial Dropout: This variation is specifically designed for
convolutional layers, where entire feature maps are dropped out instead
of individual neurons. This maintains the spatial structure of the data.
Gaussian Dropout: Instead of dropping out units, this variation scales
the outputs of neurons by a Gaussian noise. It allows for a smoother
transition during training.
DropConnect: Instead of dropping out the activations of neurons,
DropConnect drops the connections (weights) between neurons during
training, which can be thought of as a form of dropout for weights.
Advantages of Dropout:
Prevention of Overfitting: By ensuring that no single neuron or pathway
becomes overly specialized, dropout encourages more general features to
be learned.
Improved Generalization: Models trained with dropout tend to perform
better on unseen data, as they have learned to operate robustly even with
various neurons inactive.
Increased Model Capacity: Dropout effectively allows for training an
ensemble of many models with shared weights, as different combinations
of neurons are activated each time.
Practical Considerations:
Hyperparameter Tuning: The dropout rate is a critical hyperparameter.
Common practice is to start with a rate around 0.5 for fully connected
layers and lower values for convolutional layers.
Training Time: While dropout can speed up convergence in some cases
by reducing overfitting, it might increase training time since more epochs
may be needed to reach optimal performance.
When to Use Dropout: It is most beneficial in larger networks where
overfitting is a concern, particularly when the dataset is small relative to
the model complexity.
Evaluation Mode: During testing or validation, dropout is turned off, and
all neurons are used, ensuring that the full capacity of the model is
utilized.
Drop Connect:
DropConnect is a regularization technique used in deep learning to prevent
overfitting, similar to Dropout, but with a different approach. Here’s a detailed
overview of DropConnect:
Concept:
DropConnect involves randomly setting a subset of weights in a neural
network to zero during training. Instead of dropping entire neurons (as in
Dropout), DropConnect drops connections (weights) between neurons.
This means that, during each forward pass, some weights are temporarily
removed from the network, which forces the model to learn more robust
features and prevents reliance on any single connection.
Mechanism
1. Random Weight Selection:
o For each training iteration, a binary mask is generated where each
weight has a probability p of being retained (1) and a probability
1−p of being set to zero (0).
o This mask is applied to the weights of the layer, effectively
removing some connections.
2. Forward Pass:
o The forward pass is conducted using the modified weights. The
output of the layer is computed with the remaining active weights.
3. Backpropagation:
o During backpropagation, the gradients are only computed for the
active weights, and the inactive weights do not contribute to the
update.
Benefits:
Improved Generalization: By reducing the complexity of the model in
each training iteration, DropConnect helps the network to generalize
better to unseen data.
Prevent Overfitting: It mitigates the risk of overfitting, especially in
scenarios with limited training data.
Comparison with Dropout:
Dropout: Drops entire neurons, which means that the outputs from those
neurons are not considered during training.
DropConnect: Drops specific weights, allowing for more nuanced
control over the network’s connectivity and often leads to improved
performance in certain architectures.
Implementation:
DropConnect can be implemented by modifying the forward pass of a
layer in a neural network. Many deep learning libraries (like TensorFlow
or PyTorch) allow for custom layers where you can integrate
DropConnect.
The dropout rate p needs to be tuned as a hyperparameter, similar to how
you would tune dropout rates in Dropout.
Use Cases:
DropConnect has been found effective in various tasks, particularly in
deep networks where overfitting is a concern.
It’s useful in scenarios where you want to explore sparsity in model
weights, potentially leading to more efficient networks.
Limitations:
Computational Cost: The stochastic nature of DropConnect can increase
training time, as the model may require more epochs to converge.
Implementation Complexity: Compared to Dropout, it can be slightly
more complex to implement and tune.
Batch Normalization
Batch normalization is a technique used to improve the training of
deep neural networks.
It helps address issues like internal covariate shift, which refers to
the changes in the distribution of network activations during
training.
Need for Batch Normalization:
Batch Normalization is extension of concept of normalization from just
the input layer to the activations of each hidden layer throughout the
neural network.
By normalizing the activations of each layer, Batch Normalization helps
to alleviate the internal covariate shift problem, which can hinder the
convergence of the network during training.
The inputs to each hidden layer are the activations from the previous
layer. If these activations are normalized, it ensures that the network is
consistently presented with inputs that have a similar distribution,
regardless of the training stage. This stability in the distribution of
inputs allows for smoother and more efficient training.
By applying Batch Normalization into the hidden layers of the network,
the gradients propagated during backpropagation are less likely to
vanish or explode, leading to more stable training dynamics. This
ultimately facilitates faster convergence and better performance of the
neural network on the given task.
Fundamentals of Batch Normalization:
Hyperparameters:
In deep learning, hyperparameters are the parameters that are set before
training a model, and they govern the overall structure and performance of the
model. Unlike parameters (weights and biases) that are learned during training,
hyperparameters must be specified beforehand and are typically tuned through
experimentation to find the optimal configuration for a given task.
Types of Hyperparameters:
Hyperparameter Tuning: