Deep Learning c1
Deep Learning c1
● Data Analyst
● Statisticians
● Deep learning Introduction
The ability of a machine to mimic intelligent human behavior is known as artificial intelligence. A system can
automatically learn from experience thanks to machine learning. Deep learning is a type of machine learning that trains
a model using sophisticated algorithms and deep neural networks.
Why deep learning
Because deep learning is so good at identifying intricate patterns in large amounts of data, it can perform incredibly
effectively in this situation.
It can perform tasks like speech and picture recognition, language translation, and superhuman chess and go play.
Deep learning has the potential to occasionally surpass conventional machine learning techniques.
When you have a clear understanding of the features that matter or a limited dataset, use traditional machine learning.
When you have a lot of data and want the computer to automatically identify the key patterns and features, use deep
learning. Deep learning is very effective for jobs like speech and image recognition.
Human Neural network
● Synapses: Synapses allow neurons to communicate with one another. Synapses are intersections where one
neuron's axon joins with another's dendrite or cell body. Chemical molecules called neurotransmitters pass
signals between these synapses. Synapses can vary over time in terms of their potency and effectiveness, which
is essential for memory and learning.
● Neural circuits: Neural circuits or networks are made up of neurons. These circuits are in charge of processing
information, carrying out several tasks, and managing body processes. Specific brain functions are linked to
various brain areas. For instance, the occipital lobe is in charge of processing visual data, whereas the frontal
lobe is connected to higher-order cognitive abilities.
● Learning and Memory: The human brain network's essential capabilities include learning and memory
processes. They entail adjustments to synapses' connection and strength. Specific brain pathways must
frequently be repeatedly and consistently stimulated in order for long-term memory to establish.
Deep Neural network
Input Layer:
Each neuron in the input layer represents a feature or input variable. It acts as the point of entry for data that the
network needs to handle. The dimensionality of the input data affects how many input neurons there are.
● In image classification tasks, each neuron in the input layer might represent a pixel in an image.
● In natural language processing tasks, each neuron might represent a word, a character, or a word embedding.
● In tabular data tasks, each neuron could correspond to a feature or attribute.
Components- Hidden layer
Hidden Layer:
There may be one or more hidden layers between the input layer and
the output layer. Numerous neurons (nodes) make up each hidden
layer. These layers are in charge of discovering and identifying
intricate patterns and features in the incoming data. The network can
roughly mimic complex, non-linear functions since hidden layers are
present.
The output layer produces the network's predictions or outputs. The number of neurons in the output layer depends on
the nature of the task:
● For binary classification, there is typically one output neuron, with the output representing the probability of
belonging to one class.
● For multi-class classification, there are as many output neurons as there are classes, often using softmax
activation to produce class probabilities.
● For regression tasks, there is typically one output neuron per continuous target variable.
Components- Weights
Note: The vanishing gradient problem is a challenge that arises during the training of deep neural networks, particularly those with many layers. It occurs when the gradients of the
loss function with respect to the model's parameters (weights and biases) become very small as they are propagated backward through the network during the training process.
When gradients become vanishingly small, it can lead to slow convergence and difficulty in training deep networks effectively.
Components- Backpropagation
Gradient optimization algorithms, also known as optimization algorithms or optimizers, are essential for training
machine learning models, including neural networks. These algorithms aim to find the optimal values of a model's
parameters (weights and biases) by iteratively updating them based on the gradients of a loss function with respect to
those parameters.
● Gradient Descent
● Stochastic Gradient Descent (SGD):
● SGD with Momentum:
● Mini batch Gradient descent
● Adadelta (Adaptive Gradient Algorithm):
● RMSprop (Root Mean Square Propagation):
● Adam
Components- Gradient optimization- GD
A machine learning technique called gradient descent is used to update model parameters. It operates by repeatedly
moving in the gradient's inverse direction, which determines the sharpest drop. The objective is to identify the ideal
collection of parameters. To train neural networks and machine learning models, gradient descent is frequently utilized.
It uses a convex function as its foundation and iteratively adjusts its parameters to minimize a given function to its
local minimum.
The learning rate is a hyperparameter in the gradient descent approach. The learning rate governs the rate at which
parameter values are altered. It is expensive to calculate the gradients if the size of the data is huge. Gradient descent
works well for convex functions, but it doesn’t know how far to travel along the gradient for nonconvex functions.
Components- Gradient optimization- SGD
Unlike classical gradient descent, which computes the average gradient over the whole dataset, stochastic gradient
descent adjusts the model parameters at each iteration using random examples or subsets of the training data. This
stochastic feature adds variability to SGD, allowing it to avoid local minima and find better solutions in large datasets.One
of the primary benefits of stochastic gradient descent is its ability to handle large-scale datasets.
When compared to the gradient descent approach, the path taken by the algorithm is full of noise since we do not use the
entire dataset but rather portions of it for each iteration. As a result, SGD requires more iterations to reach the local minima.
The total calculation time grows as the number of iterations rises. However, the calculation cost is still lower than that of the
gradient descent optimizer even after doubling the number of iterations. The result is that stochastic gradient descent
should be chosen over gradient descent algorithm if the data is large and processing time is a crucial consideration.
Components- Gradient optimization- SGD
with momentum
SGD takes a greater number of iterations to achieve the optimal minimum, hence computation time is very sluggish. We
utilize stochastic gradient descent with a momentum approach to solve the problem.The momentum aids in the faster
convergence of the loss function. Stochastic gradient descent oscillates between gradient directions, updating the weights
accordingly. Adding a fraction of the prior update to the current update, on the other hand, will speed up the process. One
thing to keep in mind while utilizing this algorithm is that the learning rate should be reduced with a strong momentum
term.
Components- Gradient optimization- Mini
batch GD
RMSprop is an adaptive learning rate optimization algorithm designed to get around some of the
drawbacks of Adagrad and basic gradient descent. It does this by separately modifying the
learning rates for each parameter based on the gradient information from the past. The
fundamental concept is to multiply the learning rate by a moving average of the root mean
square of previous gradients.
Components- Gradient optimization- ADAM
A gradient descent optimization algorithm is called adaptive moment estimation. When dealing with
complex problems involving numerous variables or data, the approach is quite effective. It works well and
uses minimal memory. It logically combines the 'gradient descent with momentum' algorithm and the 'RMSP'
algorithm.
The above two approaches have their own merits, and Adam Optimizer relies on those strengths to produce
a gradient descent that is better optimal. Here, we regulate the gradient descent rate to reach the global
minimum with the least amount of oscillation possible while passing the local minima obstacles with
sufficiently large steps (step-size). Consequently, utilizing the advantages of the aforementioned techniques
to effectively meet the global minimum
Components - Loss function
A model's performance on a certain dataset is measured by a loss function, which is a mathematical function. In deep
learning, models are trained by reducing losses using loss functions. The mean squared error (MSE) and cross-entropy
loss are the two most often used loss functions for deep learning.
Binary Cross entropy: Binary cross-entropy is used to compute the cross-entropy between the true labels and predicted
outputs. It’s used when two-class problems arise like cat and dog classification [1 or 0].
Categorical Cross entropy: The Categorical crossentropy loss function is used to compute loss between true labels and
predicted labels. It’s mainly used for multiclass classification problems.
Components - Loss function
Sparse Categorical entropy: It is used when there are two or more classes present in our classification task. similarly to
categorical crossentropy. But there is one minor difference, between categorical crossentropy and sparse categorical
crossentropy that’s in sparse categorical cross-entropy labels are expected to be provided in integers.
Mean Squared Error: MSE tells, how close a regression line from predicted points. And this is done simply by taking
distance from point to the regression line and squaring them. The squaring is a must so it’ll remove the negative sign
problem.
Mean Absolute Error: MAE simply calculated by taking distance from point to the regression line. The MAE is more
sensitive to outliers. So before using MAE confirm that data doesn’t contain outliers.
ANN
A computer model modeled after the human brain is known as an Artificial Neural Network (ANN).
Neurons, which are interconnected nodes that process and evaluate data, make up this system.
Finding patterns and relationships in the data allows ANNs to learn and make predictions. They are
frequently employed in tasks like image recognition, language processing, and others that call for the
recognition of complicated patterns. The network improves at making predictions or classifying data
over time by training and changing the connections (weights) in its connections. ANNs are essentially
a mechanism for computers to simulate human learning and decision-making processes, although in
a more streamlined manner.
Typical ANN for regression
model = keras.Sequential([ layers.Dense(32, activation='relu', input_shape=(5,)), layers.Dense(16,
activation='relu'), layers.Dense(1) ])
ANN CODE FOR BINARY CLASSIFICATION
model = keras.Sequential([ layers.Dense(32, activation='relu', input_shape=(5,)), layers.Dense(16,
activation='relu'), layers.Dense(1, activation='sigmoid') ])
Typical ANN for MULTICLASS classification
model = keras.Sequential([ layers.Dense(32, activation='relu', input_shape=(5,)), layers.Dense(16,
activation='relu'), layers.Dense(3, activation='softmax') ])
CNN
Convolutional Neural Networks (CNNs), also known as ConvNets, are a subset of deep learning neural networks that
are specifically made for tasks involving the interpretation of spatial and visual data. CNNs are particularly good at
identifying patterns and features in visual data since they are inspired by the human visual system.
The convolution operation is the central process of a convolutional layer. This entails applying a tiny filter to the
incoming data (sometimes referred to as a kernel or receptive field). The filter is a tiny matrix with weights, frequently
3x3 or 5x5. The convolution procedure, which aids in pattern recognition, functions as a dot product between the filter
and the input data.
Stride and Padding: Convolutional Layers can have parameters like "stride" and "padding" that control
how the filter moves over the input data. Stride determines how much the filter shifts with each
operation, while padding adds extra pixels around the input data to control the size of the output
feature maps.
CNN Layer
Pooling layer
RNN
● Neural networks act like the human brain for AI, machine learning, and deep learning, helping
computers recognize patterns and solve problems.
● Recurrent Neural Networks (RNNs) are a type of neural network designed for sequence data,
like predicting the next word in a sentence. They work like our brain's memory.
● In regular neural networks, each input or output is independent. But for tasks where the order of
data matters, like in language, we need to remember previous data. That's where RNNs come
in.
● RNNs use a hidden layer to remember past information. The crucial part is the "Hidden State,"
which stores specific information about a sequence.
● RNNs have a "Memory" that keeps track of all calculations. It uses the same rules for each input,
making it handy for tasks where you need to consider what came before.
RNN ARCHITECTURE
One to One: This is a one-on-one situation. Traditional neural networks use a one-to-one architecture.
One to Many: In a one-to-many network, a single input could lead to a variety of outputs. For instance,
music production employs much too many networks.
Many To One: In this case, many inputs from various time steps are combined to produce a single
output. Such networks are used for sentiment analysis and emotion recognition, where the class label
is determined by a word list.
Many To Many: There are a lot of choices for many to many. Three outputs result from two inputs.
Many-to-many networks are used by machine translation systems, such as those that translate from
English to French or vice versa.
RNN
model.add(layers.SimpleRNN(units=64, activation='relu', input_shape=(10, 3))
BACKPROPAGATION IN RNN
Backpropagation through time is the term used when a
Backpropagation method is applied to an RNN with time
series data as its input.In a typical RNN, only one input
is provided at a time, and only one output is produced.
On the other hand, backpropagation takes into account
both the inputs from the past and the present. One
timestep will include several time series data points
entering at the same time.Once the neural network has
been trained on a time set and has produced an output,
the errors are calculated and collected. The network is
then reassembled, and weights are updated and
modified to take the errors into account.
VANISHING AND EXPLODING GRADIENT
VANISHING GRADIENT:
Vanishing gradients occur when the gradients of the model's parameters become extremely small during training. This can happen when the derivatives
with respect to the loss function are small and diminish as they propagate backward through layers
EXPLODING GRADIENT:
Exploding gradients happen when the gradients of the model's parameters become extremely large during training. This can occur when the derivatives
with respect to the loss function are large and are compounded through many layers.
For recurrent networks, consider using Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) architectures, which are designed to mitigate
vanishing gradient issues.
LSTM
Information can endure thanks to the deep learning, sequential neural network known as Long
Short-Term Memory Networks. It is a specific variety of recurrent neural network that can address the
vanishing gradient issue that RNNs encounter.
RNN retain the prior knowledge and use it to processing the input at hand. Due to the diminishing
gradient, RNN have the drawback of being unable to recall long-term dependencies. Long-term
dependency issues are specifically avoided when designing LSTMs.
CELL STATE
The LSTM's initial element, t, is present throughout the whole LSTM unit. It's somewhat comparable to
a conveyer belt.
This cell state is in charge of memory and forgetfulness. Based on the input's context, this is what
happens.
FORGET GATE
The LSTM network architecture consists of three parts, as shown in the image below, and each part performs an
individual function.
Imagine you have a special machine that's like a smart librarian. This librarian keeps track of information from the past
and uses it to make predictions about the future. This machine is called an LSTM.
FORGET GATE: The LSTM has a gate called the "Forget Gate." It decides if old information from the past is
important to remember or if it's not needed anymore and can be forgotten. It's like deciding whether old library
books are still useful. .
A sigmoid layer is used to make this decision. This sigmoid layer is
called the “forget gate layer”.
INPUT GATE
INPUT GATE: The LSTM has another gate called the "Input Gate." This gate helps it learn new information from
what's happening right now. It's like reading a new book and adding its knowledge to the library. The input gate
provides the LSTM with fresh data and makes the decision of whether or not to store that data in the cell state.
Value updates are decided by a sigmoid layer. The "input gate layer" is the name of this layer.
A layer with a tanh activation function generates a vector of potential new values for the state,
The cell state is then updated by adding these two outputs, i(t) * (t).
The output from forget and input gates are combined to create the new cell state C(t).
OUTPUT GATE
OUTPUT GATE: The last gate is the "Output Gate." It takes all the remembered and newly learned information
First, a sigmoid layer
and passes it on to the future. It's like sharing the library's knowledge with others.
decides what parts of the cell state we’re going to output. Then, a tanh layer is used on the
cell state to squash the values between -1 and 1, which is finally multiplied by the sigmoid
gate output
LSTM
A Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) architecture that is designed to address the
vanishing gradient problem while efficiently capturing dependencies in sequential data. It is similar to the more
complex Long Short-Term Memory (LSTM) network but has a simplified structure with fewer gating mechanisms.
Hidden State: Like an LSTM, a GRU maintains a hidden state, which acts as the network's memory and carries
information from one time step to the next. This hidden state can capture dependencies in the data.
Reset Gate: The reset gate in a GRU controls what information from the previous time step should be forgotten
or reset. It helps the network determine which past information to consider in the current time step.
Update Gate: The update gate determines what new information should be added to the current hidden state. It
influences how much of the new input data should be remembered and incorporated into the hidden state.
GRU
model.add(layers.GRU(units=64, input_shape=(10, 3)) # Example input shape: (sequence_length,
number_of_features)
TRANSFER LEARNING
A potent method used in deep learning is transfer learning. Transfer learning has made it possible to
train deep neural networks even with little input by leveraging the reuse of existing models and their
expertise with new tasks. This development is particularly important in data science because more
labeled data is frequently required for real applications.
Through transfer learning, the expertise of a machine learning model that has already been trained is
applied to an unrelated but closely related problem. For instance, if you trained a straightforward
classifier to determine whether a picture has a backpack, you might use that knowledge to recognize
other things, like sunglasses.
Because of the massive amount of CPU power required, transfer learning is typically applied in computer
vision and natural language processing tasks like sentiment analysis.
TRANSFER LEARNING
In transfer learning, you begin with a pre-trained model that has already been educated on a sizable
dataset for a particular activity, usually a generic or comparable one. Usually deep neural networks,
such as convolutional neural networks (CNNs) for image tasks or recurrent neural networks (RNNs)
for text tasks, are used in these pre-trained models.
FEATURE EXTRACTION:
Utilizing the pre-trained model as a feature extractor is the first stage in the transfer learning process.
You keep the other levels and take off the output layer, which contains the predictions for the initial
work. You are then left with a model that can draw out useful features from data.
TRANSFER LEARNING
FINE TUNING:
Next, you extend the pre-trained model with a fresh output layer. This new output layer is unique to the task
you want to do. The output layer would contain the new class labels you wish to forecast if you're employing a
pre-trained image model for a fresh picture classification task. This new layer's initialization is random.
TRAINING:
The target dataset, which is typically smaller than the original dataset used to train the pre-trained model, is
then utilized to train the modified model. The theory behind this is that the attributes that the pre-trained
model acquired and found useful for the initial task may also prove effective for your current work. While
maintaining the pre-trained weights fixed or fine-tuning them with a slower learning rate, the training process
modifies the parameters of the new output layer.
TRANSFER LEARNING
KNOWLEDGE TRANSFER:
The pre-trained model applies its understanding of general aspects to your particular assignment during
training. If you have little data for the new task, this can considerably speed up training and enhance the
performance of your model.
Dividing the Image: Imagine your image is divided into a grid of different sizes, like squares on a checkerboard. Each square is called a "cell."
Predictions in Each Cell: In each cell, SSD tries to make predictions. It wants to figure out if there's an object in that cell and, if so, where the
object is and what it is. To do this, it predicts:
● The likelihood that there's an object in the cell.
● The coordinates of a box that tightly surrounds the object (bounding box).
● The class of the object (e.g., cat, dog, car).
Different Box Sizes: SSD doesn't just predict one size of the bounding box. It predicts multiple bounding box sizes in each cell. This helps it find
objects of different sizes.
Confidence Score: For each predicted bounding box, SSD also gives a "confidence score." This score tells us how sure it is that the box contains
an object.
Non-Maximum Suppression: SSD looks at all the predicted boxes across all the cells and removes those with low confidence scores. It keeps
only the boxes that are very likely to contain an object.
Final Output: The remaining bounding boxes, along with their associated class labels and confidence scores, are the final output of the SSD
algorithm. These are the objects it found in the image.
SSD
Autoencoder
Vector of Context:
The context vector, which acts as the decoder's starting point, encodes the data from the whole input sequence.
The input sequence's context and meaning are captured, enabling the decoder to provide a meaningful output.
Decoder (based on LSTM):
Autoencoder
Another LSTM network serves as the decoder, using the context vector to produce the output sequence (for example, a
translation into a different language).
It too can include one or more LSTM layers, much like the encoder.
The previous concealed state and the prior anticipated token are used to construct the subsequent token in a
step-by-step way.
Training:
Training involves feeding the encoder the input sequence and training the decoder to generate the corresponding
output sequence.
The goal of training the model is to reduce the variation between the target sequence and the predicted sequence.
How to choose models for regression
Linear Regression:
● Use when there is a linear relationship between the independent and dependent variables.
● Suitable for simple regression tasks with continuous numerical data.
● Provides interpretable coefficients for each predictor.
Polynomial Regression:
● Appropriate when the relationship between variables is not strictly linear.
● It extends linear regression by including polynomial terms to capture curvature in the data.
Ridge and Lasso Regression:
● Useful for addressing multicollinearity (high correlations between predictors) and preventing overfitting.
● Ridge adds a penalty term to the linear regression loss function, while Lasso adds a penalty and performs variable selection.
Decision Trees and Random Forests:
● Suitable for non-linear relationships and when the relationship between variables is complex.
● Decision trees are easy to understand, while random forests are ensembles of decision trees that improve predictive
performance.
Support Vector Regression (SVR):
● Appropriate when there is non-linearity and you want to minimize the prediction error within a certain margin.
● Useful when you need to handle outliers effectively.
How to choose model for regression
Gradient Boosting (e.g., XGBoost, LightGBM):
● Effective for improving prediction accuracy through boosting techniques.
● Often used in Kaggle competitions and other data science competitions.
Neural Networks (Deep Learning):
● Suitable for complex, high-dimensional data with non-linear relationships.
● Particularly effective when dealing with unstructured data, like images, text, or time series.
Elastic Net Regression:
● Combines L1 (Lasso) and L2 (Ridge) regularization, providing a balance between feature selection and handling
multicollinearity.
Bayesian Regression:
● Useful when you have prior knowledge about the relationships between variables and want to incorporate that into the
modeling process.
K-Nearest Neighbors (KNN) Regression:
● Appropriate when you want to predict a continuous outcome based on the average of the values of its k-nearest neighbors.
The choice of a regression model should be based on a combination of factors, including the nature of your data, the relationships you expect, the complexity of the problem, and
the interpretability of the model. It's often a good practice to start with a simple model (e.g., linear regression) and then experiment with more complex models if needed.
Cross-validation and model evaluation are also important to determine which model performs best for your specific problem.
How to choose model for classification
Logistic Regression:
● Suitable for binary classification problems.
● Provides probabilities of class membership.
● Interpretable, easy to understand, and a good starting point.
K-Nearest Neighbors (KNN):
● Effective for simple and flexible classification.
● Good for small to medium-sized datasets.
● Works well when decision boundaries are nonlinear or not well-defined.
Decision Trees and Random Forests:
● Decision trees are simple and interpretable but can overfit.
● Random forests are ensembles of decision trees that reduce overfitting and improve performance.
● Useful when interpretability is important.
How to choose model for classification
Support Vector Machine (SVM):
● Effective for binary and multi-class classification.
● Works well with high-dimensional data.
● Good for well-separated classes and complex decision boundaries.
Naive Bayes:
● Particularly useful for text classification (e.g., spam detection, sentiment analysis).
● Assumes that features are conditionally independent within each class.
Gradient Boosting (e.g., XGBoost, LightGBM):
● Powerful ensemble techniques that perform well in many scenarios.
● Often used in data science competitions for their high accuracy.
Neural Networks (Deep Learning):
● Effective for complex, high-dimensional data and large datasets.
● Can handle image and text data, among others.
● Require substantial computational resources and data.
How to choose model for classification
Ensemble Methods:
● Combining multiple base models (e.g., bagging, boosting, stacking) often improves classification performance.
● Can be used with a variety of base classifiers.
Nearest Centroid Classifier:
● Simple classifier that assigns data points to the class with the nearest centroid.
● Suitable for situations where features have a Gaussian distribution.
Multi-Layer Perceptron (MLP):
● A type of neural network that can be used for both binary and multi-class classification.
● Allows for deep learning with multiple hidden layers.
The choice of a classification model should consider factors such as the nature of the data, the number of classes, the presence of class
imbalance, interpretability, and computational resources. It's often a good practice to start with simpler models like logistic regression and
decision trees and then explore more complex models as needed. Proper model evaluation, including cross-validation, is essential to determine
● This metric measures the proportion of the variance in the dependent variable that the model accounts
for.
● It's similar to R² and can range from 0 to 1.
Max Error:
● Max Error measures the largest absolute error between predicted and actual values.
● It is useful for identifying the worst-case scenario in terms of prediction error.
Median Absolute Error:
● Median Absolute Error calculates the median of the absolute differences between predicted and actual
values.
● It is robust to outliers and can provide a more stable measure of central tendency than mean-based
metrics.
Metrics for classification
Accuracy:
● Accuracy measures the proportion of correctly classified instances out of the total instances.
● Suitable for balanced datasets where all classes have similar importance.
● May not be suitable for imbalanced datasets, as high accuracy can be achieved by simply predicting the majority
class.
Precision:
● Precision measures the proportion of true positive predictions (correctly predicted positive instances) out of all
positive predictions (true positives + false positives).
● Useful when minimizing false positives is a priority (e.g., spam email detection).
Recall (Sensitivity or True Positive Rate):
● Recall measures the proportion of true positive predictions out of all actual positive instances (true positives +
false negatives).
● Relevant when minimizing false negatives is important (e.g., medical diagnosis).
F1-Score:
● The F1-Score is the harmonic mean of precision and recall and provides a balance between the two metrics.
● Particularly useful when precision and recall need to be considered together.
Metrics for classification