DL (Unit I)
DL (Unit I)
Linear algebra is a branch of mathematics that deals with vector spaces, linear
transformations, and systems of linear equations. It plays a fundamental role in many areas of
science and engineering, including deep learning. Understanding linear algebra is essential
for comprehending the underlying concepts and operations involved in various deep learning
algorithms and models.
Here are some key concepts related to linear algebra in the context of deep learning:
Scalars: Scalars are single values that do not have any direction or orientation. In deep
learning, scalars are typically used to represent constants or individual data points. They are
denoted by lowercase letters (e.g., a, b, c).
Vectors: Vectors are one-dimensional arrays of values, and they have both magnitude and
direction. In deep learning, vectors are commonly used to represent features or data points.
Vectors are denoted by bold lowercase letters (e.g., v, x, y). When we need to explicitly
identify the elements of a vector, we write themas a column enclosed in square brackets
X-[X1,X2,…XN]
Matrices: Matrices are two-dimensional arrays of numbers arranged in rows and columns.
They are used to represent linear transformations and store data in many deep learning
algorithms. Matrices are denoted by bold uppercase letters (e.g., A, B, X).
Tensors: Tensors are generalizations of vectors and matrices and can have multiple
dimensions. Scalars are zero-dimensional tensors, vectors are one-dimensional tensors,
matrices are two-dimensional tensors, and so on. In deep learning, tensors are the primary
data structure used to represent and process data, including images, text, and audio. Tensors
are denoted by bold uppercase letters with subscripts as needed (e.g., T or T_{i,j,k}).
Dot Product: The dot product is an operation that takes two vectors and produces a scalar. It
measures the similarity between the two vectors by taking the sum of the products of their
corresponding elements.
Matrix Multiplication: Matrix multiplication is a fundamental operation in deep learning
that involves combining rows and columns of matrices to produce new matrices. It is
essential in operations such as feedforward and backpropagation in neural networks.
Eigenvalues and Eigenvectors: Eigenvalues and eigenvectors are important concepts in linear
algebra used in various algorithms, such as Principal Component Analysis (PCA) and
eigendecomposition.
Deep learning heavily relies on these concepts to build and train complex models that can
learn and generalize from vast amounts of data. Matrices and tensors are central to
representing the weights, biases, and activations of neural networks, and linear algebra
provides the mathematical foundation for optimizing and manipulating these structures
during the training process.
2.Probability Distributions:
A probability distribution over discrete variables may be described using a probability mass
function(PMF). We typically denote probability mass functions with a capital P. Often we
associate each random variable with a different probability mass function and the reader must
infer which PMF to use based on the identity of the random variable, rather than on the name
of the function P(x) is usually not the same as P (y).The probability mass function maps from
a state of a random variable to the probability of that random variable taking on that state.
The probability that x=x is denoted as P(x), with a probability of 1 indicating that x=x is
certain and a probability of 0 indicating that x=x is impossible. Some times to disambiguate
which PMF to use, we write the name of the random variable explicitly: P(x-x). Sometimes
we define a variable first, then use ∼notation to specify which distribution it follows later: x
∼ P (x).Probability mass functions can act on many variables at the same time. Such a
probability distribution over many variables is known as a joint probability distribution. P(x-
x, y-y) denotes the probability that x-x and y-y simultaneously. We may also write P (x, y)
for brevity.
An impossible event has probability 0, and no state can be less probable than that. Likewise,
an event that is guaranteed to happen has probability 1, and no state can have a greater chance
of occurring
x∈x P(x) = 1. We refer to this property as being normalized. Without this property,
we could obtain probabilities greater than one by computing the probability of one
of many events occurring .For example, consider a single discrete random variable
with k different states. We can place a uniform distribution on x—that is, make each
of its states equally likely—by setting its PMF to P (x = xi) =1kfor all i. We can see
that this fits the requirements for a probability mass function. The value1kis positive
because k is a positive integer. We also see that I P (x = xi) =i1k=kk= 1, so the
distribution is properly normalize
When working with continuous random variables, we describe probability distributions using
a probability density function (PDF) rather than a probability mass function. To be a
probability density function, a function p must satisfy the following properties:•
Gradient-based Optimization:
• Gradient-based Optimization
• Stationary points, Local minima
• Second Derivative
• Convex Optimization
• Lagrangian
Gradient-Based Optimization
1
f (x) = || Ax − b || 2
2
Calculus in Optimization
Overfitting and Underfitting are the two main problems that occur in machine learning and
degrade the performance of the machine learning models.
Before understanding the overfitting and underfitting, let's understand some basic term that
will help to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.
Overfitting:
Overfitting occurs when our machine learning model tries to cover all the data points or more
than the required data points present in the given dataset. Because of this, the model starts
caching noise and inaccurate values present in the dataset, and all these factors reduce the
efficiency and accuracy of the model. The overfitted model has low bias and high variance.
The chances of occurrence of overfitting increase as much we provide training to our model.
It means the more we train our model, the more chances of occurring the overfitted model.
Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:
As we can see from the above graph, the model tries to cover all the data points present in the
scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the regression
model to find the best fit line, but here we have not got any best fit, so, it will generate the
prediction errors.
Both overfitting and underfitting cause the degraded performance of the machine learning
model. But the main cause is overfitting, so there are some ways by which we can reduce the
occurrence of overfitting in our model.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
Underfitting:
Underfitting occurs when our machine learning model is not able to capture the underlying
trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped
at an early stage, due to which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.
Example: We can understand the underfitting using below output of the linear regression
model:
As we can see from the above diagram, the model is unable to capture the data points present
in the plot.
Goodness of Fit:
The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning
models to achieve the goodness of fit. In statistics modeling, it defines how closely the result
or predicted values match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and ideally, it
makes predictions with 0 errors, but in practice, it is difficult to achieve it.
As when we train our model for a time, the errors in the training data go down, and the same
happens with test data. But if we train the model for a long duration, then the performance of
the model may decrease due to the overfitting, as the model also learn the noise present in the
dataset. The errors in the test dataset start increasing, so the point, just before the raising of
errors, is the good point, and we can stop here for achieving a good model.
Hyperparameters are defined as the parameters that are explicitly defined by the user
to control the learning process."
Here the prefix "hyper" suggests that the parameters are top-level parameters that are used in
controlling the learning process. The value of the Hyperparameter is selected and set by the
machine learning engineer before the learning algorithm begins training the model. Hence,
these are external to the model, and their values cannot be changed during the training
process.
The learning rate is the hyperparameter in optimization algorithms that controls how much the
model needs to change in response to the estimated error for each time when the model's
weights are updated. It is one of the crucial parameters while building a neural network, and
also it determines the frequency of cross-checking with model parameters. Selecting the
optimized learning rate is a challenging task because if the learning rate is very less, then it
may slow down the training process. On the other hand, if the learning rate is too large, then it
may not optimize the model properly.
Batch Size: To enhance the speed of the learning process, the training set is divided into
different subsets, which are known as a batch.
Number of Epochs: An epoch can be defined as the complete cycle for training the machine
learning model.
Epoch represents an iterative learning process. The number of epochs varies from model to
model, and various models are created with more than one epoch. To determine the right
number of epochs, a validation error is taken into account.
The number of epochs is increased until there is a reduction in a validation error. If there is
no improvement in reduction error for the consecutive epochs, then it indicates to stop
increasing the number of epochs.
Hyperparameters that are involved in the structure of the model are known as
hyperparameters for specific models. These are given below:
o A number of Hidden Units: Hidden units are part of neural networks, which refer to
the components comprising the layers of processors between input and output units in
a neural network.
It is important to specify the number of hidden units hyperparameter for the neural network.
It should be between the size of the input layer and the size of the output layer. More
specifically, the number of hidden units should be 2/3 of the size of the input layer, plus the
size of the output layer.
Validation sets:
In machine learning, a validation set is a separate dataset used to assess the performance of a
trained model during the training process. It plays a crucial role in model development and
helps prevent overfitting, which occurs when a model becomes too specific to the training
data and performs poorly on new, unseen data.
The typical process of using a validation set in machine learning involves the following steps:
Training Set: The original dataset is divided into two parts: the training set and the
validation set. The training set is used to train the model, i.e., the model learns the patterns
and relationships present in the data.
Validation Set: The validation set is used to evaluate the model's performance during
training. It contains examples that the model has not seen during training, simulating how the
model would perform on new, unseen data. This allows us to gauge how well the model
generalizes to new data and helps us tune hyperparameters.
Hyperparameter Tuning: During the training process, the model might have
hyperparameters that need to be optimized for best performance. These hyperparameters
cannot be learned directly from the data but rather affect the learning process.
Hyperparameter tuning involves trying different combinations of hyperparameters and
selecting the ones that give the best results on the validation set.
Model Selection: After training and hyperparameter tuning, various models with different
hyperparameters might be produced. The model that performs best on the validation set is
typically chosen as the final model.
Test Set: Once the final model is selected, it should be evaluated on a separate test set that
the model has never seen before. This provides an unbiased estimate of the model's
performance on completely new data. The test set should be different from both the training
and validation sets to ensure the evaluation is fair and unbiased.
By using a validation set, machine learning practitioners can make informed decisions about
their model's performance and improve its generalization capabilities. It helps in determining
the right balance between underfitting (where the model is too simple to capture the patterns)
and overfitting (where the model memorizes the training data and fails to generalize).
Estimators:
Estimators are commonly used for various tasks, such as regression (predicting continuous
values) and classification (predicting discrete class labels). The specific type of estimator
used depends on the nature of the problem at hand.
Linear Regression: A simple regression algorithm that models the relationship between the
input features and the target variable as a linear equation. It aims to find the best-fit line that
minimizes the distance between predicted and actual values.
Decision Trees: A non-linear supervised learning algorithm used for both regression and
classification tasks. It creates a tree-like structure where each internal node represents a
decision based on a feature, and each leaf node corresponds to a class label or a continuous
value.
Random Forest: An ensemble learning method that combines multiple decision trees to
improve accuracy and robustness. It works by aggregating predictions from individual
decision trees.
Support Vector Machines (SVM): A powerful classification algorithm that finds the optimal
hyperplane in a high-dimensional feature space to separate different classes.
K-Nearest Neighbors (KNN): A simple non-parametric algorithm used for classification and
regression tasks. It makes predictions based on the majority class or average value of the k-
nearest data points in the feature space.
Neural Networks: Deep learning models that consist of multiple interconnected layers of
artificial neurons. They are used for various complex tasks, including image recognition,
natural language processing, and speech recognition.
Naive Bayes: A probabilistic algorithm based on Bayes' theorem, commonly used for
classification tasks, especially in text categorization and spam filtering.
Gradient Boosting Machines: An ensemble learning technique that builds multiple weak
learners (usually decision trees) sequentially, with each learner focusing on the mistakes of its
predecessor.
These are just a few examples of estimators used in machine learning. The choice of
estimator depends on the problem domain, the nature of the data, and the desired performance
of the model. It's essential to understand the characteristics of different estimators to select
the most appropriate one for a particular task
o Reducible errors: These errors can be reduced to improve the model accuracy. Such
errors can further be classified into bias and Variance.
regardless of which algorithm has been used. The cause of these errors is unknown variables
whose value can't be reduced.
What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies them to
test data for prediction. While making predictions, a difference occurs between prediction
values made by the model and actual values/expected values, and this difference is known
as bias errors or Errors due to bias. It can be defined as an inability of machine learning
algorithms such as Linear Regression to capture the true relationship between the data points.
Each algorithm begins with some amount of bias because bias occurs from assumptions in the
model, which makes the target function simple to learn. A model has either:
o Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
o High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the
algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm
often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest
Neighbours and Support Vector Machines. At the same time, an algorithm with high bias
is Linear Regression, Linear Discriminant Analysis and Logistic Regression
High bias mainly occurs due to a much simple model. Below are some ways to reduce the
high bias:
The variance would specify the amount of variation in the prediction if the different training
data was used. In simple words, variance tells that how much a random variable is different
from its expected value. Ideally, a model should not vary too much from one training dataset
to another, which means the algorithm should be good in understanding the hidden mapping
between inputs and output variables. Variance errors are either of low variance or high
variance.
Low variance means there is a small variation in the prediction of the target function with
changes in the training data set. At the same time, High variance shows a large variation in
the prediction of the target function with changes in the training dataset.
A model that shows high variance learns a lot and perform well with the training dataset, and
does not generalize well with the unseen dataset. As a result, such a model gives good results
with the training dataset but shows high error rates on the test dataset.
Since, with high variance, the model learns too much from the dataset, it leads to overfitting
of the model. A model with high variance has the below problems:
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.
Some examples of machine learning algorithms with low variance are, Linear Regression,
Logistic Regression, and Linear discriminant analysis. At the same time, algorithms with
high variance are decision tree, Support Vector Machine, and K-nearest neighbours.
There are four possible combinations of bias and variances, which are represented by the
below diagram:
1. Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine learning
model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns with a
large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not learn
well with the training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias, High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate on
average.
o High training error and the test error is almost similar to training error.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very
simple with fewer parameters, it may have low variance and high bias. Whereas, if the model
has a large number of parameters, it will have high variance and low bias. So, it is required to
make a balance between bias and variance errors, and this balance between the bias error and
variance error is known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance and low bias. But
this is not possible because bias and variance are related to each other:
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance
between bias and variance errors.
One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it
usually took a higher number of iterations to reach the minima, because of the randomness
in its descent. Even though it requires a higher number of iterations to reach the minima
than typical Gradient Descent, it is still computationally much less expensive than typical
Gradient Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient Descent
for optimizing a learning algorithm.
Advantages of Stochastic Gradient Descent
Speed: SGD is faster than other variants of Gradient Descent such as Batch Gradient
Descent and Mini-Batch Gradient Descent since it uses only one example to update the
parameters.
Memory Efficiency: Since SGD updates the parameters for each training example one at a
time, it is memory-efficient and can handle large datasets that cannot fit into memory.
Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to escape
from local minima and converges to a global minimum
.
Disadvantages of Stochastic Gradient Descent :
Noisy updates: The updates in SGD are noisy and have a high variance, which can make
the optimization process less stable and lead to oscillations around the minimum.
Slow Convergence: SGD may require more iterations to converge to the minimum since it
updates the parameters for each training example one at a time.
Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD since
using a high learning rate can cause the algorithm to overshoot the minimum, while a low
learning rate can make the algorithm converge slowly.
Less Accurate: Due to the noisy updates, SGD may not converge to the exact global
minimum and can result in a suboptimal solution. This can be mitigated by using
techniques such as learning rate scheduling and momentum-based updates
Motivating deep learning can be challenging due to various factors. Some of the key
challenges are:
Complexity and Abstraction: Deep learning models are highly complex, consisting of
multiple layers and millions of parameters. Understanding and reasoning about these
models can be daunting, making it difficult for individuals to stay motivated.
Data Requirements: Deep learning models thrive on vast amounts of data. Acquiring,
cleaning, and preprocessing such data can be time-consuming and challenging, especially in
domains with limited or messy data.
Overfitting and Underfitting: Finding the right balance between overfitting and
underfitting can be challenging. Without proper regularization techniques, models may fail
to generalize well to unseen data, affecting motivation to continue.
Research and Advancements: The field of deep learning is constantly evolving with new
research papers, algorithms, and architectures being published regularly. Keeping up with
the latest developments can be overwhelming for learners.
Uninterpretable Black-box Models: Deep learning models are often considered black
boxes due to their lack of interpretability. This can be demotivating for individuals who
prefer to understand the reasoning behind model decisions.
Long Training Times: Training complex deep learning models can take a substantial
amount of time, ranging from hours to days or even weeks. Waiting for results can be
frustrating and might lead to a loss of motivation.
Despite these challenges, many find deep learning rewarding because of its ability to
solve complex problems, process vast amounts of data, and achieve state-of-the-art results
in various domains. Overcoming these obstacles often requires perseverance, a strong
support network, and a genuine interest in the field, as deep learning can be an exciting and
transformative area of study
Deep Networks:
Deep networks, also known as deep neural networks, are a class of artificial neural
networks that are characterized by multiple layers of interconnected nodes, or neurons.
These networks are designed to process and transform complex patterns in data, making
them well-suited for various machine learning tasks, including image recognition, natural
language processing, speech recognition, and more.
The term "deep" in deep networks refers to the depth of the network, which is
defined by the number of hidden layers between the input and output layers. Traditional
neural networks, with only one or two hidden layers, are shallow in comparison. Deep
networks have a significantly greater number of hidden layers, often ranging from a few
dozen to hundreds or even thousands of layers.
Input Layer: The input layer receives the raw data and passes it on to the subsequent
layers.
Hidden Layers: These are the intermediate layers between the input and output layers,
where most of the computations take place. Each neuron in a hidden layer receives input
from the previous layer and produces an output for the next layer.
Weights and Biases: Each connection between neurons in adjacent layers has an associated
weight, which determines the strength of the connection. Additionally, each neuron in a
hidden layer has an associated bias, which allows the network to learn more complex
patterns.
Activation Function: Each neuron typically applies an activation function to its weighted
sum of inputs and bias, introducing non-linearity into the model. Popular activation
functions include ReLU (Rectified Linear Unit), sigmoid, and tanh.
Output Layer: The final layer of the network produces the model's output based on the
transformed information from the hidden layers. The activation function used in the output
layer depends on the nature of the task, e.g., softmax for multiclass classification.
Increased Expressiveness: The depth of the network allows it to represent highly intricate
and nonlinear relationships in the data.
Vanishing and Exploding Gradients: During training, the gradients can become very large
or very small as they backpropagate through many layers, making it challenging to update
the weights effectively.
Overfitting: Deep networks can be prone to overfitting, especially when trained on limited
data, due to their high capacity to memorize patterns.
Despite these challenges, the development of deep networks has revolutionized the field of
artificial intelligence, leading to breakthroughs in various applications and driving the
advancement of machine learning technologies.