0% found this document useful (0 votes)

86 views25 pages

DL (Unit I)

The document discusses key concepts in deep learning including linear algebra, probability distributions, and gradient-based optimization. It covers scalars, vectors, matrices, tensors, probability mass functions, probability density functions, and gradient descent. Deep learning relies on these concepts to build and train complex models that can learn from large amounts of data.

Uploaded by

arunkrishnaaiswarya108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views25 pages

DL (Unit I)

Uploaded by

arunkrishnaaiswarya108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

UNIT I- DEEP NETWORKS BASICS

Linear Algebra: Scalars -- Vectors -- Matrices and tensors; Probability Distributions –

Gradient based Optimization – Machine Learning Basics: Capacity -- Overfitting and
underfitting -- Hyperparameters and validation sets -- Estimators -- Bias and variance --
Stochastic gradient descent -- Challenges motivating deep learning; Deep Networks: Deep
feedforward networks; Regularization -- Optimization.

1.Linear Algebra: Scalars, Vectors, Matrices and tensors:

Linear algebra is a branch of mathematics that deals with vector spaces, linear
transformations, and systems of linear equations. It plays a fundamental role in many areas of
science and engineering, including deep learning. Understanding linear algebra is essential
for comprehending the underlying concepts and operations involved in various deep learning
algorithms and models.

Here are some key concepts related to linear algebra in the context of deep learning:

Scalars: Scalars are single values that do not have any direction or orientation. In deep
learning, scalars are typically used to represent constants or individual data points. They are
denoted by lowercase letters (e.g., a, b, c).

Vectors: Vectors are one-dimensional arrays of values, and they have both magnitude and
direction. In deep learning, vectors are commonly used to represent features or data points.
Vectors are denoted by bold lowercase letters (e.g., v, x, y). When we need to explicitly
identify the elements of a vector, we write themas a column enclosed in square brackets

X-[X1,X2,…XN]

Matrices: Matrices are two-dimensional arrays of numbers arranged in rows and columns.
They are used to represent linear transformations and store data in many deep learning
algorithms. Matrices are denoted by bold uppercase letters (e.g., A, B, X).
Tensors: Tensors are generalizations of vectors and matrices and can have multiple
dimensions. Scalars are zero-dimensional tensors, vectors are one-dimensional tensors,
matrices are two-dimensional tensors, and so on. In deep learning, tensors are the primary
data structure used to represent and process data, including images, text, and audio. Tensors
are denoted by bold uppercase letters with subscripts as needed (e.g., T or T_{i,j,k}).

Matrix Operations: Common matrix operations include addition, subtraction, multiplication

(scalar and dot product), and transposition. These operations are fundamental in various deep
learning algorithms for tasks like linear regression, neural networks, and optimization.

Dot Product: The dot product is an operation that takes two vectors and produces a scalar. It
measures the similarity between the two vectors by taking the sum of the products of their
corresponding elements.
Matrix Multiplication: Matrix multiplication is a fundamental operation in deep learning
that involves combining rows and columns of matrices to produce new matrices. It is
essential in operations such as feedforward and backpropagation in neural networks.

Eigenvalues and Eigenvectors: Eigenvalues and eigenvectors are important concepts in linear
algebra used in various algorithms, such as Principal Component Analysis (PCA) and
eigendecomposition.

Deep learning heavily relies on these concepts to build and train complex models that can
learn and generalize from vast amounts of data. Matrices and tensors are central to
representing the weights, biases, and activations of neural networks, and linear algebra
provides the mathematical foundation for optimizing and manipulating these structures
during the training process.

2.Probability Distributions:

A probability distribution is a description of how likely a random variable or set of random

variables is to take on each of its possible states. The way we describe probability
distributions depends on whether the variables are discrete or continuous.

2.1 Discrete Variables and Probability Mass Functions :

A probability distribution over discrete variables may be described using a probability mass
function(PMF). We typically denote probability mass functions with a capital P. Often we
associate each random variable with a different probability mass function and the reader must
infer which PMF to use based on the identity of the random variable, rather than on the name
of the function P(x) is usually not the same as P (y).The probability mass function maps from
a state of a random variable to the probability of that random variable taking on that state.
The probability that x=x is denoted as P(x), with a probability of 1 indicating that x=x is
certain and a probability of 0 indicating that x=x is impossible. Some times to disambiguate
which PMF to use, we write the name of the random variable explicitly: P(x-x). Sometimes
we define a variable first, then use ∼notation to specify which distribution it follows later: x
∼ P (x).Probability mass functions can act on many variables at the same time. Such a
probability distribution over many variables is known as a joint probability distribution. P(x-
x, y-y) denotes the probability that x-x and y-y simultaneously. We may also write P (x, y)
for brevity.

To be a PMF on a random variable x, a function P must satisfy the following properties:

 The domain of P must be the set of all possible states of x

 ∀x ∈ x,0≤ P(x)≤1.

An impossible event has probability 0, and no state can be less probable than that. Likewise,
an event that is guaranteed to happen has probability 1, and no state can have a greater chance
of occurring

 x∈x P(x) = 1. We refer to this property as being normalized. Without this property,
we could obtain probabilities greater than one by computing the probability of one
of many events occurring .For example, consider a single discrete random variable
with k diﬀerent states. We can place a uniform distribution on x—that is, make each
of its states equally likely—by setting its PMF to P (x = xi) =1kfor all i. We can see
that this ﬁts the requirements for a probability mass function. The value1kis positive
because k is a positive integer. We also see that I P (x = xi) =i1k=kk= 1, so the
distribution is properly normalize

2.2. Continuous Variables and Probability Density Functions:

When working with continuous random variables, we describe probability distributions using
a probability density function (PDF) rather than a probability mass function. To be a
probability density function, a function p must satisfy the following properties:•

 The domain of p must be the set of all possible states of x.

 ∀x ∈ x, p(x) ≥ 0. Note that we do not require
 ∫p(x)dx-1
 A probability density function p(x) does not give the probability of a specific state
directly; instead the probability of landing inside an infinitesimal region with volume
δx is given by p(x) δx. We can integrate the density function to find the actual
probability mass of a set of points. Specifically, the probability that x lies in some set
S is given by the integral of p(x) over that set. In the univariate example, the
probability that x lies in the interval [a, b] is given by [a,b] p(x) dx.For an example
of a PDF corresponding to a specific probability density over a continuous random
variable, consider a uniform distribution on an interval of the real numbers. We can
do this with a function u(x; a, b), where a and b are the endpoints of the interval,
with b > a. The “;” notation means “parametrized by ”;we consider x to be the
argument of the function, while a and bare parameters that define the function. To
ensure that there is no probability mass outside the interval, we say u (x;a, b) = 0 for
all x ∈[a, b]. Within [a, b],u(x;a, b) =1b−a. We can see that this is non-negative
everywhere. Additionally, it integrates to 1. We often denote that x follows the
uniform distribution on [a, b] by writing x ∼ U (a, b).

Gradient-based Optimization:

• Gradient-based Optimization
• Stationary points, Local minima
• Second Derivative
• Convex Optimization
• Lagrangian

Gradient-Based Optimization

• Most ML algorithms involve optimization

• Minimize/maximize a function f (x) by altering x
• Usually stated a minimization
• Maximization accomplished by minimizing –f(x)
• f (x) referred to as objective function or criterion
• In minimization also referred to as loss function cost, or error
• Example is linear least squares
• Denote optimum value by x*=argmin f (x)

1
f (x) = || Ax − b || 2
2

Calculus in Optimization

• Suppose function y=f (x), x, y real nos.

• Derivative of function denoted: f’(x) or as dy/dx
• Derivative f’(x) gives the slope of f (x) at point x
• It specifies how to scale a small change in input to obtain a
corresponding change in the output:
f (x + ε) ≈ f (x) + ε f’ (x)

• It tells how you make a small change in input to make a small

improvement in y
• We know that f (x - ε sign (f’(x))) is less than f (x) for small ε. Thus we
can reduce f (x) by moving x in small steps with opposite sign of
derivative
• This technique is called gradient descent (Cauchy 1847)
Gradient Descent Illustrated

• For x>0, f(x) increases with x and f’(x)>0

• For x<0, f(x) is decreases with x and f’(x)<0
• Use f’(x) to follow function downhill
• Reduce f(x) by going in direction opposite sign
• of derivative f’(x)
Stationary points, Local Optima

• When f’(x)=0 derivative provides no

information about direction of move
• Points where f’(x)=0 are known as stationary
or critical points
• Local minimum/maximum: a point where f(x)
lower/higher than all its neighbors

• Saddle Points: neither maxima nor minima

Presence of multiple minima

• Optimization algorithms may fail to find global

minimum
• Generally accept such solutions

Method of Gradient Descent:

• The gradient points directly uphill, and the negative gradient

points directly downhill
• Thus we can decrease f by moving in the direction of the
negative gradient
• This is known as the method of steepest
descent or gradient descent
3.Machine Learning Basics:
Capacity:
In machine learning, the capacity of a model refers to its ability to learn and represent
complex patterns from the data. A model with high capacity can represent complex
relationships, while a model with low capacity can only capture simple relationships.
Understanding the capacity of a model is crucial in addressing two common challenges:
overfitting and underfitting.
Overfitting and Underfitting in Machine Learning

Overfitting and Underfitting are the two main problems that occur in machine learning and
degrade the performance of the machine learning models.

The main goal of each machine learning model is to generalize well.

Here generalization defines the ability of an ML model to provide a suitable output by
adapting the given set of unknown input. It means after providing training on the dataset, it
can produce reliable and accurate output. Hence, the underfitting and overfitting are the two
terms that need to be checked for the performance of the model and whether the model is
generalizing well or not.

Before understanding the overfitting and underfitting, let's understand some basic term that
will help to understand this topic well:

o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.

Overfitting:

Overfitting occurs when our machine learning model tries to cover all the data points or more
than the required data points present in the given dataset. Because of this, the model starts
caching noise and inaccurate values present in the dataset, and all these factors reduce the
efficiency and accuracy of the model. The overfitted model has low bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to our model.
It means the more we train our model, the more chances of occurring the overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:
As we can see from the above graph, the model tries to cover all the data points present in the
scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the regression
model to find the best fit line, but here we have not got any best fit, so, it will generate the
prediction errors.

How to avoid the Overfitting in Model:

Both overfitting and underfitting cause the degraded performance of the machine learning
model. But the main cause is overfitting, so there are some ways by which we can reduce the
occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting:

Underfitting occurs when our machine learning model is not able to capture the underlying
trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped
at an early stage, due to which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear regression
model:
As we can see from the above diagram, the model is unable to capture the data points present
in the plot.

How to avoid underfitting:

o By increasing the training time of the model.
o By increasing the number of features.

Goodness of Fit:

The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning
models to achieve the goodness of fit. In statistics modeling, it defines how closely the result
or predicted values match the true values of the dataset.

The model with a good fit is between the underfitted and overfitted model, and ideally, it
makes predictions with 0 errors, but in practice, it is difficult to achieve it.

As when we train our model for a time, the errors in the training data go down, and the same
happens with test data. But if we train the model for a long duration, then the performance of
the model may decrease due to the overfitting, as the model also learn the noise present in the
dataset. The errors in the test dataset start increasing, so the point, just before the raising of
errors, is the good point, and we can stop here for achieving a good model.

4.Hyperparameters and Validation Sets

4.1 Hyperparams control ML Behavior

• Most ML algorithms have hyperparameters

• We can use to control algorithm behavior
• Values of hyperparameters are not adapted by learning
algorithm itself
• Although, we can design nested learning where one learning algorithm
• Which learns best hyperparameters for another learning
algorithm
Hyperparameters in Machine learning are those parameters that are explicitly defined
by the user to control the learning process. These hyperparameters are used to improve the
learning of the model, and their values are set before starting the learning process of the
model

Hyperparameters are defined as the parameters that are explicitly defined by the user
to control the learning process."

Here the prefix "hyper" suggests that the parameters are top-level parameters that are used in
controlling the learning process. The value of the Hyperparameter is selected and set by the
machine learning engineer before the learning algorithm begins training the model. Hence,
these are external to the model, and their values cannot be changed during the training
process.

Some examples of Hyperparameters in Machine Learning

o The k in kNN or K-Nearest Neighbour algorithm
o Learning rate for training a neural network
o Train-test split ratio
o Batch Size
o Number of Epochs
o Branches in Decision Tree
o Number of clusters in Clustering Algorithm

The learning rate is the hyperparameter in optimization algorithms that controls how much the
model needs to change in response to the estimated error for each time when the model's
weights are updated. It is one of the crucial parameters while building a neural network, and
also it determines the frequency of cross-checking with model parameters. Selecting the
optimized learning rate is a challenging task because if the learning rate is very less, then it
may slow down the training process. On the other hand, if the learning rate is too large, then it
may not optimize the model properly.
Batch Size: To enhance the speed of the learning process, the training set is divided into
different subsets, which are known as a batch.

Number of Epochs: An epoch can be defined as the complete cycle for training the machine
learning model.

Epoch represents an iterative learning process. The number of epochs varies from model to
model, and various models are created with more than one epoch. To determine the right
number of epochs, a validation error is taken into account.

The number of epochs is increased until there is a reduction in a validation error. If there is
no improvement in reduction error for the consecutive epochs, then it indicates to stop
increasing the number of epochs.

Hyperparameter for Specific Models:

Hyperparameters that are involved in the structure of the model are known as
hyperparameters for specific models. These are given below:

o A number of Hidden Units: Hidden units are part of neural networks, which refer to
the components comprising the layers of processors between input and output units in
a neural network.

It is important to specify the number of hidden units hyperparameter for the neural network.
It should be between the size of the input layer and the size of the output layer. More
specifically, the number of hidden units should be 2/3 of the size of the input layer, plus the
size of the output layer.

Validation sets:

In machine learning, a validation set is a separate dataset used to assess the performance of a
trained model during the training process. It plays a crucial role in model development and
helps prevent overfitting, which occurs when a model becomes too specific to the training
data and performs poorly on new, unseen data.

The typical process of using a validation set in machine learning involves the following steps:

Training Set: The original dataset is divided into two parts: the training set and the
validation set. The training set is used to train the model, i.e., the model learns the patterns
and relationships present in the data.

Validation Set: The validation set is used to evaluate the model's performance during
training. It contains examples that the model has not seen during training, simulating how the
model would perform on new, unseen data. This allows us to gauge how well the model
generalizes to new data and helps us tune hyperparameters.
Hyperparameter Tuning: During the training process, the model might have
hyperparameters that need to be optimized for best performance. These hyperparameters
cannot be learned directly from the data but rather affect the learning process.
Hyperparameter tuning involves trying different combinations of hyperparameters and
selecting the ones that give the best results on the validation set.

Model Selection: After training and hyperparameter tuning, various models with different
hyperparameters might be produced. The model that performs best on the validation set is
typically chosen as the final model.

Test Set: Once the final model is selected, it should be evaluated on a separate test set that
the model has never seen before. This provides an unbiased estimate of the model's
performance on completely new data. The test set should be different from both the training
and validation sets to ensure the evaluation is fair and unbiased.

By using a validation set, machine learning practitioners can make informed decisions about
their model's performance and improve its generalization capabilities. It helps in determining
the right balance between underfitting (where the model is too simple to capture the patterns)
and overfitting (where the model memorizes the training data and fails to generalize).

Estimators:

In machine learning, an estimator refers to a model or algorithm used to make predictions

based on input data. Estimators are a fundamental concept in supervised learning, where the
goal is to learn a mapping between input features (independent variables) and target outputs
(dependent variables) from labeled training data.

Estimators are commonly used for various tasks, such as regression (predicting continuous
values) and classification (predicting discrete class labels). The specific type of estimator
used depends on the nature of the problem at hand.

Here are some common types of estimators used in machine learning:

Linear Regression: A simple regression algorithm that models the relationship between the
input features and the target variable as a linear equation. It aims to find the best-fit line that
minimizes the distance between predicted and actual values.

Logistic Regression: A classification algorithm used for binary classification problems. It

models the probability that an instance belongs to a certain class.

Decision Trees: A non-linear supervised learning algorithm used for both regression and
classification tasks. It creates a tree-like structure where each internal node represents a
decision based on a feature, and each leaf node corresponds to a class label or a continuous
value.

Random Forest: An ensemble learning method that combines multiple decision trees to
improve accuracy and robustness. It works by aggregating predictions from individual
decision trees.

Support Vector Machines (SVM): A powerful classification algorithm that finds the optimal
hyperplane in a high-dimensional feature space to separate different classes.
K-Nearest Neighbors (KNN): A simple non-parametric algorithm used for classification and
regression tasks. It makes predictions based on the majority class or average value of the k-
nearest data points in the feature space.

Neural Networks: Deep learning models that consist of multiple interconnected layers of
artificial neurons. They are used for various complex tasks, including image recognition,
natural language processing, and speech recognition.

Naive Bayes: A probabilistic algorithm based on Bayes' theorem, commonly used for
classification tasks, especially in text categorization and spam filtering.

Gradient Boosting Machines: An ensemble learning technique that builds multiple weak
learners (usually decision trees) sequentially, with each learner focusing on the mistakes of its
predecessor.

Clustering Algorithms: While not strictly supervised learning estimators, clustering

algorithms group similar data points together based on their similarity. Examples include K-
Means and Hierarchical Clustering.

These are just a few examples of estimators used in machine learning. The choice of
estimator depends on the problem domain, the nature of the data, and the desired performance
of the model. It's essential to understand the characteristics of different estimators to select
the most appropriate one for a particular task

Bias and Variance:

Machine learning is a branch of Artificial Intelligence, which allows machines to perform

data analysis and make predictions. However, if the machine learning model is not accurate,
it can make predictions errors, and these prediction errors are usually known as Bias and
Variance. In machine learning, these errors will always be present as there is always a slight
difference between the model predictions and actual predictions. The main aim of ML/data
science analysts is to reduce these errors in order to get more accurate results. In this topic,
we are going to discuss bias and variance, Bias-variance trade-off, Underfitting and
Overfitting. But before starting, let's first understand what errors in Machine learning are?
Errors in Machine Learning?

In machine learning, an error is a measure of how accurately an algorithm can make

predictions for the previously unknown dataset. On the basis of these errors, the machine
learning model is selected that can perform best on the particular dataset. There are mainly
two types of errors in machine learning, which are:

o Reducible errors: These errors can be reduced to improve the model accuracy. Such
errors can further be classified into bias and Variance.

o Irreducible errors: These errors will always be present in the model

regardless of which algorithm has been used. The cause of these errors is unknown variables
whose value can't be reduced.

What is Bias?

In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies them to
test data for prediction. While making predictions, a difference occurs between prediction
values made by the model and actual values/expected values, and this difference is known
as bias errors or Errors due to bias. It can be defined as an inability of machine learning
algorithms such as Linear Regression to capture the true relationship between the data points.
Each algorithm begins with some amount of bias because bias occurs from assumptions in the
model, which makes the target function simple to learn. A model has either:

o Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
o High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the
algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm
often has low bias.

Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest
Neighbours and Support Vector Machines. At the same time, an algorithm with high bias
is Linear Regression, Linear Discriminant Analysis and Logistic Regression

Ways to reduce High Bias:

High bias mainly occurs due to a much simple model. Below are some ways to reduce the
high bias:

o Increase the input features as the model is underfitted.

o Decrease the regularization term.
o Use more complex models, such as including some polynomial features.

What is a Variance Error?

The variance would specify the amount of variation in the prediction if the different training
data was used. In simple words, variance tells that how much a random variable is different
from its expected value. Ideally, a model should not vary too much from one training dataset
to another, which means the algorithm should be good in understanding the hidden mapping
between inputs and output variables. Variance errors are either of low variance or high
variance.

Low variance means there is a small variation in the prediction of the target function with
changes in the training data set. At the same time, High variance shows a large variation in
the prediction of the target function with changes in the training dataset.

A model that shows high variance learns a lot and perform well with the training dataset, and
does not generalize well with the unseen dataset. As a result, such a model gives good results
with the training dataset but shows high error rates on the test dataset.

Since, with high variance, the model learns too much from the dataset, it leads to overfitting
of the model. A model with high variance has the below problems:

o A high variance model leads to overfitting.

o Increase model complexities.

Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.
Some examples of machine learning algorithms with low variance are, Linear Regression,
Logistic Regression, and Linear discriminant analysis. At the same time, algorithms with
high variance are decision tree, Support Vector Machine, and K-nearest neighbours.

Ways to Reduce High Variance:

o Reduce the input features or number of parameters as a model is overfitted.
o Do not use a much complex model.
o Increase the training data.
o Increase the Regularization term.

Different Combinations of Bias-Variance

There are four possible combinations of bias and variances, which are represented by the
below diagram:

1. Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine learning
model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns with a
large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not learn
well with the training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias, High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate on
average.

How to identify High variance or High Bias?

High variance can be identified if the model has:

o Low training error and high test error.

High Bias can be identified if the model has:

o High training error and the test error is almost similar to training error.

Bias-Variance Trade-Off

While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very
simple with fewer parameters, it may have low variance and high bias. Whereas, if the model
has a large number of parameters, it will have high variance and low bias. So, it is required to
make a balance between bias and variance errors, and this balance between the bias error and
variance error is known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance and low bias. But
this is not possible because bias and variance are related to each other:

o If we decrease the variance, it will increase the bias.

o If we decrease the bias, it will increase the variance.

Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model

that accurately captures the regularities in training data and simultaneously generalizes well
with the unseen dataset. Unfortunately, doing this is not possible simultaneously. Because a
high variance algorithm may perform well with training data, but it may lead to overfitting to
noisy data. Whereas, high bias algorithm generates a much simple model that may not even
capture important regularities in the data. So, we need to find a sweet spot between bias and
variance to make an optimal model.

Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance
between bias and variance errors.

Stochastic gradient descent:

Gradient Descent is an iterative optimization process that searches for an objective

function’s optimum value (Minimum/Maximum). It is one of the most used methods for
changing a model’s parameters in order to reduce a cost function in machine learning
projects.
The primary goal of gradient descent is to identify the model parameters that provide the
maximum accuracy on both training and test datasets. In gradient descent, the gradient is a
vector pointing in the general direction of the function’s steepest rise at a particular point.
The algorithm might gradually drop towards lower values of the function by moving in the
opposite direction of the gradient, until reaching the minimum of the function.
Types of Gradient Descent:
Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
Stochastic Gradient Descent (SGD):
Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that is
used for optimizing machine learning models. It addresses the computational inefficiency
of traditional Gradient Descent methods when dealing with large datasets in machine
learning projects.
In SGD, instead of using the entire dataset for each iteration, only a single random training
example (or a small batch) is selected to calculate the gradient and update the model
parameters. This random selection introduces randomness into the optimization process,
hence the term “stochastic” in stochastic Gradient Descent
The advantage of using SGD is its computational efficiency, especially when dealing with
large datasets. By using a single example or a small batch, the computational cost per
iteration is significantly reduced compared to traditional Gradient Descent methods that
require processing the entire dataset.
Stochastic Gradient Descent Algorithm :
 Initialization: Randomly initialize the parameters of the model.
 Set Parameters: Determine the number of iterations and the learning rate (alpha) for
updating the parameters.
 Stochastic Gradient Descent Loop: Repeat the following steps until the model
converges or reaches the maximum number of iterations:
a. Shuffle the training dataset to introduce randomness.
b. Iterate over each training example (or a small batch) in the shuffled order.
c. Compute the gradient of the cost function with respect to the model parameters
using the current training example (or batch).
d. Update the model parameters by taking a step in the direction of the negative
gradient, scaled by the learning rate.
e. Evaluate the convergence criteria, such as the difference in the cost function
between iterations of the gradient.
 Return Optimized Parameters: Once the convergence criteria are met or the
maximum number of iterations is reached, return the optimized model parameters.
In SGD, since only one sample from the dataset is chosen at random for each iteration, the
path taken by the algorithm to reach the minima is usually noisier than your typical
Gradient Descent algorithm. But that doesn’t matter all that much because the path taken by
the algorithm does not matter, as long as we reach the minimum and with a significantly
shorter training time.
The path taken by Batch Gradient Descent is shown below:
Batch gradient optimization path

A path taken by Stochastic Gradient Descent looks as follows –

stochastic gradient optimization path

One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it
usually took a higher number of iterations to reach the minima, because of the randomness
in its descent. Even though it requires a higher number of iterations to reach the minima
than typical Gradient Descent, it is still computationally much less expensive than typical
Gradient Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient Descent
for optimizing a learning algorithm.
Advantages of Stochastic Gradient Descent
Speed: SGD is faster than other variants of Gradient Descent such as Batch Gradient
Descent and Mini-Batch Gradient Descent since it uses only one example to update the
parameters.
Memory Efficiency: Since SGD updates the parameters for each training example one at a
time, it is memory-efficient and can handle large datasets that cannot fit into memory.
Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to escape
from local minima and converges to a global minimum
.
Disadvantages of Stochastic Gradient Descent :
Noisy updates: The updates in SGD are noisy and have a high variance, which can make
the optimization process less stable and lead to oscillations around the minimum.
Slow Convergence: SGD may require more iterations to converge to the minimum since it
updates the parameters for each training example one at a time.
Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD since
using a high learning rate can cause the algorithm to overshoot the minimum, while a low
learning rate can make the algorithm converge slowly.
Less Accurate: Due to the noisy updates, SGD may not converge to the exact global
minimum and can result in a suboptimal solution. This can be mitigated by using
techniques such as learning rate scheduling and momentum-based updates

Challenges motivating deep learning”

Motivating deep learning can be challenging due to various factors. Some of the key
challenges are:
Complexity and Abstraction: Deep learning models are highly complex, consisting of
multiple layers and millions of parameters. Understanding and reasoning about these
models can be daunting, making it difficult for individuals to stay motivated.

Steep Learning Curve: Learning deep learning requires a strong foundation in

mathematics, statistics, and computer science. The initial learning curve can be steep, and
individuals may become discouraged if they struggle to grasp the fundamental concepts.

Data Requirements: Deep learning models thrive on vast amounts of data. Acquiring,
cleaning, and preprocessing such data can be time-consuming and challenging, especially in
domains with limited or messy data.

Computation Power: Training deep learning models often demands significant

computational power, including GPUs and TPUs. Lack of access to suitable hardware can
hinder individuals' motivation to experiment and learn.

Hyperparameter Tuning: Tuning hyperparameters is a crucial aspect of deep learning.

The process can be tedious, and finding the right combination can be a time-consuming
task, potentially leading to a lack of motivation.

Overfitting and Underfitting: Finding the right balance between overfitting and
underfitting can be challenging. Without proper regularization techniques, models may fail
to generalize well to unseen data, affecting motivation to continue.

Research and Advancements: The field of deep learning is constantly evolving with new
research papers, algorithms, and architectures being published regularly. Keeping up with
the latest developments can be overwhelming for learners.

Uninterpretable Black-box Models: Deep learning models are often considered black
boxes due to their lack of interpretability. This can be demotivating for individuals who
prefer to understand the reasoning behind model decisions.

Debugging and Troubleshooting: Debugging deep learning models can be difficult,

especially when facing issues like vanishing gradients, exploding gradients, or convergence
problems.

Long Training Times: Training complex deep learning models can take a substantial
amount of time, ranging from hours to days or even weeks. Waiting for results can be
frustrating and might lead to a loss of motivation.
Despite these challenges, many find deep learning rewarding because of its ability to
solve complex problems, process vast amounts of data, and achieve state-of-the-art results
in various domains. Overcoming these obstacles often requires perseverance, a strong
support network, and a genuine interest in the field, as deep learning can be an exciting and
transformative area of study
Deep Networks:

Deep networks, also known as deep neural networks, are a class of artificial neural
networks that are characterized by multiple layers of interconnected nodes, or neurons.
These networks are designed to process and transform complex patterns in data, making
them well-suited for various machine learning tasks, including image recognition, natural
language processing, speech recognition, and more.

The term "deep" in deep networks refers to the depth of the network, which is
defined by the number of hidden layers between the input and output layers. Traditional
neural networks, with only one or two hidden layers, are shallow in comparison. Deep
networks have a significantly greater number of hidden layers, often ranging from a few
dozen to hundreds or even thousands of layers.

Key components of deep networks:

Input Layer: The input layer receives the raw data and passes it on to the subsequent
layers.

Hidden Layers: These are the intermediate layers between the input and output layers,
where most of the computations take place. Each neuron in a hidden layer receives input
from the previous layer and produces an output for the next layer.

Weights and Biases: Each connection between neurons in adjacent layers has an associated
weight, which determines the strength of the connection. Additionally, each neuron in a
hidden layer has an associated bias, which allows the network to learn more complex
patterns.

Activation Function: Each neuron typically applies an activation function to its weighted
sum of inputs and bias, introducing non-linearity into the model. Popular activation
functions include ReLU (Rectified Linear Unit), sigmoid, and tanh.

Output Layer: The final layer of the network produces the model's output based on the
transformed information from the hidden layers. The activation function used in the output
layer depends on the nature of the task, e.g., softmax for multiclass classification.

Advantages of deep networks:

Hierarchical Representation: Deep networks can learn hierarchical representations of

data, where lower layers capture simple features, and higher layers learn more abstract and
complex features.

Increased Expressiveness: The depth of the network allows it to represent highly intricate
and nonlinear relationships in the data.

State-of-the-art Performance: Deep networks have shown remarkable performance in

various machine learning tasks, often surpassing traditional methods and achieving state-of-
the-art results.
Challenges of deep networks:

Vanishing and Exploding Gradients: During training, the gradients can become very large
or very small as they backpropagate through many layers, making it challenging to update
the weights effectively.

Overfitting: Deep networks can be prone to overfitting, especially when trained on limited
data, due to their high capacity to memorize patterns.

Computation and Memory Requirements: Training deep networks can be

computationally expensive and may require significant memory resources.

Despite these challenges, the development of deep networks has revolutionized the field of
artificial intelligence, leading to breakthroughs in various applications and driving the
advancement of machine learning technologies.

Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
Probability and Statistics For Machine Learning A Textbook Charu
No ratings yet
Probability and Statistics For Machine Learning A Textbook Charu
854 pages
Probabilistic Numerics
No ratings yet
Probabilistic Numerics
412 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
Math Handbook of Formulas, Processes and Tricks: Calculus
No ratings yet
Math Handbook of Formulas, Processes and Tricks: Calculus
179 pages
Lecture Maths
No ratings yet
Lecture Maths
103 pages
An Open Guide To Data Structures and Algorithms 1698234730
No ratings yet
An Open Guide To Data Structures and Algorithms 1698234730
350 pages
Module 2 ML Chapter2
No ratings yet
Module 2 ML Chapter2
64 pages
Module-2 Notes-Bcs602
No ratings yet
Module-2 Notes-Bcs602
18 pages
Lec1 Mathreview
No ratings yet
Lec1 Mathreview
61 pages
00 Statistics Slides
No ratings yet
00 Statistics Slides
54 pages
Maths$Stats NOTES
No ratings yet
Maths$Stats NOTES
50 pages
Data - Science and - Artificial - Intelligence
No ratings yet
Data - Science and - Artificial - Intelligence
106 pages
Core Statistics PDF
100% (4)
Core Statistics PDF
256 pages
Math Revision For DS and ML
No ratings yet
Math Revision For DS and ML
74 pages
Psat-Algebra Questions4-Answers
No ratings yet
Psat-Algebra Questions4-Answers
36 pages
Dip Secretarial Notes
No ratings yet
Dip Secretarial Notes
116 pages
Math Recap
No ratings yet
Math Recap
45 pages
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
39 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Lecture 4 Introduction To Calculus (Part 1)
No ratings yet
Lecture 4 Introduction To Calculus (Part 1)
45 pages
Module 2 Notes Bcs602
No ratings yet
Module 2 Notes Bcs602
19 pages
Lecture 2 - Math
No ratings yet
Lecture 2 - Math
39 pages
Unit 1
No ratings yet
Unit 1
39 pages
Applied Maths
No ratings yet
Applied Maths
34 pages
Math Review EECS 6327
No ratings yet
Math Review EECS 6327
48 pages
Lecture1 Intro ML
No ratings yet
Lecture1 Intro ML
60 pages
Calculus 7th Edition - Ch02
No ratings yet
Calculus 7th Edition - Ch02
61 pages
Deep-Learning
No ratings yet
Deep-Learning
28 pages
DL QB With Ans
No ratings yet
DL QB With Ans
38 pages
Unit 1 - Deep Learning
No ratings yet
Unit 1 - Deep Learning
49 pages
Mathematics For Computer Vision
No ratings yet
Mathematics For Computer Vision
14 pages
Lecture # 2-1 Probabilistic Models
No ratings yet
Lecture # 2-1 Probabilistic Models
40 pages
DL Notes Unit 1
No ratings yet
DL Notes Unit 1
28 pages
MLF Combined
No ratings yet
MLF Combined
84 pages
Interview Question Bank (Tech)
100% (1)
Interview Question Bank (Tech)
9 pages
Data Science
No ratings yet
Data Science
74 pages
091 - MA8451, MA6451 Probability and Random Processes - Notes PDF
No ratings yet
091 - MA8451, MA6451 Probability and Random Processes - Notes PDF
79 pages
00 Statistics
No ratings yet
00 Statistics
18 pages
DL Unit 2
No ratings yet
DL Unit 2
29 pages
GenMath11 - Q1 - Mod20 - Domain and Range of Exponential Function - 08082020
97% (31)
GenMath11 - Q1 - Mod20 - Domain and Range of Exponential Function - 08082020
29 pages
1 & 2 Linear Algebra and Probability Distribution
No ratings yet
1 & 2 Linear Algebra and Probability Distribution
11 pages
Prob RV Opt Basics
No ratings yet
Prob RV Opt Basics
35 pages
Useful Facts From Algebra and Calculus: Appendix A
No ratings yet
Useful Facts From Algebra and Calculus: Appendix A
9 pages
Sample Paper
No ratings yet
Sample Paper
4 pages
Dasar Statistika Dan Matematika
No ratings yet
Dasar Statistika Dan Matematika
30 pages
Methods PDF
No ratings yet
Methods PDF
89 pages
An Index For Operational Flexibility in Chemical Process Design
No ratings yet
An Index For Operational Flexibility in Chemical Process Design
10 pages
Basic Concepts For Understanding ML & DL
No ratings yet
Basic Concepts For Understanding ML & DL
8 pages
2 Probability and Linear Algebra
No ratings yet
2 Probability and Linear Algebra
21 pages
Math Review For ML
No ratings yet
Math Review For ML
41 pages
Notation Example
No ratings yet
Notation Example
11 pages
ECMT1020 Lecture Notes 01 rv1
No ratings yet
ECMT1020 Lecture Notes 01 rv1
6 pages
Exercise 01 Math Refresher
No ratings yet
Exercise 01 Math Refresher
4 pages
Ccs369-Unit 4
No ratings yet
Ccs369-Unit 4
13 pages
Derivatives: A Physics 100 Tutorial
No ratings yet
Derivatives: A Physics 100 Tutorial
18 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
Esia Unit 5
No ratings yet
Esia Unit 5
23 pages
斯坦福大学机器学习数学基础 33-40
No ratings yet
斯坦福大学机器学习数学基础 33-40
8 pages
Course Outline 2
No ratings yet
Course Outline 2
4 pages
Module 4 Image Enhancements
No ratings yet
Module 4 Image Enhancements
41 pages
M&Excel: A Practical Guide To Building Results Monitoring Tools in Excel
No ratings yet
M&Excel: A Practical Guide To Building Results Monitoring Tools in Excel
22 pages
Cambridge Grade8 Topics
No ratings yet
Cambridge Grade8 Topics
3 pages
Amendments 2 - Modifications in Regulations 2021
No ratings yet
Amendments 2 - Modifications in Regulations 2021
5 pages
Iaf MD1
No ratings yet
Iaf MD1
14 pages
Notation
No ratings yet
Notation
4 pages
1st PUC Maths Model QP 2 PDF
100% (1)
1st PUC Maths Model QP 2 PDF
3 pages
Human Value and Professional Ethics Handbook
No ratings yet
Human Value and Professional Ethics Handbook
19 pages
2.3 - Questions
No ratings yet
2.3 - Questions
3 pages
Pattern Classification
No ratings yet
Pattern Classification
41 pages
Jarosław Artyszuk PDF
No ratings yet
Jarosław Artyszuk PDF
16 pages
Definitions of Fields in Pricing Procedure
No ratings yet
Definitions of Fields in Pricing Procedure
3 pages
HW01 - Math Recap
No ratings yet
HW01 - Math Recap
4 pages
Annexure-173. M.A. Economics
No ratings yet
Annexure-173. M.A. Economics
95 pages
Midterm Examination in General Mathematics Grade 11
No ratings yet
Midterm Examination in General Mathematics Grade 11
4 pages
Algebra 2 Radical Functions and Rational Exponents Unit
No ratings yet
Algebra 2 Radical Functions and Rational Exponents Unit
23 pages
Linear Algebra - A Powerful Tool For Data Science
No ratings yet
Linear Algebra - A Powerful Tool For Data Science
6 pages
Area
No ratings yet
Area
26 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Project
No ratings yet
Project
4 pages
1.dynamic Characteristics Introduction
No ratings yet
1.dynamic Characteristics Introduction
20 pages
Ringo Coding - First Steps PDF
No ratings yet
Ringo Coding - First Steps PDF
35 pages
PS1 Solutions
No ratings yet
PS1 Solutions
5 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
DM Assignment 1
No ratings yet
DM Assignment 1
1 page
14.1 Functions of Several Variables
No ratings yet
14.1 Functions of Several Variables
26 pages
DIE - Arthur T. Benjamin, Jennifer J. Quinn - X
No ratings yet
DIE - Arthur T. Benjamin, Jennifer J. Quinn - X
12 pages
Machine Learning Notation: 1 Numbers & Arrays 4 Functions
No ratings yet
Machine Learning Notation: 1 Numbers & Arrays 4 Functions
2 pages
Background Material Crib-Sheet: 1 Probability Theory
No ratings yet
Background Material Crib-Sheet: 1 Probability Theory
4 pages
Colorful Modern Retro General Mathematics Class Orientation Education Presentation
No ratings yet
Colorful Modern Retro General Mathematics Class Orientation Education Presentation
16 pages
Bayseian - Google Drive
No ratings yet
Bayseian - Google Drive
6 pages
Department of Mathematics Indian Institute of Technology Guwahati Problem Sheet 3
No ratings yet
Department of Mathematics Indian Institute of Technology Guwahati Problem Sheet 3
2 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DL (Unit I)

Uploaded by

DL (Unit I)

Uploaded by

UNIT I- DEEP NETWORKS BASICS

Linear Algebra: Scalars -- Vectors -- Matrices and tensors; Probability Distributions –

1.Linear Algebra: Scalars, Vectors, Matrices and tensors:

Matrix Operations: Common matrix operations include addition, subtraction, multiplication

A probability distribution is a description of how likely a random variable or set of random

2.1 Discrete Variables and Probability Mass Functions :

To be a PMF on a random variable x, a function P must satisfy the following properties:

 The domain of P must be the set of all possible states of x

2.2. Continuous Variables and Probability Density Functions:

 The domain of p must be the set of all possible states of x.

• Most ML algorithms involve optimization

• Suppose function y=f (x), x, y real nos.

• It tells how you make a small change in input to make a small

• For x>0, f(x) increases with x and f’(x)>0

• When f’(x)=0 derivative provides no

• Saddle Points: neither maxima nor minima

• Optimization algorithms may fail to find global

Method of Gradient Descent:

• The gradient points directly uphill, and the negative gradient

The main goal of each machine learning model is to generalize well.

Overfitting is the main problem that occurs in supervised learning.

How to avoid the Overfitting in Model:

An underfitted model has high bias and low variance.

How to avoid underfitting:

4.Hyperparameters and Validation Sets

4.1 Hyperparams control ML Behavior

• Most ML algorithms have hyperparameters

Some examples of Hyperparameters in Machine Learning

Hyperparameter for Specific Models:

In machine learning, an estimator refers to a model or algorithm used to make predictions

Here are some common types of estimators used in machine learning:

Logistic Regression: A classification algorithm used for binary classification problems. It

Clustering Algorithms: While not strictly supervised learning estimators, clustering

Bias and Variance:

Machine learning is a branch of Artificial Intelligence, which allows machines to perform

In machine learning, an error is a measure of how accurately an algorithm can make

o Irreducible errors: These errors will always be present in the model

Ways to reduce High Bias:

o Increase the input features as the model is underfitted.

What is a Variance Error?

o A high variance model leads to overfitting.

Ways to Reduce High Variance:

Different Combinations of Bias-Variance

How to identify High variance or High Bias?

High variance can be identified if the model has:

o Low training error and high test error.

High Bias can be identified if the model has:

o If we decrease the variance, it will increase the bias.

Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model

Stochastic gradient descent:

Gradient Descent is an iterative optimization process that searches for an objective

A path taken by Stochastic Gradient Descent looks as follows –

stochastic gradient optimization path

Challenges motivating deep learning”

Steep Learning Curve: Learning deep learning requires a strong foundation in

Computation Power: Training deep learning models often demands significant

Hyperparameter Tuning: Tuning hyperparameters is a crucial aspect of deep learning.

Debugging and Troubleshooting: Debugging deep learning models can be difficult,

Key components of deep networks:

Advantages of deep networks:

Hierarchical Representation: Deep networks can learn hierarchical representations of

State-of-the-art Performance: Deep networks have shown remarkable performance in

Computation and Memory Requirements: Training deep networks can be

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.