DNN U2 Notes
DNN U2 Notes
37
1.6.1 Simulating Basic Machine Learning with Shallow Models . . . . . . 37
1.6.2 Radial Basis Function Networks . . . . . . . . . . . . . . . . . . . . 37
1.6.3 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 38 Unit -1
1.6.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 38
1.6.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 40
1.6.6 Hierarchical Feature Engineering and Pretrained Models Unit -2
Unit 2
Recurrent Neural Networks (RNNs) were introduced in the 1980s by researchers David
Rumelhart, Geoffrey Hinton, and Ronald J. Williams.
Recurrent neural networks (RNNs) are a type of artificial neural network that can process
sequential data such as text, speech,time-series and images.
RNNs have an internal memory that allows them to remember previous inputs and outputs, and
use them to influence the current computation.
RNNs are widely used for tasks that require understanding the context and meaning of data such
as natural language processing, speech recognition, machine translation, and image captioning.
RNNs consist of multiple recurrent layers, each of which performs a transformation on the
input and the hidden state, and produces an output and a new hidden state.
Recurrent neural networks are a class of neural networks that are used for sequence modeling.
They can be expressed as time-layered networks in which the weights are shared between
different layers.The input is of the form x1 . ……. . xn, where xt is a d-dimensional point
received at the time-stamp t.
For example,
The vector xt might contain the d values at the t th tick of a multivariate time-series (with d
different series). In a text-setting, the vector x t will contain the one-hot encoded word at the tth
time-stamp. In one-hot encoding, we have a vector of length equal to the lexicon size, and the
component for the relevant word has a value of 1. All other components are 0.
An important point about sequences is that successive words are dependent on one another.
Therefore, it is helpful to receive a particular input x t only after the earlier inputs have already
been received and converted into a hidden state. The traditional type of feed forward network in
which all inputs feed into the first layer does not achieve this goal. Therefore, the recurrent
neural network allows the input x t to interact directly with the hidden state created from the
inputs at previous time stamps. The basic architecture of the recurrent neural network is
illustrated in Figure(a).
The key point is that there is an input x t at each time-stamp, and a hidden state h t that changes at
each time stamp as new data points arrive. Each time-stamp also has an output value yt. For
example, in a time-series setting, the output y t might be the forecasted prediction of xt+1. When
used in the text-setting of predicting the next word, this approach is referred to as language
modeling. In some applications, we do not output y t at each time stamp, but only at the end of the
sequence. For example, if one is trying to classify the sentiment of a sentence as “positive” or
“negative,” the output will occur only at the final time stamp. The hidden state at time t is given
by a function of the input vector at time t and the hidden vector at time (t − 1):
A separate function yt = g(ht) is used to learn the output probabilities from the hidden states.
Note that the functions f(·) and g(·) are the same at each time stamp. The implicit assumption is
that the time-series exhibits a certain level of stationarity; the underlying properties do not
change with time. Although this property is not exactly true in real settings, it is a good
assumption to use for regularization.
A key point here is the presence of the self-loop in Figure (a), which will cause the hidden state
of the neural network to change after the input of each x t. In practice,one only works with
sequences of finite length, and it makes sense to unfurl the loop into a “time-layered” network
that looks more like a feed-forward network. This network is shown in Figure (b). Note that in
this case, we have a different node for the hidden state at each time-stamp and the self-loop has
been unfurled into a feed-forward network. This representation is mathematically equivalent to
Figure (a), but is much easier to comprehend because of its similarity to a traditional network.
Note that unlike traditional feed-forward networks, the inputs also occur to intermediate layers in
this unfurled network.
Fig : A recurrent neural network and its time-layered representation
The weight matrices of the connections are shared by multiple connections in the timelayered
network to ensure that the same function is used at each time stamp. This sharing is the key to
the domain-specific insights that are learned by the network. The backpropagation algorithm
takes the sharing and temporal length into account when updating the weights during the learning
process. This special type of backpropagation algorithm is referred to as backpropagation
through time (BPTT). Because of the recursive nature of Equation, the recurrent network has the
ability to compute a function of variable-length inputs. In other words, one can expand the
recurrence of Equation to define the function for ht in terms of t inputs. For example, starting at
h0, which is typically fixed to some constant vector, we have h 1 = f(h0, x1) and h2 = f(f(h0, x1),
x2). Note that h1 is a function of only x1, whereas h2 is a function of both x1 and x2. Since the
output yt is a function of ht, these properties are inherited by yt as well. In general, we can write
the following:
yt = Ft(x1, x2, . . . xt)
Note that the function Ft(·) varies with the value of t. Such an approach is particularly useful for
variable-length inputs like text sentences. The amount of data and the size of the hidden states
required for longer sequences increases in a way that is not realistic. Furthermore, there are
practical issues in finding the optimum choices of parameters because of the vanishing and
exploding gradient problems. As a result, specialized variants of the recurrent neural network
architecture have been proposed, such as the use of long short-term memory.
The connections in a convolutional neural network are very sparse, because any activation in a
particular layer is a function of only a small spatial region in the previous layer. All layers other
than the final set of two of three layers maintain their spatial structure. Therefore, it is possible to
spatially visualize what parts of the image affect particular portions of the activations in a layer.
The features in lower-level layers capture lines or other primitive shapes, whereas the features in
higher-level layers capture more complex shapes like loops (which commonly occur in many
digits). Therefore, later layers can create digits by composing the shapes in these intuitive
features. This is a classical example of the way in which semantic insights about specific data
domains are used to design clever architectures.
In addition, a subsampling layer simply averages the values in the local regions of size 2×2 in
order to compress the spatial footprints of the layers by a factor of 2. An illustration of the
architecture of LeNet-5 is shown in Figure 1.18. In the early years, LeNet-5 was used by several
banks to recognize hand-written numbers on checks.
Convolutional neural networks have historically been the most successful of all types of neural
networks. They are used widely for image recognition, object detection/localization,and even
text processing. The performance of these networks has recently exceeded that of humans in the
problem of image classification. Convolutional neural networks provide a very good example of
the fact that architectural design choices in a neural network should be performed with semantic
insight about the data domain at hand. In the particular case of the convolutional neural network,
this insight was obtained by observing the biological workings of a cat’s visual cortex, and
heavily using the spatial relationships among pixels.
This fact also provides some evidence that a further understanding of neuroscience might also be
helpful for the development of methods in artificial intelligence. Pretrained convolutional neural
networks from publicly available resources like ImageNet are often available for use in an off-
the-shelf manner for other applications and data sets. This is achieved by using most of the
pretrained weights in the convolutional network without any change except for the final
classification layer. The weights of the final classification layer are learned from the data set at
hand. The training of the final layer is necessary because the class labels in a particular setting
may be different from those of ImageNet.
Nevertheless, the weights in the early layers are still useful because they learn various types of
shapes in the images that can be useful for virtually any type of classification
application.Furthermore, the feature activations in the penultimate layer can even be used for
unsupervised applications. For example, one can create a multidimensional representation of an
arbitrary image data set by passing each image through the convolutional neural network and
extracting the activations of the penultimate layer. Subsequently, any type of indexing can be
applied to this representation for retrieving images that are similar to a specific target image.
Such an approach often provides surprisingly good results in image retrieval because of the
semantic nature of the features learned by the network. It is noteworthy that the use of pretrained
convolutional networks is so popular that training is rarely started from scratch.
Many deeper architectures with feed-forward architectures have multiple layers in which
successive transformations of the inputs from the previous layer lead to increasingly
sophisticated
representations of the data. The values of each hidden layer for a particular input contain a
transformed representation of the input point, which becomes increasingly informative about the
target value we are trying to learn, as the layer gets closer to the output node. The appropriately
transformed feature representations are more amenable to simple types of predictions in the
output layer. This sophistication is a result of the nonlinear activations in intermediate layers.
Traditionally, the sigmoid and tanh activations were the most popular choices in the hidden
layers, but the ReLU activation has become increasingly popular in recent years because of the
desirable property that it is better at avoiding the vanishing and exploding gradient problems. For
classification, the final layer can be viewed as a relatively simple prediction layer which contains
a single linear neuron in the case of regression, and is a sigmoid/sign function in the case of
binary classification. More complex outputs might require multiple nodes. One way of viewing
this division of labor between the hidden layers and final prediction layer is that the early layers
create a feature representation that is more amenable to the task at hand. The final layer then
leverages this learned feature representation. A key point is that the features learned in the
hidden layers are often (but not always) generalizable to other data sets and problem settings in
the same domain (e.g., text, images, and so on). This property can be leveraged in various ways
by simply replacing the output node(s) of a pretrained network with a different application-
specific output layer (e.g., linear regression layer instead of sigmoid classification layer) for the
data set and problem at hand. Subsequently, only the weights of the newly replaced output layer
may need to be learned for the new data set and application, whereas the weights of other layers
are fixed.
The output of each hidden layer is a transformed feature representation of the data, in which the
dimensionality of the representation is defined by the number of units in that layer. One can view
this process as a kind of hierarchical feature engineering in which the features in earlier layers
represent primitive characteristics of the data, whereas those in later layers represent complex
characteristics with semantic significance to the class labels. Data represented in the terms of the
features of later layers are often more well behaved (e.g., linearly separable) because of the
semantic nature of the features learned by the transformation. This type of behavior is
particularly evident in a visually interpretable way in some domains like convolutional neural
networks for image data. In convolutional neural networks, the features in earlier layers capture
detailed but primitive shapes like lines or edges from the data set of images. On the other hand,
the features in later layers capture shapes of greater complexity like hexagons, honeycombs, and
so forth, depending on the type of images provided as training data. Note that such semantically
interpretable shapes often have closer correlations with class labels in the image domain. For
example, almost any image will contain lines or edges, but images belonging to particular classes
will be more likely to have hexagons or honeycombs. This property tends to make the
representations of later layers easier to classify with simple models like linear classifiers. This
process is illustrated in Figure 1.19. The features in earlier layers are used repeatedly as building
blocks to create more complex features. This general principle of “putting together” simple
features to create more complex features lies at the core of the successes achieved with neural
networks. As it turns out, this property is also useful in leveraging pretrained models in a
carefully calibrated way. The practice of using pretrained models is also referred to as transfer
learning. A particular type of transfer learning, which is used commonly in neural networks, is
that the data and structure available in a given data set are used to learn features for that entire
domain. A classical example of this setting is that of text or image data. In text data, the
representations of text words are created using standardized benchmark data sets like Wikipedia
and models like word2vec. These can be used in almost any text application, since the nature of
text data does not change very much with the application. A similar approach is often used for
image data, in which the ImageNet data set is used to pretrain convolutional neural networks,
and provide ready-to-use features. One can download a pretrained convolutional neural network
model and convert any image data set into a multidimensional representation by passing the
image through the pretrained network. Furthermore, if additional application-specific data is
available, one can regulate the level of transfer learning depending on the amount of available
data. This is achieved by fine-tuning a subset of the layers in the pretrained neural network with
this additional data. If a small amount of application-specific data is available, one can fix the
weights of the early layers to their pretrained values and fine-tune only the last few layers of the
neural network. The early layers often contain primitive features, which are more easily
generalizable to arbitrary applications. For example, in a convolutional neural network, the early
layers learn primitive features like edges, which are useful across diverse images like trucks or
carrots. On the other hand, the later layers contain complex features which might depend on the
image collection at hand (e.g., truck wheel versus carrot top). Finetuning only the weights of the
later layers makes sense in such cases. If a large amount of application-specific data is available,
one can fine-tune a larger number of layers. Therefore, deep networks provide significant
flexibility in terms of how transfer learning is done with pretrained neural network models.
Recurrent neural networks can be hard to train, because they are prone to the vanishing and the
exploding gradient problems. However,there are other ways of training more robust recurrent
networks. A particular example that has found favor is the use of long short-term memory
network. This network uses a gentler update process of the hidden states in order to avoid the
vanishing and exploding gradient problems. Recurrent neural networks and their variants have
found use in many applications such as image captioning, token-level classification, sentence
classification.
Datasets
The benchmarks used in the neural network literature are dominated by data from the domain of
computer vision. Although traditional machine learning data sets like the UCI repository can be
used for testing neural networks, the general trend is towards using data sets from perceptually
oriented data domains that can be visualized well. Although there are a variety of data sets drawn
from the text and image domains, two of them stand out because of their ubiquity in deep
learning papers. Although both are data sets drawn from computer vision, the first of them is
simple enough that it can also be used for testing generic applications beyond the field of vision.
In the following, we provide a brief overview of these two data sets.
The MNIST database, which stands for Modified National Institute of Standards and
Technology database, is a large database of handwritten digits. This data set was created by
modifying an original database of handwritten digits provided by NIST. The data set contains
60,000 training images and 10,000 testing images. Each image is a scan of a handwritten digit
from 0 to 9, and the differences between different images are a result of the differences in the
handwriting of different individuals. These individuals were American Census Bureau
employees and American high school students. The original black and white images from NIST
were size normalized to fit in a 20 × 20 pixel box while preserving their aspect ratio and centered
in a 28 × 28 image by computing the center of mass of the pixels. The images were translated to
position this point at the center of the 28×28 field. Each of these 28×28 pixel values takes on a
value from 0 to 255, depending on where it lies in the grayscale spectrum. The labels associated
with the images correspond to the ten digit values. Examples of the digits in the MNIST database
are illustrated in Figure.
The size of the data set is rather small, and it contains only a simple object corresponding to a
digit. Therefore, one might argue that the MNIST database is a toy data set. However, its small
size and simplicity is also an advantage because it can be used as a laboratory for quick testing of
machine learning algorithms. Furthermore, the simplification of the data set by virtue of the fact
that the digits are (roughly) centered makes it easy to use it to test algorithms beyond computer
vision.
Although the matrix representation of each image is suited to a convolutional neural network,
one can also convert it into a multidimensional representation of 28 × 28 = 784 dimensions. This
conversion loses some of the spatial information in the image, but this loss is not debilitating (at
least in the case of the MNIST data set) because of its relative simplicity. In fact, the use of a
simple support vector machine on the 784-dimensional representation can provide an impressive
error rate of about 0.56%. A straightforward 2-layer neural network on the multidimensional
representation (without using the spatial structure in the image) generally does worse than the
support vector machine across a broad range of parameter choices! A deep neural network
without any special convolutional architecture can achieve an error rate of 0.35%. Deeper neural
networks and convolutional neural networks (that do use spatial structure) can reduce the error
rate to as low as 0.21% by using an ensemble of five convolutional networks. Therefore, even on
this simple data set, one can see that the relative performance of neural networks with respect to
traditional machine learning is sensitive to the specific architecture used in the former.
Finally, it should be noted that the 784-dimensional non-spatial representation of the MNIST
data is used for testing all types of neural network algorithms beyond the domain of computer
vision. Even though the use of the 784-dimensional (flattened) representation is not appropriate
for a vision task, it is still useful for testing the general effectiveness of non-vision oriented (i.e.,
generic) neural network algorithms. For example, the MNIST data is frequently used to test
generic autoencoders and not just convolutional ones. Even when the non-spatial representation
of an image is used to reconstruct it with an autoencoder, one can still visualize the results with
the original spatial positions of the reconstructed pixels to obtain a feel of what the algorithm is
doing with the data.
The ImageNet Database
The ImageNet database is a huge database of over 14 million images drawn from 1000 different
categories. Its class coverage is exhaustive enough that it covers most types of images that one
would encounter in everyday life. This database is organized according to a WordNet hierarchy
of nouns. The WordNet database is a data set containing the relationships among English words
using the notion of synsets. The WordNet hierarchy has been successfully used for machine
learning in the natural language domain, and therefore it is natural to design an image data set
around these relationships. The ImageNet database is famous for the fact that an annual
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is held using this dataset. This
competition has a very high profile in the vision community and receives entries from most
major research groups in computer vision. The entries to this competition have resulted in many
of the state-of-the-art image recognition architectures today, including the methods that have
surpassed human performance on some narrow tasks like image classification. Because of the
wide availability of known results on these data sets, it is a popular alternative for benchmarking.
Another important significance of the ImageNet data set is that it is large and diverse enough to
be representative of the key visual concepts within the image domain. As a result convolutional
neural networks are often trained on this data set; the pretrained network can be used to extract
features from an arbitrary image. This image representation is defined by the hidden activations
in the penultimate layer of the neural network. Such an approach creates new multidimensional
representations of image data sets that are amenable for use with traditional machine learning
methods. One can view this approach as a kind of transfer learning in which the visual concepts
in the ImageNet data set are transferred to unseen data objects for other applications.
Machine Learning with shallow neural networks refers to the use of relatively simple neural
networks with few layers (typically one hidden layer) for tasks like classification, regression, and
pattern recognition. These types of networks are considered "shallow" because they lack the
deep, complex architecture of modern deep learning models that often involve many hidden
layers.
Limited Capacity: Shallow neural networks might struggle to capture complex patterns
in data. For tasks involving intricate relationships, deep networks (with many layers)
often outperform shallow ones.
Feature Engineering: Shallow networks may require more manual feature engineering,
as they cannot automatically learn hierarchical features in the same way deep networks
can.
Use Cases of Shallow Neural Networks
1. Simple Classification Tasks: For example, binary classification or small multi-class
classification problems.
2. Regression: When the target variable is continuous, a shallow neural network can model
the relationship between input features and the output.
3. Dimensionality Reduction: Shallow networks like autoencoders (with one hidden layer)
can be used for unsupervised tasks like reducing the number of features.
Popular Shallow Neural Network Algorithms
Single-layer Perceptron (SLP): A basic shallow neural network that performs linear
classification by using a single hidden layer.
Multilayer Perceptron (MLP): While technically a type of shallow network, it often
involves multiple hidden neurons but is still considered shallow in contrast to modern
deep networks. MLP can be used for both classification and regression.
Shallow neural networks are still a vital tool in machine learning, especially for simpler tasks or
when computational efficiency is crucial.
In machine learning, binary classification involves predicting one of two possible classes (e.g., 0
or 1, true or false, etc.). Neural network architectures are often used to solve binary classification
problems, with different types of architectures providing varying levels of performance
depending on the complexity of the problem and data. Here are some commonly used neural
architectures for binary classification:
1. Single-Layer Perceptron (SLP)
The Single-Layer Perceptron is the most basic neural network architecture for binary
classification.
Structure:
Input Layer: Each neuron corresponds to one feature from the input data.
Output Layer: A single neuron that outputs the predicted class (0 or 1).
Activation Function: The output layer typically uses a sigmoid activation function to
squish the output between 0 and 1, representing the probability of belonging to class 1.
Training:
Loss Function: Binary Cross-Entropy (Log Loss) is used to compute the loss between
predicted probabilities and actual binary labels.
Optimization: The network is trained using gradient descent and backpropagation.
Example Use Case:
Simple problems where data is linearly separable, or relatively simple patterns need to be
learned.
The choice of neural network architecture for binary classification depends on several factors
such as the complexity of the data, the nature of the problem (e.g., sequential, image, or tabular
data), and the computational resources available. For simple tasks, a Single-Layer Perceptron or
MLP might suffice, while for more complex tasks involving sequences, images, or intricate
relationships, architectures like CNNs, RNNs, and GANs may be more suitable.
Some basic architectures for machine learning models such as least-squares regression and
classification are discussed here.
The small changes in neural architectures can result in distinct models from traditional machine
learning.
A single-layer network with d input nodes and a single output node. The coefficients of the
connections from the d input nodes to the output node are denoted by W = (w1 . . . wd).
Furthermore, the bias will not be explicitly shown because it can be seamlessly modeled as the
coefficient of an additional dummy input with a constant value of 1.
Figure :An extended architecture of the perceptron with both discrete and continuous Predictions
Revisiting the Perceptron
Let (Xi, yi) be a training instance, in which the observed value yi is predicted from the feature
variables Xi using the following relationship:
ˆyi = sign(W · Xi)
Here, W is the d-dimensional coefficient vector learned by the perceptron. Note the circumflex
on top of ˆyi to indicate that it is a predicted value rather than an observed value. In general, the
goal of training is to ensure that the prediction ˆyi is as close as possible to the observed value yi.
The gradient-descent steps of the perceptron are focused on reducing the number of
misclassifications, and therefore the updates are proportional to the difference (yi − ˆyi) between
A gradient-descent update that is proportional to the difference between the observed and
predicted values is naturally caused by a squared loss function such as (yi− ˆyi) 2. Therefore, one
possibility is to consider the squared loss between the predicted and observed values as the loss
function.
Least-Squares Regression
Widrow-Hoff Learning
Following the perceptron, the Widrow-Hoff learning rule was proposed in 1960. However, the
method was not a fundamentally new one, as it is a direct application of least-squares regression
to binary targets. Although the sign function is applied to the real-valued prediction of unseen
test instances to convert them to binary predictions, the error of training instances is computed
directly using real-valued predictions (unlike the perceptron). Therefore, it is also referred to as
least-squares classification or linear least-squares method. Remarkably, a seemingly unrelated
method proposed in 1936, known as the Fisher discriminant, also reduces to Widrow-Hoff
learning in the special case of binary targets.
The Fisher discriminant is formally defined as a direction W along which the ratio of inter-class
variance to the intra-class variance is maximized in the projected data. By choosing a scalar b in
order to define the hyperplane W · X = b, it is possible to model the separation between the two
classes. This hyperplane is used for classification. Although the definition of the Fisher
discriminant seems quite different from least-squares regression/ classification at first sight, a
remarkable result is that the Fisher discriminant for binary targets is identical to the least-squares
regression as applied to binary targets (i.e., least-squares classification). Both the data and the
targets need to be mean-centered, which allows the bias variable b to be set to 0.
Closed Form Solutions
The special case of least-squares regression and classification is solvable in closed form (without
gradient-descent) by using the pseudo-inverse of the n × d training data matrix D, whose rows
are X1 . . .Xn. Let the n-dimensional column vector of dependent variables be denoted by y = [y1
. . . yn]T . The pseudo-inverse of matrix D is defined as follows:
D+ = (DTD)−1DT
Then, the row-vector W is defined by the following relationship:
WT= D+y
If regularization is incorporated, the coefficient vector W is given by the following:
WT= (DTD + λI) −1DT y
Here, λ > 0 is the regularization parameter. However, inverting a matrix like (D TD + λI) is
typically done using numerical methods that require gradient descent anyway. One rarely inverts
large matrices like DTD. In fact, the Widrow-Hoff updates provide a very efficient way of
solving the problem without using the closed-form solution.
Logistic Regression
Logistic regression is a probabilistic model that classifies the instances in terms of probabilities.
Because the classification is probabilistic, a natural approach for optimizing the parameters is to
ensure that the predicted probability of the observed class for each training instance is as large as
possible. This goal is achieved by using the notion of maximumlikelihood estimation in order to
learn the parameters of the model. The likelihood of the training data is defined as the product of
the probabilities of the observed labels of each training instance. Clearly, larger values of this
objective function are better. By using the negative logarithm of this value, one obtains an a loss
function in minimization form. Therefore, the output node uses the negative log-likelihood as a
loss function. This loss function replaces the squared error used in the Widrow-Hoff method. The
output layer can be formulated with the sigmoid activation function, which is very common in
neural network design.
dimensional features and yi ∈ {−1, +1} is a binary class variable. As in the case of a perceptron,
Let (X1, y1), (X2, y2), . . . (Xn, yn) be a set of n training pairs in which Xi contains the d-
a single-layer architecture with weights W = (w1 . . . wd) is used. Instead of using the hard sign
activation on W · Xi to predict yi, logistic regression applies the soft sigmoid function to W · Xi
in order to estimate the probability that yi is 1:
ˆyi = P(yi = 1) = 1 /1 + exp(−W · Xi)
For a test instance, it can be predicted to the class whose predicted probability is greater than 0.5.
Note that P(yi = 1) is 0.5 when W · Xi = 0, and Xi lies on the separating Logistic regression is a
probabilistic model that classifies the instances in terms of probabilities. Because the
classification is probabilistic, a natural approach for optimizing the parameters is to ensure that
the predicted probability of the observed class for each training instance is as large as possible.
This goal is achieved by using the notion of maximumlikelihood estimation in order to learn the
parameters of the model. The likelihood of the training data is defined as the product of the
probabilities of the observed labels of each training instance. Clearly, larger values of this
objective function are better. By using the negative logarithm of this value, one obtains an a loss
function in minimization form. Therefore, the output node uses the negative log-likelihood as a
loss function. This loss function
replaces the squared error used in the Widrow-Hoff method. The output layer can be formulated
with the sigmoid activation function, which is very common in neural network design.
dimensional features and yi ∈ {−1, +1} is a binary class variable. As in the case of a perceptron,
Let (X1, y1), (X2, y2), . . . (Xn, yn) be a set of n training pairs in which Xi contains the d-
a single-layer architecture with weights W = (w1 . . . wd) is used. Instead of using the hard sign
activation on W · Xi to predict yi, logistic regression applies the soft sigmoid function to W · Xi
in order to estimate the probability that yi is 1:
For a test instance, it can be predicted to the class whose predicted probability is greater than 0.5.
Note that P(yi = 1) is 0.5 when W · Xi = 0, and Xi lies on the separating hyperplane.
o Logistic regression is used for predicting the categorical dependent variable using a given
set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Assumptions for Logistic Regression:
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:
Example: Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots of images of cats and dogs so
that it can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and dog)
and choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the
basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are
3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these
two classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this
best boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider
the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If
we convert it in 2d space with z=1, then it will become as:
All the models discussed so far in this chapter are designed for binary classification. In this
section, we will discuss how one can design multiway classification models by changing the
architecture of the perceptron slightly, and allowing multiple output nodes.
Multiclass Perceptron
ddimensional feature vector Xi and the index c(i) ∈ {1 . . . k} of its observed class. In such a
Consider a setting with k different classes. Each training instance (Xi, c(i)) contains a
case, we would like to find k different linear separators W1 . . .Wk simultaneously so that the
value of Wc(i) · Xi is larger than Wr · Xi for each r _= c(i). This is because one always predicts a
data instance Xi to the class r with the largest value of Wr · Xi. Therefore, the loss function for
the ith training instance in the case of the multiclass perceptron is defined as follows:
The multiclass perceptron is illustrated in Figure 2.5(a). As in all neural network models,
one can use gradient-descent in order to determine the updates. For a correctly classified
instance, the gradient is always 0, and there are no updates. For a misclassified instance,
the gradients are as follows:
Therefore, the stochastic gradient-descent method is applied as follows. Each training instance is
fed into the network. If the correct class r = c(i) receives the largest of output Wr · Xi, then no
update needs to be executed. Otherwise, the following update is made to each separator Wr for
learning rate α > 0:
Only two of the separators are always updated at a given time. In the special case that k = 2,
these gradient updates reduce to the perceptron because both the separators W1 and W2 will be
related as W1 = −W2 if the descent is started at W1 = W2 = 0. Another quirk that is specific to
the unregularized perceptron is that it is possible to use a learning rate of α = 1 without affecting
the learning because the value of α only has the effect of scaling the weight when starting with
Wj = 0. This property is, however, not true for other linear models in which the value of α does
affect the learning.
2.Weston-Watkins SVM
The Weston-Watkins Support Vector Machine (SVM) is a formulation for multiclass
classification based on maximizing the margin.
The Weston-Watkins SVM [529] varies on the multiclass perceptron in two ways:
1. The multiclass perceptron only updates the linear separator of a class that is predicted
most incorrectly along with the linear separator of the true class. On the other hand,the Weston-
Watkins SVM updates the separator of any class that is predicted more favorably than the true
class. In both cases, the separator of the observed class is updated by the same aggregate amount
as the incorrect classes (but in the opposite direction).
2. Not only does the Weston-Watkins SVM update the separator in the case of
misclassification, it updates the separators in cases where an incorrect class gets a prediction that
is “uncomfortably close” to the true class. This is based on the notion of margin. As in the case
of the multiclass perceptron, it is assumed that the ith training instance is denoted by (Xi, c(i)),
where Xi contains the d-dimensional feature variables, and c(i) contains the class index drawn
from {1, . . . , k}. One wants to learn d-dimensional coefficients W1 . . .Wk of the k linear
separators so that the class index r with the largest value of Wr ·Xi is predicted to be the correct
class c(i). The loss function Li for the ith training instance (Xi, c(i)) in the Weston-Watkins SVM
is as follows:
First, for each class r ≠ c(i), if the prediction Wr · Xi lags behind that of the true class by less
than a margin amount of 1,then a loss is incurred for that class. Furthermore, the losses over all
such classes r ≠ c(i) are added, rather than taking the maximum of the losses. These two
differences accomplish the two intuitive goals discussed above.
In order to determine the gradient-descent updates, one can find the gradient of the loss function
with respect to each Wr. In the event that the loss function Li is 0, the gradient of the loss
function is 0 as well. Therefore, no update is required when the training instance is classified
correctly with sufficient margin with respect to the second-best class. However, if the loss
function is non-zero we have either a misclassified or a “barely correct” prediction in which the
second-best and best class prediction are not sufficiently separated. In such cases, the gradient of
the loss is non-zero. The loss function is created byadding up the contributions of the (k−1)
separators belonging to the incorrect classes. Let δ(r,Xi) be a 0/1 indicator function, which is 1
when the rth class separator contributes positively to the loss function. In such a case, the
gradient of the loss function is as follows:
This results in the following stochastic gradient-descent step for the rth separator Wr at
learning rate α:
For training instances Xi in which the loss Li is zero, the above update can be shown to simplify
to a regularization update of each hyperplane
The regularization uses the parameter λ > 0. Regularization is considered essential to the proper
functioning of a support vector machine.
1. Model Definition
For an input feature vector xxx, the softmax classifier computes the probability that xxx belongs
to class iii using the following formula:
Where:
wi is the weight vector for class i,
bi is the bias term for class i,
k is the number of classes,
The denominator ensures the probabilities sum to 1.
4. Optimization
Multinomial logistic regression is trained using iterative optimization methods like:
Gradient Descent or Stochastic Gradient Descent (SGD): Update weights using the
gradient of the loss function.
Newton's Method or Quasi-Newton Methods: Use second-order derivatives for faster
convergence but at higher computational cost.
Batch Gradient Descent (common in large datasets).
Gradients:
The gradient of the cross-entropy loss with respect to the weights wiw_iwi is:
7. Applications
Image classification (e.g., CIFAR-10, ImageNet).
Natural Language Processing (e.g., part-of-speech tagging, sentiment analysis).
Multi-class medical diagnostics (e.g., disease classification).
Multi-class recommendation systems.
Also known as the softmax classifier, this method is widely used in neural networks for
multiclass classification.