0% found this document useful (0 votes)

2 views

DNN U2 Notes

The document discusses various neural network architectures, focusing on Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). RNNs are designed for sequential data processing and have internal memory to retain previous inputs, making them suitable for tasks like natural language processing and time-series forecasting. CNNs, inspired by biological visual processing, excel in image classification and object detection, utilizing hierarchical feature engineering and transfer learning to leverage pretrained models for various applications.

Uploaded by

Kejin Spam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

DNN U2 Notes

Uploaded by

Kejin Spam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Common Neural Architectures . . . . . . . . . . . . . . . . . . . . . . . . .

37
1.6.1 Simulating Basic Machine Learning with Shallow Models . . . . . . 37
1.6.2 Radial Basis Function Networks . . . . . . . . . . . . . . . . . . . . 37
1.6.3 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 38 Unit -1
1.6.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 38
1.6.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 40
1.6.6 Hierarchical Feature Engineering and Pretrained Models Unit -2

Unit 2

Recurrent Neural Network

Recurrent Neural Networks (RNNs) were introduced in the 1980s by researchers David
Rumelhart, Geoffrey Hinton, and Ronald J. Williams.

Recurrent neural networks (RNNs) are a type of artificial neural network that can process
sequential data such as text, speech,time-series and images.

RNNs have an internal memory that allows them to remember previous inputs and outputs, and
use them to influence the current computation.

RNNs are widely used for tasks that require understanding the context and meaning of data such
as natural language processing, speech recognition, machine translation, and image captioning.

RNNs consist of multiple recurrent layers, each of which performs a transformation on the
input and the hidden state, and produces an output and a new hidden state.

Recurrent neural networks are a class of neural networks that are used for sequence modeling.
They can be expressed as time-layered networks in which the weights are shared between
different layers.The input is of the form x1 . ……. . xn, where xt is a d-dimensional point
received at the time-stamp t.

For example,
The vector xt might contain the d values at the t th tick of a multivariate time-series (with d
different series). In a text-setting, the vector x t will contain the one-hot encoded word at the tth
time-stamp. In one-hot encoding, we have a vector of length equal to the lexicon size, and the
component for the relevant word has a value of 1. All other components are 0.

An important point about sequences is that successive words are dependent on one another.
Therefore, it is helpful to receive a particular input x t only after the earlier inputs have already
been received and converted into a hidden state. The traditional type of feed forward network in
which all inputs feed into the first layer does not achieve this goal. Therefore, the recurrent
neural network allows the input x t to interact directly with the hidden state created from the
inputs at previous time stamps. The basic architecture of the recurrent neural network is
illustrated in Figure(a).

The key point is that there is an input x t at each time-stamp, and a hidden state h t that changes at
each time stamp as new data points arrive. Each time-stamp also has an output value yt. For
example, in a time-series setting, the output y t might be the forecasted prediction of xt+1. When
used in the text-setting of predicting the next word, this approach is referred to as language
modeling. In some applications, we do not output y t at each time stamp, but only at the end of the
sequence. For example, if one is trying to classify the sentiment of a sentence as “positive” or
“negative,” the output will occur only at the final time stamp. The hidden state at time t is given
by a function of the input vector at time t and the hidden vector at time (t − 1):

A separate function yt = g(ht) is used to learn the output probabilities from the hidden states.
Note that the functions f(·) and g(·) are the same at each time stamp. The implicit assumption is
that the time-series exhibits a certain level of stationarity; the underlying properties do not
change with time. Although this property is not exactly true in real settings, it is a good
assumption to use for regularization.

A key point here is the presence of the self-loop in Figure (a), which will cause the hidden state
of the neural network to change after the input of each x t. In practice,one only works with
sequences of finite length, and it makes sense to unfurl the loop into a “time-layered” network
that looks more like a feed-forward network. This network is shown in Figure (b). Note that in
this case, we have a different node for the hidden state at each time-stamp and the self-loop has
been unfurled into a feed-forward network. This representation is mathematically equivalent to
Figure (a), but is much easier to comprehend because of its similarity to a traditional network.
Note that unlike traditional feed-forward networks, the inputs also occur to intermediate layers in
this unfurled network.
Fig : A recurrent neural network and its time-layered representation

The weight matrices of the connections are shared by multiple connections in the timelayered
network to ensure that the same function is used at each time stamp. This sharing is the key to
the domain-specific insights that are learned by the network. The backpropagation algorithm
takes the sharing and temporal length into account when updating the weights during the learning
process. This special type of backpropagation algorithm is referred to as backpropagation
through time (BPTT). Because of the recursive nature of Equation, the recurrent network has the
ability to compute a function of variable-length inputs. In other words, one can expand the
recurrence of Equation to define the function for ht in terms of t inputs. For example, starting at
h0, which is typically fixed to some constant vector, we have h 1 = f(h0, x1) and h2 = f(f(h0, x1),
x2). Note that h1 is a function of only x1, whereas h2 is a function of both x1 and x2. Since the
output yt is a function of ht, these properties are inherited by yt as well. In general, we can write
the following:
yt = Ft(x1, x2, . . . xt)
Note that the function Ft(·) varies with the value of t. Such an approach is particularly useful for
variable-length inputs like text sentences. The amount of data and the size of the hidden states
required for longer sequences increases in a way that is not realistic. Furthermore, there are
practical issues in finding the optimum choices of parameters because of the vanishing and
exploding gradient problems. As a result, specialized variants of the recurrent neural network
architecture have been proposed, such as the use of long short-term memory.

1.6.5 Convolutional Neural Networks

Convolutional neural networks are biologically inspired networks that are used in computer
vision for image classification and object detection. The basic motivation for the convolutional
neural network was obtained from Hubel and Wiesel’s understanding of the workings of the cat’s
visual cortex, in which specific portions of the visual field seemed to excite particular neurons.
This broader principle was used to design a sparse architecture for convolutional neural
networks. The first basic architecture based on this biological inspiration was the neocognitron,
which was then generalized to the LeNet-5 architecture.
In the convolutional neural network architecture, each layer of the network is 3-dimensional,
which has a spatial extent and a depth corresponding to the number of features. The notion of
depth of a single layer in a convolutional neural network is distinct from the notion of depth in
terms of the number of layers. In the input layer, these features correspond to the color channels
like RGB (i.e., red, green, blue), and in the hidden channels these features represent hidden
feature maps that encode various types of shapes in the image. If the input is in grayscale (like
LeNet-5), then the input layer will have a depth of 1, but later layers will still be 3-dimensional.
The architecture contains two types of layers, referred to as the convolution and subsampling
layers, respectively.
For the convolution layers, a convolution operation is defined, in which a filter is used to map
the activations from one layer to the next. A convolution operation uses a 3-dimensional filter of
weights with the same depth as the current layer but with a smaller spatial extent. The dot
product between all the weights in the filter and any choice of spatial region (of the same size as
the filter) in a layer defines the value of the hidden state in the next layer (after applying an
activation function like ReLU). The operation between the filter and the spatial regions in a layer
is performed at every possible position in order to define the next layer (in which the activations
retain their spatial relationships from the previous layer).

The connections in a convolutional neural network are very sparse, because any activation in a
particular layer is a function of only a small spatial region in the previous layer. All layers other
than the final set of two of three layers maintain their spatial structure. Therefore, it is possible to
spatially visualize what parts of the image affect particular portions of the activations in a layer.
The features in lower-level layers capture lines or other primitive shapes, whereas the features in
higher-level layers capture more complex shapes like loops (which commonly occur in many
digits). Therefore, later layers can create digits by composing the shapes in these intuitive
features. This is a classical example of the way in which semantic insights about specific data
domains are used to design clever architectures.

In addition, a subsampling layer simply averages the values in the local regions of size 2×2 in
order to compress the spatial footprints of the layers by a factor of 2. An illustration of the
architecture of LeNet-5 is shown in Figure 1.18. In the early years, LeNet-5 was used by several
banks to recognize hand-written numbers on checks.

Convolutional neural networks have historically been the most successful of all types of neural
networks. They are used widely for image recognition, object detection/localization,and even
text processing. The performance of these networks has recently exceeded that of humans in the
problem of image classification. Convolutional neural networks provide a very good example of
the fact that architectural design choices in a neural network should be performed with semantic
insight about the data domain at hand. In the particular case of the convolutional neural network,
this insight was obtained by observing the biological workings of a cat’s visual cortex, and
heavily using the spatial relationships among pixels.
This fact also provides some evidence that a further understanding of neuroscience might also be
helpful for the development of methods in artificial intelligence. Pretrained convolutional neural
networks from publicly available resources like ImageNet are often available for use in an off-
the-shelf manner for other applications and data sets. This is achieved by using most of the
pretrained weights in the convolutional network without any change except for the final
classification layer. The weights of the final classification layer are learned from the data set at
hand. The training of the final layer is necessary because the class labels in a particular setting
may be different from those of ImageNet.
Nevertheless, the weights in the early layers are still useful because they learn various types of
shapes in the images that can be useful for virtually any type of classification
application.Furthermore, the feature activations in the penultimate layer can even be used for
unsupervised applications. For example, one can create a multidimensional representation of an
arbitrary image data set by passing each image through the convolutional neural network and
extracting the activations of the penultimate layer. Subsequently, any type of indexing can be
applied to this representation for retrieving images that are similar to a specific target image.
Such an approach often provides surprisingly good results in image retrieval because of the
semantic nature of the features learned by the network. It is noteworthy that the use of pretrained
convolutional networks is so popular that training is rarely started from scratch.

1.6.6 Hierarchical Feature Engineering and Pretrained Models

Many deeper architectures with feed-forward architectures have multiple layers in which
successive transformations of the inputs from the previous layer lead to increasingly
sophisticated
representations of the data. The values of each hidden layer for a particular input contain a
transformed representation of the input point, which becomes increasingly informative about the
target value we are trying to learn, as the layer gets closer to the output node. The appropriately
transformed feature representations are more amenable to simple types of predictions in the
output layer. This sophistication is a result of the nonlinear activations in intermediate layers.
Traditionally, the sigmoid and tanh activations were the most popular choices in the hidden
layers, but the ReLU activation has become increasingly popular in recent years because of the
desirable property that it is better at avoiding the vanishing and exploding gradient problems. For
classification, the final layer can be viewed as a relatively simple prediction layer which contains
a single linear neuron in the case of regression, and is a sigmoid/sign function in the case of
binary classification. More complex outputs might require multiple nodes. One way of viewing
this division of labor between the hidden layers and final prediction layer is that the early layers
create a feature representation that is more amenable to the task at hand. The final layer then
leverages this learned feature representation. A key point is that the features learned in the
hidden layers are often (but not always) generalizable to other data sets and problem settings in
the same domain (e.g., text, images, and so on). This property can be leveraged in various ways
by simply replacing the output node(s) of a pretrained network with a different application-
specific output layer (e.g., linear regression layer instead of sigmoid classification layer) for the
data set and problem at hand. Subsequently, only the weights of the newly replaced output layer
may need to be learned for the new data set and application, whereas the weights of other layers
are fixed.

The output of each hidden layer is a transformed feature representation of the data, in which the
dimensionality of the representation is defined by the number of units in that layer. One can view
this process as a kind of hierarchical feature engineering in which the features in earlier layers
represent primitive characteristics of the data, whereas those in later layers represent complex
characteristics with semantic significance to the class labels. Data represented in the terms of the
features of later layers are often more well behaved (e.g., linearly separable) because of the
semantic nature of the features learned by the transformation. This type of behavior is
particularly evident in a visually interpretable way in some domains like convolutional neural
networks for image data. In convolutional neural networks, the features in earlier layers capture
detailed but primitive shapes like lines or edges from the data set of images. On the other hand,
the features in later layers capture shapes of greater complexity like hexagons, honeycombs, and
so forth, depending on the type of images provided as training data. Note that such semantically
interpretable shapes often have closer correlations with class labels in the image domain. For
example, almost any image will contain lines or edges, but images belonging to particular classes
will be more likely to have hexagons or honeycombs. This property tends to make the
representations of later layers easier to classify with simple models like linear classifiers. This
process is illustrated in Figure 1.19. The features in earlier layers are used repeatedly as building
blocks to create more complex features. This general principle of “putting together” simple
features to create more complex features lies at the core of the successes achieved with neural
networks. As it turns out, this property is also useful in leveraging pretrained models in a
carefully calibrated way. The practice of using pretrained models is also referred to as transfer
learning. A particular type of transfer learning, which is used commonly in neural networks, is
that the data and structure available in a given data set are used to learn features for that entire
domain. A classical example of this setting is that of text or image data. In text data, the
representations of text words are created using standardized benchmark data sets like Wikipedia
and models like word2vec. These can be used in almost any text application, since the nature of
text data does not change very much with the application. A similar approach is often used for
image data, in which the ImageNet data set is used to pretrain convolutional neural networks,
and provide ready-to-use features. One can download a pretrained convolutional neural network
model and convert any image data set into a multidimensional representation by passing the
image through the pretrained network. Furthermore, if additional application-specific data is
available, one can regulate the level of transfer learning depending on the amount of available
data. This is achieved by fine-tuning a subset of the layers in the pretrained neural network with
this additional data. If a small amount of application-specific data is available, one can fix the
weights of the early layers to their pretrained values and fine-tune only the last few layers of the
neural network. The early layers often contain primitive features, which are more easily
generalizable to arbitrary applications. For example, in a convolutional neural network, the early
layers learn primitive features like edges, which are useful across diverse images like trucks or
carrots. On the other hand, the later layers contain complex features which might depend on the
image collection at hand (e.g., truck wheel versus carrot top). Finetuning only the weights of the
later layers makes sense in such cases. If a large amount of application-specific data is available,
one can fine-tune a larger number of layers. Therefore, deep networks provide significant
flexibility in terms of how transfer learning is done with pretrained neural network models.

Recurrent neural networks can be hard to train, because they are prone to the vanishing and the
exploding gradient problems. However,there are other ways of training more robust recurrent
networks. A particular example that has found favor is the use of long short-term memory
network. This network uses a gentler update process of the hidden states in order to avoid the
vanishing and exploding gradient problems. Recurrent neural networks and their variants have
found use in many applications such as image captioning, token-level classification, sentence
classification.

Advanced Neural Architectures

A.Reinforcement Learning
The neural network must learn to take actions in ever-changing and dynamic situations.
Examples: Learning robots and Self-driving cars.
In these cases, a critical assumption is that the learning system has no knowledge of the
appropriate sequence of actions up front, and it learns through reward-based reinforcement as it
takes various actions. These types of learning correspond to dynamic sequences of actions that
are hard to model using traditional machine learning methods. The key assumption here is that
these systems are too complex to explicitly model, but they are simple enough to evaluate, so
that a reward value can be assigned for each action of the learner.
Video games are excellent test beds for reinforcement learning methods because they are
microcosms of living the “game” of life. As in real-world settings, the number of possible states
(i.e., unique position in game)might be too large to even enumerate, and the optimal choice of
move depends critically on the knowledge of what is truly important to model from a particular
state.
Furthermore, since one does not start with any knowledge of the rules, the learning system would
need to collect the data through its actions much as a mouse explores a maze to learn its
structure. Therefore, the collected data is highly biased by the user actions, which provides a
particularly challenging landscape for learning. The successful training of reinforcement learning
methods is a critical gateway for self-learning systems

B.Separating Data Storage and Computations

An important aspect of neural networks is that the data storage and computations are tightly
integrated. For example, the states in a neural network can be considered a type of transient
memory, which behave much like the ever-changing registers in the central processing unit of a
computer. But what if we want to construct a neural network where one can control where to
read data from, and where to write the data to. This goal is achieved with the notion of attention
and external memory. Attention mechanisms can be used in various applications like image
processing where one focuses on small parts of the image to gain successive insights. These
techniques are also used for machine translation.

C.Generative Adversarial Networks

Generative adversarial networks are a model of data generation that can create a generative
model of a base data set by using an adversarial game between two players. The two players
correspond to a generator and a discriminator. The generator takes Gaussian noise as input and
produces an output, which is a generated sample like the base data. The discriminator is typically
a probabilistic classifier like logistic regression whose job is to distinguish real samples from the
base data set and the generated sample. The generator tries to create samples that are as realistic
as possible; its job is to fool the discriminator, whereas the job of the discriminator is to identify
the fake samples irrespective of how well the generator tries to fool it. The problem can be
understood as an adversarial game between the generator and discriminator, and the formal
optimization model is a minimax learning problem. The equilibrium of this minimax game
provides the final trained model. Typically, this equilibrium point is one at which the
discriminator is unable to distinguish between real and fake samples.

Datasets
The benchmarks used in the neural network literature are dominated by data from the domain of
computer vision. Although traditional machine learning data sets like the UCI repository can be
used for testing neural networks, the general trend is towards using data sets from perceptually
oriented data domains that can be visualized well. Although there are a variety of data sets drawn
from the text and image domains, two of them stand out because of their ubiquity in deep
learning papers. Although both are data sets drawn from computer vision, the first of them is
simple enough that it can also be used for testing generic applications beyond the field of vision.
In the following, we provide a brief overview of these two data sets.

The MNIST Database of Handwritten Digits

The MNIST database, which stands for Modified National Institute of Standards and
Technology database, is a large database of handwritten digits. This data set was created by
modifying an original database of handwritten digits provided by NIST. The data set contains
60,000 training images and 10,000 testing images. Each image is a scan of a handwritten digit
from 0 to 9, and the differences between different images are a result of the differences in the
handwriting of different individuals. These individuals were American Census Bureau
employees and American high school students. The original black and white images from NIST
were size normalized to fit in a 20 × 20 pixel box while preserving their aspect ratio and centered
in a 28 × 28 image by computing the center of mass of the pixels. The images were translated to
position this point at the center of the 28×28 field. Each of these 28×28 pixel values takes on a
value from 0 to 255, depending on where it lies in the grayscale spectrum. The labels associated
with the images correspond to the ten digit values. Examples of the digits in the MNIST database
are illustrated in Figure.
The size of the data set is rather small, and it contains only a simple object corresponding to a
digit. Therefore, one might argue that the MNIST database is a toy data set. However, its small
size and simplicity is also an advantage because it can be used as a laboratory for quick testing of
machine learning algorithms. Furthermore, the simplification of the data set by virtue of the fact
that the digits are (roughly) centered makes it easy to use it to test algorithms beyond computer
vision.
Although the matrix representation of each image is suited to a convolutional neural network,
one can also convert it into a multidimensional representation of 28 × 28 = 784 dimensions. This
conversion loses some of the spatial information in the image, but this loss is not debilitating (at
least in the case of the MNIST data set) because of its relative simplicity. In fact, the use of a
simple support vector machine on the 784-dimensional representation can provide an impressive
error rate of about 0.56%. A straightforward 2-layer neural network on the multidimensional
representation (without using the spatial structure in the image) generally does worse than the
support vector machine across a broad range of parameter choices! A deep neural network
without any special convolutional architecture can achieve an error rate of 0.35%. Deeper neural
networks and convolutional neural networks (that do use spatial structure) can reduce the error
rate to as low as 0.21% by using an ensemble of five convolutional networks. Therefore, even on
this simple data set, one can see that the relative performance of neural networks with respect to
traditional machine learning is sensitive to the specific architecture used in the former.
Finally, it should be noted that the 784-dimensional non-spatial representation of the MNIST
data is used for testing all types of neural network algorithms beyond the domain of computer
vision. Even though the use of the 784-dimensional (flattened) representation is not appropriate
for a vision task, it is still useful for testing the general effectiveness of non-vision oriented (i.e.,
generic) neural network algorithms. For example, the MNIST data is frequently used to test
generic autoencoders and not just convolutional ones. Even when the non-spatial representation
of an image is used to reconstruct it with an autoencoder, one can still visualize the results with
the original spatial positions of the reconstructed pixels to obtain a feel of what the algorithm is
doing with the data.
The ImageNet Database
The ImageNet database is a huge database of over 14 million images drawn from 1000 different
categories. Its class coverage is exhaustive enough that it covers most types of images that one
would encounter in everyday life. This database is organized according to a WordNet hierarchy
of nouns. The WordNet database is a data set containing the relationships among English words
using the notion of synsets. The WordNet hierarchy has been successfully used for machine
learning in the natural language domain, and therefore it is natural to design an image data set
around these relationships. The ImageNet database is famous for the fact that an annual
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is held using this dataset. This
competition has a very high profile in the vision community and receives entries from most
major research groups in computer vision. The entries to this competition have resulted in many
of the state-of-the-art image recognition architectures today, including the methods that have
surpassed human performance on some narrow tasks like image classification. Because of the
wide availability of known results on these data sets, it is a popular alternative for benchmarking.
Another important significance of the ImageNet data set is that it is large and diverse enough to
be representative of the key visual concepts within the image domain. As a result convolutional
neural networks are often trained on this data set; the pretrained network can be used to extract
features from an arbitrary image. This image representation is defined by the hidden activations
in the penultimate layer of the neural network. Such an approach creates new multidimensional
representations of image data sets that are amenable for use with traditional machine learning
methods. One can view this approach as a kind of transfer learning in which the visual concepts
in the ImageNet data set are transferred to unseen data objects for other applications.

Machine Learning with Shallow Neural Networks

Machine Learning with shallow neural networks refers to the use of relatively simple neural
networks with few layers (typically one hidden layer) for tasks like classification, regression, and
pattern recognition. These types of networks are considered "shallow" because they lack the
deep, complex architecture of modern deep learning models that often involve many hidden
layers.

Components of Shallow Neural Networks

1. Input Layer: The input layer consists of neurons that represent the features or attributes
of the data you want to process. Each neuron in the input layer corresponds to one feature
in the data.
2. Hidden Layer(s): A shallow neural network typically has only one hidden layer. The
number of neurons in the hidden layer depends on the complexity of the problem. Each
neuron in this layer processes the information received from the input layer using
activation functions (such as ReLU or sigmoid).
3. Output Layer: The output layer produces the final result of the network's computations.
For classification tasks, the output might represent class probabilities, and for regression
tasks, it could represent continuous values.

Training a Shallow Neural Network

The training process of a shallow neural network generally involves:
1. Forward Propagation: Data is passed from the input layer through the hidden layer(s) to
the output layer, with each neuron performing a weighted sum of its inputs followed by
an activation function.
2. Loss Function: The loss function computes the difference between the predicted output
and the true label. Common loss functions include Mean Squared Error (MSE) for
regression or Cross-Entropy Loss for classification tasks.
3. Backpropagation: This is the process of adjusting the weights in the network to
minimize the loss. The gradient of the loss function with respect to the weights is
calculated, and the weights are updated using an optimization algorithm like Stochastic
Gradient Descent (SGD) or Adam.
4. Training Epochs: The network undergoes many iterations (epochs) of forward
propagation and backpropagation to refine its weights.
Advantages of Shallow Neural Networks

 Simplicity: Shallow networks are easy to understand and implement.

 Faster Training: Due to the smaller number of parameters (compared to deep networks),
shallow networks generally require less computation, and training is faster.
 Less Overfitting: Shallow networks are less likely to overfit compared to very deep
networks, especially when the data is limited.

Disadvantages of Shallow Neural Networks

 Limited Capacity: Shallow neural networks might struggle to capture complex patterns
in data. For tasks involving intricate relationships, deep networks (with many layers)
often outperform shallow ones.
 Feature Engineering: Shallow networks may require more manual feature engineering,
as they cannot automatically learn hierarchical features in the same way deep networks
can.
Use Cases of Shallow Neural Networks
1. Simple Classification Tasks: For example, binary classification or small multi-class
classification problems.
2. Regression: When the target variable is continuous, a shallow neural network can model
the relationship between input features and the output.
3. Dimensionality Reduction: Shallow networks like autoencoders (with one hidden layer)
can be used for unsupervised tasks like reducing the number of features.
Popular Shallow Neural Network Algorithms
 Single-layer Perceptron (SLP): A basic shallow neural network that performs linear
classification by using a single hidden layer.
 Multilayer Perceptron (MLP): While technically a type of shallow network, it often
involves multiple hidden neurons but is still considered shallow in contrast to modern
deep networks. MLP can be used for both classification and regression.

Shallow neural networks are still a vital tool in machine learning, especially for simpler tasks or
when computational efficiency is crucial.

Neural Architectures for Binary Classification Models

In machine learning, binary classification involves predicting one of two possible classes (e.g., 0
or 1, true or false, etc.). Neural network architectures are often used to solve binary classification
problems, with different types of architectures providing varying levels of performance
depending on the complexity of the problem and data. Here are some commonly used neural
architectures for binary classification:
1. Single-Layer Perceptron (SLP)
The Single-Layer Perceptron is the most basic neural network architecture for binary
classification.
Structure:
 Input Layer: Each neuron corresponds to one feature from the input data.
 Output Layer: A single neuron that outputs the predicted class (0 or 1).
 Activation Function: The output layer typically uses a sigmoid activation function to
squish the output between 0 and 1, representing the probability of belonging to class 1.
Training:
 Loss Function: Binary Cross-Entropy (Log Loss) is used to compute the loss between
predicted probabilities and actual binary labels.
 Optimization: The network is trained using gradient descent and backpropagation.
Example Use Case:
 Simple problems where data is linearly separable, or relatively simple patterns need to be
learned.

2. Multilayer Perceptron (MLP)

A Multilayer Perceptron (MLP) is a feedforward neural network with at least one hidden layer.
MLPs are more powerful than single-layer perceptrons and can model non-linear relationships
between the input features and the output.
Structure:
 Input Layer: Neurons corresponding to input features.
 Hidden Layers: One or more hidden layers with multiple neurons.
o Hidden layer neurons typically use activation functions like ReLU (Rectified
Linear Unit) or tanh.
 Output Layer: A single neuron that outputs the predicted probability of class 1, using a
sigmoid activation function.
Training:
 Loss Function: Binary Cross-Entropy.
 Optimization: Backpropagation with an optimization algorithm like SGD, Adam, or
RMSprop.
Example Use Case:
 Problems where the data is non-linearly separable or involves complex patterns that a
single-layer perceptron cannot capture.

3. Convolutional Neural Networks (CNNs)

Though CNNs are primarily used for image processing, they can also be effective in binary
classification tasks involving structured data or time-series data, particularly when the input
features have spatial or temporal relationships.
Structure:
 Input Layer: The input could be images, sequential data, or time-series data.
 Convolutional Layers: These layers apply filters (kernels) to extract local features (edges,
textures, etc.), especially useful for spatial data like images.
 Pooling Layers: Max pooling or average pooling is used to downsample and reduce the
spatial dimensions.
 Fully Connected Layers: After the convolutional layers, the output is flattened and passed
through fully connected layers.
 Output Layer: A single neuron with a sigmoid activation function for binary
classification.
Training:
 Loss Function: Binary Cross-Entropy.
 Optimization: Gradient descent-based optimization algorithms like Adam.
Example Use Case:
 Image classification where the task is to classify images into two categories, such as
detecting the presence of a specific object (e.g., cancerous vs. non-cancerous cells in
medical images).

4. Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are especially useful for sequential data (e.g., time-series,
text, or audio), where the current output depends not only on the current input but also on
previous inputs in the sequence.
Structure:
 Input Layer: The input can be sequences of data, such as words in a sentence, stock
prices, etc.
 RNN Layers: Neurons in RNN layers are connected in such a way that information can
be passed from one step to the next, allowing the network to capture temporal
dependencies.
o LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Units) are
advanced types of RNNs designed to solve the vanishing gradient problem and
capture long-term dependencies.
 Output Layer: A single neuron with a sigmoid activation function.
Training:
 Loss Function: Binary Cross-Entropy.
 Optimization: Backpropagation through time (BPTT) with optimization algorithms like
Adam.
Example Use Case:
 Text classification, such as sentiment analysis, where the output (positive/negative
sentiment) depends on the entire sequence of words in a sentence or document.

5. Autoencoders for Binary Classification

An Autoencoder is an unsupervised neural network architecture used for data compression or
feature learning. It can be adapted for binary classification by combining it with a classifier in
the output layer.
Structure:
 Encoder: Compresses the input into a smaller representation (latent space).
 Decoder: Reconstructs the input from the compressed representation.
 Classifier: A fully connected layer after the decoder to predict the binary class using a
sigmoid activation function.
Training:
 Loss Function: Binary Cross-Entropy (for classification).
 Optimization: Typically trained with a combination of reconstruction loss and
classification loss.
Example Use Case:
 Anomaly detection or classification of data with sparse or missing labels, where the
model first learns to encode data efficiently and then classifies it.

6. Generative Adversarial Networks (GANs) for Binary Classification

Though Generative Adversarial Networks (GANs) are mainly used for generating new data, they
can be modified for binary classification tasks, particularly when there is a need to distinguish
between two classes.
Structure:
 Generator: Generates synthetic data samples.
 Discriminator: A binary classifier that tries to distinguish between real data and synthetic
data generated by the generator.
 The discriminator is trained to classify data into two categories (real or fake), and its
output is a probability for binary classification.
Training:
 Loss Function: Binary Cross-Entropy for the discriminator.
 Optimization: A two-player game between the generator and the discriminator.
Example Use Case:
 Generating synthetic data to train a binary classifier or detecting whether an image is real
or fake (as seen in GANs for image generation).
7. Ensemble Methods (Neural Networks with Boosting)
Ensemble methods combine multiple models to improve performance. A boosted neural network
or stacked generalization can be used to improve binary classification accuracy by combining the
outputs of several neural networks.
Structure:
 Multiple neural networks are trained on different subsets of the data or with different
architectures.
 Their outputs are combined using techniques like voting or stacking to produce the final
classification decision.
Training:
 Loss Function: Binary Cross-Entropy.
 Optimization: Can use standard optimization techniques for neural networks, with an
ensemble approach to combining their outputs.
Example Use Case:
 When you want to combine the strengths of several different neural networks to improve
the generalization and classification performance.

The choice of neural network architecture for binary classification depends on several factors
such as the complexity of the data, the nature of the problem (e.g., sequential, image, or tabular
data), and the computational resources available. For simple tasks, a Single-Layer Perceptron or
MLP might suffice, while for more complex tasks involving sequences, images, or intricate
relationships, architectures like CNNs, RNNs, and GANs may be more suitable.
Some basic architectures for machine learning models such as least-squares regression and
classification are discussed here.
The small changes in neural architectures can result in distinct models from traditional machine
learning.

A single-layer network with d input nodes and a single output node. The coefficients of the
connections from the d input nodes to the output node are denoted by W = (w1 . . . wd).
Furthermore, the bias will not be explicitly shown because it can be seamlessly modeled as the
coefficient of an additional dummy input with a constant value of 1.
Figure :An extended architecture of the perceptron with both discrete and continuous Predictions
Revisiting the Perceptron

Let (Xi, yi) be a training instance, in which the observed value yi is predicted from the feature
variables Xi using the following relationship:
ˆyi = sign(W · Xi)
Here, W is the d-dimensional coefficient vector learned by the perceptron. Note the circumflex
on top of ˆyi to indicate that it is a predicted value rather than an observed value. In general, the
goal of training is to ensure that the prediction ˆyi is as close as possible to the observed value yi.
The gradient-descent steps of the perceptron are focused on reducing the number of
misclassifications, and therefore the updates are proportional to the difference (yi − ˆyi) between

W ⇐ W(1 − αλ) + α(yi − ˆyi)Xi

the observed and predicted values are calculated as follows:

A gradient-descent update that is proportional to the difference between the observed and
predicted values is naturally caused by a squared loss function such as (yi− ˆyi) 2. Therefore, one
possibility is to consider the squared loss between the predicted and observed values as the loss
function.

Least-Squares Regression

In least-squares regression, the training data contains n different training pairs

(X1, y1) . . . (Xn, yn),
where each Xi is a d-dimensional representation of the data points and each yi is a real-valued
target. The fact that the target is real-valued is important, because the underlying problem is then
referred to as regression rather than classification.
Least-squares regression is the oldest of all learning problems. One can also use least-squares
regression on binary targets by “pretending” that these targets are real-valued.
Least-Squares Regression is a statistical method used to estimate the relationship between a
dependent variable and one or more independent variables. It minimizes the sum of the squares
of the differences (residuals) between observed and predicted values. The primary goal is to fit a
line (or a hyperplane for multiple variables) that best represents the data.
In least-squares regression, the target variable is related to the feature variables using the
following relationship:
yi = W · Xi
Note the presence of the circumflex on top of ˆyi to indicate that it is a predicted value. The bias
is missing. It can be assumed that one of the features in the training data has a constant value of
1, and the coefficient of this dummy feature is the bias. This is a standard feature engineering
trick borrowed from conventional machine learning. In neural networks, the bias is often
represented with the use of a bias neuron with a constant output of 1. Although the bias neuron is
almost always used in real settings.
The error of the prediction, ei, is given by ei = (yi − ˆyi).
Here, W = (w1 . . . wd) is a d-dimensional coefficient vector that needs to be learned so as to
minimize the total squared error on the training data.
Least-Squares Regression is a method to find the best-fit line (or hyperplane) for a dataset by
minimizing the sum of the squared differences (errors) between the observed values and the
predicted values. It is widely used in statistics, machine learning, and data analysis.

Widrow-Hoff Learning
Following the perceptron, the Widrow-Hoff learning rule was proposed in 1960. However, the
method was not a fundamentally new one, as it is a direct application of least-squares regression
to binary targets. Although the sign function is applied to the real-valued prediction of unseen
test instances to convert them to binary predictions, the error of training instances is computed
directly using real-valued predictions (unlike the perceptron). Therefore, it is also referred to as
least-squares classification or linear least-squares method. Remarkably, a seemingly unrelated
method proposed in 1936, known as the Fisher discriminant, also reduces to Widrow-Hoff
learning in the special case of binary targets.
The Fisher discriminant is formally defined as a direction W along which the ratio of inter-class
variance to the intra-class variance is maximized in the projected data. By choosing a scalar b in
order to define the hyperplane W · X = b, it is possible to model the separation between the two
classes. This hyperplane is used for classification. Although the definition of the Fisher
discriminant seems quite different from least-squares regression/ classification at first sight, a
remarkable result is that the Fisher discriminant for binary targets is identical to the least-squares
regression as applied to binary targets (i.e., least-squares classification). Both the data and the
targets need to be mean-centered, which allows the bias variable b to be set to 0.
Closed Form Solutions

The special case of least-squares regression and classification is solvable in closed form (without
gradient-descent) by using the pseudo-inverse of the n × d training data matrix D, whose rows
are X1 . . .Xn. Let the n-dimensional column vector of dependent variables be denoted by y = [y1
. . . yn]T . The pseudo-inverse of matrix D is defined as follows:
D+ = (DTD)−1DT
Then, the row-vector W is defined by the following relationship:
WT= D+y
If regularization is incorporated, the coefficient vector W is given by the following:
WT= (DTD + λI) −1DT y
Here, λ > 0 is the regularization parameter. However, inverting a matrix like (D TD + λI) is
typically done using numerical methods that require gradient descent anyway. One rarely inverts
large matrices like DTD. In fact, the Widrow-Hoff updates provide a very efficient way of
solving the problem without using the closed-form solution.

Logistic Regression

Logistic regression is a probabilistic model that classifies the instances in terms of probabilities.
Because the classification is probabilistic, a natural approach for optimizing the parameters is to
ensure that the predicted probability of the observed class for each training instance is as large as
possible. This goal is achieved by using the notion of maximumlikelihood estimation in order to
learn the parameters of the model. The likelihood of the training data is defined as the product of
the probabilities of the observed labels of each training instance. Clearly, larger values of this
objective function are better. By using the negative logarithm of this value, one obtains an a loss
function in minimization form. Therefore, the output node uses the negative log-likelihood as a
loss function. This loss function replaces the squared error used in the Widrow-Hoff method. The
output layer can be formulated with the sigmoid activation function, which is very common in
neural network design.

dimensional features and yi ∈ {−1, +1} is a binary class variable. As in the case of a perceptron,
Let (X1, y1), (X2, y2), . . . (Xn, yn) be a set of n training pairs in which Xi contains the d-

a single-layer architecture with weights W = (w1 . . . wd) is used. Instead of using the hard sign
activation on W · Xi to predict yi, logistic regression applies the soft sigmoid function to W · Xi
in order to estimate the probability that yi is 1:
ˆyi = P(yi = 1) = 1 /1 + exp(−W · Xi)
For a test instance, it can be predicted to the class whose predicted probability is greater than 0.5.
Note that P(yi = 1) is 0.5 when W · Xi = 0, and Xi lies on the separating Logistic regression is a
probabilistic model that classifies the instances in terms of probabilities. Because the
classification is probabilistic, a natural approach for optimizing the parameters is to ensure that
the predicted probability of the observed class for each training instance is as large as possible.
This goal is achieved by using the notion of maximumlikelihood estimation in order to learn the
parameters of the model. The likelihood of the training data is defined as the product of the
probabilities of the observed labels of each training instance. Clearly, larger values of this
objective function are better. By using the negative logarithm of this value, one obtains an a loss
function in minimization form. Therefore, the output node uses the negative log-likelihood as a
loss function. This loss function
replaces the squared error used in the Widrow-Hoff method. The output layer can be formulated
with the sigmoid activation function, which is very common in neural network design.

dimensional features and yi ∈ {−1, +1} is a binary class variable. As in the case of a perceptron,
Let (X1, y1), (X2, y2), . . . (Xn, yn) be a set of n training pairs in which Xi contains the d-

For a test instance, it can be predicted to the class whose predicted probability is greater than 0.5.
Note that P(yi = 1) is 0.5 when W · Xi = 0, and Xi lies on the separating hyperplane.

o Logistic regression is used for predicting the categorical dependent variable using a given
set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Assumptions for Logistic Regression:
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Support Vector Machines

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:

Example: Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots of images of cats and dogs so
that it can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and dog)
and choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the
basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are
3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these
two classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this
best boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider
the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If
we convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

Neural Architectures for Multiclass Models

All the models discussed so far in this chapter are designed for binary classification. In this
section, we will discuss how one can design multiway classification models by changing the
architecture of the perceptron slightly, and allowing multiple output nodes.
Multiclass Perceptron

ddimensional feature vector Xi and the index c(i) ∈ {1 . . . k} of its observed class. In such a
Consider a setting with k different classes. Each training instance (Xi, c(i)) contains a

case, we would like to find k different linear separators W1 . . .Wk simultaneously so that the
value of Wc(i) · Xi is larger than Wr · Xi for each r _= c(i). This is because one always predicts a
data instance Xi to the class r with the largest value of Wr · Xi. Therefore, the loss function for
the ith training instance in the case of the multiclass perceptron is defined as follows:

The multiclass perceptron is illustrated in Figure 2.5(a). As in all neural network models,
one can use gradient-descent in order to determine the updates. For a correctly classified
instance, the gradient is always 0, and there are no updates. For a misclassified instance,
the gradients are as follows:

Fig:Multiclass models: In each case, class 2 is assumed to be the ground-truth class.

Therefore, the stochastic gradient-descent method is applied as follows. Each training instance is
fed into the network. If the correct class r = c(i) receives the largest of output Wr · Xi, then no
update needs to be executed. Otherwise, the following update is made to each separator Wr for
learning rate α > 0:
Only two of the separators are always updated at a given time. In the special case that k = 2,
these gradient updates reduce to the perceptron because both the separators W1 and W2 will be
related as W1 = −W2 if the descent is started at W1 = W2 = 0. Another quirk that is specific to
the unregularized perceptron is that it is possible to use a learning rate of α = 1 without affecting
the learning because the value of α only has the effect of scaling the weight when starting with
Wj = 0. This property is, however, not true for other linear models in which the value of α does
affect the learning.

2.Weston-Watkins SVM
The Weston-Watkins Support Vector Machine (SVM) is a formulation for multiclass
classification based on maximizing the margin.
The Weston-Watkins SVM [529] varies on the multiclass perceptron in two ways:
1. The multiclass perceptron only updates the linear separator of a class that is predicted
most incorrectly along with the linear separator of the true class. On the other hand,the Weston-
Watkins SVM updates the separator of any class that is predicted more favorably than the true
class. In both cases, the separator of the observed class is updated by the same aggregate amount
as the incorrect classes (but in the opposite direction).
2. Not only does the Weston-Watkins SVM update the separator in the case of
misclassification, it updates the separators in cases where an incorrect class gets a prediction that
is “uncomfortably close” to the true class. This is based on the notion of margin. As in the case
of the multiclass perceptron, it is assumed that the ith training instance is denoted by (Xi, c(i)),
where Xi contains the d-dimensional feature variables, and c(i) contains the class index drawn
from {1, . . . , k}. One wants to learn d-dimensional coefficients W1 . . .Wk of the k linear
separators so that the class index r with the largest value of Wr ·Xi is predicted to be the correct
class c(i). The loss function Li for the ith training instance (Xi, c(i)) in the Weston-Watkins SVM
is as follows:

First, for each class r ≠ c(i), if the prediction Wr · Xi lags behind that of the true class by less
than a margin amount of 1,then a loss is incurred for that class. Furthermore, the losses over all
such classes r ≠ c(i) are added, rather than taking the maximum of the losses. These two
differences accomplish the two intuitive goals discussed above.
In order to determine the gradient-descent updates, one can find the gradient of the loss function
with respect to each Wr. In the event that the loss function Li is 0, the gradient of the loss
function is 0 as well. Therefore, no update is required when the training instance is classified
correctly with sufficient margin with respect to the second-best class. However, if the loss
function is non-zero we have either a misclassified or a “barely correct” prediction in which the
second-best and best class prediction are not sufficiently separated. In such cases, the gradient of
the loss is non-zero. The loss function is created byadding up the contributions of the (k−1)
separators belonging to the incorrect classes. Let δ(r,Xi) be a 0/1 indicator function, which is 1
when the rth class separator contributes positively to the loss function. In such a case, the
gradient of the loss function is as follows:

This results in the following stochastic gradient-descent step for the rth separator Wr at
learning rate α:

For training instances Xi in which the loss Li is zero, the above update can be shown to simplify
to a regularization update of each hyperplane

The regularization uses the parameter λ > 0. Regularization is considered essential to the proper
functioning of a support vector machine.

3.Multinomial Logistic Regression (Softmax Classifier)

The Multinomial Logistic Regression (also known as the Softmax Classifier) is a

generalization of logistic regression to handle multiclass classification problems. It models the
probabilities of k mutually exclusive classes given an input x.

1. Model Definition
For an input feature vector xxx, the softmax classifier computes the probability that xxx belongs
to class iii using the following formula:

Where:
 wi is the weight vector for class i,
 bi is the bias term for class i,
 k is the number of classes,
 The denominator ensures the probabilities sum to 1.

4. Optimization
Multinomial logistic regression is trained using iterative optimization methods like:
 Gradient Descent or Stochastic Gradient Descent (SGD): Update weights using the
gradient of the loss function.
 Newton's Method or Quasi-Newton Methods: Use second-order derivatives for faster
convergence but at higher computational cost.
 Batch Gradient Descent (common in large datasets).
Gradients:
The gradient of the cross-entropy loss with respect to the weights wiw_iwi is:

5. Softmax Classifier in Deep Learning

In deep learning, the softmax layer is often the final layer of a neural network for multiclass
classification tasks.
 Architecture: A fully connected (dense) layer followed by a softmax activation function.
 Integration: Combines with convolutional or recurrent layers for feature extraction.
 Loss Function: Cross-entropy is used as the training objective.

6. Advantages and Disadvantages

Advantages:
 Probabilistic Outputs: Provides class probabilities, making it interpretable.
 Scalability: Efficient for large datasets and high-dimensional inputs.
 Standard Loss Function: Compatible with most deep learning frameworks.
Disadvantages:
 Linearity: The classifier itself models linear decision boundaries unless combined with
non-linear feature extraction (e.g., hidden layers in deep networks).
 Computationally Intensive: The normalization step in softmax can be costly for large k.

7. Applications
 Image classification (e.g., CIFAR-10, ImageNet).
 Natural Language Processing (e.g., part-of-speech tagging, sentiment analysis).
 Multi-class medical diagnostics (e.g., disease classification).
 Multi-class recommendation systems.

Also known as the softmax classifier, this method is widely used in neural networks for
multiclass classification.

DL-UNIT_5
No ratings yet
DL-UNIT_5
10 pages
module5
No ratings yet
module5
21 pages
Recurrent Neural Networks: Index
No ratings yet
Recurrent Neural Networks: Index
13 pages
Unit 3
No ratings yet
Unit 3
8 pages
UNIT-3
No ratings yet
UNIT-3
30 pages
Unit 4 - MachineLearning
No ratings yet
Unit 4 - MachineLearning
16 pages
Unit 4 - Machine Learning
No ratings yet
Unit 4 - Machine Learning
16 pages
RNN
No ratings yet
RNN
47 pages
Unit 3 Deep Learning SPPU BE IT
No ratings yet
Unit 3 Deep Learning SPPU BE IT
30 pages
AD3501_UNIT3
No ratings yet
AD3501_UNIT3
29 pages
Ministry of Higher Education and Scientific Research University of Technology Computer Engineering Department
No ratings yet
Ministry of Higher Education and Scientific Research University of Technology Computer Engineering Department
6 pages
Module 5(Chapter 10)
No ratings yet
Module 5(Chapter 10)
17 pages
Unit 5
No ratings yet
Unit 5
76 pages
Unit III (2) RNN, LSTM, Gru
No ratings yet
Unit III (2) RNN, LSTM, Gru
14 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
18 pages
Deep Learning
No ratings yet
Deep Learning
49 pages
CS601 - Machine Learning - Unit 4 - Notes - 1672759767
No ratings yet
CS601 - Machine Learning - Unit 4 - Notes - 1672759767
12 pages
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
No ratings yet
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
9 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
36 pages
Time Series Rnn Lstm 1746197734
No ratings yet
Time Series Rnn Lstm 1746197734
25 pages
UNIT5
No ratings yet
UNIT5
13 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
6 pages
Lecture Notes_RRN
No ratings yet
Lecture Notes_RRN
8 pages
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
0% (1)
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
16 pages
DL 4
No ratings yet
DL 4
11 pages
Unit V
No ratings yet
Unit V
32 pages
What are Recurrent Neural Networks.docx
No ratings yet
What are Recurrent Neural Networks.docx
7 pages
DL_MOD4 (3)
No ratings yet
DL_MOD4 (3)
105 pages
DL
No ratings yet
DL
251 pages
Module 4-1
No ratings yet
Module 4-1
44 pages
DL-unit-4-part-2
No ratings yet
DL-unit-4-part-2
8 pages
Neural Networks and Recurrent Neural Networks
No ratings yet
Neural Networks and Recurrent Neural Networks
1 page
2111CS010077 deep learning
No ratings yet
2111CS010077 deep learning
10 pages
Module 06
No ratings yet
Module 06
5 pages
Modelling Time Series With Neural Networks: Volker Tresp Summer 2017
No ratings yet
Modelling Time Series With Neural Networks: Volker Tresp Summer 2017
24 pages
What is a Recurrent Neural Network
No ratings yet
What is a Recurrent Neural Network
36 pages
6b. Recurrent Neural Networks
No ratings yet
6b. Recurrent Neural Networks
38 pages
Unit III- Recurrent Neural Networks
No ratings yet
Unit III- Recurrent Neural Networks
44 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
Module5_notes
No ratings yet
Module5_notes
23 pages
RNN
No ratings yet
RNN
14 pages
DeepLearning Unit-III
No ratings yet
DeepLearning Unit-III
42 pages
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
No ratings yet
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
16 pages
Chapter 4 Data Sci
No ratings yet
Chapter 4 Data Sci
58 pages
Chap 7.2 Sequence Analysis Using RNN LSTM
No ratings yet
Chap 7.2 Sequence Analysis Using RNN LSTM
60 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
8 pages
Deep & Reinforcement - Unit 4
No ratings yet
Deep & Reinforcement - Unit 4
17 pages
ML Unit 4
No ratings yet
ML Unit 4
47 pages
RNN introduction
No ratings yet
RNN introduction
22 pages
RNN LSTM Gru R
No ratings yet
RNN LSTM Gru R
97 pages
28-Recurrent Neural Networks - Bidirectional RNNs-19!09!2024
No ratings yet
28-Recurrent Neural Networks - Bidirectional RNNs-19!09!2024
12 pages
Unit_3_rcnn
No ratings yet
Unit_3_rcnn
25 pages
2 U4-Rnn
No ratings yet
2 U4-Rnn
17 pages
An In-Depth Explanation of Recurrent Neural Networks (RNNS) - InsideAIML
No ratings yet
An In-Depth Explanation of Recurrent Neural Networks (RNNS) - InsideAIML
9 pages
Soft Computing 1
No ratings yet
Soft Computing 1
15 pages
Module2 L7 RNN LSTM
No ratings yet
Module2 L7 RNN LSTM
47 pages
Troanary Photonic Storage Blueprint - How Light Based Logic can Redefine Computation and Data Storage: Volume 10 Troanary Photonic Storage Blueprint, #1
From Everand
Troanary Photonic Storage Blueprint - How Light Based Logic can Redefine Computation and Data Storage: Volume 10 Troanary Photonic Storage Blueprint, #1
Ylia Callan
No ratings yet
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
From Everand
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
Fouad Sabry
No ratings yet
Understanding of Convolutional Neural Network (CNN) - Deep Learning
No ratings yet
Understanding of Convolutional Neural Network (CNN) - Deep Learning
9 pages
The Backpropagation Algorithm
No ratings yet
The Backpropagation Algorithm
4 pages
Bassi, Attux - 2022
No ratings yet
Bassi, Attux - 2022
10 pages
BTECH-027-2
No ratings yet
BTECH-027-2
3 pages
(Good) Ian H. Witten, Eibe Frank, Mark A. Hall Data Mining - Practical Machine Learning Tools and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems) 2011
No ratings yet
(Good) Ian H. Witten, Eibe Frank, Mark A. Hall Data Mining - Practical Machine Learning Tools and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems) 2011
16 pages
4
No ratings yet
4
18 pages
Deepfake Technology
No ratings yet
Deepfake Technology
3 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
17 pages
Detection of Fake AudioA Deep
No ratings yet
Detection of Fake AudioA Deep
11 pages
Introduction To Natural Language Processing: by Rohit Sharma
No ratings yet
Introduction To Natural Language Processing: by Rohit Sharma
8 pages
Back Propagation
100% (1)
Back Propagation
27 pages
ANN PROJECT ASSIGNMENT
No ratings yet
ANN PROJECT ASSIGNMENT
16 pages
BCS 465 Neural Network - 2020
No ratings yet
BCS 465 Neural Network - 2020
5 pages
CBC Ss40 Elite
No ratings yet
CBC Ss40 Elite
102 pages
Gen AI 1
No ratings yet
Gen AI 1
4 pages
Topic 5
No ratings yet
Topic 5
32 pages
Soft Computing
No ratings yet
Soft Computing
92 pages
Show and Tell: A Neural Image Caption Generator (CVPR 2015) : Presenters: Tianlu Wang, Yin Zhang October 5
No ratings yet
Show and Tell: A Neural Image Caption Generator (CVPR 2015) : Presenters: Tianlu Wang, Yin Zhang October 5
13 pages
Prompt Engineering Lab Midterm Exam
No ratings yet
Prompt Engineering Lab Midterm Exam
7 pages
PVSNet Palm Vein Authentication
No ratings yet
PVSNet Palm Vein Authentication
8 pages
CSC 325 AI Assignment 02 23102023 033111pm
No ratings yet
CSC 325 AI Assignment 02 23102023 033111pm
5 pages
Le y Yang - Tiny ImageNet Visual Recognition Challenge
No ratings yet
Le y Yang - Tiny ImageNet Visual Recognition Challenge
6 pages
Transformer
No ratings yet
Transformer
3 pages
DL Lab Ex - No.5
No ratings yet
DL Lab Ex - No.5
2 pages
Idap 2019 8875953
No ratings yet
Idap 2019 8875953
6 pages
DL Unit -1 Notes
No ratings yet
DL Unit -1 Notes
45 pages
Chapter 9 - ANNs
No ratings yet
Chapter 9 - ANNs
25 pages
Perceptron - Wikipedia
No ratings yet
Perceptron - Wikipedia
9 pages
Machine Learning Week 1
No ratings yet
Machine Learning Week 1
22 pages
Supervised Learning Based On Temporal Coding in Spiking Neural Networks
No ratings yet
Supervised Learning Based On Temporal Coding in Spiking Neural Networks
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DNN U2 Notes

Uploaded by

DNN U2 Notes

Uploaded by

Common Neural Architectures . . . . . . . . . . . . . . . . . . . . . . . . .

Recurrent Neural Network

1.6.5 Convolutional Neural Networks

1.6.6 Hierarchical Feature Engineering and Pretrained Models

Advanced Neural Architectures

B.Separating Data Storage and Computations

C.Generative Adversarial Networks

The MNIST Database of Handwritten Digits

Machine Learning with Shallow Neural Networks

Components of Shallow Neural Networks

Training a Shallow Neural Network

 Simplicity: Shallow networks are easy to understand and implement.

Disadvantages of Shallow Neural Networks

Neural Architectures for Binary Classification Models

2. Multilayer Perceptron (MLP)

3. Convolutional Neural Networks (CNNs)

4. Recurrent Neural Networks (RNNs)

5. Autoencoders for Binary Classification

6. Generative Adversarial Networks (GANs) for Binary Classification

W ⇐ W(1 − αλ) + α(yi − ˆyi)Xi

In least-squares regression, the training data contains n different training pairs

The above equation is the final equation for Logistic Regression.

Hence we get a circumference of radius 1 in case of non-linear data.

Neural Architectures for Multiclass Models

Fig:Multiclass models: In each case, class 2 is assumed to be the ground-truth class.

3.Multinomial Logistic Regression (Softmax Classifier)

The Multinomial Logistic Regression (also known as the Softmax Classifier) is a

5. Softmax Classifier in Deep Learning

6. Advantages and Disadvantages

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.