0% found this document useful (0 votes)
4 views

Unit IV

The document provides an overview of deep learning, including its history, key concepts, and differences from traditional machine learning. It discusses the evolution of deep learning techniques, such as backpropagation, convolutional neural networks, and the introduction of probabilistic models for uncertainty estimation. Additionally, it highlights the applications of deep learning in various fields and the importance of gradient learning and cost functions in training neural networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit IV

The document provides an overview of deep learning, including its history, key concepts, and differences from traditional machine learning. It discusses the evolution of deep learning techniques, such as backpropagation, convolutional neural networks, and the introduction of probabilistic models for uncertainty estimation. Additionally, it highlights the applications of deep learning in various fields and the importance of gradient learning and cost functions in training neural networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit 4 DEEP FEEDFORWARD NETWORKS

Syllabus
History of Deep Learning- A Probabilistic Theory of Deep Learning-
Gradient Learning – Chain Rule and Backpropagation - Regularization:
Dataset Augmentation – Noise Robustness -Early Stopping, Bagging and
Dropout - batch normalization- VC Dimension and Neural Nets.

Deep Learning – An Introduction


• Deep learning is a method in artificial intelligence (AI) that teaches
computers to process data in a way that is inspired by the human brain.
Deep learning models can recognize complex patterns in pictures, text,
sounds, and other data to produce accurate insights and predictions.
• It has become increasingly popular in recent years due to the advances in
processing power and the availability of large datasets. Because it is
based on artificial neural networks (ANNs) also known as deep neural
networks (DNNs).
• These neural networks are inspired by the structure and function of the
human brain’s biological neurons, and they are designed to learn from
large amounts of data.
• The key characteristic of Deep Learning is the use of deep neural
networks, which have multiple layers of interconnected nodes. These
networks can learn complex representations of data by discovering
hierarchical patterns and features in the data.
• Deep Learning algorithms can automatically learn and improve from data
without the need for manual feature engineering.
Difference between Machine Learning and Deep Learning:

Machine Learning Deep Learning


Uses artificial neural network
Apply statistical algorithms to learn
architecture to learn the hidden
the hidden patterns and
patterns and relationships in the
relationships in the dataset.
dataset.
Requires the larger volume of
Can work on the smaller amount of
dataset compared to machine
dataset
learning
Takes less time to train the model. Takes more time to train the model.
A model is created by relevant
Relevant features are automatically
features which are manually
extracted from images. It is an end-
extracted from images to detect an
to-end learning process.
object in the image.
It can work on the CPU or requires
It requires a high-performance
less computing power as compared
computer with GPU.
to deep learning.

History of Deep Learning


Here is a brief history of some key developments in deep learning:
The history of deep learning can be traced back to 1943, when Walter Pitts and
Warren McCulloch created a computer model based on the neural networks of
the human brain.
They used a combination of algorithms and mathematics they called “threshold
logic” to mimic the thought process. Since that time, Deep Learning has evolved
steadily, with only two significant breaks in its development. Both were tied to
the infamous Artificial Intelligence winters.
The 1960s
Henry J. Kelley is given credit for developing the basics of a continuous Back
Propagation Modelin 1960. In 1962, a simpler version based only on the chain
rule was developed by Stuart Dreyfus. While the concept of back propagation
(the backward propagation of errors for purposes of training) did exist in the
early 1960s, it was clumsy and inefficient, and would not become useful until
1985.
The earliest efforts in developing deep learning algorithms came from Alexey
Grigoryevich Ivakhnenko (developed the Group Method of Data Handling) and
Valentin Grigorʹevich Lapa (author of Cybernetics and Forecasting Techniques)
in 1965. They used models with polynomial (complicated equations) activation
functions, that were then analyzed statistically. From each layer, the best
statistically chosen features were then forwarded on to the next layer (a slow,
manual process).
The 1970s
During the 1970’s the first AI winter kicked in, the result of promises that
couldn’t be kept. The impact of this lack of funding limited both DL and AI
research. Fortunately, there were individuals who carried on the research
without funding.
The first “convolutional neural networks” were used by Kunihiko Fukushima.
Fukushima designed neural networks with multiple pooling and convolutional
layers. In 1979, he developed an artificial neural network, called Neocognitron,
which used a hierarchical, multilayered design. This design allowed the
computer the “learn” to recognize visual patterns. The networks resembled
modern versions but were trained with a reinforcement strategy of recurring
activation in multiple layers, which gained strength over time. Additionally,
Fukushima’s design allowed important features to be adjusted manually by
increasing the “weight” of certain connections. Many of the concepts of
Neocognitron continue to be used.
The use of top-down connections and new learning methods have allowed for a
variety of neural networks to be realized. When more than one pattern is
presented at the same time, the Selective Attention Model can separate and
recognize individual patterns by shifting its attention from one to the other.
(The same process many of us use when multitasking). A modern Neocognitron
can not only identify patterns with missing information (for example, an
incomplete number 5), but can also complete the image by adding the missing
information. This could be described as “inference.”
Back propagation, the use of errors in training deep learning models, evolved
significantly in 1970. This was when Seppo Linnainmaa wrote his master’s
thesis, including a FORTRAN code for back propagation.
Unfortunately, the concept was not applied to neural networks until 1985. This
was when Rumelhart, Williams, and Hinton demonstrated back propagation in
a neural network could provide “interesting” distribution representations.
Philosophically, this discovery brought to light the question within cognitive
psychology of whether human understanding relies on symbolic logic
(computationalism) or distributed representations (connectionism).

The 1980s and 90s


In 1989, Yann LeCun provided the first practical demonstration of
backpropagation at Bell Labs. He combined convolutional neural networks with
back propagation onto read “handwritten” digits. This system was eventually
used to read the numbers of handwritten checks.
This time is also when the second AI winter (1985-90s) kicked in, which also
effected research for neural networks and deep learning. Various overly-
optimistic individuals had exaggerated the “immediate” potential of Artificial
Intelligence, breaking expectations and angering investors. The anger was so
intense, the phrase Artificial Intelligence reached pseudoscience status.
Fortunately, some people continued to work on AI and DL, and some significant
advances were made. In 1995, Dana Cortes and Vladimir Vapnik developed the
support vector machine (a system for mapping and recognizing similar data).
LSTM (long short-term memory) for recurrent neural networks was developed
in 1997, by Sepp Hochreiter and Juergen Schmidhuber.
The next significant evolutionary step for deep learning took place in 1999,
when computers started becoming faster at processing data and GPU (graphics
processing units) were developed. Faster processing, with GPUs processing
pictures, increased computational speeds by 1000 times over a 10 year span.
During this time, neural networks began to compete with support vector
machines. While a neural network could be slow compared to a support vector
machine, neural networks offered better results using the same data. Neural
networks also have the advantage of continuing to improve as more training
data is added.

2000-2010
Around the year 2000, The Vanishing Gradient Problem appeared. It was
discovered “features” (lessons) formed in lower layers were not being learned
by the upper layers, because no learning signal reached these layers. This was
not a fundamental problem for all neural networks, just the ones with gradient-
based learning methods. The source of the problem turned out to be certain
activation functions. A number of activation functions condensed their input, in
turn reducing the output range in a somewhat chaotic fashion. This produced
large areas of input mapped over an extremely small range. In these areas of
input, a large change will be reduced to a small change in the output, resulting
in a vanishing gradient. Two solutions used to solve this problem were layer-
by-layer pre-training and the development of long short-term memory.
In 2001, a research report by META Group (now called Gartner) described he
challenges and opportunities of data growth as three-dimensional. The report
described the increasing volume of data and the increasing speed of data as
increasing the range of data sources and types. This was a call to prepare for the
onslaught of Big Data, which was just starting.
In 2009, Fei-Fei Li, an AI professor at Stanford launched ImageNet, assembled a
free database of more than 14 million labeled images. The Internet is, and was,
full of unlabeled images. Labeled images were needed to “train” neural nets.
Professor Li said, “Our vision was that big data would change the way machine
learning works. Data drives learning.”

2011-2020
By 2011, the speed of GPUs had increased significantly, making it possible to
train convolutional neural networks “without” the layer-by-layer pre-training.
With the increased computing speed, it became obvious deep learning had
significant advantages in terms of efficiency and speed. One example is AlexNet,
a convolutional neural network whose architecture won several international
competitions during 2011 and 2012. Rectified linear units were used to
enhance the speed and dropout.
Also in 2012, Google Brain released the results of an unusual project known as
The Cat Experiment. The free-spirited project explored the difficulties of
“unsupervised learning.” Deep learning uses “supervised learning,” meaning the
convolutional neural net is trained using labeled data (think images from
ImageNet). Using unsupervised learning, a convolutional neural net is given
unlabeled data, and is then asked to seek out recurring patterns.
The Cat Experiment used a neural net spread over 1,000 computers. Ten million
“unlabeled” images were taken randomly from YouTube, shown to the system,
and then the training software was allowed to run. At the end of the training,
one neuron in the highest layer was found to respond strongly to the images of
cats. Andrew Ng, the project’s founder said, “We also found a neuron that
responded very strongly to human faces.” Unsupervised learning remains a
significant goal in the field of deep learning.
The Generative Adversarial Neural Network (GAN) was introduced in 2014.
GAN was created by Ian Goodfellow. With GAN, two neural networks play
against each other in a game. The goal of the game is for one network to imitate
a photo, and trick its opponent into believing it is real. The opponent is, of
course, looking for flaws. The game is played until the near perfect photo tricks
the opponent. GAN provides a way to perfect a product (and has also begun
being used by scammers).

Probabilistic Theory of Deep Learning


The Probabilistic Theory of Deep Learning (PTDL) is a framework aimed at
understanding and explaining the behavior of deep neural networks (DNNs)
through a probabilistic lens. It seeks to bridge the gap between traditional
machine learning and deep learning by integrating probabilistic models with
deep learning architectures.
The probabilistic neural networks employs deep neural networks that utilize
probabilistic layers which can represent and process uncertainty; the deep
probabilistic models uses probabilistic models that incorporate deep neural
network components which capture complex non-linear stochastic
relationships between the random variables.
The main advantages of probabilistic models are that these can capture the
uncertainties in most real-world applications and provide essential information
for decision making.
Probabilistic deep learning aims to address this limitation by incorporating
uncertainty estimation into deep learning models. This can be achieved through
various approaches:

• Bayesian Neural Networks (BNNs): BNNs treat model parameters as


random variables with prior distributions. By inferring the posterior
distribution of these parameters given the data, BNNs can provide not
only point estimates but also uncertainty estimates for predictions.
• Variational Inference: Variational inference is a technique used to
approximate complex posterior distributions with simpler distributions.
In the context of deep learning, variational inference can be used to
approximate the posterior distribution of neural network weights,
enabling uncertainty estimation.

• Dropout as Bayesian Approximation: Dropout is a regularization


technique commonly used in deep learning to prevent overfitting.
Interestingly, dropout can also be interpreted as a form of approximate
Bayesian inference, where dropout during training can be seen as
sampling from a distribution over possible neural network architecture.
This can be leveraged to estimate uncertainty in predictions.

• Gaussian Processes (GPs): GPs are a powerful probabilistic modeling


tool that can model distributions over functions. By combining GPs with
deep neural networks, researchers have developed methods like Deep
Gaussian Processes (DGPs), which provide uncertainty estimates while
leveraging the representational power of deep learning architectures.

• Monte Carlo Dropout: Monte Carlo Dropout extends dropout to the


testing phase by performing multiple stochastic forward passes through
the network with dropout turned on. This allows for the estimation of
predictive uncertainty by observing the variance of predictions across
these passes.

• Ensemble Methods: Ensemble methods involve training multiple neural


networks with different initializations or architectures and averaging
their predictions. Ensemble methods naturally provide uncertainty
estimates through the variance of predictions across the ensemble
members.
Deep Learning Framework
• Neural Networks
• Convolutional NNs
• Recurrent NNs

Incorporating
Uncertainty

Applications of Probabilistic Theory of Deep


Learning
• Medical Diagnosis
• Autonomous Driving
• Financial Modeling
• Robotics
• Natural Language Processing
• Uncertainty Quantization

Gradient Learning

"Gradient learning" typically refers to the process of updating the


parameters of a model, often a neural network, using gradient descent
optimization algorithms. Gradient descent is a fundamental optimization
technique used to minimize the loss function of a model by iteratively
adjusting its parameters in the direction of steepest descent of the loss
function.

Gradient learning is essential for training neural networks and is the


foundation of many deep learning algorithms.
Variants of gradient descent, such as

1. Stochastic gradient descent (SGD)


2. Mini-batch gradient descent
3. Adaptive learning rate methods like Adam are commonly used in
practice to improve convergence speed and stability during training.
Neural networks are usually trained by using iterative, gradient-based
optimizers. Gradient- based learning draws on the fact that it is generally
much easier to minimize a reasonably smooth, continuous function than
a discrete function.
• The loss function can be minimized by estimating the impact of
small variations of the parameter values on the loss function.
Convex optimization converges starting from any initial
parameters.
• Stochastic gradient descent applied to non-convex loss functions
has no such convergence guarantee and is sensitive to the values of
the initial parameters.
• For feedforward neural networks, it is important to initialize all
weights to small random values. The biases may be initialized to
zero or to small positive values. The iterative gradient-based
optimization algorithms used to train feedforward networks and
almost all other deep models.

Cost Function
An important aspect of the design of deep neural networks is the
cost function. They are similar to those for parametric models such
as linear models. In most cases, parametric model defines a
distribution p(y|x; 0) and simply use the principle of maximum
likelihood.
The use of cross-entropy between the training data and the
model's prediction’s function. Most modern neural networks are
trained using maximum likelihood.
Cost function is given by
J(𝐽(𝜃) = ∑ 𝑥, 𝑦~𝑝𝑑𝑎𝑡𝑎 𝐿𝑜𝑔 𝑃𝑚𝑜𝑑𝑒𝑙 (𝑌|𝑋)
The advantage of this approach to cost is that deriving cost from maximum
likelihood removes the burden of designing cost functions for each model.

Desirable property of gradient:

• Gradient must be large and predictable enough to serve as a good guide


to the learning algorithm.

Cross entropy and regularization:

• A property of cross-entropy cost used for MLE is that, it does not have a
minimum value. For discrete output variables, they cannot represent
probability of zero or one but come arbitrarily close. Logistic regression
is an example.
• For real-valued output variables it becomes possible to assign extremely
high density to correct training set outputs, e.g, by learning the variance
parameter of Gaussian output and the resulting cross-entropy
approaches negative infinity.

Learning conditional statistics:

• Instead of learning a full probability distribution, we often want to learn


just one conditional statistic of y given x.

Learning a function:

• If we have a sufficiently powerful neural network, we can think of it as


being powerful enough to determine any function "f". This function is
limited only by boundedness and continuity.
• From this point of view, cost function is a function rather than a function.
• View cost as a functional, not a function. We can think of learning as a task
of choosing a function rather than a set of parameters. We can design our
cost function to have its minimum occur at a specific function we desire.
For example, design the cost functional to have its minimum lie on the
function that maps x to the expected value of y given x.

Chain Rule and Backpropagation

• The chain rule and backpropagation are fundamental concepts in the


training of neural networks, especially in the context of gradient-based
optimization.
• Backpropagation is a training method used for a multi-layer neural
network. It is also called the generalized delta rule. It is a gradient descent
method, which minimizes the total squared error of the output computed
by the net.
• The backpropagation algorithm looks for the minimum value of the error
function in weight space using a technique called the delta rule or
gradient descent. The weights that minimize the error function is then
considered to be a solution to the learning problem.
• Backpropagation is a systematic method for training multiple layer ANN.
It is a generalization of Widrow-Hoff error correction rule. 80 % of ANN
applications uses backpropagation.
• The Figure given below shows backpropagation network.

Here's an explanation of each:


Consider a simple neuron:
• Neuron has a summing junction and activation function.
• Any nonlinear function which differentiable everywhere and increases
everywhere with sum can be used as activation function.
• Examples: Logistic function, arc tangent function, hyperbolic tangent
activation function.
These activation function makes the multilayer network to have greater
representational power than single layer network only when non-linearity is
introduced.
Need of hidden layers:
1. A network with only two layers (input and output) can only represent the
input with whatever representation already exists in the input data.
2. If the data is discontinuous or non-linearly separable, the innate
representation is inconsistent, and the mapping cannot be learned using two
layers (Input and Output).
3. Therefore, hidden layer(s) are used between input and output layers.
• Weights connects unit (neuron) in one layer only to those in the next
higher layer. The output of the unit is scaled by the value of the connecting
weight, and it is fed forward to provide a portion of the activation for the
units in the next higher layer.
• Backpropagation can be applied to an artificial neural network with any
number of hidden layers. The training objective is to adjust the weights
so that the application of a set of inputs produces the desired outputs.

Training procedure:

The network is usually trained with a large number of input-output pairs.

Training Algorithm

1. Generate weights randomly to small random values (both positive and


negative) ensure that the network is not saturated by large values of weights.
2. Choose a training pair from the training set.
3. Apply the input vector to network input.
4. Calculate the network output.
5. Calculate the error, the difference between the network output and the
desired output.
6. Adjust the weights of the network in a way that minimizes this error.
7. Repeat steps 2 - 6 for each pair of input-output in the training set until the
error for the entire system is acceptably low.

Forward pass and backward pass:

• Backpropagation neural network training involves two passes.


1. In the forward pass, the input signals moves forward from the network input
to the output.
2. In the backward pass, the calculated error signals propagate backward
through the network, where they are used to adjust the weights.
3. In the forward pass, the calculation of the output is carried out, layer by layer,
in the forward direction. The output of one layer is the input to the next layer.

In the reverse pass,


a. The weights of the output neuron layer are adjusted first since the target
value of each output neuron is available to guide the adjustment of the
associated weights, using the delta rule.
b. Next, we adjust the weights of the middle layers. As the middle layer neurons
have no target values, it makes the problem complex.
Regularization: Dataset Augmentation
Regularization techniques are essential for preventing overfitting in machine
learning models, including neural networks.
Dataset augmentation is one such technique used to enhance the generalization
ability of models by artificially increasing the size and diversity of the training
dataset.
Heuristic data augmentation schemes often rely on the composition of a set of
simple transformation functions (TFs) such as rotations and flips (see Figure).
When chosen carefully, data augmentation schemes tuned by human experts
can improve model performance. However, such heuristic strategies in practice
can cause large variances in end model performance and may not produce
augmentations needed for state-of-the-art models.

Data augmentation can be defined as the technique used to improve the


diversity of the data by slightly modifying copies of already existing data or
newly create synthetic data from the existing data. It is used to regularize the
data and it also helps to reduce overfitting. Some of the techniques used for data
augmentation are :
1. Rotation (Range 0-360 degrees)
2. flipping (true or false for horizontal flip and vertical flip)
3. Shear range (image is shifted along x-axis or y-axis)
4. Brightness or Contrast range (image is made lighter or darker)
5. Cropping (resize the image)
6. Scale (image is scaled outward or inward)
7. Saturation (depth or intensity of the image)
Here's how dataset augmentation works within the context of regularization:

Dataset Augmentation:

Dataset augmentation involves applying a variety of transformations to the


original training data to create new, slightly modified samples. These
transformations typically preserve the semantic content of the data while
introducing variability that can help the model learn more robust and invariant
features.

Common transformations include:

• Geometric transformations: Rotation, translation, scaling, cropping,


and flipping of images.
• Color transformations: Adjusting brightness, contrast, saturation, and
hue of images.
• Noise injection: Adding random noise to images or other data samples.
• Random cropping and padding: Extracting random crops or adding
random padding to images.

By applying these transformations to the training data, the dataset is effectively


expanded, providing the model with more diverse examples to learn from. This
helps prevent overfitting by exposing the model to a wider range of variations
in the data distribution.

Regularization Effect:

Dataset augmentation acts as a form of regularization by introducing noise and


variability into the training process. This helps to prevent the model from
memorizing the training examples and encourages it to learn more
generalizable features that are invariant to the transformations applied during
augmentation.

Additionally, dataset augmentation encourages the model to learn features that


are robust to variations commonly encountered in real-world scenarios.
For example, by augmenting images with random rotations and translations,
the model learns to recognize objects from different viewpoints and positions,
leading to improved generalization performance.

Implementation:

Dataset augmentation is typically applied during the training phase, where each
training sample is randomly transformed before being fed into the model for
training. The transformed samples are treated as additional training data,
effectively enlarging the training dataset.

Modern deep learning frameworks often provide built-in support for dataset
augmentation through data preprocessing pipelines or dedicated augmentation
modules. These frameworks allow users to easily specify the desired
transformations and apply them to the training data on-the-fly during training.
Applying the chain rule
Let’s use the chain rule to calculate the derivative of cost with respect to any
weight in the network. The chain rule will help us identify how much each
weight contributes to our overall error and the direction to update each weight
to reduce our error. Here are the equations we need to make a prediction and
calculate total error, or cost:

Given a network consisting of a single neuron, total cost could be calculated as:

Noise robustness

In the context of machine learning, and particularly deep learning, refers to the
ability of a model to maintain its performance and make accurate predictions
even when presented with noisy or corrupted input data. Noise in data can arise
from various sources, including sensor errors, transmission errors,
environmental factors, or imperfections in data collection processes.
Here's how noise robustness is addressed in machine learning, particularly in
deep learning:
1. Data Preprocessing:
• Noise Removal: In some cases, it's possible to preprocess the data to
remove or reduce noise before feeding it into the model. Techniques such
as denoising filters, signal processing methods, or data cleaning
algorithms can be employed to mitigate noise in the data.

2. Model Architecture:

• Robust Architectures: Designing models with architectures that are


inherently robust to noise can help improve noise robustness. For
example, architectures with skip connections or residual connections
(e.g., ResNet) can help propagate information more effectively through
the network, making them more resilient to noise.
• Dropout: Dropout regularization, which randomly drops units (along
with their connections) during training, can act as a form of noise
injection. This helps prevent overfitting and encourages the model to
learn more robust features that are less sensitive to noise in the input
data.

3. Data Augmentation:

• Augmentation with Noise: As mentioned earlier, dataset augmentation


can help improve noise robustness by exposing the model to a wider
range of data variations, including noisy samples. Augmenting the
training data with artificially added noise can help the model learn to
ignore irrelevant noise while focusing on the relevant signal in the data.

4. Training Strategies:

• Adversarial Training: Adversarial training involves training the model


on adversarially perturbed examples generated by adding carefully
crafted noise to the input data. This helps the model learn to be robust
against adversarial attacks, which can be considered as a form of noise.

5. Uncertainty Estimation:
• Probabilistic Models: Probabilistic deep learning models, such as
Bayesian neural networks or ensemble methods, can provide uncertainty
estimates along with predictions. These uncertainty estimates can help
the model recognize when it's uncertain about its predictions, which is
particularly useful in the presence of noisy or ambiguous input data.
6. Transfer Learning:
• Pretrained Models: Transfer learning from pretrained models trained
on large datasets can help improve noise robustness. Pretrained models
have learned robust features from vast amounts of data, which can
generalize well even in the presence of noise in the target domain.

Early Stopping, Bagging and Dropout

Early Stopping:
Early stopping is a regularization technique used to prevent overfitting during
the training of machine learning models, including neural networks. The basic
idea is to monitor the performance of the model on a separate validation set
during training. Training is stopped early (i.e., before the model starts to overfit)
when the performance on the validation set starts to degrade.
Specifically, early stopping involves:

• Monitoring Validation Loss: During training, the performance of the


model is evaluated periodically on a validation set. The validation loss (or
other evaluation metric) is calculated to assess the generalization
performance of the model.
• Stopping Criteria: Training is stopped when the validation loss stops
improving or starts to increase for a certain number of epochs. This
prevents the model from overfitting to the training data.
Early stopping helps find the optimal point in the training process where the
model generalizes best to unseen data, thus improving its ability to make
accurate predictions on new samples.

Bagging (Bootstrap Aggregating):


Bagging is an ensemble learning technique that aims to improve the
performance and robustness of machine learning models by combining
predictions from multiple base models. It involves training multiple instances
of the same base model on different subsets of the training data, typically using
bootstrapping (sampling with replacement).

The key steps in bagging are:

• Bootstrap Sampling: Randomly sample subsets of the training data with


replacement to create multiple training sets.
• Base Model Training: Train a base model (e.g., decision tree, neural
network) on each bootstrap sample independently.
• Combination of Predictions: Combine the predictions of the base
models by averaging (for regression) or voting (for classification) to
make the final prediction.
Bagging helps reduce variance and improve the stability of predictions by
leveraging the diversity of base models trained on different subsets of the data.

Pseudocode:
1. Given training data (x₁, y₁), .... (xm, ym)
2. For t = 1, T:
a. Form bootstrap replicate dataset S, by selecting m random examples
from the training set with replacement.
b. Let h, be the result of training base learning algorithm on St
Output Combined Classifier:
𝐻(𝑥) = 𝑀𝑎𝑗𝑜𝑟𝑖𝑡𝑦(ℎ1 (𝑥) … … ℎ𝑡 (𝑥))

Dropout:
Dropout is a regularization technique specifically designed for training neural
networks to prevent overfitting. It involves randomly "dropping out" (i.e.,
deactivating) a fraction of neurons during training.
The key aspects of dropout are:
• Random Deactivation: During each training iteration, a fraction of
neurons in the network is randomly set to zero with a probability p,
typically chosen between 0.2 and 0.5.
• Training and Inference: Dropout is only applied during training. During
inference (i.e., making predictions), all neurons are active, but their
outputs are scaled by the dropout probability p to maintain the expected
output magnitude.
• Ensemble Effect: Dropout can be interpreted as training an ensemble of
exponentially many subnetworks, which encourages the network to learn
more robust and generalizable features.
Dropout effectively prevents the co-adaptation of neurons and encourages the
network to learn more distributed representations, leading to improved
generalization performance.

Note: These techniques—early stopping, bagging, and dropout—are powerful


tools for preventing overfitting and improving the generalization performance
of machine learning models, including neural networks. By incorporating these
techniques into the training process, models can become more robust and
reliable, making them better suited for real-world applications.
Batch Normalization
Batch normalization is a popular technique used in deep neural networks to
stabilize and accelerate the training process. It addresses the problem of
internal covariate shift, which refers to the change in the distribution of
network activations during training due to changes in the parameters of earlier
layers.
Here's how batch normalization works:

The normalization step is as follows:


1. Calculate the mean and variance of the activations for each feature in a mini-
batch.
2. Normalize the activations of each feature by subtracting the mini-batch mean
and dividing by the mini-batch standard deviation.
3. Scale and shift the normalized values using the learnable parameters gamma
and beta, which allow the network to undo the normalization if that is what the
learned behavior requires.
Benefits of Batch Normalization
Batch normalization offers several benefits to the training process of deep
neural networks:
• Improved Optimization: It allows the use of higher learning rates,
speeding up the training process by reducing the careful tuning of
parameters.
• Regularization: It adds a slight noise to the activations, similar to
dropout. This can help to regularize the model and reduce overfitting.
• Reduced Sensitivity to Initialization: It makes the network less
sensitive to the initial starting weights.
• Allows Deeper Networks: By reducing internal covariate shift, batch
normalization allows for the training of deeper networks.

VC Dimension and Neural Nets


The Vapnik-Chervonenkis (VC) dimension is a concept from statistical learning
theory that provides a measure of the capacity or complexity of a hypothesis
space—the set of all possible functions that a learning algorithm can choose
from to fit the training data. In the context of neural networks, the VC dimension
plays an important role in understanding the expressiveness and generalization
ability of different network architectures.

Shattering set of examples:


Assume a binary classification problem with N examples RD and consider the
set of 2|N| possible dichotomies. For instance, with N = 3 examples, set of all
possible dichotomies is {(000), (001), (010), (011), (100), (101), (110), (111)}.
A class of functions is said to shatter the dataset if, for every possible dichotomy,
there is a function 𝑓(𝛼) that models it.
Consider as an example a finite concept class C = {c1,…,c4} applied to three
instance vectors with the results :
X1 X2 X3
C1 1 1 1
C2 0 1 1
C3 1 0 0
C4 0 0 0
Then:
𝜋𝑐 ({𝑥1 }) = {(0), (1)}
𝜋𝑐 ({𝑥1 , 𝑥3 }) = {(0,0), (0, 1), (1,0), (1, 1)}
𝜋𝑐 ({𝑥2 , 𝑥3 }) = {(0,0), (1,1)}
• VC dimension VC(f) is the size of the largest dataset that can be shattered
by the set of function 𝑓(𝛼).
• If the VC Dimension of (𝛼) is h, then there exists at least one set of h points
that can be shattered by (𝛼), but in general it will not be true that every
set of h points can be shattered.
• VC dimension cannot be accurately estimated for non-linear models such
as neural networks. The VC dimension may be infinite requiring an
infinite amount of data.

VC Dimension for Neural Networks

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy