Unit IV
Unit IV
Syllabus
History of Deep Learning- A Probabilistic Theory of Deep Learning-
Gradient Learning – Chain Rule and Backpropagation - Regularization:
Dataset Augmentation – Noise Robustness -Early Stopping, Bagging and
Dropout - batch normalization- VC Dimension and Neural Nets.
2000-2010
Around the year 2000, The Vanishing Gradient Problem appeared. It was
discovered “features” (lessons) formed in lower layers were not being learned
by the upper layers, because no learning signal reached these layers. This was
not a fundamental problem for all neural networks, just the ones with gradient-
based learning methods. The source of the problem turned out to be certain
activation functions. A number of activation functions condensed their input, in
turn reducing the output range in a somewhat chaotic fashion. This produced
large areas of input mapped over an extremely small range. In these areas of
input, a large change will be reduced to a small change in the output, resulting
in a vanishing gradient. Two solutions used to solve this problem were layer-
by-layer pre-training and the development of long short-term memory.
In 2001, a research report by META Group (now called Gartner) described he
challenges and opportunities of data growth as three-dimensional. The report
described the increasing volume of data and the increasing speed of data as
increasing the range of data sources and types. This was a call to prepare for the
onslaught of Big Data, which was just starting.
In 2009, Fei-Fei Li, an AI professor at Stanford launched ImageNet, assembled a
free database of more than 14 million labeled images. The Internet is, and was,
full of unlabeled images. Labeled images were needed to “train” neural nets.
Professor Li said, “Our vision was that big data would change the way machine
learning works. Data drives learning.”
2011-2020
By 2011, the speed of GPUs had increased significantly, making it possible to
train convolutional neural networks “without” the layer-by-layer pre-training.
With the increased computing speed, it became obvious deep learning had
significant advantages in terms of efficiency and speed. One example is AlexNet,
a convolutional neural network whose architecture won several international
competitions during 2011 and 2012. Rectified linear units were used to
enhance the speed and dropout.
Also in 2012, Google Brain released the results of an unusual project known as
The Cat Experiment. The free-spirited project explored the difficulties of
“unsupervised learning.” Deep learning uses “supervised learning,” meaning the
convolutional neural net is trained using labeled data (think images from
ImageNet). Using unsupervised learning, a convolutional neural net is given
unlabeled data, and is then asked to seek out recurring patterns.
The Cat Experiment used a neural net spread over 1,000 computers. Ten million
“unlabeled” images were taken randomly from YouTube, shown to the system,
and then the training software was allowed to run. At the end of the training,
one neuron in the highest layer was found to respond strongly to the images of
cats. Andrew Ng, the project’s founder said, “We also found a neuron that
responded very strongly to human faces.” Unsupervised learning remains a
significant goal in the field of deep learning.
The Generative Adversarial Neural Network (GAN) was introduced in 2014.
GAN was created by Ian Goodfellow. With GAN, two neural networks play
against each other in a game. The goal of the game is for one network to imitate
a photo, and trick its opponent into believing it is real. The opponent is, of
course, looking for flaws. The game is played until the near perfect photo tricks
the opponent. GAN provides a way to perfect a product (and has also begun
being used by scammers).
Incorporating
Uncertainty
Gradient Learning
Cost Function
An important aspect of the design of deep neural networks is the
cost function. They are similar to those for parametric models such
as linear models. In most cases, parametric model defines a
distribution p(y|x; 0) and simply use the principle of maximum
likelihood.
The use of cross-entropy between the training data and the
model's prediction’s function. Most modern neural networks are
trained using maximum likelihood.
Cost function is given by
J(𝐽(𝜃) = ∑ 𝑥, 𝑦~𝑝𝑑𝑎𝑡𝑎 𝐿𝑜𝑔 𝑃𝑚𝑜𝑑𝑒𝑙 (𝑌|𝑋)
The advantage of this approach to cost is that deriving cost from maximum
likelihood removes the burden of designing cost functions for each model.
• A property of cross-entropy cost used for MLE is that, it does not have a
minimum value. For discrete output variables, they cannot represent
probability of zero or one but come arbitrarily close. Logistic regression
is an example.
• For real-valued output variables it becomes possible to assign extremely
high density to correct training set outputs, e.g, by learning the variance
parameter of Gaussian output and the resulting cross-entropy
approaches negative infinity.
Learning a function:
Training procedure:
Training Algorithm
Dataset Augmentation:
Regularization Effect:
Implementation:
Dataset augmentation is typically applied during the training phase, where each
training sample is randomly transformed before being fed into the model for
training. The transformed samples are treated as additional training data,
effectively enlarging the training dataset.
Modern deep learning frameworks often provide built-in support for dataset
augmentation through data preprocessing pipelines or dedicated augmentation
modules. These frameworks allow users to easily specify the desired
transformations and apply them to the training data on-the-fly during training.
Applying the chain rule
Let’s use the chain rule to calculate the derivative of cost with respect to any
weight in the network. The chain rule will help us identify how much each
weight contributes to our overall error and the direction to update each weight
to reduce our error. Here are the equations we need to make a prediction and
calculate total error, or cost:
Given a network consisting of a single neuron, total cost could be calculated as:
Noise robustness
In the context of machine learning, and particularly deep learning, refers to the
ability of a model to maintain its performance and make accurate predictions
even when presented with noisy or corrupted input data. Noise in data can arise
from various sources, including sensor errors, transmission errors,
environmental factors, or imperfections in data collection processes.
Here's how noise robustness is addressed in machine learning, particularly in
deep learning:
1. Data Preprocessing:
• Noise Removal: In some cases, it's possible to preprocess the data to
remove or reduce noise before feeding it into the model. Techniques such
as denoising filters, signal processing methods, or data cleaning
algorithms can be employed to mitigate noise in the data.
2. Model Architecture:
3. Data Augmentation:
4. Training Strategies:
5. Uncertainty Estimation:
• Probabilistic Models: Probabilistic deep learning models, such as
Bayesian neural networks or ensemble methods, can provide uncertainty
estimates along with predictions. These uncertainty estimates can help
the model recognize when it's uncertain about its predictions, which is
particularly useful in the presence of noisy or ambiguous input data.
6. Transfer Learning:
• Pretrained Models: Transfer learning from pretrained models trained
on large datasets can help improve noise robustness. Pretrained models
have learned robust features from vast amounts of data, which can
generalize well even in the presence of noise in the target domain.
Early Stopping:
Early stopping is a regularization technique used to prevent overfitting during
the training of machine learning models, including neural networks. The basic
idea is to monitor the performance of the model on a separate validation set
during training. Training is stopped early (i.e., before the model starts to overfit)
when the performance on the validation set starts to degrade.
Specifically, early stopping involves:
Pseudocode:
1. Given training data (x₁, y₁), .... (xm, ym)
2. For t = 1, T:
a. Form bootstrap replicate dataset S, by selecting m random examples
from the training set with replacement.
b. Let h, be the result of training base learning algorithm on St
Output Combined Classifier:
𝐻(𝑥) = 𝑀𝑎𝑗𝑜𝑟𝑖𝑡𝑦(ℎ1 (𝑥) … … ℎ𝑡 (𝑥))
Dropout:
Dropout is a regularization technique specifically designed for training neural
networks to prevent overfitting. It involves randomly "dropping out" (i.e.,
deactivating) a fraction of neurons during training.
The key aspects of dropout are:
• Random Deactivation: During each training iteration, a fraction of
neurons in the network is randomly set to zero with a probability p,
typically chosen between 0.2 and 0.5.
• Training and Inference: Dropout is only applied during training. During
inference (i.e., making predictions), all neurons are active, but their
outputs are scaled by the dropout probability p to maintain the expected
output magnitude.
• Ensemble Effect: Dropout can be interpreted as training an ensemble of
exponentially many subnetworks, which encourages the network to learn
more robust and generalizable features.
Dropout effectively prevents the co-adaptation of neurons and encourages the
network to learn more distributed representations, leading to improved
generalization performance.