0% found this document useful (0 votes)
6 views117 pages

UNIT 3 ComputerVision

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views117 pages

UNIT 3 ComputerVision

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 117

Computer Vision

UNIT-3 Image Classification


AIML, III-B. Tech-II-Sem

1
Introduction to Convolution Neural Networks

2
Introduction to Convolution Neural Networks
• A Convolutional Neural Network (CNN) is a type of deep neural networks
primarily used for processing structured grid data such as images, primarily
used for image processing and computer vision tasks.
• Traditional, Artificial Neural Network(ANN) connects all of the units in one layer
to all the units in a preceding layer(fully/dense connected) leading to high
computation requirement.
• CNNs organizes each layer/feature of input-image into feature maps(which
introduces sparsely/lightly connected network in CNN, explained later). Feature
maps can be thought of as parallel planes or channels(different planes represent
different features of image).
• Convolutional neural networks(CNNs) solves it by utilizing self trainable multi-
layer convolutions using filters(filters are feature detectors which are auto
detected).
3
Introduction to Convolution Neural Networks
Architecture of CNN:
1. Input Layer: Takes raw image data as input.
2. Convolutional Layer: Applies filters to extract features such as edges, textures, and
patterns.
3. Activation Function (ReLU): Introduces non-linearity into the model to enhance
learning capacity.
4. Pooling Layer: Reduces spatial dimensions to decrease computation and prevent
overfitting.
5. Fully Connected Layer: Connects neurons from previous layers to determine final
predictions or classification.
6. Model Training: The network is trained using backpropagation and optimization
techniques such as gradient descent.
7. Output Layer: Produces final classifications or regressions.
4
Introduction to Convolution Neural Networks

5
This operation is actually a correlation (not a convolution), but the term convolution is used for
simplicity.

6
Applications of CNN:
1. Image classification (e.g., facial recognition, object detection)
2. Medical diagnosis (e.g., tumour detection in MRI scans).
3. Autonomous driving (e.g., obstacle detection)
4. Natural language processing (e.g., text and speech recognition)

7
Advantages of CNN
• The beauty of convolutional neural network that it will automatically detect
these filters on its own and that is part of the training
so when the neural network is training or when the CNN is training because .
you're supplying thousands of training images here using that it will use back
propagation and it will figure out the right amount of filters it will figure out the
values in this filter and that is part of the learning or the back propagation. As a
hyper parameter you will specify how many filters you want to have and what is the
size of each of the filters
High accuracy in image classification and pattern recognition tasks.
• Efficient at recognizing spatial hierarchies in images.

8
Benefits of Convolution, ReLu and Pooling

Sparse Connections: It means as shown in above figure we are not connecting all neurons
from one layer to other layer neurons as in ANN. So, it reduces number of connections and
so reducing number of computations. Here Input-matrix(X) is image and whereas, weight(W)
matrix acts as filter.
Parameter Sharing: It means as shown in above figure we are using only 4 weights
repeatedly instead of taking all different weights(which we do in our artificial neural network
generally). So, we say it as parameter (weights) sharing. Which reduces the complexity of
9
CNN.
Benefits of Convolution, ReLu and Pooling

10
Benefits of Convolution, ReLu and Pooling

11
Examples to illustrate CNN:
Example1: Digit Recognition:
let's say you want the computer to
recognize the handwritten digit. First we will
implement it using Artificial Neural
network(ANN) and then go for CNN to over
come disadvantages in ANN.

• The way computer looks at this is as a grid


of numbers “-1 and 1”

• In reality it will use RGB numbers from 0 to


255

12
Examples to illustrate CNN:
• The issue with this
representation is that this is
too much hardcoded.

• That is it cannot recognize


for other handwritten form
of 9. shown in next slide.

13
Examples to illustrate CNN:
If you have a little shift in
digit 9 as shown beside.

Here in this case it is the left


shift

It doesn't matches with our


original number grid, which
is shown in previous slide.

14
Examples to illustrate CNN:

15
Examples to illustrate CNN:

16
Examples to illustrate CNN:
• we created a one-
dimensional array by
flattening the two-
dimensional representation
of our hand return digit
number and
• Then we build a neural
network with one hidden
layer and output layer.

17
Examples to illustrate CNN:
• when you have a bigger image. Ex: little cute looking animal koala with image size is
1920 by 1080. We have 3 as RGB channel

• So disadvantage of using ANN or artificial neural network for image classification is


too much computation
18
Examples to illustrate CNN:

• If the pixels are moved around(i.e., if animal koala is at


different positions), still it should be able to detect the
object in an image

19
Examples to illustrate CNN:
How Does Humans Recognize Images so easily?

We humans recognize any image so easily as below:

• when we look at koala's image, we look at the little features like this round eyes this
black prominent flat nose, fluffy ears, we detect these features one by one in our
brain, connecting together and finds koala.

• Same thing with handwritten digit 9.

We will illustrate these in next two slides.

20
Examples to illustrate CNN:
Example2: Animal Koala-
Image Recognition

21
Examples to illustrate CNN:

22
Examples to illustrate CNN:

How can we make Computers Recognize These Tiny Features?


Computers use the concept of filter:
In case of digit nine we have three filters:
1. The first one is the head which is a loopy circle pattern.
2. In the middle you have vertical line
3. In the end you have diagonal line filter

so we take our original image and we will apply a convolution operation or a


filter operation

23
Examples to illustrate CNN:

24
Examples to illustrate CNN:
Here loopy circle pattern or a head
filter(shown in green color beside)
convolves with different 3x3 grid
combinations from your original
image and multiply individual
numbers with this filter.

Shown in next slides.


Filter or Kernel

Image
25
Examples to illustrate CNN: Weighted sum : Multiply all weights in filter with pixel
In CNNs, the weighted sums are only values in image then take sum and average as shown below:
performed within a small local window
as shown here:

Filter/Kernel: It
Contains weights
Feature Map
Image
There are total nine numbers and whatever number average you get you put it in grid
called as feature map 26
Examples to illustrate CNN:

Filter or Kernel
Feature Map
Image
Let us take a stride of 1,
Stride: It is a step size by which a filter (kernel) moves across an image or feature
map. A higher stride results in lower resolution.
wherever you see number one or a number that is close to one in feature map, it
means you have a loopy circle pattern matched. Similarly the koalas eyes etc also
27
detected using the specific filters to detect koalas eyes etc.
Examples to illustrate CNN:

Filter or Kernel

Image
28
Examples to illustrate CNN:

29
Examples to illustrate CNN:

30
Examples to illustrate CNN:

31
Examples to illustrate CNN:
Example2: Let us see different filters to detect koala

If the eyes are at a different location it will still detect because you're moving the filter
throughout the image and they are location invariant, which means doesn't matter where the eyes
are in the image these filters will detect those eyes
32
Examples to illustrate CNN:

33
Examples to illustrate CNN:

34
Examples to illustrate CNN:

so we will flatten final feature map in 2D array form into 1D array by flattening it as shown
above and give it to Artificial Neural Network(ANN) for final classification.
35
Examples to illustrate CNN:

36
POOLING and UNPOOLING

37
ReLu and Pooling
There are two other components, They are:
1. “ReLu” activation to introduce nonlinearity used to speed up output
computation and
2. “Pooling” concept to reduce size of final feature map to reduce
computation in final neural network for classification(explained in next
slides).

38
ReLu
1. “ReLu” activation:
1 if weight > 0
0 if weight < = 0

39
Pooling( or Downsampling)
We didn’t address the
issue of too much
computation?

Image beside requires so


many neurons, weights
and computations.

So to avoid this we
introduce Pooling in
CNNs.

40
Pooling and Unpooling
2. Pooling(also called Down Sampling): Pooling is used to reduce the size
of feature map
We have mainly two types of pooling, They are:
1. Max Pooling(Generally used)
2. Average Pooling

Un-pooling is reverse of pooling. That is increasing the size of


reduced feature map back to its original size.

41
Pooling
Max Pooling: so here you take a window of 2x2 and you pick the maximum number from each
window of 2x2. We have reduced the size of feature map from “4 by 4” to “2 by 2” which
reduces the computations when we give this feature map to neural network by flattening.

Max Number is taken,


Which indicates its a
“main feature”

stride = 2, means once we are done with one window we move by two points

42
Pooling

43
Pooling
Max Pooling

44
Pooling
Max Pooling

45
Pooling
Max Pooling

46
Pooling

47
Pooling

48
Pooling

49
Pooling

50
Unpooling(or Upsampling)
• sometimes we need to reverse pooling process to reconstruct a higher-
resolution image. This is where Unpooling and transposed convolution come
into play.

• Unpooling is the reverse process, where we restore the original size of the
image from the pooled representation.

• Since max pooling discards non-maximum values, Unpooling requires


remembering which value was selected during pooling.

• When backpropagation (the process of adjusting network parameters to


minimize errors) is applied, the error is only passed back to the selected
maximum value, ensuring accurate reconstruction.
51
Unpooling
Transposed Convolution (Upsampling or unpooling with Learned Weights):

When we want to upsample an image more systematically, we use a technique called


transposed convolution (also known as deconvolution or backward convolution).
Unlike unpooling, which simply places values back into their original locations, transposed
convolution applies filters to reconstruct finer details.

In transposed convolution:
1.The input image is expanded by inserting rows and columns of zeros between existing
pixels.
2.A convolutional filter is then applied to this expanded image, filling in the gaps with
learned values.
3.The result is a larger, more detailed image.

52
Unpooling

Figure: Transposed convolution can be used to upsample (increase the size of) an
image. Before applying the convolution operator, (s − 1) extra rows and columns of
zeros are inserted between the input samples, where s is the upsampling stride.

53
Sample Architecture of CNN

54 ANN
Using
Important Parameters of CNN
Padding
Stride ( explained already)
Dilation
1. Grouping

55
Important Parameters of CNN
1. Padding:

• Determines how image boundaries are handled.


• Options include zero padding and pixel
replication

As shown in figure on top some regions are less


convolved (row1 colomn1 box is only convolved
once), Where as row3 column 3 box is four times
convolved.
So, to remove this and make each pixel
convolved almost uniform number of times, we
use padding as shown in bottom image(zero
padding used here)
56
Important Parameters of CNN
2. Stride: (Explained already)
Defines the step size for the convolution operation.

57
Important Parameters of CNN
3. Dilation:
Introduces gaps between pixels during convolution, allowing for a larger
receptive field with fewer parameters.

White colour grid is image


Green colour grid is filter/kernel 58
Important Parameters of CNN
4. Grouping: Splits input and output channels into separate groups for
independent convolutions.

Special cases:
Depth wise Convolution: Each input channel is convolved
independently.

Regular Convolution: Uses all input channels.

59
Special Convolution Techniques
These techniques are designed to improve performance of CNNs like
efficiency and computation power. Techniques are:

1) 1 X 1 Convolution
2) Partial Convolution
3) Gated Convolution

60
Special Convolution Techniques:
1X1 Convolution
A special case where filters are 1×1 in
size (single pixel).

Used in networks like GoogleNet to:


Combine multiple channels
efficiently.
Reduce feature map size without
losing important details.
Saves computation and keeps key
info

61
Special Convolution Techniques:
Partial Convolution
Partial convolutions are a
technique designed for tasks like image
inpainting, where the goal is to fill in
missing or damaged parts of an
image (e.g., removing a scratch or
reconstructing a hole).
Works by masking out missing pixels
and normalizing based on available
pixels.
It’s great for tasks like restoring old Convolution Deconvolution
Unpooling or
photos or removing objects from
images.
62
Special Convolution Techniques:
Gated Convolution
Gated convolutions are an advanced technique where the network dynamically
decides how important each pixel is for the task at hand.
It’s like giving the network a “filtering knob” to control what it pays attention to.
Gated convolution splits the process into two parts:
1. Feature Extraction: A regular convolution extracts features from the input.
2. Gating Mechanism: A second convolution (often followed by a sigmoid
activation) generates a “gate” value between 0 and 1 for each pixel. This gate
value decides how much of the feature to keep.

Used in tasks like image enhancement and text-based image modification.

63
APPLICATION: DIGIT CLASSIFICATION

64
APPLICATION: DIGIT CLASSIFICATION
• One of the most common applications of Convolutional Neural Networks
(CNNs) is digit classification, where a model is trained to recognize handwritten
digits from images.

• This is useful for tasks like recognizing numbers on bank checks, postal codes, and
handwritten forms.

65
APPLICATION: DIGIT CLASSIFICATION
How CNNs Help in Digit Classification
CNNs are a type of deep learning model designed to process visual data. They automatically
learn patterns from images using multiple layers, such as:

• Convolutional Layers: These detect edges, curves, and shapes in the image.
• Activation Functions: These introduce non-linearity to help the network learn complex
patterns. Common activation functions include ReLU (Rectified Linear Unit), which helps
avoid issues like vanishing gradients.
• Pooling Layers: These reduce the size of the feature maps while retaining important
information. Techniques like max pooling (taking the highest value from a region) or
average pooling (taking the mean) help improve efficiency.
• Fully Connected Layers: These process the extracted features and make predictions, such
as determining which digit is present in the image.
• SoftMax Output Layer: This converts the final outputs into probabilities, indicating the
likelihood of the image belonging to each digit category (0-9).
66
APPLICATION: DIGIT CLASSIFICATION
Datasets for Training and Testing

To train a CNN for digit recognition, datasets containing labeled images of handwritten
digits are used. Some common datasets include:

•MNIST: A collection of 60,000 training images and 10,000 test images of handwritten
digits (0-9). This is a basic dataset for learning CNNs.
•CIFAR-10: A more challenging dataset containing 60,000 small images across 10 object
categories (including vehicles and animals), used for training CNNs beyond digit
classification.
•Fashion MNIST: A dataset with images of clothing items instead of digits, often used as
a tougher alternative to MNIST.

67
APPLICATION: DIGIT CLASSIFICATION
Real-World Applications of Digit classification:
CNN-based digit recognition is widely used in practical applications, such as:

•Automated Check Processing: Banks use CNNs to read handwritten amounts on


checks and verify them against manually entered values.
•Postal Code Recognition: Postal services use CNNs to automate sorting by reading
handwritten addresses.
•Form Processing: Government and corporate offices use digit recognition to extract
handwritten numerical data from documents.

Today, CNNs are a fundamental part of computer vision, and learning to build a
simple digit classifier is a common first step for students and professionals entering
the field of deep learning.
68
APPLICATION: DIGIT CLASSIFICATION

Figure: Architecture of LeNet-5, a convolutional neural network for digit recognition. This
network uses multiple channels in each layer and alternates multi-channel convolutions with
Downsampling operations, followed by some fully connected layers that produce one
activation for each of the 10 digits being classified.
69
NETWORK ACHITECTURES

70
Sample Architecture of CNN

71 ANN
Using
CNN LENET-5 ARCHITECTURE FOR DIGIT CLASSIFICATION

Figure: Architecture of LeNet-5, a convolutional neural network for digit recognition. This
network uses multiple channels in each layer and alternates multi-channel convolutions with
Downsampling operations, followed by some fully connected layers that produce one
activation for each of the 10 digits being classified.
72
CNN AlexNet ARCHITECTURE FOR DIGIT CLASSIFICATION
AlexNet is a deep Convolutional Neural Network (CNN) that significantly
improved image classification tasks. It won the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) in 2012 by achieving a top-5 error rate of
15.3%, far outperforming previous models.

Figure: Architecture of the Supervision deep neural network (more commonly known
as “AlexNet”). The network consists of multiple convolutional layers with ReLU
activations, max pooling, some fully connected layers, and a SoftMax to produce the
73
final class probabilities.
Model Zoos
Explanation of model zoos, transfer learning,
and deep learning frameworks.

74
What are Model Zoos?
• A model zoo is a collection of pre-trained deep learning models used for
applications like image classification, object detection, and image
segmentation.
• These pre-trained models are typically trained on large datasets, such as
ImageNet.
• These can be fine-tuned for specific tasks without needing to train from
scratch.
• Popular model zoos include:
1. Torch Vision (a library of PyTorch)
2. TensorFlow Hub(Tensor Flow): For mobile-friendly models
3. TensorFlow Lite Model Zoo (Tensor Flow): For mobile-friendly
models
75
TorchVision (a library in PyTorch) is a model zoo that provides popular deep
learning architectures like:

1. AlexNet: Image classification (e.g., recognizing objects in images).


2. VGG: Object detection and feature extraction in computer vision.
3. GoogleNet: Image classification and object detection with efficient
computation.
4. Inception: Face recognition, medical image analysis, and autonomous driving.
5. ResNet: High-accuracy image recognition, medical diagnosis, and video
recognition.
6. DenseNet: Image classification, segmentation, and medical imaging
applications.
7. MobileNet: Real-time image processing on mobile and edge devices.
8. ShuffleNet: Fast image classification for low-power mobile and IoT devices.
76
Model Size and Efficiency
• Deep learning models vary in size and computational efficiency.
Model Size: Measured in number of parameters (weights and biases).
Computational Load: Measured in FLOPs (Floating Point Operations
Per Second).
• To reduce the computational burden, researchers have developed
optimization techniques such as:
1. Lower precision arithmetic (e.g., using fewer bits to store numbers).
2. Weight compression (reducing memory usage by storing fewer model
parameters).
3. Binary networks (e.g., XNOR-Net): where weights and activations are
reduced to binary values (0 or 1), significantly reducing computation
and memory requirements.
77
Ways to use these Pre-Trained Models
Pre-trained models can be used directly or fine-tuned to suit specific
applications.
• Fine-tuning: This means adjusting a pre-trained model by training it on
new data that is more relevant to the specific/target task.
• Replacing the Head: Modifying the head(final layers) while keeping the
backbone(initial layers) as it is.
• Backbone(initial layers): Extracts features from input data.
• Head(final layers): Final layers making predictions or
classification.
• Fine-tuning will be on head/final layers, Fine-tuning on initial layers is
also possible but requires a slow learning rate to avoid destroying the
knowledge already learned from pre-training.
78
Transfer Learning

• Transfer learning is the process of using a model trained on one


dataset and applying it to a different but related task.
• Useful when the new dataset is small. Since it is difficult to train a deep
model from scratch if dataset is small.
• Ex: The new task is similar to the original one (e.g., training a model on
cat images and fine-tuning it for dog images).
• Depending on the available data, the new head can be:
1. A simple linear model (e.g., SVM, logistic regression)
2. A deep learning model with fully connected layers.

79
Neural Architecture Search (NAS)
• Neural Architecture Search (NAS) is an automated approach to design
deep learning models.
• Instead of manually designing architectures through trial and error, NAS
algorithms explore different network structures to find the best-
performing model.
• Popular NAS-generated models:
1. EfficientNet
2. FBNet
3. RegNet
4. RandomNets.
• These models often achieve better accuracy while using fewer
parameters and computations than traditional architectures.
80
Deep Learning Software and Frameworks

Various software frameworks help build and train deep learning models.
Some of the most widely used are:
• Common deep learning frameworks:
1. PyTorch: Flexible for research and production.
2. TensorFlow: Powerful for deployment.
3. Keras: Simplifies deep learning development.
4. MXNet: Used in academia and industry.
• For visualization and debugging, tools like TensorBoard and Visdom
help monitor model training and performance.

81
Conclusion

Model zoos provide valuable resources for deep learning by:


• Using pre-trained models for tasks like classification and
detection.
• Fine-tuning models for specific tasks.
• Applying transfer learning to new domains.
• Optimizing models with NAS and compression techniques.
• Choose appropriate deep learning frameworks for efficient
training and deployment.

82
Visualizing Weights and Activations
in Neural Networks
Understanding how neural networks process and interpret data
using visualization techniques.

83
Visualizing Weights and Activations in
Neural Networks
Introduction
• When working with computer vision and deep learning, understanding how
a neural network processes and interprets data is crucial.
• One effective way to do this is through visualization, which helps in
debugging, refining models, and developing an intuition for how the
network operates.

84
Visualizing Network Weights
• Each connection between neurons has an associated weight, which
determines the strength of influence one neuron has on another.
• Visualizing these weights and activations (responses of neurons) can help
us understand how the network makes decisions.

• Small networks: Line width and color indicate weight strength.


• Large networks: Boxes of different sizes and colors represent weight
strengths.

85
Displaying Neuron Activations
• Activations are used to show neuron responses to input.

• These activations can be visualized using different techniques:

Dimensionality reduction techniques:


• We use it where we need to display activations of multiple neurons
across many inputs
• We use techniques like t-SNE, UMap for visualization. These methods
reduce high-dimensional data into two or three dimensions for easier
visualization.

86
Understanding Neuron Responses
• First layer neurons detect simple patterns like edges or textures. These
can be visualized by directly examining their weights.
• Deeper layers neurons detect complex features, such as shapes or objects
using deconvolution and guided backpropagation.
To understand their behavior, we can:
• Identify the input patches (regions of an image) that activate them the
most.
• Use deconvolution networks to trace activations back to the original
image, revealing what the neuron is focusing on.
• Apply guided backpropagation, which enhances contrast in these
visualizations for clearer insights.

87
Activation Mapping Techniques
Several advanced techniques help in identifying which parts of an image
contribute most to a network’s decision:

• Activation Maximization: This method enhances patterns that maximize


neuron’s response.
• Saliency Maps: highlight important regions in an image.
• Grad-CAM (Gradient-weighted Class Activation Mapping): creates
heatmaps showing influential regions.

88
Interactive Tools for Neural Network
Visualization
To make neural network interpretation more accessible, several tools have
been developed:
• OpenAI's Microscope visualizes neuron behavior.
• Visualization of Pre-trained networks: Models like GoogLeNet help
analyze layer significance in processing images.

89
Conclusion
Visualization techniques help:
1. Refine models,
2. Diagnose issues, and
3. Improve neural network performance.

90
Adversarial Examples
Understanding Adversarial Examples
• Adversarial examples are specially crafted inputs designed to deceive
deep learning models into making incorrect predictions.
• These examples look normal to humans but have been subtly altered in a
way that confuses AI systems.
• The modifications are often unnoticeable to the human eye but
significantly impact the model's interpretation.
How Are Adversarial Examples
Created?
• To create an adversarial example, a
small amount of calculated
noise(perturbation) is added to an
input (such as an image).
• This noise is determined by
backpropagation, which is
typically used for training neural
networks.
• However, in this case, it is used to
adjust the input itself rather
than the model's internal weights.
Steps to Create Adversarial Examples
1. Gradient Calculation: The model’s response to an input is analyzed, and
its prediction scores (activations) for different categories are examined.

2. Target Manipulation: Instead of letting the model classify the input


naturally, an attacker forces the model to move its prediction towards an
incorrect category.

3. Input Adjustment: The input is slightly modified to make the wrong


category the strongest prediction while still appearing unchanged to a
human observer.
Types of Adversarial Attacks
1. White Box Attack: The attacker has full knowledge of the AI model,
including its structure and internal parameters (weights). This allows for
precise crafting of adversarial examples.

2. Black Box Attack: The attacker does not have direct access to the
model's internal details and instead relies on trial and error or general
knowledge of how similar models behave.
Why Are Adversarial Examples a
Concern?
Adversarial examples highlight vulnerabilities in AI systems, if we
don’t train the models with these possible/manual adversarial examples,
especially in:

1. Autonomous Vehicles: A manipulated stop sign image might be


misread as a speed limit sign.
2. Security Systems: Facial recognition can be tricked into
misidentifying individuals.
3. Medical Diagnosis: AI-based medical imaging could produce
incorrect diagnoses due to subtle distortions.
Robustness of Adversarial Examples
Research has shown that adversarial examples are transferable,
meaning an adversarial example crafted to deceive one model can often
deceive another model, even if it has a different architecture or training data.
Why Do Adversarial Attacks Work?
Adversarial examples work because small, calculated changes to an input
can exploit a neural network's sensitivity and push it across decision boundaries,
leading to incorrect outputs.
Adversarial examples exploit non-robust features—small patterns
in the data that the model learns but are not noticeable to humans.
These patterns are statistical correlations in the training data but are
not meaningful in a real-world sense.
Defenses Against Adversarial Attacks
1. Adversarial Training: Training the model with adversarial examples so it
learns to recognize and ignore them.

2. Detection Methods: Identifying whether is there any altered input before it


reaches the model.

3. Deflection Techniques: Deflection techniques modify adversarial images to


force them to resemble the correct class, making it harder for attackers to
deceive a neural network.
The Future of Adversarial Research
• Adversarial examples are an ongoing area of study, with researchers
working on improving AI robustness (the ability to withstand attacks)
and safety (ensuring reliable performance in real-world applications).
• One major challenge is dataset bias, where models learn unintended(not
intended by us) shortcuts that make models more exposed to adversarial
attacks.
Conclusion
• Understanding and mitigating adversarial threats is crucial for making
AI systems more secure and reliable across various applications.
Self-Supervised Learning
Understanding Self-Supervised Learning
Understanding Self-Supervised
Learning
• Self-supervised learning is a method
in AI where a model learns patterns
from raw data without requiring
labeled data.
• Here model generates its own labels
from the input data, allowing it to
learn in a way similar to how
humans infer knowledge from
observations.
• This approach is useful when labeled
data is scarce or expensive to obtain.
Unlabelled data Labelled data with
Patterns.
Key Concepts
1. Backbone (Trunk) Network
2. Transfer Learning
3. Domain Adaptation
4. Pretext Tasks
Backbone (Trunk) Network
A backbone network is the core structure of a deep learning model that
extracts essential features from input data.
• Example: A network trained to classify images (e.g., recognizing cats vs.
dogs) can serve as a backbone for other tasks like object detection.
Transfer Learning
• The practice of training a model
on one task and then applying its
knowledge to a different but
related task.
• Example: A model trained to
recognize general objects can be
fine-tuned to identify specific
breeds of dogs.
Domain Adaptation
• Modifying a pre-trained model to suit a specific dataset or environment.

Example: A model trained on images from high-resolution cameras may


need adaptation to work with blurry security footage.
Pretext Tasks in Self-Supervised
Learning
• Pretext tasks are artificial problems designed to help a model learn
meaningful patterns in data without human supervision or automatically.
• The goal is to use these tasks for pre-training before applying the model
to real-world applications.

Examples of pretext tasks are:


1. Context Prediction
2. Jigsaw Puzzle Task
3. Context Encoders (Inpainting)
4. Colorization
5. Image Rotation Prediction
Examples of Pretext Tasks
Example: Context prediction
1. Context Prediction

• The model is given different parts of an image and


asked to predict their relative positions.
• Helps the model understand spatial relationships.

2. Jigsaw Puzzle Task Example: Jigsaw puzzle

• The image is divided into several tiles, and the


model must arrange them correctly.
• Helps the model learn spatial coherence.
Examples of Pretext Tasks
3. Context Encoders (Inpainting)
• The model is trained to fill in missing parts of an image.
• Improves the model's ability to understand structures and textures.

4. Colorization
• The model is trained to add realistic colors to black-and-white images.
• Helps it learn object features and textures.

5. Image Rotation Prediction


• The model is given images rotated at different angles (0°, 90°, 180°, 270°) and
must predict the correct orientation.
• Helps it understand object shapes and alignment.
Extending Self-Supervised Learning
to Videos
1. Temporal Prediction
• The model is trained to order shuffled frames of a video correctly.
• Helps in understanding motion and continuity.
2. Video Colorization
• The model learns to predict colors in a sequence of video frames, ensuring
consistency across frames.
• Helps in recognizing objects over time.
3. Audio-Visual Learning
• The model aligns sounds with corresponding video frames.
• Example: Learning to identify a barking dog by associating the sound with the visual
of a dog.
Popular Self-Supervised Learning
Approaches
1. Momentum Contrast (MoCo)
• Uses a queue of past encoded samples to improve learning efficiency.
• Helps in reducing the need for large labeled datasets.
2. SimCLR (Simple Contrastive Learning)
• Uses contrastive learning with extensive data augmentation (modifying
images to create variations).
• Enhances the model’s robustness.
3. Bootstrap Your Own Latent (BYOL)
• Removes the need for negative samples(adversarial example) by using a
"momentum encoder."
• Avoids collapse (where the model gives the same output for all inputs).
4. Deep Clustering
• Groups similar data points together to improve learning representations.
• Used in applications like document clustering and image segmentation.
Generative Modeling in Self-
Supervised Learning
Instead of using contrastive learning, some approaches predict missing or
distorted parts of data, like:
1. Autoencoders reconstruct missing parts of images.
2. Masked Autoencoders predict missing words or pixels.
3. Generative Pre-trained Transformers (GPT-like models) predict
the next element in a sequence.
Student-Teacher Learning model
• A teacher model trains a student model by generating pseudo-labels for
unlabeled data.
• This method, originally used to compress models, now helps train
powerful AI models with minimal human supervision.
Future of Self-Supervised Learning
• Self-supervised learning has shown potential to outperform traditional
supervised learning in some cases.
• It is being explored in computer vision, NLP, and multimodal learning
(combining text, images, and audio).
• Large-scale datasets are making self-supervised models even more
powerful.
Conclusion
• Self-supervised learning is revolutionizing AI by enabling models to learn
from vast amounts of unlabeled data. Since we have more unlabelled
data than labelled data.
• It reduces dependency on human-annotated datasets.
• Improves model performance across different tasks.
THANK YOU

117

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy