0% found this document useful (0 votes)
18 views21 pages

DL UNIt-III

Image segmentation is the process of partitioning a digital image into multiple segments to simplify analysis, focusing on regions of interest. It includes various techniques such as semantic, instance, and panoptic segmentation, each serving different purposes in identifying and delineating objects within images. Applications range from autonomous vehicles and medical imaging to satellite analysis and content moderation.

Uploaded by

Rishika Vuggam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views21 pages

DL UNIt-III

Image segmentation is the process of partitioning a digital image into multiple segments to simplify analysis, focusing on regions of interest. It includes various techniques such as semantic, instance, and panoptic segmentation, each serving different purposes in identifying and delineating objects within images. Applications range from autonomous vehicles and medical imaging to satellite analysis and content moderation.

Uploaded by

Rishika Vuggam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit-III

Image Segmentation
It involves partitioning a digital image into multiple segments (regions or objects) to
simplify and analyze an image by separating it into meaningful components, Which makes
the image processing more efficient by focusing on specific regions of interest. A typical
image segmentation task goes through the following steps:
1. Groups pixels in an image based on shared characteristics like colour, intensity, or
texture.
2. Assigns a label to each pixel, indicating its belonging to a specific segment or object.
3. The resulting output is a segmented image, often visualized as a mask or overlay
highlighting the different segments.

need of Image Segmentation


Image segmentation is crucial in computer vision tasks because it breaks down complex
images into manageable pieces. It's like separating ingredients in a dish. By isolating
objects (things) and backgrounds (stuff), image analysis becomes more efficient and
accurate. This is essential for tasks like self-driving cars identifying objects or medical
imaging analyzing tumours. Understanding the image's content at this granular level
unlocks a wider range of applications in computer vision.

Semantic Classes in Image Segmentation: Things and Stuff.


In semantic image segmentation, we categorize image pixels based on their semantic
meaning, not just their visual properties. This classification system often uses two main
categories: Things and Stuff.
• Things: Things refer, to countable objects or distinct entities in an image with clear
boundaries, like people, flowers, cars, animals etc. So, the segmentation of "Things"
aims to label individual pixels in the image to specific classes by delineating the
boundaries of individual objects within the image
• Stuff: Stuff refers to specific regions or areas in an image different elements in an
image like background or repeating patterns of similar materials which can not be
counted like road, sky and grass which may not have clear boundaries but play a
crucial role in understanding the overall context in an image. The segmentation of
"Stuff" involves grouping of pixels in an image into clearly identifiable regions based on
the common properties like colour, texture or context.
Semantic segmentation
Semantic Segmentation is one of the different types of image segmentation where a class
label is assigned to image pixels using deep learning (DL) algorithm. In Semantic
Segmentation, collections of pixels in an image are identified and classified by assigning a
class label based on their characteristics such as colour, texture and shape. This provides
a pixel-wise map of an image (segmentation map) to enable more detailed and accurate
image analysis.
For example, all pixels related to a ‘tree’ would be labelled the same object name without
distinguishing between individual trees. Another example would be, group of people in an
image would be labelled as single object as 'persons', instead of identifying individual
people.
Instance segmentation
Instance segmentation in image segmentation of computer vision task is a more
sophisticated feature which involves identifying and delineating each individual object
within an image. So instance segmentation goes beyond just identifying objects in an
image, but also delineate the exact boundaries of each individual instance of that object.
So, the key focus of instance segmentation is to differentiate between separate objects of
the same class. for example, if there are many cats in a image, instance segmentation
would identify and outline each specific cat. The segmentation map is created for each
individual pixel and separate labels are assigned to specific object instances by creating
different coloured labels which will represent different 'cat' in the group of cats in an
image.
Instance segmentation is useful in autonomous vehicles to identify individual objects like
pedestrians, other vehicles and any objects along the navigation route. In medical
imaging, analysing scan images for detection of specific abnormalities are useful for early
detection of cancer and other organ conditions.
Panoptic segmentation
Panoptic segmentation goes a step further in image segmentation of computer vision
tasks, by combining the features and processes of semantic and instance segmentation
techniques. So the panoptic segmentation algorithm creates a comprehensive image
analysis by simultaneously classifying every pixel and identifying distinct object instances
of the same class.
So, from an image with multiple cars and pedestrians in an traffic signal, the panoptic
segmentation would label all 'pedestrians' and 'cars' (semantic segmentation) and draw
bounding boxes around them to identify and segment each individual persons and cars
and also classifying the different surrounding scenarios like road signals, traffic lights and
all other building or backgrounds. So panoptic segmentation detects and interprets
everything within a given image.
Panoptic segmentation leverages the strengths of fully convolutional networks (FCN) for
semantic context and Mask R-CNN for instance-specific details, which gives a combined
output for achieving a more holistic and nuanced understanding of visual data.

Traditional image segmentation techniques


The traditional image segmentation techniques which formed the foundation of modern
image segmentation methods using deep learning algorithms, uses thresholding, edge
detection, Region-Based Segmentation, clustering algorithms and Watershed
Segmentation. These techniques are more reliant on principle of image processing,
mathematical operation and heuristics to separate an image into meaningful regions.
• Thresholding: This method involves selecting a threshold value and classifying image
pixels between foreground and background based on intensity values
• Edge Detection: Edge detection method identify abrupt change in intensity or
discontinuation in the image. It uses algorithms like Sobel, Canny or Laplacian edge
detectors.
• Region-based segmentation: This method segments the image into smaller regions
and iteratively merges them based on predefined attributes in colour, intensity and
texture to handle noise and irregularities in the image.
• Clustering Algorithm: This method uses algorithms like K-means or Gaussian models
to group object pixels in an image into clusters based on similar features like colour or
texture.
• Watershed Segmentation:The watershed segmentation treats the image like a
topographical map where the watershed lines are identifies based on pixel intensity
and connectivity like water flowing down different valleys.
These traditional methods offer basic techniques of image segmentation with limitations,
but provide foundation for more advanced methods.
Deep learning image segmentation models
Deep learning image segmentation models are a powerful technique which leverages the
neural network architecture to automatically divide an image into different segments and
extract features from images for accurate analysis and segmentation tasks.
Below are some of the popular deep learning models used for image segmentation:
• U-Net: This model uses U-Shaped network to efficiently segment medical images. This
model is very efficient in working with small amount of data and provide precise
segmentation.
• Fully Convolutional Network (FCN):This model has the ability to process image of
any size and output spatial maps. This is achieved by replacing fully connected layers
in a conventional CNN with convolutional layers. This helps in segmenting an entire
image pixel by pixel.
• SegNet: This model includes a encoder-decoder network, used for tasks like scene
understanding and object recognition. The encoder here captures the context in the
image and the decoder performs the precise localization and segmentation objects by
using the context.
• DeepLab: The key feature of DeepLab is the use of atrous convolutions used to
capture multi-scale context with multiple parallel filters.
• Mask R-CNN: This model extents the Faster R-CNN object detection framework, by
adding a branch for predicting segmentation masks alongside bounding box
regression.
• Vision Transformer (ViT): A new model that applies transformers to image
segmentation. The image is divided into patches and processes them sequentially to
understand the global context of the image.
Applications of Image segmentation
Below are the list of different uses cases of Image Segmentation in Image processing:
• Autonomous Vehicles: Image segmentation helps autonomous vehicles in identifying
and segmenting objects like real time road lane detections, vehicles, pedestrians,
traffic signs for safe navigation.
• Medical Imaging Analysis: Image segmentation used for segmenting organs, tumours
and other anatomical structures from medical images like X-Rays, MRIs, and CT
Scans, helps in diagnosis and treatment planning.
• Satellite Image Analysis: Used in analysing satellite images for landcover
classification, urban planning, and environmental changes.
• Object Detection and Tracking: Segmenting different objects in image or video for
different tasks like person detection, anomaly detection, and detecting different
activities in security systems.
• Content Moderation: Used in monitoring and segmenting inappropriate content from
images or videos for social media platforms.
• Smart Agriculture: Image segmentation methods are used by farmers and
agronomists for crop health monitoring, estimating yield and detect plant diseases from
images and videos.
• Industrial Inspection: Image segmentation helps in manufacturing process for quality
control, detecting defects in products.

Object detection:

Object detection in deep learning is a computer vision task that involves identifying and locating
objects within an image or video. It goes beyond image classification by not only determining
what objects are present but also predicting their locations, typically using bounding boxes.
Object detection has applications in autonomous driving, surveillance, medical imaging, and
more.

Key Components of Object Detection

1. Classification: Identifying what objects are present in an image.


2. Localization: Determining the position of each object, often represented by bounding boxes.

Approaches in Object Detection

There are two main categories of object detection models:

1. Two-Stage Detectors

• Region Proposal Networks (RPN): These methods first generate region proposals where objects
might be located, then refine these proposals and classify them.
• Examples:
o R-CNN (Region-Based CNN): Extracts region proposals using selective search and applies
a CNN to each region.
o Fast R-CNN: Improves on R-CNN by sharing CNN computation.
o Faster R-CNN: Introduces RPNs to generate region proposals more efficiently.

2. Single-Stage Detectors

• These methods predict bounding boxes and class labels directly, without generating region
proposals.
• Faster but may sacrifice accuracy compared to two-stage detectors.
• Examples:
o YOLO (You Only Look Once): Splits the image into a grid and predicts bounding boxes
and class probabilities for each grid cell.
o SSD (Single Shot MultiBox Detector): Combines multi-scale feature maps and anchor
boxes for faster detection.
o EfficientDet: Balances accuracy and efficiency using compound scaling.

Key Concepts in Object Detection

1. Anchor Boxes: Predefined bounding boxes of different sizes and aspect ratios used as reference
points for predictions.
2. Intersection over Union (IoU): A metric to evaluate the overlap between predicted and ground
truth bounding boxes.
3. Loss Functions:
o Classification Loss: Measures the accuracy of predicted object classes.
o Localization Loss: Evaluates the accuracy of predicted bounding box coordinates (e.g.,
smooth L1 loss).
o Combined Loss: Sum of classification and localization losses.

Advanced Techniques

1. Feature Pyramid Networks (FPN): Enhances multi-scale feature detection by combining features
from different layers.
2. Attention Mechanisms: Improves detection by focusing on relevant regions.
3. Transformers (DETR): Uses attention-based mechanisms for object detection without relying on
anchor boxes.

Challenges

• Small Objects: Detecting tiny objects in large images.


• Occlusions: Handling overlapping objects.
• Real-Time Processing: Balancing speed and accuracy for applications like autonomous driving.
• Class Imbalance: Managing datasets where some object classes are underrepresented.

Automatic image captioning:

Automatic image captioning is the task of generating descriptive textual captions for given
images. It bridges computer vision and natural language processing (NLP), requiring the model
to understand the image content and express it in human language.
Key Components of Image Captioning

1. Image Understanding: Using computer vision to extract meaningful features from the image.
2. Language Generation: Employing NLP techniques to generate coherent and descriptive
sentences based on the extracted features.

Deep Learning Framework for Image Captioning

The typical architecture for automatic image captioning involves a combination of a


Convolutional Neural Network (CNN) for image feature extraction and a Recurrent Neural
Network (RNN) (or Transformer) for language generation.

1. Image Feature Extraction:

• A pre-trained CNN (e.g., ResNet, Inception, VGG) processes the input image to extract a high-
level feature representation.
• The CNN’s output is either:
o A fixed-length vector (global features).
o Spatial features (feature maps) for more detailed representation.

2. Sequence Generation:

• An RNN, typically an LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit), is used to
generate captions word-by-word.
• The extracted image features serve as the initial input to the RNN or as attention weights (see
below).

Attention Mechanism

To improve descriptive accuracy, attention mechanisms dynamically focus on specific parts of


the image while generating each word in the caption.

• Visual Attention:
o Allows the model to selectively emphasize different regions of the image.
o Common frameworks:
▪ Soft Attention: Assigns a probability distribution over all regions.
▪ Hard Attention: Selects one region at a time (stochastic).
• Self-Attention (Transformers):
o Models like Vision Transformers (ViT) and Transformers for Image Captioning (e.g.,
OSCAR, VinVL) use self-attention for both image understanding and language
generation.
Loss Functions

1. Cross-Entropy Loss: Minimizes the difference between the predicted and ground-truth word
probabilities during training.
2. Reinforcement Learning: Techniques like CIDEr optimization (via Reinforcement Learning) can
be used to directly optimize evaluation metrics like BLEU or CIDEr.

Popular Models for Image Captioning

1. Show and Tell (2015): Combines a CNN with an RNN for image captioning.
2. Show, Attend and Tell (2015): Introduces attention mechanisms to focus on specific image
regions during caption generation.
3. Neural Image Caption (NIC): An end-to-end neural network architecture.
4. Transformers-Based Models: Recent models like OSCAR, VinVL, and ClipCap leverage
Transformers and multimodal embeddings for state-of-the-art performance.

Evaluation Metrics

Evaluating image captioning is challenging due to the subjective nature of language. Common
metrics include:

1. BLEU (Bilingual Evaluation Understudy): Measures the overlap of n-grams between predicted
and reference captions.
2. METEOR: Focuses on semantic similarity by considering synonyms and stemming.
3. ROUGE: Evaluates recall of overlapping n-grams.
4. CIDEr (Consensus-based Image Description Evaluation): Designed specifically for image
captioning tasks, emphasizing human consensus.

Applications

1. Assistive Technology: Helping visually impaired individuals understand images.


2. Content Management: Automating caption generation for photos or videos in large-scale
datasets.
3. Autonomous Systems: Supporting navigation and object understanding.
4. Education: Creating descriptive annotations for educational content.
Image generation with Generative Adversarial Networks (GANs):

Image generation with Generative Adversarial Networks (GANs) involves using deep
learning models to create new, realistic images that mimic a given dataset. GANs are a class of
generative models introduced by Ian Goodfellow in 2014, consisting of two neural networks—a
generator and a discriminator—trained adversarially.

Key Components of GANs

1. Generator:
o A neural network that generates new images starting from a random noise vector
(latent space).
o Its goal is to produce images that are indistinguishable from real images to the
discriminator.
2. Discriminator:
o A neural network that classifies images as real (from the training data) or fake
(produced by the generator).
o Its goal is to correctly distinguish real images from generated ones.
3. Adversarial Training:
o The generator and discriminator play a two-player minimax game:
▪ The generator tries to maximize the discriminator's error.
▪ The discriminator tries to minimize its error.
o The loss function is typically based on binary cross-entropy.

Training Process

1. Sample random noise from a latent space and feed it to the generator.
2. Generate a fake image from the noise vector.
3. Pass both real images (from the dataset) and generated images to the discriminator.
4. Compute losses:
o Generator Loss: Encourages the generator to produce images that fool the
discriminator.
o Discriminator Loss: Ensures the discriminator can distinguish between real and fake
images.
5. Update the generator and discriminator alternately via backpropagation.

Variants of GANs for Image Generation

1. DCGAN (Deep Convolutional GAN):


o Introduces convolutional layers in both the generator and discriminator.
o Improves the stability of GAN training.
2. Conditional GAN (cGAN):
o Conditions the generation process on additional information, such as class labels or
textual descriptions.
o Example: Generating class-specific images.
3. Wasserstein GAN (WGAN):
o Replaces the standard GAN loss with the Wasserstein distance to improve training
stability.
o Often paired with gradient penalty (WGAN-GP).
4. Progressive Growing of GANs (ProGAN):
o Trains GANs progressively by starting with low-resolution images and gradually
increasing resolution.
o Improves stability and quality of high-resolution images.
5. StyleGAN:
o Introduces a style-based generator architecture that allows fine-grained control over
image generation.
o State-of-the-art for high-resolution image synthesis.
o Variants like StyleGAN2 improve quality and address artifacts.
6. CycleGAN:
o Used for image-to-image translation without paired datasets (e.g., converting photos to
paintings).

Applications of GAN-Based Image Generation

1. Art and Design: Generating realistic or stylized artwork.


2. Image-to-Image Translation:
o Example: Converting sketches to realistic images, enhancing photos.
3. Super-Resolution:
o Enhancing image resolution using GANs like SRGAN.
4. Data Augmentation:
o Creating synthetic data for training machine learning models.
5. Deepfake Creation:
o Generating realistic facial images or videos.
6. Medical Imaging:
o Generating synthetic medical scans for research and training.

Challenges in GANs

1. Training Instability:
o Balancing the generator and discriminator is difficult.
2. Mode Collapse:
o The generator produces limited varieties of images.
3. Evaluation:
o Metrics like Inception Score (IS) and Frechet Inception Distance (FID) are used but can
be inconsistent.

Example GAN Training Framework:

import torch

import torch.nn as nn

import torch.optim as optim

class Generator(nn.Module):

def __init__(self, z_dim):

super(Generator, self).__init__()

self.model = nn.Sequential(

nn.Linear(z_dim, 256),

nn.ReLU(),

nn.Linear(256, 512),

nn.ReLU(),

nn.Linear(512, 784),

nn.Tanh() # Output normalized to [-1, 1]

def forward(self, z):

return self.model(z)

class Discriminator(nn.Module):

def __init__(self):
super(Discriminator, self).__init__()

self.model = nn.Sequential(

nn.Linear(784, 512),

nn.LeakyReLU(0.2),

nn.Linear(512, 256),

nn.LeakyReLU(0.2),

nn.Linear(256, 1),

nn.Sigmoid() # Output between 0 and 1

def forward(self, img):

return self.model(img)

# Hyperparameters

z_dim = 100

lr = 0.0002

batch_size = 64

epochs = 100

# Initialize models and optimizers

generator = Generator(z_dim)

discriminator = Discriminator()

g_optimizer = optim.Adam(generator.parameters(), lr=lr)

d_optimizer = optim.Adam(discriminator.parameters(), lr=lr)


# Training loop skeleton

for epoch in range(epochs):

for real_images, _ in dataloader:

# Train discriminator

real_images = real_images.view(-1, 784) # Flatten

fake_images = generator(torch.randn(batch_size, z_dim))

real_loss = torch.log(discriminator(real_images))

fake_loss = torch.log(1 - discriminator(fake_images))

d_loss = -(real_loss + fake_loss).mean()

d_optimizer.zero_grad()

d_loss.backward()

d_optimizer.step()

# Train generator

fake_images = generator(torch.randn(batch_size, z_dim))

g_loss = -torch.log(discriminator(fake_images)).mean()

g_optimizer.zero_grad()

g_loss.backward()

g_optimizer.step()

print(f"Epoch [{epoch}/{epochs}], D Loss: {d_loss.item()}, G Loss: {g_loss.item()}")


Video-to-Text generation with LSTM models:
Video-to-Text generation with LSTM models in deep learning is a task where a system
generates textual descriptions (captions) for video inputs. This involves extracting meaningful
features from video frames and using these features to generate coherent sentences. It combines
computer vision and natural language processing (NLP).

Key Steps in Video-to-Text with LSTMs

1. Video Feature Extraction:


o Videos are sequences of frames. Feature extraction typically involves:
▪ Sampling frames or clips from the video at regular intervals.
▪ Passing these frames through a pre-trained CNN (e.g., ResNet, VGG) to extract
feature vectors representing spatial information.
▪ Optionally, using models like 3D CNNs (e.g., C3D, I3D) or TimeSformer to
capture spatiotemporal features directly.
2. Sequence Modeling:
o The extracted features form a sequential input to a model like an LSTM or GRU, which
processes the temporal dependencies in the video.
o The LSTM generates the caption word by word based on the input features.
3. Caption Generation:
o A decoder network (often another LSTM) generates text descriptions word-by-word,
conditioned on the video features and previously generated words.
4. Training and Optimization:
o The model is trained to minimize the loss between the predicted caption and the ground
truth caption using techniques like cross-entropy loss.
o Attention mechanisms may be incorporated to improve the focus on relevant parts of
the video during caption generation.

Typical Architecture for Video-to-Text with LSTMs

1. Encoder-Decoder Framework:

• Encoder:
o Extracts spatial and temporal features from video frames.
o CNNs or 3D CNNs encode spatial features.
o LSTMs or GRUs encode temporal dependencies.
• Decoder:
o An LSTM or GRU generates captions by decoding the temporal features into a sequence
of words.
2. Attention Mechanisms:

• Introduced to focus on specific frames or regions while generating each word in the caption.
• Temporal attention focuses on relevant frames over time.
• Spatial-temporal attention integrates spatial and temporal aspects.

3. Word Embeddings:

• Pre-trained word embeddings (e.g., GloVe, Word2Vec) are often used to represent textual data
in a dense, semantic space.

Example Workflow for Video-to-Text with LSTMs

Step 1: Video Preprocessing

• Sample frames from the video at regular intervals.


• Resize frames and pass them through a pre-trained CNN (e.g., ResNet) to extract feature
vectors.
• Store features as a sequence.

Step 2: Define the LSTM Model

import torch

import torch.nn as nn

class VideoCaptioningModel(nn.Module):

def __init__(self, feature_dim, hidden_dim, vocab_size, embed_dim, num_layers=1):

super(VideoCaptioningModel, self).__init__()

self.embedding = nn.Embedding(vocab_size, embed_dim)

self.lstm = nn.LSTM(feature_dim + embed_dim, hidden_dim, num_layers, batch_first=True)

self.fc = nn.Linear(hidden_dim, vocab_size)

def forward(self, features, captions):

# Embed captions
embeddings = self.embedding(captions[:, :-1]) # Ignore <end> token

# Concatenate video features with embeddings

inputs = torch.cat((features.unsqueeze(1), embeddings), dim=1)

# Pass through LSTM

lstm_out, _ = self.lstm(inputs)

# Generate word probabilities

outputs = self.fc(lstm_out)

return outputs

Step 3: Training

• Use a dataset of videos with corresponding captions (e.g., MSR-VTT, YouCook2).


• Loss function: Cross-Entropy Loss.
• Optimizer: Adam or similar.

criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(num_epochs):

for features, captions in dataloader:

outputs = model(features, captions)

loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))

optimizer.zero_grad()

loss.backward()

optimizer.step()
Evaluation Metrics

• BLEU (Bilingual Evaluation Understudy): Measures n-gram overlap between predicted and
ground truth captions.
• METEOR: Focuses on semantic similarity, considering synonyms and stemming.
• CIDEr: Specifically designed for image and video captioning tasks.
• ROUGE: Measures recall of overlapping n-grams.

Challenges in Video-to-Text with LSTMs

1. Temporal Dependencies:
o Capturing long-term dependencies in videos can be challenging.
o Solutions include attention mechanisms and hierarchical LSTMs.
2. Diversity in Captions:
o Videos can have multiple valid captions.
o Reinforcement learning (e.g., CIDEr optimization) can help.
3. Large Memory Requirements:
o Processing long video sequences requires significant computational resources.
4. Dataset Size:
o Requires large, annotated datasets for training (e.g., MSVD, ActivityNet Captions).

Advancements Beyond LSTMs

• Transformers: Models like ViT and BERT are increasingly used for video understanding and text
generation.
• Vision-Language Models: Multimodal transformers like CLIP and VideoBERT are state-of-the-art.

Attention Models for Computer Vision Tasks

Attention mechanisms have become a critical component in deep learning, significantly


improving the performance of models across various computer vision tasks. Inspired by human
visual attention, these models allow neural networks to focus on the most relevant regions of an
input image, enhancing feature extraction and improving task performance.

1. Attention

In the context of computer vision, attention mechanisms enable models to:


• Focus on important regions of an image instead of processing the entire image
uniformly.
• Weight features dynamically based on their relevance to the task at hand.
• Capture spatial relationships and enhance feature representation.

2. Types of Attention Mechanisms

Several attention mechanisms have been proposed for computer vision tasks:

a. Spatial Attention

Spatial attention focuses on the spatial locations in an image that are most relevant to a task.

• Key Idea: Assign different weights to different regions of an image.


• Example: In object detection, spatial attention helps the model focus on areas where
objects are likely to appear.

Formula:

Attention Map=σ(Conv(F))\text{Attention Map} =


\sigma(\text{Conv}(F))Attention Map=σ(Conv(F))

Where FFF is the input feature map, and σ\sigmaσ is a softmax or sigmoid activation function.

b. Channel Attention

Channel attention emphasizes the feature channels that are most informative for the task.

• Key Idea: Assign different weights to different channels of a feature map.


• Example: In image classification, channel attention helps the model focus on features
like color, texture, or edges.

Formula:

Attention Map=σ(MLP(GlobalAvgPool(F)))\text{Attention Map} =


\sigma(\text{MLP}(\text{GlobalAvgPool}(F)))Attention Map=σ(MLP(GlobalAvgPool(F)))

Where FFF is the input feature map, and MLP is a multi-layer perceptron.
c. Self-Attention (Non-local Attention)

Self-attention computes relationships between all pixels (or patches) in an image, allowing the
model to capture long-range dependencies.

• Key Idea: Each pixel attends to every other pixel in the image.
• Example: In image segmentation, self-attention helps capture the context of an entire
object rather than just local features.

Formula:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) =
\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)VAttention(Q,K,V)=softmax(dkQKT)V

Where QQQ (query), KKK (key), and VVV (value) are linear transformations of the input
features, and dkd_kdk is the dimension of the key.

d. Multi-Head Attention

Multi-head attention applies multiple self-attention mechanisms in parallel, allowing the model
to attend to different aspects of the image simultaneously.

• Key Idea: Learn different attention patterns by using multiple attention heads.
• Example: In Vision Transformers (ViTs), multi-head attention enables the model to
capture diverse contextual information.

Formula:

MultiHead(Q,K,V)=Concat(head1,…,headh)WO\text{MultiHead}(Q, K, V) =
\text{Concat}(\text{head}_1, \dots, \text{head}_h)W^OMultiHead(Q,K,V)=Concat(head1
,…,headh)WO

Where each head is computed as:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K,


VW_i^V)headi=Attention(QWiQ,KWiK,VWiV)

3. Attention-Based Architectures for Vision

a. Vision Transformers (ViT)


Vision Transformers apply the transformer architecture to image data by splitting an image into
patches and processing them with self-attention.

• Input: Images split into fixed-size patches.


• Architecture: A stack of transformer encoder layers with multi-head self-attention.
• Advantages:
o Captures long-range dependencies.
o Scales well with large datasets.

Paper: “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (2020)

b. Convolutional Block Attention Module (CBAM)

CBAM adds spatial and channel attention modules to convolutional layers, enhancing their
representational power.

• Key Components:
o Channel Attention: Focuses on important feature channels.
o Spatial Attention: Focuses on important spatial locations.

Formula:

F′=SpatialAttention(ChannelAttention(F))F' =
\text{SpatialAttention}(\text{ChannelAttention}(F))F′=SpatialAttention(ChannelAttention(F))

Paper: "CBAM: Convolutional Block Attention Module"

c. SENet (Squeeze-and-Excitation Network)

SENet introduces a channel attention mechanism by adaptively recalibrating channel-wise


feature responses.

• Key Idea: Squeeze global information and excite important channels.


• Advantages: Lightweight and effective in improving CNN performance.

Paper: "Squeeze-and-Excitation Networks"

d. DETR (DEtection TRansformer)


DETR uses transformers for object detection by representing objects as a set of queries processed
with self-attention.

• Key Idea: Formulate object detection as a set prediction problem.


• Advantages: Removes the need for region proposals or anchors.

Paper: "End-to-End Object Detection with Transformers"

4. Applications of Attention in Computer Vision

1. Image Classification:
Focus on discriminative regions, improving accuracy.
2. Object Detection:
Enhance the detection of small or occluded objects by focusing on relevant regions.
3. Image Segmentation:
Capture context and boundaries more effectively by attending to relevant areas.
4. Image Captioning:
Generate captions by attending to relevant image regions during word generation.
5. Super-Resolution:
Improve the reconstruction of high-frequency details by attending to important features.

5. Advantages of Attention Models in Vision

• Improved Accuracy: By focusing on relevant features, attention models improve


performance across tasks.
• Long-Range Dependencies: Self-attention captures relationships between distant pixels
or regions.
• Flexibility: Attention mechanisms can be integrated into various architectures, from
CNNs to transformers.

6. Challenges

• Computational Complexity: Self-attention has quadratic complexity with respect to the


input size, making it computationally expensive for high-resolution images.
• Data Requirements: Attention-based models, especially Vision Transformers, often
require large datasets to perform well.
• Interpretability: While attention provides insights into what the model focuses on, it
does not guarantee interpretability.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy