DL UNIt-III
DL UNIt-III
Image Segmentation
It involves partitioning a digital image into multiple segments (regions or objects) to
simplify and analyze an image by separating it into meaningful components, Which makes
the image processing more efficient by focusing on specific regions of interest. A typical
image segmentation task goes through the following steps:
1. Groups pixels in an image based on shared characteristics like colour, intensity, or
texture.
2. Assigns a label to each pixel, indicating its belonging to a specific segment or object.
3. The resulting output is a segmented image, often visualized as a mask or overlay
highlighting the different segments.
Object detection:
Object detection in deep learning is a computer vision task that involves identifying and locating
objects within an image or video. It goes beyond image classification by not only determining
what objects are present but also predicting their locations, typically using bounding boxes.
Object detection has applications in autonomous driving, surveillance, medical imaging, and
more.
1. Two-Stage Detectors
• Region Proposal Networks (RPN): These methods first generate region proposals where objects
might be located, then refine these proposals and classify them.
• Examples:
o R-CNN (Region-Based CNN): Extracts region proposals using selective search and applies
a CNN to each region.
o Fast R-CNN: Improves on R-CNN by sharing CNN computation.
o Faster R-CNN: Introduces RPNs to generate region proposals more efficiently.
2. Single-Stage Detectors
• These methods predict bounding boxes and class labels directly, without generating region
proposals.
• Faster but may sacrifice accuracy compared to two-stage detectors.
• Examples:
o YOLO (You Only Look Once): Splits the image into a grid and predicts bounding boxes
and class probabilities for each grid cell.
o SSD (Single Shot MultiBox Detector): Combines multi-scale feature maps and anchor
boxes for faster detection.
o EfficientDet: Balances accuracy and efficiency using compound scaling.
1. Anchor Boxes: Predefined bounding boxes of different sizes and aspect ratios used as reference
points for predictions.
2. Intersection over Union (IoU): A metric to evaluate the overlap between predicted and ground
truth bounding boxes.
3. Loss Functions:
o Classification Loss: Measures the accuracy of predicted object classes.
o Localization Loss: Evaluates the accuracy of predicted bounding box coordinates (e.g.,
smooth L1 loss).
o Combined Loss: Sum of classification and localization losses.
Advanced Techniques
1. Feature Pyramid Networks (FPN): Enhances multi-scale feature detection by combining features
from different layers.
2. Attention Mechanisms: Improves detection by focusing on relevant regions.
3. Transformers (DETR): Uses attention-based mechanisms for object detection without relying on
anchor boxes.
Challenges
Automatic image captioning is the task of generating descriptive textual captions for given
images. It bridges computer vision and natural language processing (NLP), requiring the model
to understand the image content and express it in human language.
Key Components of Image Captioning
1. Image Understanding: Using computer vision to extract meaningful features from the image.
2. Language Generation: Employing NLP techniques to generate coherent and descriptive
sentences based on the extracted features.
• A pre-trained CNN (e.g., ResNet, Inception, VGG) processes the input image to extract a high-
level feature representation.
• The CNN’s output is either:
o A fixed-length vector (global features).
o Spatial features (feature maps) for more detailed representation.
2. Sequence Generation:
• An RNN, typically an LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit), is used to
generate captions word-by-word.
• The extracted image features serve as the initial input to the RNN or as attention weights (see
below).
Attention Mechanism
• Visual Attention:
o Allows the model to selectively emphasize different regions of the image.
o Common frameworks:
▪ Soft Attention: Assigns a probability distribution over all regions.
▪ Hard Attention: Selects one region at a time (stochastic).
• Self-Attention (Transformers):
o Models like Vision Transformers (ViT) and Transformers for Image Captioning (e.g.,
OSCAR, VinVL) use self-attention for both image understanding and language
generation.
Loss Functions
1. Cross-Entropy Loss: Minimizes the difference between the predicted and ground-truth word
probabilities during training.
2. Reinforcement Learning: Techniques like CIDEr optimization (via Reinforcement Learning) can
be used to directly optimize evaluation metrics like BLEU or CIDEr.
1. Show and Tell (2015): Combines a CNN with an RNN for image captioning.
2. Show, Attend and Tell (2015): Introduces attention mechanisms to focus on specific image
regions during caption generation.
3. Neural Image Caption (NIC): An end-to-end neural network architecture.
4. Transformers-Based Models: Recent models like OSCAR, VinVL, and ClipCap leverage
Transformers and multimodal embeddings for state-of-the-art performance.
Evaluation Metrics
Evaluating image captioning is challenging due to the subjective nature of language. Common
metrics include:
1. BLEU (Bilingual Evaluation Understudy): Measures the overlap of n-grams between predicted
and reference captions.
2. METEOR: Focuses on semantic similarity by considering synonyms and stemming.
3. ROUGE: Evaluates recall of overlapping n-grams.
4. CIDEr (Consensus-based Image Description Evaluation): Designed specifically for image
captioning tasks, emphasizing human consensus.
Applications
Image generation with Generative Adversarial Networks (GANs) involves using deep
learning models to create new, realistic images that mimic a given dataset. GANs are a class of
generative models introduced by Ian Goodfellow in 2014, consisting of two neural networks—a
generator and a discriminator—trained adversarially.
1. Generator:
o A neural network that generates new images starting from a random noise vector
(latent space).
o Its goal is to produce images that are indistinguishable from real images to the
discriminator.
2. Discriminator:
o A neural network that classifies images as real (from the training data) or fake
(produced by the generator).
o Its goal is to correctly distinguish real images from generated ones.
3. Adversarial Training:
o The generator and discriminator play a two-player minimax game:
▪ The generator tries to maximize the discriminator's error.
▪ The discriminator tries to minimize its error.
o The loss function is typically based on binary cross-entropy.
Training Process
1. Sample random noise from a latent space and feed it to the generator.
2. Generate a fake image from the noise vector.
3. Pass both real images (from the dataset) and generated images to the discriminator.
4. Compute losses:
o Generator Loss: Encourages the generator to produce images that fool the
discriminator.
o Discriminator Loss: Ensures the discriminator can distinguish between real and fake
images.
5. Update the generator and discriminator alternately via backpropagation.
Challenges in GANs
1. Training Instability:
o Balancing the generator and discriminator is difficult.
2. Mode Collapse:
o The generator produces limited varieties of images.
3. Evaluation:
o Metrics like Inception Score (IS) and Frechet Inception Distance (FID) are used but can
be inconsistent.
import torch
import torch.nn as nn
class Generator(nn.Module):
super(Generator, self).__init__()
self.model = nn.Sequential(
nn.Linear(z_dim, 256),
nn.ReLU(),
nn.Linear(256, 512),
nn.ReLU(),
nn.Linear(512, 784),
return self.model(z)
class Discriminator(nn.Module):
def __init__(self):
super(Discriminator, self).__init__()
self.model = nn.Sequential(
nn.Linear(784, 512),
nn.LeakyReLU(0.2),
nn.Linear(512, 256),
nn.LeakyReLU(0.2),
nn.Linear(256, 1),
return self.model(img)
# Hyperparameters
z_dim = 100
lr = 0.0002
batch_size = 64
epochs = 100
generator = Generator(z_dim)
discriminator = Discriminator()
# Train discriminator
real_loss = torch.log(discriminator(real_images))
d_optimizer.zero_grad()
d_loss.backward()
d_optimizer.step()
# Train generator
g_loss = -torch.log(discriminator(fake_images)).mean()
g_optimizer.zero_grad()
g_loss.backward()
g_optimizer.step()
1. Encoder-Decoder Framework:
• Encoder:
o Extracts spatial and temporal features from video frames.
o CNNs or 3D CNNs encode spatial features.
o LSTMs or GRUs encode temporal dependencies.
• Decoder:
o An LSTM or GRU generates captions by decoding the temporal features into a sequence
of words.
2. Attention Mechanisms:
• Introduced to focus on specific frames or regions while generating each word in the caption.
• Temporal attention focuses on relevant frames over time.
• Spatial-temporal attention integrates spatial and temporal aspects.
3. Word Embeddings:
• Pre-trained word embeddings (e.g., GloVe, Word2Vec) are often used to represent textual data
in a dense, semantic space.
import torch
import torch.nn as nn
class VideoCaptioningModel(nn.Module):
super(VideoCaptioningModel, self).__init__()
# Embed captions
embeddings = self.embedding(captions[:, :-1]) # Ignore <end> token
lstm_out, _ = self.lstm(inputs)
outputs = self.fc(lstm_out)
return outputs
Step 3: Training
criterion = nn.CrossEntropyLoss()
optimizer.zero_grad()
loss.backward()
optimizer.step()
Evaluation Metrics
• BLEU (Bilingual Evaluation Understudy): Measures n-gram overlap between predicted and
ground truth captions.
• METEOR: Focuses on semantic similarity, considering synonyms and stemming.
• CIDEr: Specifically designed for image and video captioning tasks.
• ROUGE: Measures recall of overlapping n-grams.
1. Temporal Dependencies:
o Capturing long-term dependencies in videos can be challenging.
o Solutions include attention mechanisms and hierarchical LSTMs.
2. Diversity in Captions:
o Videos can have multiple valid captions.
o Reinforcement learning (e.g., CIDEr optimization) can help.
3. Large Memory Requirements:
o Processing long video sequences requires significant computational resources.
4. Dataset Size:
o Requires large, annotated datasets for training (e.g., MSVD, ActivityNet Captions).
• Transformers: Models like ViT and BERT are increasingly used for video understanding and text
generation.
• Vision-Language Models: Multimodal transformers like CLIP and VideoBERT are state-of-the-art.
1. Attention
Several attention mechanisms have been proposed for computer vision tasks:
a. Spatial Attention
Spatial attention focuses on the spatial locations in an image that are most relevant to a task.
Formula:
Where FFF is the input feature map, and σ\sigmaσ is a softmax or sigmoid activation function.
b. Channel Attention
Channel attention emphasizes the feature channels that are most informative for the task.
Formula:
Where FFF is the input feature map, and MLP is a multi-layer perceptron.
c. Self-Attention (Non-local Attention)
Self-attention computes relationships between all pixels (or patches) in an image, allowing the
model to capture long-range dependencies.
• Key Idea: Each pixel attends to every other pixel in the image.
• Example: In image segmentation, self-attention helps capture the context of an entire
object rather than just local features.
Formula:
Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) =
\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)VAttention(Q,K,V)=softmax(dkQKT)V
Where QQQ (query), KKK (key), and VVV (value) are linear transformations of the input
features, and dkd_kdk is the dimension of the key.
d. Multi-Head Attention
Multi-head attention applies multiple self-attention mechanisms in parallel, allowing the model
to attend to different aspects of the image simultaneously.
• Key Idea: Learn different attention patterns by using multiple attention heads.
• Example: In Vision Transformers (ViTs), multi-head attention enables the model to
capture diverse contextual information.
Formula:
MultiHead(Q,K,V)=Concat(head1,…,headh)WO\text{MultiHead}(Q, K, V) =
\text{Concat}(\text{head}_1, \dots, \text{head}_h)W^OMultiHead(Q,K,V)=Concat(head1
,…,headh)WO
Paper: “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (2020)
CBAM adds spatial and channel attention modules to convolutional layers, enhancing their
representational power.
• Key Components:
o Channel Attention: Focuses on important feature channels.
o Spatial Attention: Focuses on important spatial locations.
Formula:
F′=SpatialAttention(ChannelAttention(F))F' =
\text{SpatialAttention}(\text{ChannelAttention}(F))F′=SpatialAttention(ChannelAttention(F))
1. Image Classification:
Focus on discriminative regions, improving accuracy.
2. Object Detection:
Enhance the detection of small or occluded objects by focusing on relevant regions.
3. Image Segmentation:
Capture context and boundaries more effectively by attending to relevant areas.
4. Image Captioning:
Generate captions by attending to relevant image regions during word generation.
5. Super-Resolution:
Improve the reconstruction of high-frequency details by attending to important features.
6. Challenges