DeepLearning 4 and 5
DeepLearning 4 and 5
1. Introduction
An Autoencoder is a type of unsupervised artificial neural network used for learning efficient
codings (representations) of input data. Its primary goal is to learn a compressed, encoded
representation and then reconstruct the original input from that compressed data.
Autoencoders are especially useful in tasks such as dimensionality reduction, denoising, feature
extraction, and generative modeling.
2. Architecture of Autoencoder
1. Encoder:
3. Decoder:
Goal: Minimize the reconstruction loss between the input xxx and output x^.
5. Types of Autoencoders
Type Description
Variational Autoencoder (VAE) Learns probabilistic distributions and used in generative models
6. Applications of Autoencoders
• Data Compression
7. Advantages
8. Disadvantages
This is widely used in medical imaging, document restoration, and satellite images.
10. Summary
• Autoencoders are neural networks designed for unsupervised learning and reconstruction
tasks.
• They serve in various domains like compression, denoising, anomaly detection, and
generative tasks.
• Their effectiveness depends on the architecture, loss function, and training method.
1. Introduction
Autoencoders in education help uncover latent patterns in how students learn, where they struggle,
and what learning interventions might help them.
• They can reveal hidden skills, concept mastery, or cognitive gaps that are not directly
observable.
The architecture is similar to a basic autoencoder, but the input features come from student data:
• Input Layer: Student interaction data (e.g., quiz attempts, video watch duration, number of
hints used)
Knowledge tracing is the task of modeling a student’s evolving knowledge over time.Educational
autoencoders can be used to:
• Decode and predict future performance (e.g., "Will the student answer the next question
correctly?").
Diagram
7. Example Scenario
Imagine an online learning platform like Coursera:
• Input: Data from a student’s video viewing time, quiz scores, number of retries, and time
spent per module.
• Autoencoder encodes this into a latent skill vector.
• Decoder reconstructs or predicts how likely the student is to succeed in the next quiz.
• Based on the results, the system recommends remedial videos or exercises.
9. Advantages
• Captures complex learning patterns
• Unsupervised – no need for labeled student skills
• Handles sparse and noisy educational data
• Improves personalization and adaptive learning
10. Disadvantages
• Hard to interpret latent features directly.
• May require large datasets for effective training.
• Risk of overfitting with small or biased datasets.
11. Summary
• Educational Autoencoders are neural models used to analyze and model student learning
data.
• They help uncover latent knowledge, track learning progress, and support personalized
education.
• Widely applied in EdTech systems, online courses, and adaptive tutoring systems.
1. Introduction
A Regularized Autoencoder is a variant of the standard autoencoder that introduces additional
constraints (regularization) on the latent space or network weights. These constraints help the
model learn more robust, generalizable, and sparse representations, especially when the goal is to
avoid overfitting or to encourage interpretability.
While basic autoencoders focus only on minimizing reconstruction error, regularized autoencoders
include penalty terms in their loss function to improve learning quality.
Type Description
Sparse Autoencoder Applies sparsity constraints to force only a few neurons to activate.
Denoising Autoencoder Trains the model to reconstruct clean input from noisy input.
Sparse
( \sum z_i
Autoencoder
7. Applications
8. Scenario-Based Example
This approach is commonly used in cybersecurity, fraud detection, and sensor monitoring.
9. Advantages
• Some regularizations (e.g., VAE) require complex math (e.g., sampling tricks).
11. Summary
• They include variants such as Sparse, Denoising, Contractive, and Variational Autoencoders.
1. Introduction
Variational Autoencoders (VAEs) are generative models based on the autoencoder architecture but
designed with probabilistic reasoning. Unlike traditional autoencoders, which encode an input to a
fixed vector, VAEs encode it into a distribution (usually Gaussian), allowing the model to generate
new, similar data.
Introduced by Kingma and Welling in 2013, VAEs are widely used in image generation,
representation learning, and anomaly detection.
2. Key Concept
In a standard autoencoder:
In a VAE:
This makes VAEs probabilistic and allows sampling of new data from the learned distribution.
3. Architecture of a VAE
Encoder (Inference Network):
Reparameterization Trick:
• Use:
z=μ+σ⋅ϵ,where ϵ∼N(0,1)
1. Reconstruction Loss
Ensures the output resembles the input (e.g., Mean Squared Error or Binary Cross Entropy).
2. KL Divergence Loss
Measures how far the learned latent distribution q(z∣x) is from a standard normal
distribution p(z)∼N(0,1).
6. Applications of VAEs
7. Scenario-Based Example
8. Advantages
9. Disadvantages
10. Summary
• VAEs are probabilistic autoencoders that learn a distribution over latent variables instead of
deterministic codes.
• They are used for generative tasks, feature extraction, and data reconstruction.
• VAEs offer a principled, probabilistic framework for generating diverse and meaningful data.
1. Introduction
A Denoising Autoencoder (DAE) is a type of autoencoder neural network that is trained to remove
noise from data. Instead of simply copying input to output, DAEs learn to reconstruct clean data
from a corrupted version.
It was introduced to make autoencoders more robust and to prevent them from simply memorizing
the input.
2. Motivation
Traditional autoencoders may overfit and fail to generalize to slightly altered or noisy inputs.
Denoising Autoencoders address this by:
• Forcing the model to learn useful representations that are stable under input corruption.
3. Working Principle
Training Strategy:
3. The decoder reconstructs a clean version x^, not the noisy input.
• Salt and Pepper noise: Random black/white pixel drops (for images).
The typical loss function is Mean Squared Error (MSE) or Binary Cross Entropy between the clean
input xxx and the reconstructed output x^:
L=∥x−x^∥2
8. Scenario-Based Example
9. Advantages
10. Disadvantages
• May not perform well on highly corrupted or complex noise without enough data.
12. Summary
• Denoising Autoencoders are trained to reconstruct clean input from noisy versions.
• They bridge unsupervised learning with practical applications like image and audio cleaning.
1. Introduction
Autoencoders are a type of unsupervised neural network that learn to compress and reconstruct
data. Once trained, they can extract meaningful patterns and features. Their ability to learn efficient
representations has led to a wide range of applications in various domains.
1. Dimensionality Reduction
• Example: Reducing image dimensions before classification to improve speed and reduce
overfitting.
2. Image Denoising
4. Data Compression
5. Feature Extraction
• Latent vectors produced by the encoder contain important features of the input.
• VAEs learn probabilistic latent spaces and can generate new data samples.
• Autoencoders can be extended with RNNs for sequential data like text or speech.
12. Clustering
3. Scenario-Based Example
4. Advantages in Applications
6. Conclusion
Autoencoders are a powerful tool in deep learning, providing robust applications in representation
learning, anomaly detection, generation, and dimensionality reduction. Their flexibility across
domains makes them essential in both academic research and industrial deployment.
1. Introduction
Generative Adversarial Networks (GANs) are a class of deep generative models introduced by Ian
Goodfellow in 2014. GANs are used to generate new, synthetic data samples that resemble a given
dataset. They are especially powerful in image synthesis, video generation, data augmentation, and
more.
• Tries to generate realistic data that looks like the training data.
b) Discriminator (D)
The generator and discriminator are trained in a minimax game where the generator tries to
maximize the probability of fooling the discriminator while the discriminator tries to minimize it.
5. Generator is trained to fool the discriminator (i.e., make D(G(z)) close to 1).
Type Description
6. Applications of GANs
Domain Application
Data Augmentation Create synthetic training data for better model generalization.
7. Advantages
8. Disadvantages
• Training is unstable (non-convergence, mode collapse).
• Sensitive to hyperparameters.
9. Scenario-Based Example
11. Summary
• GANs are powerful generative models that learn by playing a game between two networks.
• They are widely used in realistic data generation, image editing, and creative AI
applications.
1. Introduction
The Transformer model, introduced in the paper “Attention is All You Need” by Vaswani et al. (2017),
revolutionized the field of Natural Language Processing (NLP) and became the foundation for models
like BERT, GPT, and many others. At the core of the Transformer architecture lies the Attention
Mechanism, which enables the model to focus on different parts of the input sequence when
producing output, regardless of their position.
2. Attention Mechanism
• In Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) models,
information is passed sequentially from one step to the next, which makes it difficult to
capture long-range dependencies efficiently.
b) The Need for Attention
• Attention allows the model to focus on specific parts of the input sequence dynamically. For
example, in translation, the model may focus more on certain words in the source language
while generating the target language sentence.
The attention mechanism computes a weighted sum of the input sequence based on the importance
of each element. This is done by computing query (Q), key (K), and value (V) representations.
• Key (K): Represents the parts of the input that may be relevant to the query.
• Value (V): Represents the information that will be passed to the next step, weighted by the
attention scores.
Where dkd_kdk is the dimension of the key vectors. This results in a contextualized output where
each part of the input is weighted based on its relevance to the query.
4. Transformer Architecture
a) Encoder-Decoder Structure
• Decoder also has similar layers, but with an additional attention mechanism that attends to
the encoder's output.
a) Overview
BERT – Architecture
1. Bidirectional Contextualization:
o BERT learns context from both the left and right side of a word, as opposed to
unidirectional models like GPT.
o It is trained using a masked language model (MLM) objective, where random words
are masked, and the model is tasked with predicting them based on their context.
1. Masked Language Model (MLM): Randomly masks some percentage of the input
tokens and predicts them based on the context.
2. Next Sentence Prediction (NSP): Predicts whether one sentence follows another,
which is useful for tasks like question answering and natural language inference.
d) Fine-Tuning BERT
• After pre-training, BERT can be fine-tuned with task-specific labels. This enables it to be
highly versatile for various NLP tasks.
a) Advantages of Transformers
• Parallelization: Unlike RNNs and LSTMs, which process data sequentially, Transformers can
process all tokens at once, making training faster and more efficient.
• Scalability: Transformers can be scaled to handle very large datasets, making them suitable
for large-scale language models.
b) Advantages of BERT
• Pre-training and Transfer Learning: BERT's pre-training on large datasets allows it to be fine-
tuned with smaller datasets, enabling it to perform well across various NLP tasks.
6. Applications of BERT
• Sentiment Analysis: Fine-tuned BERT can be used to classify the sentiment of text (positive,
negative, or neutral).
• Named Entity Recognition (NER): BERT can be fine-tuned for NER tasks to identify and
classify entities like people, organizations, and locations in text.
• Text Classification: BERT can classify text into predefined categories, such as spam detection
or news categorization.
• Machine Translation: Although BERT is not typically used for translation, its bidirectional
architecture can assist in translation tasks by understanding the context of words in the
source language.
• Computational Resources: Transformers like BERT are large models that require substantial
computational resources for both pre-training and fine-tuning.
• Fine-tuning Complexity: Although BERT is versatile, fine-tuning it for specific tasks can be
computationally expensive and time-consuming.
8. Summary Table
Unidirectional (GPT) /
Directionality Bidirectional (both left & right)
Bidirectional
Sequence-to-sequence
Primary Use Text understanding and classification
tasks
9. Conclusion
• Transformers and BERT have transformed the field of NLP by enabling highly efficient and
accurate models for various tasks. The introduction of the attention mechanism has solved
many limitations of previous models like RNNs, allowing for better handling of long-range
dependencies and enabling parallelization. BERT, with its bidirectional nature and pre-
training/fine-tuning approach, has become the backbone of modern NLP applications,
setting a new standard for language understanding.
Module-5 - Deep Architectures for Heterogeneous Data Processing
GPT (Generative Pre-trained Transformer) is a type of language model based on the Transformer
architecture. It was developed by OpenAI and is known for its ability to generate coherent and
contextually appropriate text based on input prompts. GPT has undergone several iterations, with
GPT-1, GPT-2, GPT-3, and the more recent GPT-4 being the most notable.
Here is a detailed explanation of GPT, including its structure, working, and applications:
1. Introduction to GPT
• Language Modeling: It can predict the next word in a sequence based on prior context.
GPT is trained on a massive amount of text data from various sources, which enables it to generate
human-like text and perform various NLP tasks.
2. GPT Architecture
• Feedforward neural networks to process and transform the input text at each layer.
Key Components:
1. Input Embeddings: Words are converted into vectors (embeddings) to be processed by the
model.
2. Positional Encoding: Since the Transformer does not inherently handle sequential data,
positional encodings are added to input embeddings to provide information about the
position of words in the sequence.
3. Multi-Head Attention: This mechanism allows the model to focus on different parts of the
sequence when generating the output, enabling it to capture long-range dependencies.
4. Feedforward Networks: After attention, the output is passed through fully connected layers
to further transform the data.
5. Output Layer: The model generates a probability distribution over the vocabulary for the
next word in the sequence.
1. Pre-training:
o The model is pre-trained on large datasets (e.g., books, articles, websites) using the
objective of predicting the next word in a sentence. The training process optimizes
the model to understand and generate text based on vast linguistic patterns.
o The pre-training task is typically called language modeling, where the model learns
the statistical properties of a language, like syntax, grammar, and general knowledge
about the world.
2. Fine-tuning:
GPT-1:
• GPT-1 introduced the core idea of using a Transformer-based architecture for language
modeling.
• It contained 117 million parameters and demonstrated the effectiveness of pre-training for
generative tasks.
GPT-2:
• GPT-2 significantly scaled up the model to 1.5 billion parameters, resulting in a large
improvement in text generation and understanding.
• It was initially not released publicly due to concerns about its potential for misuse (e.g.,
generating misleading or harmful text).
GPT-3:
• GPT-3 is a much larger model with 175 billion parameters and has demonstrated impressive
capabilities in generating coherent and contextually relevant text across a wide range of
domains.
GPT-4:
• GPT-4 is even larger than GPT-3, and it has improved capabilities, especially in understanding
context, reasoning, and generating highly coherent and context-aware text.
GPT operates in a autoregressive manner, which means it generates text one token at a time, using
the previously generated tokens as context for predicting the next one.
For example:
It uses the previous words in the sentence to predict the next one, and this process continues until a
stopping condition (e.g., maximum token length or special end token) is met.
6. Applications of GPT
1. Text Generation:
o GPT can generate creative writing, blog posts, essays, or even poetry given a prompt.
o Example: GPT can be used for content creation or idea generation for writers and
marketers.
2. Question Answering:
o GPT can answer questions based on provided context or its pre-trained knowledge.
3. Summarization:
o GPT can condense long documents or articles into shorter, meaningful summaries.
4. Translation:
o GPT can translate text from one language to another with reasonable accuracy.
5. Sentiment Analysis:
o By analyzing the context of the text, GPT can determine the sentiment (positive,
negative, neutral).
o Example: Used in social media monitoring, customer feedback analysis, and market
research.
6. Text Completion:
o GPT can be used for auto-completing text based on partial input, useful in coding,
email drafting, etc.
o GPT can engage in conversations, making it ideal for building chatbots and virtual
assistants like Siri or Alexa.
7. Limitations of GPT
• Contextual Understanding: While GPT performs exceptionally well in many tasks, it lacks
deep understanding of complex or nuanced topics and can generate incorrect or misleading
information if not properly fine-tuned or verified.
• Biases: Since GPT is trained on a massive dataset scraped from the web, it can sometimes
generate biased or offensive content based on the biases present in the data.
• Computational Resources: GPT models, especially GPT-3 and GPT-4, require significant
computational resources to train and deploy, making them costly to use at scale.
8. Conclusion
GPT models, especially with their large-scale architecture and transformer-based attention
mechanisms, have transformed natural language processing. They have demonstrated remarkable
capabilities in text generation, language understanding, and conversational AI. With further
research, future iterations of GPT are expected to continue improving in terms of understanding
context, reasoning abilities, and real-world application effectiveness.
What it is:
Greedy Search is the simplest decoding strategy used in natural language generation tasks like text
generation. In this strategy, at each time step, the model selects the next word that has the highest
probability. It doesn’t consider future words or the overall sequence quality, making it fast but often
leading to repetitive and predictable text.
How it Works:
• Given a sequence (starting with a prompt), the model generates a probability distribution for
the next word at each step.
• The word with the highest probability is selected and added to the sequence.
• The process continues until the model generates an end token or reaches a specified length.
Pros:
• May get stuck in local optima (suboptimal solutions) because it always chooses the most
probable word, ignoring potentially better alternatives in future steps.
Example:
• At each step, the model predicts the next word with the highest probability, like "there" ->
"was" -> "a" -> "king", forming a predictable sentence.
2. Beam Search
What it is:
Beam Search is an advanced decoding strategy that overcomes some of the limitations of Greedy
Search. It considers multiple possible sequences (called "beams") at each time step and keeps track
of the top k sequences, where k is the beam width. It tries to balance between exploration and
exploitation.
How it Works:
• At each step, the model generates a probability distribution for the next word.
• Instead of picking just the word with the highest probability, Beam Search keeps track of the
top k sequences, where k is a predefined number (called beam width).
• For each of the k sequences, it evaluates the possible next words, and chooses the best ones
based on their cumulative probability across the sequence.
Pros:
• More exploration: It’s less likely to get stuck in repetitive or low-quality sequences compared
to Greedy Search.
Cons:
• Can still miss diverse outputs if k is too small, leading to a limited set of possible sequences.
Example:
• With a beam width of 3, instead of just selecting the most probable word, the model will
evaluate the top 3 possible continuations, generating a more diverse set of options (e.g.,
"there was a king", "there lived a queen", "there was a dragon").
Imagine you’re trying to predict the next word in the sentence: “I like to eat ___.” Now, you
could have many potential words to fill in the blank like “cake”, “apples”, “pasta”, etc. Let’s
dive into Beam Search and see how it tackles this.
Let’s say our beam width �k is 2. This means that at each step, Beam Search will consider the
top 2 sequences (combinations of words) based on their probabilities.
1st Step: The model predicts the probabilities for the next word after “I”. Let’s say the
highest probabilities are for “like” and “am”. So, Beam Search keeps these two sequences:
1. “I like”
2. “I am”
2nd Step: Now, for each of the sequences, the model predicts the next word:
1. “I like to”
1. “I like eating”
2. “I am happy”
3. “I am eating”
From these, it picks the top 2 sequences based on their probabilities. Let’s say it chooses:
1. “I like to”
2. “I am happy”
3rd Step: Repeat the process. For “I like to”, it predicts “eat” and “play”. For “I am happy”, it
predicts “to” and “because”.
New sequences:
1. “I like to eat”
2. “I like to play”
3. “I am happy to”
4. “I am happy because”
1. “I like to eat”
2. “I am happy because”
And this process continues until an end-of-sequence token is encountered or until a set
sequence length is reached.
3. Sampling-based Strategies
Sampling-based strategies introduce randomness into the generation process, allowing for more
diverse and creative outputs. Instead of always choosing the word with the highest probability, these
methods sample words from the probability distribution generated by the model.
How it Works:
• Random Sampling: The model generates a probability distribution for the next word. A word
is then randomly selected based on its probability. This leads to more diverse and creative
sequences but can result in less coherent outputs.
• Top-k Sampling: The model only considers the top k words with the highest probabilities and
then samples from them, reducing the chances of selecting low-probability words.
• Top-p Sampling (Nucleus Sampling): Instead of selecting from the top k words, the model
selects words from the smallest set whose cumulative probability exceeds a threshold p. This
ensures that only high-probability words are considered but still allows for more diversity.
Pros:
• Diverse outputs: These strategies prevent the model from always generating the same
sequence and can lead to more creative text.
• Flexibility: By adjusting parameters like k or p, you can control the degree of randomness
and diversity in the generated text.
Cons:
• Less coherent: Since randomness is introduced, the generated text may not always make sense
and can be disjointed or nonsensical.
• Harder to control: There’s no guarantee that the output will be relevant or meaningful,
especially when the randomness is too high.
Example:
• Random Sampling might result in different endings, like "there was a dragon" or "a forest
grew up".
• Top-k Sampling: If k=5, the model will only consider the top 5 most likely next words and
sample from them, producing more structured but still diverse outputs.
• Top-p Sampling: If p=0.9, the model considers the smallest set of words whose cumulative
probability exceeds 90%, allowing for more natural variations in the output.
• Greedy Search is fast and simple, but often generates predictable and repetitive text.
• Sampling-based strategies (including Top-k and Top-p Sampling) offer the highest diversity
and creativity by introducing randomness, but they may sacrifice coherence and control.
Each of these strategies has its place in text generation depending on the use case, desired output
quality, and computational resources. For example, Greedy Search might be used for real-time
applications needing quick results, while Beam Search or Sampling might be chosen when higher-
quality and more creative text generation is necessary.
Auto-Regressive Models
Introduction
An Auto-Regressive (AR) Model is a type of probabilistic model used to predict future data points by
relying on past values. In deep learning, auto-regressive models are primarily used in sequence
modeling tasks such as language modeling, text generation, speech synthesis, and time-series
forecasting.
Core Idea
Each output depends on the previously generated outputs, hence the term “auto-regressive”.
Working Mechanism
• At each step, the output is fed back into the model to generate the next step.
• This is recursive and continues until a stop token or a desired length is reached.
Popular Auto-Regressive Models
a. Greedy Search
Selects the highest probability token at each step. Fast but can produce repetitive or suboptimal text.
b. Beam Search
Keeps top-k sequences at each step and selects the best final output. Balances quality and diversity.
c. Sampling-based Methods
• Top-k Sampling: Samples from the top k most probable next words.
• Top-p Sampling (Nucleus Sampling): Samples from the smallest possible set of words whose
cumulative probability exceeds p.
Applications
Disadvantages
• Error accumulation: Mistakes early in the sequence can affect later predictions.
• Exposure bias: During training, the model sees true previous tokens; during inference, it sees
its own predictions.
• Given a prompt like: “The weather today is”, GPT continues with high-probability words like
“sunny”, “rainy”, etc., one token at a time.
Introduction
• Diffusion Models are a class of generative models that generate new data (such as images or
audio) by reversing a gradual noise process. They have become popular due to their ability to
generate high-quality, realistic samples, especially in computer vision.
• One of the most popular implementations is Stable Diffusion, developed by Stability AI,
which uses Latent Diffusion Models (LDMs) for efficient image generation.
Key Idea
• A diffusion model destroys structure in data by gradually adding Gaussian noise over many
steps, and then learns to reverse this noising process to recover or generate new data.
• Performing the diffusion process in this latent space rather than pixel space.
3. Text Encoder (e.g., CLIP or BERT): Adds text conditioning for text-to-image generation.
Advantages
• Open source and flexible (can be used for inpainting, super-resolution, etc.).
Disadvantages
• Complex training.
Applications
• Stable Diffusion enhances them by working in a compressed latent space, making them
faster and more accessible.
• These models are widely used in modern AI for creative and scientific applications,
especially text-to-image synthesis.
Introduction
Vision and Language (V+L) applications combine computer vision (understanding images/videos)
with natural language processing (understanding text/language). These multimodal systems enable
machines to interpret and generate content involving both visual and textual modalities.
Motivation
• Humans naturally interpret both visual and linguistic inputs.
• Combining these modalities allows AI to solve more complex, human-like tasks such as
describing images, answering questions about scenes, or generating images from text.
Core Components
1. Visual Encoder – e.g., CNNs, Vision Transformers (ViTs) extract features from images.
3. Fusion Mechanism – Combines visual and textual features for a unified representation.
4. Decoder / Output Layer – Produces the final output (text, classification, etc.).
1. Image Captioning
• Example: Image shows a train; Q: "What is the color of the train?" → "Blue".
3. Text-to-Image Generation
4. Image-Text Retrieval
Popular Models
• BLIP / BLIP-2 – Combines vision and language transformers for VQA and captioning.
• VisualBERT / ViLBERT – BERT-based multimodal models that take image + text as input.
Challenges
• Dataset bias: Most models are trained on Western-centric data (e.g., MSCOCO).
• Assistive Tech: Helping visually impaired users describe scenes or read signs.
Introduction
Goal
To automatically generate captions that accurately describe the objects, actions, and context in an
image.
Architecture Overview
• Extracts high-level features from the image (e.g., objects, spatial context).
• Allows the decoder to focus on different parts of the image while generating each word.
Example Flow
Popular Models
• Show, Attend and Tell – Adds attention mechanism to focus on relevant image regions.
• Cross-Entropy Loss: Measures how well the predicted words match the ground truth.
• Reinforcement Learning (CIDEr optimization): Optimizes for metrics like BLEU, CIDEr,
METEOR using policy gradients.
Evaluation Metrics
Dataset Examples
Applications
Challenges
Conclusion
• Image Captioning bridges the gap between vision and language. Deep learning enables
models to understand and describe images, making them useful in real-world multimodal AI
applications. As models evolve, they are expected to produce more diverse, context-aware,
and accurate captions.
Introduction
Visual Question Answering (VQA) is a multimodal AI task where the model is given an image and a
natural language question about the image, and it must generate the correct textual answer. It
combines Computer Vision (CV) and Natural Language Processing (NLP).
Problem Statement
Given:
• An image (I)
4. Answer Decoder
• Helps the model focus on relevant parts of the image based on the question.
• Example: For “What is the man holding?”, attention focuses on the object near the man’s
hand.
Evaluation Metrics
• For open-ended VQA, multiple ground truth answers are considered (e.g., “red” and
“maroon” may both be valid).
Example
• Surveillance: Answering queries from security camera feeds (e.g., "How many people are
there?").
Challenges
• Reasoning: Complex questions may need reasoning or inference (e.g., “Why is the man
wet?”).
• Bias: Model may learn dataset biases (e.g., always answering “yes” for “Is there a cat?”).
Conclusion
• Visual Question Answering is a challenging and impactful multimodal task that lies at the
intersection of vision and language. Deep learning has enabled powerful VQA systems that
can perceive, comprehend, and reason with both visual and textual inputs. It plays a vital
role in advancing real-world intelligent systems.
Visual Dialog
Introduction
Visual Dialog is an advanced multimodal task where an AI model must answer a series of questions
about an image in a conversational context. It goes beyond single-turn Visual Question Answering
(VQA) by introducing dialog history, requiring the model to understand the current question, the
image, and previous conversation turns.
Problem Statement
Given:
• An image (I)
• Generate a relevant answer (A_t) based on the image, question, and dialog history.
Example
Dialog History:
Architecture Overview
1. Image Encoder
2. Question Encoder
3. History Encoder
4. Fusion Module
• Fuses features from the image, current question, and dialog history.
• Often uses attention mechanisms and co-attention modules to align across modalities.
5. Answer Decoder
Evaluation Metrics
These are mostly used in discriminative models where answers are chosen from a candidate list.
Key Challenges
Applications
• Customer Support Bots: That understand product images and user queries.
Conclusion
• Visual Dialog is a complex but powerful task combining vision, language, and memory. It
models real-world human interaction more realistically than single-turn VQA. With
applications in assistive tech and conversational AI, it is a vital area in multimodal deep
learning research.
Introduction
Pixel RNN is a generative model used to generate images pixel by pixel. It models the conditional
distribution of each pixel given all previous pixels in a sequential (autoregressive) manner, typically
row-wise or pixel-wise. This was introduced by Google DeepMind in 2016.
Main Idea
• Here, xix_ixi is a pixel value, and the model predicts each pixel conditioned on all previous
pixels in raster-scan order (left to right, top to bottom).
Architecture
1. Input Representation
• At generation time, only the previous pixels are visible to the model.
2. Masked Convolutions
• To ensure no "future" pixels are seen, masking is applied to the convolutional kernels.
3. Recurrent Layers
• Row LSTM and Diagonal BiLSTM layers are used:
o Diagonal BiLSTM: Processes along image diagonals, increasing context and reducing
generation time.
4. Output Layer
Use Case Better for long-term spatial relations Efficient for large-scale datasets
Advantages
Disadvantages
Applications
Conclusion
Pixel RNNs offer a principled way to generate images pixel-by-pixel using RNNs. Though
computationally expensive, they laid the foundation for autoregressive image models and inspired
improvements like PixelCNN and later diffusion models.
Introduction
CycleGAN is a type of Generative Adversarial Network (GAN) used for image-to-image translation
tasks where paired training data is not available. It was introduced by Jun-Yan Zhu et al. in 2017. The
core idea is to learn mappings between two image domains (say, horses ↔ zebras, summer ↔
winter) without paired examples.
Motivation
Traditional GANs or supervised translation models (like Pix2Pix) require paired images (e.g., a photo
and its sketch). However, in many real-world scenarios, paired data is not available. CycleGAN solves
this by using cycle-consistency loss.
Architecture
1. Two Generators:
2. Two Discriminators:
o Ensures that translating an image to the other domain and back results in the
original image:
Applications
Advantages
• Works without paired data.
Disadvantages
Conclusion
CycleGANs are a powerful solution for unpaired image-to-image translation. They leverage dual
generators and discriminators with cycle-consistency loss to ensure meaningful transformations.
Their ability to work without paired data makes them widely applicable in art, medicine, and visual
editing tasks.
Introduction
Progressive GAN, introduced by Karras et al. in 2017 (NVIDIA), is a type of Generative Adversarial
Network that generates high-resolution images by progressively growing both the generator and
discriminator during training. It was designed to address the training instability and low-quality
output of earlier GANs when generating large images (e.g., 1024×1024).
Motivation
• Training instability.
Progressive GANs solve this by starting small (e.g., 4×4 images) and gradually adding new layers to
increase image resolution.
1. Progressive Training:
o New layers are gradually added to double the resolution: 8×8 → 16×16 → 32×32 …
up to 1024×1024.
o Each time a new layer is added, the GAN is trained to adapt to the new resolution.
2. Fade-in Layers:
o To avoid sudden jumps in learning, a fade-in mechanism is used where the output of
new layers is blended with the old layers using a parameter α\alphaα (0 to 1).
3. Loss Function:
o Uses Wasserstein GAN loss with gradient penalty (WGAN-GP) for stability.
Diagram Reference
Advantages
Disadvantages
Applications
Conclusion
Progressive GANs significantly advanced the field of image generation by introducing gradual
learning of complexity. Their fade-in training strategy and resolution-by-resolution growth result in
stable, high-quality outputs, making them a key development in the GAN landscape.
Introduction
StackGAN is a Generative Adversarial Network (GAN) architecture introduced by Han Zhang et al. in
2017 that focuses on generating high-resolution images from text descriptions. Unlike traditional
GANs, StackGAN utilizes a two-stage process to generate images: first, a low-resolution image is
created, and then it is refined to high resolution. This architecture is particularly effective in text-to-
image generation, where a natural language description is used to synthesize corresponding images.
Motivation
Traditional GANs faced difficulties when generating high-quality images from text descriptions
because:
1. Generating high-resolution images in a single pass is challenging due to the vast range of
details and fine-grained structures.
StackGAN addresses these challenges by stacking two GANs: one for generating low-resolution
images and another for refining them to high resolution.
Architecture
• Input: A random noise vector zzz and a text embedding (using an RNN or LSTM to encode
text).
• Output: Generates a low-resolution image (e.g., 64x64 pixels) that roughly matches the given
text description.
• The Stage-I GAN learns to capture the global structure and layout of the image, based on the
text input.
• Input: The low-resolution image generated by Stage-I and the same text embedding.
• The Stage-II GAN enhances the details, textures, and finer structures that were not captured
in Stage-I.
Training Process:
• Both stages are trained simultaneously using adversarial loss (from the respective
discriminators) and conditioning loss (to ensure the generated image aligns with the text
description).
• The Stage-I generator focuses on the overall shape and color of the object.
• The Stage-II generator enhances fine details such as textures, edges, and patterns.
Diagram Reference
Advantages
• Works well even with limited data compared to other GAN models.
Disadvantages
• Quality depends on the quality of the text embedding. If the text is not descriptive enough,
the generated image may lack relevant details.
Applications
1. Text-to-Image Synthesis: Given a text description, generate images of objects or scenes (e.g.,
"a red apple on a table").
3. Video Game Design: Automatically generate game assets like character designs or
environment textures from textual descriptions.
Conclusion
Introduction
Pix2Pix is a Generative Adversarial Network (GAN) architecture introduced by Isola et al. in 2017
that focuses on image-to-image translation. Unlike traditional image generation techniques, Pix2Pix
learns to convert one type of image into another, maintaining structure and coherence between the
input and output images. It is based on a Conditional GAN (cGAN), which conditions the generation
process on input images, making it suitable for tasks where both an input and an output image are
involved.
Pix2Pix is widely used for tasks like image generation, enhancement, and style transfer, and can
work with paired datasets, where each input image has a corresponding target image.
Motivation
The idea behind Pix2Pix was to improve image translation tasks, such as converting black and white
sketches into color images or day images into night scenes. Traditional GANs struggled with
maintaining meaningful relationships between input and output images. By conditioning the GAN on
the input image, Pix2Pix can generate realistic images that are both visually appealing and aligned
with the given input.
Architecture
Pix2Pix utilizes a Conditional GAN framework, where both the generator and discriminator are
conditioned on the input image. The architecture consists of two main components:
1. Generator:
• It takes the input image as well as noise (optional) and generates the corresponding output
image.
• The architecture is designed to keep the spatial resolution intact while refining the output
through the skip connections that retain important features from earlier layers.
2. Discriminator:
• The discriminator is a PatchGAN network, which works by classifying small image patches
rather than the entire image.
• It determines whether each patch of the generated image (together with the corresponding
input) is real or fake, encouraging the generator to create more realistic outputs.
• It compares the real images (target images) with the generated images to compute the
adversarial loss.
Training Process:
• The generator aims to produce realistic images that match the target image.
• The generator is trained to fool the discriminator while ensuring that the generated output
aligns with the input image.
• Loss Function: A combination of adversarial loss and L1 loss (pixel-wise loss) is used to
ensure both realism and output accuracy.
Loss Function
1. Adversarial Loss: Ensures the generator creates realistic images that the discriminator
cannot distinguish from real ones.
2. L1 Loss (Pixel-wise Loss): Ensures that the output image is similar to the target image by
minimizing pixel differences.
Diagram Reference
Advantages
• Realistic Image Generation: Pix2Pix produces high-quality images that are realistic and
aligned with input images.
• Versatility: Can be used for various image-to-image translation tasks, such as image
colorization, style transfer, and photo enhancement.
• Conditional GANs provide fine control over the generation process, enabling high accuracy in
image transformation tasks.
• Skip Connections in the generator architecture preserve spatial information, helping with
generating fine-grained details.
Disadvantages
• Paired Data Requirement: Pix2Pix requires a paired dataset, meaning each input image must
have a corresponding target image. This limits its applicability in cases where paired data is
unavailable.
• Training Instability: Like other GANs, Pix2Pix may suffer from issues such as mode collapse
and training instability, especially with insufficient or noisy data.
Applications
1. Image-to-Image Translation: Converting images from one domain to another (e.g., black-
and-white to color, day-to-night).
3. Semantic Segmentation: Pix2Pix can be adapted for segmentation tasks where each pixel is
assigned a class label.
6. Art Style Transfer: Translating the style of one image (e.g., painting style) to another image
(e.g., a photograph).
Conclusion