10 - Generative AI
10 - Generative AI
Lecture 10 – Generative AI
Course Topics
2
Understanding Traditional AI …
Functionality: Recognizes and categorizes patterns in data. Performs specific, pre-defined tasks (e.g.,
object recognition, trend prediction).
Data Handling: Often utilizes supervised learning with labeled datasets. Models learn to map inputs
to known outputs.
Examples: Neural networks, decision trees, support vector machines, linear regression.
Key Point: Traditional AI focuses on interpreting and understanding data to make informed decisions
or classifications.
3
… to introduce Generative AI
Functionality: Understands underlying data distribution to produce new content similar to training
data but not identical.
Examples: GANs for media creation, VAEs for data generation, GPT-3 for text.
Key Point: Generative AI is about creating novel content by learning from and mimicking the
properties of the input data.
4
Traditional and generative AI
5
Key Differences Summarized
Traditional AI Generative AI
Purpose Decision making Data creation
6
Text Generation with Generative AI
7
Image Generation with Generative AI
8
Music and Audio Generation with Generative AI
9
Video and 3D Model Generation
• Video Generation:
• Applications: Movie clips, educational videos, advertising.
• Technologies: Deepfakes, GANs for video.
• Examples: Creating short films, generating educational content, advertisements.
• 3D Model Generation:
• Applications: Video games, virtual reality, architecture.
• Technologies: 3D GANs, neural radiance fields (NeRF).
• Examples: Designing game environments, creating virtual reality experiences,
architectural modeling.
10
Let’s dive into the models for Generative AI
• Autoencoder
• Variational Autoencoder
• Generative Adversarial Networks
• Long-term Short-term Networks
• Transformers
• Generative Pre-trained Transformers
• Multimodal Generative Tranformers
11
The starting point of Generative AI: the autoencoders
• Basic Concept:
• Autoencoders are neural networks used for unsupervised learning of efficient
data encodings (i.e., dimensionality reduction).
• Architecture:
• the encoder (compresses input into a latent-space representation)
• the decoder (reconstructs input from latent space).
• Function and Applications:
• Learn a representation of data to reproduce the input, thus filtering out noise.
• Commonly used for feature extraction and learning generative models of data.
12
The architecture of Autoencoders
13
Encoder: the architecture and the goals
14
Decoder: the architecture and the goal
15
Transposed convolutional layer
https://www.geeksforgeeks.org/what-is-transposed-convolutional-layer/ 16
Application scenarios of autoencoder
17
How to train an autoencoder?
18
An example
19
20
21
Variational Autoencoders (VAEs) in Generative AI
22
The architecture of Variational Autoencoder
23
The architecture of Variational Autoencoder
output
The encoder
hidden
24
The technical perspective of VAEs: the encoder
25
The architecture of Variational Autoencoder
The latent space
26
The technical perspective of VAEs: the latent space
• Sampling: a sample z is
drawn from the
distribution
q(z∣x)=N(z;μ,σ2I)
• Use the
reparameterization trick
for differentiability:
z=μ+σ⊙ϵ
where ϵ is drawn from a
standard normal distribution
27
The architecture of Variational Autoencoder
The decoder
28
The architecture of Variational Autoencoder: the decoder
The decoder
29
Training Variational Autoencoders
• Maximize the likelihood of data while regularizing the latent space distribution.
• Loss Function:
• Reconstruction Loss: Measures the difference between the input and its reconstruction.
Common choices include MSE for continuous data or binary cross-entropy for binary data.
• KL Divergence: Kullback-Leibler divergence between the learned latent distribution q(z∣x) and
the prior distribution p(z), typically assumed to be a standard normal distribution.
• Total Loss:
Loss=Reconstruction Loss+β×KL Divergence
where β is a hyperparameter that balances the two components.
• Backpropagation:
• The network uses gradient descent to update weights by backpropagating the total loss.
• The reparameterization trick is essential for enabling gradient descent
• Commonly used optimizers include Adam and RMSprop
30
Sampling the space: from AE to VAE
Autoencoder (AE) Variational Autoencoder (VAE)
31
Sampling the space: from AE to VAE
Variational Autoencoder (VAE)
32
Introduction to Generative Adversarial Networks (GANs)
33
How Do GANs Work?
• Generator:
• Creates images (or other data types) that attempt to be indistinguishable from real
data.
• Trained to deceive the discriminator by continually improving the quality of its
generated data.
• Discriminator:
• A neural network that assesses the authenticity of data, distinguishing between real
data and that generated by the generator.
• Trained to become increasingly skilled at identifying fake data.
• Competitive Training:
• Training a GAN is a “game” between the generator and discriminator.
• While the generator tries to produce more realistic data, the discriminator learns to
get better at distinguishing real data from the generated data.
• This competitive process leads to higher quality and more realistic synthetic data.
34
RANDOM NOISE Generator Generated Image
[0100…10]
Score
IMAGE Discriminator (How likely
(Real or Fake)
the image is real)
35
Generator: Generate data that is indistinguishable from real data
36
Discriminator: the role and the architecture
• Purpose: To distinguish between real data from the training set and fake data
created by the generator.
• Input: Takes either real data or generated data.
• Architecture: Typically a convolutional neural network that downsamples its
input to a binary output indicating real or fake.
Score
IMAGE Discriminator (How likely
(Real or Fake)
the image is real)
37
How to train a GAN?
38
How to (really) train a GAN?
• Optimization:
• Challenges: Training GANs is notoriously difficult due to issues like mode
collapse (where the generator produces limited varieties of outputs) and non-
convergence.
• Evaluation:
• Unlike other models, evaluating GANs can be subjective and is often based on
visual inspection of generated images.
39
LSTMs in Text Generation - Overview
• Understanding LSTMs:
• Specialized RNNs designed to handle long-term dependencies.
• Key features include cell state and various gates (input, output, forget) for
information regulation.
• Training for Text Generation:
• Begin with a large corpus of tokenized text.
• Train the model on text sequences to predict the next character or word.
• The LSTM learns the context and structure of the language in the dataset.
40
Working with text data
41
Tokenization: splitting text into individual units
• Word tokens:
• All text converted to lowercase
• A text vocabulary of words could be very large
• Words are stemmed (i.e., reduced to their simplest form) so that different
tenses of a verb are tokenized together
• Tokenize the punctuation
• The model will never predict words outside of the vocabulary
• Character tokens:
• Capital and/or lowercase letters
• The model may generate new words by combining characters
• Vocabulary is much smaller
42
An example: generating recipes with LSTM (1/4)
43
An example: generating recipes with LSTM (2/4)
44
The embedding layer: a lookup table converting a token into a vector
Token Embedding
Vocabolary size 0 -0.13 0.45 …. … … 1.22 -0.43
Embeddding size
• The embedding layer ”embeds” each integer token into a continuous vector
that enables the model to learn a representation for each word
• Trained through backpropagation
• Higher flexibility that one-hot encoding
45
An example of embedding
https://coderzcolumn.com/tutorials/artificial-intelligence/keras-word-embeddings-for-text-classification 46
How to model the text?
47
The vanishing and exploding gradient problems
48
Long short-term memory (LSTM)
Differently from RNNs, the LSTM cell has two components to its state:
• the hidden state: corresponding to the short-term memory component
• the internal cell state: corresponding to the to the long-term memory.
49
Propagating the internal and the hidden state in LSTMs
input, forget, and output gates
hidden state
Input and the forget gates together determine how much of the past information to retain in the current
cell state and how much of the current context to propagate forward to the future time steps. A value of 0
in ft implies that nothing is carried out from previous step for the cell state.
50
Gate Recurrent Unit (a simplified version of LSTM)
The GRU is popular thanks to its simplicity (having fewer parameters than LSTM) and its training efficiency
51
How to use embedding in RNNs?
Embedding
Sequence (200)
0.1 .. 0.7
1.2 .. -0.2
Embedding size (100)
-0.3 .. 0.4
-0.5 .. -0.2
x1 x2 x200
0.1
h0 -1.1
Cell Cell Cell … Cell
h1 h2 ...
hN
0.1 0.5
-1.1
... Hidden state (Randomly initialized) Output of the cell
0.5
52
Text generation in LSTM Networks
• Sequence Generation:
• Sequential Processing: The LSTM processes each token one after the other,
updating its internal state and predicting the next token in the sequence.
• Output Generation: The network outputs a probability distribution over possible
next tokens, from which the next token is sampled.
• Iterative Feedback: The predicted token is fed back into the network as the next
input, continuing the sequence generation process.
• Training:
• Loss Function: Typically uses cross-entropy loss to compare the predicted
probability distribution with the actual next token in the training data.
• Backpropagation Through Time (BPTT): Used to train the LSTM, accounting for
its sequential nature and updating weights across the sequence.
53
From the hidden vector to the next word…
• Hidden State Output: At each timestep during the sequence processing, the LSTM
updates its hidden state based on the current input (e.g., a word or character) and
the previous hidden state
• Logits Generation: The updated hidden state is then used to generate logits.
Logits are raw prediction values for each possible next word in the vocabulary
Same size of the vocabulary
• Softmax Layer: The logits are then passed through a softmax layer to convert
them into a probability distribution
54
Word Selection: how to select the next word?
The next word is selected based on this probability distribution. There are
several strategies for selecting the word:
• Greedy Selection: Choose the word with the highest probability. This
method is straightforward but can lead to repetitive or less diverse text.
• Random Sampling: Sample from the probability distribution, which allows
for more diversity in the generated text and can help explore more creative
or varied outputs.
• Top-k Sampling: Limit the sampling pool to the top k most likely next words
according to the distribution. This balances diversity with relevance.
• Temperature-based Sampling: Adjust the sharpness of the probability
distribution using a temperature parameter, as discussed previously. This
modifies the likelihoods of the words, influencing the diversity of the output.
55
The role of the “temperature”
56
Generating Text with LSTM
57
An example: generating recipes with LSTM (3/4)
58
An example: generating recipes with LSTM (4/4)
59
Transformers
60
Technical Deep Dive into Transformers
• Architecture Overview:
• Transformers consist of two main components: the encoder and the decoder.
• The encoder processes the input data, and the decoder generates the output.
• Each consists of multiple layers (typically 6), where each layer has multi-head self-
attention and feed-forward neural networks.
• Self-Attention Mechanism:
• The key feature of transformers is the self-attention mechanism, which allows the
model to consider other words in the input sequence when processing a word.
• Self-attention computes three values for each word: Query, Key, and Value. The
relevance of each word to others in the sequence is determined by a compatibility
function of Query and Key.
• This mechanism allows the model to dynamically focus on different parts of the input
sequence as it processes data.
61
TheAiEdge.io 62
Technical Deep Dive into Transformers
• Positional Encoding:
• Since Transformers do not process data sequentially, positional encoding is added to
give the model information about the position of each word in the sequence.
• Positional encodings are vectors added to the input embeddings at the bottoms of
the encoder and decoder stacks.
• Multi-Head Attention:
• Transformers employ multi-head attention in their layers, allowing the model to jointly
attend to information from different representation subspaces at different positions.
• Training and Fine-tuning:
• Transformers are often pre-trained on large datasets and then fine-tuned for specific
tasks. This pre-training involves unsupervised learning on a vast corpus of data.
• Fine-tuning tailors the model to a particular task by continuing the training process on
a smaller, task-specific dataset.
63
65
66
67
Generative Pre-trained Transformer (GPT)
68
GPT: the overall architecture
69
From Unimodal to Multimodal Generative AI …
https://medium.com/@glegoux/history-of-the-generative-ai-aa1aa7c63f3c 70
Multimodal Generative AI
• Diverse Data Types: Multimodal AI systems process and interpret various data
forms like text, images, audio, and video.
• Data Fusion: The key challenge is integrating these different data types effectively.
71
Examples of Multimodal Generative AI systems
• DALL-E:
• Function: Specializes in generating images from textual descriptions.
• Technology: Utilizes a variant of the GPT-3 model adapted for image generation, combining NLP and computer vision
capabilities.
• CLIP (Contrastive Language–Image Pre-training) :
• Function: Designed to understand and classify images based on natural language descriptions.
• Technology: Bridges the gap between NLP and computer vision, trained on a variety of image and text pairs.
• Multimodal Transformer:
• Function: A model designed for tasks that require understanding both text and images.
• Technology: Integrates transformer architecture to process and relate information across different modalities.
• BERT for Image Captioning:
• Function: Used for generating descriptive captions for images.
• Technology: Adapts the BERT model (widely used in NLP) for understanding the content of images and relating it to
text.
• Jukebox:
• Function: A model for generating music, including melody, rhythm, and even lyrics.
• Technology: Utilizes a deep neural network trained on a large dataset of music.
72