0% found this document useful (0 votes)
11 views71 pages

10 - Generative AI

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views71 pages

10 - Generative AI

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Technologies for Artificial Intelligence

Prof. Manuel Roveri – manuel.roveri@polimi.it

Lecture 10 – Generative AI
Course Topics

1. Introduction to technological platforms for AI


2. Embedded and Edge AI
a. The technology
b. The Algorithms
c. Machine Learning for Embedded and Edge AI
d. Deep Learning for Embedded and Edge AI
3. Cloud computing and AI
a. Cloud computing and the ”as-a-service” approach
b. Machine and Deep Learning as a service
c. Time-series: analysis and prediction
d. Generative AI

2
Understanding Traditional AI …

Objective: Decision-making (classification, prediction, analysis).

Functionality: Recognizes and categorizes patterns in data. Performs specific, pre-defined tasks (e.g.,
object recognition, trend prediction).

Data Handling: Often utilizes supervised learning with labeled datasets. Models learn to map inputs
to known outputs.

Examples: Neural networks, decision trees, support vector machines, linear regression.

Key Point: Traditional AI focuses on interpreting and understanding data to make informed decisions
or classifications.

3
… to introduce Generative AI

Objective: Creating new, plausible data (text, images, sounds).

Functionality: Understands underlying data distribution to produce new content similar to training
data but not identical.

Data Handling: Uses unsupervised learning or specialized forms of supervised learning.

Examples: GANs for media creation, VAEs for data generation, GPT-3 for text.

Key Point: Generative AI is about creating novel content by learning from and mimicking the
properties of the input data.

4
Traditional and generative AI

5
Key Differences Summarized

Traditional AI Generative AI
Purpose Decision making Data creation

Output Interpret and classify data Produce new instances


of data
Learning Approach Map inputs to outputs, Learn data distribution to
often with labeled data. generate new content.
Applications Analytical and predictive Pivotal in creative and
tasks synthetic tasks.

6
Text Generation with Generative AI

• Applications: Writing assistance, chatbots, content creation.


• Technologies: Transformer models like GPT (Generative Pre-trained
Transformer).
• Examples: Automated writing of articles, poetry, code, and
conversational responses.

7
Image Generation with Generative AI

• Applications: Art creation, game asset production, fashion design.


• Technologies: GANs (Generative Adversarial Networks), VAEs
(Variational Autoencoders).
• Examples: Creating new artworks, designing virtual clothing,
generating realistic human faces.

8
Music and Audio Generation with Generative AI

• Applications: Composition of music, sound effects, and voice


synthesis.
• Technologies: LSTM networks, Transformer models adapted for
sequential data.
• Examples: Composing new music pieces, generating sound effects for
movies, synthesizing speech from text.

9
Video and 3D Model Generation

• Video Generation:
• Applications: Movie clips, educational videos, advertising.
• Technologies: Deepfakes, GANs for video.
• Examples: Creating short films, generating educational content, advertisements.
• 3D Model Generation:
• Applications: Video games, virtual reality, architecture.
• Technologies: 3D GANs, neural radiance fields (NeRF).
• Examples: Designing game environments, creating virtual reality experiences,
architectural modeling.

10
Let’s dive into the models for Generative AI

• Autoencoder
• Variational Autoencoder
• Generative Adversarial Networks
• Long-term Short-term Networks
• Transformers
• Generative Pre-trained Transformers
• Multimodal Generative Tranformers

11
The starting point of Generative AI: the autoencoders

• Basic Concept:
• Autoencoders are neural networks used for unsupervised learning of efficient
data encodings (i.e., dimensionality reduction).
• Architecture:
• the encoder (compresses input into a latent-space representation)
• the decoder (reconstructs input from latent space).
• Function and Applications:
• Learn a representation of data to reproduce the input, thus filtering out noise.
• Commonly used for feature extraction and learning generative models of data.

12
The architecture of Autoencoders

13
Encoder: the architecture and the goals

• Purpose: Compress the input


into a latent-space
• Structure: Multiple layers (e.g.,
fully connected layers,
convolutional layers, or
recurrent layers)
• Each layer typically includes a
linear transformation followed
by a non-linear activation
function
• Dimensionality Reduction: Each
successive layer typically
reduces the dimensionality, up
to the latent representation

14
Decoder: the architecture and the goal

• Purpose: Reconstruct the input


data from the latent representation
as accurately as possible.
• Structure: reverse the encoder
(e.g., from convolutional to
deconvolutional layers)
• Output Layer: The final layer's
activation function is often chosen
based on the nature of the input
data

15
Transposed convolutional layer

https://www.geeksforgeeks.org/what-is-transposed-convolutional-layer/ 16
Application scenarios of autoencoder

• Dimensionality Reduction: Similar to PCA but more powerful due to


non-linear transformations.
• Feature Learning: Useful in pretraining for classification tasks.
• Anomaly Detection: Anything that differs significantly from the
reconstructed input can be considered an outlier.

17
How to train an autoencoder?

• Loss Function (MSE or Cross-


entropy):
• The goal is to minimize the
difference between the input
and its reconstruction
• Optimization: Typically trained
using backpropagation and an
optimization algorithm like SGD,
Adam, or RMSprop.

18
An example

19
20
21
Variational Autoencoders (VAEs) in Generative AI

• Evolution from Autoencoders:


• VAEs are advanced autoencoders offering a probabilistic approach to encoding data.
• Differences in architecture:
• The encoder in VAEs outputs parameters defining a probability distribution (mean and
variance).
• Uses the reparameterization trick for sampling from latent space.
• The decoder reconstructs data from this probabilistic latent space.
• Role in Generative AI:
• VAEs are instrumental in generating new data (like images) that resemble training
data.
• Useful in scenarios requiring not just data compression but also data generation.

22
The architecture of Variational Autoencoder

23
The architecture of Variational Autoencoder
output
The encoder

hidden

24
The technical perspective of VAEs: the encoder

• Function: Instead of encoding an input as a single point, VAEs encode


inputs as a distribution over the latent space.
• Components:
• Input Layer: accepts data (e.g., images, text).
• Hidden Layers: multiple layers (can be fully connected, convolutional) process
the data.
• Output Layers: typically two layers that represent the mean (μ) and logarithm
of the variance (log σ2) of the latent space distribution.
• Mathematical Representation: The encoder outputs parameters to a
probability distribution assumed to be Gaussian.

25
The architecture of Variational Autoencoder
The latent space

26
The technical perspective of VAEs: the latent space

• Sampling: a sample z is
drawn from the
distribution
q(z∣x)=N(z;μ,σ2I)
• Use the
reparameterization trick
for differentiability:
z=μ+σ⊙ϵ
where ϵ is drawn from a
standard normal distribution

27
The architecture of Variational Autoencoder
The decoder

28
The architecture of Variational Autoencoder: the decoder
The decoder

• Function: takes the sampled


latent variables z and
reconstructs the input data.
• Structure: reverse the
encoder ’s architecture
• Output: depending on the
data type, the output layer
may use different activation
functions (sigmoid, softmax).

29
Training Variational Autoencoders

• Maximize the likelihood of data while regularizing the latent space distribution.
• Loss Function:
• Reconstruction Loss: Measures the difference between the input and its reconstruction.
Common choices include MSE for continuous data or binary cross-entropy for binary data.
• KL Divergence: Kullback-Leibler divergence between the learned latent distribution q(z∣x) and
the prior distribution p(z), typically assumed to be a standard normal distribution.
• Total Loss:
Loss=Reconstruction Loss+β×KL Divergence
where β is a hyperparameter that balances the two components.
• Backpropagation:
• The network uses gradient descent to update weights by backpropagating the total loss.
• The reparameterization trick is essential for enabling gradient descent
• Commonly used optimizers include Adam and RMSprop

30
Sampling the space: from AE to VAE
Autoencoder (AE) Variational Autoencoder (VAE)

31
Sampling the space: from AE to VAE
Variational Autoencoder (VAE)

32
Introduction to Generative Adversarial Networks (GANs)

• Overview: two parts: the generator (creates data) and the


discriminator (evaluates data).

33
How Do GANs Work?

• Generator:
• Creates images (or other data types) that attempt to be indistinguishable from real
data.
• Trained to deceive the discriminator by continually improving the quality of its
generated data.
• Discriminator:
• A neural network that assesses the authenticity of data, distinguishing between real
data and that generated by the generator.
• Trained to become increasingly skilled at identifying fake data.
• Competitive Training:
• Training a GAN is a “game” between the generator and discriminator.
• While the generator tries to produce more realistic data, the discriminator learns to
get better at distinguishing real data from the generated data.
• This competitive process leads to higher quality and more realistic synthetic data.

34
RANDOM NOISE Generator Generated Image
[0100…10]

Score
IMAGE Discriminator (How likely
(Real or Fake)
the image is real)

35
Generator: Generate data that is indistinguishable from real data

RANDOM NOISE Generator Generated Image


[0100…10]

• Input: it receives in input a random noise vector (latent space).


• Architecture: it consists of a series of layers that progressively upsample the
input to generate an output of the same dimensions as the real data (e.g., an
image).
• Key Feature: it often uses transposed convolutional layers to increase spatial
dimensions.

36
Discriminator: the role and the architecture

• Purpose: To distinguish between real data from the training set and fake data
created by the generator.
• Input: Takes either real data or generated data.
• Architecture: Typically a convolutional neural network that downsamples its
input to a binary output indicating real or fake.

Score
IMAGE Discriminator (How likely
(Real or Fake)
the image is real)

37
How to train a GAN?

• Training Dynamics: generator and


discriminator are trained
alternately in phases that balance Generated
each other ’s capabilities. Generator
Image
• Two loss functions:
• Discriminator Loss: Binary cross-
entropy loss, where the discriminator
is penalized for incorrectly classifying Score
real or generated data. (How likely
Discriminator
• Generator Loss: Encourages the the image is
generator to produce data that the real)
discriminator classifies as real.

38
How to (really) train a GAN?

• Optimization:
• Challenges: Training GANs is notoriously difficult due to issues like mode
collapse (where the generator produces limited varieties of outputs) and non-
convergence.
• Evaluation:
• Unlike other models, evaluating GANs can be subjective and is often based on
visual inspection of generated images.

39
LSTMs in Text Generation - Overview

• Understanding LSTMs:
• Specialized RNNs designed to handle long-term dependencies.
• Key features include cell state and various gates (input, output, forget) for
information regulation.
• Training for Text Generation:
• Begin with a large corpus of tokenized text.
• Train the model on text sequences to predict the next character or word.
• The LSTM learns the context and structure of the language in the dataset.

40
Working with text data

• Text data comprises discrete chunks (words, characters, etc..) while


image are points in a continuous color space
• Text has a “time” dimension:
• the concept of sequence of words/characters is relevant (e.g., no flipping of
words is feasible)
• “long-term” dependencies need to be captured
• Text has no 2D “space” dimension like images
• Highly sensitive to small changes (i.e., changing a word could
significantly affect the meaning)
• A grammatical structure exists

41
Tokenization: splitting text into individual units

• Word tokens:
• All text converted to lowercase
• A text vocabulary of words could be very large
• Words are stemmed (i.e., reduced to their simplest form) so that different
tenses of a verb are tokenized together
• Tokenize the punctuation
• The model will never predict words outside of the vocabulary
• Character tokens:
• Capital and/or lowercase letters
• The model may generate new words by combining characters
• Vocabulary is much smaller

42
An example: generating recipes with LSTM (1/4)

Original text Tokenised text

43
An example: generating recipes with LSTM (2/4)

Text converted to int


Tokenised text
Word mapping

44
The embedding layer: a lookup table converting a token into a vector

Token Embedding
Vocabolary size 0 -0.13 0.45 …. … … 1.22 -0.43

1 0.43 -0.15 …. … … -1.02 -0.14

9999 -0.63 0.75 …. … … 0.35 0.23

Embeddding size

• The embedding layer ”embeds” each integer token into a continuous vector
that enables the model to learn a representation for each word
• Trained through backpropagation
• Higher flexibility that one-hot encoding

45
An example of embedding

https://coderzcolumn.com/tutorials/artificial-intelligence/keras-word-embeddings-for-text-classification 46
How to model the text?

We need to introduce recurrence in our neural networks…

47
The vanishing and exploding gradient problems

• Simple RNN cells are not capable of carrying long-term


dependencies to the future
• In case of long sequences:
• the backpropagated gradients tend to vanish and consequently weights are
not updated adequately
• the backpropagated gradients can explode over long sequences resulting in
unstable weight matrices

48
Long short-term memory (LSTM)

Differently from RNNs, the LSTM cell has two components to its state:
• the hidden state: corresponding to the short-term memory component
• the internal cell state: corresponding to the to the long-term memory.

LSTM avoids the vanishing and exploding gradient issue

49
Propagating the internal and the hidden state in LSTMs
input, forget, and output gates

internal cell state

hidden state

Input and the forget gates together determine how much of the past information to retain in the current
cell state and how much of the current context to propagate forward to the future time steps. A value of 0
in ft implies that nothing is carried out from previous step for the cell state.

50
Gate Recurrent Unit (a simplified version of LSTM)

Only one hidden state

Only two gates:


• Update gate
• Reset Gate

Candidate hidden state

The GRU is popular thanks to its simplicity (having fewer parameters than LSTM) and its training efficiency

51
How to use embedding in RNNs?
Embedding

Sequence (200)
0.1 .. 0.7
1.2 .. -0.2
Embedding size (100)
-0.3 .. 0.4
-0.5 .. -0.2

x1 x2 x200

0.1
h0 -1.1
Cell Cell Cell … Cell
h1 h2 ...
hN
0.1 0.5
-1.1
... Hidden state (Randomly initialized) Output of the cell
0.5

52
Text generation in LSTM Networks

• Sequence Generation:
• Sequential Processing: The LSTM processes each token one after the other,
updating its internal state and predicting the next token in the sequence.
• Output Generation: The network outputs a probability distribution over possible
next tokens, from which the next token is sampled.
• Iterative Feedback: The predicted token is fed back into the network as the next
input, continuing the sequence generation process.
• Training:
• Loss Function: Typically uses cross-entropy loss to compare the predicted
probability distribution with the actual next token in the training data.
• Backpropagation Through Time (BPTT): Used to train the LSTM, accounting for
its sequential nature and updating weights across the sequence.

53
From the hidden vector to the next word…

• Hidden State Output: At each timestep during the sequence processing, the LSTM
updates its hidden state based on the current input (e.g., a word or character) and
the previous hidden state
• Logits Generation: The updated hidden state is then used to generate logits.
Logits are raw prediction values for each possible next word in the vocabulary
Same size of the vocabulary

• Softmax Layer: The logits are then passed through a softmax layer to convert
them into a probability distribution

54
Word Selection: how to select the next word?

The next word is selected based on this probability distribution. There are
several strategies for selecting the word:
• Greedy Selection: Choose the word with the highest probability. This
method is straightforward but can lead to repetitive or less diverse text.
• Random Sampling: Sample from the probability distribution, which allows
for more diversity in the generated text and can help explore more creative
or varied outputs.
• Top-k Sampling: Limit the sampling pool to the top k most likely next words
according to the distribution. This balances diversity with relevance.
• Temperature-based Sampling: Adjust the sharpness of the probability
distribution using a temperature parameter, as discussed previously. This
modifies the likelihoods of the words, influencing the diversity of the output.

55
The role of the “temperature”

• A hyperparameter used in the softmax function during text generation


to control the randomness of predictions by scaling the logits before
applying softmax:
• Low Temperature (<1.0) causes the model to produce more predictable and
conservative text.
• High Temperature (>1.0) increases the likelihood of selecting less probable
words, leading to more diverse and creative outputs
• Temperature = 1.0 represents a neutral point
• Modulating the temperature allows for fine-tuning the balance
between randomness and determinism in the text generation process,
affecting the style and readability of the generated content

56
Generating Text with LSTM

• Start with seed text as an initial prompt.


• Employ strategies like temperature sampling for varied output.
• The generated text is fed back into the model iteratively.

57
An example: generating recipes with LSTM (3/4)

58
An example: generating recipes with LSTM (4/4)

59
Transformers

• What Are Transformers?


• Transformers are a type of NN architecture introduced in
the paper "Attention is All You Need" in 2017.
• Unlike RNNs and LSTMs, they don't require sequential data
processing, allowing parallelization and faster training.
• Key Features:
• Self-attention mechanism, which allows the model to
weigh the significance of different parts of the input data.
• Transformers do not have a recurrent structure and
instead process entire sequences of data in parallel
• Advantages in AI:
• Ability to handle long-range dependencies in data.
• Efficiency in processing large datasets and scalability.

60
Technical Deep Dive into Transformers

• Architecture Overview:
• Transformers consist of two main components: the encoder and the decoder.
• The encoder processes the input data, and the decoder generates the output.
• Each consists of multiple layers (typically 6), where each layer has multi-head self-
attention and feed-forward neural networks.
• Self-Attention Mechanism:
• The key feature of transformers is the self-attention mechanism, which allows the
model to consider other words in the input sequence when processing a word.
• Self-attention computes three values for each word: Query, Key, and Value. The
relevance of each word to others in the sequence is determined by a compatibility
function of Query and Key.
• This mechanism allows the model to dynamically focus on different parts of the input
sequence as it processes data.

61
TheAiEdge.io 62
Technical Deep Dive into Transformers

• Positional Encoding:
• Since Transformers do not process data sequentially, positional encoding is added to
give the model information about the position of each word in the sequence.
• Positional encodings are vectors added to the input embeddings at the bottoms of
the encoder and decoder stacks.
• Multi-Head Attention:
• Transformers employ multi-head attention in their layers, allowing the model to jointly
attend to information from different representation subspaces at different positions.
• Training and Fine-tuning:
• Transformers are often pre-trained on large datasets and then fine-tuned for specific
tasks. This pre-training involves unsupervised learning on a vast corpus of data.
• Fine-tuning tailors the model to a particular task by continuing the training process on
a smaller, task-specific dataset.

63
65
66
67
Generative Pre-trained Transformer (GPT)

• The core mechanisms: Transformer and Attention Mechanism


• Pre-training and Fine-tuning
• Pre-training: GPT models undergo extensive pre-training on large datasets of text.
During this phase, the model learns language patterns, grammar, and context. This is
done using unsupervised learning — the model predicts the next word in a sentence,
given the preceding words.
• Fine-tuning: After pre-training, GPT can be fine-tuned on specific datasets to perform
particular tasks like question-answering, translation, or content creation. This involves
supervised learning, where the model is trained on labeled data.
• Parameters: The size of GPT models is often described in terms of
parameters. For example, GPT-3 has 175 billion parameters.
• More parameters generally mean a more complex model with a greater
capacity to learn and understand language.

68
GPT: the overall architecture

69
From Unimodal to Multimodal Generative AI …

https://medium.com/@glegoux/history-of-the-generative-ai-aa1aa7c63f3c 70
Multimodal Generative AI

• Diverse Data Types: Multimodal AI systems process and interpret various data
forms like text, images, audio, and video.
• Data Fusion: The key challenge is integrating these different data types effectively.

71
Examples of Multimodal Generative AI systems

• DALL-E:
• Function: Specializes in generating images from textual descriptions.
• Technology: Utilizes a variant of the GPT-3 model adapted for image generation, combining NLP and computer vision
capabilities.
• CLIP (Contrastive Language–Image Pre-training) :
• Function: Designed to understand and classify images based on natural language descriptions.
• Technology: Bridges the gap between NLP and computer vision, trained on a variety of image and text pairs.
• Multimodal Transformer:
• Function: A model designed for tasks that require understanding both text and images.
• Technology: Integrates transformer architecture to process and relate information across different modalities.
• BERT for Image Captioning:
• Function: Used for generating descriptive captions for images.
• Technology: Adapts the BERT model (widely used in NLP) for understanding the content of images and relating it to
text.
• Jukebox:
• Function: A model for generating music, including melody, rhythm, and even lyrics.
• Technology: Utilizes a deep neural network trained on a large dataset of music.

72

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy