0% found this document useful (0 votes)
16 views159 pages

03 Autoencoders 4

This document provides an overview of autoencoders. It defines an autoencoder as a neural network that takes an input x and predicts x. It explains that autoencoders use a bottleneck architecture with an encoder that maps inputs to a latent space and a decoder that maps the latent space back to the input space. Autoencoders are used for tasks like dimensionality reduction, data compression, and unsupervised representation learning. The simplest autoencoder uses linear activations and can be solved with principal component analysis. More complex autoencoders use shallow neural networks for the encoder and decoder.

Uploaded by

suponjiayume
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views159 pages

03 Autoencoders 4

This document provides an overview of autoencoders. It defines an autoencoder as a neural network that takes an input x and predicts x. It explains that autoencoders use a bottleneck architecture with an encoder that maps inputs to a latent space and a decoder that maps the latent space back to the input space. Autoencoders are used for tasks like dimensionality reduction, data compression, and unsupervised representation learning. The simplest autoencoder uses linear activations and can be solved with principal component analysis. More complex autoencoders use shallow neural networks for the encoder and decoder.

Uploaded by

suponjiayume
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 159

Autoencoders

Di He
Outline
• Basics

• Variational Autoencoder

• Denoising Autoencoder

• Vector Quantized VAE

2
What is autoencoder?

• An autoencoder is a feed-forward neural network whose job is to take an input x


and predict x.

3
What is autoencoder?

• An autoencoder is a feed-forward neural network whose job is to take an input x


and predict x.

• Trivial (short-cut) solutions exist:a neural network can learn identity mapping 𝑥 = 𝑓(𝑥)

𝑥 𝑥

4
What is autoencoder?

• An autoencoder is a feed-forward neural network whose job is to take an input x


and predict x.

• Trivial (short-cut) solutions exist:a neural network can learn identity mapping 𝑥 = 𝑓(𝑥)

• Bottleneck architecture

𝑥 𝑥

5
What is autoencoder?

• An autoencoder is a feed-forward neural network whose job is to take an input x


and predict x.

• Trivial (short-cut) solutions exist:a neural network can learn identity mapping 𝑥 = 𝑓(𝑥)

• Bottleneck architecture

𝑥 𝑥

encoder

6
What is autoencoder?

• An autoencoder is a feed-forward neural network whose job is to take an input x


and predict x.

• Trivial (short-cut) solutions exist:a neural network can learn identity mapping 𝑥 = 𝑓(𝑥)

• Bottleneck architecture

𝑥 𝑥

encoder decoder

7
Why autoencoder?

• Map high-dimensional data to two dimensions for visualization

𝑥 𝑥

encoder decoder

8
Why autoencoder?

• Map high-dimensional data to two dimensions for visualization

• Data compression (i.e. reducing communication cost)

𝑥 𝑥

encoder decoder

9
Why autoencoder?

• Map high-dimensional data to two dimensions for visualization

• Data compression (i.e. reducing communication cost)

• Unsupervised representation learning (i.e., pre-training)

𝑥 𝑥

encoder decoder

10
Why autoencoder?

• Map high-dimensional data to two dimensions for visualization

• Data compression (i.e. reducing communication cost)

• Unsupervised representation learning (i.e., pre-training)

• Generative modelling
𝑥 𝑥

encoder decoder

11
The simplest autoencoder

• The simplest kind of autoencoder has one hidden layer with linear activations.
ℎ = 𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑥 ∈ 𝑅𝑑×1

o𝑢𝑡𝑝𝑢𝑡 = 𝑉ℎ, 𝑉 ∈ 𝑅𝑑×𝑘 𝑥 U V

encoder decoder

12
The simplest autoencoder

• The simplest kind of autoencoder has one hidden layer with linear activations.
ℎ = 𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑥 ∈ 𝑅𝑑×1

o𝑢𝑡𝑝𝑢𝑡 = 𝑉ℎ, 𝑉 ∈ 𝑅𝑑×𝑘 𝑥 U V

o𝑢𝑡𝑝𝑢𝑡 = 𝑉𝑈𝑥 encoder decoder

13
The simplest autoencoder

• The simplest kind of autoencoder has one hidden layer with linear activations.
o𝑢𝑡𝑝𝑢𝑡 = 𝑉𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑉 ∈ 𝑅𝑑×𝑘
𝑥 U V

encoder decoder

14
The simplest autoencoder

• The simplest kind of autoencoder has one hidden layer with linear activations.
o𝑢𝑡𝑝𝑢𝑡 = 𝑉𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑉 ∈ 𝑅𝑑×𝑘
𝑥 U V
• Note
• This network is linear
encoder decoder
• We usually set 𝑘 ≪ 𝑑 (if 𝑘 = 𝑑, we can make 𝑉𝑈 = 𝐼,
which is meaningless)

• when 𝑘 ≪ 𝑑, we actually do dimension reduction for data 𝑥

15
The simplest autoencoder

• The simplest kind of autoencoder has one hidden layer with linear activations.
o𝑢𝑡𝑝𝑢𝑡 = 𝑉𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑉 ∈ 𝑅𝑑×𝑘
𝑥 U V
• How to determine 𝑉 and 𝑈

minimize 𝑉𝑈𝑥 − 𝑥 , 𝑝 𝑖𝑠 𝑢𝑠𝑢𝑎𝑙𝑙𝑦 𝑠𝑒𝑡 𝑡𝑜 2


𝑝 encoder decoder

16
The simplest autoencoder

• The simplest kind of autoencoder has one hidden layer with linear activations.
o𝑢𝑡𝑝𝑢𝑡 = 𝑉𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑉 ∈ 𝑅𝑑×𝑘
𝑥 U V
• How to determine 𝑉 and 𝑈

minimize 𝑉𝑈𝑥 − 𝑥 , 𝑝 𝑖𝑠 𝑢𝑠𝑢𝑎𝑙𝑙𝑦 𝑠𝑒𝑡 𝑡𝑜 2


𝑝 encoder decoder

minimize 𝑉𝑈𝑋 − 𝑋 2

17
The simplest autoencoder

• The simplest kind of autoencoder has one hidden layer with linear activations.
o𝑢𝑡𝑝𝑢𝑡 = 𝑉𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑉 ∈ 𝑅𝑑×𝑘
𝑥 U V
• How to determine 𝑉 and 𝑈

minimize 𝑉𝑈𝑋 − 𝑋 2
encoder decoder

• Many optimal solutions: if 𝑈 ∗ and 𝑉 ∗ is a solution, then 𝑈 ∗ × 2 and 𝑉 ∗ /2 is also a solution

• The problem is well known as Principled Component Analysis

• You don’t need to solve this problem by gradient descent. There’s a closed-form solution

18
More about autoencoder

• Autoencoder is generally 𝑓(𝑔(𝑥)) = 𝑥


• Function g is the encoder
• Function f is the decoder
• h=g(x) is called the code/the representation/ 𝑥 U V
latent variable of x

encoder decoder

19
More about autoencoder

• Autoencoder is generally 𝑓(𝑔(𝑥)) = 𝑥


• Function g is the encoder
• Function f is the decoder
• h=g(x) is called the code/the representation/ 𝑥 U V
latent variable of x

encoder decoder
• f and g shouldn’t be too complex or powerful
• To avoid learn copy (encoder) and paste (decoder)
• To avoid overfit

20
More about autoencoder

• Autoencoder is generally 𝑓(𝑔(𝑥)) = 𝑥


• Function g is the encoder
• Function f is the decoder
• h=g(x) is called the code/the representation/ 𝑥 U V
latent variable of x

encoder decoder
• f and g can be shallow neural networks
• All the parameters should be trained by gradient descent

21
More about autoencoder

• Autoencoder is generally 𝑓(𝑔(𝑥)) = 𝑥


• Function g is the encoder
• Function f is the decoder
• h=g(x) is called the code/the representation/ 𝑥 U V
latent variable of x

encoder decoder
• Autoencoders are data-specific and learned
• This is different from compression method like mp3 or jpeg
• An autoencoder learned on “cat images” may fail on “dog images”

22
More about autoencoder

• Autoencoder is generally 𝑓(𝑔(𝑥)) = 𝑥


• Function g is the encoder
• Function f is the decoder
• h=g(x) is called the code/the representation/ 𝑥 U V
latent variable of x

encoder decoder
• Autoencoders learn useful properties of data
• PCA learns principled components

23
Vanilla autoencoder is not a generative model

𝑥 encoder ℎ decoder 𝑥′

24
How to modify an autoencoder into a
generative model

𝑥 encoder ℎ decoder 𝑥′
• 𝑥 follows a distribution (the data distribution), but it is unknown

25
How to modify an autoencoder into a
generative model

𝑥 encoder ℎ decoder 𝑥′
• 𝑥 follows a distribution (the data distribution), but it is unknown

• If after training, ℎ follows a known distribution (e.g., standard Gaussian


distribution), it will be perfect!

26
How to modify an autoencoder into a
generative model

𝑥 encoder ℎ decoder 𝑥′

27
How to modify an autoencoder into a
generative model

decoder
𝑁(0, 𝐼) ℎ 𝑥′

When the encoder is replaced by random noise, the decoder becomes a generative model!

28
The remaining challenge

𝑥 encoder ℎ decoder 𝑥′

How to make 𝒉 to be a known distribution after training?

29
The first step

𝑥 encoder ℎ decoder 𝑥′

How to make 𝒉 to be a known distribution after training?

30
Stochastic latent representation

• Autoencoder U
• Encoder: ℎ = 𝑔(𝑥)
• Decoder: 𝑥′ = 𝑓(ℎ)

𝑥 encoder ℎ decoder 𝑥′
• f and g are deterministic function

31
Stochastic latent representation

• Autoencoder U • Variational Autoencoder


• Encoder: ℎ = 𝑔(𝑥) • Encoder: ℎ ∼ 𝑔(𝑥)
• Decoder: 𝑥′ = 𝑓(ℎ) • Decoder: 𝑥′ = 𝑓(ℎ)

𝑥 encoder ℎ decoder 𝑥′
• f and g are deterministic function • g is a stochastic function
• f is a deterministic function
• h is a random variable

32
Stochastic latent representation

U • Variational Autoencoder
• Encoder: ℎ ∼ 𝑔(𝑥)
• Decoder: 𝑥′ = 𝑓(ℎ)

• Pre-define a parametric distribution family


• Gaussian with mean and std • g is a stochastic function
• g outputs the distribution parameters • f is a deterministic function
• g outputs the mean and std • h is a random variable

33
Examples

28×28

34
Examples

28×28

Input: 784 dimension

35
Examples

… …

28×28

First hidden layer: 256 dimension

36
Examples

… … …

28×28

Second hidden layer:


• 50 dimensions to predict mean
• 50 dimensions to predict variance

37
Examples

code

… … … …

28×28
Generate from Gaussian distribution using
the learned mean and variance

38
Examples

… … … …

28×28

encoder

39
Examples (code, incomplete)

… … … …

28×28

encoder

40
Examples

… … … …

28×28

encoder

41
Examples

… … … … … …

28×28

encoder

42
Examples

… … … … … …

28×28

encoder decoder

43
Examples

… … … … … …

28×28

encoder decoder

44
Examples (code, incomplete)

… … … … … …

28×28

encoder decoder

45
Examples (code, incomplete)

… … … … … …

28×28

encoder decoder

46
Examples

… … … … … …

28×28

encoder decoder
• The encoder and decoder are not necessarily symmetric

47
Examples

… … … … … …

28×28

encoder decoder
• The encoder and decoder are not necessarily MLP

48
Examples

• The encoder and decoder are not necessarily MLP


https://www.mathworks.com/help/deeplearning/ug/train-a-variational-autoencoder-vae-to-generate-images.html

49
The second step

… … … … … …

28×28

encoder decoder
How to make 𝒉 to be a known distribution after training?

50
Training VAE

• Variational Autoencoder:forward process


• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• ℎ ∼ Normal(𝑢, 𝜎 2 )
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )

51
Training VAE

• Variational Autoencoder:forward process


• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• ℎ ∼ Normal(𝑢, 𝜎 2 )
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )

• Loss design: evaluating the difference between 𝑥 and 𝑥′

52
Training VAE

• Variational Autoencoder:forward process


• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• ℎ ∼ Normal(𝑢, 𝜎 2 )
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )

• Loss design: evaluating the difference between 𝑥 and 𝑥′


• Mean square error: 𝐿 = 𝑥 − 𝑥 ′ 2
• Other loss function can be applied

53
Training VAE

• Variational Autoencoder:forward process


• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• ℎ ∼ Normal(𝑢, 𝜎 2 )
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )

• Loss design: evaluating the difference between 𝑥 and 𝑥′


• Mean square error: 𝐿 = 𝑥 − 𝑥 ′ 2
• Other loss function can be applied

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates

54
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑑𝑒𝑐 ?

55
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑑𝑒𝑐 ?

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2

56
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑑𝑒𝑐 ?

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
𝜕𝐿 𝜕𝐿 𝜕𝑓
=
𝜕𝜃𝑑𝑒𝑐 𝜕𝑓 𝜕𝜃𝑑𝑒𝑐

57
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑑𝑒𝑐 ?

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
𝜕𝐿 𝜕𝐿 𝜕𝑓
=
𝜕𝜃𝑑𝑒𝑐 𝜕𝑓 𝜕𝜃𝑑𝑒𝑐

• 𝐿 is differentiable with respect to 𝑓

58
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑑𝑒𝑐 ?

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
𝜕𝐿 𝜕𝐿 𝜕𝑓
=
𝜕𝜃𝑑𝑒𝑐 𝜕𝑓 𝜕𝜃𝑑𝑒𝑐

• 𝐿 is differentiable with respect to 𝑓


• 𝑓 is differentiable (almost everywhere) with respect to 𝜃𝑑𝑒𝑐

59
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑑𝑒𝑐 ?

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
𝜕𝐿 𝜕𝐿 𝜕𝑓
=
𝜕𝜃𝑑𝑒𝑐 𝜕𝑓 𝜕𝜃𝑑𝑒𝑐

• 𝐿 is differentiable with respect to 𝑓


• 𝑓 is differentiable (almost everywhere) with respect to 𝜃𝑑𝑒𝑐

60
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑒𝑛𝑐 ?

61
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑒𝑛𝑐 ?

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
, 𝑤ℎ𝑒𝑟𝑒 ℎ~Normal(𝑔 𝑥, 𝜃𝑒𝑛𝑐 )

62
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑒𝑛𝑐 ?

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
, 𝑤ℎ𝑒𝑟𝑒 ℎ~Normal(𝑔 𝑥, 𝜃𝑒𝑛𝑐 )
𝜕𝐿
• How to compute
𝜕𝜃𝑒𝑛𝑐

63
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑒𝑛𝑐 ?

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
, 𝑤ℎ𝑒𝑟𝑒 ℎ~Normal(𝑔 𝑥, 𝜃𝑒𝑛𝑐 )
𝜕𝐿
• How to compute
𝜕𝜃𝑒𝑛𝑐

• 𝐿 is differentiable with respect to ℎ

64
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑒𝑛𝑐 ?

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
, 𝑤ℎ𝑒𝑟𝑒 ℎ~Normal(𝑔 𝑥, 𝜃𝑒𝑛𝑐 )
𝜕𝐿
• How to compute
𝜕𝜃𝑒𝑛𝑐

• 𝐿 is differentiable with respect to ℎ


• 𝜃𝑒𝑛𝑐 is differentiable with respect to 𝑔

65
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑒𝑛𝑐 ?

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
, 𝑤ℎ𝑒𝑟𝑒 ℎ~Normal(𝑔 𝑥, 𝜃𝑒𝑛𝑐 )
𝜕𝐿
• How to compute
𝜕𝜃𝑒𝑛𝑐

• 𝐿 is differentiable with respect to ℎ


• 𝜃𝑒𝑛𝑐 is differentiable with respect to 𝑔
• 𝒈 is NOT differentiable with respect to 𝒉!

66
Training VAE

• Parameter (𝜃𝑒𝑛𝑐 and 𝜃𝑑𝑒𝑐 ) updates


• How to update parameter 𝜃𝑒𝑛𝑐 ?

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
, 𝑤ℎ𝑒𝑟𝑒 ℎ~Normal(𝑔 𝑥, 𝜃𝑒𝑛𝑐 )
𝜕𝐿
• How to compute
𝜕𝜃𝑒𝑛𝑐
◆ A sample from N(0,1) -> 0.32
• 𝐿 is differentiable with respect to ℎ ◆ A sample from N(1e-6,1) -> -0.17
• 𝜃𝑒𝑛𝑐 is differentiable with respect to 𝑔
• 𝒈 is NOT differentiable with respect to 𝒉! ◆ g=0->1e-6, h=0.32->-0.17, not
continuous

67
Key tech: Reparameterization trick(重参数化)

• Core technique: rearrange of rescaling and sampling

𝜇(𝜃𝑒𝑛𝑐 )
… … …
𝜎(𝜃𝑒𝑛𝑐 )
28×28
N(0,I)

68
Key tech: Reparameterization trick(重参数化)

• Core technique: rearrange of rescaling and sampling

𝜇(𝜃𝑒𝑛𝑐 )
… … …
𝜎(𝜃𝑒𝑛𝑐 ) N(𝜇(𝜃𝑒𝑛𝑐 ), 𝜎(𝜃𝑒𝑛𝑐 ))
rescale
28×28
N(0,I)

69
Key tech: Reparameterization trick(重参数化)

• Core technique: rearrange of rescaling and sampling

𝜇(𝜃𝑒𝑛𝑐 )
… … …
𝜎(𝜃𝑒𝑛𝑐 ) N(𝜇(𝜃𝑒𝑛𝑐 ), 𝜎(𝜃𝑒𝑛𝑐 )) ℎ
rescale sample
28×28
N(0,I)

70
Key tech: Reparameterization trick(重参数化)

• Core technique: rearrange of rescaling and sampling

𝜇(𝜃𝑒𝑛𝑐 )
… … …
𝜎(𝜃𝑒𝑛𝑐 ) N(𝜇(𝜃𝑒𝑛𝑐 ), 𝜎(𝜃𝑒𝑛𝑐 )) ℎ
rescale sample
28×28
N(0,I)

71
Key tech: Reparameterization trick(重参数化)

• Core technique: rearrange of rescaling and sampling

𝜇(𝜃𝑒𝑛𝑐 )
… … …
𝜎(𝜃𝑒𝑛𝑐 )
28×28
N(0,I) 𝜖
sample

72
Key tech: Reparameterization trick(重参数化)

• Core technique: rearrange of rescaling and sampling

𝜇(𝜃𝑒𝑛𝑐 )
… … … rescale
𝜎(𝜃𝑒𝑛𝑐 ) 𝜇 𝜃𝑒𝑛𝑐 + 𝜖𝜎(𝜃𝑒𝑛𝑐 )
28×28
N(0,I) 𝜖
sample

73
Key tech: Reparameterization trick(重参数化)

• Core technique: rearrange of rescaling and sampling

𝜇(𝜃𝑒𝑛𝑐 )
… … … rescale
𝜎(𝜃𝑒𝑛𝑐 ) 𝜇 𝜃𝑒𝑛𝑐 + 𝜖𝜎(𝜃𝑒𝑛𝑐 ) ~N(𝜇(𝜃𝑒𝑛𝑐 ), 𝜎(𝜃𝑒𝑛𝑐 ))
28×28
N(0,I) 𝜖
sample

74
Key tech: Reparameterization trick(重参数化)

• Core technique: rearrange of rescaling and sampling

𝜇(𝜃𝑒𝑛𝑐 )
… … … rescale
𝜎(𝜃𝑒𝑛𝑐 ) 𝜇 𝜃𝑒𝑛𝑐 + 𝜖𝜎(𝜃𝑒𝑛𝑐 ) ~N(𝜇(𝜃𝑒𝑛𝑐 ), 𝜎(𝜃𝑒𝑛𝑐 ))
28×28 reparameterization
N(0,I) 𝜖
sample

75
VAE:implementation

• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
= 𝑥 − 𝑓(𝑔𝑢 𝑥, 𝜃𝑒𝑛𝑐 + 𝝐𝑔𝜎 (𝑥, 𝜃𝑒𝑛𝑐 ), 𝜃𝑑𝑒𝑐 ) 2

• All parameters are trained by gradient descent end-to-end

76
VAE:implementation

• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
= 𝑥 − 𝑓(𝑔𝑢 𝑥, 𝜃𝑒𝑛𝑐 + 𝝐𝑔𝜎 (𝑥, 𝜃𝑒𝑛𝑐 ), 𝜃𝑑𝑒𝑐 ) 2

• All parameters are trained by gradient descent end-to-end

77
VAE:implementation

• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
= 𝑥 − 𝑓(𝑔𝑢 𝑥, 𝜃𝑒𝑛𝑐 + 𝝐𝑔𝜎 (𝑥, 𝜃𝑒𝑛𝑐 ), 𝜃𝑑𝑒𝑐 ) 2

• All parameters are trained by gradient descent end-to-end

78
VAE:implementation

• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )

𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
= 𝑥 − 𝑓(𝑔𝑢 𝑥, 𝜃𝑒𝑛𝑐 + 𝝐𝑔𝜎 (𝑥, 𝜃𝑒𝑛𝑐 ), 𝜃𝑑𝑒𝑐 ) 2

• All parameters are trained by gradient descent end-to-end

79
The third step

… … … … … …

28×28

encoder decoder
How to make 𝒉 to be a known distribution after training?

80
Why we need known distribution?

Distribution 1
Distribution 2
…..
… … …
…..
…..
…..

encoder

81
Why we need known distribution?

… …

decoder

82
VAE:implementation

• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )

• Reconstruction loss 𝐿 = 𝑥 − 𝑥 ′ 2

83
VAE:implementation

• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )

• Reconstruction loss 𝐿 = 𝑥 − 𝑥 ′ 2
• Regularization loss 𝐿𝑟 = 𝐾𝐿(𝑁(𝜇 𝑥 , 𝜎(𝑥)||𝑁(0, 𝐼))

84
VAE:implementation

• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )

• Reconstruction loss 𝐿 = 𝑥 − 𝑥 ′ 2
• Regularization loss 𝐿𝑟 = 𝐾𝐿(𝑁(𝜇 𝑥 , 𝜎(𝑥)||𝑁(0, 𝐼))
• Overall loss = 𝐿 + 𝜆𝐿𝑟

85
Summary

• Variational Autoencoder
• Bottleneck architecture
• Stochastic code
• Training
• Reparameterization trick
• Two training terms
• Inference
• Sample Gaussian noise
• Feed the noise into the decoder to generate images

86
Probabilistic view of VAE

87
Probabilistic view of VAE

• Most works assume data (𝑥) is sampled from a fixed but unknown distribution
• Some works care about “how the data is generated”

88
Probabilistic view of VAE

• Most works assume data (𝑥) is sampled from a fixed but unknown distribution
• Some works care about “how the data is generated”

• Data generation process


• Assume there is a latent code 𝑧 that guides the generation of 𝑥

89
Probabilistic view of VAE

• Most works assume data (𝑥) is sampled from a fixed but unknown distribution
• Some works care about “how the data is generated”

• Data generation process


• Assume there is a latent code 𝑧 that guides the generation of 𝑥
• Assume 𝑧 follows a distribution 𝑝(𝑧), called the prior distribution

90
Probabilistic view of VAE

• Most works assume data (𝑥) is sampled from a fixed but unknown distribution
• Some works care about “how the data is generated”

• Data generation process


• Assume there is a latent code 𝑧 that guides the generation of 𝑥
• Assume 𝑧 follows a distribution 𝑝(𝑧), called the prior distribution

◼ Sample 𝑧𝑖 from 𝑝(𝑧)


◼ Sample 𝑥𝑖 from 𝑝(𝑥|𝑧𝑖 )
◼ Obtain dataset 𝐷 = {𝑥𝑖 }

91
Probabilistic view of VAE

• Most works assume data (𝑥) is sampled from a fixed but unknown distribution
• Some works care about “how the data is generated”

• Data generation process


• Assume there is a latent code 𝑧 that guides the generation of 𝑥
• Assume 𝑧 follows a distribution 𝑝(𝑧), called the prior distribution

◼ Sample 𝑧𝑖 from 𝑝(𝑧)  We have dataset 𝐷 = {𝑥𝑖 }


◼ Sample 𝑥𝑖 from 𝑝(𝑥|𝑧𝑖 )  How to learn 𝑝(𝑥|𝑧)?
◼ Obtain dataset 𝐷 = {𝑥𝑖 }

92
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧

◼ Sample 𝑧𝑖 from 𝑝(𝑧)  We have dataset 𝐷 = {𝑥𝑖 }


◼ Sample 𝑥𝑖 from 𝑝(𝑥|𝑧𝑖 )  How to learn 𝑝(𝑥|𝑧)?
◼ Obtain dataset 𝐷 = {𝑥𝑖 }

93
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

◼ Sample 𝑧𝑖 from 𝑝(𝑧)  We have dataset 𝐷 = {𝑥𝑖 }


◼ Sample 𝑥𝑖 from 𝑝(𝑥|𝑧𝑖 )  How to learn 𝑝(𝑥|𝑧)?
◼ Obtain dataset 𝐷 = {𝑥𝑖 }

94
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖
Usually computationally intractable (involving high-
dimensional integral with non-linear functions

◼ Sample 𝑧𝑖 from 𝑝(𝑧)  We have dataset 𝐷 = {𝑥𝑖 }


◼ Sample 𝑥𝑖 from 𝑝(𝑥|𝑧𝑖 )  How to learn 𝑝(𝑥|𝑧)?
◼ Obtain dataset 𝐷 = {𝑥𝑖 }

95
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

• A known fact (assumption)

96
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

• A known fact (assumption)


• Given any 𝑧, only a small region of points in domain 𝑋 have non-zero 𝑝 𝑥 𝑧

97
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

• A known fact (assumption)


• Given any 𝑧, only a small region of points in domain 𝑋 have non-zero 𝑝 𝑥 𝑧

• The integral only have non-zero values in a small region of points

98
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

• A known fact (assumption)


• Given any 𝑧, only a small region of points in domain 𝑋 have non-zero 𝑝 𝑥 𝑧

• The integral only have non-zero values in a small region of points

• Assume we have another function/distribution 𝑞𝜃2 𝑧 𝑥 that can find the region

99
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)

100
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)

101
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)

102
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)

103
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)

104
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)

105
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)

≥0

106
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)

Evidence lower bound ≥0

107
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)

𝑝 𝑧 𝑝(𝑥|𝑧)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥)
𝑝 𝑧 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥)

108
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)

𝑝 𝑧 𝑝(𝑥|𝑧)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥)
𝑝 𝑧 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥)

109
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)

𝑝 𝑧 𝑝(𝑥|𝑧)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥)
𝑝 𝑧
= න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑞(𝑧|𝑥)

110
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑧 𝑞(𝑧|𝑥)

111
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑧 𝑞(𝑧|𝑥)

𝐸𝑧~𝑞(𝑧|𝑥) [log 𝑝 𝑥 𝑧 ]

112
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑧 𝑞(𝑧|𝑥)

𝐸𝑧~𝑞(𝑧|𝑥) [log 𝑝 𝑥 𝑧 ]
Function 𝑞 is the encoder and function 𝑝 is the decoder,
this term is the reconstruction performance

113
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑧 𝑞(𝑧|𝑥)

𝐸𝑧~𝑞(𝑧|𝑥) [log 𝑝 𝑥 𝑧 ]

114
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑧 𝑞(𝑧|𝑥)

𝐸𝑧~𝑞(𝑧|𝑥) [log 𝑝 𝑥 𝑧 ] -KL divergence

115
Probabilistic view of VAE

• Maximum likelihood method


𝑃 𝑥 = න 𝑝𝜃1 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 maximize ∑log 𝑃 𝑥𝑖

𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑧 𝑞(𝑧|𝑥)

𝐸𝑧~𝑞(𝑧|𝑥) [log 𝑝 𝑥 𝑧 ] -KL divergence


The regularization term

116
Summary

• Variational Autoencoder
• Bottleneck architecture
• Stochastic code
• Training
• Reparameterization trick
• Two training terms
• Inference
• Sample Gaussian noise
• Feed the noise into the decoder to generate images
• Neural-network view and probabilistic view of VAE

117
VAE theory and application

118
Problems in VAE

• VAE usually cannot go deep (check David Wipf’s work)

• The dimension of the latent code is sensitive (check David Wipf’s work)

• VAE cannot do density estimation, i.e., accurately calculating 𝑃 𝑥

• VAE is usually used as a component of a system, but not a standalone model

119
Extension:Denoising autoencoder
VAE injects noise in the representation

… … … … … …

28×28

encoder decoder

121
DAE injects noise in the representation

… … … … …

28×28

122
Short-cut solutions exist?

… … … … …

28×28

• Vanilla autoencoder (clear input) • Denoising autoencoder (corrupted input)


• Identity mapping has zero loss • Identity mapping has non-zero loss
• need bottleneck architecture • Usually with very deep models

123
How to inject noise?

• Mask as noise

masked autoencoder are scalable vision learners

124
How to inject noise?

• Mask as noise

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

125
Summary

• Denoising Autoencoder
• Known as one of the most efficient pre-training (self-supervised) method

126
Summary

• Denoising Autoencoder
• Known as one of the most efficient pre-training (self-supervised) method

• Masking is the standard choice and you can explore more (additive noise)

127
Summary

• Denoising Autoencoder
• Known as one of the most efficient pre-training (self-supervised) method

• Masking is the standard choice and you can explore more (additive noise)

• Training with very deep model and huge amount of data

128
Summary

• Denoising Autoencoder
• Known as one of the most efficient pre-training (self-supervised) method

• Masking is the standard choice and you can explore more (additive noise)

• Training with very deep model and huge amount of data

• Can be used with ANY data type

129
Summary

• Denoising Autoencoder
• Known as one of the most efficient pre-training (self-supervised) method

• Masking is the standard choice and you can explore more (additive noise)

• Training with very deep model and huge amount of data

• Can be used with ANY data type

• Usually is not considered as generative model, although it can generate


the missing part of the data

130
Extension:Vector-quantized VAE
Background

• VQVAE is usually used in a system, but not a standalone model

• VQVAE is popularly used (important to certain extent) in the development of


large-scale vision-language models

132
Motivation

• Language is formed by (a sequence of)


word tokens
• A small fixed vocabulary
• Short length
• Discrete
• Processed as vectors
• One-hot vector
• Real vector

133
Motivation

• Language is formed by (a sequence of) • Image is formed by (a sequence of)


word tokens pixels
• A small fixed vocabulary • No vocabulary
• Short length • Long length
• Discrete • Viewed to be continuous
• Processed as vectors • Processed as RGB vectors
• One-hot vector
• Real vector

134
Motivation

• Language is formed by (a sequence of) • Image is formed by (a sequence of)


word tokens pixels
• A small fixed vocabulary • No vocabulary
• Short length • Long length
• Discrete • Viewed to be continuous
• Processed as vectors • Processed as RGB vectors
• One-hot vector
• Real vector

The data-type difference makes it hard to


jointly train a vision-language model

135
Goal of VQVAE

• Process an image into a sequence of “word tokens”

136
Goal of VQVAE

• Process an image into a sequence of “word tokens”

• How to construct a small fixed vocabulary for an image

• How to define the length of the sequence

• How to transform RBGs into tokens?

137
VQVAE model

… … …

28×28

encoder

138
VQVAE model

… … … 𝑧 𝑥 ∈ 𝑅𝑑
𝑒

28×28

encoder

139
VQVAE model

… … … 𝑧 𝑥 ∈ 𝑅𝑑
𝑒

28×28


encoder

K 𝑒1 , 𝑒2 , … , 𝑒𝐾 ∈ 𝑅𝑑
codebook

140
VQVAE model

… … … 𝑧𝑒 𝑥 𝑧𝑑 𝑥

28×28


encoder

K 𝑒1 , 𝑒2 , … , 𝑒𝐾 ∈ 𝑅𝑑
codebook

141
VQVAE model

• 𝑧𝑑 𝑥 is the nearest neighbor of 𝑧𝑒 𝑥 in


the codebook
𝑧𝑑 (𝑥) = 𝑒𝑞
… … … 𝑧𝑒 𝑥 𝑧𝑑 𝑥 𝑞 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑘 ||𝑒𝑘 − 𝑧𝑒 (𝑥)||

28×28


encoder

K 𝑒1 , 𝑒2 , … , 𝑒𝐾 ∈ 𝑅𝑑
codebook

142
VQVAE model

… … … 𝑧𝑒 𝑥 𝑧𝑑 𝑥 … …

28×28


encoder

K 𝑒1 , 𝑒2 , … , 𝑒𝐾 ∈ 𝑅𝑑
codebook

143
Optimization terms

… … … 𝑧𝑒 𝑥 𝑧𝑑 𝑥 … …

28×28


encoder

K 𝑒1 , 𝑒2 , … , 𝑒𝐾 ∈ 𝑅𝑑
codebook

144
Gradient computation of encoder

𝑧𝑑 (𝑥) = 𝑒𝑞
𝑞 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑘 ||𝑒𝑘 − 𝑧𝑒 (𝑥)||

… … … 𝑧𝑒 𝑥 𝑧𝑑 𝑥 … …

28×28


encoder

K 𝑒1 , 𝑒2 , … , 𝑒𝐾 ∈ 𝑅𝑑
codebook

145
Key problem

• If we use | 𝑥 − 𝑔 𝑧𝑑 | as the loss, encoder parameters cannot be


updated

𝑧𝑒 𝑥 𝑧𝑑 𝑥

146
Key problem

• If we use | 𝑥 − 𝑔 𝑧𝑑 | as the loss, encoder parameters cannot be


updated

• If we use | 𝑥 − 𝑔 𝑧𝑒 | as the loss. The objective has changed and the


gradient is not correct

𝑧𝑒 𝑥 𝑧𝑑 𝑥

147
Key problem

• If we use | 𝑥 − 𝑔 𝑧𝑑 | as the loss, encoder parameters cannot be


updated

• If we use | 𝑥 − 𝑔 𝑧𝑒 | as the loss. The objective has changed and the


gradient is not correct

• Solution: set 𝑧𝑑 = 𝑧𝑒 + StopGrad(𝑧𝑑 − 𝑧𝑒 )

StopGrad(): gradients in the argument will be never


calculated 𝑧𝑒 𝑥 𝑧𝑑 𝑥

148
The straight-through estimator

• Loss term 𝑥 − 𝑔 𝑧𝑒 + StopGrad(𝑧𝑑 − 𝑧𝑒 )

𝑧𝑒 𝑥 𝑧𝑑 𝑥

149
The straight-through estimator

• Loss term 𝑥 − 𝑔 𝑧𝑒 + StopGrad(𝑧𝑑 − 𝑧𝑒 )

• Correct forward process

𝑧𝑒 𝑥 𝑧𝑑 𝑥

150
The straight-through estimator

• Loss term 𝑥 − 𝑔 𝑧𝑒 + StopGrad(𝑧𝑑 − 𝑧𝑒 )

• Correct forward process

• Correct backward process of decoder parameters

𝑧𝑒 𝑥 𝑧𝑑 𝑥

151
The straight-through estimator

• Loss term 𝑥 − 𝑔 𝑧𝑒 + StopGrad(𝑧𝑑 − 𝑧𝑒 )

• Correct forward process

• Correct backward process of decoder parameters

• Computable updates for encoder parameters

𝑧𝑒 𝑥 𝑧𝑑 𝑥

152
Update of the codebook

• We hope the codebook is meaningful

𝑧𝑒 𝑥

codebook
𝑧𝑒 𝑥 𝑧𝑑 𝑥

153
Update of the codebook

• We hope the codebook is meaningful

• Codebook loss

𝑧𝑒 − StopGrad(𝑧𝑑 ) + 𝛽 𝑧𝑑 − StopGrad(𝑧𝑒 )

𝑧𝑒 𝑥 𝑧𝑑 𝑥

154
Update of the codebook

• We hope the codebook is meaningful

• Codebook loss

𝑧𝑒 − StopGrad(𝑧𝑑 ) + 𝛽 𝑧𝑑 − StopGrad(𝑧𝑒 )

• VQVAE loss = reconstruction loss + codebook loss

𝑧𝑒 𝑥 𝑧𝑑 𝑥

155
Improving capacity

• Practically we set K=8192, does it mean there are only 8192 generation results?

156
Improving capacity

• Practically we set K=8192, does it mean there are only 8192 generation results?

Deep CNN
Patch-level 𝑧𝑒1 𝑥 𝑧𝑑1 𝑥
network 𝑧𝑒𝑙 𝑥 𝑧𝑑𝑙 𝑥
𝑧𝑒𝐿 𝑥 𝑧𝑑𝐿 𝑥

157
Summary
• Variational Autoencoder

• Denoising Autoencoder

• Vector Quantized VAE

158
Thanks dihe@pku.edu.cn

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy