03 Autoencoders 4
03 Autoencoders 4
Di He
Outline
• Basics
• Variational Autoencoder
• Denoising Autoencoder
2
What is autoencoder?
3
What is autoencoder?
• Trivial (short-cut) solutions exist:a neural network can learn identity mapping 𝑥 = 𝑓(𝑥)
𝑥 𝑥
4
What is autoencoder?
• Trivial (short-cut) solutions exist:a neural network can learn identity mapping 𝑥 = 𝑓(𝑥)
• Bottleneck architecture
𝑥 𝑥
5
What is autoencoder?
• Trivial (short-cut) solutions exist:a neural network can learn identity mapping 𝑥 = 𝑓(𝑥)
• Bottleneck architecture
𝑥 𝑥
encoder
6
What is autoencoder?
• Trivial (short-cut) solutions exist:a neural network can learn identity mapping 𝑥 = 𝑓(𝑥)
• Bottleneck architecture
𝑥 𝑥
encoder decoder
7
Why autoencoder?
𝑥 𝑥
encoder decoder
8
Why autoencoder?
𝑥 𝑥
encoder decoder
9
Why autoencoder?
𝑥 𝑥
encoder decoder
10
Why autoencoder?
• Generative modelling
𝑥 𝑥
encoder decoder
11
The simplest autoencoder
• The simplest kind of autoencoder has one hidden layer with linear activations.
ℎ = 𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑥 ∈ 𝑅𝑑×1
encoder decoder
12
The simplest autoencoder
• The simplest kind of autoencoder has one hidden layer with linear activations.
ℎ = 𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑥 ∈ 𝑅𝑑×1
13
The simplest autoencoder
• The simplest kind of autoencoder has one hidden layer with linear activations.
o𝑢𝑡𝑝𝑢𝑡 = 𝑉𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑉 ∈ 𝑅𝑑×𝑘
𝑥 U V
encoder decoder
14
The simplest autoencoder
• The simplest kind of autoencoder has one hidden layer with linear activations.
o𝑢𝑡𝑝𝑢𝑡 = 𝑉𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑉 ∈ 𝑅𝑑×𝑘
𝑥 U V
• Note
• This network is linear
encoder decoder
• We usually set 𝑘 ≪ 𝑑 (if 𝑘 = 𝑑, we can make 𝑉𝑈 = 𝐼,
which is meaningless)
15
The simplest autoencoder
• The simplest kind of autoencoder has one hidden layer with linear activations.
o𝑢𝑡𝑝𝑢𝑡 = 𝑉𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑉 ∈ 𝑅𝑑×𝑘
𝑥 U V
• How to determine 𝑉 and 𝑈
16
The simplest autoencoder
• The simplest kind of autoencoder has one hidden layer with linear activations.
o𝑢𝑡𝑝𝑢𝑡 = 𝑉𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑉 ∈ 𝑅𝑑×𝑘
𝑥 U V
• How to determine 𝑉 and 𝑈
minimize 𝑉𝑈𝑋 − 𝑋 2
17
The simplest autoencoder
• The simplest kind of autoencoder has one hidden layer with linear activations.
o𝑢𝑡𝑝𝑢𝑡 = 𝑉𝑈𝑥, 𝑈 ∈ 𝑅𝑘×𝑑 , 𝑉 ∈ 𝑅𝑑×𝑘
𝑥 U V
• How to determine 𝑉 and 𝑈
minimize 𝑉𝑈𝑋 − 𝑋 2
encoder decoder
• You don’t need to solve this problem by gradient descent. There’s a closed-form solution
18
More about autoencoder
encoder decoder
19
More about autoencoder
encoder decoder
• f and g shouldn’t be too complex or powerful
• To avoid learn copy (encoder) and paste (decoder)
• To avoid overfit
20
More about autoencoder
encoder decoder
• f and g can be shallow neural networks
• All the parameters should be trained by gradient descent
21
More about autoencoder
encoder decoder
• Autoencoders are data-specific and learned
• This is different from compression method like mp3 or jpeg
• An autoencoder learned on “cat images” may fail on “dog images”
22
More about autoencoder
encoder decoder
• Autoencoders learn useful properties of data
• PCA learns principled components
23
Vanilla autoencoder is not a generative model
𝑥 encoder ℎ decoder 𝑥′
24
How to modify an autoencoder into a
generative model
𝑥 encoder ℎ decoder 𝑥′
• 𝑥 follows a distribution (the data distribution), but it is unknown
25
How to modify an autoencoder into a
generative model
𝑥 encoder ℎ decoder 𝑥′
• 𝑥 follows a distribution (the data distribution), but it is unknown
26
How to modify an autoencoder into a
generative model
𝑥 encoder ℎ decoder 𝑥′
27
How to modify an autoencoder into a
generative model
decoder
𝑁(0, 𝐼) ℎ 𝑥′
When the encoder is replaced by random noise, the decoder becomes a generative model!
28
The remaining challenge
𝑥 encoder ℎ decoder 𝑥′
29
The first step
𝑥 encoder ℎ decoder 𝑥′
30
Stochastic latent representation
• Autoencoder U
• Encoder: ℎ = 𝑔(𝑥)
• Decoder: 𝑥′ = 𝑓(ℎ)
𝑥 encoder ℎ decoder 𝑥′
• f and g are deterministic function
31
Stochastic latent representation
𝑥 encoder ℎ decoder 𝑥′
• f and g are deterministic function • g is a stochastic function
• f is a deterministic function
• h is a random variable
32
Stochastic latent representation
U • Variational Autoencoder
• Encoder: ℎ ∼ 𝑔(𝑥)
• Decoder: 𝑥′ = 𝑓(ℎ)
33
Examples
28×28
34
Examples
28×28
35
Examples
… …
28×28
36
Examples
… … …
28×28
37
Examples
code
… … … …
28×28
Generate from Gaussian distribution using
the learned mean and variance
38
Examples
… … … …
28×28
encoder
39
Examples (code, incomplete)
… … … …
28×28
encoder
40
Examples
… … … …
28×28
encoder
41
Examples
… … … … … …
28×28
encoder
42
Examples
… … … … … …
28×28
encoder decoder
43
Examples
… … … … … …
28×28
encoder decoder
44
Examples (code, incomplete)
… … … … … …
28×28
encoder decoder
45
Examples (code, incomplete)
… … … … … …
28×28
encoder decoder
46
Examples
… … … … … …
28×28
encoder decoder
• The encoder and decoder are not necessarily symmetric
47
Examples
… … … … … …
28×28
encoder decoder
• The encoder and decoder are not necessarily MLP
48
Examples
49
The second step
… … … … … …
28×28
encoder decoder
How to make 𝒉 to be a known distribution after training?
50
Training VAE
51
Training VAE
52
Training VAE
53
Training VAE
54
Training VAE
55
Training VAE
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
56
Training VAE
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
𝜕𝐿 𝜕𝐿 𝜕𝑓
=
𝜕𝜃𝑑𝑒𝑐 𝜕𝑓 𝜕𝜃𝑑𝑒𝑐
57
Training VAE
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
𝜕𝐿 𝜕𝐿 𝜕𝑓
=
𝜕𝜃𝑑𝑒𝑐 𝜕𝑓 𝜕𝜃𝑑𝑒𝑐
58
Training VAE
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
𝜕𝐿 𝜕𝐿 𝜕𝑓
=
𝜕𝜃𝑑𝑒𝑐 𝜕𝑓 𝜕𝜃𝑑𝑒𝑐
59
Training VAE
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
𝜕𝐿 𝜕𝐿 𝜕𝑓
=
𝜕𝜃𝑑𝑒𝑐 𝜕𝑓 𝜕𝜃𝑑𝑒𝑐
60
Training VAE
61
Training VAE
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
, 𝑤ℎ𝑒𝑟𝑒 ℎ~Normal(𝑔 𝑥, 𝜃𝑒𝑛𝑐 )
62
Training VAE
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
, 𝑤ℎ𝑒𝑟𝑒 ℎ~Normal(𝑔 𝑥, 𝜃𝑒𝑛𝑐 )
𝜕𝐿
• How to compute
𝜕𝜃𝑒𝑛𝑐
63
Training VAE
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
, 𝑤ℎ𝑒𝑟𝑒 ℎ~Normal(𝑔 𝑥, 𝜃𝑒𝑛𝑐 )
𝜕𝐿
• How to compute
𝜕𝜃𝑒𝑛𝑐
64
Training VAE
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
, 𝑤ℎ𝑒𝑟𝑒 ℎ~Normal(𝑔 𝑥, 𝜃𝑒𝑛𝑐 )
𝜕𝐿
• How to compute
𝜕𝜃𝑒𝑛𝑐
65
Training VAE
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
, 𝑤ℎ𝑒𝑟𝑒 ℎ~Normal(𝑔 𝑥, 𝜃𝑒𝑛𝑐 )
𝜕𝐿
• How to compute
𝜕𝜃𝑒𝑛𝑐
66
Training VAE
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
, 𝑤ℎ𝑒𝑟𝑒 ℎ~Normal(𝑔 𝑥, 𝜃𝑒𝑛𝑐 )
𝜕𝐿
• How to compute
𝜕𝜃𝑒𝑛𝑐
◆ A sample from N(0,1) -> 0.32
• 𝐿 is differentiable with respect to ℎ ◆ A sample from N(1e-6,1) -> -0.17
• 𝜃𝑒𝑛𝑐 is differentiable with respect to 𝑔
• 𝒈 is NOT differentiable with respect to 𝒉! ◆ g=0->1e-6, h=0.32->-0.17, not
continuous
67
Key tech: Reparameterization trick(重参数化)
𝜇(𝜃𝑒𝑛𝑐 )
… … …
𝜎(𝜃𝑒𝑛𝑐 )
28×28
N(0,I)
68
Key tech: Reparameterization trick(重参数化)
𝜇(𝜃𝑒𝑛𝑐 )
… … …
𝜎(𝜃𝑒𝑛𝑐 ) N(𝜇(𝜃𝑒𝑛𝑐 ), 𝜎(𝜃𝑒𝑛𝑐 ))
rescale
28×28
N(0,I)
69
Key tech: Reparameterization trick(重参数化)
𝜇(𝜃𝑒𝑛𝑐 )
… … …
𝜎(𝜃𝑒𝑛𝑐 ) N(𝜇(𝜃𝑒𝑛𝑐 ), 𝜎(𝜃𝑒𝑛𝑐 )) ℎ
rescale sample
28×28
N(0,I)
70
Key tech: Reparameterization trick(重参数化)
𝜇(𝜃𝑒𝑛𝑐 )
… … …
𝜎(𝜃𝑒𝑛𝑐 ) N(𝜇(𝜃𝑒𝑛𝑐 ), 𝜎(𝜃𝑒𝑛𝑐 )) ℎ
rescale sample
28×28
N(0,I)
71
Key tech: Reparameterization trick(重参数化)
𝜇(𝜃𝑒𝑛𝑐 )
… … …
𝜎(𝜃𝑒𝑛𝑐 )
28×28
N(0,I) 𝜖
sample
72
Key tech: Reparameterization trick(重参数化)
𝜇(𝜃𝑒𝑛𝑐 )
… … … rescale
𝜎(𝜃𝑒𝑛𝑐 ) 𝜇 𝜃𝑒𝑛𝑐 + 𝜖𝜎(𝜃𝑒𝑛𝑐 )
28×28
N(0,I) 𝜖
sample
73
Key tech: Reparameterization trick(重参数化)
𝜇(𝜃𝑒𝑛𝑐 )
… … … rescale
𝜎(𝜃𝑒𝑛𝑐 ) 𝜇 𝜃𝑒𝑛𝑐 + 𝜖𝜎(𝜃𝑒𝑛𝑐 ) ~N(𝜇(𝜃𝑒𝑛𝑐 ), 𝜎(𝜃𝑒𝑛𝑐 ))
28×28
N(0,I) 𝜖
sample
74
Key tech: Reparameterization trick(重参数化)
𝜇(𝜃𝑒𝑛𝑐 )
… … … rescale
𝜎(𝜃𝑒𝑛𝑐 ) 𝜇 𝜃𝑒𝑛𝑐 + 𝜖𝜎(𝜃𝑒𝑛𝑐 ) ~N(𝜇(𝜃𝑒𝑛𝑐 ), 𝜎(𝜃𝑒𝑛𝑐 ))
28×28 reparameterization
N(0,I) 𝜖
sample
75
VAE:implementation
• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
= 𝑥 − 𝑓(𝑔𝑢 𝑥, 𝜃𝑒𝑛𝑐 + 𝝐𝑔𝜎 (𝑥, 𝜃𝑒𝑛𝑐 ), 𝜃𝑑𝑒𝑐 ) 2
76
VAE:implementation
• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
= 𝑥 − 𝑓(𝑔𝑢 𝑥, 𝜃𝑒𝑛𝑐 + 𝝐𝑔𝜎 (𝑥, 𝜃𝑒𝑛𝑐 ), 𝜃𝑑𝑒𝑐 ) 2
77
VAE:implementation
• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
= 𝑥 − 𝑓(𝑔𝑢 𝑥, 𝜃𝑒𝑛𝑐 + 𝝐𝑔𝜎 (𝑥, 𝜃𝑒𝑛𝑐 ), 𝜃𝑑𝑒𝑐 ) 2
78
VAE:implementation
• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )
𝐿 = 𝑥 − 𝑥′ 2
= 𝑥 − 𝑓(ℎ, 𝜃𝑑𝑒𝑐 ) 2
= 𝑥 − 𝑓(𝑔𝑢 𝑥, 𝜃𝑒𝑛𝑐 + 𝝐𝑔𝜎 (𝑥, 𝜃𝑒𝑛𝑐 ), 𝜃𝑑𝑒𝑐 ) 2
79
The third step
… … … … … …
28×28
encoder decoder
How to make 𝒉 to be a known distribution after training?
80
Why we need known distribution?
Distribution 1
Distribution 2
…..
… … …
…..
…..
…..
encoder
81
Why we need known distribution?
… …
decoder
82
VAE:implementation
• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )
• Reconstruction loss 𝐿 = 𝑥 − 𝑥 ′ 2
83
VAE:implementation
• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )
• Reconstruction loss 𝐿 = 𝑥 − 𝑥 ′ 2
• Regularization loss 𝐿𝑟 = 𝐾𝐿(𝑁(𝜇 𝑥 , 𝜎(𝑥)||𝑁(0, 𝐼))
84
VAE:implementation
• Variational Autoencoder
• Input: 𝑥
• [u, 𝜎] = 𝑔(𝑥, 𝜃𝑒𝑛𝑐 )
• 𝜖 ∼ Normal(0, 𝐼)
• ℎ = 𝜇 + 𝜖𝜎
• Output: 𝑥′ = 𝑓(ℎ, 𝜃𝑑𝑒𝑐 )
• Reconstruction loss 𝐿 = 𝑥 − 𝑥 ′ 2
• Regularization loss 𝐿𝑟 = 𝐾𝐿(𝑁(𝜇 𝑥 , 𝜎(𝑥)||𝑁(0, 𝐼))
• Overall loss = 𝐿 + 𝜆𝐿𝑟
85
Summary
• Variational Autoencoder
• Bottleneck architecture
• Stochastic code
• Training
• Reparameterization trick
• Two training terms
• Inference
• Sample Gaussian noise
• Feed the noise into the decoder to generate images
86
Probabilistic view of VAE
87
Probabilistic view of VAE
• Most works assume data (𝑥) is sampled from a fixed but unknown distribution
• Some works care about “how the data is generated”
88
Probabilistic view of VAE
• Most works assume data (𝑥) is sampled from a fixed but unknown distribution
• Some works care about “how the data is generated”
89
Probabilistic view of VAE
• Most works assume data (𝑥) is sampled from a fixed but unknown distribution
• Some works care about “how the data is generated”
90
Probabilistic view of VAE
• Most works assume data (𝑥) is sampled from a fixed but unknown distribution
• Some works care about “how the data is generated”
91
Probabilistic view of VAE
• Most works assume data (𝑥) is sampled from a fixed but unknown distribution
• Some works care about “how the data is generated”
92
Probabilistic view of VAE
93
Probabilistic view of VAE
94
Probabilistic view of VAE
95
Probabilistic view of VAE
96
Probabilistic view of VAE
97
Probabilistic view of VAE
98
Probabilistic view of VAE
• Assume we have another function/distribution 𝑞𝜃2 𝑧 𝑥 that can find the region
99
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)
100
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)
101
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)
102
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)
103
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)
104
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)
105
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)
≥0
106
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)
107
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)
𝑝 𝑧 𝑝(𝑥|𝑧)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥)
𝑝 𝑧 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥)
108
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)
𝑝 𝑧 𝑝(𝑥|𝑧)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥)
𝑝 𝑧 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥)
109
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥)
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑞(𝑧|𝑥) 𝑧 𝑝(𝑧|𝑥)
𝑝 𝑧 𝑝(𝑥|𝑧)
= න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥)
𝑝 𝑧
= න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑞(𝑧|𝑥)
110
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑧 𝑞(𝑧|𝑥)
111
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑧 𝑞(𝑧|𝑥)
𝐸𝑧~𝑞(𝑧|𝑥) [log 𝑝 𝑥 𝑧 ]
112
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑧 𝑞(𝑧|𝑥)
𝐸𝑧~𝑞(𝑧|𝑥) [log 𝑝 𝑥 𝑧 ]
Function 𝑞 is the encoder and function 𝑝 is the decoder,
this term is the reconstruction performance
113
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑧 𝑞(𝑧|𝑥)
𝐸𝑧~𝑞(𝑧|𝑥) [log 𝑝 𝑥 𝑧 ]
114
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑧 𝑞(𝑧|𝑥)
115
Probabilistic view of VAE
𝑝(𝑥, 𝑧)
log 𝑃(𝑥) = න 𝑞 𝑧 𝑥 log 𝑃 𝑥 𝑑𝑧 = න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑧 𝑝(𝑧|𝑥)
𝑝(𝑥, 𝑧) 𝑞(𝑧|𝑥) 𝑝 𝑧
= න 𝑞 𝑧 𝑥 log log 𝑃(𝑥)
𝑑𝑧 ≥ න 𝑞 𝑧 𝑥 log 𝑝(𝑥|𝑧) 𝑑𝑧 + න 𝑞 𝑧 𝑥 log 𝑑𝑧
𝑧 𝑞(𝑧|𝑥) 𝑝(𝑧|𝑥) 𝑧 𝑧 𝑞(𝑧|𝑥)
116
Summary
• Variational Autoencoder
• Bottleneck architecture
• Stochastic code
• Training
• Reparameterization trick
• Two training terms
• Inference
• Sample Gaussian noise
• Feed the noise into the decoder to generate images
• Neural-network view and probabilistic view of VAE
117
VAE theory and application
118
Problems in VAE
• The dimension of the latent code is sensitive (check David Wipf’s work)
119
Extension:Denoising autoencoder
VAE injects noise in the representation
… … … … … …
28×28
encoder decoder
121
DAE injects noise in the representation
… … … … …
28×28
122
Short-cut solutions exist?
… … … … …
28×28
123
How to inject noise?
• Mask as noise
124
How to inject noise?
• Mask as noise
125
Summary
• Denoising Autoencoder
• Known as one of the most efficient pre-training (self-supervised) method
126
Summary
• Denoising Autoencoder
• Known as one of the most efficient pre-training (self-supervised) method
• Masking is the standard choice and you can explore more (additive noise)
127
Summary
• Denoising Autoencoder
• Known as one of the most efficient pre-training (self-supervised) method
• Masking is the standard choice and you can explore more (additive noise)
128
Summary
• Denoising Autoencoder
• Known as one of the most efficient pre-training (self-supervised) method
• Masking is the standard choice and you can explore more (additive noise)
129
Summary
• Denoising Autoencoder
• Known as one of the most efficient pre-training (self-supervised) method
• Masking is the standard choice and you can explore more (additive noise)
130
Extension:Vector-quantized VAE
Background
132
Motivation
133
Motivation
134
Motivation
135
Goal of VQVAE
136
Goal of VQVAE
137
VQVAE model
… … …
28×28
encoder
138
VQVAE model
… … … 𝑧 𝑥 ∈ 𝑅𝑑
𝑒
28×28
encoder
139
VQVAE model
… … … 𝑧 𝑥 ∈ 𝑅𝑑
𝑒
28×28
…
encoder
K 𝑒1 , 𝑒2 , … , 𝑒𝐾 ∈ 𝑅𝑑
codebook
140
VQVAE model
… … … 𝑧𝑒 𝑥 𝑧𝑑 𝑥
28×28
…
encoder
K 𝑒1 , 𝑒2 , … , 𝑒𝐾 ∈ 𝑅𝑑
codebook
141
VQVAE model
28×28
…
encoder
K 𝑒1 , 𝑒2 , … , 𝑒𝐾 ∈ 𝑅𝑑
codebook
142
VQVAE model
… … … 𝑧𝑒 𝑥 𝑧𝑑 𝑥 … …
28×28
…
encoder
K 𝑒1 , 𝑒2 , … , 𝑒𝐾 ∈ 𝑅𝑑
codebook
143
Optimization terms
… … … 𝑧𝑒 𝑥 𝑧𝑑 𝑥 … …
28×28
…
encoder
K 𝑒1 , 𝑒2 , … , 𝑒𝐾 ∈ 𝑅𝑑
codebook
144
Gradient computation of encoder
𝑧𝑑 (𝑥) = 𝑒𝑞
𝑞 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑘 ||𝑒𝑘 − 𝑧𝑒 (𝑥)||
… … … 𝑧𝑒 𝑥 𝑧𝑑 𝑥 … …
28×28
…
encoder
K 𝑒1 , 𝑒2 , … , 𝑒𝐾 ∈ 𝑅𝑑
codebook
145
Key problem
𝑧𝑒 𝑥 𝑧𝑑 𝑥
146
Key problem
𝑧𝑒 𝑥 𝑧𝑑 𝑥
147
Key problem
148
The straight-through estimator
𝑧𝑒 𝑥 𝑧𝑑 𝑥
149
The straight-through estimator
𝑧𝑒 𝑥 𝑧𝑑 𝑥
150
The straight-through estimator
𝑧𝑒 𝑥 𝑧𝑑 𝑥
151
The straight-through estimator
𝑧𝑒 𝑥 𝑧𝑑 𝑥
152
Update of the codebook
𝑧𝑒 𝑥
codebook
𝑧𝑒 𝑥 𝑧𝑑 𝑥
153
Update of the codebook
• Codebook loss
𝑧𝑒 − StopGrad(𝑧𝑑 ) + 𝛽 𝑧𝑑 − StopGrad(𝑧𝑒 )
𝑧𝑒 𝑥 𝑧𝑑 𝑥
154
Update of the codebook
• Codebook loss
𝑧𝑒 − StopGrad(𝑧𝑑 ) + 𝛽 𝑧𝑑 − StopGrad(𝑧𝑒 )
𝑧𝑒 𝑥 𝑧𝑑 𝑥
155
Improving capacity
• Practically we set K=8192, does it mean there are only 8192 generation results?
156
Improving capacity
• Practically we set K=8192, does it mean there are only 8192 generation results?
Deep CNN
Patch-level 𝑧𝑒1 𝑥 𝑧𝑑1 𝑥
network 𝑧𝑒𝑙 𝑥 𝑧𝑑𝑙 𝑥
𝑧𝑒𝐿 𝑥 𝑧𝑑𝐿 𝑥
157
Summary
• Variational Autoencoder
• Denoising Autoencoder
158
Thanks dihe@pku.edu.cn