0% found this document useful (0 votes)

112 views10 pages

Notes On Gans, Energy-Based Models, and Saddle Points

This document explores the connections between generative adversarial networks (GANs) and energy-based models. It first derives a minimax expression for the negative log-likelihood of energy-based models. By adding a particular cost regularization term, the objective can be shown to recover the original GAN objective, drawing a close connection. The objectives considered are convex-concave functions of the cost and sampling density, allowing gradient-based optimization to converge.

Uploaded by

Nguyễn Việt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views10 pages

Notes On Gans, Energy-Based Models, and Saddle Points

Uploaded by

Nguyễn Việt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Notes On GANs, Energy-Based Models, and Saddle Points

John Schulman joschu@openai.com

1 Introduction
These notes explore a family of methods related to generative adversarial networks (GANs)
[Goo+14]; these methods try to estimate a probability distribution from samples, using a min-
imax objective that involves a generator/sampler and discriminator/cost. First we’ll derive a
minimax expression for negative log-likelihood of energy-based models. Reviewing an idea from
[HE16], we’ll show that after adding a particular regularizer to the cost function, we can recover
the original GAN objective, drawing a close connection between energy-based models and GANs.
The objectives we consider are convex-concave functions of the cost and sampling density (but not
necessarily with respect to their parameters). Convex-concave problems are tractable (like con-
vex optimization problems) and we discuss some results from the theory that may provide useful
intuitions.

Q&A
• The connections between energy-based models and GANs have already been pointed out in
[KB16] and [Chr16]. How is this writeup different? Those papers show that the gradient
expressions for GANs are almost the same as the gradient expressions for energy-based
models. That’s a useful observation, but it provides an incomplete picture of what the
algorithms converge to after many updates. This writeup focuses on how the objective
functions are similar and different.

• We already know from [Goo+14] that GANs converge “in function space”—i.e., when we
assume that the discriminator is optimized over the space of all functions each iteration.
Given that your analysis relies on the convex-concave property in function space, what does
it add? The limitation of the convergence anlaysis in [Goo+14] is that the actual opti-
mization procedure for GANs doesn’t fully optimize the discriminator each iteration, it
simultaneously performs small updates to both the generator and discriminator. This pro-
cedure does not converge in general for minimax problems minx maxy f (x, y). However, it
turns out that for convex-concave problems (i.e., where f is convex in x and concave in y),
gradient-based updates do converge to an optimal point (called an equilibrium or saddle
point) where minx maxy f (x, y) = maxy minx f (x, y).

2 Energy-Based Models
Energy based models define a probability distribution in terms of a cost (energy) function c:

p(x) = e−c(x) /Zc , (1)

Z
where Zc = dx e−c(x) (2)

1
Given a set of datapoints x1 , x2 , . . . , xN , the total negative log-likelihood is
N
X N
X
total nll = − log p(xn ) = (c(xn ) + log(Zc )) (3)
n=1 n=1

It is more meaningful to consider the loss per datapoint, which is

1
nll per datapoint = (total nll) = Edata [c(x)] + log(Zc ). (4)
N
where Edata [. . . ] indicates that we sample x uniformly from x1 , x2 , . . . , xN .
We can write down an importance sampling estimator for Z, as an expectation under a distri-
bution q(x).

e−c(x)
−c(x)
e
Z Z
−c(x)
Zc = dx e = dx q(x) = Eq (5)
q(x) q(x)

Thus,

e−c(x)

nll per datapoint = Edata [c(x)] + log Eq (6)
q(x)

Jensen’s inequality implies that log(E [y]) ≥ E [log(y)], with equality where y is constant. Thus,

e−c(x)

nll per datapoint ≥ Edata [c(x)] + Eq log (7)
q(x)
= Edata [c(x)] − Eq [c(x)] − Eq [log q(x)] (8)
= Edata [c(x)] − Eq [c(x)] + H(q) (9)

Hence, the right-hand side expression is a lower bound on bound on the negative log-likelihood.
Equality holds when q(x) = e−c(x) /Zc , hence we can write

nll per datapoint = max Edata [c(x)] − Eq [c(x)] + H(q) (10)
q

We are minimizing negative log-likelihood, so our full optimization problem looks like

min nll per datapoint = min max Edata [c(x)] − Eq [c(x)] + H(q) (11)
c c q

In summary, we wrote the likelihood maximization problem (wrt c) as a minimax problem,

involving a sampling distribution q.r This could be turned into an algorithm, which jointly learns
a sampling distribution q and the cost (energy) function c. In practice, when c and q are represented
by nonlinear function approximators, we will need to jointly optimize them by SGD, thus q will
not have a chance to fully catch up with c. With this formulation c can grow without bound, so
the minimax objective above may behave unstably.

2
3 Cost Regularization
Let’s introduce a regularization term ψ(c) which encourages c to be small. This change can be
interpreted as introducing a prior on c. Since c encodes a probability distribution, we’d prefer for
it to be smooth and simple, rather than putting a delta function at the data points. The objective
is redefined as follows, to be the nll plus a regularization term:

L(c) = Edata [c(x)] + log(Zc ) + ψ(c) (12)

Repeating the derivation of the previous section (but with ψ(c) added), we get

L(c) = max Edata [c(x)] − Eq [c(x)] + H(q) + ψ(c) (13)
q

= max L(c, q) (14)

where the last line is the definition of L(c, q).

3.1 Deriving the GAN Objective

Now we’ll show that by reparameterizing c and choosing a particular regularizer ψ(c), we can
derive the original GAN objective, plus the entropy term H(q). Let c(x) = log σ(−f (x)), where σ
is the sigmoid function σ(z) = 1/(1 + e−z ), and f is a function approximator with a scalar output,
e.g., the output of a neural network, whose last layer has output size 1 and linear activation. (Note
that with this definition, large f ↔ low cost ↔ the sample looks like it came from pdata .) This
function is plotted below.

0.0
log σ(−z)

−0.5

−1.0

−1.5

−2.0

−2.5

−3.0

−3.5
−3 −2 −1 0 1 2 3

Previously we referred to ψ(c), but now c is defined in terms of f , so we’ll write ψ(f ) for the
regularization term. The objective becomes

L(f, q) = Edata [log σ(−f (x))] − Eq [log σ(−f (x))] + H(q) + ψ(f ) (15)

We’re going to define ψ(f ) as the following expectation over the data:

ψ(f ) = Edata [− log σ(f (x)) − log σ(−f (x))] (16)

We designed this term so it would cancel out the log σ(−f (x)) and replace it with a − log σ(f (x)).
The regularizer − log σ(z) − log σ(−z) is plotted below. It’s ≈ z 2 + 2 log 2 around the origin but
then becomes ≈ |z| as |z| → ∞.

3
3.5
− log σ(z) − log σ(−z)

3.0

2.5

2.0

1.5

1.0
−3 −2 −1 0 1 2 3

L(f, q) = Edata [log σ(−f (x))] − Eq [log σ(−f (x))] + H(q) + Edata [− log σ(f (x)) − log σ(−f (x))]
= −Edata [log σ(f (x))] − Eq [log σ(−f (x))] (17)

The sigmoid function has the nice property that

1 ez
sigmoid(−z) = = 1 − = 1 − sigmoid(z). (18)
1 + ez 1 + ez
Thus we get

L(f, q) = −Edata [log σ(f (x))] − Eq [log(1 − σ(f (x)))] + H(q) (19)
= −Edata [log(D(x))] − Eq [log(1 − D(x))] + H(q) (20)
defining D(x) = σ(f (x))

and our optimization problem is

min max L(f, q) (21)

f q

There are two difference between the optimization problem we’ve derived in this section and
the original GAN formulation of [Goo+14].

1. The entropy regularization, H(q)

2. The min and the max are switched—in the original GAN formulation, the generator is on
the outside, but here, the generator is on the inside.

Why does it make sense to switch the min and max? When we add the entropy regular-
ization term, we can freely switch the min and the max, both orderings yield the same solution
(f ∗ , q ∗ ). That follows from the properties of convex-concave functions, which are discussed in
Section 4, and specialized to the GAN case in Section 5.

One ugly detail—normalization. There is one problem with the entropy-regularized formula-
tion. Since cost is parameterized as c(x) = log σ(−f (x)), we have that c(x) ≤ 0. If the domain of
x is infinite, then e−c(x) will have an infinite integral, i.e., we won’t have a finite partition function.
The underlying issue is that on an infinite space, the entropy regularization is too strong—the
“pressure” to spread out q is stronger than the pressure to stay near the low-cost regions c(x).
This problem can be fixed by using a KL divergence penalty −K(q, q0 ) instead of the entropy
bonus, where, q0 could be a Gaussian distribution covering the range of reasonable values for x.

4
4 Saddle Points
This section provides a general discussion of the optimization problems we’ve encountered above,
which involve a minimization over one set of variables and a maximization over the others. This
section will make it possible to answer questions such as “what happens if we switch the min
and max?” and “does gradient descent converge to the solution of a minimax problem?” We’ll
start out by defining three key concepts: minimax problems, saddle points, and convex-concave
problems.

1. Minimax problems are optimization problems that take the form minx maxy f (x, y). In
general, the min and the max do not commute: it’s not true in general that the value
minx maxy f (x, y) = maxy minx f (x, y), and it’s not true in general that a solution (x∗ , y ∗ ) to
one ordering will be a solution to the other. (When we say (x∗ , y ∗ ) is a solution for the order-
ing minx maxy f (x, y), we mean that x∗ minimizes maxy f (x, y), and y ∗ ∈ argmaxy f (x∗ , y).)

2. A saddle point is defined as a pair (x∗ , y ∗ ) satisfying x∗ ∈ argminx f (x, y ∗ ), and y ∗ ∈

argmaxy f (x∗ , y). That is, we can exchange the min and max.

3. A convex-concave function f (x, y) is convex in its first argument, and concave in its second
argument.

A basic theorem states that if f is convex-concave, then we can always exchange the min and
max and the value is unchanged: minx maxy f (x, y) = maxy minx f (x, y). Furthermore, for convex-
concave function, any point where the gradient vanishes, ∇x f (x, y) = ∇y f (x, y) = 0, is a saddle
point.
Minimax problems have an interpretation as a two player game between Xander and Yasaman.
x is Xander’s move, and y is Yasaman’s move. Xander is trying to minimize f (x, y), and Yasaman
is trying to maximize it. The ordering minx maxy f (x, y) means that Xander goes first, whereas
maxy minx f (x, y) means that Yasaman goes first. The second player has an advantage, because
he or she can see the first player’s move and respond accordingly—that is, maxy minx f (x, y) ≤
minx maxy f (x, y).
Finding saddle points of convex-concave functions is tractable, unlike solving general minimax
problems. In a way, finding these saddle points is on the same level of hardness as finding the
minimizers of convex functions. In fact, much of the theory for analyzing (stochastic) gradient
descent carries over from convex minimization to the problem of finding saddle points.
Saddle points play a key role in constrained convex optimization problems, where the solu-
tion corresponds to finding the saddle point of the Lagrangian L(x, λ, ν), which is convex in the
argument x and concave in the Lagrange multipliers (λ, ν). We can find the saddle point using
Newton’s method. (See [BV04], 10.3.)
One issue that makes saddle point problems harder to understand than minimization problems
is that it’s less straightforward to measure optimization progress. For minimization problems
minx f (x), we can trivially measure progress through the objective f , which should decrease. For
saddle point problems minx maxy f (x, y), we have two ways of measuring progress:

1. The Gap. Given a point (x, y), define

gap(x, y) = max
0
f (x, y 0 ) − min
0
f (x0 , y) (22)
y x

5
Recall that the saddle point (x∗ , y ∗ ) satisfies
f (x∗ , y ∗ ) = min max f (x, y) = max min f (x, y) (23)
x y y x

so gap(x∗ , y ∗ ) = 0. For arbitrary (x, y), gap(x, y) ≥ 0. Most of the convergence theory
of gradient descent methods for convex-concave problems relies on showing that the gap is
small after optimization.
2. Gradient Norm. Another measure of convergence is given by the norm of the gradient with
respect to x and y: k∇x f (x, y)k2 + k∇y f (x, y)k2 . The most effective algorithms for solving
constrained convex optimization problems are primal-dual methods, which perform Newton
steps on the Lagrangian. These methods typically perform line searches on this gradient
norm, called the primal-dual residual [BV04].
There are several different techniques used to prove and analyze the convergence of gradient
descent in convex-concave problems. Convergence for saddle point problems is less intuitively
clear than for convex minimization problems, so these proof techniques might provide some helpful
intuitions.

x
1. Show that gradient norm is reduced along the step direction. Let z = , and we’ll write
y
f (z) to mean f (x, y). Consider taking a small step z → z + a. Taking a first-order Taylor
expansion
f 0 (z + a) = f 0 (z) + f 00 (z)a + O(kak2 ) (24)
(a) Second order methods compute the step that solves f 00 (z)a = −f 0 (z), e.g., see the
discussion on the infeasible-start Newton method in [BV04]. They perform a line search
in this direction, which is guaranteed to reduce the gradient norm.
(b) For large-scale applications, we’re more interested
∂ in first-order
methods, which take a
f (x, y)
step in the gradient direction. Let a = −α ∂x∂ = −αSf 0 (z) where we define
− ∂y f (x, y)

I 0
S= . Substituting back into Equation (24),
0 −I
f 0 (z + a) = f 0 (z) − αf 00 (z)Sf 0 (z) = (I − αHS)f 0 (z) (25)
where H = f 00 (z) is the Hessian of f . If we require that f is strongly convex wrt x and
strongly concave wrt y, the HS is positive definite, and for small α, k(I −αHS)f 0 (z)k <
kf 0 (z)k, so the norm of the gradient strictly decreases.
2. Online gradient descent / online mirror descent. The standard analysis from online learning
mostly carries through. See [Bub].
3. Standard subgradient descent convergence analysis. [NO09] provide a convergence analysis
that looks like the standard convergence proofs for subgradient descent, which is also quite
similar to the online learning results.
4. Theory of monotone operators. One can show that the subgradient update is a contraction
using a general and elegant theory of monotone operators [RB16].
Unfortunately it’s not the case that the gap monotonically decreases during gradient descent,
rather, the gradient norm decreases and the gap reduces as a result.

6
Simple Examples
The following two-dimensional problem illustrates some of the properties of the GAN problem.

min max x2 − y(x − 1) (26)

x y

Here, y corresponds to the generator, and x to the discriminator. The problem is convex-concave
and has a saddle point at (1, 2), which one can see by solving for ∇x f (x, y) = ∇y f (x, y) = 0.
For each fixed value of y, there is a unique solution for x. But for each fixed value of x 6= 1, the
objective is unbounded in y, and for x = 1, all values y achieve the optimum. The objective above
is the Lagrangian of the problem

min x2 , subject to x = 1 (27)

Now consider adding regularization to y:

min max x2 − y(x − 1) − y 2 (28)

x y

1 2

This problem has a different saddle point: 1−4 , 1−4 . With this objective, if we fix x and optimize
over y, there is always a unique solution—we can always recover y from x. However, if is too
large, the saddle point goes to infinity.

5 GANs and Saddles

Recall the GAN optimization problem, and the entropy-regularized version, which we derived as
a likelihood maximization problem.

max min − Edata [log(σ(f (x)))] − Eq [log(1 − σ(f (x)))] [ + H(q)] (29)
q f

The training procedure for GANs is to perform gradient descent on f and q simultaneously. Thus,
the training procedure is agnostic to the ordering of the min and max. The key questions are (1)
what should this training procedure converge to, and (2) under what conditions does it actually
converge? These questions are nontrivial even when optimizing in function space, e.g., with tabular
representations of f and q.
Let’s consider the cases with and without entropy regularization.

Without Entropy Regularization

• Saddle. (f ∗ , q ∗ ) = (0, pdata ) is a saddle point, since it satisfies q ∗ ∈ argmaxq L(q, f ∗ ) and
f ∗ ∈ argminf L(q ∗ , f ).

• Discriminator on inside. If the discriminator is the inner optimization problem, i.e., if we

solve for maxq minf L(q, f ), then for every fixed generator, there’s a unique optimal solution
for the discriminator f . If the discriminator ranges over all functions, then it’s the odds ratio
f (x) = pdata (x)/(pdata (x) + q(x)).

7
• Generator on inside. If we take the generator to be the inner optimization problem, then
for a fixed discriminator, the problem L(f, q) may have multiple solutions. If f has a unique
global maximum, then argmaxq L(f, q) is a delta function at the global maximum of f . If f
is constant (as in the optimal solution), then all functions q are maximizers.

Hence, while the ordering minf maxq does yield the same optimal value as maxq minf , this
ordering has some unwholesome properties. In particular, it doesn’t satisfy the recoverability
condition—given f , we can’t recover q. (But given q, we can recover f .)

With Entropy Regularization

• Saddle. The saddle exists1 , but is different from the saddle point of the unregularized problem
and can’t be computed in closed form.

• Discriminator on inside. The optimal discriminator is the same as in the unregularized case.

• Generator on inside. maxq L(f, q) now has a unique nontrivial solution: q(x) = e−c(x) /Zc =
e− log σ(−f (x)) /Zc = σ(−f (x))Z
1
c
1
= (1−D(x))Zc
. Thus the recoverability property holds—given f ,
we can recover q, and vice versa.

Does the Saddle Point Theory Have Practical Implications for GAN-like
Problems?
• It might be possible to use an approximation of the gap as a convergence diagnostic for GANs:
perform a small number of generator-only updates and a small number of discriminator-only
updates and measure the gap.

• Since we don’t have access to the density of the generator, it’s not straightforward to ap-
proximate its entropy. However, we may be able to devise more tractable regularizers that
make the problem convex-concave, and thus make the generator recoverable in terms of the
discriminator/cost.

• Convergence guarantees only hold under restrictive assumptions, for example, that the func-
tion is convex-concave. The GAN objective is convex-concave in the functions c and q, but
not in the parameters. However, it may still be possible to obtain a convergent algorithm
when optimizing in terms of parameters. Let’s suppose we have an algorithm that is guaran-
teed to converge to the saddle point of a convex-concave problem, and it works by solving a
series a subproblems, as with proximal methods and trust region methods. Then we can try
to mimic the behavior of this algorithm by solving these subproblems in terms of parameters.
Natural gradient algorithms can be derived this way.

6 Generalizations: φ-risks and f -divergences

There are a couple of interesting generalizations of GANs that have appeared recently. [NCT16]
show how the objective can be altered to optimize various f -divergences between the generator’s
distribution and the data distribution. [HE16] uncover a related generalization—that there is a
1
Given that x is restricted to a finite space, due to the issue we discussed under “One ugly detail”.

8
connection between classification risks (φ-risks), and f -divergences—GANs naturally emerge from
using a log-loss, but other divergences arise from other losses. Both [NCT16; HE16] build on
[NWJ09] where some key mathematical ideas originated. There is a close correspondence between
the regularizer ψ(c) (in Equation (13), for example) and the resulting f -divergence / φ-risk being
minimized.

f -divergences ⇔ φ-risks ⇔ cost regularizers ψ(c)

6.1 φ-risks
Section 3 showed how the difference-of-costs objective in Equation (13) can be converted into the
GAN-like objective in Equation (19), after choosing the appropriate regularizer ψ(c). Moreover,
minimization wrt the discriminator results in the Jensen-Shannon divergence between q and the
data distribution, minq L(c, q) = DJS (pdata , q). As shown in [HE16], we can generalize this con-
struction by using different regularizers ψ(c), and we end up with different divergence measures.
Specifically, let c(x) = φ(−h(x)), where h is some function approximator with real-valued out-
put. (The analysis above used φ(z) = log σ(z) to arrive at the GAN objective). Let’s define
ψ(c) = Edata [−φ(h(x)) − φ(−h(x))].

L(c, q) = Edata [c(x)] − Eq [c(x)] + H(q) + ψ(c) (30)

= Edata [φ(−h(x))] + Eq [−φ(−h(x))] + H(q) + Edata [−φ(h(x)) − φ(−h(x))] (31)
= −Edata [φ(h(x))] − Eq [φ(−h(x))] + H(q) (32)

[NWJ09] show that when h is allowed to range over the space of all functions, then the sum of
expectations turns into an f -divergence:

max Edata [φ(h(x))] − Eq [φ(−h(x))] = Df (pdata , q) (33)
h

where f (u) = max(−φ(−h) − φ(h)u) (34)

Choosing φ(z) = log σ(z) results in the Jensen-Shannon divegence, whereas other choices give
different f -divergences; some possibilities are catalogued in [NWJ09].

6.2 f -GAN
Another approach for generalizing GANs and approximating f -divergences is in [NCT16]. That
approach is more general than the one above using φ-risks, as it allows one to approximate asym-
metric f -divergences, however, the derivation of the Jensen-Shannon divergence involves a less
natural set of choices.

7 Applications of GAN-like Methods

• Inverse reinforcement learning, as shown in [HE16]. Also [FLA16] frames their IRL
approach using an energy-based model, and [Chr16] shows the close connections to GANs,
including some interesting points about estimating the partition function using a mixture of
pdata and q, rather than q alone.

9
• Semi-supervised learning, as shown in [Sal+16] and [Che+16].

• Better unsupervised learning via lossy compression. A natural method of formulating

lossy compression in a universal way yields a minimax problem, as I described in my “Noise
Should be Free” presentation. It may be possible to develop methods for density modeling
that are better able to identify the interesting aspects of data using these ideas.

• Model-based reinforcement learning. Ask Jonathan Ho for details.

• Sample-efficient reinforcement learning. Ask Peter Chen for details.

References
[Bub] Sebastian Bubeck. ORF523 Course Notes. https://blogs.princeton.edu/imabandit/
2013/04/18/orf523-mirror-descent-part-iiii/.
[BV04] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university
press, 2004.
[Che+16] Xi Chen et al. “InfoGAN: Interpretable Representation Learning by Information Max-
imizing Generative Adversarial Nets”. In: arXiv preprint arXiv:1606.03657 (2016).
[Chr16] Paul Christiano. “Guided cost learning is generative adversarial modeling”. In: unpub-
lished tech report (2016).
[FLA16] Chelsea Finn, Sergey Levine, and Pieter Abbeel. “Guided Cost Learning: Deep In-
verse Optimal Control via Policy Optimization”. In: arXiv preprint arXiv:1603.00448
(2016).
[Goo+14] Ian Goodfellow et al. “Generative adversarial nets”. In: Advances in Neural Information
Processing Systems. 2014, pp. 2672–2680.
[HE16] Jonathan Ho and Stefano Ermon. “Generative Adversarial Imitation Learning”. In:
arXiv preprint arXiv:1606.03476 (2016).
[KB16] Taesup Kim and Yoshua Bengio. “Deep Directed Generative Models with Energy-
Based Probability Estimation”. In: arXiv preprint arXiv:1606.03439 (2016).
[NCT16] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. “f-GAN: Training Genera-
tive Neural Samplers using Variational Divergence Minimization”. In: arXiv preprint
arXiv:1606.00709 (2016).
[NO09] Angelia Nedić and Asuman Ozdaglar. “Subgradient methods for saddle-point prob-
lems”. In: Journal of optimization theory and applications 142.1 (2009), pp. 205–228.
[NWJ09] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. “On surrogate loss
functions and f-divergences”. In: The Annals of Statistics (2009), pp. 876–904.
[RB16] Ernest K Ryu and Stephen Boyd. “Primer on monotone operator methods”. In: Appl.
Comput. Math 15.1 (2016), pp. 3–43.
[Sal+16] Tim Salimans et al. “Improved Techniques for Training GANs”. In: arXiv preprint
arXiv:1606.03498 (2016).

CS230: Deep Learning: Winter Quarter 2018 Stanford University Midterm Examination 180 Minutes
100% (1)
CS230: Deep Learning: Winter Quarter 2018 Stanford University Midterm Examination 180 Minutes
36 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Modeling, Simulation and Control of A Drum Boiler
75% (4)
Modeling, Simulation and Control of A Drum Boiler
55 pages
NUS ST2334 Lecture Notes
No ratings yet
NUS ST2334 Lecture Notes
56 pages
MIT8 01SC Problems03 Soln
No ratings yet
MIT8 01SC Problems03 Soln
12 pages
The Algebraic Foundations of Mathematics - Beaumont.
100% (5)
The Algebraic Foundations of Mathematics - Beaumont.
496 pages
BEGAN: Boundary Equilibrium Generative Adversarial Networks: Model
No ratings yet
BEGAN: Boundary Equilibrium Generative Adversarial Networks: Model
10 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
Stabilizing Training of Generative Adversarial Networks Through Regularization
No ratings yet
Stabilizing Training of Generative Adversarial Networks Through Regularization
16 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Exponential Convergence Rates For Batch Normalization - 4
No ratings yet
Exponential Convergence Rates For Batch Normalization - 4
1 page
Gans Stanford
No ratings yet
Gans Stanford
39 pages
Gans Trained by A Two Time-Scale Update Rule Converge To A Local Nash Equilibrium
No ratings yet
Gans Trained by A Two Time-Scale Update Rule Converge To A Local Nash Equilibrium
38 pages
Hung-Yi Lee GAN-Improving GAN (2017.05.05)
No ratings yet
Hung-Yi Lee GAN-Improving GAN (2017.05.05)
71 pages
3 GANs
No ratings yet
3 GANs
50 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
DLbook
No ratings yet
DLbook
165 pages
1.explain The Concept of Empirical Risk Minimization. What Is The Goal of Optimization in Deep Learning?
No ratings yet
1.explain The Concept of Empirical Risk Minimization. What Is The Goal of Optimization in Deep Learning?
11 pages
Improving The Improved Training of Wasserstein Gans
No ratings yet
Improving The Improved Training of Wasserstein Gans
17 pages
Generative Adversarial Networks (GAN) : A Gentle Introduction (UPDATED)
No ratings yet
Generative Adversarial Networks (GAN) : A Gentle Introduction (UPDATED)
11 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Improved Training of Wasserstein GANs
No ratings yet
Improved Training of Wasserstein GANs
20 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Implicit Generative Models: Dual vs. Primal Approaches
No ratings yet
Implicit Generative Models: Dual vs. Primal Approaches
42 pages
Principled Methods
No ratings yet
Principled Methods
17 pages
Applying Statistical Learning Theory To Deep Learning
No ratings yet
Applying Statistical Learning Theory To Deep Learning
51 pages
Montanari
No ratings yet
Montanari
10 pages
Path SGD Behnam
No ratings yet
Path SGD Behnam
12 pages
Theory DL
No ratings yet
Theory DL
227 pages
Gradient Descent Learns One-Hidden-Layer CNN: Don't Be Afraid of Spurious Local Minima
No ratings yet
Gradient Descent Learns One-Hidden-Layer CNN: Don't Be Afraid of Spurious Local Minima
24 pages
The GAN Landscape: Losses, Architectures, Regularization, and Normalization
No ratings yet
The GAN Landscape: Losses, Architectures, Regularization, and Normalization
16 pages
DL Exam 2023-2
No ratings yet
DL Exam 2023-2
5 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Gan Tutorial Suwang
No ratings yet
Gan Tutorial Suwang
11 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
DL Assi02
No ratings yet
DL Assi02
9 pages
The Mathematics of Artificial Intelligence: 1 Supervised Learning
No ratings yet
The Mathematics of Artificial Intelligence: 1 Supervised Learning
10 pages
Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Index
No ratings yet
Index
127 pages
ANN Theory
No ratings yet
ANN Theory
23 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Spectral Normalization For GANs
No ratings yet
Spectral Normalization For GANs
26 pages
Advanced Design For AI Algorithms: Lec.: 1 GAN
No ratings yet
Advanced Design For AI Algorithms: Lec.: 1 GAN
223 pages
Fundations Data Science
No ratings yet
Fundations Data Science
16 pages
UDL - Errata Data
No ratings yet
UDL - Errata Data
19 pages
04 Numerical
No ratings yet
04 Numerical
39 pages
Inability of A Graph Neural Network Heuristic To Outperform Greedy Algorithms in Solving Combinatorial Optimization Problems Like Max-Cut
No ratings yet
Inability of A Graph Neural Network Heuristic To Outperform Greedy Algorithms in Solving Combinatorial Optimization Problems Like Max-Cut
2 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
No ratings yet
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
6 pages
04 Optimization
No ratings yet
04 Optimization
62 pages
ACV - Notes - Final
No ratings yet
ACV - Notes - Final
7 pages
Mathematical Theory of Deep
No ratings yet
Mathematical Theory of Deep
275 pages
Recognition Patterns: Jean Carlo Grandas Franco March 2020
No ratings yet
Recognition Patterns: Jean Carlo Grandas Franco March 2020
9 pages
Lecture 2
No ratings yet
Lecture 2
67 pages
Improved Techniques For Training Gans: (G) Data (G)
No ratings yet
Improved Techniques For Training Gans: (G) Data (G)
10 pages
Limits of Deepfake Detection: A Robust Estimation Viewpoint: Sakshi Agarwal and Lav R. Varshney
No ratings yet
Limits of Deepfake Detection: A Robust Estimation Viewpoint: Sakshi Agarwal and Lav R. Varshney
7 pages
Mathematics Theory of Deep Learning
No ratings yet
Mathematics Theory of Deep Learning
3 pages
A Proposal On Machine Learning Via Dynamical Systems
No ratings yet
A Proposal On Machine Learning Via Dynamical Systems
11 pages
Subgroups of S5
No ratings yet
Subgroups of S5
4 pages
Impact of The Carbon Price On Credit Portfolio's Loss With Stochastic Collateral
No ratings yet
Impact of The Carbon Price On Credit Portfolio's Loss With Stochastic Collateral
50 pages
HiSET Math Fpt6a
No ratings yet
HiSET Math Fpt6a
16 pages
Assignment DLD
100% (1)
Assignment DLD
2 pages
Math Construction Project
No ratings yet
Math Construction Project
24 pages
FINAL DBOW Math G2 2
No ratings yet
FINAL DBOW Math G2 2
13 pages
2023 - Linear and Abstract Algebra - Group 01
No ratings yet
2023 - Linear and Abstract Algebra - Group 01
8 pages
Gravitational Memory Effects in Chern-Simons
No ratings yet
Gravitational Memory Effects in Chern-Simons
22 pages
Zk-SNARKs A Gentle Introduction
100% (2)
Zk-SNARKs A Gentle Introduction
49 pages
Solution of Ordinary Differential Equations by Fourth Order Runge-Kutta Methods - Excel Spreadsheet
0% (1)
Solution of Ordinary Differential Equations by Fourth Order Runge-Kutta Methods - Excel Spreadsheet
11 pages
Sums and Products
No ratings yet
Sums and Products
4 pages
Cot2022-1st Quarter-Math - Addition of Fractions With or Without Regrouping
100% (1)
Cot2022-1st Quarter-Math - Addition of Fractions With or Without Regrouping
3 pages
Design of Linear Quadratic Regulator For Rotary Inverted Pendulum Using Labview
No ratings yet
Design of Linear Quadratic Regulator For Rotary Inverted Pendulum Using Labview
5 pages
PreCalculus Module3
No ratings yet
PreCalculus Module3
34 pages
Class6 Maths Patterns MAV
No ratings yet
Class6 Maths Patterns MAV
3 pages
Chapter 9 Counting-Permuations Combinations PDF
No ratings yet
Chapter 9 Counting-Permuations Combinations PDF
22 pages
Crypt
No ratings yet
Crypt
4 pages
Session 1 Objectives: Define Ratio and Proportion
No ratings yet
Session 1 Objectives: Define Ratio and Proportion
14 pages
2D Array Practice Questions
100% (1)
2D Array Practice Questions
2 pages
Grade 9 4th Summative Exam Reviewer 2024 2025
No ratings yet
Grade 9 4th Summative Exam Reviewer 2024 2025
3 pages
TOS (First Summative)
No ratings yet
TOS (First Summative)
5 pages
AS P1 2020 Assignment 1
No ratings yet
AS P1 2020 Assignment 1
8 pages
Thesis Banouh 2
No ratings yet
Thesis Banouh 2
118 pages
Decision Rule: NCR and NPR: Statistical Tool
No ratings yet
Decision Rule: NCR and NPR: Statistical Tool
2 pages
2.1. Derivatives and Rates of Change
No ratings yet
2.1. Derivatives and Rates of Change
17 pages
N-TOP 1 WORKSHEET 23-24 MATHS (WORD FORMAT) - Grade 4 - 1
100% (1)
N-TOP 1 WORKSHEET 23-24 MATHS (WORD FORMAT) - Grade 4 - 1
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Notes On Gans, Energy-Based Models, and Saddle Points

Uploaded by

Notes On Gans, Energy-Based Models, and Saddle Points

Uploaded by

Notes On GANs, Energy-Based Models, and Saddle Points

John Schulman joschu@openai.com

p(x) = e−c(x) /Zc , (1)

It is more meaningful to consider the loss per datapoint, which is

In summary, we wrote the likelihood maximization problem (wrt c) as a minimax problem,

L(c) = Edata [c(x)] + log(Zc ) + ψ(c) (12)

= max L(c, q) (14)

where the last line is the definition of L(c, q).

3.1 Deriving the GAN Objective

ψ(f ) = Edata [− log σ(f (x)) − log σ(−f (x))] (16)

The sigmoid function has the nice property that

and our optimization problem is

min max L(f, q) (21)

1. The entropy regularization, H(q)

2. A saddle point is defined as a pair (x∗ , y ∗ ) satisfying x∗ ∈ argminx f (x, y ∗ ), and y ∗ ∈

1. The Gap. Given a point (x, y), define

min max x2 − y(x − 1) (26)

min x2 , subject to x = 1 (27)

Now consider adding regularization to y:

min max x2 − y(x − 1) − y 2 (28)

5 GANs and Saddles

Without Entropy Regularization

• Discriminator on inside. If the discriminator is the inner optimization problem, i.e., if we

With Entropy Regularization

6 Generalizations: φ-risks and f -divergences

f -divergences ⇔ φ-risks ⇔ cost regularizers ψ(c)

L(c, q) = Edata [c(x)] − Eq [c(x)] + H(q) + ψ(c) (30)

where f (u) = max(−φ(−h) − φ(h)u) (34)

7 Applications of GAN-like Methods

• Better unsupervised learning via lossy compression. A natural method of formulating

• Model-based reinforcement learning. Ask Jonathan Ho for details.

• Sample-efficient reinforcement learning. Ask Peter Chen for details.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Notes On Gans, Energy-Based Models, and Saddle Points

Uploaded by

Notes On Gans, Energy-Based Models, and Saddle Points

Uploaded by

Notes On GANs, Energy-Based Models, and Saddle Points

John Schulman joschu@openai.com

p(x) = e−c(x) /Zc , (1)

It is more meaningful to consider the loss per datapoint, which is

In summary, we wrote the likelihood maximization problem (wrt c) as a minimax problem,

L(c) = Edata [c(x)] + log(Zc ) + ψ(c) (12)

= max L(c, q) (14)

where the last line is the definition of L(c, q).

3.1 Deriving the GAN Objective

ψ(f ) = Edata [− log σ(f (x)) − log σ(−f (x))] (16)

The sigmoid function has the nice property that

and our optimization problem is

min max L(f, q) (21)

1. The entropy regularization, H(q)

2. A saddle point is defined as a pair (x∗ , y ∗ ) satisfying x∗ ∈ argminx f (x, y ∗ ), and y ∗ ∈

1. The Gap. Given a point (x, y), define

min max x2 − y(x − 1) (26)

min x2 , subject to x = 1 (27)

Now consider adding regularization to y:

min max x2 − y(x − 1) − y 2 (28)

5 GANs and Saddles

Without Entropy Regularization

• Discriminator on inside. If the discriminator is the inner optimization problem, i.e., if we

With Entropy Regularization

6 Generalizations: φ-risks and f -divergences

f -divergences ⇔ φ-risks ⇔ cost regularizers ψ(c)

L(c, q) = Edata [c(x)] − Eq [c(x)] + H(q) + ψ(c) (30)

where f (u) = max(−φ(−h) − φ(h)u) (34)

7 Applications of GAN-like Methods

• Better unsupervised learning via lossy compression. A natural method of formulating

• Model-based reinforcement learning. Ask Jonathan Ho for details.

• Sample-efficient reinforcement learning. Ask Peter Chen for details.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

min max x2 − y(x − 1) − y 2 (28)