Alain 14 A
Alain 14 A
Abstract
What do auto-encoders learn about the underlying data-generating distribution? Recent
work suggests that some auto-encoder variants do a good job of capturing the local manifold
structure of data. This paper clarifies some of these previous observations by showing that
minimizing a particular form of regularized reconstruction error yields a reconstruction
function that locally characterizes the shape of the data-generating density. We show that
the auto-encoder captures the score (derivative of the log-density with respect to the input).
It contradicts previous interpretations of reconstruction error as an energy function. Unlike
previous results, the theorems provided here are completely generic and do not depend on
the parameterization of the auto-encoder: they show what the auto-encoder would tend
to if given enough capacity and examples. These results are for a contractive training
criterion we show to be similar to the denoising auto-encoder training criterion with small
corruption noise, but with contraction applied on the whole reconstruction function rather
than just encoder. Similarly to score matching, one can consider the proposed training
criterion as a convenient alternative to maximum likelihood because it does not involve
a partition function. Finally, we show how an approximate Metropolis-Hastings MCMC
can be setup to recover samples from the estimated distribution, and this is confirmed in
sampling experiments.
Keywords: auto-encoders, denoising auto-encoders, score matching, unsupervised repre-
sentation learning, manifold learning, Markov chains, generative models
1. Introduction
Machine learning is about capturing aspects of the unknown distribution from which the
observed data are sampled (the data-generating distribution). For many learning algorithms
and in particular in manifold learning, the focus is on identifying the regions (sets of points)
in the space of examples where this distribution concentrates, i.e., which configurations of
the observed variables are plausible.
Unsupervised representation-learning algorithms try to characterize the data-generating
distribution through the discovery of a set of features or latent variables whose variations
capture most of the structure of the data-generating distribution. In recent years, a number
of unsupervised feature learning algorithms have been proposed that are based on minimiz-
ing some form of reconstruction error, such as auto-encoder and sparse coding variants (Ol-
shausen and Field, 1997; Bengio et al., 2007; Ranzato et al., 2007; Jain and Seung, 2008;
Ranzato et al., 2008; Vincent et al., 2008; Kavukcuoglu et al., 2009; Rifai et al., 2011b,a;
Gregor et al., 2011). An auto-encoder reconstructs the input through two stages, an encoder
function f , which outputs a learned representation h = f (x) of an example x, and a decoder
function g, such that g(f (x)) ≈ x for most x sampled from the data-generating distribu-
tion. These feature learning algorithms can be stacked to form deeper and more abstract
representations. Deep learning algorithms learn multiple levels of representation, where the
number of levels is data-dependent. There are theoretical arguments and much empiri-
cal evidence to suggest that when they are well-trained, deep learning algorithms (Hinton
et al., 2006; Bengio, 2009; Lee et al., 2009; Salakhutdinov and Hinton, 2009; Bengio and
Delalleau, 2011; Bengio et al., 2013b) can perform better than their shallow counterparts,
both in terms of learning features for the purpose of classification tasks and for generating
higher-quality samples.
Here we restrict ourselves to the case of continuous inputs x ∈ Rd with the data-
generating distribution being associated with an unknown target density function, denoted
p. Manifold learning algorithms assume that p is concentrated in regions of lower dimen-
sion (Cayton, 2005; Narayanan and Mitter, 2010), i.e., the training examples are by defini-
tion located very close to these high-density manifolds. In that context, the core objective
of manifold learning algorithms is to identify where the density concentrates.
Some important questions remain concerning many of feature learning algorithms based
on reconstruction error. Most importantly, what is their training criterion learning about
the input density? Do these algorithms implicitly learn about the whole density or only some
aspect? If they capture the essence of the target density, then can we formalize that link
and in particular exploit it to sample from the model? The answers may help to establish
that these algorithms actually learn implicit density models, which only define a density
indirectly, e.g., through the estimation of statistics or through a generative procedure.
These are the questions to which this paper contributes.
The paper is divided in two main sections, along with detailed appendices with proofs
of the theorems. Section 2 makes a direct link between denoising auto-encoders (Vincent
et al., 2008) and contractive auto-encoders (Rifai et al., 2011b), justifying the interest
in the contractive training criterion studied in the rest of the paper. Section 3 is the
main contribution and regards the following question: when minimizing that criterion, what
does an auto-encoder learn about the data-generating density? The main answer is that it
estimates the score (first derivative of the log-density), i.e., the direction in which density is
increasing the most, which also corresponds to the local mean, which is the expected value
in a small ball around the current location. It also estimates the Hessian (second derivative
of the log-density).
Finally, Section 4 shows how having access to an estimator of the score can be exploited
to estimate energy differences, and thus perform approximate MCMC sampling. This is
achieved using a Metropolis-Hastings MCMC in which the energy differences between the
proposal and the current state are approximated using the denoising auto-encoder. Ex-
periments on artificial data sets show that a denoising auto-encoder can recover a good
estimator of the data-generating distribution, when we compare the samples generated by
the model with the training samples, projected into various 2-D views for visualization.
3744
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
Figure 1: Regularization forces the auto-encoder to become less sensitive to the input, but
minimizing reconstruction error forces it to remain sensitive to variations along
the manifold of high density. Hence the representation and reconstruction end
up capturing well variations on the manifold while mostly ignoring variations
orthogonal to it.
Regularized auto-encoders (see Bengio et al. 2012b for a review and a longer exposition)
capture the structure of the training distribution thanks to the productive opposition be-
tween reconstruction error and a regularizer. An auto-encoder maps inputs x to an internal
representation (or code) f (x) through the encoder function f , and then maps back f (x)
to the input space through a decoding function g. The composition of f and g is called
the reconstruction function r, with r(x) = g(f (x)), and a reconstruction loss function `
penalizes the error made, with r(x) viewed as a prediction of x. When the auto-encoder
is regularized, e.g., via a sparsity regularizer, a contractive regularizer (detailed below), or
a denoising form of regularization (that we find below to be very similar to a contractive
regularizer), the regularizer basically attempts to make r (or f ) as simple as possible, i.e.,
as constant as possible, as unresponsive to x as possible. It means that f has to throw away
some information present in x, or at least represent it with less precision. On the other
hand, to make reconstruction error small on the training set, examples that are neighbors
on a high-density manifold must be represented with sufficiently different values of f (x) or
r(x). Otherwise, it would not be possible to distinguish and hence correctly reconstruct
these examples. It means that the derivatives of f (x) or r(x) in the x-directions along the
manifold must remain large, while the derivatives (of f or r) in the x-directions orthogonal
to the manifold can be made very small. This is illustrated in Figure 1. In the case of prin-
cipal components analysis, one constrains the derivative to be exactly 0 in the directions
orthogonal to the chosen projection directions, and around 1 in the chosen projection di-
rections. In regularized auto-encoders, f is non-linear, meaning that it is allowed to choose
different principal directions (those that are well represented, i.e., ideally the manifold tan-
gent directions) at different x’s, and this allows a regularized auto-encoder with non-linear
encoder to capture non-linear manifolds. Figure 2 illustrates the extreme case when the
regularization is very strong (r wants to be nearly constant where density is high) in the
special case where the distribution is highly concentrated at three points (three training ex-
amples). It shows the compromise between obtaining the identity function at the training
examples and having a flat r near the training examples, yielding a vector field r(x) − x
that points towards the high density points.
Here we show that the denoising auto-encoder (Vincent et al., 2008) with very small
Gaussian corruption and squared error loss is actually a particular kind of contractive auto-
encoder (Rifai et al., 2011b), contracting the whole auto-encoder reconstruction function
3745
Alain and Bengio
r(x)"
rather than just the encoder, whose contraction penalty coefficient is the magnitude of the
perturbation. This was first suggested in Rifai et al. (2011b).
The contractive auto-encoder, or CAE (Rifai et al., 2011b), is a particular form of reg-
ularized auto-encoder which is trained to minimize the following regularized reconstruction
error:
" #
∂f (x) 2
LCAE = E `(x, r(x)) + λ (1)
∂x F
where r(x) = g(f (x)) and ||A||2F is the sum of the squares of the elements of A. Both the
squared loss `(x, r) = ||x−r||2 and the cross-entropy loss `(x, r) = −x log r−(1−x) log(1−r)
have been used, but here we focus our analysis on the squared loss because of the easier
mathematical treatment it allows. Note that success in minimizing the CAE criterion
strongly depends on the parameterization of f and g and in particular on the tied weights
constraint used, with f (x) = sigmoid(W x + b) and g(h) = sigmoid(W T h + c). The above
regularizing term forces f (as well as g, because of the tied weights) to be contractive, i.e.,
to have singular values less than 1.1 Larger values of λ yield more contraction (smaller
singular values) where it hurts reconstruction error the least, i.e., in the local directions
where there are only little or no variations in the data. These typically are the directions
orthogonal to the manifold of high density concentration, as illustrated in Figure 2.
1. Note that an auto-encoder without any regularization would tend to find many leading singular values
near 1 in order to minimize reconstruction error, i.e., preserve input norm in all the directions of variation
present in the data.
3746
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
The denoising auto-encoder, or DAE (Vincent et al., 2008), is trained to minimize the
following denoising criterion:
where N (x) is a stochastic corruption of x and the expectation is over the training dis-
tribution and the corruption noise source. Here we consider mostly the squared loss and
Gaussian noise corruption, again because it is easier to handle them mathematically. In
many cases, the exact same proofs can be applied to any kind of additive noise, but Gaussian
noise serves as a good frame of reference.
Theorem 1 Let p be the probability density function of the data. If we train a DAE using
the expected quadratic loss and corruption noise N (x) = x + with
∼ N 0, σ 2 I ,
When we look at the asymptotic behavior with Equation 4, the first thing to observe is
that the leading term in the expansion of rσ∗ (x) is x, and then the remainder goes to 0 as
σ → 0. When there is no noise left at all, it should be clear that the best reconstruction
target for any value x would be that x itself.
3747
Alain and Bengio
We get something even more interesting if we look at the second term of Equation 4
because it gives us an estimator of the score from
∂ log p(x)
= (rσ∗ (x) − x) /σ 2 + o(1) as σ → 0. (5)
∂x
This result is at the core of our paper. It is what allows us to start from a trained DAE,
and then recover properties of the training density p(x) that can be used to sample from
p(x).
Most of the asymptotic properties that we get by considering the limit as the Gaussian
noise level σ goes to 0 could be derived from a family of noise distribution that approaches
a point mass distribution in a relatively “nice” way.
Proposition 1 Let p be the probability density function of the data. Consider a DAE using
2
the expected quadratic loss and corruption noise N (x) = x + , with ∼ N 0, σ I . If we
assume that the non-parametric solutions rσ (x) satisfies
rσ (x) = x + o(1) as σ → 0,
This is an analytic version of the denoising criterion with small noise σ 2 , and also corre-
sponds to a contractive auto-encoder with contraction on both f and g, i.e., on r.
Because of the similarity between DAE and RCAE when taking λ = σ 2 and because
the semantics of σ 2 is clearer (as a squared distance in input space), we will denote σ 2 for
the penalty term coefficient in situations involving RCAE. For example, in the statement of
2. In the CAE there is a also a contractive effect on g(·) as a side effect of the parameterization with weights
tied between f (·) and g(·).
3748
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
Theorem 2, this σ 2 is just a positive constant; there is no notion of additive Gaussian noise,
i.e., σ 2 does not explicitly refer to a variance, but using the notation σ 2 makes it easier to
intuitively see the connection to the DAE setting.
The connection between DAE and RCAE established in Proposition 1 also serves as
the basis for an alternative demonstration to Theorem 1 in which we study the asymptotic
behavior of the RCAE solution. This result is contained in the following theorem.
∂ log p(x)
rσ∗ 2 (x) = x + σ 2 + o(σ 2 ) as σ 2 → 0. (8)
∂x
Moreover, we also have the following expression for the derivative
The proof is given in the appendix and uses the Euler-Lagrange equations from the
calculus of variations.
3749
Alain and Bengio
N
!
2
1 X (n) (n)
2
2 ∂r(x)
L̂ = r(x )−x +σ
N 2 ∂x x=x(n) F
n=1
N
based on a sample x(n) n=1 drawn from p(x).
Alternatively, the auto-encoder is trained online (by stochastic gradient updates) with a
stream of examples x(n) , which corresponds to performing stochastic gradient descent on the
expected loss (7). In both cases we obtain an auto-encoder that approximately minimizes
the expected loss.
An interesting question is the following: what can we infer from the data-generating
density when given an auto-encoder reconstruction function r(x)?
The premise is that this auto-encoder r(x) was trained to approximately minimize a
loss function that has exactly the form of (7) for some σ 2 > 0. This is assumed to have
been done through minimizing the empirical loss and the distribution p was only available
N
indirectly through the samples x(n) n=1 . We do not have access to p or to the samples.
3750
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
both a DAE and an RCAE in this fashion by minimizing a discretized version of their losses
defined by equations (2) and (6). The goal here is to show that, for either a DAE or RCAE,
the approximation of the score that we get through Equation 5 gets arbitrarily close to the
∂
actual score ∂x log p(x) as σ → 0.
The distribution p(x) studied is shown in Figure 3 (left) and it was created to be simple
enough to illustrate the mechanics. We plot p(x) in Figure 3 (left) along with the score of
p(x) (right).
1 ∂ ∂
(a) p(x) = Z
exp(−E(x)) (b) ∂x
log p(x) = − ∂x E(x)
Figure 3: The density p(x) and its score for a simple one-dimensional example.
The model r̂(x) is fitted by dividing the interval [−1.5, 1.5] into M = 1000 partition
points x1 , . . . , xM evenly separated by a distance ∆. The discretized version of the RCAE
loss function is
M M −1 2
X 2 2
X r̂(xi+1 ) − r̂(xi )
p(xi )∆ (r̂(xi ) − xi ) + σ p(xi )∆ . (11)
∆
i=1 i=1
Every value r̂(xi ) for i = 1, . . . , M is treated as a free parameter. Setting to 0 the derivative
with respect to the r̂(xi ) yields a system of linear equations in M unknowns that we can
solve exactly. From that RCAE solution r̂ we get an approximation of the score of p at each
point xi . A similar thing can be done for the DAE by using a discrete version of the exact
solution (3) from Theorem 1. We now have two ways of approximating the score of p.
In Figure 4 we compare the approximations to the actual score of p for decreasingly
small values of σ ∈ {1.00, 0.31, 0.16, 0.06}.
3751
Alain and Bengio
Two-dimensional data points (x, y) were generated along a spiral according to the fol-
lowing equations:
A denoising auto-encoder was trained with Gaussian corruption noise σ = 0.01. The
encoder is f (x) = tanh(b + W x) and the decoder is g(h) = c + V h. The parameters
(b, c, V, W ) are optimized by BFGS to minimize the average squared error, using a fixed
training set of 10 000 samples (i.e., the same corruption noises were sampled once and for
all). We found better results with untied weights, and BFGS gave more accurate models
than stochastic gradient descent. We used 1000 hidden units and ran BFGS for 1000
iterations.
3752
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
The non-convexity of the problem makes it such that the solution found depends on the
initialization parameters. The random corruption noise used can also influence the final
outcome. Moreover, the fact that we are using a finite training sample size with reasonably
small noise may allow for undesirable behavior of r in regions far away from the training
samples. For those reasons, we trained the model multiple times and selected two of the
most visually appealing outcomes. These are found in Figure 5 which features a more global
perspective along with a close-up view.
(a) r(x) − x vector field, acting as sink, zoomed out (b) r(x) − x vector field, close-up
Figure 5: The original 2-D data from the data-generating density p(x) is plotted along
with the vector field defined by the values of r(x) − x for trained auto-encoders
(corresponding to the estimation of the score ∂ log∂xp(x) ).
Figure 5 shows the data along with the learned score function (shown as a vector field).
We see that that the vector field points towards the nearest high-density point on the data
manifold. The vector field is close to zero near the manifold (i.e., the reconstruction error is
close to zero), also corresponding to peaks of the implicitly estimated density. The points
on the manifolds play the role of sinks for the vector field. Other places where reconstruction
error may be low, but where the implicit density is not high, are sources of the vector field.
In Figure 5(b) we can see that we have that kind of behavior halfway between two sections
of the manifold. This shows that reconstruction error plays a very different role as what
was previously hypothesized: whereas in Ranzato et al. (2008) the reconstruction error
was viewed as an energy function, our analysis suggests that in regularized auto-encoders,
it is the norm of an approximate score, i.e., the derivative of the energy w.r.t. input. Note
that the norm of the score should be small near training examples (corresponding to local
maxima of density) but it could also be small at other places corresponding to local minima
of density. This is indeed what happens in the spiral example shown. It may happen
whenever there are high-density regions separated by a low-density region: tracing paths
from one high-density region to another should cross a “median” lower-dimensional region
(a manifold) where the density has a local maximum along the path direction. The reason
such a median region is needed is because at these points the vectors r(x) − x must change
3753
Alain and Bengio
sign: on one side of the median they point to one of the high-density regions while on the
other side they point to the other, as clearly visible in Figure 5(b) between the arms of the
spiral.
We believe that this analysis is valid not just for contractive and denoising auto-encoders,
but for regularized auto-encoders in general. The intuition behind this statement can be
firmed up by analyzing Figure 2: the score-like behavior of r(x) − x arises simply out of the
opposing forces of (a) trying to make r(x) = x at the training examples and (b) trying to
make r(x) as regularized as possible (as close to a constant as possible).
Note that previous work (Rifai et al., 2012; Bengio et al., 2013b) has already shown that
contractive auto-encoders (especially when they are stacked in a way similar to RBMs in
a deep belief net) learn good models of high-dimensional data (such as images), and that
these models can be used not just to obtain good representations for classification tasks
but that good quality samples can be obtained from the model, by a random walk near the
manifold of high-density. This was achieved by essentially following the vector field and
adding noise along the way.
3.5 Missing σ 2
When we are in the same setting as in Section 3.2 but the value of σ 2 is unknown, we can
modify (10) a bit and avoid dividing by σ 2 . That is, for a trained reconstruction function
r(x) given to us we just take the quantity r(x) − x and it should be approximately the score
up to a multiplicative constant. We get that
∂ log p(x)
r(x) − x ∝ .
∂x
Equivalently, if one estimates the density via an energy function (minus the unnormalized
log density), then x − r(x) estimates the derivative of the energy function.
We still have to assume that σ 2 is small. Otherwise, if the unknown σ 2 is too large we
might get a poor estimation of the score.
3754
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
Conceptually, another way to see this is to argue that if such a function E0 (x) existed,
its second-order mixed derivatives should be equal. That is, we should have that
∂ 2 E0 (x) ∂ 2 E0 (x)
= ∀i, j,
∂xi ∂xj ∂xj ∂xi
which is equivalent to
∂ri (x) ∂rj (x)
= ∀i, j.
∂xj ∂xi
Again in the context of Section 3.3, with the parameterization used for that particular
kind of denoising auto-encoder, this would yield the constraint that V T = W . That is,
unless we are using tied weights, we know that no such potential E0 (x) exists, and yet when
running the experiments from Section 3.3 we obtained much better results with untied
weights. To make things worse, it can also be demonstrated that the energy function that
we get from tied weights leads to a distribution that is not normalizable (it has a divergent
integral over Rd ). In that sense, this suggests that we should not worry too much about
the exact parameterization of the denoising auto-encoder as long as it has the required
flexibility to approximate the optimal reconstruction function sufficiently well.
3755
Alain and Bengio
which is the denoising criterion. This says that when the reconstruction function r is pa-
rameterized so as to correspond to the score ψ of a model density (as per Equation 12, and
where ψ is a derivative of some log-density), the denoising criterion on r with Gaussian cor-
ruption noise is equivalent to score matching with respect to a smooth of the data-generating
density, i.e., a regularized form of score matching. Note that this regularization appears
desirable, because matching the score of the empirical distribution (or an insufficiently
smoothed version of it) could yield undesirable results when the training set is finite. Since
score matching has been shown to be a consistent induction principle (Hyvärinen, 2005), it
means that this denoising score matching (Vincent, 2011; Kingma and LeCun, 2010; Swer-
sky et al., 2011) criterion recovers the underlying density, up to the smoothing induced by
the noise of variance σ 2 . By making σ 2 small, we can make the estimator arbitrarily good
(and we would expect to want to do that as the amount of training data increases). Note
the correspondence of this conclusion with the results presented here, which show (1) the
equivalence between the RCAE’s regularization coefficient and the DAE’s noise variance σ 2 ,
and (2) that minimizing the equivalent analytic criterion (based on a contraction penalty)
estimates the score when σ 2 is small. The difference is that our result holds even when r
is not parameterized as per Equation 12, i.e., is not forced to correspond with the score
function of a density.
3756
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
Besides first and second derivatives of the density, other local properties of the density
are its local mean and local covariance, discussed in the Appendix, Section D.
4.2 Sampling
With Equation 13 from Section 4.1 we can perform approximate sampling from the es-
timated distribution, using the score estimator to approximate energy differences which
are needed in the Metropolis-Hastings accept/reject decision. Using a symmetric proposal
q(x∗ |x), the acceptance ratio is
p(x∗ )
α= = exp(−E(x∗ ) + E(x))
p(x)
which can be computed with (13) or approximated with (14) as long as we trust that
our DAE/RCAE was trained properly and has enough capacity to be a sufficiently good
3757
Alain and Bengio
estimator of ∂E
∂x . An example of this process is shown in Figure 6 in which we sample from a
density concentrated around a 1-d manifold embedded in a space of dimension 10. For this
particular task, we have trained only DAEs and we are leaving RCAEs out of this exercise.
Given that the data is roughly contained in the range [−1.5, 1.5] along all dimensions, we
selected a training noise level σtrain = 0.1 so that the noise would have an appreciable
effect while still being relatively small. As required by Theorem 1, we have used isotropic
2
Gaussian noise of variance σtrain .
The Metropolis-Hastings proposal q(x∗ |x) = N (0, σMH 2 I) has a noise parameter σ
MH
that needs to be set. In the situation shown in Figure 6, we used σMH = 0.1. After some
hyperparameter tweaking and exploring various scales for σtrain , σMH , we found that setting
both to be 0.1 worked well.
When σtrain is too large, the DAE trained learns a “blurry” version of the density that
fails to represent the details that we are interested in. The samples shown in Figure 6 are
very convincing in terms of being drawn from a distribution that models well the original
density. We have to keep in mind that Theorem 2 describes the behavior as σtrain → 0 so
we would expect that the estimator becomes worse when σtrain is taking on larger values.
In this particular case with σtrain = 0.1, it seems that we are instead modeling something
2
like the original density to which isotropic Gaussian noise of variance σtrain has been added.
In the other extreme, when σtrain is too small, the DAE is not exposed to any training
example farther away from the density manifold. This can lead to various kinds of strange
behaviors when the sampling algorithm falls into those regions and then has no idea what
to do there and how to get back to the high-density regions. We come back to that topic
in Section 4.3.
It would certainly be possible to pick both a very small value for σtrain = σMH = 0.01
to avoid the spurious maxima problem illustrated in Section 4.3. However, this leads to the
same kind of mixing problems that any kind of MCMC algorithm has. Smaller values of
σMH lead to higher acceptance ratios but worse mixing properties.
5. Conclusion
Whereas auto-encoders have long been suspected of capturing information about the data-
generating density, this work has clarified what some of them are actually doing, showing
that they can actually implicitly recover the data-generating density altogether. We have
shown that regularized auto-encoders such as the denoising auto-encoder and a form of con-
3758
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
Figure 6: Samples drawn from the estimate of ∂E ∂x given by a DAE by the Metropolis-
Hastings method presented in Section 4. By design, the data density distribu-
tion is concentrated along a 1-d manifold embedded in a space of dimension 10.
This data can be visualized in the plots above by plotting pairs of dimensions
(x0 , x1 ), . . . , (x8 , x9 ), (x9 , x0 ), going in reading order from left to right and then
line by line. For each pair of dimensions, we show side by side the original data
(left) with the samples drawn (right).
tractive auto-encoder are closely related to each other and estimate local properties of the
data-generating density: the first derivative (score) and second derivative of the log-density,
as well as the local mean. This contradicts the previous interpretation of reconstruction
error as being an energy function (Ranzato et al., 2008) but is consistent with our exper-
imental findings. Our results do not require the reconstruction function to correspond to
the derivative of an energy function as in Vincent (2011), but hold simply by virtue of
minimizing the regularized reconstruction error training criterion. This suggests that min-
imizing a regularized reconstruction error may be an alternative to maximum likelihood
for unsupervised learning, avoiding the need for MCMC in the inner loop of training, as
in RBMs and deep Boltzmann machines, analogously to score matching (Hyvärinen, 2005;
3759
Alain and Bengio
(a) DAE misbehaving when away from manifold (b) sampling getting trapped into bad attractor
Figure 7: (a) On the left we show a r(x) − x vector field similar to that of the earlier
Figure 5. The density is concentrated along a spiral manifold and we should have
the reconstruction function r bringing us back towards the density. In this case,
it works well in the region close to the spiral (the magnitude of the vectors is so
small that the arrows are shown as dots). However, things are out of control in
the regions outside. This is because the level of noise used during training was so
small that not enough of the training examples were found in those regions.
(b) On the right we sketch what may happen when we follow a sampling procedure
as described in Section 4.2. We start in a region of high density (in purple) and we
illustrate in red the trajectory that our samples may take. In that situation, the
DAE/RCAE was not trained properly. The resulting vector field does not reflect
the density accurately because it should not have this attractor (i.e., stable fixed
point) outside of the manifold on which the density is concentrated. Conceptually,
the sampling procedure visits that spurious attractor because it assumes that it
corresponds to a region of high probability. In some cases, this effect is regrettable
but not catastrophic, but in other situations we may end up with completely
unusable samples. In the experiments, training with enough of the examples
involving sufficiently large corruption noise typically eliminates that problem.
Vincent, 2011). Toy experiments have confirmed that a good estimator of the density can be
obtained when this criterion is non-parametrically minimized. The experiments have also
confirmed that an MCMC could be setup that approximately samples from the estimated
model, by estimating energy differences to first order (which only requires the score) to
perform approximate Metropolis-Hastings MCMC.
Many questions remain open and deserve further study. A big question is how to gener-
alize these ideas to discrete data, since we have heavily relied on the notions of scores, i.e.,
of derivatives with respect to x. A natural extension of the notion of score that could be
applied to discrete data is the notion of relative energy, or energy difference between a point
x and a perturbation x̃ of x. This notion has already been successfully applied to obtain the
equivalent of score matching for discrete models, namely ratio matching (Hyvärinen, 2007).
3760
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
More generally, we would like to generalize to any form of reconstruction error (for exam-
ple many implementations of auto-encoders use a Bernoulli cross-entropy as reconstruction
loss function) and any (reasonable) form of corruption noise (many implementations use
masking or salt-and-pepper noise, not just Gaussian noise). More fundamentally, the need
to rely on σ → 0 is troubling, and getting rid of this limitation would also be very useful. A
possible solution to this limitation, as well as adding the ability to handle both discrete and
continuous variables, has recently been proposed while this article was under review (Bengio
et al., 2013a).
It would also be interesting to generalize the results presented here to other regularized
auto-encoders besides the denoising and contractive types. In particular, the commonly
used sparse auto-encoders seem to fit the qualitative pattern illustrated in 2 where a score-
like vector field arises out of the opposing forces of minimizing reconstruction error and
regularizing the auto-encoder.
We have mostly considered the harder case where the auto-encoder parameterization
does not guarantee the existence of an analytic formulation of an energy function. It would
be interesting to compare experimentally and study mathematically these two formulations
to assess how much is lost (because the score function may be somehow inconsistent) or
gained (because of the less constrained parameterization).
Acknowledgments
The authors thank Salah Rifai Max Welling, Yutian Chen and Pascal Vincent for fruitful
discussions, and acknowledge the funding support from NSERC, Canada Research Chairs
and CIFAR.
which can be differentiated with respect to the quantity r(x̃) and set to be equal to 0.
Denoting the optimum by r∗ (x̃), we get
3761
Alain and Bengio
Conceptually, this means that the optimal DAE reconstruction function at every point
x̃ ∈ Rd is given by a kind of convolution involving the density function p, or weighted average
from the points in the neighbourhood of x̃, depending on how we would like to view it. A
higher noise level σ means that a larger neighbourhood of x̃ is taken into account. Note
that the total quantity of “mass” being included in the weighted average of the numerator
of (19) is found again at the denominator.
∼ N 0, σ 2 I ,
as σ → 0.
where in the second line we used the independence of the noise from x and properties
of the trace, while in the last line we used E T = σ 2 I and E[] = 0 by definition of .
3762
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
3763
Alain and Bengio
to observe that the components r1 (x), . . . , rd (x) can each be optimized separately.
∂f h
∂r1 ∂r2 ∂rd
iT
= 2σ 2 p(x) ∂xi ∂xi ··· ∂xi
∂rxi
∂ ∂f ∂p(x) h ∂r1 ∂r2 ∂rd
iT
= 2σ 2 ∂xi ∂xi ··· ∂xi
∂xi ∂rxi ∂xi
h iT
∂ 2 r1 ∂ 2 r2 ∂ 2 rd
+2σ 2 p(x) ∂x2i ∂x2i
··· ∂x2i
(r(x) − x)p(x) = σ 2
X ..
. (22)
.
i=1 ∂p(x) ∂rd 2
∂xi ∂xi + p(x) ∂∂xr2d
i
As Equation 21 hinted, the expression (22) can be decomposed into the different com-
ponents rk (x) : Rd → R that make r. For k = 1, . . . , d we get
d
∂ 2 rk (x)
2
X ∂p(x) ∂rk (x)
(rk (x) − xk )p(x) = σ + p(x) .
i=1
∂xi ∂xi ∂x2i
∂p(x)
As p(x) 6= 0 by hypothesis, we can divide all the terms by p(x) and note that ∂xi /p(x) =
∂ log p(x)
∂xi .
We get
d
∂ 2 rk (x)
2
X ∂ log p(x) ∂rk (x)
rk (x) − xk = σ + . (23)
i=1
∂xi ∂xi ∂x2i
This first thing to observe is that when σ 2 = 0 the solution is just rk (x) = xk , which
translates into r(x) = x. This is not a surprise because it represents the perfect reconstruc-
tion value that we get when we the penalty term vanishes in the loss function.
3764
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
This linear partial differential Equation 23 can be used as a recursive relation for rk (x)
to obtain a Taylor series in σ 2 . The goal is to obtain an expression of the form
where we can solve for h(x) and for which we also have that
∂r(x) ∂h(x)
= I + σ2 + o(σ 2 ) as σ 2 → 0.
∂x ∂x
We can substitute in the right-hand side of Equation 24 the value for rk (x) that we get
from Equation 24 itself. This substitution would be pointless in any other situation where
we are not trying to get a power series in terms of σ 2 around 0.
d
∂ 2 rk (x)
2
X ∂ log p(x) ∂rk (x)
rk (x) = xk +σ +
i=1
∂xi
∂xi ∂x2i
!
d d
X
∂ log p(x) ∂ xk + σ 2
X ∂ log p(x) ∂rk (x) ∂ 2 rk (x)
= xk +σ 2 +
i=1
∂xi ∂xi
j=1
∂xj ∂xj ∂x2j
d
X ∂ 2 rk (x)
+σ 2
i=1
∂x2i
d d
X ∂ 2 rk (x)
X ∂ log p(x)
= xk +σ 2 I (i = k) + σ 2
i=1
∂xi
i=1
∂x2i
d X
d
!!
22
X ∂ log p(x) ∂ ∂ log p(x) ∂rk (x) ∂ 2 rk (x)
+σ +
i=1 j=1
∂xi ∂xi ∂xj ∂xj ∂x2j
d
X ∂ 2 rk (x)
2∂ log p(x) 2
rk (x) = xk +σ + σ2 2 + σ 2 ρ(σ 2 , x)
∂xk ∂xi
i=1
Pd ∂ 2 rk (x)
Now we would like to get rid of that σ 2 i=1 ∂x2i
term by showing that it is a
22
term that involves only powers of σ or higher. We get this by showing what we get by
differentiating the expression for rk (x) in line (25) twice with respect to some l.
d
!
∂rk (x) ∂ 2 log p(x) ∂ ∂ 2 r (x)
k
X
= I (i = l) + σ 2 + σ2 2 + σ 2 ρ(σ 2 , x)
∂xl ∂xl ∂xk ∂xl ∂x i
i=1
d
!
∂ 2 rk (x) 3
2 ∂ log p(x) 2 ∂
X ∂ 2 rk (x) 2 2
=σ +σ + σ ρ(σ , x)
∂x2l ∂x2l ∂xk ∂x2l i=1 ∂x2i
∂ 2 rk (x)
Since σ 2 is a common factor in all the terms of the expression of ∂x2l
we get what we
needed. That is,
∂ log p(x) 2
rk (x) = xk + σ 2 + σ 2 η(σ 2 , x).
∂xk
3765
Alain and Bengio
3766
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
1 ∂ log p(x)
mδ (x0 ) = x0 + δ 2 + o δ3 .
d+2 ∂x x0
This links the local mean of a density with the score associated with that density.
Combining this theorem with Theorem 2, we obtain that the optimal reconstruction function
r∗ (·) also estimates the local mean:
δ2
mδ (x) − x = (r∗ (x) − x) + A(δ) + δ 2 B(σ 2 ) (25)
σ 2 (d + 2)
A(δ) ∈ o(δ 3 ) as δ → 0,
2
B(σ ) ∈ o(1) as σ 2 → 0.
This means that we can loosely estimate the direction to the local mean by the direction
of the reconstruction:
mδ (x) − x ∝ r∗ (x) − x. (26)
π d/2
d 2 Tr(H(x0 )) 3
Zδ (x0 ) = δ p(x0 ) + δ + o(δ )
Γ (1 + d/2) 2(d + 2)
∂ 2 p(x)
where H(x0 ) = ∂x2 x=x
. Moreover, we have that
0
1 −d Γ (1 + d/2) 1 2 1 Tr(H(x0 )) 3
=δ −δ + o(δ ) .
Zδ (x0 ) π d/2 p(x0 ) p(x0 )2 2(d + 2)
Proof
Z "
∂p(x) 1
Zδ (x0 ) = p(x0 ) + (x − x0 ) + (x − x0 )T H(x0 )(x − x0 )
Bδ (x0 ) ∂x x0 2!
1 (3) 3
+ D p(x0 )(x − x0 ) + o(δ ) dx
3!
Z Z
1
= p(x0 ) dx + 0 + (x − x0 )T H(x0 )(x − x0 )dx + 0 + o(δ d+3 )
Bδ (x0 ) 2 Bδ (x0 )
π d/2 π d/2
= p(x0 )δ d + δ d+2 Tr (H(x0 )) + o(δ d+3 )
Γ (1 + d/2) 4Γ (2 + d/2)
π d/2
d 2 Tr(H(x0 )) 3
= δ p(x0 ) + δ + o(δ )
Γ (1 + d/2) 2(d + 2)
3767
Alain and Bengio
We use Proposition 10 to get that trace come up from the integral involving H(x0 ). The
expression for 1/Zδ (x0 ) comes from the fact that, for any a, b > 0 we have that
a−1
1 1 b 2 3 4
= = 1 − ( δ + o(δ )) + o(δ )
a + bδ 2 + o(δ 3 ) 1 + ab δ 2 + o(δ 3 ) a a
1 b
= − δ 2 + o(δ 3 ) as δ → 0.
a a2
1
by using the classic result from geometric series where 1+r = 1 − r + r2 − . . . for |r| < 1.
Now we just apply this to
1 Γ (1 + d/2) 1
= δ −d d/2
h i
Zδ (x0 ) π p(x0 ) + δ 2 Tr(H(x 0 ))
+ o(δ 3)
2(d+2)
1 ∂ log p(x)
mδ (x0 ) = x0 + δ 2 + o δ3 .
d+2 ∂x x0
Proof
The leading term in the expression for mδ (x0 ) is obtained by transforming the x in the
integral into a x − x0 to make the integral easier to integrate.
Z Z
1 1
mδ (x0 ) = xp(x)dx = x0 + (x − x0 )p(x)dx.
Zδ (x0 ) Bδ (x0 ) z δ 0)
(x Bδ (x0 )
Z "
1 ∂p(x)
mδ (x0 ) = x0 + (x − x0 ) p(x0 ) + (x − x0 )
Zδ (x0 ) Bδ (x0 ) ∂x x0
#
1 ∂ 2 p(x)
+ (x − x0 )T (x − x0 ) + o(kx − x0 k2 ) dx.
2 ∂x2 x0
R
Remember that Bδ (x0 ) f (x)dx = 0 whenever we have a function f is anti-symmetrical
(or “odd”) relative to the point x0 (i.e., f (x − x0 ) = f (−x − x0 )). This applies to the terms
2 p(x)
(x − x0 )p(x0 ) and (x − x0 )(x − x0 ) ∂ ∂x 2 (x − x0 )T . Hence we use Proposition 9 to get
x=x0
3768
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
Z " #
1 ∂p(x)
mδ (x0 ) = x0 + (x − x0 )T (x − x0 ) + o(kx − x0 k3 ) dx
Zδ (x0 ) Bδ (x0 ) ∂x x0
d
!
1 π 2 ∂p(x)
= x0 + δ d+2 d
+ o(δ 3 ).
Zδ (x0 ) 2Γ 2 + 2 ∂x x0
∂p(x)
Now, looking at the coefficient in front of ∂x x in the first term, we can use Propo-
0
sition 4 to rewrite it as
d
!
1 d+2 π2
δ
2Γ 2 + d2
Zδ (x0 )
d
Γ (1 + d/2) 1 1 Tr(H(x0 )) π2
= δ −d − δ 2
+ o(δ 3
) δ d+2
π d/2 p(x0 )2 2(d + 2) 2Γ 2 + d2
p(x0 )
d
2 Γ 1+ 2 1 2 1 Tr(H(x0 )) 1 1
= δ d
−δ 2
+ o(δ ) = δ 2
3
+ o(δ 3 ).
2Γ 2 + 2 p(x0 ) p(x0 ) 2(d + 2) p(x0 ) d + 2
Γ(1+ d )
There is no reason the keep the −δ 4 2Γ 2+2d p(x10 )2 Tr(H(x 0 ))
2(d+2) in the above expression be-
( 2)
cause the asymptotic error from the remainder term in the main expression is o(δ 3 ). That
would swallow our exact expression for δ 4 and make it useless.
We end up with
1 ∂ log p(x)
mδ (x0 ) = x0 + δ 2 + o(δ 3 ).
d+2 ∂x x0
3769
Alain and Bengio
B j=1
0 otherwise
for any non-negative integers aj ≥ 0. Note the absence of the absolute values put on the
a
xj j terms.
Corollary 8 Let Bδ (0) ⊂ Rd be the ball of radius δ around the origin. Then
Q aj +1
Z d d+P aj Γ 2
Yaj δ Γ(1+ d2 + 12 aj )
if all the aj are even integers
xj dx =
P
Bδ (0) j=1
0 otherwise
a
for any non-negative integers aj ≥ 0. Note the absence of the absolute values on the xj j
terms.
Proof
We take the theorem as given and concentrate here on justifying the two corollaries.
Note how in Corollary 7 we dropped the absolute values that were in the original Q Theo-
a
rem 6. In situations where at least one aj is odd, we have that the function f (x) = dj=1 xj j
becomes odd in the sense that f (−x) = −f (x). Because of the symmetrical nature of the
integration on the unit ball, we get that the integral is 0 as a result of cancellations.
For Corollary 8, we can rewrite the integral by changing the domain with yj = xj /δ so
that
Z d Z d Z d
a
P Y Y Y
− aj aj
δ xj j dx = (xj /δ) dx = y aj δ d dy.
Bδ (0) j=1 Bδ (0) j=1 B1 (0) j=1
We pull out the δ d that we got from the determinant of the Jacobian when changing
from dx to dy and Corollary 8 follows.
Proposition 9 Let v ∈ Rd and let Bδ (0) ⊂ Rd be the ball of radius δ around the origin.
Then
d
Z !
d+2 π2
y < v, y > dy = δ v
2Γ 2 + d2
Bδ (0)
Proof
We have that
3770
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
v1 y12
..
y < v, y > = .
vd yd2
which is decomposable into d component-wise applications
1 of 1Corollary
1 √ 8. This yields
3
the expected result with the constant obtained from Γ 2 = 2 Γ 2 = 2 π.
Proposition 10 Let H ∈ Rd×d and let Bδ (x0 ) ⊂ Rd be the ball of radius δ around x0 ∈ Rd .
Then
π d/2
Z
(x − x0 )T H(x − x0 )dx = δ d+2 trace (H) .
Bδ (x0 ) 2Γ (2 + d/2)
Proof
First, by substituting y = (x − x0 ) /δ we have that this is equivalent to showing that
π d/2
Z
y T Hydy = trace (H) .
B1 (0) 2Γ (2 + d/2)
This integral yields a real number which can be written as
Z Z X XZ
T
y Hydy = yi Hi,j yj dy = yi yj Hi,j dy.
B1 (0) B1 (0) i,j i,j B1 (0)
Now we know from Corollary 8 that this integral is zero when i 6= j. This gives
π d/2
X Z X Z
Hi,j yi yj dy = Hi,i yi2 dy = trace (H) .
B1 (0) B1 (0) 2Γ (2 + d/2)
i,j i
References
Y. Bengio. Learning deep architectures for AI. Foundations & Trends in Mach. Learn., 2
(1):1–127, 2009.
Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep
networks. In NIPS’2006, 2007.
Yoshua Bengio and Olivier Delalleau. On the expressive power of deep architectures. In
ALT’2011, 2011.
Yoshua Bengio, Guillaume Alain, and Salah Rifai. Implicit density estimation by local
moment matching to sample from auto-encoders. Technical report, arXiv:1207.0057,
2012a.
3771
Alain and Bengio
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review
and new perspectives. Technical report, arXiv:1206.5538, 2012b.
Yoshua Bengio, Yao Li, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-
encoders as generative models. Technical Report arXiv:1305.6663, Universite de Montreal,
2013a.
Yoshua Bengio, Grégoire Mesnil, Yann Dauphin, and Salah Rifai. Better mixing via deep
representations. In ICML’13, 2013b.
Lawrence Cayton. Algorithms for manifold learning. Technical Report CS2008-0923, UCSD,
2005.
Karol Gregor, Arthur Szlam, and Yann LeCun. Structured sparse coding via lateral inhi-
bition. In NIPS’2011, 2011.
G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets.
Neural Computation, 18:1527–1554, 2006.
Aapo Hyvärinen. Some extensions of score matching. Computational Statistics and Data
Analysis, 51:2499–2512, 2007.
Viren Jain and Sebastian H. Seung. Natural image denoising with convolutional networks.
In NIPS’2008, 2008.
Diederik Kingma and Yann LeCun. Regularized estimation of image statistics by score
matching. In NIPS’2010, 2010.
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep
belief networks for scalable unsupervised learning of hierarchical representations. In
ICML’2009. 2009.
Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hy-
pothesis. In NIPS’2010. 2010.
B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy
employed by V1? Vision Research, 37:3311–3325, 1997.
3772
What Regularized Auto-Encoders Learn from the Data-Generating Distribution
M. Ranzato, Y. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks.
In NIPS’2007, 2008.
Salah Rifai, Yann Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. The man-
ifold tangent classifier. In NIPS’2011, 2011a.
Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive
auto-encoders: Explicit invariance during feature extraction. In ICML’2011, 2011b.
Salah Rifai, Yoshua Bengio, Yann Dauphin, and Pascal Vincent. A generative process for
sampling contractive auto-encoders. In ICML’2012, 2012.
Kevin Swersky, Marc’Aurelio Ranzato, David Buchman, Benjamin Marlin, and Nando
de Freitas. On autoencoders and score matching for energy based models. In ICML’2011.
2011.
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural
Computation, 23(7), 2011.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting
and composing robust features with denoising autoencoders. In ICML 2008, 2008.
3773