Fast Inverse Transform Sampling in One and Two Dimensions: Sheehan Olver Alex Townsend
Fast Inverse Transform Sampling in One and Two Dimensions: Sheehan Olver Alex Townsend
1 Introduction
School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia.
(sheehan.olver@sydney.edu.au, http://www.maths.usyd.edu.au/u/olver/). Supported by the
ARC grant DE130100333 · Mathematical Institute, 24-29 St Giles’, Oxford, OX1 3LB, UK.
(townsend@maths.ox.ac.uk, http://people.maths.ox.ac.uk/townsend/). Supported by the EP-
SRC grant EP/P505666/1 and the ERC grant FP7/2007-2013 to Nick Trefethen
2 Sheehan Olver, Alex Townsend
methods that are designed to exploit certain special properties: e.g., ziggurat algo-
rithm [10] (requires monotonicity) and adaptive rejection sampling [7, 8] (requires
log-convexity).
In contradiction to conventional wisdom, we show that inverse transform sam-
pling is computationally efficient and robust when combined with an adaptive
Chebyshev polynomial approximation scheme. Approximation by Chebyshev poly-
nomials converges superalgebraically fast for smooth functions, and hence the CDF
can be numerically computed to very high accuracy, i.e., close to machine preci-
sion. Rather than forming the inverse CDF — which is often nearly singular —
we invert the CDF pointwise using a simple bijection method. This approach is
robust — it is guaranteed to converge to a high accuracy — and fast, due to the
simple representation of the CDF.
Furthermore, we extend our approach to probability distributions of two vari-
ables by exploiting recent developments in low rank function approximation to
achieve a highly efficient inverse transform sampling scheme in 2D. Numerical ex-
periments confirm that this approach outperforms existing sampling routines for
many black box distributions, see Section 4.
Remark 1 We use black box distribution to mean a continuous probability distribu-
tion that can only be evaluated pointwise1. The only other piece of information
we assume is an interval [a, b], or a rectangular domain [a, b] × [c, d], containing the
numerical support of the distribution.
A Matlab implementation is publicly available from [14]. As an example, the
following syntax generates 2,000 samples from a 2D distribution:
[X,Y] = sample(@(x,y) exp(-x.^2-2.*y.^2).*(x-y).^2,[-3,3],[-3,3],2000);
The algorithm scales the input function to integrate to one, and generates pseudo-
random samples from the resulting probability distribution.
Let f (x) be a prescribed non-negative function supported on the interval [a, b].
Otherwise, if the support of f is the whole real line and f is rapidly decaying, an
interval [a, b] can be selected outside of which f is negligible. The algorithm will
draw samples that match the probability distribution
f (x)
fX (x) = R b
a f (s) ds
up to a desired tolerance ǫ (typically close to machine precision such as 2.2 ×10−16).
First, we replace f by a polynomial approximant f˜ on [a, b], which we numer-
ically compute such that kf − f˜k∞ ≤ ǫkf k∞ . That is, we construct a polynomial
approximant f˜ of f in the form
n
X 2(x − a)
f˜(x) = αk Tk − 1 , αk ∈ R, x ∈ [a, b], (1)
b−a
k=0
1 Strictly, we require the distribution to be continuous with bounded variation [20], and
in practice this is almost always satisfied by continuous functions. We do not require the
distribution to be prescaled to integrate to one, nor that the distribution decays smoothly to
zero.
Fast inverse transform sampling in one and two dimensions 3
Input: A non-negative function f (x), a finite interval [a, b], and an integer N .
R
Output: Random samples X1 , . . . , XN drawn from fX = f / ab f (s) ds.
Construct an approximant f˜ of f on [a, b].
R
Compute F̃ (x) = ax f˜(s) ds using (2).
Scale to a CDF, F̃X (x) = F̃ (x)/F̃ (b), if necessary.
Generate random samples U1 , . . . , UN from the uniform distribution on [0, 1].
for k = 1, . . . , N
Solve FX (Xk ) = Uk for Xk .
end
Fig. 1: Pseudocode for the inverse transform sampling algorithm with polynomial
approximation using Chebyshev polynomials and Chebyshev grids.
where Tk (x) = cos(k cos−1 x) is the degree k Chebyshev polynomial. While the
polynomial approximants (1) could be represented in other polynomial bases —
such as the monomial basis — the Chebyshev basis is a particularly good choice
for numerical computation as the expansion coefficients α0 , . . . , αn can be stably
calculated in O(n log n) operations [11, 20]. This is accomplished by applying the
fast cosine transform to function evaluations of f taken from a Chebyshev grid in
[a, b] of size n = 2k + 1, i.e., at the set of points
where the last equality comes from a change of variables with t = 2(b−a s−a)
− 1.
Fortunately, we have the following relation satisfied by Chebyshev polynomials [11,
pp. 32–33]:
Z s 1 Tk+1 (s) − T|k−1| (s) , k 6= 1,
2 k+1 k−1
Tk (t) dt = 1 (2)
4 T2 ( s) , k = 1,
The rootfinding problems in (4) can be solved by various standard methods such as
the eigenvalues of a certain matrix [20], Newton’s method, the bisection method,
or Brent’s method [2, Chapter 4].
To achieve robustness — guaranteed convergence, computation cost, and ac-
curacy — we use the bisection method. The bisection method approximates the
(unique) solution Xk to FX (Xk ) = Uk by a sequence of intervals [aj , bj ] that con-
tain Xk , satisfying FX (aj ) < Uk < FX (bj ). The initial interval is the whole domain
[a, b]. Each stage finds the midpoint mj of the interval [aj , bj ], choosing the new
interval based on whether FX (mj ) is greater or less than Uk . Convergence occurs
when bj − aj < tol(b − a), where bj − aj = 2−j (b − a). For example, it takes pre-
cisely 47 iterations to converge when tol = 10−14 and a further five iterations to
converge to machine precision. Since the CDF is represented as a Chebyshev series
(3) it can be efficiently evaluated using Clenshaw’s method, which is an extension
of Horner’s scheme to Chebyshev series [3].
The inverse transform sampling with Chebyshev approximation is very effi-
cient, as demonstrated in the numerical experiments of Section 4 using the Matlab
implementation [14].
We now extend the inverse transform sampling approach with Chebyshev approx-
imation to probability distributions of two variables. Figure 2 summarizes our al-
gorithm, which generates N pseudo-random samples (X1 , Y1 ), . . . , (XN , YN ) drawn
RbRd
from the probability distribution fX,Y (x, y ) = f (x, y )/ a c f (s, t) ds dt. The es-
sential idea is to replace f by a low rank approximant f˜ that can be manipulated in
a computationally efficient manner. Again, for convenience, if f is a non-negative
function that does not integrate to one then we automatically scale the function
to a probability distribution.
Fast inverse transform sampling in one and two dimensions 5
Input: A non-negative function f (x, y), a finite domain [a, b] × [c, d], and an integer N .
R R
Output: Random samples (X1 , Y1 ), . . . , (XN , YN ) from fX,Y = f / cd ab f (s, t) ds dt.
Construct a low rank approximant f˜ of f on [a, b] × [c, d].
R
Compute f˜1 (x) = d f˜(x, y) dy.
c
Generate random samples X1 , . . . , XN from fX (x) = f˜1 (x)/f˜1 (b).
for i = 1, . . . , N
Generate a random sample Yk from f˜Y |X=Xi (y) = f˜(Xk , y)/f˜1 (Xk ).
end
Fig. 2: Pseudocode for the inverse transform sampling algorithm for a probability
distribution of two variables denoted by fX,Y .
where k is adaptively chosen and cj and rj are polynomials of one variable rep-
resented in Chebyshev expansions of the form (1). For simplicity we assume that
cj and rj are polynomials of degree m and n, respectively, and then a function is
considered to be of low rank if k ≪ min(m, n).
While most functions are mathematically of infinite rank — such as cos(xy )
— they can typically be approximated by low rank functions to high accuracy,
especially if the function is smooth [19]. It is well-known that the singular value
decomposition can be used to compute optimal low rank approximations of matri-
ces, and it can easily be extended for the approximation of functions. However, the
Gaussian elimination approach from [18] is significantly faster and constructs near-
optimal low rank function approximations [1, 18]. This algorithm is conveniently
implemented in the recent software package Chebfun2 [19], which we utilize to
construct f˜. Furthermore, we emphasize that this low rank approximation process
only requires pointwise evaluations of f .
With the low rank approximation f˜, given in (5), we can efficiently perform
the steps of the 2D inverse transform sampling algorithm. First, we approximate
the marginal distribution of X by integrating f˜ over the second variable:
Z d k
X Z d
f˜1 (x) = f˜(x, y ) dy = σj r j ( x ) cj (y ) dy.
c j =1 c
than anything inherently 2D). The resulting sum is a Chebyshev expansion (in
x), and we use the algorithm described in Section 2 to generate N pseudo-random
samples X1 , . . . , XN from f˜X (x) = f˜1 (x)/f˜1 (b).
Afterwards, we generate the corresponding Y1 , . . . , YN by using a numerical
approximation to the conditional distribution denoted by f˜Y |X . To construct f˜Y |X
for each X1 , . . . , XN we require evaluation, i.e.,
f˜(Xi , y )
f˜Y |X =Xi (y ) = , i = 1, 2, . . . , N,
f˜1 (Xi )
Rd
where f˜1 (Xi ) = c f˜(Xk , t) dt is precisely the normalization constant. Fortunately,
evaluation of f˜(Xi , y ) is relatively cheap because of the low rank representation of
f˜. That is,
k
X
f˜(Xi , y ) = σj rj (Xi )cj (y ),
j =1
and thus we only require the evaluation of k polynomials of one variable (again,
nothing inherently 2D), which, as before, is accomplished using Clenshaw’s algo-
rithm [3]. The final task of generating the sample Yi has been reduced to the 1D
problem of drawing a sample from f˜Y |X =Xi , which was solved in Section 2.
In total we have generated N pseudo-random samples (X1 , Y1 ), . . . , (XN , YN )
from the probability distribution fX,Y without an explicit expression for the CDF
or its inverse. Low rank approximation was important for computational efficiency,
and the total computational cost for N samples, including the low rank approxi-
mation of f , is
4 Numerical experiments
Table 1: Sampling times (in seconds) for our inverse transform sampling (ITS)
compared to slicesample in Matlab (SS), and rejection sampling (RS) using
the function’s maximum for 10,000 samples. The 2D inverse transform sampling
approach described in Section 3 is remarkably efficient when the probability dis-
tribution is approximated by a low rank function. (Note: slicesample only works
on decaying PDFs, hence fails on the third example.)
x2
1. Multimodal density: e− 2 (1 + sin2 (3x))(1 + cos2 (5x)) on [−8, 8],
2
2. The spectral density of a 4 × 4 Gaussian Unitary Ensemble [5]: e−4x (9+72x2 −
4 6
192x + 512x ) on [−4, 4],
3. Compactly supported oscillatory density: 2 + cos(100x) on [−1, 1],
4. Concentrated sech density: sech(200x) on [−1, 1],
and the following bivariate distributions:
2 2
1. Bimodal distribution: e−100(x−1) +e−100(y+1) (1+cos(20x)) on [−2, 2] ×[−2, 2],
x4 y4
2. Quartic unitary ensemble eigenvalues [5]: e− 2 − 2 (x − y )2 on [−7, 7] × [−7, 7],
2 2
3. Concentrated 2D sech density: e−x −2y sech(10xy ) on [−5, 5] × [−4, 4],
2 2
4. Butterfly density: e−x −2y sech(10xy )(x − y )2 on [−3, 3] × [−3, 3].
Figure 3 demonstrates our algorithm. Figure 3 (left) shows a histogram of one
hundred thousand pseudo-random samples generated from the multimodal density,
and Figure 3 (right) shows ten thousand pseudo-random samples from the butterfly
density.
We observe that the inverse transform sampling approach described in Section 2
significantly outperforms the Matlab implementation of slice sampling in 1D. In
the worst case, the computational cost is comparable to rejection sampling, and
in two of the examples it outperforms rejection sampling by a factor of 10.
Rejection sampling has practical limitations that inverse sampling avoids. Find-
ing the exact maximum of the distribution for the rejection sampling is computa-
tionally expensive, and replacing the exact maximum with a less accurate upper
bound causes more samples to be rejected, which reduces its efficiency. Further-
more, the computational cost of each sample is not bounded, and to emphasize
this we consider generating pseudo-random samples from sech(ωx) for varying ω
on the interval [−8, 8]. For large ω the percentage of the rectangular hat functions
area that is under the probability distribution is small, and the rejection sampling
approach rejects a significant proportion of its samples, while for our approach
the dominating computational cost stays roughly constant. In Figure 4 (left) we
8 Sheehan Olver, Alex Townsend
3
histogram
pdf
2
y
−1
−2
−3
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
x
Fig. 3: Left: histogram of one hundred thousand samples from the multimodal
density computed in 2.28 seconds, and it can be seen that the generated pseudo-
random samples have the correct distribution. Right: ten thousand pseudo-random
2 2
samples from the butterfly density: f (x, y ) = e−x −2y sech(10xy )(x−y )2. Only one
sample lies out of the contour lines (the last contour is at the level curve 0.0001),
and this is consistent with the number of generated samples.
4
x 10
1.4 10
9
1.2
8
1 7
Function evaluations
Time in seconds
6
0.8
5
0.6
4
0.4 3
2
0.2
1
0 0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
ω ω
Fig. 4: Left: time (in seconds) of rejection sampling (solid) versus inverse trans-
form method (dashed) for 100 samples of sech(kx). Right: the number of function
evaluations for 50 pseudo-random sample generated by rejection sampling (solid)
versus the number of function evaluations for any number of inverse transform
samples (dashed).
compare the computational cost of rejection sampling and the approach described
in Section 2, and we observe that the Chebyshev expansion outperforms rejection
sampling for ω ≥ 30.
has been calculated the original distribution can be discarded, whereas rejection
sampling evaluates the original distribution several times per sample. Thus, in
Figure 4, we compare the number of function evaluations required to generate 50
samples using rejection sampling versus the number of evaluations for our inverse
transform sampling approach. The number of function evaluations are comparable
at 50 samples, and hence if 500 pseudo-random samples were required then the
rejection sampling approach would require about ten times the number of function
evaluations.
Parallelization of the algorithm is important for extremely large sample sizes.
This is easily accomplished using a polynomial approximation since the parameters
used to represent the CDF can be stored in local memory and the task of gener-
ating N pseudo-random samples can be divided up among an arbitrary number
of processors or machines. However, the rejection sampling approach cannot be
parallelized with a black box distribution because evaluation would cause memory
locking.
5 Conclusions
We have shown that inverse transform sampling is efficient when employed with
Chebyshev polynomial approximation in one dimension, and with low rank func-
tion approximation in two dimensions. This allows for an automated and robust
algorithm for generating pseudo-random samples from a large class of probability
distributions. Furthermore, the approach that we have described can be general-
ized to higher dimensions by exploiting low rank approximations of tensors.
In fact, the inverse transform sampling approach is efficient for any class of
probability distributions that can be numerically approximated, evaluated, and
integrated. Therefore, there are straightforward extensions to several other classes
of distributions of one variable:
1. Piecewise smooth probability distributions, utilizing piecewise polynomial ap-
proximation schemes and automatic edge detection [15],
2. Probability distribution functions with algebraic endpoint singularities,
3. Probability distributions with only algebraic decay at ±∞, by mapping to a
1−x
bounded interval via 1+ x.
The software package Chebfun [21] implements each of these special cases.
In two dimensions and higher, approximating general piecewise smooth func-
tions is a more difficult problem, and many fundamental algorithmic issues remain.
However, if these constructive approximation questions are resolved the inverse
transform sampling approach will be immediately applicable.
References