0% found this document useful (0 votes)

25 views18 pages

Distance

This working document discusses various measures of discrepancies and diversities between distributions, including metrics, distances, and divergences. It covers topics such as Jensen divergences, Bregman divergences, and statistical distances, along with their mathematical properties and applications. The document aims to provide a comprehensive overview of these concepts, which are essential in fields like information theory and statistics.

Uploaded by

kaczetoww

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views18 pages

Distance

Uploaded by

kaczetoww

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Distances, divergences, statistical divergences and diversities

Frank Nielsen
Sony Computer Science Laboratories Inc.
Tokyo, Japan

August 2021
(Last updated November 12, 2021)

This document is also available in PDF: Distance.pdf

This is a working document which will be (hopefully frequently) updated with materials con-
cerning the discrepancies between two distributions/parameters or the diversities of a set of distri-
butions/parameters. There are many synonyms in the literature to measure the difference between
two objects: Metrics, Distances, Discrepancies, deviations, deviances, dissimilarities, divergences,
contrast functions or yokes (on product manifolds), etc. Diversities generalize 2-point distances by
measuring the dispersion of a set of n objects, usually using a centrality notion.
In mathematics, a distance is often considered to be a metric distance in a metric space which
satisfies the following properties:
There is confusion in the literature where distance is also used as a synonym of a dissimilarity
measure.
In information theory and statistics, we measure deviations between a probability measure and
another probability measure using a statistical divergence.
In information geometry, a divergence is a smooth dissimilarity measure which shall satisfy the
following conditions:
Divergences were formerly called contrast functions, a dualistic structure of information geom-
etry can be built from a divergence.

Contents
1 Jensen divergences and Bregman divergences 2
1.1 Skewed Jensen and Bregman divergences . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Relationships with statistical distances between densities of an exponential family . . 2
1.3 Generalizations of Bregman and Jensen divergences . . . . . . . . . . . . . . . . . . . 2

2 Invariant f -divergences 2

3 Distances and means 5

4 Statistical distances between densities with computationally intractable normal-

izers 5

1
[width=0.75]FigIpe-skewJS.pdf

Figure 1: Skewed Jensen divergences visualized as vertical convexity gaps.

5 Statistical distances between empirical distributions and densities with compu-

tationally intractable normalizers 7

6 The Jensen-Shannon divergence and some generalizations 7

6.1 Origins of the Jensen-Shannon divergence . . . . . . . . . . . . . . . . . . . . . . . . 7
6.2 Some extensions of the Jensen-Shannon divergence . . . . . . . . . . . . . . . . . . . 9

7 Statistical distances between mixtures 11

7.1 Approximating and/or fast statistical distances between mixtures . . . . . . . . . . . 12
7.2 Bounding statistical distances between mixtures . . . . . . . . . . . . . . . . . . . . . 12
7.3 Newly designed statistical distances yielding closed-form formula for mixtures . . . . 13

1 Jensen divergences and Bregman divergences

1.1 Skewed Jensen and Bregman divergences
lim sJF,α (θ1 : θ2 ) = BF (θ1 : θ2 )
α→0

lim sJF,α (θ1 : θ2 ) = BF (θ2 : θ1 )

α→1

If α < 0 or α > 1, we can measure the gap F ((1 − α)θ1 + αθ2 ) − ((1 − α)F (θ1 ) + αF (θ2 )) =
−JF (θ1 , θ2 ) ≥ 0. See Figure 1 Thus we can define the scaled α-skew Jensen divergence for α ∈
R\{0, 1} as:
1
F (θ1 : θ2 ) = JF (θ1 : θ2 ) ≥ 0.
α(1 − α)

1.2 Relationships with statistical distances between densities of an exponential

family
1.3 Generalizations of Bregman and Jensen divergences

2 Invariant f -divergences
A f -divergence [26, 7, 1, 8] If [p : q] is a dissimilarity measure between probability distributions
defined for a convex generator f (u):
Z
q(x)
If [p : q] = p(x)f dµ(x).
p(x)

Using Jensen’s inequality, we have If [p : q] ≥ f (1). Thus we ask for generators satisfying f (1) = 0.
Moreover, in order to have If [p : q] = 0 iff. p = q (µ-almost everywhere), we require f (u) to be
strictly convex at 1. The f -divergences include many well-known statistical distances listed with
their generators:

2
f -divergence Formula If [p : q] generator f (u)
1 1
R
Total variation (metric) |p(x) − q(x)|dµ(x) |u − 1|
R2 p p 2√
Squared Hellinger ( p(x) − q(x))2 dµ(x) ( u − 1)2
R (q(x)−p(x))2
Pearson χ2P p(x) dµ(x) (u − 1)2
(p(x)−q(x)) 2 (1−u)2
Neyman χ2N
R
q(x) dµ(x) u
R p(x)
Kullback-Leibler p(x) log q(x) dµ(x) − log u
R q(x)
reverse Kullback-Leibler q(x) log p(x) dµ(x) u log u
(p(x) − q(x)) log p(x)
R
Jeffreys divergence dµ(x) (u − 1) log u
4
R 1−α q(x) 1+α 4 1+α
α-divergence 1−α 2 (1 − p 2 (x)q (x)dµ(x)) 1−α2
−u 2 )
(1
1 2p(x) 2q(x)
−(u + 1) log 1+u
R
Jensen-Shannon 2 (p(x) log p(x)+q(x) + q(x) log p(x)+q(x) )dµ(x) 2 + u log u

Two f -divergences If1 and If2 are equivalent iff. f1 (u) = f2 (u) + λ(u − 1) for any λ ∈ R. A
symmetric f -divergence is bounded (e.g., the Jensen-Shannon divergence or the total variation)
iff. f (0) < ∞. The Jeffreys divergence is an unbounded f -divergence. The dual f -divergence
If ∗ [p : q] := If [p : q] is a f -divergence for the dual generator f ∗ (u) = uf u1 (or conjugate
generator). Thus symmetric f -divergences (e.g., the Jeffreys divergence, Hellinger divergence, or
the Jensen-Shannon divergence) satisfies the functional equality f (u) = uf (1/u). The f -divergences
are joint convex and satisfies the information monotonicity property: If [p|Y : q|Y ] ≤ If [p : q] for any
partition Y of RX (see lumping [9]). A statistical divergence is said separable iff. it can be rewritten
as D[p : q] = D1 (p(x) : q(x))dµ(x), where D1 is a scalar divergence. The f -divergences are the
only divergences which are separable and satisfies the information information monotonicity [3]
(except the “curious case” [17] of binary alphabets X ). A f -divergence is said standard [3] when
f 00 (1) = 1. The local Taylor expansion [6] of If [pθ1 : pθ2 ]-divergences between two parametric
divergences is related to the Fisher information matrix I(θ) = Epθ ∇ log pθ (x)(∇ log pθ (x))> as

follows:
1
If [pθ1 : pθ2 ] = (θ2 − θ1 )> Iθ (θ1 )(θ2 − θ1 ) + o(k(θ2 − θ1 k2 ).
2
Thus we have If [pθ : pθ+dθ ] = 21 dθ> I(θ)dθ for a standard f -divergence (with f 00 (1) = 1). The
following metric distance DQ is called the Mahalanobis distance [24]1 :
√ >
DQ (θ1 ,θ2 )= (θ2 −θ1 ) Q(θ2 −θ1 ).

The Mahalanobis distance generalizes the Euclidean distance (expressed in the Cartesian coordinate
system) using Q = I, the identity matrix. When Σ = (σ11 2 , . . . , σ 2 ) with σ = σ 2 , we have
DD ii i

PD i −θ i )2
(θ2 1 .
Σ−1 (θ1 ,θ2 )= i=1 σ2
D i

2 2
Thus we have If [pθ1 : pθ2 ] = 12 DIθ (θ1 ) (θ1 ,θ2 ) +o(k(θ2 −θ1 k ) . For θ1 = θ and θ2 = θ + dθ, the half
squared Mahalanobis distance can be interpreted as the squared Riemannian infinitesimal length
element: If [pθ : pθ+dθ ] = 21 dθ> Iθ (θ)dθ =2θ .
1
Mahalanobis defined that distance for Q = Σ−1 0, the inverse of a covariance matrix.

3
A f -divergence between any two densities pθ1 and pθ2 can be expressed as a Taylor series [40]
pθ2 (x)
when pθ1 (x) < 1 + rf where rf is the convergence radius of the analytic generator f ∈ C ω and
p θ1
pθ2 ≤ C:
∞ Z n
X pθ2 (x)
If [pθ1 : pθ2 ] = an −1 pθ1 (x)dµ(x).
X pθ1 (x)
n=2
Otherwise, the Taylor series diverge.
By introducing the higher-order chi divergences [37]

(p(x) − q(x))n
Z
Dχ,n [p : q] = dµ(x)
(q(x))n−1 (x)
which are proper divergences for even integers and only pseudo-distances for odd orders, we rewrite
the Taylor series of f -divergences as:
∞
X
If [pθ1 : pθ2 ] = an Dχ,n [pθ1 : pθ2 ].
n=2

The higher-order chi divergences [37] between densities of an exponential family are available
in closed-form provided that the natural parameter space is affine (e.g., isotropic Gaussian family
or Poisson family).
To illustrate the Taylor series, consider the Poisson family, and let us express the Taylor series for
the Jensen-Shannon divergence (generator fJS (u) = −(u + 1) log 1+u 2 + u log u) between two Poisson
distributions with parameters λ1 = 0.6 and λ2 = 0.3. We have the higher-order chi divergences
between Poisson distributions expressed in closed-form as follows [37]:
k
X k λ1−j j
Dχ,k [pλ1 : pλ2 ] = (−1)k−j e 1 λ2 −((1−j)λ1 +jλ2 )
j
j=0

pλ1 (x) λk1 e−λ1

Furthermore, pλ2 (x) = λk2 e−λ2
<C
We have
f 00 (1)
If [p : q] ∼ Dχ2 [p : q],
2 N

R (p(x)−q(x))2
where Dχ2 [p : q] = q(x) dµ(x) is the chi-squared divergence. Therefore, we have

1
DKL [p : q] ∼ Dχ2 [p : q]. (1)
2
On the finite-dimensional probability simplex, the Kullback-Leibler divergence is the only sta-
tistical divergence which belongs to both the f -divergences and the Bregman divergences [2]. When
considering the f -divergences to positive measures, the intersection of the f -divergences and the
Bregman divergences are the α-divergences [2].
For the parametric family of Cauchy distributions, the f -divergences are always symmetric and
can be expressed as a function of the chi-squared divergence [40]:

If [pCauchy
l1 ,s1 : pCauchy Cauchy
l2 ,s2 ] = If [pl2 ,s2 : pCauchy Cauchy
l1 ,s1 ] = hf (Dχ2 [pl1 ,s1 : pCauchy
l2 ,s2 ]),

4
where
(l2 − l1 )2 + (s2 − s1 )2
Dχ2 [pCauchy
l1 ,s1 : pCauchy
l2 ,s2 ] = 2s1 s2
and
1 s
pCauchy
l,s (x) := = .
πs 1 + x−l
2 π(s + (x − l)2 )
2
s

3 Distances and means

4 Statistical distances between densities with computationally in-

tractable normalizers
Consider a density p(x) = p̃(x)Zp where p̃(x) is an unnormalized computable density and Zp =
R
p(x)dµ(x) the computationally intractable normalizer (also called in statistical physics the parti-
tion function or free energy). A statistical distance D[p1 : p2 ] between two densities p1 (x) = p̃Z1 (x)
p1

and p2 (x) = p̃Z2 (x)

p2
with computationally intractable normalizers Zp1 and Zp2 is said projective (or
two-sided homogeneous) if and only if
∀λ1 > 0, λ2 > 0, D[p1 : p2 ] = D[λ1 p1 : λ2 p2 ].
In particular, letting λ1 = Zp1 and λ2 = Zp2 , we have
D[p1 : p2 ] = D[p̃1 : p̃2 ].
Notice that the rhs. does not rely on the computationally intractable normalizers. These projective
distances are useful in statistical inference based on minimum distance estimators [5] (see next
Section).
Here are a few statistical projective distances:

γ-divergences (γ > 0) [18, 13]:

Z Z Z
α+1 1 α 1 α+1
Dγ [p : q] := log q − 1+ log q p + log p , γ≥0
R α R α R

When γ → 0, we have [13] Dγ [p : q] = DKL [p : q], the Kullback-Leibler divergence (KLD).

For example, we can estimate the KLD between two densities of an exponential-polynomial
family by Monte Carlo stochastic integration of the γ-divergence for a small value of γ [38].
The γ-divergences (projective, Bregman-type=Cross-entropy-entropy) and the density power
divergence [4] (non-projective, Bregman-type divergence):
Z Z Z
dpd α+1 1 α 1
Dα [p : q] := q − 1+ q p+ pα+1 , α ≥ 0,
R α R α R

can be encapsulated into the family of Φ-power divergences [49] (functional density power
divergence class):
Z Z Z
α+1 1 α 1 α+1
Dφ,α [p : q] := φ q − 1+ φ q p + φ p , α ≥ 0,
R α R α R

5
where φ(ex ) convex and strictly increasing, φ continuous and twice continously differentiable
with finite second order derivatives. We have Dφ,0 [p : q] = φ0 (1) R p(x) log p(x)
R
q(x) dµ(x) =
0
φ (1)DKL [p : q].

Cauchy-Schwarz divergence [16] (CSD, projective)

 
R
p(x)q(x)dµ(x)
DCS [p : q] = − log  qR R
 = DCS [λ1 p : λ2 q], ∀λ1 > 0, λ2 > 0,
2 2
p(x) dµ(x) q(x) dµ(x)

and Hölder divergences [46] (HD, projective, which generalizes the CSD):
!
γ/α q(x)γ/β dx
R
Hölder [p : q] = − log X p(x) 1 1
Dα,γ R 1/α R 1/β , α + β = 1.
γ γ
X p(x) dx X q(x) dx

We have
Hölder [λ p : λ q] = DHölder [p : q],
∀λ1 > 0, λ2 > 0, Dα,γ 1 2 α,γ

and
Hölder [p : q] = D [p : q].
D2,2 CS

Hölder divergences between two densities pθp and pθq of an exponential family with cumulant
function F (θ) is available in closed-form [46]:

Hölder [p : q] = 1 F (γθ ) + 1 F (γθ ) − F γ θ + γ θ
Dα,γ p q p q
α β α β

The CSD is available in closed-form between mixtures of an exponential family with a conic
natural parameter [28]: This includes the case of Gaussian mixture models [19].

Hilbert distance [45] (projective): Consider two probability mass functions p = (p1 , . . . , pd )
and q = (q1 , . . . , qd ) of the d-dimensional probability simplex. Then the Hilbert distance is
pi
!
max i∈{1,...,d} q
DHilbert [p : q] = log p
i
.
minj∈{1,...,d} qjj

We have
∀λ1 > 0, λ2 > 0, DHilbert [λ1 p : λ2 q] = DHilbert [p : q].

The Hilbert projective simplex distance can be extended to the cone of positive-definite
matrices [45] (and its subspace of correlation matrices called the elliptope) as follows:

λmax (P Q−1 )

Hilbert
D [P : Q] = log ,
λmin (P Q−1 )

where λmax (X) and λmin (X) denote the largest and smallest eigenvalue of matrix X, respec-
tively.

6
5 Statistical distances between empirical distributions and densi-
ties with computationally intractable normalizers
When estimating the parameter θ̂ for a parametric family of distributions {pθ } from i.i.d. observa-
tions S = {x1 , . . . , xn }, we can define a minimum distance estimator (MDE):

θ̂ = arg min D[pS : pθ ],

where pS = n1 ni=1 δxi is the empirical distribution (normalized). Thus we need only a right-
P
sided projective divergence to estimate models with computationally intractable normalizers. For
example, the Maximum Likelihood Estimator (MLE) is a MDE wrt. the KLD:

θ̂MLE = arg min DKL [pS : pθ ].

It is thus interesting to study the impact of the choice of the distance D to the properties of the
corresponding estimator (e.g., γ-divergences yields provably robust estimators [13]).

Hyvärinen divergence [14] (also called Fisher divergence or Fisher relative informa-
tion [47]):
Z
Hyvärinen 1
D [p : pθ ] := k∇x log p(x) − ∇x log pθ (x)k2 p(x)dx.
2

The Hyvarinen divergence has been extended for order-α Hyvarinen divergences [32] (for
α > 0):
Z
Hyvärinen 1
Dα [p : q] := p(x)α (∇x log p(x) − ∇x log q(x))2 dx, α > 0.
2

The Fisher divergence is related to the Kullback-Leibler divergence [53] as follows:

Z ∞
DKL [p : q] = DFisher [p ∗ (0, λI) : q ∗ (0, λI)] ,
0
R
where (f ∗ g)(x) = f (y)g(x − y) denotes the convolution of densties. Thus convergence wrt
Fisher divergence is stronger than convergence wrt KLD.

6 The Jensen-Shannon divergence and some generalizations

6.1 Origins of the Jensen-Shannon divergence
Let (X , F, µ) be a measure space, and (w1 , P1P), . . . , (wn , Pn ) be n weighted probability measures
dominated by a measure µ (with wi > 0 and wi = 1). Denote by P := {(w1 , p1 ), . . . , (wn , pn )}
the set of their weighted Radon-Nikodym densities pi = dP dµ with respect to µ.
i

A statistical divergence D[p : q] is a measure of dissimilarity between two densities p and q

(i.e., a 2-point distance) such that D[p : q] ≥ 0 with equality if and only if p(x) = q(x) µ-almost
everywhere. A statistical diversity index D(P) is a measure of variation of the weighted densities in

7
P related to a measure of centrality, i.e., a n-point distance which generalizes the notion of 2-point
distance when P2 (p, q) := {( 21 , p1 ), ( 12 , p2 )}:

D[p : q] := D(P2 (p, q)).

The fundamental measure of dissimilarity in information theory is the I-divergence (also called
the Kullback-Leibler divergence, KLD, see Equation (2.5) page 5 of [20]):
Z
p(x)
DKL [p : q] := p(x) log dµ(x).
X q(x)

The KLD is asymmetric (hence the delimiter notation “:” instead of ‘,’) but can be symmetrized
by defining the Jeffreys J-divergence (Jeffreys divergence, denoted by I2 in Equation (1) in 1946’s
paper [15]):
Z
p(x)
DJ [p, q] := DKL [p : q] + DKL [q : p] = (p(x) − q(x)) log dµ(x).
X q(x)

Although symmetric, any positive power of Jeffreys divergence fails to satisfy the triangle inequality:
That is, DJα is never a metric distance for any α > 0, and furthermore DJα cannot be upper bounded.
In 1991, Lin proposed the asymmetric K-divergence (Equation (3.2) in [22]):

p+q
DK [p : q] := DKL p : ,
2

and defined the L-divergence by analogy to Jeffreys’s symmetrization of the KLD (Equation (3.4)
in [22]):
DL [p, q] = DK [p : q] + DK [q : p].
By noticing that
p+q
DL [p, q] = 2h − (h[p] + h[q]),
2
where h denotes Shannon entropy (Equation (3.14) in [22]), Lin coined the (skewed) Jensen-
Shannon divergence between two weighted densities (1 − α, p) and (α, q) for α ∈ (0, 1) as follows
(Equation (4.1) in [22]):

DJS,α [p, q] = h[(1 − α)p + αq] − (1 − α)h[p] − αh[q]. (2)

Finally, Lin defined the generalized Jensen-Shannon divergence (Equation (5.1) in [22]) for a
finite weighted set of densities:
" #
X X
DJS [P] = h wi pi − wi h[pi ].
i i

This generalized Jensen-Shannon divergence is nowadays called the Jensen-Shannon diversity index.
To contrast with the Jeffreys’ divergence, the Jensen-Shannon divergence (JSD) DJS := DJS, 1
√ 2
is upper bounded by log 2 (does not require the densities to have the same support), and DJS is a
metric distance [11, 12]. Lin cited precursor work [56, 23] yielding definition of the Jensen-Shannon

8
divergence: The Jensen-Shannon divergence of Eq. 2 is the so-called “increments of entropy” defined
in (19) and (20) of [56].
The Jensen-Shannon diversity index was also obtained very differently by Sibson in 1969 when
he defined the information radius [52] of order α using Rényi α-means and Rényi α-entropies [50].
In particular, the information radius IR1 of order 1 of a weighted set P of densities is a diversity
index obtained by solving the following variational optimization problem:
n
X
IR1 [P] := min wi DKL [pi : c]. (3)
c
i=1

Sibson solved a more general optimization problem, and obtained the following expression (term
K1 in Corollary 2.3 [52]):
" #
X X
IR1 [P] = h wi pi − wi h[pi ] := DJS [P].
i i

Thus Eq. 3 is a variational definition of the Jensen-Shannon divergence.

6.2 Some extensions of the Jensen-Shannon divergence

Skewing the JSD.
The K-divergence of Lin can be skewed with a scalar parameter α ∈ (0, 1) to give

DK,α [p : q] := DKL [p : (1 − α)p + αq] . (4)

Skewing parameter α was first studied in [21] (2001, see Table 2 of [21]). We proposed to
unify the Jeffreys divergence with the Jensen-Shannon divergence as follows (Equation 19
in [27]):
J DK,α [p : q] + DK,α [q : p]
DK,α [p : q] := . (5)
2
When α = 12 , we have DK,
J J 1
1 = DJS , and when α = 1, we get DK,1 = 2 DJ .
2

Notice that
α,β
DJS [p; q] := (1 − β)DKL [p : (1 − α)p + αq] + βDKL [q : (1 − α)p + αq]

amounts to calculate

h× [(1 − β)p + βq : (1 − α)p + αq] − ((1 − β)h[p] + βh[q])

where Z
×
h [p, q] := −p(x) log q(x)dµ(x)

denotes the cross-entropy. By choosing α = β, we have h× [(1 − β)p + βq : (1 − α)p + αq] =

h[(1 − α)p + αq], and thus recover the skewed Jensen-Shannon divergence of Eq. 2.

9
In [31] (2020), we considered a positive skewing vector α ∈ [0, 1]k and a unit positive weight
w belonging to the standard simplex ∆k , and defined the following vector-skewed Jensen-
Shannon divergence:
k
α,w
X
DJS [p : q] := DKL [(1 − αi )p+ αi q : (1 − ᾱ)p + ᾱq], (6)
i=1
k
X
= h[(1 − ᾱ)p + ᾱq] − h[(1 − αi )p+ αi q], (7)
i=1
α,w
where ᾱ = ki=1 wi αi . The divergence DJS
P
generalizes the (scalar) skew Jensen-Shannon
divergence when k = 1, and is a Ali-Silvey-Csiszár f -divergence upper bounded by
1
log ᾱ(1− ᾱ) [31].

A priori mid-density.PThe JSD can be interpreted as the total divergence of the densities
to the mid-density p̄ = ni=1 wi pi , a statistical mixture:
n
X n
X
DJS [P] = wi DKL [pi : p̄] = h[p̄] − wi h[pi ].
i=1 i=1

Unfortunately, the JSD between two Gaussian densities is not known in closed form because
of the definite integral of a log-sum term (i.e., K-divergence between a density and a mixture
density p̄). For the special case of the Cauchy family, a closed-form formula [41] for the
JSD between two Cauchy densities was obtained. Thus we may choose a geometric mixture
distribution [29] instead of the ordinary arithmetic mixture p̄. More generally, we can choose
any weighted mean Mα (say, the geometric mean, or the harmonic mean, or any other power
mean) and define a generalization of the K-divergence of Equation 4:
Mα
DK [p : q] := DK [p : (pq)Mα ], (8)
where
Mα (p(x), q(x))
(pq)Mα (x) :=
ZMα (p : q)
is a statistical M -mixture with ZMα (p, q) denoting the normalizing coefficient:
Z
ZMα (p : q) = Mα (p(x), q(x))dµ(x)
R
so that (pq)Mα (x)dµ(x) = 1. These M -mixtures are well-defined provided the convergence
of the definite integrals.
Then we define a generalization of the JSD [29] termed (Mα , Nβ )-Jensen-Shannon divergence
as follows:
M ,N
DJSα β [p : q] := Nβ (DK [p : (pq)Mα ], DK [q : (pq)Mα ]) , (9)
where Nβ is yet another weighted mean to average the two Mα -K-divergences. We have
A,A
DJS = DJS where A(a, b) = a+b 2 is the arithmetic mean. The geometric JSD yields a closed-
form formula between two multivariate Gaussians, and has been used in deep learning [10].
More generally, we may consider the Jensen-Shannon symmetrization of an arbitrary distance
D as
JS
DM α ,Nβ
[p : q] := Nβ (D[p : (pq)Mα ], D[q : (pq)Mα ]) . (10)

10
A posteriori mid-density. We consider a generalization of Sibson’s information radius [52].
Let Sw (a1 , . . . , an ) denote a generic weighted mean of n positive scalars a1 , . . . , an , with weight
vector w ∈ ∆n . Then we define the S-variational Jensen-Shannon diversity index [34] as
Sw
DvJS (P) := min Sw (DKL [p1 : c], DKL [pn : c]) . (11)
c

When Sw = Aw (with Aw (a1 , . . . , an ) = ni=1 wi ai the arithmetic weighted mean), we re-

P
cover the ordinary Jensen-Shannon diversity index. More generally, we define the S-Jensen-
Shannon index of an arbitrary distance D as

DSvJS
w
(P) := min Sw (D[p1 : c], . . . , D[pn : c]) . (12)
c

When n = 2, this yields a Jensen-Shannon-symmetrization of distance D.

The variational optimization defining the JSD can also be constrained to a (parametric)
family of densities D, thus defining the (S, D)-relative Jensen-Shannon diversity index:
Sw ,D
DvJS (P) := min Sw (DKL [p1 : c], . . . , DKL [pn : c]) . (13)
c∈D

The relative Jensen-Shannon divergences are useful for clustering applications: Let pθ1 and pθ2
be two densities of an exponential family E with cumulant function F (θ). Then the E-relative
Jensen-Shannon divergence is the Bregman information of P2 (p, q) for the conjugate function
F ∗ (η) = −h[pθ ] (with η = ∇F (θ)). The E-relative JSD amounts to a Jensen divergence for
F ∗:

1
DvJS [pθ1 , pθ2 ] = min {DKL [pθ1 : pθ ] + DKL [pθ2 : pθ ]} , (14)
θ 2
1
= min {BF [θ : θ1 ] + BF [θ : θ2 ]} , (15)
θ 2
1
= min {BF ∗ [η1 : η] + BF ∗ [η2 : η]} , (16)
η 2
F ∗ (η1 ) + F ∗ (η2 )
= − F ∗ (η ∗ ), (17)
2
=: JF ∗ (η1 , η2 ), (18)
η1 +η2
since η ∗ := 2 (a right-sided Bregman centroid [36]).

7 Statistical distances between mixtures

Pearson [48] first considered a unimodal Gaussian mixture of two components for modeling dis-
tributions crabs in 1894. Statistical mixtures [25] like the Gaussian mixture models (GMMs) are
often metPin information sciences, andPtherefore it is important to assess their dissimilarities. Let
k 0 k0 0 0
m(x) = i=1 wi pi (x) and m (x) = i=1 wi pi (x) be two finite statistical mixtures. The KLD
between two GMMs m and m0 is not analytic [55] because of the log-sum terms:
Z
0 m(x)
DKL [m : m ] = m(x) log 0 dx.
m (x)

11
However, the KLD between two GMMs with the same prescribed components pi (x) = p0i (x) =
pµi ,Σi (x) (i.e., k = k 0 , and only the normalized positive weights may differ) is provably a Bregman
divergence [39] for the differential negentropy F (θ):
DKL [m(θ) : m(θ0 )] = BF (θ, θ0 ),
Pk−1 Pk−1 R
where m(θ) = i=1 wi pi (x) + (1 − i=1 wi )pk (x) and F (θ) = m(θ) log m(θ)dx. The family
{mθ θ ∈ ∆◦k−1 } is called a mixture family in information geometry, where ∆◦k−1 denotes the
(k − 1)-dimensional open standard simplex. However, F (θ) is usually not available in closed-form
because of the log-sum integral. In some special cases like the mixture of two prescribed Cauchy
distributions, we get a closed-form formula for the KLD, JSD, etc. [41, 35]. Thus when dealing
with mixtures (like GMMs), we either need efficient approximating (§7.1), bounding (§7.2) KLD
techniques, or new distances (§7.3) that yields closed-form formula between mixture densities.

7.1 Approximating and/or fast statistical distances between mixtures

The Jeffreys divergence (JD) DJ [m, m0 ] = DKL [m : m0 ]+DKL [m0 : m] between two (Gaussian)
MMs is not available in closed-form, and can be estimated using Monte Carlo integration as
s
1 X (m(xi ) − m0 (xi ))

Ss 0 m(xi )
D̂J [m, m ] := 2 log ,
s m(xi ) + m0 (xi ) m0 (xi )
i=1

where Ss = {x1 , . . . , xs } are s IID samples from the mid mixture m12 (x) := 12 (m(x) + m0 (x))
(with lims→∞ D̂JSs [m, m0 ] = DJ [m, m0 ]). In [33], the mixtures m and m0 are converted into
densities of an exponential-polynomial family. The JD between densities pθ and pθ0 of an
exponential family with cumulant function F is available in closed-form:
DJ [pθ , pθ0 ] = (θ0 − θ) · (η 0 − η),
with η = ∇F (θ) and θ = ∇F ∗ (η), where F ∗ denotes the convex conjugate. Any smooth
SME
density r (includes a mixture r = m) is converted into close densities pθrMLE and pηr of a
exponential-polynomial family using extensions of the Maximum Likelihood Estimator (MLE)
and Score Matching Estimator (SME). Then JD between mixtures is approximated as follows
SME MLE
DJ [m, m0 ] ' (θ0 − θSME ) · (η 0 − η MLE ).

Given a finite set of mixtures {mi (x)} sharing the same components (e.g., points on a mix-
ture family manifold), we precompute the KLD between pairwise components to obtain fast
approximation of the KLD DKL [mi : mj ] between any two mixtures mi and mj , see [51].

7.2 Bounding statistical distances between mixtures

Log-Sum-Exp bounds: In [42, 43], we lower and upper bound the cross-entropy between
mixtures using the fact that the log-sum term log m(x) and be interpreted as a LSE function.
We then compute lower envelopes and upper envelopes of density functions using technique
of computational geometry to report deterministic lower and upper bounds on the KLD and
α-divergences. These bounds are said combinatorial because we decompose the support into
elementary intervals. Bounds between the Total Variation Distance (TVD) between univariate
mixtures are reported in [44].

12
7.3 Newly designed statistical distances yielding closed-form formula for mix-
tures
Statistical Minkowski distances [30]: Consider the Lebesgue space
Z
α
Lα (µ) := f ∈ F : |f (x)| dµ(x) < ∞
X

for α ≥ 1, where F denotes the set of all real-valued measurable functions defined on the
support X . Minkowski’s inequality writes as kp + qkα ≤ kpkα + kqkα for α ∈ [1, ∞). The
statistical Minkowski difference distance between p, q ∈ Lα (µ) is defined as

DαMinkowski [p, q] := kpkα + kqkα − kp + qkα ≥ 0. (19)

The statistical Minkowski log-ratio distance is defined by:

kp + qkα
LMinkowski
α [p, q] := − log ≥ 0. (20)
kpkα + kqkα

These statistical Minkowski distances are symmetric, and Lα [p, q] is scale-invariant. For even
integers α ≥ 2, DαMinkowski [m : m0 ] is available in closed-form.

We show that the Cauchy-Schwarz divergence (CSD), the quadratic Jensen-Rényi diver-
gence [54] (JRD), and the total square Distance (TSD) between two GMMs, and more gen-
erally two mixtures of exponential families, can be obtained in closed-form in [28].

Initially created 13th August 2021 (last updated November 12, 2021).

References
[1] Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one
distribution from another. Journal of the Royal Statistical Society: Series B (Methodological),
28(1):131–142, 1966.

[2] Shun-Ichi Amari. α-divergence is unique, belonging to both f -divergence and Bregman diver-
gence classes. IEEE Transactions on Information Theory, 55(11):4925–4931, 2009.

[3] Shun-ichi Amari. Information Geometry and Its Applications. Applied Mathematical Sciences.
Springer Japan, 2016.

[4] Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efficient estima-
tion by minimising a density power divergence. Biometrika, 85(3):549–559, 1998.

[5] Ayanendranath Basu, Hiroyuki Shioya, and Chanseok Park. Statistical inference: the minimum
distance approach. Chapman and Hall/CRC, 2019.

[6] JM Corcuera and Federica Giummolè. A characterization of monotone and regular divergences.
Annals of the Institute of Statistical Mathematics, 50(3):433–450, 1998.

13
(M, D)-information Definition IM,D [p, q] := minc {M (D[p : c], D[q : c])}
p+q p+q
Information radius of order 1 IA,KL [p, q] = 21 DKL [p : 2 ] + 21 DKL [q : 2 ]

aka Jensen-Shannon divergence

α
R p(x)α +q(x)α α1
Information radius of order α IMαR ,DαR [p, q] = α−1 log2 X 2 dµ(x)
aka Sibson’s information radius
θ1 +θ2 θ1 +θ2
Bregman information IA,BF (θ1 , θ2 ) = 21 BF (θ1 : 2 ) + 12 BF (θ2 : 2 )

skewed Jensen–Bregman divergence IAβ ,BF (θ1 : θ2 ) = (1 − β)BF (θ1 : (1 − β)θ1 + βθ2 ) + βBF (θ2 : (1 − β)θ1 + β
Chernoff minimax discrimination Imax,DKL
∗ [p, q] = minc max{DKL [c : p], DKL [c : q]}

Amari’s α-risk IA,Dα [p, q] = 12 Dα [p : (pq)A 1 A

α ] + 2 Dα [q : (pq)α ]

Annealing geometric paths (pq)Gβ minimizer of IGβ ,DKL

∗ [p : q]

a+b
arithmetic mean A(a, b) = 2

weighted arithmetic mean Aβ (a, b) = (1 − β)a + βb

geometric mean Gβ (a, b) = a1−β bβ
maximum (mean) MAX(a, b) = max{a, b}
(α−1)a (α−1)b
1
Rényi’s α-mean MαR (a, b) = α−1 log2 2 +2
2
1−α
(pq)A −1 1 1 2
Amari’s α-integration α (x) ∝ fα ( 2 fα (p(x)) + 2 fα (q(x))), α-representation fα (u) = 1−α u
2

M -mixture (pq)Mβ (x) ∝ Mβ (p(x), q(x))

p(x) log p(x)

R
Kullback-Leibler divergence DKL [p : q] = X q(x) dµ(x)
∗ [p : q] =
R q(x)
reverse Kullback-Leibler divergence DKL q(x) log p(x)
X dµ(x) = DKL [q : p]
1
DαR [p : q] = α−1 log2 X p(x)α q(x)1−α dµ(x)
R
Rényi’s α-divergence
1−α 1+α
4
DαA [p : q] = 1−α
R
Amari’s α-divergence 2 (1 − X p(x) 2 q(x) 2 dµ(x))

Table 1: Examples of information radius measures as variational abstract mean divergences.

14
[7] Imre Csiszár. Information-type measures of difference of probability distributions and indirect
observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.

[8] Imre Csiszár. A class of measures of informativity of observation channels. Periodica Mathe-
matica Hungarica, 2(1-4):191–213, 1972.

[9] Imre Csiszár and Paul C Shields. Information theory and statistics: A tutorial. Now Publishers
Inc, 2004.

[10] Jacob Deasy, Nikola Simidjievski, and Pietro Liò. Constraining Variational Inference with Ge-
ometric Jensen-Shannon Divergence. In Advances in Neural Information Processing Systems,
2020.

[11] Dominik Maria Endres and Johannes E Schindelin. A new metric for probability distributions.
IEEE Transactions on Information theory, 49(7):1858–1860, 2003.

[12] Bent Fuglede and Flemming Topsoe. Jensen-Shannon divergence and Hilbert space embedding.
In International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings., page 31.
IEEE, 2004.

[13] Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against
heavy contamination. Journal of Multivariate Analysis, 99(9):2053–2081, 2008.

[14] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score
matching. Journal of Machine Learning Research, 6(4), 2005.

[15] Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings
of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–
461, 1946.

[16] Robert Jenssen, Jose C Principe, Deniz Erdogmus, and Torbjørn Eltoft. The Cauchy–Schwarz
divergence and Parzen windowing: Connections to graph theory and Mercer kernels. Journal
of the Franklin Institute, 343(6):614–629, 2006.

[17] Jiantao Jiao, Thomas A Courtade, Albert No, Kartik Venkat, and Tsachy Weissman. Infor-
mation measures: the curious case of the binary alphabet. IEEE Transactions on Information
Theory, 60(12):7616–7626, 2014.

[18] MC Jones, Nils Lid Hjort, Ian R Harris, and Ayanendranath Basu. A comparison of related
density-based minimum divergence estimators. Biometrika, 88(3):865–873, 2001.

[19] Kittipat Kampa, Erion Hasanbelliu, and Jose C Principe. Closed-form Cauchy-Schwarz PDF
divergence for mixture of Gaussians. In The 2011 International Joint Conference on Neural
Networks, pages 2578–2585. IEEE, 2011.

[20] Solomon Kullback. Information theory and statistics. Courier Corporation, 1997.

[21] Lillian Lee. On the effectiveness of the skew divergence for statistical language analysis. In
Artificial Intelligence and Statistics (AISTATS), page 65?72, 2001.

15
[22] Jianhua Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on
Information theory, 37(1):145–151, 1991.

[23] Jianhua Lin and SKM Wong. Approximation of discrete probability distributions based on a
new divergence measure. Congressus Numerantium (Winnipeg), 61:75–80, 1988.

[24] Prasanta Chandra Mahalanobis. On the generalized distance in statistics. National Institute
of Science of India, 1936.

[25] Geoffrey J McLachlan and Kaye E Basford. Mixture models: Inference and applications to
clustering, volume 38. M. Dekker New York, 1988.

[26] Tetsuzo Morimoto. Markov processes and the H-theorem. Journal of the Physical Society of
Japan, 18(3):328–331, 1963.

[27] Frank Nielsen. A family of statistical symmetric divergences based on Jensen’s inequality.
arXiv preprint arXiv:1009.4004, 2010.

[28] Frank Nielsen. Closed-form information-theoretic divergences for statistical mixtures. In Pro-
ceedings of the 21st International Conference on Pattern Recognition (ICPR), pages 1723–1726.
IEEE, 2012.

[29] Frank Nielsen. On the Jensen?Shannon Symmetrization of Distances Relying on Abstract

Means. Entropy, 21(5), 2019.

[30] Frank Nielsen. The statistical Minkowski distances: Closed-form formula for Gaussian mixture
models. In International Conference on Geometric Science of Information, pages 359–367.
Springer, 2019.

[31] Frank Nielsen. On a Generalization of the Jensen?Shannon Divergence and the

Jensen?Shannon Centroid. Entropy, 22(2), 2020.

[32] Frank Nielsen. Fast approximations of the Jeffreys divergence between univariate Gaussian
mixture models via exponential polynomial densities. arXiv preprint arXiv:2107.05901, 2021.

[33] Frank Nielsen. Fast approximations of the jeffreys divergence between univariate gaussian
mixture models via exponential polynomial densities. arXiv preprint arXiv:2107.05901, 2021.

[34] Frank Nielsen. On a Variational Definition for the Jensen-Shannon Symmetrization of Dis-
tances Based on the Information Radius. Entropy, 23(4), 2021.

[35] Frank Nielsen. The dually flat information geometry of the mixture family of two prescribed
Cauchy components. arXiv preprint arXiv:2104.13801, 2021.

[36] Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. IEEE transac-
tions on Information Theory, 55(6):2882–2904, 2009.

[37] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approx-
imating f -divergences. IEEE Signal Processing Letters, 21(1):10–13, 2013.

16
[38] Frank Nielsen and Richard Nock. Patch matching with polynomial exponential families and
projective divergences. In International Conference on Similarity Search and Applications,
pages 109–116. Springer, 2016.

[39] Frank Nielsen and Richard Nock. On the geometry of mixtures of prescribed distributions. In
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
2861–2865. IEEE, 2018.

[40] Frank Nielsen and Kazuki Okamura. On f -divergences between Cauchy distributions. arXiv
preprint arXiv:2101.12459, 2021.

[41] Frank Nielsen and Kazuki Okamura. On f -divergences between cauchy distributions.
arXiv:2101.12459, 2021.

[42] Frank Nielsen and Ke Sun. Guaranteed bounds on information-theoretic measures of univariate
mixtures using piecewise log-sum-exp inequalities. Entropy, 18(12):442, 2016.

[43] Frank Nielsen and Ke Sun. Combinatorial bounds on the α-divergence of univariate mixture
models. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing,
ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 4476–4480. IEEE, 2017.

[44] Frank Nielsen and Ke Sun. Guaranteed Deterministic Bounds on the total variation distance
between univariate mixtures. In 28th IEEE International Workshop on Machine Learning for
Signal Processing, MLSP 2018, Aalborg, Denmark, September 17-20, 2018, pages 1–6. IEEE,
2018.

[45] Frank Nielsen and Ke Sun. Clustering in Hilbert’s projective geometry: The case studies of
the probability simplex and the elliptope of correlation matrices. In Geometric Structures of
Information, pages 297–331. Springer, 2019.

[46] Frank Nielsen, Ke Sun, and Stéphane Marchand-Maillet. On Hölder projective divergences.
Entropy, 19(3):122, 2017.

[47] Felix Otto and Cédric Villani. Generalization of an inequality by Talagrand and links with the
logarithmic Sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.

[48] Karl Pearson. Contributions to the mathematical theory of evolution. Philosophical Transac-
tions of the Royal Society of London. A, 185:71–110, 1894.

[49] Souvik Ray, Subrata Pal, Sumit Kumar Kar, and Ayanendranath Basu. Characterizing the
functional density power divergence class. arXiv preprint arXiv:2105.06094, 2021.

[50] Alfréd Rényi et al. On measures of entropy and information. In Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to
the Theory of Statistics. The Regents of the University of California, 1961.

[51] Olivier Schwander, Stéphane Marchand-Maillet, and Frank Nielsen. Comix: Joint estimation
and lightspeed comparison of mixture models. In 2016 IEEE International Conference on
Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016,
pages 2449–2453. IEEE, 2016.

17
[52] Robin Sibson. Information radius. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte
Gebiete, 14(2):149–160, 1969.

[53] Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, and Revant Ku-
mar. Density estimation in infinite dimensional exponential families. Journal of Machine
Learning Research, 18, 2017.

[54] Fei Wang, Tanveer Syeda-Mahmood, Baba C Vemuri, David Beymer, and Anand Rangarajan.
Closed-form Jensen-Rényi divergence for mixture of Gaussians and applications to group-wise
shape registration. In International Conference on Medical Image Computing and Computer-
Assisted Intervention, pages 648–655. Springer, 2009.

[55] Sumio Watanabe, Keisuke Yamazaki, and Miki Aoyagi. Kullback information of normal mix-
ture is not an analytic function. IEICE technical report. Neurocomputing, 104(225):41–46,
2004.

[56] Andrew KC Wong and Manlai You. Entropy and distance of random graphs with applica-
tion to structural pattern recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence, (5):599–609, 1985.

Jsaer2014 01 01 20 29
No ratings yet
Jsaer2014 01 01 20 29
10 pages
Book
No ratings yet
Book
106 pages
Adrl App
No ratings yet
Adrl App
139 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
181 pages
Book
No ratings yet
Book
113 pages
SAHADEB - Categorical - Data - LECTURES - Till Session 6
No ratings yet
SAHADEB - Categorical - Data - LECTURES - Till Session 6
165 pages
Jacobson Erik D 201108 Ma
No ratings yet
Jacobson Erik D 201108 Ma
121 pages
CH 2
No ratings yet
CH 2
121 pages
A New Non-Symmetric Information Divergence of
No ratings yet
A New Non-Symmetric Information Divergence of
7 pages
Guide Prob Dist
No ratings yet
Guide Prob Dist
151 pages
Lecture Notes PDF
No ratings yet
Lecture Notes PDF
143 pages
Reachable Distance Function For KNN Classification
No ratings yet
Reachable Distance Function For KNN Classification
152 pages
PDE - S and Transition Density Function
No ratings yet
PDE - S and Transition Density Function
44 pages
Statistical Distances
No ratings yet
Statistical Distances
12 pages
Divergence Measures Based On The Shannon Entropy: Member
No ratings yet
Divergence Measures Based On The Shannon Entropy: Member
7 pages
E4 Convdist
No ratings yet
E4 Convdist
20 pages
Generalized Distances and Generalized Projections
No ratings yet
Generalized Distances and Generalized Projections
20 pages
08 JS Divergence
No ratings yet
08 JS Divergence
18 pages
F Divergence PDF
No ratings yet
F Divergence PDF
13 pages
New Proofs
No ratings yet
New Proofs
8 pages
On Divergences and Informations in Statistics and Information Theory
No ratings yet
On Divergences and Informations in Statistics and Information Theory
19 pages
Problems (Beams)
100% (3)
Problems (Beams)
15 pages
Structural Analysis Report
No ratings yet
Structural Analysis Report
22 pages
Chernoff Information of Exponential Families: Frank Nielsen, Senior Member, IEEE
No ratings yet
Chernoff Information of Exponential Families: Frank Nielsen, Senior Member, IEEE
8 pages
Indigo Manual
100% (2)
Indigo Manual
232 pages
The Earth Mover's Correlation
No ratings yet
The Earth Mover's Correlation
20 pages
A Gentle Introduction To The Kernel Distance: 1 Definitions
No ratings yet
A Gentle Introduction To The Kernel Distance: 1 Definitions
9 pages
MUML Preliminiaries
No ratings yet
MUML Preliminiaries
24 pages
Divergences
No ratings yet
Divergences
8 pages
FrankNielsen Distances Figs
No ratings yet
FrankNielsen Distances Figs
2 pages
Poster Distances
No ratings yet
Poster Distances
1 page
Poster Distances
No ratings yet
Poster Distances
1 page
Sketching Information Divergence
No ratings yet
Sketching Information Divergence
15 pages
Lecture 17 - KL Divergence, Autoencoders
No ratings yet
Lecture 17 - KL Divergence, Autoencoders
54 pages
Electronic Devices Used in Power Electronics Characteristics Comparison
No ratings yet
Electronic Devices Used in Power Electronics Characteristics Comparison
12 pages
R Enyi Divergence and Kullback-Leibler Divergence: Tim Van Erven Peter Harremo Es, Member, IEEE
No ratings yet
R Enyi Divergence and Kullback-Leibler Divergence: Tim Van Erven Peter Harremo Es, Member, IEEE
24 pages
Selective Review - Probability
No ratings yet
Selective Review - Probability
30 pages
Eustasio Del Barrio, Paul Deheuvels, Sara Van de Geer Lectures On Empirical Processes Theory and Statistical Applications 2007
No ratings yet
Eustasio Del Barrio, Paul Deheuvels, Sara Van de Geer Lectures On Empirical Processes Theory and Statistical Applications 2007
256 pages
Aci SP-155 - 1995 PDF
No ratings yet
Aci SP-155 - 1995 PDF
256 pages
Limits of Graph Sequences
No ratings yet
Limits of Graph Sequences
9 pages
Approximating The Kullback Leibler Divergence Between Gaussian Mixture Models
No ratings yet
Approximating The Kullback Leibler Divergence Between Gaussian Mixture Models
4 pages
Kullback-Leibler Divergence Estimation of Continuous Distributions
No ratings yet
Kullback-Leibler Divergence Estimation of Continuous Distributions
5 pages
R Enyi Divergence and Kullback-Leibler Divergence: Tim Van Erven Peter Harremo Es, Member, IEEE
No ratings yet
R Enyi Divergence and Kullback-Leibler Divergence: Tim Van Erven Peter Harremo Es, Member, IEEE
24 pages
Emp Proc Lecture Notes
No ratings yet
Emp Proc Lecture Notes
172 pages
Plate Tectonics Boardworks
No ratings yet
Plate Tectonics Boardworks
43 pages
Statistical Distance
No ratings yet
Statistical Distance
3 pages
Econ 623 AsymptoticTheory 2023
No ratings yet
Econ 623 AsymptoticTheory 2023
74 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Lectnotemat 2
No ratings yet
Lectnotemat 2
348 pages
Chapter 2: Advanced Differentiation
No ratings yet
Chapter 2: Advanced Differentiation
21 pages
Notes
No ratings yet
Notes
21 pages
Mallows Theorem
No ratings yet
Mallows Theorem
21 pages
SPS Week 3
No ratings yet
SPS Week 3
6 pages
Delta Method
No ratings yet
Delta Method
10 pages
Fog & Mist
No ratings yet
Fog & Mist
10 pages
Gas Absrption
No ratings yet
Gas Absrption
4 pages
Valvula de Mariposa
No ratings yet
Valvula de Mariposa
12 pages
Conditional Distributions
No ratings yet
Conditional Distributions
5 pages
Akvodrops ADW Usermanual V 0
No ratings yet
Akvodrops ADW Usermanual V 0
17 pages
Statics Correction Elevation
100% (2)
Statics Correction Elevation
9 pages
Don McLeish Probability
No ratings yet
Don McLeish Probability
101 pages
Notests PDF
No ratings yet
Notests PDF
153 pages
Lecture Notes
No ratings yet
Lecture Notes
495 pages
JC Science Project Coursework B
100% (2)
JC Science Project Coursework B
4 pages
Empirical Process (Sara Van de Geer)
No ratings yet
Empirical Process (Sara Van de Geer)
91 pages
Electromagnetic Theory
0% (1)
Electromagnetic Theory
1 page
Delta Method
No ratings yet
Delta Method
10 pages
Bregman
No ratings yet
Bregman
9 pages
Data Mining Lecture 1 - Summary
No ratings yet
Data Mining Lecture 1 - Summary
3 pages
Catálogo de Lámparas HID GE
No ratings yet
Catálogo de Lámparas HID GE
34 pages
Drilling Excavation Theory of Excavation & Rock Breaking-3
No ratings yet
Drilling Excavation Theory of Excavation & Rock Breaking-3
51 pages
Donnee Exo Suppl
No ratings yet
Donnee Exo Suppl
5 pages
MIT6 262S11 Lec02
No ratings yet
MIT6 262S11 Lec02
11 pages
Fea MCQ On Theory
No ratings yet
Fea MCQ On Theory
10 pages
Cheat Sheet For The Final Exam
No ratings yet
Cheat Sheet For The Final Exam
6 pages
Industrial Attachment Report - Hastovensyah
No ratings yet
Industrial Attachment Report - Hastovensyah
5 pages
Practice Homework Set
No ratings yet
Practice Homework Set
58 pages
2 Pavement Design - Pavement Design Flexible - TCVD3871 - Estime
No ratings yet
2 Pavement Design - Pavement Design Flexible - TCVD3871 - Estime
92 pages
Logarithmic Equations
No ratings yet
Logarithmic Equations
2 pages
TPS735 500-Ma, Low Quiescent Current, Low Noise, High PSRR, Low-Dropout Linear Regulator
No ratings yet
TPS735 500-Ma, Low Quiescent Current, Low Noise, High PSRR, Low-Dropout Linear Regulator
35 pages
Mom Chapter 7 (Transverse Shear) - 20191108082347
No ratings yet
Mom Chapter 7 (Transverse Shear) - 20191108082347
83 pages
What Is Buoyancy?: Density of Object 1 G/CM Density of Object 1 G/CM 0 G/CM Density of Object 1 G/CM
No ratings yet
What Is Buoyancy?: Density of Object 1 G/CM Density of Object 1 G/CM 0 G/CM Density of Object 1 G/CM
2 pages
1 2 6 P Understandinganalogdesign rng-2
No ratings yet
1 2 6 P Understandinganalogdesign rng-2
5 pages
Cao Wang FTA EMA
No ratings yet
Cao Wang FTA EMA
5 pages
Midterm 1
No ratings yet
Midterm 1
5 pages
Back To Basics Why Magnetic Bearings Have Problems
No ratings yet
Back To Basics Why Magnetic Bearings Have Problems
3 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Distance

Uploaded by

Distance

Uploaded by

Distances, divergences, statistical divergences and diversities

This document is also available in PDF: Distance.pdf

3 Distances and means 5

4 Statistical distances between densities with computationally intractable normal-

Figure 1: Skewed Jensen divergences visualized as vertical convexity gaps.

5 Statistical distances between empirical distributions and densities with compu-

6 The Jensen-Shannon divergence and some generalizations 7

7 Statistical distances between mixtures 11

1 Jensen divergences and Bregman divergences

lim sJF,α (θ1 : θ2 ) = BF (θ2 : θ1 )

1.2 Relationships with statistical distances between densities of an exponential

pλ1 (x) λk1 e−λ1

3 Distances and means

4 Statistical distances between densities with computationally in-

and p2 (x) = p̃Z2 (x)

 γ-divergences (γ > 0) [18, 13]:

When γ → 0, we have [13] Dγ [p : q] = DKL [p : q], the Kullback-Leibler divergence (KLD).

 Cauchy-Schwarz divergence [16] (CSD, projective)

θ̂ = arg min D[pS : pθ ],

θ̂MLE = arg min DKL [pS : pθ ].

The Fisher divergence is related to the Kullback-Leibler divergence [53] as follows:

6 The Jensen-Shannon divergence and some generalizations

A statistical divergence D[p : q] is a measure of dissimilarity between two densities p and q

D[p : q] := D(P2 (p, q)).

DJS,α [p, q] = h[(1 − α)p + αq] − (1 − α)h[p] − αh[q]. (2)

Thus Eq. 3 is a variational definition of the Jensen-Shannon divergence.

6.2 Some extensions of the Jensen-Shannon divergence

DK,α [p : q] := DKL [p : (1 − α)p + αq] . (4)

h× [(1 − β)p + βq : (1 − α)p + αq] − ((1 − β)h[p] + βh[q])

denotes the cross-entropy. By choosing α = β, we have h× [(1 − β)p + βq : (1 − α)p + αq] =

When Sw = Aw (with Aw (a1 , . . . , an ) = ni=1 wi ai the arithmetic weighted mean), we re-

When n = 2, this yields a Jensen-Shannon-symmetrization of distance D.

7 Statistical distances between mixtures

7.1 Approximating and/or fast statistical distances between mixtures

7.2 Bounding statistical distances between mixtures

DαMinkowski [p, q] := kpkα + kqkα − kp + qkα ≥ 0. (19)

The statistical Minkowski log-ratio distance is defined by:

aka Jensen-Shannon divergence

Amari’s α-risk IA,Dα [p, q] = 12 Dα [p : (pq)A 1 A

Annealing geometric paths (pq)Gβ minimizer of IGβ ,DKL

weighted arithmetic mean Aβ (a, b) = (1 − β)a + βb

M -mixture (pq)Mβ (x) ∝ Mβ (p(x), q(x))

p(x) log p(x)

Table 1: Examples of information radius measures as variational abstract mean divergences.

[29] Frank Nielsen. On the Jensen?Shannon Symmetrization of Distances Relying on Abstract

[31] Frank Nielsen. On a Generalization of the Jensen?Shannon Divergence and the

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

γ-divergences (γ > 0) [18, 13]:

Cauchy-Schwarz divergence [16] (CSD, projective)