0% found this document useful (0 votes)
25 views18 pages

Distance

This working document discusses various measures of discrepancies and diversities between distributions, including metrics, distances, and divergences. It covers topics such as Jensen divergences, Bregman divergences, and statistical distances, along with their mathematical properties and applications. The document aims to provide a comprehensive overview of these concepts, which are essential in fields like information theory and statistics.

Uploaded by

kaczetoww
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views18 pages

Distance

This working document discusses various measures of discrepancies and diversities between distributions, including metrics, distances, and divergences. It covers topics such as Jensen divergences, Bregman divergences, and statistical distances, along with their mathematical properties and applications. The document aims to provide a comprehensive overview of these concepts, which are essential in fields like information theory and statistics.

Uploaded by

kaczetoww
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Distances, divergences, statistical divergences and diversities

Frank Nielsen
Sony Computer Science Laboratories Inc.
Tokyo, Japan

August 2021
(Last updated November 12, 2021)

This document is also available in PDF: Distance.pdf

This is a working document which will be (hopefully frequently) updated with materials con-
cerning the discrepancies between two distributions/parameters or the diversities of a set of distri-
butions/parameters. There are many synonyms in the literature to measure the difference between
two objects: Metrics, Distances, Discrepancies, deviations, deviances, dissimilarities, divergences,
contrast functions or yokes (on product manifolds), etc. Diversities generalize 2-point distances by
measuring the dispersion of a set of n objects, usually using a centrality notion.
In mathematics, a distance is often considered to be a metric distance in a metric space which
satisfies the following properties:
There is confusion in the literature where distance is also used as a synonym of a dissimilarity
measure.
In information theory and statistics, we measure deviations between a probability measure and
another probability measure using a statistical divergence.
In information geometry, a divergence is a smooth dissimilarity measure which shall satisfy the
following conditions:
Divergences were formerly called contrast functions, a dualistic structure of information geom-
etry can be built from a divergence.

Contents
1 Jensen divergences and Bregman divergences 2
1.1 Skewed Jensen and Bregman divergences . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Relationships with statistical distances between densities of an exponential family . . 2
1.3 Generalizations of Bregman and Jensen divergences . . . . . . . . . . . . . . . . . . . 2

2 Invariant f -divergences 2

3 Distances and means 5

4 Statistical distances between densities with computationally intractable normal-


izers 5

1
[width=0.75]FigIpe-skewJS.pdf

Figure 1: Skewed Jensen divergences visualized as vertical convexity gaps.

5 Statistical distances between empirical distributions and densities with compu-


tationally intractable normalizers 7

6 The Jensen-Shannon divergence and some generalizations 7


6.1 Origins of the Jensen-Shannon divergence . . . . . . . . . . . . . . . . . . . . . . . . 7
6.2 Some extensions of the Jensen-Shannon divergence . . . . . . . . . . . . . . . . . . . 9

7 Statistical distances between mixtures 11


7.1 Approximating and/or fast statistical distances between mixtures . . . . . . . . . . . 12
7.2 Bounding statistical distances between mixtures . . . . . . . . . . . . . . . . . . . . . 12
7.3 Newly designed statistical distances yielding closed-form formula for mixtures . . . . 13

1 Jensen divergences and Bregman divergences


1.1 Skewed Jensen and Bregman divergences
lim sJF,α (θ1 : θ2 ) = BF (θ1 : θ2 )
α→0

lim sJF,α (θ1 : θ2 ) = BF (θ2 : θ1 )


α→1

If α < 0 or α > 1, we can measure the gap F ((1 − α)θ1 + αθ2 ) − ((1 − α)F (θ1 ) + αF (θ2 )) =
−JF (θ1 , θ2 ) ≥ 0. See Figure 1 Thus we can define the scaled α-skew Jensen divergence for α ∈
R\{0, 1} as:
1
F (θ1 : θ2 ) = JF (θ1 : θ2 ) ≥ 0.
α(1 − α)

1.2 Relationships with statistical distances between densities of an exponential


family
1.3 Generalizations of Bregman and Jensen divergences

2 Invariant f -divergences
A f -divergence [26, 7, 1, 8] If [p : q] is a dissimilarity measure between probability distributions
defined for a convex generator f (u):
Z  
q(x)
If [p : q] = p(x)f dµ(x).
p(x)

Using Jensen’s inequality, we have If [p : q] ≥ f (1). Thus we ask for generators satisfying f (1) = 0.
Moreover, in order to have If [p : q] = 0 iff. p = q (µ-almost everywhere), we require f (u) to be
strictly convex at 1. The f -divergences include many well-known statistical distances listed with
their generators:

2
f -divergence Formula If [p : q] generator f (u)
1 1
R
Total variation (metric) |p(x) − q(x)|dµ(x) |u − 1|
R2 p p 2√
Squared Hellinger ( p(x) − q(x))2 dµ(x) ( u − 1)2
R (q(x)−p(x))2
Pearson χ2P p(x) dµ(x) (u − 1)2
(p(x)−q(x)) 2 (1−u)2
Neyman χ2N
R
q(x) dµ(x) u
R p(x)
Kullback-Leibler p(x) log q(x) dµ(x) − log u
R q(x)
reverse Kullback-Leibler q(x) log p(x) dµ(x) u log u
(p(x) − q(x)) log p(x)
R
Jeffreys divergence dµ(x) (u − 1) log u
4
R 1−α q(x) 1+α 4 1+α
α-divergence 1−α 2 (1 − p 2 (x)q (x)dµ(x)) 1−α2
−u 2 )
(1
1 2p(x) 2q(x)
−(u + 1) log 1+u
R
Jensen-Shannon 2 (p(x) log p(x)+q(x) + q(x) log p(x)+q(x) )dµ(x) 2 + u log u

Two f -divergences If1 and If2 are equivalent iff. f1 (u) = f2 (u) + λ(u − 1) for any λ ∈ R. A
symmetric f -divergence is bounded (e.g., the Jensen-Shannon divergence or the total variation)
iff. f (0) < ∞. The Jeffreys divergence is an unbounded f -divergence. The dual  f -divergence
If ∗ [p : q] := If [p : q] is a f -divergence for the dual generator f ∗ (u) = uf u1 (or conjugate
generator). Thus symmetric f -divergences (e.g., the Jeffreys divergence, Hellinger divergence, or
the Jensen-Shannon divergence) satisfies the functional equality f (u) = uf (1/u). The f -divergences
are joint convex and satisfies the information monotonicity property: If [p|Y : q|Y ] ≤ If [p : q] for any
partition Y of RX (see lumping [9]). A statistical divergence is said separable iff. it can be rewritten
as D[p : q] = D1 (p(x) : q(x))dµ(x), where D1 is a scalar divergence. The f -divergences are the
only divergences which are separable and satisfies the information information monotonicity [3]
(except the “curious case” [17] of binary alphabets X ). A f -divergence is said standard [3] when
f 00 (1) = 1. The local Taylor expansion [6] of If [pθ1 : pθ2 ]-divergences between two parametric
divergences is related to the Fisher information matrix I(θ) = Epθ ∇ log pθ (x)(∇ log pθ (x))> as
 

follows:
1
If [pθ1 : pθ2 ] = (θ2 − θ1 )> Iθ (θ1 )(θ2 − θ1 ) + o(k(θ2 − θ1 k2 ).
2
Thus we have If [pθ : pθ+dθ ] = 21 dθ> I(θ)dθ for a standard f -divergence (with f 00 (1) = 1). The
following metric distance DQ is called the Mahalanobis distance [24]1 :
√ >
DQ (θ1 ,θ2 )= (θ2 −θ1 ) Q(θ2 −θ1 ).

The Mahalanobis distance generalizes the Euclidean distance (expressed in the Cartesian coordinate
system) using Q = I, the identity matrix. When Σ = (σ11 2 , . . . , σ 2 ) with σ = σ 2 , we have
DD ii i

PD i −θ i )2
(θ2 1 .
Σ−1 (θ1 ,θ2 )= i=1 σ2
D i

2 2
Thus we have If [pθ1 : pθ2 ] = 12 DIθ (θ1 ) (θ1 ,θ2 ) +o(k(θ2 −θ1 k ) . For θ1 = θ and θ2 = θ + dθ, the half
squared Mahalanobis distance can be interpreted as the squared Riemannian infinitesimal length
element: If [pθ : pθ+dθ ] = 21 dθ> Iθ (θ)dθ =2θ .
1
Mahalanobis defined that distance for Q = Σ−1  0, the inverse of a covariance matrix.

3
A f -divergence between any two densities pθ1 and pθ2 can be expressed as a Taylor series [40]
pθ2 (x)
when pθ1 (x) < 1 + rf where rf is the convergence radius of the analytic generator f ∈ C ω and
p θ1
pθ2 ≤ C:
∞ Z  n
X pθ2 (x)
If [pθ1 : pθ2 ] = an −1 pθ1 (x)dµ(x).
X pθ1 (x)
n=2
Otherwise, the Taylor series diverge.
By introducing the higher-order chi divergences [37]

(p(x) − q(x))n
Z
Dχ,n [p : q] = dµ(x)
(q(x))n−1 (x)
which are proper divergences for even integers and only pseudo-distances for odd orders, we rewrite
the Taylor series of f -divergences as:

X
If [pθ1 : pθ2 ] = an Dχ,n [pθ1 : pθ2 ].
n=2

The higher-order chi divergences [37] between densities of an exponential family are available
in closed-form provided that the natural parameter space is affine (e.g., isotropic Gaussian family
or Poisson family).
To illustrate the Taylor series, consider the Poisson family, and let us express the Taylor series for
the Jensen-Shannon divergence (generator fJS (u) = −(u + 1) log 1+u 2 + u log u) between two Poisson
distributions with parameters λ1 = 0.6 and λ2 = 0.3. We have the higher-order chi divergences
between Poisson distributions expressed in closed-form as follows [37]:
k  
X k λ1−j j
Dχ,k [pλ1 : pλ2 ] = (−1)k−j e 1 λ2 −((1−j)λ1 +jλ2 )
j
j=0

pλ1 (x) λk1 e−λ1


Furthermore, pλ2 (x) = λk2 e−λ2
<C
We have
f 00 (1)
If [p : q] ∼ Dχ2 [p : q],
2 N

R (p(x)−q(x))2
where Dχ2 [p : q] = q(x) dµ(x) is the chi-squared divergence. Therefore, we have

1
DKL [p : q] ∼ Dχ2 [p : q]. (1)
2
On the finite-dimensional probability simplex, the Kullback-Leibler divergence is the only sta-
tistical divergence which belongs to both the f -divergences and the Bregman divergences [2]. When
considering the f -divergences to positive measures, the intersection of the f -divergences and the
Bregman divergences are the α-divergences [2].
For the parametric family of Cauchy distributions, the f -divergences are always symmetric and
can be expressed as a function of the chi-squared divergence [40]:

If [pCauchy
l1 ,s1 : pCauchy Cauchy
l2 ,s2 ] = If [pl2 ,s2 : pCauchy Cauchy
l1 ,s1 ] = hf (Dχ2 [pl1 ,s1 : pCauchy
l2 ,s2 ]),

4
where
(l2 − l1 )2 + (s2 − s1 )2
Dχ2 [pCauchy
l1 ,s1 : pCauchy
l2 ,s2 ] = 2s1 s2
and
1 s
pCauchy
l,s (x) :=  = .
πs 1 + x−l
 2 π(s + (x − l)2 )
2
s

3 Distances and means

4 Statistical distances between densities with computationally in-


tractable normalizers
Consider a density p(x) = p̃(x)Zp where p̃(x) is an unnormalized computable density and Zp =
R
p(x)dµ(x) the computationally intractable normalizer (also called in statistical physics the parti-
tion function or free energy). A statistical distance D[p1 : p2 ] between two densities p1 (x) = p̃Z1 (x)
p1

and p2 (x) = p̃Z2 (x)


p2
with computationally intractable normalizers Zp1 and Zp2 is said projective (or
two-sided homogeneous) if and only if
∀λ1 > 0, λ2 > 0, D[p1 : p2 ] = D[λ1 p1 : λ2 p2 ].
In particular, letting λ1 = Zp1 and λ2 = Zp2 , we have
D[p1 : p2 ] = D[p̃1 : p̃2 ].
Notice that the rhs. does not rely on the computationally intractable normalizers. These projective
distances are useful in statistical inference based on minimum distance estimators [5] (see next
Section).
Here are a few statistical projective distances:

ˆ γ-divergences (γ > 0) [18, 13]:


Z    Z  Z 
α+1 1 α 1 α+1
Dγ [p : q] := log q − 1+ log q p + log p , γ≥0
R α R α R

When γ → 0, we have [13] Dγ [p : q] = DKL [p : q], the Kullback-Leibler divergence (KLD).


For example, we can estimate the KLD between two densities of an exponential-polynomial
family by Monte Carlo stochastic integration of the γ-divergence for a small value of γ [38].
The γ-divergences (projective, Bregman-type=Cross-entropy-entropy) and the density power
divergence [4] (non-projective, Bregman-type divergence):
Z  Z Z
dpd α+1 1 α 1
Dα [p : q] := q − 1+ q p+ pα+1 , α ≥ 0,
R α R α R

can be encapsulated into the family of Φ-power divergences [49] (functional density power
divergence class):
Z    Z  Z 
α+1 1 α 1 α+1
Dφ,α [p : q] := φ q − 1+ φ q p + φ p , α ≥ 0,
R α R α R

5
where φ(ex ) convex and strictly increasing, φ continuous and twice continously differentiable
with finite second order derivatives. We have Dφ,0 [p : q] = φ0 (1) R p(x) log p(x)
R
q(x) dµ(x) =
0
φ (1)DKL [p : q].

ˆ Cauchy-Schwarz divergence [16] (CSD, projective)


 
R
p(x)q(x)dµ(x)
DCS [p : q] = − log  qR R
 = DCS [λ1 p : λ2 q], ∀λ1 > 0, λ2 > 0,
2 2
p(x) dµ(x) q(x) dµ(x)

and Hölder divergences [46] (HD, projective, which generalizes the CSD):
!
γ/α q(x)γ/β dx
R
Hölder [p : q] = − log X p(x) 1 1
Dα,γ R 1/α R 1/β , α + β = 1.
γ γ
X p(x) dx X q(x) dx

We have
Hölder [λ p : λ q] = DHölder [p : q],
∀λ1 > 0, λ2 > 0, Dα,γ 1 2 α,γ

and
Hölder [p : q] = D [p : q].
D2,2 CS

Hölder divergences between two densities pθp and pθq of an exponential family with cumulant
function F (θ) is available in closed-form [46]:
 
Hölder [p : q] = 1 F (γθ ) + 1 F (γθ ) − F γ θ + γ θ
Dα,γ p q p q
α β α β

The CSD is available in closed-form between mixtures of an exponential family with a conic
natural parameter [28]: This includes the case of Gaussian mixture models [19].

ˆ Hilbert distance [45] (projective): Consider two probability mass functions p = (p1 , . . . , pd )
and q = (q1 , . . . , qd ) of the d-dimensional probability simplex. Then the Hilbert distance is
pi
!
max i∈{1,...,d} q
DHilbert [p : q] = log p
i
.
minj∈{1,...,d} qjj

We have
∀λ1 > 0, λ2 > 0, DHilbert [λ1 p : λ2 q] = DHilbert [p : q].

The Hilbert projective simplex distance can be extended to the cone of positive-definite
matrices [45] (and its subspace of correlation matrices called the elliptope) as follows:

λmax (P Q−1 )
 
Hilbert
D [P : Q] = log ,
λmin (P Q−1 )

where λmax (X) and λmin (X) denote the largest and smallest eigenvalue of matrix X, respec-
tively.

6
5 Statistical distances between empirical distributions and densi-
ties with computationally intractable normalizers
When estimating the parameter θ̂ for a parametric family of distributions {pθ } from i.i.d. observa-
tions S = {x1 , . . . , xn }, we can define a minimum distance estimator (MDE):

θ̂ = arg min D[pS : pθ ],


θ

where pS = n1 ni=1 δxi is the empirical distribution (normalized). Thus we need only a right-
P
sided projective divergence to estimate models with computationally intractable normalizers. For
example, the Maximum Likelihood Estimator (MLE) is a MDE wrt. the KLD:

θ̂MLE = arg min DKL [pS : pθ ].


θ

It is thus interesting to study the impact of the choice of the distance D to the properties of the
corresponding estimator (e.g., γ-divergences yields provably robust estimators [13]).

ˆ Hyvärinen divergence [14] (also called Fisher divergence or Fisher relative informa-
tion [47]):
Z
Hyvärinen 1
D [p : pθ ] := k∇x log p(x) − ∇x log pθ (x)k2 p(x)dx.
2

The Hyvarinen divergence has been extended for order-α Hyvarinen divergences [32] (for
α > 0):
Z
Hyvärinen 1
Dα [p : q] := p(x)α (∇x log p(x) − ∇x log q(x))2 dx, α > 0.
2

The Fisher divergence is related to the Kullback-Leibler divergence [53] as follows:


Z ∞
DKL [p : q] = DFisher [p ∗ (0, λI) : q ∗ (0, λI)] ,
0
R
where (f ∗ g)(x) = f (y)g(x − y) denotes the convolution of densties. Thus convergence wrt
Fisher divergence is stronger than convergence wrt KLD.

6 The Jensen-Shannon divergence and some generalizations


6.1 Origins of the Jensen-Shannon divergence
Let (X , F, µ) be a measure space, and (w1 , P1P), . . . , (wn , Pn ) be n weighted probability measures
dominated by a measure µ (with wi > 0 and wi = 1). Denote by P := {(w1 , p1 ), . . . , (wn , pn )}
the set of their weighted Radon-Nikodym densities pi = dP dµ with respect to µ.
i

A statistical divergence D[p : q] is a measure of dissimilarity between two densities p and q


(i.e., a 2-point distance) such that D[p : q] ≥ 0 with equality if and only if p(x) = q(x) µ-almost
everywhere. A statistical diversity index D(P) is a measure of variation of the weighted densities in

7
P related to a measure of centrality, i.e., a n-point distance which generalizes the notion of 2-point
distance when P2 (p, q) := {( 21 , p1 ), ( 12 , p2 )}:

D[p : q] := D(P2 (p, q)).

The fundamental measure of dissimilarity in information theory is the I-divergence (also called
the Kullback-Leibler divergence, KLD, see Equation (2.5) page 5 of [20]):
Z  
p(x)
DKL [p : q] := p(x) log dµ(x).
X q(x)

The KLD is asymmetric (hence the delimiter notation “:” instead of ‘,’) but can be symmetrized
by defining the Jeffreys J-divergence (Jeffreys divergence, denoted by I2 in Equation (1) in 1946’s
paper [15]):
Z  
p(x)
DJ [p, q] := DKL [p : q] + DKL [q : p] = (p(x) − q(x)) log dµ(x).
X q(x)

Although symmetric, any positive power of Jeffreys divergence fails to satisfy the triangle inequality:
That is, DJα is never a metric distance for any α > 0, and furthermore DJα cannot be upper bounded.
In 1991, Lin proposed the asymmetric K-divergence (Equation (3.2) in [22]):
 
p+q
DK [p : q] := DKL p : ,
2

and defined the L-divergence by analogy to Jeffreys’s symmetrization of the KLD (Equation (3.4)
in [22]):
DL [p, q] = DK [p : q] + DK [q : p].
By noticing that  
p+q
DL [p, q] = 2h − (h[p] + h[q]),
2
where h denotes Shannon entropy (Equation (3.14) in [22]), Lin coined the (skewed) Jensen-
Shannon divergence between two weighted densities (1 − α, p) and (α, q) for α ∈ (0, 1) as follows
(Equation (4.1) in [22]):

DJS,α [p, q] = h[(1 − α)p + αq] − (1 − α)h[p] − αh[q]. (2)

Finally, Lin defined the generalized Jensen-Shannon divergence (Equation (5.1) in [22]) for a
finite weighted set of densities:
" #
X X
DJS [P] = h wi pi − wi h[pi ].
i i

This generalized Jensen-Shannon divergence is nowadays called the Jensen-Shannon diversity index.
To contrast with the Jeffreys’ divergence, the Jensen-Shannon divergence (JSD) DJS := DJS, 1
√ 2
is upper bounded by log 2 (does not require the densities to have the same support), and DJS is a
metric distance [11, 12]. Lin cited precursor work [56, 23] yielding definition of the Jensen-Shannon

8
divergence: The Jensen-Shannon divergence of Eq. 2 is the so-called “increments of entropy” defined
in (19) and (20) of [56].
The Jensen-Shannon diversity index was also obtained very differently by Sibson in 1969 when
he defined the information radius [52] of order α using Rényi α-means and Rényi α-entropies [50].
In particular, the information radius IR1 of order 1 of a weighted set P of densities is a diversity
index obtained by solving the following variational optimization problem:
n
X
IR1 [P] := min wi DKL [pi : c]. (3)
c
i=1

Sibson solved a more general optimization problem, and obtained the following expression (term
K1 in Corollary 2.3 [52]):
" #
X X
IR1 [P] = h wi pi − wi h[pi ] := DJS [P].
i i

Thus Eq. 3 is a variational definition of the Jensen-Shannon divergence.

6.2 Some extensions of the Jensen-Shannon divergence


ˆ Skewing the JSD.
The K-divergence of Lin can be skewed with a scalar parameter α ∈ (0, 1) to give

DK,α [p : q] := DKL [p : (1 − α)p + αq] . (4)

Skewing parameter α was first studied in [21] (2001, see Table 2 of [21]). We proposed to
unify the Jeffreys divergence with the Jensen-Shannon divergence as follows (Equation 19
in [27]):
J DK,α [p : q] + DK,α [q : p]
DK,α [p : q] := . (5)
2
When α = 12 , we have DK,
J J 1
1 = DJS , and when α = 1, we get DK,1 = 2 DJ .
2

Notice that
α,β
DJS [p; q] := (1 − β)DKL [p : (1 − α)p + αq] + βDKL [q : (1 − α)p + αq]

amounts to calculate

h× [(1 − β)p + βq : (1 − α)p + αq] − ((1 − β)h[p] + βh[q])

where Z
×
h [p, q] := −p(x) log q(x)dµ(x)

denotes the cross-entropy. By choosing α = β, we have h× [(1 − β)p + βq : (1 − α)p + αq] =


h[(1 − α)p + αq], and thus recover the skewed Jensen-Shannon divergence of Eq. 2.

9
In [31] (2020), we considered a positive skewing vector α ∈ [0, 1]k and a unit positive weight
w belonging to the standard simplex ∆k , and defined the following vector-skewed Jensen-
Shannon divergence:
k
α,w
X
DJS [p : q] := DKL [(1 − αi )p+ αi q : (1 − ᾱ)p + ᾱq], (6)
i=1
k
X
= h[(1 − ᾱ)p + ᾱq] − h[(1 − αi )p+ αi q], (7)
i=1
α,w
where ᾱ = ki=1 wi αi . The divergence DJS
P
generalizes the (scalar) skew Jensen-Shannon
divergence when k = 1, and is a Ali-Silvey-Csiszár f -divergence upper bounded by
1
log ᾱ(1− ᾱ) [31].

ˆ A priori mid-density.PThe JSD can be interpreted as the total divergence of the densities
to the mid-density p̄ = ni=1 wi pi , a statistical mixture:
n
X n
X
DJS [P] = wi DKL [pi : p̄] = h[p̄] − wi h[pi ].
i=1 i=1

Unfortunately, the JSD between two Gaussian densities is not known in closed form because
of the definite integral of a log-sum term (i.e., K-divergence between a density and a mixture
density p̄). For the special case of the Cauchy family, a closed-form formula [41] for the
JSD between two Cauchy densities was obtained. Thus we may choose a geometric mixture
distribution [29] instead of the ordinary arithmetic mixture p̄. More generally, we can choose
any weighted mean Mα (say, the geometric mean, or the harmonic mean, or any other power
mean) and define a generalization of the K-divergence of Equation 4:

DK [p : q] := DK [p : (pq)Mα ], (8)
where
Mα (p(x), q(x))
(pq)Mα (x) :=
ZMα (p : q)
is a statistical M -mixture with ZMα (p, q) denoting the normalizing coefficient:
Z
ZMα (p : q) = Mα (p(x), q(x))dµ(x)
R
so that (pq)Mα (x)dµ(x) = 1. These M -mixtures are well-defined provided the convergence
of the definite integrals.
Then we define a generalization of the JSD [29] termed (Mα , Nβ )-Jensen-Shannon divergence
as follows:
M ,N
DJSα β [p : q] := Nβ (DK [p : (pq)Mα ], DK [q : (pq)Mα ]) , (9)
where Nβ is yet another weighted mean to average the two Mα -K-divergences. We have
A,A
DJS = DJS where A(a, b) = a+b 2 is the arithmetic mean. The geometric JSD yields a closed-
form formula between two multivariate Gaussians, and has been used in deep learning [10].
More generally, we may consider the Jensen-Shannon symmetrization of an arbitrary distance
D as
JS
DM α ,Nβ
[p : q] := Nβ (D[p : (pq)Mα ], D[q : (pq)Mα ]) . (10)

10
ˆ A posteriori mid-density. We consider a generalization of Sibson’s information radius [52].
Let Sw (a1 , . . . , an ) denote a generic weighted mean of n positive scalars a1 , . . . , an , with weight
vector w ∈ ∆n . Then we define the S-variational Jensen-Shannon diversity index [34] as
Sw
DvJS (P) := min Sw (DKL [p1 : c], DKL [pn : c]) . (11)
c

When Sw = Aw (with Aw (a1 , . . . , an ) = ni=1 wi ai the arithmetic weighted mean), we re-


P
cover the ordinary Jensen-Shannon diversity index. More generally, we define the S-Jensen-
Shannon index of an arbitrary distance D as

DSvJS
w
(P) := min Sw (D[p1 : c], . . . , D[pn : c]) . (12)
c

When n = 2, this yields a Jensen-Shannon-symmetrization of distance D.


The variational optimization defining the JSD can also be constrained to a (parametric)
family of densities D, thus defining the (S, D)-relative Jensen-Shannon diversity index:
Sw ,D
DvJS (P) := min Sw (DKL [p1 : c], . . . , DKL [pn : c]) . (13)
c∈D

The relative Jensen-Shannon divergences are useful for clustering applications: Let pθ1 and pθ2
be two densities of an exponential family E with cumulant function F (θ). Then the E-relative
Jensen-Shannon divergence is the Bregman information of P2 (p, q) for the conjugate function
F ∗ (η) = −h[pθ ] (with η = ∇F (θ)). The E-relative JSD amounts to a Jensen divergence for
F ∗:

1
DvJS [pθ1 , pθ2 ] = min {DKL [pθ1 : pθ ] + DKL [pθ2 : pθ ]} , (14)
θ 2
1
= min {BF [θ : θ1 ] + BF [θ : θ2 ]} , (15)
θ 2
1
= min {BF ∗ [η1 : η] + BF ∗ [η2 : η]} , (16)
η 2
F ∗ (η1 ) + F ∗ (η2 )
= − F ∗ (η ∗ ), (17)
2
=: JF ∗ (η1 , η2 ), (18)
η1 +η2
since η ∗ := 2 (a right-sided Bregman centroid [36]).

7 Statistical distances between mixtures


Pearson [48] first considered a unimodal Gaussian mixture of two components for modeling dis-
tributions crabs in 1894. Statistical mixtures [25] like the Gaussian mixture models (GMMs) are
often metPin information sciences, andPtherefore it is important to assess their dissimilarities. Let
k 0 k0 0 0
m(x) = i=1 wi pi (x) and m (x) = i=1 wi pi (x) be two finite statistical mixtures. The KLD
between two GMMs m and m0 is not analytic [55] because of the log-sum terms:
Z
0 m(x)
DKL [m : m ] = m(x) log 0 dx.
m (x)

11
However, the KLD between two GMMs with the same prescribed components pi (x) = p0i (x) =
pµi ,Σi (x) (i.e., k = k 0 , and only the normalized positive weights may differ) is provably a Bregman
divergence [39] for the differential negentropy F (θ):
DKL [m(θ) : m(θ0 )] = BF (θ, θ0 ),
Pk−1 Pk−1 R
where m(θ) = i=1 wi pi (x) + (1 − i=1 wi )pk (x) and F (θ) = m(θ) log m(θ)dx. The family
{mθ θ ∈ ∆◦k−1 } is called a mixture family in information geometry, where ∆◦k−1 denotes the
(k − 1)-dimensional open standard simplex. However, F (θ) is usually not available in closed-form
because of the log-sum integral. In some special cases like the mixture of two prescribed Cauchy
distributions, we get a closed-form formula for the KLD, JSD, etc. [41, 35]. Thus when dealing
with mixtures (like GMMs), we either need efficient approximating (§7.1), bounding (§7.2) KLD
techniques, or new distances (§7.3) that yields closed-form formula between mixture densities.

7.1 Approximating and/or fast statistical distances between mixtures


ˆ The Jeffreys divergence (JD) DJ [m, m0 ] = DKL [m : m0 ]+DKL [m0 : m] between two (Gaussian)
MMs is not available in closed-form, and can be estimated using Monte Carlo integration as
s
1 X (m(xi ) − m0 (xi ))
 
Ss 0 m(xi )
D̂J [m, m ] := 2 log ,
s m(xi ) + m0 (xi ) m0 (xi )
i=1

where Ss = {x1 , . . . , xs } are s IID samples from the mid mixture m12 (x) := 12 (m(x) + m0 (x))
(with lims→∞ D̂JSs [m, m0 ] = DJ [m, m0 ]). In [33], the mixtures m and m0 are converted into
densities of an exponential-polynomial family. The JD between densities pθ and pθ0 of an
exponential family with cumulant function F is available in closed-form:
DJ [pθ , pθ0 ] = (θ0 − θ) · (η 0 − η),
with η = ∇F (θ) and θ = ∇F ∗ (η), where F ∗ denotes the convex conjugate. Any smooth
SME
density r (includes a mixture r = m) is converted into close densities pθrMLE and pηr of a
exponential-polynomial family using extensions of the Maximum Likelihood Estimator (MLE)
and Score Matching Estimator (SME). Then JD between mixtures is approximated as follows
SME MLE
DJ [m, m0 ] ' (θ0 − θSME ) · (η 0 − η MLE ).

ˆ Given a finite set of mixtures {mi (x)} sharing the same components (e.g., points on a mix-
ture family manifold), we precompute the KLD between pairwise components to obtain fast
approximation of the KLD DKL [mi : mj ] between any two mixtures mi and mj , see [51].

7.2 Bounding statistical distances between mixtures


ˆ Log-Sum-Exp bounds: In [42, 43], we lower and upper bound the cross-entropy between
mixtures using the fact that the log-sum term log m(x) and be interpreted as a LSE function.
We then compute lower envelopes and upper envelopes of density functions using technique
of computational geometry to report deterministic lower and upper bounds on the KLD and
α-divergences. These bounds are said combinatorial because we decompose the support into
elementary intervals. Bounds between the Total Variation Distance (TVD) between univariate
mixtures are reported in [44].

12
7.3 Newly designed statistical distances yielding closed-form formula for mix-
tures
ˆ Statistical Minkowski distances [30]: Consider the Lebesgue space
 Z 
α
Lα (µ) := f ∈ F : |f (x)| dµ(x) < ∞
X

for α ≥ 1, where F denotes the set of all real-valued measurable functions defined on the
support X . Minkowski’s inequality writes as kp + qkα ≤ kpkα + kqkα for α ∈ [1, ∞). The
statistical Minkowski difference distance between p, q ∈ Lα (µ) is defined as

DαMinkowski [p, q] := kpkα + kqkα − kp + qkα ≥ 0. (19)

The statistical Minkowski log-ratio distance is defined by:


kp + qkα
LMinkowski
α [p, q] := − log ≥ 0. (20)
kpkα + kqkα

These statistical Minkowski distances are symmetric, and Lα [p, q] is scale-invariant. For even
integers α ≥ 2, DαMinkowski [m : m0 ] is available in closed-form.

ˆ We show that the Cauchy-Schwarz divergence (CSD), the quadratic Jensen-Rényi diver-
gence [54] (JRD), and the total square Distance (TSD) between two GMMs, and more gen-
erally two mixtures of exponential families, can be obtained in closed-form in [28].

Initially created 13th August 2021 (last updated November 12, 2021).

References
[1] Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one
distribution from another. Journal of the Royal Statistical Society: Series B (Methodological),
28(1):131–142, 1966.

[2] Shun-Ichi Amari. α-divergence is unique, belonging to both f -divergence and Bregman diver-
gence classes. IEEE Transactions on Information Theory, 55(11):4925–4931, 2009.

[3] Shun-ichi Amari. Information Geometry and Its Applications. Applied Mathematical Sciences.
Springer Japan, 2016.

[4] Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efficient estima-
tion by minimising a density power divergence. Biometrika, 85(3):549–559, 1998.

[5] Ayanendranath Basu, Hiroyuki Shioya, and Chanseok Park. Statistical inference: the minimum
distance approach. Chapman and Hall/CRC, 2019.

[6] JM Corcuera and Federica Giummolè. A characterization of monotone and regular divergences.
Annals of the Institute of Statistical Mathematics, 50(3):433–450, 1998.

13
(M, D)-information Definition IM,D [p, q] := minc {M (D[p : c], D[q : c])}
p+q p+q
Information radius of order 1 IA,KL [p, q] = 21 DKL [p : 2 ] + 21 DKL [q : 2 ]

aka Jensen-Shannon divergence


α
R  p(x)α +q(x)α  α1
Information radius of order α IMαR ,DαR [p, q] = α−1 log2 X 2 dµ(x)
aka Sibson’s information radius
θ1 +θ2 θ1 +θ2
Bregman information IA,BF (θ1 , θ2 ) = 21 BF (θ1 : 2 ) + 12 BF (θ2 : 2 )

skewed Jensen–Bregman divergence IAβ ,BF (θ1 : θ2 ) = (1 − β)BF (θ1 : (1 − β)θ1 + βθ2 ) + βBF (θ2 : (1 − β)θ1 + β
Chernoff minimax discrimination Imax,DKL
∗ [p, q] = minc max{DKL [c : p], DKL [c : q]}

Amari’s α-risk IA,Dα [p, q] = 12 Dα [p : (pq)A 1 A


α ] + 2 Dα [q : (pq)α ]

Annealing geometric paths (pq)Gβ minimizer of IGβ ,DKL


∗ [p : q]

a+b
arithmetic mean A(a, b) = 2

weighted arithmetic mean Aβ (a, b) = (1 − β)a + βb


geometric mean Gβ (a, b) = a1−β bβ
maximum (mean) MAX(a, b) = max{a, b}
 (α−1)a (α−1)b 
1
Rényi’s α-mean MαR (a, b) = α−1 log2 2 +2
2
1−α
(pq)A −1 1 1 2
Amari’s α-integration α (x) ∝ fα ( 2 fα (p(x)) + 2 fα (q(x))), α-representation fα (u) = 1−α u
2

M -mixture (pq)Mβ (x) ∝ Mβ (p(x), q(x))

p(x) log p(x)


R
Kullback-Leibler divergence DKL [p : q] = X q(x) dµ(x)
∗ [p : q] =
R q(x)
reverse Kullback-Leibler divergence DKL q(x) log p(x)
X dµ(x) = DKL [q : p]
1
DαR [p : q] = α−1 log2 X p(x)α q(x)1−α dµ(x)
R 
Rényi’s α-divergence
1−α 1+α
4
DαA [p : q] = 1−α
R
Amari’s α-divergence 2 (1 − X p(x) 2 q(x) 2 dµ(x))

Table 1: Examples of information radius measures as variational abstract mean divergences.

14
[7] Imre Csiszár. Information-type measures of difference of probability distributions and indirect
observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.

[8] Imre Csiszár. A class of measures of informativity of observation channels. Periodica Mathe-
matica Hungarica, 2(1-4):191–213, 1972.

[9] Imre Csiszár and Paul C Shields. Information theory and statistics: A tutorial. Now Publishers
Inc, 2004.

[10] Jacob Deasy, Nikola Simidjievski, and Pietro Liò. Constraining Variational Inference with Ge-
ometric Jensen-Shannon Divergence. In Advances in Neural Information Processing Systems,
2020.

[11] Dominik Maria Endres and Johannes E Schindelin. A new metric for probability distributions.
IEEE Transactions on Information theory, 49(7):1858–1860, 2003.

[12] Bent Fuglede and Flemming Topsoe. Jensen-Shannon divergence and Hilbert space embedding.
In International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings., page 31.
IEEE, 2004.

[13] Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against
heavy contamination. Journal of Multivariate Analysis, 99(9):2053–2081, 2008.

[14] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score
matching. Journal of Machine Learning Research, 6(4), 2005.

[15] Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings
of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–
461, 1946.

[16] Robert Jenssen, Jose C Principe, Deniz Erdogmus, and Torbjørn Eltoft. The Cauchy–Schwarz
divergence and Parzen windowing: Connections to graph theory and Mercer kernels. Journal
of the Franklin Institute, 343(6):614–629, 2006.

[17] Jiantao Jiao, Thomas A Courtade, Albert No, Kartik Venkat, and Tsachy Weissman. Infor-
mation measures: the curious case of the binary alphabet. IEEE Transactions on Information
Theory, 60(12):7616–7626, 2014.

[18] MC Jones, Nils Lid Hjort, Ian R Harris, and Ayanendranath Basu. A comparison of related
density-based minimum divergence estimators. Biometrika, 88(3):865–873, 2001.

[19] Kittipat Kampa, Erion Hasanbelliu, and Jose C Principe. Closed-form Cauchy-Schwarz PDF
divergence for mixture of Gaussians. In The 2011 International Joint Conference on Neural
Networks, pages 2578–2585. IEEE, 2011.

[20] Solomon Kullback. Information theory and statistics. Courier Corporation, 1997.

[21] Lillian Lee. On the effectiveness of the skew divergence for statistical language analysis. In
Artificial Intelligence and Statistics (AISTATS), page 65?72, 2001.

15
[22] Jianhua Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on
Information theory, 37(1):145–151, 1991.

[23] Jianhua Lin and SKM Wong. Approximation of discrete probability distributions based on a
new divergence measure. Congressus Numerantium (Winnipeg), 61:75–80, 1988.

[24] Prasanta Chandra Mahalanobis. On the generalized distance in statistics. National Institute
of Science of India, 1936.

[25] Geoffrey J McLachlan and Kaye E Basford. Mixture models: Inference and applications to
clustering, volume 38. M. Dekker New York, 1988.

[26] Tetsuzo Morimoto. Markov processes and the H-theorem. Journal of the Physical Society of
Japan, 18(3):328–331, 1963.

[27] Frank Nielsen. A family of statistical symmetric divergences based on Jensen’s inequality.
arXiv preprint arXiv:1009.4004, 2010.

[28] Frank Nielsen. Closed-form information-theoretic divergences for statistical mixtures. In Pro-
ceedings of the 21st International Conference on Pattern Recognition (ICPR), pages 1723–1726.
IEEE, 2012.

[29] Frank Nielsen. On the Jensen?Shannon Symmetrization of Distances Relying on Abstract


Means. Entropy, 21(5), 2019.

[30] Frank Nielsen. The statistical Minkowski distances: Closed-form formula for Gaussian mixture
models. In International Conference on Geometric Science of Information, pages 359–367.
Springer, 2019.

[31] Frank Nielsen. On a Generalization of the Jensen?Shannon Divergence and the


Jensen?Shannon Centroid. Entropy, 22(2), 2020.

[32] Frank Nielsen. Fast approximations of the Jeffreys divergence between univariate Gaussian
mixture models via exponential polynomial densities. arXiv preprint arXiv:2107.05901, 2021.

[33] Frank Nielsen. Fast approximations of the jeffreys divergence between univariate gaussian
mixture models via exponential polynomial densities. arXiv preprint arXiv:2107.05901, 2021.

[34] Frank Nielsen. On a Variational Definition for the Jensen-Shannon Symmetrization of Dis-
tances Based on the Information Radius. Entropy, 23(4), 2021.

[35] Frank Nielsen. The dually flat information geometry of the mixture family of two prescribed
Cauchy components. arXiv preprint arXiv:2104.13801, 2021.

[36] Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. IEEE transac-
tions on Information Theory, 55(6):2882–2904, 2009.

[37] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approx-
imating f -divergences. IEEE Signal Processing Letters, 21(1):10–13, 2013.

16
[38] Frank Nielsen and Richard Nock. Patch matching with polynomial exponential families and
projective divergences. In International Conference on Similarity Search and Applications,
pages 109–116. Springer, 2016.

[39] Frank Nielsen and Richard Nock. On the geometry of mixtures of prescribed distributions. In
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
2861–2865. IEEE, 2018.

[40] Frank Nielsen and Kazuki Okamura. On f -divergences between Cauchy distributions. arXiv
preprint arXiv:2101.12459, 2021.

[41] Frank Nielsen and Kazuki Okamura. On f -divergences between cauchy distributions.
arXiv:2101.12459, 2021.

[42] Frank Nielsen and Ke Sun. Guaranteed bounds on information-theoretic measures of univariate
mixtures using piecewise log-sum-exp inequalities. Entropy, 18(12):442, 2016.

[43] Frank Nielsen and Ke Sun. Combinatorial bounds on the α-divergence of univariate mixture
models. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing,
ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 4476–4480. IEEE, 2017.

[44] Frank Nielsen and Ke Sun. Guaranteed Deterministic Bounds on the total variation distance
between univariate mixtures. In 28th IEEE International Workshop on Machine Learning for
Signal Processing, MLSP 2018, Aalborg, Denmark, September 17-20, 2018, pages 1–6. IEEE,
2018.

[45] Frank Nielsen and Ke Sun. Clustering in Hilbert’s projective geometry: The case studies of
the probability simplex and the elliptope of correlation matrices. In Geometric Structures of
Information, pages 297–331. Springer, 2019.

[46] Frank Nielsen, Ke Sun, and Stéphane Marchand-Maillet. On Hölder projective divergences.
Entropy, 19(3):122, 2017.

[47] Felix Otto and Cédric Villani. Generalization of an inequality by Talagrand and links with the
logarithmic Sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.

[48] Karl Pearson. Contributions to the mathematical theory of evolution. Philosophical Transac-
tions of the Royal Society of London. A, 185:71–110, 1894.

[49] Souvik Ray, Subrata Pal, Sumit Kumar Kar, and Ayanendranath Basu. Characterizing the
functional density power divergence class. arXiv preprint arXiv:2105.06094, 2021.

[50] Alfréd Rényi et al. On measures of entropy and information. In Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to
the Theory of Statistics. The Regents of the University of California, 1961.

[51] Olivier Schwander, Stéphane Marchand-Maillet, and Frank Nielsen. Comix: Joint estimation
and lightspeed comparison of mixture models. In 2016 IEEE International Conference on
Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016,
pages 2449–2453. IEEE, 2016.

17
[52] Robin Sibson. Information radius. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte
Gebiete, 14(2):149–160, 1969.

[53] Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, and Revant Ku-
mar. Density estimation in infinite dimensional exponential families. Journal of Machine
Learning Research, 18, 2017.

[54] Fei Wang, Tanveer Syeda-Mahmood, Baba C Vemuri, David Beymer, and Anand Rangarajan.
Closed-form Jensen-Rényi divergence for mixture of Gaussians and applications to group-wise
shape registration. In International Conference on Medical Image Computing and Computer-
Assisted Intervention, pages 648–655. Springer, 2009.

[55] Sumio Watanabe, Keisuke Yamazaki, and Miki Aoyagi. Kullback information of normal mix-
ture is not an analytic function. IEICE technical report. Neurocomputing, 104(225):41–46,
2004.

[56] Andrew KC Wong and Manlai You. Entropy and distance of random graphs with applica-
tion to structural pattern recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence, (5):599–609, 1985.

18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy