2. Dirichlet Polynomials and Bundles
This section is simply a brief summary of content from [
1], repeated here for the convenience of the reader.
Definition 1. A Dirichlet polynomial
d in one variable is a function of the form for some . The set of Dirichlet polynomials is clearly closed under addition, and further under multiplication (using the distributive law along with the fact that ). In fact, it has the structure of a rig: a “ring without negatives” (or, to be pedantic, a monoid object in commutative monoids). We denote this rig by , where the additive unit is 0, and the multiplicative unit is .
Note that we can embed as a sub-rig of , by ; we often use this fact and simply write .
Following [
1], we can think of Dirichlet polynomials as functors
, where
is the category of finite sets. Indeed, given a natural number
, the
exponential can be thought of as the Yoneda embedding of the set with
n elements, i.e.,
where
. (For typographical convenience, we sometimes use the notation
n and
interchangeably. In particular, we write e.g.,
instead of
). Then the addition of exponentials corresponds to the coproduct of the corresponding representable functors (and so multiplication by a natural number
corresponds to the
-fold coproduct of the representable functor with itself). This means that evaluating a Dirichlet polynomial at some natural number
n corresponds to evaluating the corresponding functor on the finite set
.
Note that
is
not the initial object 0, since
i.e.,
.
Example 1. The Dirichlet polynomialevaluated at 0 givesand, similarly, Note that, since , we can write .
Definition 2. A morphism of Dirichlet polynomials is a natural transformation of (contravariant) functors. Denote by the category of Dirichlet polynomials (thought of as functors ), and by the set of all morphisms .
When we think of Dirichlet polynomials as functors , addition is given by the coproduct (disjoint union of sets), and multiplication by the product (Cartesian product of sets). This means that working with Dirichlet polynomials in really is like working with polynomials, in the sense that addition and multiplication are exactly “as expected”.
Example 2. The only slightly confusing aspect of multiplication in is how behaves (since ): if is a Dirichlet polynomial, thenas follows from the aforementioned fact that is zero for , and 1 for . We can use this general fact for specific computations. For example, let Then(and ). There is a more geometric interpretation of objects of
as
set-theoretic bundles, i.e., (isomorphism classes of) functions
, where
, given as follows: to the Dirichlet polynomial
, we associate the function
induced by the unique function
. (This vague statement can be upgraded to an equivalence of categories: ([
1], [Theorem 4.6]) for more details).
For example, to the polynomial , we associate the bundle
Note that bundles also form a rig, where the sum is given by disjoint union of sets, and the product is given by the Cartesian product of sets, both on the base and the total space. Further, the equivalence between Dirichlet polynomials and bundles respects the rig structures (cf. [
1], Theorem 4.6). Because of this, we often switch freely between thinking of Dirichlet polynomials as functors
, as bundles
, and simply as functions of the form
.
Example 3. We can draw the bundle corresponding to as follows:
Lemma 1. Let be a Dirichlet polynomial. Then Proof. This follows from the fact that and , for all . □
Definition 3. Let . For , we definewhere is the bundle corresponding to d. Using the fact that the sum of bundles is given by the disjoint union of sets, we can use this above definition to write any Dirichlet polynomial
d as
(where ∑ is the coproduct in
).
Corollary 1. Let be a Dirichlet polynomial. Then Proof. This is, again, simply the fact that for all . □
Lemma 2. A morphism of Dirichlet polynomials is exactly a morphism of the corresponding bundles, i.e., functions and such that
commutes. Proof. This statement forms a specific part of ([
1], Theorem 4.6), but the proof is simple enough that we give a direct version here. Writing
and
, we see that
(the first isomorphism is by the universal property of the coproduct; the second isomorphisms is the Yoneda lemma). However, an element of this set is exactly a bundle morphism: we have, for all
, some
along with a function
;
is given by the disjoint union of all these
, and
is given by the the choice of
j for each
i. □
Definition 4. Given Dirichlet polynomials such that , we denote by the set of morphisms such that .
Given the correspondence between Dirichlet polynomials and bundles, we might rightly ask why we should prefer to work with the former over the latter. For one possible answer to this, see
Section 6.
3. Bundles as Empirical Distributions
The interpretation of Dirichlet polynomials as bundles helps us to understand how they relate to probability theory. Imagine flipping a coin eight times and observing five heads and three tails; we refer to “heads” and “tails” as outcomes, and each of the eight flips as draws; every draw has an associated outcome.
Consider some bundle . We can think of as the set of outcomes, and as the set of draws; the fibre over an outcome corresponds to all the draws that lead to the outcome x, and so we obtain a probability distribution on by setting . Conversely, any rational distribution (i.e., a distribution such that all probabilities are rational numbers ) on a finite set arises in this way: take the finite set as the set of outcomes; take the least common multiple of the denominators of all the probabilities as the cardinality of ; and then take many elements of to be in the fibre of .
Example 4. Consider the set , endowed with the probability distribution such that Define the sets and , and define the function by Then the empirical probability distribution on the bundle agrees exactly with the given distribution on S. As a Dirichlet polynomial, this bundle is given (up to relabelling the outcomes) by .
Note that any multiple of d (for ) will correspond to the same probability distribution as d itself, but to a different empirical distribution, since it will have m times as many draws.
Under this interpretation of Dirichlet polynomials as empirical distributions, multiplication corresponds to taking the product distribution.
Remark 1. For any , and any , we can give a combinatorial interpretation: it is the number of ways of choosing n indistinguishable (in the sense that they have the same outcome) draws, i.e., the number of length-n lists of elements of for some .
To see this, note that (by Yoneda), and so is in bijection with the set of bundle morphisms , which are given exactly by choosing n (possibly repeated) elements of that all lie in the same fibre (namely the fibre above the point specified by ).
Remark 2. Although we deal only with finite
sets and rational
probability distributions here, it seems likely that one could follow the methods of [3] and consider colimits of these to obtain analogous results for arbitrary
probability distributions on discrete measurable spaces.
4. Area and Width
Definition 5. Define the rig as follows. The underlying set is . The multiplicative structure has unit , and is given by component-wise multiplication: The additive structure has unit , and is given by real-number addition in the first component, and by weighted geometric mean in the second component: Given an element in , we call A its area and W its width.
The fact that
is indeed a rig follows from the fact that its multiplication distributes over its addition:
Proposition 1. There exists a unique rig morphism for which Proof. Since every Dirichlet polynomial is just a sum of exponentials, a rig homomorphism is fully determined by its action on exponentials, since it must respect addition. So we just need to show that h does indeed extend to a rig homomorphism, but this follows from the fact that . □
Definition 6. Given a Dirichlet polynomial d, we define its area and its width to be given by the components of .
With this definition, along with Proposition 1, we see that
Lemma 3. Let and . Then
- 1.
;
- 2.
;
- 3.
.
Proof. Recall that addition (and thus scalar multiplication) in
involves the weighted geometric mean in the second component. Then
which proves (i). For (ii) and (iii), since
h is a rig homomorphism (and thus respects addition), it suffices to consider the case where
d is an exponential, say
. However, then
i.e.,
and
, as claimed. □
Proof. Write
. Using Lemma 3, along with Definition 5, we see that
By Lemma 1, the first component (i.e.,
) is equal to
; by the same lemma, we can also rewrite the second component (i.e.,
) as
so it simply remains to justify why this is equal to
. But a morphism in
is exactly the data of an endomorphism of each fibre of
; since there are
fibres of size
i, endomorphisms of these fibres are in bijection with the
-fold product of
, which is equal to
, whence the claim. □
Corollary 3. Let . Then the width is an algebraic number, i.e., the image of lies in the sub-rig whose underlying set is , where is the algebraic closure of .
Proof. By Corollary 2, both and are equal to the cardinality of some sets, and thus integer. □
Example 5. Reassuringly, if we start with a “rectangle”, then the area and width are exactly what we might expect. More concretely: consider for some ; then, by Corollary 2,and, by direct calculation,Comparing this to the picture of , we can explain why we chose the terminology “width” and “area”: Indeed, the area is exactly the number of dots in the (upper) rectangle, and the width is its width.
However, this picture now leads us to consider the question of whether or not there is a good meaning we can give to the “length” of this rectangle (which, here, should be equal to a). Indeed, this has been our motivation all along; we will return to this question in Example 9.
Example 6. The fact that is “rectangular” in Example 5 makes the terminology look like a numerical coincidence, but we can try to hone our intuition of what this really “means” by considering another example.
Let’s consider , which has area . We can calculate its width by using the fact thatwhenceand so . How, then, does the rectangle with area 8 and width 2 relate to our Dirichlet polynomial ? That is, what is the process that takes us from d to ? Looking at the pictures of the bundles, we see that the width tells us how our bundle would look if we had the same set () of draws, but with different outcomes, now all equally likely:
Note that, in order to have equally sized fibres, we needed to have 4 outcomes, not 5 (since ). We make this idea more precise (as well as explain why the rectangle is of size instead of ) in Section 6. Example 7. We have just seen that has and , but now let’s look at an example where the numbers don’t divide so neatly.
Let . Then , and, as in Example 6, we use the fact that Of course, now we can’t draw a nice rectangle representing the evenly distributed bundle as we did in Example 6 for , since we would have to have an outcome set of size elements, with fibres all of size , but this should come as no surprise, since 7 is prime. One might be tempted to solve this problem using groupoid cardinality (cf. [4]), but there are some technical issues here. 5. Length
Remark 3. We write log to mean .
Definition 7. Given a Dirichlet polynomial , we define its entropy
by We then define its length
(also called the perplexity)
by Readers might recognise
as being the
Shannon entropy of the corresponding probability distribution (cf. [
5]). The convention for naming the sides of a rectangle is from [
6] (
Figure 1).
Example 8. Consider for some .
Then and , and sowhence . In terms of distributions, this corresponds to the fact that the unique probability distribution on a single outcome has entropy equal to 0 (and so the same is true for any empirical distribution on a single outcome).
Example 9. Continuing on from Example 5, we can calculate the entropy of a uniform distribution on a many outcomes aswhence , exactly as desired. Example 10. Continuing on from Example 7, recall that has area and width . We can further calculate thatwhence . As for , recall that its area is and its width is . Now, its entropy isand so its length is Note that, in the above example, even though both
and
have non-integer values, the formula implied by our choice of nomenclature still holds: the area
is equal to the length
times the width
. This leads us to our main theorem. It says that the Shannon entropy, which is only homomorphic in products of distributions, can be computed in terms of the width and area, which together are homomorphic in both sums and products of distributions. We will explain this in more detail in
Section 6.
Theorem 1. For all , we have the rectangle-area formula
Proof. In the following, we omit absolute value signs, writing e.g., instead of .
First, write
Now we can rewrite the length as
The numerator is then
since
, by Corollary 1, and we can then apply Corollary 2. The denominator is exactly
and so, by Corollary 2, we only need to justify why
is equal to
. However, this follows from the definition of an element of the latter set: a choice of map
for all
. □
6. Interpreting Area, Length, and Width
We have mentioned many times that Dirichlet polynomials are equivalent to set-theoretic bundles, so the natural question to ask is “
why, then, should we work with the former instead of the latter?”. One answer to this is question is the fact that
entropy does not respect bundle morphisms: we cannot functorially assign a morphism between entropies to morphisms, since we are working with
arbitrary morphisms of bundles. (If, however, we restrict to only morphisms given by pushforward, then [
7] tells us (via
Faddeev’s theorem) that the only possible functorial definition of entropy is given by the
relative entropy, i.e., the difference of the entropies of the source and the target). This makes it seem rather bad to work with a
category (such as that of bundles) instead of simply a
rig (such as that of Dirichlet polynomials). Of course, this isn’t an entirely satisfactory answer, since we
do care about the notion of morphisms for Dirichlet polynomials (for example, Corollary 2 tells us that the width can be expressed in terms of the number of certain morphisms). In light of Theorem 1, however, we might consider the following possibility: both area and length can be expressed in terms of
,
, and
(for
), and we could
define the width by
.
A better answer to this question might be the following: the rig homomorphism is incredibly simple, since it just maps to ; from this computationally simple homomorphism, however, we can recover entropy (as ), without making any reference to the classical equation that defines it (“negative the sum of probabilities of the log of the probabilities”), but instead relying on the fact that encodes the weighted geometric mean. That is, is only homomorphic in the product of distributions, whereas the pair is homomorphic in both the product and the sum.
Now, the entropy
can be understood (via Huffman coding, cf. [
8]) as
the average number of bits needed to code a single outcome (over a long enough message). What is also true, however, is that the width (which is obtained purely “algebraically”, i.e., from the rig homomorphism
) gives similar information: by Theorem 1, combined with the previous sentence,
is
the average number of bits needed to code the draw, given an outcome (in the same Huffman coding as before). This answers the question of “
what is special about the bundle defined by the width and length” with “
it describes the optimal encoding of draws, given outcomes”.
As for the picture in Example 6, we can now understand the hand-wavy explanation a bit better (but still just as hand-wavy-ly): we take our original “half-filled” rectangle and pour its contents into a new rectangle, of length , and then “slosh the contents around” until they lie flat, and then put a lid on it; the rectangle will be perfectly filled up, and the placement of the lid will be given by .
We also mentioned, in Corollary 3, that the width of any Dirichlet polynomial d is an algebraic number, but the actual result is slightly more interesting that this: Corollary 2 tells us that is equal to the cardinality of the set . We already know how to understand endomorphisms of d that fix as endomorphisms of that fix the outcome; we can understand as maps from to ; roughly speaking, such a map determines the remaining ambiguity in determining a draw, given its outcome.
7. Cross Entropy
Everything above can be viewed as a specific example of the analogous cross notions. That is, given two Dirichlet polynomials, we can define their cross area, cross width, etc. as follows.
Definition 8. Let be Dirichlet polynomials such that . Then we define thecross entropy
byand the cross area, cross width,
and cross length
by(respectively). By definition, for . That is, just as cross entropy is a generalisation of entropy, the notions of cross width, etc., generalise the notions of width, etc.
Remark 4. Note that we can recover the notion of relative entropy (
also known as Kullback–Leibler divergence)
, as studied in [9], from cross entropy:(which can also be seen to justify the fact that ). Remark 5. Although we have some idea of how to understand these cross notions (e.g., cross area can be understood as the number of “actual” draws, when we think of d as being a potentially inaccurate model for e), the choice of definitions in Definition 8 was chosen simply so that
- 1.
we recover the “uncrossed” notions when we take , and
- 2.
Theorem 2 holds.
Theorem 2. For all , we have the cross rectangle-area formula
Proof. This proof follows exactly the same argument as the proof of Theorem 1. □