Week 2
Week 2
1 / 16
Convexity and Concavity of Information Measures
Often we need to maximize or minimize Information measures in
different contexts
As we know, convexity and concavity guarantee that local
minima/maxima is global
Concavity of H(X )
We have already seen this for Bernoulli RV but not for any pmf. To show
this, we take two random variables X1 and X2 with pmfs p1 and p2 defined
over the same sample space.
Obviously, H(X1 ) is a function in terms of the pmf p1 , so we may represent
this as H(p1 ). We need to show that
H(λp1 + (1 − λ)p2 ) ≥ λH(p1 ) +( (1 − λ)H(p2 ) for λ ∈ [0, 1]
1 with prob λ
Let us define Z = Xθ where θ =
2 with prob 1 − λ
P
P[Z = z] = θ P[Z = z, θ] = P[Z = X1 = z|θ = 1]P[θ = 1] + P[Z =
X2 = z|θ = 2]P[θ = 2] = λp1 (z) + (1 − λ)p2 (z). As,
H(Z ) ≥ H(Z |θ) ⇒ H(λp1 + (1 − λp )) ≥ λH(p1 ) + (1 − λ)H(p2 )
2 /216
Concavity of MI
Later, we will see that capacity of a communication channel where the
channel is defined by p(y |x) is given as:
C = maxp(x) I (X ; Y )
What is the guarantee that we will be able obtain such a p(x)?
Lemma: Convexity/Concavity is preserved over affine transforms.
Affine transforms are of the form Ax + b where x and b are vectors. Show
that if f (x) is convex, so is f (Ax + b).
Concavity of I (X ; Y ) for fixed p(y |x)
P P
I (X ; Y ) = H(Y )−H(Y |X ) = y p(y ) log p(y )− x p(x) log H(Y |X = x)
p(y |x) isPfixed, so we need to worry about how I (X ; Y ) behaves w.r.t p(x).
p(y ) = x p(x)p(y |x). As p(y |x) is fixed, p(y ) is a linear function of
p(x). H(Y ) is a concave function of p(y ) which is a linear function of p(x)
and hence H(Y ) is a concave function of p(x). For fixed p(y |x), H(Y |X )
is fixed and hence the right term is a linear function of p(x). A linear
function is both concave and convex (satisfied with equality). So, Concave
- Convex(linear) = Concave + Concave = Concave (Why?)
3 / 16
Data Processing Inequality
Markov Chain
A sequence of random variables is said to be a markov chain if any RV in
the chain only depends on its previous one. Graphically, let X , Y , Z be 3
RVs, X → Y → Z is a Markov chain if
P(Z |X , Y ) = P(Z |Y ) ⇒ P(X , Y , Z ) = P(X )P(Y , Z |X ) =
P(X )P(Y |X )P(Z |Y , X ) = P(X )P(Y |X )P(Z |Y )
We can also, say that given Y , X and Z are independent, as:
P(X , Z |Y ) = P(X |Y )P(Z |X , Y ) = P(X |Y )P(Z |Y )
Hence, I (X ; Z |Y ) = 0
I (X ; Y ) ≥ I (X ; Z )
1
Prove this by expanding I (X ; Y , Z ) = I (X ; Z ) + I (X ; Y |Z ) = I (X ; Y ) + I (X ; Z |Y )
4 / 16
Fano’s Inequality
We may think of a comm system as X → Y → g (Y ) = X̂
We are interested
( in Pe = P[X̂ ̸= X ] and its bounds.
1 if X̂ ̸= X w.p. pe
Let E = .
0 otherwise w.p. (1 − pe )
H(E ) = H(pe ) = −[pe log pe + (1 − pe ) log (1 − pe )]
H(E , X |X̂ ) = H(X |X̂ ) + H(E |X , X̂ )
| {z }
=0
= H(E |X̂ ) + H(X |E , X̂ )
| {z } | {z }
≤H(E )=H(Pe ) ≤Pe log |X |
6 / 16
Data Compression: Source Coding
X Singular Non-singular, Uniquely decod- Instantaneous
but not uniquely able, but not in-
decodable stantaneous
1 0 0 10 0
2 0 010 00 10
3 0 01 11 110
4 0 10 110 111
Code Classes
Singular: Many-one mapping
Non-singular: One-one mapping
Uniquely Decodable: Code extension is non-singular (may need
following bits to decode)
Instantaneous/Prefix: Does not require next bits to decode (self
punctuating/decoding/prefix-free)
The expected length L for any instantaneous D-ary code for a random
variable X is greater or equal to the entropy HD (X ): L ≥ HD (X ) with
equality iff D −li =
Ppi −l
Proof: Let L = pi li be the expected length, Let ri = PDD −li i
X X
L − HD (X ) = pi li + pi logD pi
i
X X
=− pi logD D −li + pi logD pi
X pi X
= pi logD − logD D −li
ri
≥ 0 [∵ Gibbs Inequality and Kraft Inequality]
P −l
Equality occurs iff D(p||r ) = 0 and D i = 1 which happens when
−l
pi = D ⇒ li = − logD pi ∈ Z ∀i
i +
11 / 16
Upper Bound on Length
As we see li = logD p1i may not be an integer. The next best could be
li = ⌈logD p1i ⌉.
If we choose this li , first we check if it satisfies Kraft’s inequality.
Of course,
1 1 1
logD ≤ ⌈logD ⌉ < logD +1
pi pi pi
1 1
−⌈log ⌉ − log
D −li = D D p
i ≤D D p
i
P −⌈logD p1 ⌉ P − logD p1 P
and hence, D i ≤ D i = pi = 1
So, this choice of li satisfies Kraft’s inequality and by taking expectation
over the first inequality, we have
1 1 1
E logD ≤ E ⌈logD ⌉ < E logD +1
pi pi pi
∴ HD (X ) ≤ L < HD (X ) + 1
12 / 16
Huffman Codes
We know li∗ = − logD pi is the optimal prefix code length which might not
be realizable. The code obtained by the ceil is upper bounded in length but
of course not optimal and thus there is a question that arises: Is there an
optimal realizable prefix code?
Huffman Codes
Given an alphabet, Huffman came up with a simple algorithm that
generates a prefix code that is optimal in the sense that any other code
over the same alphabet cannot have a lower expected length.
13 / 16
Optimality Proof
The complete proof is provided in the book. The ideas are presented in
brief over here.
There are many optimal codes while Huffman is one of them.
WLOG PMFs P are ordered, so that p1 ≥ p2 · · · ≥ pm . A code is
optimal if pi li is minimal.
HR
L∗ (p) −−→ L(p ′ )[(a) → (c)]
Expansion
L∗ (p ′ ) −−−−−−→ L(p)
Figure: Expansion