0% found this document useful (0 votes)
3 views16 pages

Week 2

The document discusses the concepts of convexity and concavity in information measures, particularly focusing on entropy and mutual information. It introduces key theorems in information theory, such as the Source Coding Theorem and the Channel Coding Theorem, and elaborates on data processing inequalities and Huffman coding for optimal prefix codes. Additionally, it covers the Kraft Inequality and the conditions for constructing efficient coding schemes in communication systems.

Uploaded by

v3193373
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views16 pages

Week 2

The document discusses the concepts of convexity and concavity in information measures, particularly focusing on entropy and mutual information. It introduces key theorems in information theory, such as the Source Coding Theorem and the Channel Coding Theorem, and elaborates on data processing inequalities and Huffman coding for optimal prefix codes. Additionally, it covers the Kraft Inequality and the conditions for constructing efficient coding schemes in communication systems.

Uploaded by

v3193373
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Week 2

1 / 16
Convexity and Concavity of Information Measures
Often we need to maximize or minimize Information measures in
different contexts
As we know, convexity and concavity guarantee that local
minima/maxima is global
Concavity of H(X )
We have already seen this for Bernoulli RV but not for any pmf. To show
this, we take two random variables X1 and X2 with pmfs p1 and p2 defined
over the same sample space.
Obviously, H(X1 ) is a function in terms of the pmf p1 , so we may represent
this as H(p1 ). We need to show that
H(λp1 + (1 − λ)p2 ) ≥ λH(p1 ) +( (1 − λ)H(p2 ) for λ ∈ [0, 1]
1 with prob λ
Let us define Z = Xθ where θ =
2 with prob 1 − λ
P
P[Z = z] = θ P[Z = z, θ] = P[Z = X1 = z|θ = 1]P[θ = 1] + P[Z =
X2 = z|θ = 2]P[θ = 2] = λp1 (z) + (1 − λ)p2 (z). As,
H(Z ) ≥ H(Z |θ) ⇒ H(λp1 + (1 − λp )) ≥ λH(p1 ) + (1 − λ)H(p2 )
2 /216
Concavity of MI
Later, we will see that capacity of a communication channel where the
channel is defined by p(y |x) is given as:
C = maxp(x) I (X ; Y )
What is the guarantee that we will be able obtain such a p(x)?
Lemma: Convexity/Concavity is preserved over affine transforms.
Affine transforms are of the form Ax + b where x and b are vectors. Show
that if f (x) is convex, so is f (Ax + b).
Concavity of I (X ; Y ) for fixed p(y |x)
P P
I (X ; Y ) = H(Y )−H(Y |X ) = y p(y ) log p(y )− x p(x) log H(Y |X = x)
p(y |x) isPfixed, so we need to worry about how I (X ; Y ) behaves w.r.t p(x).
p(y ) = x p(x)p(y |x). As p(y |x) is fixed, p(y ) is a linear function of
p(x). H(Y ) is a concave function of p(y ) which is a linear function of p(x)
and hence H(Y ) is a concave function of p(x). For fixed p(y |x), H(Y |X )
is fixed and hence the right term is a linear function of p(x). A linear
function is both concave and convex (satisfied with equality). So, Concave
- Convex(linear) = Concave + Concave = Concave (Why?)
3 / 16
Data Processing Inequality
Markov Chain
A sequence of random variables is said to be a markov chain if any RV in
the chain only depends on its previous one. Graphically, let X , Y , Z be 3
RVs, X → Y → Z is a Markov chain if
P(Z |X , Y ) = P(Z |Y ) ⇒ P(X , Y , Z ) = P(X )P(Y , Z |X ) =
P(X )P(Y |X )P(Z |Y , X ) = P(X )P(Y |X )P(Z |Y )
We can also, say that given Y , X and Z are independent, as:
P(X , Z |Y ) = P(X |Y )P(Z |X , Y ) = P(X |Y )P(Z |Y )
Hence, I (X ; Z |Y ) = 0

Data Processing Inequality1 (DPI)


It states that no matter how you obtain Z from Y , if X → Y → Z ,
information about X from Z cannot be more than that obtained from Y

I (X ; Y ) ≥ I (X ; Z )
1
Prove this by expanding I (X ; Y , Z ) = I (X ; Z ) + I (X ; Y |Z ) = I (X ; Y ) + I (X ; Z |Y )
4 / 16
Fano’s Inequality
We may think of a comm system as X → Y → g (Y ) = X̂
We are interested
( in Pe = P[X̂ ̸= X ] and its bounds.
1 if X̂ ̸= X w.p. pe
Let E = .
0 otherwise w.p. (1 − pe )
H(E ) = H(pe ) = −[pe log pe + (1 − pe ) log (1 − pe )]
H(E , X |X̂ ) = H(X |X̂ ) + H(E |X , X̂ )
| {z }
=0
= H(E |X̂ ) + H(X |E , X̂ )
| {z } | {z }
≤H(E )=H(Pe ) ≤Pe log |X |

H(X |E , X̂ ) = P[E = 0]H(X |X̂ , E = 0) + P[E = 1]H(X |X̂ , E = 1)


≤ 0 + Pe log |X |

H(Pe ) + Pe log |X | ≥ H(X |X̂ ) ≥ H(X |Y ) [∵ I (X ; Y ) ≥ I (X ; Xˆ) by DPI]


log |X | term may be improved to log (|X | − 1) if it is guaranteed that
X̂ ∈ X because H(X |X̂ , E = 1) can have |X | − 1 possibilities
5 / 16
Information Theory in a Nutshell

Shannon’s work or the subject revolves around 3 main theorems:


Source Coding Theorem: The average code length of a source is
greater or equal to entropy
Channel Coding Theorem: There exists a coding scheme which
allows one to send data at a rate (termed as the capacity of a channel)
with arbitrary low probability of error . The catch of this theorem is
that there exists such a code but Shannon did not show which code
will achieve this! Such proofs are tricky as they are non-constructive!
Source-channel separation theorem: Since source (compressing)
and channel coding (expanding) are exactly opposite, it might appear
that they are interdependent (depends on entropy and MI from
previous theorems). However, the theorem states that they are
independent and can be treated separately and is the reason behind
having two separate blocks in a communication system.

6 / 16
Data Compression: Source Coding
X Singular Non-singular, Uniquely decod- Instantaneous
but not uniquely able, but not in-
decodable stantaneous
1 0 0 10 0
2 0 010 00 10
3 0 01 11 110
4 0 10 110 111

Code Classes
Singular: Many-one mapping
Non-singular: One-one mapping
Uniquely Decodable: Code extension is non-singular (may need
following bits to decode)
Instantaneous/Prefix: Does not require next bits to decode (self
punctuating/decoding/prefix-free)

All Codes ⊃ Singular ⊃ Non-singular7 / ⊃


16 Uniquely Decodable ⊃ Prefix
Kraft Inequality
For any instantaneous code over an alphabet of size D, the codeword
lengths l1 , l2 , . . . , lm must satisfy
X
D −li ≤ 1
i

Conversely, given a set of codeword lengths that satisfy this inequality,


there exists an instantaneous code with these word lengths.
Proof:
The proof uses the following arguments:
A D-ary tree where each node has D possible children. The nodes
(except root) can represent any codeword.
If a codeword or node is part of the prefix tree, then its descendants
cannot be a codeword (violates prefix condition)
Let lmax be the length of the longest codeword
The nodes at the level of lmax are a codeword, or not (meaning they
are descendants of codewords from previous levels)
8 / 16
Kraft Inequality (Contd)

Based on the arguments, the total no


of leaf nodes at lmax is D lmax . The no
of descendants of a codeword at level
i at lmax is D lmax −li 1 . The set of these
descendant codewords are disjoint for
every
P limaxand hence, we should have:
−li ≤ D lmax ⇒
P −i
i D i D ≤1
Figure: Tree for D = 2
The converse is also evident as any code say l1 , l2 , . . . , lm be sorted in
length. Then l1 is the 1st codeword with all its descendants removed, l2
the 2nd with its descendants removed and so on. Eventually, it forms a tree
with just the codewords as leaf nodes, which should satisfy the inequality.
1
So take a codeword(node) at level li ; this node will be producing 2 (for D=2) new
children at every level after li . The number of such levels is lmax − li and hence this node
has 2lmax −li descendants till lmax .
9 / 16
Optimal Codes
We want codes with minimum average length while it should be
instantaneous/prefix.
So, this is an optimization problem s.t. constraints.
X
min L = pi li
X
s.t. D −li ≤ 1

Evidently, if li is reduced, D −li increases. So the objective and constraint


counter each other. So, we try to use equality constraint instead of an
inequality. UsingP method of Lagrange multipliers, we have:
J = pi li + λ( D −li − 1). Differentiating w.r.t li , we have:
P
∂J −li ln D = 0 ⇒ D −li = pi ∗
∂lj = pi − λD P λ ln D ⇒ λ = 1/ ln D ⇒ li = − logD pi
Hence, L∗ = − pi logD pi = HD (X )
However, it might not be possible to set li = − logD pi because we ignored
the constraint that li must be non-negative integers. Hence, the objective
is to get as close as possible to the entropy.
10 / 16
Source Coding Theorem

The expected length L for any instantaneous D-ary code for a random
variable X is greater or equal to the entropy HD (X ): L ≥ HD (X ) with
equality iff D −li =
Ppi −l
Proof: Let L = pi li be the expected length, Let ri = PDD −li i
X X
L − HD (X ) = pi li + pi logD pi
i
X X
=− pi logD D −li + pi logD pi
X pi X
= pi logD − logD D −li
ri
≥ 0 [∵ Gibbs Inequality and Kraft Inequality]
P −l
Equality occurs iff D(p||r ) = 0 and D i = 1 which happens when
−l
pi = D ⇒ li = − logD pi ∈ Z ∀i
i +

11 / 16
Upper Bound on Length
As we see li = logD p1i may not be an integer. The next best could be
li = ⌈logD p1i ⌉.
If we choose this li , first we check if it satisfies Kraft’s inequality.
Of course,
1 1 1
logD ≤ ⌈logD ⌉ < logD +1
pi pi pi
1 1
−⌈log ⌉ − log
D −li = D D p
i ≤D D p
i

P −⌈logD p1 ⌉ P − logD p1 P
and hence, D i ≤ D i = pi = 1
So, this choice of li satisfies Kraft’s inequality and by taking expectation
over the first inequality, we have
     
1 1 1
E logD ≤ E ⌈logD ⌉ < E logD +1
pi pi pi
∴ HD (X ) ≤ L < HD (X ) + 1

12 / 16
Huffman Codes
We know li∗ = − logD pi is the optimal prefix code length which might not
be realizable. The code obtained by the ceil is upper bounded in length but
of course not optimal and thus there is a question that arises: Is there an
optimal realizable prefix code?
Huffman Codes
Given an alphabet, Huffman came up with a simple algorithm that
generates a prefix code that is optimal in the sense that any other code
over the same alphabet cannot have a lower expected length.

13 / 16
Optimality Proof
The complete proof is provided in the book. The ideas are presented in
brief over here.
There are many optimal codes while Huffman is one of them.
WLOG PMFs P are ordered, so that p1 ≥ p2 · · · ≥ pm . A code is
optimal if pi li is minimal.

Properties of Instantaneous/Prefix Optimal Code


The lengths are ordered inversely with probabilities
The two longest codewords have same length
Two longest codewords differ only in the last bit and correspond to
two least likely symbols

The first one is proved using a swapping argument. Let Cm be an optimal



code with pj > pk and lj ≤ lk , satisfying the property while Cm is obtained
by swapping codewords j and k and hence having lengths lk and lj . By

computing L(Cm ) − L(Cm ) show that lj ≤ lk for optimality of L(Cm ).
14 / 16
Optimality of Huffman Code - Arguments

The 2nd follows by trimming. If p5 has


longer length as shown in (a), without
disturbing the prefix condition, we can
trim the last bit to obtain (b).
The last follows by rearrangement. Even Should be p_1

if p4 and p5 are not siblings in (c), we


can make them so, as they have same
lengths and others can be swapped
accordingly to get (d). Siblings only
differ in last bit.
Figure: p1 ≥ p2 ≥ · · · ≥ p5
Thus, if p1 ≥ p2 ≥ · · · ≥ pm , ∃ an optimal code with
l1 ≤ l2 ≤ · · · ≤ lm−1 = lm , and codewords C (xm−1 ) and C (xm ) that differ
only in the last bit. These are called canonical codes.
15 / 16
Induction step (Canonical ↔ Huffman)
To show Huffman code is optimal, we define a Huffman reduction (HR)
for an alphabet of size m, p = (p1 , p2 , . . . , pm ) with p1 ≥ p2 ≥ · · · ≥ pm as
p ′ = (p1 , p2 , . . . , pm−1 + pm ) over an alphabet of size m − 1. Let L∗ (p)
and L∗ (p ′ ) denote two optimal canonical codes.

HR
L∗ (p) −−→ L(p ′ )[(a) → (c)]
Expansion
L∗ (p ′ ) −−−−−−→ L(p)

Figure: Expansion

L(p) = L∗ (p ′ ) + pm−1 + pm , L(p ′ ) = L∗ (p) − pm−1 − pm ⇒


(L(p) − L∗ (p)) + (L(p ′ ) − L∗ (p ′ )) = 0 ⇒ L(p) = L(p ∗ )
16 / 16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy