Lecture 1
Lecture 1
Basak Guler
1 / 136
Background
• Information theory studies the fundamental limits of how to represent,
compress, and transfer information.
2 / 136
Introduction
Information theory answers two fundamental questions:
4 / 136
Remark
• Although Information Theory originated from “dealing” with
communications, its principles and impact goes well beyond the field
of communications.
5 / 136
Notation
• We will assume that a discrete random variable (r.v.) X has an
alphabet X , meaning that X takes values from the set X , with a
probability mass function (PMF):
for x ∈ X .
• With some abuse of notation, we will use p(x) to denote the PMF of
X , instead of PX (x). That is, we will define p(x) , PX (x).
6 / 136
Remark
• Information theory relies on a set of mathematical tools.
• In particular, there are a few key definitions that facilitate the main
results.
• The most important notions are Entropy (H) and Mutual Information (I).
• Let’s start!
7 / 136
Entropy
• Definition 1. Entropy of a discrete random variable X is:
X 1
H(X ) = p(x) log (2)
p(x)
x∈X
X
=− p(x) log p(x) (3)
x∈X
8 / 136
Entropy
• Recall that for a function g(X ) of X ,
X
E[g(X )] = p(x)g(x) (4)
x∈X
1
g(x) = log ∀x ∈ X , (5)
p(x)
1
H(X ) = E[g(X )] = E[log ]. (6)
p(X )
9 / 136
Example 1
• Consider a r.v. X that takes values from X = {1, 2, . . . , 8} with equal
probability.
8
X 1 1
H(X ) = − log (7)
8 8
x=1
1
=8× × log 8 (8)
8
= 3 bits (9)
10 / 136
Example 2
• Consider a r.v. X that takes values from X = {1, 2, . . . , 8}. Suppose
the PMF for X is ( 12 , 14 , 81 , 16
1 1
, 64 1
, 64 1
, 64 1
, 64 ).
8
X
H(X ) = − p(x) log p(x) (10)
x=1
8
X 1
= p(x) log (11)
p(x)
x=1
1 1 1 1 1
= log 2 + log 4 + log 8 + log 16 + 4 log 64 (12)
2 4 8 16 64
1 1 3 2 3
= + + + + (13)
2 2 8 8 8
= 2 bits (14)
11 / 136
Observations
• Entropy is non-negative.
1 1 1 1 1 1 1 1 1
H( , . . . , ) > H( , , , , , , ) (15)
8 8 2 4 16 64 64 64 64
12 / 136
Properties of Entropy
• Lemma 1. Entropy is always non-negative, i.e., H(X ) ≥ 0.
0 log 0 = 0 (16)
13 / 136
Properties of Entropy
• Lemma 2. Let Ha (X ) denote the entropy of X with the logarithm taken
with respect to base a, i.e.,
X 1
Ha (X ) = p(x) loga (17)
p(x)
x∈X
14 / 136
Properties of Entropy
• Proof. Note that,
logb p(x)
loga p(x) = (19)
logb a
or equally, logb p(x) = (logb a)loga p(x). Then, Finally,
X
Hb (X ) = − p(x) logb p(x) (20)
x∈X
X
=− p(x)(logb a)loga p(x) (21)
x∈X
!
X
= (logb a) − p(x)loga p(x) (22)
x∈X
= (logb a)Ha (X ) (23)
15 / 136
Example 3- Binary Entropy Function
• Consider a binary r.v. X :
1 with probability p
X = (24)
0 with probability 1−p
16 / 136
Example 3- Binary Entropy Function
• Properties of the binary entropy function:
18 / 136
Example 4-Best Strategy to Guess the Value of a R.V.
• Answer. Intuitively, it is better to start by guessing the most likely
outcome. Then, we are more likely to be correct.
This strategy would look like:
YES NO
YES NO
19 / 136
Example 4-Best Strategy to Guess the Value of a R.V.
• Answer. Intuitively, it is better to start by guessing the most likely
outcome. Then, we are more likely to be correct.
• Then the expected number of questions is:
20 / 136
Recap
• Entropy is a measure of uncertainty, randomness, amount of
self-information.
• Less entropy means
• less randomness, less self-information
• more compression, less average number of bits needed to represent the
outcomes
• In the future chapters, we will study these concepts in detail.
• So far we have covered Sections 1 and 2.1 from the book Elements of
Information Theory, Cover-Thomas.
• Next, we will cover Sections 2.2. and 2.3.
21 / 136
Joint Entropy
• We can extend the notion of entropy to a pair of random variables.
22 / 136
Joint Entropy
• We can extend the notion of entropy to a pair of random variables.
23 / 136
Conditional Entropy
• Conditional entropy H(Y |X ) quantifies the amount of uncertainty
remaining in Y when we know X .
• Definition 3. The conditional entropy H(Y |X ) is defined as:
X
H(Y |X ) = p(x)H(Y |X = x) (34)
x∈X
X X 1
= p(x) p(y |x) log (35)
p(y |x)
x∈X y∈Y
XX 1
= p(y |x)p(x) log (36)
| {z } p(y|x)
x∈X y ∈Y
p(x,y)
XX 1
= p(x, y ) log (37)
p(y|x)
x∈X y ∈Y
1
= E[log ] (38)
p(Y |X )
23 / 136
Conditional Entropy
• Conditional entropy H(Y |X ) quantifies the amount of uncertainty
remaining in Y when we know X .
• Definition 3. The conditional entropy H(Y |X ) is defined as:
X
H(Y |X ) = p(x)H(Y |X = x) (34)
x∈X
X X 1
= p(x) p(y |x) log (35)
p(y |x)
x∈X y∈Y
XX 1
= p(y |x)p(x) log (36)
| {z } p(y|x)
x∈X y ∈Y
p(x,y)
XX 1
= p(x, y ) log (37)
p(y|x)
x∈X y ∈Y
1
= E[log ] (38)
p(Y |X )
• The expectation is over X , Y , i.e., E[log 1
p(Y |X ) ] = EX ,Y [log 1
p(Y |X ) ]
23 / 136
Chain Rule of Entropy
• Theorem 1. The chain rule of entropy:
24 / 136
Chain Rule of Entropy - Proof
XX
H(X , Y ) = − p(x, y ) log p(x, y) (40)
x∈X y∈Y
XX
=− p(x, y ) log(p(y |x)p(x)) (41)
x∈X y∈Y
XX XX
=− p(x, y) log p(x) − p(x, y) log p(y|x) (42)
x∈X y∈Y x∈X y∈Y
X X XX
=− p(x, y ) log p(x) − p(x, y ) log p(y|x) (43)
x∈X y ∈Y x∈X y∈Y
| {z }
=p(x)
X XX
=− p(x) log p(x) − p(x, y) log p(y|x) (44)
x∈X x∈X y∈Y
25 / 136
Chain Rule of Entropy - Alternative Proof
• Note: The proof can also be carried out by noting
1 1 1
log = log + log (46)
p(x, y ) p(x) p(y |x)
1 1 1
EX ,Y [log ] = EX ,Y [log + log ] (47)
p(X , Y ) p(X ) p(Y |X )
1 1
= EX ,Y [log ] + EX ,Y [log ] (48)
p(X ) p(Y |X )
1 1
= EX [log ] + EX ,Y [log ] (49)
p(X ) p(Y |X )
= H(X ) + H(Y |X ) (50)
26 / 136
Chain Rule of Entropy
• Also, we have (by symmetry):
27 / 136
Chain Rule for Many Random Variables
• Theorem 2. The chain rule for n random variables:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 ) (53)
i=1
28 / 136
Chain Rule for Many Random Variables
• Proof. From the chain rule for conditional probabilities:
then
H(X1 , . . . , Xn )
X
=− p(x1 , . . . , xn ) log p(x1 , . . . , xn ) (55)
x1 ∈X1 ,...,xn ∈Xn
X
=− p(x1 , . . . , xn ) log(p(x1 ) . . . p(xn |xn−1 , . . . , x1 )) (56)
x1 ∈X1 ,...,xn ∈Xn
X
=− p(x1 , . . . , xn )(log p(x1 ) + . . . + log p(xn |xn−1 , . . . , x1 ))
x1 ∈X1 ,...,xn ∈Xn
X X
=− p(x1 , . . . , xn ) log p(x1 ) −. . .− p(x1 , . . . , xn ) log p(xn |xn−1 , . . . , x1 )
x1 ,...,xn x1 ,...,xn
29 / 136
Chain Rule for Many Random Variables
• Simpler proof. From the chain rule for conditional probabilities:
so
1 1 1 1
log = log +log +. . .+log
p(x1 , . . . , xn ) p(x1 ) p(x2 |x1 ) p(xn |xn−1 , . . . , x1 )
(59)
Take the expectation of both sides with respect to (X1 , . . . , Xn ):
1
EX1 ,...,Xn [log ]
p(X1 , . . . , Xn )
1 1
= EX1 ,...,Xn [log + . . . + log ] (60)
p(X1 ) p(Xn |Xn−1 , . . . , X1 )
1 1
= EX1 ,...,Xn [log ] + . . . + EX1 ,...,Xn [log ] (61)
p(X1 ) p(Xn |Xn−1 , . . . , X1 )
= H(X1 ) + H(X2 |X1 ) + . . . + H(Xn |Xn−1 , . . . , X1 ) (62)
30 / 136
Corollary
• Corollary 1. For three random variables X , Y , Z :
31 / 136
Corollary
• Proof. From the chain rule, p(x, y|z) = p(x|z)p(y |x, z). Then,
H(X , Y |Z )
X
= p(z)H(X , Y |Z = z)
z∈Z
X XX
=− p(z) p(x, y|z) log p(x, y |z)
| {z }
z∈Z x∈X y∈Y
p(x|z)p(y|x,z)
X XX XX
=− p(z) p(x, y |z) log p(x|z)+ p(x, y|z) log p(y|x, z)
z∈Z x∈X y ∈Y x∈X y∈Y
X X XX
=− p(z) p(x|z) log p(x|z) + p(x, y|z) log p(y |x, z)
z∈Z x∈X x∈X y∈Y
XX XXX
=− p(x, z) log p(x|z) − p(x, y , z) log p(y|x, z)
z∈Z x∈X x∈X y ∈Y z∈Z
32 / 136
Example 5
• Consider a pair of r.v.s X , Y with a joint PMF p(x, y ):
X \Y 0 1
1 1
0 2 4
(64)
1
1 0 4
33 / 136
Example 5 - Solution
1) Find H(X ).
• From the definition of marginal probabilities:
X X
p(x) = P[X = x] = P[X = x, Y = y] = p(x, y)
y∈Y y∈Y
34 / 136
Example 5 - Solution
1) Find H(X ).
• From the definition of marginal probabilities:
X X
p(x) = P[X = x] = P[X = x, Y = y] = p(x, y)
y∈Y y∈Y
3 1 3 1
P[X = 0] = , P[X = 1] = ⇒ p(x) = ,
4 4 4 4
34 / 136
Example 5 - Solution
1) Find H(X ).
• From the definition of marginal probabilities:
X X
p(x) = P[X = x] = P[X = x, Y = y] = p(x, y)
y∈Y y∈Y
3 1 3 1
P[X = 0] = , P[X = 1] = ⇒ p(x) = ,
4 4 4 4
3 3 1 1
H(X ) = − log − log = 0.8113
4 4 4 4
34 / 136
Example 5 - Solution
2) Find H(Y ).
• The marginal PMF of Y , p(y) is:
1 1 1 1
P[Y = 0] = , P[Y = 1] = ⇒ p(y ) = ,
2 2 2 2
35 / 136
Example 5 - Solution
2) Find H(Y ).
• The marginal PMF of Y , p(y) is:
1 1 1 1
P[Y = 0] = , P[Y = 1] = ⇒ p(y ) = ,
2 2 2 2
1 1 1 1
H(Y ) = − log − log = 1
2 2 2 2
35 / 136
Example 5 - Solution
3) Find the conditional entropy H(X |Y ).
• For this, we first need to find conditional PMF p(x|y ).
36 / 136
Example 5 - Solution
3) Find the conditional entropy H(X |Y ).
• For this, we first need to find conditional PMF p(x|y ).
• From the definition of conditional probabilities:
P[X = x, Y = y] p(x, y)
p(x|y ) = P[X = x|Y = y ] = =
P[Y = y ] p(y )
36 / 136
Example 5 - Solution
3) Find the conditional entropy H(X |Y ).
• For this, we first need to find conditional PMF p(x|y ).
• From the definition of conditional probabilities:
P[X = x, Y = y] p(x, y)
p(x|y ) = P[X = x|Y = y ] = =
P[Y = y ] p(y )
• Then,
P[X = 0, Y = 0] 1/2
P[X = 0|Y = 0] = = =1
P[Y = 0] 1/2
36 / 136
Example 5 - Solution
3) Find the conditional entropy H(X |Y ).
• For this, we first need to find conditional PMF p(x|y ).
• From the definition of conditional probabilities:
P[X = x, Y = y] p(x, y)
p(x|y ) = P[X = x|Y = y ] = =
P[Y = y ] p(y )
• Then,
P[X = 0, Y = 0] 1/2
P[X = 0|Y = 0] = = =1
P[Y = 0] 1/2
36 / 136
Example 5 - Solution
• Then,
X
H(X |Y ) = p(y)H(X |Y = y)
y∈{0,1}
37 / 136
Example 5 - Solution
• Then,
X
H(X |Y ) = p(y)H(X |Y = y)
y∈{0,1}
• Note that:
37 / 136
Example 5 - Solution
• Then,
X
H(X |Y ) = p(y)H(X |Y = y)
y∈{0,1}
• Note that:
• whereas
1 1 1 1
H(X |Y = 1) = − log − log = 1
2 2 2 2
37 / 136
Example 5 - Solution
• Then,
X
H(X |Y ) = p(y)H(X |Y = y)
y∈{0,1}
• Note that:
• whereas
1 1 1 1
H(X |Y = 1) = − log − log = 1
2 2 2 2
• Then,
X 1 1
H(X |Y ) = p(y )H(X |Y = y) = × 0 + × 1 = 1/2
2 2
y∈{0,1}
37 / 136
Example 5 - Solution
4) Find the conditional entropy H(Y |X ).
• For this, we first need to find conditional PMF p(y |x).
38 / 136
Example 5 - Solution
4) Find the conditional entropy H(Y |X ).
• For this, we first need to find conditional PMF p(y |x).
• From the definition of conditional probabilities:
P[Y = y , X = x] p(y , x)
p(y |x) = P[Y = y |X = x] = =
P[X = x] p(x)
38 / 136
Example 5 - Solution
4) Find the conditional entropy H(Y |X ).
• For this, we first need to find conditional PMF p(y |x).
• From the definition of conditional probabilities:
P[Y = y , X = x] p(y , x)
p(y |x) = P[Y = y |X = x] = =
P[X = x] p(x)
• Then,
P[Y = 0, X = 0] 1/2
P[Y = 0|X = 0] = = = 2/3
P[X = 0] 3/4
38 / 136
Example 5 - Solution
4) Find the conditional entropy H(Y |X ).
• For this, we first need to find conditional PMF p(y |x).
• From the definition of conditional probabilities:
P[Y = y , X = x] p(y , x)
p(y |x) = P[Y = y |X = x] = =
P[X = x] p(x)
• Then,
P[Y = 0, X = 0] 1/2
P[Y = 0|X = 0] = = = 2/3
P[X = 0] 3/4
P[Y = 0, X = 1] 0
P[Y = 0|X = 1] = = =0
P[X = 1] 1/4
P[Y = 1|X = 1] = 1 − 0 = 1
38 / 136
Example 5 - Solution
• Then,
X
H(Y |X ) = p(x)H(Y |X = x)
x∈{0,1}
where
2 2 1 1
H(Y |X = 0) = − log − log = 0.9183
3 3 3 3
and
Then,
X 3 1
H(Y |X ) = p(x)H(Y |X = x) = × 0.9183 + × 0 = 0.6887
4 4
x∈{0,1}
39 / 136
Example 5 - Solution
• 5) Find the joint entropy H(X , Y ).
1 1 1 1 1 1
H(X , Y ) = − log − 0 log 0 − log − log = 1.5
2 2 4 4 4 4
40 / 136
Example 5 - Solution
• 5) Find the joint entropy H(X , Y ).
1 1 1 1 1 1
H(X , Y ) = − log − 0 log 0 − log − log = 1.5
2 2 4 4 4 4
• But
H(X , Y ) = H(X , Y )
H(X ) + H(Y |X ) = H(Y ) + H(X |Y )
H(X ) − H(X |Y ) = H(Y ) − H(Y |X )
0.8113 − 1/2 = 1 − 0.6887
0.3113 = 0.3113
40 / 136
Relative Entropy - KL-distance
• Definition 4. The relative entropy or Kullback-Leibler (KL) distance
between two PMFs p(x) and q(x) (that are defined on the same
alphabet) is:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X
p(X )
= E[log ]
q(X )
41 / 136
Relative Entropy - KL-distance
• Definition 4. The relative entropy or Kullback-Leibler (KL) distance
between two PMFs p(x) and q(x) (that are defined on the same
alphabet) is:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X
p(X )
= E[log ]
q(X )
• The relative entropy is a measure of distance between two
distributions (although it is actually not a true distance measure
because it is not symmetric and does not satisfy the triangle
inequality!). However, we will see that it is always ≥ 0 and = 0 if and
only if p = q.
41 / 136
Relative Entropy - KL-distance
• Definition 4. The relative entropy or Kullback-Leibler (KL) distance
between two PMFs p(x) and q(x) (that are defined on the same
alphabet) is:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X
p(X )
= E[log ]
q(X )
• The relative entropy is a measure of distance between two
distributions (although it is actually not a true distance measure
because it is not symmetric and does not satisfy the triangle
inequality!). However, we will see that it is always ≥ 0 and = 0 if and
only if p = q.
• If there is any symbol x ∈ X for which p(x) > 0 and q(x) = 0, then
D(p||q) = ∞.
• This notion is also called KL divergence or information divergence.
41 / 136
Mutual Information
• Definition 5. Consider two random variables X and Y with a joint
PMF p(x, y) and marginal PMFs p(x) and p(y ).
p(X ,Y )
• Also note that I(X ; Y ) = E[log p(X )p(Y ) ]
42 / 136
Mutual Information
43 / 136
Relationship Between Entropy and Mutual Information
• Proof.
XX p(x, y)
I(X ; Y ) = p(x, y ) log
p(x)p(y )
x∈X y ∈Y
XX p(x|y)
= p(x, y ) log
p(x)
x∈X y ∈Y
XX 1 XX 1
= p(x, y ) log − p(x, y ) log
p(x) p(x|y)
x∈X y ∈Y x∈X y∈Y
| {z } | {z }
p(x) H(X |Y )
| {z }
H(X )
44 / 136
Relationship Between Entropy and Mutual Information
• Alternative Proof (by using expectations).
p(X , Y )
I(X ; Y ) = EX ,Y [log ]
p(X )p(Y )
p(X |Y )
= EX ,Y [log ]
p(X )
p(X |Y )
= EX ,Y [log ]
p(X )
1 1
= EX ,Y [log − log ]
p(X ) p(X |Y )
1 1
= EX [log ] − EX ,Y [log ]
p(X ) p(X |Y )
= H(X ) − H(X |Y )
45 / 136
How to Interpret Mutual Information?
• We have seen that,
I(X ; Y ) = I(Y ; X )
46 / 136
Observations
• We can also write:
p(X , Y )
I(X ; Y ) = E[log ]
p(X )p(Y )
1 1 1
= E[log ] + E[log ] − E[log ]
p(X ) p(Y ) p(X , Y )
47 / 136
Observations
• We can also write:
p(X , Y )
I(X ; Y ) = E[log ]
p(X )p(Y )
1 1 1
= E[log ] + E[log ] − E[log ]
p(X ) p(Y ) p(X , Y )
= H(X ) + H(Y ) − H(X , Y )
48 / 136
Observations
• We can also write:
p(X , Y )
I(X ; Y ) = E[log ]
p(X )p(Y )
1 1 1
= E[log ] + E[log ] − E[log ]
p(X ) p(Y ) p(X , Y )
= H(X ) + H(Y ) − H(X , Y )
48 / 136
Observations
• We can also write:
p(X , Y )
I(X ; Y ) = E[log ]
p(X )p(Y )
1 1 1
= E[log ] + E[log ] − E[log ]
p(X ) p(Y ) p(X , Y )
= H(X ) + H(Y ) − H(X , Y )
49 / 136
Observations
• The following diagram shows these relationships:
H(X) H(Y )
H(X, Y )
50 / 136
Conditional Mutual Information
• Definition 6. Conditional mutual information:
X p(x, y|z)
I(X ; Y |Z ) = p(x, y, z) log
p(x|z)p(y |z)
x∈X ,y∈Y,z∈Z
p(X , Y |Z )
= EX ,Y ,Z [log ]
p(X |Z )p(Y |Z )
= H(X |Z ) − H(X |Y , Z )
51 / 136
Chain Rule of Mutual Information
• Recall the chain rule of entropy:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1
• Mutual information also has a chain rule!
52 / 136
Chain Rule of Mutual Information
• Recall the chain rule of entropy:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1
• Mutual information also has a chain rule!
• Theorem 4. Chain rule of mutual information:
n
X
I(X1 , . . . , Xn ; Y ) = I(Xi ; Y |Xi−1 , . . . , X1 )
i=1
52 / 136
Chain Rule of Mutual Information
• Recall the chain rule of entropy:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1
• Mutual information also has a chain rule!
• Theorem 4. Chain rule of mutual information:
n
X
I(X1 , . . . , Xn ; Y ) = I(Xi ; Y |Xi−1 , . . . , X1 )
i=1
• Proof.
I(X1 , . . . , Xn ; Y ) = H(X1 , . . . , Xn ) − H(X1 , . . . , Xn |Y )
n
X n
X
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y )
i=1 i=1
Xn
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y )
i=1
n
X
= I(Xi ; Y |Xi−1 , . . . , X1 )
i=1 52 / 136
Conditional Relative Entropy
• Definition 7. For two joint PMFs p(x, y ) and q(x, y), the conditional
relative entropy is defined as:
X X p(y |x)
D(p(y |x)||q(y|x)) = p(x) p(y |x) log
q(y |x)
x∈X y∈Y
XX p(y|x)
= p(x, y) log
q(y|x)
x∈X y∈Y
p(Y |X )
= EX ,Y [log ]
q(Y |X )
53 / 136
Chain Rule of Relative Entropy
• Relative entropy also has a chain rule:
54 / 136
Chain Rule of Relative Entropy
• Relative entropy also has a chain rule:
• Proof.
XX p(x, y )
D(p(x, y)||q(x, y)) = p(x, y) log
q(x, y )
x∈X y∈Y
XX p(y|x)p(x)
= p(x, y) log
q(y|x)q(x)
x∈X y∈Y
XX p(y|x) X X p(x)
= p(x, y) log + p(x, y ) log
q(y|x) q(x)
x∈X y∈Y x∈X y∈Y
| {z } | {z }
D(p(y |x)||q(y |x)) p(x)
| {z }
D(p(x)||q(x))
54 / 136
Convex Functions
• We will now briefly review the basic definitions of convexity and
present one of the most widely used inequalities in information theory.
55 / 136
Convex Functions
• Definition 8 (Convex function). A function f (x) is convex over an
interval (a, b) if for every x1 , x2 ∈ (a, b) and 0 ≤ λ ≤ 1:
• The function is strictly convex if for the above, the equality holds only
when λ = 0 or λ = 1.
f (x)
f (x1 ) + (1 )f (x2 )
f (x2 )
f (x1 )
f ( x1 + (1 )x2 )
x1 x2
56 / 136
Example 6
• Example 6. f (x) = x 2 where x ∈ R
57 / 136
Example 6
• Example 6. f (x) = x 2 where x ∈ R
100
90
80
70
60
50
40
30
20
10
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
• is a convex function
58 / 136
Example 7
• Example 7. f (x) = − log x where x > 0
59 / 136
Example 7
• Example 7. f (x) = − log x where x > 0
10
-2
-4
0 1 2 3 4 5 6 7 8 9 10
• is a convex function
60 / 136
Example 8
• Example 8. f (x) = ex where x ∈ R
61 / 136
Example 8
• Example 8. f (x) = ex where x ∈ R
150
100
50
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
• is a convex function
62 / 136
Example 9
• Example 9. f (x) = ax + b where x ∈ R
63 / 136
Example 9
• Example 9. f (x) = ax + b where x ∈ R
-1
-2
-3
-2 -1 0 1 2 3 4 5
64 / 136
Example 10
• Example 10. f (x) = x log x where x ≥ 0
12
10
-2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
• is a convex function
65 / 136
Concave Functions
• Definition 8 (Concave function). A function f (·) is concave over
(a, b) if −f (x) is convex, i.e., for every x1 , x2 ∈ (a, b) and 0 ≤ λ ≤ 1:
• The function is strictly concave if for the above, the equality holds only
when λ = 0 or λ = 1.
f (x) f ( x1 + (1 )x2 )
f (x2 )
f (x1 )
f (x1 ) + (1 )f (x2 )
x1 x2
66 / 136
Example 11
√
• Example 11. f (x) = x where x ≥ 0
67 / 136
Example 11
√
• Example 11. f (x) = x where x ≥ 0
2.5
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
• is a concave function
68 / 136
Example 12
• Example 12. f (x) = log x where x > 0
69 / 136
Example 12
• Example 12. f (x) = log x where x > 0
-1
-2
-3
-4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
• is a concave function
70 / 136
How do we know if a function is convex (or concave)?
• If f (x) is twice differentiable,
f 00 (x) ≥ 0 → convex
00
f (x) ≤ 0 → concave
71 / 136
How do we know if a function is convex (or concave)?
• If f (x) is twice differentiable,
f 00 (x) ≥ 0 → convex
00
f (x) ≤ 0 → concave
• If you would like to learn more on convex functions, read Chapter 3 of:
https:
//web.stanford.edu/˜boyd/cvxbook/bv_cvxbook.pdf
71 / 136
Example 13
• Example 13. Is f (x) = x 2 for x ∈ R convex?
f 0 (x) = 2x (65)
00
f (x) = 2 > 0 → convex (66)
72 / 136
Example 14
• Example 14. Is f (x) = log x for x > 0 convex?
1
f 0 (x) = (67)
x ln(2)
1
f 00 (x) = − 2 <0 → concave (68)
x ln(2)
73 / 136
Recap
• So far we have covered sections 2.3, 2.4, 2.5 from the textbook. Next,
we will cover 2.6, 2.7, 2.8.
74 / 136
Important Properties of Convex Functions
Pn
• Theorem 5. Let p1 , . . . , pn ≥ 0 such that i=1 pi = 1. If f (x) is
convex, then for any x1 , . . . , xn ,
Xn n
X
f( pi xi ) ≤ pi f (xi )
i=1 i=1
75 / 136
Important Properties of Convex Functions
• Proof. Can be proved by induction.
76 / 136
Important Properties of Convex Functions
• Step 2. Assume that the claim holds for n − 1. Then, for n:
n n−1
X X pi
pi f (xi ) = pn f (xn ) + (1 − pn ) f (xi )
1 − pn
i=1 i=1
pi
Now, set qi = 1−p n
for i = 1, . . . , n − 1. Note that qi ≥ 0 and
Pn−1
i=1 qi = 1. Since we assumed that the hypothesis is true for n − 1,
Xn n−1
X
pi f (xi ) = pn f (xn ) + (1 − pn ) qi f (xi )
i=1 i=1
Xn−1
≥ pn f (xn ) + (1 − pn )f ( qi xi )
i=1
| {z }
x̄
≥ f (pn xn + (1 − pn )x̄) (hypothesis true for n = 2)
n−1
!
X pi
≥ f pn xn + (1 − pn ) xi (Substitute back x̄)
1 − pn
i=1
n
X
= f( pi xi )
i=1
77 / 136
Jensen’s Inequality
• We will now state an important inequality.
78 / 136
Jensen’s Inequality
• We will now state an important inequality.
E[f (X )] ≥ f (E[X ])
78 / 136
Jensen’s Inequality
• We will now state an important inequality.
E[f (X )] ≥ f (E[X ])
78 / 136
Jensen’s Inequality
• Corollary 2. If f is concave,
E[f (X )] ≤ f (E[X ])
79 / 136
Jensen’s Inequality
• Corollary 2. If f is concave,
E[f (X )] ≤ f (E[X ])
• Next, we will use Jensen’s inequality to prove some important
properties of the measures we have defined so far.
79 / 136
KL-distance is Non-negative
• Theorem 7 (Information Inequality). For two probability mass
functions (PMFs) p(x) and q(x) over an alphabet x ∈ X , we have:
D(p||q) ≥ 0
where equality holds if and only if p(x) = q(x) for all x.
80 / 136
KL-distance is Non-negative
• Theorem 7 (Information Inequality). For two probability mass
functions (PMFs) p(x) and q(x) over an alphabet x ∈ X , we have:
D(p||q) ≥ 0
where equality holds if and only if p(x) = q(x) for all x.
• Proof. Define a set A = {x : p(x) > 0}. Then,
X p(x)
−D(p||q) = − p(x) log
q(x)
x∈A
X p(x)
= p(x) − log
q(x)
x∈A
X q(x)
= p(x) log
p(x)
x∈A
q(X )
= E log (expectation is taken over p(x) > 0)
p(X )
80 / 136
KL-distance is Non-negative
• Proof continued. Recall that the function f (y ) = log y is concave.
Therefore, log q(x) q(x)
p(x) is concave in p(x) . Then,
q(X ) q(X )
E[log ] ≤ log E[ ] (Jensen’s inequality - Corollary 2) (69)
p(X ) p(X )
!
X q(x)
= log p(x) (70)
p(x)
x∈A
!
X
= log q(x) (71)
x∈A
!
X
≤ log q(x) (log y is strictly increasing in y ) (72)
x∈X
| {z }
=1
= log 1 (probability of the entire sample space is 1)
=0
• Therefore, D(p||q) ≥ 0
81 / 136
When is D(p||q) = 0?
• Note that f (y ) = log y is a strictly concave function of y. Then, from
Jensen’s Inequality, equality occurs, i.e.,
E[f (Y )] = f (E[Y ])
82 / 136
When is D(p||q) = 0?
• Finally, (72) becomes an equality if and only if
X
q(x) = c = 1
x∈A
83 / 136
Mutual Information is Non-negative
• Corollary 3. Mutual information is non-negative:
I(X ; Y ) ≥ 0
84 / 136
Mutual Information is Non-negative
• Corollary 3. Mutual information is non-negative:
I(X ; Y ) ≥ 0
84 / 136
Corollaries
• Corollary 4. Conditional KL-distance is non-negative
D(p(y|x)||q(y |x)) ≥ 0
85 / 136
Corollaries
• Corollary 4. Conditional KL-distance is non-negative
D(p(y|x)||q(y |x)) ≥ 0
I(X ; Y |Z ) ≥ 0
85 / 136
Upper Bound on Entropy
• Theorem 8. For any random variable X defined over an alphabet X ,
H(X ) ≤ log |X |
86 / 136
Upper Bound on Entropy
• Proof. We will use Theorem 7. Specifically, let p(x) denote the PMF
of the random variable X and let u(x) = |X1 | be the PMF of a uniform
random variable.
87 / 136
Upper Bound on Entropy
• Proof. We will use Theorem 7. Specifically, let p(x) denote the PMF
of the random variable X and let u(x) = |X1 | be the PMF of a uniform
random variable.
• Then,
X p(x)
D(p||u) = p(x) log (75)
u(x)
x∈X
X 1 X 1
= p(x) log − p(x) log (76)
u(x) p(x)
x∈X x∈X
X X 1
= p(x) log |X | − p(x) log (77)
p(x)
x∈X x∈X
| {z }
by definition of H(X )
• Therefore,
H(X ) ≤ log |X | (80)
87 / 136
Uniform Distribution Maximizes Entropy
• Corollary 8. The uniform random variable has the largest entropy.
88 / 136
Uniform Distribution Maximizes Entropy
• Corollary 8. The uniform random variable has the largest entropy.
• Proof. Let X be a uniform random variable over a set of elements X .
Denote the PMF of X by u(x) = |X1 | for all x ∈ X . Then, the entropy of
X is given by:
X
H(X ) = − u(x) log u(x) (81)
x∈X
X 1 1
=− log (82)
|X | |X |
x∈X
X 1
= log |X | (83)
|X |
x∈X
!
X 1
= log |X | (84)
|X |
x∈X
= log |X | (85)
(86)
88 / 136
Conditioning Reduces Entropy
• Theorem 9. Conditioning can not increase entropy:
H(X |Y ) ≤ H(X )
89 / 136
Conditioning Reduces Entropy
• Theorem 9. Conditioning can not increase entropy:
H(X |Y ) ≤ H(X )
89 / 136
Conditioning Reduces Entropy
• Theorem 9. Conditioning can not increase entropy:
H(X |Y ) ≤ H(X )
89 / 136
Example 15
• Lets consider a pair of r.v.s X , Y with a joint PMF p(x, y ):
X \Y 0 1
1 1
0 3 3
(87)
1
1 0 3
90 / 136
Example 15
• Lets consider a pair of r.v.s X , Y with a joint PMF p(x, y ):
X \Y 0 1
1 1
0 3 3
(87)
1
1 0 3
• PMF of X :
2 1
P[X = 0] = 2/3, P[X = 1] = 1/3, p(x) = ,
3 3
90 / 136
Example 15
• Lets consider a pair of r.v.s X , Y with a joint PMF p(x, y ):
X \Y 0 1
1 1
0 3 3
(87)
1
1 0 3
• PMF of X :
2 1
P[X = 0] = 2/3, P[X = 1] = 1/3, p(x) = ,
3 3
• Then, the entropy of X is:
2 2 1 1
H(X ) = − log − log = 0.918bits
3 3 3 3
90 / 136
Example 15
• PMF of Y :
1 2 1 2
P[Y = 0] = , P[Y = 1] = , p(y) = ( , )
3 3 3 3
• Conditional PMF of p(x|y ):
P[X = 0, Y = 0]
P[X = 0|Y = 0] = =1
P[Y = 0]
P[X = 1|Y = 0] = 0
1/3 1
P[X = 0|Y = 1] = =
2/3 2
1
P[X = 1|Y = 1] =
2
• Then,
p(x|y ) 0 1
1
0 1 2
(88)
1
1 0 2
91 / 136
Example 15
• Then,
X
H(X |Y ) = p(y )H(X |Y = y )
y∈Y
• Note that:
1 2 2
H(X |Y ) = × 0 + × 1 = = 0.667 bits < H(X )
3 3 3
as expected.
92 / 136
Independence Bound on Entropy
• Theorem 10 For any set of n random variables X1 , . . . , Xn , their joint
entropy can be upper bounded by the sum of the individual entropies.
n
X
H(X1 , X2 , . . . , Xn ) ≤ H(Xi )
i=1
93 / 136
Independence Bound on Entropy
• Theorem 10 For any set of n random variables X1 , . . . , Xn , their joint
entropy can be upper bounded by the sum of the individual entropies.
n
X
H(X1 , X2 , . . . , Xn ) ≤ H(Xi )
i=1
with equality if and only if Xi are all independent from each other.
93 / 136
Recap
• So far, we have seen Jensen’s inequality and used it to prove some
important results and observations.
94 / 136
LOG-SUM inequality
• Theorem 11 (LOG-SUM Inequality). For non-negative numbers
a1 , . . . , an and b1 , . . . , bn , the following holds:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1
95 / 136
LOG-SUM inequality
• Proof. We will use the convention 0 log 0 = 0, a log a0 = ∞( for a > 0),
and 0 log 00 = 0. Then, without loss of generality, we can assume
ai , bi > 0 for all i.
• Define,
ai
p(xi ) = Pn
j=1 aj
and
bi
q(xi ) = Pn
j=1 bj
• Since pi , qi ≥ 0, and
n
X n
X
p(xi ) = q(xi ) = 1
i=1 i=1
96 / 136
LOG-SUM inequality
• Next, consider the KL-distance between p and q, and recall that
KL-distance is non-negative D(p||q) ≥ 0. In other words,
n
X p(xi )
D(p||q) = p(xi ) log ≥0
q(xi )
i=1
n Pn !
ai ai j=1 aj
X
⇒ Pn log − log Pn ≥0
j=1 aj
bi j=1 bj
i=1
n n
! Pn
ai j=1 aj
X X
⇒ ai log ≥ ai log Pn
bi j=1 bj
i=1 i=1
97 / 136
Relative entropy is convex
• Theorem 12. The relative entropy (KL-distance) is convex in the pair
of distributions (p, q). That is,
98 / 136
Relative entropy is convex
• Theorem 12. The relative entropy (KL-distance) is convex in the pair
of distributions (p, q). That is,
98 / 136
Relative entropy is convex
• Theorem 12. The relative entropy (KL-distance) is convex in the pair
of distributions (p, q). That is,
98 / 136
Relative entropy is convex
• Theorem 12. The relative entropy (KL-distance) is convex in the pair
of distributions (p, q). That is,
98 / 136
Relative entropy is convex
• Now, the inequality from Theorem 12 says that:
X λp1 (x) + (1 − λ)p2 (x)
(λp1 (x) + (1 − λ)p2 (x)) log
λq1 (x) + (1 − λ)q2 (x)
x∈X
X p1 (x) X p2 (x)
≤λ p1 (x) log + (1 − λ) p2 (x) log (89)
q1 (x) q2 (x)
x∈X x∈X
• Let’s have a look at this for one term with a specific x on both sides.
Can we prove the following inequality?
99 / 136
Relative entropy is convex
• Log-sum inequality says that:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1
100 / 136
Relative entropy is convex
• Log-sum inequality says that:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1
• By setting n = 2, we have:
a1 a2 a1 + a2
a1 log + a2 log ≥ (a1 + a2 ) log (91)
b1 b2 b1 + b2
100 / 136
Relative entropy is convex
• Log-sum inequality says that:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1
• By setting n = 2, we have:
a1 a2 a1 + a2
a1 log + a2 log ≥ (a1 + a2 ) log (91)
b1 b2 b1 + b2
• Let
a1 = λp1 (x) b1 = λq1 (x)
a2 = (1 − λ)p2 (x) b2 = (1 − λ)q2 (x)
100 / 136
Relative entropy is convex
• Log-sum inequality says that:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1
• By setting n = 2, we have:
a1 a2 a1 + a2
a1 log + a2 log ≥ (a1 + a2 ) log (91)
b1 b2 b1 + b2
• Let
a1 = λp1 (x) b1 = λq1 (x)
a2 = (1 − λ)p2 (x) b2 = (1 − λ)q2 (x)
• Then, (91) becomes,
λp1 (x) (1 − λ)p2 (x)
λp1 (x) log + (1 − λ)p2 (x) log
λq1 (x) (1 − λ)q2 (x)
λp1 (x) + (1 − λ)p2 (x)
≥ (λp1 (x) + (1 − λ)p2 (x)) log
λq1 (x) + (1 − λ)q2 (x)
which means we proved (90).
100 / 136
Relative entropy is convex
• Log-sum inequality says that:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1
• By setting n = 2, we have:
a1 a2 a1 + a2
a1 log + a2 log ≥ (a1 + a2 ) log (91)
b1 b2 b1 + b2
• Let
a1 = λp1 (x) b1 = λq1 (x)
a2 = (1 − λ)p2 (x) b2 = (1 − λ)q2 (x)
• Then, (91) becomes,
λp1 (x) (1 − λ)p2 (x)
λp1 (x) log + (1 − λ)p2 (x) log
λq1 (x) (1 − λ)q2 (x)
λp1 (x) + (1 − λ)p2 (x)
≥ (λp1 (x) + (1 − λ)p2 (x)) log
λq1 (x) + (1 − λ)q2 (x)
which means we proved (90).
• Now, if we take the sum over all x ∈ X of both sides, we have (89),
which proves Theorem 12.
100 / 136
Relative entropy is convex
• Corollary 7. Relative entropy is convex in p for any fixed q.
101 / 136
Relative entropy is convex
• Corollary 7. Relative entropy is convex in p for any fixed q.
• Proof. Choose q1 = q2 = q in Theorem 12:
101 / 136
Entropy is concave
• Theorem 13. (Concavity of entropy) Let H(p) denote the entropy of
X , with p representing the PMF of X (that is, if X = {x1 , . . . , xn },
p = (p(x1 ), p(x2 ), . . . , p(xn ))). Then, H(p) is a concave function of p.
102 / 136
Entropy is concave
• Theorem 13. (Concavity of entropy) Let H(p) denote the entropy of
X , with p representing the PMF of X (that is, if X = {x1 , . . . , xn },
p = (p(x1 ), p(x2 ), . . . , p(xn ))). Then, H(p) is a concave function of p.
• Proof. We will use Corollary 7.
• Let q(x) = |X1 | for all x ∈ X , i.e., q(x) = u(x) (recall that u(x) is the
discrete uniform random variable). Then,
X p(x)
D(p||u) = p(x) log 1
x∈X |X |
= log |X | − H(X )
• Therefore,
103 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• Theorem 14. Mutual Information I(X ; Y ) is a concave function of p(x)
for fixed p(y |x).
• Proof. Recall that:
103 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• Theorem 14. Mutual Information I(X ; Y ) is a concave function of p(x)
for fixed p(y |x).
• Proof. Recall that:
• (A):
• Note that for fixed p(y |x), p(y) is a linear function of p(x) (because
P
p(y) = x p(y |x)p(x)).
• We known by Theorem 11, H(Y ) is concave in p(y).
• Fact 1. Let y be a linear function of x. Then, a function f is concave in
x if and only if f is concave in y.
• Thus, H(Y ) for fixed p(y|x) is concave in p(x).
103 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• (B):
X
−H(Y |X ) = − H(Y |X = x) p(x)
x
| {z }
fixed since p(y|x) is fixed
104 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• (B):
X
−H(Y |X ) = − H(Y |X = x) p(x)
x
| {z }
fixed since p(y|x) is fixed
104 / 136
Mutual Information is convex in p(y |x) for fixed p(x)
• Theorem 14’. Mutual Information I(X ; Y ) is a convex function of
p(y|x) for fixed p(x).
105 / 136
Mutual Information is convex in p(y |x) for fixed p(x)
• Theorem 14’. Mutual Information I(X ; Y ) is a convex function of
p(y|x) for fixed p(x).
• Proof. For a fixed p(x), let’s define a function f (·) of p(y|x)
105 / 136
Mutual Information is convex in p(y |x) for fixed p(x)
• In order to show that f (p(y |x)) = I(X ; Y ) is convex in p(y |x), we need
to show that (by definition of convexity):
D(p(x)pλ (y |x)||p(x)pλ (y ))
≤ λD(p(x)p1 (y |x)||p(x)p1 (y ))
+ (1 − λ)D(p(x)p2 (y|x)||p(x)p2 (y )) (96)
P P
where p1 (y) , x p1 (y |x)p(x) and p2 (y ) , x p2 (y |x)p(x), and
and X
p(x)pλ (y ) = p(x) p(x 0 )(λp1 (y|x 0 ) + (1 − λ)p2 (y |x 0 ))
x0
= p(x) λp1 (y ) + (1 − λ)p2 (y)
= λq1 (x, y ) + (1 − λ)q2 (x, y ) (98)
which holds since we know that D(p||q) is convex in (p, q), therefore
(95) holds and the proof is completed.
107 / 136
Entropy of a function of a random variable
• Let g(X ) be some known function of X . Then,
H(g(X )) ≤ H(X )
108 / 136
Entropy of a function of a random variable
• Proof. We know that,
which means
109 / 136
Entropy of a function of a random variable
• Proof. We know that,
which means
110 / 136
Example 16
• Example 16. (Entropy of a sum) Let X and Y be two random variables
taking values in x1 , . . . , xn and y1 , . . . , yn , respectively. Let Z = X + Y .
(a) Show that H(Z |X ) = H(Y |X ) and if X and Y are independent,
then H(Z ) ≥ H(X ) and H(Z ) ≥ H(Y ).
• Solution:
X
H(Z |X ) = p(x)H(Z |X = x)
x
X X 1
= p(x) P[Z = z|X = x] log
x z
P[Z = z|X = x]
X X 1
= p(x) P[Y = z − X |X = x] log
x z
P[Y = z − X |X = x]
X X 1
= p(x) P[Y = z − x|X = x] log
x z
P[Y = z − x|X = x]
X X 1
= p(x) P[Y = y |X = x] log
x y
P[Y = y |X = x]
| {z }
H(Y |X =x)
= H(Y |X ) (105)
110 / 136
Example 16
• If X and Y are independent,
111 / 136
Example 16
• If X and Y are independent,
111 / 136
Example 16
• If X and Y are independent,
111 / 136
Example 16
• If X and Y are independent,
111 / 136
Example 16
• (b) Give an example of (necessarily dependent) random variables in
which H(X ) > H(Z ) and H(Y ) > H(Z ) where Z = X + Y .
112 / 136
Example 16
• (b) Give an example of (necessarily dependent) random variables in
which H(X ) > H(Z ) and H(Y ) > H(Z ) where Z = X + Y .
• Solution.
Let
1 w.p. (with probability) 1/2
X =
0 w.p. 1/2
and
Y = −X
Z =X +Y
112 / 136
Example 16
• (b) Give an example of (necessarily dependent) random variables in
which H(X ) > H(Z ) and H(Y ) > H(Z ) where Z = X + Y .
• Solution.
Let
1 w.p. (with probability) 1/2
X =
0 w.p. 1/2
and
Y = −X
Z =X +Y
• Then,
112 / 136
Example 16
• (c) Under what conditions does
hold?
113 / 136
Example 16
• Solution. Recall that for any function g(X ) of X , we showed that
H(g(X )) ≤ H(X ), with equality if and only if the function is 1-to-1.
114 / 136
Example 16
• Solution. Recall that for any function g(X ) of X , we showed that
H(g(X )) ≤ H(X ), with equality if and only if the function is 1-to-1.
• Let Z be a function of (X , Y ). Then,
114 / 136
Example 16
• Solution. Recall that for any function g(X ) of X , we showed that
H(g(X )) ≤ H(X ), with equality if and only if the function is 1-to-1.
• Let Z be a function of (X , Y ). Then,
114 / 136
Example 16
• Solution. Recall that for any function g(X ) of X , we showed that
H(g(X )) ≤ H(X ), with equality if and only if the function is 1-to-1.
• Let Z be a function of (X , Y ). Then,
115 / 136
Markov Chains
• Let Xn be a discrete-time, discrete state random process.
• Xn is called a Markov chain iff
116 / 136
Markov Chains
• Let Xn be a discrete-time, discrete state random process.
• Xn is called a Markov chain iff
116 / 136
Markov Chains
• Let Xn be a discrete-time, discrete state random process.
• Xn is called a Markov chain iff
j
i
l
k
116 / 136
Markov Chains
• Definition 10. Three random variables X , Y , Z form a Markov chain
in that order, shown as X → Y → Z if the following equivalent
conditions are satisfied:
1. p(z|y, x) = p(z|y )
2. p(x, y , z) = p(z|y)p(y |x)p(x)
117 / 136
Markov Chains
• Definition 10. Three random variables X , Y , Z form a Markov chain
in that order, shown as X → Y → Z if the following equivalent
conditions are satisfied:
1. p(z|y, x) = p(z|y )
2. p(x, y , z) = p(z|y)p(y |x)p(x)
I(X ; Y ) ≥ I(X ; Z )
118 / 136
Data Processing Inequality (DPI)
• Proof. Recall the Chain rule of mutual information:
n
X
I(X1 , . . . , Xn ; Y ) = I(Xi ; Y |Xi−1 , . . . , X1 )
i=1
• So, we have,
I(X ; Y , Z ) = I(X ; Y |Z ) + I(X ; Z )
and similarly,
I(X ; Y , Z ) = I(X ; Z |Y ) + I(X ; Y )
so that:
I(X ; Y |Z ) + I(X ; Z ) = I(X ; Z |Y ) + I(X ; Y ) (107)
• If X → Y → Z , then, conditioned on Y , random variables X and Z are
independent (i.e., p(x, z|y ) = p(x|y )p(z|y)). Therefore the mutual
information between X and Z conditioned on Y is 0, i.e.,
I(X ; Z |Y ) = 0
119 / 136
Data Processing Inequality (DPI)
• Then, (107) becomes,
120 / 136
Data Processing Inequality (DPI)
• Corollary 8. If X → Y → Z forms a Markov chain,
121 / 136
Data Processing Inequality (DPI)
• Corollary 8. If X → Y → Z forms a Markov chain,
121 / 136
Data Processing Inequality (DPI)
• Corollary 8. If X → Y → Z forms a Markov chain,
121 / 136
Data Processing Inequality (DPI)
• Corollary 9. If Z = g(Y )
122 / 136
Data Processing Inequality (DPI)
• Corollary 9. If Z = g(Y )
• D.P.I says that we can never get more information (about a random
variable) by further processing (that random variable)!
122 / 136
Example 17
• Problem. Show that if H(X |Y ) = 0, then there exists a function g(Y )
such that X = g(Y ). In other words, X is a function of Y .
123 / 136
Example 17
• Solution. Let’s start by the definition of H(X |Y ):
X
H(X |Y ) = p(y)H(X |Y = y )
y ∈Y
X X 1
= p(y) P[X = x|Y = y] log
P[X = x|Y = y ]
y ∈Y x∈X
| {z }
≥0
124 / 136
Example 17
• Solution. Let’s start by the definition of H(X |Y ):
X
H(X |Y ) = p(y)H(X |Y = y )
y ∈Y
X X 1
= p(y) P[X = x|Y = y] log
P[X = x|Y = y ]
y ∈Y x∈X
| {z }
≥0
for some x0 ∈ X .
124 / 136
Example 17
• We can re-write (112) as:
for some x0 , which means for each y with p(y ) > 0, there is only one
possible value of x, hence x = g(y ).
125 / 136
Example 17
• We can re-write (112) as:
for some x0 , which means for each y with p(y ) > 0, there is only one
possible value of x, hence x = g(y ).
• For p(y ) = 0, we can assign x = g(y) to an arbitrary value in Y.
125 / 136
Example 17
• We can re-write (112) as:
for some x0 , which means for each y with p(y ) > 0, there is only one
possible value of x, hence x = g(y ).
• For p(y ) = 0, we can assign x = g(y) to an arbitrary value in Y.
• Then, for each Y , X takes only one value with probability 1, i.e.,
X = g(Y ).
125 / 136
Fano’s Inequality
• From Example 17, we know that if H(X |Y ) = 0, then X = g(Y ) for
some function g(·).
• We also know that if X = g(Y ) for some g(·), then H(X |Y ) = 0.
• Therefore, if H(X |Y ) = 0 if and only if X is a function of Y .
126 / 136
Fano’s Inequality
• From Example 17, we know that if H(X |Y ) = 0, then X = g(Y ) for
some function g(·).
• We also know that if X = g(Y ) for some g(·), then H(X |Y ) = 0.
• Therefore, if H(X |Y ) = 0 if and only if X is a function of Y .
• In other words, we can estimate the value of X from the observations
Y with zero error probability if and only if H(X |Y ) = 0.
126 / 136
Fano’s Inequality
• From Example 17, we know that if H(X |Y ) = 0, then X = g(Y ) for
some function g(·).
• We also know that if X = g(Y ) for some g(·), then H(X |Y ) = 0.
• Therefore, if H(X |Y ) = 0 if and only if X is a function of Y .
• In other words, we can estimate the value of X from the observations
Y with zero error probability if and only if H(X |Y ) = 0.
• We will now see an important inequality, called Fano’s inequality,
which extends this argument to arbitrary X and Y .
126 / 136
Fano’s Inequality
• From Example 17, we know that if H(X |Y ) = 0, then X = g(Y ) for
some function g(·).
• We also know that if X = g(Y ) for some g(·), then H(X |Y ) = 0.
• Therefore, if H(X |Y ) = 0 if and only if X is a function of Y .
• In other words, we can estimate the value of X from the observations
Y with zero error probability if and only if H(X |Y ) = 0.
• We will now see an important inequality, called Fano’s inequality,
which extends this argument to arbitrary X and Y .
• It says that in order for us to be able to estimate a random variable X
from the observations of Y with small probability of error, then
H(X |Y ) should be small.
126 / 136
Fano’s Inequality
• Suppose there are two random variables X and Y with joint PMF
p(x, y ).
127 / 136
Fano’s Inequality
• Suppose there are two random variables X and Y with joint PMF
p(x, y ).
• We observe Y and want to guess the value of X .
127 / 136
Fano’s Inequality
• Suppose there are two random variables X and Y with joint PMF
p(x, y ).
• We observe Y and want to guess the value of X .
Pe = P(X̂ 6= X )
127 / 136
Fano’s Inequality
• Theorem 13. (Fano’s Inequality) For any estimator X̂ = g(Y ) (this
implies X → Y → X̂ ),
which can be weakened to (by using the fact that for the binary
entropy function H(Pe ) ≤ 1),
128 / 136
Fano’s Inequality
• Theorem 13. (Fano’s Inequality) For any estimator X̂ = g(Y ) (this
implies X → Y → X̂ ),
which can be weakened to (by using the fact that for the binary
entropy function H(Pe ) ≤ 1),
128 / 136
Fano’s Inequality
• Theorem 13. (Fano’s Inequality) For any estimator X̂ = g(Y ) (this
implies X → Y → X̂ ),
which can be weakened to (by using the fact that for the binary
entropy function H(Pe ) ≤ 1),
128 / 136
Fano’s Inequality
• Proof. Define a random variable:
(
1 if X̂ 6= X Wrong decision
E= (116)
0 if X̂ = X Correct decision
129 / 136
Fano’s Inequality
• Proof. Define a random variable:
(
1 if X̂ 6= X Wrong decision
E= (116)
0 if X̂ = X Correct decision
therefore,
129 / 136
Fano’s Inequality
• Proof. Define a random variable:
(
1 if X̂ 6= X Wrong decision
E= (116)
0 if X̂ = X Correct decision
therefore,
129 / 136
Fano’s Inequality
• But
1
X
H(X |E, Y ) = P[E = e]H(X |Y , E = e)
e=0
P[E = 0] H(X |Y , E = 0) + P[E = 1] H(X |Y , E = 1)
| {z } | {z } | {z } | {z }
1−Pe 0 Pe (A)
H(X |Y ) ≤ 1 + Pe log |X |
131 / 136
Another Useful Inequality
• Here is another inequality that relates probability of error and entropy.
• Theorem 18. Let X and X 0 be two independent identically distributed
(i.i.d.) random variables with entropy H(X ). Then,
131 / 136
Another Useful Inequality
• Then
P
2−H(X ) = 2 x p(x) log p(x)
X
≤ p(x)2log p(x) (from Jensen’s inequality) (118)
x
X
= p2 (x)
x
= P(X = X 0 )
132 / 136
Summary
• We have now finished the “toolbox” lectures, i.e., Chapter 2 from the
book.
• Next, we will start Chapter 3, the “Asymptotic Equipartition Property”.
133 / 136