0% found this document useful (0 votes)
44 views66 pages

Advance Digital Communication

This document summarizes a lecture on information theory. It reviews concepts like information content, entropy, and joint/conditional entropies. It then previews topics to be covered in the current lecture, including the decomposability of entropy, relative entropy/KL divergence, and mutual information. Mutual information is defined as the reduction in uncertainty of one random variable due to knowledge of another. The document provides an example to illustrate the decomposability of entropy into separate entropies for binary variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views66 pages

Advance Digital Communication

This document summarizes a lecture on information theory. It reviews concepts like information content, entropy, and joint/conditional entropies. It then previews topics to be covered in the current lecture, including the decomposability of entropy, relative entropy/KL divergence, and mutual information. Mutual information is defined as the reduction in uncertainty of one random variable due to knowledge of another. The document provides an example to illustrate the decomposability of entropy into separate entropies for binary variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

COMP2610/COMP6261 - Information Theory

Lecture 7: Relative Entropy and Mutual Information

Mark Reid and Aditya Menon

Research School of Computer Science


The Australian National University

August 12th, 2014

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 1 / 31
Last time

Information content and entropy: definition and computation


Entropy and average code length
Entropy and minimum expected number of binary questions
Joint and conditional entropies, chain rule

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 2 / 31
Information Content: Review

Let X be a random variable with outcomes in X

Let p(x) denote the probability of the outcome x ∈ X

The (Shannon) information content of outcome x is

1
h(x) = log2
p(x)

As p(x) → 0, h(x) → +∞ (rare outcomes are more informative)

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 3 / 31
Entropy: Review

The entropy is the average information content of all outcomes:


X 1
H(X ) = p(x)log2
x
p(x)

Entropy is minimised if p is peaked, and maximized if p is uniform:

0 ≤ H(X ) ≤ log|X |

Entropy is related to minimal number of bits needed to describe a random


variable

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 4 / 31
This time

The decomposability property of entropy


Relative entropy and divergences
Mutual information

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 5 / 31
Outline

1 Decomposability of Entropy

2 Relative Entropy / KL Divergence

3 Mutual Information
Definition
Joint and Conditional Mutual Information

4 Wrapping up

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 6 / 31
Decomposability of Entropy
Example 1 (Mackay, 2003)

Let X ∈ {0, 1, 2} be a r.v. created by the following process:


1 Flip a fair coin to determine whether X = 0

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 7 / 31
Decomposability of Entropy
Example 1 (Mackay, 2003)

Let X ∈ {0, 1, 2} be a r.v. created by the following process:


1 Flip a fair coin to determine whether X = 0
2 If X 6= 0 flip another fair coin to determine whether X = 1 or X = 2

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 7 / 31
Decomposability of Entropy
Example 1 (Mackay, 2003)

Let X ∈ {0, 1, 2} be a r.v. created by the following process:


1 Flip a fair coin to determine whether X = 0
2 If X 6= 0 flip another fair coin to determine whether X = 1 or X = 2

The probability distribution of X is given by:

p(X = 0) =
p(X = 1) =
p(X = 2) =

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 7 / 31
Decomposability of Entropy
Example 1 (Mackay, 2003)

Let X ∈ {0, 1, 2} be a r.v. created by the following process:


1 Flip a fair coin to determine whether X = 0
2 If X 6= 0 flip another fair coin to determine whether X = 1 or X = 2

The probability distribution of X is given by:

1
p(X = 0) = 2

p(X = 1) =
p(X = 2) =

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 7 / 31
Decomposability of Entropy
Example 1 (Mackay, 2003)

Let X ∈ {0, 1, 2} be a r.v. created by the following process:


1 Flip a fair coin to determine whether X = 0
2 If X 6= 0 flip another fair coin to determine whether X = 1 or X = 2

The probability distribution of X is given by:

1
p(X = 0) = 2
1
p(X = 1) = 4

p(X = 2) =

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 7 / 31
Decomposability of Entropy
Example 1 (Mackay, 2003)

Let X ∈ {0, 1, 2} be a r.v. created by the following process:


1 Flip a fair coin to determine whether X = 0
2 If X 6= 0 flip another fair coin to determine whether X = 1 or X = 2

The probability distribution of X is given by:

1
p(X = 0) = 2
1
p(X = 1) = 4
1
p(X = 2) = 4

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 7 / 31
Decomposability of Entropy
Example 1 (Mackay, 2003) — Cont’d

By definition,
1 1 1
H(X ) = log 2 + log 4 + log 4 = 1.5 bits.
2 4 4
But imagine learning the value of X gradually:

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 8 / 31
Decomposability of Entropy
Example 1 (Mackay, 2003) — Cont’d

By definition,
1 1 1
H(X ) = log 2 + log 4 + log 4 = 1.5 bits.
2 4 4
But imagine learning the value of X gradually:
1 First we learn whether X = 0:

I Binary variable with p(1) = ( 21 , 12 )


I Hence H(1/2, 1/2) = log2 2 = 1 bit.

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 8 / 31
Decomposability of Entropy
Example 1 (Mackay, 2003) — Cont’d

By definition,
1 1 1
H(X ) = log 2 + log 4 + log 4 = 1.5 bits.
2 4 4
But imagine learning the value of X gradually:
1 First we learn whether X = 0:

I Binary variable with p(1) = ( 21 , 12 )


I Hence H(1/2, 1/2) = log2 2 = 1 bit.
2 If X 6= 0 we learn the the value of the second coin flip:
I Also binary variable with p(2) = ( 12 , 12 )
I Therefore H(1/2, 1/2) = 1 bit.

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 8 / 31
Decomposability of Entropy
Example 1 (Mackay, 2003) — Cont’d

By definition,
1 1 1
H(X ) = log 2 + log 4 + log 4 = 1.5 bits.
2 4 4
But imagine learning the value of X gradually:
1 First we learn whether X = 0:

I Binary variable with p(1) = ( 21 , 12 )


I Hence H(1/2, 1/2) = log2 2 = 1 bit.
2 If X 6= 0 we learn the the value of the second coin flip:
I Also binary variable with p(2) = ( 12 , 12 )
I Therefore H(1/2, 1/2) = 1 bit.

However, the second revelation only happens half of the time:


1
H(X ) = H(1/2, 1/2) + H(1/2, 1/2) = 1.5 bits.
2

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 8 / 31
Decomposability of Entropy
Generalization

For a r.v. with probability distribution p = (p1 , . . . , p|X | ):

p|X |
 
p2
H(p) = H(p1 , 1 − p1 ) + (1 − p1 )H ,...,
1 − p1 1 − p1

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 9 / 31
Decomposability of Entropy
Generalization

For a r.v. with probability distribution p = (p1 , . . . , p|X | ):

p|X |
 
p2
H(p) = H(p1 , 1 − p1 ) + (1 − p1 )H ,...,
1 − p1 1 − p1

H(p1 , 1 − p1 ) = entropy for a random variable corresponding to “Is


X = x0 ?”
 
p2 p|X |
H 1−p 1
, . . . , 1−p1 = entropy for a random variable corresponding to
outcomes when X 6= x0

(1 − p1 ) = probability of X 6= x0

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 9 / 31
Decomposability of Entropy
Generalization

In general, we have that for any m:


 
m |X |
X X
H(p) =H  pi , pi 
i=1 i=m+1
m
!  
p pm
Pm 1
X
+ pi H , . . . , Pm
i=1 i=1 pi i=1 pi
 
|X |
!
X pm+1 p|X |
+ pi  H P|X | , . . . , P|X |
i=m+1 i=m+1 pi i=m+1 pi

Apply this formula with m = 1, |X | = 3, p = (p1 , p2 , p3 ) = (1/2, 1/4, 1/4)

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 10 / 31
1 Decomposability of Entropy

2 Relative Entropy / KL Divergence

3 Mutual Information
Definition
Joint and Conditional Mutual Information

4 Wrapping up

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 11 / 31
Entropy in Information Theory

If a random variable has distribution p, there exists an encoding with an


average length of
H(p) bits
and this is the “best” possible encoding

What happens if we use a “wrong” encoding?


e.g. because we make an incorrect assumption on the probability distribution

If the true distribution is p, but we assume it is q, it turns out we will need


to use
H(p) + DKL (pkq) bits
where DKL (pkq) is some measure of “distance” between p and q

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 12 / 31
Relative Entropy

Definition
The relative entropy or Kullback-Leibler (KL) divergence between two
probability distributions p(X ) and q(X ) is defined as:
 
X p(x) p(X )
DKL (pkq) = p(x) log = Ep(X ) log .
q(x) q(X )
x∈X

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 13 / 31
Relative Entropy

Definition
The relative entropy or Kullback-Leibler (KL) divergence between two
probability distributions p(X ) and q(X ) is defined as:
 
X p(x) p(X )
DKL (pkq) = p(x) log = Ep(X ) log .
q(x) q(X )
x∈X

Note:
I Both p(X ) and q(X ) are defined over the same alphabet X
Conventions:
0 def 0 def p def
0 log =0 0 log =0 p log =∞
0 q 0

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 13 / 31
Relative Entropy
Properties

DKL (pkq) ≥ 0

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 14 / 31
Relative Entropy
Properties

DKL (pkq) ≥ 0
DKL (pkq) = 0 ⇔ p = q

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 14 / 31
Relative Entropy
Properties

DKL (pkq) ≥ 0
DKL (pkq) = 0 ⇔ p = q
DKL (pkq) 6= DKL (qkp)

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 14 / 31
Relative Entropy
Properties

DKL (pkq) ≥ 0
DKL (pkq) = 0 ⇔ p = q
DKL (pkq) 6= DKL (qkp)
I Not a true distance since is not symmetric and does not satisfy the
triangle inequality

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 14 / 31
Relative Entropy
Properties

DKL (pkq) ≥ 0
DKL (pkq) = 0 ⇔ p = q
DKL (pkq) 6= DKL (qkp)
I Not a true distance since is not symmetric and does not satisfy the
triangle inequality
I Hence, “KL divergence” rather than “KL distance”

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 14 / 31
Relative Entropy
Uniform q

1
Let q correspond to a uniform distribution: q(x) = |X |

Relative entropy between p and q:


X p(x)
DKL (pkq) = p(x) log
q(x)
x∈X
X
= p(x) · (log p(x) + log |X |)
x∈X
X
= −H(X ) + p(x) · log |X |
x∈X
= −H(X ) + log |X |.

Matches intuition as penalty on number of bits for encoding

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 15 / 31
Relative Entropy
Example (from Cover & Thomas, 2006)

Let X ∈ {0, 1} and consider the distributions p(X ) and q(X ) such that:

p(X = 1) = θp p(X = 0) = 1 − θp
q(X = 1) = θq q(X = 0) = 1 − θq
What distributions are these?

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 16 / 31
Relative Entropy
Example (from Cover & Thomas, 2006)

Let X ∈ {0, 1} and consider the distributions p(X ) and q(X ) such that:

p(X = 1) = θp p(X = 0) = 1 − θp
q(X = 1) = θq q(X = 0) = 1 − θq
What distributions are these?

1 1
Compute DKL (pkq) and DKL (qkp) with θp = 2 and θq = 4

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 16 / 31
Relative Entropy
Example (from Cover & Thomas, 2006) — Cont’d

θp 1 − θp
DKL (pkq) = θp log + (1 − θp ) log
θq 1 − θq

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 17 / 31
Relative Entropy
Example (from Cover & Thomas, 2006) — Cont’d

θp 1 − θp
DKL (pkq) = θp log + (1 − θp ) log
θq 1 − θq
1 1
1 2 1 2 1
= log 1
+ log 3
=1− log 3 ≈ 0.2075 bits
2 4
2 4
2

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 17 / 31
Relative Entropy
Example (from Cover & Thomas, 2006) — Cont’d

θp 1 − θp
DKL (pkq) = θp log + (1 − θp ) log
θq 1 − θq
1 1
1 2 1 2 1
= log 1
+ log 3
=1− log 3 ≈ 0.2075 bits
2 4
2 4
2

θq 1 − θq
DKL (qkp) = θq log + (1 − θq ) log
θp 1 − θp

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 17 / 31
Relative Entropy
Example (from Cover & Thomas, 2006) — Cont’d

θp 1 − θp
DKL (pkq) = θp log + (1 − θp ) log
θq 1 − θq
1 1
1 2 1 2 1
= log 1
+ log 3
=1− log 3 ≈ 0.2075 bits
2 4
2 4
2

θq 1 − θq
DKL (qkp) = θq log + (1 − θq ) log
θp 1 − θp
1 3
1 4 3 4 3
= log 1
+ log 1
= −1 + log 3 ≈ 0.1887 bits
4 2
4 2
4

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 17 / 31
1 Decomposability of Entropy

2 Relative Entropy / KL Divergence

3 Mutual Information
Definition
Joint and Conditional Mutual Information

4 Wrapping up

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 18 / 31
Mutual Information
Definition

Let X , Y be two r.v. with joint distribution p(X , Y ) and marginals p(X )
and p(Y ):

Definition
The mutual information I (X ; Y ) is the relative entropy between the joint
distribution p(X , Y ) and the product distribution p(X )p(Y ):

I (X ; Y ) = DKL (p(X , Y )kp(X )p(Y ))


XX p(x, y )
= p(x, y ) log
p(x)p(y )
x∈X y ∈Y

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 19 / 31
Mutual Information
Definition

Let X , Y be two r.v. with joint distribution p(X , Y ) and marginals p(X )
and p(Y ):

Definition
The mutual information I (X ; Y ) is the relative entropy between the joint
distribution p(X , Y ) and the product distribution p(X )p(Y ):

I (X ; Y ) = DKL (p(X , Y )kp(X )p(Y ))


XX p(x, y )
= p(x, y ) log
p(x)p(y )
x∈X y ∈Y

Intuitively, how much information, on average, X conveys about Y .

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 19 / 31
Relationship between Entropy and Mutual Information

We can re-write the definition of mutual information as:


XX p(x, y )
I (X ; Y ) = p(x, y ) log
p(x)p(y )
x∈X y ∈Y

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 20 / 31
Relationship between Entropy and Mutual Information

We can re-write the definition of mutual information as:


XX p(x, y )
I (X ; Y ) = p(x, y ) log
p(x)p(y )
x∈X y ∈Y
XX p(x|y )
= p(x, y ) log
p(x)
x∈X y ∈Y

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 20 / 31
Relationship between Entropy and Mutual Information

We can re-write the definition of mutual information as:


XX p(x, y )
I (X ; Y ) = p(x, y ) log
p(x)p(y )
x∈X y ∈Y
XX p(x|y )
= p(x, y ) log
p(x)
x∈X y ∈Y
 
X X XX
=− log p(x) p(x, y ) − − p(x, y ) log p(x|y )
x∈X y ∈Y x∈X y ∈Y

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 20 / 31
Relationship between Entropy and Mutual Information

We can re-write the definition of mutual information as:


XX p(x, y )
I (X ; Y ) = p(x, y ) log
p(x)p(y )
x∈X y ∈Y
XX p(x|y )
= p(x, y ) log
p(x)
x∈X y ∈Y
 
X X XX
=− log p(x) p(x, y ) − − p(x, y ) log p(x|y )
x∈X y ∈Y x∈X y ∈Y

= H(X ) − H(X |Y )

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 20 / 31
Relationship between Entropy and Mutual Information

We can re-write the definition of mutual information as:


XX p(x, y )
I (X ; Y ) = p(x, y ) log
p(x)p(y )
x∈X y ∈Y
XX p(x|y )
= p(x, y ) log
p(x)
x∈X y ∈Y
 
X X XX
=− log p(x) p(x, y ) − − p(x, y ) log p(x|y )
x∈X y ∈Y x∈X y ∈Y

= H(X ) − H(X |Y )

The average reduction in uncertainty of X due to the knowledge of Y .

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 20 / 31
Mutual Information:
Properties

Mutual Information is non-negative:

I (X ; Y ) ≥ 0 why?

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 21 / 31
Mutual Information:
Properties

Mutual Information is non-negative:

I (X ; Y ) ≥ 0 why?

We have seen that: H(Y ) − H(Y |X ) = H(X ) − H(X |Y ), therefore:

I (X ; Y ) = I (Y ; X )

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 21 / 31
Mutual Information:
Properties

Mutual Information is non-negative:

I (X ; Y ) ≥ 0 why?

We have seen that: H(Y ) − H(Y |X ) = H(X ) − H(X |Y ), therefore:

I (X ; Y ) = I (Y ; X )

Since H(X , Y ) = H(X ) + H(Y |X ) we have that:

I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 21 / 31
Mutual Information:
Properties

Mutual Information is non-negative:

I (X ; Y ) ≥ 0 why?

We have seen that: H(Y ) − H(Y |X ) = H(X ) − H(X |Y ), therefore:

I (X ; Y ) = I (Y ; X )

Since H(X , Y ) = H(X ) + H(Y |X ) we have that:

I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )

Finally:

I (X ; X ) = H(X ) − H(X |X ) = H(X )

Sometimes the entropy is referred to as self-information


Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 21 / 31
Breakdown of Joint Entropy

H(X) H(Y )

H(X|Y ) I(X; Y ) H(Y |X)

H(X, Y )
Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 22 / 31
Mutual Information
Example 1 (from Mackay, 2003)

Let X , Y , Z be r.v. with X , Y ∈ {0, 1}, X ⊥


⊥ Y and:

p(X = 0) = p p(X = 1) = 1 − p
p(Y = 0) = q p(Y = 1) = 1 − q
Z = (X + Y ) mod 2

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 23 / 31
Mutual Information
Example 1 (from Mackay, 2003)

Let X , Y , Z be r.v. with X , Y ∈ {0, 1}, X ⊥


⊥ Y and:

p(X = 0) = p p(X = 1) = 1 − p
p(Y = 0) = q p(Y = 1) = 1 − q
Z = (X + Y ) mod 2

(a) if q = 1/2 what is P(Z = 0)? P(Z = 1)? I (Z ; X )?

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 23 / 31
Mutual Information
Example 1 (from Mackay, 2003)

Let X , Y , Z be r.v. with X , Y ∈ {0, 1}, X ⊥


⊥ Y and:

p(X = 0) = p p(X = 1) = 1 − p
p(Y = 0) = q p(Y = 1) = 1 − q
Z = (X + Y ) mod 2

(a) if q = 1/2 what is P(Z = 0)? P(Z = 1)? I (Z ; X )?


(b) For general p and q what is P(Z = 0)? P(Z = 1)? I (Z ; X )?

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 23 / 31
Mutual Information
Example 1 (from Mackay, 2003) — Solution (a)

(a) As X ⊥⊥ Y and q = 1/2 the noise will flip the input with probability
q = 0.5 regardless of the original input distribution. Therefore:

p(Z = 1) = E[Z = 1] = 1/2 p(Z = 0) = 1/2

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 24 / 31
Mutual Information
Example 1 (from Mackay, 2003) — Solution (a)

(a) As X ⊥⊥ Y and q = 1/2 the noise will flip the input with probability
q = 0.5 regardless of the original input distribution. Therefore:

p(Z = 1) = E[Z = 1] = 1/2 p(Z = 0) = 1/2

Hence:
I (X ; Z ) = H(Z ) − H(Z |X ) = 1 − 1 = 0

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 24 / 31
Mutual Information
Example 1 (from Mackay, 2003) — Solution (a)

(a) As X ⊥⊥ Y and q = 1/2 the noise will flip the input with probability
q = 0.5 regardless of the original input distribution. Therefore:

p(Z = 1) = E[Z = 1] = 1/2 p(Z = 0) = 1/2

Hence:
I (X ; Z ) = H(Z ) − H(Z |X ) = 1 − 1 = 0

Indeed for q = 1/2 we see that Z ⊥


⊥X

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 24 / 31
Mutual Information
Example 1 (from Mackay, 2003) — Solution (b)

(b)
def
` = p(Z = 0) = p(X = 0) × p(no flip) + p(X = 1) × p(flip)
= pq + (1 − p)(1 − q)
= 1 + 2pq − q − p

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 25 / 31
Mutual Information
Example 1 (from Mackay, 2003) — Solution (b)

(b)
def
` = p(Z = 0) = p(X = 0) × p(no flip) + p(X = 1) × p(flip)
= pq + (1 − p)(1 − q)
= 1 + 2pq − q − p
Similarly:
p(Z = 1) = p(X = 1) × p(no flip) + p(X = 0) × p(flip)
= (1 − p)q + p(1 − q)
= q + p − 2pq

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 25 / 31
Mutual Information
Example 1 (from Mackay, 2003) — Solution (b)

(b)
def
` = p(Z = 0) = p(X = 0) × p(no flip) + p(X = 1) × p(flip)
= pq + (1 − p)(1 − q)
= 1 + 2pq − q − p
Similarly:
p(Z = 1) = p(X = 1) × p(no flip) + p(X = 0) × p(flip)
= (1 − p)q + p(1 − q)
= q + p − 2pq
and:
I (Z ; X ) = H(Z ) − H(Z |X )
= H(`, 1 − `) − H(q, 1 − q) why?

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 25 / 31
1 Decomposability of Entropy

2 Relative Entropy / KL Divergence

3 Mutual Information
Definition
Joint and Conditional Mutual Information

4 Wrapping up

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 26 / 31
Joint Mutual Information

Recall that for random variables X , Y ,

I (X ; Y ) = H(X ) − H(X |Y )
Reduction in uncertainty in X due to knowledge of Y

More generally, for random variables X1 , . . . , Xn , Y1 , . . . , Ym ,

I (X1 , . . . , Xn ; Y1 , . . . , Ym ) = H(X1 , . . . , Xn ) − H(X1 , . . . , Xn |Y1 , . . . , Ym )


Reduction in uncertainty in X1 , . . . , Xn due to knowledge of Y1 , . . . , Ym

Symmetry also generalises:

I (X1 , . . . , Xn ; Y1 , . . . , Ym ) = I (Y1 , . . . , Ym ; X1 , . . . , Xn )

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 27 / 31
Conditional Mutual Information
The conditional mutual information between X and Y given Z = zk :

I (X ; Y |Z = zk ) = H(X |Z = zk ) − H(X |Y , Z = zk ).

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 28 / 31
Conditional Mutual Information
The conditional mutual information between X and Y given Z = zk :

I (X ; Y |Z = zk ) = H(X |Z = zk ) − H(X |Y , Z = zk ).

Averaging over Z we obtain:


The conditional mutual information between X and Y given Z :

I (X ; Y |Z ) = H(X |Z ) − H(X |Y , Z )
p(X , Y |Z )
= Ep(X ,Y ,Z ) log
p(X |Z )p(Y |Z )

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 28 / 31
Conditional Mutual Information
The conditional mutual information between X and Y given Z = zk :

I (X ; Y |Z = zk ) = H(X |Z = zk ) − H(X |Y , Z = zk ).

Averaging over Z we obtain:


The conditional mutual information between X and Y given Z :

I (X ; Y |Z ) = H(X |Z ) − H(X |Y , Z )
p(X , Y |Z )
= Ep(X ,Y ,Z ) log
p(X |Z )p(Y |Z )

The reduction in the uncertainty of X due to the knowledge of Y when Z


is given.

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 28 / 31
Conditional Mutual Information
The conditional mutual information between X and Y given Z = zk :

I (X ; Y |Z = zk ) = H(X |Z = zk ) − H(X |Y , Z = zk ).

Averaging over Z we obtain:


The conditional mutual information between X and Y given Z :

I (X ; Y |Z ) = H(X |Z ) − H(X |Y , Z )
p(X , Y |Z )
= Ep(X ,Y ,Z ) log
p(X |Z )p(Y |Z )

The reduction in the uncertainty of X due to the knowledge of Y when Z


is given.

Note that I (X ; Y ; Z ), I (X |Y ; Z ) are illegal terms while


e.g. I (A, B; C , D|E , F ) is legal.
Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 28 / 31
1 Decomposability of Entropy

2 Relative Entropy / KL Divergence

3 Mutual Information
Definition
Joint and Conditional Mutual Information

4 Wrapping up

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 29 / 31
Summary

Decomposability of entropy
Relative entropy
Mutual information
Reading: Mackay §2.5, Ch 8; Cover & Thomas §2.3 to §2.5

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 30 / 31
Next time

Mutual information chain rule

Jensen’s inequality

“Information cannot hurt”

Data processing inequality

Mark Reid and Aditya Menon (ANU) COMP2610/COMP6261 - Information Theory Semester 2 31 / 31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy