0% found this document useful (0 votes)

16 views48 pages

3 Information Theory

Uploaded by

Mayouf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views48 pages

3 Information Theory

Uploaded by

Mayouf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Source modeling and source coding

Louis Wehenkel

Institut Montefiore, University of Liège, Belgium

ELEN060-2
Information and coding theory
February 2021

1 / 48
Outline

• Stochastic processes and models for information sources

• First Shannon theorem : data compression limit

• (Overview of state of the art in data compression)

• (Relations between automatic learning and data compression)

2 / 48
The objectives of this part of the course are the following :
• Understand the relationship between discrete information sources and stochastic processes
• Define the notions of entropy for stochastic processes (infinitely long sequences of not necessarily
independent symbols)
• Talk a little more about an important class of stochastic processes (Markov processes/chains)
• How can we compress sequences of symbols in a reversible way
• What are the ultimate limits of data compression
• Review state of the art data compression algorithms
• Look a little bit at irreversible data compression (analog signals, sound, images. . . )
Do not expect to become a specialist in data compression in one day. . .

3 / 48
1. Introduction

How and why can we compress information?

Simple example :
A source S which emits messages as sequences of the 3 symbols (a, b, c).
P (a) = 12 , P (b) = 14 , P (c) = 1
4 (i.i.d.)
Entropy : H(S) = 1
2 log 2 + 1
4 log 4 + 1
4 log 4 = 1.5 Shannon/ source symbol
Not that there is redundancy : H(S) < log 3 = 1.585.
How to code “optimaly” using a binary code ?
1. Associate to each symbol of the source a binary word.
2. Make sure that the code is unambiguous (can decode correctly).
3. Optimality : minimal average encoded message length.

4 / 48
Solutions(s)
Since there are 3 symbols : not enough binary words of length 1. ⇒ Let’s use words of
length 2, for example
a 0 0
b 0 1
c 1 0
Average length : P (a)`(a) + P (b)`(b) + P (c)`(c) = 2. (bits/source symbol)
Can we do better ?
Idea : let’s associate one bit to a (most frequent symbol) and two bits to b and c, for
example
a 0
b 0 1
c 1 0
BADLUCK : the two source messages ba et ac result both in 010 once encoded.
There is not a unique way to decode : code is not uniquely decodable.

5 / 48
Here, the codeword of a is a prefix of the codeword of b.
Another idea : let’s try out without prefixes :

a 0
b 1 0
c 1 1

Does it work out ?

Indeed : code is uniquely decodable.
Average length : P (a)`(a) + P (b)`(b) + P (c)`(c) =
1 12 + 2 41 + 2 14 = 1.5 bit/symbol = H(S)
Conclusion.
By using short codewords for the most probable source symbols we were able to reach
an average length equal to H(S) (source entropy).

6 / 48
Questions

Does this idea work in general ?

What if P (a) = 45 , P (b) = 1
10 , P (c) = 1
10 ?
What if successive symbols are not independent ?
Would it be possible to do better in terms of average length ?
Relation between the “prefix-less” condition and unique decodability
How to build a good code automatiquely ?...

Before attempting to answer all these questions, we need first to see how we can model
in a realistic way some real sources ⇒ two real examples :
Telecopy (Fax) :
- Sequences of 0 and 1 resulting from a black and white scanner.
Telegraph :
- Sequences of letters of the roman alphabet (say 26 letters + the space symbol).

7 / 48
FAX: sequences of 0 and 1

A. Most simple model : (order 0)

Let’s scan a few hundred fax messages with the appropriate resolution and quantize the
grey levels into two levels 0 (white), 1 (black). Let’s count the proportion of 0 and 1 in
the resulting file, say : p0 = 0.9 et p1 = 0.1
Let’s model the source as a sequence of independent and identically distributed (i.i.d.)
random variables according to this law.
Source entropy : H(S) = 0.469 Shannon/symbol
B. Slightly better model : (order 1)
Instead of assuming that successive symbols are independent, we suppose that for a
sequence of symbols s1 , s2 , . . . , sn , we have

P (sn |sn−1 , sn−2 , . . . , s1 ) = P (sn |sn−1 )

which we suppose is independent of n. This model is characterized by 6 numbers : intial

probabilities (first symbol) and transition probabilities p(0|0), p(1|0), p(0|1), p(1|1).

8 / 48
Estimates of the source model probabilities :
Initial probs. : again p0 = 0.9 and p1 = 0.1
And p(0|0) = 0.93, p(1|0) = 0.07, p(0|1) = 0.63 et p(1|1) = 0.37
With respect to the zero order model, we have reinforced the probability to have
sequences of identical symbols ⇒ plausible.
This model : Two state time-invariant Markov process (chain)
p(1|0)

p(0|0) 0 1 p(1|1)

p(0|1)

This model is sufficiently general for our needs : we will study it in “detail”

9 / 48
Telegraph : sequences of letters and spaces (English)

Order 0: simply statistics of letters and spaces in English text

H(S) ≤ log 27 = 4.76 Shannon/symbol
H(S) ≈ 4.03 Shannon/symbol (taking into account statistics of English text)
Higher orders
Markov process of order 1 : H(S) ≈ 3.42 Shannon/symbol
Markov process of order 15 : H(S) ≈ 2.1 Shannon/symbol
Markov process of order 100 : H(S) ≈ 1.3 Shannon/symbol
4.76
Compare the rate 1.3 to “gzip” compression rates.
Alternative : model English grammar... and possibilities of misspellings
NB: Markov was a linguist.

10 / 48
What we have just introduced intuitively are classical models of discrete time and discrete state
stochastic processes.
A discrete time stochastic process is a sequence of (time indexed) random variables. Typically (as in our
context) time starts at 0 and increases indefinitely by steps of 1. To each time index i corresponds a
random variable Si , and all the random variables share the same set of possible values which are called
the states of the process. Because we are mainly concerned about digital sources of information, we will
mainly look at the case where the state space is finite (i.e. discrete). Such models are appropriate tools
to represent sources which send message of finite but unspecified length, as it is the case in practice.
The most simple case of such a stochastic process corresponds to the situation where all the random
variables are independent and identically distributed. This is also the so-called random sampling model
used in statistics to represent the collection of data from a random experiment.
We will pursue our introduction to data compression by considering this example for the time being. As
we will see, the general theoretical results of information theory are valide only asymptotically, i.e. in
the limit when we study very long messages sent by a source. The asymptotic equipartition property is
the main result which characterizes in terms of source entropy the nature of the set of long messages
produced by a source. We will formulate and study this property in the case of the i.i.d. source model,
but the result holds for a much larger class of stochastic processes. However, we will have to study
these models and in particular provide a sound definition of source entropy.
Exercise.
Consider the zero order fax model. What is the joint entropy of the first two symbols ? What is the
joint entropy of the second and third symbol ? What is the entropy of the n first symbols ?
What is entropy rate (entropy per symbol) of the first n symbols ?

11 / 48
Examples of English language models

Uniform distribution of letters : DM QASCJDGFOZYNX ZSDZLXIKUD...

Zero-order model : OR L RW NILI E NNSBATEI...
First-order MARKOV model : OUCTIE IN ARE AMYST TE TUSE SOBE CTUSE...
Second-order MARKOV model :
HE AREAT BEIS HEDE THAT WISHBOUT SEED DAY OFTE AND HE IS FOR
THAT MINUMB LOOTS WILL AND GIRLS A DOLL WILL IS FRIECE ABOARICE
STRED SAYS...
Third-order French model : DU PARUT SE NE VIENNER PERDENT LA TET...
Second-order WORD model :
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE
CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE
LETTERS THAT THE TIME OF WHOEVER TOLD THE PROBLEM FOR AN
UNEXPECTED...

12 / 48
Asymptotic Equipartition Property (the clue of information theory)

Objective : study the structure of the set of long messages produced by discrete sources
Answer to the question : What if P (a) = 45 , P (b) = 1
10 , P (c) = 1
10 ?
Possibility to reach the limit H(S) for data compression.
First :
We consider a sequence of discrete r.vars. X1 , X2 , . . . independent and identically
distributed according to the distribution P (X ) and look at the distribution of long
messages.
Corresponds to the most simple non trivial source model we can immagine.
Afterwards :
We will discuss the generalization (anticipating on the sequel)

13 / 48
In information theory, the Asymptotic Equipartition Property is the analog of the law of large numbers.
1 Pn
• The (weak) law of large numbers states that for i.i.d. real random variables, n i=1 Xi is closed
to its expected value for large values of n.
1 1
• The AEP states that n
log P (X1 ,...,Xn )
is close to the entropy H(X ).

Thus the probability of observing a particular sequence X1 , . . . , Xn must be close to 2−nH .

This has dramatic consequences on the structure of so-called typical messages of a source.

14 / 48
Reminder : the (weak) law of large numbers (wLLN)

If Xi , ∀i = 1, . . . n independent and of mean µi and finite variance σi , then

Pn Pn 4 Pn
If n1 i=1 µi → µ and n12 i=1 σi2 → 0, then X n = n1 i=1 Xi is such that

P
(X n ) −→ µ.

NB. The sequence (Xn ) of r.v. is said to converge in probability towards a r.v. a
P
(notation (Xn ) −→ a), if ∀ et η (arbitrarily small), ∃n0 such that n > n0 implies

P (|Xn − a| > ) < η.

Particular case : (which we will use below)

σ2
Pn Pn
The r.v. Xi are i.i.d. µ, σ. Then n1 i=1 µi = µ and 1
n2 i=1 σi2 = n .

15 / 48
The AEP

If the X1 , X2 , . . . are i.i.d. according to P (X ), then

1 P
− log P (X1 , X2 , . . . , Xn ) −→ H(X ), (1)
n

Proof : (almost) trivial application of the law of large numbers

Indeed. First notice that
n
1 1X
− log P (X1 , X2 , . . . , Xn ) = − log P (Xi ), (2)
n n i=1

⇒ sum of i.i.d. random variables. wLLN can be applied and

n
1X P
− log P (Xi ) −→ −E{log P (X )} = H(X ),
n i=1

which proves the AEP.

16 / 48
Interpretation:
Let X1 , . . . , X|X | be the |X | possible values of each Xi . For any message of length n
we can decompose the right hand part of (2) as follows

n |X | |X |
1X 1X X nj
− log P (Xi ) = − nj log P (Xj ) = − log P (Xj ), (3)
n i=1 n j=1 j=1
n

where we count by nj the number of occurrences of the j-th value of X in the message.
The meaning of the AEP is that almost surely the source message will be such that the
right hand converges towards
|X |
X
H(X ) = − P (Xj ) log P (Xj ),
j=1

nj P
which is a simple consequence of the well known (statistical fact) that n → P (Xj )
(relative frequencies provide “good” estimates of probabilities).

17 / 48
The quantity − log P (X1 , X2 , . . . , Xn ) is the self-information provided by a message of length n of our
source. This is a random variable because the message itself is random.
However, the AEP states that
1
− log P (X1 , X2 , . . . , Xn )
n
which we can call the “per symbol sample entropy”, will converge towards the source entropy (in
probability).
Thus, we can divide the set of all possible messages of length n into two subsets : those for which the
sample entropy is close (up to ) to the source entropy (we will call them typical messages) and the
other messages (the atypical ones).
Hence, long messages almost surely will be typical : if we need to prove a property we need only to
focus on typical messages.
The second important consequence is that we can derive an upper bound on the number of typical
messages. Because the probability of these messages is lower bounded by 2−n(H+) their number is
upper bounded by 2n(H+) . This number is often much smaller than the number of possible messages
(equal to 2n log |X | ).
In turn this latter fact has a direct consequence in terms of data compression rates which are achievable
with long sequences of source messages.
The AEP was first stated by Shannon in his original 1948 paper, where he proved the result for i.i.d.
processes and stated the result for stationary ergodic processes. McMillan (1953), Breiman (1957),
Chung (1961), Barron (1985), Algoet and Cover (1988) provided proofs of generalized version of the
theorem, including continuous ergodic processes.

18 / 48
(n)
Typical sets A (of -typical messages of length n)

X n = X × · · · × X ≡ set of all possible messages of length n.

(n)
A w.r.t. P (X n ) is defined as the set of messages (X1 , X2 , . . . , Xn ) of X n which
satisfy :
2−n(H(X )+) ≤ P (X1 , X2 , . . . Xn ) ≤ 2−n(H(X )−) . (4)
It is thus a subset of messages which probability of occurrence is constrained to be close
to these two limits. The following properties hold for this set :
(n)
1. (X1 , X2 , . . . Xn ) ∈ A
⇒ H(X ) − ≤ − n1 log P (X1 , X2 , . . . Xn ) ≤ H(X ) + , ∀n.
(n)
2. P (A ) > 1 − , for sufficiently large n.
(n)
3. |A | ≤ 2n(H(X )+) , ∀n.
(|A| denotes the number of elements of the set A)
(n)
4. |A | ≥ (1 − )2n(H(X )−) , for sufficiently large n.

19 / 48
Comments
(n)
Typical set A :
1. In probability close to 1.
2. In number close to 2nH .
Set of all possible messages X n :
1. In probability = 1.
2. In number = |X |n = 2n log |X |

If P (X ) uniform :
Typical messages = all messages. Why ?

Otherwise, the relative size of the two sets ↓ exponentially, when n ↑.

20 / 48
21 / 48
Data compression

We can exploit the AEP and typical sets to encode efficiently source messages.
(n)
Let us fix a value of and then n sufficiently large, so that P (A ) > 1 − .
(n)
Let us “construct” A : it contains at most 2n(H(X )+) different messages.
(n)
Let us also construct its complement ¬A : this set is included in X n . Its number of
messages is therefore upper bounded by |X |n .
Binary code for a set containing M messages
It is possible to do it with at most 1 + log M bits (actually dlog M e).
(extra bit because log M is not necessarily integer)
(n)
⇒ for A : 1 + log{2n(H(X )+) } = n(H(X ) + ) + 1 bits
(n)
⇒ for ¬A : 1 + n log |X | bits.

22 / 48
To distinguish the two kind of messages we need to add an extra bit :
(n) (n)
0 if message is in A , 1 if in ¬A : uniquely decodable code.
Let us compute the expected message length (average number of bits)
X
E{`(X n )} = P (X n )`(X n )
Xn
X X
= P (X n )`(X n ) + P (X n )`(X n )
(n) (n)
X n ∈A X n 6∈A
X X
≤ P (X n )(n(H(X ) + ) + 2) + P (X n )(n log |X | + 2)
(n) (n)
X n ∈A X n 6∈A

= P (A(n) (n)
)(n(H(X ) + ) + 2) + (1 − P (A )(n log |X | + 2)
≤ n(H(X ) + ) + n log |X | + 2(P (A(n) (n)
) + (1 − P (A ))
= n(H + 0 )
2
where 0 = + log |X | +
n

23 / 48
Conclusion

If we take n sufficiently large, it is possible to make 0 arbitrarily small.

We have thus shown that it is possible, by encoding long source messages (possibly very
long messages), to build a simple code of variable word length, which requires in the
average H(X ) bits/source symbol.

Questions : (i) generalization; (ii) can we do better? (iii) practical feasibility

24 / 48
2. Entropy rates of a stochastic process

General model for discrete sources

Indexed sequence of r.vars. Xi , i = 1, 2 . . ., with identical (and finite) sets of possible
values Xi = X (i represents in general time, or position in a string.)
Let’s say that X = {1, . . . , q} : finite q-ary alphabet.
The process is characterized by the joint distributions : P (X1 , . . . , Xn ) defined on X n ,
∀n = 1, 2, . . ..
Values : P (X1 = i1 , . . . , Xn = in ), ∀n, ∀ij ∈ {1, . . . , q}.
These distributions allow us to compute all other joint and conditional distributions of
any finite combination of r.v. of the process.
For example P (Xn , . . . , Xn+k ), ∀n > 0, k ≥ 0 can be computed by marginalization.

NB: w.r.t. Bayesian networks we extend here to an infinite number of variables...

25 / 48
Stationarity of a stochastic process

A stochastic process is said to be stationary if the joint distribution of any subset of the
sequence of random variables is invariant with respect to shifts in the time index, i.e.,
∀n, ` > 0, ∀i1 , . . . , in ∈ X

P (X1 = i1 , . . . , Xn = in ) = P (X1+` = i1 , . . . , Xn+` = in )

Example : memoryless stationary source. . .

Ergodicity of a stationary stochastic process
Informally :
A stationary process is said to be ergodic if temporal statistics along a trajectory
converge to the ensemble probabilities.
Example : memoryless stationary source. . .

26 / 48
Markov chains

A discrete stochastic process is said to be a (first-order) Markov chain or Markov

process, if ∀n = 1, 2, . . . , et ∀in+1 , in , . . . i1 ∈ X

P (Xn+1 = in+1 |Xn = in , . . . , X1 = i1 ) = P (Xn+1 = in+1 |Xn = in ), (5)

In this case the joint probability distribution may be written as

P (X1 , X2 , . . . , Xn ) = P (X1 )P (X2 |X1 ), . . . , P (Xn |Xn−1 ). (6)

States : values {1, . . . , q}

Time invariance :
The Markov chain is said to be time invariant if ∀n = 1, 2, . . . ,

P (Xn+1 = a|Xn = b) = P (X2 = a|X1 = b), ∀a, b ∈ X . (7)

(6= stationarity)

27 / 48
Time invariant Markov chains
Characterized by: π and Π : πi = P (X1 = i) and Πi,j = P (Xk+1 = j|Xk = i).
Irreducibility : all states “communicate” (finite number of steps, two directions)
Periodic states : possible return times are multiple of an integer > 1.
Recurrent states : we always come back again to such a state
Transient states : with prob. =1, we transit only a finite number of times through it
Pq
Stationary distribution : (p1 , . . . , pq ) which satisfies pj = i=1 Πij pi
Steady state behavior : If limn→∞ P (Xn = i) = pi exists for every i = 1, . . . , q.
Pq
If this limit exists, it satisfies pj = i=1 Πij pi

If the chain is initialized with a stationary distribution, it

forms a stationary process

28 / 48
Illustration : communication classes, irreducibility. . .

a. Reducible
0.5 0.5 0.5 0.5

1.0
0.5 0.3 0.5
1 2 3 4 5 6
0.2 0.5
0.8
0.2

b. Irreducible
0.5 0.5 0.5 0.5

1.0
0.5 0.3 0.5 0.2
1 2 3 4 5 6
0.2 0.3
0.8
0.2

29 / 48
Illustration : periodicity vs aperiodicity

period = 1 period = 1 period = 2

0.5 0.5 0.5 0.5

1.0
0.5 0.3 0.5 0.2
1 2 3 4 5 6
0.2 0.3
1.0

aperiodic periodic

Existence and unicity of stationary distribution :

Garanteed for all irreducible and aperiodic chains.

Ergodicity :
Garanteed for any irreducible and aperiodic chain, if ini-
tialized with the stationary distribution.

30 / 48
Entropy rates of stochastic processes

Entropy of first n states : H(X1 , . . . , Xn ), may grow indefinitely, oscillate, have any
kind of behavior with respect to n.
Definition: (“Average entropy per symbol”)
The entropy rate of a process (denoted by H(S)) is defined by

H(X1 , . . . , Xn )
H(S) = lim
n→∞ n
whenever this limit exists.
Examples :
Monkey on the typewriter.
Stationary memoryless source.
Sequence of independent symbols, but not identically distributed
Alternative plausible definition : H 0 (S) = limn→∞ H(Xn |Xn−1 . . . , X1 ).
⇒ new information/symbol
(NB: If memoryless stationary source :H 0 (S) = H(S))

31 / 48
Theorem : relation between H(S) and H 0 (S)

If H 0 (S) exists then H(S) exists and H(S) = H 0 (S)

Proof :
We use the “Cesaro mean” theorem, which states
Pn
If an → a and bn = n1 i=1 ai then bn → a.

Let an = H(Xn |Xn−1 , . . . , X1 ).

We thus have an → H 0 (S), which implies that
n
1X
H(Xi |Xi−1 . . . , X1 ) → H 0 (S)
n i=1

Pn
But, ∀n : H(X1 , . . . , Xn ) = i=1 H(Xi |Xi−1 . . . , X1 )
(entropy chain rule).

32 / 48
Theorem : entropy rate of stationary processes

For any stationary process H 0 (S) and hence H(S) exist.

Indeed, in general we have

H(Xn+1 |Xn , . . . , X1 ) ≤ H(Xn+1 |Xn , . . . , X2 ).

Stationarity implies also

H(Xn+1 |Xn , . . . , X2 ) = H(Xn |Xn−1 , . . . , X1 ).

Conclusion : for a stationary process : H(Xn+1 |Xn , . . . , X1 ) decreases in n.

Since this sequence is also lower bounded (entropies can not be negative) it must
converge.
Thus H 0 (S) exists and therefore also H(S), (because of the preceeding theorem).

33 / 48
Bounds and monotonicity

1. ∀n : H(S) ≤ H(Xn |Xn−1 , . . . , X1 ). (follows immediately from what precedes)

2. H(S) ≤ n−1 H(Xn , Xn−1 , . . . , X1 ). (from 1 and chain rule)
H(Xn ,Xn−1 ,...,X1 )
3. n is a decreasing sequence.
Pn
Indeed : H(Xn , Xn−1 , . . . , X1 ) = i=1 H(Xi |Xi−1 , . . . , X1 ), and H(Xi |Xi−1 , . . . , X1 )
decreases, thus average must decrease.
AEP for general stochastic processes
For a stationary ergodic process,
1
− log P (X1 , . . . , Xn ) −→ H(S)
n
with probability 1.

34 / 48
Entropy rate of a Markov chain.

For a stationary Markov chain, the entropy rate is given by H(X2 |X1 ).
Indeed :

H(S) = H 0 (S) = lim H(Xn |Xn−1 , . . . , X1 )

n→∞
= lim H(Xn |Xn−1 ) = H(X2 |X1 ), (8)
n→∞

NB: to compute H(X2 |X1 ) we use a stationary disribution for P (X1 ) and Π.
Question :
What if the Markov chain is not intitialized with the stationary distribution ?

35 / 48
3. Data compression theory

Problem statement (source coding)

Given :
A source S using Q-ary alphabet (e.g. S = {a, b, c, . . . , z})
Model of S as a stochastic process (probabilities)
A q-ary code alphabet (e.g. {0, 1})
Objectives :
Build a code : a mapping of source messages into sequences of code symbols.
Minimise encoded message length.
Be able to reconstruct the source message from the coded sequence.
NB : = compact representation of stochastic process trajectories.

36 / 48
Let us particularize.
A code : rule which associates to each source symbol si of S a word mi , composed of a
certain number (say ni ) code symbols.
C
Notation : si −→ mi
Extension of the source :
Nothing prevents us from coding blocs of n successive source symbols.
We merely apply the preceding ideas to the extended source S n .
C
Code for S n : si1 · · · sin −→ mi1 ,...,in
Extension of a code :
Let C be a code for S.
Cn
Order n extension of C : si1 · · · sin −→ mi1 · · · min
⇒ a special kind of code for S n .

37 / 48
Properties (desired) of codes

Focus on decodability.
Regular codes. mi 6= mj (or non-singularity)
Uniquely decodable codes (≡ non ambiguous)
E.g. constant word length. Code-words with separators. Prefix-less code.
Instantaneous. possibility to decode “on-line”.
Relations :
Uniquely decodable codes ⇔ all extensions are regular.
Prefix free ⇒ uniquely decodable (but 6⇐)
Prefix free ⇔ instantaneous.
Proofs : almost trivial...
Example of uniquely decodable code, but not prefix-free ?

38 / 48
39 / 48
Definition : A code is prefix-free (or prefix code, or instantaneous code) if no codeword
is a prefix of any other codeword.
Example
Non-singular, but not Uniquely decodable, but
si Singular uniquely decodable not instantaneous Instantaneous
s1 0 0 10 0
s2 0 010 00 10
s3 0 01 11 110
s4 0 10 110 111

40 / 48
Kraft inequality : the FUNDAMENTAL result for codes

What can we say about the existence of a code, given word lengths ni and alphabet
sizes Q and q ?
McMillan condition. If n1 , . . . , nQ denote candidate word lengths, then
Q
X
q −ni ≤ 1 (9)
i=1

⇔ there exists a uniquely decodable code having such word lengths.

Instantaneous codes. The Kraft inequality is also a necessary and sufficient condition
of existence of an instantaneous code.
Conclusion.
If there is a uniquely decodable code using some word lengths, there is also an
instantaneous code using the same word lengths.

41 / 48
The necessary condition means that :
• If a code is uniquely decodable (and hence a fortiori if it is instantaneous), the word lengths must
satisfy (9).
• Means also that if the condition is not satisfied by a given code, it can not be uniquely decodable
(and certainly not instantaneous).
The sufficient condition means (more subtly) that : if we specify a set of word lengths which satisfy
(9), then it is possible to define a uniquely decodable (and even instantaneous) code using these lengths.
Usefulness of the result ?
Tells us that the ni are sufficient to decide about decodability existence. We can now focus on
instantaneous codes without loosing anything in terms of data compression possibilities.
Proofs
We will not provide the proofs of these results. See e.g. Cover and Thomas for the proofs.
Remarks
If sk=1 rk q −k < 1 it is possible to add one more word (or several) to the code
P
(Choose ni+1 sufficiently large.)

If sk=1 rk q −k = 1 it is impossible to complete the code in such a way.

One says that the code is complete

42 / 48
First Shannon theorem

PQ
Rem.: average length/symbol of messages of a memoryless stationary S : i=1 P (si )ni
= expected length of the codeword of the first symbol S.
A. Memoryless stationary source, coded symbol per symbol :
PQ H(S)
1. Lower bound on average length n = i=1 P (si )ni ≥ log q
H(S)
2. It is possible to reach n < log q + 1.
B. Order k extension of this source
S k = memoryless source of entropy kH(S) :
kH(S) kH(S)
Optimal code must satisfy : log q ≤ nk < log q +1
Per symbol of the original source S : divide by k : k → ∞

43 / 48
C. General stationary source
For any stationary source, there exists a uniquely decodable code, such that the average
length n per symbol is as close as possible to its lower bound H(S)/ log q.
In fact : take a stationary source, and messages of length s of this source and an
optimal code for these;
Then : we have
Hs (S) Hs (S)
≤ ns < + 1,
log q log q
where Hs (S) = H(S1 , . . . , Ss ) denotes the joint entropy of the first s source symbols.
This implies
ns H(S)
lim = .
s→∞ s log q

44 / 48
Proof

H(S)
1. Lower bound : log q

Take a uniquely decodable code ⇒ satisfies Kraft inequality.

PQ
Thus 0 < Q = i=1 q −ni ≤ 1.
q −ni
Let qi = : a kind of probability distribution.
Q
Pn
Gibbs ⇒ : i=1 P (si ) log P (s
qi
i)
≤0
Pn Pn −ni
⇔− i=1 P (si ) log P (si ) ≤ − i=1 P (si ) log q Q
Pn q −ni Pn
But : − i=1 P (si ) log Q = log q i=1 P (si )ni + log Q
Since Q ≤ 1, log Q ≤ 0.
What about equality ?

45 / 48
H(S)
2. Is it possible to reach n < log q + 1.
We only need to find one code for which it works. Ideas ?
Shannon code : word lenghts : ni = d− logq P (si )e
− log P (si ) − log P (si )
⇒ log q ≤ ni = d− logq P (si )e < log q +1
Average length : multiply by P (si ) and sum over i
H(S) H(S)
log q ≤n< log q + 1.
But can we prove existence of such a code ?
Yes, since the ni satisfy Kraft inequality.
− log P (si )
Indeed : ≤ ni ⇔ q −ni ≤ P (si )
log q
PQ PQ
Sum over i : i=1 q −ni ≤ i=1 P (si ) = 1

46 / 48
Further reading

• D. MacKay, Information theory, inference, and learning algorithms

• Chapters 4, 5

47 / 48
Frequently asked questions

• Define the notion of Markov chain over three discrete random variables. State,
prove, discuss and illustrate the data processing inequality. Give an example where
the data processing inequality can not be applied and where it is also not satisfied.
State and discuss the corollaries of the data processing inequality.
• Define the notion of convergence in probability, formulate the AEP, and explain the
(n)
principle of its proof. Define the set of -typical messages of length n(A ), and
state exactly the 4 main properties of this set. Explain the use of this notion in the
context of data compression.
• Define the notion of (discrete) stationary data source. State the 2 definitions of the
entropy rate of a discrete stationary source (H(S) and H 0 (S)). Show
mathematically why the existence of H 0 (S) implies that of H(S). Show
mathematically that H 0 (S) exists for a stationary source.
• State the first Shannon theorem and prove it mathematically. Give two examples of
reversible data compression methods and explain their advantages and drawbacks.

48 / 48

Chapter 2 - Mathematical Preliminaries For Lossless Compression
No ratings yet
Chapter 2 - Mathematical Preliminaries For Lossless Compression
56 pages
Joint and Conditional Entropy
No ratings yet
Joint and Conditional Entropy
27 pages
Introduction To Information Theory and Coding
No ratings yet
Introduction To Information Theory and Coding
46 pages
Chapter 2 - Edited
No ratings yet
Chapter 2 - Edited
45 pages
Data Compression
No ratings yet
Data Compression
113 pages
Information Theory and Coding - Chapter 2
0% (1)
Information Theory and Coding - Chapter 2
41 pages
Materi Source Coding
No ratings yet
Materi Source Coding
39 pages
Digital Communication Process Through Swayam
No ratings yet
Digital Communication Process Through Swayam
31 pages
Intro To ICT 11
No ratings yet
Intro To ICT 11
31 pages
Information Theory
No ratings yet
Information Theory
108 pages
Lossless Data Compression
No ratings yet
Lossless Data Compression
77 pages
Lecture 2-Print
No ratings yet
Lecture 2-Print
19 pages
Materials For Developing Speaking Skills
100% (2)
Materials For Developing Speaking Skills
10 pages
Lossless Compression: Huffman Coding: Mikita Gandhi Assistant Professor Adit
No ratings yet
Lossless Compression: Huffman Coding: Mikita Gandhi Assistant Professor Adit
39 pages
2201.01741v2 - Understanding Entropy Coding With Asymmetric Numeral Systems (ANS) - Statistician Perspective
No ratings yet
2201.01741v2 - Understanding Entropy Coding With Asymmetric Numeral Systems (ANS) - Statistician Perspective
26 pages
Entropy
No ratings yet
Entropy
9 pages
TEOI InformationOfDataSources
No ratings yet
TEOI InformationOfDataSources
55 pages
Data Compression: Chapter - 2 Mathematical Preliminaries For Lossless Compression
100% (2)
Data Compression: Chapter - 2 Mathematical Preliminaries For Lossless Compression
26 pages
Information Theory 1
No ratings yet
Information Theory 1
31 pages
Lecture 7 Source Coding 2024
No ratings yet
Lecture 7 Source Coding 2024
28 pages
Magna Carta For Philippine Internet Freedom
No ratings yet
Magna Carta For Philippine Internet Freedom
17 pages
Information Theory: Dr. Muhammad Imran Farid
No ratings yet
Information Theory: Dr. Muhammad Imran Farid
32 pages
Chap 2
No ratings yet
Chap 2
47 pages
DC Unit3
No ratings yet
DC Unit3
97 pages
Unit 1 Shashi
No ratings yet
Unit 1 Shashi
51 pages
7-Information Theory
No ratings yet
7-Information Theory
29 pages
Blockchain 2035 Ebook (Tate, JaredKnapp, Andrew)
No ratings yet
Blockchain 2035 Ebook (Tate, JaredKnapp, Andrew)
524 pages
CE Notes
No ratings yet
CE Notes
32 pages
Lecture 3-Print
No ratings yet
Lecture 3-Print
22 pages
Data Compression: Reference: Proakis Salehi (II Ed.) Cap.4
No ratings yet
Data Compression: Reference: Proakis Salehi (II Ed.) Cap.4
30 pages
Lossless Math
No ratings yet
Lossless Math
32 pages
Stufflebeam, Daniel. Meta-Evaluation
100% (1)
Stufflebeam, Daniel. Meta-Evaluation
121 pages
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
No ratings yet
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
76 pages
Week 3
No ratings yet
Week 3
30 pages
Ifo Theo&Coding-Compiled-PSS
No ratings yet
Ifo Theo&Coding-Compiled-PSS
156 pages
Data Compression Basics: Discrete Source
No ratings yet
Data Compression Basics: Discrete Source
34 pages
2 BasicInformationTheory
No ratings yet
2 BasicInformationTheory
31 pages
Noiseless Coding
No ratings yet
Noiseless Coding
5 pages
A Photographer's Hand Book: How To Take Pictures of An EU Funded Project
No ratings yet
A Photographer's Hand Book: How To Take Pictures of An EU Funded Project
15 pages
Entropy 3
No ratings yet
Entropy 3
10 pages
Unit 4 - DC - 2023-2024
No ratings yet
Unit 4 - DC - 2023-2024
100 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
Information Coding Techniques
No ratings yet
Information Coding Techniques
42 pages
Unit 1 INFORMATION ENTROPY FUNDAMENTALS
No ratings yet
Unit 1 INFORMATION ENTROPY FUNDAMENTALS
13 pages
Module-3 Information Theory: Entropy Source-Coding Theorem
No ratings yet
Module-3 Information Theory: Entropy Source-Coding Theorem
14 pages
ITC 4 SourceCoding
No ratings yet
ITC 4 SourceCoding
4 pages
Information Theory
No ratings yet
Information Theory
38 pages
Unit I Information Theory & Coding Techniques P I
No ratings yet
Unit I Information Theory & Coding Techniques P I
48 pages
How To Write Scope of Study in Thesis
100% (2)
How To Write Scope of Study in Thesis
6 pages
Itc Term1
No ratings yet
Itc Term1
78 pages
Agenda For The Lecture: C Himanshu Tyagi. Feel Free To Use With Acknowledgement
No ratings yet
Agenda For The Lecture: C Himanshu Tyagi. Feel Free To Use With Acknowledgement
7 pages
Complaint Management System Report
100% (1)
Complaint Management System Report
11 pages
Shanon Encoding and Fano Encoding, Theorem, Problems On Entropy
No ratings yet
Shanon Encoding and Fano Encoding, Theorem, Problems On Entropy
25 pages
TSBK08 Data Compression Exercises: Informationskodning, ISY, Link Opings Universitet, 2013
No ratings yet
TSBK08 Data Compression Exercises: Informationskodning, ISY, Link Opings Universitet, 2013
32 pages
Paper 1
No ratings yet
Paper 1
8 pages
Sayood DataCompression
No ratings yet
Sayood DataCompression
22 pages
Information Theory and Coding (Lecture 1) : Dr. Farman Ullah
No ratings yet
Information Theory and Coding (Lecture 1) : Dr. Farman Ullah
32 pages
Shannon Source Coding Theorem
No ratings yet
Shannon Source Coding Theorem
3 pages
Source Coding
No ratings yet
Source Coding
29 pages
Tata Steel Safety Training Centre
No ratings yet
Tata Steel Safety Training Centre
10 pages
ECEVSP L03 Compression2
No ratings yet
ECEVSP L03 Compression2
40 pages
Paper Esp - Materials Design
No ratings yet
Paper Esp - Materials Design
10 pages
French Ii
No ratings yet
French Ii
7 pages
PRAGMATICS AND THE EXPLICITATION HYPOTHESIS - Séguinot PDF
No ratings yet
PRAGMATICS AND THE EXPLICITATION HYPOTHESIS - Séguinot PDF
9 pages
Channel Coding Theorem
No ratings yet
Channel Coding Theorem
23 pages
Nteq Ms Powerpoint Lesson Plan
No ratings yet
Nteq Ms Powerpoint Lesson Plan
9 pages
Cover Note 95e2cfc
No ratings yet
Cover Note 95e2cfc
10 pages
Data Journalism Uncovering The Story in The Number
No ratings yet
Data Journalism Uncovering The Story in The Number
10 pages
Pmbok T&T Chapter 10
No ratings yet
Pmbok T&T Chapter 10
1 page
IT Audit - Framework
No ratings yet
IT Audit - Framework
20 pages
Communication From Interpersonal To Mass Communication
No ratings yet
Communication From Interpersonal To Mass Communication
32 pages
Internship Report 10287751
No ratings yet
Internship Report 10287751
25 pages
Managerial or Responsibility Accounting Systems
No ratings yet
Managerial or Responsibility Accounting Systems
10 pages
LabReport - #3 - Final Report Rubric 2023 - Fall
No ratings yet
LabReport - #3 - Final Report Rubric 2023 - Fall
4 pages
Chapter One - Background of The Study
No ratings yet
Chapter One - Background of The Study
8 pages
Report 2.3-Revised-Final - Kaal Harir Abdulle, 160041080
No ratings yet
Report 2.3-Revised-Final - Kaal Harir Abdulle, 160041080
36 pages
Music Genre Research Project Rubric
No ratings yet
Music Genre Research Project Rubric
2 pages
Source Coding: Source Encoder Channel Encoder Digital Source Source Entropy Symbols Binary Sequence Modulator
No ratings yet
Source Coding: Source Encoder Channel Encoder Digital Source Source Entropy Symbols Binary Sequence Modulator
18 pages
La01 - DCP - 2021-03-10-133708
No ratings yet
La01 - DCP - 2021-03-10-133708
37 pages
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
No ratings yet
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
7 pages
4 Information Theory
No ratings yet
4 Information Theory
53 pages
2 Information Theory
No ratings yet
2 Information Theory
40 pages
17data Management Organizational Change Management
No ratings yet
17data Management Organizational Change Management
37 pages
2022 Cyber Threats
No ratings yet
2022 Cyber Threats
12 pages
Symmetric-Key Encryption: Constructions: PRG, PRF Stream and Block Ciphers
No ratings yet
Symmetric-Key Encryption: Constructions: PRG, PRF Stream and Block Ciphers
19 pages
Public-Key Cryptography: CCA Secure PKE Hybrid Encryption
No ratings yet
Public-Key Cryptography: CCA Secure PKE Hybrid Encryption
18 pages
IET Generation Trans Dist - 2023 - Mansour - Applications of IoT and Digital Twin in Electrical Power Systems A
No ratings yet
IET Generation Trans Dist - 2023 - Mansour - Applications of IoT and Digital Twin in Electrical Power Systems A
23 pages
1 s2.0 S2405844024160999 Main
No ratings yet
1 s2.0 S2405844024160999 Main
53 pages
ASSIGNMENT INSTRUCTIONS - Request For Information Letter
No ratings yet
ASSIGNMENT INSTRUCTIONS - Request For Information Letter
2 pages
ERMCD14 01 Data Privacy and Security Manual
No ratings yet
ERMCD14 01 Data Privacy and Security Manual
29 pages
Date of Joining Confirmation
No ratings yet
Date of Joining Confirmation
2 pages
Costa's Levels of Thinking - Science
No ratings yet
Costa's Levels of Thinking - Science
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

3 Information Theory

Uploaded by

3 Information Theory

Uploaded by

Source modeling and source coding

Institut Montefiore, University of Liège, Belgium

• Stochastic processes and models for information sources

• First Shannon theorem : data compression limit

• (Overview of state of the art in data compression)

• (Relations between automatic learning and data compression)

How and why can we compress information?

Does it work out ?

Does this idea work in general ?

A. Most simple model : (order 0)

P (sn |sn−1 , sn−2 , . . . , s1 ) = P (sn |sn−1 )

which we suppose is independent of n. This model is characterized by 6 numbers : intial

Order 0: simply statistics of letters and spaces in English text

Uniform distribution of letters : DM QASCJDGFOZYNX ZSDZLXIKUD...

Thus the probability of observing a particular sequence X1 , . . . , Xn must be close to 2−nH .

If Xi , ∀i = 1, . . . n independent and of mean µi and finite variance σi , then

P (|Xn − a| > ) < η.

Particular case : (which we will use below)

If the X1 , X2 , . . . are i.i.d. according to P (X ), then

Proof : (almost) trivial application of the law of large numbers

⇒ sum of i.i.d. random variables. wLLN can be applied and

which proves the AEP.

X n = X × · · · × X ≡ set of all possible messages of length n.

Otherwise, the relative size of the two sets ↓ exponentially, when n ↑.

If we take n sufficiently large, it is possible to make 0 arbitrarily small.

Questions : (i) generalization; (ii) can we do better? (iii) practical feasibility

General model for discrete sources

NB: w.r.t. Bayesian networks we extend here to an infinite number of variables...

P (X1 = i1 , . . . , Xn = in ) = P (X1+` = i1 , . . . , Xn+` = in )

Example : memoryless stationary source. . .

A discrete stochastic process is said to be a (first-order) Markov chain or Markov

P (Xn+1 = in+1 |Xn = in , . . . , X1 = i1 ) = P (Xn+1 = in+1 |Xn = in ), (5)

In this case the joint probability distribution may be written as

P (X1 , X2 , . . . , Xn ) = P (X1 )P (X2 |X1 ), . . . , P (Xn |Xn−1 ). (6)

States : values {1, . . . , q}

P (Xn+1 = a|Xn = b) = P (X2 = a|X1 = b), ∀a, b ∈ X . (7)

If the chain is initialized with a stationary distribution, it

period = 1 period = 1 period = 2

Existence and unicity of stationary distribution :

If H 0 (S) exists then H(S) exists and H(S) = H 0 (S)

Let an = H(Xn |Xn−1 , . . . , X1 ).

For any stationary process H 0 (S) and hence H(S) exist.

Indeed, in general we have

H(Xn+1 |Xn , . . . , X1 ) ≤ H(Xn+1 |Xn , . . . , X2 ).

Stationarity implies also

H(Xn+1 |Xn , . . . , X2 ) = H(Xn |Xn−1 , . . . , X1 ).

Conclusion : for a stationary process : H(Xn+1 |Xn , . . . , X1 ) decreases in n.

1. ∀n : H(S) ≤ H(Xn |Xn−1 , . . . , X1 ). (follows immediately from what precedes)

H(S) = H 0 (S) = lim H(Xn |Xn−1 , . . . , X1 )

Problem statement (source coding)

⇔ there exists a uniquely decodable code having such word lengths.

If sk=1 rk q −k = 1 it is impossible to complete the code in such a way.

One says that the code is complete

Take a uniquely decodable code ⇒ satisfies Kraft inequality.

• D. MacKay, Information theory, inference, and learning algorithms

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

P (|Xn − a| > ) < η.

If we take n sufficiently large, it is possible to make 0 arbitrarily small.