Lec 2
Lec 2
In this chapter, we look at the “source encoder” part of the system. This
part removes redundancy from the message stream or sequence. We will
focus only on binary source coding.
2.1. The material in this chapter is based on [C & T Ch 2, 4, and 5].
8
2.5. It is estimated that we may only need about 1 bits per character in
English text.
Definition 2.6. Discrete Memoryless Sources (DMS): Let us be more
specific about the information source.
• The message that the information source produces can be represented
by a vector of characters X1 , X2 , . . . , Xn .
◦ A perpetual message source would produce a never-ending sequence
of characters X1 , X2 , . . ..
• These Xk ’s are random variables (at least from the perspective of the
decoder; otherwise, these is no need for communication).
• For simplicity, we will assume our source to be discrete and memoryless.
◦ Assuming a discrete source means that the random variables are
all discrete; that is, they have supports which are countable.
∗ Recall that “countable” means “finite” or “countably infinite”.
∗ We will further assume that they all share the same support
and that the support is finite.
· This support is called the source alphabet.
· See Example 2.7 for some examples.
◦ Assuming a memoryless source means that there is no depen-
dency among the characters in the sequence.
∗ More specifically,
pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = pX1 (x1 ) × pX2 (x2 ) × · · · × pXn (xn ).
(1)
9
∗ We will further assume that all of the random variables share
the same probability mass function (pmf)5 . We denote this
shared pmf by pX (x).
pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = pX (x1 )×pX (x2 )×· · ·×pX (xn ). (2)
5
We often use the term “distribution” interchangably with pmf and pdf; that is, instead of saying “pmf
of X”, we may say “distribution of X”.
10
Definition 2.8. An encoder c(·) is a function that maps each of the char-
acter in the source alphabet into a corresponding (binary) codeword.
• In particular, the codeword corresponding to a source character x is
denoted by c(x).
where
{0, 1}∗ = {ε, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, . . .}
11
Example 2.10. Suppose the message is a sequence of basic English words
which happen according to the probabilities provided in the table below.
Definition 2.11. The expected length of a code c(·) for (a DMS source
which is characterized by) a random variable X with probability mass func-
tion pX (x) is given by
X
E [`(X)] = pX (x)`(x).
x∈SX
12
• Short sequences represent frequent letters (e.g., a single dot represents
E) and long sequences represent infrequent letters (e.g., Q is represented
by “dash,dash,dot,dash”).
Example 2.14. Thought experiment: Let’s consider the following code
x p(x) Codeword c(x) `(x)
1 4% 0
2 3% 1
3 90% 0
4 3% 1
This code is bad because we have ambiguity at the decoder. When a
codeword “0” is received, we don’t know whether to decode it as the source
symbol “1” or the source symbol “3”. If we want to have lossless source
coding, this ambiguity is not allowed.
Definition 2.15. A code is nonsingular if every source symbol in the
source alphabet has different codeword.
13
(a) use fixed-length code (as in Example 2.10), or
(b) use variable-length code and
(i) add a special symbol (a “comma” or a “space”) between any two
codewords
or
(ii) use uniquely decodable codes.
Definition 2.18. A code is called uniquely decodable (UD) if any en-
coded string has only one possible source string producing it.
Example 2.19. The code used in Example 2.16 is not uniquely decodable
because source string “2”, source string “34”, and source string “13” share
the same code string “010”.
2.20. It may not be easy to check unique decodability of a code. (See
Example 2.28.) Also, even when a code is uniquely decodable, one may
have to look at the entire string to determine even the first symbol in the
corresponding source string. Therefore, we focus on a subset of uniquely
decodable codes called prefix code.
Definition 2.21. A code is called a prefix code if no codeword is a prefix7
of any other codeword.
• Equivalently, a code is called a prefix code if you can put all the
codewords into a binary tree where all of them are leaves.
• A more appropriate name would be “prefix-free” code.
• The codeword corresponding to a symbol is the string of labels on the
path from the root to the corresponding leaf.
Example 2.22.
x Codeword c(x)
1 10
2 110
3 0
4 111
7
String s1 is a prefix of string s2 if there exist a string s3 , possibly empty, such that s2 = s1 s3 .
14
Example 2.23. The code used in Example 2.12 is a prefix code.
x Codeword c(x)
1 01
2 001
3 1
4 0001
All codes
Nonsingular codes
UD codes
Prefix
codes
15
Example 2.27.
x Codeword c(x)
1 1
2 10
3 100
4 1000
Try to decode 10010001110100111
Example 2.28. [5, p 106–107]
x Codeword c(x)
1 10
2 00
3 11
4 110
This code is not a prefix code because codeword “11” is a prefix of code-
word “110”.
This code is uniquely decodable. To see that it is uniquely decodable,
take any code string and start from the beginning.
• If the first two bits are 00 or 10, they can be decoded immediately.
• If the first two bits are 11, we must look at the following bit(s).
◦ If the next bit is a 1, the first source symbol is a 3.
◦ If the next bit is a 0, we need to count how many 0s are there
before 1 shows up again.
◦ If the length of the string of 0’s immediately following the 11 is
even, the first source symbol is a 3.
◦ If the length of the string of 0’s immediately following the 11 is
odd, the first codeword must be 110 and the first source symbol
must be 4.
By repeating this argument, we can see that this code is uniquely decodable.
16
2.29. For our present purposes, a better code is one that is uniquely de-
codable and has a shorter expected length than other uniquely decodable
codes. We do not consider other issues of encoding/decoding complexity or
of the relative advantages of block codes or variable length codes. [6, p 57]
9
The class was the first ever in the area of information theory and was taught by Robert Fano at MIT
in 1951.
◦ Huffman wrote a term paper in lieu of taking a final examination.
◦ It should be noted that in the late 1940s, Fano himself (and independently, also Claude Shannon)
had developed a similar, but suboptimal, algorithm known today as the ShannonFano method. The
difference between the two algorithms is that the ShannonFano code tree is built from the top down,
while the Huffman code tree is constructed from the bottom up.
17
• By construction, Huffman code is a prefix code.
Example 2.31.
x pX (x) Codeword c(x) `(x)
A 0.5
B 0.25
C 0.125
D 0.125
E [`(X)] =
Note that for this particular example, the values of 2`(x) from the Huffman
encoding is inversely proportional to pX (x):
1
pX (x) = .
2`(x)
In other words,
1
`(x) = log2 = − log2 (pX (x)).
pX (x)
Therefore,
X
E [`(X)] = pX (x)`(x) =
x
Example 2.32.
x pX (x) Codeword c(x) `(x)
‘a’ 0.4
‘b’ 0.3
‘c’ 0.1
‘d’ 0.1
‘e’ 0.06
‘f’ 0.04
E [`(X)] =
18
Example 2.33.
x pX (x) Codeword c(x) `(x)
1 0.25
2 0.25
3 0.2
4 0.15
5 0.15
E [`(X)] =
Example 2.34.
x pX (x) Codeword c(x) `(x)
1/3
1/3
1/4
1/12
E [`(X)] =
E [`(X)] =
2.35. The set of codeword lengths for Huffman encoding is not unique.
There may be more than one set of lengths but all of them will give the
same value of expected length.
Definition 2.36. A code is optimal for a given source (with known pmf) if
it is uniquely decodable and its corresponding expected length is the shortest
among all possible uniquely decodable codes for that source.
2.37. The Huffman code is optimal.
19
2.3 Source Extension (Extension Coding)
2.38. One can usually (not always) do better in terms of expected length
(per source symbol) by encoding blocks of several source symbols.
Definition 2.39. In, an n-th extension coding, n successive source sym-
bols are grouped into blocks and the encoder operates on the blocks rather
than on individual symbols. [4, p. 777]
Example 2.40.
x pX (x) Codeword c(x) `(x)
Y(es) 0.9
N(o) 0.1
E [`(X)] =
YNNYYYNYYNNN...
E [`(X1 , X2 )] =
E [`(X1 , X2 , X3 )] =
20
2.4 (Shannon) Entropy for Discrete Random Variables
Entropy is a measure of uncertainty of a random variable [5, p 13].
• The log is to the base 2 and entropy is expressed in bits (per symbol).
◦ The base of the logarithm used in defining H can be chosen to be
any convenient real number b > 1 but if b 6= 2 the unit will not be
in bits.
◦ If the base of the logarithm is e, the entropy is measured in nats.
◦ Unless otherwise specified, base 2 is our default base.
• Based on continuity arguments, we shall assume that 0 ln 0 = 0.
21
Example 2.42. The entropy of the random variable X in Example 2.31 is
1.75 bits (per symbol).
Example 2.43. The entropy of a fair coin toss is 1 bit (per toss).
HX = -pX*(log2(pX))’.
22
Definition 2.47. Binary Entropy Function : We define hb (p), h (p) or
H(p) to be −p log2 p − (1 − p) log2 (1 − p), whose plot is shown in Figure 3.
0.9
0.8
0.7
0.6
H(p)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
1
• Logarithmic Bounds: ( ln p Figure
)( ln q )3:≤Binary
( log eEntropy
) H ( p )Function
≤ ( ln p )( ln q )
ln 2
2.48. Two important facts about entropy:
0.7
(a) H (X) ≤ log2 |SX | with equality if and only if X is a uniform random
0.6
variable.
0.5
(b) H (X) ≥ 0 with equality if and only if X is not random.
0.4
In summary,
0.3 0 ≤ H (X) ≤ log2 |SX | .
deterministic uniform
Theorem 2.49. The expected length E [`(X)] of any uniquely decodable
0.2
binary code for a random variable X is greater than or equal to the entropy
0.1
0.3
2.51. Given a random variable X, let cHuffman be the Huffman code for this
X. Then, from the optimality of Huffman code mentioned in 2.37,
L∗ (X) = L(cHuffman , X).
Theorem 2.52. The optimal code for a random variable X has an expected
length less than H(X) + 1:
L∗ (X) < H(X) + 1.
2.53. Combining Theorem 2.49 and Theorem 2.52, we have
H(X) ≤ L∗ (X) < H(X) + 1. (3)
Definition 2.54. Let L∗n (X) be the minimum expected codeword length
per symbol when the random variable X is encoded with n-th extension
uniquely decodable coding. Of course, this can be achieve by using n-th
extension Huffman coding.
2.55. An extension of (3):
1
H(X) ≤ L∗n (X) < H(X) + . (4)
n
In particular,
lim L∗n (X) = H(X).
n→∞
In otherwords, by using large block length, we can achieve an expected
length per source symbol that is arbitrarily close to the value of the entropy.
2.56. Operational meaning of entropy: Entropy of a random variable is the
average length of its shortest description.
2.57. References
• Section 16.1 in Carlson and Crilly [4]
• Chapters 2 and 5 in Cover and Thomas [5]
• Chapter 4 in Fine [6]
• Chapter 14 in Johnson, Sethares, and Klein [8]
• Section 11.2 in Ziemer and Tranter [18]
24