0% found this document useful (0 votes)
136 views35 pages

RDT Berger

Uploaded by

karthikbgm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views35 pages

RDT Berger

Uploaded by

karthikbgm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

RATE DISTORTION THEORY AND DATA COMPRESSION

TOBYBERGER
School of Electrical Engineering
Cornell University, Ithaca, New York

T. Berger et al., Advances in Source Coding


© Springer-Verlag Wien 1975
PREFACE

I am grateful to CISM and to Prof. Giuseppe Longo for the opportunity


to lecture about rate distortion theory research during the 1973 CISM summer
school session. Rate distortion theory is currently the most vital area of probabilistic
information · theory, affording significantly increased insight and understanding
about data compression mechanisms in both man-mcde and natural information
processing systems. The lectures that follow are intended to convey an appreciation
for the important results that have been obtained to date and the challenging
problems that remain to be resolved in this rapidly evolving branch of information
theory. The international composition of the CISM summer school attendees affords
a unique opportunity to attempt to stimulate subsequent contributions to the
theory from many nations. It is my hope that the lectures which follow capitalize
successfully on this promising opportunity.

T. Berger
LECTURE 1

Rate Distortion Theory: An Introduction


In this introductory lecture we present the rudiments of rate distortion
theory, the brauch of information theory that treats data compression problems.
The rate distortion function is defined and a powerful iterative algorithm for
calculating it is described. Shannon's source ~oding theorems are stated and
heuristically discussed.
Shannon's celebrated coding theorem states that a source of entropy
rate H can be transmitted reliably over any channel of capacity C > H. This theorem
also has a converse devoted to the frequently encountered situation in which the
entropy rate of the source exceeds the capacity of the channel.'Said converse states
that not all the source data can be recovered reliably when H > C [ 1 ].
Since it is always desirable to recover all the data correctly, one can
reduce H either by
(a) slowing down the rate of production of source letters, or
(b) encoding the letters more slowly than they are being produced.
However, (a) often is impossible because the source rate is not under
the communication system designer's control, and (b) often is impossible both
because the data becomes stale, or perhaps even useless, because of lang coding
delays and because extensive buffering memory is needed.
If H cannot be lowered, then the only other remedy is to increase C.
This, however, is usually very expensive in practice. We shall assume in all that
follows that H already has been lowered and that C already has been raised as much
as is practically possible but the situation H >C nonetheless continues to prevail.
(This is always true in the important case of analog sources because their absolute
entropy H is infinite whereas the C of any physical channel is finite). In such a
situation it is reasonable to attempt to preserve those aspects of the source data that
are the most crucial to the application at hand. That is, the communication system
should be configured to reconstruct the source data at the channel output with
minimum possible distortion subject to the requirement that the rate of transmission
of information cannot exceed the channel capacity. Obviously, there is a tradeoff
between the rate at which information is provided about the output of a data source
and the fidelity with which said output can be reconstructed on the basis of this
information. The mathematical discipline that treats this tradeoff from the
6 Rate Distortion Theory and Data Compression

viewpoint of information theory is known as rate distortion theory.


A graphical sketch of a typical rate-distortion tradeoff is shown m
Figure 1. If the rate R at which

Rite

Figure 1. A Typical Rate-Distortion Tradeoff

information can be provided about the source output exceeds H, then no distortion
need result. As R decreases from H towards 0, the minimum attainable distortion
steadily increases from 0 to the value Dm ax associated with the best guess one can
make in the total absence of any information about the source outputs. Curves that
quantify such tradeoffs aptly have been termed rate distortion functions by
Shannon [2].
The crucial property that a satisfactorily defined rate distortion
function R(D) possesses is the following:
"It is possible to compress the data rate from H down to any R > R(D)
and still be able to recover the original source outputs with an average
( *) distortion not exceeding D. Conversely, if the compressed data rate R
satisfies R < R(D), then it is not possible to recover the original source
data from the compressed version with an average distortion of D or
less".
An R(D) curve that posssesses the above property clearly functions as
an extension, or generalization, of the concept of entropy. Just as His the minimum
data rate (channel capacity) needed to transmit the source data with zero average
distortion, R(D) is the minimum data rate (channel capacity) needed to transmit the
data with average distortion D.
Omnis rate distortion theory in tres partes divisa est.
(i) Definition, calculation and bounding of R(D) curves for various data sources
and distortion measures.
(ii) Proving of coding theorems which establish that said R(D) curves da indeed
specify the absolute limit an the rate vs. distortion tradeoff in the sense of
(*).
(iii) Designing and analyzing practical communication systems whose performances
Rate Distortion Theory: An lntroduction 7

closely approach the ideallimit set by R(D).


We shall begin by defining R(D) for the case of a memoryless
source and a context-free distortion measure. This is the simplest case from the
mathematical viewpoint, but unfortunately it is the least interesting case from
the viewpoint of practical applications. The memoryless nature of the source
limits the extent to which the data rate can be compressed without severe
distortion being incurred. Moreover, because of the assumption that even
isolated errors cannot be corrected from contextual considerations, the ability
to compress the data rate substantially without incurring intolerable distortion
is further curtailed. Accordingly, we subsequently shall extend the
development to situations in which the source has memory and/or the
fidelity criterion has context-dependence.
A discrete memoryless source generates a sequence {X , X ,... } of
I 2
independent identically distributed random variables (i.i.d. r.v.'s). Each Xi
assumes a value in the finite set A= { a 1 , ... , aM } called the source alphabet.
The probability that Xi = aj will be denoted by P( aj ). This probability does
not depend on i because the source outputs are identically distributed. Let
~ = (X 1 , ... , Xn ) denote a random vector of n successive source outputs, and
let .! = (x1 , ... , xn ) € A" denote a value that ~ can assume. Then the probability
Pn (!) that ~ assumes the value ! is given by
n
P (x) = ll P(x )
n - t=l t

because the source has been assumed to be memoryless.


A communication system is to be built that will attempt to
convey the sequence of source outputs to an interested user, as depicted in
Figure 2.

-Communication S y s t e m -

Figure 2. A Communication System Linking Source to User

Let Y , Y , ... denote the sequence of lettcrs reccived by the user. In


I 2
general, the Yi assume values in an alphabet B= {b 1 , ... bN} of Jettcrs that may differ
both in value and in cardinality from thc source alphabet A, although A = B in most
8 Rate Distortion Theory and Data Compression

applications.
In order to quantify the rate-distortion tradeoff, we must have a means
of specifying the distortion that results when Xi = ai and Yi = bk. We shall assume
that there is given for this purpose a so-called distortion measure p: AxB--+ [O,oo].
That is, p (a. b~,) is the penalty, lass, cost, or distortion that results when the source
J' 1\
produces ai and the system delivers bk to the user. Moreover, we shall assume that
the distortion p 0 (~.y) that results when a vector ~ € A" of n successive source letters
r
is represented to the-user as € ß!l is of the form
n
p (x,y) = n-t ~ p(xt ,yt).
n -- t=l

A family {pn, 1 ".;;; n <


of such vector distortion measures will be
00 }

referred to as a single-letter, or memoryless, fidelity criterion because the penalty


charged for each error the system makes does not depend on the overall context in
which the error occurs.

Example:

A = B = {0,1}
p(O,O) = p(1,1) = 0

p(0,1) = c.:, p(l,O) = ß

Then 1 a +ß
p 3 (010,001) =3 (0 + a + ß) =- 3-
Each communication system linking source to user in Figure 2 may be
fully described for statistical purposes by specifying for each j € {1, ... , M } and
k € {1, ... , N} the probability Q k/ i that a typical system output Y will be equal to
bk given that the corresponding input X equals aj' Contracting notation from P(aj)
to Pi and from P(a j , b k) to Pjk , we may associate with each system Q = (Qk /i)
two functions of extreme importance, namely

I(Q) = ~M
j =1 k=l
~N P. o . log
J ""'r.1J
+ ,
(~./'
""'r.
)
where
Rate Distortion Theory: An Introducdon 9

and
d (Q) = j,I:k P.J o"k1J.p.k
J

I(Q) is the average mutual information between source and user for the
system Q, and d(Q) is the average distortion with which the source outputs are
reproduced for the user by system Q. For situatior:s in which both the source and
the fidelity criterion are memoryless, Shannon [ 2 ! has defined the rate distortion
function R(D) as follows

R(D) = min I (Q)


Q: d(Q) ~ D

That is, we restriet our attention to those systems Q whose average distortion d(Q)
does not exceed a specified Ievel D of tolerable average di<>tortion, and then we
search among this set of Q's for the ',ne with the minimum value of I(Q). We
minimize rather than maximize I(Q) because we wish to supply as Iittleinformation
as possible about the source outputs provided, of course, that we preserve the data
with the required fidelity D. In physical terms, we wish to compress the rate of
transmission of information to the lowest value consistent with the requirement that
the average distortion may not exceed D.
Remark: An equivalent definition of the critical curve in the (R,D)-
plane in which R is the independent and D the dependent variable is

D(R) = min d(Q).


os:; ; R
Q: I (Q)

This so-called "distortion rate function" concept is more intuitively


satisfying because average distortion gets minimized rather than average mutual
information. Moreover, the constraint I(Q) os:;;;R has the physical significance that the
compressed data rate cannot exceed a specified value which in practice would be the
capacity C of the channel in the communication link of Figure 2. It is probably
unfortunate that Shannon chose to employ the R(D) rather than the D(R) approach.
LECTURE 2.

Computatio n of R(D)
The definition of R(D) that we have given thus far is not imbued with
any physical significance because we have not yet proved the crucial source-coding
theorem and converse for R(D) so defined. Weshall temporarily ignore this fact and
concentrate instead an procedures for calculating R(D) curves in practical examples.
Until recently, it was possible to find R(D) only in special examples involving either
small M and N or { P j } and ( Pjk) with special symmetry properties. A recent
advance by Blahut, however, permits rapid and accurate calculation of R(D) in the
general case [3].
The problern of computing R(D) is one in convex programming. The
mutual information functional I(Q) is convex U in Q (Homework). lt must be
minimized subject to the linear equality and inequality constraints
N
k~l ~Jj = 1

.I: P. 0. .. p. k <;; D
j,k J "'k/ J J

and Qk/ j ;;;;::. 0. Accordingly, the Kuhn-Tucke r theorem can be applied to determine
necessary and sufficient conditions that characterize the optimum transition
probabüity assignment. Said conditions assume the following form:
Theorem 1. - Fix s e [- oo,O]. Then there is a line of slope s tangent to the R(D)
curve at the point (d(Q), I(Q)) if and only if there exists a probability vector

rl '
{Qk, 1 .;;;;; k .;;;;; N} such that
(i) ~/j = \ ~ eSPjk where \ = ( t~ eSPjk

an4

6. { • 1 for all k such that Qk >0


(ii) Gt= .I: A. P. eSPjk
Q =0
J J
<;; 1 for all k such that
k

Proof: Straightforward application of the Kuhn-Tucker theorem.


The theorem above shows that the R(D) curve can be generated
parametrically by choosing different values of s €[..., 00 ,01 and then determining the
12 Computation of R(D)

corresponding optimum {Qk} vectors.


The following theorem describes Blahut's iterative algorithm which
converges to the optimum {Qk } vector, and hence yields the optimum system of
transition probabilities Qk/ j associated with a particular value s of the slope
parameter.
0
Theorem 2. - Given s € [ - oo, O] ,choose any probability vector Q k , 1 ~ k ~N with
strictly positive com ponen ts Q ~ > 0. Define

r+l rQr
Qk = ck k

where

r
Then {Qk} converges to the { Q k} that satisfies (i) and (ii) of theorem 1 as r -+-oo.

Proof: Lemma: log x~x-1 (naturallogarithm)


Proof of Lemma:

Since log (x) is convex () and both log x and


x-1 equal zero and have slope 1 at x = 1, the

tangent line x-1 must lie above the curve log x for all x. This completes the proof of
the lemma. (log x ~ x - 1 is often called the fundamental inequality).
Corollary of Lemma: log y ~1-1/y. (Proof omitted.)
To prove the theorem we consider the functional V(Q) = I (Q)- sd (Q).
The graphical significance of V(Q) is shown in Figure 3. It is the R-axis intercept of a

lines ot slope s Figure 3. Graphical Description of V(Q)

R!Dl
Rate Distertian Theory and Data Compression 13

line of slope s through the point (d(Q), I(Q)) in the (R,D)- plane. For any Q the
point (d(Q), I(Q)) necessarily lies an or above the R(D) curve by definition of the
latter. Since R(D) is convex U(Homework), it follows that V(Q)~V8 , the R-axis
intercept of the line of slope s that is tangent to R(D).
The Blahut iterative formula Q ~ +~ c~ Qrk can be thought ofas the
composition of two successive steps; namely starting from {Q1k } , we have

Step 1.
r +I
Qk/j

Step 2.
r +1 ~ r +I
Qk j pj Qkfj
r r sp·k r
= ~ Aj Pj ~ e J = c~ ~
J

We proceed to show that V(Q r) ~ V8 as r~oo, which in turn clearly


implies from Figure 3 that(d(Qr ), I(Qr )) converges to the point ( D 8 , R(D 8 )) at
which the slope of R(.) iss. Let us contract notation from V(Qr) to yr. Then
r+I
r+1 ~P r+1l Qk/j _ 8 ~P Qr+I p
V = j,kjQk/j og Qr+l j,kj k/j jk
k
Ar r SPjk
~P. ~+. 1 log ( j ~e ) - ~P. d 1~ 1 1og(esPjk)
j,kJk/J <{+1 j,kJ"'kJ

r r+I (r/r+1)
fPj log(Aj ) + tQk log Qk Qk

Hence, with the aid of the lemma we get


-sP·k <{+I
r
V- V
r+I ~ P. r{ . log ( <{/j e J ) + ~ or+ 1 1og ( - - )
j,k J '<k/J Qr Ar k "k Qr
k j k

r r SP·k c{
r [ 1 - __:____
~ P. ~I" \ e J ] + ~ ~kct+1 (1 - -r+I )
<tn
~ '<k
j,kJ J Qk
14 Computation of R(D)

1- ~i f:\;pj eSPjk + 1- ti
1 - ~ <{ c~ + 1- 1 = 1 - ~ <{ + 1 = 1 - 1 0.

This establishes that V is monotonic nonincreasing* in r and hence


converges because it is bounded from below by V 8 • It only remains to show that the
value to which V r converges is V 8 rather than some higher value. This is argued by
noting that equality must hold in the abov.e argument at the convergence point which
r +1 r
in turn requires that Q k /Qk = crk ~1 Vk such that lim Qkr >O. That is, the first
r oo ~

of the Kuhn-Tucker conditions must be satisfied at the convergence point. But the
second one also must be satisfied there since, if rlt!v". c~ > 1 for any k, then
convergence cannot occur because o< Q~+l = ck Qk. This completes the proof of
the theorem and also completes Lecture 2.

* In fact, vr is strictly decreasing whenever vr >V . To see this,.note that the inequality used to show the
nonincreasing nature of Vr is strict unless Q:: 1 =8 Q~ V k, i.e. unless ck = 1 V k. But ck = 1 V k implies
that Qk is optimum and hence that yr = V8•
LECTURE3.

Thc Lowcr ßound Theorem and Error Estimates for the Blahut Algorithm
We have just seen how Blahut's iterative algorithm provides a powerful
tool for the computation of R(D) curves. The following rather remarkable theorem
is most interesting in its own right and also provides a means for ascertaining when
thc Blahut iterations have progressed to the point that (d {Qr), I {Qr)) is within some
specified € of the R(D) curve.

Theorem 3.
R(D) = (sD + ~P. log X. )
J J J

where

Remark: This theorem gives a representation of R(D) in the form of a maximum of


a different function in a different convex region. Hence, it permits lower bounds to
R(D) to be generated easily just as the original definition permitted generation of
upper bounds in the form of I(Q) for any D-admissible Q.
Proof: Choose any s oe;;;o, any ~ € As , and any Q suchthat d(Q) oe;;;o, Then

(1)
~ I ( Q) - sd ( Q) - ~ P. 1 og X.
j J J

= ~ P. ~~- log (
~/j )
j:k J J ~ ~ esPjk
16 The Lower Bound Theorem ...

(2) AOttesPjk
~ I p. ~~· (1 -
j,k J J
1
~/j
)

(3)

(1)
~ 1 - t~ =1 - 1 =0
[~ results from the conditions s EO;;o and d(Q)EO;; 0.
<;J is the fundamental inequality
<~ results from the fact that ~
E A8 ]
Wehave just shown that (d(Q)EO;; 0),.,.. (I(Q) ~sO + Ipj log ~j for any sosO;; 0
an d a n y ~ E A8 ) • It f o ll o w s t hat ( d ( Q) osO;; D) ,.,.. (I ( Q) ~
max sD + ~p. log ~.).
s ...;o AeA J J J
·- 8

Accordingly
R(D) ~ min I(Q) ~ max (sD + ~P. log A. )
Q: d(Q) EO;;D s:S:::
"""lt:
0 t-
AeA 8 J J J

which establishes the advertized lower bound. That the rcverse incyuality also holds,
and hence that the theorem is true, is established by rccalling that thc Q that solvcs
the original R(D) problern must be of the form
n
"fk/j
= ~ n
j"fke
sP;k

so that for this Q we have

R(D) = I(Q) = IP. o 1. log (


j ,k J "'k J
~j
"'k
) = IP. o . log
j,k J "'k/ J
(A.J esP;k)

= s j,k
IP. o 1.P.k +
J "'k J J
IP. log A.
j J J

= sd(Q) + IP. log A. = sD + IP. log A. ,


j J J j J J

since d(Q)= 0 for the Q that solves the R(O) problem. Also, we know that Ck <; 1
for this Q, so ~ E A 8 • Thus R(O) is of the form sO+~ Pj log~ for some s EO;;Q
and some ~ e A . Therefore J
8
Rate Distordon Theory and Data Compression 17

R(D) ~ max sD + l: P. log A..


s ~ 0, l, €1\ 8 j J J

and the theorem is proved.


Let us return to step (r + 1) of the Blahut algorithm in which we
defined
d+l = ,r r{ sPjk d+I - rr{
"'<k/j "i "'<k e and "'<k - ck"'<k

Let d(Q"+ 1 )=D. It then follows that

r+l
R(D) ~ I(Qr+I) = l:P. IT+. 1 log Qk/j
j,kJ '<k/J <{+1
Ar.e SPjk
l:P. <{1~ 1 log .....;J_ __
j,kJ J er
k

s l:P. rf 1~ 1 p.k + l:P. log }{ - l:rf log (r!) l:A.~ P. e 8 Pjk


j, k J "'<k J J j J J k '<k 'K j J J

sd(<f+l) + l:p. log A.~ - l:d ckr log r!, or


j J J k'<k "'k

r
1\; (D)

which is an upper bound to R(D) at the value of distortion associated with iteration
r + 1. We can obtain a lower bound to R(D) at the same value of D with the help of
Theorem 3 by defining

er max crk = max l: A.~ P. e sPj k


max k k j J J

and then setting

It follows that
18 The Lower Bound Theorem ...

er
p • e spjk -- 1 r
e SP·k k
J = - - · <;; 1
ek, -~ - - 1;~. p.
A
~'\'
~1\. 1 1
j J J ~ j ~
max max
for all k. Thus ~ e A 8 so Theorem 3 gives

R(D) ;;;;, .) + :I: P. log ~~ or


j J J

R(D) ;;;;, sD + l::P. log ~~ - log er = Rr (D)


j J J max -1.

which is the desired lower bound. Comparing the two bounds, we see that I(Q r+l)
differs from R (d(Qr+~) =R(D) by less than

~(D) -I{ (D) = log e:.X- ~c{ ejJog ek_.

Sinee 1 ;;;,,l~.. ek_ with equality "Y ~~~l~ooQk >O,we see that)!w.;.~(D)­
.Ri_(D)=O. The upper and lower bounds ~herefore tend to eoincidenee in the limit of
large iteration number. If we need an estimate of R(D) with an error of e or less, we
need merely iterate on r until log e~ax -l::kQk ek_ log ek_ <;; e. Note that both the
iterations and the test to see if we are within e of R(D) ean be earried out without
having to calculate either I(Qr) or d(Qr) at step r, let alone their gradients.This is the
reason why the iterations proeeed so rapidly in praetice.
LECTURE4.

Extension to Sources and Distorrion Measures with Memory


We shall now indicate how to extend the theory developed above to
cases in which the source and/ or the distortion measure have memory. Let P n ( x)
denote the probability distribution governing the random source vector ~ = (X 1 , ... ,
X n), and let Pn (~,~) denote the distortion incurred when the source vector ~ is
reproduced as the vector y. Let Q nCYJ!) denote a hypothetical transition probability
assignment, or "system", for transmission of successive blocks of length n. Then
define

d(Qn) = !~ln (~Qn <zi~Pn (~,p


~<zl~
I(Qn) n-\~ln (!_}Qn (~I!_} log ~ (p

where

~(1_) ~pn (!_)~ <x_l!.>, and finally

~ (D) = min I(~).


Qn:d(Qn)~D

It is clear that Rn (D) is the rate distortion function for a source that
produces successive n-vectors independently according to P n (~) when the distortion
in the reproduction of a sequence of such n-vectors is measured by the arithmetic
average of Pn over the successive vectors that com prise the sequence. Although the
actual source if stationary will produce successive n-vectors that are identically
distributed according toP n (~), these vectors will not be independent. Hence, ~ (D)
will be an overestimate of the rate needed to achieve average distortion D because it
does not reflect the fact that the statistical dependence between successive source
letters can be exploited to further reduce the required information rate. However,
one expects that this dependence will be useful only near the beginning and end of
each block and hence may be ignored for large n. This tempted Shannon [ 2 ] to
define
20 Extension to Sources and Distordon Measures with Memory

R(D) = lim inf ~(D)


n-+oo

Source coding theorems to the effect that the above prescription for
calculating R(D) döes indeed describe the absolute tradeoff between rate and
distortion have been proven under increasingly general conditions by a variety of
authors [2,4-9].
A sufficient but by no means necessary set of conditions is the
following
(i) The source is strictly stationary
(ii) The source is ergodie
(iii) :!lg < 00 and Pg : Ag x Bg-+ [ 0, 00 ] suchthat
n-g+l
Pn(~,I) = p-!+1 / ; Pg(xt,xt+l'•••,xt+g-l'yt, ••• ,yt+g-1)

(This is called a distortion me·asure of span g and can be used to reflect context-
dependencies when assigning distortions).

(iv)

where EX denotes expectation over a vector ~ =(X 1 , ••. , Xg) of g succs~ive random
source oütputs.
Comments. Gray and Davisson [9] claim recently to have been able to remove the
need for condition (ü). Condition (iv) is automatically satisfied to finite-alphabet
sources; it is needed only when the theory is extended to the important case of
sources with continuous, possibly unbounded, source alphabets.
Under conditions (i) and (iii) the Rn (D) curves will be monotonic
nonincreasing for n~g. They have the physical significance that no system that
operates independently on successive n-vectors from the source can perform below
Rn(D). However, the corresponding positive source coding theorem that systems can
be found that operate arbitrarily close to ~ (D) is true only in the Iimit n -+ oo •
We now give some explicit examples of rate distortion functions.

Example 1. Binary Memoryless Source and Error Frequence Criterion [Shannon,


1948, 1959; Goblick, 1962]
Rate Distordon Theory and Data Compression 21

6 { 0 if j =k
A = B = {0, 1} ' pjk = 1- jk = . 1 if j =I= k.

Assurne that zeros are produced with probability 1-p ~ 1/2 and ones
with probability p ~ 1/2.
A simple computation shows that

R(D) = { 1\, (p) -1\, (D) , 0 ~D ~Dmax= p


0 , D~p
R

where Hb(.)is the binary entropy function,

1\ (p) =- p log p-(1-p) log(l-p)

In the equiprobable casep=1/2, we get

R(D) = 1 +D log2 D + (l-D)log2 (1-D) bits/letter

This formula equals the Gilbert bound which is a lower bound on the
exponential rate of growth of the maximum number of vectors that can be packed
into binary n-space such that no two of them disagree in fewer than nD places. Rate
distortion theory yields a different interpretation of this curve in terms of covering
theory. Namely, R(D) = 1+ D log2D +(1-D) log 2 (1-D) is the exponential rate of
growth of the minimum number of vectors such that every point in binary n space
differs from at least one of these vectors in fewer than nD places. Note that in the
covering problern the result is exact, whereas in the corresponding packing problern
the tightest upper bounds known (Eliar, Levenstein 11 ) are strictly greater than the
Gilbert bound *. In this sense at least covering problems are simpler than packing
problems.
Example 2. Gaussian Source and Mean Squared Error (MSE) [Shannon, 1948,
and 1959]

*lt is widely believed that the Hilbert bound is tight in the packing problem, too, but no proof has been found
despite more than 20 years of attempts.
22 Extension to Sources and Distortion Measures with Memory

A=B= IR, thcrcallinc P (x,y) = (x-y) 2 •

Thc sourcc Jettcrs are governed by the Gaussian probability density


(x -fJY
2a 2
e
p (x) = ,!2-rrat xetR

The R(D) curve given by Shannon [1,12] is

R(D) = -1
2 log(a 2 /D) , 0 ~D ~Dmax 02

R!Dl • t loglo2 /Dl

Example 3. Stationary Gaussian Random Sequence and MSE [Kolmogorov, 1956]

A=B=IR
P(x,y) = (x-yf

EX. 0,
I

Thc answer is most easily expressed in terms of the discrete-time


spcctral density ..
<P(w) =I: IPk e -ikw , i2 =- 1
k=-oo
The R and D coordinates are then given parametrically in terms of
Oe[O,sup <P (w)] as follows :
w

n0 = _!._
21r
j" min [ 8, 4>(w)] dw
-'lf
Rate Distortion Theory and Data Compression 23

1
R(D8 ) = 47r I max [ 0, log
ff
4'(w) /8] dw
-ff

~~~""' .
-~
The cross-hatched region in the accompanying sketch is the area under
the so-called "error spectrum", min[8, 41(w)], associated with the parameter8.
The optimtim system for rate R(D8 ) will yield a reproduction sequence {Yi } which
is Gaussian and stationarily related to {Xi }with the error sequence {Xi- Yi }having
time-discrete spectral density Imin [ 8 , 4'(w) ] , Iw I ~ 7r •

Example 4. Stationary Gaussian Random Process and MSE [Kolmogorov, 1956]

Let
x = {x(t) ,0 ~t ~T} , 1. = {y(t) ,0 ~t ~T}
and 1 T
Pr(x,y) = T f[x(t) -y(t)Jldt.
0
The formulas for D 8 and R(D 8 ) from the time-discrete case of Example
3 continue to apply except that the limits pn the integrals become ±oo instead of ± 7r
and the spectral density 41(w) is defmed by
.. .
d..
'~~'(W) = I I{) (T) e -JWTdT
where
lfJ(T) = E(X(t)X(t +T)).
In the special case of an ideal bandlimited process

41(w) = { «<»o'
lwl ~ 2'HW
0' lwl > 2'HW
the results specialize to D8 =2WO and R(D 8 )=W log(41JO). Eliminating 0 yields the
explicit result
2W'I>0
= W log (-D---·
)
R(D)
24 Extension to Sources and Distortion Measures with Memory

Now the instantaneous signalpower is znW


... 1
S = ...!:.. Jcll(w)dw = 2.11' J 4>0 dW = 2W<l>0 ,
2.11' _.., -21f w

and the mean squared error D in the reconstruction of the signal can be considered
as an effective additive noise power N, the above result often is written as

R(D) = W log ( ~) ,

which is the form originally given by Shannon [ 1 ].


LECTURE 5.

Algebraic Source Encoding


We shall now prove the source coding theorem for the special case of
the binary equiprobable memoryless source and error frequency fidelity criterion
(Example 1 with p = 1/2). Moreover, the proof will show that a sequence of linear
codes of increasing blocklength n can be found whose (D,R) performances converge
to any specified point on the rate distortion function, R(D) = 1 - Hh (D) bits/letter;
The fact that the codes in question are linear is very important from the practical
Standpoint because the associated structure considerably simplifies the encoding
procedure.
Let x = (x 1 , ••• ,~) € {0, l}n and x_ = € {0, l}n
denote a typical source vector and reproduction vector, respectively. Define

where d His the Hamming distance function.


Given any set y 1 , y 2 , ... , y. of not necessarily t linearly independent re-
- - _J
production vectors, let

denote the [n, j] linear code with g~nerators 1J , ••• ,x_j (If the generatorsarenot
linearly independent, then not all 2 J of the y i € Bj are distinct). For completeness,
let B 0 = {_Q}. Define
F.
J

and let N. = II F. II denote the cardinality of FJ· . It should be clear that Nj is the
J J
number of source vectors .!. that can be encoded by at least one y EBj with error
frequency '1.jn or less. Accordingly, the probability Qj that a random vector X
produced by the source cannot be encoded by Bj with average distortion '1.j n ~
less is
0; = 1- 2-n~

because all 2npossible values of ~ are equally likely.


Now consider the addition of another generator .lj+l resulting in a
code Bj+l. Then
26 Algebraic Source Encoding

where

rJ = y_€UB. Sn (v + Y.
~ .:t.. 'J + 1
) = {v- + -JY. + 1 : -veF.}
J
J

The set Fr clearly is geometrically isomorphic to Fj and therefore contains Nj


elements, too. It follows that
N; + 1 = 2~ - IIF; n~ II

A good choice of X; +1 is one that makes IIF; n F; * II small. We proceed to derive an


upper bound that we are assured will exceed IIF. n F. * II for some choice of y. +l.
J J -J
This is done by averaging IIF; n Fj * II over all 2n possible choices of X;+ 1 and then
concluding that there must be at least one y. +l such that IIF. n F. * II does not ex-
_J J J
ceed this average. The average is calculated by observing that, for fixed y e Fj, the
vector y + I; +1 e F; * also belongs to Fj iffI; +1 = y + .!! for some u e F;. Hence, there
are exactly N1. choices of y. +l suchthat y+ y.+l € F. n F.*. Now letting!vary over
-J t..J J J 2'
the Nj points of Fj shows that there are a total of Nj' Nj = Nj pairs (y •X; +1 ), y e Fj,
such that y + y. +1 € F. n F. *. It follows that the average value of IIF. n F. * II is
2 -J J J J J
2-nNj. Therefore, there exists at least one way to choose I;+ 1 suchthat

~ 2N
Nj+I...- - z-nN.2
j J

This implies that


Rate Distorrion Theory and Data Compression 27

ü -1 - 2-nNj+1""'
)+1-
~1 22-n ~ +(2-nN.)
- • J
2

~ + 1 ,.;;; (1 _ 2 -n~ )2 = ~2

It follows by recursion that

Since

where we have used the fundamental inequality in the last step. Now let n -+ oo and
k -+ oo in such a way that the code rate k/n -+ R, and let !Q-+ oosuch that~/n-+ D.
It then follows that the probability Qk that the source produces a word that cannot
be encoded with distortion D or less will tend to zero provided
lim l2-n(~) = oo
n~ oo
k/n~ R
Q/n~ D
Since by Stirling's approximation we know that

(:n) _ 2 nl\ (D) ± O(n 1og n), 1


2

we conclude that Qk-+ 0 provided nR- n + nHb (D)-+ 00 ,i.e., provided

R>1 - \ (D) = R(D).

Hence, we have established the following theorem.


28 Algebraic Source Encoding

Theorem 4. [ Goblick, 1962 ]


There exist D-admissible linear codes of rate R for the binary
equiprobable memoryless source and error frequency criterion for any
R >R(D) = 1-Hb (D) bits/letter.
The. encoding, or com pression, of binary data by means of a linear code
proceeds in exact analogy to the procedure for decoding the word received at the
channcl ou tpu t in the channel coding application. Specifically, when the source
produccs ~ onc computes the syndrome

s = HxT

and then searches for a minimum weight solution ~ of the equation

s = HzT •
Such a ~ is called the Ieader of the coset CW = { y; H~ T = .!}. Once ~ has been
found, which is the hardest part of the procedure, one then encodes (approximates)
~by 1. =~ + ~· Said J_ is a code word since Hl = Hl + H~T = ~+ ~- = Q: Hence,
L is expressible as a linear combination (mod. 2) of the k generators. This means that
a string of k binary digits suffices for specification of y- In this way the n source
digits are compressed to only k digits that need to be transmitted. The resulting
error frequency is n- 1wt(E:Y) = n- 1 wt(~). Accordingly, the expected error
frequency Dis the average weight of the coset Ieaders
2 n·k
2- (n-k) ~
D = n -1
._ w.
j=l I

where wi is the weight of the Ieader ~ of the ith coset, 1 ~i ~ 2n-k . It


follows that for the source encoding application the crucial parameter of an
algebraic code is the average weight of its coset Ieaders rather than the minimum
distance between any pair of codewords.
LECTURE 6

Tree Codes for Sources


There are two major difficulties with algebraic source encoding. The
first one is that most sources da not naturally produce outputs in GF( q) for some
prime power q. Of course, the source outputs can be losslessly encoded into astring
of elements from GF( q), but then it is unlikely that minimization of Hamming or
Lee distance will correspond to maximization of fidelity. The second difficulty is
that, even if A = GF(q) and the Hamming or Lee metric is appropriate, the source
produces output vectors that are distributed over An in a manner that does not
concentrate probability in the immediate vicinity of the code words. This differs
markedly from the situation that prevails in channel decoding where the probability
distribution an the space of channel output vectors is concentrated in peaks
centered at the code words (provided the code rate is not extremely close to the
channel capacity ). As a result it suffices

Figure 4. Sketch of Probability Distribution Over


the Space of Channel Output Words A . ,.
p•obobo!oty
peoks

in most instances to limit decoding to the nonoverlapping spheres of


radius t ~d/2 centered about the codewords. The majority of algebraic decoding
procedures that have been devised to date simply abort whenever the received ward
does not lie inside one of the spheres of radius t centered at the code words. Such
decoding procedures are essentially useless for source encoding because the source
ward density is just as great between these spheres as it is inside them. In fact, if one
insists that the limiting rate of a sequence of binary codes be bounded away from 0
and 1 as n ~ oo, then a vanishingly small fraction of the space will lie within the
nonoverlapping t-spheres.
It follows from the above discussion that a good algebraic source
encoding algorithm must be complete in the sense that it finds the closest ( or at least
a close) code ward for every possible received ward. At present complete decoding
algorithms are known only for the Hamming codes [ Hamming, 1950], t = 2 BCH
codes [Berlekamp, 1968), certain t= 3 BCH codes [Berger and V an der Horst, 1973],
30 Tree Codes for Sources

and first-ordcr Recd-Muller codcs [Posner, 1968 ]. Unfortunately, the asymptotic


rate is 1 for thcsc Hamming and BCH codes, and 0 for these Reed-Muller codes, so no
long, good, dccodable algebraic source codes of nontrivial rate are known at present.
Anothcr way to introducc structure into a source code that overcomes
most of thc shortcomings of thc algcb~:aic approach is to employ tree codes. The
conccpt of trcc cncoding of sourccs is most readily introduced by means of an
cxamplc. Thc trcc codc dcpictcd in Figure 5 provides a means for encoding the
binary sourcc of Examplc 1 of Lccturc 4.

00 000000 000

000001 001

000110 1110


·~· 00
000111

111000
Oll

100

l
Figure 5. A Binary Tree Code
lt1001 101

111110 110

111111 111

......
-·· "''h
Iu...
mapl

Thc codc consists of the 2 3 = 8 different words of length n = 6 that


can bc formcd by concatenating the three pairs of binary digits encountered along a
path from the root to a leaf. The code word corresponding to each such path is
indicated in Figurc 5 directly to the right of the leaf at which the path terminates.
Since thcrc arc 26 = 64 different words of length 6 that might be produced by ·the
source, thcre usually will not be a code word that matches the source word exactly.
Accordingly, we approximate the source word ~ by whatever y_ in the code
minimizes P6 (!_,y). For example, if.!. = 010110, then~ = 000110 is closest in the
sense of the Hamming metric. When traversing the path- from the root to the leaf
y = 000110, we branch first up, then down, and then up at the successive nodes
cncountcrcd. Hence, this path is specified by the binary path map sequence 010,
whcrc 0 represents "up" and 1 represents "down". In this way we have compressed
a source sequcnce of length 6 into a path map oflength 3. Despite the fact that the
data rate has been ~ompressed by a factor of two, most of the original source digits
can be recovcred correctly. The reader may wish to verify that in the equiprobable
case p = 1/2 , the average fraction of the digits that are reproduced incorrectly is
Rate Distortion Theory and Data Compression 31

3/16 = 0.1875. The best that could be done in this instance by any code of rate
R=l/2of any blocklength is D(R=l/ 2) = 0.110.
It should be clear that tree codes can be designed with any desired rate.
If there are b branches per node and t letters per branch, then the rate is

R = Q- 1 1og d

Also, the letters an the branches can be chosen from any reproducing alphabet
whatsoever including a continuous alphabet, so tree coding is not restricted to
GF(q).
Tree encoding of sourceswas first proposed by Gabliek (1962). Jelinek
(1969) proved that tree codes can be found that perform arbitrarily close to the
R(D) curve for any memoryless source and fidelity criterion; this result has been
extended by Berger (1 971) to a limited dass of sources with memory. In order for
tree coding to become practical, however, an efficient search algorithm must be
devised for finding a good path through the tree. (Note that, unlike in sequential
decoding for channels, one need not necessarily find the best path through the tree;
a good path wi.P suffice). This problern has been attacked with some degree of
success by Anderson andJelinek (1971, 1973), by Gallager (1973), by Viterbiand
Omura (1973), and by Berger, Dick andJelinek (1973).
In conclusion, it should be mentioned that delta modulation and more
general differential PCM schemes are special examples of tree encoding procedures.
However, even the adaptive versions of such schemes are necessarily sub-optimal
because they make their branching decisions instantaneously on the basis solely of
past and present source outputs. Performance could be improved by introducing
some coding delay in order to take future source outputs into consideration, too.
LECTURE 7

Theory vs Practice and Some Open Problems


In the final lccturc of this ovcrview of rate distortion theory, I shall
bcgin by attcmpting to convcy somc fecling for the current status of the comparison
of thcory and practicc in data compression. I shall couch the discussion in terms of
thc i.i.d. Gaussian sourcc and MSE criterion of Example 2 of Lecture 4 because this
cxamplc has bccn treated the most thoroughly in the literature. However, the
ovcrall flavor qf my comments applies to more general sources and distortion
measures, too.
Rccall that thc rate distortion function for the situation in question is

R(D) =I log (o 2 /D)

Whcn this is plotted with a logarithmic D-axis, it appears as a straight line with
negative slope as skctched in Figure 6. Parallel to R(D)

1-IITS

Figure 6. Performances for Gauss-MSE Problem

A LLOYD-MAX OUANnZOS,UNCODED
e LLOVO-MAX OUANTIZERS,COOED

• BEST TREE CODE, 1•5


x lEST TIElliS CODE,hsS

but approximatcly 1/ 4bit higher lies the performance curve of the best entropy-
coded quantizers. Thc small separation indicates that there is very little tobe gained
by more sophisticated encoding techniques. This is not especially surprising given the
memorylcss nature of both the source and the fidelity criterion. It must be
emphasized, however, that entropy coding necessitates the use of a buffer to
implement the conversion to variable-length codewords. Moreover, since the
optimum quantizer has nearly uniformly spaced Ievels, some of these Ievels become
many times more probable than others, which Ieads to difficult buffering problems.
Furthermore, when the buffer overflows it is usually because of an inordinately high
local density of large-magnitude source outputs. This means that the per-letter MSE
incurred when buffer overflows occur tends to be even bigger than o2.
34 Rate Distordon Theory and Data Compression

As a result, the performance of coded quantizers with buffer overflow


properly taken into account may be considerably poorer than that shown in Figure
6, especially at high rates.
The buffering problern can be cirfumvented by not applying entropy
coding to the quantizer outputs. However, the performance curves of uncoded
quantizer diverge from R(D) at high R, as indicated by the locus of uncoded
Lloyd-Max quantizers in Figure 6.
Another scheme studied by Berger, Jelinek and Wolf (1972) uses the
permutation codes of Slepian (1965) in reverse for compression of source data. The
performance curve of optimum permutation codes of Hxed blocklength and varying
rate essentially coincides with the optimum coded quantizer curve for low rates but
diverges froin it at a rate that increases with blocklength. Permutation codes offer
the advantage of synchronaus operation, but they are characterized by long coding
delays and the need to partially order the components of the source vector. Berger
(1972) has shown that optimum quantizers and optimum permutation codes
perform identically in the respective Iimits of perfect entropy coding and of infinite
blocklength. Both are strictly bounded away from R(D).
An extensive study of tree coding techniques for the Gauss-MSE
problern was undertaken by Berger, Dick and Jelinek (1973). They studied source
encoding adaptations of both the Jelinek-Zigangirov stack algorithm for sequential
decoding of channel tree codes and the Viterbi algorithm for trellis codes. The best
performances obtained, determined by extensive simulation, were strictly better
than that of the best coded quantizer as indicated by the points marked in Figure 6.
The best stack algorithm run had D = 1.243D(R) and the Viterbi algorithm achieved
D• 1.'308 D(R), whereas coded quantizers achieve at best D = 1.415 D(R).
However, it was necessary to search 512 nodes per datum in the Viterbi runs and an
average of 727 nodes per daturn in the best of the stack runs.
Hence, real time tree encoding of practical analog sources at high rates is not instru-
mentable at present.
We close with a discussion of several open problems and research areas.

A. Algebraic Source Encoding


1. Derive bounds on the average weights of the coset Ieaders offamilies of linear
codes.
2. Are long BCH, Justesen, and/or Goppa source codes good?
3. Find co~plete decoding algorithms for families of codes with nondegenerate
Theory vs Practice and Some Open Problems 35

asymptotic rates.
4. Prove that the ensemble performance of codes with generators chosen in-
dependendy at random approaches R(D) as n ~ oo.
5. Extend the theory satisfactorily to nonequiprobable sources.

B. Tree Coding
1. Show that there are good convolutional tree codes for sources. (For contin-
uaus amplitude sources the tapped shift register is replaced by a feed-forward
digital fitter.
2. Find better algorithms for finding satisfactory paths through the tree (or
trellis).

C. Information-Singularity [Berger (1973) ]


Characterize the dass of all sources whose MSE rate distortion functions vanish
V.D > 0. This dass is surprisingly broad, and its study promises to provide insight
into the fundamental structure of information production mechanisms.

D. Biochemical Data Compression


The equations in chemical thermodynamics that describe the approach to
multiphase chemical equilibrium are mathematically analogaus to those which
must be solved in order to calculate a point on an R(D) curve. It has been
postulated [ Berger (1971) ] that this is not purely coincidental; indeed, the
interaction of a living system with its environment can be modeled by multiphase
chemical equilibrium. By "solving" this multiphase chemical equilibrium
problem, the system efficiently extracts those aspects of the environmental data
that it wishes to record accurately and either rejects or only coarsely encodes the
remainder.
The provocative mathematical analogy with rate distortion theory arises as follows.
If nj molecules of substance j, 1 ~ j ~ M, are injected into a system that
possesses N thermodynamically homogeneaus phases, then the number njk of mole-
cules of substance j that reside in phase k at equilibrium is found by minimizing the
Gibbs free energy functional which has the form
36 Rate Distordon Theory and Data Compression

where nk = ~njk and the Cjk, are so-called "free energy constants" that can be
experimentilly measured. The minimization naturally is subject to the mass balance
conditions l:n.k = n. and the constraints n.k ~0. Letting n = l:n. and making the
k J J J j J
obvious associations
n "" nP.
J J

cjk '"" -spjk


one finds that F is of the form F = I(Q)-sd(Q), which we know from earlier work to
be the quantity that one must minimize to find'-the point on R(D) at which the
slope iss.
lnvestigation of the realm of applicability of rate distortion theory to
environmental encoding by biochemical systems along the lines of the above
discussion of multiphase chemical equilibrium appears to be a very worthwhile area
for future research. It may weil turn out that principal usefulness of rate distortion
theory will prove to be in applications to biology rather than to communication
theory.
REFERENCES

[1] SHANNON, C.E., (1948). "A Mathematical Theory of Communication",


BSTJ, 27, 379-423, 623-656.

[2] SHANNON, C.E., (1959) "Coding Theoremsfora Discrete Source with a


Fidelity Criterion", IRE Nat'l. Conv. Rec., Part 4, 142-163.

[3] BLAHUT, R.E., (1972) "Computation of Channel Capacity and Rate-


distortion Functions", Trans. IEEE, IT-18, 460-473.

[4] PINKSTER, M.S., (1963) "Sources of Messages", Problemy Peredecü


Informatsii. 14, 5-20.

[5] GALLAGER, R.G., (1968) "Information Theory and Reliable


Communication", Wiley, New York.

[6] BERGER, T., (1968) "Rate Distordon Theory for Sources with Abstract
Alphabetsand Memory", Information and Control, 13, 254-273.

[7] GOBLICK, T.J.) Jr. (1969) "A Coding Theorem for Time-Discrete Analog
Data Sources", Trans. IEEE, IT-15, 401-407.

[8] BERGER, T., (1971) "Rate Distordon Theory. A Mathematical Basis for
Data Compression", Prentice-Hall, Englewood Cliffs, N.Y.

[9) GRAY, R.M., and L.D. DAVISSON (1973) "Source Coding Without
Ergodicity", Presented at 1973 IEEE Intern. Symp. on Inform.
Theory, Ashkelon, Israel.

[10] GOBLICK, T.J., Jr. (1962) "Coding for a Discrete Information Source with
a Distortion Measure", Ph.D.Dissertation, Elec. Eng. Dept. M.I.T.
Cam bridge, Mass.

[11] KOLMOGOROV, A.N., (1956) "On the Shannon Theory of Information


38 References

Transmission in the Case of Continuous Signals", Trans. IEEE, IT-2,


102-108.

[12] HAMMING, R.W., (1950) "Error Detecting and Error Correcting Codes",
BSTJ, 29, 1.47-160.

[13] BERLEKAMP, E.R., (1968) "Algebraic Coding Theory", McGraw-Hill, N.Y.

[14] BERGER, T., and J.A. VAN DER HORST (1973) "BCH Source Codes",
Submitted to IEEE Trans. on Information Theory.

[15] POSNER, E.C., (1968) In Man H.B."Error Correcting Codes", Wiley, N.Y.
Chapter 2.

[ 16] JELINEK, F., (1969) "Tree Encoding of Memoryless Time-Discrete Sources


with a Fidelity Criterion", Trans. IEEE, IT-15, 584-590.

[17] JELINEK, F., and J.B. ANDERSON (1971) "Instrumentable Tree En-
coding and Information Sources", Trans. IEEE, IT-17, 118-119.

[18] ANDERSON, J.B., and F. JELINEK (1973) "A Two-Cycle Algorithm for
Source Coding with a Fidelity Criterion", Trans. IEEE, IT-19, 77-92.

[19] GALLAGER, R.G., (1973) "Tree Encoding for Symmetrie Sources with a
Distordon Measure", Presented at 1973 IEEE Int'l. Symp. on
Information Theory, Ashkelon, Israel.

[20] VITERBI, A.J., and J.K. OMURA (1974) "Trellis Encoding of Memoryless
Discrete-Time Sources with a Fidelity Criterion", Trans. IEEE,
IT-20, 325-332.

[21] BERGER, T., R.J. DICK and F. JELINEK (1974) "Tree Encoding of
Gaussian Sources", Trans. IEEE, IT-20, 332-336.

[22] BERGER, T., F. JELINEK and J.K. WOLF (1972) "Permutation Codes
for Sources", Trans. IEEE, IT-18, 160-169.
References 39

[23] SLEPIAN, D., (1965) "Permutation Modulation", Proc. IEEE, 53, 228-236.

[24] BERGER, T., (1972) "Optimum Quantizers and Permutation Codes", Trans.
IEEE, IT-18, 759-765.

[25] BERGER, T., (1973) "Information- Singular Random Processes", Presented


at Third International Symposium on Information Theory, Tallinn,
Estonia, USSR.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy