RDT Berger
RDT Berger
TOBYBERGER
School of Electrical Engineering
Cornell University, Ithaca, New York
T. Berger
LECTURE 1
Rite
information can be provided about the source output exceeds H, then no distortion
need result. As R decreases from H towards 0, the minimum attainable distortion
steadily increases from 0 to the value Dm ax associated with the best guess one can
make in the total absence of any information about the source outputs. Curves that
quantify such tradeoffs aptly have been termed rate distortion functions by
Shannon [2].
The crucial property that a satisfactorily defined rate distortion
function R(D) possesses is the following:
"It is possible to compress the data rate from H down to any R > R(D)
and still be able to recover the original source outputs with an average
( *) distortion not exceeding D. Conversely, if the compressed data rate R
satisfies R < R(D), then it is not possible to recover the original source
data from the compressed version with an average distortion of D or
less".
An R(D) curve that posssesses the above property clearly functions as
an extension, or generalization, of the concept of entropy. Just as His the minimum
data rate (channel capacity) needed to transmit the source data with zero average
distortion, R(D) is the minimum data rate (channel capacity) needed to transmit the
data with average distortion D.
Omnis rate distortion theory in tres partes divisa est.
(i) Definition, calculation and bounding of R(D) curves for various data sources
and distortion measures.
(ii) Proving of coding theorems which establish that said R(D) curves da indeed
specify the absolute limit an the rate vs. distortion tradeoff in the sense of
(*).
(iii) Designing and analyzing practical communication systems whose performances
Rate Distortion Theory: An lntroduction 7
-Communication S y s t e m -
applications.
In order to quantify the rate-distortion tradeoff, we must have a means
of specifying the distortion that results when Xi = ai and Yi = bk. We shall assume
that there is given for this purpose a so-called distortion measure p: AxB--+ [O,oo].
That is, p (a. b~,) is the penalty, lass, cost, or distortion that results when the source
J' 1\
produces ai and the system delivers bk to the user. Moreover, we shall assume that
the distortion p 0 (~.y) that results when a vector ~ € A" of n successive source letters
r
is represented to the-user as € ß!l is of the form
n
p (x,y) = n-t ~ p(xt ,yt).
n -- t=l
Example:
A = B = {0,1}
p(O,O) = p(1,1) = 0
Then 1 a +ß
p 3 (010,001) =3 (0 + a + ß) =- 3-
Each communication system linking source to user in Figure 2 may be
fully described for statistical purposes by specifying for each j € {1, ... , M } and
k € {1, ... , N} the probability Q k/ i that a typical system output Y will be equal to
bk given that the corresponding input X equals aj' Contracting notation from P(aj)
to Pi and from P(a j , b k) to Pjk , we may associate with each system Q = (Qk /i)
two functions of extreme importance, namely
I(Q) = ~M
j =1 k=l
~N P. o . log
J ""'r.1J
+ ,
(~./'
""'r.
)
where
Rate Distortion Theory: An Introducdon 9
and
d (Q) = j,I:k P.J o"k1J.p.k
J
I(Q) is the average mutual information between source and user for the
system Q, and d(Q) is the average distortion with which the source outputs are
reproduced for the user by system Q. For situatior:s in which both the source and
the fidelity criterion are memoryless, Shannon [ 2 ! has defined the rate distortion
function R(D) as follows
That is, we restriet our attention to those systems Q whose average distortion d(Q)
does not exceed a specified Ievel D of tolerable average di<>tortion, and then we
search among this set of Q's for the ',ne with the minimum value of I(Q). We
minimize rather than maximize I(Q) because we wish to supply as Iittleinformation
as possible about the source outputs provided, of course, that we preserve the data
with the required fidelity D. In physical terms, we wish to compress the rate of
transmission of information to the lowest value consistent with the requirement that
the average distortion may not exceed D.
Remark: An equivalent definition of the critical curve in the (R,D)-
plane in which R is the independent and D the dependent variable is
Computatio n of R(D)
The definition of R(D) that we have given thus far is not imbued with
any physical significance because we have not yet proved the crucial source-coding
theorem and converse for R(D) so defined. Weshall temporarily ignore this fact and
concentrate instead an procedures for calculating R(D) curves in practical examples.
Until recently, it was possible to find R(D) only in special examples involving either
small M and N or { P j } and ( Pjk) with special symmetry properties. A recent
advance by Blahut, however, permits rapid and accurate calculation of R(D) in the
general case [3].
The problern of computing R(D) is one in convex programming. The
mutual information functional I(Q) is convex U in Q (Homework). lt must be
minimized subject to the linear equality and inequality constraints
N
k~l ~Jj = 1
.I: P. 0. .. p. k <;; D
j,k J "'k/ J J
and Qk/ j ;;;;::. 0. Accordingly, the Kuhn-Tucke r theorem can be applied to determine
necessary and sufficient conditions that characterize the optimum transition
probabüity assignment. Said conditions assume the following form:
Theorem 1. - Fix s e [- oo,O]. Then there is a line of slope s tangent to the R(D)
curve at the point (d(Q), I(Q)) if and only if there exists a probability vector
rl '
{Qk, 1 .;;;;; k .;;;;; N} such that
(i) ~/j = \ ~ eSPjk where \ = ( t~ eSPjk
an4
r+l rQr
Qk = ck k
where
r
Then {Qk} converges to the { Q k} that satisfies (i) and (ii) of theorem 1 as r -+-oo.
tangent line x-1 must lie above the curve log x for all x. This completes the proof of
the lemma. (log x ~ x - 1 is often called the fundamental inequality).
Corollary of Lemma: log y ~1-1/y. (Proof omitted.)
To prove the theorem we consider the functional V(Q) = I (Q)- sd (Q).
The graphical significance of V(Q) is shown in Figure 3. It is the R-axis intercept of a
R!Dl
Rate Distertian Theory and Data Compression 13
line of slope s through the point (d(Q), I(Q)) in the (R,D)- plane. For any Q the
point (d(Q), I(Q)) necessarily lies an or above the R(D) curve by definition of the
latter. Since R(D) is convex U(Homework), it follows that V(Q)~V8 , the R-axis
intercept of the line of slope s that is tangent to R(D).
The Blahut iterative formula Q ~ +~ c~ Qrk can be thought ofas the
composition of two successive steps; namely starting from {Q1k } , we have
Step 1.
r +I
Qk/j
Step 2.
r +1 ~ r +I
Qk j pj Qkfj
r r sp·k r
= ~ Aj Pj ~ e J = c~ ~
J
r r+I (r/r+1)
fPj log(Aj ) + tQk log Qk Qk
r r SP·k c{
r [ 1 - __:____
~ P. ~I" \ e J ] + ~ ~kct+1 (1 - -r+I )
<tn
~ '<k
j,kJ J Qk
14 Computation of R(D)
1- ~i f:\;pj eSPjk + 1- ti
1 - ~ <{ c~ + 1- 1 = 1 - ~ <{ + 1 = 1 - 1 0.
of the Kuhn-Tucker conditions must be satisfied at the convergence point. But the
second one also must be satisfied there since, if rlt!v". c~ > 1 for any k, then
convergence cannot occur because o< Q~+l = ck Qk. This completes the proof of
the theorem and also completes Lecture 2.
* In fact, vr is strictly decreasing whenever vr >V . To see this,.note that the inequality used to show the
nonincreasing nature of Vr is strict unless Q:: 1 =8 Q~ V k, i.e. unless ck = 1 V k. But ck = 1 V k implies
that Qk is optimum and hence that yr = V8•
LECTURE3.
Thc Lowcr ßound Theorem and Error Estimates for the Blahut Algorithm
We have just seen how Blahut's iterative algorithm provides a powerful
tool for the computation of R(D) curves. The following rather remarkable theorem
is most interesting in its own right and also provides a means for ascertaining when
thc Blahut iterations have progressed to the point that (d {Qr), I {Qr)) is within some
specified € of the R(D) curve.
Theorem 3.
R(D) = (sD + ~P. log X. )
J J J
where
(1)
~ I ( Q) - sd ( Q) - ~ P. 1 og X.
j J J
= ~ P. ~~- log (
~/j )
j:k J J ~ ~ esPjk
16 The Lower Bound Theorem ...
(2) AOttesPjk
~ I p. ~~· (1 -
j,k J J
1
~/j
)
(3)
(1)
~ 1 - t~ =1 - 1 =0
[~ results from the conditions s EO;;o and d(Q)EO;; 0.
<;J is the fundamental inequality
<~ results from the fact that ~
E A8 ]
Wehave just shown that (d(Q)EO;; 0),.,.. (I(Q) ~sO + Ipj log ~j for any sosO;; 0
an d a n y ~ E A8 ) • It f o ll o w s t hat ( d ( Q) osO;; D) ,.,.. (I ( Q) ~
max sD + ~p. log ~.).
s ...;o AeA J J J
·- 8
Accordingly
R(D) ~ min I(Q) ~ max (sD + ~P. log A. )
Q: d(Q) EO;;D s:S:::
"""lt:
0 t-
AeA 8 J J J
which establishes the advertized lower bound. That the rcverse incyuality also holds,
and hence that the theorem is true, is established by rccalling that thc Q that solvcs
the original R(D) problern must be of the form
n
"fk/j
= ~ n
j"fke
sP;k
= s j,k
IP. o 1.P.k +
J "'k J J
IP. log A.
j J J
since d(Q)= 0 for the Q that solves the R(O) problem. Also, we know that Ck <; 1
for this Q, so ~ E A 8 • Thus R(O) is of the form sO+~ Pj log~ for some s EO;;Q
and some ~ e A . Therefore J
8
Rate Distordon Theory and Data Compression 17
r+l
R(D) ~ I(Qr+I) = l:P. IT+. 1 log Qk/j
j,kJ '<k/J <{+1
Ar.e SPjk
l:P. <{1~ 1 log .....;J_ __
j,kJ J er
k
r
1\; (D)
which is an upper bound to R(D) at the value of distortion associated with iteration
r + 1. We can obtain a lower bound to R(D) at the same value of D with the help of
Theorem 3 by defining
It follows that
18 The Lower Bound Theorem ...
er
p • e spjk -- 1 r
e SP·k k
J = - - · <;; 1
ek, -~ - - 1;~. p.
A
~'\'
~1\. 1 1
j J J ~ j ~
max max
for all k. Thus ~ e A 8 so Theorem 3 gives
which is the desired lower bound. Comparing the two bounds, we see that I(Q r+l)
differs from R (d(Qr+~) =R(D) by less than
Sinee 1 ;;;,,l~.. ek_ with equality "Y ~~~l~ooQk >O,we see that)!w.;.~(D)
.Ri_(D)=O. The upper and lower bounds ~herefore tend to eoincidenee in the limit of
large iteration number. If we need an estimate of R(D) with an error of e or less, we
need merely iterate on r until log e~ax -l::kQk ek_ log ek_ <;; e. Note that both the
iterations and the test to see if we are within e of R(D) ean be earried out without
having to calculate either I(Qr) or d(Qr) at step r, let alone their gradients.This is the
reason why the iterations proeeed so rapidly in praetice.
LECTURE4.
where
It is clear that Rn (D) is the rate distortion function for a source that
produces successive n-vectors independently according to P n (~) when the distortion
in the reproduction of a sequence of such n-vectors is measured by the arithmetic
average of Pn over the successive vectors that com prise the sequence. Although the
actual source if stationary will produce successive n-vectors that are identically
distributed according toP n (~), these vectors will not be independent. Hence, ~ (D)
will be an overestimate of the rate needed to achieve average distortion D because it
does not reflect the fact that the statistical dependence between successive source
letters can be exploited to further reduce the required information rate. However,
one expects that this dependence will be useful only near the beginning and end of
each block and hence may be ignored for large n. This tempted Shannon [ 2 ] to
define
20 Extension to Sources and Distordon Measures with Memory
Source coding theorems to the effect that the above prescription for
calculating R(D) döes indeed describe the absolute tradeoff between rate and
distortion have been proven under increasingly general conditions by a variety of
authors [2,4-9].
A sufficient but by no means necessary set of conditions is the
following
(i) The source is strictly stationary
(ii) The source is ergodie
(iii) :!lg < 00 and Pg : Ag x Bg-+ [ 0, 00 ] suchthat
n-g+l
Pn(~,I) = p-!+1 / ; Pg(xt,xt+l'•••,xt+g-l'yt, ••• ,yt+g-1)
(This is called a distortion me·asure of span g and can be used to reflect context-
dependencies when assigning distortions).
(iv)
where EX denotes expectation over a vector ~ =(X 1 , ••. , Xg) of g succs~ive random
source oütputs.
Comments. Gray and Davisson [9] claim recently to have been able to remove the
need for condition (ü). Condition (iv) is automatically satisfied to finite-alphabet
sources; it is needed only when the theory is extended to the important case of
sources with continuous, possibly unbounded, source alphabets.
Under conditions (i) and (iii) the Rn (D) curves will be monotonic
nonincreasing for n~g. They have the physical significance that no system that
operates independently on successive n-vectors from the source can perform below
Rn(D). However, the corresponding positive source coding theorem that systems can
be found that operate arbitrarily close to ~ (D) is true only in the Iimit n -+ oo •
We now give some explicit examples of rate distortion functions.
6 { 0 if j =k
A = B = {0, 1} ' pjk = 1- jk = . 1 if j =I= k.
Assurne that zeros are produced with probability 1-p ~ 1/2 and ones
with probability p ~ 1/2.
A simple computation shows that
This formula equals the Gilbert bound which is a lower bound on the
exponential rate of growth of the maximum number of vectors that can be packed
into binary n-space such that no two of them disagree in fewer than nD places. Rate
distortion theory yields a different interpretation of this curve in terms of covering
theory. Namely, R(D) = 1+ D log2D +(1-D) log 2 (1-D) is the exponential rate of
growth of the minimum number of vectors such that every point in binary n space
differs from at least one of these vectors in fewer than nD places. Note that in the
covering problern the result is exact, whereas in the corresponding packing problern
the tightest upper bounds known (Eliar, Levenstein 11 ) are strictly greater than the
Gilbert bound *. In this sense at least covering problems are simpler than packing
problems.
Example 2. Gaussian Source and Mean Squared Error (MSE) [Shannon, 1948,
and 1959]
*lt is widely believed that the Hilbert bound is tight in the packing problem, too, but no proof has been found
despite more than 20 years of attempts.
22 Extension to Sources and Distortion Measures with Memory
R(D) = -1
2 log(a 2 /D) , 0 ~D ~Dmax 02
A=B=IR
P(x,y) = (x-yf
EX. 0,
I
n0 = _!._
21r
j" min [ 8, 4>(w)] dw
-'lf
Rate Distortion Theory and Data Compression 23
1
R(D8 ) = 47r I max [ 0, log
ff
4'(w) /8] dw
-ff
~~~""' .
-~
The cross-hatched region in the accompanying sketch is the area under
the so-called "error spectrum", min[8, 41(w)], associated with the parameter8.
The optimtim system for rate R(D8 ) will yield a reproduction sequence {Yi } which
is Gaussian and stationarily related to {Xi }with the error sequence {Xi- Yi }having
time-discrete spectral density Imin [ 8 , 4'(w) ] , Iw I ~ 7r •
Let
x = {x(t) ,0 ~t ~T} , 1. = {y(t) ,0 ~t ~T}
and 1 T
Pr(x,y) = T f[x(t) -y(t)Jldt.
0
The formulas for D 8 and R(D 8 ) from the time-discrete case of Example
3 continue to apply except that the limits pn the integrals become ±oo instead of ± 7r
and the spectral density 41(w) is defmed by
.. .
d..
'~~'(W) = I I{) (T) e -JWTdT
where
lfJ(T) = E(X(t)X(t +T)).
In the special case of an ideal bandlimited process
41(w) = { «<»o'
lwl ~ 2'HW
0' lwl > 2'HW
the results specialize to D8 =2WO and R(D 8 )=W log(41JO). Eliminating 0 yields the
explicit result
2W'I>0
= W log (-D---·
)
R(D)
24 Extension to Sources and Distortion Measures with Memory
and the mean squared error D in the reconstruction of the signal can be considered
as an effective additive noise power N, the above result often is written as
R(D) = W log ( ~) ,
denote the [n, j] linear code with g~nerators 1J , ••• ,x_j (If the generatorsarenot
linearly independent, then not all 2 J of the y i € Bj are distinct). For completeness,
let B 0 = {_Q}. Define
F.
J
and let N. = II F. II denote the cardinality of FJ· . It should be clear that Nj is the
J J
number of source vectors .!. that can be encoded by at least one y EBj with error
frequency '1.jn or less. Accordingly, the probability Qj that a random vector X
produced by the source cannot be encoded by Bj with average distortion '1.j n ~
less is
0; = 1- 2-n~
where
rJ = y_€UB. Sn (v + Y.
~ .:t.. 'J + 1
) = {v- + -JY. + 1 : -veF.}
J
J
~ 2N
Nj+I...- - z-nN.2
j J
ü -1 - 2-nNj+1""'
)+1-
~1 22-n ~ +(2-nN.)
- • J
2
~ + 1 ,.;;; (1 _ 2 -n~ )2 = ~2
Since
where we have used the fundamental inequality in the last step. Now let n -+ oo and
k -+ oo in such a way that the code rate k/n -+ R, and let !Q-+ oosuch that~/n-+ D.
It then follows that the probability Qk that the source produces a word that cannot
be encoded with distortion D or less will tend to zero provided
lim l2-n(~) = oo
n~ oo
k/n~ R
Q/n~ D
Since by Stirling's approximation we know that
s = HxT
s = HzT •
Such a ~ is called the Ieader of the coset CW = { y; H~ T = .!}. Once ~ has been
found, which is the hardest part of the procedure, one then encodes (approximates)
~by 1. =~ + ~· Said J_ is a code word since Hl = Hl + H~T = ~+ ~- = Q: Hence,
L is expressible as a linear combination (mod. 2) of the k generators. This means that
a string of k binary digits suffices for specification of y- In this way the n source
digits are compressed to only k digits that need to be transmitted. The resulting
error frequency is n- 1wt(E:Y) = n- 1 wt(~). Accordingly, the expected error
frequency Dis the average weight of the coset Ieaders
2 n·k
2- (n-k) ~
D = n -1
._ w.
j=l I
00 000000 000
000001 001
000110 1110
l·
·~· 00
000111
111000
Oll
100
l
Figure 5. A Binary Tree Code
lt1001 101
111110 110
111111 111
......
-·· "''h
Iu...
mapl
3/16 = 0.1875. The best that could be done in this instance by any code of rate
R=l/2of any blocklength is D(R=l/ 2) = 0.110.
It should be clear that tree codes can be designed with any desired rate.
If there are b branches per node and t letters per branch, then the rate is
R = Q- 1 1og d
Also, the letters an the branches can be chosen from any reproducing alphabet
whatsoever including a continuous alphabet, so tree coding is not restricted to
GF(q).
Tree encoding of sourceswas first proposed by Gabliek (1962). Jelinek
(1969) proved that tree codes can be found that perform arbitrarily close to the
R(D) curve for any memoryless source and fidelity criterion; this result has been
extended by Berger (1 971) to a limited dass of sources with memory. In order for
tree coding to become practical, however, an efficient search algorithm must be
devised for finding a good path through the tree. (Note that, unlike in sequential
decoding for channels, one need not necessarily find the best path through the tree;
a good path wi.P suffice). This problern has been attacked with some degree of
success by Anderson andJelinek (1971, 1973), by Gallager (1973), by Viterbiand
Omura (1973), and by Berger, Dick andJelinek (1973).
In conclusion, it should be mentioned that delta modulation and more
general differential PCM schemes are special examples of tree encoding procedures.
However, even the adaptive versions of such schemes are necessarily sub-optimal
because they make their branching decisions instantaneously on the basis solely of
past and present source outputs. Performance could be improved by introducing
some coding delay in order to take future source outputs into consideration, too.
LECTURE 7
Whcn this is plotted with a logarithmic D-axis, it appears as a straight line with
negative slope as skctched in Figure 6. Parallel to R(D)
1-IITS
A LLOYD-MAX OUANnZOS,UNCODED
e LLOVO-MAX OUANTIZERS,COOED
but approximatcly 1/ 4bit higher lies the performance curve of the best entropy-
coded quantizers. Thc small separation indicates that there is very little tobe gained
by more sophisticated encoding techniques. This is not especially surprising given the
memorylcss nature of both the source and the fidelity criterion. It must be
emphasized, however, that entropy coding necessitates the use of a buffer to
implement the conversion to variable-length codewords. Moreover, since the
optimum quantizer has nearly uniformly spaced Ievels, some of these Ievels become
many times more probable than others, which Ieads to difficult buffering problems.
Furthermore, when the buffer overflows it is usually because of an inordinately high
local density of large-magnitude source outputs. This means that the per-letter MSE
incurred when buffer overflows occur tends to be even bigger than o2.
34 Rate Distordon Theory and Data Compression
asymptotic rates.
4. Prove that the ensemble performance of codes with generators chosen in-
dependendy at random approaches R(D) as n ~ oo.
5. Extend the theory satisfactorily to nonequiprobable sources.
B. Tree Coding
1. Show that there are good convolutional tree codes for sources. (For contin-
uaus amplitude sources the tapped shift register is replaced by a feed-forward
digital fitter.
2. Find better algorithms for finding satisfactory paths through the tree (or
trellis).
where nk = ~njk and the Cjk, are so-called "free energy constants" that can be
experimentilly measured. The minimization naturally is subject to the mass balance
conditions l:n.k = n. and the constraints n.k ~0. Letting n = l:n. and making the
k J J J j J
obvious associations
n "" nP.
J J
[6] BERGER, T., (1968) "Rate Distordon Theory for Sources with Abstract
Alphabetsand Memory", Information and Control, 13, 254-273.
[7] GOBLICK, T.J.) Jr. (1969) "A Coding Theorem for Time-Discrete Analog
Data Sources", Trans. IEEE, IT-15, 401-407.
[8] BERGER, T., (1971) "Rate Distordon Theory. A Mathematical Basis for
Data Compression", Prentice-Hall, Englewood Cliffs, N.Y.
[9) GRAY, R.M., and L.D. DAVISSON (1973) "Source Coding Without
Ergodicity", Presented at 1973 IEEE Intern. Symp. on Inform.
Theory, Ashkelon, Israel.
[10] GOBLICK, T.J., Jr. (1962) "Coding for a Discrete Information Source with
a Distortion Measure", Ph.D.Dissertation, Elec. Eng. Dept. M.I.T.
Cam bridge, Mass.
[12] HAMMING, R.W., (1950) "Error Detecting and Error Correcting Codes",
BSTJ, 29, 1.47-160.
[14] BERGER, T., and J.A. VAN DER HORST (1973) "BCH Source Codes",
Submitted to IEEE Trans. on Information Theory.
[15] POSNER, E.C., (1968) In Man H.B."Error Correcting Codes", Wiley, N.Y.
Chapter 2.
[17] JELINEK, F., and J.B. ANDERSON (1971) "Instrumentable Tree En-
coding and Information Sources", Trans. IEEE, IT-17, 118-119.
[18] ANDERSON, J.B., and F. JELINEK (1973) "A Two-Cycle Algorithm for
Source Coding with a Fidelity Criterion", Trans. IEEE, IT-19, 77-92.
[19] GALLAGER, R.G., (1973) "Tree Encoding for Symmetrie Sources with a
Distordon Measure", Presented at 1973 IEEE Int'l. Symp. on
Information Theory, Ashkelon, Israel.
[20] VITERBI, A.J., and J.K. OMURA (1974) "Trellis Encoding of Memoryless
Discrete-Time Sources with a Fidelity Criterion", Trans. IEEE,
IT-20, 325-332.
[21] BERGER, T., R.J. DICK and F. JELINEK (1974) "Tree Encoding of
Gaussian Sources", Trans. IEEE, IT-20, 332-336.
[22] BERGER, T., F. JELINEK and J.K. WOLF (1972) "Permutation Codes
for Sources", Trans. IEEE, IT-18, 160-169.
References 39
[23] SLEPIAN, D., (1965) "Permutation Modulation", Proc. IEEE, 53, 228-236.
[24] BERGER, T., (1972) "Optimum Quantizers and Permutation Codes", Trans.
IEEE, IT-18, 759-765.