DSP Algorithm Arch., For Tele Comm.
DSP Algorithm Arch., For Tele Comm.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Digital communication . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Digital communication systems 4
1.1.2 Modulation 5
1.2 The JPEG and MPEG standards . . . . . . . . . . . . . . . 6
1.3 The DMT transmission technique . . . . . . . . . . . . . . 8
1.3.1 DMT modulation 9
1.3.2 Frequency allocation 11
1.3.3 The DMT symbol 12
1.3.4 The splitter 12
1.4 Scope of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 13
i
3.1 Variable length codes . . . . . . . . . . . . . . . . . . . . . . 35
3.2 The VLC decoding process . . . . . . . . . . . . . . . . . . 36
3.2.1 Tree based decoding 36
3.2.2 Symbol parallel decoding 37
3.3 VLC decoder with simplified length decoder . . . . 38
3.4 VLC decoder with pipelined length decoder . . . . 39
3.5 VLC decoder with symbol decoder partitioning . . 40
3.6 Length decoder implementation . . . . . . . . . . . . . . 40
3.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 65
ii
Paper 1
Paper 2
iii
4 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Paper 3
Paper 4
iv
3 HARDWARE PARTITIONING . . . . . . . . . . . . . . . . 111
3.1. Interface design 112
4 HARDWARE/SOFTWARE TRADE-OFFS . . . . . . 113
4.1. Huffman processor 113
4.2. IDCT processor 114
5 CONCLUSIONS AND FURTHER WORK . . . . . . . 114
6 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Paper 5
v
Paper 6
Paper 7
vi
3 CORRECTION OF OFFSET IN DMT MODEMS . 146
3.1. DMT based communication system 146
3.2. Correction of offset before connection 148
3.3. Correction of offset during initialization 148
3.3.1. Activation 148
3.3.2. Modem training 149
3.4. Correction of offset during transmission 149
4 SIMULATION RESULTS . . . . . . . . . . . . . . . . . . . . 150
5 HARDWARE ARCHITECTURE . . . . . . . . . . . . . . 151
6 ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . 151
7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Paper 8
vii
6 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Paper 9
Paper 10
viii
4 A 4-BIT CONVERTER EXAMPLE . . . . . . . . . . . . 188
5 SIMULATION RESULTS . . . . . . . . . . . . . . . . . . . . 190
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
ix
x
Abbreviations
and Acronyms
ADC Analog-to-digital converter
ADSL Asymmetrical digital subscriber line
ASDSP Application specific digital signal processor/processing
CO Central office
CPE Customer premises equipment
DAC Digital-to-analog converter
DCT Discrete cosine transform
DEM Dynamic element matching
DMT Discrete multi tone
DNL Differential nonlinearity
DSL Digital subscriber line
DSP Digital signal processing
EC Echo cancelling
EXU Execution unit
GSM Global system for mobile communications
IDCT Inverse discrete cosine transform
IFFT Inverse fast fourier transform
INL Integral nonlinearity
FDM Frequency division multiplex
FEQ Frequency domain equalizer
FFT Fast fourier transform
FIFO First in first out
FIR filter Finite impulse response filter
HDSL High speed digital subscriber line
JPEG Joint pictures experts group
MDSP Modular digital signal processor
1
MPEG Motion pictures experts group
OFDM Orthogonal frequency domain multiplexing
PAR Peak to average ratio
POTS Plain old telephone system
SFDR Spurious free dynamic range
SFG Signal flow graph
SHDSL Symmetric high bit-rate digital subscriber line
SNR Signal to noise ratio
SNDR Signal to noise and distortion ratio
TDM Time division multiplexing
TEQ Time domain equalizer
TIADC Time Interleaved analog-to-digital converter
VDSL Very high speed digital subscriber line
VLC Variable length code
QAM Quadrature amplitude modulation
2
1 Introduction
1 Introduction
This thesis consists of two parts where part one provides a background to the
applications of interest and problems relevant for this thesis. Part two consists of
a selection of publications. The research have been carried out in the period 1995
to 2001. The publications consider hardware implementation of signal processing
in telecommunication systems, ranging from coding of images to transmission
via wideband digital subscriber lines.
3
1 Introduction
Since the channel capacity is limited there is a need for techniques that can
reduce the required channel capacity for a given service. Three important areas
where compression techniques are widely used for better utilization of the chan-
nel capacity is transmission of speech, image and video. In a mobile phone sys-
tem voice data is compressed from 64 kbit/s down to 11.4 kbit/s (GSM, half-rate)
keeping an acceptable quality of the speech [2].
For images and video transmission the JPEG [3] and MPEG [4] standards are
widely used. Interesting to note is that even if the available bandwith keeps
increasing compression of image and video signals will be crucial for many years
to come. Transmitting standard resolution video with an acceptable quality
require 1.5-2.5 Mbit/s with compression. Transmitting uncompressed video is not
even an option today since this would require data rates above 50 Mbit/s.
digital
digital source
source channel
channel modulator
source encoder encoder modulator
source encoder encoder
Transmit path
noise
channel
channel
Receive path
user source
source channel
channel detector
user decoder decoder detector
decoder decoder
4
1 Introduction
1.1.2 Modulation
Modulation is the way information is mapped onto a signal. The transmitted
information is divided into symbols where one symbol has a finite duration. The
information content is encoded into the shape of the waveform during the symbol
period. Common ways to encode the information is to put it into the amplitude
and/or phase of the waveform. We will in this thesis mainly consider the quadra-
ture amplitude modulation (QAM) technique and its relatives.
In QAM the information is mapped onto a carrier, which often is a sinusoid,
using different phases and amplitudes. The transmitted signal ( s i ( t ) ) is a sinusoid
with four possible phases ( ϕ i ). These phases are created by varying a and b in
Eq. 1.2 where the sin and cos terms are the basis functions. E is a constant
related to the transmitted energy [6].
b
(00) (01)
(10) (11)
The detector will take decision of how to interpret the received symbol. When
using 4-QAM the decision is taken based on which one of the four possible con-
stellation points that is located closest to the received symbol. The decision
boundaries are outlined as shaded areas in Fig. 1.2. The distance between the
5
1 Introduction
received constellation point and the ideal position is a measure of the noise level
in the channel. If the noise level is too high the detector may not be able to cor-
rectly detect which symbol that originally was sent from the transmitter, and
therefore there will be a bit error. In order to reduce the probability for bit errors
it is common to introduce coding, where redundancy is added to the signal in a
controlled way so that some bit errors can be corrected.
In the general case we can allow more than four points in the constellations
which here is referred to as M-ary QAM. A typical case with 16 possible points
in the constellation is shown in Fig. 1.3. When more than four points are used in
a QAM constellation both the amplitude and the phase is used as signal carrier.
b
(0000) (0001) (0011) (0010)
a
(1100) (1101) (1111) (1110)
6
1 Introduction
tion. The quantization has been optimized against how sensitive humans are for
different frequencies in the images. The human eye is less sensitive for noise at
higher frequencies than at lower.
Further data compression is achieved using Run-Length-Zero (RLZ) coding and
variable length coding. None of these methods remove information, it just finds a
more efficient representation. In the RLZ coding long sequences with zeros is
replaced with the number of zeros in a row and the next nonzero value. For
example the sequence {0,0,0,0,3} is replaced with {4,3}. In the variable length
coding the frequency of RLZ coded data determines the number of bits used for
representing a value. For instance we may assign a shorter representation for the
RZ coded value {0,1} than for {4,1} which is less frequent. More about variable
length codes and decoding of variable length coded data are found in chapter 3.
A common format for compressed video is the MPEG-2 Video standard [4].
There are many similarities between the JPEG and MPEG standards. Both stan-
dards use DCT, RLZ and VLC for compression of images. The main difference
between the standards is that the MPEG-2 Video standard not only compresses
the digital images one by one but also consider similarities between adjacent
images in the video stream. To accomplish this a motion estimation unit is
needed in the video encoder. The motion estimation search for similarities
between images in the video sequence and is the most resource requiring algo-
rithm in the MPEG encoder. Instead of transmitting the image data, only the dif-
ference between images may be transmitted in those cases when this is more
efficient. While the MPEG encoding not has to be made in real-time the decoding
7
1 Introduction
task has to since the decoding is made while the video stream is watched. Real-
time MPEG decoding is therefore more important than real-time encoding. An
outline of an MPEG-2 decoder is shown in Fig. 1.5.
Input
stream Input VLC
Input Parser
Parser VLD RLZ
buffer
buffer decoder
decoded
Motion Picture video
IQ IDCT
Parser reorder
comp
The first step in the decoder is to extract the control information which contain
information about what type of coding that has been used, image size, and so on.
The VLC and RLZ decoders reverse the VLC and RLZ encoders operations. The
Inverse Quantizer (IQ) multiplies the coefficients with the quantization coeffi-
cients used in the quantizer which restores the signal levels at each frequency.
The Inverse Discrete Cosine Transform (IDCT) transforms the image back from
the frequency domain to the spatial domain. If only the difference between two
images has been transmitted the image data is restored by adding the previously
transmitted image to the received difference image. Finally the images may have
to be re-ordered since the encoder performs a picture re-ordering to better exploit
similarities between adjacent images in the video stream.
8
1 Introduction
The data rates in ADSL are up to 9 Mb/s from CO to the CPE side, and up to 1
Mb/s in the other direction. The reason to provide higher bit rates in the down-
stream direction is that it is assumed that the need for high data rates is higher in
this direction.
The very high speed DSL (VDSL) standard will provide data rates up to 50 Mb/s.
The standardization of VDSL has however been delayed much due to problems
with agreeing on which modulation method is most suitable. Much of the work in
this thesis has been based on a VDSL technique proposed by Ericsson which
based on the Discrete Multi Tone modulation (DMT) scheme which also is used
in the ADSL standard [8].
central office
CO ADSL
CO ADSL
CO ADSL
internet
TV
ADSL
PC
9
1 Introduction
by using the numerical equivalent fast transforms IFFT and FFT. The constella-
tion size on each carrier is dynamically adapted to a varying noise level by using
a “bit swapping” algorithm [9].
The main blocks in a DMT modem are outlined in Fig. 1.7. In addition to the out-
lined block we also need blocks for clock recovery and symbol synchronization
as well as serial/parallel converters, etc. These blocks have been excluded to sim-
plify the explanation of the basic idea behind DMT communication.
The information is put into frames and symbols in the block called framer.
Redundancy information is added in the forward error correction block, FEC,
which makes it possible to detect and correct some transmission errors. The
Reed-Solomon decoder (RS-decoder) is used for correction of transmission
errors. There are two transmission paths, one with an interleaver and one without.
The Interleaver/Deinterleaver pair will spread transmission errors in time which
will increase the error correction performance in the RS-decoder. Unfortunately
also the delay through the system is increased which causes problems for
instance in two-way communication. Therefore there is also a non-interleaved
transmission path that can be used for delay sensitive applications.
EC stands for echo cancelling which is needed if the data transmitted in upstream
and downstream direction share the same frequency space. In this case the
received signal will contain some of the transmitted signal which must be
removed in order to not disturb the decoder.
TEQ is the time domain equalizer, and FEQ is the frequency domain equalizer.
The task for the equalizers is to work as an “inverse filter” to the channel impulse
response so that the original signal is restored, giving a signal as close to the
transmitted signal as possible. By using two equalizers the total complexity for
implementing the equalizers will be reduced compared with using only a TEQ.
10
1 Introduction
The analog frontend contains analog filters and a line driver. Sometimes the digi-
tal-to-analog (DAC) and the analog-to-digital (ADC) converters as well as digital
interpolation and decimation filters are also counted as parts of the analog fron-
tend.
11
1 Introduction
VDSL use a wider range of frequencies than ADSL. In this work we have aimed
at frequencies up to around 11 MHz, which may be slightly changed when the
standard is set. From beginning a time-division multiplexing scheme was pro-
posed where only transmission in one direction was taking place at the same
time. The current proposal does, however, propose that different frequencies are
used instead (frequency domain multiplexing, FDM). The frequency plan has not
been completely finalized yet but it seems clear that there will be several down-
stream as well as upstream bands in the final standard.
DMT symbol
Cyclic prefix
copy
Additional to the user data there are extra fields in the symbol that contain infor-
mation used for the two modems to exchange system parameters and other con-
trol information.
12
1 Introduction
Splitter
HP
ADSL
The reason to keep the POTS installation instead of running all communication
over the ADSL modem is that it has been considered very important to have a
connection that works even during a power failure. A POTS system get its power
from the twisted pair cable, but today an ADSL modem cannot be powered from
the twisted pair, and therefore the POTS system is kept as a life line. More infor-
mation about the DMT technique can be found in [10,11,12].
13
1 Introduction
Another area which have been studied in this work is how DSP algorithms can be
used to improve performance in A/D and D/A converters. By identifying errors
and then trying to correct them or spectrally move distortion the data converter
performance can be increased.
In the publications [15,16,17] architectures for fast decoders for variable length
codes (VLCs) are proposed. Variable length codes are not used directly in digital
communication but they are often used in the data streams that are transmitted
over the communication channel. In both digital audio and video VLCs are used
for reducing the amount of data that must be transmitted. For example the MPEG
and JPEG standards which are used to compress images and video sequences are
therefore important. Much of the work has also been reported in [18], but some
additional discussion is also made in Chapter 3.
The design process for efficiently designing Application Specific Processors was
studied in the papers [19,20,21]. This work is a continuation of the work made by
K.G. Andersson [22,23], but with improvements that includes better ability to
reuse old designs, and an efficient way to synthesize the architectures. We
present two case studies where the first [19] is an ASDSP for decoding JPEG
images and the second [20] is an ASDSP for the Fast Fourier Transform (FFT). A
synthesis tool for making the design path more efficient is reported in [21]. The
design process is further discussed in Chapter 2.
The last four papers cover distortion reduction techniques in D/A and A/D con-
verters. Signal processing algorithms have been developed that can be used to
increase the performance in the data converters. In [24] we propose a method that
can cancel offset errors in a time interleaved A/D converter utilizing the receiver
which in this case is a digital modem. This method has also been subject for a
patent application [25]. A method to cancel gain and skew mismatch in an A/D
converter is proposed in [26].
In [27,28,29] architectures that make it possible to trade between glitches and
mismatch in the weights in a current-source D/A converter are proposed. Data
converters are discussed in Chapter 4.
Related to this thesis and the publications [25-29] is the tutorial “A/D and D/A
Converters for Telecom. Applications” that was held at ICECS´2001 [30]. In this
tutorial we tried to relate distortion reduction methods to both each other and to
applications.
14
1 Introduction
Most of the work has been carried out within an industrial research project called
VIBRA at Ericsson Microelectronics AB. The aim with VIBRA was to develop
analog and digital building blocks for DSL based systems. VIBRA have had
strong connections to other research projects within Ericsson studying algorithms
and hardware for DSL systems. For secrecy reasons the complete picture of how
this work relates to work within other parts of Ericsson is not possible to present
in this thesis.
15
1 Introduction
16
2 Digital Signal Processing Architectures
17
2 Digital Signal Processing Architectures
Adaptive algorithms
output input
Hard real-time processing
x(n) y(n)
+
-
0.5
1
y ( n ) = x ( n ) – --- y ( n – 1 ) . (2.1)
2
18
2 Digital Signal Processing Architectures
T OPi
T min = max ------------ (2.2)
Ni
i
where T OPi is the total operation latency in the recursive loop i , and N i is the
number of delay elements found in the loop [32]. The critical loop, is the loop
that limits the sample rate. There are several ways of improving the sample rate
by various algorithm transformations like for instance moving operations out of
critical loops [14].
Nk ⋅ Tk
N EXUk = ---------------- (2.3)
T min
where T k is the time required for an operation of type k and N k is the number of
operations.
It is important to schedule the operations properly in order to reach a high degree
of utilization of the EXUs. The scheduling should also consider the dataflow
between the blocks in the architecture. Reducing the interconnect will also
reduce the parasitic load from the wires, and hence also the power consumption.
19
2 Digital Signal Processing Architectures
where P switching is the power that is consumed every time a signal node
changes state. α is the average switching activity for all nodes in the circuit, and
C L is the switched capacitance. The signal levels are assumed to be 0 and V with
a power supply of V DD .
P sc is the short circuit current that occurs when NMOS and PMOS transistors
are active simultaneously which may occur during switching, giving a short-cir-
cuit current from V DD to ground.
P leak is due to the leakage current that arises from sub-threshold effects. The rel-
ative contribution from P leak is increasing because of the scaling of threshold
voltages that is made in new process technologies. A reduction of the threshold
voltages for the transistors increases the leakage current, I leak [34]. Future
CMOS processes will enable an increased amount of on-chip memory which will
give a significant contribution to the total leakage current.
The static current I static in a purely digital circuit mainly origins from logic
gates whose inputs have reduced swing. When using full swing static logic the
static current will be low.
The power consumption in different functional parts of a DSP system can be par-
titioned into three components,
P calc is the power consumed in the functional units, i.e. where the actual algo-
rithm is executed. P calc grows approximately linearly with the number of opera-
tions. To decrease this part the computational complexity of the implemented
function should be decreased. This can be done by choosing another algorithm or
trying to simplify the original one [33].
P store is the power consumed when storing internal signal values during the
execution of the algorithm. The amount of storage needed is mainly dependent
on a) how many samples are needed to compute one output data for a given algo-
rithm, and b) the architecture used for executing the algorithm. It is important to
reduce data movement between different memory elements to decrease P store .
One way of doing this is for instance to implement a First-in-first-out (FIFO)
buffer using a memory and a memory pointer instead of using a shift register. The
positioning of the storage elements is also important, local storage may be less
expensive than global memories. Low computational complexity do not have to
imply few load and store operations and P calc and P store should therefore be
co-optimized [33].
20
2 Digital Signal Processing Architectures
P ctrl is the power consumed in the control unit that among other things controls
the dataflow between the storage elements and the functional units. The com-
plexity of the controller is dependent on the datapath architecture, the scheduling
of operations and the algorithm.
21
2 Digital Signal Processing Architectures
status
flags
operation CU 1..K
control
storage
dataflow control
control
EXU 1
STU 1..M
EXU 2
EXU N
It is also possible to mix the two strategies, isomorphic mapping and time-shar-
ing by implementing efficient EXUs using isomorphic mapping and then time-
share the EXUs. For example in the FFT algorithm the inner loop contain a but-
terfly operation, which is often implemented using an isomorphic mapping and
then time-shared for the different butterflies in the FFT [35,36].
The time-shared architecture adds complexity to the interconnect, control units,
and possibly to the execution units as well. This extra control overhead will
increase the power consumption and it is therefore essential to keep the overhead
as low as possible if the total power consumption is an important design parame-
ter.
22
2 Digital Signal Processing Architectures
one that can multiply with one pre-defined coefficient only. The interconnect
may need extensions that remove restrictions on the dataflow. The control unit
may also need to support more advanced data flows, as for instance nested loops
and conditional jumps.
The addition of more flexibility in the datapath and programmable control units
will increase the complexity of the architecture as well as the power consump-
tion. If the programmable architecture will be used for a wide range of applica-
tions the instruction set will become more extensive. As a consequence it is an
advantage from an efficiency point of view if the DSP architecture can be tar-
geted towards a small range of algorithms since this will reduce the instruction
set and therefore increase the efficiency.
A programmable DSP architecture has the advantage of being easier to reuse for
several applications. One way of providing some flexibility without having to go
all the way to a DSP processor is to have a set of user controlled parameters that
affects the algorithm in some predefined way, i.e. parametrization. The length of
an FFT, or the number of taps in a Finite Impulse Response (FIR) filter can for
instance be made as a parameter to the block. In this way it is also possible to
make architectures that can be used in many applications but still can be synthe-
sized efficiently if the parameters are fixed before the synthesis stage. For
instance a programmable filter can be turned into a filter with fixed coefficients,
making it possible to simplify for instance multipliers.
To summarize we have the following types of DSP architectures with various
degree of efficiency and flexibility
• Fix function architectures that only can execute one pre-determined algo-
rithm, where the operations can be either time-shared or isomorphic mapped
to the EXUs. In this thesis this class is represented by the presented work
dealing with variable length codes, see chapter 3.
• Parametrized architectures that only can execute pre-determined algo-
rithms, but with a possibility to control some parameters as for instance filter
length. This class of architectures are not explicitly treated in this thesis, but
some parametrization is used in the case studies for programmable DSP archi-
tectures.
• Programmable architectures that are controlled by a microprogram and that
can be used for replacing the algorithms with new ones without having to
change the hardware. This architecture is used in the case studies presented
later in this chapter.
• Reconfigurable architectures that are realized using reconfigurable logic as
for instance FPGAs. These architectures are not discussed in this thesis.
23
2 Digital Signal Processing Architectures
24
2 Digital Signal Processing Architectures
execution”. The µC-model is both cycle accurate and bit accurate which makes it
possible to do bit and cycle accurate simulations early in the design process. An
example of a µC-model that describes an FIR filter is given in Fig. 2.5.
spec.
function mC model
library
HW fixed
SW fixed
RTL
model
formal
verification
After verification the µC-model is translated into a VHDL architecture that sup-
ports the instructions needed for the execution of the given algorithm, and then
synthesized using conventional tools. An example of an architecture that is com-
patible with the µC-code in Fig. 2.5 is shown in Fig. 2.6. Note that the VHDL
architecture may support a larger instruction set than what is required from the
µC-model. A microcode is finally extracted from the µC-model together with the
VHDL architecture, and a library with building blocks like registers, memories,
and ALUs. The instructions used in the µC-code must have a corresponding
building block in the building block library. The library is easy to extend with
new functions when needed.
25
2 Digital Signal Processing Architectures
1: // Declaration part
2: MDSP fir
3: {
4:
5: INPUT inp(14, PARALLEL);// input port, 14 bits
6: OUTPUT outp(14, PARALLEL); // output port, 14 bits
7: REG acc(30), i(6), ca(5), da(5); // different registers
8: RAM d(32,16); // RAM with 32 16 bit words
9: ROM c(32,16, "rom.data"); // ROM
10:
11: PROCEDURE compfir ();// procedure declaration
12: }
13:
14: // Code part
15:
16: PROCEDURE main()
17: {
18: for(;;){ // loop forever
19: do {;} while(!inpF) ; // While no input on the input
20: // port inp do nothing
21: inpF=0, d[da]=inp; // Reset input by setting inpF=0,
22: // store inp in RAM. “,” means that
23: // this is made in parallel
24:
25: compfir(); // call procedure compfir
26: outp=acc; // place the value of acc on outp port
27: }
28: }
29: PROCEDURE compfir() // compute fir
30: {
31: acc=0,ca=0;
32:
33: i=30;
34: do {
35: acc+=d[da++]*c[ca++],
36: i--;
37: } while (i>0)
38: acc+=d[da]*c[ca++];
39: return;
40: }
26
2 Digital Signal Processing Architectures
ing and Huffman decoding. The second core is dedicated to processing of the
Inverse Discrete Cosine Transform (IDCT) which represents a high computa-
tional work load. Due to the partitioning of the algorithms only image data needs
to be passed to the IDCT processor core. The parameters can be kept entirely in
the Huffman processor core.
from firl
circ_add
RAM
1
da
inp
d
imm op ROM
0 * c
firl 1
+,pass
circ_add
acc 0
outp
ca
imm op
1
+,-,pass
0
to control unit
i
>
The experience of this case study is twofold. The methodology worked well for
defining programmable architectures for a special application. With some modi-
fications it was possible to design a high efficiency core with high utilization
using the design methodology. The modifications we needed to add was hard-
ware to support loops, where the jumps did not cost any extra clock cycle, and a
possibility to describe finite state machines to handle the I/O of the IDCT proces-
sor. As it turned out the IDCT core architecture is in most aspects similar to the
architecture that would have been obtained if conventional design methods had
been used for designing a fixed-function ASIC. Hence, even if the methodology
is supposed to be used for programmable architectures with one control unit and
a data path it is possible to describe complex architectures as the finite state
machine that work in parallel with the main control unit.
27
2 Digital Signal Processing Architectures
the FFT algorithm the best choice is normally to implement an FFT in a dedicated
architecture. The special feature with this case study was that both a high degree
of programmability and a high throughput was required due to uncertainties in
the proposed standard. Therefore our choice was to implement the FFT in a pro-
grammable architecture that easily could be adapted to changes in the standard.
28
2 Digital Signal Processing Architectures
It is possible to derive FFT algorithms with different radix which implies differ-
ent types of butterflies. If the FFT algorithm contains only butterflies with two
inputs it is a radix-2 algorithm, with four inputs it is a radix-4 algorithm, and so
on. In our implementation of the FFT algorithm we chose to support both radix-4
and radix-2 butterfly operations. Radix-4 butterflies require fewer memory
accesses than radix-2 butterflies, while it is possible to calculate only FFT sizes
that are a power of four using radix-4 butterflies. By supporting both radix-2 and
radix-4 butterflies, FFTs with a length which are a power of two can be calcu-
lated.
The FFT algorithm used in the DMT technique is derived from the normal FFT
algorithm with some modifications due to that the signal sent to the line is real
valued. In this case it is possible to calculate a 2N FFT using an N point FFT and
an additional calculation step. The algorithm is based on the fact that the Fourier
transform of a real valued input sequence is conjugate-symmetric [40], i.e.
X ( e jω ) = X ( e –j ω ) .
*
(2.7)
y ( l ) = x ( 2l ) + jx ( 2l + 1 ) , l = 0,1,... N – 1 . (2.8)
Compute the N point DFT of y ( l )
N–1
k ∈ [ 0, N – 1 ]
and
29
2 Digital Signal Processing Architectures
j2πk
1 * 1 – ----------
- *
X(k) = (Y(k ) + Y (N – k )) – ⋅ e
--
- ----
- 2N ⋅ ( Y ( k ) – Y ( N – k ) ) . (2.11)
2 2j
k ∈ [ N, 2N – 1 ]
Only values in the range k ∈ [ 0, N – 1 ] need to be calculated since the out-
put is symmetric according to Eq. 2.7.
∑ y ( l ) ⋅ ej2πlk ⁄ N
1
y ( n ) = ---- (2.13)
N
l=0
x ( 2n ) = Re ( y ( n ) ) (2.14)
x ( 2n + 1 ) = Im ( y ( n ) )
30
2 Digital Signal Processing Architectures
The calculation stages in the FFT/IFFT operations used in the DMT technique is
also illustrated in Fig. 2.8.
transmit path
IFFT
2N->N N point
pre- IFFT
processing
FFT
N->2N N point
post FFT
processing
receive path
Functional description
The FFT processor can handle between 128 and 1024 carriers at a data rate that
corresponds to 25 MHz sample rate. We chose to use two parallel processing
cores, where each core can handle one direction, or alternatively the two cores
can be used in the same direction with an increased data rate. An outline of the
top level of the FFT architecture is shown in Fig. 2.9.
There are two I/O blocks that handle the two data streams (upstream and down-
stream). The two I/O blocks communicate with six sets of memories each one
capable of keeping one complete symbol in memory. To keep the memory band-
width high enough in the FFT a segmented bus structure where each memory set
have access to three buses, i.e. the two I/O units and one of the FFT cores. Each
memory set contains two physical memories, and it is possible to do one read and
one write to the memory set each clock cycle as long as this is not made to the
same physical memory. The on-chip busses have been designed such that it is
possible to do both read and write over the same bus in the same clock cycle.
Since we had to support both different types of time division multiplex and fre-
quency duplex modulation on the line without major changes in the external con-
trol logic the memory buffering scheme was put inside the FFT processor. A
possibility to add and remove cyclic prefix from the symbols is included in the I/
O units. This saves a buffer stage in the VDSL modem. One of the I/O units has
been supplied with a complex multiplier in order to be able to integrate a fre-
quency domain equalizer with the addition of some external control logic.
31
2 Digital Signal Processing Architectures
The FFT DSP core is optimized for processing of complex valued data, and
therefore instructions like complex multiplication, complex addition etc were
chosen. One complex multiplier, two ALUs for addition and subtractions and one
combined scaling and rounding unit are the available resources for the main FFT
calculation. There are also three address generation blocks, two for the read and
write addresses to access the data, and one coefficient generation block for the
twiddle factors in the FFT algorithm.
The memory buffering scheme as well as the FFT length and the length of the
cyclic prefix is software controlled.
MEMORY SYSTEM
A0 A1 A2 B0 B1 B2
IO A IO B
DSP A DSP B
INA INB
OUTA OUTB
32
2 Digital Signal Processing Architectures
33
2 Digital Signal Processing Architectures
from a field in the instruction word. Exceptions from the rule is if the con-
stant is zero or one. This rule can be overridden by specifying an option to
the tool or by changing the µC code.
Our experience is that the tool works well, but in some cases the designer is
required to change the µC model to work around some problems.
It is difficult to compare our solution with other solutions, but there exist some
other systems which use a C-like language for the hardware modelling. There
also exist several systems that use a C derivative for general HW modelling
[44,45] and there is also an initiative called The Open SystemC initiative where a
hardware design language based on C is proposed [46].
Many prototype systems for hardware synthesis have been proposed. In some of
the systems an algorithm model is fed into a synthesis program that performs a
automatic resource allocation and scheduling of operations [47-51]. The disad-
vantage of doing behavioral synthesis with the algorithm as starting point is that
the synthesized instruction set will be limited to what is necessary in the imple-
mented algorithms. If extra flexibility, i.e. more instructions, is needed in the
architectures this is difficult to incorporate, and even if possible, it is difficult to
re-program the control unit for a modified algorithm.
An advantage with our tool is the high degree of control of the resulting architec-
ture. No optimization stages are included in the tool. One argument for that is that
we want the designer to create an efficient DSP architecture with an instruction
set well suited for the application. When the architecture and the most important
algorithms are in place the rest of the design can be made in software using a
standard C compiler targeted against the chosen instruction set. This will how-
ever require that the tool can generate an instruction description file that fits the
chosen compiler. This function has however not been implemented yet.
The need for a tool with a high degree of interaction and a possibility for repro-
gramming the synthesized architecture has also been identified and incorporated
in a design environment called AMICAL [52,53].
34
3 Variable Length Decoding
0 1 level 0 (root)
a 0 1 level 1
0 1 0 1 level 2
0 1 d e 0 level 3
1
b c f g level 4
Figure 3.1 Example of tree representation of a variable length code.
35
3 Variable Length Decoding
L ave × T Q +
T min = --------------------------- (3.1)
N bits
where L ave is the average code length, and N bits is the number of new bits that
are decoded every cycle.
Since EQ 3.1 is a fundamental limit on this architecture the only feasible way to
increase the throughput is to increase N bits . Unfortunately, increasing N bits will
increase the time required for updating the state, T Q + , giving an optimum at
some point.
Nbits symbol
VLC symbol ready
logic
T
Q Q+
Figure 3.2 Tree-based VLC decoding SFG.
36
3 Variable Length Decoding
in symbol
Buffer
T
VLC
logic
code_length
The symbol decoding process can be pipelined, while the length decoding result
must be fed back to the buffer before the next N symb symbols can be decoded.
Hence, the critical path is found in the length decoder as
TL dec + T buf
T min = ------------------------------- (3.2)
N symb
where TL dec is the time it takes to decode the length of N symb symbols, and
T buf is the time it takes for the buffer to throw away the used bits. The through-
put can be increased by decoding more symbols every cycle, but this will also
increase TL dec and T buf .
There are variants on this architecture where N symb varies with symbol length.
In [64] an architecture is presented that in some special cases can decode several
VLC codes in parallel. When one of the code words is a short code this is handled
in parallel with the decoding of the following code.
37
3 Variable Length Decoding
T L + T mux
T min = max i
-
------------------------- (3.3)
∀i i
where T L is the time to find out if the code has the length of i bits, and T mux is
i
the multiplexer delay.
38
3 Variable Length Decoding
in consumed bits
Shift reg
load
Register 1
M-1
L2
L1
LM
counter
reset
new_symb
Figure 3.4 VLC decoder with varying rate length decoder.
The output of the length decoder can be used for synchronization of the symbol
decoder, or if speculative decoding is used in the symbol decoder as well, to indi-
cate when a valid output exist.
39
3 Variable Length Decoding
To equalize the latency in the decoder the delay through the length decoder is dif-
ferent for different code lengths. The delay for a code length i is restricted to be
equal to i ⋅ T where T is the clock period. That is, the pipeline depth is set equal
to the code length.
in consumed bits
Shift reg
M
Pipelined
varying rate
length Symbol symbol
decoder decoder
symbol ready
L2 L1
LM
counter
new_symb
Figure 3.5 VLC decoder with, varying rate length decoder.
40
3 Variable Length Decoding
in consumed bits
Shift reg
M
Delay symbol
symbol decoder
for 1 to N-1 bits symbol ready
Pipelined
varying rate
length start_short
decoder
symbol
symbol decoder
for N to LM bits symbol ready
L2 L1
LM start_long
counter
control
new_symb
Figure 3.6 VLC decoder with partitioned symbol decoder.
sible to use “don’t care” in many of the positions in the truth table for the length
decoder. In Table 3.1 the truth table of a length decoder for the example given in
Table 3.1 is shown. The simplified boolean equations are given in EQ 3.5 - 3.8.
VLC code
L 1 L2 L3 L4 Symbol
C1 C2 C3 C4
0 1XXX a
101 001X d
110 001X e
1000 000X b
1001 0001 c
1110 0001 f
1111 0001 g
Table 3.1. Truth table for length decoder.
L 1 = not ( C 1 ) (3.5)
L2 = 0 (3.6)
L3 = C2 ⊕ C3 (3.7)
L4 = 1 (3.8)
41
3 Variable Length Decoding
3.7 Remarks
The proposed architectures are mainly suitable for fixed-function VLC decoders.
A prototype chip implementing the static MPEG-2 Video VLCs has been
reported in [16]. The MPEG-2 standard use fixed VLCs, while this is not the case
using the JPEG standard. The need for a fast VLC decoder is usually higher for
decoding of video than for still images, which makes it relevant to study VLC
decoders with fixed VLCs.
It is difficult, but not impossible, to make a good programmable solution using
the proposed architectures. The difficulty is the length decoder which must be
made very parallel and fast. A possible solution that may be worth to examine
further is to use programmable logic to realize the length decoder.
42
4 Data Converters in Communication Systems
4 Data Converters in
Communication Systems
Analog-to-Digital (ADC) and Digital-to-Analog (DAC) converters are critical
components in many communication systems. The current trend is to move more
and more of the functionality of a communication system into the digital domain
in order to provide an increased flexibility and reduce cost. In order to acomplish
this the requirements on the data converters increase both in terms of higher
accuracy and larger bandwidth.
In order to continue to push the data converter performance even further there is
a need to handle problems caused by the processing of the chips. The variations
in transistor parameters, especially for analog circuits causes a degradation in the
performance [66]. To increase the performance of data converters we believe that
more attention must be put on optimizing the data converters against the target
application. There exists many analog and digital calibration techniques that aim
at reducing the matching error problems, but few methods take full advantage of
the properties of the target application. We stress that in order to get the most out
of digital calibration and error correction all available information about the pro-
cess, the application, and the data converter architecture should be utilized as far
as possible. In our case the target application is DSL based communication sys-
tems. In this chapter we propose methods that can be used to correct some of the
problems in ADCs and DACs.
43
4 Data Converters in Communication Systems
The second step quantize the amplitude continuous signal values. In Fig. 4.1 a
model of the analog-to-digital conversion is shown. The quantization is usually
modelled with an additive zero-mean gaussian distributed noise source with the
variance σ 2 .
nT
nT
s2
The quantization noise term depend on the number of bits that are used to repre-
sent the digital signal. This is usually referred to as the resolution of the ADC. In
the ideal ADC the maximum quantization error q ( n ) = s q ( n ) – s ( n ) is in the
range [ ± ∆ ⁄ 2 ] where ∆ is defined as
FS
∆ = ------- (4.1)
N
2
where FS refers to the full scale input range and N is the resolution.
Assuming a random input signal the noise will be equally distributed in the range
± ∆ ⁄ 2 and the variance, σ 2 , will be
∆⁄2
∆2
∫
2 1
σ2 = E[ q ( n )] = q 2 ( n ) ⋅ --- dq = ------ (4.2)
∆ 12
–∆ ⁄ 2
There are also many other types of noise sources and imperfections that degrade
the performance. Here we differentiate between two types of error sources, 1) the
static error which not is frequency dependent, and 2) the dynamic errors that nor-
mally increase with frequency. In order to measure the performance of the ADC a
number of measures have been defined. Some of the measures are listed below.
44
4 Data Converters in Communication Systems
DNL i = X i + 1 – X i – ∆ i ∈ [ 0, 2 N – 1 ] (4.3)
Xi + 1 – Xi – ∆ (4.4)
DNL i = ---------------------------------- i ∈ [ 0, 2 N – 1 ]
∆
D
analog input
45
4 Data Converters in Communication Systems
peak amplitude
PAR = ------------------------------------ (4.9)
rms value
46
4 Data Converters in Communication Systems
mation the conversion is made by a binary search strategy applying one reference
voltage at a time. Another type of ADC is the sigma-delta ADC which works
with oversampling. Since the principle relies on oversampling this architecture is
less interesting when a high conversion speed is required.
s(t)
nT
vref
+
-
s(nT)
R + Reference generator
Decode
- s(nT)
R nT s>x?
x
-
s(t)
+
+
-
R
Due to deviations from the ideal values in the components used for creating the
reference voltages, the voltages will contain errors which will result in DNL and
INL errors. These errors are independent of the frequency and is therefore
referred to as static errors.
Other important error sources are the offset and gain errors. Analog circuits may
have a DC offset which will result in an output signal even for zero input signal.
The gain variations are normally caused in amplifiers or capacitors in the ADC,
or in the sample-and-hold circuit. If the gain and offset errors are assumed to be
frequency independent these errors can be modelled as
xe ( n ) = g ⋅ x ( n ) + o (4.10)
where g is the gain, and o is the DC offset. Both the gain error and the offset
error may reduce the maximum voltage swing in the ADC.
47
4 Data Converters in Communication Systems
At high input frequencies the dynamics of the circuitry becomes important. For
example the sample-and-hold circuit may not be fast enough to track the input
signal, and the reference voltage generation may settle too slow. These frequency
dependent errors will at some frequency become dominating and will limit the
bandwidth of the ADC.
x(t) x1(n)
ADC 1 xTIADC(n)
(2M-1)*T xM-1(n)
ADC M-1
It is important that the differences between the ADCs in a TIADC are small since
these differences will result in distortion.
x tiadc ( n ) = { x ( T ) + o 1, …, x ( MT ) + o m } (4.11)
∑ X ω – k ⋅ -------- + O ( e jω )
1 2π
X tiadc ( e jω ) = --- (4.12)
T MT
k = –∞
48
4 Data Converters in Communication Systems
There are analog offset cancellation techniques that can be used to reduce the off-
set differences in the analog circuitry [68]. An advantage of removing the offset
in the analog domain is that the offset will not reduce the available input range. A
disadvantage is that the analog offset cancellation increases the complexity
which may reduce the performance of the analog circuitry.
In [69] a mixed digital and analog technique is proposed where most of the work
is made in the digital domain, in addition to some minor analog circuits. The
input samples are multiplied with a random sequence of c ( n ) = { 1, – 1 } using a
modified sample-and-hold circuit. The samples observed at the output of one of
the ADCs in the TIADC will be
xc ( n ) = c ( n ) ⋅ xi ( n ) + oi (4.13)
where o i is the offset added by the ADC. By choosing c ( n ) so that its mean
value is close to zero the mean value of Eq. 4.13 will approach o i . A calibration
unit continuously computes the mean value using a large number of samples,
which is used as an estimation of o i . The original signal is then recreated by a
digital multiplication using the same sequence c ( n ) as used in the sampling pro-
cess, i.e.
49
4 Data Converters in Communication Systems
Im
Re
Figure 4.6 Effects of offset errors in a DMT modem when using a TIADC.
The offset signal is additive and independent of the input signal as shown in Eq.
4.17. This additive error will cause an offset in the constellation diagram at the
frequencies k ⋅ f s ⁄ M where k varies between 1 and M – 1 . An example on how
the offset error affects the received constellation point at the disturbed carriers is
shown in Fig. 4.6. In [24] we show that the offset error can be identified and
reduced if the magnitude of the error is reasonably small. The main result in the
paper is that the error due to the difference between the decoded information and
the received signal can be used for offset estimation. Taking the average value of
the error between the detected signal and the received signal will identify the off-
set O ( e jω ) assuming that the mean value of the noise is zero ( E [ N ( e jω ) ] = 0 ),
see Eq. 4.15.
E [ N ( e jω ) + O ( e jω ) ] = E [ N ( e jω ) ] + E [ O ( e jω ) ] = E [ O ( e jω ) ] = (4.15)
E [ S rec ( e jω ) – S dec ( e jω ) ]
50
4 Data Converters in Communication Systems
There is an error in the simulations shown in [24] which accidently left about
10% of the offset. Later simulations have shown that the offset estimation can be
made much better with an error well below one per cent.
The offset error will decrease the SNDR at the receivers end, but since the ADSL
standard has been specified to adapt to a large range of different signal qualities it
will still be possible to transmit data. As the offset estimate becomes more accu-
rate the increased SNDR can be utilized to increase the bit rate.
Amplitude
x(t)
DA
T 2T 3T 4T t
T(1+r0) T(1+r2)
T(1+r1) T(1+r3)
When using a single ADC it is important to have a sample clock generator with
low jitter, but in a TIADC it is also important to achieve a similar delay from the
clock source to all sample-and-hold units in the TIADC to avoid nonuniform
sampling with a period of M cycles.
Considering a TIADC with M channels with gain and timing mismatch, see Fig.
4.8, the output from the TIADC will be
x tiadc ( n ) = { g 1 ⋅ x ( T ( 1 + r 1 ) ), …, g m ⋅ x ( TM ( 1 + r m ) ) } (4.16)
∑ A k ( e jω ) ⋅ X ω – k ⋅ --------
1 2π
X tiadc ( e jω ) = --- (4.17)
T MT
k = –∞
51
4 Data Converters in Communication Systems
where A k ( e jω ) is described by
M–1
g ⋅ e – j ( ω – k ⋅ 2π ⁄ MT )r m T ⋅ e – jkm ⋅ 2π ⁄ M
∑
1
A k ( ω ) = ----- m (4.18)
M
m=0
g m is the gain error in ADC number m . r m is the relative sampling error for each
ADC.
g0 o0 s20
MT(1+r )
0
*g +o +
s 1
2
1
x1(n)
(M+1)*T(1+r ) 1
1 x1(n)
* + +
xTIADC(n)
x(t)
oM-1 s2M-1
(2M-1)*T(1+r ) gM-1
M-1
xM-1(n)
* + +
Figure 4.8 Time interleaved ADC error sources.
where g is the average gain and σ g is its standard deviation. M is the number of
ADCs used in the TIADC. For a resolution of 10 bits, Eq. 4.19 shows that σ g
should be kept smaller than 0.1%.
The gain error was approximated in [71] to
52
4 Data Converters in Communication Systems
where σ t is the standard deviation of the timing skew and f in is the input signal
frequency. For a 10 bits resolution and 20 MHz input signal σ t must be smaller
than 8 ps.
53
4 Data Converters in Communication Systems
MT(1+r )
0
x0(n)
ADC 0
(M+1)*T(1+r )
1 x1(n)
x(t) ADC 1 xTIADC(n)
polynom interpolator
(2M-1)*T(1+r ) xM-1(n)
M-1 estimated
ADC M-1 timing mismatch
54
4 Data Converters in Communication Systems
an estimate of the timing mismatch is found. The main limitation for this method
is that in order for the algorithm to work well most of the signal energy should be
concentrated below f s ⁄ 6 .
In [26] we propose a method for estimating and correcting both timing and gain
mismatch. The proposed method takes the application in consideration, and uses
the decoder in a DMT or OFDM modem for extracting the noise on each fre-
quency. An adaptive algorithm estimates the mismatch distortion, which then is
used for increasing the SNDR in the modem.
The distortion described by Eq. 4.17 and Eq. 4.18 can be treated as information
leakage from one carrier frequency to another. Since the different carriers in the
DMT modem can be considered independent from each other the correlation
between two carriers are caused by gain and/or timing mismatch (see also 1.3
where the DMT technique is described). The correlation between two carriers is
identified using the Least Mean Square (LMS) algorithm, and most of the distor-
tion can be cancelled.
where A out(nT) is the settled output amplitude at the time instants nT , M is the
number of bits in the input word, which contain the bits b m(nT) , and w m are the
internal DAC weights. b M is referred to as the most significant bit (MSB) and
b 1 is the least significant bit (LSB). For a binary offset input word, we have that
M = N and w m = 2 m – 1 . For a thermometer code input, we have that
M = 2 N – 1 and w m = 1 . An example of a current-steering binary offset coded
DAC and a thermometer coded DAC is shown in Fig. 4.10.
55
4 Data Converters in Communication Systems
N-1 1 0
2 I0 2 I0 2 I0 I0 I0 I0
bN b2 b1 b2N-1 b2 b1
Iout Iout
(a) (b)
Figure 4.10 Example of a) a binary weighted and b) a thermometer coded current-
steering DAC.
56
4 Data Converters in Communication Systems
tion from one sample to the next. That is either a number of bits toggles from
zero to one, or a number of bits toggles from one to zero, but never both in the
same transition.
Normally only the most significant bits in the binary offset coded input to the
DAC are converted to thermometer code, since this is an expensive operation. A
common configuration is to use thermometer code for 5-6 of the most significant
bits, and binary offset coding for the remaining bits [67], see Fig. 4.11. This
hybrid solution is called an M-bit segmented DAC, where M refers to the number
of binary bits that have been translated to thermometer code.
M 2M-1 Therm.
Thermometer coded
encoder DAC Aout
X
+
Binary-
Delay weighted
N-M N-M DAC
Figure 4.11 Multi-segmented DAC structure with the M MSBs thermometer coded.
Mismatches in the sizes of the current sources will, as in the case of the reference
voltage mismatch in the ADC, cause DNL errors.
4.4.2 Scrambling
It is difficult to measure the output from a DAC without using an ADC with even
better performance, and therefore it is also difficult to use pure digital methods to
calibrate a DAC. In order to improve the SFDR it has instead been suggested to
use scrambling, so that the direct relation from an input value to the size of the
DNL error and the glitch will be removed. A scrambler will select which current
sources to use for a given input value in a random way, and therefore the size of
the error will be less correlated to a given input value, and is therefore spread in
the frequency domain. This method is commonly referred to as Dynamic Ele-
ment Matching (DEM) and was originally a mixed analog-digital method [79],
but today it is most common to use the digital DEM technique [80-84]. A com-
parison between different DEM methods is made in [85], see Fig. 4.12.
57
4 Data Converters in Communication Systems
Digital Encoder
x1(n) 1 1-bit y1(n)
Thermometer Encoder
DAC
Scrambler
x2(n) 1 1-bit y2(n)
x(n) N DAC y(n)
58
4 Data Converters in Communication Systems
No DEM
120
100
PSD [dB/Hz] 80
60
distortion terms
40
20
0
0 0.1 0.2 0.3 0.4 0.5
Normalized frequency
(a)
With DEM
120
100
80
PSD [dB/Hz]
60
40
20
0
0 0.1 0.2 0.3 0.4 0.5
Normalized frequency
(b)
59
4 Data Converters in Communication Systems
0
t0
p p p
t1
a x
t2
p p p
t3 b y
t4
p p p p
t5
t6
p p p
Figure 4.14 Scrambler for a 3-bit DAC with thermometer encoded input.
60
4 Data Converters in Communication Systems
One application where it has turned out to be interesting to use restricted scram-
bling is radio architectures where the first up-conversion stage is done in the dig-
ital domain, Fig. 4.15, [86]. The relatively narrow signal band is located at high
frequencies, while there is a large frequency band without signal in which the
distortion will be spread into.
digital mixer
cos(wift)
sin(wrft) RF frontend
I
M H(z)
+ IF-DAC H(s) PA
Q
M H(z)
sin(wift)
61
4 Data Converters in Communication Systems
62
5 Author´s Contribution to Published Work
5 Author´s Contribution to
Published Work
In this section the Author´s contribution to the published work is clarified for
each publication.
Pub. 4. Design of A JPEG DSP Using the Modular Digital Signal Pro-
cessor Methodology [19]
This work was made within a cooperation project between Linköping University
and Ericsson Microelectronics AB. A new design methodology was going to be
evaluated in a case study. The Author made the initial hardware partitioning
together with K-G Andersson. The Author was also responsible for the design of
one of the two processor cores that were designed (the IDCT core).
63
5 Author´s Contribution to Published Work
64
6 Conclusions
6 Conclusions
Signal processing is used in all electronic communication systems and it is there-
fore important to have architectures that are efficient for implementing DSP algo-
rithms. It is also important to have an efficient design flow for implementing the
algorithms.
In three papers we propose architectures suitable for VLC decoding [15,16,17].
We have shown how the critical loop in the VLC decoder can be broken up,
which in theory increases the achievable decoding rate. We have also shown how
to parallelize the symbol decoder parallel to reduce the data rate through each
decoder, which makes them easier to implement. One implementation has been
made to verify the ideas.
A hardware-software co-design methodology aimed for application specific
DSPs has been verified and improved. A JPEG decoder DSP has been designed
which shows how to combine programmability and performance [19]. An FFT
DSP for the VDSL application has been designed, implemented, and verified
[20]. A tool for improving the hardware-software co-design methodology has
also been implemented [21]. The tool supports the designer with hardware gener-
ation by trying to capture the designer´s intentions by having a simple set of rules
easily can be overridden by the designer. The DSP work presented in this thesis
should be seen as an additional bit to the big puzzle creating a more efficient DSP
design methodology.
Digital methods to improve the performance of data converters have also been
proposed. Increasing data converter performance using digital methods is attrac-
tive since modern CMOS technology allows high processing capability. It has
been shown how it is possible to co-optimize an A/D converter with the applica-
tion, giving an efficient way to cancel errors in a time-interleaved A/D converter.
A method to identify and cancel offset differences in the time-interleaved A/D
65
6 Conclusions
converter is proposed in [24], and a method for cancelling time and gain mis-
match is proposed in [26]. Both proposed methods are targeted against systems
using DMT modulation, but can also be used in OFDM based systems. Using the
proposed methods for correcting mismatch in time-interleaved ADCs it is possi-
ble to look for more wide band receiver architectures since the effective sample
rate can be increased without the usual problems with performance degradation
when using a time-interleaved ADC. New receiver architectures can provide
greater flexibility since more functionality can be placed in the digital domain
which in turn will require more efficient programmable DSP architectures.
Another purely digital method that is proposed is the restricted DEM method
where glitch performance and weight mismatch can be balanced against each
other [27,28]. It has been shown how current sources can be dynamically
matched while preserving a low glitch energy. We believe that the restricted
DEM technique is very well suited for high frequency DACs aimed at radio
applications.
The presented ADC and DAC work are examples of how it is possible to get a
performance increase in data converters using pure digital techniques. When the
data converter requirements continue to increase and the process technologies
limit the performance, techniques as the ones presented in this thesis may be the
way forward.
66
References
[1] C.E. Shannon, “Communication in the Presence of Noise,” Proc. IRE, Vol.
37, pp. 10-21, Jan. 1949.
[2] ETSI, Group Speciale Mobile or Global System of Mobile Communication
(GSM) Recommendation, 1988, France.
[3] ISO/IEC 10918-1: Digital Compression and Coding of Continuous-Tone
Still Images (JPEG), Feb. 1994.
[4] ISO/IEC DIS 13818-2: Generic Coding of Moving Pictures and Associated
Audio Information, part 2: Video, (MPEG-2), June 1994.
[5] S. Haykin, Digital Communications, John Wiley and Sons, 1988.
[6] J. Gibson, The Mobile Communications Handbook, CRC Press, 1996.
[7] ANSI T1.413-1998, “Network and Customer Installation Interfaces:
Asymmetrical Digital Subscriber Line (ADSL) Metallic Interface,”
American National Standards Institute.
[8] “VDSL Coalition Technical Draft Specification (Version 5),” Tech. Rep.
983t8, ETSI TM6, Luleå, Sweden, June 1998.
[9] T. Starr, J. M. Cioffi, and J. Silverman, Understanding Digital Subscriber
Line Technology, Prentice-Hall, 1999.
[10] W. Y. Chen, DSL Simulation Techniques and Standards Development for
Digital Subscriber Line Systems, Macmillan technical publishing, 1998.
[11] D. J. Rauschmayer, ADSL/VDSL Principles, Macmillan Technical
Publishing, 1999.
[12] F. Sjöberg, The Zipper Duplex Method in Very High-Speed Digital
Subscriber Lines, Luleå University of Technology, 2000.
[13] K. K. Parhi, VLSI Digital Signal Processing Systems - Design and
Implementation, Wiley, 1999.
[14] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999.
67
[15] M. K. Rudberg and L. Wanhammar, “New Approaches to High Speed
Huffman Decoding,” Proc. of IEEE Intern. Symp. on Circuits and Systems,
ISCAS'96, Vol. 2, pp. 149-52, Atlanta, USA, May 1996.
[16] M. K. Rudberg and L. Wanhammar, “Implementation of a Fast MPEG-2
Compliant Huffman Decoder,” Proc. of European Signal Processing Conf.,
EUSIPCO'96, Trieste, Italy, Sept. 1996.
[17] M. K. Rudberg and L. Wanhammar, “High Speed Pipelined Parallel
Huffman Decoding,” Proc. of IEEE Intern. Symp. on Circuits and Systems,
ISCAS'97, Vol. 3, pp. 2080-83, Hong Kong, June 1997.
[18] M. K. Rudberg, System Design of Image Decoder Hardware, LiU-Tek-Lic-
1997:657, Department of Electrical Engineering, Linköping University, Dec.
1997.
[19] K-G Andersson, M. K. Rudberg, and A. Wass, “Design of A JPEG DSP
Using the Modular Digital Signal Processor Methodology,” Proc. of Intern.
Conf. on Signal Processing Applications and Technology, ICSPAT`97, Vol.
1, pp. 764-68, San Diego, CA, USA, Sep. 1997.
[20] M. K. Rudberg, M. Sandberg, and K. Ekholm, “Design and Implementation
of an FFT Processor for VDSL,” Proc. of IEEE Asia-Pacific Conference on
Circuits and Systems, APCCAS `98, pp. 611-14, Chiangmai, Thailand, Nov.
1998.
[21] M. K. Rudberg and M. Hjelm, ”Application driven DSP Hardware
Synthesis,” Proc. of IEEE Nordic Signal Processing Symp. (NORSIG2000),
Kolmården, Sweden, June 2000.
[22] K-G Andersson, A. Wass and K. Parmar, “A Methodology for
Implementation of Modular Digital Signal Processors,” Proc. of Intern.
Conf. On Signal Proc. Applications & Technology, ICSPAT ’96, Boston,
MA, Oct. 1996.
[23] K-G Andersson, Implementation and Modeling of Modular Digital Signal
Processors, LiU-Tek-Lic-1997:09, Department of Electrical Engineering,
Linköping University, March 1997.
[24] M. K. Rudberg, “ADC Offset Identification and Correction in DMT
Modems,” Proc. of IEEE Intern. Symp. on Circuits and Systems, ISCAS'00,
Vol 4, pp. 677-80, Geneva, May 2000.
[25] M. K. Rudberg, “A/D omvandlare,” Swedish patent number 9901888-9, 25
May 1999.
[26] M. K. Rudberg, “Correction of Mismatch in Time Interleaved ADCs“, Proc.
of IEEE Intern. Conf. on Electronics, Circuits & Systems, Malta, Sept. 2001.
[27] M. K. Rudberg, M. Vesterbacka, N. Andersson, and J.J. Wikner, “Glitch
Minimization and Dynamic Element Matching in D/A Converters,” Proc. of
IEEE Intern. Conf. on Electronics, Circuits & Systems, Lebanon, Dec. 2000.
68
[28] M. Vesterbacka, M. K. Rudberg, J.J. Wikner, and N. Andersson, “Dynamic
Element Matching in D/A Converters with Restricted Scrambling,” Proc. of
IEEE Intern. Conf. on Electronics, Circuits & Systems, Lebanon, Dec. 2000.
[29] M. K. Rudberg, M. Vesterbacka, N. U. Andersson, and J. J. Wikner, “A
scrambler and a method to scramble data words,” Swedish patent appl.
0001917-4, 23 May 2000.
[30] M. K. Rudberg, J. J. Wikner, J.-E. Eklund, F. Gustavsson, and J. Elbornsson,
“A/D and D/A Converters for Telecom. Applications,”
http://www.es.isy.liu.se/staff/mikaelr/downloads/adda_tut_icecs2001.pdf,
tutorial held at IEEE Intern. Conf. on Electronics, Circuits & Systems, Sept.
2001.
[31] K. Palmkvist, Studies on the Design and Implementation of Digital Filters,
Diss. No. 583, Linköping Unversity, Sweden, 1999.
[32] M. Renfors and Y. Neuvo, “The Maximum Sampling Rate of Digital Filters
Under Hardware Speed Constraints,” IEEE Trans. on Circuits and Systems,
Vol. CAS-28, No. 3, pp. 196-202, March 1981.
[33] A. Chandrakasan and R. Brodersen, Low Power Digital CMOS Design,
Kluwer Academic Publishers, 1995.
[34] A. Bellaouar and M. Elmasry, Low-Power VLSI Design - Circuits and
Systems, Kluwer Academic Publishers, 1995.
[35] J. Melander, Design of SIC FFT Architectures, Linköping Studies in Science
and Technology, Thesis No. 618, 1997.
[36] T. Widhe, Efficient Implementation of FFT Processing Elements, Linköping
Studies in Science and Technology, Thesis No. 619, 1997.
[37] E. Brigham, The Fast Fourier Transform and Its Applications, Prentice Hall,
1988.
[38] J. W. Cooley and J. W. Tukey, “An Algorithm for the Machine Calculation
of Complex Fourier Series,” Math Computers, Vol. 19, pp. 297-301, April
1965.
[39] W. M. Gentleman and G. Sande, “Fast Fourier Transform for Fun and
Profit,” Proc. 1966 Fall Joint Computer Conf., AFIPS’66, Vol.29, pp. 563-
678, Washington DC, USA, Nov. 1966.
[40] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing,
Prentice Hall, 1989.
[41] Proakis and Manolakis, Digital Signal Processing - Principles, Algorithms
and Applications, 2nd ed., Macmillian, 1992.
[42] Ericsson Internal document, ETX/XA/NB-97:006.
[43] M. Hjelm, Architectural Synthesis From a Time Discrete Behavioural
Language, LiTH-ISY-EX-2000, Linköping, Sweden, Sept. 1998.
69
[44] P. Schaumont, S. Vernalde, L. Rijnders, M. Engels, and I. Bolsens, “A
Programming Environment for the Design of Complex High Speed ASICs,”
Proc. of Design Autom. Conf., pp. 915-20, 1998.
[45] K. Wakabayashi, “C-based Synthesis Experiences with a Behavior
Synthesizer, “Cyber” ,” Design Automation and Test in Europe Conf. and
Exhibition, DATE’99, pp. 390-99, 1999.
[46] http://www.SystemC.org
[47] H. D. Man, J. Rabaey, J. Vanhoof, G. Goossens, P. Six, and L. Claesen,
“CATHEDRAL-II - A Computer-Aided Synthesis System for Digital Signal
Processing VLSI Systems,” Computer-Aided Engineering Journal, pp. 55-
66, April 1988.
[48] J.M. Rabaey, C. Chu, P. Hoang, and M. Potkonjak, “Fast Prototyping of
Datapath-Intensive Architectures,” IEEE Design and Test of Computers,
Vol. 8, Iss. 2, pp. 40-51, June 1991.
[49] E. Martin, O. Sentieys, H. Dubois, and J. L. Philippe, “GAUT: An
Architectural Synthesis Tool for Dedicated Signal Processors,” Proc. of
European Design Autom. Conf, pp. 14-19, Feb. 1993.
[50] L. Guerra, M. Potkonjak, and J. Rabaey, “A Methodology for Guided
Behavioral-Level Optimization,” Proc. of Design Automation Conf.,
DAC’98, pp. 309-14, USA, June 1998.
[51] S. Ramanathan, V. Visvanathan, and S. K. Nandy, “Synthesis of
Configurable Architectures for DSP Algorithms,” Proc. of 12th Intern. Conf.
on VLSI Design, pp. 350-57, Jan. 1999.
[52] A.A. Jerraya, I. Park, and K. O’Brien, “AMICAL: An Interactive High Level
Synthesis Environment,” Proc. of European Design Autom. Conf, pp. 58-62,
Feb. 1993.
[53] M. Benmohammed and A. Rahmoune, “Automatic generation of
reprogrammable microcoded controllers within a high-level synthesis
environment,” IEE Proc. Comput. Digit. Tech., Vol. 145, No. 3, pp. 155-60,
May 1998.
[54] D.A. Huffman, “A method for the construction of minimum redundancy
codes,” Proc. IRE, Vol. 40, No. 10, pp. 1098-1101, Sept. 1952.
[55] S. F. Chang and D. G. Messerschmitt, “Designing High-Throughput VLC
Decoder Part I - Concurrent VLSI Architectures,” IEEE Trans. on Circuits
and Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June 1992.
[56] H. D. Lin and D. G. Messerschmitt, “Designing High-Throughput VLC
Decoder Part II - Parallel Decoding Methods,” IEEE Trans. on Circuits and
Systems for Video Technology, Vol. 2, No. 2, pp. 197-206, June 1992.
70
[57] S. Ho and P. Law, “Efficient Hardware Decoding Method for Modified
Huffman Code,” Electronics Letters, Vol. 27, No 10, pp. 855-856, May
1991.
[58] S. B. Choi and M. H. Lee, “High Speed Pattern Matching for a Fast Huffman
Decoder,” IEEE Transactions on Consumer Electronics, Vol. 41, No 1, pp.
97-103,Feb. 1995.
[59] R. Hashemian, “High Speed Search and Memory Efficient Huffman
Coding,” Proc. IEEE Intern. Symp. on Circuits and Systems., ISCAS ‘93,
Vol. 1, pp. 287-290, 1993.
[60] R. Hashemian, “Design and Hardware Implementation of a Memory
Efficient Huffman Decoding,” IEEE Trans. on Consumer Electronics, Vol.
40, No. 3, pp. 345-352, Aug. 1994.
[61] H. Park and V. Prasanna, “Area Efficient VLSI Architectures for Huffman
Coding,” IEEE Trans. on Circuits and Systems - II Analog and Digital Signal
Processing, Vol. 40, No. 9, pp. 568-575, Sept. 1993.
[62] K. Parhi, “High-Speed Architectures for Huffman and Viterbi Decoders,”
IEEE Trans. on Circuits and Systems - II, Analog and Digital Signal
Processing, Vol. 39, No. 6, pp. 385-391, June 1992.
[63] E. Komoto and M. Seguchi, “A 110 MHz MPEG2 Variable Length Decoder
LSI,” 1994 Symp. on VLSI Circuits, Digest of Technical Papers, pp. 71-72,
1994.
[64] D.-S. Ma, J.-F. Yang, and J.-Y. Lee, “Programmable and Parallel Variable-
Length Decoder for Video Systems,” IEEE Trans. on Consumer Electronics,
pp. 448-454, Vol. 39, No, 3, Aug. 1993.
[65] Y.-S. Lee, B.-J. Shieh, and C.-Y. Lee, “A Generalized Prediction Method for
Modified Memory-Based High Throughput VLC Decoder Design,” IEEE
Trans. on Circuits and Systems - II Analog and Digital Signal Processing, pp.
742-754, Vol. 46, No. 6, June 1999.
[66] M. J. M Pelgrom, A. C. J. Duinmaijer, and A. P. G. Welbers, “Matching
Properties of MOS Transistors,” IEEE J. of Solid-State Circuits, Vol. 24, No.
5, pp. 1433-9, Oct. 1989.
[67] M. Gustavsson, J. J. Wikner, and N. N. Tan, CMOS Data Converters for
Communications, Kluwer Academic Publishers, 2000.
[68] K.-S. Tan, et.al., ”Error Correction Techniques for High-Performance
Differential A/D Converters,” IEEE J. of Solid-State Circuits, Vol. 25, No.
6, pp. 1318-27, Dec. 1990.
[69] J.-E. Eklund, and F. Gustafsson, “Digital Offset Compensation of Time-
Interleaved ADC Using Random Chopper Sampling,” Proc. IEEE Intern.
Symp. on Circuits and Systems, ISCAS’00, Vol. 3, pp. 447-50, Geneva,
May, 2000.
71
[70] M. Gustavsson, CMOS A/D Converters for Telecommunications, Diss. No.
552, Linköping Unversity, Sweden, 1998.
[71] Y.-C. Jenq, “Digital Spectra of Nonuniformly Sampled Signals:
Fundamentals and High-Speed Waveform Digitizers,” IEEE Trans. on
Instrumentation and Measurement, Vol. 37, No. 2, pp. 245-251, June, 1988.
[72] H. Johansson and P. Löwenborg, “Reconstruction of Nonuniformly Sampled
Bandlimited Signals Using Digital Filter Banks,” Proc. of IEEE Intern.
Symp. on Circuits and Systems, ISCAS'01, Sydney, 2001.
[73] H. Jin and E. Lee, “A Digital Technique for Reducing Clock Jitter Effects in
Time-Interleaved A/D Converter,” Proc. of IEEE Intern. Symp. on Circuits
and Systems, ISCAS'99, Vol. 2, pp. 330-33, 1999.
[74] H. Jin and E. Lee, “A Digital-Background Calibration Technique for
Minimizing Timing-Error Effects in Time-Interleaved ADC’s,” IEEE Trans.
on Circuit and Systems - II: Analog and Digital Signal Processing, Vol. 47,
No. 7, pp. 603-13, July 2000.
[75] Y.-C. Jenq, “Perfect Reconstruction of Digital Spectrum from Nonuniformly
Sampled Signals,” IEEE Trans. on Instrumentation and Measurement, Vol.
46, No. 7, pp. 649-52, Dec. 1997.
[76] Y.-C. Jenq, “Digital Spectra of Nonuniformly Sampled Signals: A Robust
Sampling Time Offset Estimation Algorithm for Ultra High-Speed
Waveform Digitizeers Using Interleaving,” IEEE Trans. on Instrumentation
and Measurement, Vol. 39, No. 1, pp. 71-75, Feb. 1990.
[77] Y.-C. Jenq, “Digital Spectra of Nonuniformly Sampled Signals: Theories
and Applications - Measuring Clock/Aperture Jitter of an A/D System,”
IEEE Trans. on Instrumentation and Measurement, Vol. 39, No. 6, pp. 969-
71, Dec. 1990.
[78] J. Elbornsson and J.-E. Eklund, “Blind Estimation of Timing Errors in
Interleaved AD Converters,” IEEE Intern. Conf. on Acoustics, Speech, and
Signal Processing, May 2001.
[79] R. J. van de Plassche, “Dynamic Element Matching for high-accuracy
monolithic D/A converters,” IEEE J. Solid-State Circuits, Vol. SC-11, pp.
795-800, Dec. 1976.
[80] P. Carbone and I. Galton, “Conversion error in D/A converters employing
dynamic element matching,” Proc. of ISCAS‘94, Vol. 2, pp. 13-16, 1994.
[81] L.R. Carley, “A noise-shaping coder topology for 15+ bit converters,” IEEE
J. of Solid-State Circuits, Vol. 24, no. 2 , pp. 267-273, April 1989.
[82] H.T. Jensen and I. Galton, “An analysis of the partial randomization dynamic
element matching technique,” IEEE Trans. of Circuits and Systems II, Vol.
45. No. 12, pp. 1538-1549, Dec. 1998.
72
[83] I. Galton, “Spectral Shaping if Circuit Errors in Digital-to-Analog
Converters,” IEEE Transaction of Circuits and Systems II, Vol. 44. No. 10,
pp. 808-817, Oct. 1997.
[84] L. Hernández, “A Model of Mismatch-Shaping D/A Conversion for
Linearized DAC Architectures,” IEEE Trans. of Circuits and Systems I, Vol.
45, No. 10, pp. 1068-76, Oct. 1998.
[85] N. U. Andersson and J.J.Wikner, “Comparison of Different Dynamic
Element Matching Techniques for Wideband CMOS DACs,” Proc. of
NORCHIP, Oslo, Norway, Nov. 1999.
[86] M. Helfenstein and G. S. Moschytz, Circuits and Sysems for Wireless
Communications, Kluwer Academic Publishers, 2000.
73
74
Part 2: Publications
75
76
Paper 1 - New Approaches to High Speed Huffman Decoding
PAPER 1
77
78
Paper 1 - New Approaches to High Speed Huffman Decoding
ABSTRACT
This paper presents two novel structures for fast Huffman decoding. The
solutions are suited for decoding of symbols at rates up to several hundred
Mbit/s. The structures are built using the principle of pipelining, which
when applied to the length decoder unit makes it possible to remove the only
recursive loop in the basic structure. In this way a structure with a high the-
oretical speed is obtained. Another attractive property of the solutions is the
simplicity of the structures and control logic.
1. INTRODUCTION
The Huffman coding technique is a lossless coding method that assigns short
codewords to frequently used symbols and longer words to less frequently used
symbols. If the codebook is good enough this will lead to a near entropy optimal
result. Huffman coding are a part of several important image coding standards,
for instance the JPEG [1] and MPEG [2] standards.
79
Since the coded data has different sized codeword it is difficult to perform the
decoding in parallel. This is maybe not a problem when dealing with still images
but moving images put entirely different requirements at the decoding process.
The MPEG-2 standard requires the data to be decoded at 100 Mbit/s and above.
In this paper we introduce a new principle for fast Huffman decoding. The pre-
sented algorithm is a hybrid between a constant input, variable output decoder,
and a variable input, constant output decoder. In section 2 an overview of previ-
ous work is given. In section 3 we discuss modifications of the algorithm in
order to speed up the decoding. Finally two new structures are presented with
slightly different properties.
2. PREVIOUS WORK
There are two main approaches for hardwired Huffman decoders with fixed
codebooks. If one or several bits at a time are decoded at a constant rate it will
result in a sequential solution which tray erses the Huffman tree until a leaf is
reached ana then outputs the symbol (Fig. 1).
This type of decoder has a constant input rate and a variable output rate. If large
codebooks are used, the constant input rate solution tend to give very large state
machines which limit the speed. Some ways to get around this problem are given
in [3] and [4], but most solutions lead to complicated control logic.
symbol
K
input input bit/cycle indicator output symbol
buffer logic buffer
reg-
ister next state
The other approach is to decode one codeword in each cycle, hence it will deliver
one symbol every cycle (Fig. 2). However, since the codewords have different
lengths the input rate will be variable. This solution consists of two main blocks.
The first block finds the length of the next codeword. This is necessary since the
different codewords must be kept apart to be able to feed the symbol decoder
with correct data. The symbol decoder finds the corresponding symbol accord-
ing to the codeword. This pattern matching can be done in several ways. Usu-
80
Paper 1 - New Approaches to High Speed Huffman Decoding
ally some kind of PLA structure is used to perform both the length decoding and
symbol decoding. In some solutions [5] sophisticated memory partition methods
are used to get access to the symbol and its length in an effective way.
The structure of the basic Huffman decoder is shown in Fig. 2. The critical path is
from the input shifting buffer through the length decoder.
M
input shifting bit/cycle
buffer
81
The decoder can not run at a higher speed than it takes for the length decoder to
find the length of the codeword. The symbol decoder can be designed in several
ways and can always be pipelined to reach sufficient speed. Hence, we will focus
on the length decoder and the input register.
We have here utilized the fact that we can allow the longer codewords length to
be decoded at a slower rate than the shorter ones. Notice that the constant output
rate decoder now has got a constant input rate. Instead the symbol decoder will
now not get a new codeword every cycle and hence it will have a variable output
rate.
82
Paper 1 - New Approaches to High Speed Huffman Decoding
The critical path will be from the length decoder register through the length
decoder to the load signal. But the only thing that must be found in one cycle is
if the length is equal to one. It is often possible to further reduce the critical delay
by placing some of the length decoder logic between the shift register and the
register.
If a comparison between this modified decoder and the basic decoder is done
there are a few important differences to note. This new structure decodes short
codewords very fast but will be slower for longer codewords. Since the basic
decoder that we started from decodes symbols at a constant output rate it will
probably be more effective for long codewords. Fortunately, the nature of Huff-
man coding makes it more likely that short codewords will dominate.
length=M? length=1?
select
length
length decoder
Figure 3. Huffman decoder with relaxed evaluation time for the length decoding unit.
83
Further we can add D flip-flops after the length decoder as long as it is done
before the symbol decoder as well (Fig. 4). All the flip-flops can be propagated
into the multiplexer and the length decoder logic. By this the delay through the
decoder logic is reduced to Tcritical/N where Tcritical is the critical, not maximum,
delay through the length decoder logic and the multiplexer and N is the number
of added flip-flops.
The resulting structure is shown in Fig. 5 below. This structure tries to evaluate
the length of the codeword at the input vector every cycle instead of only when it
actually are a codeword present at the input, as in the first solution. Since there
are no limitations on how much the structure is pipelined, the length decoder will
no longer be the time critical part of the design and the speed can be increased
significantly. The theorethical speed limit is now set by the delay from a flip-
flop through one logic gate to the following flip-flop.
4. CONCLUSIONS
We have in presented two new structures for Huffman decoders. Both structures
are based on a simple constant output rate decoder with a length decoder and a
symbol decoder. Since the speed limiting unit in this structure is the length
decoder we have suggested how it can be modified to reach higher speed.
Our first structure contains a length decoder with relaxed evaluation time that
makes it possible to significantely reduce the critical path delay and in this way
design faster Huffman decoders. We have simulated a standard cell implementa-
tion of the MPEG-2 huffman tables in 120 MHz using a 0.8 µm CMOS process.
84
Paper 1 - New Approaches to High Speed Huffman Decoding
In the pipelined structure we have shown how the time limiting recursive loop in
the length decoder can be completely eliminated. This structure should be suit-
able for Huffman decoders with very high decoding rates, for example in future
wideband transmission systems, and HDTV.
5. REFERENCES
[1] ISO/IEC 10918-1 Digital compression and coding of continuous-tone still
images (JPEG), Feb. 1994.
[2] ISO/IEC DIS 13818-2 Generic coding of moving pictures and associated
audio information, part 2: Video, (MPEG-2), June 1994.
[3] S. F. Chang and D. G. Messerschmitt, Designing High-Troughput VLC
Decoder Part I - Concurrent VLSI Architectures, IEEE Transactions on
Circuits and Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June
1992.
[4] H. D. Lin and D. G. Messerschmitt, Designing High-Throughput VLC
Decoder Part II - Parallel Decoding Methods, IEEE Transactions on Circuits
and Systems for Video Technology, Vol. 2, No. 2, pp. 197-206, June 1992.
[5] S. B. Choi and M. H. Lee, High Speed Pattern Matching for a Fast Huffman
Decoder, IEEE Transactions on Consumer Electronics, Vol. 41, No 1, pp.
97-103, Feb. 1995.
N
input
shift register D D D symbol output symbol
M bits decoder buffer
length decoder
logic length=1?
length=2?
length=M? length=3?
(M-1)D
D
D D
select
length
D
D N
D
length decoder
Figure 4. Huffman decoder with delay elements in the length decoder unit.
85
input shift register ND symbol output symbol
M bits decoder buffer
pipelined
length decoder
logic
length=1?
length=2?
length=M?
select
length
length decoder
86
Paper 2 - Implementation of a Fast MPEG-2 Compliant Huffman Decoder
Paper 2
87
88
Paper 2 - Implementation of a Fast MPEG-2 Compliant Huffman Decoder
ABSTRACT
In this paper a 100 Mbit/s Huffman decoder implementation is presented. A
novel approach where a parallel decoding of data mixed with a serial input
has been used. The critical path has been reduced and a significant increase
in throughput is achieved. The decoder is aimed at the MPEG-2 Video
decoding standard and has therefore been designed to meet the required
performance.
1. INTRODUCTION
Huffman coding is a lossless compression technique often used in combination
with other lossy compression methods, in for instance digital video and audio
applications. The Huffman coding method uses codes with different lengths,
where symbols with high probability are assigned shorter codes than symbols
with lower probability. The problem is that since the coded symbols have
89
unequal lengths it is impossible to know the boundaries of the symbols without
first decoding them. Therefore it is difficult to parallelize the decoding process.
When dealing with compressed video data this will become a problem since high
data rates are necessary.
The architecture of the Huffman decoder presented in this paper is based on a
novel hardware structure [1] that allows high speed decoding.
The decoder can handle all Huffman tables required for decoding MPEG-2 Video
at the Main Stream, Main Level resolutions [2]. The design is completely
MPEG-2 adapted with automatic handling of the MPEG-2 specific escape and
end of block codes. In total our decoder supports 11 code tables with more than
600 different code words. Since the code books are static in the MPEG-2 stan-
dard the Huffman decoder has been optimized for these specific MPEG-2 codes.
A decoding rate of 100 Mbit/s is required and also achieved in our implementa-
tion.
2. HUFFMAN DECODER
Huffman decoding can be performed in a numerous ways. One common princi-
ple is to decode the incoming bit stream in parallel [3, 4]. The simplified decod-
ing process is described below:
1. Feed a symbol decoder and a length decoder with M bits, where M is the
length of the longest code word.
2. The symbol decoder maps the input vector to the corresponding symbol.
A length decoder will at the same time find the length of the input vector.
3. The information from the length decoder is used in the input buffer to fill
up the buffer again (with between one and M bits, Fig. 1).
The problem with this solution is the long critical path through the length
decoder to the buffer that shifts in new data (Fig. 1).
In our decoder the shifting buffer is realized with a shift register that continu-
ously shifts new data into the decoder (Fig. 2). The length decoder and symbol
decoder are supplied from registers that are loaded every time a new code word is
present at the input. The decoding process is described below:
1. Load the input registers of the length and symbol decoder.
2. If the coded data has a length of one go back to point 1.
3. If the coded data has a length of two go back to point 1.
and so on with codes of length three and four up to M.
90
Paper 2 - Implementation of a Fast MPEG-2 Compliant Huffman Decoder
M
input shifting bit/cycle
buffer
input
shift register D
load
M bits registers
load register
M-1 bits symbol
length decoder decoder
logic length=2? logic
length=M? length=1?
select output
buffer
force load
length
>1 symbol
This structure allows longer evaluation times for longer code words. The delay in
the critical path is reduced to the time it takes for evaluating the length of code
words with a length of one or two bits. Codes with other lengths are allowed to be
evaluated in several cycles, i.e. code words with lengths of three must be evalu-
ated in two cycles and so on.
Comparing this algorithm with the previous one we note the following:
• The input rate of our new structure is constant while the original has a vari-
able input rate.
• The new structure evaluates short code words in a few cycles but requires
more cycles for longer words. The original structure has a constant evaluation
time for all code words.
91
• The new structure allow higher clock rate since the critical path is reduced.
But this also means that the symbol decoder must be faster since it in the
worst case will receive new data every clock cycle.
• The new structure has a variable output rate while the original one has a con-
stant output rate.
The new structure require higher clock rate to perform the same amount of work.
But, if the average code length is short enough the new structure will have a
higher speed due to the significantly higher clock rates that can be achieved. Nor-
mally the shorter code words will dominate in Huffman coded data and therefore
the new decoder is faster during normal circumstances.
The mb_escape marker is also important. After this symbol the following data is
of fix length. Also this marker is detected in the length decoder and results in that
the following data is passed through the symbol decoder unchanged (Fig. 3).
3. IMPLEMENTATION
The MPEG-2 standard requires that the input data must be decoded at a rate of
about 100 Mbit/s. During the implementation special care had to be taken during
the partitioning of the symbol decoder and a few critical paths had to be opti-
mized manually. A few modifications of the new decoding algorithm had to be
made to make it possible to achieve the targeted performance.
92
Paper 2 - Implementation of a Fast MPEG-2 Compliant Huffman Decoder
load load
D D
load
D D
Note that this way of breaking up the critical loops can be generalized to remov-
ing all critical loops in this structure, see [1].
93
3.3 Interface
The interface of the Huffman decoder consists of an eight bit, parallel input port
for coded data. A signal indicates when a new input vector can be applied. The
decoded data is delivered with a maximum of 50 Msymbol/s. The 'symbol
present' signal (Fig. 5) indicates when data is valid at the output.
The shift register at the input of the decoder (Fig. 2) can be read and controlled
externally. This is necessary since the Huffman coded data is interleaved with
other information.
current table D
D
post processing
D
register D
>1
symbol symbol present
3.3. Synthesis
The decoder has been described in VHDL and then transformed to a circuit using
synthesis tools mapping to an 0.8 µm CMOS standard cell library. Some post
processing had to be done after the synthesis step to achieve the necessary perfor-
mance. The main problem was to get the symbol decoder to work fast enough.
Therefore the symbol decoding has been split into five separate units. The core
area is about 8.4 mm2 the total area is 14.5 mm2 (3.9 X 3.75 mm2). About two
third of the area is occupied by the symbol decoder (Fig. 6). The power supply is
5V and the transistor count is 26900.
94
Paper 2 - Implementation of a Fast MPEG-2 Compliant Huffman Decoder
4. CONCLUSIONS
In this paper an implementation of a novel Huffman decoder architecture has
been presented. We have shown that the new structure can be used for fast Huff-
man decoding while still keeping a simple architecture. The throughput has been
increased by using a serial input combined with a serial/parallel length evalua-
tion. Since the current implementation uses standard cells it is reasonable to
believe that a full custom version of the same circuit can reach significantly
higher speed.
5. REFERENCES
[1] M. K. Rudberg and L. Wanhammar, New Approaches to High Speed
Huffman Decoding, IEEE Proc. ISCAS ´96, May 1996.
[2] ISO/IEC DIS 13818-2 Generic coding of moving pictures and associated
audio information, part 2: Video, (MPEG-2), June 1994.
[3] S. F. Chang and D. G. Messerschmitt, Designing High-Throughput VLC
Decoder Part I – Concurrent VLSI Architectures, IEEE Trans. on Circuits
and Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June 1992.
[4] H. D. Lin and D. G. Messerschmitt, Designing High-Throughput VLC
Decoder Part II – Parallel Decoding Methods, IEEE Trans. on Circuits and
Systems for Video Technology, Vol. 2, No. 2, pp. 197-206, June 1992.
95
Shift
Register
Control
Unit
Symbol
Length
Decoder Decoder
Clock Buffer
96
Paper 3 - High Speed Pipelined Parallel Huffman Decoding
Paper 3
97
98
Paper 3 - High Speed Pipelined Parallel Huffman Decoding
ABSTRACT
This paper introduces a new class of Huffman decoders which is a develop-
met of the parallel Huffman decoder model. With pipelining and partition-
ing, a regular architecture with an arbitrary degree of pipelining is devel-
oped. The proposed architecture dramatically reduces the symbol decoder
requirements compared to previous results, and still is the actual implemen-
tation of the symbol decoder not treated. The proposed architectures also
have a potential of realizing high speed, low power Huffman decoders.
1. INTRODUCTION
The Huffman coding method is a method for lossless data compression. The
method is used in a variety of fields as for instance in the JPEG image coding
standard and the MPEG Video coding standards. With the introduction of High
Definition digital television (HDTV) the throughput requirements of the Huff-
man decoder will be increased several orders of magnitude. Unfortunately the
99
Huffman decoding process is difficult to parallelize since the symbols are of
unequal length. It is not possible to know where the symbol boundaries are
before actually decoding them in sequence.
The Huffman code uses variable-length code words to compress its input data.
Frequently used symbols are represented with a short code while less often used
symbols have longer representation. The Huffman codebook forms an unbal-
anced binary tree with the symbols at the leaves. The Huffman decoding process
starts at the root node in the binary tree and stops at a leaf.
In this paper we will extend previous work reported in [1] and [2] where architec-
tures for high speed Huffman decoders are described. In this paper we generalize
the concept of pipelined Huffman decoders and discuss the theoretical potential
of this class of decoders. An improvement for dramatically decreasing the sym-
bol decoding speed requirements is also presented.
The parallel decoder consists of three different units, a symbol decoder that maps
a bit-vector with a coded symbol to a fix length representation, a length decoder
that calculates the length of the current code so that the shifting buffer knows
100
Paper 3 - High Speed Pipelined Parallel Huffman Decoding
how many bits that has been consumed and is able to fill the buffer again. The
parallel decoder has a varying input rate of 1 to Wcode,max bits/cycle depending
on the length of the latest decoded symbol. Wcode is the length of the present code
and Wcode,max is the length of the longest code in the codebook. The output rate
is constant with a fixed delay for all symbols. The critical loop in the parallel
decoder is through the length decoder to the shifting buffer. Before a new symbol
can be decoded the length of the previous code has to be found and the consumed
bits must be thrown away.
This paper will show that the parallel decoder has a potential of reaching a high
decoding rate. In Fig. 1 the two discussed models are shown. In the remaining of
this paper we will focus on the parallel Huffman decoder model.
One drawback with the parallel Huffman decoder in Fig. 1 is that the symbol
decoder and the length decoder operates in parallel on the same code. Therefore
the length of the code is not available when the symbol decoder starts the decod-
ing, which makes the symbol decoding more difficult than necessary. This prob-
lem can however be solved by inserting a buffer in front of the symbol decoder as
shown in Fig. 2. Since the length and symbol decoders here operates on different
codes the symbol decoder can take advantage of the fact that the length of the
code is known.
coded data
shifting
buffer
symbol
buffer
decoder
length
decoder
length symbol
101
The decoder in Fig. 3 operates as follows: The shift register continuously shift
the coded data from left to right. The codelength is evaluated in the pipelined
length decoder unit and is represented with one separate signal for every length,
i.e. Wcode,max signals. In every cycle one codelength is checked. In the first cycle
it is checked if the code is a one bit code, in the second cycle it is checked if it is
a two bit code and so on until a matching length is found. At this time the code
has been shifted out from the shift register and stored in a register feeding the
symbol decoder. The symbol decoder starts and the length decoder starts to
examine if the next code is a one bit code and so on. Note that the feed-back loop
from the length decoder to the shifting buffer is not needed any longer, but is
replaced by a synchronous reset signal to a counter.
A major disadvantage with this structure is that the symbol decoder must be
designed for a worst case sampling rate of fs,max = fclk to be able to handle suc-
ceeding one bit codes. This yields a low utilization degree of the symbol decoder
since the sampling rate is lower when longer codes are decoded (utilization n = 1/
Wcode,ave, where Wcode,ave is the average codelength).
register
input stream
shift register
Wcode,max
pipeline register
fs fclk
k pipeline
stages pipeline register
symbol decoder
counter for Wcode,max bits
reset start
decoded
equalizing symbols
delay
Lcode=Wcode,max Lcode=1 bit
counter reset
Length decoder
102
Paper 3 - High Speed Pipelined Parallel Huffman Decoding
Another solution is to stop the length decoder and the shift register when fs,max is
exceeded [2]. This can for instance be done by halting the length decoder and the
shift register a number of cycles as soon as a code with a length of less than M
bits are found, where fs,max in the symbol decoder is fs,max = M/fclk. The penalty
for this is that no symbol will be decoded in less than M cycles, i.e. the decoder
will be less effective on short codes, which also are the most frequent ones. How-
ever, this can in some cases be accepted since if the average codelength is low the
average throughput will be high anyway. Unfortunately, halting the shift register
will result in that the constant input data rate property is lost.
In the next section we propose another method for reducing the requirements on
the symbol decoder without any loss in efficiency. This is accomplished by tak-
ing advantage of the fact that the length of the codes are available and use this to
partition the symbol decoder.
103
The partitioning can be repeated, splitting the symbol decoder into K partitions.
If K is chosen to be equal to Wcode,max there will be one dedicated symbol
decoder for every codelength, and every symbol decoder operates with a sam-
pling frequency of maximum fs,max = fclk/Wcode,j, where Wcode,j is the length of
the code that symbol decoder j is optimized for. The resulting architecture can be
seen as a sorter that sorts the codes according to their length followed by a sim-
plified symbol decoding step. In Fig. 5 an architecture with the maximum parti-
tioned symbol decoder is shown. The architecture consists of a length decoder
with a k stages pipeline, a buffer with a depth of n, a sorter for sorting the codes
and a set of symbol decoders. The size of the buffer can be as low as zero. The
control is carried out by counters and logic blocks that checks for the start condi-
tions for the symbol decoders.
register
input stream
shift register
Wcode,max
cnt
pipeline register
counter fs fclk
k pipeline pipeline register
stages symbol decoder
cnt reset for 1 to N-1 bits
start
cnt N-1 decoded
equalizing and reset = 1 symbols
delay fs fclk/N
Lcode=
Wcode,max Lcode=1 bit
symbol decoder for
N to Wcode,max bits
cnt > N-1
reset and reset = 1 start
counter
Length decoder
4. DISCUSSION
In this section the advantages and drawbacks of the proposed methods are dis-
cussed. The biggest advantage with the loop free pipelined parallel decoder with
partitioned symbol decoding is the potential of doing really fast Huffman decod-
ing at relatively low power consumption. Fast because the critical length decoder
can be pipelined to reach almost arbitrary speed. Low power consumption
because of the partitioned symbol decoding. Symbol decoders that not are used
can be put in an idle state which will save quite a lot of power if the partitioning
is well balanced. Note that using many partitions do not lead to much increase in
the control structure which would consume power. The reduced maximum sam-
104
Paper 3 - High Speed Pipelined Parallel Huffman Decoding
pling rates in the symbol decoders also saves power since a lower clock fre-
quency can be used, and also because more power efficient but slower symbol
decoders can be used. Unfortunately, a heavily pipelined length decoder will con-
sume some power, but the length decoding unit is significantly smaller than the
symbol decoder unit [2] and consumes therefore a minor part of the total power.
There are two types of codebooks that are commonly used. In the MPEG stan-
dards the codebook is fixed and can therefore be hardwired into the decoder
logic. It is more difficult when the codebook is changed from time to time, as is
the case of the JPEG image coding standard. However, in this paper we have not
discussed the actual realization of neither the length decoder nor the symbol
decoders (even if the length decoder must conform to the pipelined model). It
should be possible to successfully implement both fixed and dynamic codebooks
using the proposed architectures.
5. CONCLUSIONS
In this paper we have discussed different Huffman decoder models and their
speed potential. The pipelined parallel decoder model is transformed to a fast
loop free architecture by using a shift register as replacement for the normally
used shifting buffer. Further, we have developed an architecture that enables a
highly partitioned symbol decoder which can be used for combining high speed
decoding with a power efficient solution. The proposed architectures does not
imply that there must be a fixed codebook or that the symbol decoders must be
realized in a particular way. Different solutions can be chosen depending on the
sampling rate and the size of the codebook.
6. REFERENCES
[1] M. K. Rudberg and L. Wanhammar, "New Approaches to High Speed
Huffman Decoding", IEEE Proc. ISCAS ´96, Atlanta, USA, May 1996.
[2] M. K. Rudberg and L. Wanhammar, "Implementation of a Fast MPEG-2
Compliant Huffman Decoder", Proc. EUSIPCO ´96, Trieste, Italy,
September 1996.
[3] S. F. Chang and D. G. Messerschmitt, "Designing High-Throughput VLC
Decoder Part I - Concurrent VLSI Architectures", IEEE Transactions on
Circuits and Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June
1992.
105
[4] H. D. Lin and D. G. Messerschmitt, "Designing High-Throughput VLC
Decoder Part II - Parallel Decoding Methods", IEEE Transactions on
Circuits and Systems for Video Technology, Vol. 2, No. 2, pp. 197-206, June
1992.
[5] S. Ho and P. Law, "Efficient Hardware Decoding Method for Modified
Huffman code", Electronics Letters, Vol. 27, No 10, pp. 855-856, May 1991.
start
fs fclk
cnt = 1
and reset = 1 symbol decoder
register
input stream for Wcode = 1 bit
shift register shift register (n+k bits)
Wcode,max symbol decoder
cnt = 2 for Wcode = 2 bit
and reset = 1 start
pipeline register fs fclk/2
106
Paper 4 - Design of a JPEG DSP using the Modular Digital Signal Processor Methodology
Paper 4
107
108
Paper 4 - Design of a JPEG DSP using the Modular Digital Signal Processor Methodology
Abstract
In this paper we present the design of a JPEG decoder using the Modular
DSP Methodology (MDSP). It is shown that the MDSP methodology is a
powerful tool for doing hardware-software co-design. The hardware
resources have been chosen to match the frequently used operations in the
JPEG standard to increase performance. The JPEG decoder has been real-
ized using a dual core solution where irregular and static algorithms have
been separated.
109
1. INTRODUCTION
The Modular DSP (MDSP) Methodology is a method for modelling of Applica-
tion Specific DSPs (ASDSP). The MDSP methodology aims at tackling some of
the most important issues in bridging the gap from algorithms down to silicon
and move the two levels closer [1,2,3].
This paper discuss how the MDSP Methodology was used during the design of a
JPEG decoder.
Common for all wideband communication and storage systems is the need for
compression of speech, image, data, audio, and video. International organiza-
tions, such as CCITT and ISO/IEC JPEG (Joint Photographic Experts Group)
[4], have standardized compression algorithms and formats for images. Repro-
grammability is important for the adaptation to different applications and mar-
kets. A JPEG DSP should contain the arithmetic functions needed for the specific
algorithm and should be designed for the accurate wordlength in the different
parts of the architecture. The memory requirements (size, wordlength), and the
partitioning of the memory structure have to be considered as well. The JPEG
decoder has been modelled to fulfill the CCIR601 requirements.
The JPEG algorithm consists of four stages: Data is transformed to the frequency
domain using the Discrete Cosine Transform (DCT), Quantized to remove fre-
quencies in the picture that are of minor interest, Run-Zero encoded to replace
sequences of zeroes with a shorter representation and finally Huffman encoded
which results in a variable length code. The decoding is in principle a reversal of
the operations in the encoder. The frame to be encoded is split into 8x8 pixels
large blocks that then are individually coded.
2. METHODOLOGY
Why do we see the need of a new methodology and what problems do we solve
with the MDSP methodology?
First of all we se a rapid growth of the need to do early design trade-offs and per-
formance estimations. To do that we must be able to have a powerful modelling
methodology where different algorithmic and architectural solutions can be
quickly evaluated. We also want an environment where the designers experience
is captured, i.e. the environment must provide a high degree of interactivity
instead of leaving important tasks as scheduling and resource allocation entirely
to the tools.
110
Paper 4 - Design of a JPEG DSP using the Modular Digital Signal Processor Methodology
Future consumer electronics put requirements on the hardware that today can be
hard to fulfill: high speed, high complexity, low power and low cost. To be able
to meet these architecture goals it is obvious that the level of integration must be
increased, the hardware must be matched to the algorithms and the design pro-
cess must be shortened. We believe that using the MDSP Methodology for
Application Specific DSPs matched to the algorithms are the way to go to handle
increased complexity and reduce power consumption.
3. HARDWARE PARTITIONING
There are two types of algorithms, data dependent and static. The data dependent
algorithms are characterized by having much data or parameter dependent pro-
cessing branches. A typical data dependent algorithm is the parsing and control
of a JPEG-coded data-stream. There are several types of data blocks that require
different kind of decoding. Most parameters are located at the beginning of the
datastream, which must be parsed and then used to select the appropriate decod-
ing algorithm. A static algorithm, on the other hand, is the Inverse Discrete
Cosine Transform (IDCT) which is a part of the JPEG standard.
111
Scheduling, Assignment
VHDL Simulator
model model
112
Paper 4 - Design of a JPEG DSP using the Modular Digital Signal Processor Methodology
eters in the display device, the parameters are decoded in the Huffman core and
are then accessable through a DMA port when the parameter memory not is used
internally. See Fig. 2.
The internal interface between the cores consists of a parallel data port, Outp_RZ,
that outputs run-zero coded data. A synchronization signal, DC, that is activated
in the beginning of every block is provided in order to synchronize the two cores.
A stop signal halts the Huffman processor when the IDCT core can not receive
data at the required rate.
HUFFMAN IDCT
processor-core processor-core
Data_Ready RZ
EI_RZ Data_Ready
Ready_St
EI_St Outp_RZ INP_RZ Outp_Pix
INP_St DC block_stat Outp_Adr_
Ci_addr_dim
INT_A stop
Release_dim
Ci_dim
Reset MCLK
66 MHZ
Display parameters
4. HARDWARE/SOFTWARE TRADE-OFFS
It is important to use the right kind of hardware resources in an architecture. The
performance can be significantly reduced if the architecture is register limited so
variables have to be stored in a memory and then read back into the datapath
repeatedly. It is also performance limiting to do multiplication using a shift-add
approach, if this is done often. A trade-off between when to add dedicated
resources to the hardware and when to solve a problem using the already avail-
able hardware and software. It might for instance be more efficient to add an
adder and a few registers instead of a multiplier-accumulator in the datapath if
the multiply-accumulate operation is seldom used.
113
ries, one for the storage of the Huffman code book, Run/Size and Code length of
29-bits words. The smaller memory is used for the storage of quantization tables,
temporary data and various parameters used by the JPEG algorithm.
In the Huffman core we have chosen to use dedicated hardware to detect a spe-
cial marker byte (FF) in the datastream. This is done since this marker byte can
occur anywhere in the datastream and instead of a time consuming software test
there is hardware that generates a trap that forces a jump to a software routine
that can handle the marker byte. To make efficient header decoding we use com-
parators and special masking hardware. A Barrel shifter is also used. The Huff-
man decoding is programmed in software using a table look-up technique.
The Huffman core delivers run-zero coded data in a zig-zag order. The data is
expanded and written into a memory. The datapath consists of three multiply/
accumulate blocks (macc) and five adders (see figure 3). The IDCT on an 8x8
block is performed by doing a 1-dimensional IDCT on each column followed by
a IDCT on all eight rows. An offset of 128 is added during the read out stage.
114
Paper 4 - Design of a JPEG DSP using the Modular Digital Signal Processor Methodology
INP RZ
Memories
Read Write
Outp_Pix
reg 128
out_reg
+
+
reg
macc flow
reg
reg
macc reg
+
stage 1
reg
reg
+ reg
reg flow
+ reg
reg reg
reg
macc reg reg
reg
reg reg
stage 2
reg stage 3
6. REFERENCES
[1] K-G Andersson, Anders Wass, Karam Parmar: A Methodology for
Implementation of Modular digital Signal Processors, ICSPAT ’96, Boston,
MA, Oct. 7-10, 1996.
[2] K-G Andersson: A Design Environment for Modular-dsp Architectures,
Electronic Design Autom. Conf., Kista, Stockholm, March 15, 1994.
[3] K-G Andersson, Implementation and Modeling of Modular Digital Signal
Processors, LiU-Tek-Lic-1997:09, Department of Electrical Engineering,
Linköping University, March 1997.
115
[4] ISO/IEC 10918-1: Digital compression and coding of continuous-tone still
images, 1994-02-15.
[5] C. Liem, T May, P Pauline: Instruction-Set Matching and Selection for DSP
and ASIP Code Generation, Proceedings of the European Design and Test
Conferance, February 1994.
[6] Gert Goossens, Jan Rabaey, Joos Vandewalle, Hugo De Man : An Efficient
Microcode Compiler for Applications Specific DSP Processors. IEEE
Transactions on Computer-Aided Design. vol. 9 NO. 9, September 1990.
[7] Z. Wang, "Fast Algorithms for the Discrete Cosine Transform and for the
Discrete Fourier Transform", IEEE Transactions on ASSP, Vol ASSP-32,
No.4, pp. 803-816, Aug. 1984.
Code_len
5
Huff_tab
Mask_len
Code_reg 29
EN_St 23 Compare 4
Sh_con
CU, Trap
BSH 4
Inp_St
Inp_St
8
Acc
15 0 16
8 AP
Areg
1 0 0 1
Outp_RZ
status Inp_St
ALU
macc
16 flags
SLL1
11 0 -1 8
CU Constants
Acc
0
LSB
0
01 FF 8
8
Pred
CU, Trap
DC
16 Ireg Jpeg_mem
Jpeg_reg
1
Param_reg
8 1
Jreg
16 01
8
8
320 x 8
1 8 16
4 Q-tab + 64
Mereg 4
DC UDM
Ci_dim
116
Paper 5 - Design and Implementation of an FFT Processor for VDSL
Paper 5
117
118
Paper 5 - Design and Implementation of an FFT Processor for VDSL
Abstract
In this paper we present an implementation of an FFT processor for VDSL
applications. Since no standard yet are available for VDSL high require-
ments on flexibility were put at the design. A concurrent hardware and soft-
ware design methodology made it possible to trade between hardware and
software realizations in order to get an effective and flexible architecture.
1. INTRODUCTION
The Fast Fourier transform (FFT) is an effective way of calculating the discrete
fourier transform and is often used in multicarrier communication systems. The
FFT processor presented in this paper is aimed at VDSL (Very high speed Digital
Subscriber Line) applications, which is one of the candidates for providing wide-
band communication capabilities to the consumers, Fig. 1. VDSL systems use the
already installed base of twisted pair copper cables for the last few hundred
119
meters to the homes. The VDSL system is a multicarrier system which is attrac-
tive for wideband transmission because of its capability to adapt to different
channel characteristics.
Today there is no standard for VDSL and there are several candidates that put
different requirements on the FFT processing. Currently the number of carriers is
unspecified, and it is still uncertain if data shall be time multiplexed or frequency
multiplexed transmitted over the channel. This uncertainty made it important to
have a programmable FFT processor. The computational requirements excluded
a solution with standard DSPs. Therefore the FFT has been realized as an appli-
cation specific signal processor (ASSP) targeting FFT processing.
The worst case processing requirements that can be handled is two streams of
continuous 50 MHz real input data with simultaneously processing of both FFT
and IFFT:s with lengths up to 1024 points. One of the output ports are equipped
with a multiplier that can be used as a frequency equalizer. Cyclic prefix of arbi-
trary length can be added at the output of the IFFT and automatically discarded at
the input of the FFT.
2. ALGORITHM
The implemented algorithm is a well known decimation in frequency radix-4
FFT algorithm [1]. The primitive operation is the radix-4 butterfly shown in Fig.
2.
Since the input data to the FFT and the output data from the IFFT is real valued
and the FFT is a complex transform it is possible to calculate a 2048 points FFT
by first doing a 1024 points complex FFT and then perform a separation pass to
end up with the same result as would have been the case if a full 2048 points FFT
120
Paper 5 - Design and Implementation of an FFT Processor for VDSL
had been calculated [2]. This extra separation has a structure close but not identi-
cal to a radix-2 butterfly. To be able to support other FFT lengths than 4 n also
radix-2 butterflies has to be supported.
x(0)
+ + X(0)
x(1)
+ -+ W2p X(2)
x(2)
-+ + Wp X(1)
x(3)
- + -j -+ W3p X(3)
3. DESIGN FLOW
The FFT project is the first project where Ericsson’s Modular DSP methodology
(MDSP) has been fully used throughout the entire design. The design methodol-
ogy is aimed at programmable ASSP:s and have previously been reported in [3].
A case study resulting in a JPEG decoder architecture that never was made in [4]
and we have also been studying other algorithms.
The methodology encourage the designer to do trade-offs between hardware and
software realizations by offering a unified design environment and modeling lan-
guage for both hardware and the software.
Spec.
LIB
µC model
HW
Formal
RTL Ver. SW
ASIC µ code
Impl. Gen.
121
The architecture and application program is concurrently evolved from the
requirement that are put on the application. The description language, µC, is
derived from the C language with some modifications. An RTL description of the
hardware is manually or automatically derived from the application program.
The software can be refined after the hardware extraction, but to assure that it
still is possible to execute the application program on the architecture, there is a
formal verification tool available. See Fig. 3 for the design flow. An example of
the design language is given in Fig. 4 below.
The RTL description is taken to a traditional ASIC flow and the microcode that
shall be run on the processor is generated by a compiler.
Important to note is that translating the µC model to RTL and microcode is a
mapping process. Information about resource allocations and scheduling is found
in the model. Advantages with this approach is that the designer has full control
of both the architecture and the scheduling, and can therefore get maximal per-
formance from the design.
The key benefits with the design methodology are:
• An effective design language which enables short design time.
• Concurrent modeling of hardware and software.
• Fast simulation compared with Verilog and VHDL.
• Results in a programmable DSP architecture that can be re-programmed
after processing.
122
Paper 5 - Design and Implementation of an FFT Processor for VDSL
INPUT in(10);
OUTPUT out(10);
REG acc(10);
REG cnt;
RAM mem(16,10); 16 words, 10 bit wide
void main()
{
// init
acc=0;
// fill memory
for(cnt=0; cnt<=16;cnt++)mem[cnt]=in;
// calculate square
for(cnt=0; cnt<=16;cnt++)
out=mem[cnt]*mem[cnt];
}
5. ARCHITECTURE
The FFT processor is divided into five cores; two FFT datapaths, two IO blocks
and one memory system, see Fig. 5. The IO blocks handle the input and output of
data and are not programmable but parametrized to be able to handle different
cyclic prefix and FFT lengths. The memory system contains six sets of memories
where each memory set contain 1024 complex words. The datapath blocks per-
123
form the actual FFT calculations and the control functionality. The two datapaths
are identical and operates individually. All communication between the blocks
are from register to register.
The internal clock is generated on chip and operates at two or four times the
external clock. The maximum internal clock rate is 100 MHz. The application
program is loaded at boot time through a bit serial port using a separate program
loading clock.
The wordlength used is 18 bits for data and 16 bits for the coefficient (wp in Fig.
2).
Memory System
Port A Port B
Figure 5. Block partitioning.
5.1. IO
The IO consists of an address generator and a complex multiplier which was
included as a part of an equalizer that was needed in the application.The IO core
is controlled from the datapath block.
5.2. Memory
The memory block has four read ports and four write ports. There are six mem-
ory sets with two physical memories enabling concurrent read and write accesses
using single port memories. That we are using six sets of memories is forced by
the proposed time multiplexed transmission method that require buffering.
5.3. Datapath
The datapath block consists of a datapath for the calculations, three address gen-
erators, a control block and a control unit, see also Fig. 6 where the calculation
unit of the datapath block is outlined.
124
Paper 5 - Design and Implementation of an FFT Processor for VDSL
6. IMPLEMENTATION
The implementation has been made in a 0.35 µm process. The design mainly of
standard cells, memories, and a PLL. The complex multipliers had to be inter-
nally pipelined in two stages to reach sufficient speed.
Necessary for the success of this project was the availability of a good timing
driven place and route tool. In a 0.35 µm process the wire capacitance contribute
too much to the delays in order to get a good correlation between the estimated
delays from the synthesis tool and the actual layout. A photo of the final chip is
given in Fig. 7.
125
from memory read addr.
+/-
Controller
reg. file
+/-
addr. calc
reg. file
* ROM
126
Paper 5 - Design and Implementation of an FFT Processor for VDSL
7. CONCLUSIONS
In this paper the design and implementation of a high performance FFT proces-
sor has been described. A new design methodology with concurrent hardware
and software development have been proven to work. It has been shown possible
to design and implement an ASSP starting without specification in a short time
period using the MDSP design flow.
8. REFERENCES
[1] Gentleman W. M. and Sande G. “Fast Fourier Transform for Fun and Profit”,
Proc. 1966 Fall Joint Computer Conf. (AFIPS), Vol. 29, pp. 563-678,
Washington DC, Spartan, Nov. 1966.
[2] Brigham, “The Fast Fourier Transform and its Applications”, Prentice Hall,
1988.
[3] K-G Andersson, “Implementation and Modeling of Modular Digital Signal
Processors”, LiU-Tek-Lic-1997:09, Department of Electrical Engineering,
Linköping University, March 1997.
[4] K-G Andersson, Mikael Karlsson Rudberg, and Anders Wass. “Design of a
JPEG DSP using the Modular Digital Signal Processor Methodology“,
ICSPAT ’97, San Diego, CA, USA, Sept. 14-17, 1997.
PLL
Data memory
Program memory
127
128
Paper 6 - Application Driven DSP Hardware Synthesis
Paper 6
129
130
Paper 6 - Application Driven DSP Hardware Synthesis
ABSTRACT
In this paper we present a synthesis tool aimed for application specific DSP
processors. The purpose with the presented work has been to develop a tool
where it is easy for a designer to try different approaches in order to achieve
a well balanced architecture. In the paper we discuss the algorithms in the
tool and show, by example, the intended way of operation.
1. INTRODUCTION
DSP processing in modern communication systems are today normally carried
out either in programmable DSP processors or dedicated ASICs with little or no
programmability.
An ASIC solution offer high performance in terms of processing power and
power consumption. The ASIC is targeted against one or a few tasks and can
therefore be optimized to meet the desired computational requirements and mem-
ory bandwidth etc.
131
The DSP processor is made to support a wider range of applications and must
therefore have an extensive instruction set and more on-board memory. In this
paper we focus on applications that require a high degree of flexibility, but for a
given application. Examples of applications include FFT processing, Viterbi and
Reed-Solomon decoding. Each one of these examples exist in various variants,
working with different block sizes etc.
For these applications we want to be able to evaluate various instruction sets as
well as different degrees of parallelism in a hardware-software co-design pro-
cess. Instead of a automatic synthesis tool we need an interactive environment
that gives the designer the opportunity to describe different architectures in an
efficient way and then fast get the resulting netlist.
In this paper we show a solution that makes it possible to synthesize a DSP pro-
cessor from an executable cycle true, model of a processor and the application.
This is done using a synthesis tool where most of the design choices are done by
the designer.
2. RELATED WORK
Synthesis of DSP processors have been studied by several groups in the world.
The main difference between our approach and previously reported synthesis
systems, as for instance in [1,2] is that we instead of having advanced algorithms
in the tool we leave most of the design choices to the designer. That is the
designer is used as the intelligent component in the system and the synthesis tool
just performs the hard work.
3. SYNTHESIS FRAMEWORK
The synthesis tool has been designed to fit into the MDSP design flow which is a
design methodology that allows the designer to use a C-like description language
called µC for defining the DSP, [3,4]. The synthesis tool takes as input the cycle
true µC model and gives as output an architecture that is able to execute the algo-
rithms described in the model. The program memory image is then created using
other tools in the framework. The generated architecture is later on passed to a
VHDL compiler in order to generate a netlist suitable for the layout tool, Fig. 1.
132
Paper 6 - Application Driven DSP Hardware Synthesis
uC model
lib
Synthesis
vhdl
program vhdl to
image ASIC std.
gen. design flow
DSP netlist
program
133
RAM
status signals
from datapath
imm op
Control unit
op1, op2,...
control signals
I/O
regfile address instruction
Program
memory
datapath
4.3. Synthesis
The synthesis process is divided into a number of stages that analyze the resource
need and then creates an architecture that is matched against the algorithms to
implement.
In the first stage the µC model is analyzed to find out which hardware that are
explicitly declared, i.e. all memory and registers. Secondly, the tools analyze
which operations that are made in the program flow. The target and destination
registers for each instruction is also stored.
In the third stage the operations are mapped to ALUs. This can basically be done
in two ways realizing either a minimal architecture with as few ALUs as possi-
ble, or a maximal architecture where little or no resource sharing is made. A min-
imal architecture will require ALUs supporting many instructions, while a
maximal architecture gives many, but simple ALUs.
This tool creates an architecture where each destination register in the architec-
ture gets a dedicated ALU. The ALU is chosen from the synthesis library by find-
ing the ALU that supports all operations that has the given register as target
register. Hence, the resulting architecture will be an architecture with one ALU
134
Paper 6 - Application Driven DSP Hardware Synthesis
The degree of inter activity during the design process is intended to be high. The
main goal has been to provide a tool that makes it easy for the designer to get the
intended architecture. Therefore the tool contains little inherent intelligence, but
is easy to control by flags fed to the synthesis tool, by modifying the synthesis
library and/or rewriting the model.
Synthesis of:
acc1=reg_a+reg_b;
acc2=reg_a-reg_b;
become:
+,-
+ - +/-
135
The type of ALUs that are chosen for a given register may not become the one
the designer wants to have. The type of ALU can therefore be explicitly assigned
using a configuration file as input to the synthesis tool. In this way it is possible
to add a more powerful ALU that supports more instructions than required by the
present application.
5. EXAMPLE
In this section an example how to use our synthesis tool in the design flow is
given.
In Fig. 4 an example of µC code of a 32 tap FIR filter is given. Passing this
description through the synthesis tool without any ALUs declared gives the
architecture shown in Fig. 5 (control unit excluded). The tool creates an architec-
ture that can execute the given task, and nothing more. In order to achieve an
implementation that are easier to reuse, for instance if we want to support any fil-
ter length up to 32 taps, the instruction set has to be extended. This has to be
made such as it become possible to realize an addressing scheme other than mod-
ulo 32.
To realize this we may for instance include circular buffers for the calculation of
data and coefficient addresses. Since a circular buffer may be useful in the future
we decide to add a circular buffer ALU into our synthesis library and then instan-
tiate it into the µC model. In Fig. 6 it is shown how to change the µC model and
what to add to the synthesis library. The new more general datapath is shown in
Fig. 7.
6. FUTURE WORK
The implemented heuristics with an ALU selection based on target registers lead
to an architecture that normally works nice for dedicated DSPs. In modern gen-
eral purpose DSPs there is normally a number of parallel ALUs that are con-
nected to one register file. This is an architecture that can not be supported in the
present version of the synthesis tool. One of the problems with the synthesis of
such an architecture is that it is difficult to decide which instruction to put in
which ALU. In order to decide how many ALUs to attach to a register file the
parallelism within the register file has to be analyzed.
The parallel ALU problem can today be worked around by explicitly instantiate
it into the uC model. But a more smooth way making it easier to elaborate with
different solutions would be preferred.
136
Paper 6 - Application Driven DSP Hardware Synthesis
1: // Declaration part
2: MDSP fir
3: {
4:
5: INPUT inp(14, PARALLEL);// input port, 14 bits
6: OUTPUT outp(14, PARALLEL); // output port, 14
bits
7: REG acc(30), i(6), ca(5), da(5); // different registers
8: RAM d(32,16); // RAM with 32 16 bit words
9: ROM c(32,16, "rom.data"); // ROM
10:
11: PROCEDURE compfir ();// procedure declaration
12: }
13:
14: // Code part
15:
16: PROCEDURE main()
17: {
18: for(;;){ // loop forever
19: do {;} while(!inpF) ; // While no input on the input
20: // port inp do nothing
21: inpF=0, d[da]=inp; // Reset input by setting
inpF=0,
22: // store inp in RAM. “,” means
that
23: // this is made in parallel
24:
25: compfir(); // call procedure compfir
26: outp=acc; // place the value of acc on outp port
27: }
28: }
29: PROCEDURE compfir() // compute fir
30: {
31: acc=0,ca=0;
32:
33: i=30;
34: do {
35: acc+=d[da++]*c[ca++],
36: i--;
37: } while (i>0)
38: acc+=d[da]*c[ca++];
39: return;
137
+,-
RAM
da
inp
d 1
ROM
0 * c
0
+,pass 1
+,pass
acc
ca
outp
imm op
1
+,-,pass
0
to control unit
i
>
7. CONCLUSIONS
In this paper we have demonstrated a synthesis tool where a µC model is trans-
lated to a DSP processor. The nice thing with the tool is not the optimization rou-
tines in the tool since they do not contain anything advanced. Instead we have
shown a design flow, using the synthesis tool, where it become easy for a
designer to evaluate different architecture. We have, by an example, shown how
an FIR filter can be synthesized and redesigned to support a wider application
without too much work.There are things that can be improved, such as the user
interface. The tool is this far just at a prototype showing the possibility to work as
described.
138
Paper 6 - Application Driven DSP Hardware Synthesis
and
acc+=d[da]*c[ca++];
is replaced by:
acc+=d[da]*c[ca],
ca=circ_add(1,ca,firl),
ALU_circ_add:
operations: circ_add
weight:
power=100, size=100, delay=100,
default=100
139
from firl
circ_add
RAM
1
da
inp
d
imm op ROM
0 * c
firl 1
+,pass
circ_add
acc 0
outp
ca
imm op
1
+,-,pass
0
to control unit
i
>
8. REFERENCES
[1] T. Hollstein, J. Becker, A. Kirschbaum, M.Glesner, “HiPART: A New
Hierarchical Semi-Interactive HW-/SW Partitioning Approach with Fast
Debugging for Real-Time Embedded Applications”, Proc. of Workshop on
Hardware/Software Codesign, CODES/CASHE’98, March, 1998.
[2] P. Duncan, et al., “HI-PASS: A Computer-aided Synthesis System for
Maximally Parallel Digital Signal Processing ASICs”, Proc. of IEEE Intern.
Conf. on Acoustics, Speech and Signal Processing, ICASSP’92, March,
1992.
[3] K-G Andersson, “Implementation and Modeling of Modular Digital Signal
Processors”, LiU-Tek-Lic-1997:09, Department of Electrical Engineering,
Linköping University, March 1997.
[4] K-G Andersson, Mikael Karlsson Rudberg, and Anders Wass. “Design of a
JPEG DSP using the Modular Digital Signal Processor Methodology“, Proc.
of ICSPAT ’97, San Diego, CA, USA, Sept. 14-17, 1997.
140
Paper 7 - ADC Offset Identification and Correction in DMT Modems
Paper 7
141
142
Paper 7 - ADC Offset Identification and Correction in DMT Modems
ABSTRACT
In this paper the possibility to identify and correct DC offset errors in time
interleaved ADCs are investigated. It is shown how the offset introduced by
the ADC can be identified and corrected by utilizing the knowledge about
the target application. As target application the ADSL standard has been
used. It is shown that an offset error from a time interleaved ADC can be
handled efficiently in a wideband communication system as ADSL.
1. INTRODUCTION
With the increasing demands on bandwidth in communication systems the
demands on AD converters (ADCs) also increase. One way of increasing the
sampling rate is to use time interleaved parallel ADCs [1]. A time interleaved
ADC consists of N ADCs, where each ADC only samples every N:th value. For
instance, with two time interleaved ADCs the first sample will be taken by
ADC1, the second by ADC2, the third by ADC1 and so on. In this case the effec-
tive sample rate for each ADC is reduced to fs/N, while the total sample rate
remains fs. In Fig. 1 the principle of time interleaved AD conversion is shown.
143
N*T
s1(n)
ADC 1
(N+1)*T
s(t) s2(n)
ADC 2 sdist(n)
(2N-1)*T
sN(n)
ADC N
s ( 1 ) + o 1 , s ( 2 ) + o 2 , s ( 3 ) + o 3 , s ( 4 ) + o 4, s ( 5 ) + o 1 … (1)
The offset signal, o(n) is only dependent of which ADC channel that are used and
not the input signal. Hence the offset signal is a periodic signal with period N,
where N is the number of ADC channels. In the frequency domain the offset will
cause tones located at m * fs/N, where m is an integer in the range [0,N-1].
The SNDR of a signal affected by an offset error can be expressed as the ratio of
the energy at the input signal, s(n), and the offset signal, o(n), [2].
E [ s2 ( n ) ]
SNDR = ---------------------- (2)
E[o2(n)]
Assuming that the offsets can be regarded as normally distributed random vari-
ables with a mean of zero and a variance of σ2 and with a sinusoidal with the
amplitude A the SNDR can be expressed as.
A2
SNDR dB = 10 log --------- (3)
2σ 2
144
Paper 7 - ADC Offset Identification and Correction in DMT Modems
Using a 16 channel time interleaved ADC, the offset error has been measured to
be in the range of 30 codes, and with a variance around 50, which corresponds to
almost three bits performance degradation in a 12 bit ADC.
2. IDENTIFICATION OF OFFSET
2.1. Communication system
A simple digital communication system can be viewed as in Fig. 2, [3]. A piece
of information (a symbol) is passed through an encoder that creates a signal that
is sent over a channel. At the receiver the signal is converted to the digital
domain and the transmitted information is recreated in the decoder. The task of
the decoder is mainly to find which information that most probably was transmit-
ted. This can somewhat simplified be described as comparing the difference
between the received signal and all possible symbols. The symbol that minimize
the difference between possible symbols and received signal is the most probable
symbol. In order to further increase the performance filters and error correction
algorithms are used in the decoder.
A commonly used line coding is the Quadrature Amplitude Modulation (QAM)
line coding. A QAM coded signal consists of a sine and a cosine wave, where
each one can have a number of different phases and amplitudes. Every code have
one combination of amplitude and phase, making it possible to detect the trans-
mitted information at the receiver.
In the complex plane the received data can be shown as in Fig. 3. When the SNR
allows the number of bits transmitted in one QAM constellation is increased,
resulting in more points in the constellation diagram shown in Fig. 3.
data to transmit
ENCODER DAC Analog line
Frontend
DECODER ADC
received
data
Figure 2. Simple communication system.
145
Im
01 00
Re
11 10
146
Paper 7 - ADC Offset Identification and Correction in DMT Modems
Im
Re
data to transmit
ENCODER IFFT DAC Analog line
Frontend
DECODER FEQ FFT TEQ EC ADC
received
data
Since the data from the ADC is used in pairs, the disturbed tones will be 2m/N *
Ntones, where m is an integer in the range [0,N/2-1], N is the number of ADC
channels, and Ntones is the number of tones used in an ADSL modem.
147
3.2. Correction of offset before connection
Before two DMT modems are connected only noise are present at the input to the
modem. If the input is assumed to be a normal distributed noise signal, e(n) with
an average of zero the offset will be found by just averaging the received signal,
Eq. 6.
oˆ ( n ) = E [ s ( n ) ] = E [ e ( n ) + o ( n ) ] = E [ e ( n ) ] + E [ o ( n ) ] = E [ o ( n ) ] (6)
As described in Eq. 5 this operation can just as well be performed in the fre-
quency domain, Eq. 7.
ˆ
O ( e jω ) = E [ E ( e jω ) + O ( e jω ) ] = E [ E ( e jω ) ] + E [ O ( e jω ) ] = E [ O ( e jω ) ] (7)
One problem with this method is that an input signal with a frequency that is a
multiple of fs/N will be cancelled. But, the only situation when there is a risk that
a wanted signal is removed is if the remote modem tries to get the attention from
the receiving modem. This problem is discussed in the following section.
3.3.1. Activation
The first stage during initialization is to activate the remote modem by sending
an activation signal which consists of a single tone. The remote modem is
answering by replying with another tone. The tones lasts for 32 ms and might be
mistaken for an offset error if the offset error generate a tone with a frequency
that is the same as the activation frequencies. The tones that are used in this phase
in the downstream direction are tone 44, 48, 52 and 60. In the upstream direction
tone number 8, 10 and 14 is used. This occurs when the transmitted tone have a
frequency such as it is possible to find a positive integer m that fulfills Eq. 8.
F sig
2m = N ⋅ ---------------- (8)
N tones
148
Paper 7 - ADC Offset Identification and Correction in DMT Modems
Fsig is the tone with an activation signal, Ntones is the total number of tones, and
N is the number of ADC channels. The smallest value of N that may give rise to
an offset error in a tone that is used during this phase is N=32. Fortunately is 32
ADC channels more than needed for the sampling rates used in ADSL (the band-
width is 1.1 MHz).
A DMT modem must be trained in order to adjust gain, echo canceller and equal-
ization filters. Since the offset error is an additive signal, which from the training
point of view will be regarded as noise the adaptive training algorithms are still
useful, but it might result in a longer adaptation time.
In several of the training sequences, the same symbol are continuously repeated.
This cause a correlation between the received repetitive symbol and the offset
error. Hence, it is not possible to identify what is the actual signal and what is the
offset error. One of the training sequences are used for estimation of the SNR on
each tone. This sequence has a length of 16384 symbols and is not repetitive as
the other ones. Since this sequence consists of a known pseudo random sequence
it can also be used for offset estimation. The offset is found by taking the differ-
ence between the received data and the expected data.
149
4. SIMULATION RESULTS
In order to verify the ideas of how to identify and correct offset errors caused be
an interleaved ADC architecture the different cases have been simulated using
ADSL as the application. A 12 bit time interleaved ADC consisting of eight
channels with the DC offsets {-2, -11, 3, 5, 3, -8, 1, 14} giving a variance of 54
has been used. The offset will in this case affect the tones {0, 64, 128, 192}.
A suitable method to update the offset estimate is to use a running average as in
Eq. 9 below. The size of λ will control the adaptation rate and is chosen close to
one.
ˆ ˆ
O i + 1 ( e jω ) = λO i ( e jω ) + ( 1 – λ )E ( e jω ) (9)
Fig. 6 is showing how the adaptation is made during the SNR measurement
sequence. Since the input signal is known, only the noise will disturb the adapta-
tion. λ = 0.999 has been used and the simulation shows how much that remains
of the offset error at the disturbed tones. 7000 symbols, which corresponds to
about 1.6 second in real adaptation time. Around 5% of the errors remains after
this period, i.e. 26 dB decrease in offset errors at the disturbed tones. A simula-
tion with only noise as input will result in the same result since the known signal
is removed before Eq. 9 is applied.
Offset adaptation
0.04
0.035
0.03
0.025
relative offset error
0.02
0.015
0.01
0.005
−0.005
−0.01
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
symbol number
150
Paper 7 - ADC Offset Identification and Correction in DMT Modems
5. HARDWARE ARCHITECTURE
In Fig. 7 below an outline of how the offset identification and correction can be
put into an ADSL modem is outlined. The offset correction unit is realizing Eq. 9
with either the error coming from the decoder or the noise coming from the FFT.
Equation 9 contains two multiplications and one accumulation. Only the tones
that may be disturbed by the offset need to be taken into account, hence one mul-
tiplier is enough sine the disturbed tones are separated by 2N, where N is the
number of ADC channels. The offset for each tone can be kept in a register file
since they are quite few. The offset estimation stored in the register file is sub-
tracted from data coming from the FFT. The complexity of the compensation unit
can be kept low since there are only a few tones that are affected.
6. ACKNOWLEDGEMENTS
I would like to thank Jan-Erik Eklund at Microelectronics Research Center, Eric-
sson Components AB, for the help with finding typical values of the DC offset
error in a time interleaved ADC.
7. CONCLUSIONS
In this paper it has been shown how an offset error in a wideband data transmis-
sion system as ADSL can be identified and corrected. By treating the ADC as a
system component that can be optimized together with the rest of the system and
utilize what is known about the target application we have shown how the offset
error can be handled in all the important phases during modem initialization and
data transmission in the ADSL modem.
Our methods should be possible to use in other communications systems as well
since it is common to have various types of training sequences that can be uti-
lized for offset identification.
8. REFERENCES
[1] J. Yuan and C. Svensson, ”A 10-bit 5MS/s Successive Approximation Cell
used in a 70 MS/s ADC Array in 1.2υm CMOS”, IEEE Journal of Solid state
Circuits, vol. 29, no. 8, pp 866-872, Aug. 1994.
[2] M. Gustavsson, “CMOS A/D Converters for Telecommunications”, Ph.D.
thesis, Diss. No 552, Linköping University, Sweden, Dec. 1998.
[3] S. Haykin, Digital Communications, Wiley, 1988.
151
[4] ANSI T1.413-1998, “Network and Customer Installation Interfaces:
Asymetrical Digital Subscriber Line (ADSL) Metallic Interface”, American
National Standards Institute.
[5] T. Starr, J. M. Cioffi, and P. J. Silverman, Understanding Digital Subscriber
Line Technology, Prentice-Hall, 1999.
[6] M. Karlsson Rudberg, “A/D omvandlare”, pending Swedish patent no.
9901888-9.
register
1-λ
λ
acc
* + file
Error
Noise
−
+ To decoder
from FFT
only subtract
disturbed tones
152
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs
Paper 8
153
154
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs
ABSTRACT
An efficient way of increasing the sample rate in an A/D converter (ADC) is
to use a time-interleaved structure. The effective sample rate can be
increased without increase in sample rate for the individual ADCs. There
are however problems with this architecture caused by differences in gain in
the ADCs as well as timing mismatch in the sample-and-hold circuits. These
mismatch errors will degrade the performance of the time interleaved ADC.
In this paper we propose algorithms for both identification of the mismatch
errors on-line and cancelling of the distortion. The proposed algorithms are
suitable for applications that use the Discrete Multi-Tone modulation
(DMT) or the Orthogonal Frequency Division Multiplex (OFDM) technique.
155
1. INTRODUCTION
Fast and accurate analog-to-digital converters (ADCs) are key components in
present and future communication systems. An increasing demand of bandwidth
and an increased use of digital signal processing both put higher demands on the
ADCs.
One way of increasing the sample rate that has been proposed is to use several
ADCs in a time interleaved way [1]. A time interleaved ADC (TIADC) consists
of M ADCs, where each ADC only samples every M:th value. The effective
sample rate for each ADC is reduced from f s to f s ⁄ M , while the total sample rate
remains unchanged. In Fig. 1 the principle of time interleaved A/D conversion is
shown.
MT
x0(n)
ADC 0
(M+1)*T
x1(n)
x(t) ADC 1 xTIADC(n)
(2M-1)*T xM-1(n)
ADC M-1
156
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs
nT
xin(t) xs(nT)
R
t m = mT – r m T . (1)
Amplitude
x(t)
DA
T 2T 3T 4T t
T(1+r0) 3T(1+r2)
2T(1+r1) 4T(1+r3)
Considering the two effects, gain mismatch and nonuniform sampling the distor-
tion with a band limited input signal can be modeled as [3]
∞
∑ ) ⋅ X ω – k ⋅ --------
jωT 1 jωT 2π
X tiadc ( e ) = --- Ak( e (2)
T MT
k = –∞
157
jωT
where A k ( e ) is
M–1
2π – j ( ω – k ⋅ 2π ⁄ MT )r m T
G ω – k ⋅ -------
∑
jωT 1
Ak( e ) = ----- - ⋅e (3)
M m MT
m=0
– jkm ⋅ 2π ⁄ M
⋅e
In the summation in Eq. 2 only M terms are non-zero if the input is band limited
to f s ⁄ 2 which will be assumed in this paper.
To keep the distortion low, both the gain and the timing errors must be kept at a
low level. In [3,4] approximations of the effects on SNDR by gain and timing
mismatch have been derived. Assuming a nominal gain of g with a standard
deviation of σ g the SNDR in a TIADC with gain mismatch can be approximated
by
158
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs
Using a test signal is usually not desired for in circuit calibration since a high
accuracy test signal must be generated, and it is also necessary to interrupt the
communication while performing the calibration.
In this paper we will focus on the applications that use the Discrete Multi-Tone
modulation technique (DMT) or the Orthogonal Frequency Duplex Multiplex
(OFDM) technique. The DMT technique is used for digital subscriber lines, e.g.
ADSL [8]. The OFDM technique is similar to DMT with the main difference that
OFDM is proposed for radio transmission.
transmit path
encoder
encoder IFFT
IFFT DAC
DAC line
analog
analog
frontend
frontend
decoder
decoder FEQ
TEQ FFT
FFT TEQ
TEQ EC
EC ADC
ADC
receive path
The blocks of importance for this paper are found in the receive path. The EC
block is an echo canceller and the TEQ block is the time domain equalizer. Both
these blocks together with the frequency domain equalizer FEQ will here be
jωT
referred to a filter with the frequency response H eq ( e ).
The output from the decoder is an estimate of which symbol that was received.
jωT
The equalized input to the decoder is denoted X eq ( e ) , and the estimated sym-
jω n T
bol is called S ( e ) . where n is the index to one of the carriers which may
vary between 0 and N – 1 .
The filtered signal received by the decoder will be
jωT jωT jωT
X eq ( e ) = H eq ( e ) ⋅ X tiadc ( e ). (6)
159
The information on each carrier is coded using M-ary QAM. That is the bits are
mapped in a two-dimensional plane where the positions represent the transmitted
bits, Fig. 5.
Im
(00) (10)
S(ejwT)
D Re
(01) (11)
In order to take full benefit from in circuit calibration there must be a possibility
to utilize an increased SNR to increase the amount of transmitted data. The DMT
technique as realized in the ADSL standard support this feature.
3. IDENTIFICATION OF ERRORS
In the ideal case when no distortion is present the received information on each
carrier is independent of each other. When the distortion is present some interfer-
ence between the frequencies is present. Each carrier is interfered by the maxi-
jωT
mum of M – 1 other carriers, which are described by A k ( e ) when k ≠ 0 in
Eq. 2.
An error in the transmission on carrier m may occur when the noise plus the total
distortion become larger than D ⁄ 2 . That is
∑
jω m T jω l T jω m T jω m T D
N(e )+ S( e ) ⋅ Ak ( e ) ⋅ X eq ( e ) > ---- (7)
2
k≠0
where D is the minimum distance between two points in the constellation dia-
jω m T
gram in Fig. 5, and N ( e ) is the noise contribution.
The similarities between the distortion with the signal on one carrier leaks into
another carrier is similar to what is happening when a transmitted signal is ech-
oed into the received signal in for instance an ADSL system, and it is therefore
possible to use a similar method for removing the distortion as is used for echo
cancelling. The most well known method for adapting an echo canceller is the
160
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs
Least Mean Square (LMS) method which use the gradient of the error in the
received signal to update the coefficients ( C i ) in an adaptive filter according to
[7]
C i, k + 1 = C i, k + µ ⋅ e k ⋅ x k – i (8)
where e k is the error between the wanted signal and the one that actually was
received, x k – i . µ is a parameter that controls the adaptation rate.
jω m T jω m T ˆ jω m T jω l T
Cˆk ( e ) = Cˆk ( e ) + µ ⋅ U k, rem ( e ) ⋅ S*( e ) (13)
161
jω m T jω m T jω m T
Uˆk ( e ) = X eq ( e ) ⋅ Cˆk ( e ). (14)
jω m T
X eq ( e ) contain some distortion, but is the best possible estimation of
jω m T
S(e ) available without performing the symbol decoding.
∑
jω m T jω m T jω m T
X eq2 ( e ) = X eq ( e )– Uˆk ( e ). (15)
k≠0
Which is the contribution from all distortion terms that leaks into the current car-
rier. The proposed method do not require more than that the input signal is band
limited to f s ⁄ 2 . Alternative methods for timing error correction in a TIADC usu-
ally work less good when the signal bandwidth gets close to f s ⁄ 2 , [5,9]. The
method presented in [6] performs perfect reconstruction of the signal spectrum,
but the use of a special DFT makes the algorithm computationally heavy.
162
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs
4. SIMULATIONS
A TIADC with four ADCs have been simulated with a 256 carrier DMT signal as
input. No quantization effects have been considered. The timing mismatch has
been randomly selected with a standard deviation of 8%. The gain mismatch has
a standard deviation of 2%. In Fig. 6 the adaptation process is shown considering
the distortion that leaks into carrier 96. The simulation is made using 105 sym-
bols.
In Fig. 7 it is shown how the received QAM-encoded constellation points look
like before and after cancellation of the noise on carrier 96. The improvement in
SNDR is about 13 dB.
5. CONCLUSIONS
In this paper we have proposed a method to both identify and correct mismatch
errors caused by gain and timing errors between ADCs in a TIADC. The method
can be applied to the OFDM and the DMT transmission techniques which are
used in for instance ADSL and VDSL. The method work all the way up to the
Nyquist frequency, and can handle gain mismatch that is frequency dependent as
long as the gain can be considered linear.
0.1
0.08
carrier 34
0.06 carrier 162
carrier 224
0.04
0.02
0
0 1 2 3 4 5
4
x 10
163
In Fig. 7 it is shown how the received QAM-encoded constellation points look
like before and after cancellation of the noise on carrier 96. In the simulation the
improvement in SNDR at this carrier is around 13 dB.
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
−1 −0.5 0 0.5 1
Figure 7. Received constellation before and after cancellation of gain and timing mis-
match.
6. REFERENCES
[1] J. Yuan and C. Svensson, ”A 10-bit 5MS/s Successive Approximation Cell
used in a 70 MS/s ADC Array in 1.2υm CMOS”, IEEE Journal of Solid state
Circuits, Vol. 29, No. 8, pp 866-872, Aug. 1994.
[2] M. K. Rudberg, "ADC Offset Identification and Correction in DMT
Modems", Proc. of IEEE Intern. Symp. on Circuits and Systems, ISCAS'00,
Geneva, May, 2000.
[3] Y.-C. Jenq, “Digital Spectra of Nonuniformly Sampled Signals:
Fundamentals and High-Speed Waveform Digitizers”, IEEE Trans. Instrum.
Meas., Vol. 37, pp. 245-251, June 1988.
[4] M. Gustavsson, “CMOS A/D Converters for Telecommunications”, Ph.D.
thesis, Diss. No. 552, Linköping University, Sweden, Dec. 1998.
[5] H. Jin, and E. Lee, “A Digital-Background Calibration Technique for
Minimizing Timing-Error Effects in Time-Interleaved ADC’s, IEEE Trans.
on Circuit and Systems - II: Analog and Digital Signal Processing, Vol. 47,
No. 7, July 2000.
164
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs
165
166
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters
Paper 9
167
168
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters
ABSTRACT
In this paper we present a novel method for combining thermometer coding
and dynamic element matching (DEM) in a digital-to-analog converter
(DAC). The proposed method combine DEM with a minimization of glitch
power. The glitch power may in a DEM solution give a significant contribu-
tion to the total noise power. The switch based solution provides a structural
solution where it is possible to implement parts of the method, which reduce
the area required for implementation.
169
1. INTRODUCTION
The requirements in terms of accuracy in digital-to-analog converters (DACs) are
increasing with the introduction of wide-band access services as for instance
ADSL. In order to increase the accuracy we want to reduce the influence of both
static and dynamic errors. Considering the static case, a DAC will in general per-
form the following operation
M
170
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters
N-1 1 0
2 I0 2 I0 2 I0 I0 I0 I0
bN b2 b1 b2N-1 b2 b1
Iout Iout
(a) (b)
171
Digital Encoder
x1(n) 1 1-bit y1(n)
Thermometer Encoder
DAC
Scrambler
x2(n) 1 1-bit y2(n)
x(n) N DAC y(n)
0
t0
p p p
t1
t2
p p p
t3
t4
p p p
t5
t6
p p p
172
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters
previous state of the randomization must be remembered in order to find out how
to randomize the new sample. An example of how to randomize thermometer
coded data with glitch minimization is shown in Tab. 1.
1.3. Scrambler
To realize a scrambler, using a net of switches, requires a switch that can remem-
ber the previous state in order to not randomize positions that should be pre-
served. In Tab. 2 a truth table for a switch that can be used in a glitch minimizing
scrambler is shown. ai+1 and bi+1 are the inputs to the switch, ai and bi are the
inputs from the previous sample. Bits are to be randomized only if a new zero or
one occurs at the input. A logic realization of the truth table require three flip-
flops since both inputs from the previous samples as well as the previous setting
of the switch must be saved.
Since flip-flops are expensive logic elements area can be saved if the number of
flip-flops can be reduced. Since thermometer coded data is used the situation
when <ai,bi>=<0,1> (or <1,0>) become <ai+1,bi+1>=<1,0> (<0,1>) in the next
sample never occurs. Therefore it is possible to set don’t care in Tab. 1 at the
positions marked (*). Another thing to notice is that since the transition directly
from <1,0> (<0,1>) to <0,1> (<1,0>) never occurs at the input of a switch, a
value of <1,1> or <0,0> will be present at the input at least one sample between
the two cases <1,0> and <0,1>. It is therefore enough to randomly set the switch
when the input data is <1,1> or <0,0> to keep the same degree of randomization.
173
ai+1 bi+1 ai bi switch setting
0 0 X X don’t care
0 1 0 0 random
0 1 0 1 keep previous
0 1 1 0 inverse of previous (*)
0 1 1 1 random
1 0 0 0 random
1 0 0 1 inverse of previous (*)
1 0 1 0 keep previous
1 0 1 1 random
1 1 X X don’t care
A simplified truth table is shown in Tab. 3. Notice that the setting of the switch
no longer is dependent on the input value from the previous sample period
(<ai,bi>). Hence, only one flip-flop that saves the state of the switch is needed. A
possible realization of the switch is shown in Fig. 4.
174
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters
i
the binary offset coded input data, a group of 2 bits are added to the unordered
thermometer code. All these added bits always have the same value and no
switches are needed when only bits within this group are scrambled (i.e. shaded
switches in Fig. 6 are unnecessary). using a radix-2 butterfly architecture of the
scrambler require at least k switch layers to guarantee that all paths in the group
k
2 cross at least one path in each of the groups {2 j, j < k} . If this condition is ful-
filled the output will be glitch minimized. Switch layers placed after layer k may
be needed to increase the randomization, but since the output from layer k is
glitch minimized the more simple switch shown in Fig. 4 can be used for these
layers.
0 0 X X random
0 1 X X keep previous
1 0 X X keep previous
1 1 X X random
The added logic for converting <1,0> to <0,1> can be seen as a two-bit unordered
to ordered thermometer encoder. If the switches are kept fixed (i.e. p is fixed),
the proposed architecture work as a normal thermometer encoder. Hence, the
architecture is a thermometer encoder with included glitch minimized scram-
bling.
D p
=1
a x
b
175
2. SIMULATIONS
In Tab. 5 the relative glitch power has been estimated for four different DAC
architectures. As input signal a multicarrier ADSL signal with 256 carriers have
been used. As can be expected the proposed glitch minimized thermometer cod-
ing technique performs just as well as plain thermometer coding. Randomization
of thermometer code is about as bad as binary offset coding from the glitch
power aspect. In Fig. 7 a) and b) the effect of mismatch on distortion is compared
between thermometer coding and thermometer coding with glitch minimization.
The simulation show an improvement of the SFDR, when only considering the
matching error, compared with normal thermometer coding (13 dB). In the simu-
lations a 6 bit DAC with a random matching error of σ = 0.02 have been used.
Note, that all harmonics disappear using the proposed method.
It is important to be aware of is that a fast varying input signal become more ran-
domized than a slowly varying signal, this because only the difference between
two samples become randomized.
0 0 0 0 random
0 1 0 1 keep previ-
ous
1 0 0 1 keep previ-
ous
1 1 1 1 random
176
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters
D p
=1
a
a
& x
>1 b
b
y
0
20 p
p p
1
2
p p p
22
p p p
p p p
3. CONCLUSIONS
In this paper we have presented a novel method where dynamic element match-
ing is combined with glitch minimization. We have presented an architecture,
similar to the commonly used scrambler with a number of switch layers with the
difference that we use a modified switch that has the advantage to remember the
old path through the scrambler to minimize glitches. By simulations it has been
shown that the proposed method both reduces the number of glitches and de-cor-
relates the mismatch in the current-sources from the signal.
4. REFERENCES
[1] L.R. Carley, and J. Kenney, “A 16-bit 4’th order noise-shaping D/A
converter”, Proc. of 1988 Custom Integrated Circuits Conf., USA, May,
1988.
[2] H.T. Jensen and I. Galton, "An analysis of the partial randomization dynamic
element matching technique," IEEE Trans. of Circuits and Systems II, vol.
45. No. 12, pp. 1538-1549, Dec. 1998.
177
[3] N.U. Andersson and J.J. Wikner, "Comparison of different dynamic element
matching techniques for wideband CMOS DACs," In Proc. of the NorChip
Conf., Oslo, Norway, Nov. 8-9, 1999.
[4] M. Vesterbacka, M. K. Rudberg, J.J. Wikner, and N. Andersson, “Dynamic
Element Matching in D/A Converters with Restricted Scrambling”, accepted
to ICECS’00, Beirut, Lebanon, Dec. 2000.
Normalized SNDR
Type of coding
(dB)
thermometer code 11
thermometer code 0
+ randomization
thermometer code + 11
randomization + glitch
minimization
178
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters
40
20
(a)
PSD [dB/Hz]
−20
−40
−60
−80
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Normalized frequency
(b ) 40
20
0
PSD [dB/Hz]
−20
−40
−60
−80
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Normalized frequency
Figure 7. Simulation of thermometer coded DAC (a) without and (b) with glitch mini-
mization.
179
180
Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling
Paper 10
181
182
Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling
ABSTRACT
Inaccurate matching of the analog sources in a D/A converter causes a sig-
nal-dependent error in the output. This distortion can be transformed into
noise by assigning the digital control to the analog sources randomly, which
is a technique referred to as dynamic element matching. In this paper, we
present a dynamic element matching technique where the scrambling is
restricted such that the glitches in the converter are minimized. By this, both
the distortion due to glitches is reduced, and the signal-dependent error due
to matching is suppressed. A hardware structure is proposed that imple-
ments the approach, and the operation of the hardware is described. Simula-
tion results indicate that the method has a potential of yielding as good
reduction of glitches as the optimal thermometer-coded converter and a sig-
nal-dependent error level that is almost as low as achieved with prior
dynamic element matching techniques.
183
1. INTRODUCTION
A major problem in design of high-resolution communication D/A converters is
the inaccuracy in the fabrication process. This imperfection introduces mismatch
among the sources to the analog output, resulting in non-linear behavior of the
converter [1, 2]. To overcome this problem, a technique referred to as dynamic
element matching (DEM) has been suggested where digital signal processing is
used to control the switching of the analog sources so that the distortion is trans-
formed into noise [1, 3, 4, 5]. Hence, signal-dependent errors are suppressed, and
if we combine this technique with oversampling, we can reduce the error caused
by the noise by low-pass filtering the output [3].
However, converters in many modern communication applications need to oper-
ate at high speed. At high speed, glitches caused by delay variations in different
paths will have a significant impact on the achievable resolution of a converter.
To reduce the glitches, thermometer code can be used, which yields a minimal
amount of glitches compared with other codes, but requires complex hardware.
In practice, a segmented converter structure is used for high resolution converters
where the least significant source weights are binary scaled and the most signifi-
cant weights are thermometer-coded. Hence, the thermometer-coding used in the
presented DEM encoders applies to segmented converters as well.
The use of a thermometer encoder suits the DEM techniques well. However, a
problem is that the current DEM techniques use a type of scrambling that ruins
the good glitch property that can be achieved with thermometer code. In this
paper we present an approach to scramble thermometer code so that the glitch
energy associated with a code transition is minimized, while we maintain the
property of having a low sensitivity to matching errors in a converter. In the fol-
lowing, we will also suggest a hardware structure that implements the presented
approach. We will also explain the function of the hardware with a simple exam-
ple, where a 4-bit converter is used for the sake of simplicity.
2. A DEM APPROACH
The operation of an N-bit thermometer-coded flash converter is characterized by
n
A = ∑ wk ref (1)
k=1
184
Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling
where A is the analog output, ref is a reference quantity of, e.g., current, voltage
or charge that should be added to the output, n = 2N–1 denotes the number of
sources of reference units to add, and w1…wn is a bit vector encoded from a dig-
ital input D used to control which sources to add [1]. The name thermometer
code implies that a continuous range of bits w1…wi should be one, while the
remaining bits are zero. However, by relaxing the last constraint and allowing
any wk to be one as long as the output is correct, we achieve a redundant code
with many possible representations for most numbers. This redundant property
makes it suitable for use in DEM techniques where we randomize what code to
use. By restricting the randomization to only include codes that produce small
glitches it is possible to improve the glitch performance compared with using a
conventional DEM technique where a code is selected randomly from the full set
of codes. In this work we present an approach that aims at solving this problem.
The key idea in our approach is to construct a subset of codes containing only the
codes that cause a minimal number of bits to be altered in a code transition. By
this we will minimize the glitches, since they to a significant extent depend on
this parameter. The codes in a subset are identified from an investigation of the
two cases presented in the following.
185
One approach to implement this idea is illustrated in Fig. 1 where an N-bit D/A
converter is shown. Compared to the conventional approach, we have added a
register to the output of the DEM encoder that contains 2 N – 1 D flip-flops. The
use of this register is two-fold. First, the control signals wk become independent
of delay variations in the encoder, improving the glitch situation. Second, the reg-
ister stores the current state, which can be used in the encoder to construct the
proper subsets. The cost of this solution is an increased complexity of the DEM
encoder and, of course, the hardware for the additional ( 2 N – 1 ) -bit wide regis-
ter.
In a second paper, also presented at ICECS’00, we present another implementa-
tion approach that instead uses a tree structure [6].
D DEM encoder
D D D
w1 w2 w2N1
1
1
A
ref 0
Figure 1. An N-bit D/A converter with a DEM encoder and a register for storing the
thermometer-coded state.
186
Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling
corners are the additional operations needed to handle the somewhat more com-
plex code selection case B. Now we will describe the operations needed to handle
case A.
D W
2N1:N
counter
B1
c
Invert Subtractor Invert
B3 B2
Negate
B4
Thermome-
ter encoder
T1
M-bit
scrambler
T2
T3
Zero
distributor
T4
Invert
W'
187
3.2. Operations in case B
Obviously, the presented scheme is not designed to handle case B where we need
to clear ones instead of setting zeros. However, this can easily be achieved by
modifying the described structure slightly. Then we detect case B, e.g., as an
overflow c in the ‘Subtractor’. When this case is detected, we can use the hard-
ware to clear ones in case B instead of setting zeros by inverting both the input W
and the output W'. This is accomplished by the blocks ‘Invert’ producing T2 and
W' in Fig. 2, that should invert a signal depending on the control input.
Some other modifications are also needed in order to handle case B. The block
‘Negate’ is needed to correct the output B2 when we have an overflow from
‘Subtractor’, i.e, we calculate the number of ones to clear. The effective opera-
tion will be B4 = |B2|. Another modification is needed to the block ‘M-bit scram-
bler’ where the input B3 is the number of zeros in case A indicating the number
of bits to be scrambled. In case B we need to scramble a number of bits corre-
sponding to the number of ones B1 in the current state. Since B3 is calculated as
the inverted B1, we simply make the inversion operation conditional on that we
have case A, as indicated in Fig. 2.
188
Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling
T2: 0110-----------
- distribution to zeros
T3: 101011101011111
- changes
T4: 101111111011111
The arrows going from the four scrambled bits indicate the zero bits to be
replaced in the current state, and the remaining two arrows indicate which of the
two zeros that actually is set. The next state becomes
W ' = 101111111011111
which is output to the register.
189
5. SIMULATION RESULTS
The function of the proposed hardware was verified by a C program that simu-
lates the hardware for an N-bit converter, where N is defined at compilation time.
To estimate the performance of the presented approach, we modeled four 6-bit
converters in Matlab, assuming that the glitch power is proportional to the num-
ber of switching sources. The modeled D/A converters were three conventional
converters, a binary-scaled, a thermometer-coded, and a thermometer-coded con-
verter with conventional DEM, plus a thermometer-coded converter with the pre-
sented DEM approach. As a measure of glitch performance we use the ratio
between simulated glitch power and signal power. In Table 1, power ratios
obtained from simulation with a multi-tone input are listed. The input contained
256 tones with equidistant frequency spacing, distributed over the entire Nyquist
frequency range. The power ratios have been normalized with respect to the
binary-scaled converter. In the table, we see that there is an improvement from
using a thermometer-coded converter over a binary-scaled. However, this gain in
performance is lost when we introduce conventional DEM. The presented DEM
approach is able to regain the glitch performance to the level of the thermometer-
coded converter.
To investigate the performance in terms of matching errors, we apply a Gaussian
independent distributed relative matching error with standard deviation of 2% to
each weighted source in all converter structures. In Table 2, the estimated SFDR
from the simulations is given. In the table, we see that both the converter with
conventional DEM and the converter with DEM that uses restricted scrambling
are able to improve the SFDR with 13 dB over the other structures.
These results indicate that our DEM technique is able to reclaim the gain in
SNDR that is lost with conventional DEM techniques, while the performance in
terms of matching is maintained.
Normalized power
6-bit converter
ratio [dB]
Binary-scaled 0
Thermometer- -11
coded
Conventional 0
DEM
Restricted DEM -11
190
Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling
6. CONCLUSION
A DEM approach was presented that aims at reducing the additional glitch
energy introduced by other DEM techniques. This is achieved by restricting the
scrambling in DEM to only include codes that do not increase the glitch energy.
Further, a hardware structure was proposed that implement this approach. The
hardware is realized from two cases depending on the state of the analog output.
In the first case the number of bits that are one increases, and in the second case
the number of ones decreases. We start by describing how the first case can be
implemented, and then we reuse the hardware in the second case by introducing
some additional hardware that is activated when the second case is detected. This
can be achieved since there is a simple relation between the two cases that
enables a simple transformation of the input and output state.
The functionality of the hardware was verified with a C program that simulated
the hardware for an N-bit converter, where N is a generic parameter. For the pur-
pose of estimating the performance, four 6-bit converters were also modeled in
Matlab, using a simple model for the glitches. The simulation results indicated
that the proposed implementation has the potential of suppressing the glitches as
well as the optimal thermometer-coded converter, while yielding a distortion
level that is almost as low as conventional DEM implementations.
7. REFERENCES
[1] R.J. van de Plassche, Integrated Analog-to-Digital and Digital-to-Analog
Converters, Kluwer Academic Publishers, Boston, 1994.
[2] M. Gustavsson, J.J. Wikner, and N. Tan, CMOS Data Converters for
Communications, Kluwer Academic Publishers, 2000.
191
[3] P. Carbone and I. Galton, “Conversion error in D/A converters employing
dynamic element matching”, Proc. 1994 IEEE Int. Symp. on Circuits and
Systems, vol. 2, 1994, pp. 13-16.
[4] H.T. Jensen and I. Galton, “A low-complexity dynamic element matching
DAC for direct digital synthesis”, IEEE Trans. of Circuits and Systems II,
vol.45.1, Jan. 1998, pp. 13-27.
[5] L.R. Carley, J. Kenney, “A 16-bit 4’th order noise-shaping D/A converter”,
in Proc of Custom Integrated Circuits Conference, 1998, pp. 21.7/1-21.7/4.
[6] M. Rudberg, M. Vesterbacka, N.U. Andersson, and J.J. Wikner, “Glitch
minimization and dynamic element matching in D/A converters”, to appear
in IEEE Proc. The 7th Int. Conf. on Electronics, Circuits, and Systems,
Beirut, Lebanon, Dec. 17-20, 2000.
192
Dissertations
Division of Electronics Systems
Department of Electrical Engineering
Linköpings universitet
Sweden