0% found this document useful (0 votes)
47 views212 pages

DSP Algorithm Arch., For Tele Comm.

DSP Algorithm

Uploaded by

Koushick Venkat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views212 pages

DSP Algorithm Arch., For Tele Comm.

DSP Algorithm

Uploaded by

Koushick Venkat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 212

Linköping Studies in Science and Technology

Dissertation No. 705

DSP ALGORITHMS AND


ARCHITECTURES FOR
TELECOMMUNICATION

Mikael Karlsson Rudberg

Department of Electrical Engineering


Linköpings universitet, SE-581 83 Linköping, Sweden
Linköping 2001
Linköping Studies in Science and Technology
Dissertation No. 705

DSP ALGORITHMS AND


ARCHITECTURES FOR
TELECOMMUNICATION

Mikael Karlsson Rudberg

Department of Electrical Engineering


Linköpings universitet, SE-581 83 Linköping, Sweden
Linköping 2001
DSP Algorithms and Architectures for
Telecommunication

Copyright © 2001 Mikael Karlsson Rudberg

Department of Electrical Engineering


Linköpings universitet,
SE-581 83 Linköping

ISBN 91-7373-069-6 ISSN 0345-7524


Printed in Sweden by UniTryck, Linköping, 2001
Abstract
Techniques for providing users with high quality, high capacity digital transmis-
sion links has been in the research focus the last years. Both academia and indus-
try try to develop methods that can provide the consumers with high capacity
transmission links at low price. Utilizing the twisted-pair copper wires that exist
in almost every home for wideband data transmission is one of the most promis-
ing technologies for providing wideband communication capacity to the con-
sumer.
In this thesis we present algorithms and architectures suitable for the signal pro-
cessing needed in the Asymmetrical Digital Subscriber Line (ADSL) and the
Very High Speed Digital Subscriber Line (VDSL) standards. The FFT is one of
the key blocks in both the ADSL and the VDSL standard. In this thesis we
present an implementation of an FFT processor for these applications. The imple-
mentation was made adopting a new design methodology suitable for program-
mable signal processors that are optimized towards one or a few algorithms. The
design methodology is presented, and an improved version where a tool for con-
verting a combined instruction and algorithm description to a dedicated, pro-
grammable DSP processor is shown.
In many applications as for instance video streaming the required channel capac-
ity far exceeds what is possible today. Data must in those applications be com-
pressed using various techniques that reduces the required channel capacity down
to a feasible level. In this thesis architectures for image and video decompression
is presented.
Keeping the cost of ADSL and VDSL equipment low puts requirements on using
low cost technologies. One way, proposed in this thesis, is to accept errors in the
A/D and D/A converters and correct these errors utilizing digital signal process-
ing, and the properties from a known application. Methods for cancellation of
errors found in time interleaved A/D converters are proposed.
Acknowledgment
I would like to thank my supervisor Prof. Lars Wanhammar for his support and
guidance. Gunnar Björklund at Microelectronics Research Center, Ericsson
Microelectronics AB for the support that made it possible to finish my Ph.D. as
part of my work at Ericssson Microelectronics AB.
I also want to thank all people that I have worked with in the VIBRA research
project, the project that has financed much of the research in this thesis. Thanks
also to Mikael Hjelm for valuable discussions on DMT algorithms and the syn-
thesis tool.
I will also thank everyone at Electronics Systems, Linköping University and at
Microelectronics Research Center for valuable discussions and help with inspira-
tion.
Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Digital communication . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Digital communication systems 4
1.1.2 Modulation 5
1.2 The JPEG and MPEG standards . . . . . . . . . . . . . . . 6
1.3 The DMT transmission technique . . . . . . . . . . . . . . 8
1.3.1 DMT modulation 9
1.3.2 Frequency allocation 11
1.3.3 The DMT symbol 12
1.3.4 The splitter 12
1.4 Scope of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Digital Signal Processing Architectures . . . 17


2.1 DSP algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 Sample period bound 19
2.1.2 Mapping of algorithms to hardware 19
2.1.3 Power consumption 19
2.2 DSP architectures . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 Fixed-function architectures 21
2.2.2 Programmable architectures 22
2.3 DSP architectures with programmability and high
efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Design methodology for ASDSP . . . . . . . . . . . . . 24
2.4.1 Modelling of a JPEG DSP 26
2.4.2 Design and implementation of an FFT processor 27
2.5 ASDSP design methodology . . . . . . . . . . . . . . . . 32
2.5.1 Architecture synthesis from mC 33

3 Variable Length Decoding . . . . . . . . . . . . . . 35

i
3.1 Variable length codes . . . . . . . . . . . . . . . . . . . . . . 35
3.2 The VLC decoding process . . . . . . . . . . . . . . . . . . 36
3.2.1 Tree based decoding 36
3.2.2 Symbol parallel decoding 37
3.3 VLC decoder with simplified length decoder . . . . 38
3.4 VLC decoder with pipelined length decoder . . . . 39
3.5 VLC decoder with symbol decoder partitioning . . 40
3.6 Length decoder implementation . . . . . . . . . . . . . . 40
3.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Data Converters in Communication Systems 43


4.1 Analog-to-digital conversion . . . . . . . . . . . . . . . . 43
4.2 ADC errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Time-interleaved ADC . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 Offset in TIADCs 48
4.3.2 Gain and sample timing mismatch 51
4.3.3 Gain and timing mismatch effects on SNDR 52
4.3.4 Gain and timing mismatch cancellation 54
4.4 Digital-to-analog conversion . . . . . . . . . . . . . . . . 55
4.4.1 Error sources 56
4.4.2 Scrambling 57

5 Author´s Contribution to Published Work 63

6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 65

ii
Paper 1

New Approaches to High Speed Huffman Decod-


ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2 PREVIOUS WORK . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3 TWO NEW FAST HUFFMAN DECODER STRUC-
TURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.1. The basic Huffman decoder 81
3.2. Huffman length decoder with relaxed evaluation time 82
3.3. Pipelined Huffman length decoder 83
3.4. Symbol decoder 84
4 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Paper 2

Implementation of a Fast MPEG-2 Compliant


Huffman Decoder . . . . . . . . . . . . . . . . . . . . . 87
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2 HUFFMAN DECODER . . . . . . . . . . . . . . . . . . . . . . 90
2.1. Handling of special markers 92
3 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.1. Improvements of the length decoder 93
3.2. Symbol decoder 93
3.3. Synthesis 94
3.4. Symbol tables 94

iii
4 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Paper 3

High Speed Pipelined Parallel Huffman


Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2 HUFFMAN DECODER MODELS . . . . . . . . . . . . . 100
3 PIPELINED PARALLEL HUFFMAN DECODING . . .
101
3.1. Reducing symbol decoder requirements 102
3.2. Symbol decoder partitioning 103
4 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Paper 4

Design of a JPEG DSP using the Modular


Digital Signal Processor Methodology . . . 107
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2.1. Modelling with the MDSP methodology 111

iv
3 HARDWARE PARTITIONING . . . . . . . . . . . . . . . . 111
3.1. Interface design 112
4 HARDWARE/SOFTWARE TRADE-OFFS . . . . . . 113
4.1. Huffman processor 113
4.2. IDCT processor 114
5 CONCLUSIONS AND FURTHER WORK . . . . . . . 114
6 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Paper 5

Design and Implementation of an FFT


Processor for VDSL . . . . . . . . . . . . . . . . . . 117
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
2 ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3 DESIGN FLOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4 DESIGN SPACE EXPLORATION . . . . . . . . . . . . . 122
5 ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.1. IO 124
5.2. Memory 124
5.3. Datapath 124
6 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . 125
6.1. Key data 125
7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

v
Paper 6

Application Driven DSP Hardware


Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3 SYNTHESIS FRAMEWORK . . . . . . . . . . . . . . . . . 132
4 THE DSP SYNTHESIS TOOL . . . . . . . . . . . . . . . . . 133
4.1. Target architecture 133
4.2. Synthesis library 133
4.3. Synthesis 134
4.3.1. User control 135
5 EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Paper 7

ADC Offset Identification and Correction


in DMT Modems . . . . . . . . . . . . . . . . . . . . 141
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
1.1. Mismatch between ADC channels 144
2 IDENTIFICATION OF OFFSET . . . . . . . . . . . . . . . 145
2.1. Communication system 145

vi
3 CORRECTION OF OFFSET IN DMT MODEMS . 146
3.1. DMT based communication system 146
3.2. Correction of offset before connection 148
3.3. Correction of offset during initialization 148
3.3.1. Activation 148
3.3.2. Modem training 149
3.4. Correction of offset during transmission 149
4 SIMULATION RESULTS . . . . . . . . . . . . . . . . . . . . 150
5 HARDWARE ARCHITECTURE . . . . . . . . . . . . . . 151
6 ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . 151
7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Paper 8

Calibration of Mismatch Errors in Time Inter-


leaved ADCs . . . . . . . . . . . . . . . . . . . . . . . . 153
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
1.1. Error sources in a TIADC 156
1.2. Gain Mismatch 156
1.3. Timing Mismatch 157
1.4. Methods to cancel gain and timing mismatch 158
2 THE DMT MODEM . . . . . . . . . . . . . . . . . . . . . . . . . 159
3 IDENTIFICATION OF ERRORS . . . . . . . . . . . . . . 160
3.1. Error identification 161
3.2. Signal reconstruction 161
3.3. Implementation aspects 162
4 SIMULATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

vii
6 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Paper 9

Glitch Minimization and Dynamic Element


Matching in D/A Converters . . . . . . . . . . . 167
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
1.1. Reducing glitches 170
1.2. Reducing influence from matching errors 171
1.3. Scrambler 173
1.4. Scrambler with unordered thermometer code 174
2 SIMULATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Paper 10

Dynamic Element Matching in D/A Converters


with Restricted Scrambling . . . . . . . . . . . . 181
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
2 A DEM APPROACH . . . . . . . . . . . . . . . . . . . . . . . . 184
2.1. Code selection case A: Bit increase 185
2.2. Code selection case B: Bit decrease 185
2.3. DEM approach 185
3 REALIZATION OF A DEM ENCODER . . . . . . . . 186
3.1. Description of the operations 187
3.2. Operations in case B 188

viii
4 A 4-BIT CONVERTER EXAMPLE . . . . . . . . . . . . 188
5 SIMULATION RESULTS . . . . . . . . . . . . . . . . . . . . 190
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

ix
x
Abbreviations
and Acronyms
ADC Analog-to-digital converter
ADSL Asymmetrical digital subscriber line
ASDSP Application specific digital signal processor/processing
CO Central office
CPE Customer premises equipment
DAC Digital-to-analog converter
DCT Discrete cosine transform
DEM Dynamic element matching
DMT Discrete multi tone
DNL Differential nonlinearity
DSL Digital subscriber line
DSP Digital signal processing
EC Echo cancelling
EXU Execution unit
GSM Global system for mobile communications
IDCT Inverse discrete cosine transform
IFFT Inverse fast fourier transform
INL Integral nonlinearity
FDM Frequency division multiplex
FEQ Frequency domain equalizer
FFT Fast fourier transform
FIFO First in first out
FIR filter Finite impulse response filter
HDSL High speed digital subscriber line
JPEG Joint pictures experts group
MDSP Modular digital signal processor

1
MPEG Motion pictures experts group
OFDM Orthogonal frequency domain multiplexing
PAR Peak to average ratio
POTS Plain old telephone system
SFDR Spurious free dynamic range
SFG Signal flow graph
SHDSL Symmetric high bit-rate digital subscriber line
SNR Signal to noise ratio
SNDR Signal to noise and distortion ratio
TDM Time division multiplexing
TEQ Time domain equalizer
TIADC Time Interleaved analog-to-digital converter
VDSL Very high speed digital subscriber line
VLC Variable length code
QAM Quadrature amplitude modulation

2
1 Introduction

1 Introduction
This thesis consists of two parts where part one provides a background to the
applications of interest and problems relevant for this thesis. Part two consists of
a selection of publications. The research have been carried out in the period 1995
to 2001. The publications consider hardware implementation of signal processing
in telecommunication systems, ranging from coding of images to transmission
via wideband digital subscriber lines.

1.1 Digital communication


Digital communication is today used in a range of products from mobile phones
to computer networks. The use of digital communication provides an increased
performance compared to previously used analog communication methods. One
important factor for the success of both the Internet and the mobile phones are
the advances within the process technology. New process generations makes it
cheaper to implement advanced digital signal processing, which is the enabler for
digital communication. Digital signal processing makes it possible to implement
communication methods with more complex modulation schemes, adaptive
receivers and error correction. It is today possible to achieve transmission capac-
ities close to the channel capacity theorem stated by Shannon [1]. The theorem
describes the theoretical capacity limit on a communication channel disturbed by
additive white gaussian noise with power spectral density of N 0 ⁄ 2 , a channel
bandwidth of B , and average power level of P . The capacity is then given by

C = B ⋅ log 2  1 + ---------- bits/s


P
 (1.1)
N B0

3
1 Introduction

Since the channel capacity is limited there is a need for techniques that can
reduce the required channel capacity for a given service. Three important areas
where compression techniques are widely used for better utilization of the chan-
nel capacity is transmission of speech, image and video. In a mobile phone sys-
tem voice data is compressed from 64 kbit/s down to 11.4 kbit/s (GSM, half-rate)
keeping an acceptable quality of the speech [2].
For images and video transmission the JPEG [3] and MPEG [4] standards are
widely used. Interesting to note is that even if the available bandwith keeps
increasing compression of image and video signals will be crucial for many years
to come. Transmitting standard resolution video with an acceptable quality
require 1.5-2.5 Mbit/s with compression. Transmitting uncompressed video is not
even an option today since this would require data rates above 50 Mbit/s.

1.1.1 Digital communication systems


A digital communication systems can be outlined as shown in Fig. 1.1 [5]. The
signal is created in a digital source which for instance can be digital data gener-
ated in a computer, digitized speech or digital video. The source encoder pro-
vides a one-to one mapping from the input signal to a new representation suitable
for transmission. The objective is to eliminate or reduce redundancy, i.e. giving
the signal a more efficient representation. The source decoder re-creates the
original signal. The channel encoder and decoder is used for providing a reli-
able transmission link by introducing a controlled redundancy that is used for
detection and correction of transmission errors. In the modulator the information
is modulated, which gives a signal suitable for transmission using the desired fre-
quency band. The task of the detector in the receiver end is to detect which sig-
nal that was transmitted from the transmitter. Sometimes the detector and the
channel decoder are collected into one block which in this thesis is referred to as
the decoder.

digital
digital source
source channel
channel modulator
source encoder encoder modulator
source encoder encoder

Transmit path
noise
channel
channel
Receive path

user source
source channel
channel detector
user decoder decoder detector
decoder decoder

Figure 1.1 Digital communication system.

4
1 Introduction

1.1.2 Modulation
Modulation is the way information is mapped onto a signal. The transmitted
information is divided into symbols where one symbol has a finite duration. The
information content is encoded into the shape of the waveform during the symbol
period. Common ways to encode the information is to put it into the amplitude
and/or phase of the waveform. We will in this thesis mainly consider the quadra-
ture amplitude modulation (QAM) technique and its relatives.
In QAM the information is mapped onto a carrier, which often is a sinusoid,
using different phases and amplitudes. The transmitted signal ( s i ( t ) ) is a sinusoid
with four possible phases ( ϕ i ). These phases are created by varying a and b in
Eq. 1.2 where the sin and cos terms are the basis functions. E is a constant
related to the transmitted energy [6].

si( t ) = E ⋅ ( a ⋅ cos ( ωt ) + b ⋅ sin ( ωt ) ) (1.2)

This encoding scheme can be illustrated using a constellation diagram where a is


on one axis and b is on the other one. In Fig. 1.2 an example is shown where 4-
QAM is used. In this case only the phase of the signal is of interest. The four pos-
sible points allow two binary bits to be transmitted per symbol. The encoding of
the two bits are normally made so that the error probability is minimized, which
in this case will be to use a grey-encoding scheme with as small difference
between the bits in adjacent constellation points as possible.

b
(00) (01)

(10) (11)

Figure 1.2 QAM constellation diagram.

The detector will take decision of how to interpret the received symbol. When
using 4-QAM the decision is taken based on which one of the four possible con-
stellation points that is located closest to the received symbol. The decision
boundaries are outlined as shaded areas in Fig. 1.2. The distance between the

5
1 Introduction

received constellation point and the ideal position is a measure of the noise level
in the channel. If the noise level is too high the detector may not be able to cor-
rectly detect which symbol that originally was sent from the transmitter, and
therefore there will be a bit error. In order to reduce the probability for bit errors
it is common to introduce coding, where redundancy is added to the signal in a
controlled way so that some bit errors can be corrected.
In the general case we can allow more than four points in the constellations
which here is referred to as M-ary QAM. A typical case with 16 possible points
in the constellation is shown in Fig. 1.3. When more than four points are used in
a QAM constellation both the amplitude and the phase is used as signal carrier.

b
(0000) (0001) (0011) (0010)

(1000) (1001) (1011) (1010)

a
(1100) (1101) (1111) (1110)

(0100) (0101) (0111) (0110)

Figure 1.3 16 QAM.

1.2 The JPEG and MPEG standards


Compression of images and video are important for transmitting of high quality
images and video over a transmission channel with limited channel capacity. The
JPEG image compression standard from 1994 is one of the most common stan-
dards for transmission of images over the Internet [3]. The JPEG standard is a
generic standard suitable for continuos tone digital images. The compression
scheme implemented in the JPEG standard is a combination of algorithms that
can exactly recreate the original information and algorithms where information is
removed from the images and hence cannot be exactly recreated. The compres-
sion algorithms that remove information are referred to as lossy algorithms in
this thesis.
In Fig. 1.4 the key algorithms in a JPEG decoder are outlined. The lossy com-
pression is made using the Discrete Cosine Transform (DCT) that converts the
image representation to the frequency domain. In the frequency domain data is
quantized so that fewer bits are needed for representing high frequency informa-

6
1 Introduction

tion. The quantization has been optimized against how sensitive humans are for
different frequencies in the images. The human eye is less sensitive for noise at
higher frequencies than at lower.
Further data compression is achieved using Run-Length-Zero (RLZ) coding and
variable length coding. None of these methods remove information, it just finds a
more efficient representation. In the RLZ coding long sequences with zeros is
replaced with the number of zeros in a row and the next nonzero value. For
example the sequence {0,0,0,0,3} is replaced with {4,3}. In the variable length
coding the frequency of RLZ coded data determines the number of bits used for
representing a value. For instance we may assign a shorter representation for the
RZ coded value {0,1} than for {4,1} which is less frequent. More about variable
length codes and decoding of variable length coded data are found in chapter 3.

image JPEG coded


data images
Input
DCT Quantize
VLD RLZ VLC
buffer
Figure 1.4 JPEG coding.

A common format for compressed video is the MPEG-2 Video standard [4].
There are many similarities between the JPEG and MPEG standards. Both stan-
dards use DCT, RLZ and VLC for compression of images. The main difference
between the standards is that the MPEG-2 Video standard not only compresses
the digital images one by one but also consider similarities between adjacent
images in the video stream. To accomplish this a motion estimation unit is
needed in the video encoder. The motion estimation search for similarities
between images in the video sequence and is the most resource requiring algo-
rithm in the MPEG encoder. Instead of transmitting the image data, only the dif-
ference between images may be transmitted in those cases when this is more
efficient. While the MPEG encoding not has to be made in real-time the decoding

7
1 Introduction

task has to since the decoding is made while the video stream is watched. Real-
time MPEG decoding is therefore more important than real-time encoding. An
outline of an MPEG-2 decoder is shown in Fig. 1.5.

Input
stream Input VLC
Input Parser
Parser VLD RLZ
buffer
buffer decoder

decoded
Motion Picture video
IQ IDCT
Parser reorder
comp

Figure 1.5 MPEG-2 Video decoder.

The first step in the decoder is to extract the control information which contain
information about what type of coding that has been used, image size, and so on.
The VLC and RLZ decoders reverse the VLC and RLZ encoders operations. The
Inverse Quantizer (IQ) multiplies the coefficients with the quantization coeffi-
cients used in the quantizer which restores the signal levels at each frequency.
The Inverse Discrete Cosine Transform (IDCT) transforms the image back from
the frequency domain to the spatial domain. If only the difference between two
images has been transmitted the image data is restored by adding the previously
transmitted image to the received difference image. Finally the images may have
to be re-ordered since the encoder performs a picture re-ordering to better exploit
similarities between adjacent images in the video stream.

1.3 The DMT transmission technique


Twisted pair copper wires, which today mainly are used for telephony can also be
used for transmission of data at quite high speeds. There are several competing
standards and techniques that can be used for increasing the data rates on the
twisted pair cables. The family of standards for this kind of communication is
often referred to as xDSL where DSL stands for Digital Subscriber Line. Some of
the standards belonging to this family are ADSL, ADSL.lite, VDSL, HDSL and
SHDSL.
The Asymmetrical DSL (ADSL) standard is the technique that today dominates
high speed communications on twisted pair cables [7]. The ADSL technique han-
dle the last few kilometers from the so called central office (CO) to the homes.
The equipment in our homes is usually referred to as the customer premises
equipment, CPE, see Fig. 1.6. ADSL is suitable when the distance from the CO is
less than around 5 km. The reach for VDSL is even shorter since higher, more
attenuated, frequencies are used.

8
1 Introduction

The data rates in ADSL are up to 9 Mb/s from CO to the CPE side, and up to 1
Mb/s in the other direction. The reason to provide higher bit rates in the down-
stream direction is that it is assumed that the need for high data rates is higher in
this direction.
The very high speed DSL (VDSL) standard will provide data rates up to 50 Mb/s.
The standardization of VDSL has however been delayed much due to problems
with agreeing on which modulation method is most suitable. Much of the work in
this thesis has been based on a VDSL technique proposed by Ericsson which
based on the Discrete Multi Tone modulation (DMT) scheme which also is used
in the ADSL standard [8].

central office
CO ADSL

CO ADSL

CO ADSL

CO ADSL cross connecting


point
CO ADSL
CO ADSL

internet
TV
ADSL
PC

Figure 1.6 The ADSL scenario.

1.3.1 DMT modulation


In the DMT technique the information is encoded on a large number of signal
carriers. The signal conditions on a twisted pair cable may vary on different fre-
quencies and the independency between the carriers in the DMT technique pro-
vide a possibility to optimize the amount of information to send on each carrier.
The multi-carrier signal is in DMT created by using the Inverse Discrete Fourier
Transform (IDFT) with the input a + j ⋅ b where a and b basically are the same
as in Eq. 1.2. The effective encoding technique on each carrier will be M-ary
QAM where M varies depending on channel capacity for each carrier. The
decoding is done by first splitting the multi-carrier signal into its components by
using the Discrete Fourier Transform (DFT) on the received signal and then
decoding each carrier individually. In practice the IDFT and DFT are calculated

9
1 Introduction

by using the numerical equivalent fast transforms IFFT and FFT. The constella-
tion size on each carrier is dynamically adapted to a varying noise level by using
a “bit swapping” algorithm [9].
The main blocks in a DMT modem are outlined in Fig. 1.7. In addition to the out-
lined block we also need blocks for clock recovery and symbol synchronization
as well as serial/parallel converters, etc. These blocks have been excluded to sim-
plify the explanation of the basic idea behind DMT communication.

interleaver transmit path


framer FEC
TEQ encoder
encoder IFFT
IFFT DAC
DAC line
analog
analog
deinterleaver frontend
frontend
deframer RS-decoder decoder FEQ FFT TEQ EC ADC
decoder TEQ FFT TEQ EC ADC
receive path

Figure 1.7 Block diagram of a DMT modem.

The information is put into frames and symbols in the block called framer.
Redundancy information is added in the forward error correction block, FEC,
which makes it possible to detect and correct some transmission errors. The
Reed-Solomon decoder (RS-decoder) is used for correction of transmission
errors. There are two transmission paths, one with an interleaver and one without.
The Interleaver/Deinterleaver pair will spread transmission errors in time which
will increase the error correction performance in the RS-decoder. Unfortunately
also the delay through the system is increased which causes problems for
instance in two-way communication. Therefore there is also a non-interleaved
transmission path that can be used for delay sensitive applications.
EC stands for echo cancelling which is needed if the data transmitted in upstream
and downstream direction share the same frequency space. In this case the
received signal will contain some of the transmitted signal which must be
removed in order to not disturb the decoder.
TEQ is the time domain equalizer, and FEQ is the frequency domain equalizer.
The task for the equalizers is to work as an “inverse filter” to the channel impulse
response so that the original signal is restored, giving a signal as close to the
transmitted signal as possible. By using two equalizers the total complexity for
implementing the equalizers will be reduced compared with using only a TEQ.

10
1 Introduction

The analog frontend contains analog filters and a line driver. Sometimes the digi-
tal-to-analog (DAC) and the analog-to-digital (ADC) converters as well as digital
interpolation and decimation filters are also counted as parts of the analog fron-
tend.

1.3.2 Frequency allocation


ADSL uses 256 carriers in downstream and 32 carriers in upstream direction, see
Fig. 1.8. The communication takes place in both directions at the same time and
since the upstream and downstream bands are overlapping it will require an echo
canceller in order to cancel the signal sent in the other direction using the same
frequencies. It is, however, allowed in the standard to use non-overlapping fre-
quency bands which will be cheaper since less complex hardware is needed (fre-
quency domain multiplexing, FDM). Today the normal case is to not use
overlapping frequencies since the consumer market is very cost sensitive.
The 256 carriers downstream and the 32 carriers upstream covers the frequencies
from 0 Hz up to 1.104 MHz and 138 kHz, respectively. The number of carriers
actually used are lower since it is not possible to use the frequencies occupied by
other systems like the plain old telephone system (POTS).

power level (a) ADSL with overlapping frequencies


upstream
POTS band
band downstream band

4 kHz 138 kHz 1.104 MHz


30 kHz
power level (b) ADSL without overlapping frequencies
upstream
POTS band
band downstream band

4 kHz 138 kHz 1.104 MHz


30 kHz

Figure 1.8 ADSL frequency plan.

11
1 Introduction

VDSL use a wider range of frequencies than ADSL. In this work we have aimed
at frequencies up to around 11 MHz, which may be slightly changed when the
standard is set. From beginning a time-division multiplexing scheme was pro-
posed where only transmission in one direction was taking place at the same
time. The current proposal does, however, propose that different frequencies are
used instead (frequency domain multiplexing, FDM). The frequency plan has not
been completely finalized yet but it seems clear that there will be several down-
stream as well as upstream bands in the final standard.

1.3.3 The DMT symbol


A DMT symbol consists of 2 N samples that are mapped on up to N carriers
using the IFFT. Additionally a cyclic prefix (CP) is added before the 2 N sam-
ples. Hence the total symbol length become 2 N + CP samples. The CP is a copy
of the last part of the 2 N samples and is used for reducing the transients that
occur between two symbols. The length of the CP is set in the standard and has
been selected by compromising between how long CP can be afforded, and the
length needed for the transients to fade away before the information arrives, see
Fig. 1.9.

DMT symbol
Cyclic prefix
copy

start of transients no transients left

Figure 1.9 Cyclic prefix avoids transients.

Additional to the user data there are extra fields in the symbol that contain infor-
mation used for the two modems to exchange system parameters and other con-
trol information.

1.3.4 The splitter


The ADSL modem is thought to be an add-on to the subscriber line, and it is
therefore necessary to keep the POTS well separated from the new ADSL sys-
tem. POTS and ADSL use different frequencies and they can therefore be sepa-
rated using a low pass filter in the POTS receive path and a high pass filter in the
ADSL receive path, see Fig. 1.10.

12
1 Introduction

Splitter

POTS LP Twisted pair cable

HP
ADSL

Figure 1.10 The splitter.

The reason to keep the POTS installation instead of running all communication
over the ADSL modem is that it has been considered very important to have a
connection that works even during a power failure. A POTS system get its power
from the twisted pair cable, but today an ADSL modem cannot be powered from
the twisted pair, and therefore the POTS system is kept as a life line. More infor-
mation about the DMT technique can be found in [10,11,12].

1.4 Scope of the thesis


The choice of algorithms in an application has a direct impact on the perfor-
mance of a given application. The complexity of the algorithms has an impact on
both the power consumption and how it should be implemented.
The implementation of a digital signal processing (DSP) algorithm is a well stud-
ied area [13,14], but with changing process technologies the focus is changing
from that the area required for an implementation being the most important
parameter to reducing the power consumption, and to provide an efficient design
flow. A clear trend is that DSP hardware need to provide an increased program-
mability which makes it possible to reuse hardware for several applications and
to have a more parallel design flow where the algorithms do not have to be stable
before the hardware implementation can start.
The main target application for this thesis is the digital subscriber line (DSL)
modem. In addition, work has also been performed in the area of image decoding
where the JPEG and MPEG standards have been the target. Efficient DSP imple-
mentations dedicated for FFT processing, variable length decoding and JPEG
decoding is presented. Application Specific Digital Signal Processing (ASDSP)
with the ability to combine efficient processing with reprogrammability are dis-
cussed.

13
1 Introduction

Another area which have been studied in this work is how DSP algorithms can be
used to improve performance in A/D and D/A converters. By identifying errors
and then trying to correct them or spectrally move distortion the data converter
performance can be increased.
In the publications [15,16,17] architectures for fast decoders for variable length
codes (VLCs) are proposed. Variable length codes are not used directly in digital
communication but they are often used in the data streams that are transmitted
over the communication channel. In both digital audio and video VLCs are used
for reducing the amount of data that must be transmitted. For example the MPEG
and JPEG standards which are used to compress images and video sequences are
therefore important. Much of the work has also been reported in [18], but some
additional discussion is also made in Chapter 3.
The design process for efficiently designing Application Specific Processors was
studied in the papers [19,20,21]. This work is a continuation of the work made by
K.G. Andersson [22,23], but with improvements that includes better ability to
reuse old designs, and an efficient way to synthesize the architectures. We
present two case studies where the first [19] is an ASDSP for decoding JPEG
images and the second [20] is an ASDSP for the Fast Fourier Transform (FFT). A
synthesis tool for making the design path more efficient is reported in [21]. The
design process is further discussed in Chapter 2.
The last four papers cover distortion reduction techniques in D/A and A/D con-
verters. Signal processing algorithms have been developed that can be used to
increase the performance in the data converters. In [24] we propose a method that
can cancel offset errors in a time interleaved A/D converter utilizing the receiver
which in this case is a digital modem. This method has also been subject for a
patent application [25]. A method to cancel gain and skew mismatch in an A/D
converter is proposed in [26].
In [27,28,29] architectures that make it possible to trade between glitches and
mismatch in the weights in a current-source D/A converter are proposed. Data
converters are discussed in Chapter 4.
Related to this thesis and the publications [25-29] is the tutorial “A/D and D/A
Converters for Telecom. Applications” that was held at ICECS´2001 [30]. In this
tutorial we tried to relate distortion reduction methods to both each other and to
applications.

14
1 Introduction

Most of the work has been carried out within an industrial research project called
VIBRA at Ericsson Microelectronics AB. The aim with VIBRA was to develop
analog and digital building blocks for DSL based systems. VIBRA have had
strong connections to other research projects within Ericsson studying algorithms
and hardware for DSL systems. For secrecy reasons the complete picture of how
this work relates to work within other parts of Ericsson is not possible to present
in this thesis.

15
1 Introduction

16
2 Digital Signal Processing Architectures

2 Digital Signal Processing


Architectures
A digital signal processing (DSP) system typically consists of a set of algorithms
which are implemented in a combination of hardware and software. Normally
there are different types of DSP algorithms that interacts with each other in the
DSP system, see Fig. 2.1.
There are certain algorithms that must be executed continuously within a given
time frame. Data are continuously processed when arriving, and must be pro-
cessed before the next data arrives, i.e. in real-time. Examples of algorithms that
fall into this category are filters, Fast Fourier Transforms (FFTs), and speech
encoders.
There are also adaptive algorithms usually used for optimizing filters to the
present transmission channel. The adaptive algorithms are normally executed
during initialization of the DSP system when for instance the receiving filters and
the echo cancellers are trained for optimal performance. After initialization, the
adaptive algorithms are used to monitor changes in the communication channel,
component drift due to temperature variations, etc. The adaptive algorithms can
often be executed at a slower rate than the data rate, especially when the algo-
rithms are used for monitoring changes in the channel.
Further, there are also some control algorithms that will supervise the transmis-
sion, detect carriers, and control the initialization, see Fig. 2.1. The control pro-
cess handles the scheduling of different activities during the initialization stage,
and start or stop of filter adaption.

17
2 Digital Signal Processing Architectures

Supervision and control

Adaptive algorithms

output input
Hard real-time processing

Figure 2.1 DSP system tasks.

A DSP algorithm operates on either a block of samples or computes a new output


value for every new input sample, i.e. it is stream based. Speech coders, image
and video coding algorithms usually work with blocks of samples, while digital
filters typically process the data stream continuously.

2.1 DSP algorithms


A common way of representing a DSP algorithm is by using a Signal Flow Graph
(SFG) [14]. An example of a SFG of a simple filter is given in Fig. 2.2. The box
marked with T is a delay element that delays data from one sample to the next.
The SFG has no connection to how the algorithm is implemented, but may be
used to derive a suitable architecture, calculate the amount of resources needed,
and schedule the operations onto an architecture [31].

x(n) y(n)
+
-

0.5

Figure 2.2 SFG of a simple filter.

An equivalent way of representing the algorithm is by using differential equa-


tions. The algorithm in Fig. 2.2 can for instance be described as

1
y ( n ) = x ( n ) – --- y ( n – 1 ) . (2.1)
2

18
2 Digital Signal Processing Architectures

2.1.1 Sample period bound


The data rate of an implementation of an algorithm is bounded by the recursive
loops in the algorithm. The minimum sample period of a recursive algorithm is
given by

 T OPi
T min = max  ------------ (2.2)
 Ni 
i

where T OPi is the total operation latency in the recursive loop i , and N i is the
number of delay elements found in the loop [32]. The critical loop, is the loop
that limits the sample rate. There are several ways of improving the sample rate
by various algorithm transformations like for instance moving operations out of
critical loops [14].

2.1.2 Mapping of algorithms to hardware


When designing an architecture aimed for implementation of a given algorithm
the following things must be considered.
The required data rate, and the feasible clock rate set a lower bound on the num-
ber of resources needed. The minimum amount of execution units (EXU) of a
given type needed to execute an algorithm at a given data rate is given by

Nk ⋅ Tk
N EXUk = ---------------- (2.3)
T min

where T k is the time required for an operation of type k and N k is the number of
operations.
It is important to schedule the operations properly in order to reach a high degree
of utilization of the EXUs. The scheduling should also consider the dataflow
between the blocks in the architecture. Reducing the interconnect will also
reduce the parasitic load from the wires, and hence also the power consumption.

2.1.3 Power consumption


There are four sources that contribute to the total power consumption in a digital
CMOS circuit [33].

P avg = P switching + P sc + P leak + P static


= αC L f VV DD + I sc V DD + I leak V DD + I static V DD (2.4)
clk

19
2 Digital Signal Processing Architectures

where P switching is the power that is consumed every time a signal node
changes state. α is the average switching activity for all nodes in the circuit, and
C L is the switched capacitance. The signal levels are assumed to be 0 and V with
a power supply of V DD .
P sc is the short circuit current that occurs when NMOS and PMOS transistors
are active simultaneously which may occur during switching, giving a short-cir-
cuit current from V DD to ground.
P leak is due to the leakage current that arises from sub-threshold effects. The rel-
ative contribution from P leak is increasing because of the scaling of threshold
voltages that is made in new process technologies. A reduction of the threshold
voltages for the transistors increases the leakage current, I leak [34]. Future
CMOS processes will enable an increased amount of on-chip memory which will
give a significant contribution to the total leakage current.
The static current I static in a purely digital circuit mainly origins from logic
gates whose inputs have reduced swing. When using full swing static logic the
static current will be low.
The power consumption in different functional parts of a DSP system can be par-
titioned into three components,

P avg = P calc + P store + P ctrl . (2.5)

P calc is the power consumed in the functional units, i.e. where the actual algo-
rithm is executed. P calc grows approximately linearly with the number of opera-
tions. To decrease this part the computational complexity of the implemented
function should be decreased. This can be done by choosing another algorithm or
trying to simplify the original one [33].
P store is the power consumed when storing internal signal values during the
execution of the algorithm. The amount of storage needed is mainly dependent
on a) how many samples are needed to compute one output data for a given algo-
rithm, and b) the architecture used for executing the algorithm. It is important to
reduce data movement between different memory elements to decrease P store .
One way of doing this is for instance to implement a First-in-first-out (FIFO)
buffer using a memory and a memory pointer instead of using a shift register. The
positioning of the storage elements is also important, local storage may be less
expensive than global memories. Low computational complexity do not have to
imply few load and store operations and P calc and P store should therefore be
co-optimized [33].

20
2 Digital Signal Processing Architectures

P ctrl is the power consumed in the control unit that among other things controls
the dataflow between the storage elements and the functional units. The com-
plexity of the controller is dependent on the datapath architecture, the scheduling
of operations and the algorithm.

2.2 DSP architectures


The implementation strategy for a DSP algorithm depends on the required data
rate, acceptable power consumption, maximum chip area, and the complexity of
the algorithm. But also parameters like available building blocks and the required
flexibility of the final system. The two main choices of implementation strategies
are to use either a fixed-function architecture without the possibility to change its
functionality afterwards or a programmable architecture with larger flexibility.
Sometimes there is an advantage to choose an architecture that offers a limited
programmability since this increases the possibility to make changes in the algo-
rithms after the processing of the chip.

2.2.1 Fixed-function architectures


A fixed-function architecture can be obtained either by using isomorphic map-
ping where the SFG of the algorithm is directly mapped to an architecture [14] or
by using time-sharing of execution units. When an isomorphic mapping is used
the number of EXUs equals the number of operations in the SFG. This kind of
architecture is best suitable for algorithms that contain few operations and require
a high data rate since the area will be large. The benefit with an isomorph
mapped architecture is that the overhead can be kept low since little control of
the dataflow is needed. Since the isomorphic mapped architecture requires a min-
imal amount of control overhead and the clock rate often is low, which allows the
use of a low power supply voltage, the total power consumption can be low.
In the time-shared architecture the operations in the SFG are mapped onto one or
a few EXUs. In order to control the dataflow between the EXUs and storage ele-
ments (STU) there must be one or several control units (CU) that control opera-
tions, dataflow and memory accesses. The control unit controls the dataflow by
applying signals that affect the way data is transported or computed in the archi-
tecture, see Fig. 2.3.

21
2 Digital Signal Processing Architectures

status
flags
operation CU 1..K
control
storage
dataflow control
control
EXU 1
STU 1..M
EXU 2

EXU N

Figure 2.3 time-shared architecture.

It is also possible to mix the two strategies, isomorphic mapping and time-shar-
ing by implementing efficient EXUs using isomorphic mapping and then time-
share the EXUs. For example in the FFT algorithm the inner loop contain a but-
terfly operation, which is often implemented using an isomorphic mapping and
then time-shared for the different butterflies in the FFT [35,36].
The time-shared architecture adds complexity to the interconnect, control units,
and possibly to the execution units as well. This extra control overhead will
increase the power consumption and it is therefore essential to keep the overhead
as low as possible if the total power consumption is an important design parame-
ter.

2.2.2 Programmable architectures


In a programmable architecture the control unit must be programmable which in
the simplest way is done by placing the sequence of control signals in a memory.
The different operations that can be controlled by control signals will be referred
to as instructions, and the instruction set is the set of instructions supported by a
given architecture.
To take benefit of the programmable architecture the instruction set must be mod-
ified and extended compared to the simple time-shared architecture in order to
introduce some flexibility. The execution units may need to be more general, e.g.
a multiplier that can use more than a few selected coefficients is more useful than

22
2 Digital Signal Processing Architectures

one that can multiply with one pre-defined coefficient only. The interconnect
may need extensions that remove restrictions on the dataflow. The control unit
may also need to support more advanced data flows, as for instance nested loops
and conditional jumps.
The addition of more flexibility in the datapath and programmable control units
will increase the complexity of the architecture as well as the power consump-
tion. If the programmable architecture will be used for a wide range of applica-
tions the instruction set will become more extensive. As a consequence it is an
advantage from an efficiency point of view if the DSP architecture can be tar-
geted towards a small range of algorithms since this will reduce the instruction
set and therefore increase the efficiency.
A programmable DSP architecture has the advantage of being easier to reuse for
several applications. One way of providing some flexibility without having to go
all the way to a DSP processor is to have a set of user controlled parameters that
affects the algorithm in some predefined way, i.e. parametrization. The length of
an FFT, or the number of taps in a Finite Impulse Response (FIR) filter can for
instance be made as a parameter to the block. In this way it is also possible to
make architectures that can be used in many applications but still can be synthe-
sized efficiently if the parameters are fixed before the synthesis stage. For
instance a programmable filter can be turned into a filter with fixed coefficients,
making it possible to simplify for instance multipliers.
To summarize we have the following types of DSP architectures with various
degree of efficiency and flexibility
• Fix function architectures that only can execute one pre-determined algo-
rithm, where the operations can be either time-shared or isomorphic mapped
to the EXUs. In this thesis this class is represented by the presented work
dealing with variable length codes, see chapter 3.
• Parametrized architectures that only can execute pre-determined algo-
rithms, but with a possibility to control some parameters as for instance filter
length. This class of architectures are not explicitly treated in this thesis, but
some parametrization is used in the case studies for programmable DSP archi-
tectures.
• Programmable architectures that are controlled by a microprogram and that
can be used for replacing the algorithms with new ones without having to
change the hardware. This architecture is used in the case studies presented
later in this chapter.
• Reconfigurable architectures that are realized using reconfigurable logic as
for instance FPGAs. These architectures are not discussed in this thesis.

23
2 Digital Signal Processing Architectures

2.3 DSP architectures with programmability and high effi-


ciency
In the work presented in [19,20,21] we have aimed at finding a method to com-
bine high performance with little overhead and high efficiency in the architec-
ture. Using a programmable application specific DSP processor makes it possible
to closely match an architecture to the target application, while still providing
enough programmability to be able to incorporate changes in the algorithms late
in the design flow. During the design of an ASIC the DSP algorithms may change
many times, both during the design process and after the release of the product.
The application specific DSP can hopefully incorporate many of the changes in
the algorithms without architecture modifications, while this is difficult to do in a
fixed-function architecture.

2.4 Design methodology for ASDSP


To be able to match the architecture with the algorithm it is necessary to have a
good understanding of both DSP architectures and the algorithms to be imple-
mented. The process of finding a good ASDSP architecture involves the task of
modelling the algorithm on the target architecture. Since the architecture are to
be closely matched with the algorithm this task may include several iterations
where the architecture is step-wise refined until a good architecture has been
found. To support the step-wise refinement we need a design methodology that
can model both ASDSP architectures and algorithms efficiently. In our work we
have been using the design methodology reported in [23] that supports a concur-
rent design flow with simultaneous modelling of algorithm, instruction set and
DSP architecture, see Fig. 2.4. This design methodology which is called the
Modular DSP Methodology, MDSP) has been evaluated in two case-studies (see
2.4.1 and 2.4.2).
The MDSP methodology uses a “C-like” language called µC, to describe both the
algorithm and the instruction set needed to implement the algorithm using one
unified model. The µC-model relies on an underlying control unit model with an
instruction set that supports instructions like “conditional jump”, and “sequential

24
2 Digital Signal Processing Architectures

execution”. The µC-model is both cycle accurate and bit accurate which makes it
possible to do bit and cycle accurate simulations early in the design process. An
example of a µC-model that describes an FIR filter is given in Fig. 2.5.

spec.

function mC model
library
HW fixed
SW fixed
RTL
model

formal
verification

ASIC m-code gen.


des. flow

Figure 2.4 MDSP design flow.

After verification the µC-model is translated into a VHDL architecture that sup-
ports the instructions needed for the execution of the given algorithm, and then
synthesized using conventional tools. An example of an architecture that is com-
patible with the µC-code in Fig. 2.5 is shown in Fig. 2.6. Note that the VHDL
architecture may support a larger instruction set than what is required from the
µC-model. A microcode is finally extracted from the µC-model together with the
VHDL architecture, and a library with building blocks like registers, memories,
and ALUs. The instructions used in the µC-code must have a corresponding
building block in the building block library. The library is easy to extend with
new functions when needed.

25
2 Digital Signal Processing Architectures

1: // Declaration part
2: MDSP fir
3: {
4:
5: INPUT inp(14, PARALLEL);// input port, 14 bits
6: OUTPUT outp(14, PARALLEL); // output port, 14 bits
7: REG acc(30), i(6), ca(5), da(5); // different registers
8: RAM d(32,16); // RAM with 32 16 bit words
9: ROM c(32,16, "rom.data"); // ROM
10:
11: PROCEDURE compfir ();// procedure declaration
12: }
13:
14: // Code part
15:
16: PROCEDURE main()
17: {
18: for(;;){ // loop forever
19: do {;} while(!inpF) ; // While no input on the input
20: // port inp do nothing
21: inpF=0, d[da]=inp; // Reset input by setting inpF=0,
22: // store inp in RAM. “,” means that
23: // this is made in parallel
24:
25: compfir(); // call procedure compfir
26: outp=acc; // place the value of acc on outp port
27: }
28: }
29: PROCEDURE compfir() // compute fir
30: {
31: acc=0,ca=0;
32:
33: i=30;
34: do {
35: acc+=d[da++]*c[ca++],
36: i--;
37: } while (i>0)
38: acc+=d[da]*c[ca++];
39: return;
40: }

Figure 2.5 µC code of a 32 tap FIR filter.

2.4.1 Modelling of a JPEG DSP


The first case study was the modelling of an ASDSP dedicated to decompressing
images that had been coded according to the JPEG image coding standard [3].
This work was reported in [19]. The JPEG standard consists of a mix of different
algorithms and it is therefore difficult to find one architecture that is well adapted
to all algorithms. Instead the resulting architecture is a compromise where the
instruction set has been optimized for the critical algorithms. The resulting archi-
tecture consists of a two-core solution where one core handles protocol process-

26
2 Digital Signal Processing Architectures

ing and Huffman decoding. The second core is dedicated to processing of the
Inverse Discrete Cosine Transform (IDCT) which represents a high computa-
tional work load. Due to the partitioning of the algorithms only image data needs
to be passed to the IDCT processor core. The parameters can be kept entirely in
the Huffman processor core.

from firl

circ_add
RAM

1
da
inp
d

imm op ROM
0 * c
firl 1
+,pass
circ_add
acc 0
outp

ca
imm op
1
+,-,pass
0
to control unit
i
>

Figure 2.6 32-tap FIR filter architecture.

The experience of this case study is twofold. The methodology worked well for
defining programmable architectures for a special application. With some modi-
fications it was possible to design a high efficiency core with high utilization
using the design methodology. The modifications we needed to add was hard-
ware to support loops, where the jumps did not cost any extra clock cycle, and a
possibility to describe finite state machines to handle the I/O of the IDCT proces-
sor. As it turned out the IDCT core architecture is in most aspects similar to the
architecture that would have been obtained if conventional design methods had
been used for designing a fixed-function ASIC. Hence, even if the methodology
is supposed to be used for programmable architectures with one control unit and
a data path it is possible to describe complex architectures as the finite state
machine that work in parallel with the main control unit.

2.4.2 Design and implementation of an FFT processor


The second case study was the task of implementing a core dedicated for the FFT
meeting the requirements of the VDSL standard proposal. The FFT is an algo-
rithm with a simple structure, which basically consists of repeated calculations of
a butterfly operation [37]. Because of the regular, data independent structure of

27
2 Digital Signal Processing Architectures

the FFT algorithm the best choice is normally to implement an FFT in a dedicated
architecture. The special feature with this case study was that both a high degree
of programmability and a high throughput was required due to uncertainties in
the proposed standard. Therefore our choice was to implement the FFT in a pro-
grammable architecture that easily could be adapted to changes in the standard.

The FFT algorithm


There exist many different algorithms for computing a discrete Fourier transform
more efficient than by direct calculation of the DFT sum
N–1

X(k) = ∑ x ( l ) ⋅ e –j2πlk ⁄ N k = 0, 1, ..., N – 1 (2.6)


l=0

where N is the transform length.


In 1965 the first algorithm for calculating the DFT more efficiently was presented
by J. Cooley and J. Tukey [38]. The presented algorithm reduced the arithmetic
complexity from O ( N 2 ) to O ( N log ( N ) ) . The FFT algorithm presented by
Cooley and Tukey is usually referred to the Cooley-Tukey FFT or the decimation
in time (DIT) FFT algorithm.
In this implementation a decimation-in-frequency (DIF) FFT algorithm is used
and it was first published in [39].
Both the DIT and DIF FFT algorithms are attractive from implementation point
of view because of the regular structure of the SFG with a number of columns in
which the same butterfly operation is repeated, see Fig. 2.7.

column 0 column 1 column 2


x(0) + + + X(0)
x(1) + + + X(4)
2p
x(2) W X(2)
+ + 4p
+
x(3) W X(6)
+ p
+ +
W butterfly
x(4) + 2p
+ + X(1)
W +
x(5) + 3p
+ 2p
+ X(5)
p
W W W
x(6) + + + X(3) +
4p 4p
W W
x(7) + + + X(7)

Figure 2.7 SFG of an 8-point radix-2 DIF FFT.

28
2 Digital Signal Processing Architectures

It is possible to derive FFT algorithms with different radix which implies differ-
ent types of butterflies. If the FFT algorithm contains only butterflies with two
inputs it is a radix-2 algorithm, with four inputs it is a radix-4 algorithm, and so
on. In our implementation of the FFT algorithm we chose to support both radix-4
and radix-2 butterfly operations. Radix-4 butterflies require fewer memory
accesses than radix-2 butterflies, while it is possible to calculate only FFT sizes
that are a power of four using radix-4 butterflies. By supporting both radix-2 and
radix-4 butterflies, FFTs with a length which are a power of two can be calcu-
lated.
The FFT algorithm used in the DMT technique is derived from the normal FFT
algorithm with some modifications due to that the signal sent to the line is real
valued. In this case it is possible to calculate a 2N FFT using an N point FFT and
an additional calculation step. The algorithm is based on the fact that the Fourier
transform of a real valued input sequence is conjugate-symmetric [40], i.e.

X ( e jω ) = X ( e –j ω ) .
*
(2.7)

This kind of symmetry in the Fourier transform is commonly called Hermitian


symmetry. The steps of calculating a 2N FFT with real input values using an N
FFT are [37,41] are given below
Assume input sequence x(n) is real, with n= 0,1,...2 N – 1 .
Form a new complex sequence

y ( l ) = x ( 2l ) + jx ( 2l + 1 ) , l = 0,1,... N – 1 . (2.8)
Compute the N point DFT of y ( l )
N–1

Y(k) = ∑ y ( l ) ⋅ e–j2πlk ⁄ N (2.9)


l=0

Create the DFT of x ( n ) by the following computation:


j2πk
1 * 1 – ----------
- *
X ( k ) = --- ( Y ( k ) + Y ( N – k ) ) + ----- ⋅ e 2N ⋅ (Y(k) – Y (N – k)) (2.10)
2 2j

k ∈ [ 0, N – 1 ]
and

29
2 Digital Signal Processing Architectures

j2πk
1 * 1 – ----------
- *
X(k) = (Y(k ) + Y (N – k )) – ⋅ e
--
- ----
- 2N ⋅ ( Y ( k ) – Y ( N – k ) ) . (2.11)
2 2j

k ∈ [ N, 2N – 1 ]
Only values in the range k ∈ [ 0, N – 1 ] need to be calculated since the out-
put is symmetric according to Eq. 2.7.

In a similar way it is possible to calculate a real valued output sequence with a


length of 2N points by only calculating an N -point IDFT when the output is
known to be real valued [42]. The stages in this calculation are given below.
j2πk
1 * 1 – ----------
- *
Y ( k ) = --- ( X ( k ) + X ( N – k ) ) + ----- ⋅ e 2N ⋅ (X (k) – X(N – k)) (2.12)
2 2j
k ∈ [ 0, N – 1 ]
Perform an N point IDFT on the sequence Y ( k ) .
N–1

∑ y ( l ) ⋅ ej2πlk ⁄ N
1
y ( n ) = ---- (2.13)
N
l=0

Perform a de-interleaving stage to create the 2N output values.

x ( 2n ) = Re ( y ( n ) ) (2.14)
x ( 2n + 1 ) = Im ( y ( n ) )

30
2 Digital Signal Processing Architectures

The calculation stages in the FFT/IFFT operations used in the DMT technique is
also illustrated in Fig. 2.8.

transmit path

IFFT
2N->N N point
pre- IFFT
processing

FFT
N->2N N point
post FFT
processing

receive path

Figure 2.8 FFT/IFFT calculation exploiting Hermitian symmetry.

Functional description
The FFT processor can handle between 128 and 1024 carriers at a data rate that
corresponds to 25 MHz sample rate. We chose to use two parallel processing
cores, where each core can handle one direction, or alternatively the two cores
can be used in the same direction with an increased data rate. An outline of the
top level of the FFT architecture is shown in Fig. 2.9.
There are two I/O blocks that handle the two data streams (upstream and down-
stream). The two I/O blocks communicate with six sets of memories each one
capable of keeping one complete symbol in memory. To keep the memory band-
width high enough in the FFT a segmented bus structure where each memory set
have access to three buses, i.e. the two I/O units and one of the FFT cores. Each
memory set contains two physical memories, and it is possible to do one read and
one write to the memory set each clock cycle as long as this is not made to the
same physical memory. The on-chip busses have been designed such that it is
possible to do both read and write over the same bus in the same clock cycle.
Since we had to support both different types of time division multiplex and fre-
quency duplex modulation on the line without major changes in the external con-
trol logic the memory buffering scheme was put inside the FFT processor. A
possibility to add and remove cyclic prefix from the symbols is included in the I/
O units. This saves a buffer stage in the VDSL modem. One of the I/O units has
been supplied with a complex multiplier in order to be able to integrate a fre-
quency domain equalizer with the addition of some external control logic.

31
2 Digital Signal Processing Architectures

The FFT DSP core is optimized for processing of complex valued data, and
therefore instructions like complex multiplication, complex addition etc were
chosen. One complex multiplier, two ALUs for addition and subtractions and one
combined scaling and rounding unit are the available resources for the main FFT
calculation. There are also three address generation blocks, two for the read and
write addresses to access the data, and one coefficient generation block for the
twiddle factors in the FFT algorithm.
The memory buffering scheme as well as the FFT length and the length of the
cyclic prefix is software controlled.

MEMORY SYSTEM

A0 A1 A2 B0 B1 B2

IO A IO B
DSP A DSP B

INA INB
OUTA OUTB

Figure 2.9 FFT processor architecture.

2.5 ASDSP design methodology


The goal with the MDSP methodology has been to develop an improved design
methodology which can be used for efficient design of ASDSPs containing a mix
of fixed-function hardware, efficient programmable cores with specialized
instruction sets, and software using a hardware-software co-design approach.
Often the instruction set tends to become large, and there is a need to be able to
stop designers from adding new instructions at some point. The possibility to eas-
ily add new instructions is nice at the initial stage of the architecture design, but
when the architecture evolves the introduction of new instructions must be more
restricted. Our solution is to have an instruction definition file in parallel with the

32
2 Digital Signal Processing Architectures

µC model. The instruction definition file makes it possible to have dedicated


designers that are allowed to introduce new instructions while others just imple-
ment parts of the software using the existing ones.
The resulting architecture may also become too limited if not special attention is
put on what kind of flexibility is needed in the architecture. Addressing modes
should be general enough to allow alternative addressing schemes and the data
flow between the EXUs should be allowed to be different from the one used in
the implemented algorithm.
A good compromise between having an efficient architecture with few special-
ized instructions or using a very extensive instruction set is to use parameter con-
trolled EXUs. For instance the size of a circular buffer can be set by storing the
size in a register, rounding type can be controlled by setting some bits in a regis-
ter and so on.

2.5.1 Architecture synthesis from µC


An important experience of the work with the MDSP methodology was that there
was a need for a a tool that can help the designer with translating the instruction
description in µC to an architecture suitable for the algorithm. To achieve a more
efficient way to create a good DSP architecture from a µC model, a synthesis tool
was created.
We wanted a tool that would be deterministic in the sense that small modifica-
tions to the µC model should only lead to small changes in the synthesized archi-
tecture. We also wanted a tool that was easy to understand so that the designer
could easily learn how to write good µC code. That is, the effort should not be on
the optimization of the architecture, but instead to capture the designers inten-
tions.
A prototype tool based on the goals mentioned above has been developed and is
reported in [43,21]. The tool work according to a few simple rules, but with a
possibility to easily override the rules when they yields a poor architecture.
•All operations, whose target register is the same, are collected into an arith-
metic logic unit (ALU).
•Each target register has its own ALU connected to it. It may appear to be a
strange rule, but the designer is supposed to override the rule by identify-
ing register files. A register file will have one ALU connected to it. A reg-
ister file is in this context a number of registers that all can be used in an
identical way in all instructions.
•A constant expression used in an operation or assigned to a register or other
storage elements is implemented as an immediate operand which comes

33
2 Digital Signal Processing Architectures

from a field in the instruction word. Exceptions from the rule is if the con-
stant is zero or one. This rule can be overridden by specifying an option to
the tool or by changing the µC code.

Our experience is that the tool works well, but in some cases the designer is
required to change the µC model to work around some problems.
It is difficult to compare our solution with other solutions, but there exist some
other systems which use a C-like language for the hardware modelling. There
also exist several systems that use a C derivative for general HW modelling
[44,45] and there is also an initiative called The Open SystemC initiative where a
hardware design language based on C is proposed [46].
Many prototype systems for hardware synthesis have been proposed. In some of
the systems an algorithm model is fed into a synthesis program that performs a
automatic resource allocation and scheduling of operations [47-51]. The disad-
vantage of doing behavioral synthesis with the algorithm as starting point is that
the synthesized instruction set will be limited to what is necessary in the imple-
mented algorithms. If extra flexibility, i.e. more instructions, is needed in the
architectures this is difficult to incorporate, and even if possible, it is difficult to
re-program the control unit for a modified algorithm.
An advantage with our tool is the high degree of control of the resulting architec-
ture. No optimization stages are included in the tool. One argument for that is that
we want the designer to create an efficient DSP architecture with an instruction
set well suited for the application. When the architecture and the most important
algorithms are in place the rest of the design can be made in software using a
standard C compiler targeted against the chosen instruction set. This will how-
ever require that the tool can generate an instruction description file that fits the
chosen compiler. This function has however not been implemented yet.
The need for a tool with a high degree of interaction and a possibility for repro-
gramming the synthesized architecture has also been identified and incorporated
in a design environment called AMICAL [52,53].

34
3 Variable Length Decoding

3 Variable Length Decoding


Variable length coding is an important method that is implemented in several of
the standards used for saving bandwidth when transmitting images or video. As
described in chapter 1, section 1.2 variable length coding is included in both the
JPEG image coding standard [3] and the MPEG-2 video coding standard [4].

3.1 Variable length codes


Variable length codes (VLC) are used for reducing the redundancy in transmitted
information in communication systems. In a variable length code, commonly
used symbols are assigned shorter code words in order to minimize the total
amount of bits used to represent the information. The most known type of VLC is
the Huffman code which is the code that reaches the smallest code size among
the VLCs since it is constructed from the statistics of the symbol use [54]. A
VLC can be represented as an unbalanced binary decision tree, whose leaves rep-
resent the symbols and the paths from the root node to the leaves represent the
VLC, Fig. 3.1. For example the symbol e will be represented by the code 110
when using the VLC code in Fig. 3.1.

0 1 level 0 (root)
a 0 1 level 1
0 1 0 1 level 2

0 1 d e 0 level 3
1
b c f g level 4
Figure 3.1 Example of tree representation of a variable length code.

35
3 Variable Length Decoding

3.2 The VLC decoding process


There are two basic architectures that have been used for VLC decoding. In the
tree traversal method, one or a few bits at a time are used for traversing the binary
decision tree that represents the VLC code. An overview of architectures imple-
menting the tree traversal method can be found in [55,56]. The second architec-
ture type is symbol parallel in the sense that one symbol is decoded at a time
normally by using a table look-up approach. This type of architecture is more
common than the tree traversal method. Some examples are found in [57-64].

3.2.1 Tree based decoding


The tree based decoding method is basically a large state-machine that given the
+
current state, and the next incoming bits N bits determines the next state ( Q ),
and if a leaf in the decision tree is reached the symbol has been found, which is
signalled with a symbol_ready signal, see Fig. 3.2. The throughput of the decoder
depends on the average code length of the input data, and the critical path of the
tree traversal method is to find the next state T Q + . The output data rate of this
VLC-decoder architecture will be variable. The maximum decoding rate will
then be limited by

L ave × T Q +
T min = --------------------------- (3.1)
N bits

where L ave is the average code length, and N bits is the number of new bits that
are decoded every cycle.
Since EQ 3.1 is a fundamental limit on this architecture the only feasible way to
increase the throughput is to increase N bits . Unfortunately, increasing N bits will
increase the time required for updating the state, T Q + , giving an optimum at
some point.

Nbits symbol
VLC symbol ready
logic

T
Q Q+
Figure 3.2 Tree-based VLC decoding SFG.

36
3 Variable Length Decoding

In [56] a pipelined tree-based coding architecture is presented. This architecture


+
is pipelined such as the next state, Q , is fed forward to the following pipeline
stage. This will make it possible to reduce T Q + . The proposed architecture is,
however, only suitable when multiple independent bit streams exist.

3.2.2 Symbol parallel decoding


The most commonly used architecture is the parallel decoding architecture where
one or several symbols are decoded every cycle. A bit-vector of L max × N symb
bits is fed to the VLC-decoder, where L max is the length of the longest VLC
code and N symb is the number of symbols to decode in parallel. The decoder
decodes the symbols, usually by using a table look-up technique [58-60,65].
Since the number of bits consumed for each symbol decoding varies with input
data, the length of the decoded symbols must be calculated. The code length is
fed back to the input buffer, which will throw away the used bits in the input vec-
tor, see Fig. 3.3.

in symbol
Buffer

T
VLC
logic

code_length

Figure 3.3 Symbol parallel VLC decoding SFG.

The symbol decoding process can be pipelined, while the length decoding result
must be fed back to the buffer before the next N symb symbols can be decoded.
Hence, the critical path is found in the length decoder as

TL dec + T buf
T min = ------------------------------- (3.2)
N symb

where TL dec is the time it takes to decode the length of N symb symbols, and
T buf is the time it takes for the buffer to throw away the used bits. The through-
put can be increased by decoding more symbols every cycle, but this will also
increase TL dec and T buf .
There are variants on this architecture where N symb varies with symbol length.
In [64] an architecture is presented that in some special cases can decode several
VLC codes in parallel. When one of the code words is a short code this is handled
in parallel with the decoding of the following code.

37
3 Variable Length Decoding

3.3 VLC decoder with simplified length decoder


Both the tree based and symbol parallel decoding methods have a fundamental
speed limit given by T min for the respective architectures.
In [15,16] we propose an architecture where features from the tree-based and
symbol parallel VLC-decoder are combined, see Fig. 3.4. The proposed architec-
ture uses a bit-serial input fed through a shift register that works as a serial to par-
allel converter. The parallel output from the shift register is connected to a length
decoder. The time for the length decoder to decode the length of a symbol is
thereby varying with the symbol length. For example, a one bit code is decoded
in one clock cycle, a two bit code is decoded in two clock cycles and so on, just
as in tree based decoding. Each level in the binary decision tree has its own out-
put from the length decoder, L i , and is later combined using a multiplexer for
choosing the right output. The output to choose is based on the number of clock
cycles since the last decoded symbol. The first clock cycle after the last decoded
symbol the output corresponding to a code length of one L 1 is selected by a mul-
tiplexer, the next cycle the output corresponding to a code length of two is cho-
sen, L 2 , and so on until a new symbol has been found. While the tree-based
decoder only use the previous state and the next incoming bit (or bits) to find the
next state, the proposed architecture have a parallel input making it possible to
decode the symbol in parallel. Since the input is bit-serial, while the decoding is
in parallel, the decoder have M cycles available to decode an M bit code. This is
an advantage if we assume that short codes are easier to decode fast than long
codes.
The critical part of this architecture is the length decoder. The symbol decoder
can be pipelined to reach sufficient speed, and also make use of information from
the length decoder.

The critical path in this architecture is

T L + T mux
T min = max  i
-
------------------------- (3.3)
∀i  i 

where T L is the time to find out if the code has the length of i bits, and T mux is
i
the multiplexer delay.

38
3 Variable Length Decoding

in consumed bits
Shift reg

load
Register 1

M-1

Varying rate Symbol symbol


length decoder
decoder symbol ready

L2
L1
LM
counter
reset
new_symb
Figure 3.4 VLC decoder with varying rate length decoder.

3.4 VLC decoder with pipelined length decoder


In [15,16] we propose an architecture where the length decoder is pipelined, see
Fig. 3.5. Just as the architecture shown in Fig. 3.4 this architecture use a bit-serial
shift register at the input. But instead of using the output of the length decoder to
find out when to start to decode the next symbol, the length decoder starts to
decode a new symbol every clock cycle. The length decoding is started assuming
that the first bit in the bit-serial shift register at the input is the first bit in the next
symbol. In most cases this is not the case, and the result of the decoding is auto-
matically discarded because of the way the counter, that controls the multiplexer
at the output of the length decoder, works. This way of running the length
decoder without having the input aligned to the symbols is denoted speculative
decoding.
Since the length is unneeded for restarting the length decoder, and only needed
for indicating to the symbol decoder when to start decoding of a new symbol, the
critical path in this architecture is from the input to the multiplexer that selects
one of the L i ‘s to the reset of the counter that control the multiplexer, i.e.

T min = T cntreset + T mux . (3.4)

The output of the length decoder can be used for synchronization of the symbol
decoder, or if speculative decoding is used in the symbol decoder as well, to indi-
cate when a valid output exist.

39
3 Variable Length Decoding

To equalize the latency in the decoder the delay through the length decoder is dif-
ferent for different code lengths. The delay for a code length i is restricted to be
equal to i ⋅ T where T is the clock period. That is, the pipeline depth is set equal
to the code length.

in consumed bits
Shift reg
M

Pipelined
varying rate
length Symbol symbol
decoder decoder
symbol ready

L2 L1
LM
counter

new_symb
Figure 3.5 VLC decoder with, varying rate length decoder.

3.5 VLC decoder with symbol decoder partitioning


In the architecture proposed in [17] the length decoder is used for sorting the
input data into groups, where each group only consists of code words of certain
lengths. In the same way as the time allowed for decoding can be made propor-
tional to the code length, the symbol decoding time can also be made propor-
tional to the code length. For instance, a symbol with a code length of M -bits
will not occur more often than at most every M th clock cycle, giving a decoder
specialized for symbols with a length of M or more bits, M clock cycles to com-
plete the task. In the proposed architecture the length decoder executes first, and
the symbol decoder starts when a new code word is available. Fig. 3.6 shows a
solution with two symbol decoders, one fast for short VLC codes and one slower
for longer VLC codes.

3.6 Length decoder implementation


Since there is one output for each level in the binary decision tree, the implemen-
tation of the length decoder can be made simple. The output L i from the length
decoder must become one every time there is a code of length i at the input. If
the actual length is shorter than i the value of L i is irrelevant. This makes it pos-

40
3 Variable Length Decoding

in consumed bits
Shift reg
M
Delay symbol
symbol decoder
for 1 to N-1 bits symbol ready
Pipelined
varying rate
length start_short
decoder

symbol
symbol decoder
for N to LM bits symbol ready
L2 L1
LM start_long
counter
control
new_symb
Figure 3.6 VLC decoder with partitioned symbol decoder.
sible to use “don’t care” in many of the positions in the truth table for the length
decoder. In Table 3.1 the truth table of a length decoder for the example given in
Table 3.1 is shown. The simplified boolean equations are given in EQ 3.5 - 3.8.

VLC code
L 1 L2 L3 L4 Symbol
C1 C2 C3 C4

0 1XXX a
101 001X d
110 001X e
1000 000X b
1001 0001 c
1110 0001 f
1111 0001 g
Table 3.1. Truth table for length decoder.

L 1 = not ( C 1 ) (3.5)

L2 = 0 (3.6)

L3 = C2 ⊕ C3 (3.7)

L4 = 1 (3.8)

41
3 Variable Length Decoding

The complexity of calculating each output L i is mainly decided by the number of


codes with length i . This is because of the large number of don’t care positions in
the truth table. If the truth table is sorted with increasing code length as in Table
3.1 the output L i will always have all don’t care positions on the first rows in the
table, followed by the codes where the output has to be one and then the remain-
ing rows will have the output zero.

3.7 Remarks
The proposed architectures are mainly suitable for fixed-function VLC decoders.
A prototype chip implementing the static MPEG-2 Video VLCs has been
reported in [16]. The MPEG-2 standard use fixed VLCs, while this is not the case
using the JPEG standard. The need for a fast VLC decoder is usually higher for
decoding of video than for still images, which makes it relevant to study VLC
decoders with fixed VLCs.
It is difficult, but not impossible, to make a good programmable solution using
the proposed architectures. The difficulty is the length decoder which must be
made very parallel and fast. A possible solution that may be worth to examine
further is to use programmable logic to realize the length decoder.

42
4 Data Converters in Communication Systems

4 Data Converters in
Communication Systems
Analog-to-Digital (ADC) and Digital-to-Analog (DAC) converters are critical
components in many communication systems. The current trend is to move more
and more of the functionality of a communication system into the digital domain
in order to provide an increased flexibility and reduce cost. In order to acomplish
this the requirements on the data converters increase both in terms of higher
accuracy and larger bandwidth.
In order to continue to push the data converter performance even further there is
a need to handle problems caused by the processing of the chips. The variations
in transistor parameters, especially for analog circuits causes a degradation in the
performance [66]. To increase the performance of data converters we believe that
more attention must be put on optimizing the data converters against the target
application. There exists many analog and digital calibration techniques that aim
at reducing the matching error problems, but few methods take full advantage of
the properties of the target application. We stress that in order to get the most out
of digital calibration and error correction all available information about the pro-
cess, the application, and the data converter architecture should be utilized as far
as possible. In our case the target application is DSL based communication sys-
tems. In this chapter we propose methods that can be used to correct some of the
problems in ADCs and DACs.

4.1 Analog-to-digital conversion


An ideal ADC is normally defined as a block that converts a continuous time sig-
nal to a discrete time signal with discrete amplitude, i.e. a digital signal. The ana-
log-to-digital conversion is often split into two steps, where the first step converts
the continuous time signal to a discrete time signal, i.e. a sample-and-hold step.

43
4 Data Converters in Communication Systems

The second step quantize the amplitude continuous signal values. In Fig. 4.1 a
model of the analog-to-digital conversion is shown. The quantization is usually
modelled with an additive zero-mean gaussian distributed noise source with the
variance σ 2 .

nT

x(t) x(nT) x(n)


Q

nT
s2

x(t) x(nT) x(n)


+
Figure 4.1 The analog-to-digital conversion process.

The quantization noise term depend on the number of bits that are used to repre-
sent the digital signal. This is usually referred to as the resolution of the ADC. In
the ideal ADC the maximum quantization error q ( n ) = s q ( n ) – s ( n ) is in the
range [ ± ∆ ⁄ 2 ] where ∆ is defined as

FS
∆ = ------- (4.1)
N
2

where FS refers to the full scale input range and N is the resolution.
Assuming a random input signal the noise will be equally distributed in the range
± ∆ ⁄ 2 and the variance, σ 2 , will be
∆⁄2
∆2

2 1
σ2 = E[ q ( n )] = q 2 ( n ) ⋅ --- dq = ------ (4.2)
∆ 12
–∆ ⁄ 2

There are also many other types of noise sources and imperfections that degrade
the performance. Here we differentiate between two types of error sources, 1) the
static error which not is frequency dependent, and 2) the dynamic errors that nor-
mally increase with frequency. In order to measure the performance of the ADC a
number of measures have been defined. Some of the measures are listed below.

44
4 Data Converters in Communication Systems

Differential nonlinearity (DNL)


The differential nonlinearity is defined as the deviation from the ideal stepsize ∆
between two adjacent codes in the ADC [67], see Fig. 4.2.

DNL i = X i + 1 – X i – ∆ i ∈ [ 0, 2 N – 1 ] (4.3)

Sometimes a normalized definition is used instead

Xi + 1 – Xi – ∆ (4.4)
DNL i = ---------------------------------- i ∈ [ 0, 2 N – 1 ]

digital output DNL


2
DNL
1

D
analog input

Figure 4.2 Non-ideal transfer function for a 2-bit ADC.

Integral nonlinearity (INL)


The integral nonlinearity is defined as the total deviation from the ideal value and
can be expressed in terms of DNL by
i

INL i = ∑ DNLk (4.5)


k=0

Spurious free dynamic range (SFDR)


The spurious free dynamic range is defined as the ratio of the power between the
input signal and the largest spurious within the frequency band. The SDFR
expressed in dBc is

45
4 Data Converters in Communication Systems

SFDR dBc = 10 log  ---------------------------------------------------------


Signal Power
 Largest Spurious Power (4.6)

Signal-to-noise ratio (SNR)


The signal-to-noise ratio (SNR) is the ratio between the signal power and the
total noise power within a certain frequency band, excluding the harmonic com-
ponents.

SNR dB = 10 log  --------------------------------


Signal Power
 Noise Power  (4.7)

Signal-to-noise and distortion ratio (SNDR)


The signal-to-noise and distortion ratio (SNDR) is the ratio between the signal
power and the total noise power within a certain frequency band, including the
harmonic components.

SNDR dB = 10 log  ------------------------------------------------------------------


Signal Power
 Noise and Distortion Power (4.8)

Peak-to-average ratio (PAR)


The peak-to-average ratio of a signal gives information on how the signal is dis-
tributed over the amplitude range. A low PAR indicates a more uniform distribu-
tion of the amplitudes in the input signal. A high PAR indicates that high
amplitudes may occur at the input which affects the dynamic range that must be
handled by the data converter.
The PAR is defined as

peak amplitude
PAR = ------------------------------------ (4.9)
rms value

4.2 ADC errors


In most ADC architectures the conversion process consists of a) sampling of the
input signal and b) comparing the sampled signal by a set of reference voltages.
Depending on how the reference voltages are created and how the comparison is
made we obtain different architectures. In a pipelined architecture one or a few
bits are converted at each stage in a pipeline. In a time-interleaved ADC
(TIADC) several ADCs are used in a time interleaved way. Both the pipelined
and the time-interleaved ADC may be based on a simpler flash or a successive
approximation conversion scheme. In a flash converter the sampled data is
directly compared against all reference voltages. When using successive approxi-

46
4 Data Converters in Communication Systems

mation the conversion is made by a binary search strategy applying one reference
voltage at a time. Another type of ADC is the sigma-delta ADC which works
with oversampling. Since the principle relies on oversampling this architecture is
less interesting when a high conversion speed is required.

s(t)

nT
vref

+
-
s(nT)
R + Reference generator
Decode

- s(nT)
R nT s>x?
x
-
s(t)
+
+
-
R

Flash ADC SA ADC

Figure 4.3 Example of flash and successive approximation ADC principle.

Due to deviations from the ideal values in the components used for creating the
reference voltages, the voltages will contain errors which will result in DNL and
INL errors. These errors are independent of the frequency and is therefore
referred to as static errors.
Other important error sources are the offset and gain errors. Analog circuits may
have a DC offset which will result in an output signal even for zero input signal.
The gain variations are normally caused in amplifiers or capacitors in the ADC,
or in the sample-and-hold circuit. If the gain and offset errors are assumed to be
frequency independent these errors can be modelled as

xe ( n ) = g ⋅ x ( n ) + o (4.10)

where g is the gain, and o is the DC offset. Both the gain error and the offset
error may reduce the maximum voltage swing in the ADC.

47
4 Data Converters in Communication Systems

At high input frequencies the dynamics of the circuitry becomes important. For
example the sample-and-hold circuit may not be fast enough to track the input
signal, and the reference voltage generation may settle too slow. These frequency
dependent errors will at some frequency become dominating and will limit the
bandwidth of the ADC.

4.3 Time-interleaved ADC


One way to increase the bandwidth of an ADC is to use several ADCs in parallel
and to sample data in a time interleaved fashion, see Fig. 4.4. The conversion rate
in each individual ADC is reduced to f s ⁄ M while the overall sample rate is kept
at f s , where M is the number of ADCs that are used in parallel.
MT
x0(n)
ADC 0
(M+1)*T

x(t) x1(n)
ADC 1 xTIADC(n)

(2M-1)*T xM-1(n)
ADC M-1

Figure 4.4 Time interleaved ADC.

It is important that the differences between the ADCs in a TIADC are small since
these differences will result in distortion.

4.3.1 Offset in TIADCs


A difference in offset between the ADCs in the TIADC will appear as a periodic
output signal with a period of M samples, Eq. 4.11.

x tiadc ( n ) = { x ( T ) + o 1, …, x ( MT ) + o m } (4.11)

or expressed in the frequency domain


∑ X  ω – k ⋅ -------- + O ( e jω )
1 2π
X tiadc ( e jω ) = --- (4.12)
T MT
k = –∞

48
4 Data Converters in Communication Systems

There are analog offset cancellation techniques that can be used to reduce the off-
set differences in the analog circuitry [68]. An advantage of removing the offset
in the analog domain is that the offset will not reduce the available input range. A
disadvantage is that the analog offset cancellation increases the complexity
which may reduce the performance of the analog circuitry.
In [69] a mixed digital and analog technique is proposed where most of the work
is made in the digital domain, in addition to some minor analog circuits. The
input samples are multiplied with a random sequence of c ( n ) = { 1, – 1 } using a
modified sample-and-hold circuit. The samples observed at the output of one of
the ADCs in the TIADC will be

xc ( n ) = c ( n ) ⋅ xi ( n ) + oi (4.13)

where o i is the offset added by the ADC. By choosing c ( n ) so that its mean
value is close to zero the mean value of Eq. 4.13 will approach o i . A calibration
unit continuously computes the mean value using a large number of samples,
which is used as an estimation of o i . The original signal is then recreated by a
digital multiplication using the same sequence c ( n ) as used in the sampling pro-
cess, i.e.

x corr ( n ) = c ( n ) ⋅ ( c ( n ) ⋅ x i ( n ) + o i – oˆi ) = x i ( n ) + c ( n ) ⋅ ( o i – oˆi ) (4.14)

In Eq. 4.14 the fact that c ( n ) ⋅ c ( n ) = 1 has been used.


In [24] we propose a purely digital method that takes advantage of that the propo-
erties of the application is known. The application studied is the DMT modem
and the offset is identified by using the symbol decoder. The FFT in the DMT
receiver makes it possible to do the offset estimation in the frequency domain.
There is no need for identifying the individual offset from each ADC, instead it is
sufficient if the total offset contribution O ( e jω ) can be identified and removed
before the decoder. See Fig. 4.5 where the DMT modem outline from chapter 1 is
repeated. The offset will move the received constellation points from their ideal
position and make it more difficult to detect the transmitted information, see Fig.
4.6.

49
4 Data Converters in Communication Systems

interleaver transmit path


framer FEC
TEQ encoder
encoder IFFT
IFFT DAC
DAC line
analog
analog
deinterleaver frontend
frontend
deframer RS-decoder decoder FEQ FFT TEQ EC ADC
decoder TEQ FFT TEQ EC ADC
receive path

Figure 4.5 DMT modem.

Im

Re

Figure 4.6 Effects of offset errors in a DMT modem when using a TIADC.

The offset signal is additive and independent of the input signal as shown in Eq.
4.17. This additive error will cause an offset in the constellation diagram at the
frequencies k ⋅ f s ⁄ M where k varies between 1 and M – 1 . An example on how
the offset error affects the received constellation point at the disturbed carriers is
shown in Fig. 4.6. In [24] we show that the offset error can be identified and
reduced if the magnitude of the error is reasonably small. The main result in the
paper is that the error due to the difference between the decoded information and
the received signal can be used for offset estimation. Taking the average value of
the error between the detected signal and the received signal will identify the off-
set O ( e jω ) assuming that the mean value of the noise is zero ( E [ N ( e jω ) ] = 0 ),
see Eq. 4.15.

E [ N ( e jω ) + O ( e jω ) ] = E [ N ( e jω ) ] + E [ O ( e jω ) ] = E [ O ( e jω ) ] = (4.15)
E [ S rec ( e jω ) – S dec ( e jω ) ]

50
4 Data Converters in Communication Systems

There is an error in the simulations shown in [24] which accidently left about
10% of the offset. Later simulations have shown that the offset estimation can be
made much better with an error well below one per cent.
The offset error will decrease the SNDR at the receivers end, but since the ADSL
standard has been specified to adapt to a large range of different signal qualities it
will still be possible to transmit data. As the offset estimate becomes more accu-
rate the increased SNDR can be utilized to increase the bit rate.

4.3.2 Gain and sample timing mismatch


The timing of the sample clock will become more important in a TIADC than in
other types of ADCs since an inexact timing will lead to unequal time intervals
between the samples obtained by the different ADCs, Fig. 4.7.

Amplitude

x(t)
DA

T 2T 3T 4T t
T(1+r0) T(1+r2)
T(1+r1) T(1+r3)

Figure 4.7 Sample timing mismatch.

When using a single ADC it is important to have a sample clock generator with
low jitter, but in a TIADC it is also important to achieve a similar delay from the
clock source to all sample-and-hold units in the TIADC to avoid nonuniform
sampling with a period of M cycles.
Considering a TIADC with M channels with gain and timing mismatch, see Fig.
4.8, the output from the TIADC will be

x tiadc ( n ) = { g 1 ⋅ x ( T ( 1 + r 1 ) ), …, g m ⋅ x ( TM ( 1 + r m ) ) } (4.16)

and in frequency domain


∑ A k ( e jω ) ⋅ X  ω – k ⋅ --------
1 2π
X tiadc ( e jω ) = --- (4.17)
T MT
k = –∞

51
4 Data Converters in Communication Systems

where A k ( e jω ) is described by
M–1
 g ⋅ e – j ( ω – k ⋅ 2π ⁄ MT )r m T ⋅ e – jkm ⋅ 2π ⁄ M

1
A k ( ω ) = -----  m  (4.18)
M
m=0

g m is the gain error in ADC number m . r m is the relative sampling error for each
ADC.

g0 o0 s20
MT(1+r )
0

*g +o +
s 1
2
1
x1(n)
(M+1)*T(1+r ) 1
1 x1(n)
* + +
xTIADC(n)
x(t)
oM-1 s2M-1
(2M-1)*T(1+r ) gM-1
M-1
xM-1(n)
* + +
Figure 4.8 Time interleaved ADC error sources.

4.3.3 Gain and timing mismatch effects on SNDR


The effects of gain and skew errors can be fatal for the performance of a TIADC.
In [70] and [71] expressions for gain and skew errors have been derived.
The SNDR for a TIADC with only gain error was estimated in [70] to

SNDR = 20 log  ------ – 10 log  1 – -----


g 1
(4.19)
σg M

where g is the average gain and σ g is its standard deviation. M is the number of
ADCs used in the TIADC. For a resolution of 10 bits, Eq. 4.19 shows that σ g
should be kept smaller than 0.1%.
The gain error was approximated in [71] to

SNDR = 20 log  ------------------ – 10 log  1 – -----


1 1
(4.20)
σ t 2πf in M

52
4 Data Converters in Communication Systems

where σ t is the standard deviation of the timing skew and f in is the input signal
frequency. For a 10 bits resolution and 20 MHz input signal σ t must be smaller
than 8 ps.

53
4 Data Converters in Communication Systems

4.3.4 Gain and timing mismatch cancellation


Gain and timing mismatch cause the same type of distortion as described in Eq.
4.17 and Eq. 4.18.
The skew problem is, however, more difficult to handle and it is shown in [72]
that it is not possible to achieve perfect reconstruction in a conventional TIADC
by only using linear filtering.
In [73,74] the skew errors are corrected using polynomial interpolation, which is
a well-known method for estimating signal values at intermediate time instants,
see Fig. 4.9. To measure the skew the authors propose to use a ramp as training
signal, but this can be difficult to do in a communication environment. In [74] it
is shown that the degree of the polynom becomes high if the input frequency is
close to the Nyquist frequency. For instance for an SFDR of 100 dB an interpola-
tion degree of 32 is needed when the input signal is oversampled 1.5 times.

MT(1+r )
0
x0(n)
ADC 0
(M+1)*T(1+r )
1 x1(n)
x(t) ADC 1 xTIADC(n)
polynom interpolator

(2M-1)*T(1+r ) xM-1(n)
M-1 estimated
ADC M-1 timing mismatch

Figure 4.9 Timing mismatch correction using polynom interpolation.

In [75] a method for perfect reconstruction in the presence of timing mismatch is


presented. The method uses a modified DFT, which modifies the distortion terms
in Eq. 4.18 to become frequency independent, and the correct spectrum can then
be computed. The method requires a modified and computationally heavy DFT to
correct the samples. Also here a training signal is used for identification of the
timing mismatch. See also [76] where a timing estimation algorithm based on a
training signal and a DFT is given, and [77] where a method on how to determine
the standard deviation of the timing error is discussed.
A statistical method is presented in [78]. The method is based on the fact that in
statistical meaning the change in amplitude is approximately proportional to the
distance between the samples. By calculating the mean square of the amplitude
difference between two adjacent ADCs and comparing this with the other ADCs

54
4 Data Converters in Communication Systems

an estimate of the timing mismatch is found. The main limitation for this method
is that in order for the algorithm to work well most of the signal energy should be
concentrated below f s ⁄ 6 .
In [26] we propose a method for estimating and correcting both timing and gain
mismatch. The proposed method takes the application in consideration, and uses
the decoder in a DMT or OFDM modem for extracting the noise on each fre-
quency. An adaptive algorithm estimates the mismatch distortion, which then is
used for increasing the SNDR in the modem.
The distortion described by Eq. 4.17 and Eq. 4.18 can be treated as information
leakage from one carrier frequency to another. Since the different carriers in the
DMT modem can be considered independent from each other the correlation
between two carriers are caused by gain and/or timing mismatch (see also 1.3
where the DMT technique is described). The correlation between two carriers is
identified using the Least Mean Square (LMS) algorithm, and most of the distor-
tion can be cancelled.

4.4 Digital-to-analog conversion


An ideal digital-to-analog converter transforms a digital representation of the sig-
nal to an analog representation. The analog representation is normally in terms of
a current or a voltage level. A commonly used DAC architecture is the so-called
current-steering DAC [67], where an analog current is generated by a sum of cur-
rent sources that is controlled by the digital input. This operation can in the static
case be described as
M

A out(nT) = ∑ b m(nT) ⋅ w m (4.21)


m=1

where A out(nT) is the settled output amplitude at the time instants nT , M is the
number of bits in the input word, which contain the bits b m(nT) , and w m are the
internal DAC weights. b M is referred to as the most significant bit (MSB) and
b 1 is the least significant bit (LSB). For a binary offset input word, we have that
M = N and w m = 2 m – 1 . For a thermometer code input, we have that
M = 2 N – 1 and w m = 1 . An example of a current-steering binary offset coded
DAC and a thermometer coded DAC is shown in Fig. 4.10.

55
4 Data Converters in Communication Systems

N-1 1 0
2 I0 2 I0 2 I0 I0 I0 I0

bN b2 b1 b2N-1 b2 b1

Iout Iout
(a) (b)
Figure 4.10 Example of a) a binary weighted and b) a thermometer coded current-
steering DAC.

decimal thermometer code


representation representation
0 000
1 001
2 011
3 111

Table 4.1. Example of thermometer code.

4.4.1 Error sources


There are two error sources in a DAC structure that are considered in this work,
glitches and mismatch of current sources. A glitch occurs when the output code-
word temporarily is wrong between the transition from one sample to the next.
For instance, the binary offset coded word <00111> may become <11111> for a
short time when toggling to <11000>. A common solution to this problem is to
use thermometer coding at the input of the DAC. A thermometer code is charac-
terized by that all current sources have equal weights and that all bits that have
the value one in the code are concentrated to a continuous part of the thermome-
ter code, see Table 4.1. A transition from one sample to the next cannot cause
intermediate values to occur at the output since there is only one type of transi-

56
4 Data Converters in Communication Systems

tion from one sample to the next. That is either a number of bits toggles from
zero to one, or a number of bits toggles from one to zero, but never both in the
same transition.
Normally only the most significant bits in the binary offset coded input to the
DAC are converted to thermometer code, since this is an expensive operation. A
common configuration is to use thermometer code for 5-6 of the most significant
bits, and binary offset coding for the remaining bits [67], see Fig. 4.11. This
hybrid solution is called an M-bit segmented DAC, where M refers to the number
of binary bits that have been translated to thermometer code.

M 2M-1 Therm.
Thermometer coded
encoder DAC Aout
X
+
Binary-
Delay weighted
N-M N-M DAC

Figure 4.11 Multi-segmented DAC structure with the M MSBs thermometer coded.

Mismatches in the sizes of the current sources will, as in the case of the reference
voltage mismatch in the ADC, cause DNL errors.

4.4.2 Scrambling
It is difficult to measure the output from a DAC without using an ADC with even
better performance, and therefore it is also difficult to use pure digital methods to
calibrate a DAC. In order to improve the SFDR it has instead been suggested to
use scrambling, so that the direct relation from an input value to the size of the
DNL error and the glitch will be removed. A scrambler will select which current
sources to use for a given input value in a random way, and therefore the size of
the error will be less correlated to a given input value, and is therefore spread in
the frequency domain. This method is commonly referred to as Dynamic Ele-
ment Matching (DEM) and was originally a mixed analog-digital method [79],
but today it is most common to use the digital DEM technique [80-84]. A com-
parison between different DEM methods is made in [85], see Fig. 4.12.

57
4 Data Converters in Communication Systems

Digital Encoder
x1(n) 1 1-bit y1(n)

Thermometer Encoder
DAC

Scrambler
x2(n) 1 1-bit y2(n)
x(n) N DAC y(n)

xM(n) 1 1-bit yM(n)


DAC

Figure 4.12 DAC with scrambler.

In Fig. 4.13 a simulation of a 12 bit DAC with a mismatch of 1% in the weights is


shown. The simulation shows the effect of spreading of the distortion when using
a DEM technique. Note that DEM not reduces the total noise power caused by
the matching errors, instead the noise is spectrally moved and can be either fil-
tered away or “hidden” in the quantization noise floor.
An efficient way of realizing a scrambler is by using a set of switches with two
inputs and two outputs. In Fig. 4.14 an example with a 3 bit thermometer
encoded DAC is shown. An extra zero is put into one of the switches making the
net of switches more symmetric. Each switch is controlled by a signal that con-
trols which of the inputs ( a, b ) is fed to which output, ( x, y ) . If the control signal
p is chosen in a random way the matching error will be decorrelated with the sig-
nal value.

58
4 Data Converters in Communication Systems

No DEM
120

100

PSD [dB/Hz] 80

60
distortion terms
40

20

0
0 0.1 0.2 0.3 0.4 0.5
Normalized frequency

(a)
With DEM
120

100

80
PSD [dB/Hz]

60

40

20

0
0 0.1 0.2 0.3 0.4 0.5
Normalized frequency

(b)

Figure 4.13 Simulation a) without and b) with DEM.

Another solution is to choose p such that noise shaping is achieved. In [81,83]


noise shaping DAC architectures are shown, and in [84] a more general discus-
sion on how to choose p for various shaping techniques is given.

59
4 Data Converters in Communication Systems

A problem arises when trying to combine scrambling and glitch reduction. If


thermometer coded data is scrambled in an random way the glitch power will
increase, and the main advantage of using thermometer code has disappeared. To
combine scrambling with glitch reduction the scrambling must be restricted so
that glitch power not become too large. One possibility is to reduce the rate at
which the scrambling is done by for instance only selecting a new p every sec-
ond time, which will decrease the glitch power compared with if p toggles every
time. But, the randomization effect will also decrease since the matching error
becomes less random.

0
t0
p p p
t1
a x
t2
p p p
t3 b y
t4
p p p p
t5
t6
p p p

Figure 4.14 Scrambler for a 3-bit DAC with thermometer encoded input.

In [27,28] we propose two architectures for scrambling of data where 1) scram-


bling of the distortion is achieved and 2) the glitches are kept at the minimum.
The key idea is to only scramble the difference between two adjacent samples.
As many as possible of the current sources are kept in their old state and the cur-
rent sources that have to be turned off (or on) are randomly selected among all
possible ones. The advantage with the method is that glitches will be kept at a
low level but there is also a drawback with the method caused by the fact that the
degree of randomization is dependent on the input signal. The error from a
slowly varying signal also changes slowly, while a fast varying signal will ran-
domize the distortion better.
The method was originally developed aiming at improving the performance of
DACs targeted to the VDSL application. But, since the DMT signal consists of a
large number of carriers the total distortion at a given frequency is the sum a
large number of distortion terms. Many of the distortion terms can be considered
to be independent of each other, which makes the distortion look much like addi-
tive gaussian noise. This makes the proposed method less suitable for DSL appli-
cations since the method converts distortion that already looks like noise to noise
again.

60
4 Data Converters in Communication Systems

One application where it has turned out to be interesting to use restricted scram-
bling is radio architectures where the first up-conversion stage is done in the dig-
ital domain, Fig. 4.15, [86]. The relatively narrow signal band is located at high
frequencies, while there is a large frequency band without signal in which the
distortion will be spread into.

digital mixer

cos(wift)

sin(wrft) RF frontend
I
M H(z)

+ IF-DAC H(s) PA
Q
M H(z)

sin(wift)

Figure 4.15 Radio transmitter with digital IF mixer.

61
4 Data Converters in Communication Systems

62
5 Author´s Contribution to Published Work

5 Author´s Contribution to
Published Work
In this section the Author´s contribution to the published work is clarified for
each publication.

Pub. 1. New Approaches to High Speed Huffman Decoding [15]


The publication presents an architecture idea that origins from the Author.

Pub. 2. Implementation of a Fast MPEG-2 Compliant Huffman


Decoder [16]
This publication presents an implementation of the architecture idea shown in
Pub. 1. The Author was responsible for all simulation and implementation work.

Pub. 3. High Speed Pipelined Parallel Huffman Decoding [17]


In this publication the high speed Huffman architectures are further developed by
the Author. All main contributions to this work origins from the Author.

Pub. 4. Design of A JPEG DSP Using the Modular Digital Signal Pro-
cessor Methodology [19]
This work was made within a cooperation project between Linköping University
and Ericsson Microelectronics AB. A new design methodology was going to be
evaluated in a case study. The Author made the initial hardware partitioning
together with K-G Andersson. The Author was also responsible for the design of
one of the two processor cores that were designed (the IDCT core).

63
5 Author´s Contribution to Published Work

Pub. 5. Design and Implementation of an FFT Processor for


VDSL [20]
The FFT processor was designed by the Author, but with help from people at
Ericsson Microelectronics AB with the design flow and design environment.
Anders Wass et. al. supported the design environment which made it possible to
include new features in the MDSP methodology during the project. All authors
participated in the implementation process.

Pub. 6. Application driven DSP Hardware Synthesis [21]


The idea of how to make a synthesis tool for the MDSP methodology came from
the Author while the implementation was made by Mikael Hjelm. Mikael Hjelm
also contributed with valuable ideas of how to solve some of the problems that
arise during the work.

Pub. 7. ADC Offset Identification and Correction in DMT


Modems [24]
All work, from idea to simulations was, carried out by the Author.

Pub. 8. Correction of Mismatch in Time Interleaved ADCs [25]


All work, from idea to implementation, was carried out by the Author.

Pub. 9. Glitch Minimization and Dynamic Element Matching in D/A


Converters [26]
The idea to reduce glitches when doing scrambling came from J. Jacob Wikner
et. al. while the Author came up with the architecture presented in this publica-
tion. The analysis and simulations were made by the Author.

Pub. 10. Dynamic Element Matching in D/A Converters with


Restricted Scrambling [27]
This publication presents an alternative architecture for reducing glitches when
performing scrambling in DACs. Mark Vesterbacka came up with the architec-
ture and made the simulations verifying the ideas. The Author contributed to this
work with the DMT simulations and discussions how to use the method in a sys-
tem.

64
6 Conclusions

6 Conclusions
Signal processing is used in all electronic communication systems and it is there-
fore important to have architectures that are efficient for implementing DSP algo-
rithms. It is also important to have an efficient design flow for implementing the
algorithms.
In three papers we propose architectures suitable for VLC decoding [15,16,17].
We have shown how the critical loop in the VLC decoder can be broken up,
which in theory increases the achievable decoding rate. We have also shown how
to parallelize the symbol decoder parallel to reduce the data rate through each
decoder, which makes them easier to implement. One implementation has been
made to verify the ideas.
A hardware-software co-design methodology aimed for application specific
DSPs has been verified and improved. A JPEG decoder DSP has been designed
which shows how to combine programmability and performance [19]. An FFT
DSP for the VDSL application has been designed, implemented, and verified
[20]. A tool for improving the hardware-software co-design methodology has
also been implemented [21]. The tool supports the designer with hardware gener-
ation by trying to capture the designer´s intentions by having a simple set of rules
easily can be overridden by the designer. The DSP work presented in this thesis
should be seen as an additional bit to the big puzzle creating a more efficient DSP
design methodology.
Digital methods to improve the performance of data converters have also been
proposed. Increasing data converter performance using digital methods is attrac-
tive since modern CMOS technology allows high processing capability. It has
been shown how it is possible to co-optimize an A/D converter with the applica-
tion, giving an efficient way to cancel errors in a time-interleaved A/D converter.
A method to identify and cancel offset differences in the time-interleaved A/D

65
6 Conclusions

converter is proposed in [24], and a method for cancelling time and gain mis-
match is proposed in [26]. Both proposed methods are targeted against systems
using DMT modulation, but can also be used in OFDM based systems. Using the
proposed methods for correcting mismatch in time-interleaved ADCs it is possi-
ble to look for more wide band receiver architectures since the effective sample
rate can be increased without the usual problems with performance degradation
when using a time-interleaved ADC. New receiver architectures can provide
greater flexibility since more functionality can be placed in the digital domain
which in turn will require more efficient programmable DSP architectures.
Another purely digital method that is proposed is the restricted DEM method
where glitch performance and weight mismatch can be balanced against each
other [27,28]. It has been shown how current sources can be dynamically
matched while preserving a low glitch energy. We believe that the restricted
DEM technique is very well suited for high frequency DACs aimed at radio
applications.
The presented ADC and DAC work are examples of how it is possible to get a
performance increase in data converters using pure digital techniques. When the
data converter requirements continue to increase and the process technologies
limit the performance, techniques as the ones presented in this thesis may be the
way forward.

66
References
[1] C.E. Shannon, “Communication in the Presence of Noise,” Proc. IRE, Vol.
37, pp. 10-21, Jan. 1949.
[2] ETSI, Group Speciale Mobile or Global System of Mobile Communication
(GSM) Recommendation, 1988, France.
[3] ISO/IEC 10918-1: Digital Compression and Coding of Continuous-Tone
Still Images (JPEG), Feb. 1994.
[4] ISO/IEC DIS 13818-2: Generic Coding of Moving Pictures and Associated
Audio Information, part 2: Video, (MPEG-2), June 1994.
[5] S. Haykin, Digital Communications, John Wiley and Sons, 1988.
[6] J. Gibson, The Mobile Communications Handbook, CRC Press, 1996.
[7] ANSI T1.413-1998, “Network and Customer Installation Interfaces:
Asymmetrical Digital Subscriber Line (ADSL) Metallic Interface,”
American National Standards Institute.
[8] “VDSL Coalition Technical Draft Specification (Version 5),” Tech. Rep.
983t8, ETSI TM6, Luleå, Sweden, June 1998.
[9] T. Starr, J. M. Cioffi, and J. Silverman, Understanding Digital Subscriber
Line Technology, Prentice-Hall, 1999.
[10] W. Y. Chen, DSL Simulation Techniques and Standards Development for
Digital Subscriber Line Systems, Macmillan technical publishing, 1998.
[11] D. J. Rauschmayer, ADSL/VDSL Principles, Macmillan Technical
Publishing, 1999.
[12] F. Sjöberg, The Zipper Duplex Method in Very High-Speed Digital
Subscriber Lines, Luleå University of Technology, 2000.
[13] K. K. Parhi, VLSI Digital Signal Processing Systems - Design and
Implementation, Wiley, 1999.
[14] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999.

67
[15] M. K. Rudberg and L. Wanhammar, “New Approaches to High Speed
Huffman Decoding,” Proc. of IEEE Intern. Symp. on Circuits and Systems,
ISCAS'96, Vol. 2, pp. 149-52, Atlanta, USA, May 1996.
[16] M. K. Rudberg and L. Wanhammar, “Implementation of a Fast MPEG-2
Compliant Huffman Decoder,” Proc. of European Signal Processing Conf.,
EUSIPCO'96, Trieste, Italy, Sept. 1996.
[17] M. K. Rudberg and L. Wanhammar, “High Speed Pipelined Parallel
Huffman Decoding,” Proc. of IEEE Intern. Symp. on Circuits and Systems,
ISCAS'97, Vol. 3, pp. 2080-83, Hong Kong, June 1997.
[18] M. K. Rudberg, System Design of Image Decoder Hardware, LiU-Tek-Lic-
1997:657, Department of Electrical Engineering, Linköping University, Dec.
1997.
[19] K-G Andersson, M. K. Rudberg, and A. Wass, “Design of A JPEG DSP
Using the Modular Digital Signal Processor Methodology,” Proc. of Intern.
Conf. on Signal Processing Applications and Technology, ICSPAT`97, Vol.
1, pp. 764-68, San Diego, CA, USA, Sep. 1997.
[20] M. K. Rudberg, M. Sandberg, and K. Ekholm, “Design and Implementation
of an FFT Processor for VDSL,” Proc. of IEEE Asia-Pacific Conference on
Circuits and Systems, APCCAS `98, pp. 611-14, Chiangmai, Thailand, Nov.
1998.
[21] M. K. Rudberg and M. Hjelm, ”Application driven DSP Hardware
Synthesis,” Proc. of IEEE Nordic Signal Processing Symp. (NORSIG2000),
Kolmården, Sweden, June 2000.
[22] K-G Andersson, A. Wass and K. Parmar, “A Methodology for
Implementation of Modular Digital Signal Processors,” Proc. of Intern.
Conf. On Signal Proc. Applications & Technology, ICSPAT ’96, Boston,
MA, Oct. 1996.
[23] K-G Andersson, Implementation and Modeling of Modular Digital Signal
Processors, LiU-Tek-Lic-1997:09, Department of Electrical Engineering,
Linköping University, March 1997.
[24] M. K. Rudberg, “ADC Offset Identification and Correction in DMT
Modems,” Proc. of IEEE Intern. Symp. on Circuits and Systems, ISCAS'00,
Vol 4, pp. 677-80, Geneva, May 2000.
[25] M. K. Rudberg, “A/D omvandlare,” Swedish patent number 9901888-9, 25
May 1999.
[26] M. K. Rudberg, “Correction of Mismatch in Time Interleaved ADCs“, Proc.
of IEEE Intern. Conf. on Electronics, Circuits & Systems, Malta, Sept. 2001.
[27] M. K. Rudberg, M. Vesterbacka, N. Andersson, and J.J. Wikner, “Glitch
Minimization and Dynamic Element Matching in D/A Converters,” Proc. of
IEEE Intern. Conf. on Electronics, Circuits & Systems, Lebanon, Dec. 2000.

68
[28] M. Vesterbacka, M. K. Rudberg, J.J. Wikner, and N. Andersson, “Dynamic
Element Matching in D/A Converters with Restricted Scrambling,” Proc. of
IEEE Intern. Conf. on Electronics, Circuits & Systems, Lebanon, Dec. 2000.
[29] M. K. Rudberg, M. Vesterbacka, N. U. Andersson, and J. J. Wikner, “A
scrambler and a method to scramble data words,” Swedish patent appl.
0001917-4, 23 May 2000.
[30] M. K. Rudberg, J. J. Wikner, J.-E. Eklund, F. Gustavsson, and J. Elbornsson,
“A/D and D/A Converters for Telecom. Applications,”
http://www.es.isy.liu.se/staff/mikaelr/downloads/adda_tut_icecs2001.pdf,
tutorial held at IEEE Intern. Conf. on Electronics, Circuits & Systems, Sept.
2001.
[31] K. Palmkvist, Studies on the Design and Implementation of Digital Filters,
Diss. No. 583, Linköping Unversity, Sweden, 1999.
[32] M. Renfors and Y. Neuvo, “The Maximum Sampling Rate of Digital Filters
Under Hardware Speed Constraints,” IEEE Trans. on Circuits and Systems,
Vol. CAS-28, No. 3, pp. 196-202, March 1981.
[33] A. Chandrakasan and R. Brodersen, Low Power Digital CMOS Design,
Kluwer Academic Publishers, 1995.
[34] A. Bellaouar and M. Elmasry, Low-Power VLSI Design - Circuits and
Systems, Kluwer Academic Publishers, 1995.
[35] J. Melander, Design of SIC FFT Architectures, Linköping Studies in Science
and Technology, Thesis No. 618, 1997.
[36] T. Widhe, Efficient Implementation of FFT Processing Elements, Linköping
Studies in Science and Technology, Thesis No. 619, 1997.
[37] E. Brigham, The Fast Fourier Transform and Its Applications, Prentice Hall,
1988.
[38] J. W. Cooley and J. W. Tukey, “An Algorithm for the Machine Calculation
of Complex Fourier Series,” Math Computers, Vol. 19, pp. 297-301, April
1965.
[39] W. M. Gentleman and G. Sande, “Fast Fourier Transform for Fun and
Profit,” Proc. 1966 Fall Joint Computer Conf., AFIPS’66, Vol.29, pp. 563-
678, Washington DC, USA, Nov. 1966.
[40] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing,
Prentice Hall, 1989.
[41] Proakis and Manolakis, Digital Signal Processing - Principles, Algorithms
and Applications, 2nd ed., Macmillian, 1992.
[42] Ericsson Internal document, ETX/XA/NB-97:006.
[43] M. Hjelm, Architectural Synthesis From a Time Discrete Behavioural
Language, LiTH-ISY-EX-2000, Linköping, Sweden, Sept. 1998.

69
[44] P. Schaumont, S. Vernalde, L. Rijnders, M. Engels, and I. Bolsens, “A
Programming Environment for the Design of Complex High Speed ASICs,”
Proc. of Design Autom. Conf., pp. 915-20, 1998.
[45] K. Wakabayashi, “C-based Synthesis Experiences with a Behavior
Synthesizer, “Cyber” ,” Design Automation and Test in Europe Conf. and
Exhibition, DATE’99, pp. 390-99, 1999.
[46] http://www.SystemC.org
[47] H. D. Man, J. Rabaey, J. Vanhoof, G. Goossens, P. Six, and L. Claesen,
“CATHEDRAL-II - A Computer-Aided Synthesis System for Digital Signal
Processing VLSI Systems,” Computer-Aided Engineering Journal, pp. 55-
66, April 1988.
[48] J.M. Rabaey, C. Chu, P. Hoang, and M. Potkonjak, “Fast Prototyping of
Datapath-Intensive Architectures,” IEEE Design and Test of Computers,
Vol. 8, Iss. 2, pp. 40-51, June 1991.
[49] E. Martin, O. Sentieys, H. Dubois, and J. L. Philippe, “GAUT: An
Architectural Synthesis Tool for Dedicated Signal Processors,” Proc. of
European Design Autom. Conf, pp. 14-19, Feb. 1993.
[50] L. Guerra, M. Potkonjak, and J. Rabaey, “A Methodology for Guided
Behavioral-Level Optimization,” Proc. of Design Automation Conf.,
DAC’98, pp. 309-14, USA, June 1998.
[51] S. Ramanathan, V. Visvanathan, and S. K. Nandy, “Synthesis of
Configurable Architectures for DSP Algorithms,” Proc. of 12th Intern. Conf.
on VLSI Design, pp. 350-57, Jan. 1999.
[52] A.A. Jerraya, I. Park, and K. O’Brien, “AMICAL: An Interactive High Level
Synthesis Environment,” Proc. of European Design Autom. Conf, pp. 58-62,
Feb. 1993.
[53] M. Benmohammed and A. Rahmoune, “Automatic generation of
reprogrammable microcoded controllers within a high-level synthesis
environment,” IEE Proc. Comput. Digit. Tech., Vol. 145, No. 3, pp. 155-60,
May 1998.
[54] D.A. Huffman, “A method for the construction of minimum redundancy
codes,” Proc. IRE, Vol. 40, No. 10, pp. 1098-1101, Sept. 1952.
[55] S. F. Chang and D. G. Messerschmitt, “Designing High-Throughput VLC
Decoder Part I - Concurrent VLSI Architectures,” IEEE Trans. on Circuits
and Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June 1992.
[56] H. D. Lin and D. G. Messerschmitt, “Designing High-Throughput VLC
Decoder Part II - Parallel Decoding Methods,” IEEE Trans. on Circuits and
Systems for Video Technology, Vol. 2, No. 2, pp. 197-206, June 1992.

70
[57] S. Ho and P. Law, “Efficient Hardware Decoding Method for Modified
Huffman Code,” Electronics Letters, Vol. 27, No 10, pp. 855-856, May
1991.
[58] S. B. Choi and M. H. Lee, “High Speed Pattern Matching for a Fast Huffman
Decoder,” IEEE Transactions on Consumer Electronics, Vol. 41, No 1, pp.
97-103,Feb. 1995.
[59] R. Hashemian, “High Speed Search and Memory Efficient Huffman
Coding,” Proc. IEEE Intern. Symp. on Circuits and Systems., ISCAS ‘93,
Vol. 1, pp. 287-290, 1993.
[60] R. Hashemian, “Design and Hardware Implementation of a Memory
Efficient Huffman Decoding,” IEEE Trans. on Consumer Electronics, Vol.
40, No. 3, pp. 345-352, Aug. 1994.
[61] H. Park and V. Prasanna, “Area Efficient VLSI Architectures for Huffman
Coding,” IEEE Trans. on Circuits and Systems - II Analog and Digital Signal
Processing, Vol. 40, No. 9, pp. 568-575, Sept. 1993.
[62] K. Parhi, “High-Speed Architectures for Huffman and Viterbi Decoders,”
IEEE Trans. on Circuits and Systems - II, Analog and Digital Signal
Processing, Vol. 39, No. 6, pp. 385-391, June 1992.
[63] E. Komoto and M. Seguchi, “A 110 MHz MPEG2 Variable Length Decoder
LSI,” 1994 Symp. on VLSI Circuits, Digest of Technical Papers, pp. 71-72,
1994.
[64] D.-S. Ma, J.-F. Yang, and J.-Y. Lee, “Programmable and Parallel Variable-
Length Decoder for Video Systems,” IEEE Trans. on Consumer Electronics,
pp. 448-454, Vol. 39, No, 3, Aug. 1993.
[65] Y.-S. Lee, B.-J. Shieh, and C.-Y. Lee, “A Generalized Prediction Method for
Modified Memory-Based High Throughput VLC Decoder Design,” IEEE
Trans. on Circuits and Systems - II Analog and Digital Signal Processing, pp.
742-754, Vol. 46, No. 6, June 1999.
[66] M. J. M Pelgrom, A. C. J. Duinmaijer, and A. P. G. Welbers, “Matching
Properties of MOS Transistors,” IEEE J. of Solid-State Circuits, Vol. 24, No.
5, pp. 1433-9, Oct. 1989.
[67] M. Gustavsson, J. J. Wikner, and N. N. Tan, CMOS Data Converters for
Communications, Kluwer Academic Publishers, 2000.
[68] K.-S. Tan, et.al., ”Error Correction Techniques for High-Performance
Differential A/D Converters,” IEEE J. of Solid-State Circuits, Vol. 25, No.
6, pp. 1318-27, Dec. 1990.
[69] J.-E. Eklund, and F. Gustafsson, “Digital Offset Compensation of Time-
Interleaved ADC Using Random Chopper Sampling,” Proc. IEEE Intern.
Symp. on Circuits and Systems, ISCAS’00, Vol. 3, pp. 447-50, Geneva,
May, 2000.

71
[70] M. Gustavsson, CMOS A/D Converters for Telecommunications, Diss. No.
552, Linköping Unversity, Sweden, 1998.
[71] Y.-C. Jenq, “Digital Spectra of Nonuniformly Sampled Signals:
Fundamentals and High-Speed Waveform Digitizers,” IEEE Trans. on
Instrumentation and Measurement, Vol. 37, No. 2, pp. 245-251, June, 1988.
[72] H. Johansson and P. Löwenborg, “Reconstruction of Nonuniformly Sampled
Bandlimited Signals Using Digital Filter Banks,” Proc. of IEEE Intern.
Symp. on Circuits and Systems, ISCAS'01, Sydney, 2001.
[73] H. Jin and E. Lee, “A Digital Technique for Reducing Clock Jitter Effects in
Time-Interleaved A/D Converter,” Proc. of IEEE Intern. Symp. on Circuits
and Systems, ISCAS'99, Vol. 2, pp. 330-33, 1999.
[74] H. Jin and E. Lee, “A Digital-Background Calibration Technique for
Minimizing Timing-Error Effects in Time-Interleaved ADC’s,” IEEE Trans.
on Circuit and Systems - II: Analog and Digital Signal Processing, Vol. 47,
No. 7, pp. 603-13, July 2000.
[75] Y.-C. Jenq, “Perfect Reconstruction of Digital Spectrum from Nonuniformly
Sampled Signals,” IEEE Trans. on Instrumentation and Measurement, Vol.
46, No. 7, pp. 649-52, Dec. 1997.
[76] Y.-C. Jenq, “Digital Spectra of Nonuniformly Sampled Signals: A Robust
Sampling Time Offset Estimation Algorithm for Ultra High-Speed
Waveform Digitizeers Using Interleaving,” IEEE Trans. on Instrumentation
and Measurement, Vol. 39, No. 1, pp. 71-75, Feb. 1990.
[77] Y.-C. Jenq, “Digital Spectra of Nonuniformly Sampled Signals: Theories
and Applications - Measuring Clock/Aperture Jitter of an A/D System,”
IEEE Trans. on Instrumentation and Measurement, Vol. 39, No. 6, pp. 969-
71, Dec. 1990.
[78] J. Elbornsson and J.-E. Eklund, “Blind Estimation of Timing Errors in
Interleaved AD Converters,” IEEE Intern. Conf. on Acoustics, Speech, and
Signal Processing, May 2001.
[79] R. J. van de Plassche, “Dynamic Element Matching for high-accuracy
monolithic D/A converters,” IEEE J. Solid-State Circuits, Vol. SC-11, pp.
795-800, Dec. 1976.
[80] P. Carbone and I. Galton, “Conversion error in D/A converters employing
dynamic element matching,” Proc. of ISCAS‘94, Vol. 2, pp. 13-16, 1994.
[81] L.R. Carley, “A noise-shaping coder topology for 15+ bit converters,” IEEE
J. of Solid-State Circuits, Vol. 24, no. 2 , pp. 267-273, April 1989.
[82] H.T. Jensen and I. Galton, “An analysis of the partial randomization dynamic
element matching technique,” IEEE Trans. of Circuits and Systems II, Vol.
45. No. 12, pp. 1538-1549, Dec. 1998.

72
[83] I. Galton, “Spectral Shaping if Circuit Errors in Digital-to-Analog
Converters,” IEEE Transaction of Circuits and Systems II, Vol. 44. No. 10,
pp. 808-817, Oct. 1997.
[84] L. Hernández, “A Model of Mismatch-Shaping D/A Conversion for
Linearized DAC Architectures,” IEEE Trans. of Circuits and Systems I, Vol.
45, No. 10, pp. 1068-76, Oct. 1998.
[85] N. U. Andersson and J.J.Wikner, “Comparison of Different Dynamic
Element Matching Techniques for Wideband CMOS DACs,” Proc. of
NORCHIP, Oslo, Norway, Nov. 1999.
[86] M. Helfenstein and G. S. Moschytz, Circuits and Sysems for Wireless
Communications, Kluwer Academic Publishers, 2000.

73
74
Part 2: Publications

75
76
Paper 1 - New Approaches to High Speed Huffman Decoding

PAPER 1

New Approaches to High Speed Huffman


Decoding

Mikael Karlsson Rudberg and Lars Wanhammar

Proceedings of IEEE International Symposium on Circuits and Systems,


ISCAS’96, Atlanta, USA, May 1996.

77
78
Paper 1 - New Approaches to High Speed Huffman Decoding

New Approaches to High Speed Huffman Decoding

Mikael Karlsson Rudberg and Lars Wanhammar


Department of Electrical Engineering, Linköping University,
SE-581 83 Linköping, Sweden
mikaelr@isy.liu.se larsw@isy.liu.se

ABSTRACT
This paper presents two novel structures for fast Huffman decoding. The
solutions are suited for decoding of symbols at rates up to several hundred
Mbit/s. The structures are built using the principle of pipelining, which
when applied to the length decoder unit makes it possible to remove the only
recursive loop in the basic structure. In this way a structure with a high the-
oretical speed is obtained. Another attractive property of the solutions is the
simplicity of the structures and control logic.

1. INTRODUCTION
The Huffman coding technique is a lossless coding method that assigns short
codewords to frequently used symbols and longer words to less frequently used
symbols. If the codebook is good enough this will lead to a near entropy optimal
result. Huffman coding are a part of several important image coding standards,
for instance the JPEG [1] and MPEG [2] standards.

79
Since the coded data has different sized codeword it is difficult to perform the
decoding in parallel. This is maybe not a problem when dealing with still images
but moving images put entirely different requirements at the decoding process.
The MPEG-2 standard requires the data to be decoded at 100 Mbit/s and above.
In this paper we introduce a new principle for fast Huffman decoding. The pre-
sented algorithm is a hybrid between a constant input, variable output decoder,
and a variable input, constant output decoder. In section 2 an overview of previ-
ous work is given. In section 3 we discuss modifications of the algorithm in
order to speed up the decoding. Finally two new structures are presented with
slightly different properties.

2. PREVIOUS WORK
There are two main approaches for hardwired Huffman decoders with fixed
codebooks. If one or several bits at a time are decoded at a constant rate it will
result in a sequential solution which tray erses the Huffman tree until a leaf is
reached ana then outputs the symbol (Fig. 1).
This type of decoder has a constant input rate and a variable output rate. If large
codebooks are used, the constant input rate solution tend to give very large state
machines which limit the speed. Some ways to get around this problem are given
in [3] and [4], but most solutions lead to complicated control logic.

symbol
K
input input bit/cycle indicator output symbol
buffer logic buffer

reg-
ister next state

Figure 1. A constant input rate Huffman decoder.

The other approach is to decode one codeword in each cycle, hence it will deliver
one symbol every cycle (Fig. 2). However, since the codewords have different
lengths the input rate will be variable. This solution consists of two main blocks.
The first block finds the length of the next codeword. This is necessary since the
different codewords must be kept apart to be able to feed the symbol decoder
with correct data. The symbol decoder finds the corresponding symbol accord-
ing to the codeword. This pattern matching can be done in several ways. Usu-

80
Paper 1 - New Approaches to High Speed Huffman Decoding

ally some kind of PLA structure is used to perform both the length decoding and
symbol decoding. In some solutions [5] sophisticated memory partition methods
are used to get access to the symbol and its length in an effective way.

3. TWO NEW FAST HUFFMAN DECODER STRUCTURES


In our solution we modify the length decoder and shifting buffer in the constant
output rate decoder shown in Fig. 2. We then get a decoder with a structure simi-
lar to the constant output decoder but with a variable output rate and a constant
input rate.

3.1. The basic Huffman decoder


The algorithm for a constant output rate Huffman decoder is described below.
1. Feed the symbol decoder with a coded vector from the input register. The
length of this vector must be equal to the length of the longest possible code-
word to assure that the vector contains at least one codeword. At the same
time feed the length decoder with the same vector as the symbol decoder.
2. The length of the decoded word that is found by the length decoder is used for
finding out how many new bit that must be shifted into the input register.
3. Repeat from 1.

The structure of the basic Huffman decoder is shown in Fig. 2. The critical path is
from the input shifting buffer through the length decoder.

symbol output symbol


decoder buffer

M
input shifting bit/cycle
buffer

critical length length


path decoder

Figure 2. A constant output rate Huffman decoder.

81
The decoder can not run at a higher speed than it takes for the length decoder to
find the length of the codeword. The symbol decoder can be designed in several
ways and can always be pipelined to reach sufficient speed. Hence, we will focus
on the length decoder and the input register.

3.2. Huffman length decoder with relaxed evaluation time


The basic algorithm can easily be modified to not perform the length decoding
and symbol decoding at the same time. The length decoder can find the length of
codeword i at the same time that the symbol decoder decodes codeword i-1.
Since the codewords have different lengths it is also reasonable to assume that it
is usually more time consuming to evaluate the length of long codewords than
shorter ones. These two observations can be utilized to design a more effective
length decoder.
The basic circuit is modified by changing the input shifting buffer to a shift regis-
ter and add a register with a load signal between the length decoding logic and
the shift register (Fig. 3).
Then let the length decoder indicate the length for the codeword by one-hot cod-
ing (i.e. one dedicated signal for every possible length). The algorithm will now
look like this:
1. Shift data into the shift register until it is full.
2. Copy all the data, but the bit most to the right, from the shift register to the
length decoder register (with the load signal).
3. The length decoder shall now, in one cycle, determine if the symbol is of
length one and feed this to the control unit. Symbols of length two must be
found in no more than two cycles, and so on with lengths of three and four up
to the maximum code length M. When the length signal indicates that the
length is found the shift register has passed the next coded symbol to the sym-
bol decoder and shifted in the next codeword. Thus, it is possible to continue
from 2.

We have here utilized the fact that we can allow the longer codewords length to
be decoded at a slower rate than the shorter ones. Notice that the constant output
rate decoder now has got a constant input rate. Instead the symbol decoder will
now not get a new codeword every cycle and hence it will have a variable output
rate.

82
Paper 1 - New Approaches to High Speed Huffman Decoding

The critical path will be from the length decoder register through the length
decoder to the load signal. But the only thing that must be found in one cycle is
if the length is equal to one. It is often possible to further reduce the critical delay
by placing some of the length decoder logic between the shift register and the
register.
If a comparison between this modified decoder and the basic decoder is done
there are a few important differences to note. This new structure decodes short
codewords very fast but will be slower for longer codewords. Since the basic
decoder that we started from decodes symbols at a constant output rate it will
probably be more effective for long codewords. Fortunately, the nature of Huff-
man coding makes it more likely that short codewords will dominate.

input shift register symbol output symbol


decoder buffer
M bits
load register
M-1 bits
length decoder
logic length=2?

length=M? length=1?
select

length

length decoder

Figure 3. Huffman decoder with relaxed evaluation time for the length decoding unit.

3.3. Pipelined Huffman length decoder


In this version of the Huffman decoder all recursive loops are removed and then,
in principle, the maximum clock rate is enhanced to be limited by the delay of a
single logic gate. Hence, decoding rates of several hundreds MHz is feasible. To
obtain this pipelined structure the structure in Fig. 3 is modified as described
below.
First we remove the loadable register. As a consequence, it must for a moment
be assumed that all outputs from the length decoder are evaluated in one cycle.
Since only one bit of the length vector are considered every cycle D flip-flops
must be added before the multiplexer to equalize the delay (Fig. 4). For the
'length = 2?' signal one D flip-flop is needed, for the signal 'length = 3?' two D
flip-flops are needed, and so on.

83
Further we can add D flip-flops after the length decoder as long as it is done
before the symbol decoder as well (Fig. 4). All the flip-flops can be propagated
into the multiplexer and the length decoder logic. By this the delay through the
decoder logic is reduced to Tcritical/N where Tcritical is the critical, not maximum,
delay through the length decoder logic and the multiplexer and N is the number
of added flip-flops.
The resulting structure is shown in Fig. 5 below. This structure tries to evaluate
the length of the codeword at the input vector every cycle instead of only when it
actually are a codeword present at the input, as in the first solution. Since there
are no limitations on how much the structure is pipelined, the length decoder will
no longer be the time critical part of the design and the speed can be increased
significantly. The theorethical speed limit is now set by the delay from a flip-
flop through one logic gate to the following flip-flop.

3.4. Symbol decoder


The actual implementation of the symbol decoder is not discussed in this paper.
However, some notes of the data input interface will be done. In our decoder
structures the symbol decoder is always feeded with serial data, but we want the
symbol decoder to be bit parallel to avoid time critical recursive loops. The serial
to parallel conversion can easily be done by having a register at the input of the
symbol decoder with a load signal for every individual bit. The first bit can then
be stored at position one, the next at position two and so on. When the length is
found by the length decoder all necessary bits are stored in the input register of
the symbol decoder and the symbol decoding can start.
Since it is more complicated to calculate the symbol than finding its length one
might want this part to run at a lower speed. This can in our structures easily be
realized with a FIFO buffer inserted between the length and symbol decoders,

4. CONCLUSIONS
We have in presented two new structures for Huffman decoders. Both structures
are based on a simple constant output rate decoder with a length decoder and a
symbol decoder. Since the speed limiting unit in this structure is the length
decoder we have suggested how it can be modified to reach higher speed.
Our first structure contains a length decoder with relaxed evaluation time that
makes it possible to significantely reduce the critical path delay and in this way
design faster Huffman decoders. We have simulated a standard cell implementa-
tion of the MPEG-2 huffman tables in 120 MHz using a 0.8 µm CMOS process.

84
Paper 1 - New Approaches to High Speed Huffman Decoding

In the pipelined structure we have shown how the time limiting recursive loop in
the length decoder can be completely eliminated. This structure should be suit-
able for Huffman decoders with very high decoding rates, for example in future
wideband transmission systems, and HDTV.

5. REFERENCES
[1] ISO/IEC 10918-1 Digital compression and coding of continuous-tone still
images (JPEG), Feb. 1994.
[2] ISO/IEC DIS 13818-2 Generic coding of moving pictures and associated
audio information, part 2: Video, (MPEG-2), June 1994.
[3] S. F. Chang and D. G. Messerschmitt, Designing High-Troughput VLC
Decoder Part I - Concurrent VLSI Architectures, IEEE Transactions on
Circuits and Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June
1992.
[4] H. D. Lin and D. G. Messerschmitt, Designing High-Throughput VLC
Decoder Part II - Parallel Decoding Methods, IEEE Transactions on Circuits
and Systems for Video Technology, Vol. 2, No. 2, pp. 197-206, June 1992.
[5] S. B. Choi and M. H. Lee, High Speed Pattern Matching for a Fast Huffman
Decoder, IEEE Transactions on Consumer Electronics, Vol. 41, No 1, pp.
97-103, Feb. 1995.

N
input
shift register D D D symbol output symbol
M bits decoder buffer
length decoder
logic length=1?
length=2?
length=M? length=3?
(M-1)D

D
D D
select
length
D
D N
D
length decoder

Figure 4. Huffman decoder with delay elements in the length decoder unit.

85
input shift register ND symbol output symbol
M bits decoder buffer

pipelined
length decoder
logic
length=1?
length=2?
length=M?
select

length

length decoder

Figure 5. Huffman decoder with pipelined length decoder unit.

86
Paper 2 - Implementation of a Fast MPEG-2 Compliant Huffman Decoder

Paper 2

Implementation of a Fast MPEG-2 Compliant


Huffman Decoder

Mikael Karlsson Rudberg and Lars Wanhammar

Proceedings of European Signal Processing Conference, EUSIPCO’96,


Trieste, Italy, Sept. 1996.

87
88
Paper 2 - Implementation of a Fast MPEG-2 Compliant Huffman Decoder

IMPLEMENTATION OF A FAST MPEG-2


COMPLIANT HUFFMAN DECODER

Mikael Karlsson Rudberg (mikaelr@isy.liu.se)


and Lars Wanhammar (larsw@isy.liu.se)
Department of Electrical Engineering, Linköping University, S-581 83
Linköping, Sweden
Tel: +46 13 284059; fax: +46 13 139282

ABSTRACT
In this paper a 100 Mbit/s Huffman decoder implementation is presented. A
novel approach where a parallel decoding of data mixed with a serial input
has been used. The critical path has been reduced and a significant increase
in throughput is achieved. The decoder is aimed at the MPEG-2 Video
decoding standard and has therefore been designed to meet the required
performance.

1. INTRODUCTION
Huffman coding is a lossless compression technique often used in combination
with other lossy compression methods, in for instance digital video and audio
applications. The Huffman coding method uses codes with different lengths,
where symbols with high probability are assigned shorter codes than symbols
with lower probability. The problem is that since the coded symbols have

89
unequal lengths it is impossible to know the boundaries of the symbols without
first decoding them. Therefore it is difficult to parallelize the decoding process.
When dealing with compressed video data this will become a problem since high
data rates are necessary.
The architecture of the Huffman decoder presented in this paper is based on a
novel hardware structure [1] that allows high speed decoding.
The decoder can handle all Huffman tables required for decoding MPEG-2 Video
at the Main Stream, Main Level resolutions [2]. The design is completely
MPEG-2 adapted with automatic handling of the MPEG-2 specific escape and
end of block codes. In total our decoder supports 11 code tables with more than
600 different code words. Since the code books are static in the MPEG-2 stan-
dard the Huffman decoder has been optimized for these specific MPEG-2 codes.
A decoding rate of 100 Mbit/s is required and also achieved in our implementa-
tion.

2. HUFFMAN DECODER
Huffman decoding can be performed in a numerous ways. One common princi-
ple is to decode the incoming bit stream in parallel [3, 4]. The simplified decod-
ing process is described below:
1. Feed a symbol decoder and a length decoder with M bits, where M is the
length of the longest code word.
2. The symbol decoder maps the input vector to the corresponding symbol.
A length decoder will at the same time find the length of the input vector.
3. The information from the length decoder is used in the input buffer to fill
up the buffer again (with between one and M bits, Fig. 1).

The problem with this solution is the long critical path through the length
decoder to the buffer that shifts in new data (Fig. 1).
In our decoder the shifting buffer is realized with a shift register that continu-
ously shifts new data into the decoder (Fig. 2). The length decoder and symbol
decoder are supplied from registers that are loaded every time a new code word is
present at the input. The decoding process is described below:
1. Load the input registers of the length and symbol decoder.
2. If the coded data has a length of one go back to point 1.
3. If the coded data has a length of two go back to point 1.
and so on with codes of length three and four up to M.

90
Paper 2 - Implementation of a Fast MPEG-2 Compliant Huffman Decoder

symbol output symbol


decoder buffer

M
input shifting bit/cycle
buffer

critical length length


path decoder

Figure 1. A constant output rate Huffman decoder.

input
shift register D

load
M bits registers
load register
M-1 bits symbol
length decoder decoder
logic length=2? logic

length=M? length=1?
select output
buffer
force load
length
>1 symbol

length decoder symbol decoder

Figure 2. Huffman decoder with relaxed length evaluation time.

This structure allows longer evaluation times for longer code words. The delay in
the critical path is reduced to the time it takes for evaluating the length of code
words with a length of one or two bits. Codes with other lengths are allowed to be
evaluated in several cycles, i.e. code words with lengths of three must be evalu-
ated in two cycles and so on.
Comparing this algorithm with the previous one we note the following:
• The input rate of our new structure is constant while the original has a vari-
able input rate.
• The new structure evaluates short code words in a few cycles but requires
more cycles for longer words. The original structure has a constant evaluation
time for all code words.

91
• The new structure allow higher clock rate since the critical path is reduced.
But this also means that the symbol decoder must be faster since it in the
worst case will receive new data every clock cycle.
• The new structure has a variable output rate while the original one has a con-
stant output rate.

The new structure require higher clock rate to perform the same amount of work.
But, if the average code length is short enough the new structure will have a
higher speed due to the significantly higher clock rates that can be achieved. Nor-
mally the shorter code words will dominate in Huffman coded data and therefore
the new decoder is faster during normal circumstances.

2.1. Handling of special markers


Special markers are placed in the data stream to indicate for example end of block
(eob) at the end of a coded block of data. After this marker other types of data
like uncompressed stream information will follow (Fig. 3). In the Huffman
decoder it is essential to detect the presence of this marker to be able to stop the
decoding process and let other units process the data that will follow. This
decoder detects the eob marker in the length decoder and halts the decoding pro-
cess until a new start signal is applied.

Huffman codes eob header data Huffman codes

Huffman codes mb_escape fix length data Huffman codes

Figure 3. Markers in the MPEG-2 stream requiring special decoding.

The mb_escape marker is also important. After this symbol the following data is
of fix length. Also this marker is detected in the length decoder and results in that
the following data is passed through the symbol decoder unchanged (Fig. 3).

3. IMPLEMENTATION
The MPEG-2 standard requires that the input data must be decoded at a rate of
about 100 Mbit/s. During the implementation special care had to be taken during
the partitioning of the symbol decoder and a few critical paths had to be opti-
mized manually. A few modifications of the new decoding algorithm had to be
made to make it possible to achieve the targeted performance.

92
Paper 2 - Implementation of a Fast MPEG-2 Compliant Huffman Decoder

3.1. Improvements of the length decoder


The length decoder turned out to be to slow when evaluating codes with lengths
of one or two bits ('length = 1?' and 'length = 2?' in Fig. 2). These paths had to be
broken up. How this was done is shown in Fig. 4 below. The evaluation of the
'length = 1?' signal is done by taking data one step earlier (i.e. from position i—1
instead of i) from the shift register and add a flip flop after the evaluation. For the
'length = 2?' signal the register was moved to after the evaluation logic.

i-2 i-1 i i-2 i-1 i


D D D D D D

load load
D D

logic logic logic logic

load
D D

length=2? length=1? length=2? length=1?

Before optimization After optimization

Figure 4. Optimization of critical paths in the length decoder.

Note that this way of breaking up the critical loops can be generalized to remov-
ing all critical loops in this structure, see [1].

3.2. Symbol decoder


The symbol decoding task is more complicated than the length decoding. The
symbol decoder could not be designed to receive data in 100 Mbit/s. Code words
with a length of one bit are rare in MPEG coded data. The most frequent used
code tables do only contain codes with more than two bits. Therefore the symbol
decoder is fed with data no more often than every second clock cycle. The input
shift register is halted one clock period every time a one bit code is found, and
hence, the symbol decoder only need to process 50 Mbit/s without a significance
loss in performance. However, this modification causes the input rate to vary.
The symbol decoder were split into five separate units that takes care of their
own part of the code tables (Fig. 5). Every unit consists of an input register that
holds the data and a combinatorial block that maps the input vector to the sym-
bol. One of the five units' output are chosen and passed to the output.
Data is in two's complement after the mb_escape marker while other data is
decoded to signed magnitude format. A post processing stage converts the two's
complement data to signed magnitude representation.

93
3.3 Interface
The interface of the Huffman decoder consists of an eight bit, parallel input port
for coded data. A signal indicates when a new input vector can be applied. The
decoded data is delivered with a maximum of 50 Msymbol/s. The 'symbol
present' signal (Fig. 5) indicates when data is valid at the output.
The shift register at the input of the decoder (Fig. 2) can be read and controlled
externally. This is necessary since the Huffman coded data is interleaved with
other information.

register register register register register


load

decoder decoder decoder decoder decoder D


unit 1 unit 2 unit 3 unit 4 unit 5 D

register register register register register

current table D
D
post processing
D

register D

>1
symbol symbol present

Figure 5. Realization of symbol decoder.

3.3. Synthesis
The decoder has been described in VHDL and then transformed to a circuit using
synthesis tools mapping to an 0.8 µm CMOS standard cell library. Some post
processing had to be done after the synthesis step to achieve the necessary perfor-
mance. The main problem was to get the symbol decoder to work fast enough.
Therefore the symbol decoding has been split into five separate units. The core
area is about 8.4 mm2 the total area is 14.5 mm2 (3.9 X 3.75 mm2). About two
third of the area is occupied by the symbol decoder (Fig. 6). The power supply is
5V and the transistor count is 26900.

3.4. Symbol tables


To get all symbol tables correct the VHDL code for the symbol decoding as well
as for the length de-coding has been generated from a thoroughly verified tem-
plate file. However, this method also yielded a sub-optimal symbol decoding that

94
Paper 2 - Implementation of a Fast MPEG-2 Compliant Huffman Decoder

can be further optimized, but with an increasing probability to introduce design


errors in the code tables. This was not done in this implementation because of
lack of time and that it was considered more important to get a functional correct
implementation.

4. CONCLUSIONS
In this paper an implementation of a novel Huffman decoder architecture has
been presented. We have shown that the new structure can be used for fast Huff-
man decoding while still keeping a simple architecture. The throughput has been
increased by using a serial input combined with a serial/parallel length evalua-
tion. Since the current implementation uses standard cells it is reasonable to
believe that a full custom version of the same circuit can reach significantly
higher speed.

5. REFERENCES
[1] M. K. Rudberg and L. Wanhammar, New Approaches to High Speed
Huffman Decoding, IEEE Proc. ISCAS ´96, May 1996.
[2] ISO/IEC DIS 13818-2 Generic coding of moving pictures and associated
audio information, part 2: Video, (MPEG-2), June 1994.
[3] S. F. Chang and D. G. Messerschmitt, Designing High-Throughput VLC
Decoder Part I – Concurrent VLSI Architectures, IEEE Trans. on Circuits
and Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June 1992.
[4] H. D. Lin and D. G. Messerschmitt, Designing High-Throughput VLC
Decoder Part II – Parallel Decoding Methods, IEEE Trans. on Circuits and
Systems for Video Technology, Vol. 2, No. 2, pp. 197-206, June 1992.

95
Shift
Register

Control
Unit

Symbol
Length
Decoder Decoder

Clock Buffer

Figure 6. Layout of the Huffman decoder.

96
Paper 3 - High Speed Pipelined Parallel Huffman Decoding

Paper 3

High Speed Pipelined Parallel Huffman


Decoding

Mikael Karlsson Rudberg and Lars Wanhammar

Proceedings of IEEE International Symposium on Circuits and Systems,


ISCAS’97, Hong Kong, June 1997.

97
98
Paper 3 - High Speed Pipelined Parallel Huffman Decoding

High Speed Pipelined Parallel Huffman Decoding

Mikael Karlsson Rudberg and Lars Wanhammar


Department of Electrical Engineering, Linköping University, S-581 83
Linköping, Sweden
email: mikaelr@isy.liu.se, larsw@isy.liu.se

ABSTRACT
This paper introduces a new class of Huffman decoders which is a develop-
met of the parallel Huffman decoder model. With pipelining and partition-
ing, a regular architecture with an arbitrary degree of pipelining is devel-
oped. The proposed architecture dramatically reduces the symbol decoder
requirements compared to previous results, and still is the actual implemen-
tation of the symbol decoder not treated. The proposed architectures also
have a potential of realizing high speed, low power Huffman decoders.

1. INTRODUCTION
The Huffman coding method is a method for lossless data compression. The
method is used in a variety of fields as for instance in the JPEG image coding
standard and the MPEG Video coding standards. With the introduction of High
Definition digital television (HDTV) the throughput requirements of the Huff-
man decoder will be increased several orders of magnitude. Unfortunately the

99
Huffman decoding process is difficult to parallelize since the symbols are of
unequal length. It is not possible to know where the symbol boundaries are
before actually decoding them in sequence.
The Huffman code uses variable-length code words to compress its input data.
Frequently used symbols are represented with a short code while less often used
symbols have longer representation. The Huffman codebook forms an unbal-
anced binary tree with the symbols at the leaves. The Huffman decoding process
starts at the root node in the binary tree and stops at a leaf.
In this paper we will extend previous work reported in [1] and [2] where architec-
tures for high speed Huffman decoders are described. In this paper we generalize
the concept of pipelined Huffman decoders and discuss the theoretical potential
of this class of decoders. An improvement for dramatically decreasing the sym-
bol decoding speed requirements is also presented.

2. HUFFMAN DECODER MODELS


There are two main classes of Huffman decoders, the parallel decoder and the
sequential decoder [3, 4], Fig. 1. The sequential decoder has a constant input data
rate with just a few bits width (normally one or two bits). The code-tree is repre-
sented as a state machine that traverses the code-tree until a symbol is found. A
common realization that belongs to this class of decoders is based on lookup
tables stored in a memory [5]. This type of solution can be made very memory
efficient, but since a state machine representation always has several feed back
loops and just a few bits per cycle input data rate the potential for high speed
decoding is limited for this type of decoder.
coded data
next 1 to Wcode,max bits/cycle
state coded data
1 bit/cycle shifting
buffer
state critical path symbol
machine decoder
length
critcal path symbol decoder
symbol
length

Sequential decoder Parallel decoder


model model

Figure 1. Fundamental Huffman decoder models.

The parallel decoder consists of three different units, a symbol decoder that maps
a bit-vector with a coded symbol to a fix length representation, a length decoder
that calculates the length of the current code so that the shifting buffer knows

100
Paper 3 - High Speed Pipelined Parallel Huffman Decoding

how many bits that has been consumed and is able to fill the buffer again. The
parallel decoder has a varying input rate of 1 to Wcode,max bits/cycle depending
on the length of the latest decoded symbol. Wcode is the length of the present code
and Wcode,max is the length of the longest code in the codebook. The output rate
is constant with a fixed delay for all symbols. The critical loop in the parallel
decoder is through the length decoder to the shifting buffer. Before a new symbol
can be decoded the length of the previous code has to be found and the consumed
bits must be thrown away.
This paper will show that the parallel decoder has a potential of reaching a high
decoding rate. In Fig. 1 the two discussed models are shown. In the remaining of
this paper we will focus on the parallel Huffman decoder model.
One drawback with the parallel Huffman decoder in Fig. 1 is that the symbol
decoder and the length decoder operates in parallel on the same code. Therefore
the length of the code is not available when the symbol decoder starts the decod-
ing, which makes the symbol decoding more difficult than necessary. This prob-
lem can however be solved by inserting a buffer in front of the symbol decoder as
shown in Fig. 2. Since the length and symbol decoders here operates on different
codes the symbol decoder can take advantage of the fact that the length of the
code is known.

coded data

shifting
buffer
symbol
buffer

decoder
length
decoder
length symbol

Figure 2. Pipelined parallel decoder model.

3. PIPELINED PARALLEL HUFFMAN DECODING


In [1] we have shown that it is possible to completely remove the critical loop in
the parallel decoder. This is done by replacing the shifting buffer with a shift reg-
ister, and by using a pipelined length decoder. The resulting architecture will
have a structure similar to the parallel pipelined decoder but a behavior more like
the sequential decoder with a constant input data rate and a varying output rate
(Fig. 3).

101
The decoder in Fig. 3 operates as follows: The shift register continuously shift
the coded data from left to right. The codelength is evaluated in the pipelined
length decoder unit and is represented with one separate signal for every length,
i.e. Wcode,max signals. In every cycle one codelength is checked. In the first cycle
it is checked if the code is a one bit code, in the second cycle it is checked if it is
a two bit code and so on until a matching length is found. At this time the code
has been shifted out from the shift register and stored in a register feeding the
symbol decoder. The symbol decoder starts and the length decoder starts to
examine if the next code is a one bit code and so on. Note that the feed-back loop
from the length decoder to the shifting buffer is not needed any longer, but is
replaced by a synchronous reset signal to a counter.
A major disadvantage with this structure is that the symbol decoder must be
designed for a worst case sampling rate of fs,max = fclk to be able to handle suc-
ceeding one bit codes. This yields a low utilization degree of the symbol decoder
since the sampling rate is lower when longer codes are decoded (utilization n = 1/
Wcode,ave, where Wcode,ave is the average codelength).
register

input stream
shift register
Wcode,max

pipeline register
fs ­ fclk
k pipeline
stages pipeline register
symbol decoder
counter for Wcode,max bits
reset start
decoded
equalizing symbols
delay
Lcode=Wcode,max Lcode=1 bit

counter reset

Length decoder

Figure 3. Loop free pipelined parallel decoder.

3.1. Reducing symbol decoder requirements


One way of increasing the utilization degree of the symbol decoder is to insert a
buffer between the length decoder and the symbol decoder, and then use a slower
symbol decoder. However, one can never guarantee that a buffer overflow never
occurs when long sequences of codes with short codelengths arrive.

102
Paper 3 - High Speed Pipelined Parallel Huffman Decoding

Another solution is to stop the length decoder and the shift register when fs,max is
exceeded [2]. This can for instance be done by halting the length decoder and the
shift register a number of cycles as soon as a code with a length of less than M
bits are found, where fs,max in the symbol decoder is fs,max = M/fclk. The penalty
for this is that no symbol will be decoded in less than M cycles, i.e. the decoder
will be less effective on short codes, which also are the most frequent ones. How-
ever, this can in some cases be accepted since if the average codelength is low the
average throughput will be high anyway. Unfortunately, halting the shift register
will result in that the constant input data rate property is lost.
In the next section we propose another method for reducing the requirements on
the symbol decoder without any loss in efficiency. This is accomplished by tak-
ing advantage of the fact that the length of the codes are available and use this to
partition the symbol decoder.

3.2. Symbol decoder partitioning


Only when the code stream contains a one bit code will the symbol decoder in
Fig. 3 be fed with a new code in two successive clock cycles (i.e. a code with
length k is followed by a code with length 1). If the one bit codes can be sorted
out before the symbol decoder, the maximum necessary sampling rate can be
halved. In this case this can be done by switching to another decoder during one
clock cycle as soon as a new code is fed into the decoder.
If there is no one bit code following, the switch is restored, making it possible for
the original symbol decoder to receive a new code with a length of 2 or more bits.
The one bit symbol decoder can obviously de made very simple since there is
only one possible one bit code. Though, the one bit decoder must have a maxi-
mum sampling frequency of fs,max = fclk. The original symbol decoder only need
to consider codes with a length of two bits and more.
In a similar way it is possible to partition the symbol decoders so one decoder
takes care of all codes in the range of 1 to N-1 bits and the other one codes in the
range of N to Wcode,max bits. The first symbol decoder has a maximum sampling
frequency of fs,max = fclk and the second decoder has a sampling frequency of
maximum fs,max = fclk/N. Each one of the two decoders can be optimized to han-
dle only a sub-set of the complete codebook. If N is chosen reasonably small it is
possible to have one fast but simple symbol decoder and one more complicated
but also slower symbol decoder. Different decoding methods can be chosen for
the decoders, a fast method for the small decoder and an area efficient solution
for the large decoder. In Fig. 4 an architecture with a two symbol decoder solu-
tion is shown.

103
The partitioning can be repeated, splitting the symbol decoder into K partitions.
If K is chosen to be equal to Wcode,max there will be one dedicated symbol
decoder for every codelength, and every symbol decoder operates with a sam-
pling frequency of maximum fs,max = fclk/Wcode,j, where Wcode,j is the length of
the code that symbol decoder j is optimized for. The resulting architecture can be
seen as a sorter that sorts the codes according to their length followed by a sim-
plified symbol decoding step. In Fig. 5 an architecture with the maximum parti-
tioned symbol decoder is shown. The architecture consists of a length decoder
with a k stages pipeline, a buffer with a depth of n, a sorter for sorting the codes
and a set of symbol decoders. The size of the buffer can be as low as zero. The
control is carried out by counters and logic blocks that checks for the start condi-
tions for the symbol decoders.

register
input stream
shift register
Wcode,max
cnt
pipeline register
counter fs ­ fclk
k pipeline pipeline register
stages symbol decoder
cnt reset for 1 to N-1 bits
•
start
cnt ­ N-1 decoded
equalizing and reset = 1 symbols
delay fs ­ fclk/N
Lcode=
Wcode,max Lcode=1 bit
symbol decoder for
N to Wcode,max bits
•
cnt > N-1
reset and reset = 1 start
counter

Length decoder

Figure 4. Partitioned symbol decoders with reduced requirements.

4. DISCUSSION
In this section the advantages and drawbacks of the proposed methods are dis-
cussed. The biggest advantage with the loop free pipelined parallel decoder with
partitioned symbol decoding is the potential of doing really fast Huffman decod-
ing at relatively low power consumption. Fast because the critical length decoder
can be pipelined to reach almost arbitrary speed. Low power consumption
because of the partitioned symbol decoding. Symbol decoders that not are used
can be put in an idle state which will save quite a lot of power if the partitioning
is well balanced. Note that using many partitions do not lead to much increase in
the control structure which would consume power. The reduced maximum sam-

104
Paper 3 - High Speed Pipelined Parallel Huffman Decoding

pling rates in the symbol decoders also saves power since a lower clock fre-
quency can be used, and also because more power efficient but slower symbol
decoders can be used. Unfortunately, a heavily pipelined length decoder will con-
sume some power, but the length decoding unit is significantly smaller than the
symbol decoder unit [2] and consumes therefore a minor part of the total power.
There are two types of codebooks that are commonly used. In the MPEG stan-
dards the codebook is fixed and can therefore be hardwired into the decoder
logic. It is more difficult when the codebook is changed from time to time, as is
the case of the JPEG image coding standard. However, in this paper we have not
discussed the actual realization of neither the length decoder nor the symbol
decoders (even if the length decoder must conform to the pipelined model). It
should be possible to successfully implement both fixed and dynamic codebooks
using the proposed architectures.

5. CONCLUSIONS
In this paper we have discussed different Huffman decoder models and their
speed potential. The pipelined parallel decoder model is transformed to a fast
loop free architecture by using a shift register as replacement for the normally
used shifting buffer. Further, we have developed an architecture that enables a
highly partitioned symbol decoder which can be used for combining high speed
decoding with a power efficient solution. The proposed architectures does not
imply that there must be a fixed codebook or that the symbol decoders must be
realized in a particular way. Different solutions can be chosen depending on the
sampling rate and the size of the codebook.

6. REFERENCES
[1] M. K. Rudberg and L. Wanhammar, "New Approaches to High Speed
Huffman Decoding", IEEE Proc. ISCAS ´96, Atlanta, USA, May 1996.
[2] M. K. Rudberg and L. Wanhammar, "Implementation of a Fast MPEG-2
Compliant Huffman Decoder", Proc. EUSIPCO ´96, Trieste, Italy,
September 1996.
[3] S. F. Chang and D. G. Messerschmitt, "Designing High-Throughput VLC
Decoder Part I - Concurrent VLSI Architectures", IEEE Transactions on
Circuits and Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June
1992.

105
[4] H. D. Lin and D. G. Messerschmitt, "Designing High-Throughput VLC
Decoder Part II - Parallel Decoding Methods", IEEE Transactions on
Circuits and Systems for Video Technology, Vol. 2, No. 2, pp. 197-206, June
1992.
[5] S. Ho and P. Law, "Efficient Hardware Decoding Method for Modified
Huffman code", Electronics Letters, Vol. 27, No 10, pp. 855-856, May 1991.

start
fs ­ fclk
cnt = 1
and reset = 1 symbol decoder

register
input stream • for Wcode = 1 bit
shift register shift register (n+k bits)
Wcode,max symbol decoder
•
• cnt = 2 for Wcode = 2 bit
• and reset = 1 start
pipeline register fs ­ fclk/2

k pipeline pipeline register counter


cnt decoded
stages symbols
reset
•
fs ­ fclk/(Wcode,max-1)
equalizing
delay •
symbol decoder
Lcode= • cnt = Wcode,max-1 for Wcode,max-1 bits
Wcode,max Lcode=1 bit • and reset = 1 start

shift register (n bits)


symbol decoder
cnt = Wcode,max for Wcode,max bits
and reset = 1 start
counter
reset fs ­ fclk/Wcode,max
Length decoder Buffer Sorter Symbol decoder

Figure 5. Fully partitioned loop free pipelined parallel Huffman decoder.

106
Paper 4 - Design of a JPEG DSP using the Modular Digital Signal Processor Methodology

Paper 4

Design of a JPEG DSP using the Modular


Digital Signal Processor Methodology

K.-G. Andersson, Mikael Karlsson Rudberg, and Anders Wass

Proceedings of International Conference on Signal Processing Applications


& Technology, ICSPAT’97, San Diego, USA, Sept. 1997.

107
108
Paper 4 - Design of a JPEG DSP using the Modular Digital Signal Processor Methodology

Design of a JPEG DSP using the Modular Digital Signal


Processor Methodology

K-G Andersson1, Mikael Karlsson Rudberg1,2 (ekamiru@eka.ericsson.se),


Anders Wass3 (Anders.Wass@eka.ericsson.se)
1)
Ericsson Components, Microelectronics Research Center, Stockholm, Sweden.
2)
Department of Electrical Engineering,
Linköping University, Linköping, Sweden.
3)
Ericsson Components, Microelectronics Division,
ASIC & ASSP Sector, Stockholm, Sweden.

Abstract
In this paper we present the design of a JPEG decoder using the Modular
DSP Methodology (MDSP). It is shown that the MDSP methodology is a
powerful tool for doing hardware-software co-design. The hardware
resources have been chosen to match the frequently used operations in the
JPEG standard to increase performance. The JPEG decoder has been real-
ized using a dual core solution where irregular and static algorithms have
been separated.

109
1. INTRODUCTION
The Modular DSP (MDSP) Methodology is a method for modelling of Applica-
tion Specific DSPs (ASDSP). The MDSP methodology aims at tackling some of
the most important issues in bridging the gap from algorithms down to silicon
and move the two levels closer [1,2,3].
This paper discuss how the MDSP Methodology was used during the design of a
JPEG decoder.
Common for all wideband communication and storage systems is the need for
compression of speech, image, data, audio, and video. International organiza-
tions, such as CCITT and ISO/IEC JPEG (Joint Photographic Experts Group)
[4], have standardized compression algorithms and formats for images. Repro-
grammability is important for the adaptation to different applications and mar-
kets. A JPEG DSP should contain the arithmetic functions needed for the specific
algorithm and should be designed for the accurate wordlength in the different
parts of the architecture. The memory requirements (size, wordlength), and the
partitioning of the memory structure have to be considered as well. The JPEG
decoder has been modelled to fulfill the CCIR601 requirements.
The JPEG algorithm consists of four stages: Data is transformed to the frequency
domain using the Discrete Cosine Transform (DCT), Quantized to remove fre-
quencies in the picture that are of minor interest, Run-Zero encoded to replace
sequences of zeroes with a shorter representation and finally Huffman encoded
which results in a variable length code. The decoding is in principle a reversal of
the operations in the encoder. The frame to be encoded is split into 8x8 pixels
large blocks that then are individually coded.

2. METHODOLOGY
Why do we see the need of a new methodology and what problems do we solve
with the MDSP methodology?
First of all we se a rapid growth of the need to do early design trade-offs and per-
formance estimations. To do that we must be able to have a powerful modelling
methodology where different algorithmic and architectural solutions can be
quickly evaluated. We also want an environment where the designers experience
is captured, i.e. the environment must provide a high degree of interactivity
instead of leaving important tasks as scheduling and resource allocation entirely
to the tools.

110
Paper 4 - Design of a JPEG DSP using the Modular Digital Signal Processor Methodology

Future consumer electronics put requirements on the hardware that today can be
hard to fulfill: high speed, high complexity, low power and low cost. To be able
to meet these architecture goals it is obvious that the level of integration must be
increased, the hardware must be matched to the algorithms and the design pro-
cess must be shortened. We believe that using the MDSP Methodology for
Application Specific DSPs matched to the algorithms are the way to go to handle
increased complexity and reduce power consumption.

2.1. Modelling with the MDSP methodology


An MDSP model provides a bit-true and cycle-true model using a hardware
description language called µC. The language is mainly a subset of the C lan-
guage extended with some features. There are four types of storage elements
defined: input ports, output ports, memories and registers. Parallelism has
been introduced by redefining the ‘,’-operator in C to mean parallel operations in
µC. The statement delimiter ‘;’ in C is redefined to delimit clock cycles in µC.
The architecture and algorithm can be concurrently developed since the µC-
model contains the algorithm, the hardware resources (memories and registers)
and also implicitly ALU:s and the control unit. When additional resources are
needed they are just added in the model. The µC-model defines a virtual DSP
that supports the set of operations actually performed in the µC-code. The final
hardware implementation can then have a different architecture as long as it sup-
ports the operations contained in the code, i.e. the µC-model defines a minimal
architecture.
The tool environment consists of a compiler that generates a simulation model
from the µC-code and checks the code against the target architecture to assure
that it can execute the µC program. Furthermore there is a simulator with the
capability to cosimulate several µC cores.

3. HARDWARE PARTITIONING
There are two types of algorithms, data dependent and static. The data dependent
algorithms are characterized by having much data or parameter dependent pro-
cessing branches. A typical data dependent algorithm is the parsing and control
of a JPEG-coded data-stream. There are several types of data blocks that require
different kind of decoding. Most parameters are located at the beginning of the
datastream, which must be parsed and then used to select the appropriate decod-
ing algorithm. A static algorithm, on the other hand, is the Inverse Discrete
Cosine Transform (IDCT) which is a part of the JPEG standard.

111
Scheduling, Assignment

USER LIB µC model LIB USER

VHDL C++ Code


Generation Generation
H/W code

VHDL Simulator
model model

Figure 1. The MDSP Methodology design flow


These two types of algorithms require different types of architectures. A data
dependent algorithm requires hardware that is well supplied with control regis-
ters and control operators like comparing to registers and conditional branches,
and possibly a stack to enable function calls. The performance of hardware exe-
cuting data dependent algorithm can in some degree be measured by the ability to
effectively implement compare-and-branch operations. The program memory is
often large for the data dependent algorithm due to many alternative processing
steps.
The static algorithm is less control oriented. Normally, a static algorithm repeat-
edly executes a relatively small number of lines. The performance is mainly lim-
ited by the degree of parallelism in the hardware.
The JPEG decoder is partitioned into two cores, one optimized for the data
dependent parameter parsing and Huffman decoding. This core also perform
most of the control task. The other core is the IDCT processor that implements
the static IDCT algorithm in a pipelined, hardware intensive, datapath. The parti-
tioning is natural due to the discussion above, where static and data dependent
algorithms require different types of architectures.

3.1. Interface design


There are two ways of I/O-modes defined using the MDSP Methodology, paral-
lel and serial. The I/O is made using a hand-shake protocol to simplify synchroni-
zation.
The coded datastream is fed byte by byte into the Huffman processor. The
decoded pixels are found at the output from the IDCT core. The JPEG datastream
contains a quite extensive set of parameters that are of interest when displaying
the decoded images. In order to avoid doing a separate decoding of these param-

112
Paper 4 - Design of a JPEG DSP using the Modular Digital Signal Processor Methodology

eters in the display device, the parameters are decoded in the Huffman core and
are then accessable through a DMA port when the parameter memory not is used
internally. See Fig. 2.
The internal interface between the cores consists of a parallel data port, Outp_RZ,
that outputs run-zero coded data. A synchronization signal, DC, that is activated
in the beginning of every block is provided in order to synchronize the two cores.
A stop signal halts the Huffman processor when the IDCT core can not receive
data at the required rate.

HUFFMAN IDCT
processor-core processor-core
Data_Ready RZ
EI_RZ Data_Ready
Ready_St
EI_St Outp_RZ INP_RZ Outp_Pix
INP_St DC block_stat Outp_Adr_
Ci_addr_dim

INT_A stop
Release_dim
Ci_dim

Reset MCLK
66 MHZ
Display parameters

Figure 2. The JPEG MDSP dual core processor.

4. HARDWARE/SOFTWARE TRADE-OFFS
It is important to use the right kind of hardware resources in an architecture. The
performance can be significantly reduced if the architecture is register limited so
variables have to be stored in a memory and then read back into the datapath
repeatedly. It is also performance limiting to do multiplication using a shift-add
approach, if this is done often. A trade-off between when to add dedicated
resources to the hardware and when to solve a problem using the already avail-
able hardware and software. It might for instance be more efficient to add an
adder and a few registers instead of a multiplier-accumulator in the datapath if
the multiply-accumulate operation is seldom used.

4.1. Huffman processor


The Huffman processor core consists of one datapath, two memories and one
address processor. The data and address paths are built from MDSP templates.
The different parts of the processor can be seen in Fig. 4. There are two memo-

113
ries, one for the storage of the Huffman code book, Run/Size and Code length of
29-bits words. The smaller memory is used for the storage of quantization tables,
temporary data and various parameters used by the JPEG algorithm.
In the Huffman core we have chosen to use dedicated hardware to detect a spe-
cial marker byte (FF) in the datastream. This is done since this marker byte can
occur anywhere in the datastream and instead of a time consuming software test
there is hardware that generates a trap that forces a jump to a software routine
that can handle the marker byte. To make efficient header decoding we use com-
parators and special masking hardware. A Barrel shifter is also used. The Huff-
man decoding is programmed in software using a table look-up technique.

4.2. IDCT processor


The IDCT processor core consists of two memories where an input process stores
the input in one memory while the other one is used for calculation. The algo-
rithm used in the IDCT require 11 multiplications and 29 additions per 1 dimen-
sional IDCT (1D IDCT)[7]. The 2D IDCT is calculated by doing 8+8=16 1D
IDCT:s.
The IDCT core, which executes a static algorithm, has been allocated enough
resources to continuously start the processing of a new block every 160:th cycle.
Three multiply-accumulate elements and four adders are used in the main datap-
ath, and all are fully utilized. The register usage has been optimized for a calcula-
tion in three stages with separate register files.

The Huffman core delivers run-zero coded data in a zig-zag order. The data is
expanded and written into a memory. The datapath consists of three multiply/
accumulate blocks (macc) and five adders (see figure 3). The IDCT on an 8x8
block is performed by doing a 1-dimensional IDCT on each column followed by
a IDCT on all eight rows. An offset of 128 is added during the read out stage.

5. CONCLUSIONS AND FURTHER WORK


In this paper we have presented the MDSP modelling methodology for Applica-
tion Specific DSPs. The methodology has been found to be an efficient way of
performing hardware-software co-design since the hardware and software is
developed in the same model and consequently also simulated and verified in the
same design environment. A JPEG decoder with two processor cores, modelled

114
Paper 4 - Design of a JPEG DSP using the Modular Digital Signal Processor Methodology

INP RZ
Memories
Read Write

Outp_Pix
reg 128

out_reg
+
+
reg
macc flow
reg
reg
macc reg
+
stage 1
reg

reg
+ reg
reg flow

+ reg
reg reg
reg
macc reg reg
reg
reg reg
stage 2
reg stage 3

Figure 3. IDCT processor datapath.


using the MDSP Methodology has been introduced showing the strength of the
MDSP concept. Capabilities to achieve efficient architectures even with complex
irregular algorithms as the JPEG standard is demonstrated.
The MDSP Methodology is today used in the regular design flow at Ericsson
Components. The design environment is continuously evolving. Features are
added to the modelling language and the methodology when needed by new
design projects.

6. REFERENCES
[1] K-G Andersson, Anders Wass, Karam Parmar: A Methodology for
Implementation of Modular digital Signal Processors, ICSPAT ’96, Boston,
MA, Oct. 7-10, 1996.
[2] K-G Andersson: A Design Environment for Modular-dsp Architectures,
Electronic Design Autom. Conf., Kista, Stockholm, March 15, 1994.
[3] K-G Andersson, Implementation and Modeling of Modular Digital Signal
Processors, LiU-Tek-Lic-1997:09, Department of Electrical Engineering,
Linköping University, March 1997.

115
[4] ISO/IEC 10918-1: Digital compression and coding of continuous-tone still
images, 1994-02-15.
[5] C. Liem, T May, P Pauline: Instruction-Set Matching and Selection for DSP
and ASIP Code Generation, Proceedings of the European Design and Test
Conferance, February 1994.
[6] Gert Goossens, Jan Rabaey, Joos Vandewalle, Hugo De Man : An Efficient
Microcode Compiler for Applications Specific DSP Processors. IEEE
Transactions on Computer-Aided Design. vol. 9 NO. 9, September 1990.
[7] Z. Wang, "Fast Algorithms for the Discrete Cosine Transform and for the
Discrete Fourier Transform", IEEE Transactions on ASSP, Vol ASSP-32,
No.4, pp. 803-816, Aug. 1984.

Code_len
5
Huff_tab
Mask_len

Mask AND 858 x 29


16
0

Code_reg 29

Code_reg Run Size


Ampl 23
29 4 DC + 4 AC
Stream_ 16

EN_St 23 Compare 4
Sh_con

CU, Trap
BSH 4

Inp_St
Inp_St

8
Acc
15 0 16
8 AP
Areg
1 0 0 1
Outp_RZ
status Inp_St
ALU
macc

16 flags
SLL1
11 0 -1 8
CU Constants
Acc
0

LSB
0
01 FF 8
8
Pred
CU, Trap

DC
16 Ireg Jpeg_mem
Jpeg_reg

1
Param_reg

8 1
Jreg
16 01
8
8
320 x 8
1 8 16

4 Q-tab + 64
Mereg 4

DC UDM
Ci_dim

Figure 4. Huffman processor datapath.

116
Paper 5 - Design and Implementation of an FFT Processor for VDSL

Paper 5

Design and Implementation of an FFT


Processor for VDSL

Mikael Karlsson Rudberg, Martin Sandberg, and Kent Ekholm

Proceedings of IEEE Asia-Pacific Conference on Circuits and System,


APCCAS’98, Chiangmai, Thailand, Nov. 1998.

117
118
Paper 5 - Design and Implementation of an FFT Processor for VDSL

Design and Implementation of an FFT Processor for VDSL

Mikael Karlsson Rudberg (Mikael.Rudberg@eka.ericsson.se),


Martin Sandberg (Martin.Sandberg@eka.ericsson.se),
Kent Ekholm (Kent.Ekholm@ericsson.com)
Ericsson Components AB, 164 81 Kista, SWEDEN.
Phone +46 8 757 5295, Fax +46 8 757 5032
e-mail Mikael.Rudberg@eka.ericsson.se

Abstract
In this paper we present an implementation of an FFT processor for VDSL
applications. Since no standard yet are available for VDSL high require-
ments on flexibility were put at the design. A concurrent hardware and soft-
ware design methodology made it possible to trade between hardware and
software realizations in order to get an effective and flexible architecture.

1. INTRODUCTION
The Fast Fourier transform (FFT) is an effective way of calculating the discrete
fourier transform and is often used in multicarrier communication systems. The
FFT processor presented in this paper is aimed at VDSL (Very high speed Digital
Subscriber Line) applications, which is one of the candidates for providing wide-
band communication capabilities to the consumers, Fig. 1. VDSL systems use the
already installed base of twisted pair copper cables for the last few hundred

119
meters to the homes. The VDSL system is a multicarrier system which is attrac-
tive for wideband transmission because of its capability to adapt to different
channel characteristics.
Today there is no standard for VDSL and there are several candidates that put
different requirements on the FFT processing. Currently the number of carriers is
unspecified, and it is still uncertain if data shall be time multiplexed or frequency
multiplexed transmitted over the channel. This uncertainty made it important to
have a programmable FFT processor. The computational requirements excluded
a solution with standard DSPs. Therefore the FFT has been realized as an appli-
cation specific signal processor (ASSP) targeting FFT processing.
The worst case processing requirements that can be handled is two streams of
continuous 50 MHz real input data with simultaneously processing of both FFT
and IFFT:s with lengths up to 1024 points. One of the output ports are equipped
with a multiplier that can be used as a frequency equalizer. Cyclic prefix of arbi-
trary length can be added at the output of the IFFT and automatically discarded at
the input of the FFT.

Figure 1. VDSL transmission system.

2. ALGORITHM
The implemented algorithm is a well known decimation in frequency radix-4
FFT algorithm [1]. The primitive operation is the radix-4 butterfly shown in Fig.
2.
Since the input data to the FFT and the output data from the IFFT is real valued
and the FFT is a complex transform it is possible to calculate a 2048 points FFT
by first doing a 1024 points complex FFT and then perform a separation pass to
end up with the same result as would have been the case if a full 2048 points FFT

120
Paper 5 - Design and Implementation of an FFT Processor for VDSL

had been calculated [2]. This extra separation has a structure close but not identi-
cal to a radix-2 butterfly. To be able to support other FFT lengths than 4 n also
radix-2 butterflies has to be supported.

x(0)
+ + X(0)
x(1)
+ -+ W2p X(2)
x(2)
-+ + Wp X(1)
x(3)
- + -j -+ W3p X(3)

Figure 2. Radix-4 decimation in frequency butterfly.

3. DESIGN FLOW
The FFT project is the first project where Ericsson’s Modular DSP methodology
(MDSP) has been fully used throughout the entire design. The design methodol-
ogy is aimed at programmable ASSP:s and have previously been reported in [3].
A case study resulting in a JPEG decoder architecture that never was made in [4]
and we have also been studying other algorithms.
The methodology encourage the designer to do trade-offs between hardware and
software realizations by offering a unified design environment and modeling lan-
guage for both hardware and the software.

Spec.

LIB
µC model
HW
Formal
RTL Ver. SW

ASIC µ code
Impl. Gen.

Figure 3. Design flow.

121
The architecture and application program is concurrently evolved from the
requirement that are put on the application. The description language, µC, is
derived from the C language with some modifications. An RTL description of the
hardware is manually or automatically derived from the application program.
The software can be refined after the hardware extraction, but to assure that it
still is possible to execute the application program on the architecture, there is a
formal verification tool available. See Fig. 3 for the design flow. An example of
the design language is given in Fig. 4 below.
The RTL description is taken to a traditional ASIC flow and the microcode that
shall be run on the processor is generated by a compiler.
Important to note is that translating the µC model to RTL and microcode is a
mapping process. Information about resource allocations and scheduling is found
in the model. Advantages with this approach is that the designer has full control
of both the architecture and the scheduling, and can therefore get maximal per-
formance from the design.
The key benefits with the design methodology are:
• An effective design language which enables short design time.
• Concurrent modeling of hardware and software.
• Fast simulation compared with Verilog and VHDL.
• Results in a programmable DSP architecture that can be re-programmed
after processing.

4. DESIGN SPACE EXPLORATION


The specification for the FFT processor have been changed several times during
the project. The reasons for that are the lack of standard for VDSL and that the
implementation work was made concurrently with the design of the entire
modem. It was essential to have a methodology that made it possible to quickly
try different design alternatives. Essential for providing this is an effective mod-
eling language and a fast simulator.
Some of the solutions evaluated were for instance where to handle the cyclic pre-
fix. From being realized in pure hardware it was moved to the software in the
DSP core. This reduced the complexity and increased the flexibility, though it
requires a reboot to change cyclic prefix. The memory mapping functionality that
resided in the DSP hardware and software turned out to be more efficient to map
directly to hardware in the memory system.

122
Paper 5 - Design and Implementation of an FFT Processor for VDSL

INPUT in(10);
OUTPUT out(10);
REG acc(10);
REG cnt;
RAM mem(16,10); 16 words, 10 bit wide
void main()
{
// init
acc=0;
// fill memory
for(cnt=0; cnt<=16;cnt++)mem[cnt]=in;

// calculate square
for(cnt=0; cnt<=16;cnt++)
out=mem[cnt]*mem[cnt];
}

Figure 4. µC description of simple squaring DSP with on chip memory.


After an FFT calculation a rearrangement of the memory must take place since
data will be stored in bit reversed address order (i.e. output data number 010112
is stored at location 110102). This is normally made by either making a rear-
rangement pass in the DSP or by performing I/O in bit reversed order. In this case
there is a continuous stream of both input and output data. Reading data in bit
reversed order led to that the input data was stored in bit reversed order. To hide
this for the address generators in the DSP there exist two addressing modes, nor-
mal and bit reversed. In bit reversed mode the actual address fed to the memory is
the bit reversed address.

5. ARCHITECTURE
The FFT processor is divided into five cores; two FFT datapaths, two IO blocks
and one memory system, see Fig. 5. The IO blocks handle the input and output of
data and are not programmable but parametrized to be able to handle different
cyclic prefix and FFT lengths. The memory system contains six sets of memories
where each memory set contain 1024 complex words. The datapath blocks per-

123
form the actual FFT calculations and the control functionality. The two datapaths
are identical and operates individually. All communication between the blocks
are from register to register.
The internal clock is generated on chip and operates at two or four times the
external clock. The maximum internal clock rate is 100 MHz. The application
program is loaded at boot time through a bit serial port using a separate program
loading clock.
The wordlength used is 18 bits for data and 16 bits for the coefficient (wp in Fig.
2).

Memory System

I/O A FFT FFT I/O B


datapath datapath
A B

Port A Port B
Figure 5. Block partitioning.

5.1. IO
The IO consists of an address generator and a complex multiplier which was
included as a part of an equalizer that was needed in the application.The IO core
is controlled from the datapath block.

5.2. Memory
The memory block has four read ports and four write ports. There are six mem-
ory sets with two physical memories enabling concurrent read and write accesses
using single port memories. That we are using six sets of memories is forced by
the proposed time multiplexed transmission method that require buffering.

5.3. Datapath
The datapath block consists of a datapath for the calculations, three address gen-
erators, a control block and a control unit, see also Fig. 6 where the calculation
unit of the datapath block is outlined.

124
Paper 5 - Design and Implementation of an FFT Processor for VDSL

The datapath is a complex datapath where all instructions operates on complex


words. The instruction types are of Very Long Instruction Word (VLIW) type,
i.e. all control signals are stored in the program memory and no instruction
decoding is necessary. The benefit is that it is possible to fully utilize the parallel-
ism in the hardware which allow a high degree of hardware utilization.
The drawback with a VLIW architecture is the inefficient memory usage since
even seldom used control signals have their own position in the instruction word
and valuable memory space is wasted. In this project we introduced a hybrid
solution where some instruction decoding where made for seldom used instruc-
tions, i.e. control type of instructions. The DSP also have different modes which
enabled us to multiplex some control signals. These tricks with the control sig-
nals gave a 30-50% decrease in the instruction word width which finally ended
up in 144 bits. The ASSP approach resulted into an architecture with a resource
utilization of 90% in the address generators and the actual datapath.
The critical loop in an FFT calculation is the calculation of a radix-4 butterfly. In
our architecture the can be done in four clock cycles. To achieve this a hardwired
loop controller is included as well as parameter registers that control some opera-
tions in the datapath during execution of the critical loops (e.g. offset registers for
address calculation).

6. IMPLEMENTATION
The implementation has been made in a 0.35 µm process. The design mainly of
standard cells, memories, and a PLL. The complex multipliers had to be inter-
nally pipelined in two stages to reach sufficient speed.
Necessary for the success of this project was the availability of a good timing
driven place and route tool. In a 0.35 µm process the wire capacitance contribute
too much to the delays in order to get a good correlation between the estimated
delays from the synthesis tool and the actual layout. A photo of the final chip is
given in Fig. 7.

6.1. Key data


In table 1 below some key data of the FFT processor is summarized. The chip has
been successfully tested.

125
from memory read addr.

reg. file addr. calc

+/-
Controller
reg. file

+/-
addr. calc
reg. file

* ROM

reg. file addr. calc

to memory write addr.

Figure 6. Datapath outline.

Power supply 3.3 V


Power consumption 3 Watt
Chip size: active area 46 mm2
Chip size: total area 56 mm2
Process UMC 0.35µm
On chip memory 360 Kbit
Number of gates (mem- ~ 150000
ory excluded)
Maximum clock rate 100 MHz
Computation time for a 80 µs
2048 point FFT with real
input data (using one of
the two datapaths)
Table 1. FFT processor data.

126
Paper 5 - Design and Implementation of an FFT Processor for VDSL

7. CONCLUSIONS
In this paper the design and implementation of a high performance FFT proces-
sor has been described. A new design methodology with concurrent hardware
and software development have been proven to work. It has been shown possible
to design and implement an ASSP starting without specification in a short time
period using the MDSP design flow.

8. REFERENCES
[1] Gentleman W. M. and Sande G. “Fast Fourier Transform for Fun and Profit”,
Proc. 1966 Fall Joint Computer Conf. (AFIPS), Vol. 29, pp. 563-678,
Washington DC, Spartan, Nov. 1966.
[2] Brigham, “The Fast Fourier Transform and its Applications”, Prentice Hall,
1988.
[3] K-G Andersson, “Implementation and Modeling of Modular Digital Signal
Processors”, LiU-Tek-Lic-1997:09, Department of Electrical Engineering,
Linköping University, March 1997.
[4] K-G Andersson, Mikael Karlsson Rudberg, and Anders Wass. “Design of a
JPEG DSP using the Modular Digital Signal Processor Methodology“,
ICSPAT ’97, San Diego, CA, USA, Sept. 14-17, 1997.

PLL

Data memory

Standard cell area

Program memory

Figure 7. FFT chip.

127
128
Paper 6 - Application Driven DSP Hardware Synthesis

Paper 6

Application Driven DSP Hardware Synthesis

Mikael Karlsson Rudberg and Mikael Hjelm

Proceedings of IEEE Nordic Signal Processing Symposium, NORSIG’00,


Kolmården, Sweden, June 2000.

129
130
Paper 6 - Application Driven DSP Hardware Synthesis

Application Driven DSP Hardware Synthesis

Mikael Karlsson Rudberg1,2 and Mikael Hjelm1


1)
Ericsson Microelectronics AB, 164 81 Kista, SWEDEN
2)
Department of Electrical Engineering, Linköping University,
S-581 83 Linköping, SWEDEN
Mikael.Rudberg@mic.ericsson.se, Mikael.Hjelm@mic.ericsson.se

ABSTRACT
In this paper we present a synthesis tool aimed for application specific DSP
processors. The purpose with the presented work has been to develop a tool
where it is easy for a designer to try different approaches in order to achieve
a well balanced architecture. In the paper we discuss the algorithms in the
tool and show, by example, the intended way of operation.

1. INTRODUCTION
DSP processing in modern communication systems are today normally carried
out either in programmable DSP processors or dedicated ASICs with little or no
programmability.
An ASIC solution offer high performance in terms of processing power and
power consumption. The ASIC is targeted against one or a few tasks and can
therefore be optimized to meet the desired computational requirements and mem-
ory bandwidth etc.

131
The DSP processor is made to support a wider range of applications and must
therefore have an extensive instruction set and more on-board memory. In this
paper we focus on applications that require a high degree of flexibility, but for a
given application. Examples of applications include FFT processing, Viterbi and
Reed-Solomon decoding. Each one of these examples exist in various variants,
working with different block sizes etc.
For these applications we want to be able to evaluate various instruction sets as
well as different degrees of parallelism in a hardware-software co-design pro-
cess. Instead of a automatic synthesis tool we need an interactive environment
that gives the designer the opportunity to describe different architectures in an
efficient way and then fast get the resulting netlist.
In this paper we show a solution that makes it possible to synthesize a DSP pro-
cessor from an executable cycle true, model of a processor and the application.
This is done using a synthesis tool where most of the design choices are done by
the designer.

2. RELATED WORK
Synthesis of DSP processors have been studied by several groups in the world.
The main difference between our approach and previously reported synthesis
systems, as for instance in [1,2] is that we instead of having advanced algorithms
in the tool we leave most of the design choices to the designer. That is the
designer is used as the intelligent component in the system and the synthesis tool
just performs the hard work.

3. SYNTHESIS FRAMEWORK
The synthesis tool has been designed to fit into the MDSP design flow which is a
design methodology that allows the designer to use a C-like description language
called µC for defining the DSP, [3,4]. The synthesis tool takes as input the cycle
true µC model and gives as output an architecture that is able to execute the algo-
rithms described in the model. The program memory image is then created using
other tools in the framework. The generated architecture is later on passed to a
VHDL compiler in order to generate a netlist suitable for the layout tool, Fig. 1.

132
Paper 6 - Application Driven DSP Hardware Synthesis

uC model
lib
Synthesis

vhdl
program vhdl to
image ASIC std.
gen. design flow
DSP netlist
program

Figure 1. Synthesis design flow.

4. THE DSP SYNTHESIS TOOL


The synthesis tool takes as input the simulation model written in µC that contains
all information about the desired instruction set as well as the desired parallelism.
What not are given in the description is how many and which types of ALUs that
are wanted. The description also lack explicit information about how the
resources should be connected together.

4.1. Target architecture


The DSP synthesis tool has a target architecture consisting of a number of ALUs,
register files, memories, I/Os, busses, a control unit and a program memory, Fig.
2. One difference between a general purpose DSP architecture and a dedicated
one is how registers are used. A general purpose DSP have large register banks
with general purpose register files and ALUs that can be used for everything
from addressing to normal data processing. In a dedicated architecture it is possi-
ble to have dedicated registers and ALUs for addressing and different types of
data. Therefore we have chosen to target an architecture with dedicated resources
for different types of tasks in the DSP.

4.2. Synthesis library


The synthesis tool maps the µC description to structural VHDL containing prim-
itives found in a synthesis library. The library consists of registers, memories,
various types of I/O blocks and arithmetic logic units (ALU).
The I/Os are either plain registers, an asynchronous port that communicates using
handshaking, or user defined I/O. The RAM have separate read and write buses
which is common in many on-chip RAMs.

133
RAM
status signals
from datapath

imm op
Control unit

op1, op2,...

control signals

I/O
regfile address instruction

Program
memory

datapath

Figure 2. Target architecture.


There is also a control unit available in the library that support a set of instruc-
tions such as for instance jump, conditional jump and sequential execution. All
control signals needed in the datapath is taken directly from the control unit. The
instruction decoding is supposed to be made inside the control unit and is not
done by the synthesis tool.
Any kind of functional block can easily be included in the synthesis library by
describing the block in VHDL and adding a description where the supported
instructions are listed.

4.3. Synthesis
The synthesis process is divided into a number of stages that analyze the resource
need and then creates an architecture that is matched against the algorithms to
implement.
In the first stage the µC model is analyzed to find out which hardware that are
explicitly declared, i.e. all memory and registers. Secondly, the tools analyze
which operations that are made in the program flow. The target and destination
registers for each instruction is also stored.
In the third stage the operations are mapped to ALUs. This can basically be done
in two ways realizing either a minimal architecture with as few ALUs as possi-
ble, or a maximal architecture where little or no resource sharing is made. A min-
imal architecture will require ALUs supporting many instructions, while a
maximal architecture gives many, but simple ALUs.
This tool creates an architecture where each destination register in the architec-
ture gets a dedicated ALU. The ALU is chosen from the synthesis library by find-
ing the ALU that supports all operations that has the given register as target
register. Hence, the resulting architecture will be an architecture with one ALU

134
Paper 6 - Application Driven DSP Hardware Synthesis

attached to each register. In order to optimize the architecture the number of


ALUs must be reduced. Typically, each ALU should have a number of registers
attached to it, i.e. a register file. Therefore it is possible to define register files,
telling the synthesis tool to attach the same ALU to each register in the register
file, Fig. 3.
Normally the selected ALU is the ALU that most closely matches the needed
instruction set, however it is also possible to control the tool such as the most
power or area efficient or even fastest ALU is chosen. This is accomplished by
storing relative power, area and speed weights in the synthesis library. Finally the
interconnections are created by analyzing the model in order to see which blocks
that has to be able to communicate. Only necessary communication paths are cre-
ated. The communication paths only contains multiplexers and wires, i.e. no tri-
state buses are used.

4.3.1. User control

The degree of inter activity during the design process is intended to be high. The
main goal has been to provide a tool that makes it easy for the designer to get the
intended architecture. Therefore the tool contains little inherent intelligence, but
is easy to control by flags fed to the synthesis tool, by modifying the synthesis
library and/or rewriting the model.

Synthesis of:
acc1=reg_a+reg_b;
acc2=reg_a-reg_b;

become:

normal synthesis: acc1 and acc2 declared


as register file:

reg_a reg_b reg_a reg_b

+,-
+ - +/-

acc1 acc2 acc1


acc2

Figure 3. Synthesis of ALUs for register and register files.

135
The type of ALUs that are chosen for a given register may not become the one
the designer wants to have. The type of ALU can therefore be explicitly assigned
using a configuration file as input to the synthesis tool. In this way it is possible
to add a more powerful ALU that supports more instructions than required by the
present application.

5. EXAMPLE
In this section an example how to use our synthesis tool in the design flow is
given.
In Fig. 4 an example of µC code of a 32 tap FIR filter is given. Passing this
description through the synthesis tool without any ALUs declared gives the
architecture shown in Fig. 5 (control unit excluded). The tool creates an architec-
ture that can execute the given task, and nothing more. In order to achieve an
implementation that are easier to reuse, for instance if we want to support any fil-
ter length up to 32 taps, the instruction set has to be extended. This has to be
made such as it become possible to realize an addressing scheme other than mod-
ulo 32.
To realize this we may for instance include circular buffers for the calculation of
data and coefficient addresses. Since a circular buffer may be useful in the future
we decide to add a circular buffer ALU into our synthesis library and then instan-
tiate it into the µC model. In Fig. 6 it is shown how to change the µC model and
what to add to the synthesis library. The new more general datapath is shown in
Fig. 7.

6. FUTURE WORK
The implemented heuristics with an ALU selection based on target registers lead
to an architecture that normally works nice for dedicated DSPs. In modern gen-
eral purpose DSPs there is normally a number of parallel ALUs that are con-
nected to one register file. This is an architecture that can not be supported in the
present version of the synthesis tool. One of the problems with the synthesis of
such an architecture is that it is difficult to decide which instruction to put in
which ALU. In order to decide how many ALUs to attach to a register file the
parallelism within the register file has to be analyzed.
The parallel ALU problem can today be worked around by explicitly instantiate
it into the uC model. But a more smooth way making it easier to elaborate with
different solutions would be preferred.

136
Paper 6 - Application Driven DSP Hardware Synthesis

1: // Declaration part
2: MDSP fir
3: {
4:
5: INPUT inp(14, PARALLEL);// input port, 14 bits
6: OUTPUT outp(14, PARALLEL); // output port, 14
bits
7: REG acc(30), i(6), ca(5), da(5); // different registers
8: RAM d(32,16); // RAM with 32 16 bit words
9: ROM c(32,16, "rom.data"); // ROM
10:
11: PROCEDURE compfir ();// procedure declaration
12: }
13:
14: // Code part
15:
16: PROCEDURE main()
17: {
18: for(;;){ // loop forever
19: do {;} while(!inpF) ; // While no input on the input
20: // port inp do nothing
21: inpF=0, d[da]=inp; // Reset input by setting
inpF=0,
22: // store inp in RAM. “,” means
that
23: // this is made in parallel
24:
25: compfir(); // call procedure compfir
26: outp=acc; // place the value of acc on outp port
27: }
28: }
29: PROCEDURE compfir() // compute fir
30: {
31: acc=0,ca=0;
32:
33: i=30;
34: do {
35: acc+=d[da++]*c[ca++],
36: i--;
37: } while (i>0)
38: acc+=d[da]*c[ca++];
39: return;

Figure 4. µC model of 32 tap FIR filter.


The instruction coding is today put into the control unit, which just is instantiated
by the synthesis tool. A future extension would be to include an instruction cod-
ing stage in the tool in order to further reduce the design effort.

137
+,-

RAM

da
inp
d 1

ROM
0 * c
0
+,pass 1
+,pass
acc
ca
outp

imm op
1
+,-,pass
0
to control unit
i
>

Figure 5. Datapath for 32 tap FIR filter.

7. CONCLUSIONS
In this paper we have demonstrated a synthesis tool where a µC model is trans-
lated to a DSP processor. The nice thing with the tool is not the optimization rou-
tines in the tool since they do not contain anything advanced. Instead we have
shown a design flow, using the synthesis tool, where it become easy for a
designer to evaluate different architecture. We have, by an example, shown how
an FIR filter can be synthesized and redesigned to support a wider application
without too much work.There are things that can be improved, such as the user
interface. The tool is this far just at a prototype showing the possibility to work as
described.

138
Paper 6 - Application Driven DSP Hardware Synthesis

The µC model is changed such as


acc+=d[da++]*c[ca++]
is replaced by:
acc+=d[da]*c[ca],
da=circ_add(1,da,firl),
ca=circ_add(1,ca,firl);

and
acc+=d[da]*c[ca++];
is replaced by:
acc+=d[da]*c[ca],
ca=circ_add(1,ca,firl),

a register called firl that holds the wanted


filter length is created and the line
firl=12;

is added to the model.

The VHDL description of the circ_add


block is stored in the synthesis library and
the library description file is added with:

ALU_circ_add:
operations: circ_add
weight:
power=100, size=100, delay=100,
default=100

Figure 6. Changes to support arbitrary circular addressing.

139
from firl

circ_add
RAM

1
da
inp
d

imm op ROM
0 * c
firl 1
+,pass
circ_add
acc 0
outp

ca
imm op
1
+,-,pass
0
to control unit
i
>

Figure 7. Datapath for a programmable FIR filter.

8. REFERENCES
[1] T. Hollstein, J. Becker, A. Kirschbaum, M.Glesner, “HiPART: A New
Hierarchical Semi-Interactive HW-/SW Partitioning Approach with Fast
Debugging for Real-Time Embedded Applications”, Proc. of Workshop on
Hardware/Software Codesign, CODES/CASHE’98, March, 1998.
[2] P. Duncan, et al., “HI-PASS: A Computer-aided Synthesis System for
Maximally Parallel Digital Signal Processing ASICs”, Proc. of IEEE Intern.
Conf. on Acoustics, Speech and Signal Processing, ICASSP’92, March,
1992.
[3] K-G Andersson, “Implementation and Modeling of Modular Digital Signal
Processors”, LiU-Tek-Lic-1997:09, Department of Electrical Engineering,
Linköping University, March 1997.
[4] K-G Andersson, Mikael Karlsson Rudberg, and Anders Wass. “Design of a
JPEG DSP using the Modular Digital Signal Processor Methodology“, Proc.
of ICSPAT ’97, San Diego, CA, USA, Sept. 14-17, 1997.

140
Paper 7 - ADC Offset Identification and Correction in DMT Modems

Paper 7

ADC Offset Identification and Correction in


DMT Modems

Mikael Karlsson Rudberg

Proceedings of IEEE International Symposium on Circuits and Systems,


ISCAS’00, Geneva, Switzerland, May 2000.

141
142
Paper 7 - ADC Offset Identification and Correction in DMT Modems

ADC Offset Identification and Correction in DMT


Modems

Mikael Karlsson Rudberg


Ericsson Components AB, 164 81 Kista, SWEDEN.
Tel: +46 13 28 1676 Fax: +46 13 13 9282 E-mail: mikaelr@isy.liu.se

ABSTRACT
In this paper the possibility to identify and correct DC offset errors in time
interleaved ADCs are investigated. It is shown how the offset introduced by
the ADC can be identified and corrected by utilizing the knowledge about
the target application. As target application the ADSL standard has been
used. It is shown that an offset error from a time interleaved ADC can be
handled efficiently in a wideband communication system as ADSL.

1. INTRODUCTION
With the increasing demands on bandwidth in communication systems the
demands on AD converters (ADCs) also increase. One way of increasing the
sampling rate is to use time interleaved parallel ADCs [1]. A time interleaved
ADC consists of N ADCs, where each ADC only samples every N:th value. For
instance, with two time interleaved ADCs the first sample will be taken by
ADC1, the second by ADC2, the third by ADC1 and so on. In this case the effec-
tive sample rate for each ADC is reduced to fs/N, while the total sample rate
remains fs. In Fig. 1 the principle of time interleaved AD conversion is shown.

143
N*T
s1(n)
ADC 1
(N+1)*T

s(t) s2(n)
ADC 2 sdist(n)

(2N-1)*T
sN(n)
ADC N

Figure 1. Time interleaved parallel ADC.

1.1. Mismatch between ADC channels


Due to process variations the ADCs in a time interleaved ADC have small differ-
ences in gain and DC offset. This mismatch cause the overall ADC performance
to be worse than the performance of each individual ADC channel. In this paper
we only consider the effects of the DC offset mismatch, hence the gain mismatch
is assumed to be zero.
Assume that the sampled signal is a sequence of values {s(1), s(2), s(3), ....}. If
the signal is passed through a time interleaved ADC with four channels the result
will become a sequence with a DC offset (oi) contribution from each ADC chan-
nel. Together with a quantization noise the resulting sequence will become.

s ( 1 ) + o 1 , s ( 2 ) + o 2 , s ( 3 ) + o 3 , s ( 4 ) + o 4, s ( 5 ) + o 1 … (1)
The offset signal, o(n) is only dependent of which ADC channel that are used and
not the input signal. Hence the offset signal is a periodic signal with period N,
where N is the number of ADC channels. In the frequency domain the offset will
cause tones located at m * fs/N, where m is an integer in the range [0,N-1].
The SNDR of a signal affected by an offset error can be expressed as the ratio of
the energy at the input signal, s(n), and the offset signal, o(n), [2].

E [ s2 ( n ) ]
SNDR = ---------------------- (2)
E[o2(n)]
Assuming that the offsets can be regarded as normally distributed random vari-
ables with a mean of zero and a variance of σ2 and with a sinusoidal with the
amplitude A the SNDR can be expressed as.

A2
SNDR dB = 10 log  --------- (3)
 2σ 2

144
Paper 7 - ADC Offset Identification and Correction in DMT Modems

Using a 16 channel time interleaved ADC, the offset error has been measured to
be in the range of 30 codes, and with a variance around 50, which corresponds to
almost three bits performance degradation in a 12 bit ADC.

2. IDENTIFICATION OF OFFSET
2.1. Communication system
A simple digital communication system can be viewed as in Fig. 2, [3]. A piece
of information (a symbol) is passed through an encoder that creates a signal that
is sent over a channel. At the receiver the signal is converted to the digital
domain and the transmitted information is recreated in the decoder. The task of
the decoder is mainly to find which information that most probably was transmit-
ted. This can somewhat simplified be described as comparing the difference
between the received signal and all possible symbols. The symbol that minimize
the difference between possible symbols and received signal is the most probable
symbol. In order to further increase the performance filters and error correction
algorithms are used in the decoder.
A commonly used line coding is the Quadrature Amplitude Modulation (QAM)
line coding. A QAM coded signal consists of a sine and a cosine wave, where
each one can have a number of different phases and amplitudes. Every code have
one combination of amplitude and phase, making it possible to detect the trans-
mitted information at the receiver.
In the complex plane the received data can be shown as in Fig. 3. When the SNR
allows the number of bits transmitted in one QAM constellation is increased,
resulting in more points in the constellation diagram shown in Fig. 3.

data to transmit
ENCODER DAC Analog line
Frontend
DECODER ADC
received
data
Figure 2. Simple communication system.

145
Im
01 00

Re
11 10

Figure 3. QAM constellation


In the presence of an additive offset the QAM constellations will be moved in
some direction, e.g. as in Fig. 4. A large offset error will cause too large displace-
ment of the constellation, causing the decoder to make the wrong guess about
which symbol that were sent. Even with a small offset error the displacement of
the constellation cause an increased error probability, making it necessary to
identify and correct the offset

3. CORRECTION OF OFFSET IN DMT MODEMS


Today there exist several standards for data communication based on discrete
multitone modulation (DMT). The most well known of those standards is the
ADSL standard [4,5] which may provide the user with data rates up to about 8
MBit/s in downstream data rate over a twisted pair copper cable.

3.1. DMT based communication system


The difference between a single carrier QAM based modem and one based on
discrete multitone modulation (DMT) is that many carriers, or tones, are used
simultaneously. This is done in an efficient way using the inverse discrete fourier
transform (IDFT) or its fast variant inverse fast fourier transform (IFFT) in the
transmitter, Fig. 5. Each carrier is optimized to carry as much information as pos-
sible, thus the carriers have varying constellation sizes. To optimize the perfor-
mance an echo canceller (EC), and frequency (FEQ) and time domain equalizers
are used (TEQ).
The symbol decoding is made by calculating the DFT of a batch of samples and
then decode the carriers individually. Each carrier is coded using QAM encoding
as described in section 2.1.. Normally the received signal consists of a mix of the
signal that were sent from the modem at the other end of the line, an echo signal
which is caused by echo from the data sent from the receiving modem, and noise.
The echo signal can be removed using echo cancellation, or separation filters

146
Paper 7 - ADC Offset Identification and Correction in DMT Modems

Im

Re

Figure 4. QAM constellation with additive offset.

(when frequency multiplexing is used). The received data is then reconstructed


by subtracting an estimate of the echo, followed by a filter that compensates for
the channel impulse response, Eq. 4.

S info ( e jω ) = H ( S rec ( e jω ) – S echo ( e jω ) ) (4)


It does not matter if the offset signal is removed in the time or the frequency
domain. Hence, when an offset is present Eq. 5 becomes.

S info ( e jω ) = H ( S rec ( e jω ) – S echo ( e jω ) ) – H ( O ( e jω ) ) (5)


Obviously, H(O(ejω)) can be identified instead of the individual contribution to
the offset from each individual ADC, making the identification easier. The signal
level of the received data may, due to damping in the transmission, be 20 dB
below the echo signal. Hence, the wanted signal has a signal level similar to the
offset, and will therefore be difficult to detect. It is however possible to detect
and correct the offset error using the knowledge about the application and utiliz-
ing adaptive techniques.

data to transmit
ENCODER IFFT DAC Analog line
Frontend
DECODER FEQ FFT TEQ EC ADC
received
data

Figure 5. Outline of a DMT modem.

Since the data from the ADC is used in pairs, the disturbed tones will be 2m/N *
Ntones, where m is an integer in the range [0,N/2-1], N is the number of ADC
channels, and Ntones is the number of tones used in an ADSL modem.

147
3.2. Correction of offset before connection
Before two DMT modems are connected only noise are present at the input to the
modem. If the input is assumed to be a normal distributed noise signal, e(n) with
an average of zero the offset will be found by just averaging the received signal,
Eq. 6.

oˆ ( n ) = E [ s ( n ) ] = E [ e ( n ) + o ( n ) ] = E [ e ( n ) ] + E [ o ( n ) ] = E [ o ( n ) ] (6)
As described in Eq. 5 this operation can just as well be performed in the fre-
quency domain, Eq. 7.
ˆ
O ( e jω ) = E [ E ( e jω ) + O ( e jω ) ] = E [ E ( e jω ) ] + E [ O ( e jω ) ] = E [ O ( e jω ) ] (7)
One problem with this method is that an input signal with a frequency that is a
multiple of fs/N will be cancelled. But, the only situation when there is a risk that
a wanted signal is removed is if the remote modem tries to get the attention from
the receiving modem. This problem is discussed in the following section.

3.3. Correction of offset during initialization


During the start-up phase of a modem connection a number of known initializa-
tion sequences are sent. These sequences are used for measuring the channel
quality and training of the adaptive algorithms inherent in a DMT modem. Since
it in this phase is known what data that actually are sent, this can be used for
training.

3.3.1. Activation

The first stage during initialization is to activate the remote modem by sending
an activation signal which consists of a single tone. The remote modem is
answering by replying with another tone. The tones lasts for 32 ms and might be
mistaken for an offset error if the offset error generate a tone with a frequency
that is the same as the activation frequencies. The tones that are used in this phase
in the downstream direction are tone 44, 48, 52 and 60. In the upstream direction
tone number 8, 10 and 14 is used. This occurs when the transmitted tone have a
frequency such as it is possible to find a positive integer m that fulfills Eq. 8.

F sig
2m = N ⋅ ---------------- (8)
N tones

148
Paper 7 - ADC Offset Identification and Correction in DMT Modems

Fsig is the tone with an activation signal, Ntones is the total number of tones, and
N is the number of ADC channels. The smallest value of N that may give rise to
an offset error in a tone that is used during this phase is N=32. Fortunately is 32
ADC channels more than needed for the sampling rates used in ADSL (the band-
width is 1.1 MHz).

3.3.2. Modem training

A DMT modem must be trained in order to adjust gain, echo canceller and equal-
ization filters. Since the offset error is an additive signal, which from the training
point of view will be regarded as noise the adaptive training algorithms are still
useful, but it might result in a longer adaptation time.
In several of the training sequences, the same symbol are continuously repeated.
This cause a correlation between the received repetitive symbol and the offset
error. Hence, it is not possible to identify what is the actual signal and what is the
offset error. One of the training sequences are used for estimation of the SNR on
each tone. This sequence has a length of 16384 symbols and is not repetitive as
the other ones. Since this sequence consists of a known pseudo random sequence
it can also be used for offset estimation. The offset is found by taking the differ-
ence between the received data and the expected data.

3.4. Correction of offset during transmission


When the modems are connected the offset are hopefully identified and can
therefore be subtracted. But, there are always small changes in the offset that are
caused by changes in temperature. In order to keep a good quality there is there-
fore a need to continuously measure the offset when the modems are connected.
Since only some of the channels are affected by an offset error, and the offset
error is an additive signal, a change in the offset error will change the average
error between the received data and the actual symbol. According to Eq. 7 it is
possible to identify the offset if the signal is removed. Hence, if the information
first is removed, for instance by the decoder, the remaining signal consists of
noise and offset and Eq. 7 can be used.
If the offset identification not is fully adapted before the modem traffic is started
some of the tones will have a degraded performance, using a smaller constella-
tion size than what actually is possible. However, the ADSL standard support on
line change of the constellation size (i.e. the number of bits transmitted on each
tone). Hence, when the offset identification is fully adapted and the SNR for the
disturbed tone has increased the modem can increase the bit rate on this tone in
order to fully exploit the channel capacity.

149
4. SIMULATION RESULTS
In order to verify the ideas of how to identify and correct offset errors caused be
an interleaved ADC architecture the different cases have been simulated using
ADSL as the application. A 12 bit time interleaved ADC consisting of eight
channels with the DC offsets {-2, -11, 3, 5, 3, -8, 1, 14} giving a variance of 54
has been used. The offset will in this case affect the tones {0, 64, 128, 192}.
A suitable method to update the offset estimate is to use a running average as in
Eq. 9 below. The size of λ will control the adaptation rate and is chosen close to
one.

ˆ ˆ
O i + 1 ( e jω ) = λO i ( e jω ) + ( 1 – λ )E ( e jω ) (9)
Fig. 6 is showing how the adaptation is made during the SNR measurement
sequence. Since the input signal is known, only the noise will disturb the adapta-
tion. λ = 0.999 has been used and the simulation shows how much that remains
of the offset error at the disturbed tones. 7000 symbols, which corresponds to
about 1.6 second in real adaptation time. Around 5% of the errors remains after
this period, i.e. 26 dB decrease in offset errors at the disturbed tones. A simula-
tion with only noise as input will result in the same result since the known signal
is removed before Eq. 9 is applied.
Offset adaptation
0.04

0.035

0.03

0.025
relative offset error

0.02

0.015

0.01

0.005

−0.005

−0.01
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
symbol number

Figure 6. DC offset identification during SNR measurement


sequence.

150
Paper 7 - ADC Offset Identification and Correction in DMT Modems

5. HARDWARE ARCHITECTURE
In Fig. 7 below an outline of how the offset identification and correction can be
put into an ADSL modem is outlined. The offset correction unit is realizing Eq. 9
with either the error coming from the decoder or the noise coming from the FFT.
Equation 9 contains two multiplications and one accumulation. Only the tones
that may be disturbed by the offset need to be taken into account, hence one mul-
tiplier is enough sine the disturbed tones are separated by 2N, where N is the
number of ADC channels. The offset for each tone can be kept in a register file
since they are quite few. The offset estimation stored in the register file is sub-
tracted from data coming from the FFT. The complexity of the compensation unit
can be kept low since there are only a few tones that are affected.

6. ACKNOWLEDGEMENTS
I would like to thank Jan-Erik Eklund at Microelectronics Research Center, Eric-
sson Components AB, for the help with finding typical values of the DC offset
error in a time interleaved ADC.

7. CONCLUSIONS
In this paper it has been shown how an offset error in a wideband data transmis-
sion system as ADSL can be identified and corrected. By treating the ADC as a
system component that can be optimized together with the rest of the system and
utilize what is known about the target application we have shown how the offset
error can be handled in all the important phases during modem initialization and
data transmission in the ADSL modem.
Our methods should be possible to use in other communications systems as well
since it is common to have various types of training sequences that can be uti-
lized for offset identification.

8. REFERENCES
[1] J. Yuan and C. Svensson, ”A 10-bit 5MS/s Successive Approximation Cell
used in a 70 MS/s ADC Array in 1.2υm CMOS”, IEEE Journal of Solid state
Circuits, vol. 29, no. 8, pp 866-872, Aug. 1994.
[2] M. Gustavsson, “CMOS A/D Converters for Telecommunications”, Ph.D.
thesis, Diss. No 552, Linköping University, Sweden, Dec. 1998.
[3] S. Haykin, Digital Communications, Wiley, 1988.

151
[4] ANSI T1.413-1998, “Network and Customer Installation Interfaces:
Asymetrical Digital Subscriber Line (ADSL) Metallic Interface”, American
National Standards Institute.
[5] T. Starr, J. M. Cioffi, and P. J. Silverman, Understanding Digital Subscriber
Line Technology, Prentice-Hall, 1999.
[6] M. Karlsson Rudberg, “A/D omvandlare”, pending Swedish patent no.
9901888-9.

register
1-λ
λ

acc
* + file
Error
Noise


+ To decoder
from FFT
only subtract
disturbed tones

Figure 7. Offset correction architecture in an ADSL modem.

152
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs

Paper 8

Calibration of Mismatch Errors in Time


Interleaved ADCs

Mikael Karlsson Rudberg

Proceedings of IEEE International Conference on Electronics, Circuits and


Systems, Malta, Sept. 2001.

153
154
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs

Calibration of Mismatch Errors in Time Interleaved ADCs

Mikael Karlsson Rudberg


Microelectronics Research Center, Ericsson Microelectronics AB, SE-581 17
Linköping, Sweden
Department of Electrical Engineering, Linköping University, SE-581 83
Linköping, Sweden
Tel: +46 13 32 2523 Fax: +46 13 13 9282 E-mail: mikaelr@isy.liu.se

ABSTRACT
An efficient way of increasing the sample rate in an A/D converter (ADC) is
to use a time-interleaved structure. The effective sample rate can be
increased without increase in sample rate for the individual ADCs. There
are however problems with this architecture caused by differences in gain in
the ADCs as well as timing mismatch in the sample-and-hold circuits. These
mismatch errors will degrade the performance of the time interleaved ADC.
In this paper we propose algorithms for both identification of the mismatch
errors on-line and cancelling of the distortion. The proposed algorithms are
suitable for applications that use the Discrete Multi-Tone modulation
(DMT) or the Orthogonal Frequency Division Multiplex (OFDM) technique.

155
1. INTRODUCTION
Fast and accurate analog-to-digital converters (ADCs) are key components in
present and future communication systems. An increasing demand of bandwidth
and an increased use of digital signal processing both put higher demands on the
ADCs.
One way of increasing the sample rate that has been proposed is to use several
ADCs in a time interleaved way [1]. A time interleaved ADC (TIADC) consists
of M ADCs, where each ADC only samples every M:th value. The effective
sample rate for each ADC is reduced from f s to f s ⁄ M , while the total sample rate
remains unchanged. In Fig. 1 the principle of time interleaved A/D conversion is
shown.
MT
x0(n)
ADC 0
(M+1)*T
x1(n)
x(t) ADC 1 xTIADC(n)

(2M-1)*T xM-1(n)
ADC M-1

Figure 1. Time interleaved ADC.

1.1. Error sources in a TIADC


Even if there are advantages using a TIADC when it comes to conversion speed,
there are also problems if the accuracy at the same time must be kept high. All
differences between the ADCs that form the TIADC will turn up as distortion in
the signal spectrum. The two mismatches of interest in this paper are when there
are gain differences between the ADCs in the TIADC, and when the delay of the
sample clock to each sample circuit is unequal. Another important problem not
within the scope of this paper is mismatch in offset between the ADCs [2].

1.2. Gain Mismatch


A gain mismatch can occur from differences in the reference voltages in the
ADCs or from gain differences in the sample-and-hold circuit which in the most
simple case can be modeled as an RC link and a switch, Fig. 2.
The gain from each of the ADCs will in this paper be denoted as G m ( ω ) where
m is the current ADC which may vary between 0 and M – 1 .

156
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs

nT

xin(t) xs(nT)
R

Figure 2. Sample-and-hold circuit.

1.3. Timing Mismatch


Keeping the delay from the sample clock to each sample-and-hold circuit equal is
difficult due to variations in path length and process parameters. This mismatch
in timing will cause the sampling to be non-uniform with a period of M samples,
Fig. 3.
The timing error is in this paper modeled relative to the average sampling period
T so that the sample times for ADC m is described by

t m = mT – r m T . (1)

Amplitude

x(t)
DA

T 2T 3T 4T t
T(1+r0) 3T(1+r2)
2T(1+r1) 4T(1+r3)

Figure 3. Non-uniform sampling.

Considering the two effects, gain mismatch and nonuniform sampling the distor-
tion with a band limited input signal can be modeled as [3]

∑ ) ⋅ X  ω – k ⋅ --------
jωT 1 jωT 2π
X tiadc ( e ) = --- Ak( e (2)
T MT
k = –∞

157
jωT
where A k ( e ) is

M–1
2π  – j ( ω – k ⋅ 2π ⁄ MT )r m T
 G  ω – k ⋅ -------

jωT 1
Ak( e ) = ----- - ⋅e (3)
M  m MT
m=0

– jkm ⋅ 2π ⁄ M
⋅e

In the summation in Eq. 2 only M terms are non-zero if the input is band limited
to f s ⁄ 2 which will be assumed in this paper.
To keep the distortion low, both the gain and the timing errors must be kept at a
low level. In [3,4] approximations of the effects on SNDR by gain and timing
mismatch have been derived. Assuming a nominal gain of g with a standard
deviation of σ g the SNDR in a TIADC with gain mismatch can be approximated
by

SNDR ≈ 20 log  ------ – 10 log  1 – ----- .


g 1
(4)
σ   M
g
A timing mismatch with a standard deviation σ r will give the SNDR

SNDR ≈ 20 log  ------------------- – 10 log  1 – ----- .


1 1
(5)
 2πf σ   M
in r
Assuming that 12 bits accuracy is needed, M = 4 and that the input frequency is
11 MHz (the VDSL standard which use the DMT technique has 11 MHz band-
width) σ r must be kept smaller than 9 ps, which may be difficult to achieve in a
CMOS process. The gain error, σ g must be kept smaller than 0.03% for the same
resolution.

1.4. Methods to cancel gain and timing mismatch


jωT
If A k ( e ) can be identified it has been shown in [3] that the r m :s can be calcu-
lated. In [5] a method based on polynomial interpolation to correct the spectrum
is proposed, and in [6] a method based on a variant of the discrete fourier trans-
form is used. In both papers a test signal is used for identifying the errors, and
none of the methods consider a frequency dependent gain mismatch.

158
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs

Using a test signal is usually not desired for in circuit calibration since a high
accuracy test signal must be generated, and it is also necessary to interrupt the
communication while performing the calibration.
In this paper we will focus on the applications that use the Discrete Multi-Tone
modulation technique (DMT) or the Orthogonal Frequency Duplex Multiplex
(OFDM) technique. The DMT technique is used for digital subscriber lines, e.g.
ADSL [8]. The OFDM technique is similar to DMT with the main difference that
OFDM is proposed for radio transmission.

2. THE DMT MODEM


In a DMT modem the signal is modulated using the Inverse Fast Fourier Trans-
form (IFFT) to form a signal with N carriers. In the receiver the Fast Fourier
Transform (FFT) is used which separates the information on the different carrier
frequencies. An outline of a DMT modem is shown in Fig. 4.

transmit path
encoder
encoder IFFT
IFFT DAC
DAC line
analog
analog
frontend
frontend
decoder
decoder FEQ
TEQ FFT
FFT TEQ
TEQ EC
EC ADC
ADC
receive path

Figure 4. DMT modem outline

The blocks of importance for this paper are found in the receive path. The EC
block is an echo canceller and the TEQ block is the time domain equalizer. Both
these blocks together with the frequency domain equalizer FEQ will here be
jωT
referred to a filter with the frequency response H eq ( e ).

The output from the decoder is an estimate of which symbol that was received.
jωT
The equalized input to the decoder is denoted X eq ( e ) , and the estimated sym-
jω n T
bol is called S ( e ) . where n is the index to one of the carriers which may
vary between 0 and N – 1 .
The filtered signal received by the decoder will be
jωT jωT jωT
X eq ( e ) = H eq ( e ) ⋅ X tiadc ( e ). (6)

159
The information on each carrier is coded using M-ary QAM. That is the bits are
mapped in a two-dimensional plane where the positions represent the transmitted
bits, Fig. 5.

Im
(00) (10)

S(ejwT)
D Re

(01) (11)

Figure 5. QAM mapped information.

In order to take full benefit from in circuit calibration there must be a possibility
to utilize an increased SNR to increase the amount of transmitted data. The DMT
technique as realized in the ADSL standard support this feature.

3. IDENTIFICATION OF ERRORS
In the ideal case when no distortion is present the received information on each
carrier is independent of each other. When the distortion is present some interfer-
ence between the frequencies is present. Each carrier is interfered by the maxi-
jωT
mum of M – 1 other carriers, which are described by A k ( e ) when k ≠ 0 in
Eq. 2.
An error in the transmission on carrier m may occur when the noise plus the total
distortion become larger than D ⁄ 2 . That is


jω m T jω l T jω m T jω m T D
N(e )+ S( e ) ⋅ Ak ( e ) ⋅ X eq ( e ) > ---- (7)
2
k≠0
where D is the minimum distance between two points in the constellation dia-
jω m T
gram in Fig. 5, and N ( e ) is the noise contribution.

The similarities between the distortion with the signal on one carrier leaks into
another carrier is similar to what is happening when a transmitted signal is ech-
oed into the received signal in for instance an ADSL system, and it is therefore
possible to use a similar method for removing the distortion as is used for echo
cancelling. The most well known method for adapting an echo canceller is the

160
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs

Least Mean Square (LMS) method which use the gradient of the error in the
received signal to update the coefficients ( C i ) in an adaptive filter according to
[7]

C i, k + 1 = C i, k + µ ⋅ e k ⋅ x k – i (8)

where e k is the error between the wanted signal and the one that actually was
received, x k – i . µ is a parameter that controls the adaptation rate.

3.1. Error identification


The distortion that leaks from carrier ω l = ω m – k ⋅ 2π ⁄ MT into carrier ω m for
some k , is estimated as
jω m T jω l T jω m T
Uˆk ( e ) = S( e ) ⋅ Cˆk ( e ) (9)
ˆ jωm T )
where C k ( e is the estimation of
jω m T jω m T jω m T
Ck( e ) = H eq ( e ) ⋅ Ak ( e ). (10)
The distortion estimation is subtracted from the received signal
jω m T jω m T jω m T
X eq2 ( e ) = X eq ( e ) – Uˆk ( e ). (11)
As estimation of the remaining distortion from carrier l into carrier m we use
ˆ jω m T jω m T jω m T
U k, rem ( e ) = X eq2 ( e ) – S(e ). (12)
The estimation of the remaining distortion is used for updating of the estimated
ˆ jω m T )
leakage coefficients, C k ( e .

jω m T jω m T ˆ jω m T jω l T
Cˆk ( e ) = Cˆk ( e ) + µ ⋅ U k, rem ( e ) ⋅ S*( e ) (13)

3.2. Signal reconstruction


The proposed algorithm can in the presented form not be used as it is to correct
the distorted signal since the decoded symbol is used in the equations for estimat-
ing the distortion terms, that are subtracted from the received signal (Eq. 9-12).
The signal reconstruction should be done before the decoding stage since it is the
probability for the decoder to make the right decision that we want to improve.
Therefore Eq. 9 is modified to

161
jω m T jω m T jω m T
Uˆk ( e ) = X eq ( e ) ⋅ Cˆk ( e ). (14)
jω m T
X eq ( e ) contain some distortion, but is the best possible estimation of
jω m T
S(e ) available without performing the symbol decoding.

The mismatch cancelled signal sent to the decoder will be


jω m T jω m T jω m T
X eq2 ( e ) = X eq ( e )– Uˆk ( e ). (15)
k≠0
Which is the contribution from all distortion terms that leaks into the current car-
rier. The proposed method do not require more than that the input signal is band
limited to f s ⁄ 2 . Alternative methods for timing error correction in a TIADC usu-
ally work less good when the signal bandwidth gets close to f s ⁄ 2 , [5,9]. The
method presented in [6] performs perfect reconstruction of the signal spectrum,
but the use of a special DFT makes the algorithm computationally heavy.

3.3. Implementation aspects


The complexity of the proposed algorithm is dependent on the number of signal
carriers that are used, N carr , and the number of ADCs in the TIADC, M .
The number of complex multiplications in the coefficient update loop are

N cmult = 2N carr ( M – 1 ) (16)


and the total number of additions and subtractions are

N add/sub = 3N carr ( M – 1 ) . (17)


The corresponding values for the distortion correction are

N cmult = N carr ( M – 1 ) (18)


and

N add/sub = N carr ( M – 1 ) . (19)

162
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs

4. SIMULATIONS
A TIADC with four ADCs have been simulated with a 256 carrier DMT signal as
input. No quantization effects have been considered. The timing mismatch has
been randomly selected with a standard deviation of 8%. The gain mismatch has
a standard deviation of 2%. In Fig. 6 the adaptation process is shown considering
the distortion that leaks into carrier 96. The simulation is made using 105 sym-
bols.
In Fig. 7 it is shown how the received QAM-encoded constellation points look
like before and after cancellation of the noise on carrier 96. The improvement in
SNDR is about 13 dB.

5. CONCLUSIONS
In this paper we have proposed a method to both identify and correct mismatch
errors caused by gain and timing errors between ADCs in a TIADC. The method
can be applied to the OFDM and the DMT transmission techniques which are
used in for instance ADSL and VDSL. The method work all the way up to the
Nyquist frequency, and can handle gain mismatch that is frequency dependent as
long as the gain can be considered linear.

Adaptation of distortion coefficients


0.12

0.1

0.08

carrier 34
0.06 carrier 162
carrier 224

0.04

0.02

0
0 1 2 3 4 5
4
x 10

Figure 6. Adaptation of Cˆk ( ω ) .

163
In Fig. 7 it is shown how the received QAM-encoded constellation points look
like before and after cancellation of the noise on carrier 96. In the simulation the
improvement in SNDR at this carrier is around 13 dB.

Corrected constellation diagram


1
TIADC with mismatch
0.8 TIADC with cancelled mismatch

0.6

0.4

0.2

−0.2

−0.4

−0.6

−0.8

−1
−1 −0.5 0 0.5 1

Figure 7. Received constellation before and after cancellation of gain and timing mis-
match.

6. REFERENCES
[1] J. Yuan and C. Svensson, ”A 10-bit 5MS/s Successive Approximation Cell
used in a 70 MS/s ADC Array in 1.2υm CMOS”, IEEE Journal of Solid state
Circuits, Vol. 29, No. 8, pp 866-872, Aug. 1994.
[2] M. K. Rudberg, "ADC Offset Identification and Correction in DMT
Modems", Proc. of IEEE Intern. Symp. on Circuits and Systems, ISCAS'00,
Geneva, May, 2000.
[3] Y.-C. Jenq, “Digital Spectra of Nonuniformly Sampled Signals:
Fundamentals and High-Speed Waveform Digitizers”, IEEE Trans. Instrum.
Meas., Vol. 37, pp. 245-251, June 1988.
[4] M. Gustavsson, “CMOS A/D Converters for Telecommunications”, Ph.D.
thesis, Diss. No. 552, Linköping University, Sweden, Dec. 1998.
[5] H. Jin, and E. Lee, “A Digital-Background Calibration Technique for
Minimizing Timing-Error Effects in Time-Interleaved ADC’s, IEEE Trans.
on Circuit and Systems - II: Analog and Digital Signal Processing, Vol. 47,
No. 7, July 2000.

164
Paper 8 - Calibration of Mismatch Errors in Time Interleaved ADCs

[6] Y.-C. Jenq, “Perfect Reconstruction of Digital SPectrum from Nonuniformly


Sampled Signals”, IEEE Trans. Instrum.. Meas., vol. 37, pp. 245-251, June
1988.
[7] T. Starr, J. M. Cioffi, and P. J. Silverman, Understanding Digital Subscriber
Line Technology, Prentice-Hall, 1999.
[8] ANSI T1.413-1998, “Network and Customer Installation Interfaces:
Asymetrical Digital Subscriber Line (ADSL) Metallic Interface”, American
National Standards Institute, 1998.
[9] J. Elbornsson and J.-E. Eklund, "Blind estimation of timing errors in
interleaved AD converters", Submitted to International Conference on
Acoustics, Speech, and Signal Processing 2001.

165
166
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters

Paper 9

Glitch Minimization and Dynamic Element


Matching in D/A Converters

Mikael Karlsson Rudberg, Mark Vesterbacka, Niklas Andersson, and


J. Jacob Wikner

Proceedings of IEEE International Conference on Electronics, Circuits and


Systems, Lebanon, Dec. 2000.

167
168
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters

Glitch Minimization and Dynamic Element Matching in


D/A Converters

Mikael Karlsson Rudberg1,2, Mark Vesterbacka2, Niklas Andersson1,2,


and J. Jacob Wikner1,2
1) Microelectronics Research Center, Ericsson Microelectronics AB,
SE-581 17 Linköping, Sweden
2) Department of Electrical Engineering, Linköping University,
SE-581 83 Linköping, Sweden
{mikaelr, markv, niklasa, jacobw}@isy.liu.se, Phone: +46 708 488 418

ABSTRACT
In this paper we present a novel method for combining thermometer coding
and dynamic element matching (DEM) in a digital-to-analog converter
(DAC). The proposed method combine DEM with a minimization of glitch
power. The glitch power may in a DEM solution give a significant contribu-
tion to the total noise power. The switch based solution provides a structural
solution where it is possible to implement parts of the method, which reduce
the area required for implementation.

169
1. INTRODUCTION
The requirements in terms of accuracy in digital-to-analog converters (DACs) are
increasing with the introduction of wide-band access services as for instance
ADSL. In order to increase the accuracy we want to reduce the influence of both
static and dynamic errors. Considering the static case, a DAC will in general per-
form the following operation
M

A out(nT) = ∑ b m(nT) ⋅ w m (1)


m=1
where A out(nT) is the settled output amplitude at the time instants nT , M is the
number of bits in the input word containing the bits b m(nT) , and w m are the
internal DAC weights. b M is referred to as the most significant bit (MSB) and
b 1 is the least significant bit (LSB). For a binary offset input word, we have that
M = N and w m = 2 m – 1 . For a thermometer code input, we have that
M = 2 N – 1 and w m = 1.
In a current-steering DAC the internal DAC weights can be implemented by
using a number of weighted current sources. The switches that determine which
current sources that should be connected to the output are controlled by the input
bits, b m . A binary weighted- and a thermometer coded current-steering DAC are
illustrated in Fig. 1 a) and b) respectively.

1.1. Reducing glitches


One problem with a binary weighted DAC is that the glitch power between two
adjacent input samples may be high. This is caused by the fact that the transition
between two sample may contain an intermediate output value that differs from
the final value. If the output is to be changed from the binary word {011} to the
word {100} the output may become {111} for a short time before settling to the
final output value {100}. The intermediate value adds unwanted glitch power to
the output.
The problem with intermediate values is avoided by using thermometer coded
data. This because it in a thermometer coded signal only exists transitions from a
zero to a one or vice versa, and never both types of transitions in the same sam-
ple.

170
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters

N-1 1 0
2 I0 2 I0 2 I0 I0 I0 I0

bN b2 b1 b2N-1 b2 b1

Iout Iout
(a) (b)

Figure 1. Example of a) a binary weighted- and b) thermometer coded current steering


DAC.

1.2. Reducing influence from matching errors


A current-steering DAC will suffer from matching errors caused by the non-ideal
manufacturing process of the circuits. The matching error can be represented as a
deviation in the weights in the DAC. Considering the matching error Eq. 1
become
M

A out(nT) = ∑ b m(nT) ⋅ ( w m + ε m ) (2)


m=1
where ε m is the matching error. Since the error caused by the matching error is
dependent of the input signal this error occurs as a static signal distortion. It is
possible to reduce the influence from the matching error if it for every input word
is possible to combine weights in different ways to form the same output value.
Each combination will contain an error term of the size:
M

A out(nT) = ∑ b m(nT) ⋅ ε m (3)


m=1
But by choosing which combination of b m that will be used randomly from time
to time the size of the error term will be uncorrelated with the output value. If for
instance thermometer coded input data is used ( w m = 1 ) and which combination
of w m to use for a given code C i is chosen in a random way, the average repre-
sentation of code C i will approach the mean value, C i ⋅ w m , and the error term
will be uncorrelated with the signal.

171
Digital Encoder
x1(n) 1 1-bit y1(n)

Thermometer Encoder
DAC

Scrambler
x2(n) 1 1-bit y2(n)
x(n) N DAC y(n)

xM(n) 1 1-bit yM(n)


DAC

Figure 2. DAC architecture with randomization.

0
t0
p p p
t1
t2
p p p
t3
t4
p p p
t5
t6
p p p

Figure 3. Example of scrambler with seven thermometer encoded bits as input.

In Fig. 2 a block diagram realizing the described randomization is given. The


random selection of current sources is done in the scrambler which can be real-
ized as a net of switches whose setting is determined of the value of a pseudo ran-
dom signal, p, Fig. 3. The switch is either passing data right through the switch,
or exchange the two signal lines, dependent on the setting of p. Using a net of
switches arranged in a matrix makes it easy to vary the degree of randomization
by modifying the number of columns that are used. There are many possible
ways to connect the different columns in the switch matrix, the one used in Fig. 3
use a radix-2 butterfly interconnect style. This way of randomizing data to
improve integral linearity is called dynamic element matching (DEM) [1].
Different aspects of DEM are also discussed in for instance [2,3]. An alternative
solution to the glitch minimization problem is shown in [4].
Combining randomization and glitch minimization require that a) bits in the ran-
domized sample toggle from zero to one or vice versa, but not both in the same
sample, and b) if there are more ones in the current sample than in the previous
one, the new ones shall have a random position. The same also applies for zeros
if there are more zeros in the current sample than in the previous one. Hence, the

172
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters

previous state of the randomization must be remembered in order to find out how
to randomize the new sample. An example of how to randomize thermometer
coded data with glitch minimization is shown in Tab. 1.

previous input sample


new randomized
randomized (thermometer
sample
sample encoded)
00000 00001 00001, 00100, ...
00100 00111 11100,10101, ...
01110 00011 00110, 01100, ...

Table 1. Example of glitch minimized randomization.

1.3. Scrambler
To realize a scrambler, using a net of switches, requires a switch that can remem-
ber the previous state in order to not randomize positions that should be pre-
served. In Tab. 2 a truth table for a switch that can be used in a glitch minimizing
scrambler is shown. ai+1 and bi+1 are the inputs to the switch, ai and bi are the
inputs from the previous sample. Bits are to be randomized only if a new zero or
one occurs at the input. A logic realization of the truth table require three flip-
flops since both inputs from the previous samples as well as the previous setting
of the switch must be saved.
Since flip-flops are expensive logic elements area can be saved if the number of
flip-flops can be reduced. Since thermometer coded data is used the situation
when <ai,bi>=<0,1> (or <1,0>) become <ai+1,bi+1>=<1,0> (<0,1>) in the next
sample never occurs. Therefore it is possible to set don’t care in Tab. 1 at the
positions marked (*). Another thing to notice is that since the transition directly
from <1,0> (<0,1>) to <0,1> (<1,0>) never occurs at the input of a switch, a
value of <1,1> or <0,0> will be present at the input at least one sample between
the two cases <1,0> and <0,1>. It is therefore enough to randomly set the switch
when the input data is <1,1> or <0,0> to keep the same degree of randomization.

173
ai+1 bi+1 ai bi switch setting

0 0 X X don’t care
0 1 0 0 random
0 1 0 1 keep previous
0 1 1 0 inverse of previous (*)
0 1 1 1 random
1 0 0 0 random
1 0 0 1 inverse of previous (*)
1 0 1 0 keep previous
1 0 1 1 random
1 1 X X don’t care

Table 2. Truth table for glitch minimization.

A simplified truth table is shown in Tab. 3. Notice that the setting of the switch
no longer is dependent on the input value from the previous sample period
(<ai,bi>). Hence, only one flip-flop that saves the state of the switch is needed. A
possible realization of the switch is shown in Fig. 4.

1.4. Scrambler with unordered thermometer code


Instead of using the thermometer code described earlier it is possible to use an
unordered thermometer code where the bits just are copied according to their
weight in the binary offset code (e.g. the binary offset code {101} become the
unordered thermometer code {1111001}). The advantage with this code is that
the conversion from binary offset code to unordered thermometer code is trivial.
The disadvantage is that it is not possible to make the simplifications in the truth
table of the switch as described in the previous section. There may be a transition
at the input of a switch from <0,1> (<1,0>) to <1,0> (<0,1>) between two sam-
ples.
The proposed solution is to convert <1,0> to <0,1> with some extra logic in front
of the main switch. In Tab. 4 and Fig. 5 a’i+1 and b’i+1 are the inputs before con-
version, while ai+1 and bi+1 are after the conversion from <1,0> to <0,1>. All
other codes are passed unchanged from <a’i+1,b’i+1> to <ai+1,bi+1>. To guaran-
tee minimal number of glitches when using unordered thermometer encoding,
one must make sure that each bit at the input of the scrambler has a path that is
crossed by all other bits paths. Interesting to note is that for each extra bit, i , in

174
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters

i
the binary offset coded input data, a group of 2 bits are added to the unordered
thermometer code. All these added bits always have the same value and no
switches are needed when only bits within this group are scrambled (i.e. shaded
switches in Fig. 6 are unnecessary). using a radix-2 butterfly architecture of the
scrambler require at least k switch layers to guarantee that all paths in the group
k
2 cross at least one path in each of the groups {2 j, j < k} . If this condition is ful-
filled the output will be glitch minimized. Switch layers placed after layer k may
be needed to increase the randomization, but since the output from layer k is
glitch minimized the more simple switch shown in Fig. 4 can be used for these
layers.

ai+1 bi+1 ai bi switch setting

0 0 X X random
0 1 X X keep previous
1 0 X X keep previous
1 1 X X random

Table 3. Simplified truth table for glitch minimization.

The added logic for converting <1,0> to <0,1> can be seen as a two-bit unordered
to ordered thermometer encoder. If the switches are kept fixed (i.e. p is fixed),
the proposed architecture work as a normal thermometer encoder. Hence, the
architecture is a thermometer encoder with included glitch minimized scram-
bling.

D p

=1

a x
b

Figure 4. Switch logic.

175
2. SIMULATIONS
In Tab. 5 the relative glitch power has been estimated for four different DAC
architectures. As input signal a multicarrier ADSL signal with 256 carriers have
been used. As can be expected the proposed glitch minimized thermometer cod-
ing technique performs just as well as plain thermometer coding. Randomization
of thermometer code is about as bad as binary offset coding from the glitch
power aspect. In Fig. 7 a) and b) the effect of mismatch on distortion is compared
between thermometer coding and thermometer coding with glitch minimization.
The simulation show an improvement of the SFDR, when only considering the
matching error, compared with normal thermometer coding (13 dB). In the simu-
lations a 6 bit DAC with a random matching error of σ = 0.02 have been used.
Note, that all harmonics disappear using the proposed method.
It is important to be aware of is that a fast varying input signal become more ran-
domized than a slowly varying signal, this because only the difference between
two samples become randomized.

a’i+1 b’i+1 ai+1 bi+1 switch setting

0 0 0 0 random
0 1 0 1 keep previ-
ous
1 0 0 1 keep previ-
ous
1 1 1 1 random

176
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters

Table 4. Truth table for presorting logic.

D p

=1


a
& x
>1 b

y

Figure 5. Switch logic with presorter.

0
20 p
p p
1
2
p p p
22

p p p

p p p

Figure 6. Simplified switch matrix with presorter switch.

3. CONCLUSIONS
In this paper we have presented a novel method where dynamic element match-
ing is combined with glitch minimization. We have presented an architecture,
similar to the commonly used scrambler with a number of switch layers with the
difference that we use a modified switch that has the advantage to remember the
old path through the scrambler to minimize glitches. By simulations it has been
shown that the proposed method both reduces the number of glitches and de-cor-
relates the mismatch in the current-sources from the signal.

4. REFERENCES
[1] L.R. Carley, and J. Kenney, “A 16-bit 4’th order noise-shaping D/A
converter”, Proc. of 1988 Custom Integrated Circuits Conf., USA, May,
1988.
[2] H.T. Jensen and I. Galton, "An analysis of the partial randomization dynamic
element matching technique," IEEE Trans. of Circuits and Systems II, vol.
45. No. 12, pp. 1538-1549, Dec. 1998.

177
[3] N.U. Andersson and J.J. Wikner, "Comparison of different dynamic element
matching techniques for wideband CMOS DACs," In Proc. of the NorChip
Conf., Oslo, Norway, Nov. 8-9, 1999.
[4] M. Vesterbacka, M. K. Rudberg, J.J. Wikner, and N. Andersson, “Dynamic
Element Matching in D/A Converters with Restricted Scrambling”, accepted
to ICECS’00, Beirut, Lebanon, Dec. 2000.

Normalized SNDR
Type of coding
(dB)

offset binary code 0

thermometer code 11

thermometer code 0
+ randomization
thermometer code + 11
randomization + glitch
minimization

178
Paper 9 - Glitch Minimization and Dynamic Element Matching in D/A Converters

Table 5. Relative glitch power for different DAC structures.


Thermometer coded DAC
60

40

20

(a)

PSD [dB/Hz]
−20

−40

−60

−80
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Normalized frequency

Thermometer coded DAC with glitch minimization


60

(b ) 40

20

0
PSD [dB/Hz]

−20

−40

−60

−80
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Normalized frequency

Figure 7. Simulation of thermometer coded DAC (a) without and (b) with glitch mini-
mization.

179
180
Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling

Paper 10

Dynamic Element Matching in D/A Converters


with Restricted Scrambling

Mark Vesterbacka, Mikael Karlsson Rudberg, J. Jacob Wikner, and


Niklas Andersson

Proceedings of IEEE International Conference on Electronics, Circuits and


Systems, Lebanon, Dec. 2000.

181
182
Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling

Dynamic Element Matching in D/A Converters with


Restricted Scrambling

Mark Vesterbacka1, Mikael Rudberg1,2, J. Jacob Wikner1,2 and Niklas U.


Andersson1,2
1
Department of Electrical Engineering, Linköping University,
581 83 Linköping, Sweden
2
Microelectronics Research Center, Ericsson Microelectronics AB, Box 1885,
581 17 Linköping, Sweden
E-mail: {markv, mikaelr, jacobw, niklasa}@isy.liu.se

ABSTRACT
Inaccurate matching of the analog sources in a D/A converter causes a sig-
nal-dependent error in the output. This distortion can be transformed into
noise by assigning the digital control to the analog sources randomly, which
is a technique referred to as dynamic element matching. In this paper, we
present a dynamic element matching technique where the scrambling is
restricted such that the glitches in the converter are minimized. By this, both
the distortion due to glitches is reduced, and the signal-dependent error due
to matching is suppressed. A hardware structure is proposed that imple-
ments the approach, and the operation of the hardware is described. Simula-
tion results indicate that the method has a potential of yielding as good
reduction of glitches as the optimal thermometer-coded converter and a sig-
nal-dependent error level that is almost as low as achieved with prior
dynamic element matching techniques.

183
1. INTRODUCTION
A major problem in design of high-resolution communication D/A converters is
the inaccuracy in the fabrication process. This imperfection introduces mismatch
among the sources to the analog output, resulting in non-linear behavior of the
converter [1, 2]. To overcome this problem, a technique referred to as dynamic
element matching (DEM) has been suggested where digital signal processing is
used to control the switching of the analog sources so that the distortion is trans-
formed into noise [1, 3, 4, 5]. Hence, signal-dependent errors are suppressed, and
if we combine this technique with oversampling, we can reduce the error caused
by the noise by low-pass filtering the output [3].
However, converters in many modern communication applications need to oper-
ate at high speed. At high speed, glitches caused by delay variations in different
paths will have a significant impact on the achievable resolution of a converter.
To reduce the glitches, thermometer code can be used, which yields a minimal
amount of glitches compared with other codes, but requires complex hardware.
In practice, a segmented converter structure is used for high resolution converters
where the least significant source weights are binary scaled and the most signifi-
cant weights are thermometer-coded. Hence, the thermometer-coding used in the
presented DEM encoders applies to segmented converters as well.
The use of a thermometer encoder suits the DEM techniques well. However, a
problem is that the current DEM techniques use a type of scrambling that ruins
the good glitch property that can be achieved with thermometer code. In this
paper we present an approach to scramble thermometer code so that the glitch
energy associated with a code transition is minimized, while we maintain the
property of having a low sensitivity to matching errors in a converter. In the fol-
lowing, we will also suggest a hardware structure that implements the presented
approach. We will also explain the function of the hardware with a simple exam-
ple, where a 4-bit converter is used for the sake of simplicity.

2. A DEM APPROACH
The operation of an N-bit thermometer-coded flash converter is characterized by
n

A = ∑ wk ref (1)
k=1

184
Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling

where A is the analog output, ref is a reference quantity of, e.g., current, voltage
or charge that should be added to the output, n = 2N–1 denotes the number of
sources of reference units to add, and w1…wn is a bit vector encoded from a dig-
ital input D used to control which sources to add [1]. The name thermometer
code implies that a continuous range of bits w1…wi should be one, while the
remaining bits are zero. However, by relaxing the last constraint and allowing
any wk to be one as long as the output is correct, we achieve a redundant code
with many possible representations for most numbers. This redundant property
makes it suitable for use in DEM techniques where we randomize what code to
use. By restricting the randomization to only include codes that produce small
glitches it is possible to improve the glitch performance compared with using a
conventional DEM technique where a code is selected randomly from the full set
of codes. In this work we present an approach that aims at solving this problem.
The key idea in our approach is to construct a subset of codes containing only the
codes that cause a minimal number of bits to be altered in a code transition. By
this we will minimize the glitches, since they to a significant extent depend on
this parameter. The codes in a subset are identified from an investigation of the
two cases presented in the following.

2.1. Code selection case A: Bit increase


In case A, the output from the converter is increasing. This case implies that the
number of bits for which w k = 1 must increase in a thermometer code. For this
case, a code transition with a minimum number of altered bits is achieved if we
select the new code so that we only set bits.

2.2. Code selection case B: Bit decrease


In case B, the output from the converter is decreasing. This complementary case
implies that the number of bits for which w k = 0 must increase. For this case, a
code transition with a minimum number of altered bits is achieved if we select
the new code so that we only clear bits.

2.3. DEM approach


Since all bits in a thermometer-coded converter control unit sources, the identi-
fied subset of codes in any of the two cases introduces the same minimal amount
of glitch energy. Use of any other code that is not within the subset, however still
yielding the correct output, would require more bits to be switched, with the
number of cleared bits equal to the number of set bits.

185
One approach to implement this idea is illustrated in Fig. 1 where an N-bit D/A
converter is shown. Compared to the conventional approach, we have added a
register to the output of the DEM encoder that contains 2 N – 1 D flip-flops. The
use of this register is two-fold. First, the control signals wk become independent
of delay variations in the encoder, improving the glitch situation. Second, the reg-
ister stores the current state, which can be used in the encoder to construct the
proper subsets. The cost of this solution is an increased complexity of the DEM
encoder and, of course, the hardware for the additional ( 2 N – 1 ) -bit wide regis-
ter.
In a second paper, also presented at ICECS’00, we present another implementa-
tion approach that instead uses a tree structure [6].

D DEM encoder

D D D
w1 w2 w2N–1
1

1
A

ref 0

Figure 1. An N-bit D/A converter with a DEM encoder and a register for storing the
thermometer-coded state.

3. REALIZATION OF A DEM ENCODER


In the following description of the DEM encoder proposed in the previous sec-
tion, we initially consider code selection case A, since the modifications needed
to handle the complementary case B are minor. Consider the N-bit converter in
Fig. 1. Compared to a conventional converter, the state of the thermometer-code
is stored in a register. This state will be updated by the DEM encoder according
to the following approach.
In Fig. 2 the suggested realization of the DEM encoder is shown. The input W =
w1…wn is the input from the register in Fig. 1, and D is an offset binary input that
should be used to encode a new state W'. In the figure, white arrows have been
used to indicate binary data, and gray arrows have been used to indicate ther-
mometer-coded data. There is also one control bit c indicated with a line that sig-
nals whether the current code selection case is A or B. The boxes with round

186
Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling

corners are the additional operations needed to handle the somewhat more com-
plex code selection case B. Now we will describe the operations needed to handle
case A.
D W

2N–1:N
counter
B1
c
Invert Subtractor Invert
B3 B2
Negate
B4
Thermome-
ter encoder
T1
M-bit
scrambler
T2
T3
Zero
distributor
T4
Invert

W'

Figure 2. Implementation of the DEM encoder.

3.1. Description of the operations


At the top of Fig. 2, there is a block labeled ‘2N–1:N counter’. The purpose of
this block is to count the number of ones in the current state W, i.e., a thermome-
ter-to-binary encoding. The count, denoted B1, is subtracted from the data input
D in the block ‘Subtractor’, yielding a difference B2 corresponding to the number
of zeros that should be changed to ones in the next state. This additional number
of ones is literally created in the block ‘Thermometer encoder’, that converts the
binary count B2 to the thermometer code T1 with a continuous range of ones.
The block ‘M-bit scrambler’ produces T2 by scrambling the position of M bits,
including all ones created in the preceeding block. The number M should be
equal to the number of zeros in the current state W, which is calculated in the
block ‘Invert’ producing B3. This block inverts the bits in the number of zeros B1,
yielding the wanted count of ones since it is given by the relation
B 3 = 2 N – 1 – B 1 = B 1 , assuming two’s complement arithmetic.
Finally the block labeled ‘Zero distributor’ distributes the scrambled bits to the
bits that are zero in the current state W. The bits that are one are unaffected during
the distribution. The result of this block is output as the next state W'.

187
3.2. Operations in case B
Obviously, the presented scheme is not designed to handle case B where we need
to clear ones instead of setting zeros. However, this can easily be achieved by
modifying the described structure slightly. Then we detect case B, e.g., as an
overflow c in the ‘Subtractor’. When this case is detected, we can use the hard-
ware to clear ones in case B instead of setting zeros by inverting both the input W
and the output W'. This is accomplished by the blocks ‘Invert’ producing T2 and
W' in Fig. 2, that should invert a signal depending on the control input.
Some other modifications are also needed in order to handle case B. The block
‘Negate’ is needed to correct the output B2 when we have an overflow from
‘Subtractor’, i.e, we calculate the number of ones to clear. The effective opera-
tion will be B4 = |B2|. Another modification is needed to the block ‘M-bit scram-
bler’ where the input B3 is the number of zeros in case A indicating the number
of bits to be scrambled. In case B we need to scramble a number of bits corre-
sponding to the number of ones B1 in the current state. Since B3 is calculated as
the inverted B1, we simply make the inversion operation conditional on that we
have case A, as indicated in Fig. 2.

4. A 4-BIT CONVERTER EXAMPLE


To illustrate the operation of the presented DEM approach further, we will give a
numerical example on the operation of a 4-bit converter. Let us assume an arbi-
trary initial state of
W = 101011101011111
which corresponds to the decimal value 11. The first operation we perform is to
count the number of ones in W with the block ‘2N–1:N counter’, yielding
B1 = 1110 .
We will also assume an arbitrarily chosen digital input to the converter of
D = 1310
which primarily is used to see how many zeros we need to set in the current state
W to achieve the next state W'. This is achieved in the block ‘Subtractor’ by the
operation
B2 = D – B1 = 1310 – 1110 = 210 .

188
Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling

In this case we have no overflow, yielding c = 0. Hence the conditional operation


‘Negate’ produces the count B4 = B2. To set the two ones in the current state W
we first create two ones literally in the block ‘Thermometer encoder’, which con-
verts the number of additional ones B2 into thermometer code, i.e.
T1 = 110000000000000 .
Now the strategy is to select as many bits (including all ones in T1) as we have
zero bits in the current state W. To obtain this we need to calculate the number of
zeros in W. This is a straightforward operation since we already have counted the
number of ones as B1, and know the total number of bits in the state to be
2 N – 1 = 15 . The block ‘Invert’ performs exactly this operation, since inversion
of all bits in two’s complement arithmetic corresponds to the operation
B3 = 15 – B1 = 15 – 11 = 4 .
This count is used in the block ‘N-bit scrambler’ to scramble the position of the
corresponding number of bits. We illustrate this operation S by assuming that the
randomization process happens to yield
T2 = S(0011-----------) = 0110-----------
where we have indicated the bits not included in the scrambling (15 – 4 = 11 bits)
with ‘-’:s. Finally the block ‘Zero distributor’ distributes the scrambled bits T2 to
the zeros in the current state W. Any bit marked with a ‘-’ above is left
unchanged. Below we use arrows to illustrate the distribution of the bits:

T2: 0110-----------
- distribution to zeros
T3: 101011101011111
- changes
T4: 101111111011111

The arrows going from the four scrambled bits indicate the zero bits to be
replaced in the current state, and the remaining two arrows indicate which of the
two zeros that actually is set. The next state becomes
W ' = 101111111011111
which is output to the register.

189
5. SIMULATION RESULTS
The function of the proposed hardware was verified by a C program that simu-
lates the hardware for an N-bit converter, where N is defined at compilation time.
To estimate the performance of the presented approach, we modeled four 6-bit
converters in Matlab, assuming that the glitch power is proportional to the num-
ber of switching sources. The modeled D/A converters were three conventional
converters, a binary-scaled, a thermometer-coded, and a thermometer-coded con-
verter with conventional DEM, plus a thermometer-coded converter with the pre-
sented DEM approach. As a measure of glitch performance we use the ratio
between simulated glitch power and signal power. In Table 1, power ratios
obtained from simulation with a multi-tone input are listed. The input contained
256 tones with equidistant frequency spacing, distributed over the entire Nyquist
frequency range. The power ratios have been normalized with respect to the
binary-scaled converter. In the table, we see that there is an improvement from
using a thermometer-coded converter over a binary-scaled. However, this gain in
performance is lost when we introduce conventional DEM. The presented DEM
approach is able to regain the glitch performance to the level of the thermometer-
coded converter.
To investigate the performance in terms of matching errors, we apply a Gaussian
independent distributed relative matching error with standard deviation of 2% to
each weighted source in all converter structures. In Table 2, the estimated SFDR
from the simulations is given. In the table, we see that both the converter with
conventional DEM and the converter with DEM that uses restricted scrambling
are able to improve the SFDR with 13 dB over the other structures.
These results indicate that our DEM technique is able to reclaim the gain in
SNDR that is lost with conventional DEM techniques, while the performance in
terms of matching is maintained.

Normalized power
6-bit converter
ratio [dB]
Binary-scaled 0
Thermometer- -11
coded
Conventional 0
DEM
Restricted DEM -11

Table 1. Relative glitch performance for different 6-bit converter structures.

190
Paper 10 - Dynamic Element Matching in D/A Converters with Restricted Scrambling

6-bit converter SFDR [dB]


Binary-scaled 54
Thermometer- 55
coded
Conventional 68
DEM
Restricted DEM 68

Table 2. Matching in different 6-bit converter structures.

6. CONCLUSION
A DEM approach was presented that aims at reducing the additional glitch
energy introduced by other DEM techniques. This is achieved by restricting the
scrambling in DEM to only include codes that do not increase the glitch energy.
Further, a hardware structure was proposed that implement this approach. The
hardware is realized from two cases depending on the state of the analog output.
In the first case the number of bits that are one increases, and in the second case
the number of ones decreases. We start by describing how the first case can be
implemented, and then we reuse the hardware in the second case by introducing
some additional hardware that is activated when the second case is detected. This
can be achieved since there is a simple relation between the two cases that
enables a simple transformation of the input and output state.
The functionality of the hardware was verified with a C program that simulated
the hardware for an N-bit converter, where N is a generic parameter. For the pur-
pose of estimating the performance, four 6-bit converters were also modeled in
Matlab, using a simple model for the glitches. The simulation results indicated
that the proposed implementation has the potential of suppressing the glitches as
well as the optimal thermometer-coded converter, while yielding a distortion
level that is almost as low as conventional DEM implementations.

7. REFERENCES
[1] R.J. van de Plassche, Integrated Analog-to-Digital and Digital-to-Analog
Converters, Kluwer Academic Publishers, Boston, 1994.
[2] M. Gustavsson, J.J. Wikner, and N. Tan, CMOS Data Converters for
Communications, Kluwer Academic Publishers, 2000.

191
[3] P. Carbone and I. Galton, “Conversion error in D/A converters employing
dynamic element matching”, Proc. 1994 IEEE Int. Symp. on Circuits and
Systems, vol. 2, 1994, pp. 13-16.
[4] H.T. Jensen and I. Galton, “A low-complexity dynamic element matching
DAC for direct digital synthesis”, IEEE Trans. of Circuits and Systems II,
vol.45.1, Jan. 1998, pp. 13-27.
[5] L.R. Carley, J. Kenney, “A 16-bit 4’th order noise-shaping D/A converter”,
in Proc of Custom Integrated Circuits Conference, 1998, pp. 21.7/1-21.7/4.
[6] M. Rudberg, M. Vesterbacka, N.U. Andersson, and J.J. Wikner, “Glitch
minimization and dynamic element matching in D/A converters”, to appear
in IEEE Proc. The 7th Int. Conf. on Electronics, Circuits, and Systems,
Beirut, Lebanon, Dec. 17-20, 2000.

192
Dissertations
Division of Electronics Systems
Department of Electrical Engineering
Linköpings universitet
Sweden

Vesterbacka, M.: On Implementation of Maximally Fast Wave Digital Filters,


Linköping Studies in Science and Technology, Diss. No. 487, Linköpings Uni-
versitet, Sweden, June 1997.
Johansson, H.: Synthesis and Realization of High-Speed Recursive Digital Fil-
ters, Linköping Studies in Science and Technology, Diss. No. 534, Linköpings
Universitet, Sweden, May 1998.
Gustavsson, M.: CMOS A/D Converters for Telecommunications, Linköping
Studies in Science and Technology, Diss. No. 552, Linköpings Universitet, Swe-
den, Dec 1998.
Palmkvist, K.: Studies on the the Design and Implementation of Digital Filters,
Linköping Studies in Science and Technology, Dissertation. No. 583, Linköping
Universitet, Sweden, June 1999.
Wikner, J. J.: Studies on CMOS Digital-to-Analog Converters, Linköping Studies
in Science and Technology, Dissertation. No. 667, Linköping Universitet, Swe-
den, April 2001.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy