0% found this document useful (0 votes)
18 views5 pages

Walther U GLOBECOM 99

Uploaded by

nabila brahimi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views5 pages

Walther U GLOBECOM 99

Uploaded by

nabila brahimi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Future Wireless Communication System

NEW DSPs FOR NEXT GENERATION MOBILE COMMUNICATIONS *

Ulrich Walthert Falk Tischer and Gerhard P. Fettweis


Mannesmann Mobile Communication Systems, Dresden University of Technology
Mommsenstr. 13, D-01062 Dresden, Germany

Abstract - The implementation of new wireless a reduced-state-decision- feedback MAP equalizer com-
communication standards often requires the design bined with a convolutional decoder (which can also be
of new hardware capable of processing special al- used for Viterbi or MAP decoding) and
gorithms. One approach of tackling the problem is 2) The W-CDMA RAKE receiver for the UMTS stan-
the usage of dedicated hardware, optimized towards dard including the turbo decoding process.
the corresponding algorithm, together with a DSP.
However this may cause overhead in data transfers
First the DSP concept is shortly explained. Sections 3
(DSP < - > ASIC) and requires additional control
and 4 give a short description of the algorithm problem
hardware and memories. The full design process in-
and consider the impact upon the datapath design.
cluding simulation and debugging of the whole sys- 2 Concept for Application Tailored Signal Pro-
tem can be very time consuming. In this paper we cessors (CATS)
avoid such problems by utilization of a new concept
for application tailored DSPs. The architecture sup- The concept was first presented in [I] and [2]. It is a
ports scaling and the inherent flexibility allows for research Droiect at the Mannesmann Mobilfunk Chair.
I . ,

the adaption to new algorithms. Examples such as The above mentioned problems lead to the main goals:
equalization and the future wireless W-CDMA stan-
dard UMTS has been used to proove the applicabil- e One processor family as the base for designing gen-
ity of the structure. In addition, the time- to-market eral purpose processors as well as application tai-
factor can be significantly reduced. lored (domain specific) processor types
1 Introduction e One processor development platform
The nature of the mobile communication environment
e One generic software development platform, i.e. as-
requires the connection of very different signal process-
sembler, compiler, debugger which is independent
ing algorithms, operating at different clock speeds and
with different demands in terms of precision and real- on the tailorization
time capability. The classic approach of a system on a e A real-time development /debugging environment
chip (SOC) comprises an embedded DSP core and ded- for the customer
icated parallel hardware on one die together with mem-
ories and controllers. This results in additional data e A uniform instruction set architecture which satis-
transfers between DSP core and surrounding hardware. fies the statements above - > “TVLIW’ [3]
To Support the introduction of a new wireless standard
with increased signal processing complexity requires a
new design of dedicated hardware, e.g. the equalizer In Figure 1 a simplified block schematic is shown. The
structure or the channel-(de)coding scheme with new
parallel hardware, separate memories and controllers.
In this paper we solve the problem by using a DSP archi-
tecture [l]which supports parallelism and is both scal-
able and extendable with dedicated datapaths. An effi-
cient Instruction Set Architecture (ISA) allows for such A A
customizations without changing large parts of the chip.
Two different examples have been investigated as case I t t
Bus-S ystem
t J
studies for the new concept:
1) A classical single-carrier transmission system using

‘This work was supported in part by Deutsche Forschungsge-


meinschaft contract SFB 358-A6 and Infineon Technologies AG
te-mail: waltherQifn.et.tu-dresden.de
Figure 1: CATS approach

Global Telecommunications Conference - Globecom‘99 261 5


Future Wireless Communication System

1 DTU A-BUS i Coefficient sharing over multiple multiply-accumulate


datapath units (DPUs) leads to an constant memory
bandwith even for highly parallel computation. An ex-
ample is the block FIR where y k = Cizk--iui is calcu-
lated in parallel to Y k + l = c i x k + l - i a i and so forth.
The coefficients ai are the same in each datapath, while
datapath unit n (DPUn) uses the x-value from datap-
ath unit n - 1 delayed by one cycle. The technique is
also referred to as Zurich Zip. So it will take n cycles
t o feed n DPUs with data and the same time to update
the according sections of the VLIW.
In this paper we assume the dual-access paradigm, i.e
only two busses broadcast the data to all units (see Fig.
Figure 2: Dual-bus architecture 2). The counterpart, the wide data memory approach
was chosen in a different project for a Hiperlan Modem
DSP [4],where FFT-like operations are heavily utilized.
dark gray marked blocks are not dependable on the ap-
plication, while the datapaths change in number and Two examples should show the applicability of the ar-
functionality according to the specific application. The chitecture for new algorithms.
data memory (bandwidth) can be subject to customiza-
tion as explained later on. The challenge of the concept 3 Combined Equalization and Decoding
is the reuse of as many functional units as possible while The multipath propagation property of the mobile comu-
allowing for extendability, scalability and tailorization. nication channel causes severe intersymbol interference
This requires knowledge about the interactions of hard- (ISI), which comes along with fast time variations due
warelarchitecture and softwarelalgorithm respectively. to fading. Thus, data transmission in the presence of
One fundamental paradigm follows: The functionality of IS1 asks for receivers which perform both, equalization
a DSP can be orthogonalized into data transfer and data and channel decoding. Different solutions have been pro-
manipulation. Data transfer includes tasks such as data posed for soft-output equalizer techniques (e.g. SOVA
moveslloads, and all processing done in the program [6]), which are known to be superior to hard-decision
control unit, since trwsfers to/from program counter techniques. An interesting approach was presented by
are performed. In contrast the actual algorithm execu- Koch and Baier [7], which should be used throughout
tion is referred to as data manipulation. While the latter this section. We will not reproduce the whole deriva-
changes with the application all units belonging to data tion, however the fundamental assumptions and equa-
transfer should be fixed. In order to achieve full per- tions will be given.
formance the orthogonalization must be represented in
the Instruction Set Architecture (ISA) as well. A DSP's Equalizer Structure & Implementation
ISA is a key issue since it fullfills the important task Starting with a simplified channel model (shift register
of bridging between software (flexibility) and hardware process with memory L ) an algorithm was derived which
(complexity). is a symbol-by-symbol based maximimum probability es-
A new approach referred to as Tagged Very Long In- timator. It is similiar to the one referred to as BCJR [8]
struction Word architecture (TVLIW) was developed or MAP (maximum-a-posteriori probability). Instead
[3]. It supports parallelism while keeping the code size of probabilities their negative logarithm will be used.
at a reasonable amount. In the technique in-line and in- This simplifies the computation and enables fixed-point
loop code is treated differently by dynamic assembly of implementation. Furthermore, an approximation is in-
the very long instruction word (VLIW). The overhead troduced which avoids the calculation of logarithm and
generated by composing the VLIW is most often hid- exponential leading to a group of equations, also known
den in the data flow pipeline. This is especially the case from the MAX-Log MAP [9]. The forward and backward
for dual-bus architectures, i.e. the memory bandwidth state metrics A , ( S k ) and A b ( S k ) of state S at time IC in
is limited to 2 accesses per cycle. This fact should be the trellis are recursively calculated.
explained in more detail.
Turning back to the algorithm level we differ between
matrix-vector operations such as shuffling algorithms
(FFT, DCT, etc.), and vector-scalar operations (e.g.
FIR, correlation), since they require different memory
bandwidth when parallelism is applied. The latter can
be considered as a class of sliding window algorithms,
where a coefficient window slides over the data vector.

261 6 Global Telecommunications Conference - Globecom'99


Future Wireless Communication System

k-1 -+ k
Trellis step
k-1 - k

decjn

Figure 4: Survivor Path Unit

Figure 3: Trellis: path contributions for L = 2 states can significantly reduce the computational com-
plexity. The performance loss is limited by a state-
defined by SI,and X ~ - L= +l or -1'. For the backward dependent feedback of hard decisions. The resulting
case ( 2 ) the 21, = fl or -1 defines the selection. The derivative will be called Reduced-State-Decision- Feed-
branch metrics for the transition from state Sk-1 to SI, back (Bahl) Equalizer. The overall performance of such
are given by the distance calculation an approach is adequate for most of the channel types
as simulations have shown [7].
L
A(&-1, Sk) = IZI, - XI,-i/# (3) The basic operations of this equalizer can be mapped
i=O to the functional units known from the classical VA.
Hence we can distinguish between transition metric unit
The soft-output value for bit 2k-L is now defined by (TMU), add-compare-selct unit (ACSU) and survivor
path unit (SPU). The most complex operation takes
place in the ACSU for which accelaration techniques
are known. However, the computation of the transi-
tion metrics is now much more complex. It involves an
FIR-like computation (see equation (3)) and depends
upon the decisions made in the ACSU and stored in the
SPU. Since we need a survivor path for each state for
Fig. 3 shows the contributions to the min-operation in the reduced-state decision feedback, the back trace algo-
both terms in equation (4) for L = 2. rithm used in some DSPs is not applicable for our case.
Due to the soft-decision output the Koch & Baier equal- Instead we use a register exchange approach. The im-
izer outperformes its Viterbi counterpart by about 2 dB. plementation is shown in Fig. 4. The SPU-hardware
However this benefit is dearly paid by an increase in actual consists of 2N 16-bit registers which can be di-
computational complexity and a trippling in memory re- vided into low and high byte, each half representing one
quirements due to the necessary storage of the backward bank of the register file. In normal mode the registers
metrics. It was found that the influence of the backward are available to the program control unit of the DSP for
recursion is rather small. Assuming an influence depth scratch-pad purposes. The additional hardware reduces
of the Ab(lc) of only L stages and all A b ( k + L ) are equally to the controller and one-bit shift circuitry.
probable (typical initialization for backward recursion) The datapath of the DSP itself is customized in a way
the backward recursion can be implemented within the that the ALU does not only support a 'SPLIT' mode
forward recursion [lo]. Koch and Baier already took ad- allowing for parallel computations of two lG-bit values
vantage from this fact which can be seen from equation instead of one double precision 32-bit value. It also com-
(4),where the decision is made for 2 k - L using trellis prises an additional comparison unit in series, i.e. the
states up to s k , hence input values up to 2 8 are used. Add-Compare-Select operation is performed in one ma-
Now the number of operations is of the same order as chine cycle. Another feature accelerates the transmis-
for the Viterbi Algorithm (VA). sion metric update. Since the data for the FIR filter
computation corresponds now to the survivors in the
In the previous section the trellis had 2 L states, thus SPU and x k ~ { + l , -l}, the multiplication reduces to a
the computation grows exponentially with the increase conditional accumulation (-tor -) of the complex chan-
in channel influence length L. Reducing the number of nel coefficients. Thus the ALU has special inputs which
conditionally control the internal Adder (see Fig. 5).
'all calculations assume binary symbols xk = e{+1,-1} for
simplicity of presentation only The new hardware results in more than GO% savings for

Global Telecommunications Conference - Globecom'99 261 7


Future Wireless Communication System

A-Bus The actual FIR part has a much lower frequency and can
be accomplished in a tiny standard DSP. However a 4
MHz clock for submicron silicon is far below the feasible
rate and decreases the silicon efficiency. Additional I/O-
transfers between the DSP and the despreader hardware
lead to further loss in efficiency. The problem seems at-
tractive to prove the capability of our DSP concept. The
following requirements for a new DSP were extracted
from the analysis of the despreading task.

0 enough data memory to store a whole time slot

0 support 8 bit arithmetic/storage


Figure 5: Datapath for equalization & coding 0 4 input registers for UMTS datapath with integra-
tion of PN-code generators into DPU
the equalization part compared to standard DSPs. In
processors with Viterbi accelerators (e.g. TI ‘Lead’) the The basic operation in the despreader itself is the corre-
transition metric update together with the survivor path lation of a complex data vector s with the correspond-
exchange is still a rather complex calculation. In par- ing complex codeword c of length S. Since the code
ticular the decision feedback of the reduced-state Bahl has a range of c i c ( f 1 f j} the multiply/accumulate
equalizer nearly neutralizes any performance gain of the of the correlator reduces to a code-dependent addi-
C54x or CARMEL which support only the standard VA tion/subtraction of the input data to the accumulation
with traceback functionality. The proposed architecture value. This fact has been taken into account by design-
outperformes such approaches by a factor of 2.4 and 1.6 ing the UMTS-datapath (see Fig. 8)
(CARMEL) respectively. Figure 7 shows the benchmark
results with a standard single MAC DSP as 100% refer- 5 Results
ence. The figure also indicates the performance increase
gained by an extension to 4 datapath units (DPU). The new UMTS-specific datapath has been described
in VHDL, simulated and synthesized with standard-cell
4 The IMT2000 Modem libraries. It contains a modified 40-bit ALU, a PN-code
generator (basically a 18-bit shift register with set/hold
The new Universal Mobile Communication Standard functions) and a few additional multiplexers for data
(UMTS) is currently subject to standardization. In this storage (de)compression, i.e. a complex value consisting
paper we use the specifications from 3GPP and IMT2000 of two 16-bit parts will be compressed into a single 16-
which are characterized by: bit memory word before memory storage and will be
0 Wideband CDMA with chiprate of 3.84 Mcps decompressed within the DPU after memory read.
Interestingly, the ALU-modification matches with the
0 RAKE receiver with 8 fingers features added for the Equalizer from section 3. Ihr-
0 Symbolrates between 3840 to 7.5 ksps thermore the special support of the BCJR algorithm en-
ables turbo decoding, which is part of the FEC in the
0 Dual mode channel coding: convolutional (K=9) standard. The required soft-input/soft-output modules
and turbo coding, rate 1/2 or 1/3 can be easily implemented within the DSP, however the
exhaustive memory needs of the interleaver block has t o
A simple block diagram of a possible receiver is shown be taken into account.
in Fig. 6. The further parts of this sections are focused
on the RAKE-receiver only (gray shaded area), which
consists of 8 RAKE-fingers, each running at a different I
I
delay and/or with a different code. Each finger performs I
despreading with a special PN code and early/late com-
Y
I
1
putation for sychronization purposes. The despreaded I
symbols are weighted with a precalculated channel coef- I
I
ficient (FIR filter operation). An algorithm investigation
revealed that the despreading operation itself represents -I

Sampling +
more than 60 % of overall computation while the weight- I 1 Rake Control
ing filter requires another 17 %. -
I

Since despreading is the most complex part this could be


implemented in dedicated hardware running at 4 MHz. Figure 6: UMTS receiver block diagram

261 8 Global Telecommunications Conference - Globecom’99


Future Wireless Communication System

Cycle ____-
Count ference = 100% (e g. Oak, MIPS
100%

80% 2000

rn 1500

40% 1000

20% 500

~~~~~

TI’c54x Carmel Our DSP


TI’c54x Carmel Our DSPs ( I MACIALU) (2 MAC+ALU) (4 UMTS -DP)
IDPU 2DPU IDPU 4DPU

Figure 7: Benchmarks for reduced state equalization Figure 9: Benchmarks of UMTS rake receiver
A-BUS
I B-Bus t
[2] M. Weiss, U. Walther, and G. Fettweis, “A
Structural Approach for designing Performance En-
hanced DSPs: 1-MPIS GSM-FR case study,” in
Proc. ICASSP’97, 1997, pp. 4085-4088.
[3] M. Weiss and G. Fettweis, “Dynamic Codewidth
Reduction for VLIW Instruction Set Architectures
in Digital Signal Processors,” in Proc. I WISP’96,
1,996,pp. 517-520.
[4] G. Fettweis et al., “Breaking new grounds over
3000M MAC/s,” in Proc. ICSPAT’98, 1998, vol. 11,
pp. 543-547.
Figure 8: Structure of UMTS datapath
[5] G. D. Forney, “The Viterbi Algorithm,” Pro-
ceedings of the IEEE, vol. 61, no. 03, pp. 268-278,
The comparison to a standard ALU shows an area in- March 1973.
crease of only 7% for the logic part, while providing the
extended functionality. As a result the RAKE-receiver [6] J . Hagenauer and P. Hoeher, “A Viterbi Algorithm
part could be implemented within a 4-datapath archi- with Soft-Decision Outputs and its Applications,”
tecture (each DPU consists of the proposed ALU and a in GLOBECOM’89, 1989, pp. 47.1.1-4.7.1.7.
MAC). This corresponds to the capability of 1200 pure
DPU-MOPS @100MHz and leads to the benchmarks [7] W. Koch and A. Baier, “Optimum and Sub-
given in Fig. 9. The total DSP core size is about 4 optimum Detection of Coded Data disturbed by
mm2 in 0.25 p m 4-layer-metal technology. Time-varying Intersymbol Interference,” in Proc.
GLOBECOM’SO, 1990, vol. 11, pp. 1679-1684.
6 Conclusions
[SI L. Bahl, J . Cocke, F. Jelinek, and J . Raviv, “Opti-
A new architecture for high performance DSPs has been mal Decoding of Linear Codes for Minimizing Sym-
investigated in terms of its capacity to handle algorithms bo1 Error Rate,” IEEE Transactions on Informa-
of future mobile communication standards. The modu- tion Theory, , no. 03, pp. 284-287, March 1974.
lar system provides support for very complex tasks and
can also be downscaled or extended according to the tar- [9] P. Robertson, E. Villebrun, and P. Hoeher, “A
get application. Thus it is flexible enough to fullfil the Comparison of Optimal and Sub-optimal MAP De-
tight requirements of today’s communications sector by coding Algorithms Operating in the Log-Domain,”
providing a uniform development plattform on both the Proceedings of the IEEE, vol. 61, no. 02, pp. 1009-
hardware and the software side. 1013, February 1995.

References [lo] M. Schmidt and G. Fettweis, “On Memory Redun-


dancy in the BCJR Algorithm for Shift Register
[l] G. Fettweis, “DSP Cores for Mobile Communica- Processes,” submitted to IEEE Transactions on In-
tions: Where are we going?,” in Proc. ICASSP’97, formation Theory, 1998.
1997, pp. 279-282.

Global Telecommunications Conference - Globecom‘99 2619

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy