Walther U GLOBECOM 99
Walther U GLOBECOM 99
Abstract - The implementation of new wireless a reduced-state-decision- feedback MAP equalizer com-
communication standards often requires the design bined with a convolutional decoder (which can also be
of new hardware capable of processing special al- used for Viterbi or MAP decoding) and
gorithms. One approach of tackling the problem is 2) The W-CDMA RAKE receiver for the UMTS stan-
the usage of dedicated hardware, optimized towards dard including the turbo decoding process.
the corresponding algorithm, together with a DSP.
However this may cause overhead in data transfers
First the DSP concept is shortly explained. Sections 3
(DSP < - > ASIC) and requires additional control
and 4 give a short description of the algorithm problem
hardware and memories. The full design process in-
and consider the impact upon the datapath design.
cluding simulation and debugging of the whole sys- 2 Concept for Application Tailored Signal Pro-
tem can be very time consuming. In this paper we cessors (CATS)
avoid such problems by utilization of a new concept
for application tailored DSPs. The architecture sup- The concept was first presented in [I] and [2]. It is a
ports scaling and the inherent flexibility allows for research Droiect at the Mannesmann Mobilfunk Chair.
I . ,
the adaption to new algorithms. Examples such as The above mentioned problems lead to the main goals:
equalization and the future wireless W-CDMA stan-
dard UMTS has been used to proove the applicabil- e One processor family as the base for designing gen-
ity of the structure. In addition, the time- to-market eral purpose processors as well as application tai-
factor can be significantly reduced. lored (domain specific) processor types
1 Introduction e One processor development platform
The nature of the mobile communication environment
e One generic software development platform, i.e. as-
requires the connection of very different signal process-
sembler, compiler, debugger which is independent
ing algorithms, operating at different clock speeds and
with different demands in terms of precision and real- on the tailorization
time capability. The classic approach of a system on a e A real-time development /debugging environment
chip (SOC) comprises an embedded DSP core and ded- for the customer
icated parallel hardware on one die together with mem-
ories and controllers. This results in additional data e A uniform instruction set architecture which satis-
transfers between DSP core and surrounding hardware. fies the statements above - > “TVLIW’ [3]
To Support the introduction of a new wireless standard
with increased signal processing complexity requires a
new design of dedicated hardware, e.g. the equalizer In Figure 1 a simplified block schematic is shown. The
structure or the channel-(de)coding scheme with new
parallel hardware, separate memories and controllers.
In this paper we solve the problem by using a DSP archi-
tecture [l]which supports parallelism and is both scal-
able and extendable with dedicated datapaths. An effi-
cient Instruction Set Architecture (ISA) allows for such A A
customizations without changing large parts of the chip.
Two different examples have been investigated as case I t t
Bus-S ystem
t J
studies for the new concept:
1) A classical single-carrier transmission system using
k-1 -+ k
Trellis step
k-1 - k
decjn
Figure 3: Trellis: path contributions for L = 2 states can significantly reduce the computational com-
plexity. The performance loss is limited by a state-
defined by SI,and X ~ - L= +l or -1'. For the backward dependent feedback of hard decisions. The resulting
case ( 2 ) the 21, = fl or -1 defines the selection. The derivative will be called Reduced-State-Decision- Feed-
branch metrics for the transition from state Sk-1 to SI, back (Bahl) Equalizer. The overall performance of such
are given by the distance calculation an approach is adequate for most of the channel types
as simulations have shown [7].
L
A(&-1, Sk) = IZI, - XI,-i/# (3) The basic operations of this equalizer can be mapped
i=O to the functional units known from the classical VA.
Hence we can distinguish between transition metric unit
The soft-output value for bit 2k-L is now defined by (TMU), add-compare-selct unit (ACSU) and survivor
path unit (SPU). The most complex operation takes
place in the ACSU for which accelaration techniques
are known. However, the computation of the transi-
tion metrics is now much more complex. It involves an
FIR-like computation (see equation (3)) and depends
upon the decisions made in the ACSU and stored in the
SPU. Since we need a survivor path for each state for
Fig. 3 shows the contributions to the min-operation in the reduced-state decision feedback, the back trace algo-
both terms in equation (4) for L = 2. rithm used in some DSPs is not applicable for our case.
Due to the soft-decision output the Koch & Baier equal- Instead we use a register exchange approach. The im-
izer outperformes its Viterbi counterpart by about 2 dB. plementation is shown in Fig. 4. The SPU-hardware
However this benefit is dearly paid by an increase in actual consists of 2N 16-bit registers which can be di-
computational complexity and a trippling in memory re- vided into low and high byte, each half representing one
quirements due to the necessary storage of the backward bank of the register file. In normal mode the registers
metrics. It was found that the influence of the backward are available to the program control unit of the DSP for
recursion is rather small. Assuming an influence depth scratch-pad purposes. The additional hardware reduces
of the Ab(lc) of only L stages and all A b ( k + L ) are equally to the controller and one-bit shift circuitry.
probable (typical initialization for backward recursion) The datapath of the DSP itself is customized in a way
the backward recursion can be implemented within the that the ALU does not only support a 'SPLIT' mode
forward recursion [lo]. Koch and Baier already took ad- allowing for parallel computations of two lG-bit values
vantage from this fact which can be seen from equation instead of one double precision 32-bit value. It also com-
(4),where the decision is made for 2 k - L using trellis prises an additional comparison unit in series, i.e. the
states up to s k , hence input values up to 2 8 are used. Add-Compare-Select operation is performed in one ma-
Now the number of operations is of the same order as chine cycle. Another feature accelerates the transmis-
for the Viterbi Algorithm (VA). sion metric update. Since the data for the FIR filter
computation corresponds now to the survivors in the
In the previous section the trellis had 2 L states, thus SPU and x k ~ { + l , -l}, the multiplication reduces to a
the computation grows exponentially with the increase conditional accumulation (-tor -) of the complex chan-
in channel influence length L. Reducing the number of nel coefficients. Thus the ALU has special inputs which
conditionally control the internal Adder (see Fig. 5).
'all calculations assume binary symbols xk = e{+1,-1} for
simplicity of presentation only The new hardware results in more than GO% savings for
A-Bus The actual FIR part has a much lower frequency and can
be accomplished in a tiny standard DSP. However a 4
MHz clock for submicron silicon is far below the feasible
rate and decreases the silicon efficiency. Additional I/O-
transfers between the DSP and the despreader hardware
lead to further loss in efficiency. The problem seems at-
tractive to prove the capability of our DSP concept. The
following requirements for a new DSP were extracted
from the analysis of the despreading task.
Sampling +
more than 60 % of overall computation while the weight- I 1 Rake Control
ing filter requires another 17 %. -
I
Cycle ____-
Count ference = 100% (e g. Oak, MIPS
100%
80% 2000
rn 1500
40% 1000
20% 500
~~~~~
Figure 7: Benchmarks for reduced state equalization Figure 9: Benchmarks of UMTS rake receiver
A-BUS
I B-Bus t
[2] M. Weiss, U. Walther, and G. Fettweis, “A
Structural Approach for designing Performance En-
hanced DSPs: 1-MPIS GSM-FR case study,” in
Proc. ICASSP’97, 1997, pp. 4085-4088.
[3] M. Weiss and G. Fettweis, “Dynamic Codewidth
Reduction for VLIW Instruction Set Architectures
in Digital Signal Processors,” in Proc. I WISP’96,
1,996,pp. 517-520.
[4] G. Fettweis et al., “Breaking new grounds over
3000M MAC/s,” in Proc. ICSPAT’98, 1998, vol. 11,
pp. 543-547.
Figure 8: Structure of UMTS datapath
[5] G. D. Forney, “The Viterbi Algorithm,” Pro-
ceedings of the IEEE, vol. 61, no. 03, pp. 268-278,
The comparison to a standard ALU shows an area in- March 1973.
crease of only 7% for the logic part, while providing the
extended functionality. As a result the RAKE-receiver [6] J . Hagenauer and P. Hoeher, “A Viterbi Algorithm
part could be implemented within a 4-datapath archi- with Soft-Decision Outputs and its Applications,”
tecture (each DPU consists of the proposed ALU and a in GLOBECOM’89, 1989, pp. 47.1.1-4.7.1.7.
MAC). This corresponds to the capability of 1200 pure
DPU-MOPS @100MHz and leads to the benchmarks [7] W. Koch and A. Baier, “Optimum and Sub-
given in Fig. 9. The total DSP core size is about 4 optimum Detection of Coded Data disturbed by
mm2 in 0.25 p m 4-layer-metal technology. Time-varying Intersymbol Interference,” in Proc.
GLOBECOM’SO, 1990, vol. 11, pp. 1679-1684.
6 Conclusions
[SI L. Bahl, J . Cocke, F. Jelinek, and J . Raviv, “Opti-
A new architecture for high performance DSPs has been mal Decoding of Linear Codes for Minimizing Sym-
investigated in terms of its capacity to handle algorithms bo1 Error Rate,” IEEE Transactions on Informa-
of future mobile communication standards. The modu- tion Theory, , no. 03, pp. 284-287, March 1974.
lar system provides support for very complex tasks and
can also be downscaled or extended according to the tar- [9] P. Robertson, E. Villebrun, and P. Hoeher, “A
get application. Thus it is flexible enough to fullfil the Comparison of Optimal and Sub-optimal MAP De-
tight requirements of today’s communications sector by coding Algorithms Operating in the Log-Domain,”
providing a uniform development plattform on both the Proceedings of the IEEE, vol. 61, no. 02, pp. 1009-
hardware and the software side. 1013, February 1995.