A High Performance Floating Point Coprocessor
A High Performance Floating Point Coprocessor
5, OCTOBER 1984
INTRODUCTION
MICROARCHITECTURE
-s==s
Po
ROM (CONSTANTS)
H 60
P9
1 %
I B REGISTER
(MU LTIPLICAND/DIV
Plo 4
60-+--i- P14 7
P15 -
P19 y
P20 4
P24 -
(PARTIAL PRODUCT/REMAINDER)
P25
*
P29
SHIFTER P30
(L2T0 R5)
1
P34
I
OSHIFTER
(Ll, L2, R3. 0)
1
2%--
P40
II
P44 GROUP PROPAGATES
ALSO USED FOR
x
P45 CARRY LOOKAHEAD
P49 E
x
.
P55
P59 e
MINIMUM STUTTER = 10
Yn
~ALLOW STUTTER
DATA BUS
MAX NOT STUTTER = 16
Y “STUTTER” TO
Fig. 2. Block diagram of fraction data path. CLOCK CIRCUIT
with long carry propagation delays. The FPA, by compari- all operations which have a carry length of 19 or greater.
son, achieves a fast (100 ns) microcycle time, including a 60 Two le~els of AND gating are used because the first level of
bit ALU operation using a new technique well suited to gating is already present for the minimal 5 bit group carry
VLSI applications, as well as designs using standard parts. lookahead logic. The mDing of two group propagate sig-
A simple carry length detection scheme is used to pro- nals indicates whether or not all of the propagates in that
duce a stutter signal that stretches the final phase of the group of 10 bits are asserted.
EU clock if a long carry propagation path exists. The Single precision processing uses only the upper half of
method takes advantage of the fact that most ALU op- the fraction data path. In order to avoid unnecessary
erations have a largest maximum carry length which is stutter cycles caused by data in the lower half of the clata
much less than the width of the ALU. By detecting a long path, an additional enable signal is included in the detec-
carry and providing additional time for the ALU to com- tion gates covering bits 34 to 5. The stutter signal may set
plete for a small percentage of operations, a data path can for as few as 10 consecutive propagates, but might not set
be run at a fast rate for most ( >95 percent) ALU cycles. for as many as 18. A propagate is a necessary but not
Fig. 3 shows the stutter circuit for the 60 bit wide ALU sufficient condition to imply an actual carry. For this
used in the J-II floating point accelerator chip. In Fig. 3 reason, an allow stutter signal is used to gate the stutter
the propagates produced in bit positions 54 to 5 of the signal to the clock circuitry. The allow stutter signal is not
ALU’s PG (propagate generate) logic are gated with a set for ALU operations in which all bit positions will
minimum of logic to produce a detection signal stutter for produce a generate. Unnecessary stutter cycles are there-
692 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. SC-19, NO. 5, OCTOBER 1984
Po
fore prevented for ALU operations in which carry propa-
gation is not a factor (i.e., A + A generates a carry at all
P3
D
bit positions).
P4 -
The optimal width of the AND function used to produce
the group propagate signals will depend on the technology
P7 —
in which the ALU is to be implemented. Fig. 4 shows an
P8
example of the stutter circuit where the width of the group
propagate is 4 bits. If the individual propagate bits are not PI 1
available, as in standard part ALU slices, then the ANDing
P12
of the group propagates can be used directly. The stutter
circuit is very inexpensive to implement, due to the fact P15
MI NSTALL=8
tive propagates (Fig. 6).
MAX NOT STALL = 14
For the circuit described in Fig, 3 the probability of
requiring a stutter on random data is i- STUTTER
Fig. 4. Stutter detect for 32 bit ALU with 4 bit lookabead groupings.
P=((w–nZ)/n )(2* *(w- n))\2* *W
where bits retired. The required shift and ALU function for each
cycle is determined by examining the multiplier bits from
w = width of the ALU
the LSB’S of the Q register (bits 38:35 for F, bits 5:2 for
m = total bits not included in any detection gate
D). Since the main data path has only one shifter, the total
n = width of the detection gates shift required for each cycle combines the shift necessary to
(this equation is an upper bound since align the binary point for the present multiple, and the
data which sets more than 1 detection post shift required to complete the 3 bit retirement from
gate are counted more than once) the previous cycle. The previous group of multiplier bits
~=6tJ, m=lO, n=lo are held in a delay register which is initially cleared allow-
double precision
ing the algorithm to begin without any shifts being owed
P = 5/1024
from a previous cycle.
single precision ~=30,~=lo,~=lo Prior to the start of the multiplication, 3/4 times the
P = 2/1024, multiplicand is calculated and placed in the scratch register
for use in generating the multiple of 6. If the multiples 2,4,
The 60 bit ALU/shifter cycle of the J-n FPA can be or 8 are required, the normal multiplicand register is
completed in 100 ns for all operations in which the maxi- accessed and the partial product is appropriately shifted.
mum consecutive carry is less than 19 bits. When a multiple of 6 times the multiplicand is required,
the contents of the scratch register are used instead of the
ALGORITHMS multiplicand and added to the partial product shifted right
3 times. A special microinstruction is used to establish an
The FPA executes four data path assisted microinstruc- initial partial product of either zero if the LSB of the
tion which serve as the basis for executing the multiplica- multiplier is zero, or minus one times the multiplicand if
tion, division, alignment, and normalization algorithms. the multiplier LSB is a one. Table II details the single
The FPA uses a fixed 3 bit shift algorithm to perform shifter 3 bit multiplication algorithm implemented in the
multiplication. The algorithm requires the generation of FPA.
multiples O, 2, 4, 6, and 8 times the multiplicand. The The FPA uses a normalizing nonrestoring division al-
multiples are added or subtracted to the partial product gorithm which produces a quotient at a rate of 1.5
and the result is shifted to account for the three multiplier bits/cycle. If the partial remainder will be normalized for
WOLRICH et al.: FLOATING POINT COPROCESSOR 693
Po
P4 ENABLE DOUBLE
PRECISION
L*
=-L_
P25
P29
P30
P34
I
P35
P39
P40
P44
P45
P49
P50
P54
’55-
P59W Y
rALLOW STUTTER
r ALLOW STUTTER
STUTTER
STUTTER 2<
MIN STALL=
1 <MAX
10
NOT ST*LL
MI NSTALL =20
MAX NOT STALL.
. ,B
28
Q STUTTER 1 L! STUTTER 2
division (– 1< R < – 1/2, 1/2< R < 1) by a left shift of for F, bits 3:2 for D) and the Q register shifts either 1 or 2
one, then one new quotient bit is determined; when the bits left as required. When Q57 = 1, a Q shift of left one is
partial remainder requires more than a single left shift to forced and only one more quotient bit is accepted. When
become normalized, two quotient bits are determined. Ta- Q58 = 1, the division is completed and the normalized
ble III describes the next shift, ALU operation, and quo- quotient is in the Q register. The normalized quotient can
tient bits derived as a function of the MSBS of the partial be in the range 1/2< Q <1 or 1< Q <2 dependingon’the
remainder. If the 4 MSB’S of the partial remainder equal ratio of the initial dividend and divisor. If the initial
all ones or all zeros then a left shift of 2 will not normalize subtraction of the divisor from the dividend is positive,
the present partial remainder and the next cycle ALU then 1< quotient <2 and the final exponent is incre-
operation is A ~ A. This insures that the next partial mented.
remainder will remain R <1/2.
in If the
the range – 1/2 < The important feature of the FPA alignment and nor-
present partial remainder can be normalized by a left shift malization algorithms is that although the main shifter has
of one or two, the next ALU operation adds or subtracts limited range, the shift probability data for floating point
the divisor depending on the sign of the remainder in order addition and subtraction (Table IV) show this range to be
to drive it toward zero. The quotient bits are inserted at the all that is required for most operations. The FPA perfcmms
guard bit and LSB positions of the Q register (bits 35:34 78 percent (up to 5 bits of exponent difference) of the
694 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL SC-19. NO. 5, OCTOBER 1984
Po
P4
P5
P9
Plo
P14
P15
P19
P20
P24
P25
P29
P30
P34
P35
P39
P40
P49 1 I
P50
II
P54
P55 -
P59 — STUTTER
MI NSTALL=1O
Y STUTTE R
TABLE II
FPA 3 BIT/CYCLE MULTIPLICATION ALGORITHM
TABLE III
FPA 1.5 BIT/CYCLE DIVISION ALGORtTHM
Note: Quotient bit(s) are inserted at bit positions Q35 and Q34 for single
precision operation instead of Q3 and Q2.
TABLE IV TABLE V
WEIGHTED DATA ON ALIGNMENT AND NORMALIZATION J-n FPA TYPICAL REGISTER-TO-REGISTER EXECUTION TIMES
~ ~
pm
Larry Harada received the B.S. degree in electri-
creasing the fraction and exponent data paths to 67 and 13
‘“”;8 cal engirieering in 1980, and the M.E. degree in
bits, respectively. The three designs achieve similar perfor- ;f;y
1981, both from Cornell University, Ithaca, NY.
* ,
mance. The carry length detection method and the 3 bit $%?, He joined Digital Equipment Corporation,
.&>.
multiplication are especially beneficial in applications re- %$?$$ Hudson, MA in July 1981. He is currently work-
@(*.,@
ing for the Digital LSI Manufacturing Group in
quiring wide data paths, as evidenced by the performance $&:
Hudson.
of these three floating point processor chips.
IU3FERENCES
7
degrees in electrical engineering from Massachu-
setts Institute of Technology, Cambridge, MA, in
% 1980.
Prior to Joining Digital Equipment Corpora-
Gil Wolridh received the B.S. degree in electrical tion, Hudson, MA, in 1982, he was with In-
engineering from Rensselaer Polytechnic In- ~? tegrated Circuit Systems, Incorporated, West-
stitute, Troy, NY, in 1971, and the M. S. in boro, MA.
electrical engineering from Northeastern Univer-
sity, Boston, MA, in 1978.
He joined the Digital Equipment Corporation,
Hudson, MA, in April 1979. He is currently a
Principal Engineer with the Semiconductor En-
gineering Group in Hudson, MA.