0% found this document useful (0 votes)
16 views565 pages

40 EN - Computer Architecture Complexity and Correctness

Uploaded by

imed jomaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views565 pages

40 EN - Computer Architecture Complexity and Correctness

Uploaded by

imed jomaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 565

Computer Architecture:

Complexity and Correctness

Silvia M. Mueller Wolfgang J. Paul


IBM Germany Development University of Saarland
Boeblingen, Germany Saarbruecken, Germany
Preface

N THIS BOOK we develop at the gate level the complete design of a


pipelined RISC processor with delayed branch, forwarding, hardware
interlock, precise maskable nested interrupts, caches, and a fully IEEE-
compliant floating point unit. The design is completely modular. This
permits us to give rigorous correctness proofs for almost every part of the
design. Also, because we can compute gate counts and gate delays, we can
formally analyze the cost effectiveness of all parts of the design.

 

This book owes much to the work of the following students and postdocs:
P. Dell, G. Even, N. Gerteis, C. Jacobi, D. Knuth, D. Kroening, H. Leister,
P.-M. Seidel.

March 2000
Silvia M. Mueller
Wolfgang J. Paul
Contents

1 Introduction 1

2 Basics 7
2.1 Hardware Model . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Components . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Cycle Times . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Hierarchical Designs . . . . . . . . . . . . . . . . 10
2.1.4 Notations for Delay Formulae . . . . . . . . . . . 10
2.2 Number Representations and Basic Circuits . . . . . . . . 12
2.2.1 Natural Numbers . . . . . . . . . . . . . . . . . . 12
2.2.2 Integers . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Basic Circuits . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Trivial Constructions . . . . . . . . . . . . . . . . 17
2.3.2 Testing for Zero or Equality . . . . . . . . . . . . 19
2.3.3 Decoders . . . . . . . . . . . . . . . . . . . . . . 19
2.3.4 Leading Zero Counter . . . . . . . . . . . . . . . 21
2.4 Arithmetic Circuits . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Carry Chain Adders . . . . . . . . . . . . . . . . 22
2.4.2 Conditional Sum Adders . . . . . . . . . . . . . . 24
2.4.3 Parallel Prefix Computation . . . . . . . . . . . . 27
2.4.4 Carry Lookahead Adders . . . . . . . . . . . . . . 28
2.4.5 Arithmetic Units . . . . . . . . . . . . . . . . . . 30
2.4.6 Shifter . . . . . . . . . . . . . . . . . . . . . . . . 31
   
2.5 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.1 School Method . . . . . . . . . . . . . . . . . . . 34
2.5.2 Carry Save Adders . . . . . . . . . . . . . . . . . 35
2.5.3 Multiplication Arrays . . . . . . . . . . . . . . . . 36
2.5.4 4/2-Trees . . . . . . . . . . . . . . . . . . . . . . 37
2.5.5 Multipliers with Booth Recoding . . . . . . . . . 42
2.5.6 Cost and Delay of the Booth Multiplier . . . . . . 47
2.6 Control Automata . . . . . . . . . . . . . . . . . . . . . . 50
2.6.1 Finite State Transducers . . . . . . . . . . . . . . 50
2.6.2 Coding the State . . . . . . . . . . . . . . . . . . 51
2.6.3 Generating the Outputs . . . . . . . . . . . . . . . 51
2.6.4 Computing the Next State . . . . . . . . . . . . . 52
2.6.5 Moore Automata . . . . . . . . . . . . . . . . . . 54
2.6.6 Precomputing the Control Signals . . . . . . . . . 55
2.6.7 Mealy Automata . . . . . . . . . . . . . . . . . . 56
2.6.8 Interaction with the Data Paths . . . . . . . . . . . 58
2.7 Selected References and Further Reading . . . . . . . . . 61
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3 A Sequential DLX Design 63


3.1 Instruction Set Architecture . . . . . . . . . . . . . . . . . 63
3.1.1 Instruction Formats . . . . . . . . . . . . . . . . . 64
3.1.2 Instruction Set Coding . . . . . . . . . . . . . . . 64
3.1.3 Memory Organization . . . . . . . . . . . . . . . 68
3.2 High Level Data Paths . . . . . . . . . . . . . . . . . . . 69
3.3 Environments . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.1 General Purpose Register File . . . . . . . . . . . 71
3.3.2 Instruction Register Environment . . . . . . . . . 73
3.3.3 PC Environment . . . . . . . . . . . . . . . . . . 74
3.3.4 ALU Environment . . . . . . . . . . . . . . . . . 75
3.3.5 Memory Environment . . . . . . . . . . . . . . . 78
3.3.6 Shifter Environment SHenv . . . . . . . . . . . . 81
3.3.7 Shifter Environment SH4Lenv . . . . . . . . . . . 85
3.4 Sequential Control . . . . . . . . . . . . . . . . . . . . . 88
3.4.1 Sequential Control without Stalling . . . . . . . . 88
3.4.2 Parameters of the Control Automaton . . . . . . . 95
3.4.3 A Simple Stall Engine . . . . . . . . . . . . . . . 97
3.5 Hardware Cost and Cycle Time . . . . . . . . . . . . . . . 99
3.5.1 Hardware Cost . . . . . . . . . . . . . . . . . . . 99
3.5.2 Cycle Time . . . . . . . . . . . . . . . . . . . . . 100
3.6 Selected References and Further Reading . . . . . . . . . 104

   
4 Basic Pipelining 105
4.1 Delayed Branch and Delayed PC . . . . . . . . . . . . . . 107
4.2 Prepared Sequential Machines . . . . . . . . . . . . . . . 111
4.2.1 Prepared DLX Data Paths . . . . . . . . . . . . . 114
4.2.2 FSD for the Prepared Data Paths . . . . . . . . . . 120
4.2.3 Precomputed Control . . . . . . . . . . . . . . . . 122
4.2.4 A Basic Observation . . . . . . . . . . . . . . . . 128
4.3 Pipelining as a Transformation . . . . . . . . . . . . . . . 130
4.3.1 Correctness . . . . . . . . . . . . . . . . . . . . . 131
4.3.2 Hardware Cost and Cycle Time . . . . . . . . . . 139
4.4 Result Forwarding . . . . . . . . . . . . . . . . . . . . . . 143
4.4.1 Valid Flags . . . . . . . . . . . . . . . . . . . . . 144
4.4.2 3-Stage Forwarding . . . . . . . . . . . . . . . . . 145
4.4.3 Correctness . . . . . . . . . . . . . . . . . . . . . 148
4.5 Hardware Interlock . . . . . . . . . . . . . . . . . . . . . 151
4.5.1 Stall Engine . . . . . . . . . . . . . . . . . . . . . 151
4.5.2 Scheduling Function . . . . . . . . . . . . . . . . 154
4.5.3 Simulation Theorem . . . . . . . . . . . . . . . . 157
4.6 Cost Performance Analysis . . . . . . . . . . . . . . . . . 159
4.6.1 Hardware Cost and Cycle Time . . . . . . . . . . 159
4.6.2 Performance Model . . . . . . . . . . . . . . . . . 160
4.6.3 Delay Slots of Branch/Jump Instructions . . . . . 162
4.6.4 CPI Ratio of the DLX Designs . . . . . . . . . . . 163
4.6.5 Design Evaluation . . . . . . . . . . . . . . . . . 166
4.7 Selected References and Further Reading . . . . . . . . . 168
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 169

5 Interrupt Handling 171


5.1 Attempting a Rigorous Treatment of Interrupts . . . . . . 171
5.2 Extended Instruction Set Architecture . . . . . . . . . . . 174
5.3 Interrupt Service Routines For Nested Interrupts . . . . . . 177
5.4 Admissible Interrupt Service Routines . . . . . . . . . . . 180
5.4.1 Set of Constraints . . . . . . . . . . . . . . . . . . 180
5.4.2 Bracket Structures . . . . . . . . . . . . . . . . . 181
5.4.3 Properties of Admissible Interrupt Service Routines 182
5.5 Interrupt Hardware . . . . . . . . . . . . . . . . . . . . . 190
5.5.1 Environment PCenv . . . . . . . . . . . . . . . . 191
5.5.2 Circuit Daddr . . . . . . . . . . . . . . . . . . . . 193
5.5.3 Register File Environment RFenv . . . . . . . . . 194
5.5.4 Modified Data Paths . . . . . . . . . . . . . . . . 198
5.5.5 Cause Environment CAenv . . . . . . . . . . . . . 202
5.5.6 Control Unit . . . . . . . . . . . . . . . . . . . . 204

   
5.6 Pipelined Interrupt Hardware . . . . . . . . . . . . . . . . 214
5.6.1 PC Environment . . . . . . . . . . . . . . . . . . 214
5.6.2 Forwarding and Interlocking . . . . . . . . . . . . 216
5.6.3 Stall Engine . . . . . . . . . . . . . . . . . . . . . 220
5.6.4 Cost and Delay of the DLXΠ Hardware . . . . . . 225
5.7 Correctness of the Interrupt Hardware . . . . . . . . . . . 227
5.8 Selected References and Further Reading . . . . . . . . . 235
5.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 236

6 Memory System Design 239


6.1 A Monolithic Memory Design . . . . . . . . . . . . . . . 239
6.1.1 The Limits of On-chip RAM . . . . . . . . . . . . 240
6.1.2 A Synchronous Bus Protocol . . . . . . . . . . . . 241
6.1.3 Sequential DLX with Off-Chip Main Memory . . 245
6.2 The Memory Hierarchy . . . . . . . . . . . . . . . . . . . 253
6.2.1 The Principle of Locality . . . . . . . . . . . . . . 254
6.2.2 The Principles of Caches . . . . . . . . . . . . . . 255
6.2.3 Execution of Memory Transactions . . . . . . . . 263
6.3 A Cache Design . . . . . . . . . . . . . . . . . . . . . . . 265
6.3.1 Design of a Direct Mapped Cache . . . . . . . . . 266
6.3.2 Design of a Set Associative Cache . . . . . . . . . 268
6.3.3 Design of a Cache Interface . . . . . . . . . . . . 276
6.4 Sequential DLX with Cache Memory . . . . . . . . . . . 280
6.4.1 Changes in the DLX Design . . . . . . . . . . . . 280
6.4.2 Variations of the Cache Design . . . . . . . . . . . 290
6.5 Pipelined DLX with Cache Memory . . . . . . . . . . . . 299
6.5.1 Changes in the DLX Data Paths . . . . . . . . . . 300
6.5.2 Memory Control . . . . . . . . . . . . . . . . . . 304
6.5.3 Design Evaluation . . . . . . . . . . . . . . . . . 309
6.6 Selected References and Further Reading . . . . . . . . . 314
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 314

7 IEEE Floating Point Standard and Theory of Rounding 317


7.1 Number Formats . . . . . . . . . . . . . . . . . . . . . . 317
7.1.1 Binary Fractions . . . . . . . . . . . . . . . . . . 317
7.1.2 Two’s Complement Fractions . . . . . . . . . . . 318
7.1.3 Biased Integer Format . . . . . . . . . . . . . . . 318
7.1.4 IEEE Floating Point Numbers . . . . . . . . . . . 320
7.1.5 Geometry of Representable Numbers . . . . . . . 321
7.1.6 Convention on Notation . . . . . . . . . . . . . . 322
7.2 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . 323
7.2.1 Rounding Modes . . . . . . . . . . . . . . . . . . 323

   
7.2.2 Two Central Concepts . . . . . . . . . . . . . . . 325
7.2.3 Factorings and Normalization Shifts . . . . . . . . 325
7.2.4 Algebra of Rounding and Sticky Bits . . . . . . . 326
7.2.5 Rounding with Unlimited Exponent Range . . . . 330
7.2.6 Decomposition Theorem for Rounding . . . . . . 330
7.2.7 Rounding Algorithms . . . . . . . . . . . . . . . 335
7.3 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . 335
7.3.1 Overflow . . . . . . . . . . . . . . . . . . . . . . 336
7.3.2 Underflow . . . . . . . . . . . . . . . . . . . . . 336
7.3.3 Wrapped Exponents . . . . . . . . . . . . . . . . 338
7.3.4 Inexact Result . . . . . . . . . . . . . . . . . . . . 341
7.4 Arithmetic on Special Operands . . . . . . . . . . . . . . 341
7.4.1 Operations with NaNs . . . . . . . . . . . . . . . 342
7.4.2 Addition and Subtraction . . . . . . . . . . . . . . 343
7.4.3 Multiplication . . . . . . . . . . . . . . . . . . . . 344
7.4.4 Division . . . . . . . . . . . . . . . . . . . . . . . 344
7.4.5 Comparison . . . . . . . . . . . . . . . . . . . . . 345
7.4.6 Format Conversions . . . . . . . . . . . . . . . . 347
7.5 Selected References and Further Reading . . . . . . . . . 349
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 349

8 Floating Point Algorithms and Data Paths 351


8.1 Unpacking . . . . . . . . . . . . . . . . . . . . . . . . . . 354
8.2 Addition and Subtraction . . . . . . . . . . . . . . . . . . 359
8.2.1 Addition Algorithm . . . . . . . . . . . . . . . . . 359
8.2.2 Adder Circuitry . . . . . . . . . . . . . . . . . . . 360
8.3 Multiplication and Division . . . . . . . . . . . . . . . . . 372
8.3.1 Newton-Raphson Iteration . . . . . . . . . . . . . 373
8.3.2 Initial Approximation . . . . . . . . . . . . . . . 375
8.3.3 Newton-Raphson Iteration with Finite Precision . . 377
8.3.4 Table Size versus Number of Iterations . . . . . . 379
8.3.5 Computing the Representative of the Quotient . . . 380
8.3.6 Multiplier and Divider Circuits . . . . . . . . . . . 381
8.4 Floating Point Rounder . . . . . . . . . . . . . . . . . . . 390
8.4.1 Specification and Overview . . . . . . . . . . . . 391
8.4.2 Normalization Shift . . . . . . . . . . . . . . . . . 394
8.4.3 Selection of the Representative . . . . . . . . . . . 405
8.4.4 Significand Rounding . . . . . . . . . . . . . . . 406
8.4.5 Post Normalization . . . . . . . . . . . . . . . . . 407
8.4.6 Exponent Adjustment . . . . . . . . . . . . . . . 408
8.4.7 Exponent Rounding . . . . . . . . . . . . . . . . 409
8.4.8 Circuit S PEC FP RND . . . . . . . . . . . . . . . . 410

   
8.5 Circuit FCon . . . . . . . . . . . . . . . . . . . . . . . . 412
8.5.1 Floating Point Condition Test . . . . . . . . . . . 414
8.5.2 Absolute Value and Negation . . . . . . . . . . . . 417
8.5.3 IEEE Floating Point Exceptions . . . . . . . . . . 418
8.6 Format Conversion . . . . . . . . . . . . . . . . . . . . . 418
8.6.1 Specification of the Conversions . . . . . . . . . . 419
8.6.2 Implementation of the Conversions . . . . . . . . 423
8.7 Evaluation of the FPU Design . . . . . . . . . . . . . . . 432
8.8 Selected References and Further Reading . . . . . . . . . 435
8.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 436

9 Pipelined DLX Machine with Floating Point Core 439


9.1 Extended Instruction Set Architecture . . . . . . . . . . . 441
9.1.1 FPU Register Set . . . . . . . . . . . . . . . . . . 441
9.1.2 Interrupt Causes . . . . . . . . . . . . . . . . . . 443
9.1.3 FPU Instruction Set . . . . . . . . . . . . . . . . . 444
9.2 Data Paths without Forwarding . . . . . . . . . . . . . . . 445
9.2.1 Instruction Decode . . . . . . . . . . . . . . . . . 448
9.2.2 Memory Stage . . . . . . . . . . . . . . . . . . . 451
9.2.3 Write Back Stage . . . . . . . . . . . . . . . . . . 455
9.2.4 Execute Stage . . . . . . . . . . . . . . . . . . . . 461
9.3 Control of the Prepared Sequential Design . . . . . . . . . 470
9.3.1 Precomputed Control without Division . . . . . . 474
9.3.2 Supporting Divisions . . . . . . . . . . . . . . . . 479
9.4 Pipelined DLX Design with FPU . . . . . . . . . . . . . . 485
9.4.1 PC Environment . . . . . . . . . . . . . . . . . . 485
9.4.2 Forwarding and Interlocking . . . . . . . . . . . . 486
9.4.3 Stall Engine . . . . . . . . . . . . . . . . . . . . . 498
9.4.4 Cost and Delay of the Control . . . . . . . . . . . 503
9.4.5 Simulation Theorem . . . . . . . . . . . . . . . . 507
9.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 508
9.5.1 Hardware Cost and Cycle Time . . . . . . . . . . 508
9.5.2 Variation of the Cache Size . . . . . . . . . . . . . 511
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 516

A DLX Instruction Set Architecture 519


A.1 DLX Fixed-Point Core: FXU . . . . . . . . . . . . . . . . 519
A.1.1 Instruction Formats . . . . . . . . . . . . . . . . . 520
A.1.2 Instruction Set Coding . . . . . . . . . . . . . . . 521
A.2 Floating-Point Extension . . . . . . . . . . . . . . . . . . 521
A.2.1 FPU Register Set . . . . . . . . . . . . . . . . . . 521
A.2.2 FPU Instruction Set . . . . . . . . . . . . . . . . . 522

   
B Specification of the FDLX Design 527
B.1 RTL Instructions of the FDLX . . . . . . . . . . . . . . . 527
B.1.1 Stage IF . . . . . . . . . . . . . . . . . . . . . . . 527
B.1.2 Stage ID . . . . . . . . . . . . . . . . . . . . . . 527
B.1.3 Stage EX . . . . . . . . . . . . . . . . . . . . . . 529
B.1.4 Stage M . . . . . . . . . . . . . . . . . . . . . . . 532
B.1.5 Stage WB . . . . . . . . . . . . . . . . . . . . . . 534
B.2 Control Automata of the FDLX Design . . . . . . . . . . 534
B.2.1 Automaton Controlling Stage ID . . . . . . . . . . 535
B.2.2 Precomputed Control . . . . . . . . . . . . . . . . 536

Bibliography 543

Index 549


Chapter

1
Introduction

    
In this book we develop at the gate level the complete design of a pipelined
RISC processor with delayed branch, forwarding, hardware interlock, pre-
cise maskable nested interrupts, caches, and a fully IEEE-compliant float-
ing point unit.
The educated reader should immediately ask “So what? Such designs
obviously existed in industry several years back. What is the point of
spreading out all kinds of details?”
The point is: the complete design presented here is modular and clean.
It is certainly clean enough to be presented and explained to students. This
opens the way to covering the following topics, both in this text and in the
class room.

To begin with the obvious: we determine cost and and cycle times of
designs. Whenever a new technique is introduced, we can evaluate
its effects and side effects on the cycle count, the hardware cost,
and the cycle time of the whole machine. We can study tradeoffs
between these very real complexity measures.

As the design is modular, we can give for each module a clean and
precise specification, of what the module is supposed to do.

Following the design for a module, we give a complete explanation


as to why the design meets the specification. By far the fastest way to
  
give such an explanation is by a rigorous mathematical correctness
I NTRODUCTION proof.1

From known modules, whose behavior is well defined, we hierar-


chically construct new modules, and we show that the new compli-
cated modules constructed in this way meet their specifications by
referring only to the specifications of the old modules and to the
construction of the new modules. We follow this route up to the
construction of entire machines, where we show that the hardware
of the machines interprets the instruction set and that interrupts are
serviced in a precise way.

Because at all stages of the design we use modules with well defined
behavior, the process of putting them all together is in this text completely
precise.

     


We see three ways to use this book:

Again, we begin with the obvious: one can try to learn the material
by reading the book alone. Because the book is completely self con-
tained this works. A basic understanding of programming, knowl-
edge of high school math, and some familiarity with proofs by in-
duction suffices to understand and verify (or falsify!) each and every
statement in this book.

The material of this book can be covered in university classes during


two semesters. For a class in “computer structures” followed by
“computer architecture 1” the material is somewhat heavy. But our
experience is, that students of the classes “computer architecture 1
and 2” deal well with the entire material. Many advanced topics like
superscalar processors, out-of-order execution, paging, and parallel
processing, that are not covered in this book, can be treated very well
in a seminar parallel to the class “computer architecture 2”. Students
who have worked through the first part of this book usually present
and discuss advanced material in seminars with remarkable maturity.
Sections 2.1 to 2.5, chapter 7 and chapter 8 present a self-contained
construction of the data paths of an IEEE-compliant floating point
unit. This material can be covered during one semester in a class on
computer arithmetic.
1 Whether mathematical correctness proofs are to be trusted is a sore issue which we
will address shortly.
  
The book can be used as supplementary reading in more traditional
architecture classes or as a reference for professionals. I NTRODUCTION

   !    "


Computer architects tend not to like proofs. It is almost as if computer
architects do not believe in mathematics. Even mathematical formulae are
conspicuously rare in most textbooks on computer architecture, in contrast
to most other engineering disciplines. The reason for this is simple:
Correctness proofs are incredibly error prone. When it comes to the
verification of computer systems, it is very difficult to tell a correct
proof from a proof, which is almost but not quite correct. The proofs
in this book are no exception.

Shipping hardware which is believed to be correct and which turns


out to be faulty later can cost a computer manufacturer a GREAT deal
of money.
Thus, do we expect our readers to buy the correctness of all designs
presented here based solely on the written proofs? Would we – the authors
– be willing to gamble our fortune on the correctness of the designs? The
only sane answer is: no. On the contrary, in spite of our best efforts and
our considerable experience we consider it quite likely, that one or more
proofs in the second half of the book will receive a nontrivial fix over the
next two years or so.
Keeping the above stated limitations of written correctness proofs firmly
in mind, we see nevertheless three very strong points in favor of using
mathematics in a text book about computer architecture.
The main foremost reason is speed. If one invests in the development
of appropriate mathematical formalism, then one can express one’s
thoughts much more clearly and succinctly than without formalism.
This in turn permits one to progress more rapidly.
Think of the famous formula stating, that the square of the sum of
a first number and a second number equals the sum of the square of
the first number, two times the product of the first number with the
second number, and the square of the second number. The line

a  b2  a2  2ab  b2

says the very same, but it is much easier to understand. Learning the
formalism of algebra is an investment one makes in high school and
which costs time. It pays off, if the time saved during calculations
with the formalism exceeds the time spent learning the formalism.
#
  
In this book we use mathematical formalism in exactly this way. It
I NTRODUCTION is the very reason why we can cover so much material so quickly.

We have already stated it above: at the very least the reader can
take the correctness proofs in this book as a highly structured and
formalized explanation as to why the authors think the designs work.

But this is not all. Over the last years much effort has been invested
in the development of computer systems which allow the formula-
tion of theorems and proofs in such a precise way, that proofs can
actually be verified by the computer. By now proofs like the ones
in this book can be entered into computer-aided proof systems with
almost reasonable effort.
Indeed, at the time of this writing (February 2000) the correctness
of a machine closely related to the machine from chapter 4 (with a
slightly different more general forwarding mechanism) has been ver-
ified using the system PVS [CRSS94, KPM00]. This also includes
the verification of all designs from chapter 2 used in chapter 4. Ver-
ification of more parts of the book including the floating point unit
of chapter 8 is under way and progressing smoothly (so far).

$% 
There are three key concepts, which permit us to develop the material of
this book very quickly and at the same time in a completely precise way.

1. We distinguish rigorously between numbers and their representation.


The simple formalism for this is summarized in chapter 2. This will
immediately help to reduce the correctness proofs of many auxiliary
circuits to easy exercises. More importantly, this formalism main-
tains order – and the sanity of the reader – in the construction of
floating point units which happen to manipulate numbers in 7 differ-
ent formats.2

2. The details of pipelining are very tricky. As a tool to better under-


standing them, we introduce in chapter 4 prepared sequential ma-
chines. This are machines which have the data path of a pipelined
machine but which are operated sequentially. They are very easy to
understand.
Pipelined machines have to simulate prepared sequential machines
in a fairly straightforward formal sense. In this way we can at least
2 packedsingle and double precision, unpacked single and double precision, binary
numbers, two’s complement numbers, and biased integers
&
  
easily formulate what pipelined machines are supposed to do. Show-
ing that they indeed do what they are supposed to do will occasion- I NTRODUCTION
ally involve some short but subtle arguments about the scheduling of
instructions in pipelines.
3. In chapter 7 we describe the algebra of rounding from [EP97]. This
permits us to formulate very concise assertions about the behavior
of floating point circuits. It will allow us to develop the schematics
of the floating point unit in a completely structured way.

  
We conclude the introduction by highlighting some results from the chap-
ters of this book. In chapter 2 we develop many auxiliary circuits for later
use: various counters, shifters, decoders, adders including carry lookahead
adders, and multipliers with Booth recoding. To a large extent we will
specify the control of machines by finite state diagrams. We describe a
simple translation of such state diagrams into hardware.
In chapter 3 we specify a sequential DLX machine much in the spirit
of [PH94] and prove that it works. The proof is mainly bookkeeping. We
have to go through the exercise because later we establish the correctness
of pipelined machines by showing that they simulate sequential machines
whose correctness is already established.
In section 4 we deal with pipelining, delayed branch, result forwarding,
and hardware interlock. We show that the delayed branch mechanism can
be replaced by a mechanism we call “delayed PC” and which delays all
instruction fetches, not just branches.3 We partition machines into data
paths, control automaton, forwarding engine, and stall engine. Pipelined
machines are obtained from the prepared machines mentioned above by an
almost straightforward transformation.
Chapter 5 deals with a subject that is considered tricky and which has not
been treated much in the literature: interrupts. Even formally specifying
what an interrupt mechanism should do turns out to be not so easy. The
reason is, that an interrupt is a kind of procedure call; procedure calls in
turn are a high level language concept at an abstraction level way above
the level of hardware specifications.
Achieving preciseness turns out to be not so bad. After all preciseness
is trivial for sequential machines, and we generate pipelined machines by
transformation of prepared sequential machines. But the interplay of in-
terrupt hardware and forwarding circuits is nontrivial, in particular when it
comes to the forwarding of special purpose registers like, e.g., the register,
which contains the masks of the interrupts.
3 We are much more comfortable with the proof since it has been verified in PVS.
'
  
Chapter 6 deals with caches. In particular we specify a bus protocol by
I NTRODUCTION which data are exchanged between CPU, caches, and main memory, and
we specify automata, which (hopefully) realize the protocol. We explain
the automata, but we do not prove that the automata realize the protocol.
Model checking [HQR98] is much better suited to verify a statement of
that nature.
Chapter 7 contains no designs at all. Only the IEEE floating point stan-
dard is rephrased in mathematical language and theorems about rounding
are proven. The whole chapter is theory. It is an investment into chap-
ter 8 where we design an entire fully IEEE-compatible floating point units
with denormals, and exceptions, dual precision adder, multiplier, iterative
division, format conversion, rounding. All this on only 120 pages.
In chapter 9 we integrate the pipelined floating point unit into the DLX
machine. As one would expect, the control becomes more complicated,
both because instructions have variable latency and because the iterative
division is not fully pipelined. We invest much effort into a very com-
fortable forwarding mechanism. In particular, this mechanism will permit
the rounding mode of floating point operations to be forwarded. This, in
turn, permits interval arithmetic to be realized while maintaining pipelined
operation of the machine.

(
Chapter

2
Basics

  

TUDYING COMPUTER architecture without counting the cost of hard-


 ware and the length of critical paths is great fun. It is like going shop-
ping without looking at price tags at all. In this book, we specify and
analyze hardware in the model from [MP95]. This is a model at the gate
level which gives at least rough price tags.

  

In the model there are five types of basic components, namely: gates,
flipflops, tristate drivers, RAMs and ROMs. Cost and delay of the basic
components are listed in table 2.1. They are normalized relative to the cost
and delay of a 1-bit inverter. For the basic components we use the symbols
from figure 2.1.
Clock enable signals ce of flipflops and registers, output enable signals
oe of tristate drivers and write signals w of RAMs are always active high.
RAMs have separate data input and data output ports. All flipflops are
assumed to be clocked in each cycle; thus there is no need to draw clock
inputs.
A RAM with A addresses and d-bit data has cost

Cram A d   CRAMcell  A  3  d  log log d 


 
BASICS   Cost [g] (gate equivalents) and delay [d] of the basic components
cost delay cost delay
not 1 1 flipflop 8 4
nand, nor 2 1 3-state driver 5 2
and, or 2 1 RAM cell 2 –
xor, xnor 4 2 ROM cell 0.25 –
mux 3 2

sl 0 1

inverter AND NOR multiplexer

Ad Din
oe w RAM
Dout

XOR NAND tristate driver Axd RAM

Din Ad
ce ROM
Dout Dout

XNOR OR flipflop Axd ROM

   Symbols of the basic components

00111100
a
0110 b
00111100 c

00111100

s
c’

   Circuit of a full adder FA

)
  
  Read and write times of registers and RAMs; d ram denotes the access H ARDWARE M ODEL
time of the RAM.

register RAM
read 0 dram
write ∆  Df f  δ dram  δ

and delay
log d  A4 ; A  64
Dram A d  
3  log A  10 ; A  64

For the construction of register files, we use 3-port RAMs capable of


performing two reads and one write in a single cycle. If in one cycle a read
and a write to the same address are performed, then the output data of the
read operation are left undefined.
Cost and delay of these multi-port RAMs are

Cram3 A d   16  Cram A d 


Dram3 A d   15  Dram A d 

The circuit in figure 2.2 has cost C FA and delay DFA , with   
CFA 2 Cxor  2 Cand  Cor
DFA Dxor  maxDxor Dand  Dor 

 %  

In the computation of cycle times, we charge for reads and writes in regis-
ters and RAMs the times specified in table 2.2. Note that we start and end
counting cycles at the point in time, when the outputs of registers have new
values. The constant δ accounts for setup and hold times; we use δ  1.

Suppose circuit S has delay d S and RAM R has access time dram . The four   
schematics in figure 2.3 then have cycle times

 dS  ∆ in case a)
dram  dS  ∆ in case b)
τ
 dS  dram  δ in case c)
dS  2 dram  δ in case d)

*
 
#  +  
BASICS
It is common practice to specify designs in a hierarchical or even recursive
manner. It is also no problem to describe the cost or delay of hierarchi-
cal designs by systems of equations. For recursive designs one obtains
recursive systems of difference equations. Section 2.3 of this chapter will
contain numerous examples.
Solving such systems of equations in closed form is routine work in the
analysis of algorithms if the systems are small. Designs of entire proces-
sors contain dozens of sheets of schematics. We will not even attempt
to solve the associated systems of equations in closed form. Instead, we
translate the equations in a straightforward way into C programs and let
the computer do the work.
Running a computer program is a particular form of experiment. Scien-
tific experiments should be reproducible as easily as possible. Therefore,
all C programs associated with the designs in this book are accessible at our
web site1 . The reader can easily check the analysis of the designs, analyze
modified designs, or reevaluate the designs with a new set of component
costs and delays.

& !    +% , -

Let S be a circuit with inputs I and outputs O as shown in figure 2.4. It is


often desirable to analyze the delay DS I ; O  from a certain subset I of
the inputs to a certain subset O of the outputs. This is the maximum delay
of a path p from an input in I to an output in O . We use the abbreviations

DS I ; O  DS I 
DS I; O   DS O 
DS  DS I; O

Circuits S do not exist in isolation; their inputs and outputs are connected
to registers or RAMs, possibly via long paths. We denote by AS I ; O  the
maximum delay of a path which starts in a register or RAM, enters S via I
and leaves S via O . We call AS I ; O  an accumulated delay. If all inputs
I are directly connected to registers, we have

AS I ; O   DS I ; O 

1      



  
H ARDWARE M ODEL
A Din A Din
circuit S RAM circuit S RAM
0 w 0 w
Dout Dout
A Din
circuit S RAM circuit S
a) register to register 1 w
Dout
A Din
RAM
b) RAM to register c) register to RAM 1 w
Dout

d) RAM to RAM

   The four types of transfer between registers and RAMs

I’
inputs I

P’’ P’
P
outputs O
O’

   Paths through a circuit S. I ¼


is a subset of its inputs I, and O ¼ is a subset
of its outputs O.

Similarly we denote by TS I ; O  the maximum cycle time required by


cycles through I and O . If I  I or O  O we abbreviate as defined
above.

The schematic Sc of figure 2.5 comprises three cycles:   


 leaving circuit S 1 via output d 3 ,

 entering circuit S 2 via input d1 ,

 entering circuit S 2 via input d2 .

Thus, the cycle time of S c can be expressed as

TSc maxTS1 d3  TS2 d1  TS2 d2 



  d4
d3
BASICS registers
d1
d2
circuit S1 circuit S2

   Schematic Sc

with
TS1 d3  AS1 d3  ∆ DS1 d3  D f f  δ
TS2 d1  AS2 d1  ∆ DS2 d1  D f f  δ
TS2 d2  AS2 d2  ∆ AS1 d2  DS2 d2  D f f  δ

       

  ! - !- 

For bits x  0 1 and natural numbers n, we denote by xn the string con-


sisting of n copies of x. For example, 03  000 and 15  11111. We usually
index the bits of strings a  0 1n from right to left with the numbers from
0 to n  1. Thus, we write
a  an  1    a0 or a  an  1 : 0
For strings a  an  1    a0  0 1n , we denote by
n1
a  ∑ ai  2i
i 0

the natural number with binary representation a. Obviously we have


a  0  2n  1
We denote by Bn  0    2n  1 the range of numbers which have a
binary representation of length n. For x  Bn and a  0 1n with x  a,
we denote by
binn x  a
the n-bit binary representation of x. A binary number is a string which is
interpreted as a binary representation of a number. We have for example
10n   2n
1n   2n  1

  
  Computing the binary representation c ¼ s of the sum of the bits a b c. N UMBER
R EPRESENTATIONS
a b c c’ s
AND BASIC
0 0 0 0 0 C IRCUITS
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1 1 1 1 1

From the definition one immediately concludes for any j  0  n  1

an  1 : 0  an  1 : j  2 j  a j  1 : 0 (2.1)

Addition The entries in table 2.3 obviously satisfy

s  a b c
c 1 abc 2
c s  a  b  c

This is the standard algorithm for computing the binary representation of


the sum of three bits. For the addition of two n-bit numbers an  1 : 0 and
bn  1 : 0, one first observes that

an  1 : 0  bn  1 : 0  0  2n1  2

Thus, even the sum  1 can be represented with n  1 bits. The standard
algorithm for adding the binary numbers an  1 : 0 and bn  1 : 0 as well
as a carry in cin is inductively defined by

c 1
  cin
ci si   ci 1  ai  bi
 (2.2)
sn  cn 1 

for i  0    n  1. Bit si is called the sum bit at position i, and ci is


called the carry bit from position i to position i  1. The following theorem
asserts the correctness of the algorithm.

an  1 : 0  bn  1 : 0  cin  cn 1 sn  1 : 0.



   
#
 

BASICS by induction on n. For n  0, this follows directly from the definition of
the algorithm. From n to n  1 one concludes with equation (2.1) and the
induction hypothesis:

an : 0  bn : 0  cin  an  bn   2n  an  1 : 0


bn  1 : 0  cin
 an  bn   2n  cn 1 sn  1 : 0


1   2  sn  1 : 0
n
 an  bn  cn 

 cn sn   2n  sn  1 : 0


 cn sn : 0
 

  .  

For strings an  1 : 0, we use the notation a  an  1    a0 , e.g., 104 


01111, and we denote by

a  an1  2  an  2 : 0


n1

the integer with two’s complement representation a. Obviously, we have

a  2n  1
 2n 1
 1

We denote by Tn  2n 1    2n 1  1 the range of numbers which have


 

a two’s complement representation of length n. For x  Tn and a  0 1n


with x  a, we denote by

twon x  a

the n-bit two’s complement representation of x. A two’s complement num-


ber is a string which is interpreted as a two’s complement representation
of a number. Obviously,

a  0 an
 1 1

The leading bit of a two’s complement number is therefore called its sign
bit. The basic properties of two’s complement numbers are summarized in

   Let a  an  1 : 0, then


&
  
N UMBER
0a  a R EPRESENTATIONS
a an  2 : 0 mod 2n  1 AND BASIC
C IRCUITS
a  a mod 2n
an1 a  a (sign extension)
a  a  1

The first two equations are obvious. An easy calculation shows, that 
a  a  an  12 ;
n

this shows the third equation.

an1 a  an 1  2  an  1 : 0



n

 an1  2n  an1  2n1  an  2 : 0


 a 

This proves the fourth equation.


n2
an1  a0   2n 1
 an
1  ∑ ai  2i
i 0
n2
 2n 1
 1  an 1  ∑ 1  ai   2i
i 0
n2 n2
 2n 1
2
n1
 an 1  ∑ 2i  ∑ ai  2i
i 0 i 0
 2 n1
2
n1
 an  12
n1
 1  an  2 : 0
 an  1 : 0  1

This proves the last equation.  

Subtraction The basic subtraction algorithm for n bit binary numbers a


and b in the case where the result is nonnegative works as follows:
1. Add the binary numbers a b and 1.

2. Throw away the leading bit of the result

We want to perform the subtraction 1100  0101 12  5 7 We compute   


1. 1100  0101 1100  1010  1
  1100  1011  10111
2. We discard the leading bit and state that the result is 0111 7
'
 
This is reassuring but it does not prove anything. In order to see why the
BASICS algorithm works, observe that ab 0 implies ab  0    2n 
1. Thus, it suffices to compute the result modulo 2n , i.e., throwing away
the leading bit does not hurt. The correctness of the algorithm now imme-
diately follows from

    Let a  an  1 : 0 and b  bn  1 : 0, then


a  b a  b  1 mod 2n 


a  b  a  0b
 a  1b  1
a  b  1 mod 2n

 
The salient point about two’s complement numbers is that addition al-
gorithms for the addition of n-bit binary numbers work just fine for n-bit
two’s complement numbers as long as the result of the addition stays in the
range Tn . This is not completely surprising, because the last n  1 bits of n-
bit two’s complement numbers are interpreted exactly as binary numbers.
The following theorem makes this precise.

    Let a  an  1 : 0, b  bn  1 : 0 and let cin  0 1. Let sn : 0 
an  1 : 0  bn  1 : 0  cin and let the bits ci and si be defined as in
the basic addition algorithm for binary numbers. Then
a  b  cin  Tn cn 1  cn2 

If a  b  cin  Tn , then a  b  cin  sn  1 : 0.


a  b  cin  2n
 1
an  1  bn1   an  2 : 0  bn  2 : 0  cin
 2 n1
an 1  bn1   cn2 sn  2 : 0
 2 n1
an 1  bn1  cn2  2  cn2   sn  2 : 0
 2 n1
cn 1 sn 1   2  cn
   2   sn  2 : 0
 2  cn
n
 1  cn2   sn  1 : 0
One immediately verifies
2n  cn  1  cn2   sn  1 : 0  Tn cn 1  cn2


  and the theorem follows.


(
  #
Observe that for a  an  1 : 0 and b  bn  1 : 0 we have
BASIC C IRCUITS
a  b  cin  Tn1 
Thus, if we perform the binary addition
an 1 a  bn 1 b  cin  sn  1 : 0
 

then we always get


a  b  cin  sn : 0

   

N THIS section a number of basic building blocks for processors are con-
structed.

#   - 

One calls n multiplexers with a common select line sl an n-bit multiplexer


or n-bit mux. Similarly, n flipflops with common clock enable line ce are
called an n-bit register, and n tristate drivers with a common output enable
line oe are called an n-bit driver.
For x  xn  1 : 0, we defined x  xn 1    x0 . For a  an  1 : 0,


b  bn  1 : 0 and Æ  AND , OR , NAND , NOR , XOR , XNOR, we define


a Æ b  an  1 Æ bn1  a0 Æ b0 
The circuit in figure 2.6 (a) has inputs an  1 : 0 and outputs bn  1 :
0  a. It is called an n-bit inverter. The circuit in figure 2.6 (b) has inputs
an  1 : 0 bn  1 : 0 and outputs cn  1 : 0  a Æ b. It is called an n-bit
Æ-gate.
For a  0 1 b  bn  1 : 0 and Æ  AND , OR , NAND , NOR , XOR ,
XNOR, we define

a Æ b  an Æ b  a Æ bn 1  a Æ b0 
The circuit in figure 2.6 (c) has inputs a bn  1 : 0 and outputs c  a Æ b
The circuit consists of an n-bit Æ-gate where all inputs ai are tied to the
same bit a.
For Æ  AND , OR, a balanced tree of n  1 many Æ-gates has inputs
an  1 : 0 and output b  an 1 Æ    Æ a0 . It is called an n-input Æ-tree.


The cost and the delay of the above trivial constructions are summarized
in table 2.4. The symbols of these constructions are depicted in figure 2.7.
/
 
BASICS a) b) c)
a[n-1] a[0] a[n-1] b[n-1] a[0] b[0] a b[n-1] b[0]

... ...
11
00 ...

b[n-1] b[0] c[n-1] c[0] c[n-1] c[0]

   Circuits of an n-bit inverter (a) and of an n-bit Æ-gate. The circuit (c)
computes a Æ bn  1 : 0.

a) a[n-1:0] b) a[n-1:0] b[n-1:0] c) a[n-1:0]

n n n n
ce sl 0 1 oe
n n n
b[n-1:0] c[n-1:0] b[n-1:0]

d) a[n-1:0] b[n-1:0] e) a b[n-1:0] f) a[n-1:0]


n n n n
n
n n
c[n-1:0] c[n-1:0] b

   Symbols of an n-bit register (a), an n-bit mux (b), an n-bit tristate driver
(c), an n-bit Æ-gate (d, e), and an n-input Æ-tree (f). In (e), all the inputs a i are tied
to one bit a.

  Cost and delay of the basic n-bit components listed in figure 2.7.
n-bit n-input
register mux driver Æ-gate Æ-tree
cost n Cf f n  Cmux n  Cdriv n C Æ n  1  C Æ

delay Df f Dmux Ddriv D Æ log n  D Æ

)
  #
#    0  12- %
BASIC C IRCUITS
An n-zero tester is a circuit with input an  1 : 0 and output

b  an 1      a0 


The obvious realization is an n-bit OR-tree, where the output gate is re-
placed by a NOR gate. Thus,

Czero n  n  2  Cor  Cnor


Dzero n  log n  1  Dor  Dnor 

An n-equality tester is a circuit with inputs an  1 : 0 and bn  1 : 0


and output c such that

c1 an  1 : 0  bn  1 : 0

Since ai  bi is equivalent to ai bi  0, the equality test can also be
expressed as

c1 an  1 : 0 bn  1 : 0  0n 

Thus, the obvious realization is to combine the two operands bitwise by


XOR and to pass the result through an n-zero tester:

Cequal n  n  Cxor  Czero n


Dequal n  Dxor  Dzero n

## + 

An n-decoder is a circuit with inputs xn  1 : 0 and outputs Y 2n  1 : 0


such that for all i
Yi  1 x  i

A recursive construction with delay logarithmic in n is depicted in figure


2.8. Let k  n2 and l  n2. The correctness of the construction is
shown by induction on n. For the induction step one argues

Y 2k  i  j  1 V i  1  U  j  1
xn  1 : k  i  xk  1 : 0  j
xn  1 : kxk  1 : 0  2k  i  j
*
 
n=1 n>1
BASICS
x[0] x[k-1 : 0]
k
dec(k)
K=2k U[K-1 : 0] 0110

Y[1] Y[0] x[n-1 : k]


l
dec(l)
V[L-1 : 0] 0110
L=2l V[i] U[j]

Y[0] ... Y[K i + j] ... Y[2n-1]

   Recursive definition of an n-decoder circuit

The cost and delay of this decoder circuit run at

Cdec 1  Cinv
Cdec n  Cdec n2  Cdec n2  2n  Cand
Ddec 1  Dinv
Ddec n  Ddec n2  Dand 

Half Decoder An n-half decoder is a circuit with inputs xn  1 : 0 and


outputs Y 2n  1 : 0 such that
n
Y 2n  1 : 0  02 x
 
1x  


Thus, input x turns on the x low order bits of the output of the half de-
coder.
Let L denote the lower half and H the upper half of the index range
2n  1 : 0:

L  2n  1
 1 : 0 H  2n  1 : 2n  1


With these abbreviations, figure 2.9 shows a recursive construction of a


half decoder. The cost and the delay are

Chdec 1  0
Chdec n  Chdec n  1  2n 1
 Cand  Cor 
Dhdec 1  0
Dhdec n  Dhdec n  1  maxDand Dor 

  #
n=1 n>1 x[n-2 : 0]
BASIC C IRCUITS
0 x[0]
hdec(n-1)

U[L]
2n-1
x[n-1]
Y[1] Y[0]
2n-1 2n-1
Y[H] Y[L]

   Recursive definition of an n-half decoder circuit

In the induction step of the correctness proof the last xn  2 : 0 bits
of U are set to one by induction hypothesis. If xn 1  0, then 

x  xn  2 : 0


n 1
yH   02 and
yL  U

If xn
 1 1, then

x  2n  1
 xn  2 : 0
yH   U and
2n 1
yL  1 

Thus, in both cases the last x bits of y are one.

#& 3  0 - 

For strings x, we denote by lz x the number of leading zeros of x. Let


n  2m be a power of two. An n-leading zero counter is a circuit with
inputs xn  1 : 0 and outputs ym : 0 satisfying y  lz x.
Figure 2.10 shows a recursive construction for n-leading zero counters.
For the induction step of the correctness proof we use the abbreviations

H  n  1 : n2
L  n2  1 : 0
yH   lz xH  and
yL   lz xL

 
m=0 m>0
y m-1
BASICS x[0] L
x[L] lz(n/2) m 1
y m-1
H
x[H] lz(n/2) m 0 y[m:0]
y[0] 0

   Recursive definition of an n-leading zero counter

Thus,
lz xH  if lz xH   2m 1
lz xH xL  m 1
2 
 lz xL if lz xH  2m 1

0yH m  1 : 0 if yH m  1  0

z if yH m  1  1
where
z  10m 1   yL m  1 : 0


01yL m  2 : 0 if yL m  1  0

10yL m  2 : 0 if yL m  1  1
 yL m  1 yL m  1 yL m  2 : 0
Cost and delay of this circuit are
Clz 1  Cinv
Clz n  2  Clz n2  Cmux m  1  Cinv
Dlz 1  Dinv
Dlz n  Dlz n2  Dinv  Dmux 

   

E USE three varieties of adders: carry chain adders, conditional sum


 adders, and carry look ahead adders.

& %   

A full adder is a circuit with inputs a b c and outputs c s satisfying


c s  a  b  c
a0 b0
  &
cin
A RITHMETIC
a1 b1 FA
C IRCUITS
s0
FA
an-1 bn-1 ooo
c1 s1
cn-2
FA
sn s n-1

   Circuit of the n-bit carry chain adder CCA

  Functionality of a half adder


a c c’ s
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0

Full adders implement one step of the basic addition algorithm for binary
numbers as illustrated in table 2.3 of section 2.2. The circuit in figure 2.2
of section 2.1 happens to be a full adder with the following cost and delay

CFA  2  Cxor  2  Cand  Cor


DFA  Dxor  maxDxor Dand  Dor 

An n-adder is a circuit with inputs an  1 : 0, bn  1 : 0, cin and outputs
sn : 0 satisfying
a  b  cin  s
The most obvious adder construction implements directly the basic ad-
dition algorithm: by cascading n full adders as depicted in figure 2.11, one
obtains a carry chain adders. Such adders are cheap but slow, and we
therefore do not use them.
A half adder is a circuit with inputs a c and outputs c s satisfying

c s  a  c

The behavior of half adders is illustrated in table 2.5. As we have

s  a c and c  ac
#
  a c
BASICS
01 01
c’ s

   Circuit of a half adder HA

a0 cin

a1 HA
c0
HA s0
an-1 cn-1
...
c1 s1
HA
sn sn-1

   Circuit of an n-carry chain incrementer CCI

the obvious realization of half adders consists of one AND gate and one OR
gate, as depicted in figure 2.12.
An n-incrementer is a circuit with inputs an  1 : 0 cin and outputs
sn : 0 satisfying
a  cin  s
By cascading n half adders as depicted in figure 2.13 (b), one obtains a
carry chain incrementer with the following cost and delay:
CCCI n  n  Cxor  Cand 
DCCI n  n  1  Dand  maxDxor Dand 
The correctness proof for this construction follows exactly the lines of the
correctness proof for the basic addition algorithm.

&    -  

The most simple construction for conditional sum adders is shown in figure
2.14. Let m  n2 and k  n2 and write
sn : 0  sn : m sm  1 : 0
&
  &
b[n-1:m] a[n-1:m] b[m-1:0] a[m-1:0]

0011 1 0011 0
cin
A RITHMETIC
C IRCUITS
adder(k) adder(k) adder(m)

s1[n:m] s0[n:m] m
k+1
1 0
cm-1
s[n:m] s[m-1:0]

   Simple version of an n-bit conditional sum adder; m n 2 and


  

k n2 .

then
sn : m  an  1 : m  bn  1 : m  cm 1 

an  1 : m  bn  1 : m if cm  1 0



an  1 : m  bn  1 : m  1 if cm 1  1

Thus, the high order sum bits in figure 2.14 are computed twice: the sum
bits s0 n : m are for the case cm 1  0 and bits s1 n : m are for the case


cm 1  1. The final selection is done once cm 1 is known.


 

This construction should not be repeated recursively because halving the


problem size requires 3 copies of hardware for the half sized problem and
the muxes. Ignoring the muxes and assuming n  2ν is a power of two,
one obtains for the cost c n of an n-adder constructed in this manner the
estimate
c n  3  c n2
 3ν  c 1  2ν log 3  c 1


 nlog 3  c 1  n1 57  c 1
This is too expensive.
For incrementers things look better. The high order sum bits of incre-
menters are

sn : m  an  1 : m  cm 1 

an  1 : m if cm  1 0

an  1 : m  1 if cm 1  1

This leads to the very simple construction of figure 2.15. Our incrementer
of choice will be constructed in this way using carry chain incrementers
'
 
11
00
a[n-1:m] a[m-1:0]
BASICS
inc(k) inc(m)

s1[n:m] 0 s0[n:m] m
k+1
1 0
cm-1
s[n:m] s[m-1:0]

   An n-bit conditional sum incrementer; m n 2 and k


   n2 .

for solving the subproblems of size k and m. Such an n-incrementer then


has the following cost and delay
Cinc n  CCCI m  CCCI k  Cmux k  1
Dinc n  DCCI m  Dmux 

Note that in figure 2.15, the original problem is reduced to only two
problems of half the size of the original problem. Thus, this construction
could be applied recursively with reasonable cost (see exercise 2.1). One
then obtains a very fast conditional sum incrementer CSI.
Indeed, a recursive construction of simple conditional sum adders turns
out to be so expensive because disjoint circuits are used for the computa-
tion of the candidate high order sum bits s0 n : m and s1 n : m. This flaw
can be remedied if one constructs adders which compute both, the sum and
the sum +1 of the operands a and b.
An n-compound adder is a circuit with inputs an  1 : 0 bn  1 : 0 and
outputs s0 n : 0 s1 n : 0 satisfying
s0   a  b
s  1
 a  b  1
A recursive construction of the n-compound adders is shown in figure
2.16. It will turn out to be useful in the rounders of floating point units.
Note that only two copies of hardware for the half sized problem are used.
Cost and delay of the construction are
Cadd2 1  Cxor  Cxnor  Cand  Cor
Cadd2 n  Cadd2 k  Cadd2 m  2  Cmux k  1
Dadd2 1  maxDxor Dxnor Dand Dor 
Dadd2 n  Dadd2 m  Dmux k  1

(
  &
n=1 n>1 a[n-1:m] b[n-1:m] a[m-1:0] b[m-1:0]
a b A RITHMETIC
add2(k) add2(m) C IRCUITS

mux(k+1) mux(k+1)
S1[1:0] S0[1:0]
S1[n:m] S0[n:m] S1[m-1:0] S0[m-1:0]

   An n-compound adder add2n; m n 2 ,k


   n2

X n-1 X n-2 X3 X2 X1 X0
...

PP (n/2)

...

Yn-1 Yn-2 Y2 Y1 Y0

   The recursive specification of an n-fold parallel prefix circuit of the


function Æ for an even n

&# " "4  -  

Let Æ : M  M  M be an associative, dyadic function. its n-fold parallel


prefix function PP n : M n  M n maps n inputs x1    xn into n results
Æ

y1    yn with yi  x1 Æ    Æ xi .
A recursive construction of efficient parallel prefix circuits based on Æ-
gates is shown in figure 2.17 for the case that n is even. If n is odd, then
one realizes PP n  1 by the construction in figure 2.17 and one computes
Æ

Yn 1  Xn 1 Æ Yn 2 in a straightforward way using one extra Æ-gate.


  

The correctness of the construction can be easily seen. From


Xi  X2i1 Æ X2i

it follows
Yi  Xi Æ    Æ X0  X2i1 Æ    Æ X0  Y2i1 
The computation of the outputs
Y2i  X2i Æ Y2i  1
/
 
is straightforward. For cost and delay, we get
BASICS
CPP 1
Æ  0
CPP n
Æ  CPP n2  n  1  C
Æ Æ

DPP 1
Æ  0
DPP n  DPP n2  2  D
Æ Æ Æ

&& % 3  

For an  1 : 0 bn  1 : 0 and indices i j with i  j one defines

pi j a b  1
 a j : i  b j : i  1 j i1  

This is the case if c j  ci 1 , in other words if carry ci 1 is propagated by


 

positions i to j of the operands to position j. Similarly, one defines for


0  i  j:

gi j a b  1
 a j : i  b j : i 10 j i1 


i.e., if positions i to j of the operands generate a carry cj independent of


ci 1 . For i  0 one has account for cin , thus one defines


g0 j a b cin   1
 a j : 0  b j : 0  cin 10 j 11 


In the following calculations, we simply write gi j and pi j , respectively. 

Obviously, we have

pi i
  ai bi
gi i
  ai  bi for i0
g0 0
  a0 b0   cin   a0  b0 

Suppose one has already computed the generate and propagate signals
for the adjacent intervals of indices i : j and  j  1 : k, where i  j  k.
The signals for the combined interval i : k can then be computed as

pi k
  pi j  p j1 k
 

gi k
  g j1 k  gi j  p j1 k 
  

This computation can obviously be performed by the circuit in figure


2.18 which takes inputs g1 p1  and g2 p2  from M  0 12 to output

g p  g2 p2  Æ g1 p1 
 g2  g1  p2 p1  p2   M
)
  &
g2 p2 g1 p1
A RITHMETIC
C IRCUITS

g p

   Circuit Æ, to be used in the carry lookahead adder

a0 b0

an-1 bn-1 an-2 bn-2 a1 b1 cin

...

g n-1 p n-1 g1 p1 g0 p0

PP (n)
G n-1 G n-2 G1 G 0 P0 cin

...
sn s n-1 s1 s0

   Circuit of an n-bit carry lookahead adder

A simple exercise shows that the operation Æ defined in this way is asso-
ciative (for details see, e.g., [KP95]).
Hence, figure 2.18 can be substituted as a Æ-gate in the parallel prefix
circuits of the previous subsections. The point of this construction is that
the i-th output of the parallel prefix circuit computes

Gi Pi   gi pi  Æ    Æ g0 p0   gi 0 pi 0 
   ci pi 0 

It follows that the circuit in figure 2.19 is an adder. It is called a carry


look ahead adder.
The circuit in figure 2.18 has cost 6 and delay 4. We change the compu-
tation of output g using

g  g2  g1  p2  g2  g1  p2 

For the cost and the delay of operation Æ this gives

C Æ  Cand  Cnand  Cnor  Cinv  7


D Æ  maxDand Dnand  maxDnor Dinv   2
*
 
b[n-1:0] sub
BASICS a[n-1:0]

neg p[n-1] cin


n-adder
c[n-1]

ovf s[n-1] s[n-1:0]

   Circuit of an n-bit arithmetic unit AU

The cost and the delay of the whole CLA adder are

CCLA n  CPP n  2n  Cxor  n  1  Cand  Cor


Æ

DCLA n  DPP n  2  Dxor  Dand  Dor 


Æ

&'     

An n bit arithmetic unit is a circuit with inputs an  1 : 0 bn  1 : 0 sub


and outputs sn : 0 neg ov f . It performs operation

 if sub  0
op 
 if sub  1

The sum outputs s satisfy

s  a op b if a op b  Tn 

The flag ov f indicates that a op b 


 Tn , whereas flag neg indicates that

a op b  0. This flag has to be correct even in the presence of an over-
flow. With the help of this flag one implements for instance instructions
of the form “branch if a  b”. In this case one wants to know the sign of
a  b even if a  b is not representable with n bits.
Figure 2.20 shows an implementation of an n-bit arithmetic unit. The
equation
b  b  1
translates into
b b sub and cin  sub
The flag neg is the sign bit of the sum a  b  Tn1 . By the argument
at the end of section 2.2, the desired flag is the sum bit sn of the addition

an 1 a  bn 1 b  cin  sn  1 : 0


 

#
  &
It can be computed as
A RITHMETIC
neg  sn  cn 1 an 1 bn 1  cn  1 pn  1 C IRCUITS

By theorem 2.4, we have ov f  cn 1 cn 2 . In the carry lookahead


 

adder, all the carry bits are available, whereas the conditional sum adder
only provides the final carry bit cn 1 . Since the most significant sum bit


equals sn 1  pn 1 cn 1 , an overflow can be checked by


  

ov f  cn 1 pn  1 cn 2
 pn 1 
 sn  1 neg

Let add denote the binary adder of choice; the cost and the delay of the
arithmetic unit, then be expressed as

CAU n  n  2  Cxor  Cadd n


DAU n  3  Dxor  Dadd n

&(  

For strings an  1 : 0 and natural numbers i  0  n  1 we consider


the functions

cls a i  an  i  1    a0an  1    an  i


crs a i  ai  1    a0an  1    ai
lrs a i  0i an  1    ai

The function cls is called a cyclic left shift, the function crs is called a cyclic
right shift, and the function lrs is called logic right shift. We obviously
have
crs a i  cls a n  i mod n

Cyclic Left Shifter An n i-cyclic left shifter is a circuit with inputs


an  1 : 0, select input s  0 1 and outputs rn  1 : 0 satisfying

cls a i if s  1
r
a otherwise

As shown in figure 2.21 such shifters can be built from n muxes.


Let n  2m be a power of two. An n-cyclic left shifter is a circuit with
inputs an  1 : 0, select inputs bm  1 : 0 and outputs rn  1 : 0 satisfy-
ing
r  cls a b
#
 
BASICS
an-1 ... an-i an-i-1 ... ai ai-1 ... a0

0 1 0 1 0 1 0 1
s
rn-1 ... ri ri-1 ... r0

   n i-Cyclic left shifter

a[n-1:0]

cls(n, 20 ) b[0]

r0
cls(n, 21 ) b[1]

...

cls(n, 2m-1 ) b[m-1]


r

   Circuit of an n-cyclic left shifter CLSn

b[m-1:0]

1
a[n-1:0]
inc(m)
m
CLS(n)
r[n-1:0]

   Circuit of an n-cyclic right shifter CRSn

#
an-1 ai ai-1 ... a0   &
...
A RITHMETIC
0 0 C IRCUITS
0 1 0 1 0 1 0 1
s
rn-1 ... ri ri-1 ... r0

   n i-Logic right shifter

Such shifters can be built by cascading n i-cyclic left shifters for i   0,


1, 2, 4,    2m 1  as shown in figure 2.22.


By induction on i, one easily shows that the output ri of the n 2i -


cyclic left shifter in figure 2.22 satisfies

ri  cls a bi : 0

Cyclic Right Shifter An n-cyclic right shifter is a circuit with inputs


an  1 : 0, select inputs bm  1 : 0 and outputs rn  1 : 0 satisfying

r  crs a b

It can be built from an n-cyclic left shifter by the construction shown in


figure 2.23. This works, because

n  b  n  0b  n  1b  1


b  1 mod n

Logic Right Shifter An n i-logic right shifter is a circuit with inputs


an  1 : 0, select input s  0 1 and outputs rn  1 : 0 satisfying

lrs a i if s  1
r
a otherwise

It can be built from n muxes, as depicted in figure 2.24.


Let n  2m be a power of two. An n-logic right shifter is a circuit with
inputs an  1 : 0, select inputs bm  1 : 0 and outputs rn  1 : 0 satisfy-
ing
r  lrs a b
In analogy to the cyclic left shifter, the n-logic right shifter can be built by
cascading the n i-logic right shifters for i   0, 1, 2, 4,    2m 1 . 

##
 
  
BASICS
ET a  an  1 : 0 and b  bm  1 : 0, then
 a  b  2n  1  2m  1  2nm  1 (2.3)
Thus, the product can be represented with n  m bits.
An n m-multiplier is a circuit with an n-bit input a  an  1 : 0, an
m-bit input b  bm  1 : 0, and an n  m-bit output p  pn  m  1 : 0
such that a  b   p holds.

'  5 

Obviously, one can write the product a  b as a sum of partial products
m1
a  b  ∑ a  bt  2t 
t 0

with
a  bt  2t  an  1  bt   a0  bt  0t 
Thus, all partial products can be computed with cost n  m  Cand and delay
Dand . We denote by
j k1
∑t j a  bt   2
Sj k  t

(2.4)
 a  b j  k  1 : j  2 j  2nk j
the sum of the k partial products from position j to position j  k  1.
Because S j k is a multiple of 2 j it has a binary representation with j trailing


zeros. Because S j k is smaller than 2 jnk it has a binary representation of




length n  j  k (see figure 2.25). We have


Sj 1
  a   b  j   2 j
a  b  S0 m 

S j kh
  S j k  S jk h
 

S0 t
  S0 t   1  St 11 

The last line suggests an obvious construction of multipliers comprising


m  1 many n-adders. This construction corresponds to the school method
for the multiplication of natural numbers. If one realizes the adders as
carry chain adders, then cost and delay of this multiplier construction can
be shown (see exercise 2.2) to be bounded by
Cmul n m  m  n  Cand  CFA 
Dmul n m  Dand  m  n  DFA
#&
  '
Sj,k 0 ... 0
M ULTIPLIERS
n+k j

   S j k , the sum of the k partial products starting from position j.

' %   

Let x be a natural number and suppose the two binary numbers sn  1 : 0
and t n  1 : 0 satisfy
s  t   x

We then call s t a carry save representation of x with length n.


A crucial building block for speeding up the summation of partial prod-
ucts are n-carry save adders. These are circuits with inputs an  1 : 0,
bn  1 : 0, cn  1 : 0 and outputs sn  1 : 0, t n : 0 satisfying

a  b  c  s  t 

i.e., the outputs s and t are a carry save representation of the sum of the
numbers represented at the inputs. As carry save adders compress the sum
of three numbers to two numbers, which have the same sum, they are also
called n-3/2-adders. Such adders are realized, as shown in figure 2.26,
simply by putting n full adders in parallel. This works, because

n1
a  b  c  ∑ ai  bi  ci   2i
i 0
n1
 ∑ ti1 si   2i
i 0
n1
 ∑ 2  ti1  si   2i  s  t 
i 0

The cost and the delay of such a carry save adder are

C3 2add n  n  CFA
D3 2add n  DFA 

The point of the construction is, of course, that the delay of carry save
adders is independent of n.
#'
 
a[n-1] b[n-1] a[1] b[1] a[0] b[0]
BASICS c[n-1] c[1] c[0]

FA ... FA FA 0

t[n] s[n-1] t[2] s[1] t[1] s[0] t[0]

   Circuit of an n-bit carry save adder, i.e., of an n-3/2-adder.

b[m-1] b[j] b[0]


a a a
... ...

0m-1 0m-1-j 0
j
0m-1

m-operand addition tree


t s
0
add(n+m)
p[n+m-1:0]

   Circuit of an n m-multiplier

'# 5-   %

An addition tree with m operands is a circuit which takes as inputs m binary


numbers and which outputs a carry save representation of their sum. Using
addition trees, one can construct n m-multipliers as suggested in figure
2.27. First, one generates binary representations of the m partial products
St 1 . These are fed into an addition tree with m operands. The output of the


tree is a carry save representation of the desired product. An ordinary adder


then produces from the carry save representation the binary representation
of the product.
We proceed to construct a particularly simple family of addition trees. In
figure 2.28 (a) representations of the partial sums S0 1 S1 1 and S2 1 are fed
  

into an n-carry save adder. The result is a carry save representation of S0 3 

with length n  3. In figure 2.28 (b) the representation of St 1 1 and a carry  

save representation of S0 t 1 are fed into an n-carry save adder. The result
 

is a carry save representation of S0 t with length n  t. By cascading m  2




many n-carry save adders as suggested above, one obtains an addition tree
which is also called a multiplication array because of its regular structure.
If the final addition is performed by an n  m-carry lookahead adder,
one obtains an n m-multiplier with the following cost and delay
#(
  '
a) b)
S0,1
S1,1 S0,t-1
M ULTIPLIERS
S2,1 St-1,1 0 ... 0

carry save adder(n) carry save adder(n)

S0,3 0 S0,t

   Generating a carry save representation of the partial sums S 0 3 (a) and
S0 t (b).

CMULarray n m  n  m  Cand  m  2  C3  2add n  CCLA n  m


DMULarray n m  Dand  m  2  D3 2add n  DCLA n  m

'& &6 7

The delay of multiplication arrays is proportional to m. The obvious next


step is to balance the addition trees, hereby reducing the delay to O log m.
We use here a construction which is particularly regular and easy to ana-
lyze.
An n-4/2-adder is a circuit with inputs an  1 : 0, bn  1 : 0, cn  1 : 0,
d n  1 : 0 and outputs sn  1 : 0, t n  1 : 0 satisfying

a  b  c  d   s  t  mod 2n 

The obvious construction of n-4/2-adders from two n-3/2-adders is shown


in figure 2.29. Its cost and delay are

C4 2add n  2  C3
 2add n  2  n  CFA
D4 2add n  2  D3  2add n  2  DFA 

      


With the help of 4/2-adders, one constructs complete balanced addition
trees by the recursive construction suggested in figure 2.30. Note that we
do not specify the width of the operands yet. Thus, figure 2.30 is not yet
a complete definition. Let K 2 be a power of two. By this construction,
#/
  a c
b
BASICS n n n
3/2add(n) d
n n n
3/2add(n)
n n
t s

   Circuit of an n-4/2-adder

a) b) S2K-1 SK SK-1 S0
S3 S2 S1 S0 ... ...
T(K/2) T(K/2)
4/2-adder
4/2-adder

    Complete balanced addition tree T K  with 2K operands S 0 , 

S2K 1 , where K is a power of two; a) tree T 2, b) tree T K .

one obtains addition trees T K  with 2K inputs. Such a 4/2-tree T K  has


the delay
DT K   log K  2  DFA 

.      


For the construction of IEEE-compliant floating point units we will have
to construct n m-multipliers where m is not a power of two. Let
 log m
M  2 and µ  log M 4

Thus, M is the smallest power of two with M m. As a consequence of the


IEEE floating point standard [Ins85] and of the division algorithm used in
the floating point unit, the length m  27 58 of the operand bm  1 : 0,
and hence the number of operands of the addition tree, will satisfy the
condition
3 4  M  m  M 
In this case, we construct an addition tree T m with m operands as
suggested in figure 2.31 2 . The tree T m has depth µ. The bottom portion
2 For the complementary case see exercise 2.3
#)
  '
M/4-a many a many
M ULTIPLIERS
Sm-2 S4a+1 S4a-1
Sm-1 Sm-3 S4a+2 S4a . . . S4a-4 S3 . . . S0

3/2-adder . . . 3/2-adder 4/2-adder ... 4/2-adder

complete 4/2-adder tree T(M/4)

    Construction of a 4/2-adder tree T m adding inputs S 0  Sm 1

of the tree is a completely regular and balanced 4/2-tree T M 4 with M 4


many pairs of inputs and M 8 many 4/2-adders as leaves. In the top level,
we have a many 4/2-adders and M 4  a many 3/2-adders. Here, a is the
solution of the equation

4a  3  M 4  a  m

hence
a  m  3 M 4

Note that for i  0 1   , the partial products Si 1 are entered into the


tree from right to left and that in the top level of the tree the 3/2-adders are
arranged left of the 4/2-adders. For the delay of a multiplier constructed
with such trees one immediately sees

D4 2mul n m  Dand  2  µ  1  DFA  DCLA n  m

      
Estimating the cost of the trees is more complicated. It requires to estimate
the cost of all 3/2-adders and 4/2-adders of the construction. For this es-
timate, we view the addition trees as complete binary trees T in the graph
theoretic sense. Each 3/2-adder or 4/2-adder of the construction is a node
v of the tree. The 3/2-adders and 4/2-adders at the top level of the addition
tree are then the leaves of T .
The cost of the leaves is easily determined. Carry save representations of
the sums Si 3 are computed by n-3/2-adders in a way completely analogous


to figure 2.28 (a). The length of the representation is n  3. Figure 2.32


shows that carry save representations of sums Si 4 can be computed by two


#*
  Si,1
Si+1,1
BASICS
Si+2,1

carry save adder(n)

Si,3 0

Si+3,1

carry save adder(n)

Si,4 0

    Partial compression of S i 4

n-3/2 adders 3 . The length of the representation is n  4. Thus, we have

c v  2  n  CFA

for all leaves v of T .


Let v be an interior node of the tree with left son L v and right son R v.
Node v then computes a carry save representation of the sum

Si kh  Si k  Sik h
  

where R v provides a carry save representation of Si k and L v provides a

carry save representation of Sik h . If the length of the representations are




i  k and i  k  h, by Equation (2.4) we are then in the situation of figure


2.33. Hence node v consists of 2n  2h full adders.
If all 3/2-adders in the tree would have exactly n full adders, and if all
4/2-adders would have 2n full adders, the cost of the tree would be n  m 
2. Thus, it remains to estimate the number of excess full adders in the
tree.

    3 
Let T be a complete binary tree with depth µ. We number the levels  from
the leaves to the root from 0 to µ. Each leaf u has weight W u. For some
natural number k, we have W u  k k  1 for all leaves, and the weights
are nondecreasing from left to right. Let m be the sum of the weights of
3 Formally, figure 2.32 can be viewed as a simplified n  3-4/2-adder
&
h n k i
  '
0 ... 0 Si,k M ULTIPLIERS
0 ... 0
0 ... 0 Si+k,h
0 ... 0

4/2-adder(n+h)

0 ... 0 Si,k+h
0 ... 0
n+h k i

   Partial compression of S i kh

the leaves. For µ  4, m  53, and k  3, the leaves would, for example,
have the weights 3333333333344444. For each subtree t of T , we define

W t  ∑ W u
u leaf of t

where u ranges over all leaves of t. For each interior node v of T we define
L v and R v as the weight of the subtree rooted in the left or right son of
v, respectively. We are interested in the sums
µ
H   ∑ L v and H  ∑H 

level   1

where v ranges over all nodes of level . The cost H then obeys

  
µ  m2  2 µ1
 H  µ  m2

By induction on the levels of T one shows that in each level weights are 
nondecreasing from left to right, and their sum is m. Hence,

2H  ∑ L v  ∑ R v  ∑ W v  m
level  level  level 

This proves the upper bound.


In the proof of the upper bound we have replaced each weight L v by
the arithmetic mean of L v and R v, overestimating L v by

h v  L v  R v2  L v  R v  L v2
&
 
Observe that all nodes in level , except possibly one, have weights in
BASICS k  2 , k  1  2 . Thus, in each level  there is at most one node v with
 


R v   L v . For this node we have


 

h v   k  1  2

 1
k2  1
2 

2 2


Hence, the error in the upper bound is at most


µ
∑2 2
 2µ 1

 1
 
We now use this lemma in order to estimate the number of excess full
adders in the adder tree T m of figure 2.31. For that purpose, we label
every leaf u of T m with the number W u of partial products that it
sums, i.e., in case of a 3/2-adder, u is labeled with W u  3, and in case
of a 4/2-adder it is labeled with W u  4. In figure 2.31, we then have

h  W L u

and the number E of excess full adders can be estimated as

E  2  H  µ  m

The error in this bound is at most

2  2µ 1

 2  05  M 4  m3

Thus, the upper bound is quite tight. A very good upper bound for the cost
of 4/2-trees is therefore

C4 2tree n m  n  m  2  E   CFA


 n  m  2  µ  m  CFA 

A good upper bound for the cost of multipliers built with 4/2-trees is

C4
 2mul n m  n  m  Cand  C4 2tree n m  CCLA n  m

'' 5-      8 

Booth recoding is a method which reduces the number of partial products


to be summed in the addition tree. This makes the addition tree smaller,
cheaper and faster. On the other hand, the generation of partial products
&
  '
a[n-1:0] b[m-1:0]
M ULTIPLIERS
Booth partial products Bgen(n,m’)
S’2m’-2,2 ... S’2,2 S’0,2

m’-Booth addition tree

t s
0
add(n+m)
p[n+m-1:0]

    Structure of a n m-Booth multiplier with addition tree

becomes more expensive and slower. One therefore has to show, that the
savings in the addition tree outweigh the penalty in the partial product
generation.
Figure 2.34 depicts the structure of a 4/2-tree multiplier with Booth re-
coding. Circuit Bgen generates the m Booth recoded partial products S2 j 2 , 

which are then fed into a Booth addition tree. Finally, an ordinary adder
produces from the carry save result of the tree the binary representation of
the product. Thus, the cost and delay of an n m-multiplier with 4/2-tree
and Booth recoding can be expressed as
C4 2Bmul n m  CBgen n m   C4 2Btree n m   CCLA n  m
D4 2Bmul n m  DBgen n m   D4  2Btree n m   DCLA n  m

 7 8 
In the simplest form (called Booth-2) the multiplier b is recoded as sug-
gested in figure 2.35. With bm1  bm  b 1  0 and m  m  12,


one writes
m¼ 1
b  2b  b  ∑ B2 j  4 j
j 0

where
B2 j  2 b2 j  b2 j 1  2 b2 j 1  b2 j 
 2 b2 j1  b2 j  b2 j 1

The numbers B2 j  2 1 0 1 2 are called Booth digits, and we define


their sign bits s2 j by
0 if B2 j 0
s2 j 
1 if B2 j  0
&#
 
2<b> bm bm-1 bi+1 bi bi-1 b1 b0 0
BASICS 1100
001100
11 10
0 10
1
-<b> 0 bm bm-1 bi+1 bi bi-1 b1 b0

Bm Bm-2 Bi B2 B0

    Booth digits B2 j

With

C2 j  a  B2 j  2n1  2  2n1 


D2 j  a  B2 j   0  2n1 
d2 j  binn1 D2 j 

the product can be computed from the sums


m ¼ 1 m¼ 1
a  b  ∑ j 0 aB2 j 4 j  ∑ j 0
C2 j  4 j
m¼ 1
∑ j 0 1s  D2 j  4 j

 2j


In order to avoid negative numbers C2 j , one sums the positive E2 j in-


stead:

E2 j  C2 j  3  2n1
E0  C0  4  2n1
e2 j  binn3 E2 j 
e0  binn4 E0 

This is illustrated in figure 2.36. The additional terms sum to


m¼ 1 ¼
4m  1 ¼
2n1 1  3  ∑ 4 j  2n1 1  3    2n12 m 

j 0 3

Because 2  m  m these terms are congruent to zero modulo 2nm . Thus,


m ¼ 1 
∑ E2 j  4 j mod 2nm


a  b 
j 0

   The binary representation e2 j of E2 j can be computed by

e2 j   1s2 j d2 j s2 j   s2 j
e0   s0 s0 s0 d0 s0   s0 

&&
  '
1
M ULTIPLIERS
1 1 0 0 0 0
E0 :
+ 00B
11
- 00 0
11
d0 = < a >
1 1 0 0 0 0
E2 :
+ d2 = < a > 0
1B2
-
1 1 0 0 0 0
E4 :
+
- 00B4
11
d4 = < a >
0
1
00 1
11 0
00
1100
11
00
11
1 1 0 0 0 0
E2m’-2 :
11B
00
+
- d
2m’-2
00
11
= <a>
2m’-2

    Summation of the E 2 j

For j  0 and s2 j  0, we have 


e2 j   11 0n1   00 d2 j   11 d2 j 
 1s2 j d2 j s2 j   s2 j 
For j  0 and s2 j  1, we have

e2 j  110n1   11d2 j   1 mod 2n3


10 d2 j   1 mod 2n3
 1s2 j d2 j s2 j   s2 j 

For j  0, one shows along the same lines that


e0   s0 s0 s0 d0 s0   s0
 
By lemma 2.6, the computation of the numbers
F2 j  E2 j  s2 j
f2 j  binn3 F2 j 
f0  binn4 F0 
is easy, namely

f2 j  1 s2 j d2 j s2 j 
f0  s0 s0 s0 d0 s0 
&'
 
g0 s0 s0 s0 d0 s 0 0 0
BASICS s2 d2 s 2 s0
g2 1 0

g4 1 s4 d4 s 4 0 s
2
1
0
1
0
11
00
g 2m’ 1 s2m’ d2m’ s2m’ 0 s 2m’-2

    Construction of the partial products S 2 j 2 ¼

Instead of adding the sign bits s2 j to the numbers F2 j , one incorporates


them at the proper position into the representation of F2 j2 , as suggested
in figure 2.37. The last sign bit does not create a problem, because B2m¼ 2 

is always positive. Formally, let

g2 j  f2 j 0 s2 j  2  0 1n5
g0  f0 00  0 1n6

then
g2 j   4   f2 j   s2 j  2  4  F2 j  s2 j  2

and with s  2  s2m¼ 2  0, the product can also be written as


 m¼ 1

a  b ∑ j 0
E2 j  4 j mod 2nm
m¼ 1
 ∑ j 0 4  F2 j  s2 j   4 j 1

m¼ 1 m¼ 1
∑ j 0 4  F2 j  s2 j 2  4 j ∑j

 
 1

0
g2 j   4 j1 

We define
j k1
S2 j 2k
 ∑ g2 j   4 j 1
t j

then

S2 j 2

 g2 j   4 j  1
  f2 j 0s2 j 2  4
j 1

S2 j 2kh

 S2 j 2k  S2 j2k 2h
 

and it holds:

   S2 j 2k is a multiple of 22 j 2 bounded by S2 j 2k  2n2 j2k2 . Therefore, at






most n+4+2k non-zero positions are necessary to represent S2 j 2k in both 

carry-save or binary form.


&(
  '
 by induction over k. For k  1:
M ULTIPLIERS
S  1n6   4 j 1  2n6  22 j
2 j 2
  2
 2n2 j2 12 

For k  1: It is known from the assumption that S2 j 2k   2  2n2 j2k . Thus,

S2 j 2k

 S2 j 2k 1  g2 jk 1  4 jk 2 
  


 2n2 j2k  2n5  22 j2k 2 

 1  2  2n2 j2k  2n2 j2k2


 

'(   +%     5- 

Partial Product Generation The binary representation of the numbers


S2 j 2 must be computed (see figure 2.37). These are


g2 j  1s2 j d2 j s2 j 0s2 j  2
g0  s0 s0 s0 d0 s0 00
shifted by 2 j  2 bit positions. The d2 j  binn1 a  B2 j  are easily
determined from B2 j and a by

 0  0 if B2 j  0
0 a if B2 j   1
d2 j 
 a 0 if B2 j   2
For this computation, two signals indicating B2 j   1 and B2 j   2 are
necessary. We denote these signals by
1 if B2 j   1 1 if B2 j   2
b12 j  b22 j 
0 otherwise 0 otherwise
and calculate them by the Booth decoder logic BD of figure 2.38 (a). The
decoder logic BD can be derived from table 2.6 in a straightforward way.
It has the following cost and delay:
CBD  Cxor  Cxnor  Cnor  Cinv
DBD  maxDxor Dxnor   Dnor 
The selection logic BSL of figure 2.38 (b) directs either bit ai, bit ai 
1, or 0 to position i  1. The inversion depending on the sign bit s2 j then
yields bit g2 j i  3. The select logic BSL has the following cost and delay:
CBSL  3  Cnand  Cxor
DBSL  2  Dnand  Dxor 
&/
 
BASICS   Representation of the Booth digits
b2 j  1 : 2 j  1 B2 j b2 j  1 : 2 j  1 B2 j
000 0 100 -2
001 1 101 -1
010 1 110 -1
011 2 111 -0

a) b2j+1 b2j b2j-1 b) b12j a[i+1] a[i] b22j

s2j
d2j [i+1]
/s2j s2j b22j b12j
g2j [i+3]

    The Booth decoder BD (a) and the Booth selection logic BSL (b)

The select logic BSL is only used for selecting the bits g2 j n  2 : 2; the
remaining bits g2 j 1 : 0 and g2 j n  5 : n  3 are fixed. For these bits, the
selection logic is replaced by the simple signal of a sign bit, its inverse, a
zero, or a one. Thus, for each of the m partial products n  1 many select
circuits BSL are required. Together with the m Booth decoders, the cost of
the Booth preprocessing runs at

CBpre n m   m  CBD  n  1  CBSL 

¼
Redundant Partial Product Addition Let M  2 log m be the smallest

power of two which is greater or equal m  m  12, and let µ 


log M 4. For m  27 58, it holds that

34  M  m  M

For the construction of the Booth 4/2-adder tree T m , we proceed as in


section 2.5.4, but we just focus on trees which satisfy the above condition.
The standard length of the 3/2-adders and the 4/2-adders is now n 
n  5 bits; longer operands require excess full adders. Let E be the number
of excess full adders. Considering the sums S instead of the sums S, one
shows that the top level of the tree has no excess full adders. Let H be the
&)
  '
sum of the labels of the left sons in the resulting tree. With Booth recoding,
successive partial products are shifted by 2 positions. Thus, we now have M ULTIPLIERS

E  4H 

Since H is bounded by µ  m 2, we get

E  2  µ  m 

Thus, the delay and the cost of the 4/2-tree multiplier with Booth-2 recod-
ing can be expressed as

C4  2Btree  n  m  2  E   CFA
 n  m  2  2  µ  m   CFA
D4  2Btree  2 µ  1  DFA 

Let C and D denote the cost and delay of the Booth multiplier but with-
out the n  m-bit CLA adder, and let C and D denote the corresponding
cost and delay of the multiplier without Booth recoding:

C  C4 2Bmul n m  CCLA n  m
D  D4  2Bmul n m  DCLA n  m
C  C4 2mul n m  CCLA n  m
D  D4  2mul n m  DCLA n  m

For n  m  58, we then get

C C  4524655448  816%
D D  5562  887%

and for n  m  27, we get

C C  1023412042  849%
D D  4350  860%

Asymptotically, C C tends to 1216  075. Unless m is a power of two,


we have µ  µ  1 and D  D  7. Hence D D tends to one as n grows
large.
When taking wire delays in a VLSI layout into account, it can be shown
that Booth recoding also saves a constant fraction of time (independent of
n) in multipliers built with 4/2-trees [PS98].
&*
 
     
BASICS
( ,      -

Finite state transducers are finite automata which produce an output in


every step. Formally, a finite state transducer is specified by a 6-tuple
Z In Out z0 δ η, where Z is a finite set of states; z0  Z is called the
initial state. In is a finite set of input symbols, Out is a finite set of output
symbols,
δ : Z  In  Z
is the transition function, and

η : Z  In  Out

is the output function.


Such an automaton works step by step according to the following rules:

The automaton is started in state z0 .

If the automaton is in state z and reads input symbol in, it then out-
puts symbol η z in and goes to state δ z in.

If the output function does not depend on the input in, i.e., if it can be
written as
η : Z  Out
then the automaton is called a Moore automaton. Otherwise, it is called a
Mealy automaton.
Obviously, the input of an automaton which controls parts of a com-
puter will come from a certain number σ of input lines inσ  1 : 0, and
it will produce outputs on a certain number γ of output lines out γ  1 : 0.
Formally, we have

In  0 1σ and Out  0 1γ 

It is common practice to visualize automata as in figure 2.39 which


shows a Moore automaton with 3 states z0 , z1 , and z2 , with the set of input
symbols In  0 12 and with the set of output symbols Out  0 12 .
The automaton is represented as a directed graph V E  with labeled edges
and nodes.
The set of nodes V of the graph are the states of the automaton. We draw
them as rectangles, the initial state is marked by a double border. For any
pair of states z z, there is an edge from z to z in the set E of edges if
δ z in  z for some input symbol in, i.e., if a transition from state z to
'
  (
{01, 10, 11}
{00, 01} C ONTROL
z0 z1 z2 AUTOMATA
out = (010) {00} out = (101) {11} out = (011)

{0, 1}2

    Representation of a Moore automaton with tree states, the set In


0 12 of input symbols and the set Out 0 1 3 of output symbols.

state z is possible. The edge z z is labeled with all input symbols in that
take the automaton from state z to state z. For Moore automata, we write
into the rectangle depicting state z the outputs signals which are active in
state z.
Transducers play an important role in the control of computers. There-
fore, we specify two particular implementations; their cost and delay can
easily be determined if the automaton is drawn as a graph. For a more
general discussion see, e.g., [MP95].

(      

Let k  #Z be the number of states of the automaton. Then the states can be
numbered from 0 to k  1, and we can rename the states with the numbers
from 0 to k  1:
Z  0    k  1

We always code the current state z in a register with outputs Sk  1 : 0


satisfying
1 if z  i
Si 
0 otherwise

for all i. This means that, if the automaton is in state i, bit Si is turned on
and all other bits S j with j  i are turned off. The initial state always gets
number 0. The cost of storing the state is obviously k  Cf f .

(# 9   - -

For each output signal outj  Out, we define the set of states

Zj  z  Z  out j is active in state z


'
 
in which it is active. Signal outj can then obviously be computed as
BASICS
out j  Sz 
z Z j

We often refer to the cardinality


νj  #Z j
as the frequency of the output signal outj ; νmax and νsum denote the maxi-
mum and the sum of all cardinalities νj :
γ1
νmax  maxν j  0  j  γ νsum  ∑ νj 
j 0

In the above example, the subsets are Z0  z1 z2 , Z1  z0 z2 , Z2 


z1 , and the two parameters have the values νmax  2 and νsum  5.
Each output signal out j can be generated by a ν j -input OR-tree at the cost
of ν j  1  Cor and the delay of log ν j   Dor . Thus, a circuit O generating
all output signals has the following cost and delay:
γ1
CO  ∑ νi  1  Cor  νsum  γ  Cor
i 0

DO  max log νi   1  i  γ  Dor  log νmax   Dor 

(&  -   !   

For each edge z z   E, we derive from the transition function δ the


boolean function δz z¼ specifying under which inputs the transition from


state z to state z is taken:


δz z¼ inσ  1 : 0
  1 δ z inσ  1 : 0  z
Let D z z  be a disjunctive normal form of δz z¼ . If the transition from

z to z occurs for all inputs in, then δz z¼ 1 and the disjunctive normal


form of δz z¼ consists only of the trivial monomial m  1. The automaton




of figure 2.39 comprises the following disjunctive normal forms:


D z0 z1   in1  in0 D z0 z2   in1  in0 D z2 z0   1
D z1 z1   in1 D z1 z2   in1  in0 


Let M z z  be the set of monomials in D z z  and let
M  M z z   1
z z¼ 
 E
'
  (
be the set of all nontrivial monomials occurring in the disjunctive forms
D z z . The next state vector N k  1 : 1 can then be computed in three C ONTROL
steps: AUTOMATA

1. compute in j for each input signal inj ,


2. compute all monomials m  M,
3. and then compute for all z between 1 and k  1 the bit
N z 
 ¼  S ¼    ¼  m
 ¼    ¼  S ¼  m
z z  E z mM z z
(2.5)
 z z  E mM z z z 

Note that we do not compute N 0 yet. For each monomial m, its length
l m denotes the number of literals in m; lmax and lsum denote the maximum
and the sum of all l m:
lmax  maxl m  m  M  lsum  ∑l m
mM

The computation of the monomials then adds the following cost and delay:
CCM  σ  Cinv  lsum  #M   Cand
DCM  Dinv  log lmax   Dand 
For each node z, let
f anin z  ∑ #M z z
z¼ z
  E
f aninmax  max f anin z  1  z  k  1
k1
f aninsum  ∑ f anin z
z 1

the next state signals N k  1 : 1 can then be generated at the following


cost and delay:
CCN  f aninsum  Cand  Cor   k  1  Cor
DCN  log f aninmax   Dor  Dand 
Thus, a circuit NS computing these next state signals along these lines from
the state bits Sk  1 : 0 and the input inσ  1 : 0 has cost and delay
CNS  CCM  CCN
DNS  DCM  DCN 
Table 2.7 summarizes all the parameters which must be determined from
the specification of the automaton in order to determine cost and delay of
the two circuits O and NS.
'#
 
BASICS   Parameters of a Moore control automaton
Parameter Meaning
σ # inputs in j of the automaton
γ # output signals out j of the automaton
k # states of the automaton
νmax , νsum maximal / accumulated frequency of all outputs
#M # monomials m  M (nontrivial)
lmax , lsum maximal / accumulated length of all m  M
fanmax , fansum  z0
maximal / accumulated fanin of states z 

N[0]
zero(k-1) 0k-1 1
k
N[k-1:1] 11
00 01 clr
00
11 0 1
ce
σ m NS S γ
in CM CN O out

   Realization of a Moore automaton

(' 5 -   

Figure 2.40 shows a straightforward realization of Moore automata with


clock enable signal ce and clear signal clr. The automaton is clocked
when at least one of the signals ce and clr is active. At the start of the
computation, the clear signal clr is active. This forces the pattern 0k 1 1, 

i.e., the code of the initial state 0 into register S. As long as the clear signal
is inactive, the next state N is computed by circuit NS and a zero tester. If
none of the next state signals N k  1 : 1 is active, then the output of the
zero tester becomes active and next state signal N 0 is turned on.
This construction has the great advantage that it works even if the tran-
sition function is not completely specified, i.e., if δ z in is undefined for
some state z and input in. This happens, for instance, if the input in codes
an instruction and a computer controlled by the automaton tries to execute
an undefined (called illegal) instruction.
In the Moore automaton of figure 2.39, the transition function is not
specified for z1 and in  10. We now consider the case, that the automa-
ton is in state z1 and reads input 10. If all next state signals including
signal N 0 are computed according to equation (2.5) of the previous sub-
section, then the next state becomes 0k . Thus, the automaton hangs and
'&
  (
N[0] k 0k-1 1
zero(k-1)
C ONTROL
N[k-1:1] 11
00 clr O
00
11 0
11
00
1
γ AUTOMATA
σ NS
in CM
m
CN S 01 Rout
ce clr out

   Realization of a Moore automaton with precomputed outputs

can only be re-started by activating the clear signal clr. However, in the
construction presented here, the automaton falls gracefully back into its
initial state. Thus, the transition δ z1 10  z0 is specified implicitly in
this automaton.
Let A in and A clr ce denote the accumulated delay of the input sig-
nals in and of the signals clr and ce. The cost, the delay and the cycle time
of this realization can then be expressed as

CMoore  C f f k  CO  CNS  Czero k  1  Cmux k  Cor


A out   DO
TMoore  maxA clr ce  Dor A in  DNS  Dzero k  1
Dmux  ∆

(( " -      

In the previous construction of the Moore automaton it takes time DO from


the time registers are clocked until the output signal are valid. In the con-
struction of figure 2.41, the control signals are therefore precomputed and
clocked into a separate register Rout . This increases the cycle time of the
automaton by DO , but the output signals out γ  1 : 0 are valid without
further delay. The cost, the delay, and the cycle time of the automaton then
run at

C pMoore  C f f k  CO  CNS  Czero k  1  Cmux k  C f f γ  Cor


A out   0
TpMoore  maxA clr ce  Dor A in  DNS  Dzero k  1
Dmux  DO  ∆

This will be our construction of choice for Moore automata. The choice
is not completely obvious. For a formal evaluation of various realizations
of control automata see [MP95].
''
 
{01, 10, 11}
BASICS z2
z1 {00, 01}
z0 out = (101) out = (011)
out = (010) {00} out[3] if in[0] {11} out[3] if /in[1]
{0, 1}2

   Representation of a Mealy automaton with tree states, inputs in1 : 0,
and outputs out 3 : 0; out 3 is the only Mealy component of the output.

(/ 5% -   

Consider a Mealy automaton with input signals inσ  1 : 0 and output


signals out γ  1 : 0. In general, not all components out  j of the output
will depend on both, the current state and the input signals inσ  1 : 0.
We call out  j a Mealy component if it depends on the current state and the
current input; otherwise we call out  j a Moore component.
Let out j be a Mealy component of the output. For every state z in which
out j can be activated, there is a boolean function fz j such that


out j is active in state z fz j in


  1

If the Mealy output out j is never activated in state z, then fz j 0. If the 

Mealy output out j is always turned on in state z, then fz j 1. For any 

Mealy output out j we define the set of states

Zj  z  f z j 
 0

where out j can possibly be turned on.


Let F z j be a disjunctive normal form of fz j . With the help of F z j


we can visualize Mealy outputs outj in the following way. Let z be a state
– visualized as a rectangle – in Zj ; we then write inside the rectangle:

out j if F z i

In figure 2.42, we have augmented the example automaton by a new Mealy


output out 3.
Let MF z j be the set of monomials in F z j, and let

MF 
  MF z j  1
γ1

j 0 zZ ¼j

be the set of all nontrivial monomials occurring in the disjunctive normal


forms F z j. The Mealy outputs out j can then be computed in two steps:
'(
  (
1. compute all monomials m  M  MF in a circuit CM (the monomials
of M are used for the next state computation), C ONTROL
AUTOMATA
2. and for any Mealy output outj , circuit O computes the bit

out j  Sz  m  Sz  m


zZ ¼j mMF z j  zZ ¼j mMF z j 

Circuit CM computes the monomials m  M  MF in the same way as


the next state circuit of section 2.6.4, i.e., it first inverts all the inputs ini
and then computes each monomial m by a balanced tree of AND-gates.
Let l fmax and lmax denote the maximal length of all monomials in MF and
M, respectively, and let lsum denote the accumulated length of all nontrivial
monomials. Circuit CM can then generate the monomials of M  MF  M
at the following cost and delay:

CCM  σ  Cinv  lsum  #M   Cand


DCM MF   Dinv  log l fmax   Dand
DCM M   Dinv  log lmax   Dand 

Circuit CM is part of the circuit NS which implements the transition func-


tion of the automaton.
Since the Moore components of the output are still computed as in the
Moore automaton, it holds
  Sz
out j   ¼
zZ j

zZ j
 

mF z j  Sz  m


for a Moore component
for a Mealy component.

The number of monomials required for the computation of a Mealy output


out j equals
ν j  ∑ #MF z j
zZ ¼j

In analogy to the frequency of a Moore output, we often refer to νj as the


frequency of the Mealy output outj . Let νmax and νsum denote the maximal
and the accumulated frequency of all outputs out γ  1 : 0. The cost and
the delay of circuit O which generates the outputs from the signals m  MF
and Sk  1 : 0 can be estimated as:

CO  νsum  Cand  Cor   γ  Cor


DO  Dand  log νmax   Dor 

A Mealy automaton computes the next state in the same way as a Moore
automaton, i.e., as outlined in section 2.6.4. The only difference is that in
'/
 
N[0] 0k-1 1
zero(k-1)
BASICS k
01 11
00 clr
N[k-1:1] 10 0 1
ce
σ M NS S
in CN 01 γ
CM MF 10 O out

   Realization of a Mealy automaton

the Mealy automaton, circuit CM generates the monomials required by the


transition function as well as those required by the output function. Thus,
the cost and delay of the next state circuit NS can now be expressed as

CNS  CCM  CCN


DNS  DCM M   DCN 

Let A in and A clr ce denote the accumulated delay of the input sig-
nals in and of the signals clr and ce. A Mealy automaton (figure 2.43) can
then be realized at the following cost and delay:

CMealy  C f f k  CO  CNS  Czero k  1  Cmux k  Cor


A out   A in  DCM MF   DO
TMealy  maxA clr ce  Dor A in  DNS  Dzero k  1
Dmux  ∆

() .      +  " 

All the processor designs of this monograph consist of control automata


and the so called data paths (i.e., the rest of the hardware). In such a
scenario, the inputs in of an automaton usually code the operation to be
executed; they are provided by the data paths DP. The outputs out of the
automaton govern the data paths; these control signals at least comprise
all clock enable signals, output enable signals, and write signals of the
components in DP.
The interface between the control automaton and the data paths must be
treated with care for the following two reasons:
1. Not all of the possible outputs out  0 1γ are admissible; for some
values out, the functionality of the data paths and of the whole hard-
ware may be undefined. For example, if several tristate drivers are
')
  (
connected to the same bus, at most one these drivers should be en-
abled at a time in order to prevent bus contentions. C ONTROL
AUTOMATA
2. The signals in provided by the data paths usually depend on the cur-
rent control signals out, and on the other hand, the output out of the
automaton may depend on the current input in. Thus, after clocking,
the hardware not necessarily gets into a stable state again, i.e., some
control signals may not stabilize. However, stable control signals
are crucial for the deterministic behavior of designs.

      


The functionality of combinatorial circuits is well defined [Weg87]. How-
ever, the structure of the processor hardware H is more complicated; its
schematics also include flipflops, registers, RAMs, and tristate drivers.
These components require control signals which are provided by a con-
trol automaton.
The control signals define for every value out  Out a modified hardware
H out . In the modified hardware, a tristate driver with active enable signal
en  1 is treated like a gate which forwards its input data signals to its
outputs. A tristate driver with inactive enable signal en  0 is treated like
there would be no connection between its inputs and its outputs. As a
consequence, components and combinatorial circuits of H out  can have
open inputs with an undefined input value.
A value out  Out of the control signals is called admissible if the fol-
lowing conditions hold:

Tristate drivers are used in the data paths but not in the control au-
tomata.

In the modified hardware H out , combinatorial circuits and basic


components may have open inputs with an undefined value. Despite
of these open inputs, each input of a register with active clock enable
signal or of a RAM with active write signal has a value in 0 1.

In the modified hardware H out , any transfer between registers and


RAMs is of one of the four types depicted in figure 2.3.

Note, for all of our processor designs, it must be checked that the control
automata only generate admissible control signals.

     


In order to keep the functionality of the whole hardware (control automa-
ton and data paths DP) well defined, the data paths and the circuit O are
'*
 
partitioned into p parts each, DP 1  DP p and O 1  O p, such
BASICS that
Circuit O i gets the inputs in i  in0    inσ 1 ; these inputs


are directly taken from registers or they are provided by circuits of


DP j, with j  i.
Circuit O i generates the output signals out i  out0  outγ 1 .


These signals only govern the data paths DP i.


Circuit NS can receive inputs from any part of the data paths and from any
output circuit O i.
The whole hardware which uses a common clock signal works in cycles.
Let the control automaton only generate admissible control signals out. It
then simply follows by induction over p that the control signals out i
stabilize again after clocking the hardware, and that the functionality of
the hardware is well defined (for every clock cycle). The control signals
out i have an accumulated delay of
i
AOi  DO1  ∑ DDP j  DO j
j 2

For Moore automata, such a partitioning of the data paths and of the
automaton is unnecessary, since the control signals do not depend on the
current state. However, in a Mealy automaton, the partitioning is essential.
The signals out 1 then form the Moore component of the output.

"     5% -   
The output signals of a Mealy automaton tend to be on the time critical
path. Thus, it is essential for a good performance estimate, to provide the
accumulated delay of every output circuit O i respectively the accumu-
lated delay of every subset out i of output signals:
AOi  A out i
Let νmax i denote the maximal frequency of all output signals in out i,
and let l fmax i denote the maximal length of the monomials in the dis-
junctive normal forms F z j, with j  out i. Thus:
A out i  A in i  DCM MF i  DOi
DOi  Dand  log νmax i  Dor
DCM MF i  Dinv  log l fmax i  Dand 
Table 2.8 summarizes all the parameters which must be determined from
the specification of the automaton in order to determine the cost and the
delay of a Mealy automaton.
(
  /
  Parameters of a Mealy control automaton with p output levels S ELECTED
O1  O p R EFERENCES AND
F URTHER R EADING
Symbol Meaning
σ # input signals in j of the automaton
γ # output signals out j of the automaton
k # states of the automaton
fansum accumulated fanin of all states z   z0
fanmax maximal fanin of all states z   z0
#M # monomials m  M  MF  M of the automaton
lsum accumulated length of all monomials m  M
l fmaxi , maximal length of the monomials of output level O i
lmax and of the monomials m  M of the next state circuit
νsum accumulated frequency of all control signals
νmaxi maximal frequency of the signals out i of level O i
Aclr ce
 accumulated delay of the clear and clock signals
Aini , accumulated delay of the inputs in i of circuit O i
Ain and of the inputs in of the next state circuit

   !  "  #

HE FORMAL hardware model used in this book is from [MP95]. The


 extensive use of recursive definitions in the construction of switching
circuits is very common in the field of Complexity of Boolean Functions;
a standard textbook is [Weg87]. The description of Booth recoding at an
appropriate level of detail is from [AT97], and the analysis of Booth re-
coding is from [PS98]. Standard textbooks on computer arithmetic are
[Kor93, Omo94]. An early text on computer arithmetic with complete cor-
rectness proofs is [Spa76].

$ %&

  Let m  n2 for any n  1. The high order sum bits of
an n-bit incrementer with inputs an  1 : 0, cin and output sn : 0 can be
expressed as
sn : m  an  1 : m  cm 1


an  1 : m if cm 1  0


an  1 : m  1 if cm 1  1


(
 
where cm 1 denotes the carry from position m  1 to position m. This

BASICS suggests for the circuit of an incrementer the simple construction of fig-
ure 2.15 (page 26); the original problem is reduced to only two half-sized
problems. Apply this construction recursively and derive formulae for the
cost and the delay of the resulting incrementer circuit CSI.

  Derive formulae for the cost and the delay of an n m-mul-
tiplier which is constructed according to the school method, using carry
chain adders as building blocks.

  In section 2.5.4, we constructed and analyzed addition trees


T m for n m-multipliers without Booth recoding. The design was re-
stricted to m satisfying the condition

34  M  m  M with M 

2 log m


This exercise deals with the construction of the tree T m for the remain-
ing cases, i.e., for M 2  m  34  M. The bottom portion of the tree is
still a completely regular and balanced 4/2-tree T M 4 with M 4 many
pairs of inputs and M 8 many 4/2-adders as leaves. In the top level, we
now have a many 3/2-adders and M 4  a many pairs of inputs which are
directly fed to the 4/2-tree T M 4. Here, a is the solution of the equation

3a  2  M 4  a  m

hence
a  m  M 2
For i  0 1   , the partial products Si 1 are entered into the tree from right


to left and that in the top level of the tree the 3/2-adders are placed at the
right-hand side.

1. Determine the number of excess adders in the tree T m and derive


formulae for its cost and delay.
2. The Booth multiplier of section 2.5.5 used a modified addition tree
T m in order to sum the Booth recoded partial products. Extend
the cost and delay formulae for the case that M 2  m  34  M .

(
Chapter

3
A Sequential DLX Design

N THE remainder of this book we develop a pipelined DLX machine


with precise interrupts, caches and an IEEE-compliant floating point
unit. Starting point of our designs is a sequential DLX machine without
interrupt processing, caches and floating point unit. The cost effectiveness
of later designs will be compared with the cost effectiveness of this basic
machine.

We will be able to reuse almost all designs from this chapter. The design
process will be – almost – strictly top down.

 '     

E SPECIFY the DLX instruction set without floating point instruc-


 tions and without interrupt handling. DLX is a RISC architecture
with only three instruction formats. It uses 32 general purpose registers
GPR j31 : 0 for j  0    31. Register GPR0 is always 0.

Load and store operations move data between the general purpose reg-
isters and the memory M. There is a single addressing mode: the effective
address ea is the sum of a register and an immediate constant. Except for
shifts, immediate constants are always sign extended.
  # 6 5 5 16

A S EQUENTIAL I-type opcode RS1 RD immediate


DLX D ESIGN
6 5 5 5 5 6
R-type opcode RS1 RS2 RD SA function

6 26
J-type opcode PC offset

   The three instruction formats of the DLX fixed point core. RS1 and
RS2 are source registers; RD is the destination register. SA specifies a special
purpose register or an immediate shift amount; f unction is an additional 6-bit
opcode.

# . -  , 

All three instruction formats (figure 3.1) have a 6-bit primary opcode and
specify up to three explicit operands. The I-type (Immediate) format spec-
ifies two registers and a 16-bit constant. That is the standard layout for
instructions with an immediate operand. The J-type (Jump) format is used
for control instructions. They require no explicit register operand and profit
from a larger 26-bit immediate operand. The third format, R-type (Regis-
ter) format, provides an additional 6-bit opcode (function). The remaining
20 bits specify three general purpose registers and a field SA which spec-
ifies a 5-bit constant or a special purpose register. A 5-bit constant, for
example, is sufficient for a shift amount.

# . -    

Since the DLX description in [HP90] does not specify the coding of the
instruction set, we adapt the coding of the MIPS R2000 machine ([PH94,
KH92]) to the DLX instruction set. Tables 3.1 through 3.3 list for each
DLX instruction its effect and its coding; the prefix “hx” indicates that the
number is represented as hexadecimal. Taken alone, the tables are almost
but not quite a mathematical definition of the semantics of the DLX ma-
chine language. Recall that mathematical definitions have to make sense if
taken literally.
So, let us try to take the effect

RD  RS1  imm ? 1 : 0
(&
  #
  I-type instruction layout. All instructions except the control instruc- I NSTRUCTION S ET
tions also increment the PC by four; sxt a is the sign-extended version of a. A RCHITECTURE
The effective address of memory accesses equals ea GPRRS1  sxt imm,
where imm is the 16-bit intermediate. The width of the memory access in bytes is
indicated by d. Thus, the memory operand equals m M ea  d  1 M ea.

IR31 : 26 Mnemonic d Effect


Data Transfer
hx20 lb 1 RD = sxt(m)
hx21 lh 2 RD = sxt(m)
hx23 lw 4 RD = m
hx24 lbu 1 RD = 024 m
hx25 lhu 2 RD = 016 m
hx28 sb 1 m = RD7 : 0
hx29 sh 2 m = RD15 : 0
hx2b sw 4 m = RD
Arithmetic, Logical Operation
hx08 addi RD = RS1 + imm
hx09 addi RD = RS1 + imm
hx0a subi RD = RS1 - imm
hx0b subi RD = RS1 - imm
hx0c andi RD = RS1  sxt(imm)
hx0d ori RD = RS1  sxt(imm)
hx0e xori RD = RS1 sxt(imm)
hx0f lhgi RD = imm 016
Test Set Operation
hx18 clri RD = ( false ? 1 : 0);
hx19 sgri RD = (RS1  imm ? 1 : 0);
hx1a seqi RD = (RS1  imm ? 1 : 0);
hx1b sgei RD = (RS1 imm ? 1 : 0);
hx1c slsi RD = (RS1  imm ? 1 : 0);
hx1d snei RD = (RS1   imm ? 1 : 0);
hx1e slei RD = (RS1  imm ? 1 : 0);
hx1f seti RD = ( true ? 1 : 0);
Control Operation
hx04 beqz PC = PC + 4 + (RS1  0 ? imm : 0)
hx05 bnez PC = PC + 4 + (RS1  0 ? imm : 0)
hx16 jr PC = RS1
hx17 jalr R31 = PC + 4; PC = RS1

('
  #
A S EQUENTIAL   R-type instruction layout. All instructions execute PC += 4. SA denotes
DLX D ESIGN the 5-bit immediate shift amount specified by the bits IR10 : 6.

IR31 : 26 IR5 : 0 Mnemonic Effect


Shift Operation
hx00 hx00 slli RD = sll(RS1, SA)
hx00 hx02 srli RD = srl(RS1, SA)
hx00 hx03 srai RD = sra(RS1, SA)
hx00 hx04 sll RD = sll(RS1, RS24 : 0)
hx00 hx06 srl RD = srl(RS1, RS24 : 0)
hx00 hx07 sra RD = sra(RS1, RS24 : 0)
Arithmetic, Logical Operation
hx00 hx20 add RD = RS1 + RS2
hx00 hx21 add RD = RS1 + RS2
hx00 hx22 sub RD = RS1 - RS2
hx00 hx23 sub RD = RS1 - RS2
hx00 hx24 and RD = RS1  RS2
hx00 hx25 or RD = RS1  RS2
hx00 hx26 xor RD = RS1 RS2
hx00 hx27 lhg RD = RS2[15:0] 016
Test Set Operation
hx00 hx28 clr RD = ( false ? 1 : 0);
hx00 hx29 sgr RD = (RS1  RS2 ? 1 : 0);
hx00 hx2a seq RD = (RS1  RS2 ? 1 : 0);
hx00 hx2b sge RD = (RS1 RS2 ? 1 : 0);
hx00 hx2c sls RD = (RS1  RS2 ? 1 : 0);
hx00 hx2d sne RD = (RS1   RS2 ? 1 : 0);
hx00 hx2e sle RD = (RS1  RS2 ? 1 : 0);
hx00 hx2f set RD = ( true ? 1 : 0);

  J-type instruction layout. sxt(imm) is the sign-extended version of the


26-bit immediate called PC offset.

IR31 : 26 Mnemonic Effect


Control Operation
hx02 j PC = PC + 4 + sxt(imm)
hx03 jal R31 = PC + 4; PC = PC + 4 + sxt(imm)

((
  #
of instruction  in table 3.1 literally: the 5-bit string RS1 is compared
with the 16-bit string imm using a comparison “” which is not defined I NSTRUCTION S ET
for such pairs of strings. The 1-bit result of the comparison is assigned to A RCHITECTURE
the 5-bit string RD.
This insanity can be fixed by providing five rules specifying the abbre-
viations and conventions which are used everywhere in the tables.
1. RD is a shorthand for GPRRD. Strictly speaking, it is actually a
shorthand for GPRRD. The same holds for R1 and R2.

2. Except in logical operations, immediate constants imm are always


two’s complement numbers.

3. In arithmetic operations and in test set operations, the equations refer


to two’s complement numbers.

4. All integer arithmetic is modulo 232 . This includes all address cal-
culations and, in particular, all computations involving the PC.
By lemma 2.2 we know that a a mod 232 for 32-bit addresses a.
Thus, the last convention implies that it does not matter whether we in-
terpret addresses as two’s complement numbers or as binary numbers.
The purpose of abbreviations and conventions is to turn long descrip-
tions into short descriptions. In the tables 3.1 through 3.3, this has been
done quite successfully. For three of the DLX instructions, we now list the
almost unabbreviated semantics, where sxt imm denotes the 32-bit sign
extended version of imm.
1. Arithmetic instruction  :
GPRRD  GPRRS1  imm mod 232
 GPRRS1  sxt imm

2. Test set instruction !:

GPRRD   GPRRS1  imm ? 1 : 0 

or, equivalently

GPRRD  031 GPRRS1  sxt imm ? 1 : 0

3. Branch instruction :

PC  PC  4  GPRRS1  0 ? imm : 0 mod 232


 PC  4  GPRRS1  0 ? sxt imm : 0 mod 232 
(/
  #
Observe that in the more detailed equations many hints for the imple-
A S EQUENTIAL mentation of the instructions become visible: immediate constants should
DLX D ESIGN be sign extended, and the 1-bit result of tests should be extended by 31
zeros.

## 5 %  : 

The memory is byte addressable, i.e., each memory address j specifies


a memory location M  j capable of storing a single byte. The memory
performs byte, half word, and word accesses. All instructions are coded in
four bytes. In memory, data and instructions are aligned in the following
way:
half words must have even (byte) addresses. A half word h with
address e is stored in memory such that
h15 : 0  M e  1 : e

words or instructions must have (byte) addresses divisible by four.


These addresses are called word boundaries. A word or instruction
w with address e is stored in memory such that
w31 : 0  M e  3 : e

The crucial property of this storage scheme is, that half words, words
and instructions stored in memory never cross word boundaries (see figure
3.2). For word boundaries e, we define the memory word with address e as

Mword e  M e  3 : e


Moreover, we number the bytes of words w31 : 0 in little endian order
(figure 3.3), i.e.:
byte j w  w8 j  7 : 8 j
bytei: j w  bytei w    byte j w
The definitions immediately imply the following lemma:

   Let a31 : 0 be a memory address, and let e be the word boundary e 
a31 : 2 00. Then
1. the byte with address a is stored in byte a1 : 0 of the memory
word with address e:
M a  byte a1:0 Mword a31 : 2 00
 

()
  #
a) 1-bank desing
byte
H IGH L EVEL DATA
addr half word word
PATHS
<a> b) 4-bank design
: : :
bank address a[1:0]
11 10 01 00 addr
<a’>
: : : :

e+4 e+4
e+3 b3 b3 b2 b1 b0 e
e+2 b2
: : : :
e+1 b1 0
e b0
e-1 4 bytes
: : :
0

   Storage scheme in an 1-bank memory system (a) and in a 4-bank


memory system (b). A bank is always one byte wide. a ¼ a31 : 2 00 and
e a¼ .

bits 31 24 23 16 15 8 7 0

word w byte3 byte2 byte1 byte0

   Ordering of the bytes within a word w31 : 0 – little endian order

2. The piece of data which is d bytes wide and has address a is stored
in the bytes a1 : 0 to a1 : 0  d  1 of the memory word with
address e:

byte a1:0 d
  1 : a1:0 Mword a31 : 200

 # () * +

3.4 presents a high level view of the data paths of the machine.
IGURE
 It shows busses, drivers, registers, a zero tester, a multiplexer, and the
environments. Environments are named after some major unit or a register.
They contain that unit or register plus some glue logic that is needed to
adapt that unit or register to the coding of the instruction set. Table 3.4
gives a short description of the units used in figure 3.4. The reader should
copy the table or better learn it by heart.
(*
  #
MDout
A S EQUENTIAL C MDRr
DLX D ESIGN
SH4Lenv

C’
PCenv GPRenv IRenv

00111100
PC AEQZ A’ B’ co

zero 0110
A
00111100
B
4 0
a
b
Menv
ALUenv SHenv

D
MDin
MAR MDRw
fetch 1 0 MA

   High level view of the sequential DLX data paths

We use the following naming conventions:

1. Clock enable signals for register R are called Rce. Thus, IRce is the
clock enable signal of the instruction register.

2. A driver from X to bus Y is called XY d, its output enable signal is


called XY doe. Thus, SH4LDdoe is the output enable signal of the
driver from the shifter for loads to the internal data bus.

3. A mux from anywhere to Y is called Y mux. Its select signal is called


Y muxsel.

We complete the design of the machine and we provide a rigorous proof


that it works in a completely structured way. This involves the following
three steps:

1. For each environment we specify its behavior and we then design it


to meet the specifications.

2. We specify a Moore automaton which controls the data paths.

3. We show that the machine interprets the instruction set, i.e., that the
hardware works correctly.
/
  ##
  Units and busses of the sequential DLX data paths E NVIRONMENTS
Large Units, Environments
GPRenv environment of the general purpose register file GPR
ALUenv environment of the arithmetic logic unit ALU
SHenv environment of the shifter SH
SH4Lenv environment of the shifter for loads SH4L
PCenv environment of the program counter PC
IRenv environment of the instruction register IR
Menv environment of the memory M
Registers
A, B output registers of GPR
MAR memory address register
MDRw memory data register for data to be written to M
MDRr memory data register for data read from M
Busses
A’, B’ input of register A and register B
a, b left/right source operand of the ALU and the SH
D internal data bus of the CPU
MA memory address
MDin Input data of the memory M
MDout Output data of the memory M
Inputs for the control
AEQZ indicates that the current content of register A equals zero
IR[31:26] primary opcode
IR[5:0] secondary opcode

Theoretically, we could postpone the design of the environments to the end.


The design process would then be strictly top down – but the specification
of seven environments in a row would be somewhat tedious to read.

 %) 

## 9 "-  8   ,

The general purpose register file environment contains a 32-word 3-port


register file with registers GPRi31 : 0 for i  0    31. It is controlled by
three control signals, namely
/
  #
the write signal GPRw of the register file GPR,
A S EQUENTIAL
DLX D ESIGN signal Rtype indicating an R-type instruction, and
signal Jlink indicating a jump and link instruction ("# ")
In each cycle, the behavior of the environment is completely specified
by very few equations. The first equations specify that the registers with
addresses RS1 and RS2 are always read and provided as inputs to registers
A and B. Reading from address 0, however, should force the output of the
register file environment to zero.
GPRRS1 if RS1  0
A 
0 if RS1  0

GPRRS2 if RS2  0
B 
0 if RS2  0
Let Cad be the address to which register C is written. This address is
usually specified by RD. In case of jump and link instructions (Jlink  1),
however, the PC must be saved into register 31. Writing should only occur
if the signal GPRw is active:
RD if Jlink  0
Cad 
31 if Jlink  1
GPRCad  : C if GPRw  1

The remaining equations specify simply the positions of the fields RS1,
RS2 and RD; only the position of RD depends on the type of the instruction:

RS1  IR25 : 21


RS2  IR20 : 16
IR20 : 16 if Rtype  0
RD 
IR15 : 11 if Rtype  1
This completes the specification of the GPR environment.
Circuit CAddr of figure 3.5 generates the destination address at the fol-
lowing cost and delay:

CCAddr  2  Cmux 5
DDAddr  2  Dmux 5

The design in figure 3.5 is a straightforward implementation of the GPR


environment with the cost:

CGPRenv  Cram3 32 32  CCAddr  2  Czero 5  Cinv  Cand 32


/
  ##
IR[20:16] IR[15:11]
CAddr E NVIRONMENTS
IR[25:21] 11111 0 1 Rtype IR[20:16]
C’

0011 0011
1 0 Jlink GPRw

Aad Cad Din Bad


zero(5) GPR zero(5)
3-port RAM (32 x 32)
DoA DoB
az 32 32 bz

A’ B’

   Implementation of the GPR environment

The register file performs two types of accesses; it provides data A and B ,
or it writes data C back. The read access accounts for the delay

DGPR read
  DGPRenv IR GPRw; A B 
 maxDram3 32 32 Dzero 5  Dinv   Dand

whereas the write access takes time

DGPR write
  DCAddr  Dram3 32 32

## . -  8   1 

This environment is controlled by the three control signals


J jump indicating an J-type jump instruction,

shi f tI indicating a shift instruction with an immediate operand, and

the clock enable signal IRce of the instruction register IR.


The environment contains the instruction register, which is loaded from the
bus MDout. Thus,

IR : MDout if IRce  1

The environment IRenv outputs the 32-bit constant

27 SA if shi f tI  1
co31 : 0 
sxt imm if shi f tI  0
/#
  # MDout
A S EQUENTIAL
IR
DLX D ESIGN IRce
[31:26] [25] [24:16] [15:5] [4:0]

IR[15] IR[15]9 IR[10:6]

Jjump 1 0 1 0 Jjump 0 1 shiftI


sign
co[31:25] co[24:16] co[15:5] co[4:0]

   Implementation of the IR environment

where sxt a denotes the 32-bit, sign extended representation of a. The


position of the shift amount SA and of the immediate constant imm in the
instruction word is specified by
SA  IR10 : 6
IR15 : 0 if J jump  0
imm 
IR25 : 0 if J jump  1
This completes the specification of the environment IRenv. The design in
figure 3.6 is a straightforward implementation. Its cost and the delay of
output co are:
CIRenv  C f f 32  Cmux 15
DIRenv co  Dmux 15

### " 1 

This environment is controlled by the reset signal and the clock enable
signal PCce of the PC. If the reset signal is active, then the start address
032 of the boot routine is clocked into the PC register:
D if PCce  reset
PC :
032 if reset
This completes the specification of the PC environment. The design in fig-
ure 3.7 implements PCenv in a straightforward manner. Let DPCenv In; PC
denote the delay which environment PCenv adds to the delay of the inputs
of register PC. Thus:
CPCenv  C f f 32  Cmux 32  Cor
DPCenv In; PC  maxDmux 32 Dor 

/&
  ##
032 D
E NVIRONMENTS
reset 1 0
reset
PC
PCce

   Implementation of the PC environment

##& 3 1 

This environment is controlled by the three control signals


Rtype indicating an R-type instruction,
add forcing the ALU to add, and
test forcing the ALU to perform a test and set operation.
The ALU is used for arithmetic/logic operations and for test operations.
The type of the ALU operation is specified by three bits which we call
f 2 : 0. These bits are the last three bits of the primary or secondary
opcode, depending on the type of instruction:
IR28 : 26 if Rtype  0
f 2 : 0 
IR2 : 0 if Rtype  1
In case a test operation is performed, the result t  0 1 is specified by
table 3.5. In case of an arithmetic/logic operation, the result al is specified
by table 3.6. Observe that in this table al  a  b is a shorthand for al  
a  b mod 232 ; the meaning of a  b is defined similarly. For later use,
we define the notation
al  a op b
The flag ov f of the arithmetic unit AU indicates an overflow, i.e., it indi-
cates that the value a op b does not lie in the range T32 of a 32-bit two’s
complement number.
If signal add is activated, the ALU performs plain binary addition mod-
ulo 232 . The final output alu of the ALU is selected under control of the
signals test and add in an obvious way such that
031t if test  1
alu 
al if test  0 AND add  0

alu  a  b mod 232 if test  0 AND add  1


This completes the specification of the ALU.
/'
  #
A S EQUENTIAL   Specification of the test condition
DLX D ESIGN
cond. false ab ab a b ab a  b ab true
f2  0 0 0 0 1 1 1 1
f1  0 0 1 1 0 0 1 1
f0  0 1 0 1 0 1 0 1

  Coding of the arithmetic/logical ALU operations


a+b a-b ab ab a b b15 : 00n 16

f2 0 0 1 1 1 1
f1 0 1 0 0 1 1
f0 * * 0 1 0 1

   
The coding of conditions from table 3.5 is frequently used. The obvious
implementation proceeds in two steps. First, one computes the auxiliary
signals l e g (less, equal, greater) with
l1 ab ab  0
e1 ab ab  0
g1 ab ab  0
and then, one generates

t a b f  f2  l  f1  e  f0  g

Figure 3.8 depicts a realization along these lines using an arithmetic unit
from section 2.4. Assuming that the subtraction signal sub is active, it
holds

l  neg
e1 s31 : 0  032
g  e  l

The cost and the delay of a 32-bit comparator are

Ccomp 32  Czero 32  2  Cinv  4  Cand  2  Cor


Dcomp 32  maxDinv  Dand Dzero 32  Dor Dzero 32  Dinv 
Dand  Dor 

/(
  ##
a[31:0] b[31:0] sub
E NVIRONMENTS
AU(32)
ovf neg s
0011 s[31:0]

zero(n)
f2 01 f0
f1 0011

comp
t

   Arithmetic unit supplemented by the comparator circuit

01 01 11
00
10 10 00
11
a 32
b[15:0]
b 32 01 11
00 11
00
sub 016
AU(32)
ovf neg s
01 0 1 f0 0 1 f0
10
f [2:0] comp(32) 0 1 f1 LU
t 0 1 f2
031
test al
1 0
alu

   Implementation of the ALU comprising an arithmetic unit AU, a logic


unit LU and a comparator

 3  
The coding of the arithmetic/logic functions in table 3.6 translates in a
straightforward way into figure 3.9. Thus, the cost and the delay of the
logic unit LU and of this ALU run at

CLU 32  Cand 32  Cor 32  Cxor 32  3  Cmux 32
DLU 32  maxDand  Dor Dxor   2  Dmux
CALU  CAU 32  CLU 32  Ccomp 32  2  Cmux 32
DALU  maxDAU 32  Dcomp 32 DAU 32  Dmux
DLU 32  Dmux   Dmux 

//
  #
test IR[28:26] IR[2:0]
A S EQUENTIAL f[1] 000 0 1 Rtype
DLX D ESIGN
1 0 add
sub f[2:0]

   Glue logic of the ALU environment

 9- 3 
Figure 3.10 suggests how to generate the signals sub and f 2 : 0 from
control signals add and Rtype. The mux controlled by signal Rtype selects
between primary and secondary opcode. The mux controlled by add can
force f 2 : 0 to 000, that is the code for addition.
The arithmetic unit is only used for tests and arithmetic operations. In
case of an arithmetic ALU operation, the operation of the AU is an addition
(add or addi) if f1  0 and it is a subtraction (sub or subi) if f1  1. Hence,
the subtraction signal can be generated as

sub  test  f1 

The environment ALUenv consists of the ALU circuit and the ALU glue
logic. Thus, for the entire ALU environment, we get the following cost and
delay:

CALUglue  Cor  2  Cmux 3


DALUglue  2  Dmux 3  Dor
CALUenv  CALU  CALUglue
DALUenv  DALUglue  DALU 

##' 5 % 1 

The memory environment Menv is controlled by three signals


mr indicating a memory read access,
mw indicating a memory write access, and
f etch indicating an instruction fetch.
On instruction fetch (i.e., f etch  1), the memory write signal must be
inactive, i.e., mw  0. The address of a memory access is always specified
by the value on the memory address bus MA31 : 0.
/)
  ##
  Coding the width of a memory write access E NVIRONMENTS
IR[27:26] d MAR[1:0] mbw[3:0]
00 1 00 0001
01 0010
10 0100
11 1000
01 2 00 0011
10 1100
11 4 00 1111

Recall that the memory M is byte addressable. Half words are aligned at
even (byte) addresses; instructions and words are aligned at word bound-
aries, i.e., at (byte) addresses divisible by 4. Due to the alignment, memory
data never cross word boundaries. We therefore organize the memory in
such a way that for every word boundary e the memory word

Mword e  M e  3 : e

can be accessed in parallel. Thus, a single access suffices in order to load


or store every byte, half word, word or instruction.
If mr  1, the memory environment Menv performs a read operation,
i.e., a load operation or an instruction fetch. Menv then provides on the
bus MDout the word

MDout 31 : 0  Mword MA31 : 200

If the read operation accesses the d-byte data X, by lemma 3.1, X is then
the subword

X  byte MA1:0 d
   1 : MA1:0 MDout 

of the memory bus MDout.


On mw  1, the fetch signal is inactive f etch  0. Thus, a store opera-
tion is executed, and the memory environment performs a write operation.
During a store operation, the bits IR27 : 26 of the primary opcode specify
the number of bytes d to be stored in memory (table 3.7). The address of
the store is specified by the memory address register MAR. If the d-byte
data X are to be stored, then the memory environment expects them as the
subword
X  byte MAR1:0 d 1 : MAR1:0 MDin
    

/*
  #
MDout
[31:24] [23:16] [15:8] [7:0]
A S EQUENTIAL
do do do do
DLX D ESIGN
mr bank mr bank mr bank mr bank
MB[3] MB[2] MB[1] MB[0]
mbw[3] mbw[2] mbw[1] mbw[0]
di a di a di a di a
[31:24] [23:16] [15:8] [7:0]
MDin
MA[31:2]

   Connecting the memory banks to the data and address busses

of the memory bus MDin and performs the write operation


M e  d  1 : e : X 
The data on the memory bus MDin are provided by register MDRw. For
later use, we introduce for this the notation
m  bytes MDRw
Since memory accesses sometimes require multiple clock cycles, we
need a signal mbusy indicating that the current memory access will not
be completed during the current clock cycle. This signal is an input of the
control unit; it can only be active on a memory access, i.e., if mr  1 or
mw  1. We expect signal mbusy to be valid dmstat time units after the start
of each clock cycle.
This completes the specification of the memory environment Menv. Its
realization is fairly straightforward. We use four memory banks MB j
with j  0    3. Each bank MB j is one byte wide and has its own
write signal mbw j. Figure 3.11 depicts how the four banks are connected
to the 32-bit data and address busses.

 5 %  


The bank write signals mbw3 : 0 are generated as follows: Feeding the
address bits MAR1 : 0 into a 2-decoder gives four signals B3 : 0 satisfy-
ing
B j  1 MAR1 : 0  j
for all j. From the last two bits of the opcode, we decode the width of the
current access according to table 3.7 by
B  IR26
H  IR27  IR26
W  IR27  IR26
)
  ##
0011
IR[26] MAR[1:0]
E NVIRONMENTS
IR[27] dec(2) mw
B[3:0]
W H B
GenMbw

mbw[3:0]

   Memory control MC. Circuit GenMbw generates the bank write sig-
nals according to Equation 3.1

The bank write signals are then generated in a brute force way by

mbw0  mw  B0
mbw1  mw  W  B0  H  B0  B  B1
(3.1)
mbw2  mw  W  B0  H  B2  B  B2
mbw3  mw  W  B0  H  B2  B  B3

When reusing common subexpressions, the cost and the delay of the
memory control MC (figure 3.12) runs at

CMC  Cdec 2  2  Cinv  12  Cand  5  Cor


DMC  maxDdec 2 Dinv  Dand   2  Dand  2  Dor 

Let dmem be the access time of the memory banks. The memory environ-
ment then delays the data MDout by

DMenv MDout   DMC  dmem 

We do not elaborate on the generation of the mbusy signal. This will only
be possible when we built cache controllers.

##(   1  

The shifter environment SHenv is used for two purposes: for the execution
of the explicit shift operations sll (shift left logical), srl (shift right logical)
and sra (shift right arithmetic), and second, for the execution of implicit
shifts. An implicit shifted is only used during the store operations sb and
sw in order to align the data to be stored in memory. The environment
SHenv is controlled by a single control signal
shi f t4s, denoting a shift for a store operation.
)
  #
A S EQUENTIAL   Coding of the explicit shifts
DLX D ESIGN
IR[1:0] 00 10 11
type sll srl sra

1 
We formally define the three explicit shifts. Obviously, left shifts and right
shifts differ by the shift direction. Logic shifts and arithmetic shifts differ
by the fill bit. This bit fills the positions which are not covered by the
shifted operand any more. We define the explicit shifts of operand an  1 :
0 by distance bm  1 : 0 in the following way:
  b
sll a b  an  b 1  a0 f ill 
b
srl a b  f ill an  1  ab  
b
 
sra a b  f ill an  1  ab
  

where
0 for logic shifts
f ill 
an 1
 for arithmetic shifts
Thus, arithmetic shifts extend the sign bit of the shifted operand. They
probably have their name from the equality
sra a b   a2 b   

which can be exploited in division algorithms for 2’s complement numbers.


In case of an explicit shift operation, the last two bits IR1 : 0 of the
secondary opcode select among the three explicit shifts according to table
3.8. By shi f t a b IR1 : 0, we denote the result of the shift specified by
IR1 : 0 with operands a and b.

.  
Implicit left shifts for store operation are necessary if a byte or half word
– which is aligned at the right end of a31 : 0 – is to be stored at a byte
address which is not divisible by 4. The byte address is provided by the
memory address register MAR. Measured in bits, the shift distance (moti-
vated by lemma 3.1) in this case equals
8  MAR1 : 0  MAR1 : 0000
The operand a is shifted cyclically by this distance. Thus, the output sh of
the shifter environment SHenv is
shi f t a b IR1 : 0 if shi f t4s  0
sh 
cls a MAR1 : 0000 if shi f t4s  1
)
a b
  ##
MAR[1:0]
2
5 E NVIRONMENTS
32

CLS(32) Dist
r
32 fill
Fill

Scor mask
Mask
32
32

sh

   Top level of the shifter environment SHenv

This completes the specification of the shifter environment. Figure 3.13


depicts a very general design for shifters from [MP95]. A 32-bit cyclic left
shifter CLS shifts operand a31 : 0 by a distance dist 4 : 0 provided by the
distance circuit Dist. The result r31 : 0 of the shift is corrected by circuit
Scor as a function of the fill bit f ill and a replacement mask mask31 : 0
which are provided by the corresponding subcircuits.

   


For every bit position i, circuit Scor replaces bit ri of the intermediate result
by the fill bit in case that the mask bit maski is active. Thus,

f ill if maski  1
shi 
ri if maski  0

Figure 3.14 depicts a straightforward realization of the correction circuit.


For the whole shifter environment SHenv, one obtains the following cost
and delay:

CSHenv  CCLS 32  CDist  CFill  CMask  32  Cmux


DSHenv  maxDDist  DCLS 32 DFill DMask   Dmux

  + 


According to the shifters of section 2.4.6, an n-cyclic right shift can also
be expressed as an n-cyclic left shift:

crs a b  cls a b  1 mod n


)#
  # r 31 r1 r0 fill
A S EQUENTIAL ... 0011 01
0 1 mask31 0 1 mask1 0 1 mask0
DLX D ESIGN
sh31 sh1 sh0

   The shift-correction circuit Scor

b[4:0] 1
inc(5)

right 0 1
MAR[1:0] 000
shift4s 0 1
dist[4:0]

   Circuit Dist selects the shift distance of shifter SH

Thus, in the distance circuit Dist of figure 3.15, the mux controlled by
signal right selects the proper left shift distance of the explicit shift. Ac-
cording to table 3.8, bit IR1 can be used to distinguish between explicit
left shifts and explicit right shifts. Thus, we can set
right  IR1
The additional mux controlled by signal shi f t4s can force the shift distance
to MAR1 : 0000, i.e., the left shift distance specified for stores. The cost
and the delay of the distance circuit Dist are
CDist  Cinv 5  Cinc 5  2  Cmux 5
DDist b  Dinv 5  Dinc 5  2  Dmux 5
DDist MAR  Dmux 5

 , 
The fill bit is only different from 0 in case of an arithmetic shift, which is
coded by IR1 : 0  11 (table 3.8). In this case, the fill bit equals the sign
bit a31 of operand a, and therefore
f ill  IR1  IR0  a31
The cost and the delay of the fill bit computation run at
CFill  2  Cand
DFill  2  Dand 
)&
  ##
032 1
Flip(32) 1 mask[31:0] E NVIRONMENTS

hdec(5)
0
b[4:0] 0
shift4s
right

   Circuit Mask generating the mask for the shifter SH.

 8  5 
During an explicit left shift, the least significant b bits of the intermediate
result r have to be replaced. In figure 3.16, a half decoder generates from
b the corresponding mask 032 b 1 b . During an explicit right shift, the
   

most significant b bits of the intermediate result r have to be replaced.


The corresponding mask is simply obtained by flipping the left shift mask.
Note that no gates are needed for this. Thus, in the gate model used here,
flipping the mask does not contribute to the cost and the delay. On shifts
for store, the mask is forced to 032 , and the intermediate result r is not
corrected at all. The cost and the delay of the mask circuit are
CMask  Chdec 5  2  Cmux 32
DMask  Dhdec 5  2  Dmux 32
For later use, we introduce the notation
sh  shi f t a dist 
Observe that in this shorthand, lots of parameters are hidden.

##/   1  &3

This environment consists of the shifter for loads SH4L and a mux; it is
controlled by a single control signal
shi f t4l denoting a shift for load operation.
If signal shi f t4l is active, the result R of the shifter SH4L is provided to
the output C of the environment, and otherwise, input C is passed to C :
R if shi f t4l  1
C :
C if shi f t4l  0
Figure 3.17 depicts the top level schematics of the shifter environment
SH4Lenv; its cost and delay can be expressed as
CSH4Lenv  CSH4L  Cmux 32
DSH4Lenv  DSH4L  Dmux 32
)'
  #
MDRr
A S EQUENTIAL shifter SH4L R 1
DLX D ESIGN MAR[1:0] C’
0
C
32 shift4l

   Top level schematics of the shifter environment SH4L

The shifter SH4L is only used in load operations. The last three bits
IR28 : 26 of the primary opcode specify the type of the load operation
(table 3.9). The byte address of the data, which is read from memory on
a load operation, is stored in the memory address register MAR. If a byte
or half word is loaded from a byte address which is not divisible by 4, the
loaded data MDRr has to be shifted to the right such that it is aligned at the
right end of the data bus D31 : 0. A cyclic right shift by MAR1 : 0000
bits (the distance is motivated by lemma 3.1) will produce an intermediate
result
r  crs MDRr MAR1 : 0000
where the loaded data is already aligned at the right end. Note that this
also covers the case of a load word operation, because words are stored at
addresses with MAR1 : 0  00. After the loaded data has been aligned,
the portion of the output R not belonging to the loaded data are replaced
with a fill bit:

 f ill 24 r7 : 0 for lb, lbu
f ill 16 r15 : 0
R31 : 0 
 r31 : 0
for lw, lwu
for lw
In an unsigned load operation, the fill bit equals 0, whereas in signed
load operations, the fill bit is the sign bit of the shifted operand. This is
summarized in table 3.9 which completes the specification of the shifter
SH4L.
Figure 3.18 depicts a straightforward realization of the shifter SH4L.
The shift distance is always a multiple of 8. Thus, the cyclic right shifter
only comprises two stages for the shift distances 8 and 16. Recall that for
32 bit data, a cyclic right shift by 8 (16) bits equals a cyclic left shift by 24
(16) bits.
The first half word r31 : 16 of the intermediate result is replaced by
the fill bit in case that a byte or half word is loaded. During loads, this
is recognized by IR27=0. Byte r15 : 8 is only replaced when loading
a single byte. During loads, this is recognized by IR27 : 26  00. This
explains the multiplexer construction of figure 3.18.
)(
  ##
  Fill bit of the shifts for load E NVIRONMENTS
IR[28] IR[27:26] Type MAR[1:0] fill
0 00 byte, signed 00 MDRr[7]
01 MDRr[15]
10 MDRr[23]
11 MDRr[31]
01 halfword, signed 00 MDRr[15]
10 MDRr[31]
11 word 
1 00 byte, unsigned 0
01 halfword, unsigned 0

MDRr[31:0] 11
00
00
11 MAR[0]
CSR32,8 = CSL32,24
00
11
CSR32,16 = CSL32,16 00
11
00
11
MAR[1]

16 8 8
01
fill
LFILL
IR[26]
1 0 0 1
01 IR[27]

R[31:16] R[15:8] R[7:0]

   The shifter SH4L for load instructions

The circuit LFILL of figure 3.19 is a brute force realization of the fill bit
function specified in table 3.9. The cost and the delay of the shifter SH4L
and of circuit LFILL are

CSH4L  2  Cmux 32  Cmux 24  Cnand  CLFILL


DSH4L  max2  Dmux 32 Dnand DLFILL   Dmux 24
CLFILL  5  Cmux  Cand  Cinv
DLFILL  max3  Dmux Dinv   Dand 

For later use, we introduce the notation

R  sh4l MDRr MAR1 : 0000

)/
  #
MDRr[7] MDRr[15] MDRr[23] MDRr[31]
A S EQUENTIAL MAR[0] MAR[0]
0 1 0 1
DLX D ESIGN MDRr[15] MDRr[31]

0 1 MAR[1] 0 1 MAR[1]

0 1 IR[26]
IR[28]

fill

   Circuit LFILL computes the fill bit for the shifter SH4L

 ,   

T IS now amazingly easy to specify the control of the sequential machine


and to show that the whole design is correct. In a first design, we will
assume that memory accesses can be performed in a single cycle. Later
on, this is easily corrected by a simple stalling mechanism.

#& 2-     -  

Figure 3.20 depicts the graph of a finite state diagram. Only the names
of the states and the edges between them are presently of interest. In order
to complete the design, one has to specify the functions δz z¼ for all states z


with more than one successor state. Moreover, one has to specify for each
state z the set of control signals active in state z.

We begin with an intermediate step and specify for each state z a set of
register transfer language (RTL) instructions rt z to be executed in that
state (table 3.10). The abbreviations and the conventions are those of the
tables 3.1 to 3.3. In addition, we use M PC as a shorthand for M PC.
Also note that the functions op, shi f t, sh4l and rel have hidden parameters.

We also specify for each type t of DLX instruction the intended path
path t  through the diagram. All such paths begin with the states fetch and
decode. The succeeding states on the path depend on the type t as indicated
in table 3.11. One immediately obtains
))
  #&
S EQUENTIAL
C ONTROL

  RTL instructions of the FSD of figure 3.20


State RTL Instruction
fetch IR  M PC
decode A  RS1,
RD if I-type instruction
B
RS2 if R-type instruction
27 SA if shift immediate # # 
co 
sxt imm otherwise
PC  PC  4
alu C  A op B
test C  A rel B ? 1 : 0
shift C  shi f t A B4 : 0
aluI C  A op co
testI C  A rel co ? 1 : 0
shiftI C  shi f t A co4 : 0
wbR RD  C (R-type)
wbI RD  C (I-type)
addr MAR  A  co
load MDRr  Mword MAR31 : 200
sh4l RD  sh4l MDRr MAR1 : 0000
sh4s MDRw  cls B MAR1 : 0000
store m  bytes MDRw
branch
btaken PC  PC  co
jimm PC  PC  co
jreg PC  A
savePC C  PC
jalR PC  A
jalI PC  PC  co
wbL GPR31  C

)*
  #
fetch
A S EQUENTIAL
else
DLX D ESIGN decode

D2 D4 D6 D7 D8 D9 v D10 D12

alu shiftI testI addr jreg savePC branch


D1 D3 D5 /D13 D13 D9 D10 bt
else

shift test aluI load sh4s jalR jalI btaken


0110 D11

wbR 10 wbI sh4l store wbI jimm

   Finite state diagram (FSD) of the DLX machine

   If the design is completed such that

1. for each type of instruction t, the path path t  is taken, and that

2. for each state s, the set of RTL instructions rtl s is executed,

then the machine is correct, i.e., it interprets the instruction set.

The proof is a simple exercise in bookkeeping. For each type of instruc-


tion t one executes the RTL instructions on the path path t . The effect of
this on the visible DLX registers has to be as prescribed by the instruction
set. We work out the details in some typical cases.

    . - 


Suppose t  addi. By table 3.11, the sequence of states executed is

path t   f etch decode alui wbi

and by table 3.10, the sequence of RTL-instructions on this path is:

state s rtl(s)
fetch IR  M PC
decode A  GPRRS1 B  GPRRS2 PC  PC  4 mod 232
alui C  A  imm if this is in T32
wbi GPRRD  C
*
  #&
  Paths patht  through the FSD for each type t of DLX instruction S EQUENTIAL
C ONTROL
DLX instruction type path through the FSD
arithmetic/logical, I-type: fetch, decode, aluI, wbI
 # $# % # #  #  !
arithmetic/logical, R-type: fetch, decode, alu, wbR
 # $# % # #  #  !
test set, I-type: fetch, decode, testI, wbI
# !# # !# # %# #
&
test set, R-type: fetch, decode, test, wbR
# !# # !# # %# # &
shift immediate: # #  fetch, decode, shiftI, wbR
shift register: # #  fetch, decode, shift, wbR
load: #  # '# $#  $ fetch, decode, addr, load, sh4l
store: #  # ' fetch, decode, addr, sh4s, store
jump register: " fetch, decode, jreg
jump immediate: " fetch, decode, jimm
jump & link register: " fetch, decode, savePC, jalR, wbL
jump & link immediate: " fetch, decode, savePC, jalI, wbL
taken branch # % fetch, decode, branch, btaken
untaken branch # % fetch, decode, branch

The combined effect of this on the visible registers is – as it should be

GPRRD  C  A  imm


 GPRRS1  imm if this is inT32
PC  PC  4

It is that easy and boring. Keep in mind however, that with literal appli-
cation of the abridged semantics, this simple exercise would end in com-
plex and exciting insanity. Except for loads and stores, the proofs for all
cases follow exactly the above pattern.

  . - 


Suppose instruction M PC has type t  store and the operand X to be
stored is d bytes wide

X  byted 1:0 GPRRD


*
  #
then path t   f etch decode addr sh4s store The effect of the RTL
A S EQUENTIAL instructions of the last three states is
DLX D ESIGN
state s rtl(s)
addr MAR  A  imm mod 232
sh4s MRDw  cls B MAR1 : 0000
store M MAR  d  1 : MAR
 byte MAR1:0 d 1: MAR1:0 MDRw
    

Thus, the combined effect of all states on MAR is

MAR  A  imm mod 232


 GPRRS1  imm mod 232
 ea

and the combined effect of all states on MDRw is

MDRw  cls B MAR1 : 0000


 cls GPRRD MAR1 : 0000

Hence,
X  byte MAR1:0 d
  1:MAR1:0 MDRw
and the effect of the store operation is

M ea  d  1 : ea  X

 3 . - 


Suppose M PC has type t  load, and the operand X to be loaded into
register GPRRD is d bytes wide. Thus, path t   f etch decode addr,
load sh4l , and the RTL instructions of the last three states are:

state s rtl(s)
addr MAR  A  imm mod 232
load MDRr  Mword MAR31 : 200
sh4l GPRRD  sh4l MDRr MAR1 : 0000

As in the previous case, MAR  ea. By lemma 3.1, it follows

X  byte MAR1:0 d
   1:MAR1:0 Mword MAR31 : 200
 byte MAR1:0 d
   1:MAR1:0 MDRr
*
  #&
With the fill bit f ill defined as in table 3.9, one concludes
S EQUENTIAL
GPRRD  sh4l MDRr MAR1 : 0000 C ONTROL
328d

f ill X
sxt m for load (signed)

032 8d m

for load unsigned
The design is now easily completed. Table 3.12 is an extension of table
3.10. It lists for each state s not only the RTL instructions rtl s but also
the control signals activated in that state. One immediately obtains
For all states s, the RTL instructions rtl s are executed in state s.   
For all states except addr and btaken, this follows immediately from the 
specification of the environments. In state s  addr, the ALU environment
performs the address computation
MAR  A  sxt imm mod 232
 A  imm15 sxt imm mod 232
 A  imm
The branch target computation of state s  btaken is handled in a com-
pletely analogous way.  
It only remains to specify the disjunctive normal forms Di for figure 3.20
such that it holds:
For each instruction type t, the sequence path t  of states specified by   
table 3.11 is followed.
Each Di has to test for certain patterns in the primary and secondary
opcodes IR31 : 26 5 : 0, and it possibly has to test signal AEQZ as well.
These patterns are listed in table 3.13. They have simply been copied from
the tables 3.1 to 3.3. Disjunctive form D8 , for instance, tests if the actual
instruction is a jump register instruction " coded by
IR31 : 26  hx16  010110
It can be realized by the single monomial
D8  IR31  IR30  IR29  IR28  IR27  IR26 
In general, testing for a single pattern with k zeros and ones can be done
with a monomial of length k. This completes the specification of the whole
machine. Lemmas 3.2 to 3.4 imply
The design correctly implements the instruction set.    

*#
  #
A S EQUENTIAL   RTL instructions and their active control signals
DLX D ESIGN
state RTL instruction active control signals
fetch IR  M PC fetch, mr, IRce
decode A  RS1, Ace,
RD if I-type
B Bce, Pce
RS2 if R-type
PC  PC  4 PCadoe, 4bdoe, add, ALUDdoe,
27 SA ; shiftI
co  shiftI,
sxt imm ; other.
alu C  A op B Aadoe, Bbdoe, ALUDdoe, Cce,
Rtype
test C  A rel B ? 1 : 0 like alu, test
shift C  shi f t A B4 : 0 Aadoe, Bbdoe, SHDdoe, Cce,
Rtype
aluI C  A op co Aadoe, cobdoe, ALUDdoe, Cce
testI C  A rel co ? 1 : 0 like aluI, test
shiftI C  shi f t A co4 : 0 Aadoe, cobdoe, SHDdoe, Cce,
shiftI, Rtype
wbR RD  C (R-type) GPRw, Rtype
wbI RD  C (I-type) GPRw
addr MAR  A  co Aadoe, cobdoe, ALUDdoe, add,
MARce
load MDRr  mr, MDRrce
Mword MAR31 : 200
sh4l RD  sh4l MDRr shift4l, GPRw
MAR1 : 0000
sh4s MDRw  Badoe, SHDdoe, shift4s,
cls B MAR1 : 0000 MDRwce
store m  bytes MDRw mw
branch
btaken PC  PC  co PCadoe, cobdoe, add,
ALUDdoe, PCce
jimm PC  PC  co like btaken, Jjump
jreg PC  A Aadoe, 0bdoe, add, ALUDdoe,
PCce
savePC C  PC PCadoe, 0bdoe, add, ALUDdoe,
Cce
jalR PC  A like jreg
jalI PC  PC  co like jimm
wbL GPR31  C GPRw, Jlink
*&
  #&
  Nontrivial disjunctive normal forms (DNF) of the DLX finite state S EQUENTIAL
diagram and the corresponding monomials C ONTROL
Nontrivial Target Monomial m  M Length
DNF State IR31 : 26 IR5 : 0 l m
D1 shift 000000 0001*0 11
000000 00011* 11
D2 alu 000000 100*** 9
D3 test 000000 101*** 9
D4 shiftI 000000 0000*0 11
000000 00001* 11
D5 aluI 001*** ****** 3
D6 testI 011*** ****** 3
D7 addr 100*0* ****** 4
10*0*1 ****** 4
10*00* ****** 4
D8 jreg 010110 ****** 6
D9 jalR 010111 ****** 6
D10 jalI 000011 ****** 6
D9  D10 savePC like D9 and D10
D11 jimm 000010 ****** 6
D12 branch 00010* ****** 5
D13 sh4s **1*** ****** 1
/D13 load **0*** ****** 1
bt btaken AEQZ IR26 2
/AEQZ IR26 2
Accumulated length of m  M: ∑m M l m

115

#& "       -   

In the previous subsection, we have specified the control of the sequential


DLX architecture without stalling. Its output function, i.e., the value of the
control signals, depends on the current state of the control automaton but
not on its current inputs. Thus, the sequential control can be implemented
as a Moore automaton with precomputed control signals.
In this scenario, the automaton is clocked in every cycle, i.e., its clock
signal is ce  CONce  1. Signal reset serves as the clear signals clr of
the Moore automaton in order to initialize the control on reset. Except for
signal AEQZ, all the inputs of the control automaton are directly provided
*'
  #
A S EQUENTIAL   Parameters of the Moore control automaton
DLX D ESIGN Parameter Value
k # states of the automaton 23
σ # input signals in j 13
γ # output signals out j 29
νmax maximal frequency of the outputs 12
νsum accumulated frequency of the outputs 94
#M # monomials m  M (nontrivial) 20
lmax length of longest monomial m  M 11
lsum accumulated length of all monomials m  M 115
faninmax maximal fanin of nodes ( fetch) in the FSD 4
faninsum accumulated fanin 33

by the instruction register IR at zero delay. Thus, the input signals of the
automaton have the accumulated delay:

A in  A AEQZ   Dzero 32


A clr ce  A reset 

According to section 2.6, the cost and the delay of such a Moore automa-
ton only depend on a few parameters (table 3.14). Except for the fanin of
the states/nodes and the frequency of the control signals, these parameters
can directly be read off the finite state diagram (figure 3.20) and table 3.13.
State f etch serves as the initial state z0 of the automaton. Recall that our
realization of a Moore automaton has the following peculiarity: whenever
the next state is not specified explicitly, a zero tester forces the automaton
in its initial state. Thus, in the next state circuit NS, transitions to state
f etch can be ignored.

,   ! 
For each edge z z  E and z  f etch, we refer to the number #M z z
of monomials in D z z as the weight of the edge. For edges with nontriv-
ial monomials, the weight can be read off table 3.13; all the other edges
have weight 1. The fanin of a node z equals the sum of the weights of all
edges ending in z. Thus, state wbR has the highest fanin of all states differ-
ent from f etch, namely, f aninmax  4, and all the states together have an
accumulated fanin of f aninsum  31.
*(
  #&
  Control signals of the DLX architecture and their frequency. Signals S EQUENTIAL
printed in italics are used in several environments. C ONTROL
control signals control signals
Top PCadoe, Aadoe, Badoe, GPRenv GPRw, Jlink, Rtype
level Bbdoe, 0bdoe, SHDdoe, PCenv PCce
coBdoe, 4bdoe, ALUDdoe, ALUenv add, test, Rtype
Ace, Bce, Cce, MARce, Menv mr, mw, fetch
MDRrce, MDRwce, fetch SHenv shift4s
IRenv Jjump, shiftI, IRce SH4Lenv shift4l

outputs out j with a frequency ν j  1


Cce 7 PCce 6 GPRw 5 mr 2
PCadoe 5 Aadoe 9 Bbdoe 3 cobdoe 7
0bdoe 3 ALUDdoe 12 SHDdoe 3 Rtype 5
Jlink 2 Jjump 2 add 9 test 2

,2-%      


The first part of table 3.15 summarizes the control signals used in the top
level schematics of the DLX architecture and in its environments. For each
control signal out j , its frequency can be derived from table 3.12 by simply
counting the states in which outj is active. These values are listed in the
second part of table 3.15; signals with a frequency of 1 are omitted. Thus,
the automaton generates γ  29 control signals; the signals have a maximal
frequency of νmax  12 and an accumulated frequency of νsum  93.

#&#      1 

So far, we have assumed that a memory access can be performed in a


single cycle, but that is not always the case. In order to account for those
multi-cycle accesses, it is necessary to stall the DLX data paths and the
main control, i.e., the update of registers and RAMs must be stopped. For
that purpose, we introduce a stall engine which provides an update enable
signal ue for each register or RAM.

   1  
A register R is now controlled by two signals, the signal Rce which request
the update and the update enable signal Rue which enables the requested
update (figure 3.21). The register is only updated if both signals are active,
*/
  #
A S EQUENTIAL Rce’ Rce di a Kw’ Kw
R Rue RAM K w Kue
DLX D ESIGN do

   Controlling the update of registers and RAMs. The control automaton
provides the request signals Rce, Kw; the stall engine provides the enable signals
Rue, Kue.

i.e., Rce  Rue  1. Thus, the actual clock enable signal of register R,
which is denoted by Rce , equals

Rce  Rce  Rue

The clock request signal Rce is usually provided by the control automaton,
whereas signals Rue and Rce are generated by a stall engine.
In analogy, the update of a RAM R is requested by signal Rw and enabled
by signal Rue. Both signals are combined to the actual write signal

Rw  Rw  Rue

  5- 7% 5 %  


A memory access sometimes requires multiple clock cycles. The memory
system M therefore provides a status signal mbusy indicating that the ac-
cess will not be completed in the current cycle. Thus, on mbusy  1, the
DLX hardware is unable to run the RTL instructions of the current state to
completion. In this situation, the correct interpretation of the instruction is
achieved as follows:
While mbusy is active, the memory system M proceeds its access,
but the data paths and the control are stalled. This means that the
Moore control automaton still requests the register and RAM up-
dates according to the RTL instructions of its current state z, but the
stall engine disables these updates. Thus, the hardware executes a
NOP (no-operation), and the control automaton remains in its cur-
rent state.
In the cycle in which mbusy becomes inactive, the memory system
completes its access, the stall engine enables the requested updates,
and the data paths and the control execute the RTL instructions of
the current state z.
Since the data paths and the control automaton are stalled simultaneously,
the stall engine only provides a single update enable signal UE, which is
*)
  #'
inactive during an ongoing memory access mbusy  1). However, the up-
date must be enabled during reset in order to ensure that the DLX machine H ARDWARE C OST
can be restarted: AND C YCLE T IME
UE  mbusy  reset 
This signal enables the update of all the registers and RAMs in the data
paths and in the control automaton. Thus, the write signal of the general
purpose register file GPR and the clock signal CONce of the Moore au-
tomaton, for instance, are then obtained as

GPRw  GPRw  GPRue  GPRw  UE


CONce  CONce  CONue  CONce  UE 

Note that the read and write signals Mr and Mw of the memory M are not
masked by signal UE.
According to table 3.15, the control automaton provides 8 clock request
signals and 1 write request signal. Together with the clock of the Moore
automaton, the stall engine has to manipulate 10 clock and write signals.
Thus, the cost and the delay of this simple stall engine run at

Cstall  Cinv  Cor  10  Cand


Dstall  Dinv  Dor  Dand 

     -  .

N THE previous sections, we derived formulae which estimate the cost


and the delay of the data paths environments and of the control automa-
ton. Based on these formulae, we now determine the cost and the cycle
time of the whole DLX hardware. Note that all the adders in our DLX
designs are carry lookahead adders, if not stated otherwise.

#'   

The hardware consists of the data paths and of the sequential control. If
not stated otherwise, we do not consider the memory M itself to be part of
the DLX hardware.
The data paths DP (figure 3.4) of the sequential DLX fixed-point core
consist of six registers, nine tristate drivers, a multiplexer and six environ-
ments: the arithmetic logic unit ALUenv, the shifters SHenv and SH4Lenv,
**
  #
A S EQUENTIAL   Cost of the DLX fixed-point core and of all its environments
DLX D ESIGN
cost cost cost
ALUenv 1691 IRenv 301 DP 10846
SHenv 952 GPRenv 4096 CON 1105
SH4Lenv 380 PCenv 354 DLX 11951

and the environments of the instruction register IR, of the general purpose
registers GPR and of the program counter PC. Thus, the cost of the 32-bit
data paths equals
CDP  6  C f f 32  9  Cdriv 32  Cmux 32  CALUenv  CSHenv
CSH4Lenv  CIRenv  CGPR  CPCenv 

The sequential control consists of a Moore automaton, of the memory


control MC, and of the stall engine. The automaton precomputes its out-
puts and has the parameters of table 3.14. Thus, the control unit has cost
CCON  C pMoore  CMC  Cstall 
Table 3.16 lists the cost of the sequential DLX hardware and of all its
environments. The register file is the single most expensive environment;
its cost account for 37% of the cost of the data paths. Of course, this
fraction depends on the size of the register file. The control only accounts
for 9% of the whole hardware cost.

#' %  

For the cycle time, we have to consider the four types of transfers illus-
trated in figure 2.3 (page 11). This requires to determine the delay of each
paths which start in a register and end in a register, in a RAM, or in the
memory. In this regard, the sequential DLX design comprises the follow-
ing types of paths:
1. the paths which only pass through the data paths DP and the Moore
control automaton,
2. the paths of a memory read or write access, and
3. the paths through the stall engine.
These paths are now discussed in detail. For the paths of type 1 and 2, the
impact of the global update enable signal UE is ignored.

  #'
"  -  +"   5 -   
All these paths are governed exclusively by the output signals of the Moore H ARDWARE C OST
automaton; these standard control signals, denoted by Csig, have zero de- AND C YCLE T IME
lay:
A Csig  A pMoore out   0
One type of paths is responsible for the update of the Moore automaton.
A second type of paths is used for reading from or writing into the register
file GPR. All the remaining paths pass through the ALU or the shifter SH.

     -   
The time TpMoore denotes the cycle time of the Moore control automaton,
as far as the computation of the next state and of the outputs is concerned.
According to section 2.6, this cycle time only depends on the parameters
of table 3.14 and on the accumulated delay A in A clr ce of its input,
clear and clock signals.

8   ,  
For the timing, we distinguish between read and write accesses. During
a read access, the two addresses come directly from the instruction word
IR. The data A and B are written into the registers A and B. The control
signals Csig switch the register file into read mode and provide the clock
signals Ace and Bce. The read cycle therefore requires time:

TGPRr  A Csig  DGPRr  ∆

During write back, the value C , which is provided by the shifter envi-
ronment SH4Lenv, is written into the multiport RAM of the GPR register
file. Both environments are governed by the standard control signals Csig.
Since the register file has a write delay of DGPRw , the write back cycle takes

TGPRw  A Csig  DSH4Lenv  DGPRw  δ

This already includes the time overhead for clocking.

"  -  3  


The ALU and the shifter SH get their operands from the busses a and b.
Except for value co which is provided by environment IRenv, the operands
are either hardwired constants or register values. Thus, the data on the two
operand busses are stable ABUSab delays after the start of a cycle:

ABUSab  A Csig  DIRenv co  Ddriv 

As soon as the operands become valid, they are processed in the ALU and
the shifter SHenv. From the data bus D, the result is then clocked into

  #
a register (MAR, MDRw or C) or it passed through environment PCenv
A S EQUENTIAL which adds delay DPCenv IN; PC. Thus, the ALU and shift cycles require
DLX D ESIGN a cycle time of

TALU SH  ABUSab  maxDALUenv DSHenv 


Ddriv  DPCenv IN; PC  ∆

5 % 8  ;   


The memory environment performs read and write accesses. The memory
M also provides a status flag mbusy which indicates whether the access can
be completed in the current cycle or not. The actual data access has a delay
of dmem , whereas the status flag has a delay of dmstat .
According to figure 3.4, bus MA provides the memory address, and reg-
ister MDRw provides the data to be written into memory. Based on the
address MA and some standard control signals, the memory control MC
(section 3.3.5) generates the bank write signals mbw3 : 0. Thus,

AMC  A Csig  Dmux  DMC

delays after the start of each cycle, all the inputs of the memory system
are valid, and the memory access can be started. The status flag mbusy
therefore has an accumulated delay of

AMenv mbusy  AMC  dmstat

and a write access requires a cycle time of

TMwrite  AMC  dmem  δ

On a read access, the memory data arrive on the bus MDout dmem delays
after the inputs of the memory are stable, and then, the data are clocked
into the instruction register IR or into register MDRr. Thus, the read cycle
time is
TM  TMread  AMC  dmem  ∆

A read access takes slightly longer than a write access.

"  -     1 


Based on the status flag mbusy, the stall engine generates the update enable
signal UE which enables the update of all registers and RAMs in the DLX
hardware. The stall engine then combines flag UE with the write and clock
request signals provided by the Moore control automaton.

  #'
  Cycle time of the sequential DLX design H ARDWARE C OST
AND C YCLE T IME
TDP TCON
TM
TGPRr TGPRw TALU SH
 TpMoore Tstall
27 37 70 42 37  dmstat 16  dmem

Since mbusy has a much longer delay than the standard control signals of
the Moore automaton, the stall engine provides the write and clock enable
signals at an accumulated delay of

Astall  AMenv mbusy  Dstall 

Clocking a register adds delay Df f  δ, whereas the update of the 3-port


RAM in environment GPRenv adds delay Dram3 32 32  δ. Thus, the
paths through the stall engine require a cycle time of

Tstall  Astall  maxDram3 32 32 D f f   δ

1-    %  


Table 3.17 lists the cycle times of the DLX data paths, of the control and
of the memory system. In the data paths, the cycles through the functional
units are most time critical; the register file itself could tolerate a clock
which is twice as fast. The DLX data paths require a minimal cycle time
of TDP  70 gate delays.
The control does not dominate the cycle time of the sequential DLX
design, as long as the memory status time dmstat stays under 44% of TDP .
The cycle time of the memory system only becomes time critical, if the
actual access time dmem is at least 74% of TDP .
The cycle time τDLX of the sequential DLX design is usually the maxi-
mum of the cycle times required by the data paths and the control:

τDLX  maxTDP TCON 

The cycle time TM of the memory environment only has an indirect im-
pact on the cycle time τDLX . If the memory cycle time is less than τDLX ,
memory accesses can be performed in a single machine cycle. In the other
case, TM τDLX , the cycle time of the machine must be increased to TM or
memory accesses require TM τDLX  cycles. Our designs use the second
approach.
#
  #
   !  "  #

Ì
A S EQUENTIAL
DLX D ESIGN HEDLX instruction set is from the classical textbook [HP90]. The
design presented here is partly based on designs from [HP90, PH94,
KP95, MP95]. A formal verification of a sequential processor is reported
in [Win95].

&
Chapter

4
Basic Pipelining

N THE CPU constructed in the previous chapter DLX instructions are


processed sequentially; this means that the processing of an instruction
starts only after the processing of the previous instruction is completed.
The processing of an instruction takes between 3 and 5 cycles. Most of
the hardware of the CPU is idle most of the time. One therefore tries to
re-schedule the use of the hardware resources such that several instruc-
tions can be processed simultaneously. Obviously, the following condi-
tions should be fulfilled:

1. No structural hazards exist, i.e., at no time, any hardware resource


is used by two instructions simultaneously.

2. The machine is correct, i.e., the hardware interprets the instruction


set.

The simplest such schedule is basic pipelining: the processing of each


instructions is partitioned into the five stages k listed in table 4.1. Stages
IF and ID correspond directly to the states f etch and decode of the FSD
in figure 3.20. In stage M, the memory accesses of load and store instruc-
tions are performed. In stage W B, results are written back into the general
purpose registers. Roughly speaking, everything else is done in stage EX.
Figure 4.1 depicts a possible partition of the states of the FSD into these
five stages.
We consider the execution of sequence I  I0 I1    of DLX instructions,
where instruction I0 is preceded by a reset. For the cycles T  0 1   , the
  &
BASIC P IPELINING   Stages of the pipelined instruction execution
k shorthand name
0 IF instruction fetch
1 ID instruction decode
2 EX execute
3 M memory
4 WB write back

IF fetch

else
ID decode
EX
D1 D2 D3 D4 D5 D6 D7 D8 D9 v D10 D11 D12

alu shiftI testI addr jreg savePC jimm branch


/D13 D13 D9 D10 bt
else
shift test aluI sh4s jalR jalI btaken

M load store

wbR wbI sh4l wbL


WB

   Partitioning of the FSD of the sequential DLX design into the five
stages of table 4.1.

stages k and the instructions Ii , we use

I k T  i

as a shorthand for the statement, that instruction Ii is in stage k during cycle


T . The execution starts in cycle T  0 with I 0 0  0.
Ideally, we would like to fetch a new instruction in every cycle, and each
instruction should progress by one stage in every cycle, i.e.,

if I 0 T   i then I 0 T  1  i  1, and

if I k T   i and k  4 then I k  1 T  1  i.

For all stages k and cycles T we therefore have

I k T  i T  ki
(
  &
IF I0 I1 I2 I3 I4 ... cycles D ELAYED B RANCH
...
AND D ELAYED PC
ID I0 I1 I2 I3 I4
EX I0 I1 I2 I3 I4 ...

idle ...
M I0 I1 I2 I3 I4
WB I0 I1 I2 I3 I4

   Pipelined execution of the instruction sequence I 0 I1 I2 I3 I4

This ideal schedule is illustrated in figure 4.2. Obviously, two instructions


are never in the same stage simultaneously. If we can allocate each hard-
ware resource to a stage k such that the resource is only used by instruction
Ii while Ii is in stage k, then no hardware resource is ever used by two in-
structions simultaneously, and thus, structural hazards are avoided.
For the machine constructed so far this cannot be done for the following
two reasons:

1. The adder is used in stage decode for incrementing the PC, and in
stage execute it is either used for ALU operations or for branch tar-
get computations. The instructions " and " use the adder even
twice in the execute stage, namely for the target computation and for
passing the PC to the register file. Thus, we at least have to provide
an extra incrementer for incrementing the PC during decode and an
ALU bypass path for saving the PC.

2. The memory is used in stages f etch and memory. Thus, an extra


instruction memory IM has to be provided.

 * -   * - +

T IS still impossible to fetch an instruction in every cycle. Before we


explain the simple reason, we introduce more notation.
For a register R and an instruction Ii , we denote by Ri the content of regis-
ter R after the (sequential) execution of instruction Ii . Note that instruction
Ii is fetched from instruction memory location PCi 1. The notation can


be extended to fields of registers. For instance, immi denotes the content


of the immediate field of the instruction register IRi . The notation can be
extended further to expressions depending on registers.
/
  &
Recall that the control operations are the DLX instructions # %#
BASIC P IPELINING "# "# " and ". The branch target btargeti of a control operation Ii is
defined in the obvious way by
RS1i 1
 for "# "
btargeti 
PCi 1  4  immi for # %# "# "


We say that a branch or jump is taken in Ii , (short b jtakeni  1), if


Ii has the type "# "# " or ", or

Ii is a branch  and RS1i  0, or

Ii is a branch % and RS1i  0.


Now suppose that instruction Ii is a control operation which is fetched in
cycle T , where
Ii  IM PCi 1 

The next instruction Ii1 then has to be fetched from location PCi with
btargeti if b jtakeni  1
PCi 
PCi 1  4 otherwise


but instruction Ii is not in the instruction register before cycle T  1. Thus,


even if we provide an extra adder for the branch target computation in stage
decode, PCi cannot be computed before cycle T  1. Hence, instruction
Ii1 can only be fetched in cycle T  2.

     +% 
The way out of this difficulty is by very brute force: one changes the se-
mantics of the branch instruction by two rules, which say:
1. A branch taken in instruction Ii affects only the PC computed in the
following instruction, i.e., PCi1 . This mechanism is called delayed
branch.

2. If Ii is a control operation, then the instruction Ii1 following Ii is


called the instruction in the delay slot of Ii . No control operations
are allowed in delay slots.
A formal inductive definition of the delayed branch mechanism is

PC  1  0
b jtaken 1  0
btargeti if b jtakeni  1
PCi1 
PCi  4 otherwise.
)
  &
Observe that the definition of branch targets PC  4  imm instead of the
much more obvious branch targets PC  imm is motivated by the delayed D ELAYED B RANCH
branch mechanism. After a control operation Ii , one always executes the AND D ELAYED PC
instruction IM PCi 1  4 in the delay slot of Ii (because Ii does not occupy


a delay slot and hence, b jtakeni 1  0). With a branch target PC  imm,


one would have to perform the computation

PCi1  PCi  immi1  4

instead of
PCi1  PCi  immi1
The delayed branch semantics is, for example, used in the MIPS [KH92],
the SPARC [SPA92] and the PA-RISC [Hew94] instruction set.

     +% "
Instead of delaying the effect of taken branches, one could opt for delay-
ing the effect of all PC calculations. A program counter PC is updated
according to the trivial sequential semantics

 PCi 1  immi if b jtakeni  1  Ii   % " "


if b jtakeni  1  Ii  " " 


PCi 
 RS1i 1


PCi 1  4 otherwise


The result is simply clocked into a delayed program counter DPC:

DPCi1  PCi 

The delayed program counter DPC is used for fetching instructions from
IM, namely Ii  IM DPCi 1. Computations are started with


PC 1

 4
DPC  1  0

We call this uniform and easy to implement mechanism delayed PC. The
two mechanisms will later turn out to be completely equivalent.

<-   3 . - 


We continue our discussion with a subtle observation concerning the se-
mantics of the jump and link instructions ("# ") which are usually used
for procedure calls. Their semantics changes by the delayed branch mech-
anism as well! Saving PC  4 into GPR31 results in a return to the delay
slot of the jump and link instruction. Of course, the return should be to
*
  &
the instruction after the delay slot (e.g., see the MIPS architecture manual
BASIC P IPELINING [KH92]). Formally, if Ii  IM PCi 1 is a jump and link instruction, then


PCi  PCi 14

because Ii is not in a delay slot, and instruction

Ii1  IM PCi 

is the instruction in the delay slot of Ii . The jump and link instruction Ii
should therefore save

GPR31i  PCi  4  PCi  1  8

In the simpler delayed PC mechanism, one simply saves

GPR31i  PCi 1  4


12-  +%   +% "


    Suppose a machine with delayed branch and a machine with delayed PC
are started with the same program (without control operations in delay
slots) and with the same input data. The two machines then perform exactly
the same sequence I0 I1    of instructions.

 This is actually a simulation theorem. By induction on i, we will show


two things, namely

1. PCi PCi1   DPCi PCi 

2. and if Ii is a jump and link instruction, then the value GPR31i saved
into register 31 during instruction Ii is identical for both machines.

Since b jtaken  1 0, it follows that PC0  4. Thus

PC 1 PC0   0 4  DPC


 1 PC 1

and part one of the induction hypothesis holds for i  1.


In the induction step, we conclude from i  1 to i, based on the induction
hypothesis PCi 1 PCi   DPCi 1 PCi 1  Since
  

DPCi  PCi  1 by the definition of DPC


 PCi by the induction hypothesis

it only remains to show that

PCi1  PCi 

  &
Since DPCi 1  PCi 1 , the same instruction Ii is fetched with delayed
 

branch and delayed PC, and in both cases, the variable b jtakeni has the P REPARED
same value. S EQUENTIAL
If b jtakeni  0, it follows M ACHINES

PCi  PCi 1  4


 PCi  4 by the induction hypothesis


 PCi1 by the definition of delayed branch
If b jtakeni  1, then instruction Ii cannot occupy a delay slot, and there-
fore b jtakeni 1  0. If Ii is of type # %# " or ", then


PCi  PCi 1  immi




 PCi 2  4  immi

because b jtakeni 1  0


 PCi 1  4  immi
 by the induction hypothesis for i  2
 btargeti
 PCi1 because b jtakeni  1
If Ii is of type " or ", then
PCi  RS1i 1


 btargeti
 PCi1 because b jtakeni  1
and part one of the induction hypothesis follows.
For the second part, suppose Ii is a jump and link instruction. With
delayed branch, PCi 1  8 is then saved. Because Ii is not in a delay slot,


we have
PCi 18  PCi  4
 DPCi  4 by induction hypothesis
 PCi 1  4

by definition of delayed PC
This is exactly the value saved in the delayed PC version.  
Table 4.2 illustrates for both mechanisms, delayed branch and delayed
PC, how the PCs are updated in case of a jump and link instruction.

 + ,  

N THIS section we construct a machine DLXσ with the following prop-


erties:

1. The machine consists of data paths, a control as well as a stall engine


for the clock generation.

  &
BASIC P IPELINING   The impact of a jump and link instruction I i    on the PCs
under the delayed branch and the delayed PC regime

delayed branch delayed PC


after
PC GPR[31] DPC PC’ GPR[31]
Ii 1 PCi 1 PCi 1 PCi =PCi 1  4
Ii PCi 1  4 PCi 1  8 PC’i 1 =PCi PCi1 =btargeti PC’i 1 4
Ii1 btargeti PC’i =btargeti PCi2

2. The data paths and the control of the machine are arranged in a 5-
stage pipeline, but

3. Only one stage at a time is clocked in a round robin fashion. Thus,


machine DLXσ will be sequential; its correctness is easily proved
using the techniques from the previous chapter.

4. The machine can be turned into a pipelined machine DLXπ by a very


simple transformation concerning only the PC environment and the
stall engine. Correctness is then shown by a simulation theorem
stating – under certain hypotheses – that machine DLXπ simulates
machine DLXσ .

We call machine DLXσ a prepared sequential machine. The overall


structure of the data paths is depicted in figure 4.3. There are 5 stages
of registers and RAM cells. Note, that we have arranged all registers and
RAM cells at the bottom of the stage, where they are computed.
For each stage k – with the numbers or names of table 4.1 – we denote by
out k the set of registers and RAM cells computed in stage k. Similarly,
we denote by in k the set of registers and RAM cells which are inputs of
stage k. These sets are listed in table 4.3 for all k. Rk denotes that R is an
output register of stage k  1, i.e., Rk  out k  1.
The cost of the data paths is

CDP  CPCenv  CIMenv  CIRenv  CEXenv  CDMenv


 CSH4Lenv  CGPRenv  CCAddr  7  C f f 32  3  C f f 5  12

Most of the environments can literally be taken from the sequential DLX
designs. Only two environment undergo nontrivial changes: the PC envi-
ronment and the execute environment EXenv. The PC environment has to
be adapted for the delayed PC mechanism. For store instructions, the ad-
dress calculation of state addr and the operand shift of state sh4s have now
to be performed in a single cycle. This will not significantly slow down the

  &
IMenv
IF P REPARED
S EQUENTIAL
IR.1
ID M ACHINES
IRenv CAddr
PCenv 12
A, B PC’, link, DPC co IR.2 Cad.2
EX
5
D EXenv sh

MAR MDRw IR.3 Cad.3


M
5
DMenv

C MDRr IR.4 Cad.4


WB
5
SH4Lenv
Aad
C’
Bad
GPRenv
A’, B’

   High level view of the prepared sequential DLX data paths

  Inputs and outputs of each stage k of the prepared DLX data paths

stage in k out k
0 IF DPC, IM IR
1 ID GPR, PC’, IR A, B, PC’, link, DPC,
co, Cad.2
2 EX A, B, link, co, Cad.2, IR MAR, MDRw, Cad.3
3 M MAR, MDRw, DM, Cad.3, IR DM, C, MDRr, Cad.4
4 WB C, MDRr, Cad.4, IR GPR

cycle time, because only the last two bits of the address influence the shift
distance, and these bits are known early in the cycle. Trivially, the memory
M is split into an instruction memory IM and a data memory DM.
There is, however, a simple but fundamental change in which we clock
the output registers of the stages. Instead of a single update enable signal
UE (section 3.4.3), we introduce for every stage k a distinct update enable
signal uek. An output register R of stage k is updated iff its clock request
#
  &
R1 R1ce ... Rs Rsce
BASIC P IPELINING
ue.k

   Controlling the update of the output registers of stage k

signal Rce and the update enable signal of stage k are both active (figure
4.4). Thus, the clock enable signal Rce of such a register R is obtained as

Rce  Rce  uek

As before, the read and write signals of the main memory M are not masked
by the update enable signal ue3 but by the full bit f ull 3 of the memory
stage.

&  " +3= +  " 

1  .8
of the instruction register is still controlled by the signals J jump (J-type
jump), shi f tI and the clock signal IRce. The functionality is virtually the
same as before. On IRce  1, the output IMout of the instruction memory
is clocked into the instruction register

IR  IMout

and the 32-bit constant co is generated as in the sequential design, namely


as 
 PCo f f set if J jump  1
27 SA
co  constant IR 
 imm if shi f tI
otherwise 
The cost and the delay of environment IRenv remain the same.
For the use in later pipeline stages, the two opcodes IR31 : 26 and
IR5 : 0 are buffered in three registers IRk, each of which is 12 bits wide.

1  &3
is controlled by signal shi f t4l which requests a shift in case of a load
instruction. The only modification in this environment is that the memory
address is now provided by register C and not by register MAR. This has
an impact on the functionality of environment SH4Lenv but not on its cost
and delay.
&
  &
Let sh4l a dist  denote the function computed by the shifter SH4L as
it was defined in section 3.3.7. The modified SH4Lenv environment then P REPARED
provides the result S EQUENTIAL
M ACHINES
sh4l MDRr C1 : 0000 if shi f t4l  1
C 
C if shi f t4l  0

1     9"8
As in the sequential design, circuit CAddr generates the address Cad of the
destination register based on the control signals Jlink (jump and link) and
Itype. However, the address Cad is now precomputed in stage ID and is
then passed down stage by stage to the register file environment GPRenv.
For later use, we introduce the notation

Cad  CAddr IR

Environment GPRenv (figure 4.5) itself has still the same functionality.
It provides the two register operands
GPRRS1  GPRIR25 : 21 if RS1  0
A 
0 otherwise

GPRRS2  GPRIR20 : 16 if RS2  0


B 
0 otherwise
and updates the register file under the control of the write signal GPRw:

GPRCad 4  C if GPRw  1

Since circuit CAddr is now an environment of its own, the cost of the
register file environment GPRenv run at

CGPRenv  Cram3 32 32  2  Czero 5  Cinv  Cand 32

Due to the precomputed destination address Cad 4, the update of the reg-
ister file becomes faster. Environment GPRenv now only delays the write
access by
DGPR write  Dram3 32 32


Let ACON csW B denote the accumulated delay of the control signals
which govern stage WB; the cycle time of the write back stage then runs at

ASH4Lenv  ACON csW B  DSH4Lenv


TW B  TGPRw  ASH4Lenv  DGPR write  ∆


The delay DGPR read of a read access, however, remains unchanged; it adds


to the cycle time of stage IF and of the control unit.


'
  &
IR[25:21] Cad.4 IR[20:16]
GPRw C’
BASIC P IPELINING
0011 Aad Cad Din Bad 0011
zero(5) GPR zero(5)
3-port RAM (32 x 32)
DoA DoB
az 32 32 bz

A’ B’

   Environment GPRenv of the DLX σ design

5 % 1 
The DLX design which is prepared for pipelined execution comprises two
memories, one for instructions and one for the actual data accesses.

Environment IMenv of the instruction memory is controlled by a single


signal f etch which activates the read signal Imr. The address of an in-
struction memory access is specified by register DPC. Thus, on f etch  1,
the environment IMenv performs a read operation providing the memory
word
IMout  IMword DPC31 : 2 00
Since memory IM performs no write accesses, its write signal Imw is
always inactive.1 The control IMC of the instruction memory is trivial and
has zero cost and delay. Let dImem denote the access time of the banks of
memory IM. Since the address is directly taken from a register, environ-
ment IMenv delays the instruction fetch by

DIMenv IR  DIMC  dImem  dImem 

The instruction memory also provides a signal ibusy indicating that the
access cannot be finished in the current clock cycle. We expect this signal
to be valid dIstat time units after the start of an IM memory access.

Environment DMenv of the data memory DM performs the memory ac-


cesses of load and store instructions. To a large extend, DMenv is identical
to the memory environment of the sequential design, but the address is now
always provided by register MAR.
Environment DMenv is controlled by the two signals Dmr and Dmw
which request a memory read or write access, respectively. Since memory
1 Thisis of course an abstraction. In chapter 6 we treat instruction caches which of
course can be written.
(
  &
DM is byte addressable, the control DMC generates four bank write signals
Dmbw3 : 0 based on the address and the width of the write access as in P REPARED
the sequential design. The cost and delay of the memory control remain S EQUENTIAL
the same. M ACHINES
The data memory DM has an access time of dDmem and provides a flag
dbusy with a delay of dDstat . Signal dbusy indicates that the current access
cannot be finished in the current clock cycle. Let ACON csM  denote the
accumulated delay of the signals Dmr and Dmw, then

TM  TDMenv read
  ACON csM   DDMC  dDmem  ∆
ADMenv dbusy  ACON csM   DDMC  dDstat 

" 1 
The environment PCenv of figure 4.6 is governed by seven control signals,
namely:

reset which initializes the registers PC’ and DPC,

the clock signals PCce and linkce,

jump which denotes one of the four jump instructions "# "# " and
",
jumpR which denotes an absolute jump instruction ("# "),

branch which denotes a branch instruction # %, and

bzero which is active on  and inactive on %.

Based on these signals, its glue logic PCglue generates the clock sig-
nal of the registers PC and DPC. They are clocked simultaneously when
signal PCce is active or on reset, i.e., they are clocked by

PCce  reset 

In addition, PCglue tests operand A for zero

AEQZ  1 A31 : 0  0

and generates signal b jtaken according to the specifications of section 4.1.


Thus, b jtaken is set on any jump or on a taken branch:

b jtaken  jump  branch  bzero XNOR AEQZ 


/
  & 11
00
BASIC P IPELINING [31:2] [1:0] co 00
11
Inc(30) Add(32)

11
00
A’

00
11
nextPC jumpR 1 0

bjtaken 0
11
00
0 1
4
0 1 reset 0 1 reset

link PC’ DPC

   Environment PCenv implementing the delayed PC

Let ACON csID denote the accumulated delay of the control signals
which govern stage ID. The cost of the glue logic and the delay of the
signals AEQZ and b jtaken then run at

CPCglue  2  Cor  Cand  Cxnor  Czero 32


DPCglue  Dor  Dand  Dxnor
A AEQZ   AGPRenv A   Dzero 32
A b jtaken  maxACON csID A AEQZ   DPCglue 

The environment PCenv implements the delayed PC mechanism of sec-


tion 4.1 in a straightforward way. On an active clock signal, the two PCs
are set to
0 4 if reset
DPC PC  
PC pc  otherwise,
where the value pc  nextPC PC A co of the instruction I, which is
held in register IR, is computed as

 PC  co if b jtaken  I   % " "
if I  " "
pc 
 A
PC 4 otherwise.

PCenv also provides a register link which is updated under the control
of signal linkce. On linkce  1, it is set to

link  PC 4

that is the PC to be saved in case of a jump and link instruction.


)
co
  &
link B A B
P REPARED
0011
A 0 1 bmuxsel 0 1 a’muxsel
S EQUENTIAL
1100
a a’
b M ACHINES
ALUenv SHenv
ovf alu s[1:0] sh

D sh

   Execute environment of the prepared DLX

In order to update the two program counters, environment PCenv re-


quires some operands from other parts of the data paths. The register
operand A is provided by the register file environment GPRenv, whereas
the immediate operand co is provided by environment IRenv. The cost and
the cycle time of environment PCenv can be estimated as
CPCenv  3  C f f 32  4  Cmux 32  Cadd 32  Cinc 30  CPCglue
TPCenv  maxDinc 30 AIRenv co  Dadd 32 AGPRenv A 
A b jtaken  3  Dmux 32  ∆

1-  1 
The execute environment EXenv of figure 4.7 comprises the ALU envi-
ronment and the shifter SHenv and connects them to the operand and re-
sult busses. Since on a store instruction, the address computation and the
operand shift are performed in parallel, three operand and two result busses
are needed.
Register A always provides the operand a. The control signals bmuxsel
and a muxsel select the data to be put on the busses b and a :
B if bmuxsel  1 B if a muxsel  1
b a 
co otherwise, A otherwise.
The data on the result bus D is selected among the register link and the
results of the ALU and the shifter. This selection is governed by three
output enable signals

 link if linkDdoe  1
D 
 alu if ALUDdoe  1
sh if SHDdoe  1
Note, that at most one of these signals should be active at a time.
*
  &
ALU Environment Environment ALUenv is governed by the same con-
BASIC P IPELINING trol signals as in the sequential design, and the specification of its results
alu and ov f remains unchanged. However, it now provides two additional
bits s1 : 0 which are fed directly to the shifter. These are the two least
significant bits of the result of the arithmetic unit AU 32. Depending on
signal sub, which is provided by the ALU glue logic, the AU computes the
sum or the difference of the operands a and b modulo 232 :

a  b mod 232 if sub  0


s  a  1sub  b mod 232 
a  b mod 232 if sub  1

The cost of the ALU environment and its total delay DALUenv remain the
same, but the bits s1 : 0 have a much shorter delay. For all the adders
introduced in chapter 2, the delay of these bits can be estimated based on
the delay of a 2-bit AU

DALUenv s1 : 0  DALUglue  DAU 2

as it is shown in exercise 4.1.

Shifter Environment The shifter environment SHenv is still controlled


by signal shi f t4s which requests an implicit shift in case of a store oper-
ation, but its operands are different. On an explicit shift, the operands are
provided by the busses a and b, whereas on an implicit shift, they are pro-
vided by bus a and by the result s1 : 0 of the ALU environment. Thus,
the output sh of SHenv is now specified as

shi f t a b IR1 : 0 if shi f t4s  0


sh 
cls a s1 : 0000 if shi f t4s  1
However, this modification has no impact on the cost and the delay of the
environment. Assuming a delay of ACON csEX  for the control signals of
stage EX, the cost and the cycle time of the whole execute environment
EXenv run at

CEXenv  CALUenv  CSHenv  2  Cmux 32  3  Cdriv 32


AEXenv  maxDALUenv DALUenv s1 : 0  DSHenv   Dmux  Ddriv
TEX  TEXenv  AEXenv  ∆

&  ,+   " +  " 

Figure 4.8 depicts an FSD for the prepared data paths; the tables 4.4 to
4.6 list the corresponding RTL instructions and their active control signals.
 
  &
IF fetch
P REPARED
ID decode S EQUENTIAL
EX M ACHINES
D9 D1 D2 D3 D4 D5 D6 D7 D8 else

addrL alu aluI shift shiftI test testI savePC addrS noEX
M
load passC store noM
WB
sh4l wb noWB

   The FSD of the prepared sequential DLX design

  RTL instructions of stages IF and ID


RTL instruction type of I control signals
IF IR  IM DPC fetch, IRce
ID A  A  RS1 Ace,
AEQZ  zero A 
B  RS2 link  PC  4 Bce, linkce,
DPC  reset ? 0 : PC  PCce,
PC  reset ? 4 : pc  PCce,
pc  nextPC PC A co "# " jump
"# " jumpR, jump
 branch, bzero
% branch
otherwise
co  constant IR "# " Jjump
# #  shiftI
otherwise
Cad  CAddr IR "# " Jlink
R-type Rtype
otherwise

The nontrivial DNFs are listed in table 4.7. Except for the clocks Ace, Bce,
PCce and linkce, all the control signals used in the decode stage ID are
Mealy signals. Following the pattern of section 3.4, one shows

Let the DLX design be completed such that    


 
  &
BASIC P IPELINING   RTL instructions of stage EX
state RTL instruction active control signals
alu MAR  A op B bmuxsel, ALUDdoe, MARce,
Cad 3  Cad 2 Rtype, Cad3ce
test MAR  A rel B ? 1 : 0 bmuxsel, ALUDdoe, MARce,
Cad 3  Cad 2 test, Rtype, Cad3ce
shift MAR  shift A B4 : 0 bmuxsel, SHDdoe, MARce,
Cad 3  Cad 2 Rtype, Cad3ce
aluI MAR  A op co ALUDdoe, MARce,
Cad 3  Cad 2 Cad3ce
testI MAR  A rel co ? 1 : 0 ALUDdoe, MARce,
Cad 3  Cad 2 test, Cad3ce
shiftI MAR  shift A co4 : 0 SHDdoe, MARce, shiftI,
Cad 3  Cad 2 Rtype, Cad3ce
savePC MAR  link linkDdoe, MARce,
Cad 3  Cad 2 Cad3ce
addrL MAR  A  co ALUDdoe, add, MARce,
Cad 3  Cad 2 Cad3ce
addrS MAR  A  co ALUDdoe, add, MARce,
MDRw  amuxsel, shift4s, MDRwce
cls B MAR1 : 0000
noEX

1. for each type of instruction, the path specified in table 4.8 is taken,

2. and for each state s, the set of RTL instructions rtl s is executed.
If every memory access takes only one cycle, then the machine interprets
the DLX instruction set with delayed PC semantics.

The correctness of all pipelined machines in this chapter will follow


from this theorem. Adding the stall engine from section 3.4.3 takes care of
memory accesses which require more than one cycle.

& # " -   

We derive from the above FSD and the trivial stall engine a new control
and stall engine with exactly the same behavior. This will complete the
design of the prepared sequential machine DLXσ .

  &
  RTL instructions of the memory and write back stage P REPARED
S EQUENTIAL
state RTL instruction control signals
M ACHINES
M passC C  MAR Cad 4  Cad 3 Cce, Cad4ce
load MDRr  Mword MAR31 : 200, Dmr, MDRrce,
C  MAR Cad 4  Cad 3 Cce, Cad4ce
store m  bytes MDRw Dmw
noM
WB sh4l GPRCad 4  shift4l, GPRw
sh4l MDRr MAR1 : 0000
wb GPRCad 4  C GPRw
noWB (no update)

We begin with a stall engine which clocks all stages in a round robin
fashion. It has a 5-bit register f ull 4 : 0, where for all stages i, signal

uei  f ulli  busy

enables the update of the registers in out i. Since memory accesses can
take several cycles, the update of the data memory DM is enabled by f ull3
and not by ue3 . Register f ull is updated by

 00001 if reset
f ull 4 : 0 if busy  reset
f ull 4 : 0 :
 cls f ull  otherwise 

Since the design comprises two memories, we compute the busy signal by

busy
  ibusy NOR dbusy

With signals f ull defined in this way, we obviously can keep track of the
stage which processes the instruction, namely: the instruction is in stage i
iff f ulli  1. In particular, the instruction is in stage IF iff f ull0  1, and
it is in stage ID if f ull1  1.
We proceed to transform the FSD by the following four changes:

1. The control signals activated in state IF are now always activated.


In cycles with f ull0  1, these signals then have the right value. In
other cycles, they do not matter because IR is not clocked.

2. Moore signals activated in state ID are now always activated. They


only matter in cycles with f ull1  1.
 #
  &
BASIC P IPELINING

  Nontrivial disjunctive normal forms (DNF) of the FSD corresponding
to the prepared data paths

Nontrivial Target Monomial m  M Length


DNF State IR31 : 26 IR5 : 0 l m
D1 alu 000000 100*** 9
D2 aluI 001*** ****** 3
D3 shift 000000 0001*0 11
000000 00011* 11
D4 shiftI 000000 0000*0 11
000000 00001* 11
D5 test 000000 101*** 9
D6 testI 011*** ****** 3
D7 savePC 010111 ****** 6
000011 ****** 6
D8 addrS 10100* ****** 5
1010*1 ****** 5
D9 addrL 100*0* ****** 4
1000*1 ****** 5
10000* ****** 5
DNF Mealy Signals
D10 Rtype 000000 ****** 6
D4 shiftI 000000 0000*0 (10)
000000 00001* (10)
D7 Jlink 010111 ****** (6)
000011 ****** (6)
D11 jumpR 01011* ****** 5
D12 Jjump 00001* ****** 5
D13 jump D11 OR D12
D14 branch 00010* ****** 5
D15 bzero *****0 ****** 1
Accumulated length of m  M: ∑m M l m

126

 &
  &
  Paths patht  through the FSD for each type t of DLX instruction P REPARED
S EQUENTIAL
DLX Instruction Type Path through FSD
M ACHINES
 # $# % # #  #  ! fetch, decode, aluI, passC, wb
 # $# % # #  #  ! fetch, decode, alu, passC, wb
# !# # !# # %# fetch, decode, testI, passC, wb
# &
# !# # !# # %# # fetch, decode, test, passC, wb
&
# #  fetch, decode, shiftI, passC, wb
# #  fetch, decode, shift, passC, wb
#  # '# $#  $ fetch, decode, addrL, load, sh4l
#  # ' fetch, decode, addrS, store, noWB
"# " fetch, decode, savePC, passC, wb
others fetch, decode, noEX, noM, noWB

3. Mealy signals activated in state ID are now activated in every cycle;


they too matter only when f ull1  1. Thus, the Mealy signals only
depend on the inputs IR but not on the current state.

4. Finally observe that in figure 4.8 only state decode has a fanout
greater than one. In stage ID, we can therefore precompute the con-
trol signals of all stages that follow and clock them into a register
R2  out 1. Table 4.9 lists for each state the signals to be clocked
into that register. The inputs of register R2 are computed in every
cycle, but they are only clocked into register R2 when

ue1  f ull 1  busy  1

Register R2 contains three classes of signals:

(a) signals x to be used in the next cycle only control stage EX,
(b) signals y to be used in the next two cycles control the stages
EX and M, and
(c) signals z to be used in the next three cycles control the stages
EX, M and WB.

The control signals y of stage M are delayed by one additional reg-


ister R3  out 2, whereas the signals of stage WB are delayed by
the registers R3 and R4  out 3 as depicted in figure 4.9.
 '
  &
BASIC P IPELINING
  Control signals to be precomputed during stage ID for each of the 10
execute states. The signals of the first table are all of type x, i.e., they only control
stage EX.

EX ALUDdoe SHDdoe linkDdoe add test Rtype


M
WB
shift 1 1
shiftI 1 1
alu 1 1
aluI 1
test 1 1 1
testI 1 1
addrL 1 1
addrS 1 1
savePC 1
noEX

EX MARce bmuxsel MDRwce Cad3ce


amuxsel
shift4s
M Dmw Cad4ce Dmr
Cce MDRrce
WB GPRw shift4l
Type x x y z z
shift 1 1 1
shiftI 1 1
alu 1 1 1
aluI 1 1
test 1 1 1
testI 1 1
addrL 1 1 1
addrS 1 1
savePC 1 1
noEX

 (
  &
ue.1 z y x R.2 P REPARED
S EQUENTIAL
ue.2 z y R.3 M ACHINES

ue.3 z R.4

   Buffering of the precomputed control signals

R.0 IM DP
stage IF
R.1 out(0)
stage ID
R.2 out(1)
stage EX
R.3 out(2)
stage M
R.4 out(3)
stage WB
CON
out(4)

   Structure of the data paths DP and the precomputed control CON of
the DLXσ machine

  Parameters of the two control automata; one precomputes the Moore
signals (ex) and the other generate the Mealy signals (id).

# states # inputs # and frequency of outputs


k σ γ νsum νmax
ex 10 12 11 39 9
id 1 12 9 11 2

fanin of the states # and length of monomials


fansum fanmax #M lsum lmax
ex 15 3 15 104 11
id – – 5 22 10

 /
  &
The new precomputed control generates the same control signals as the
BASIC P IPELINING FSD, but the machine now has a very regular structure: Control signals
coming from register Rk  out k  1 control stage k of the data paths for
all k  1. Indeed, if we define R0 and R1 as dummy registers of length
0, the same claim holds for all k. The structure of the data paths and the
precomputed control of machine DLXσ is illustrated in figure 4.10.
The hardware generating the inputs of register R2 is a Moore automaton
with the 10 EX states, precomputed control signals and the parameters ex
from table 4.10. The state noEX serves as the initial state of the automaton.
The next state only depends on the input IR but not on the current state.
Including the registers R3 and R4, the control signals of the stages EX to
WB can be precomputed at the following cost and cycle time:

CCON moore  CpMoore ex  3  2  C f f


TCON moore  TpMoore ex

The Mealy signals which govern stage ID are generated by a Mealy au-
tomaton with a single state and the parameters id of table 4.10. All its
inputs are provided by register IR at zero delay. According to section 2.6,
the cost of this automaton and the delay of the Mealy signals can be esti-
mated as

CCON mealy  CMealy id   3  2  C f f


ACON mealy  AO id 

We do not bother to analyze cost and cycle time of the stall engine.

& &      

Later on, we will establish the correctness of pipelined machines by show-


ing that they simulate machine DLXσ . This will require an inductive proof
on a cycle by cycle basis. We will always argue about a fixed but arbitrary
sequence
I  I0 I1   
of instructions which is preceded by reset and which is itself not inter-
rupted by reset.
If during a cycle the busy signal is active, then the state of the machine
does not change at the end of that cycle. We therefore only number the
cycles during which the busy signal is inactive with

T 0 1 

 )
  &
For such cycles T and signals R, we denote by RT the value of R during
cycle T ; R can also be the output of a register. We abbreviate with P REPARED
S EQUENTIAL
Iσ k T   i M ACHINES

the fact that instruction Ii is in stage k of machine DLXσ during cycle T .


Formally, this can be defined as

the execution starts in cycle T  0, i.e., Iσ 0 0  0,

if Iσ k T   i and k  4, then Iσ k  1 T  1  i, and

if Iσ 4 T   i, then Iσ 0 T  1  i  1.

For any other combination of T and k, the scheduling function Iσ k T  is


undefined. Hence,

Iσ k T   i T  5ik i  T 5 and k  T mod 5;

and for any cycle T 0, stage k is full ( f ullT k  1) iff Iσ k T  is defined.
Recall that we denote by Ri the value of R after execution of instruction
Ii . By R 1 we denote the initial value of R, i.e., the value of R just after


reset. A basic observation about the cycle by cycle progress of machine


DLXσ is formulated in the following lemma.

Dateline Lemma. Let I k T   i, and let R  out t , then   


¼ Ri 1 if t k
RT 


Ri if t  k

This is illustrated in figure 4.11. During cycle T , registers above stage k


already have the new value Ri , whereas registers below stage k still have the
old value Ri 1 . In other words, on downward arrows of figure 4.3 machine


DLXσ reads values Ri from the current instruction, whereas on upward


arrows the machine reads values Ri 1 from the previous instruction.


This very intuitive formulation of the lemma is the reason why in figure
4.3 we have drawn the general purpose register file at the bottom of the
pipeline and not – as is usual – in the middle of stage ID. A formal proof
uses the fact that
Ri 1  R5i


and proceeds for T  5i  k by induction on k. We leave the simple details


as an exercise 4.2.
Another very intuitive way to state this lemma is in the following way.
Imagine that wires between pipeline stages are so long, that we can wrap
 *
  &
stage
BASIC P IPELINING
0
Ri

...
1111
0000 k

...
Ri-1

   Illustration of lemma 4.3. In the current cycle, I i is stage k.

the machine around the equator (with stage k east of stage k  1 mod 5).
Now imagine that we process one instruction per day and that we clock the
pipeline stages at the dateline, i.e., the border between today and yesterday.
Then the lemma states that east of the dateline we already have today’s data
whereas west of the dateline we still have yesterdays data.
Let I 4 T   i, then I 0 T  1  i  1, and the dateline lemma applies
for all R
RT 1  Ri1 1  Ri 


 + #   .!  

ITH TWO very simple changes, we transform the prepared sequen-


 tial machine DLXσ from the previous section into a pipelined ma-
chine DLXπ :

1. Register DPC, i.e., the delayed PC is discarded. The instruction


memory IM is now directly addressed by PC . At reset, the instruc-
tion memory is addressed with address 0, and PC is initialized with
4. This is illustrated in figure 4.12. Register PC is still clocked by

PCce  reset 

2. The stall engine from figure 4.13 is used. For all i, signal uei enables
the update of registers and RAM cells in out i. The update of the
data memory DM is now enabled by

f ull3  reset 
#
  &#
IF
00111100 dpc IM
P IPELINING AS A
[31:2] [1:0] co 0 1 reset

0011
T RANSFORMATION
Inc(30) Add(32)
A’ 0

01
jumpR 1 0
ID
0 1 bjtaken
4
0 1 reset
nextPC

link PC’

   PC environment of the DLXπ design, implementing a delayed PC

CE ue.0

CE 1 full.1
reset
ue.1
CE full.2

ue.2
CE full.3

ue.3
CE full.4

ue.4

   The stall engine of the DLXπ design

At reset, the signals ue4 : 0 are initialized with 00001. When only
counting cycles T with an inactive busy signal, the update enable sig-
nals ue become active successively as indicated in table 4.11. Note
that we now assume the reset signal to be active during cycle T  0.

&#  

We want to argue, that under certain hypotheses the pipelined machine


DLXπ simulates the prepared sequential machine DLXσ . We will have to
#
  &
BASIC P IPELINING   Activation of the update enable signals ue4 : 0 after reset
T reset ue[0] ue[1] ue[2] ue[3] ue[4] full[2] full[3] full[4]
0 1 1 0 0 0 0 * * *
1 0 1 1 0 0 0 0 0 0
2 0 1 1 1 0 0 1 0 0
3 0 1 1 1 1 0 1 1 0
4 0 1 1 1 1 1 1 1 1
 0 1 1 1 1 1 1 1 1

argue simultaneously about registers R occurring in machine DLXπ and


their counterpart in machine DLXσ . Therefore, we introduce the notation
Rπ to denote a register in machine DLXπ ; we denote by Rσ the correspond-
ing register in machine DLXσ. The notation Ri (the content of register R
at the end of instruction Ii ) will only be used for the sequential machine
DLXσ .

+-   8 
We generally assume that the reset signal is active long enough to permit
an instruction memory access.

.   


The registers visible to the programmer are the general purpose registers
GPR, the RAM cells in IM and DM and the program counters. The re-
maining registers are called invisible.
We assume, that during reset, the simulated machine (here DLXσ ) and
the simulating machine (here DLXπ) have the same contents of the memo-
ries and the register file. In the sequential execution, reset is given in cycle
T  1 whereas in the pipelined execution, reset is given in cycle T  0.
By construction, both machines do not update general purpose registers
or memory cells during reset. Thus, in the two DLX designs any register
R  GPR PC DPC and any memory cell M of DM and IM must satisfy

R 1  R0σ  R1π
M 1  Mσ0  Mπ1 

Note that we make no such assumption for the remaining registers. This
will be crucial when we treat interrupts. The mechanism realizing the jump
to the interrupt service routine (JISR) will be almost identical to the present
reset mechanism.
#
  &#
The schedule for the execution of instructions Ii by machine DLXπ is
defined by P IPELINING AS A
Iπ k T   i T  k  i T RANSFORMATION

The strategy of the correctness proof is now easily described. We con-


sider cycle T for machine DLXπ and the corresponding cycle T for ma-
chine DLXσ , when the same instruction, say Ii is in the same stage, say k.
Formally
Iπ k T   i  Iσ k T 
We then want to conclude by induction on T that stage k of the simulat-
ing machine has during cycle T the same inputs as stage k of the simulated
machine during cycle T . Since the stages are identical, we want to con-
clude for all signals S inside the stages
¼
SπT  SσT  (4.1)

This should hold in particular for the signals which are clocked into the
output registers R  out k of stage k at the end of cycle T . This would
permit us to conclude for these registers

RπT 1  RσT 1  Ri 
¼
(4.2)

This almost works. Indeed it turns out that equations 4.1 and 4.2 hold
after every invisible register has been updated at least once. Until this has
happened, the invisible registers in the two machines can have different
values because they can be initialized with different values.
Thus, we have to formulate a weaker version of equations 4.1 and 4.2.
We exploit the fact, that invisible registers are only used to hold intermedi-
ate results (that is why they can be hidden from the programmer). Indeed,
if the invisible register R is an input register of stage k, then the pipelined
machine uses this register in cycle T only if it was updated at the end of
the previous cycle. More formally, we have

Let Iπ k T   i, and let R be an invisible input register of stage k that was   
not updated at the end of cycle T  1, then:

1. The set of output registers R of stage k which are updated at the end
of cycle T is independent of RT .

2. Let R be an output register of stage k that is updated at the end of


cycle T , and let S be an input signal for R , then ST is independent
of RT .

This can be verified by inspection of the tables 4.4 to 4.6 and 4.8.
##
  &
Therefore, it will suffice to prove equation 4.2 for all visible registers as
BASIC P IPELINING well as for all invisible registers which are clocked at the end of cycle T . It
will also suffice to prove equation 4.1 for the input signals S of all registers
which are clocked at the end of cycle T .
Under the above assumptions and with a hypothesis about data depen-
dencies in the program executed we are now able to prove that the ma-
chines DLXπ and DLXσ produce the same sequence of memory accesses.
Thus, the CPUs simulate each other in the sense that they have the same
input/output behavior on the memory. The hypotheses about data depen-
dencies will be removed later, when we introduce forwarding logic and a
hardware interlock.
     0, the instructions Ii 3    Ii 1 do
Suppose that for all i 0 and for all r   

not write register GPRr, where GPRr is a source operand of instruction


Ii . The following two claims then hold for all cycles T and T , for all stages
k, and for all instructions Ii with
Iπ k T   i  Iσ k T :
1. For all signals S in stage k which are inputs to a register R  out k
that is updated at the end of cycle T :
¼
SπT  SσT

2. For all registers and R  out k which are visible or updated at the
end of cycle T :
RTπ 1  Ri 
 Proof by induction on the cycles T of the pipelined execution. Let T  0.
We have Iπ 0 0  0  Iσ 0 0, i.e., instruction 0 is in stage 0 of machine
DLXπ during cycle T  0 and in stage 0 of machine DLXσ during cycle
T  0. The only input of stage 0 is the address for the instruction mem-
ory. This address is the output of register DPC for machine DLXσ and
signal d pc for machine DLXπ. By construction, both signals have in the
corresponding cycles T  0 and T  0 the same value, namely
DPCσ0  d pc0π  0
As stages 0 are for both machines identical, we have
Sπ0  Sσ0
for all internal signals S of stage 0 and claim 1 follows. In particular in
both machines IM 0 is clocked into the instruction register at the end of
cycle T  T  0. Hence, claim 2 follows because of
IR1π  IR1σ  IR0 
#&
  &#
  Illustration of the scheduling function I π for the stages k  1 and k. P IPELINING AS A
T RANSFORMATION
stage s Iπ s T  Iπ s T  1
k-1 i
k i i-1

In the induction step we conclude from T  1 to T . Thus, we have to


show claim 1 for signals S in cycle T , and we have to show claim 2 for
registers R in cycle T  1. According to figure 4.14, which illustrates the
data flow between the stages of the DLXσ design, there are the following
four cases:
1. k  2 (execute) or k  4 (write back).This is the easy case. In figures
4.3 and 4.14, all edges into stage k come from output registers

R  out k  1

of stage k  1. From the scheduling functions it can be concluded


that
Iπ k  1 T  1  Iπ k T   Iσ k T   i
This is illustrated in table 4.12. Let R be an input register of stage
k which is visible or which was updated at the end of cycle T  1.
Using lemma 4.3 with t  1 we conclude

RTπ  Ri by induction hypothesis


¼
 RTσ by lemma 4.3

Hence, except for invisible input registers R which were not updated
after cycle T  1, stage k of machine DLXπ has in cycle T the same
inputs as stage k of machine DLXσ in cycle T . Stage k is identical in
both machines (this is the point of the construction of the prepared
machine !). By lemma 4.4, the set of output registers R of stage k
which are updated after cycle T or T , respectively, is identical for
both machines, and the input signals S of such registers have the
same value:
¼
SπT  SσT 
If follows that at the end of these cycles T and T identical values
are clocked into R :
T 1 T ¼ 1
Rπ  Rσ
 Ri by lemma 4.3
#'
  &
BASIC P IPELINING
IF IM

out(0)
ID
NextPC

out(1) PC’ DPC

EX

out(2)

DM
out(3)

WB

out(4) GPR

   Data flow between the pipeline stages of the DLX σ design

2. k  3 (memory). The inputs of this stage comprise registers from


out 2 and the memory DM which belongs to out 3. For input
registers R  out 2, one concludes as above that
¼
RTπ  Ri  RT 

For i  0, one concludes from the scheduling function (table 4.12)

Iπ 3 T  1  i  1

We have M  out 3, i.e., every memory cell is an output register of


stage 3. Using lemma 4.3 with t  k  3 and the induction hypothe-
sis, we can conclude
¼
MπT  Mi  1  MσT 

For i  0, the scheduling function implies

Iπ 3 T   i T  3
#(
  &#
  Illustration of the scheduling function I π for the stages 0 and 1. P IPELINING AS A
T RANSFORMATION
stage s Iπ s T  Iπ s T  1
0 i
1 i-1 i-2

In the DLXπ design, the data memory is only updated if


reset  f ull 3  1
According to table 4.11, memory cell M is not updated during cycles
t  1 2, because the f ull 3tπ  0. Since the DLXπ design is started
with contents Mπ1  M 1 , it follows


MπT  Mπ2  Mπ1  M 1

In the DLXσ design, the update of the data memory DM is enabled by


the flag f ull 3. Thus, DM might be updated during reset, but then
the update is disabled until I0 reaches stage 3, since f ull 3tσ  0 for
t  0 1 2. Therefore,
¼
M 1  Mσ0  MσT 

Now the argument is completed as in the first case.


3. k  0 (fetch). Here we have to justify that the delayed PC can be
discarded. In the pipelined design, PC is the only input register of
stage IF, whereas in the sequential design, the input register is DPC.
Both registers are outputs of stage s  1.
For i  2 one concludes from the scheduling functions (table 4.13)
Iπ 1 T  1  Iπ 0 T  1  1  i  2
The induction hypothesis implies (for T  1)
PCπT  PCi 2

For i  1 (and T  1) we have by construction


PC  1  4  PCπ1 
Using lemma 4.3 with t  1, we conclude for T 1
PC Tπ  PCi 2  DPCi 1  by construction

¼
 DPCσT by lemma 4.3.
Now the argument is completed as in the first case.
#/
  &
BASIC P IPELINING   Illustration of the scheduling function I π for the stages 1 to 4.
stage s Iπ s T  Iπ s T  1
1 i
2 i-1
3 i-2
4 i-3 i-4

4. k  1 (decode). In either design, the decode stage has the input


¼
registers IR  out 0 and GPR  out 4. One shows IRTπ  IRT as
above. If instruction Ii does not read a register GPRr with r  0,
we are done, because the outputs of stage 4 are not used. In the other
case, only the value GPRrT can be used. The scheduling function
implies (table 4.14)

Iπ 4 T  1  i  4

For i 4, we conclude using lemma 4.3 with s  4 that

GPRrTπ  GPRri
4 by induction hypothesis

 GPRrσ 

According to the hypothesis of the theorem, instructions Ii 3


 to Ii 1


do not write register GPRr. Hence

GPRri  1  GPRri  4

i  3. The update of the register file GPR is enabled by signal ue4.


The stall engine (table 4.11) therefore ensures that the register file is
not updated during cycles t  1 2 3. Thus,

GPR  1  GPR1π    GPR4π

The hypothesis of the theorem implies that instructions Ij with 0 


j  3 do not write register GPRr. Hence,

GPRr  1    GPRri 1




By lemma 4.3 with s  4, we conclude


¼
GPRr4π  GPRri  1  GPRrTσ 

  The argument is completed as before.


#)
  &#
&#     %  
P IPELINING AS A
In the following, we determine the cost and the cycle time of the DLXπ T RANSFORMATION
design. Except for the PC environment and the stall engine, the pipelined
design DLXπ and the prepared sequential design DLXσ are the same. Since
in section 4.2, the environments of the DLXσ design are described in detail,
we can focus on the PC environment and the stall engine.

 " 1 


PCenv (figure 4.12) is governed by the same control signals as in the DLXσ
design, the glue logic PCglue also remains the same. The only modifica-
tion in PCenv is that the register DPC of the delayed PC is discarded. The
instruction memory IM is now addressed by

PC if reset  0
d pc 
0 if reset  1

Nevertheless, the PC environment still implements the delayed PC mecha-


nism, where d pc takes the place of DPC.
Due to the modification, the PC environment becomes cheaper by one
32-bit register. The new cost is

CPCenv  2  C f f 32  4  Cmux 32  Cadd 32  Cinc 30  CPCglue 

The cycle time TPCenv of the PC environment remains unchanged, but the
address d pc of the instruction memory has now a longer delay. Assuming
that signal reset has zero delay, the address is valid at

APCenv d pc  Dmux 32

This delay adds to the cycle time of the stage IF and to the accumulated
delay of signal ibusy of the instruction memory:

TIF  APCenv d pc  DIMenv IR  ∆


AIMenv ibusy  APCenv d pc  dIstat 

   1 


determines for each stage i the update enable signal uei according to fig-
ure 4.13. The registers f ull 4 : 2 are clocked by CE when neither the
instruction memory nor the data memory is busy or during reset:

CE   busy  reset  ibusy   busy  reset NOR ibusy




busy  ibusy NOR dbusy


#*
  &
During reset, the update is delayed until the instruction fetch is completed.
BASIC P IPELINING Since signal reset has zero delay, the clock CE can be generated at an
accumulated delay of

Astall CE   maxAIMenv ibusy ADMenv dbusy  Dnor  Dor 

For each register R  out i and memory M  out i, the stall engine
then combines the clock/write request signal and the update signal and
turns them into the clock/write signal:

Rce  Rce  uei Mw  Mw  uei

The update of the data memory DM is only enabled if stage 3 is full, and
if there is no reset:

Dmw  Dmw  f ull 3  reset 

The Moore control automaton provides 7 clock/write request signals and


signal Dmw (table 4.9). Together with two AND gates for the clocks of the
stages IF and ID, the stall engine has cost

Cstall  3  C f f  4  Cand  Cinv  2  Cnor  Cor  7  2  2  Cand 

As in the sequential design, the clocking of a register adds delay Df f  δ,


whereas the update of the register file adds delay Dram3 32 32  δ. Thus,
the stall engine requires a cycle time of

Tstall  Astall CE   3  Dand  maxDram3 32 32 D f f   δ

The write signal Dmw of the data memory has now a slightly larger accu-
mulated delay. However, an inspection of the data memory control DMC
(page 81) indicates that signal Dmw is still not time critical, and that the
accumulated delay of DMC remains unchanged.

  
For the DLXπ design and the DLXσ design, the top level schematics of the
data paths DP are the same (figure 4.3), and so do the formula of the cost
CDP .
The control unit CON comprises the stall engine, the two memory con-
trollers IMC and DMC, and the two control automata of section 4.2.3. The
cost CCON moore already includes the cost for buffering the Moore sig-
nals up to the write back stage. The cost of the control and of the whole
DLXπ core therefore sum up to

CCON  CIMC  CDMC  Cstall  CCON moore  CCON mealy


CDLX p  CDP  CCON 
&
  &#
  Cost of the DLX data paths and all its environments for the sequential P IPELINING AS A
DLX core (1) and for the pipelined design DLX π (2). The last row lists the cost of T RANSFORMATION
the DLXπ relative to that of the sequential design.

EX SH4L GPR IR PC DP CON DLX


1 4083 380 4096 301 416 10846 1105 11951
2 3315 380 4066 / 30 301 1906 12198 756 12954
0.81 1 4.58 1.12 0.68 1.08

Table 4.15 lists the cost of the DLX core and of its environments for
the sequential design (chapter 3) and for the pipelined design. The execute
environment of the sequential design consists of the environments ALUenv
and SHenv and of the 9 drivers connecting them to the operand and result
busses. In the DLXπ design, the busses are more specialized so that EXenv
only requires three drivers and two muxes and therefore becomes 20%
cheaper.
In order to resolve structural hazards, the DLXπ design requires an ex-
tended PC environment with adder and conditional sum incrementer. That
accounts for the 358% cost increase of PCenv and of the 12% cost increase
of the whole data paths.
Under the assumption that the data and control hazards are resolved in
software, the control becomes significantly cheaper. Due to the precompu-
tation and buffering of the control signals, the automata generate 19 instead
of 29 signals. In addition, the execution scheme is optimized, cutting the
total frequency νsum of the control signals by half. The constant, for exam-
ple, is only extracted once in stage ID, and not in every state of the execute
stage.

%  
In order to determine the cycle time of the DLX design, we distinguish
three types of paths, those through the control, through the memory system
and through the data paths.

Control Unit CON The automata of the control unit generate Mealy
and Moore control signals. The Mealy signals only govern the stage ID;
they have an accumulated delay of 13 gate delays. The Moore signals are
precomputed and therefore have zero delay:

ACON csID  ACON mealy  13


ACON csEX   ACON csM   ACON csW B  0
&
  &
BASIC P IPELINING   Cycle time of the DLX fixed-point core for the sequential (1) and for
the pipelined (2) design. In the pipelined design, d mem denotes the maximum of
the two access times dImem and dDmem ; dmstat denotes the maximum of the two
status times dIstat and dDstat .

ID EX WB IF, M control CON


GPRr PC ALU/SH GPRw memory auto stall
1 27 70 70 37 16  dmem 42 37  dmstat
2 27 54 66 33 16  dmem 32 41  dmstat

The cycle time of the control unit is the maximum of the times required by
the stall engine and by the automata

TCON  maxTstall Tauto 

Compared to the sequential design, the automata are smaller. The maximal
frequency of the control signals and the maximal fanin of the states are cut
by 25% reducing time Tauto by 24% (table 4.16). The cycle time of the
whole control unit, however, is slightly increased due to the stall engine.

Memory Environments The cycle time TM models the read and write
time of the memory environments IMenv and DMenv. Pipelining has no
impact on the time tM which depends on the memory access times dImem
and dDmem :
TM  maxTIMenv TDMenv 

Data Paths DP The cycle time TDP is the maximal time of all cycles in
the data paths except those through the memories. This involves the stages
decode, execute and write back:

TDP  maxTID TEX TW B 

During decode, the DLX design updates the PC environment (TPCenv ),


reads the register operands (TGPRr ), extracts the constant, and determines
the destination address. Thus,

TID  maxTPCenv TGPRr ACON csID  maxDIRenv DCAddr   ∆

Table 4.16 lists all these cycle times for the sequential and the pipelined
DLX design. The DLXπ design already determines the constant and the
destination address during decode. That saves 4 gate delays in the execute
and write back cycle and improves the total cycle time by 6%.
&
  &&
The cycle time of stage ID is dominated by the updating of the PC. In the
sequential design, the ALU environment is used for incrementing the PC R ESULT
and for the branch target computation. Since environment PCenv has now F ORWARDING
its own adder and incrementer, the updating of the becomes 20% faster.
Pipelining has the following impact on the cost and the cycle time of the  $& 
DLX fixed-point core, assuming that the remaining data and control haz-
ards can be resolved in software:
The data paths are about 12% more expensive, but the control be-
comes cheaper by roughly 30%. Since the control accounts for 5%
of the total cost, pipelining increases the cost of the core by about
8%.
The cycle time is reduced by 6%.
In order to analyze the impact which pipelining has on the quality of
the DLX fixed-point core, we have to quantify the performance of the
two designs. For the sequential design, this was done in [MP95]. For
the pipelined design, the performance strongly depends on how well the
data and control hazards can be resolved. This is analyzed in section 4.6.

   " #

N THIS section, we describe a rather simple extension of the hardware of


machine DLXπ which permits to considerably weaken the hypothesis of
theorem 4.5. For the new machine, we will indeed show theorem 4.5 but
with the following hypothesis: If instruction Ii reads register GPRr, then
the instructions Ii 1 Ii 2 are not load operations with destination GPRr.
 

Suppose that for all i 0 and r  0, the instructions Ii 1 Ii 2 are not load
 
   
operations with destination GPRr, where GPRr is a source operand of
instruction Ii . The following two claims then hold for all cycles T and T ,
for all stages k and for all instructions Ii with
Iσ k T   Iπ k T   i
1. For all signals S in stage k which are inputs to a register R  out k
that is updated at the end of cycle T :
¼
SπT  SσT

2. For all registers and R  out k which are visible or updated at the
end of cycle T :
RπT 1  Ri 
&#
  &
&& > ,
BASIC P IPELINING
We first introduce three new precomputed control signals v4 : 2 for the
prepared sequential machine DLXσ . The valid signal v j indicates that the
data, which will be written into the register file at stage 4 (write back), is
already available in the circuitry of stage j. For an instruction Ii , the valid
signals are defined by

0 if instruction Ii is a load
v4  1; v3  v2   Dmri
1 otherwise

where Dmri is the read signal of the data memory for Ii . Together with the
write signal GPRw of the register file and some other precomputed control
signals, the signals v4 : 2 are pipelined in registers R2, R3 and R4 as
indicated in figure 4.15. For any stage k  2 3 4, the signals GPRwk
and vkk are available in stage k. At the end of stage k, the following
signals C k are available as well:
C 2 which is the input of register MAR,

C 3 which is the input of register C, and

C 4 which is the data to be written into the register file GPR.


Observe that the signals C k are inputs of output registers of stage k.
Therefore, one can apply part 1 of the theorem to these signals in certain
cycles. This is crucial for the correctness proof of the forwarding logic.
Obviously, the following statements hold:

 (  For all i, for any stage k 2, and for any cycle T with Iσ k T   i, it
holds:
1. Ii writes the register GPRr iff after the sequential execution of Ii ,
the address r, which is different from 0, is kept in the registers Cad k
and the write signals GPRwk are turned on, i.e.:

Ii writes GPRr Cad ki   r  r  0  GPRwki  1

2. If Ii writes a register GPRr, and if after its sequential execution,


the valid flag vk is turned on, then the value of signal C k during
cycle T equals the value written by Ii , i.e.:

Ii writes GPRr  vki  1  C kT  GPRri 

Moreover, C k is clocked into an output register of stage k at the end


of cycle T .
&&
  &&
R.0 CON IM DP
R ESULT
stage IF F ORWARDING
R.1 out(0)
stage ID
R.2 GPRw.2 z v[4].2 y v[3].2 x v[2].2 out(1)
stage EX
R.3 GPRw.3 z v[4].3 y v[3].3 out(2)
stage M
R.4 GPRw.4 z v[4].4 out(3)
stage WB
out(4)

   Structure of the data paths DP and of the precomputed control CON
of the extended DLXπ machine

In the decode stage, the valid signals are derived from the memory read
signal Dmr, which is precomputed by the control automata. The generation
and buffering of the valid signals therefore requires the following cost and
cycle time:

CVALID  3  2  1  C f f  Cinv
TVALID  Tauto  Dinv 

This extension effects the cost and cycle time of the precomputed control
of the pipelined DLX design.

&& #7   , 

We describe a circuit Forw capable of forwarding data from the three


stages j  2 3 4 into stage 1. It has the following inputs

1. Cad  j C  j GPRw j as described above,

2. an address ad to be matched with Cad, and

3. a data Din from a data output port of the register file,

and it has an output Dout feeding data into stage 1. The data Din are fed
into stage 1 whenever forwarding is impossible.
&'
  &
a) b)
BASIC P IPELINING topA topB
ad top Dout Aad Ain Bad Bin

Cad.2, C’.2, GRPw.2 [4:2] ad Dout ad Dout


Cad.3, C’.3, GRPw.3 Forw(3) Forw(3) Forw(3)
Cad.4, C’.4, GRPw.4 Din Din

Din GPRoutA GPRoutB

   Block diagram of circuit Forw3 and the forwarding engine

The data paths of the pipelined machine DLXπ will be augmented by


a forwarding engine consisting of two circuits Forw 3, as depicted in
figures 4.16. One of the circuits forwards data into register A, the other
forwards data into register B. In general, forwarding engines will take
care of all data transport from high stages to low stages, except for the
instruction memory address. Thus, in the top level data path schematics
(figure 4.17), there will be no more upward edges between the stages 1 to
4.
We proceed to specify circuit Forw 3, give a simple realization and
then prove the theorem 4.7.

- ,
For the stages j  2 3 4, we specify the following signals:

hit  j  f ull  j  GPRw j  ad  0  ad  Cad  j

Signal hit  j is supposed to indicate that the register accessed by the in-
struction in stage 1 is modified by the instruction in stage j. Except for the
first four clock cycles T  0    3 all pipeline stages are full (table 4.13),
i.e., they process regular instructions. However, during the initial cycles,
an empty stage is prevented from signaling a hit by its full flag. Signal

top j  hit  j 

j 1
hit x
x 2

indicates moreover, that there occurs no hit in stages above stage j. The
data output Dout is then chosen as

C j if top j  1 for some j


Dout 
Din otherwise
&(
  &&
IMenv
IF R ESULT
F ORWARDING
IR.1
ID
Ain, Bin IRenv CAddr
PCenv
A, B link, PC’ co
EX

Buffering: IR.j and Cad.j


D EXenv sh
Forwarding Engine C’.2
MAR MDRw
M
DMenv
C’.3

C MDRr
WB
SH4Lenv
C’.4
C’ Cad.4
GPRenv
GPRoutA, GPRoutB Aad, Bad

   Top level schematics of the DLXπ data paths with forwarding. For
clarity’s sake, the address and control inputs of the stall engine are dropped.

8:   - ,


An example realization is shown in figure 4.18. The circuitry to the left
generates the three hit signals hit 4 : 2, whereas the actual data selection is
performed by the three multiplexers. The signals top j are implicit in the
order of the multiplexers. The signals top j, which will be needed by the
stall engine, can be generated by two inverters and three AND gates. The
cost of this realization of circuit Forw then runs at

CForw 3  3  Cequal 5  3  Cand  Cmux 32


Cortree 5  2  Cinv  3  Cand 

This forwarding engine provides the output Dout and the signals top j at
the following delays

DForw Dout; 3  Dequal 5  2  Dand  3  Dmux


DForw top; 3  Dequal 5  4  Dand  3  Dmux ;
&/
  &
ad 5 Dout
BASIC P IPELINING OR
GPRw.2
hit[2]
full.2 0 1
equal(5)
Cad.2 GPRw.3
hit[3] C’.2
full.3 0 1
equal(5)
Cad.3 GPRw.4 C’.3
hit[4]
full.4 0 1
equal(5)
Cad.4 Din C’.4

   A realization of the 3-stage forwarding circuit Forw3

the delay is largely due to the address check. The actual data Din and C  j
are delayed by no more than

DForw Data; 3  3  Dmux 

Let A C Din denote the accumulated delay of the data inputs C i and
Din. Since the addresses are directly taken from registers, the forwarding
engine can provide the operands Ain and Bin at the accumulated delay
A Ain Bin; this delay only impacts the cycle time of stage ID.

A C Din  maxAEXenv ASH4Lenv DGPR read  

A Ain Bin  maxA C Din  AForw Data 3 DForw Dout 3

The construction obviously generalizes to s-stage forwarding with s  3,


but then the delay is proportional to s. Based on parallel prefix circuits one
can construct forwarding circuits Forw s with delay O log s (see exercise
4.3).

&&#  

We now proceed to prove theorem 4.7. We start with a simple observation


about valid bits in a situation where instruction Ii reads register GPRr and
one of the three preceding instructions Ii α (with α  1 2 3) writes to


register GPRr. In the pipelined machine, the read occurs in a cycle T


when instruction Ii is in stage 1, i.e., when

Iπ k T   1

During this cycle, instruction Ii  α is in stage 1  α:

Iπ 1  α T   i  α
&)
  &&
We consider the time T , when instruction Ii  α is in stage 1  α of the
prepared sequential machine: R ESULT
F ORWARDING
Iσ 1  α T   i  α

In cycle T , the prepared sequential machine has not yet updated register
GPRr. The following lemma states, where we can find a precomputed
version of GPRri α in the sequential machine.


Suppose the hypothesis of theorem 4.5 holds, Ii reads GPRr, instruction   )
Ii α writes GPRr, and Iσ 1  α T   i  α, then


¼
C  1  αTσ  GPRri  α

If Ii α is a load instruction, then by the hypothesis of the theorem we have


 
α  3. In this case, the valid bits are generated such that

v4i α  v1  αiα  1

In any other case, the valid signals for any j 2 equal

v ji  α 1

The claim now follows directly from lemma 4.8.  


Proof of Theorem 4.7 The proof proceeds along the same lines as the 
proof of theorem 4.5 by induction on T where T denotes a cycle in the
pipelined execution with Iπ k T   i. Since only the inputs of stage 1 were
changed, the proof for the case T  0 and the induction step for k  1
stay literally the same. Moreover, in the induction step, when we conclude
from T  1 to T for k  1, we can already assume the theorem for T and
k  1. We only need to show the claim for those input signals of stage 1,
which depend on the results of later stages, i.e., the signals Ain and Bin.
For all the other signals and output registers of stage 1, the claim can the
be concluded as in the proof of theorem 4.5.
A read from GPRr can be into register A or into register B. In the
induction step, we only treat the case where instruction Ii reads GPRr
into register A. Reading into register B is treated in the same way with the
obvious adjustments of notation.
There are two cases. In the interesting case, the hypothesis of theorem
4.5 does not hold for instruction Ii , i.e., there is an α  1 2 3 such that
instruction Ii α writes GPRr. By the hypothesis of the theorem, this in-


struction is not a load instruction. For the valid bits this implies

v ji α  1
&*
  &
for all stages j. Application of the induction hypothesis to the instruction
BASIC P IPELINING register gives IRi  IRTπ . Since Ii reads GPRr, it follows for signal Aadr:
r  Aadi   AadπT 
Since Ii α writes register GPRr, it follows by lemma 4.8 for any stage


j 2 that
GPRw ji α  Cad  ji α   r  r  0
 

For stage j  1  α, the pipelining schedule implies (table 4.14, page 138)
Iπ j T   Iπ 1α T  iα
Note that none of the stages 0 to i  α is empty. By the induction hypothesis
it therefore follows that
hit 1  αTπ  f ull  1  αTπ  GPRw 1  αTπ
 r  0  Cad  1  αTπ   r
 1  GPRw 1  α1 α
 r  0  Cad  1  αi α   r


 1
Let Ii α be the last instruction before Ii which writes GPRr. Then no


instruction between Ii and Ii α writes GPRr, and we have




hit l Tπ 0

for any stage l with 1  l  1  α, and hence


top 1  αTπ  1

Let T denote the cycle in the sequential execution with


Iσ 1  α T   Iπ 1α T  i  α
The forwarding logic delivers the output
DoutπT  C  1  αTπ
¼
 C  1  αTσ by lemma 4.8 and by
the theorem for T and k  1  α
 GPRri  α by lemma 4.9
 GPRri  1

In the simple second case, the stronger hypothesis of theorem 4.5 holds
for Ii . For any i 4, this means that none of the instructions Ii 1 Ii 2 Ii 3   

writes GPRr. As above, one concludes that


hit  jTπ  0
'
  &'
for all j. Hence, the forwarding logic behaves like the old connection
between the data output GPRoutA of the GPR environment and the input H ARDWARE
Ain of the decode stage delivering I NTERLOCK

DoutπT  DinTπ  GPRri  4  GPRri 1

For i  3, the DLXπ pipeline is getting filled. During these initial cycles
(T  3), either stage k  1 is empty or instruction Ij with Iπ k T   j  2
does not update register GPR[r]. As above, one concludes that for any j

hit  jTπ  0

and that
DoutπT  DinTπ  GPRr  1

 

  ' /

&'   1 

In this section, we construct a nontrivial stall engine called hardware inter-


lock. This engine stalls the upper two stages of the pipeline in a situation
called a data hazard , i.e., when the forwarding engine cannot deliver valid
data on time. Recall that this occurs if
1. an instruction Ii which reads from a register r  0 is in stage 1,

2. one of the instructions Ij with j  i  1 i  2 is a load with destina-


tion r,

3. and I j is the last instruction before Ii with destination r.


This must be checked for both operands A and B. In the existing machine,
we could characterize this situation by the activation of the signal dhaz:

dhaz  dhazA  dhazB


dhazA  topA2  v22  topA3  v33
dhazB  topB2  v22  topB3  v33

Based on this signal, we define the two clocks, the clock CE1 of the stages
0 and 1, and the clock CE2 of the stages 2 to 4:

CE2   ibusy  dbusy  reset  ibusy


CE1   ibusy  dbusy  dhaz  reset  ibusy
'
  &
Thus, CE2 corresponds to the old clock signal CE, whereas CE1 is also
BASIC P IPELINING inactive in presence of a data hazard.
Whenever the lower stages of the pipeline are clocked while the upper
stages are stalled, a dummy instruction (i.e., an instruction which should
not be there) enters stage 2 and trickles down the pipe in subsequent cycles.
We have to ensure that dummy instructions cannot update the machine.
One method is to force a NOP instruction into stage 2 whenever CE2 
CE1  1. This method unfortunately depends on the particular instruction

set and its encoding. When stalling a different pipeline, the corresponding
part of the hardware has to be modified. A much more uniform method is
the following:

1. Track true instructions and dummy instruction in stage k by a single


bit f ull k, where f ull k  1 signals a true instruction and f ull k  0
signals a dummy instruction.

2. In CE2 cycles with f ull k  0, do not update stage k and advance


the dummy instruction to stage k  1 if k  1  4.

The following equations define a stall engine which uses this mecha-
nism. It is clocked by CE2. A hardware realization is shown in figure
4.19. For k 2,

ue0  CE1
f ull 1  1
ue1  CE1  reset
uek  CE2  reset  f ull k
f ull k : ue k  1

This is an almost trivial set of equations. However, enabling the hit


signals hit  j by the corresponding full flags is a subtle and crucial part of
the mechanism. It ensures that dummy instructions cannot activate a hit
signal hit  j nor the data hazard signal (exercise 4.4).
In order to prevent dummy instructions from generating a dbusy signal
and from updating the data memory, the read and write signals Dmr and
Dmw of the data memory DM are also enabled by the full flag:

Dmr  Dmr  f ull 3


Dmw  Dmw  f ull 3  reset

where Dmr and Dmw are the read and write request signals provided by
the precomputed control.
'
CE1 ue.0
  &'

1 full.1
H ARDWARE
CE1 I NTERLOCK
ue.1
reset CE2 full.2

CE2
ue.2
CE2 full.3

ue.3
CE2 full.4

ue.4

   Hardware interlock engine of the DLX π design

    +%


The modifications presented above only effect the stall engine. The Stall
Engine of figure 4.19 determines the update enable signals uei based on
the clocks CE1 and CE2. These clock signals can be expressed as
CE2   busy   reset NOR ibusy
CE1   busy  dhaz   reset NOR ibusy
busy  ibusy NOR dbusy
The clocks now also depend on the data hazard signal dhaz which can be
provided at the following cost and delay:
Cdhaz  2  Cinv  4  Cand  3  Cor
Astall dhaz  AForw top; 3  Dand  2  Dor 
Since signal reset has zero delay, the clocks can be generated at an accu-
mulated delay of
Astall busy  maxAIMenv ibusy ADMenv dbusy  Dnor
Astall CE   maxAstall busy Astall dhaz  Dand  Dor 
For each register and memory, the stall engine turns the clock/write re-
quest signal into a clock/write signal. Due to the signal Dmr and Dmw, that
now requires 11 AND gates. Altogether, the cost of the stall and interlock
engine then runs at
Cstall  3  C f f  Cinv  5  11  1  Cand  2  Cnor  2  Cor  Cdhaz 
'#
  &
Since the structure of the stall engine remains the same, its cycle time can
BASIC P IPELINING be expressed as before:

Tstall  Astall CE   3  Dand  maxDram3 32 32 D f f   δ

&'  - ,- 

With the forwarding engine and the hardware interlock, it should be pos-
sible to prove a counterpart of theorem 4.7 with no hypothesis whatsoever
about the sequence of instructions.
Before stating the theorem, we formalize the new scheduling function
Iπ k T . The cycles T under consideration will be CE2 cycles. Intuitively,
the definition says that a new instruction is inserted in every CE1 cycle into
the pipe, and that subsequently it trickles down the pipe together with its
f ull k signals. We assume that cycle 0 is the last cycle in which the reset
signal is active.
The execution still starts in cycle 0 with Iπ 0 0  0. The instructions
are always fetched in program order, i.e.,

i if ue0T 0
Iπ 0 T   i  Iπ 0 T  1  (4.3)
i1 if ue0T  1

Any instructions makes a progress of at most one stage per cycle, i.e., if
Iπ k T   i, then

Iπ k T  1 if uekT 0
i  (4.4)
Iπ k  1 T  1 if uekT 1 and k  1  4

We assume that the reset signal is active long enough to permit an access
with address 0 to the instruction memory. With this assumption, activation
of the reset signal has the following effects:

CE2  1
ue0  CE1
ue1  ue2  ue3  ue4  0

After at most one cycle, the full flags are initialized to

f ull 1  1 f ull 2  f ull 3  f ull 4  0

read accesses to the data memory are disabled (DMr  0), and thus,

busy  dhaz  0
'&
  &'
When the first access to IM is completed, the instruction register holds
H ARDWARE
IR  IM 0 I NTERLOCK

This is the situation in cycle T  0. From the next cycle on, the reset signal
is turned off, and a new instruction is then fed into stage 0 in every CE1
cycle. Moreover, we have

ue0T  ue1T  CE1T for all T 1

i.e., after cycle T  0, stages 0 and 1 are always clocked simultaneously,


namely in every CE1 cycle. A simple induction on T gives for any i 1

Iπ 0 T   i  Iπ 1 T   i  1 (4.5)

This means that the instructions wander in lockstep through the stages 0
and 1. For T 1 and 1  k  3, it holds that

uekT 1  ue k  1T 1  1

Once an instruction is clocked into stage 2, it passes one stage in each CE2
clock cycle. Thus, an instruction cannot be stalled after being clocked into
stage 2, i.e., it holds for k  2 3

Iπ k T   i  Iπ k  1 T  1  i (4.6)

The stall engine ensures the following two features:    *


1. An instruction Ii can never overtake the preceding instruction Ii 1 .

2. For any stage k 1 and any cycle T 1, the value Iπ k T  of the


scheduling function is defined iff the flag f ull k is active during cycle
T , f ull kT  1.

1) Since the instructions are always fetched in-order (equation 4.3), in- 
struction Ii enters stage 0 after instruction Ii 1 . Due to the lockstep behav-


ior of the first two stages (equation 4.5), there exists a cycle T with

Iπ 0 T   i  Iπ 1 T   i  1

Let T T be the next cycle with an active CE1 clock. The stages 0 and 1
are both clocked at the end of cycle T ; by equation 4.4 it then follows that
both instructions move to the next stage:

Iπ 1 T  1  i  Iπ 2 T  1  i  1
''
  &
Instruction Ii 1 now proceeds at full speed (equation 4.6), i.e., it holds for

BASIC P IPELINING a  1 2 that
Iπ 2  a T  1  a  i  1
Instruction Ii can pass at most one stage per cycle (equation 4.4), and up
to cycle T  1  a it therefore did not move beyond stage 1  a. Thus, Ii
cannot overtake Ii 1 . This proves the first statement.


2) The second statement can be proven by a simple induction on T ; we


  leave the details as an exercise (see exercise 4.5).

+  , 1- 


Finally, we have to argue that the stall mechanism cannot produce dead-
locks. Let both clocks be active during cycle T  1, i.e.,

CE1T 1

 CE2T 1
 1

let the instructions I, I and I be in the stages 1 to 3 during cycle T . I I


are possibly dummy instructions. Furthermore, let CE1T  0. Thus, the
hazard flag must be raised (dhazT  1), and one of the instructions I and
I must be a load which updates a source register of I.
1. Assuming that instruction I in stage 2 is such a load, then
T 1
v2T  v3 0 and v4T 2  1

Instruction I produces a data hazard during cycles T and T  1.


During these cycles, only dummy instructions which cannot activate
the dhaz signal enter the lower stages, and therefore

dhazT 2  0 and CE1T 2  1

2. Assuming that instruction I in stage 3 is the last load which updates


a source register of I, then
T 1
v2T  v3 1 v3T 0 and v4T 1  1

Instruction I produces a data hazard during cycle T , and a dummy


instruction enters stage 2 at the end of the cycle. In the following
CE2 cycle, there exists no data hazard, and the whole pipeline is
clocked:
dhazT 1  0 and CE1T 1  1

Thus, the clock CE1 is disabled (CE1  0) during at most two consecutive
CE2 cycles, and all instructions therefore reach all stages of the pipeline.
Note that none of the above arguments hinges on the fact, that the pipelined
machine simulates the prepared sequential machine.
'(
  &'
&'#  -  
H ARDWARE
We can now show the simulation theorem for arbitrary sequences of in- I NTERLOCK
structions:

For all i, k, T , T such that Iπ k T   Iσ k T  i and uekT  1, the    


following two claims hold:

1. for all signals S in stage k which are inputs to a register R  out k


that is updated at the end of cycle T :
¼
SπT  SσT

2. for all registers and R  out k which are visible or updated at the
end of cycle T :
RπT 1  Ri 

We have argued above that IM 0 is clocked into register IR at the end 
of CE2 cycle 0, and that the PC is initialized properly. Thus, the theorem
holds for T  0. For the induction steps, we distinguish four cases:

1. k  0. Stage 0 only gets inputs from the stages 0 and 1. Without


reset, these two stages are clocked simultaneously. Thus, the inputs
of stage 0 only change on an active CE1 clock. Arguing about CE1
cycles instead of CE2 cycles, one can repeat the argument from the-
orem 4.7.

2. k  2 4. In the data paths, there exists only downward edges into
stage k, and the instructions pass the stages 2 to 4 at full speed. The
reasoning therefore remains unchanged.

3. k  3. From Iπ 3 T   i one cannot conclude Iπ 3 T  1  i  1


anymore. Instead, one can conclude

Iπ 3 t   i  1

for the last cycle t  T such that I 3 t  is defined, i.e., such that a
non-dummy instruction was in stage 3 during cycle t. Since dummy
instructions do not update the data memory cell M, it then follows
that

Mπt 1  Mi1 by induction hypothesis


T
 Mπ 
'/
  &
4. k  1. For I 1 T   i and ue1T  1 we necessarily have dhazT  0.
BASIC P IPELINING If Ii has no register operand GPR[r], then only downward edges are
used, and the claim follows as before. Thus, let Ii read a register
GPR[r] with r   0. The read can be for operand A or B. We only
treat the reading of operand A; the reading of B is treated in the same
way with the obvious adjustments of notation.
If the instructions I0  Ii 1 do not update register GPR[r], it follows
for any k  1 that
GPRwkπT  0  Cad kπT   r
or that stage k processes a dummy instruction, i.e., f ull kT  0.
Thus, hit signal hit kT is inactive, and the reasoning of theorem 4.7
can be repeated.
If register GPR[r] is updated by an instruction preceding Ii , we define
last i r as the index of the last instruction before Ii which updates
register GPR[r], i.e.,
last i r  max j  iI j updates register GPRr
Instruction I  Ilast i r is either still being processed, or it has already


left the pipeline.


If instruction I is still in process, then there exists a stage l 2 with
Iπ l T   last i r
From lemma 4.10 and the definition of last i r, it follows that
hitAlπT 1

and that any stage between stage 1 and l is either empty or processes
an instruction with a destination address different from r. By the
construction of circuit Forw, it then follows that
topAkπT  1

Since dhazTπ  0, the hazard signal of operand A is also inactive,


dhazATπ  0. By the definition of this signal and by the simulation
theorem for l 2 it follows that
vlπT  1  vllast i r 


The decode stage k  1 then reads the proper operand A of instruc-


tion Ii ,
AinTπ  C lπT ; design of the forwarding engine
 GPRrlast i r ; theorem for stages 2 to 4


 GPRri 1  ; definition of last i r


')
  &(
  Cost of the sequential DLX core and of the pipelined DLX designs C OST
P ERFORMANCE
Design DP CON DLX
A NALYSIS
sequential 10846 1105 11951
basic pipeline 12198 756 12954
pipeline + forwarding 12998 805 13803
pipeline + interlock 13010 830 13840

and the claim follows for stage k  1.


If instruction I already ran to completion, then there exists no stage
l 2 with
Iπ l T   last i r

With reasoning similar to the one of the previous case it then follows
that
AinTπ  GPRrTπ  GPRrlast i r  GPRri 1
 

and thus, I gets the proper operand A.  

   +!   -

N PREVIOUS sections we have described several variants of a pipelined


DLX core and have derived formulae for their cost and cycle time. In
the following, we will evaluate the pipelined and sequential DLX designs
based on their cost, cycle time, and performance-cost ratio. The SPEC
integer benchmark suite SPECint92 [Hil95, Sta] serves as workload.

&(     %  

Table 4.17 lists the cost of the different DLX designs. Compared to the
sequential design of chapter 3, the basic pipeline increases the total gate
count by 8%, and result forwarding adds another 7%. The hardware inter-
lock engine, however, has virtually no impact on the cost. Thus, the DLXπ
design with hardware interlock just requires 16% more hardware than the
sequential design.
Note that pipelining only increases the cost of the data paths; the control
becomes even less expensive. This even holds for the pipelined design
with forwarding and interlocking, despite the more complex stall engine.
'*
  &
BASIC P IPELINING   Cycle time of the DLX core for the sequential and the pipelined de-
signs. The cycle time of CON is the maximum of the two listed times.

Design A/B PC EX IF, M CON


sequential 27 70 70 18  dmem 40 39  dmstat
basic pipe 27 54 66 16  dmem 32 41  dmstat
pipe + forwarding 72 93a 66 16  dmem 34 41  dmstat
pipe + interlock 72 93a 66 16  dmem 57 43  dmstat
a this time can be reduced to 89 by using a fast zero tester for AEQZ

According to table 4.18, the result forwarding slows down the PC envi-
ronment and the register operand fetch dramatically, increasing the cycle
time of the DLX core by 40%. The other cycle times stay virtually the
same. The hardware interlocks make the stall engine more complicated
and increase the cycle time of the control, but the time critical paths re-
mains the same.
The significant slow down caused by result forwarding is not surprising.
In the design with a basic pipeline, the computation of the ALU and the
update of the PC are time critical. With forwarding, the result of the ALU
is forwarded to stage ID and is clocked into the operand registers A1 and
B1. That accounts for the slow operand fetch. The forwarded result is
also tested for zero, and the signal AEQZ is then fed into the glue logic
PCglue of the PC environment. PCglue provides the signal b jtaken which
governs the selection of the new program counter. Thus, the time critical
path is slowed down by the forwarding engine (6d), by the zero tester (9d),
by circuit PCglue (6d), and by the selection of the PC (6d).
With the fast zero tester of exercise 4.6, the cycle time can be reduced
by 4 gate delays at no additional cost. The cycle time (89d) is still 35%
higher than the one of the basic pipeline. However, without forwarding
and interlocking, all the data hazards must be resolved at compile time by
rearranging the code or by insertion of NOP instructions. The following
sections therefore analyze the impact of pipelining and forwarding on the
instruction throughput and on the performance-cost ratio.

&( "  5 

The performance is modeled by the reciprocal of the benchmark’s execu-


tion time. For a given architecture A, this execution time is the product of
(
  &(
the design’s cycle time τA and its cycle count CCA :
C OST
TA  τA  CCA  P ERFORMANCE
A NALYSIS
% -  2-  +  
In a sequential design, the cycle count is usually expressed as the product
of the total instruction count IC and the average number of cycles CPI
which are required per instruction:

CC  IC  CPI  (4.7)

The CPI ratio depends on the workload and on the hardware design. The
execution scheme of the instruction set Is defines how many cycles CPII
an instruction I requires on average. On the other hand, the workload to-
gether with the compiler defines an instruction count ICI for each machine
instruction, and so the CPI value can be expressed by

ICI
CPI  ∑ CPII  ∑ νI  CPII (4.8)
IIs IC I Is

where νI denotes the relative frequency of instruction I in the given work-


load.

% -  " +  


Pipelining does not speed up the execution time of a single instruction,
but it rather improves the instruction throughput, due to the interleaved
execution. Thus, it is difficult to directly apply the formulae (4.7) and (4.8)
to a pipelined design.
In case of perfect pipelining, it takes k  1 cycles to fill a k-stage
pipeline. After that, an instruction is finished per cycle. In this case, the
cycle count equals
CC  k  1  IC  IC
For very long workloads, the cycle count virtually equals the instruction
count. However, perfect pipelining is unrealistic; the pipeline must be
stalled occasionally in order to resolve hazards. Note, that the stalling is
either due to hardware interlocks or due to NOPs inserted by the compiler.
Let νh denote the relative frequency of a hazard h in the given workload,
and let CPHh denote the average number of stall cycles caused by this
hazard. The cycle count of the pipelined design can then be expressed as
 
CC  IC  ∑ IC  νh  CPHh  IC  1  ∑ νh  CPHh 

hazard h h
(
  &
In analogy to formula (4.8), the following term is treated as the CPI ratio
BASIC P IPELINING of the pipelined design:
CPI  1 ∑ νh  CPHh  (4.9)
hazard h

&(# +%   6<-  . - 

It is the matter of an optimizing compiler to make a good use of the


branch/jump delay slots. In the most trivial case, the compiler just fills
the delay slots with NOP instructions, but the compiler can do a much
better job (table 4.19, [HP96]). It tries to fill the delay slots with useful in-
structions. There are basically three code blocks to choose the instructions
from, namely:
1. The code block which immediately precedes the branch/jump. The
delay slot can be filled with a non-branch instruction from this block,
if the branch does not depend on the re-scheduled instruction, and if
the data dependences to other instructions permit the re-scheduling.
This always improves the performance over using a NOP.
2. The code from the branch/jump target. The re-scheduled instruc-
tion must not overwrite data which is still needed in the case that
the branch is not taken. This optimization only improves the perfor-
mance, if the branch is taken; the work of the delay slot is wasted
otherwise.
3. The code from the fall through of a conditional branch. In analogy
to the second case, the re-scheduled instruction must not overwrite
data needed if the branch is taken. This optimization only improves
the performance if the branch is not taken.
Strategy 1) is the first choice. The other two strategies are only used when
the first one is not applicable. How well the delay slot can be filled also
depends on the type of the branch/jump instruction:
An unconditional, PC relative branch/jump is always taken and has
a fixed target address. Thus, if the first strategy does not work, the
target instruction can be used to fill the delay slot.
An unconditional, absolute jump is always taken, but the target ad-
dress may change. This type of jump usually occurs on procedure
call or on return from procedure. In this case, there are plenty of
independent instructions which can be scheduled in the delay slot,
e.g., the instructions for passing a parameter/result.
(
  &(
  Percentage of conditional branches in the SPECint92 benchmarks and C OST
how well their delay slot (DS) can be filled. AV denotes the arithmetic mean over P ERFORMANCE
the five benchmarks. A NALYSIS
compress eqntott espresso gcc li AV
% branch 17.4 24.0 15.2 11.6 14.8 16.6
empty DS 49% 74% 48% 49% 75% 59%

  Instruction mix % of the SPECint92 programs normalized to 100%.
instructions compress eqntott espresso gcc li AV
load 19.9 30.7 21.1 23.0 31.6 25.3
store 5.6 0.6 5.1 14.4 16.9 8.5
compute 55.4 42.8 57.2 47.1 28.3 46.2
call (jal, jalr) 0.1 0.5 0.4 1.1 3.1 1.0
jump 1.6 1.4 1.0 2.8 5.3 2.4
branch, taken 12.7 17.0 9.1 7.0 7.0 10.6
, untaken 4.7 7.0 6.1 4.6 7.8 6.0

A conditional branch. If the branch results from an if-then-else con-


struct, it is very difficult to predict the branch behavior at compile
time. Thus, if the first strategy does not work, the delay slot can
hardly be filled with an useful instruction. For loops the branch pre-
diction is much easier because the body of a loop is usually executed
several times.

Thus, the delay slot of an unconditional branch/jump can always be filled;


only conditional branches cause some problem. For these branches, the
compiler can only fill about 40% of the delay slots (table 4.19).

&(& ". 8    +3= +  

For our analysis, we assume an average SPECint92 workload. Table 4.20


lists the frequencies of the DLX machine instructions on such a workload.
The table is taken from [HP96], but we have normalized the number to
100%.
(#
  &
BASIC P IPELINING   Number of CPU cycles and memory accesses per DLX instruction.
instructions CPU cycles memory accesses CPII
load, store 3 2 5  2 W S
compute 3 1 4 W S
call (jal, jalr) 4 1 5 W S
jump 2 1 3 W S
branch, taken 3 1 4 W S
branch, untaken 2 1 3 W S

2-  +  
For the sequential DLX design, table 4.21 specifies the number of CPU
cycles and the number of memory accesses required by any machine in-
struction I. This table is derived from the finite state diagram of figure
3.20 (page 90). Let a memory access require W S wait states, on average.
The CPII value of an instruction I then equals the number of its CPU cy-
cles plus W S  1 times the number of memory accesses. When combined
with the instruction frequencies from table 4.20, that yields the following
CPI ratio for the sequential DLX design:

CPIDLXs  426  134  W S

" +     . 
Even with result forwarding, the pipelined DLX design can be slowed
down by three types of hazards, namely by empty branch delay slots, by
hardware interlocks due to loads, and by slow memory accesses.

Branch Delay Slots The compiler tries to fill the delay slot of a branch
with useful instructions, but about 59% of the delay slots cannot be filled
(table 4.19). In comparison to perfect pipelining, such an empty delay slot
stalls the pipeline for CPHNopB  1 cycles. This hazard has the following
frequency:

νNopB  νbranch  059  0166  059  01

Since these control hazards are resolved in software, every empty delay
slot also causes an additional instruction fetch.

Hardware Interlock Since the result of a load can only be forwarded


from stage WB, the forwarding engine cannot always deliver the operands
(&
  &(
on time. On such a data hazard, the hardware interlock engine inserts up
to two dummy instructions. The compiler reduces these data hazards by C OST
scheduling independent instructions after a load wherever that is possible. P ERFORMANCE
According to [HP96], both interlocks can be avoided for 63% of the A NALYSIS
loads, and for another 11% at least one of the interlocks can be avoided.
Thus, two interlocks occur only for 26% of all loads. Each interlock in-
creases the cycle count by CPHNopL  1 cycle. On the workload under
consideration, this hazard has a frequency of

νNopL  νload  2  026  011  0253  063  016

Slow Memory Accesses In a hierarchical memory system, most of the


accesses can be completed in a single cycle, but there are also slow ac-
cesses which require some wait states. Let every memory access require
an average of CPHslowM  W S wait states. The frequency of a slow mem-
ory access then equals the number of loads, stores and instruction fetches:

νslowM  νload  νstore  ν f etch 

Since the branch hazards are resolved in software by inserting a NOP, they
cause νNopB additional instruction fetches. Load hazards are resolved by
a hardware interlock and cause no additional fetches. Thus, the frequency
of instruction fetches equals

ν f etch  1  νNopB  11

Summing up the stall cycles of all the hazards yields the following CPI
ratio for the pipelined DLX design with forwarding:

CPIDLXπ  1  νNopB  1  νNopL  1  νslowM  CPHslowM


 126  144  W S

" +    - , 


The design DLXπb with the basic pipeline resolves the hazards in soft-
ware; if necessary, the compiler must inserts NOPs. This design faces
the same problems as the DLXπ design, but in addition, it must manage
without result forwarding. Whenever the DLXπ pipeline would forward a
result, the compiler must re-arrange the code or insert a NOP. According to
simulations [Del97], these forwarding hazards stall the basic pipeline for
CPH f orw  1 cycles each, and they have a frequency of

ν f orw  039
('
  &
BASIC P IPELINING   Hardware cost, cycle time and CPI ratio of the DLX designs (sequen-
tial, basic pipeline, pipeline with interlock)

Gate Count Cycle Time CPI Ratio


abs. rel. abs. rel. WS 0.3 1
DLXs 11951 1.0 70 1.0 426  134  W S 4.66 5.60
DLXπb 12949 1.08 66 0.94 165  20  W S 2.25 3.65
DLXπ 13833 1.16 89 1.27 126  144  W S 1.70 2.70

The simulation assumed that the additional hazards are resolved by in-
serting a NOP. Thus, every branch, load or forwarding hazard causes an
additional instruction fetch. The frequency of fetches then runs at

ν f etch  1  νNopB  νNopL  ν f orw


 1  01  016  039  165

and slow memory accesses have a frequency of

νslowM  νload  νstore  ν f etch  0253  0085  165  20

Thus, the CPI ratio of the pipelined DLX design without forwarding is:

CPIDLXπb  1  νNopB  νNopL  ν f orw   1  νslowM  CPHslowM


 165  20  W S

&(' +   1- 

"   - %
According to table 4.22, pipelining and result forwarding improve the CPI
ratio, but forwarding also increases the cycle time significantly. The CPI
ratio of the three designs grows with the number of memory wait states.
Thus, the speedup caused by pipelining and forwarding also depends on
the speed of the memory system (figure 4.20).
Result forwarding and interlocking have only a minor impact (3%) on
the performance of the pipelined DLX design, due to the slower cycle time.
However, both concept disburden the compiler significantly because the
hardware takes care of the data hazards itself.
The speedup due to pipelining increases dramatically with the speed
of the memory system. In combination with an ideal memory system
((
  &(
3
DLXs/DLXpb
DLXs/DLXp
C OST
DLXpb/DLXp P ERFORMANCE
2.5
A NALYSIS

speedup
2

1.5

0 0.5 1 1.5 2 2.5 3


average number of wait states: WS

   Speedup of pipelining and forwarding as a function of the memory


latency (DLXs: sequential, DLXpb: basic pipeline, DLXp: pipeline with inter-
lock)

(W S  0), pipelining yields a speedup of 2.7, whereas for W S 55, the


sequential DLX design becomes even faster than the pipelined designs.
Thus, pipelining calls for a low-latency memory system.
Powerful, cache based memory systems, like that of the Dec Alpha
21064 [HP96], require about W S  025 wait states per memory access,
and even with a small 2KB on-chip cache, a memory speed of W S  05 is
still feasible (chapter 6). In the following, we therefore assume WS = 0.3.
Under this assumption, pipelining speeds the DLX design up by a factor
of 2.2.

.    ?- %   +3=


Quality Metric The quality is the weighted geometric mean of the per-
formance P and the reciprocal of the cost C:

Q  P1
 q
Cq 
 (4.10)

The weighting parameter q  0 1 determines whether cost or performance


has a greater impact on the quality. Therefore, we denote q as quality
parameter. Commonly used values are:
q  0: Only performance counts, Q  P.

q  05: The resulting quality metric Q  PC0 5 models the cost-


performance ratio.
(/
  &
2.4
BASIC P IPELINING DLXs/DLXs
2.2 DLXpb/DLXs

quality ratio (WS = 0.3)


DLXp/DLXs
2

1.8

1.6

1.4

1.2

0.8
0 0.2 0.4 0.6 0.8 1
quality paramter: q

   Quality ratio of the pipelined designs relative to the sequential design
(DLXs: sequential, DLXpb: basic pipeline, DLXp: pipeline with interlock)

q  13: The resulting quality metric is Q  P2 C1 3 . This means 

that a design A which is twice as fast as design B has the same quality
as B if it is four times as expensive.
For a realistic quality metric, the quality parameter should be in the range
02 05: Usually, more emphasis is put on the performance than on the
cost, thus q  05. For q  02, doubling the performance already allows
for a cost ratio of 16; a higher cost ratio would rarely be accepted.

Evaluation Pipelining and result forwarding improve the performance


of the DLX architecture significantly, but they also increase the cost of
the fixed-point core. Figure 4.21 quantifies this tradeoff between cost and
performance.
In combination with a fast memory system W S  03, pipelining and
result forwarding improve the quality of the DLX fixed-point core, at least
under the realistic quality metric. In case that the cost is more emphasized
than the performance, pipelining becomes unprofitable for q  08.

   !  "  #

HE DESIGN presented here is partly based on designs from [PH94,


 HP96, Knu96]. The concept of delayed PC and pipelining as a trans-
formation is from [KMP99a]. The formal verification of pipeline con-
()
  &)
trol without delayed branch is reported in [BS90, SGGH91, BD94, BM96,
LO96, HQR98]. E XERCISES

$ %&

  In chapter 2, we have introduced a conditional sum adder


and a carry look-ahead adder, and extended them to an arithmetical unit
AU. In addition to the n-bit sum/difference, the n-bit AU provides two
flags indicating an overflow and a negative result. Let DAU n denote the
maximal delay of the n-bit AU, and let DAU s1 : 0; n denote the delay of
the two least significant sum bits.
Show that for both AU designs and for any n 2 the delay of these two
sum bits can be estimated as

DAU s1 : 0; n  DAU 2

  Prove the dateline lemma 4.3 by induction on T .


  Fast s-stage Forwarding Engine. In section 4.4.2, we have
presented a forwarding engine capable of forwarding data from 3 stages.
The construction obviously generalizes to s-stage forwarding, with s  3.
The actual data selection (figure 4.18) is then performed by s cascaded
multiplexers. Thus, the delay of this realization of an s-stage forwarding
engine is proportional to s.
However, these s multiplexers can also be arranged as a balanced binary
tree of depth log s. Signal top j (as defined in section 4.4.2) indicates
that stage j provides the current data of the requested operand. These
signals top j can be used in order to govern the multiplexer tree.

1. Construct a circuit TOP which generates the signals top j using a


parallel prefix circuit.
2. Construct an s-stage forwarding engine based on the multiplexer tree
and circuit T OP. Show that this realization has a delay of O log s.
3. How can the delay of the forwarding engine be improved even fur-
ther?

  In case of a data hazard, the interlock engine of section 4.5
stalls the stages IF and ID. The forwarding circuit Forw signals a hit of
stage j  2 3 4 by

hit  j  f ull  j  GPRw j  ad  0  ad  Cad  j


(*
  &
These hit signals are used in order to generate the data hazard signal dhaz.
BASIC P IPELINING The check whether stage j is full (i.e., f ull  j  1) is essential for the cor-
rectness of the interlock mechanism.
Show that, when simplifying the hit signals to

hit  j  GPRw j  ad  0  ad  Cad  j

dummy instructions could also activate the hazard flag, and that the inter-
lock engine could run into a deadlock.

  Prove for the interlock engine of section 4.5 and the corre-
sponding scheduling function the claim 2 of lemma 4.10: for any stage k
and any cycle T  0, the value Iπ k T  is defined iff f ull kT  1.

  Fast Zero Tester. The n-zero tester, introduced in section
2.3, uses an OR-tree as its core. In the technology of table 2.1, NAND / NOR
gates are faster than OR gates. Based on the equality

abcd  abcd  a NOR b NAND c NOR d 

the delay of the zero tester can therefore roughly be halved.


Construct such a fast zero tester and provide formulae for its cost and delay.

/
Chapter

5
Interrupt Handling

 #  #   . ! ' 

NTERRUPTS ARE events, which change the flow of control of a program


by means other than a branch instruction. They are triggered by the
activation of event signals, which we denote by ev j j  0 1   . Here,
we will consider the interrupts shown in table 5.1.
Loosely speaking, the activation of an event signal ev j should result
in a procedure call of a routine H j. This routine is called the exception
handler for interrupt j and should take care of the problem signaled by

  Interrupts handled by our DLX design


index j name symbol
0 reset reset
1 illegal instruction ill
2 misaligned memory access mal
3 page fault on fetch pff
4 page fault on load/store pfls
5 trap trap
6 arithmetic overflow ovf
6i external I/O exi
  '
the activation of ev j. The exception handler for a page fault for instance
I NTERRUPT should move the missing page from secondary memory into primary mem-
H ANDLING ory. Interrupts can be classified in various ways:

They can be internal, i.e., generated by the CPU or the memory


system, or external.

They can be maskable, i.e., they can be ignored under software con-
trol, or non maskable.

After an interrupt of instruction I the program execution can be re-


sumed in three ways:

– repeat instruction I,
– continue with the instruction I which would follow I in the
uninterrupted execution of the program,
– abort the program.

Table 5.2 classifies the interrupts considered here.


Finally, the interrupts have priorities defined by the indices j. Activation
of ev j can only interrupt handler H j  if j  j . Moreover if ev j and
ev j  become active simultaneously and j  j , then handler H j  should
not be called. Thus, small indices correspond to high priorities.1
If we want to design an interrupt mechanism and prove that it works, we
would like to do the usual three things:

1. define what an interrupt mechanism is supposed to do,

2. design the mechanism, and

3. show that it meets the specification.

The first step turns out to be not so easy. Recall that interrupts are a
kind of procedure calls, and that procedure call is a high level language
concept. On the other hand, our highest abstraction level so far is the as-
sembler/machine language level. This is the right level for stating what the
hardware is supposed to do. In particular, it permits to define the meaning
of instructions like ", which support procedure call. However, the mean-
ing of the call and return of an entire procedure cannot be defined like the
meaning of an assembler instruction.
There are various way to define the semantics of procedure call and re-
turn in high level languages [LMW86, Win93]. The most elementary way
– called operational semantics – defines the meaning of a procedure by
1 Priority 1 is urgent, priority 31 is not.
/
  '
  Classifications of the interrupts ATTEMPTING A
R IGOROUS
index j symbol external maskable resume
T REATMENT OF
0 reset yes no abort I NTERRUPTS
1 ill no no abort
2 mal no no abort
3 pff no no repeat
4 pfls no no repeat
5 trap no no continue
6 ovf no yes continue/abort
6i exi yes yes continue

prescribing how a certain abstract machine should interpret calls and re-
turns. One uses a stack of procedure frames. A call pushes a new frame
with parameters and return address on the stack and then jumps to the body
of the procedure. A return pops a frame from the stack and jumps to the
return address.
The obvious choice of the ‘abstract machine’ is the abstract DLX ma-
chine with delayed branch/delayed PC semantics defined by the DLXσ in-
struction set and its semantics. The machine has, however, to be enriched.
There must be a place where interrupt masks are stored, and there must be
a mechanism capable of changing the PC as a reaction to event signals. We
will also add mechanisms for collecting return addresses and parameters,
that are visible at the assembler language level.
We will use a single interrupt service routine ISR which will branch
under software control to the various exception handlers H j. We denote
by SISR the start address of the interrupt service routine.
We are finally able to map out the rest of the chapter. In section 5.2, we
will define at the abstraction level of the assembler language

1. an extension of the DLX machine language,

2. a mechanism collecting return addresses and parameters, and

3. a mechanism capable of forcing the pair of addresses (SISR, SISR +


4) into (DPC, PC) as reaction to the activation of event signals.

In section 5.3, we define a software protocol for the interrupt service


routine which closely parallels the usual definition of procedure call and
return in operational semantics. This completes the definition of the inter-
rupt mechanism.
/#
  '
In compiled programs, the body of procedures is generated by the com-
I NTERRUPT piler. Thus, the compiler can guarantee, that the body of the procedure
H ANDLING is in a certain sense well behaved, for instance, that it does not overwrite
the return address. In our situation, the compiled procedure body is re-
placed by the exception handler, which – among other things – obviously
can overwrite return addresses on a procedure frame. They can also gen-
erate interrupts in many ways. Indeed, the attempt to execute an exception
handler for page faults, which does not reside in memory will immediately
generate another page fault interrupt, and so on.
In section 5.4, we therefore present a set of conditions for the excep-
tion handlers and show: if the exception handlers satisfy the conditions,
then interrupts behave like kind of procedure calls. The proof turns out
to be nontrivial mainly due to the fact, that instructions which change the
interrupt masks can themselves be interrupted.
Given the machinery developed so far, the rest is straightforward. In
section 5.5, we design the interrupt hardware for a prepared sequential
machine according to the specifications of section 5.2. In section 5.6, we
pipeline the machine and show that the pipelined machine simulates the
prepared sequential machine in some sense. The main technical issue there
will be a more powerful forwarding mechanism.

 %& '     

AGE FAULT and misalignment interrupts are obviously generated by


 the memory system. Illegal interrupts are detected by the control au-
tomaton in the decode stage. Overflow interrupts can be generated by the
two new R-type instructions  # $ and the two new I-type instruc-
tions   # $ specified in table 5.4. They generate the (maskable)
overflow event signal ev5, if the result of the computation is not repre-
sentable as a 32-bit 2’s complement number. Trap interrupts are generated
by the new J-type instruction & (table 5.4). External interrupts are gen-
erated by external devices; for these interrupts we apply the following

. -  @


The active event line ev j of an external I/O interrupt j is only turned off,
once interrupt j received service. Interrupt j receives service as soon as
the ISR is started with interrupt level j where j  j, or where j  j and
interrupt j is of type abort. A formal definition of the concept of interrupt
level will be given shortly.
/&
  '
  Special purpose registers used for exception handling E XTENDED
I NSTRUCTION S ET
address name meaning
A RCHITECTURE
0 SR status register
1 ESR exception status register
2 ECA exception cause register
3 EPC the exception PC
4 EDPC the exception delayed PC
5 Edata exception data register

The DLX architecture is extend by 7 new registers, 6 of them are visible


to the assembler programmer. They form the registers SPR0 to SPR5 of
the new special purpose register file SPR. Names and addresses of the SPR
registers are listed in table 5.3; their function is explained later.
Register’s contents can be copied between the general purpose register
file GPR and the special purpose register file SPR by means of the special
move instructions  + (move integer to special) and  + (move spe-
cial to integer). Both moves are R-type instructions (table 5.4). The binary
representation of the special register address is specified in field SA.
The cause register CA is the new non visible register. It catches event
signals ev j which become active during the execution of instructions Ii in
the following sense:

If j is an internal interrupt, it is caught in the same instruction, i.e.,


CA ji  1.

If j is external, it is caught in the current or in the next instruction;


CA ji  1 or CA ji1  1. Once the bit CA j is active, it remains
active till interrupt j receives service.

In any other situation, we have CA ji  0.


The interrupt masks are stored in the status register SR. For a maskable
interrupt j, bit SR j stores the mask of interrupt j. Masking means that
interrupt j is disabled (masked) if SR j  0, and it is unmasked otherwise.
The masked cause MCA is derived from the the cause register and the status
register. For instruction Ii , the masked cause equals

CA ji ; if interrupt j is not maskable


MCA ji 
CA ji  SR ji 1 ; if interrupt j is maskable.

Note that this is a nontrivial equation. It states that for instruction Ii , causes
are masked with the masks valid after instruction Ii 1 . Thus, if Ii happens


/'
  '
to be a  + instruction with destination SR, the new masks have no
I NTERRUPT affect on the MCA computation of Ii .
H ANDLING
<-    .8
From the masked cause MCA, the signal JISR (jump to interrupt service
routine) is derived by
31
JISRi  MCA j
j 0

Activation of signal JISR triggers the jump to the interrupt service routine.
Formally we can treat this jump either as a new instruction Ii1 or as a part
of instruction Ii . We chose the second alternative because this reflects more
closely how the hardware will work. However, for interrupted instructions
Ii and registers or signals X, we have now to distinguish between
Xi , which denotes the value of X after the (interrupted) execution of
instruction Ii , i.e., after JISR, and

Xiu , which denotes the value of X after the uninterrupted execution


of instruction Ii .
We proceed to specify the effect of JISR for instruction Ii . The interrupt
level il of the interrupt is

ili  min j  MCA ji  1

Interrupt il has the highest priority among all those interrupts which were
not masked during Ii and whose event signals ev j were caught. Interrupt
il can be of type continue, repeat or abort. If it is of type repeat, no register
file and no memory location X should be updated, except for the special
purpose registers. For any register or memory location X, we therefore
define
Xi 1 if ili is of type repeat

Xi 
Xiu otherwise
By SISR, we denote the start address of the interrupt service routine. The
jump to ISR is then realized by

DPC PCi  SISR SISR  4

The return addresses for the interrupt service routine are saved as

 DPC PC i 1 if ili is of type repeat
DPC PC ui
EDPC EPC i 
  
if ili is of type continue
if ili is of type abort,
/(
  '#
i.e., on an interrupt of type abort, the return addresses do not matter. The
exception data register stores a parameter for the exception handler. For I NTERRUPT
traps this is the immediate constant of the trap instruction. For page fault S ERVICE ROUTINES
and misalignment during load/store this is the memory address of the faulty F OR N ESTED
access: I NTERRUPTS

sext immi for trap interrupts


EDATAi 
eai for p f or misa during load/store

For page faults during fetch, the address of the faulty instruction memory
access is DPCi 1, which is saved already. Thus, there is no need to save it


twice.
The exception cause register ECA stores the masked interrupt cause

ECAi  MCAi

all maskable interrupts are masked by

SR  0

and the old masks are saved as



 SRi  1 if ili is of type repeat
SRui
ESRi 
 
if ili is of type continue
if ili is of type abort.

Thus, if the interrupt instruction sets new masks and it is interrupted by


an interrupt of type continue, then the new masks are saved. This com-
pletes at the instruction level the description of the semantics of JISR.
The restoration of the saved parameters is achieved by a new J-type in-
struction , (return from exception) specified in table 5.4.

 '  )   "   ' 

interrupts are handled by a software protocol. The protocol



ESTED
maintains an interrupt stack IS. The stack consists of frames. Each
frame can hold copies of all general purpose registers and all special reg-
isters. Thus, with the present design we have a frame size of 32  6  38
words.
We denote by IST OP the top frame of the interrupt stack. Its base
address is maintained in the interrupt stack pointer ISP. For this pointer,
we reserve a special purpose register, namely

ISP  GPR30
//
  '
I NTERRUPT   Extensions to the DLX instruction set. Except for rfe and trap, all
H ANDLING instructions also increment the PC by four. SA is a shorthand for the special
purpose register SPRSA; sxt(imm) is the sign-extended version of the immediate.

IR[31:26] IR[5:0] effect


Arithmetic Operation (I-type)
hx08 addio RD = RS1 + imm; ovf signaled
hx0a subio RD = RS1 - imm; ovf signaled
Arithmetic Operation (R-type)
hx00 hx20 addo RD = RS1 + RS2; ovf signaled
hx00 hx22 subo RD = RS1 - RS2; ovf signaled
Special Move Instructions (R-type)
hx00 hx10 movs2i RD = SA
hx00 hx11 movi2s SA = RS1
Control Instructions (J-type)
hx3e trap trap = 1; Edata = sxt(imm)
hx3f rfe SR = ESR; PC = EPC; DPC = EDPC

We call the sequence of registers

EHR  ESR ECA EDPC EPC EDATA

the exception handling registers. For each frame F of the interrupt stack
and for any register R, we denote by FR the portion of F reserved for reg-
ister R. We denote by FEHR the portion of the frame reserved for copies
of the exception handling registers. We denote by ISEHR the portions
of all frames of the stack, reserved for copies of the exception handling
registers.
The interrupt service routine, which is started after an JISR, has three
phases:
1. S AVE (save status):

(a) The current interrupt level

il  min j  ECA j  1

is determined. For this computation, ECA has to be copied into


some general purpose register GPRx. This register in turn has
first to be saved to some reserved location in the memory. This
write operation in turn does better not generate a page fault
interrupt.
/)
  '#
(b) If il is of type abort, an empty interrupt stack is initialized, and
otherwise a new frame is pushed on the stack by the computa- I NTERRUPT
tion S ERVICE ROUTINES
F OR N ESTED
ISP  ISP  f rame size
I NTERRUPTS
(c) The exception handling registers are saved:

IST OPEHR  EHR

(d) All maskable interrupts j  il are unmasked:

SR  031 il il
1 

This mask is precomputed and the assigned to SR in a single


special move instruction. After this instruction, the interrupt
service routine can be interrupted again by certain maskable
interrupts.

2. Exception Handler H il : The interrupt service routine branches


to the start of the proper routine for interrupt il. This routine will
usually need some general purpose registers. It will save the corre-
sponding registers to IST OP. After the proper work for interrupt il
is done, the general purpose registers which were saved are restored.
Observe that all this can be interrupted by (maskable) interrupts of
higher priority. Finally the handler masks all maskable interrupts by
a single special move instruction:

SR  GPR0

3. R ESTORE (restore status): the following registers are restored from


the stack:

EDPC  IST OPEDPC


EPC  IST OPEPC
ESR  IST OPESR

The top frame is popped from the stack:

ISP  ISP  f rame size

The interrupt service routine ends with an , instruction.

/*
  '
   '  )  
I NTERRUPT
H ANDLING E INTEND interrupts to behave like procedure calls. The mechanism
 of the previous section defines the corresponding call and return
mechanism. Handlers unfortunately are not generated by compilers and
thus, the programmer has many possibilities for hacks which make the
mechanism not at all behave like procedure calls. The obvious point of
attack are the fields ISEHR. Manipulation of IST OPEDPC obviously
allows to jump anywhere.
If the interrupt stack is not on a permanent memory page, each interrupt,
including page fault interrupts, can lead to a page fault interrupt, and so
on. One can list many more such pitfalls. The interesting question then
obviously is: have we overlooked one?
In this section we therefore define an interrupt service routine to be ad-
missible if it satisfies a certain set of conditions (i.e., if it does not make
use of certain hacks). We then prove that with admissible interrupt service
routines the mechanism behaves like a procedure call and return.

'&    

An interrupt service routine is called admissible if it complies with the


following set of constraints:
1. The data structures of the interrupt mechanism must be used in a
restricted manner:
(a) The interrupt stack pointer ISP is only written by S AVE and
R ESTORE .
(b) The segments of an IS frame which are reserved for the EHR
registers are only updated by S AVE .
2. The ISR must be written according to the following constraints:
(a) Instruction , is only used as the last instruction of the ISR.
(b) The code segments S AVE and R ESTORE avoid any non-mask-
able internal interrupt; in the current DLX architecture, that are
the interrupts j with 0  j  6.
(c) Every handler H j avoids any non-maskable internal interrupt
i with a priority i j.
(d) If handler H j uses a special move with source register R in
order to update the status register SR, then the bit Ri  0 for
any i j.
)
  '&
Among other things, the conditions b) and c) require that page faults
are avoided in certain handlers. That can only be ensured if the A DMISSIBLE
interrupt stack IS and the codes S AVE and R ESTORE are held on I NTERRUPT
permanent pages, i.e., on pages which cannot be swapped out of S ERVICE ROUTINES
main memory. Let j p denote the priority level of the page fault p f f .
For any j  j p , the handler H j and all the data accessed by H j
must also be held on permanent pages.
We will have to show that the interrupt mechanism can manage with
a limited number of permanent pages, i.e., that the interrupt stack IS
is of finite size.
3. The interrupt priorities are assigned such that
(a) Non-maskable external interrupts are of type abort and have
highest priority j  0.
(b) Maskable external interrupts are of type continue and have a
lower priority than any internal interrupt.
(c) If an instruction can cause several internal interrupts at the
same time, the highest priorized of all the caused interrupts
must then be of type repeat or abort.
The assignment of the interrupt priorities used by our DLX design
(table 5.2) complies with these constraints.
The conditions 1 and 2 must hold whether the handler H j is interrupted
or not. This is hard to achieve because the ISR of another interrupt could
corrupt the data structures and the registers used by H j. As a conse-
quence, H j could cause a misaligned memory access or overwrite an
EHR field on the interrupt stack.
The following approach could, for instance, protect the stack IS against
illegal updates. Besides the EHR registers, a frame of stack IS also backs
data which are less critical, e.g., the general purpose registers. It is there-
fore suitable to use two stacks, one for the EHR registers and one for the
remaining data. The EHR stack can then be placed on a special memory
page which except for the code S AVE is read-only.

'&   - -

The code segments S AVE and R ESTORE can be interpreted as left and right
brackets, respectively. Before we can establish that admissible interrupt
service routines behave in some sense like procedures we have to review
some facts concerning bracket structures.
)
  '
For sequences S  S1    St of brackets ‘(’ and ‘)’ we define
I NTERRUPT
H ANDLING l S  the number of left brackets in S
r S  the number of right brackets in S

Sequence S is called a bracket structure if

l S  r S and
(5.1)
l Q r Q for all prefixes Q of S

i.e., the number of left brackets equals the number of right brackets, and in
prefixes of S there are never more right brackets than left brackets.
Obviously, if S and T are bracket structures, then S and ST are bracket
structures as well. In bracket structures S one can pair brackets with the
following algorithm:

For all right brackets R from left to right do:


 pair R with the left bracket L immediately left of R;
cancel R and L from S;

The above algorithm proceeds in rounds k  1 2   . Let R k and L k


be the right and left bracket paired in round k, and let S k be the string S
before round k. We have S 1  S. By induction on k one shows

   1. R k is the leftmost right bracket in S k,

2. L k exists, and

3. the portion Q of S from L k to R k is a bracket structure.

The proof is left as an exercise. Observe that up to round k, the above


algorithm only works with the prefix S1    R k of S.

'&# "      . -  8- 

We begin with some definitions. First, we define the interrupt level il in


situations, where S AVE sequences are not interrupted:2

 min j  MCA j  1 during S AVE
min j  IST OPMCA j  1 outside of S AVE , if it exists
il 
 32 otherwise

2 We show later that this is always the case


)
  '&
A sequence of instructions S AVE H R ESTORE is called an instance of
ISR j if during H the interrupt level equals A DMISSIBLE
I NTERRUPT
il  j S ERVICE ROUTINES

It is called a non aborting execution of ISR j if the interrupt level obeys

il  j during S AVE and R ESTORE


il  j during H 

Thus, during executions of ISR j the handler H j can be interrupted. We


do not consider infinite executions.
Assume that H does not end with a R ESTORE sequence of interrupt level
j, then
S AVE 1 H S AVE 2 H R ESTORE
is called an aborting execution of ISR(j) if

il  j during S AVE1
il  j during H
il  2 during S AVE2 , H and R ESTORE 

We call the execution of an interrupt service routine properly nested or


simply nested, if
1. no code segment S AVE or R ESTORE is interrupted,
2. the sequence of code segments S AVE and R ESTORE forms an initial
segment of a proper bracket structure, and if

3. paired brackets belong to an instance of some ISR j in the follow-


ing sense: Let L and R be paired S AVE and R ESTORE sequences.
Let H consist of the instructions between L and R

(a) which do not belong to S AVE and R ESTORE sequences, and


(b) which are not included by paired brackets inside L and R.
Then L H R is an instance of some ISR j.
We call an execution perfectly nested if it is properly nested and the se-
quence of S AVE s and R ESTORE s forms a proper bracket structure. In the
following proofs we will establish among other things

Executions of admissible interrupt service routines are properly nested.    
We will first establish properties of perfectly nested executions of in-
terrupt service routines in lemma 5.3. In lemma 5.4 we will prove by
)#
  '
induction the existence of the bracket structure. In the induction step, we
I NTERRUPT will apply lemma 5.3 to portions of the bracket structure, whose existence
H ANDLING is already guaranteed by the induction hypothesis. In particular, we will
need some effort to argue that R ESTORE s are never interrupted.
The theorem then follows directly from the lemmas 5.3 and 5.4.

   Let the interrupt mechanism obey software constraints 1 to 3. Consider a


perfectly nested execution of ISR j. The sequence of instructions executed
has the form
Ia    Ib  Ic    Id
S AVE H  j R ESTORE
we then have:

1. If the execution of ISR j is not aborted, then the interrupt stack IS


holds the same number of frames before and after ISR j, and the
segments of IS reserved for the EHR registers remain unchanged,
i.e.,
ISPa 1  ISPd and ISEHRa 1  ISEHRd 
 

2. Preciseness. If ISR j is not aborted, the execution is resumed at



DPCa  2 PCa 2
 if j is a repeat interrupt
DPCd PCd   u
DPCau  1 PC a1  if j is a continue interrupt

with the masks

SRa 2
 if j is a repeat interrupt
SRd 
SRua 1

if j is a continue interrupt 

 Proof by induction on the number n of interrupts which interrupt the exe-
cution of an ISR j.
n  0. The execution of ISR j is uninterrupted. Since interrupt j is
not aborting, S AVE allocates a new frame on the stack IS, and R ESTORE
removes one frame. The handler H j itself does not update the stack
pointer (constraint 1), and thus

ISPa  1 ISPd 

According to constraint 1, the EHR fields on the interrupt stack IS are only
written by S AVE . However S AVE just modifies the top frame of IS which
is removed by R ESTORE . Thus

ISEHRa  1 ISEHRd
)&
  '&
and claim 1 follows. With respect to claim 2, we only show the preciseness
of the masks SR; the preciseness of the PCs can be shown in the same way. A DMISSIBLE
I NTERRUPT
SRd  ESRd 1  by definition of , S ERVICE ROUTINES
 IST OPESRc  1 by definition of R ESTORE

where IST OP denotes the top frame of the stack IS. Since the handler
itself does not update the stack pointer ISP nor the EHR fields on the stack
IS (constraint 1), it follows

IST OPESRc  1  IST OPESRb


 ESRa 1  by definition of S AVE

and by the definition of the impact of JISR it then follows that

SRa 2
 if j is a repeat interrupt
SRd  ESRa 1 

SRua 1

if j is a continue interrupt 

In the induction step, we conclude from n to n  1. The execution of


ISR j is interrupted by n  1 interrupts, and the codes S AVE and R ESTORE
of the corresponding instances of the ISR form a proper bracket structure.
Since S AVE and R ESTORE are uninterrupted, there are m top level pairs
of brackets in the instruction stream of the handler H j; each pair corre-
sponds to an instance ISR jr :
ISR j1  ISR j2  ISR jm 

Ia    Ib  Ia1    Id1  Ia2    Id2  Iam    Idm  Ic    Id
S AVE H  j R ESTORE

Each of the ISR jr  is interrupted at most n times, and due to the induction
hypothesis, they return the pointer ISP and the EHR fields on the stack
unchanged:

ISPar  1 ISPdr and ISEHRar  1 ISEHRdr 

Since the instructions of the handler H j do not update these data, it fol-
lows for the pointer ISP that

ISPb  ISPa1  1 ISPd1      ISPam 1  ISPdm  ISPc1 

The same holds for the EHR fields of the interrupt stack:

ISEHRb  ISEHRa1  1    ISEHRdm  ISEHRc  1 (5.2)

Since R ESTORE removes the frame added by S AVE , and since S AVE only
)'
  '
updates the EHR fields of the top frame, the claim 1 follows for n  1. The
I NTERRUPT preciseness of the ISR j can be concluded like in the case n  0, except
H ANDLING for the equality

ISTOPEHRb  ISTOPEHRc  1

  However, this equality holds because of equation 5.2.

   Let the interrupt mechanism obey the software constraints. Then, non
aborting executions of the interrupt service routine are properly nested.

 We proceed in three steps:


1. S AVE is never interrupted: According to the software constraint 2, the
codes S AVE and R ESTORE avoid any non-maskable internal interrupt. Re-
set is the only non-maskable external interrupt, but we are only interested
in a non aborted execution. Thus, S AVE and R ESTORE can only be inter-
rupted by a maskable interrupt.
If an instruction Ii causes an interrupt, all masks are cleared, i.e., SRi  0,
and a jump to the ISR is initiated: JISRi  1. In the code S AVE , the masks
are only updated by the last instruction. Since new masks apply to later
instructions, S AVE cannot be interrupted by maskable interrupts either.
2. The code R ESTORE avoids non-maskable interrupts, and only its last
instruction updates the status register. Thus, R ESTORE cannot be inter-
rupted if it is started with SR  0. The last instruction of any non-aborting
interrupt handler is a special move

SR : GPR0  0

If this special move is not interrupted, then R ESTORE is not interrupted


either.
3. Let the code R ESTORE comprise the instructions R1    Rs . Note that by
the construction of interrupt service routines every instance of ISR starts
with a S AVE and – in case it is not aborted – it produces later exactly one
first instruction R1 of its R ESTORE sequence. Therefore, in executions
of the interrupt service routine the sequence of S AVE s (which are never
interrupted) and instructions R1 form an initial segment of a proper bracket
structure.
In a non aborting execution, we denote by Rn1 the nth occurrence of R1 .
We prove by induction on n that until Rn1

the code segment R ESTORE is always started with SR  0 (hence it


is not interrupted),
)(
  '&
the code segments S AVE and R ESTORE form a start sequence of a
proper bracket structure, and A DMISSIBLE
I NTERRUPT
paired brackets belong to an execution of some ISR j. S ERVICE ROUTINES
For n  1 there must be a S AVE to the left of the first R1 . Consider the
first such S AVE to the left of R11 . Then, this S AVE and R11 belong to an
uninterrupted instance of an ISR j. Thus, R11 is started with SR  0 and
the first R ESTORE is not interrupted.
For the induction step, consider Rn11 . There are n instructions Ri1 to
its left. By induction hypothesis the code segments S AVE and R ESTORE
up to Rn1 form a start sequence of a proper bracket structure with paired
brackets belonging to executions of some ISR j. By lemma 5.3, these
executions are precise. Since the sequence of S AVE s and R1 s forms an
initial segment of a bracket structure, we can pair Rn11 with a preceding
S AVE code sequence L. Let H be the sequence of instructions between
L and Rn11 . Construct H from H by canceling all executions of some
ISR j. Because these executions are precise, we have during H a constant
interrupt level
il  i
thus, handler H i is executed during H.
Let ISR jn  denote the instance of the ISR which belongs to Rn1 . Instruc-
tion Rn11 is then either directly preceded
(a) by the special move I with SR : 0, or
(b) by the special move I followed by ISR jn .
The first case is trivial (see n  1). In the second case, ISR jn  interrupts
the special move, and interrupt jn is of type continue. Due to the precise-
ness of ISR jn , Rn11 is started with the masks SRum  0, and the n  1st
R ESTORE block is not interrupted.  
Priority Criterion. For admissible interrupt service routines, it holds:   
1. During the execution of ISR j, maskable interrupts i with i j are
masked all the time.
2. ISR j can only be interrupted by an interrupt i  j of higher prior-
ity.

According to lemma 5.4, the codes S AVE and R ESTORE can only be in- 
terrupted by reset. Thus, we focus on the interrupt handlers. For any non-
maskable interrupt j  6, claim two follows directly by constraint 2. For
the maskable interrupts j 6, we prove the claims by induction on the
number n of interrupts which interrupt the handler H j.
)/
  '
n  0: The ISR is always started with SR  0, due to signal JISR.
I NTERRUPT The ISR only updates the masks by a special move  + or by an
H ANDLING , instruction. Since , is only used as the last instruction of an ISR
(constraint 2), it has no impact on the masks used by the ISR itself.
In case of a special move SR : R, the bit Ri must be zero for any
i j. Thus, the maskable interrupts are masked properly. Due to the
definition of the masked interrupt cause of instruction Il

CA j l  SR j l 1 ; if interrupt j is maskable


MCA j l 
CA j l ; otherwise

and the definition of the interrupt level

ill  min j  MCA j l  1

ISR j cannot be interrupted by a maskable interrupt j j, and the


claim follows.

n  0: The handler H j is interrupted n times, and the codes S AVE


and R ESTORE form a proper bracket structure. Thus, the instruction
sequence of ISR j has the following form

Save    ISR j1     ISR jm     Restore

for an m  n. The instructions which belong to the code of the han-


dler H j do not unmask interrupts j with j  j. Due to the pre-
ciseness of the ISR, any ISR jr  returns the masks SR delivered to it
by register ESR. By induction on m it then follows that interrupt jr
has a higher priority than j, i.e., jr  j.
Since any ISR jr  is interrupted at most n  1 times, the induc-
tion hypothesis applies. ISR jr  keeps all interrupts j with j jr
  masked, and especially those with j j.

Theorem 5.2 and lemma 5.5 imply:


    Non aborting executions of admissible interrupt service routines are per-
fectly nested.

 Let LHR be a non aborting execution of ISR j, where L is a save sequence
and R is a restore sequence. By theorem 5.2, the sequence of S AVE s and
R ESTORE s in LHR is an initial segment of a bracket structure. If the brack-
ets L and R are paired, then the S AVE and R ESTORE sequences in H form
a bracket structure. Hence, the brackets in LHR form a bracket structure
and LHR is perfectly nested.
))
  '&
Assume R is paired with a left bracket L right of L:
A DMISSIBLE
L  L  R I NTERRUPT
S ERVICE ROUTINES
ISR j 

Then by lemma 5.5, the interrupt level immediately before L is greater


than j, and LHR is not a non aborting execution.  

According to lemma 5.5, the ISR of an interrupt j  0 can only be


interrupted by an interrupt of higher priority. Thus, there can be at most
one frame on the stack IS for each interrupt level j  0. Reset can even
interrupt ISR 0. However, on reset, the ISR does not allocate a new frame,
the stack IS is cleared instead. The size of the interrupt stack IS is therefore
limited; the ISR uses at most 32 frames.
Like for many software protocols, fairness seems to be desirable for the
interrupt mechanism. In this context, fairness means that every interrupt
finally gets service. Due to the pure priority scheme, that cannot always
be achieved. Consider the case that the event signals of two external inter-
rupts ev15 and ev17 become active at the same time, that the external
interrupt ev16 occurs whenever leaving ISR 15 and vice versa. Under
these conditions, interrupt 17 is starved by the interrupts 15 and 16. Thus
fairness and a pure priority scheme do not go together. Nevertheless, one
would at least like to guarantee that no internal interrupt gets lost.

Completeness Let the interrupt mechanism obey the software constraints.   
Every internal interrupt j which occurs in instruction Ii and which is not
masked receives service in instruction Ii1 , or instruction Ii is repeated
after the ISR which starts with instruction Ii1 .

Let instruction Ii trigger the internal interrupt j, i.e., ev ji  1. The cause 
bit CA ji is then activated as well. Under the assumption of the lemma, j
is either non-maskable or it is unmasked (SR ji 1  1). In either case, the


corresponding bit of MCA is raised, and an jump to the ISR is initiated.


Thus, Ii1 is the first instruction of routine ISR k, where k  ili denotes
the interrupt level after Ii . Due to the definition of the interrupt level, k  j.
For k  j, the claim follows immediately. For k  j, interrupt k is either
external or internal. In case of an external interrupt, k must be a reset (con-
straint 3) which aborts the execution servicing any pending interrupt. If k is
an internal interrupt, it is of type abort or repeat due to constraint 3. Thus,
ISR k either services any pending interrupt by aborting the execution, or
after ISR k, the execution is resumed at instruction Ii .  
)*
  '
If the constraint 3 is relaxed, the completeness of the interrupt mecha-
I NTERRUPT nism in the sense of lemma 5.7 cannot be guaranteed. Assume that instruc-
H ANDLING tion Ii causes two internal interrupts j and j , and that j  j . If j is of type
continue, ISR j just services j and resumes the execution at the instruc-
tion which would follow Ii in case of JISRi  0. Thus, interrupt j would
get lost. If interrupt j is of type repeat, ISR j does not service interrupt j
either. However, instruction Ii is repeated after the ISR, and the fault which
corresponds to interrupt j can occur again.

 '  

N THIS section, we design the interrupt hardware of the prepared se-


quential architecture DLXΣ according to the specifications of section 5.2.
The instruction set architecture (ISA) is extended by

the special purpose register file SPR,

a register S which buffers data read from SPR,3

the circuitry for collecting the interrupt events,

the actual ISR call mechanism which in case of an active interrupt


event forces the interrupt parameters into the SPR register file and
the pair of addresses SISR SISR  4 into the registers DPC and
PC , and by

control realizing the instructions from table 5.4.

The enhanced ISA requires changes in the data paths and in the control
(section 5.5.6). The data paths get an additional environment CAenv which
collects the interrupt event signals and determines the interrupt cause (sec-
tion 5.5.5). Except for the PC environment, the register file environment
RFenv and circuit Daddr, the remaining data paths undergo only minor
changes (section 5.5.4). Figure 5.1 depicts the top level schematics of the
enhanced DLX data paths. Their cost can be expressed as

CDP  CPCenv  CIMenv  CIRenv  CEXenv  CDMenv  CSH4Lenv


 CRFenv  CDaddr  CCAenv  Cbu f f er  8  C f f 32

Note that without interrupt hardware, reset basically performs two tasks,
it brings the hardware in a well defined state (hardware initialization) and
3 Registers A and B play this role for register file GPR
*
  ''
IMenv
I NTERRUPT
EPCs IR.1
H ARDWARE
Ain, Bin IRenv Daddr
Sin PCenv
S A, B link, PCs co
CAenv
D EXenv sh
buffers:
MAR MDRw IR.j
Cad.j
DMenv PCs.j
Sad.j
C.4 MDRr

1111
0000 SH4Lenv
C’
Sout
SR RFenv

   Data paths of the prepared sequential designs with interrupt support

restarts the instruction execution. In the DLXΣ design with interrupt hard-
ware, the reset signal itself initializes the control and triggers an interrupt.
The interrupt mechanism then takes care of the restart, i.e., with respect to
restart, signal JISR takes the place of signal reset.

'' 1  "

The environment PCenv of figure 5.2 still implements the delayed PC


mechanism, but it now provides an additional register DDPC (delayed de-
layed PC) which buffers the PC of the current instruction Ii :

DDPCi  DPCi 1

The functionality of the environment also needs to be extended in order


to account for the new control instruction , and to support a jump to
the ISR. Without interrupt handling, the PCs are initialized on reset. Now,
reset is treated like any other interrupt, and therefore, the PCs are initialized
*
  ' 00111100 nextPC
I NTERRUPT co

0110 00111100
Inc / +4
H ANDLING EPC
Add(32) EDPC
Ain
rfe 0 1 jumpR 1 0 0 1 rfe

bjtaken 0 1 SISR
SISR+4
0 1 JISR
0 1 JISR

link PC’ DPC DDPC

   Environment PCenv with interrupt support

on JISR, instead:
SISR SISR  4 if JISRi  1
DPCi PCi  
DPCiu PC ui  otherwise.

Except for an , instruction, the values PC ui and DPCiu are computed as
before:

if Ii  ,
u
 EPCi 1
PCi 1  immi


if b jtakeni  1  Ii   % " "


i 

PC
if b jtakeni  1  Ii  " "
 RS1i 1 

PCi 1  4 otherwise


EDPCi 1 if Ii  ,
DPCiu  

PCi 1
otherwise
Thus, the new PC computation just requires two additional muxes con-
trolled by signal r f e. The two registers link and DDPC are only updated
on an active clock signal PCce, whereas PC and DPC are also updated on
a jump to the ISR:

DPCce  PC ce  PCce  PCinit 

These modifications have no impact on register link nor on the glue logic
PCglue which generates signal b jtaken. The cost of the environment now
are

CPCenv  4  C f f 32  6  Cmux 32  Cadd 32  Cinc 30  CPCglue 

The two exception PCs are provided by environment RFenv. Let csID
denote the control signals which govern stage ID, including signal JISR;
*
  ''
IR[20:11] IR[10:6] 00001 IR[10:6] 00000

Jlink Saddr
I NTERRUPT
Caddr 0 1 rfe.1 0 1 H ARDWARE
Rtype
Cad Sas Sad

   Circuit Daddr

and let ACON csID denote their accumulated delay. Environment PCenv
then requires a cycle time of

TPCenv  maxDinc 30 AIRenv co  Dadd 32 AGPRenv Ain


ARFenv EPCs A b jtaken ACON csID
3  Dmux 32  ∆

'' - + 

Circuit Daddr consists of the two subcircuits Caddr and Saddr. As before,
circuit Caddr generates the destination address Cad of the general purpose
register file GPR. Circuit Saddr (figure 5.3) provides the source address
Sas and the destination address Sad of the special purpose register file
SPR.
The two addresses of the register file SPR are usually specified by the
bits SA  IR10 : 6. However, on an , instruction, the exception status
ESR is copied into the status register SR. According to table 5.3, the reg-
isters ESR and SR have address 1 and 0, respectively. Thus, circuit Saddr
selects the source address and the destination address of the register file
SPR as

SA SA if r f e  0
Sas Sad  
00001 00000 if r f e  1

Circuit Daddr provides the three addresses Cad, Sas and Sad at the
following cost and delay:

CDaddr  CCaddr  CSaddr


CSaddr  2  Cmux 5
DDaddr  maxDCaddr Dmux 5  DCaddr 
*#
  '
''# 8   , 1  8,
I NTERRUPT
H ANDLING The DLX architecture now comprises two register files, one for the general
purpose registers GPR and one for the special purpose registers SPR. Both
register files form the environment RFenv.
CRFenv  CGPRenv  CSPRenv
The environment GPRenv of the general purpose register file has the same
functionality as before. The additional SPR registers are held in a register
file with an extended access mode. The special move instructions  +
and  + access these registers as a regular register file which permits
simultaneously one read and one write operation. However, on JISR all
registers are read and updated in parallel. Before describing the environ-
ment SPRenv in detail, we first introduce a special register file with such
an extended access mode.

  8   ,
An K  n special register file SF comprises K registers, each of which
is n bits wide. The file SF can be accessed like a regular two-port register
file:
the flag w specifies, whether a write operation should be performed
the addresses adr and adw specify the read and write address of the
register file, and
Din and Dout specify the data input and output of the register file.
In addition, the special register file SF provides a distinct write and read
port for each of its registers. For any register SF r,
Dor specifies the output of its distinct read port, and
Dir specifies the data to be written into register SF r on an active
write flag wr.
In case of an address conflict, such a special write takes precedence over
the regular write access specified by address adw. Thus, the data d r to be
written into SF r equals
Dir if wr  1
d r 
Din otherwise.
The register is updated in case of wr  1 and in case of a regular write to
address r:
cer  wr  w  adw  r (5.3)
*&
  ''
Di[K-1] Di[0]
00111100
Din adr adw w w[ ]
I NTERRUPT
w[K-1] 1 0 w[0] 1 0 AdDec H ARDWARE
K

00111100
... ce
ce[K-1] SF[K-1] ce[0] SF[0] sl

01 DataSel

Do[K-1] Do[0] Dout

   Special register file SF of size K n)

adr adw w w[K-1 : 0]

k-dec k-dec
K K

sl[K-1 : 0] ce[K-1 : 0]

   Address decoder AdDec of an SF register file

We do not specify the output Dout of the special purpose register file if a
register is updated and read simultaneously.

8: 
Figure 5.4 depicts an example realization of a special register file SF of
size (K  n). The multiplexer in front of register SF r selects the proper
input depending on the special write flag wr.
The address decoder circuit AdDec in figure 5.5 contains two k-bit de-
coders (k  log K ). The read address adr is decoded into the select bits
sl K  1 : 0. Based on this decoded address, the select circuit DataSel se-
lects the proper value of the standard data output Dout. For that purpose,
the data Dor are masked by the select bit sl r. The masked data are then
combined by n-OR-trees in a bit sliced manner:
K 1
Dout j  Dor j  sl r
r 0

The write address adw is decoded into K select bits. The clock signals of
the K registers are generated from these signals according to equation 5.3.
Thus, the cost of the whole register file SF runs at

CSF K n  K  C f f n  Cmux n  CAdDec K 


n  Cor  Ctree K   K  Cand n
*'
  '
SPRsel PC.4 DPC.4 DDPC.4
I NTERRUPT
H ANDLING C.4 SR 0 1 repeat C.4 Sas.1 Sad.4
MCA
0 1 0 sel repeat 0 1

Di[0] Di[1] Di[2] Di[3] Di[4] Di[5] Din adr adw


w SPRw
(6 x 32) special register file
w[5:0] SPRw[]
Do[0] Do[1] Do[2] Do[3] Do[4] Do[5] Dout

SR ESR ECA EPC EDPC Edata Sout

   Environment SPRenv of the DLX Σ design

CAdDec K   2  Cdec log K   Cand K   Cor K 

The distinct read ports have a zero delay, whereas the standard output Dout
is delayed by the address decoder and the select circuit:

DSF Dor  0;
DSF Dout   Ddec log K   Dand  Dor  Dtree K 

On a write access, the special register file has an access time of DSFw , and
the write signals w and w delay the clock signals by DSF w; ce:

DSFw  maxDmux n Ddec log K   Dand  Dor   D f f


DSF w; ce  Dand  Dor 

1  "8
The core of the special purpose register environment SPRenv (figure 5.6) is
a special register file of size 6  32. The names of these registers SPR[5:0]
are listed in table 5.3. The environment is controlled by the write signals
SPRw and SPRw5 : 0, and by the signals JISR, repeat, and sel.
The standard write and read ports are only used on the special move
instructions  + and  + and on an , instruction. The standard
data output of the register file equals

Souti  SPRSasi 1

and in case of a write request SPRw  1, the register file is updated as

SPRSad ui : C4i 


*(
  ''
According to the specification of section 5.2, the SPR registers must also
be updated on a & instruction and on a jump to the ISR. These updates I NTERRUPT
are performed via the six distinct write ports of the special register file. H ARDWARE
Since a & instruction always triggers an interrupt, i.e., trapi  1 im-
plies JISRi  1, the SPR register only require a special write on JISR. The
write signals are therefore set to

SPRwr  JISR

On JISR, the status register SR is cleared. Register ECA buffers the


masked cause MCA, and register Edata gets the content of C4. On a trap,
C4 provides the trap constant, and on a load or store, it provides the effec-
tive memory address:

Di0 Di2 Di5  0 MCA C4

The selection of input Di1 is more complicated. If instruction Ii is


interrupted, the new value of ESR depends on the type of the interrupt and
on the type of Ii

 SRi  1 if ili is of type repeat
SRui
Di1i 
 
if ili is of type continue
if ili is of type abort.

where
C4i if SPRwi  Sadi   0
SRui 
SRi 1 otherwise.


The environment SPRenv selects the proper input

C4i if seli  1
Di1i 
SRi 1 otherwise,


with
sel  repeat  SPRw  Sad   0
According to the specification of JISR, if instruction Ii is interrupted, the
two exception PCs have to be set to

PC DPCi  1 if ili is of type repeat


EPC EDPC i 
PC DPCui if ili is of type continue;

whereas on an abort interrupt, the values of the exception PCs do not mat-
ter. Environment PCenv generates the values PCiu , DPCiu , and

DDPCiu  DPCi 1
*/
  '
which are then passed down the pipeline together with instruction Ii . Ex-
I NTERRUPT cept on an , instruction,
H ANDLING
DPCiu  PCi 1

but due to the software constraints, , can only be interrupted by reset
which aborts the execution. Thus, the inputs of the two exception PCs can
be selected as
PC4 DPC4 if repeat  0
Di3 Di4 
DPC4 DDPC4 if repeat  1
Environment SPRenv consists of a special register file, of circuit SPRsel
which selects the inputs of the distinct read ports, and of the glue logic
which generates signal sel. Thus, the cost run at

CSPRenv  CSF 6 32  CSPRsel  CSPRglue


CSPRsel  3  Cmux 32
CSPRglue  Czero 3  2  Cand  Cinv 

All the data inputs are directly provided by registers at zero delay. Let its
control inputs have a delay of ACON csSPR. The output Sin and the inputs
Di then have an accumulated delay of

ASPRenv Sin  DSF Dout 


ASPRenv Di  maxACON csSPR Dzero 3  2  Dand  Dmux

and the write access requires a cycle time of at most

TSPRenv  ASPRenv Di  DSFw  δ

''& 5 4 +  " 

The decode stage ID gets the new output register S. The two opcodes
IR[31:26] and IR[5:0] and the destination address Cad of the general pur-
pose register file are provided by stage ID, but they are also used by later
stages. As before, these data are therefore passed down the pipeline, and
they are buffered in each stage. Due to the interrupt handling, stage WB
now also requires the three PCs and the address Sad of the register file
SPR. Like the opcodes and the address Cad, these data wander down the
pipeline together with the instruction. That requires additional buffering
(figure 5.7); its cost runs at

Cbu f f er  C f f 22  2  C f f 22  3  32


*)
  ''
IR.1[31:26, 5:0] Cad.1 Sad.1
I NTERRUPT
ue.1 IR.2 Cad.2 Sad.2 H ARDWARE
PC DPC DDPC
22
ue.2 IR.3 Cad.3 Sad.3 PC.3 DPC.3 DDPC.3
96
ue.3 IR.4 Cad.4 Sad.4 PC.4 DPC.4 DDPC.4

   Buffering

The interrupt handling has no impact on the instruction register envi-


ronment IRenv which extracts the immediate operand co and the shifter
environment SH4Lenv.

1-  1 
The execute environment EXenv of figure 5.8 still comprises the ALU en-
vironment and the shifter SHenv and connects them to the operand and
result busses. The three operand busses are controlled as before, and the
outputs sh and ov f also remain the same.
The only modification is that the result D is now selected among six
values. Besides the register value link and the results of the ALU and the
shifter, environment EXenv can also put the constant co or the operands S
or A on the result bus:
 link if linkDdoe  1
if ALUDdoe  1
 alu
sh if SHDdoe  1
D 
co if coDdoe  1
 A
S
if ADdoe  1
if SDdoe  1

The result D  co is used in order to pass the trap constant down the
pipeline, whereas the result D  A is used on the special move instruction
 +. D  S is used on , and  +.
The selection of D now requires two additional tristate drivers, but that
has no impact on the delay of the environment. The cost of EXenv are

CEXenv  CALUenv  CSHenv  2  Cmux 32  6  Cdriv 32


**
  '
I NTERRUPT
S link 0110co B A B

A 0 1 bmuxsel 0 1 a’muxsel

01 0011
H ANDLING a’
a b

ALUenv SHenv
ovf alu s[1:0] sh

D sh

   Execute environment EXenv with interrupt support

. -  5 % 1  .5


The environment IMenv of the instruction memory is controlled by a single
control signal Imr. The address is still specified by register DPC, but the
memory IM has a slightly extended functionality. In addition to the data
output IMout and the busy flag ibusy, IM provides a second status flag ip f .
The flag ip f indicates that the memory is unable to perform the requested
access due to a page fault. The flag ibusy indicates that the memory re-
quires at least one more cycle in order to complete the requested access.
Both flags are inactive if the memory IM does not perform an access. In
case of a successful access (ibusy  ip f  0), the instruction memory IM
provides the requested memory word at the data output IMout and other-
wise, it provides an arbitrary but fixed binary value IMde f ault:

IMword DPC31 : 200 if Imr  ibusy  ip f


IMout 
IMde f ault otherwise,

The instruction memory control IMC checks for a misaligned access.


The 4-byte instruction fetch is misaligned if the address is not a multiple
of four:
imal  DPC0  DPC1
Let dIstat denote the status time of the instruction memory. Since the ad-
dress is directly taken from a register, the status flags imal, ibusy and ip f
are provided at the following cost and accumulated delay:

CIMC  Cor
AIMenv f lags  maxDor dIstat 

+  5 % 1  +5


The environment DMenv still consists of the data memory DM and the
memory controller DMC. The memory DM performs the actual load or

  ''
store access, whereas the controller DMC generates the four bank write
signals Dmbw3 : 0 and checks for misalignment. I NTERRUPT
Except for the data output DMout and an additional flag d p f , the func- H ARDWARE
tionality of the data memory DM itself remains the same. The flag d p f
indicates that the memory is unable to perform the requested access due to
a page fault. If the memory DM detects a page fault (d p f  1), it cancels
the ongoing access. Thus, the memory itself ensures that it is not updated
by a store instruction which causes a page fault. The flags d p f and dbusy
are inactive if the memory performs no access (Dmr3  Dmw3  0). On a
successful read access, the data memory DM provides the requested mem-
ory word, and otherwise it provides a fixed value DMde f ault:
DMword MDRw31 : 200 if Dmr  dbusy  d p f
DMout 
DMde f ault otherwise,

Memory Control DMC In addition to the bank write signals, the mem-
ory controller DMC now provides signal dmal which indicates a mis-
aligned access.
The bank write signals Dmbw[3:0] are generated as before (page 81). In
addition, this circuit DMbw provides the signals B (byte), H (half word),
and W (word) which indicate the width of the memory access, and the
signals B[3:0] satisfying
B j   1 s1 : 0  j
A byte access is always properly aligned. A word access is only aligned,
if it starts in byte 0, i.e., if B0  1. A half word access is misaligned, if it
starts in byte 1 or 3. Flag dmal signals that an access to the data memory
is requested, and that this access is misaligned (malAc  1):
dmal  Dmr3  Dmw3  malAc
malAc  W  B0  H  B1  B3
The cost CDMC of the memory controller is increased by some gates, but
the delay DDMC of the controller remains unchanged:
CDMC  CDMbw  Cinv  3  Cand  3  Cor
 Cdec 2  3  Cinv  15  Cand  8  Cor 
Let ACON csM  denote the accumulated delay of the signals Dmr and
Dmw, the cycle time of the data memory environment and the delay of
its flags can then be expressed as
TM  ACON csM   DDMC  dDmem  ∆
ADMenv f lags  ACON csM   DDMC  dDstat 

  '
CAcol ipf, imal
I NTERRUPT [3, 2]
ue.0 CA.1
H ANDLING
[6] ovf
ue.1 CA.2
[5, 1] ovf?
trap, ill
ue.2 CA.3
reset dmal
dpf
ev[31:7] [0] [2] [4] cause processing
CA.3’ CApro
CA4ce MCA, jisr.4, repeat

   Schematics of the cause environment CAenv

''' -  1  

The cause environment CAenv (figure 5.9) performs two major tasks:
Its circuit CAcol collects the interrupt events and clocks them into
the cause register.
It processes the caught interrupt events and initiates the jump to the
ISR. This cause processing circuit CApro generates the flags jisr
and repeat, and provides the masked interrupt cause MCA.

-   


The internal interrupt events are generated by the data paths and the control
unit, but the stage in which a particular event is detected depends on the
event itself (table 5.5).
The instruction memory and its controller IMC provide the flags ip f and
imal which indicate a page fault or misaligned access on fetch. The flag
dmal, generated by the controller DMC signals a misaligned data memory
access. The flags dmal and imal are combined to the event flag mal. In
the memory stage, the flag d p f of the data memory signals a page fault on
load/store.
A trap and an illegal instruction ill are detected by the control unit. This
will be done in stage EX in order to keep the automaton simple (see page
208). The ALU provides the overflow flag ov f , but an arithmetical over-
flow should only be reported in case of an instruction  # $ #   ,
or $ . Such an instruction is indicated by the control signal ov f ? which
activates the overflow check.

  ''
  Assignment of Internal Interrupt Events. It is listed in which stage an I NTERRUPT
event signal is generated and by which unit. H ARDWARE
event signal stage unit
ill ill EX control unit
mal imal IF instruction memory control IMC
dmal M data memory control DMC
pff ip f IF instruction memory environment IMenv
pfls dpf M data memory environment DMenv
trap trap EX control unit
ovf ov f  ov f ? EX ALU environment, control unit

Since the interrupt event signals are provided by several pipeline stages,
the cause register CA cannot be assigned to a single stage. Register CA is
therefore pipelined: CAi collects the events which an instruction triggers
up to stage i. That takes care of internal events. External events could be
caught at any stage, but for a shorter response time, they are assigned to
the memory stage.
The control signals of the stage EX are precomputed. The cycle time of
the cause collection CAcol, the accumulated delay of its output CA3 , and
its cost can be expressed as:
TCAcol  maxAIMenv f lags AALUenv ov f   Dand   ∆
ACAcol CA3   maxADMenv f lags ADMC  Dor 
CCAcol  Cand  Cor  9  C f f 

-  " 
(figure 5.10) The masked cause mca is obtained by masking the maskable
interrupt events CA3 with the corresponding bits of the status register SR.
The flag jisr is raised if mca is different from zero, i.e., if at least one bit
mcai equals one.
CA3 i  SRi if i 6
mcai 
CA3 i otherwise
31
jisr  mcai
i 0

A repeat interrupt is signaled if one of the page faults is the event of


highest priority among all the interrupt events j with mca j  1:
repeat  mca0  mca1  mca2  mca3  mca4
#
  '
CA.3’[31:6] CA.3’[5:0]
I NTERRUPT
H ANDLING SR[31:6] mca
OR(32) CAtype
jisr
CA4ce MCA jisr.4 CA4ce repeat

   Cause processing circuit CApro.

Circuit CAtype generates flag repeat according to this equation. At the


end of the cycle, the masked cause and the two flags jisr and repeat are
clocked into registers. The cost and cycle time of the cause processing
circuit CApro can be estimated as

CCApro  Cand 26  Ctree 32  Cor  C f f 34  CCAtype


CCAtype  3  Cor  Cand  Cinv
DCApro  Dand  Dtree 32  Dor
TCApro  ACAcol CA3   DCApro  ∆

The cost and cycle time of the whole cause environment CAenv run at

CCAenv  CCAcol  CCApro


TCAenv  maxTCAcol TCApro 

''(   

As in the previous designs, the control unit basically comprises two cir-
cuits:

The control automaton generates the control signals of the data paths
based on an FSD. These signals include the clock and write request
signals of the registers and RAMs.

The stall engine schedules the instruction execution. It determines


the stage which currently executes the instruction and enables the
update of its registers and RAMs.

The control automaton must be adapted to the extended instruction set, but
the new instructions have no impact on the stall engine. Nevertheless, the
DLXΣ design requires a new stall engine, due to the ISR call mechanism.
&
  ''

/reset CE full.0 I NTERRUPT


H ARDWARE
CE ue.0
CE full.1

CE ue.1
CE full.2

CE ue.2
CE full.3
reset
/reset ue.3
CE
CE full.4

CE ue.4

   Stall engine of the sequential DLX design with interrupt hardware

  1    DLXΣ +  


There is still one central clock CE for the whole DLXΣ design. The stall
engine (figure 5.11) clocks the stages in a round robin fashion based on the
vector f ull 4 : 0. This vector is initialized on reset and shifted cyclically
on every clock CE. However, in the first cycle after reset, the execution is
now started in stage WB:

 10000 if reset
cls f ull  if CE  reset
f ull 4 : 0 :
 f ull otherwise.

The update enable bit uei enables the update of the of the output registers
of stage i. During reset, all the update enable flags are inactive

ue4 : 0  f ull 4 : 0  CE  reset 

A jump to the interrupt service routine is only initiated, if the flag jisr4
is raised and if the write back stage is full:

JISR  jisr4  f ull 4

Thus, a dummy instruction can never initiate a jump to the ISR.


'
  '
On reset, the flags CA3 0 and jisr are raised. However, a jump to the
I NTERRUPT ISR can only be initiated in the following cycle, if the global clock CE and
H ANDLING the clock CA4ce of the cause processing circuit are also active on reset

CE  Ibusy NOR Dbusy  reset


CA4ce  ue3  reset 

As before, the clock CE is stalled if one of the memories is busy. In order


to avoid unnecessary stalls, the busy flags are only considered in case of
a successful memory access. Since the memories never raise their flags
when they are idle, the flags Ibusy and Dbusy can be generated as

Ibusy  ibusy  f ull 0  imal NOR ip f 


Dbusy  dbusy  f ull 3  dmal NOR d p f 

The interrupt mechanism requires that the standard write to a register file
or to the memory is canceled on a repeat interrupt. Since the register files
GPR and SPR belong to stage WB, their protection is easy. Thus, the write
signals of the two register files are set to

GPRw  GPRw  ue4  JISR NAND repeat 


SPRw  SPRw  ue4  JISR NAND repeat 

For the data memory, the protection is more complicated because the
memory DM is accessed prior to the cause processing. There are only two
kinds of repeat interrupts, namely the two page faults p f f and p f ls; both
interrupts are non-maskable. Since the interrupt event p f ls is provided by
the memory DM, the memory system DM itself must cancel the update if
it detects a page fault. The other type of page fault (ev2  p f f ) is already
detected during fetch. We therefore redefine the write signal Dmw as

Dmw3 : Dmw2  CA22

As before, the memory update is disabled if the memory stage is empty

Dmw 3  Dmw3  f ull 3

Signal Dmw 3 is used by the memory controller DMC in order to generate


the bank write signals.
(
  ''
The remaining clock and write signals are enabled as before. With this
stall engine, a reset brings up the DLXΣ design no matter in which state the I NTERRUPT
hardware has been before: H ARDWARE

Let T be the last machine cycle in which the reset signal is active. In   (
the next machine cycle, the DLXΣ design then signals a reset interrupt and
performs a jump to the ISR:

reset T 1  reset T 1  0  JISRT 1  1 and MCA0T 1  1

Since the global clock is generated as 


CE  Ibusy NOR Dbusy  reset

the DLXΣ design is clocked whenever the reset signal is active, and espe-
cially in cycle T . Due to reset, the flags f ull 4 : 0 get initialized

f ull 4 : 0T 1  10000

and the clock enable signal for the output registers of CApro is

CA4ceT  ue3T  reset T  1

Hence, the output registers of the cause processing circuit are updated at
the end of cycle T with the values

MCA0T 1  mca0T  reset T


31
T 1
jisr4  mca jT  1
j 0

Consequently,

JISRT 1  jisr4T 1  f ull 4T 1  1

and ISR(0) is invoked in cycle T  1.  

  -   
The control automaton is constructed as for the DLXσ design without in-
terrupt handling (section 4.2.3). The automaton is modeled by a sequential
FSD which is then transformed into precomputed control:

The control signals of stage IF and the Moore signals of ID are al-
ways active, whereas the Mealy signals of stage ID are computed in
every cycle.
/
  '
The control signals of the remaining stages are precomputed during
I NTERRUPT ID. This is possible because all their states have an outdegree of one.
H ANDLING There are three types of signals: signals x are only used in stage EX,
signals y are used in stage EX and M, and signals z are used in all
three stages.

However, there are three modifications. The automaton must account for
the 8 new instructions (table 5.4). It must check for an illegal opcode,
i.e., whether the instruction word codes a DLX instruction or not. Unlike
the DLXσ design, all the data paths registers invisible to the assembler
programmer (i.e., all the registers except for PC’, DPC, and the two register
files) are now updated by every instruction. For all these registers, the
automaton just provides the trivial clock request signal 1.
The invisible registers of the execute stage comprise the data registers
MAR and MDRw and the buffers IR.3, Cad.3, Sad.3, PC.3, DPC.3, and
DDPC.3. By default, these registers are updated as

IR3 Cad 3 Sad 3 : IR2 Cad 2 Sad 2


PC3 DPC3 DDPC3 : PC2 DPC2 DDPC2
MAR MDRw : A shi f t A co4 : 0

Besides the buffers, the invisible registers of the memory stage comprise
the data registers C.4 and MDRr. Their default update is the following:

IR4 Cad 4 Sad 4 : IR3 Cad 3 Sad 3


PC4 DPC4 DDPC4 : PC3 DPC3 DDPC3
C4 MDRr : MAR DMde f ault 

The automaton is modeled by the FSD of figure 5.12. The tables 5.6
and 5.7 list the RTL instruction; the update of the invisible registers is
only listed if it differs from the default. Note that in the stages M and
WB, an , is processed like a special move  +. Table 5.8 lists the
nontrivial disjunctive normal forms, and table 5.10 lists the parameters of
the automaton.
In stage ID, only the selection of the program counters and of the con-
stant got extended. This computation requires two additional Mealy sig-
nals r f e1 and Jimm. In stage EX, the automaton now also has to check
for illegal instructions; in case of an undefined opcode, the automaton gets
into state Ill. Since this state has the largest indegree, Ill serves as the new
initial state. State noEX is used for all legal instructions which already
finish their actual execution in stage ID, i.e., the branches  and %
and the two jumps " and ".
)
  ''
IF fetch
I NTERRUPT
ID
decode H ARDWARE
EX

addrL alu aluI shift test savePC mi2s Ill trap

ms2i aluo aluIo shiftI testI rfe addrS noEX


M
load passC mi2sM

ms2iM store noM


WB
sh4l wb mi2sW noWB

   FSD of the DLXΣ design with interrupt handling

  RTL instructions of the stages IF and ID


RTL instruction type of I signals
IF IR1  IM DPC fetch, IRce
ID A  A  RS1 AEQZ  zero A  Ace,
B  RS2 PC  reset ? 4 : pc  Bce, PC’ce,
DPC  reset ? 0 : d pc DPCcee
S  SPRSas Sce
link  PC  4 DDPC  DPC PCce,
IR2  IR1 Sad 2  Sad
co  constant IR1 "# "# & Jimm
# #  shiftI
otherwise
pc d pc  , rfe.1
nextPC PC A co EPCs "# " jump
"# " jumpR, jump
 branch, bzero
% branch
otherwise
Cad  Caddr IR1 "# " Jlink
R-type Rtype
otherwise
Sas Sad   Saddr IR1 , rfe.1
otherwise

*
  '
I NTERRUPT
H ANDLING
  RTL instructions of the stages EX, M, and WB. The update of the
invisible registers is only listed if it differs from the default.

state RTL instruction control signals


EX alu MAR  A op B ALUDdoe, Rtype,
MDRw  shift A B4 : 0 bmuxsel
aluo MAR  A op B overflow? ALUDdoe, Rtype, ovf?
MDRw  shift A B4 : 0 bmuxsel
test MAR  A rel B ? 1 : 0 ALUDdoe, test, Rtype,
MDRw  shift A B4 : 0 bmuxsel
shift MAR  MDRw  SHDdoe, Rtype,
shift A B4 : 0 bmuxsel
aluI MAR  A op co, ALUDdoe,
aluIo MAR  A op co overflow? ALUDdoe, ovf?
testI MAR  A rel co ? 1 : 0 ALUDdoe, test
shiftI MAR  shift A co4 : 0 SHDdoe, shiftI, Rtype
savePC MAR  link linkDdoe,
addrL MAR  A  co ALUDdoe, add,
addrS MAR  A  co ALUDdoe, add,
MDRw  amuxsel, shift4s
cls B MAR1 : 0000
trap MAR  co trap  1 coDdoe, trap
Ill MAR  A ill  1 ADdoe, ill
rfe MAR  S SDdoe
ms2i MAR  S SDdoe
mi2s default updates ADdoe
noEX
M load MDRr  Dmr
Mword MAR31 : 200
store m  bytes MDRw Dmw
others default updates
WB sh4l GPRCad 4  shift4l, GPRw
sh4l MDRr MAR1 : 0000
wb GPRCad 4  C4 GPRw
mi2sW SPRSad 4  C4 SPRw
noWB no update


  ''
  Nontrivial disjunctive normal forms of the DLX Σ control automaton I NTERRUPT
H ARDWARE
stage DNF state/signal IR31 : 26 IR5 : 0 length
EX D1 alu 000000 1001** 10
000000 100**1 10
D2 aluo 000000 1000*0 11
D3 aluI 0011** ****** 4
001**1 ****** 4
D4 aluIo 0010*0 ****** 5
D5 shift 000000 0001*0 11
000000 00011* 11
D6 shiftI 000000 0000*0 11
000000 00001* 11
D7 test 000000 101*** 9
D8 testI 011*** ****** 3
D9 savePC 010111 ****** 6
000011 ****** 6
D10 addrS 10100* ****** 5
1010*1 ****** 5
D11 addrL 100*0* ****** 4
1000*1 ****** 5
10000* ****** 5
D12 mi2s 000000 010001 12
D13 ms2i 000000 010000 12
D14 trap 111110 ****** 6
D15 rfe 111111 ****** 6
D16 noEX 00010* ****** 5
000010 ****** 6
010110 ****** 6
ID D17 Rtype 000000 ****** 6
D6 shiftI 000000 0000*0 (10)
000000 00001* (10)
D9 Jlink 010111 ****** (6)
000011 ****** (6)
D18 jumpR 01011* ****** 5
D19 jump 00001* ****** 5
01011* ****** (5)
D20 branch 00010* ****** (5)
D21 bzero *****0 ****** 1
D15 rfe.1 111111 ****** (6)
D22 Jimm 00001* ****** (5)
111110 ****** (6)
Accumulated length of all nontrivial monomials 206 
  '
I NTERRUPT   Control signals to be precomputed during stage ID
H ANDLING
EX M WB type x signals (stage EX only)
y shift4s, Dmw trap, ADdoe ovf?
amuxsel coDdoe SDdoe add?
z Dmr shift4l linkDdoe Rtype ill
SPRw ALUDdoe bmuxsel test
GPRw SHDdoe

ill add test Rtype ovf? bmuxsel Dmw Dmr SHDdoe


shift 1 1 1
shiftI 1 1
alu 1 1
aluo 1 1 1
aluIo 1
test 1 1 1
testI 1
addrL 1 1
addrS 1 1
Ill 1
inactive in states: aluI, savePC, trap, mi2s, rfe ms2i, noEX
ALUDdoe linkDdoe trap ADdoe SDdoe SPRw GPRw
shift 1
shiftI 1
alu 1 1
aluo 1 1
aluI 1 1
aluIo 1 1
test 1 1
testI 1 1
addrL 1 1
addrS 1
savePC 1 1
trap 1
mi2s 1 1
ms2i 1 1
rfe 1 1
Ill 1
noEX 1


  ''
  Parameters of the two control automata; one precomputes the Moore I NTERRUPT
signals (ex) and the other generate the Mealy signals (id). H ARDWARE
# states # inputs # and frequency of outputs
k σ γ νsum νmax
ex 17 12 16 48 11
id 1 12 9 13 2

fanin of the states # and length of monomials


fansum fanmax #M lsum lmax
ex 26 3 26 189 12
id – – 4 17 10

The stage EX, M and WB are only controlled by Moore signals, which
are precomputed during decode. All their states have an outdegree of one.
It therefore suffices to consider the states of stage EX in order to generate
all these control signals. For any of these signals, the table 5.9 list its type
(i.e., x, y, or z) and the EX states in which it becomes active.

    +  
Along the lines of section 3.4 it can be show that the DLXΣ design interprets
the extended DLX instruction set of section 5.2 with delayed PC semantics.
In the sequential DLX design without interrupt handling, any instruction
which has passed a stage k only updates output registers of stages k  k
(lemma 4.3). In the DLXΣ design, this dateline criterion only applies for
the uninterrupted execution. If an instruction Ii gets interrupted, the two
program counters PC’ and DPC get also updated when Ii is in the write
back stage. Furthermore, in case of a repeat interrupt, the update of the data
memory is suppressed. Thus, for the DLXΣ design, we can just formulate
a weak version of the dateline criterion:

Let IΣ k T   i. For any memory cell or register R  out t  different from   )
PC’ and DPC, we have

¼ Ri 1 if t k
RT 


Ri if t  k

If R  PC DPC, then R is an output register of stage t  1 and

¼ Ri 1 if k  0 1
RT 


Rui if k 2
#
  ' ¼
If the execution of instruction Ii is not interrupted, i.e., if JISRT  0 with
I NTERRUPT IΣ 4 T   i, then Ri  Rui for any register R.
H ANDLING
If IΣ 4 T   i, then IΣ 0 T  1  i  1 and lemma 5.9 implies for all
R
RT 1
¼
 Ri 

 +  '  

S IN the basic DLX design (chapter 4), the same three modifications
 are sufficient in order to transform the prepared sequential design
DLXΣ into the pipelined design DLXΠ . Except for
a modified PC environment,
extensive hardware for result forwarding and hazard detection, and
a different stall engine,
the DLXΣ hardware can be used without changes. Figure 5.13 depicts the
top-level schematics of the DLXΠ data paths. The modified environments
are now described in detail.

'( " 1 

Figure 5.14 depicts the PC environment of the DLXΠ design. The only
modification over the DLXΣ design is the address provided to the instruc-
tion memory IM. As for the transformation of chapter 4, memory IM is
now addressed by the input d pc of register DPC and not by its output.

 SISR if JISR  1
EDPC if JISR  0  r f e1  1
d pc 
 PC otherwise
However, the delayed program counter must be buffered for later use, and
thus, register DPC cannot be discarded.
The cost of environment PCenv and most of its delays remain the same.
The two exception PCs are now provided by the forwarding circuit FORW .
Thus,
APCenv d pc  maxAJISR AFORW EDPC  2  Dmux 32
TPCenv  maxDinc 30 AIRenv co  Dadd 32 AGPRenv Ain
AFORW EPCs A b jtaken ACON csID
3  Dmux 32  ∆
&
  '(
P IPELINED
IMenv I NTERRUPT
H ARDWARE
EPCs IR.1
Ain, Bin IRenv Daddr
Sin PCenv
S A, B link, PCs co
CAenv

D EXenv sh
buffers:
Forwarding Engine FORW MAR MDRw IR.j
SR Cad.j
DMenv PCs.j
Sad.j
C.4 MDRr

SH4Lenv
C’
RFenv
C’, Aout, Bout, Sout

   Data paths of the pipelined design DLX Π with interrupt support

co 01 nextPC (to IMenv)

0110 11
00
Inc / +4
Add(32) rfe.1
00
11
EPC JISR
Ain
0
rfe.1 0 1 jumpR 1 0
1 0
1 dpc
bjtaken 0 1
SISR+4
EDPC SISR
0 1 JISR

link PC’ DPC DDPC

   Environment PCenv of the DLX Π design

'
  '
The modified PC environment also impacts the functionality and delay
I NTERRUPT of the instruction memory environment. On a successful read access, the
H ANDLING instruction memory now provides the memory word

IMout  IMword d pc31 : 200

The cycle time of IMenv and the accumulated delay of its flags are

TIMenv  APCenv d pc  dImem  ∆


AIMenv f lags  APCenv d pc  maxDor dIstat 

'( ,   . 

The data paths comprise two register files, GPR and SPR. Both are up-
dated during write back. Since their data are read by earlier stages, result
forwarding and interlocking is required. The two register files are treated
separately.

9 "-  8  
During  + instructions, data are copied from register file SPR via reg-
ister S and the Ck registers into the register file GPR. The forwarding
circuits to S have to guarantee that the uninterrupted execution of Ii , i.e.,

IΠ 2 T   IΠ 3 T  1  IΠ 4 T  2  i

implies ST  Si 1 . During stages EX, M and W B the data then wander




down the Ck registers like the result of an ordinary fixed point operation.
Thus we do not modify the forwarding circuits for registers A and B at all.

 "-  8  
Data from the special purpose registers are used in three places, namely

on a  + instruction, SPRSas is read into register S during de-


code,

the cause environment reads the interrupt masks SR in the memory


stage, and

on an , instruction, the two exception PCs are read during decode.

Updates of the SPR registers are performed in three situations:

On a  + instruction, value C4 is written into register SPRSad .


(
Sas.1 Sin
  '(
P IPELINED
Sad.2, C’.2, SRPw.2 ad Dout I NTERRUPT
Sad.3, C’.3, SRPw.3 SFor(3) H ARDWARE
Sad.4, C’.4, SRPw.4 Din
Sout

   Forwarding of SPR into register S

Register SR is updated by ,. Recall that in stages 2 to 4, we have


implemented this update like a regular write into SPR with write
address Sad  0.

All special purpose registers are updated by JISR. Forwarding the


effect of this looks like a nightmare. Fortunately, all instructions
which could use forwarded versions of values forced into SPR by
JISR get evicted from the pipe by the very same occurrence of JISR.

Therefore, one only needs to forward data from the inputs of the Ck reg-
isters with destinations in SPR specified by Sad.

Forwarding of S Forwarding data with destination SPR into register S


is exactly like forwarding data with destination GPR into A or B, except
that for address ad  0 the data are now forwarded as well. Thus, con-
necting the three stage forwarding circuit SFor 3 as depicted in figure
5.15 handles the forwarding into register S. Note that no data hazards are
introduced.

Circuit SFor Figure 5.16 depicts a realization of the 3-stage forwarding


circuit SFor. It is derived from the circuit Forw of figure 4.18 in the ob-
vious way. Let DSFor Data; 3 denote the delay, the data inputs require to
pass circuit SFor(3). For an n-bit address ad, the cost and delay of SFor(3)
can be modeled as

CSFor 3  3  Cmux 32  6  Cand  3  Cequal n


DSFor hit   Dequal n  Dand
DSFor Dout; 3  DSFor hit   3  Dmux 32
DSFor Data; 3  3  Dmux 32

Circuit SFor is slightly faster than the forwarding circuit Forw for the GPR
operands.
/
  ' Dout
ad
I NTERRUPT SPRw.2
hit[2]
full.2 0 1
H ANDLING equal
Sad.2
SPRw.3 C’.2
hit[3]
full.3 0 1
equal
Sad.3
SPRw.4 C’.3
hit[4]
full.4 0 1
equal
Sad.4
Din C’.4

   3-stage forwarding circuit SFor3 for an SPR register

a) 011 EPC’ b)
000 SR’

Sad.2, C’.2, SRPw.2 ad Dout


ad Dout
Sad.3, C’.3, SRPw.3 SFor(3) Sad.4, C’.4
SPRw.4 SFor(1)
Sad.4, C’.4, SRPw.4 Din
EPC SR

   Forwarding of EPC into register PC’ (a) and of register SR into the
memory stage (b)

Forwarding of EPC The forwarding of EPC into the program counter


PC during , instructions is done by a circuit SFor 3 which is connected
as depicted in figure 5.17 (a). Note that the address input ad of the for-
warding circuit has now been tied to the fixed address 3 of the register
EPC. No data hazards are introduced.

Forwarding of SR The forwarding of register SR into the memory en-


vironment requires forwarding over a single stage with a circuit SFor 1
connected as depicted in figure 5.17 (b). This circuit is obtained from cir-
cuit SFor 3 by the obvious simplifications. It has cost and delay
CSFor 1  Cmux 32  2  Cand  Cequal 3
DSFor Dout; 1  DSFor hit   Dmux 32
DSFor Data; 1  Dmux 32
Again, no data hazards were introduced.

Forwarding of EDPC The forwarding of EDPC during , instructions


to signal d pc in the PC environment would work along the same lines,
)
100
  '(
dhaz(EDPC)
SPRw.2 P IPELINED
hit[2]
full.2
Sad.2
equal I NTERRUPT
SPRw.3 H ARDWARE
hit[3]
full.3
equal
Sad.3
SPRw.4
hit[4]
full.4
equal
Sad.4

   Data hazard detection for EDPC

but this would increase the instruction fetch time. Therefore, forwarding
of EDPC to d pc is omitted. The data hazards caused by this can always
be avoided if we update in the R ESTORE sequence of the interrupt service
routine register EDPC before register EPC.
If this precaution is not taken by the programmer, then a data hazard
signal
dhaz EDPC  hit 2  hit 3  hit 4

is generated by the circuit in figure 5.18. Note that this circuit is obtained
from circuit SFor 3 by the obvious simplifications. Such a data hazard is
only of interest, if the decode stage processes an , instruction. That is the
only case in which a SPR register requests an interlock:

dhazS  dhaz EDPC  r f e1

Cost and Delay The hazard signal dhazS is generated at the following
cost and delay

CdhazS  3  Cequal 3  7  Cand  2  Cor


AdhazS  Dequal 3  2  Dand  2  Dor 

The address and control inputs of the forwarding circuits SFor are directly
taken from registers. The input data are provided by the environment EX-
env, by register C.4 and by the special read ports of the SPR register file.
Thus,

AFORW S EPC  maxDSFor Dout; 3 DSFor Data; 3  AEXenv 


AFORW SR  maxDSFor Dout; 1 DSFor Data; 1  ASH4Lenv 
AFORW EDPC  0
*
  '
The forwarding of the SPR operands is performed by an 1-stage and two
I NTERRUPT 3-stage forwarding circuits:
H ANDLING
CSFORW  CSFor 1  2  CSFor 3

'(#   1 

The stall engine of the DLXΠ design is very similar to the interlock engine
of section 4.5 except for two aspects: the initialization is different and there
are additional data hazards to be checked for. On a data hazard, the upper
two stages of the pipeline are stalled, whereas the remaining three stages
proceed. The upper two stages are clocked by signal CE1, the other stages
are clocked by signal CE2.
A data hazard can now be caused by one of the general purpose operands
A and B or by a special purpose register operand. Such a hazard is signaled
by the activation of the flag
dhaz  dhazA  dhazB  dhazS

     ,- > 
The full vector is initialized on reset and on every jump to the ISR. As in
the DLXΣ design, a jump to the ISR is only initiated if the write back stage
is not empty
JISR  jisr4  f ull 4
On JISR, the write back stage is updated and stage IF already fetches the
first instruction of the ISR. The update enable signals ue4 and ue0 must
therefore be active. The instructions processed in stages 1 to 3 are canceled
on a jump to the ISR; signal JISR disables the update enable signals ue3 :
1. In the cycle after JISR, only stages 0 and 1 hold a valid instruction, the
other stages are empty, i.e., they process dummy instructions.
Like in the DLXΣ design, an active reset signal is caught immediately
and is clocked into register MCA even if the memory stage is empty. In
order to ensure that in the next cycle a jump to the ISR is initiated, the reset
signal forces the full bit f ull 4 of the write back stage to one.
The following equations define such a stall engine. A hardware realiza-
tion is depicted in figure 5.19.
ue0  CE1
ue1  CE1  JISR f ull 1  1
ue2  CE2  JISR  f ull 2 f ull 2 : ue1
ue3  CE2  JISR  f ull 3 f ull 3 : ue2
ue4  CE2  f ull 4 f ull 4 : ue3  reset


CE1 ue.0
  '(
1 full.1 P IPELINED
CE1 I NTERRUPT
ue.1 H ARDWARE
JISR CE2 full.2

CE2
ue.2
CE2 full.3

reset ue.3

CE2 full.4
ue.4
CE2

   Stall engine of the DLXΠ design with interrupt support

  
Like in the pipelined design DLXπ without interrupt handling, there are two
clock signals. Signal CE1 governs the upper two stages of the pipeline, and
signal CE2 governs the remaining stages.
CE1   busy  dhaz  JISR  Ibusy
  busy  dhaz   JISR NOR Ibusy
CE2  busy   JISR NOR Ibusy  reset 
Both clocks are inactive if one of the memories is busy; CE1 is also inactive
on a data hazard. However, on JISR both clocks become active once the
instruction memory is not busy. In order to catch an active reset signal
immediately, the clock CE2 and the clock CA4ce of the cause processing
circuit must be active on reset
CA4ce  ue3  reset 
In order to avoid unnecessary stalls, the busy flags are only considered in
case of a successful memory access. Since the memories never raise their
flags when they are idle, the busy flags are generated as
Ibusy  ibusy  imal NOR ip f 
Dbusy  dbusy  dmal NOR d p f 
 busy  Ibusy NOR Dbusy
The interrupt mechanism requires that the standard write to a register
file or memory is canceled on a repeat interrupt. The register files GPR

  '
and SPR are protected as in the sequential design. A special write to the
I NTERRUPT SPR register file is enabled by signal ue4. The write signals of the register
H ANDLING files are therefore generated as

GPRw  GPRw  ue4  JISR NAND repeat 


SPRw  SPRw  ue4  JISR NAND repeat 
SPRw 5 : 0  SPRw5 : 0  ue4

For the data memory, the protection becomes more complicated. Like in
the sequential design DLXΣ, the memory system DM itself cancels the
update if it detects a page fault, and in case of a page fault on fetch, the
write request signal is disabled during execute

Dmw3 : Dmw2  CA22

However, the access must also be disabled on JISR and on reset. Thus,
signal Dmw3 which is used by the memory controller DMC in order to
generate the bank write signals is set to

Dmw 3  Dmw3  f ull 3  JISR NOR reset 

The remaining clock and write signals are enabled as in the pipelined de-
sign DLXπ without interrupt handling: the data memory read request is
granted if stage M is full

Dmr 3  Dmr3  f ull 3

and the update of an register R  out i is enabled by uei

Rce  Rce  uei

Like for the DLXΣ design (lemma 5.8), it follows immediately that with this
stall engine, an active reset signal brings up the DLXΠ design, no matter in
which state the hardware has been before:

  *  Let T be the last machine cycle in which the reset signal is active. In the
next machine cycle, the DLXΠ design then signals a reset interrupt and
performs a jump to the ISR:

reset T 1  reset T 1  0  JISRT 1  1 and MCA0T 1  1


  '(
  Start of the execution after reset under the assumption that no data P IPELINED
hazards occur. A blank entry indicates that the value is undefined. I NTERRUPT
H ARDWARE
T reset JISR ue0 1 2 3 4 f ull 2 3 4 IF
-1 1 1
0 0 1 1 0 0 0 1 1 I0
1 0 0 1 1 0 0 0 0 0 0 I1
2 0 0 1 1 1 0 0 1 0 0 I2
3 0 0 1 1 1 1 0 1 1 0 I3
4 0 0 1 1 1 1 1 1 1 1 I4

 - ,- 


The scheduling functions of the pipelined DLX designs with and without
interrupt handling are very much alike. The execution starts in cycle T  0,
which is the first cycle after reset (table 5.11). According to lemma 5.10,
the first instruction I0 of the ISR is fetched in cycle T  0, and

IΠ 0 0  0

The instructions are still fetched in program order and wander in lock-
step through the stages 0 and 1:

i if ue0T 0
IΠ 0 T   i  IΠ 0 T  1 
i1 if ue0T 1

IΠ 1 T   i  IΠ 0 T   i  1

Any instruction makes a progress of at most one stage per cycle, and it
cannot be stalled once it is clocked into stage 2. However, on an active
JISR signal, the instructions processed in stages 1 to 3 are evicted from the
pipeline. Thus, IΠ k T   i  JISRT  0  k  0 implies

IΠ k T  1 if uekT 0
i
IΠ k  1 T  1 if uekT 1 and k1  4

and for k 2, the instructions proceed at full speed:

IΠ k T   i  JISRT 0  IΠ k  1 T  1  i

Note that on JISR  1, the update enable signals of the stages 0 and 4 are
active whereas the ones of the remaining stages are inactive.
#
  '
  +%
I NTERRUPT The computation of the inverted hazard signal dhaz requires the data haz-
H ANDLING ard signals of the two GPR operands A and B and the data hazard signal
dhazS of the SPR operands.

dhaz  dhazA  dhazB NOR dhazS

Since for the two GPR operands, the hazard detection is virtually the same,
the cost and delay of signal dhaz can be modeled as

Cdhaz  2  CdhazA  CdhazS  Cor  Cnor


Adhaz  maxAdhazA  Dor AdhazS   Dnor 

The inverted flag busy, which combines the two signals Dbusy and
Ibusy, depends on the flags of the memory environments. Its cost and
delay can be modeled as

Cbusy  2  Cand  3  Cnor


Abusy  maxAIMenv f lags ADMenv f lags  Dand  2  Dnor 

The two clock signals CE1 and CE2 depend on the busy flag, the data
hazard flag dhaz, and the JISR flags.

JISR  jisr4  f ull 4 JISR  jisr4 NAND f ull 4

We assume that the reset signal has zero delay. The two clocks can then be
generated at the following cost and delay

CCE  3  Cor  Cnor  Cand  Cdhaz  Cbusy  Cand  Cnand


AJISR  maxDand Dnand 
ACE  maxAdhaz AJISR Abusy   Dand  Dor 

The core of the stall engine is the circuit of figure 5.19. In addition,
the stall engine generates the clock signals and enables the update of the
registers and memories. Only the data memory, the two register files, the
output registers of environment CApro, and the registers PC’ and DPC
have non-trivial update request signals. All the other data paths registers
R  out i are clocked by uei. The cost and the cycle time of the whole
stall engine can therefore be modeled as

Cstall  3  C f f  Cor  5  Cand


CCE  Cnand  Cnor  Cor  Cinv  9  Cand
Tstall  ACE  3  Dand  δ
 maxDSF w ce; 6 32  D f f Dram3 32 32
&
  '(
  Cost of the data paths of the pipelined DLX designs with/without P IPELINED
interrupt hardware I NTERRUPT
H ARDWARE
environment EX RF PC CA buffer FORW DP
DLXπ 3315 4066 1906 – 408 812 13010
DLXΠ 3795 7257 2610 471 2064 1624 20610
increase 14% 78% 37% – 406% 100% 58%

'(&   +%   DLXΠ  

In following, we determine the cost and the cycle time of the DLXΠ de-
sign and compare these values to those of pipelined design DLXπ without
interrupt handling.

   +  " 
Except for the forwarding circuit FORW, the top level schematics of the
data paths of the two DLX design with interrupt support are the same. The
cost of the DLXΠ data paths DP (figure 5.13) can therefore be expressed as

CDP  CIMenv  CIRenv  CPCenv  CDaddr


CEXenv  CDMenv  CSH4Lenv  CRFenv
Cbu f f er  CCAenv  CFORW  8  C f f 32

Table 5.12 lists the cost of the data paths and its environments for the
two pipelined DLX designs. Environments which are not effected by the
interrupt mechanism are omitted. The interrupt mechanism increases the
cost of the data paths by 58%. This increase is largely caused by the reg-
ister files, the forwarding hardware, and by the buffering. The other data
paths environments become about 20% more expensive.
Without interrupt hardware, each of the stages ID, EX and M requires
17 buffers for the two opcodes and one destination address. In the DLXΠ
design, each of these stages buffers now two addresses and three 32-bit
PCs. Thus, the amount of buffering is increased by a factor of 4.
The environment RFenv now consists of two register files GPR and SPR.
Although there are only 6 SPR registers, they almost double the cost of en-
vironment RFenv. That is because the GPR is implemented by a RAM,
whereas the SPR is implemented by single registers. Note that an 1-bit
register is four times more expensive than a RAM cell. The register imple-
mentation is necessary in order to support the extended access mode – all
6 SPR registers can be accessed in parallel.
'
  '
I NTERRUPT   Cost of the control of the two pipelined DLX designs
H ANDLING
environment stall MC automata buffer CON DLX
DLXπ 77 48 609 89 830 13840
DLXΠ 165 61 952 105 1283 21893
increase 114% 27% 56% 18% 44% 58%

    
According to the schematics of the precomputed control (figure 4.15), the
control unit CON buffers the valid flags and the precomputed control sig-
nals. For the GPR result, 6 valid flags are needed, i.e., v4 : 22 v4 : 33
and v44. Due to the extended ISA, there is also an SPR result. Since
this result always becomes valid in the execute stage, there is no need for
additional valid flags.
Since the control automata already provide one stage of buffering, pre-
computed control signals of type x need no explicit buffering. Type y sig-
nals require one additional stage of buffers, whereas type z signals require
two stages of buffers. According to table 5.9, there are three control signals
of type z and one of type y. Thus, the control requires

6  2  3  1  1  13

flipflops instead of 11. One inverter is used in order to generate the valid
signal of the GPR result. In addition, the control unit CON comprises the
stall engine, the two memory controllers IMC and DMC, and two control
automata (table 5.10). Thus, the cost of unit CON can be modeled as

CCON  CIMC  CDMC  Cstall  CCON moore  CCON mealy


13  C f f  Cinv 

Table 5.13 lists the cost of the control unit, of all its environments, and
of the whole DLX hardware. The interrupt mechanism increases the cost
of the pipelined control by 44%. The cost of the stall engine is increased
above-average (114%).

%  
According to table 5.14, the interrupt support has virtually no impact on the
cycle time of the pipelined DLX design. The cycle times of the data paths
environments remain unchanged, only the control becomes slightly slower.
However, as long as the memory status time stays below 43 gate delays,
the cycle time of the DLXΠ design is dominated by the PC environment.
(
  '/
  Cycle times of the two pipelined DLX designs; d mem denotes the max- C ORRECTNESS OF
imum of the two access times d Imem and dDmem and dmstat denotes the maximum THE I NTERRUPT
of the two status times dIstat and dDstat . H ARDWARE
ID CON / stall
EX WB DP IF, M
A/B PC max( , )
DLXπ 72 89 66 33 89 16  dmem 57 43  dmstat
DLXΠ 72 89 66 33 89 16  dmem 57 46  dmstat

   !  '  

N THIS section, we will prove that the pipelined hardware DLXΠ to-
gether with an admissible ISR processes nested interrupts in a precise
manner. For a sequential design, the preciseness of the interrupt processing
is well understood. We therefore reduce the preciseness of the pipelined
interrupt mechanism to the one of the sequential mechanism by showing
that the DLXΠ design simulates the DLXΣ design on any non-aborted in-
struction sequence.
In a first step, we consider an uninterrupted instruction sequence I0   
Ip , where I0 is preceded by JISR, and where Ip initiates a JISR. In a sec-
ond step, it is shown that the simulation still works when concatenating
several of these sequences. With respect to these simulations, canceled
instructions and external interrupt events are a problem.

 . - 


Between the fetching of instruction Ip which initiates a jump to the ISR and
the actual JISR, the DLXΠ design starts further instructions Ip1    Ipδ .
However, these instructions are canceled by JISR before they reach the
write back stage. Thus, with respect to the simulation, we consider se-
quence P  I0    Ip    Ipδ for the pipelined design, and sequence P 
I0    Ip for the sequential design.

1  . - 1


are asynchronous to the instruction execution and can occur at any time.
Due to the pipelined execution, an instruction sequence P is usually pro-
cessed faster on the DLXΠ design than on the DLXΣ design. For the simu-
lation, it is therefore insufficient to assign a given external event to a fixed
cycle. Instead, the instruction sequences P and P are extended by a se-
quence of external events. For any external interrupt ev j, we use the
following assignment, which is illustrated in table 5.15:
/
  '
I NTERRUPT   Assignment of external interrupt events for an uninterrupted instruc-
H ANDLING tion sequence P

cycle ev[j] JISR full.3 full.4 M WB


T 1 0 0
T 1 0 0 –
T 1 1 0 0 0 – –
 1 0 0 0 – –
t 1 1 0 1 0 Ii –
t 1 0 1 Ii

Let the external interrupt event ev j be raised during cycle T of the
pipelined execution of P

ev jTΠ  1
0 and ev jTΠ  1

let t be the first cycle after T for which the write back stage is full, and let
T  1 be the cycle in the sequential execution of P corresponding to cycle
t, i.e.,
IΠ 4 t   i  IΣ 4 T  1
In the sequential execution of P, event ev j is then assigned to cycle T
¼
ev jTΣ  1

Since the external events are collected in stage 3, it is tempting to argue


about the first cycle tˆ T in which stage 3 is full, i.e., i  IΠ 3 tˆ . For
a single uninterrupted instruction sequence P that makes no difference be-
cause the instruction processed in stage 3 is always passed to stage 4 at the
end of the cycle. Thus,

IΠ 3 tˆ   IΠ 4 tˆ  1  IΠ 4 t 

However, when concatenating two sequences P  I0    Ipδ and Q  J0


J1   , the instruction processed in stage 3 can be canceled by JISR. There-
fore, it is essential to argue about the instruction executed in stage 4. In
the example of table 5.16, the external event ev j is signaled while the
DLXΠ design performs a jump to the ISR. When arguing about stage 3, the
external event is assigned to instruction Ip1 which has no counterpart in
the sequential execution, whereas when arguing about stage 4, the event is
assigned to the first instruction of sequence Q.
)
  '/
  Assignment of external interrupt events when concatenating two in- C ORRECTNESS OF
struction sequences P and Q THE I NTERRUPT
H ARDWARE
cycle ev[j] JISR full.3 full.4 M WB
T 1 0 0 1 Ip
T  tˆ 1 1 1 1 Ip1 Ip
T 1 1 0 0 0 – –
 1 0 0 0 – –
t 1 1 0 1 0 J0 –
t 1 0 1 J0

The proofs dealing with the admissibility of the ISR (section 5.4) only
argue about signal JISR and the values of the registers and memories vis-
ible to the assembler programmer, i.e., the general and special purpose
register files, the two PCs and the two memories IM and DM:

C  GPR0  GPR31 SPR0  SPR5 PC DPC DM IM 

For the simulation, signal JISR and the contents of storage C are therefore
of special interest.

Let P  I0    Ip    Ipδ and P  I0    Ip be two instruction sequences    


extended by a sequence of external events, as defined above. Sequence P
is processed by the pipelined design DLXΠ and P by the sequential design
DLXΣ. Let instruction I0 be preceded by JISR

JISRΣ 1  1

and JISR0Π  1

and let both designs start in the same configuration, i.e.,

R  C R0Σ  R1Π 

Let Tp and Tp denote the cycles in which Ip is processed in the write back
stage
T
IΠ 4 Tp   IΣ 4 Tp   p  ue4Πp  1

The initial PCs then have values PC 0Σ  SISR  4 and DPCΣ0  SISR. For
any instruction Ii  P , any stage k, and any two cycles T , T with

IΠ k T   IΣ k T i  uekΠ
T
1

the following two claims hold:


*
  ' Tp
0
T
I NTERRUPT
H ANDLING k=0 I0 I1 Ip
k=1 I0
k=2 box 0 I0 box 1 box 2

k=3 I0
k=4 I0 Ip

   Pairs k T  of the pipelined execution. Box 0 is covered by the


hypothesis of the simulation theorem, the boxes 1 and 2 correspond to the claims
1 and 2.

1. (a) for all signals S in stage k which are inputs to a register R 


out k that is updated at the end of cycle T :
T T ¼
SΠ  SΣ

(b) for all registers R  out k which are visible or updated at the
end of cycle T :

RTΠ1  RΣT 1 
¼ Rui if T  Tp
Ri if T  Tp

(c) for any cell M of the data memory DM and k  3:


T 1 T 1 ¼
MΠ  MΣ  Mi

2. and for any R  C and T  Tp

T 1 T 1 ¼
RΠ  RΣ  R p

With respect to the pipelined execution, there are three types of pairs
k T  for which the values ST and RT 1 of the signals S and output regis-
ters R of stage k are of interest (figure 5.20):

For the first cycle, the theorem makes an assumption about the con-
tents of all registers and memories R  C independent of the stage
they belong to (box 0).

Claim 1 covers all the pairs k T  for which IΠ k T  is defined and


lies between 0 and p (box 1).
#
  '/
  Start of the execution after reset or JISR respectively C ORRECTNESS OF
THE I NTERRUPT
DLXσ DLXΣ
H ARDWARE
T’ reset ue[0:4] full[0:4] reset JISR ue[0:4] full[0:4]
-2 1 * * *
-1 1 * * 0 1 00001 00001
0 0 10000 10000 0 0 10000 10000
1 0 01000 01000 0 0 01000 01000
2 0 00100 00100 0 0 00100 00100
3 0 00010 00010 0 0 00010 00010
4 0 00001 00001 0 0 00001 00001
5 0 10000 10000 0 0 10000 10000
DLXπ DLXΠ
T reset ue[0:4] full[2:4] reset JISR ue[0:4] full[2:4]
-1 1 * * *
0 1 10000 * 0 1 10001 **1
1 0 11000 000 0 0 11000 000
2 0 11100 100 0 0 11100 100
3 0 11110 110 0 0 11110 110
4 0 11111 111 0 0 11111 111

For the final cycle Tp , claim 2 covers all the registers and memories
R  C independent of the stage they belong to (box 2).
The above theorem and the simulation theorem 4.11 of the DLX design
without interrupt handling are very similar. Thus, it should be possible to
largely reuse the proof of theorem 4.11. Signal JISR of the designs DLXΣ
and DLXΠ is the counterpart of signal reset in the designs DLXσ and DLXπ.
This pair of signals is used to initialize the PC environment and they mark
the start of the execution. In the sequential designs, the execution is started
in cycle 1, whereas in the pipelined designs, it is started in cycle 0:

resetσ 1

 JISRΣ 1

 JISR0Π  resetπ0  1

Proof of Theorem 5.11 


Claim 1 is proven by induction on the cycles T of the pipelined execution,
but we only present the arguments which are different from those used in
the proof of theorem 4.11. The original proof strongly relies on the dateline
lemma 4.3 and on the stall engines (the scheduling functions).
Except for the initial cycle, the stall engines of the two sequential designs
produce identical outputs (table 5.17). The same is true for the stall engines
#
  '
of the two pipelined designs. For the initial cycle T  0, the pipelined
I NTERRUPT scheduling function is only defined for stage k  0:
H ANDLING
IΠ 0 0  IΣ 0 1  0
Stage 0 has the instruction memory and its address as inputs. In the pipe-
lined design, IM is addressed by dpc, whereas in the sequential design it is
addressed by register DPC. Since
DPCce  PCce  ue1  JISR

it follows from the hypothesis of the theorem and the update enable flags
that
d pc0Π  DPCΠ 1 0
 DPCΣ  DPCΣ 
1

The memory IM is read-only and therefore keeps its initial contents. Thus,
on design DLXΠ in cycle T  0 stage 0 has the same inputs as on design
DLXΣ in cycle T  1.
Note that the stages k of the designs DLXΣ and DLXΠ generate the same
signals S and update their output registers in the same way, given that they
get identical inputs. This also applies to the data memory DM and its write
request signal Dmw 3 which in either design is disabled if the instruction
encounters a page fault on fetch. Thus, with the new dateline lemma 5.9,
the induction proof of claim 1 can be completed as before.

Claim 2 is new and therefore requires a full proof. For the output regis-
ters of stage 4, claim 1 already implies claim 2. Furthermore, in the designs
DLXΣ and DLXΠ , the instruction memory is never updated. Thus, claim 2
only needs to be proven for the two program counters PC’ and DPC, and
for the data memory DM.
The instruction sequence P of the sequential design was constructed
such that instruction Ip causes an interrupt. Since signal JISR is generated
in stage 4, claim 1 implies
T¼ T
JISRΣp  JISRΠp  1

In either design, the two PCs are initialized on an active JISR signal, and
therefore
T ¼ 1 Tp 1
DPCΣp  SISR  DPCΠ
T ¼ 1 T 1 
PC Σp  SISR  4  PC Πp
The data memory DM belongs to the set out 3. For stage 3, the two
scheduling functions imply

IΠ 3 Tp  1  IΣ 3 Tp  1  p
#
  '/
In the sequential design, the data memory is only updated when the in-
struction is in stage 3, i.e., when f ull 3  1. Claim 1 then implies that C ORRECTNESS OF
THE I NTERRUPT
T T¼
DMΠp  DMΣp  DM p  H ARDWARE

JISR is only signaled if f ull 4  1. For cycle Tp , the sequential stall engine
then implies that
T¼ T¼
f ull 3Σp 1 and DmwΣp  0

Thus, the data memory is not updated during JISR, and therefore
T¼ T ¼ 1
DMΣp  DMΣp 

In the pipelined design, the write enable signal of the data memory is gen-
erated as

Dmw3  Dmw 3  f ull 3  JISR NOR reset 

Since signal Dmw3 is disabled on an active JISR signal, the data memory
is not updated during cycle Tp , and therefore,
T 1 T
DMΠp  DMΠp 

That completes the proof of claim 2.  


We will now consider an arbitrary instruction sequence Q, which is pro-
cessed by the pipelined DLX design, and which is interrupted by several
non-aborting interrupts. Such a sequence Q can be broken down into sev-
eral uninterrupted subsequences

Pi  Ii 0

 Ii pi 

 Ii pi δi  


This means that for any sequence Pi , instruction Ii 0 is preceded by JISR, 

Ii pi  is the only instruction of Pi which causes an interrupt, and instruction




Ii pi δi  is the last instruction fetched before the jump to the ISR. For the


sequential execution, we consider the instruction sequence Q  P1 P2   


which is derived from sequence Q by dropping the instructions evicted by
JISR, i.e.,
Pi  Ii 0    Ii pi  
 

The external interrupt events are assigned as before. The scheduling func-
tions are extended in an obvious way. For the designs DLXΣ and DLXΠ ,

IΣ k T   i j and IΠ k T   i j

denote that in cycle T pipeline stage k processes instruction Ii j . 

##
  '
Like the two DLX designs without interrupt hardware, the designs DLXΣ
I NTERRUPT and DLXΠ are started by reset and not by JISR. Lemmas 5.8 and 5.10
H ANDLING imply that after reset, both designs come up gracefully; one cycle after
reset JISR  1 and the designs initiate a jump to ISR(0). Thus, we can
now formulate the general simulation theorem for the designs DLXΣ and
DLXΠ :

     Let Q  P1 P2    and Q  P1 P2    be two instruction sequences ex-


tended by a sequence of external events, as defined above. Sequence Q is
processed by the pipelined design DLXΠ and Q by the sequential design
DLXΣ. In the sequential execution, reset is given in cycle 2, whereas in
the pipelined execution, reset is given in cycle 1:

resetΣ 2

 1  resetΠ 1 


Let both designs be started with identical contents, i.e., any register and
memory R of the data paths satisfies

RΣ 1  R0Π

(5.4)

and let the first instruction I1 0 be preceded by JISR




JISRΣ 1

 1  JISR0Π 

For every pair Pi Pi  of subsequences, the DLXΠ design processing Pi


then simulates the DLXΣ design on Pi in the sense of theorem 5.11.

 As shown in the proof of theorem 5.11 claim 2, both designs initialize the
PCs on JISR in the same way, thus

PC DPC0Σ  PC DPC1Π 

The instruction memory is ready-only, and the update of the data memory
is disabled on ue3  0. Table 5.17 and equation 5.4 therefore imply

IM DM 0Σ  IM DM 1Π 

In either design, the output registers of stage 4 are updated during JISR.
Since stage 4 gets identical inputs it also produces identical outputs. Thus,

R  C R0Σ  R1Π

and for the subsequences P1 and P1 simulation theorem 5.11 is applicable.


Since instruction I1 p1  causes an interrupt, claim 2 of theorem 5.11 implies


that during the cycles T1 and T1 with

IΠ 4 T1   1 p1  and IΣ 4 T1   1 p1 
#&
  ')
T
k=0 I(1,0) I(2,0) S ELECTED
box’ 1 R EFERENCES AND
k=1
box’ 0 F URTHER R EADING
k=2
box 0 box 1 box 2
k=3
k=4 I(1,p)

   Scheduling of the first two subsequences P1 P2 for the pipelined ex-
ecution of sequence Q

the two designs are in the same configuration, i.e.,

T ¼ 1 T 1 
R  C RΣ1  RΠ1 (5.5)

In the sequential execution, the next subsequence is stared one cycle


after JISR, i.e.,

IΣ 4 Ti   i pi   JISRΣi 1  IΣ 0 Ti  1  i  1 0

whereas in the pipelined execution, the next subsequence is already started


during JISR, i.e.,

IΠ 4 Ti   i pi   JISRTΠi  1  IΠ 0 Ti   i  1 0

For the first two subsequences, figure 5.21 illustrates this scheduling be-
havior.
Thus, cycle T1  1 corresponds to the cycle 0 of the sequential execution
of P2 , and that cycle T1  1 corresponds to the cycle 1 of the pipelined
execution of P2 . Equation 5.5 then implies that the subsequences P2 and P2
are started in the same configuration, and that theorem 5.11 can be applied.
With the same arguments, the theorem follows by induction on the sub-
sequences of Q and Q .  

$   !  "  #

NTERRUPT SERVICE routines which are not nested are, for example,
described in [PH94]. Mechanisms for nested interrupts are treated in
[MP95] for sequential machines and in [Knu96] for pipelined machines.
#'
  '
0 %&
I NTERRUPT
H ANDLING   Let t1 and t2 be cycles of machine DLXΠ, and let t1  t2 . Sup-
pose external interrupts i and j are both enabled, interrupt i becomes active
in cycle t1 , interrupt j becomes active in cycle t2 , and no other interrupts
are serviced or pending in cycle t2 .

1. Show that it is possible that interrupt j is serviced before interrupt i.


2. Why does this not constitute a counterexample for the correctness
proof?

  Invalid address exception. Two addresses are stored in spe-
cial purpose registers UP and LOW . A maskable exception of type abort
has to be signalled, if a memory location below LOW or above UP is ac-
cessed.

1. Design the hardware for this exception.


2. Design the forwarding mechanism for the registers UP and LOW .
3. Determine the effect on the cost and the cycle time.

  Protected mode. We want to run the machine in two modes,


namely protected mode and user mode. Only the operating system should
run in protected mode.

1. Design an interrupt mechanism for a mode exception which is acti-


vated if a change of the following values is attempted in user mode:
mode, UP, LOW , the mask bits for the mode exception and the in-
valid address exception.
2. Is it possible to merge the invalid address exception and the mode
exception into a single exception?
3. What should be the priorities of the new exception(s)?
4. How is the correctness proof affected?

  Protection of the interrupt stack.


1. Design an interrupt mechanism, where the interrupt stack can only
be accessed by the operating system; the code segments S AVE and
R ESTORE are part of the operating system.
2. What requirements for the interrupt service routine from the correct-
ness proof can be guaranteed by the operating system?
3. What requirements for the interrupt service routine cannot be guar-
anteed by the operating system alone?
#(
  '*
  Suppose we want to make the misaligned exception of type
repeat. E XERCISES

1. Sketch an exception handler which fetches the required data.


2. What should be the priority of such an exception?
3. How is the correctness proof affected?

#/
Chapter

6
Memory System Design

NE WAY to improve the performance of an architecture, is trying to


increase the instruction throughput, for example by pipelining, but
that calls for a fast memory system, as the analysis of section 4.6.5 has
turned out.
Thus, users would like to have a very large (or even unlimited) amount
of fast and cheap memory, but that is unrealistic. In general, only small
RAM is fast, and fast RAMs are more expensive than slower ones. In this
chapter we therefore study the key concept for designing a memory system
with high bandwidth, low latency, high capacity, and reasonable cost.
The pipelined DLX design requires two memory ports, one port for in-
struction fetch and the second port for data accesses. Since the sequential
DLX design can manage with just one memory port, we first develop a fast
memory system based on the sequential DLX architecture. In section 6.5,
we then integrate the memory system into the pipelined DLX design.

     - *#

N THE simplest case, the memory system is monolithic, i.e., it just com-
prises a single level. This memory block can be realized on-chip or
off-chip, in static RAM (SRAM) or in dynamic RAM (DRAM). DRAM is
about 4 to 10 times cheaper and slower than SRAM and can have a 2 to
4 times higher storage capacity [Ng92]. We therefore model the cost and
  (
delay of DRAM as
M EMORY S YSTEM
D ESIGN CDRAM A d   CSRAM A d α
DDRAM A d   α  DSRAM A d 

with α  4 8 16. Thus, on-chip SRAM yields the fastest memory sys-
tem, but that solution has special drawbacks, as will be shown now.

(  3   7 85

Chapter 3 describes the sequential design of a DLX fixed point core. The
main memory is treated as a black box which has basically the function-
ality of a RAM; its temporal behavior is modeled by two parameters, the
(minimal) memory access time dmem and the memory status time dmstat .
All CPU internal actions of this DLX design require a cycle time of
τCPU  70 gate delays, whereas the memory access takes TM  18  dmem
delays. If a memory access is performed in 1  W cycles, then the whole
DLX fixed point unit can run at a cycle time of
 TM

τDLX  max τCPU  (6.1)
W 1
The parameter W denotes the number of wait states. From a performance
point of view, it is desirable to run the memory without wait states and at
the speed of the CPU, i.e.,

TM  18  dmem  τCPU  70 (6.2)

Under these constraints, the memory access time dmem can be at most 52
gate delays. On-chip SRAM is the fastest memory available. According to
our hardware model, such an SRAM with A entries of d bits each has the
following cost and access time

CSRAM A d   2  A  3  d  log log d 


DSRAM A d   3  log A  10 if A  64

The main memory of the DLX is organized in four banks, each of which
is one byte wide. If each bank is realized as an SRAM, then equation (6.2)
limits the size of the memory to

4A  4  2 52
  103
 216 bytes

That is much to small for main memory. Nevertheless, these 64 kilo bytes
of memory already require 1.3 million gates. That is roughly 110 times the
&
  (
  Signals of the bus protocol A M ONOLITHIC
M EMORY D ESIGN
signal type CPU memory
data of the write read
MDat bidirectional
memory access read write
MAd memory address unidirectional write read
burst burst transfer
status
w/r write/read flag unidirectional write read
flag
BE byte enable flags
req request access write read
hand-
reqp request pending unidirectional
shake read write
Brdy bus ready

cost of the whole DLX fixed point core CDLX  11951. Thus, a large,
monolithic memory system must be implemented off-chip, and a memory
access then definitely takes several CPU cycles. The access time of the
main memory depends on many factors, like the memory address and the
preceding requests. In case of DRAMs, the memory also requires some
time for internal administration, the so called refresh cycles. Thus, the
main memory has a non-uniform access time, and in general, the processor
cannot foresee how many cycles a particular access will take. Processor
and main memory therefore communicate via a bus.

(  %- - " 

There exist plenty of bus protocols; some are synchronous, the others are
asynchronous. In a synchronous protocol, memory and processor have a
common clock. That simplifies matters considerably. Our memory designs
therefore uses a synchronous bus protocol similar to the pipelined protocol
of the INTEL Pentium processor [Int95].
The bus signals comprise the address MAd and the data MDat of the
memory access, the status flags specifying the type of the access, and the
handshake signals coordinating the transfer. The data lines MDat are bidi-
rectional, i.e., they can be read and written by both devices, the processor
and the memory system. The remaining bus lines are unidirectional; they
are written by one device and read by the other (table 6.1). The protocol
uses the three handshake signals request (req), request pending (reqp), and
bus ready (Brdy) with the following meaning:
&
  (
Request is generated by the processor. This signal indicates that a
M EMORY S YSTEM new transfer should be started. The type of the access is specified by
D ESIGN some status flags.

Request Pending reqp is generated by the main memory. An active


signal reqp  1 indicates that the main memory is currently busy
performing an access and cannot accept a new request.

Bus Ready is also generated by the main memory. On a read access,


an active bus ready signal Brdy  1 indicates that there are valid
data on the bus MDat. On a write access, an active bus ready signal
indicates that the main memory no longer needs the data MDat.

The main memory provides its handshake signals reqp and Brdy one cycle
ahead. That leaves the processor more time for the administration of the
bus. During the refresh cycles, the main memory does not need the bus.
Thus, the processor can already start a new request but the main memory
will not respond reqp  1 Brdy  0 until the refresh is finished.

-  
The data unit to be transferred on the bus is called bus word. In our memory
design, the bus word corresponds to the amount of data which the processor
can handle in a single cycle. In this monograph, the bus width is either 32
bits or 64 bits. The memory system should be able to update subwords
(e.g., a single byte) and not just a whole bus word. On a write access, each
byte i of the bus word is therefore accompanied by an enable bit BEi.
On a burst transfer, which is indicated by an active burst flag, MAd
specifies the address of the first bus word. The following bus words are
referenced at consecutive addresses. The bus word count bwc specifies the
number of bus words to be transferred. Our protocol supports burst reads
and burst writes. All the bursts have a fixed length, i.e., they all transfer the
same amount of data. Thus, the bwc bits can be omitted; the status flags of
the bus protocol comprise the write/read flag wr, the burst flag, and the
byte enable flags BE.

8 -  


Figure 6.1 depicts the idealized timing of the bus protocol on a single-word
read transfer (burst  0) followed by a fast burst read (burst  1). The two
transfers are overlapped by one cycle.
In order to initiate a read access, the processor raises the request signal
req for one cycle and pulls the write/read signal to zero, wr  0. In the
same cycle, the processor provides the address MAd and the burst flag to
the memory. The width bwc of the access can be derived from the burst
&
  (
MAd 1st address 2nd address A M ONOLITHIC
M EMORY D ESIGN
burst
w/r
req
reqp
Brdy

MDat D D1 ... Dx-1 Dx


*

   A single-word read transfer followed by a fast x-word burst read. On


a fast read transfer, the cycle marked with * is omitted.

MAd address

burst
w/r
req
reqp
Brdy

MDat D1 D2 D3 D4

   A 4-2-3-1 read burst transfer

flag. The memory announces the data by an active bus ready signal Brdy 
1, one cycle ahead of time. After a request, it can take several cycles till
the data is put on the bus. During this time, the memory signals with
repq  1 that it is performing an access. This signal is raised one cycle
after the request and stays active repq  1 till one cycle before a new
request is allowed. The processor turns the address and the status signals
off one cycle after req  0. A new read access can be started one cycle
after req  0 and reqp  0.
On a burst read any of the bus words can be delayed by some cycles, not
just the first one. In this case, the Brdy line toggles between 0 and 1. The
burst transfer of figure 6.2 has a 4-2-3-1 access pattern; the first bus word
arrives in the fourth cycle, the second word arrives two cycles later, and so
on. The fastest read access supported by this protocol takes 2  bwc bus
cycles. The first word already arrives two cycles after the request.
&#
  (
M EMORY S YSTEM MAd 1st address 2nd address 3rd address
D ESIGN
burst
w/r
req
reqp
Brdy

MDat Dread Dwrite Dread

   Fast read transfer followed by a write transfer and another read.

;  -  


Figure 6.3 depicts the idealized timing of the bus protocol on a fast read
followed by a write and another fast read. The write transfer starts in the
fourth cycle.
In order to initiate a write transfer, the processor raises the request line
req for one cycle, it raises the write/read line and puts the address MAd, the
burst flag burst and the byte enable flags BE on the bus. In the second cycle,
the (first) bus word is transferred. The memory signals with Brdy  1 that
it needs the current data MDat for just one more cycle.
Like on a read access, signal reqp is turned on one cycle after the request
if the transfer takes more than 3 cycles. One cycle before the memory can
accept a new access, it turns signal reqp off. One cycle later, the processor
turns the address and the status signals off.
On a write burst transfer, each of the bus words can be delayed by some
cycles. The burst write of figure 6.4 performs a 4-2-1-1 transfer. The
fastest write transfer supported by this protocol takes bwc  2 bus cycles.

    


The bus protocol supports that two succeeding transfers can be overlapped
by one cycle. However, when switching between reads and writes, the data
bus MDat must be disabled for at least one cycle in order to prevent bus
contention. On a write transfer, the processor uses the MDat bus from the
second to the last cycle, whereas on a read transfer, the bus MDat is used in
the third cycle at the earliest. Thus, a read transfer can be overlapped with
any preceding transfer, but a write transfer can only be overlapped with a
preceding write. At best, the processor can start a new read transfer one
cycle after
req  0  reqp  0
&&
  (
MAd 1st address 2nd address A M ONOLITHIC
M EMORY D ESIGN
burst
w/r
req
reqp
Brdy

MDat D D1 D2 D3 D4

   Fast single-word write transfer followed by a 4-2-1-1 burst write.

and it can start a new write transfer one cycle after


req  0  reqp  0  wr  0  Brdy  0

(# 2-  +3=   A7 5 5 %

In this section, we connect the sequential DLX design of chapter 3 to a 64


MB (Mega Byte) off-chip memory, using the bus protocol of section 6.1.2.
This modification of the memory system only impacts the implementation
of the memory environment and its control. The other environments of the
DLX design remain unchanged.
Moreover, the global functionality of the memory system and its inter-
action with the data paths and main control of the design also remains the
same. The memory system is still controlled by the read and write signals
mr and mw, and by the opcode bits IR27 : 26 which specify the width of
the access. On a read, the memory system provides the memory word
MDout 31 : 0  Mword MA31 : 200
whereas on a d-byte write access with address e  MA31 : 0 and offset
o  MA1 : 0, the memory system performs the update
M e  d  1 : e : byteod  1:o MDin
With mbusy  1, the memory systems signals the DLX core that it cannot
complete the access in the current cycle.

.       5 % % 


So far, the memory environment, i.e., the data paths of the memory system,
consists of four memory banks, each of which is a RAM of one byte width
&'
  (
MAd BE[3:0] MA[1:0]
M EMORY S YSTEM MA[31:2] a mbw MC
MDat mw
D ESIGN MDout M req, w/r
do
reqp, Brdy mr
di MifC
MDin
mbusy
Mif
MAdoe, MDindoe

   Memory system of the DLX with off-chip memory

(figure 3.11). Based on the control signals and the offset, the memory
control MC generates the bank write signals mbw[3:0] which enable the
update of the memory (figure 3.12).
Now the memory system (figure 6.5) consists of the off-chip main mem-
ory M, the memory interface Mi f , the memory interface control Mi fC,
and the original memory control MC. The memory interface connects the
memory M to the data paths.
The memory of the 32-bit DLX architecture is byte addressable, but all
reads and the majority of the writes are four bytes (one word) wide. Thus,
the data bus MDat between the processor and the main memory can be
made one to four bytes wide. On a one-byte bus, half word and word
accesses require a burst access and take at least one to three cycles longer
than on a four-byte bus. In order to make the common case fast, we use
a four-byte data bus. On a write transfer, the 32-bit data are accompanied
by four byte enable flags BE 3 : 0. Since bursts are not needed, we restrict
the bus protocol to single-word transfers.

Memory Interface The memory interface Mif which connects the data
paths to the memory bus and the external memory uses 32-bit address and
data lines. Interface Mif forwards the data from the memory bus MDat to
the data output MDout. On MAdoe  1, the interface puts the address MA
on the address bus MAd, and on MDindoe  1 it puts the data MDin on
the bus MDat.
Except for the memory interface Mif, the data paths of environment
Menv are off-chip and are therefore not captured in the cost model. Thus,
CMenv  CMi f  2  Cdriv 32

Off-Chip Memory The off-chip memory M obeys the protocol of sec-


tion 6.1.2. On a read access requested by req  1 and wr  0, it provides
the data
MDat 31 : 0  Mword MAd 31 : 200
&(
  (
On a write access (wr  1), the memory performs the update
A M ONOLITHIC
Mword MAd 31 : 200 : X3 X2 X1 X0 M EMORY D ESIGN

where for i  0  3 the byte Xi is obtained as

M MAd 31 : 200  i if BE i  0


Xi 
bytei MDat  if BE i  1

Memory Interface Control The memory interface control MifC con-


trols the tristate drivers of the memory interface and generates the hand-
shake signal req and the signals wr and mbusy according to the bus pro-
tocol of section 6.1.2. In the sequential DLX design, transfers are never
overlapped, and the address on bus MAd is always provided by the same
source MA. The bus protocol can therefore be simplified; the address MA
and signal w/r are put on the bus during the whole transfer.
In the FSD of the main control (figure 3.20) there are three states per-
forming a memory access, namely ,& #   and & . The only combi-
nation of accesses which are performed directly one after another is &  -
,& . In the first two cycles of a read access, bus MDat is not used. Thus,
the memory interface can already put the data on the bus MDat during the
first cycle of the write transfer without risking any bus contention. The
control MifC can therefore generate the enable signals and the write/read
signal as

wr  mw
MAdoe  mem  mr  mw
MDindoe  mw

The handshake signals are more complicated. Signal req is only active
during the first cycle of the access, and signal mbusy is always active except
during the last cycle of the transfer. Thus, for the control MifC a single
transfer is performed in three steps. In the first step, MifC starts the off-
chip transfer as soon as reqp  0. In the second step, which usually takes
several cycles, MifC waits till the memory signals Brdy  1. In the third
step, the transfer is terminated. In addition, MifC has to ensure that a
new request is only started if in the previous cycle the signals reqp and
Brdy were inactive. Since the accesses are not overlapped, this condition
is satisfied even without special precautions.
The signals req and mbusy are generated by a Mealy automaton which
is modeled by the FSD of figure 6.6 and table 6.2. According to section
2.6, cost and delay of the automaton depend on the parameters listed in
&/
  (
else D2
M EMORY S YSTEM D1 D3
D ESIGN start wait finish

   FSD underlying the Mealy automaton MifC; the initial state is start.

  Disjunctive normal forms DNF of the Mealy automaton of MifC
DNF source state target state monomial m  M length l m
D1 start wait mem 1
D2 wait wait /Brdy 1
D3 wait finish Brdy 1

DNF mealy signal state mM l m


D1 req start mem (1)
D4 mbusy start mem (1)
wait 1 0

table 6.3 and on the accumulated delay of its inputs Brdy and mem. Let
CMealy Mi fC denote the cost of the automaton, then

CMi f C  CMealy Mi fC  Cor 

The input Brdy only affects the next state of the automaton, but input
mem also affects the computation of the Mealy outputs. Let the main mem-
ory provide the handshake signals with an accumulated delay of AM Brdy,
and let the bus have a delay of dbus . The inputs and outputs of the automa-
ton then have the following delays:

Ain Mi fC  AM Brdy  dbus (next state)


Ain1 Mi fC  ACON mw mr  Dor (outputs only)
AMi f C  AMealy out 

  +%
Table 6.4 lists the cost of the DLX design and of the environments affected
by the change of the memory interface. The new memory interface is fairly
cheap and therefore has only a minor impact on the cost of the whole DLX
design.
&)
  (
  Parameters of the Mealy automaton used in the control MifC A M ONOLITHIC
M EMORY D ESIGN
# states # inputs # and frequency of outputs
k σ γ νsum νmax
3 2 2 3 2
fanin of the states # and length of monomials
fanmax fansum #M lsum lmax
2 3 3 3 1

  Cost of the memory interface Mi f , of the data paths DP, of the control
CON and of the whole DLX for the two memory interfaces.

Mif DP CON DLX


old memory interface – 10846 1105 11951
new memory interface 320 11166 1170 12336
increase [%] +3 +6 +3

Cycle Time The cycle time τDLX of the DLX design is the maximum of
three times, namely: the cycle time TCON required by the control unit, the
time TM of a memory access, and the time TDP for all CPU internal cycles.

τDLX  maxTCON TM TDP 

The connection of the DLX to an off-chip memory system only affects the
memory environment and the memory control. Thus, the formula of TCON
and TM need to be adapted, whereas the formula of TDP remains unchanged.
So far, time TCON accounted for the update of the main control automaton
Tauto  and for the cycle time Tstall of the stall engine. The handling of
the bus protocol requires a Mealy automaton, which needs to be updated
as well; that takes TMealy Mi fC delays. In addition, the new automaton
provides signal mbusy to the stall engine. Therefore,

Tstall  AMi f C  Dstall  maxDram3 32 32 D f f   δ

and the control unit now requires a minimal cycle time of

TCON  maxTauto Tstall TMealy Mi fC

&*
  (
Timing of Memory Accesses The delay formula of a memory access
M EMORY S YSTEM changes in a major way. For the timing, we assume that the off-chip mem-
D ESIGN ory is controlled by an automaton which precomputes its outputs. We fur-
ther assume that the control inputs which the off-chip memory receives
through the memory bus add dMhsh (memory handshake) delays to the cy-
cle time of its automaton.
The memory interface starts the transfer by sending the address and the
request signal req to the off-chip memory. The handshake signals of the
DLX processor are valid AMi f C delays after the start of the cycle. For-
warding signal req and address MA to the memory bus and off-chip takes
another Ddriv  dbus delays, and the processing of the handshake signals
adds dMhsh delays. Thus, the transfer request takes

TMreq  AMi f C  Ddriv  dbus  dMhsh  ∆

After the request, the memory performs the actual access. On a read
access, the memory reads the memory word, which on a 64 MB memory
takes DMM 64MB gate delays. The memory then puts the data on the
bus through a tristate driver. The memory interface receives the data and
forwards them to the data paths where they are clocked into registers. The
read cycle therefore takes at least

TMread  DMM 64MB  Ddriv  dbus  ∆

In case of a write access, the memory interface first requests a transfer.


In the following cycles, the memory interface sends the data MDin and the
byte enable bits. Once the off-chip memory receives these data, it performs
the access. The cycle time of the actual write access (without the memory
request) can be estimated as

TMwrite  maxAMC AMi f C  Ddriv   dbus  DMM 64MB  δ


TMaccess  maxTMwrite TMread   TMwrite 

Table 6.5 lists the cycle times of the data paths and control, as well as
the access and request time of the memory system, assuming a bus delay
of dbus  15 and dMhsh  10. The access time of the memory depends on
the version of the DRAM used.
The control and the memory transfer time are less time critical. They can
tolerate a bus delay and handshake delay of dbus  dMhsh  56 before they
slow down the DLX processor. However, the actual memory access takes
much longer than the other cycles, even with the fastest DRAM α  4.
In order to achieve a reasonable processor cycle time, the actual memory
access is performed in W cycles; the whole transfer takes W  1 cycles.
'
  (
  Cycle time of the DLX design and of its main parts, which are the data A M ONOLITHIC
paths DP, the control unit CON and the memory system MM. M EMORY D ESIGN
TCON TMaccess
TDP TMreq
maxA B α  4 α  8 α  16
13  dbus 14  dbus  dMhsh
70 42 355 683 1339
 28  39

The DLX design with a direct connection to the off-chip memory can then
be operated at a cycle time of

τDLX W   maxTDP TCON TM W 


 
TM W   max TMreq TMaccess W  

Increasing the number W of wait states improves the cycle time of the DLX
design, at least till W  TMaccess TDP . For larger W, the main memory
is no longer time critical, and a further increase of the wait states has no
impact on the cycle time.
According to section 4.6, the performance is modeled by the reciprocal
of a benchmark’s execution time, and on a sequential DLX design, the run
time of a benchmark Be is the product of the instruction count IC Be of
the benchmark, of the average cycles per instruction CPI, and of the cycle
time τDLX :

TDLX Be  IC Be  CPI Be W   τDLX W 

Increasing the number of wait states improves the cycle time, but is also
increases the CPI ratio. Thus, there is a trade-off between cycle time and
cycle count which we now quantify based on SPECint92 benchmark work-
loads. Table 6.6 lists the DLX instruction mix of these workloads and the
number of cycles required per instruction. According to formula (4.8) from
section 4.6, the benchmarks   and  and the average SPECint92
workload, for example, achieve the following CPI ratios:

CPI compress  419  125  W


CPI li  438  149  W
CPI SPECint  426  134  W

We increase the number of wait states W from 1 to TMaccess TDP  and


study the impact on the cycle time, on the CPI ratio and on the run time of
'
  (
M EMORY S YSTEM   DLX instruction mix for the SPECint92 programs and for the average
D ESIGN workload. CPII denotes the average number of cycles required by instruction I.

instruction mix
CPII
compress eqntott espresso gcc li AV
load 52 W 19.9 30.7 21.1 23.0 31.6 25.3
store 52 W 5.6 0.6 5.1 14.4 16.9 8.5
compute 41 W 55.4 42.8 57.2 47.1 28.3 46.2
call 51 W 0.1 0.5 0.4 1.1 3.1 1.0
jump 31 W 1.6 1.4 1.0 2.8 5.3 2.4
taken 41 W 12.7 17.0 9.1 7.0 7.0 10.6
untaken 31 W 4.7 7.0 6.1 4.6 7.8 6.0

  Performance of the DLX core on the   and benchmarks and
on the average SPECint92 workload. Parameter α denotes the factor by which
off-chip DRAM is slower than standard SRAM.

DRAM compress li SPEC aver.


W τDLX
α CPI TPI CPI TPI CPI TPI
1 355 5.4 1934.0 5.9 2083.8 5.6 1988.7
2 178 6.7 1193.1 7.4 1309.2 6.9 1235.3
3 119 8.0 947.0 8.8 1052.0 8.3 985.1
4
4 89 9.2 820.0 10.3 918.9 9.6 855.8
5 71 10.5 743.2 11.8 838.5 11.0 777.7
6 70 11.7 820.6 13.3 930.6 12.3 860.4
9 76 15.5 1177.1 17.8 1349.0 16.3 1239.3
8
10 70 16.7 1172.0 19.2 1346.5 17.6 1235.1
19 71 28.0 1990.7 32.6 2314.6 29.7 2107.7
16
20 70 29.3 2050.5 34.1 2386.0 31.0 2171.7

the benchmarks. Since the instruction count IC of the benchmarks remains


the same, table 6.7 lists the average time required per instruction
T
T PI   CPI  τDLX 
IC
instead of the run time. The CPI ratio and the TPI ratio vary with the
workload, but the optimal number of wait states is the same for all the
benchmarks of the SPECint92 suite.
On fast DRAM α  4, the best performance is achieved on a memory
'
  (
  Typical memory hierarchy of a large workstation in 1995 T HE M EMORY
H IERARCHY
level size location technology access time
register  1 KB on-chip custom memory 2–5 ns
CMOS / BiCMOS
L1 cache  64 KB on-chip
CMOS SRAM 3–10 ns
L2 cache  4 MB off-chip
main memory  4 GB off-chip CMOS DRAM 80–400 ns
disk storage  1 GB off-chip Magnetic disk 5 ms

system with five wait states. The DLX system then spends about 61%
134  5110 of the run time waiting for the off-chip memory. On the
slower DRAM with α  8 (16), the memory is operated with 10 (19) wait
states, and the DLX even waits 76% (86%) of the time.
Thus a large, monolithic memory has got to be slow, and even in a se-
quential processor design, it causes the processor to wait most of the time.
Pipelining can increase the performance of a processor significantly, but
only if the average latency of the memory system is short W  2. Thus,
the monolithic memory is too slow to make pipelining worthwhile, and
the restriction to a single memory port makes things even worse. In the
next section, we therefore analyze whether a hierarchical memory system
is better suited.

 .  - -

RESENTLY (2000) a low to mid range desktop machine has about 64


 to 128 MB of main memory. In order to provide that much memory at
reasonable cost and high speed, all commercial designs use a memory hi-
erarchy. Between the on-chip register files and the off-chip main memory,
there are placed several levels of memory (table 6.8, taken from [HP96]).
The levels close to the CPU are called cache.
As one goes down the hierarchy, the cost per bit decreases and the stor-
age capacity and the access time increase. This is achieved by changing
the type of memory and the technology. With respect to the memory type,
one switches from fast on-chip SRAM (static random access memory) to
off-chip SRAM, to DRAM (dynamic RAM) and then to disks and tapes.
On a memory access, the processor first accesses the first level (L1)
cache. When the requested data is in the L1 cache, a hit occurs and the
'#
  (
data is accessed at the speed of the L1 cache. When the data is not in this
M EMORY S YSTEM memory level, a miss occurs and the hardware itself forwards the request
D ESIGN to the next level in the memory hierarchy till the data is finally found.
A well designed multi-level memory system gives the user the illusion
that the whole main memory runs roughly at the speed of the L1 cache. The
key to this temporal behavior is the locality of memory references (section
6.2.1). In addition, the levels of the memory hierarchy are transparent, i.e.,
invisible to the user. For the levels between the CPU and the main memory,
this is achieved by caching (section 6.2.2).
In a hierarchical memory system, special attention has to be payed to the
following aspects:
The identification of a memory reference, i.e., how can a memory
reference be found in the memory hierarchy.
The placement policy determines where the data is placed in a par-
ticular memory level.
If a particular level of the memory hierarchy is full, new data can
only be brought into this level, if another entry is evicted. The re-
placement policy determines which one to replace.
The allocation policy determines under which circumstances data is
transfered to the next higher level of the hierarchy.
The write policy determines which levels of the memory hierarchy
are updated on a write access.
The initialization of the cache after power-up.
The transfer between two neighboring levels of RAM memory goes al-
ways along the same lines.1 For simplicity, we therefore focus on a two-
level memory system, i.e., an L1 cache backed by the main memory.

(   "  3 %

The key for the nice temporal behavior of multi-level memory is a princi-
ple known as locality of reference [Den68]. This principle states that the
memory references, both for instructions and data, tend to cluster. These
clusters change over time, but over a short time period, the processor pri-
marily works on a few clusters of references. Locality in references comes
in two flavors:
1 Additional considerations come into play, when one level is no random access mem-
ory, like disks or tapes.
'&
  (
Temporal Locality After referencing a sequence S of memory loca-
tions, it is very likely that the following memory accesses will also T HE M EMORY
reference locations of sequence S. H IERARCHY

Spatial Locality After an access to a particular memory location s,


it is very likely that within the next several references an access is
made to location s or a neighboring location.

For the instruction fetches, the clustering of the references is plausible


for the following two reasons: First, the flow of control is only changed by
control instructions (e.g., branch, trap, and call) and interrupts, but these
instructions are only a small fraction of all executed instructions. In the
SPEC benchmarks, for example, the control instructions account for 15%
of all instructions, on average [HP96]. Second, most iterative constructs,
like loops and recursive procedures, consist of a relatively small number of
instructions which are repeated may times. Thus, in the SPEC benchmarks,
90% of the execution time is spent in 10 to 15% of the code [HP96].
For the data accesses, the clustering is harder to understand, but has for
example been observed in [Den80, CO76]. The clustering occurs because
much of the computation involves data structures, such as arrays or se-
quences of records. In many cases, successive references to these data
structures will be to closely located data items.
Hierarchical memory designs benefit from the locality of references in
the following two ways: Starting a memory transfer requires more time
than the actual transfer itself. Thus, fetching larger blocks from the next
level of the memory hierarchy saves time, if the additional data is also
used later on. Due to spatial locality, this will often be the case. Temporal
locality states, that once a memory item is brought into the fast memory,
this item is likely to be used several times before it is evicted. Thus, the
initial slow access is amortized by the fast accesses which follow.

(   "  

All our designs use byte addressable memory. Let the main memory size be
2m bytes, and let the cache size be 2c bytes. The cache is much smaller than
the main memory; 2c  2m . The unit of data (bytes) transferred between
the cache and the main memory is called block or cache line. In order
to make use of spatial locality, the cache line usually comprises several
memory data; the line sizes specifies how many. The cache size therefore
equals
2c  # lines  line size
''
  (
The cache lines are organized in one of three ways, namely: direct mapped,
M EMORY S YSTEM set associative, or fully associative.
D ESIGN
 + 5 
For every memory address a  am  1 : 0, the placement policy spec-
ifies a set of cache locations. When the data with memory address a is
brought into the cache, it is stored at one of these locations. In the simplest
case, all the sets have cardinality one, and the memory address a is mapped
to cache address

ca  cac  1 : 0  a mod 2c


 am  1 : 0 mod 2c  ac  1 : 0

i.e., the memory address is taken modulo the cache size.


A cache which implements this placement policy is called direct mapped
cache. The replacement policy of such a cache is trivial, because there is
only one possible cache location per memory address. Thus, the requested
cache line is either empty, or the old entry must be evicted.
Since the cache is much smaller than the main memory, several memory
locations are mapped to the same cache entry. At any given time, one needs
to know whether a cache entry with address ca holds valid memory data,
but that is not enough. If the entry is valid valid ca  1, one also needs
to know the corresponding memory address madr ca. The cache data C
with address ca then stores the following memory data

Cca  M madr ca

Since the cache is direct mapped, the c least significant bits of the two
addresses ca and a  madr ca are the same, and one only needs to store
the leading m  c bits of the memory address as tag:

tag ca  am  1 : c


madr ca  am  1 : 0  tag ca cac  1 : 0

A cache line therefore comprises three fields, the valid flag, the address
tag, and the data (figure 6.7). Valid flag and tag are also called the directory
information of the cache line. Note that each of the 2l cache lines holds
line-size many memory data, but the cache only provides a single tag and
valid bit per line. Let the cache address ca be a line boundary, i.e., ca is
divisible by the line size 2o , then

valid ca  valid ca  1    valid ca  2o  1


tag ca  tag ca  1    tag ca  2o  1
'(
cache address
  (
T HE M EMORY
memory address tag line addr. line offset
H IERARCHY
t l o

cache directory cache data

line valid tag data 2l lines

2o data (bytes) per line

   Organization of a byte addressable, direct mapped cache. The cache


comprises 2l lines, each of which is 2o bytes wide.

Thus, all the bytes of a cache line must belong to consecutive memory
addresses.

Cline ca  Cca  2o  1 : ca  M madr ca  2o  1 : madr ca

Such a data structure makes it straightforward to detect whether the


memory data with address a is in the cache or not. If the data is in the
direct mapped cache, it must be stored at cache address ca  ac  1 : 0.
The cache access is a hit, if the entry is valid and if the tag matches the
high-order bits of the memory address, i.e.,

hit  valid ca  tag ca  am  1 : c

On a read access with address ca, the cache provides the valid flag v 
valid ca, the tag t  tag ca and the data

d  Cline cac  1 : o 0o 

Each field of the cache line, i.e., valid flag, tag and data, can be updated
separately. A write access to the cache data can update as little as a single
byte but no more than the whole line.

  


The cache line can be very wide, because it holds several memory data. In
order to reduce the width of the cache data RAM, the line is broken into
several 2s  sectors which are stored in consecutive cells of the cache data
'/
  (
line offset
M EMORY S YSTEM
memory address tag line addr. sector offset
D ESIGN
t l s b

cache data

cache directory

2l lines valid tag 2s sectors


= 1 line

2b bytes per sector

   Organization of a byte addressable, direct mapped cache with sectors.


Each cache line comprises 2 s sectors, each of which is 2 b bytes wide.

RAM. However, all the sectors of a cache line still have the same tag and
valid flag2 . The line-offset in the memory address is split accordingly in
an s-bit sector address and in a b-bit sector offset, where o  s  b. Figure
6.8 depicts the organization of such a direct mapped cache.
With sectoring, the largest amount of cache data to be accessed in par-
allel is a sector not a whole line. Thus, on read access with address ca the
sectored cache provides the data

d  Csectorcac  1 : b 0b 
 Ccac  1 : b 0b   2b  1 : cac  1 : b 0b 

 7;%     


A k-way set associative cache provides a set of k possible cache locations
for a memory address. Such an associative cache comprises k ways (figure
6.9), which are referenced with the same cache address. Each way is a
direct mapped cache with directory and cache data providing exactly one
2 Some cache designs allow that each sector has its own valid flag, in order to fetch only
some sectors of a line.
')
cache address
  (
T HE M EMORY
memory address tag line addr. line offset
H IERARCHY
t l o

way 0 (direct mapped) way k-1

dir data dir data

set line ... line 2l sets

v0 t0 d0 2o bytes per line

   Organization of a byte addressable, k-way set associative cache. The


cache comprises k ways (direct mapped caches). Each way holds 2 l lines which
are 20 bytes wide; vi , t i and d i denote the valid flag, the tag and the data of way i.

cache position per cache address ca. These k positions form the set of ca.
There are two special cases of k-way set associative caches:
For k  1, the cache comprises exactly one way; the cache is direct
mapped.
If there is only one set, i.e., each way holds a single line, then each
cache entry is held in a separate way. Such a cache is called fully
associative.
The associativity of a set associative, first level cache is typically 2 or 4.
Occasionally a higher associativity is used. For example, the PowerPC
uses an 8-way cache [WS94] and the SuperSPARC uses a 5-way instruc-
tion cache [Sun92]. Of course, the cache line of a set associative cache can
be sectored like a line in a direct mapped cache. For simplicity’s sake, we
describe a non-sectored, set associative cache. We leave the extension of
the specification to sectored caches as an exercise (see exercise 6.1).

 :
Let l denote the width of the line address, and let o denote the width of
the line offset. Each way then comprises 2l lines, and the whole cache
comprises 2l sets. Since in a byte addressable cache, the lines are still 2o
bytes wide, each way has a storage capacity of
¼
size way  2c  2l  2o
'*
  (
bytes. The size (in byte) of the whole k-way set associative cache equals
M EMORY S YSTEM
D ESIGN k  size way  k  2l  2o 

Since in a k-way set associative cache there are several possible cache
positions for a memory address a, it becomes more complicated to find
the proper entry, and the placement and replacement policies are no longer
trivial. However, the placement is such that at any given time, a memory
address is mapped to at most one cache position.

.  4     1 %


The set of a set associative cache corresponds to the line of a direct mapped
cache (way). Thus, the cache address ca is computed as the memory ad-
¼
dress a modulo the size 2c of a cache way

cac  1 : 0  ac  1 : 0

For this address ca, every way provides data di , a valid flag vi , and a tag t i :

vi  valid i ca t i  tagi ca d i  Clinei cac  1 : o0o 

A local hit signal hi indicates whether the requested data is held in way i
or not. This local hit signal can be generated as

hi  vi  t i  am  1 : m  t 

In a set associative cache, a hit occurs if one of the k ways encounters a hit,
i.e.,
hit  h0  h1      hk 1  

On a cache hit, exactly one local hit signal hj is active, and the corre-
sponding way j holds the requested data d. On a miss, the cache provides
an arbitrary value, e.g., d  0. Thus,
k1
dj if hit  1 and h j  1
d   d i  hi 
0 if hit  0
i 0

3 8 
In case of a miss, the requested data is not in the cache, and a new line
must be brought in. The replacement policy specifies which way gets the
new line. The selection is usually done as follows:

1. As long as there are vacant lines in the set, the replacement circuit
picks one of them, for example, the way with the smallest address.
(
  (
2. If the set is full, a line must be evicted; the replacement policy sug-
gests which one. The two most common policies are the following: T HE M EMORY
H IERARCHY
LRU replacement picks the line which was least recently used.
For each set, additional history flags are required which store
the current ordering of the k ways. This cache history must be
updated on every cache access, i.e., on a cache hit and on a line
replacement.
Random replacement picks a random line of the set and there-
fore manages without cache history.

   ;  "


The allocation policy distinguishes between read and write accesses. New
data is only brought into the cache on a miss, i.e., if the referenced data
is not in the cache. Besides the requested data, the whole line which cor-
responds to that data is fetched from the next level of the memory hierar-
chy. The cache allocation operation can either be combined with the actual
cache access (forwarding), or the cache access must be re-started after the
allocation.
In case the miss occurs on a read access, i.e., on an instruction fetch or
on a load operation, the requested data is always brought into the cache.
For write accesses, three different types of allocation policies are possible:

1. Read Allocate: A write hit always updates the data RAM of the
cache. On a write miss, the requested data and the corresponding
line will not be transferred into the cache. Thus, new data is only
brought in on a read miss.

2. Write Allocate: A write always updates the data RAM of the cache.
In case of a write miss, the referenced line is first transferred from
the memory into the cache, and then the cache line is updated. This
policy allocates new lines on every cache miss.

3. Write Invalidate: A write never updates the data RAM of the cache.
On the contrary, in case of a write hit, the write even invalidates the
cache line. This allocation policy is less frequently used.

A particular piece of data can be stored in several levels of the hierar-


chical memory system. In order to keep the memories consistent, a write
access must update all the instances of the data. For our two-level mem-
ory system, this means that on a write hit, cache and main memory are
updated. However, the main memory is rather slow. From a performance
point of view, it is therefore desirable to hide the main memory updates or
(
  (
M EMORY S YSTEM   Combinations of write and allocation policies for caches.
D ESIGN
Write Allocate Read Allocate Write Invalidate
Write Through + + +
Write Back + + –

even avoid some of them. The latter results in a weak memory consistency.
The write policy specifies which of the two consistency models should be
used:

1. Write Through supports the strong consistency model. A write al-


ways updates the main memory. Write buffers between cache and
main memory allow the processor to go on, while the main memory
performs the update. Thus, the slow memory updates can largely
be hidden. The update of the cache depends on the allocation pol-
icy. Write through can be combined with any of the three allocation
policies.

2. Write Back applies the weak consistency model. A write hit only
updates the cache. A dirty flag indicates that a particular line has
been updated in the cache but not in the main memory. The main
memory keeps the old data till the whole line is copied back. This
either occurs when a dirty cache line is evicted or on a special update
request. This write policy can be combined with read allocate and
write allocate but not with write invalidate (exercises in section 6.7).

Table 6.9 lists the possible combinations of the allocation and write poli-
cies.

. :   .  


After power-up, all the cache RAMs hold binary but arbitrary values, and
the information stored in a cache line is invalid even if the correspond-
ing valid flag is raised. Thus, the valid flags must be cleared under hard-
ware control, before starting the actual program execution. In case that
the replacement policy relies on a cache history, the history RAM must be
initialized as well. This initialization depends on the type of the history
information.
Besides reads and writes, the cache usually supports a third type of ac-
cess, namely the line invalidation. In case that the line invalidation access
is a hit, the corresponding cache line is evicted, i.e., its valid flag is cleared,
(
scnt < S-1
  (
/hit * (mr + mw)
Fill Request Line Fill scnt = S-1 T HE M EMORY
else H IERARCHY
Cache Read mr Last Sector
linv * hit hit * mw
mw
Invalidate Update M Cache Update

   Cache accesses of the memory transactions read, write and line in-
validate on a sectored, write through cache with write allocation.

and the history is updated as well. In case of a miss, the invalidation access
has no impact on the cache. Line invalidation is necessary, if a particular
level of the memory system comprises more than one cache, as it will be
the case in our pipelined DLX design (section 6.5). In that situation, line
invalidation is used in order to ensure that a particular memory word is
stored in at most one of those parallel caches.

( # 1-   5 %   

The cache as part of the memory hierarchy has to support four types of
memory transactions which are reading (rw  1) or writing (mw  1) a
memory data, invalidating a cache line (linv  1), and initializing the whole
cache. Except for the initialization, any of the memory transactions is
performed as a sequence of the following basic cache accesses:

reading a cache sector including cache data, tag and valid flag,

updating the cache directory, and

updating a sector of the cache data.

The sequences of the memory transactions depend on the allocation and


write policies. The flow chart of figure 6.10 depicts the sequences for a
sectored, write through cache with write allocation and read forwarding.
Each transaction starts in state .   . A cache line is S sectors wide.
Let the sector boundary a denote the memory address of the transaction,
and let ca denote the corresponding cache address.

8   


The read transaction starts with a .   access. The cache generates
the hit signal hit ca and updates the cache history. On a hit (hit ca  1),
(#
  (
the cache determines the address way of the cache way which holds the
M EMORY S YSTEM requested data, and it provides these data
D ESIGN
d  Csectorway ca  Msector a

That already completes the read transfer.


In case of a miss, address way specifies the cache way chosen by the
replacement policy. In subsequent cycles, the cache performs a line fill,
i.e., it fetches the memory line Mline a  with a  am  1 : o0o  sector
by sector and writes it at cache line address ca  cac  1 : o0o .
This line fill starts with a  $&. The cache line is invalidated,
i.e., valid way ca  0. This ensures that a valid cache line always holds
consistent data, even if the line fill is interrupted in between. The cache
also requests the line from memory and clears a sector counter scnt.
In each of the next S  1 %  cycles, one sector of the line is written
into the cache and the sector counter is incremented:

Csectorway ca  scnt  : Msector a  scnt 


scnt : scnt  1

In the cycle & /& , the cache fetches the last sector of the line. Due to
forwarding, the requested data are provided at the data output of the cache.
In addition, the directory is updated, i.e., the new tag is stored in the tag
RAM and the valid flag is turned back on:

tagway ca : am  1 : m  t  validway : 1

This is the last cycle of a read transaction which does not hit the cache.

;    


Like a read transaction, the write transaction starts with a .   ac-
cess, which in case of a miss is followed by a line fill. The write transaction
then proceeds with a .  0 & cycle, in which a memory update is
requested and the cache sector is updated as

Csectorway ca : XB 1  X0

with
bytei Csectorway ca if CDwi  0
Xi 
bytei Din if CDwi  1
The transaction ends with an 0 & 1 cycle, in which the memory per-
forms the requested write update.
(&
  (#
3 .     
This transaction also starts with a .   access, in order to check A C ACHE D ESIGN
whether the requested line is in the cache. In case of a miss, the line is not
in the cache, and the transaction ends after the .   access. In case
of a hit, the line is invalidated in the next cycle (2%+ &):
valid way : 0

   *#

K - WAY set associative cache comprises k cache ways, each of which


 is identical with a direct mapped cache. In a first step, we therefore
design a byte addressable, sectored, direct mapped cache. We then extend
the design to a k-way associative cache (section 6.3.2). Finally, the cache
is integrated into a cache interface which implements the write allocation,
write through policy.
The cache design must support all the cache accesses which according
to section 6.2.3 are required to perform the standard memory transactions.
In order to split the update of the directory and of the cache data, the tags,
valid flags and cache data are stored in separate RAMs. The cache is con-
trolled by the following signals:
the $rd flag which indicates a cache read access,
the clear flag which clears the whole cache,
the write signals V w and Tw which enable the update of the cache
directory (valid and tag), and
the B  2b bank write signals CDwB  1:0 which specify the bytes
of the cache sector to be updated.
The cache gets a memory address a  am  1 : 0, a valid flag, and a
B-byte data Di and provides a flag hit and data Do. The flag hit indicates
whether the requested memory data are held in the cache or not. As de-
picted in figure 6.8 (page 258), the memory address a is interpreted as tag
a tag, line address a line, sector address a sector, and sector offset a byte:
a tag  am  1 : l  s  b
a line  al  s  b  1 : s  b
a sector  as  b  1 : b
a byte  ab  1 : 0
('
  ( l s
a_line a_sector
M EMORY S YSTEM a_tag
t
Di
D ESIGN
valid din A A din A din
Vw valid tag Tw data RAM CDw[B-1:0]
clear RAM RAM B banks
Lx1 Lxt L*S x 8
1
dout dout dout
tag 8B
EQ(t+1)
v hit Do

   Byte addressable, direct mapped cache with L 2 l lines. The cache
line is organized in S 2s sectors, each of which is B 2b bytes wide.

According to the FSD of figure 6.10, all the memory transactions start
with a cache read access ($rd  1); updates of the directory and of the
cache data only occur in later cycles. The design of the k-way set associa-
tive cache will rely on this feature.

(# +     + 5 

Figure 6.11 depicts the data paths of a sectored, byte addressable, direct
mapped cache with L  2l cache lines. The cache consists of valid, tag and
data RAMs and an equality tester. The valid RAM V and the t bits wide
tag RAM T form the cache directory.
Since all sectors of a line share the same tag and valid flag, they are only
stored once; the valid and tag RAM are of size L  1 and L  t. Both RAMs
are referenced with the line address a line. The write signals V w and Tw
control the update of the directory. On Tw  0 the tag RAM provides the
tag
tag  T a line
and on Tw  1, the tag a tag is written into the tag RAM

T a line : a tag

The valid RAM V is a special type of RAM which can be cleared in just a
few cycles3 . That allows for a fast initialization on reset. The RAM V is
3 TheIDT71B74 RAM, which is used in the cache system of the Intel i486 [Han93],
can be cleared in two to three cycles [Int96].
((
  (#
cleared by activating signal clear. On V w  clear  0, it provides the flag
A C ACHE D ESIGN
v  V a line

whereas on a write access, requested by V w  1 and clear  0, the RAM


V performs the update

V a line : valid 

On every cache access, the equality tester EQ checks whether the line
entry is valid and whether the tag provided by the tag RAM matches the
tag a tag. If that is the case, a hit is signaled:

hit  1 v  tag  a tag v tag  1 a tag

The data portion of the cache line is organized in S  2s sectors, each of


which is B  2b bytes wide. The data RAM of the cache therefore holds
a total of 2l s sectors and is addressed with the line and sector addresses
a line and a sector. The cache is byte addressable, i.e., a single write can
update as little as a single byte but no more than a whole sector. In order to
account for the different widths of the writes, the data RAM is organized
in B banks. Each bank is a RAM of size L  S  8, and is controlled by a
bank write signal CDwi.
On a read access, the B bank write signals are zero. In case of a hit, the
cache then provides the whole sector to the output Do:

Do  dataa line a sector  B  1 : a line a sector

If CDwB  1 : 0  0B and if the access is a hit, the data RAMs are updated.
For every i with CDwi  1, bank i performs the update

dataa line a sector  i : bytei Di

The cost of this direct mapped cache (1-way cache, $1) run at:

C$1 t l s b  CSRAM 2l 1  CSRAM 2l t 


CEQ t  1  2b  CSRAM 2l s 8

The cache itself delays the read/write access to its data RAMs and directory
and the detection of a hit by the following amount:

D$1 data  DSRAM L  S 8


D$1 dir  maxDSRAM L 1 DSRAM L t 
D$1 hit   D$1 dir  DEQ t  1
(/
  (
Tw, Vw, Wadapt: k-way write signal adapter
M EMORY S YSTEM CDw[B-1:0] Tw0, Vw0, CDw0 ... Twk-1, Vwk-1, CDwk-1 $rd
D ESIGN way
valid
a[31:b]
clear
Di k
Di valid a Di valid a a way
clear clear
way 0 ... way k-1 replacement
(direct mapped cache) (direct mapped cache) circuit
Repl
Do hit v Do hit v

d0 h 0
v 0
v[0:k-1]
k-1
d h k-1 vk-1

h[0:k-1]
Sel: k-way data select

8B
Do hit

   Byte addressable, k-way set associative cache. The sectors of a cache
line are B 2b bytes wide.

(# +         

The core of a set associative cache (figure 6.12) are k sectored, direct
mapped caches with L lines each. The k cache ways provide the local
hit signals hi , the valid flags vi , and the local data di . Based on these sig-
nals, the select circuit Sel generates the global hit signal and selects the
data output Do. An access only updates a single way. The write signal
adapter Wadapt therefore forwards the write signals Tw V w, and CDw to
this active cache way.
The replacement circuit Repl determines the address way of the active
cache way; the address is coded in unary. Since the active cache way
remains the same during the whole memory transaction, address way is
only computed during the first cycle of the transaction and is then buffered
in a register. This first cycle is always a cache read ($rd). Altogether, the
cost of the k way cache is:
C$k t l s b  k  C$1 t l s b  CSel  CWadapt
CRepl  C f f k

+   -
Each cache way provides a local hit signal hi , a valid flag vi , and the local
data d i . An access is a cache hit, if one of the k-ways encounters a hit:
hit  h0  h1      hk  1


()
  (#
On a cache hit, exactly one local hit signal hi is active, and the correspond-
ing way i holds the requested data Do. Thus, A C ACHE D ESIGN

Do  d j  h j
j 0 k 1
 

When arranging these OR gates as a binary tree, the output Do and the hit
signal can be selected at the following cost and delay:

CSel  Ctree k  Cor  8B  Cand k  Ctree k  Cor 


DSel  Dand  Dtree k  Dor 

;      
Circuit Wadapt gets the write signals Tw, V w and CDwB  1 : 0 which
request the update of the tag RAM, the valid RAM and the B data RAMs.
However, in a set associative cache, an access only updates the active cache
way. Therefore, the write signal adapter forwards the write signals to the
active way, and for the remaining k  1 ways, it disables the write signals.
Register way provides the address of the active cache way coded in
unary. Thus, the write signals of way i are obtained by masking the signals
Tw, V w and CDwB  1 : 0 with signal bit wayi, e.g.,

V w if wayi  1
V wi   V w  wayi
0 if wayi  0

The original B  2 write signals can then be adapted to the needs of the set
associative cache at the following cost and delay

CWadapt  k  Cand B  2
DWadapt  Dand 

38 8  -


The replacement circuit Repl performs two major tasks. On every cache
read access, it determines the address way of the active cache way and
updates the cache history. On a cache miss, circuit Repl determines the
eviction address ev; this is the address of the way which gets the new data.
The circuit Repl of figure 6.13 keeps a K-bit history vector for each
set, where K  k  log k. The history is stored in an L  K RAM which is
updated on a cache read access ($rd  1) and by an active clear signal:

Hw  $rd  clear

On clear  1, all the history vectors are initialized with the value Hid.
Since the same value is written to all the RAM words, we assume that
(*
  (
a_line
M EMORY S YSTEM hit h[0:k-1]
EQ
D ESIGN $rd active

dec
ev
0 H
Al Aw history LRUup EV
RAM 1 H’ h[0:k-1]
Ar LxK Hid
clear hit
clear 2-port clear 0 1
1 Hid Hw 0 1
Hw w Din
0 H’ way

   Circuit Repl of a k-way set associative cache with LRU replacement

this initialization can be done in just a few cycles, as it is the case for the
valid RAM. Circuit LRUup determines the new history vector H and the
eviction address ev; circuit active selects the address way.
Updating the cache history involves two consecutive RAM accesses, a
read of the cache history followed by a write to the history RAM. In order
to reduce the cycle time, the new history vector H and the address are
buffered in registers. The cache history is updated during the next cache
read access. Since the cache history is read and written in parallel, the
history RAM is dual ported, and a multiplexer forwards the new history
vector Hl , if necessary. On clear  1, register H is initialized as well. The
cost of circuit Repl can be expressed as:

CRepl  CSRAM2 L K   C f f K   3  Cmux K 


C f f l   CEQ l   Cor  CLRUup  Cactive

We now describe the circuits active and LRUup in detail.

+       ;%


On a cache hit, the active cache way is the way which holds the requested
data; this is also the cache way which provides an active hit signal hi .
On a miss, the active way is specified by the eviction address ev which
is provided by the cache history. Since ev is coded in binary, it is first
decoded providing value EV . Thus,
EV if hit  0
way 
hk  1 : 0  hk  1  h0  if hit  1
is the address of the active way coded in unary. The circuit active (figure
6.13) determines the address of the active way at the following cost and
delay:

Cactive  Cdec log k  Cmux k


/
  (#
Dactive ev  Ddec log k  Dmux k
A C ACHE D ESIGN
Dactive hit   Dmux k

  %
For each set l, circuit Repl keeps a history vector

Hl  Hl0  Hlk 1


 Hli   0  k  1

Hl is a permutation of the addresses 0    k  1 of the cache ways, it pro-


vides an ordering of the k ways of set l. The elements of the vector Hl
are arranged such that the data of way Hli was used more recently than the
data of way Hli1 . Thus, Hl0 (Hlk 1 ) points to the data of set l which was


most (least) recently used. In case of a miss, the cache history suggests
the candidate for the line replacement. Due to LRU replacement, the least
recently used entry is replaced; the eviction address ev equals Hlk 1. 

On power-up, the whole cache is invalidated, i.e., all the valid flags in
the k direct mapped caches are cleared. The cache history holds binary but
arbitrary values, and the history vectors Hl are usually not a permutation
of the addresses 0    k  1. In order to ensure that the cache comes up
properly, all the history vectors must be initialized, e.g., by storing the
identity permutation. Thus,

Hid  H0  Hk 1
 H i   i

       %
The cache history must be updated on every cache read access, whether
the access is a hit or a miss. The update of the history also depends on the
type of memory transaction. Read and write accesses are treated alike; line
invalidation is treated differently.
Let a read or write access hit the way Hli. This way is at position i in
vector Hl . In the updated vector R, the way Hli is at the first position, the
elements Hl0    Hli 1 are shifted one position to the right, and all the other


elements remain the same:


H = ( Hl0 , ... , Hli-1, Hli , Hli+1 , ... , Hlk-1 )

R = ( Hli , Hl0 , ... , Hli-1, Hli+1 , ... , Hlk-1 )


x = ( 0 ... 0 1 0 ... 0 )
y = ( * 0 ... 0 0 1 ... 1 )

The meaning of the vectors x and y will be described shortly.


/
  (
In case of a read/write miss, the line of way ev  Hlk 1 is replaced. Thus,


M EMORY S YSTEM all the elements of the history vector Hl are shifted one position to the right
D ESIGN and ev is added at the first position:

R  ev Hl0 Hl1  Hlk  2


  Hlk 1
Hl0 Hl1  Hlk 2



In case that an invalidation access hits the way Hli , the cache line corre-
sponding to way Hli is evicted and should be used at the next line fill. In
the updated vector I, the way Hli is therefore placed at the last position, the
elements Hli1    Hlk 1 are shifted one position to the left, and the other


elements remain the same:


H = ( Hl0 , ... , Hli-1, Hli , Hli+1 , ... , Hlk-1 )

I = ( Hl0 , ... , Hli-1, Hli+1 , ... , Hlk-1 , Hli )

If the invalidation access causes a cache miss, the requested line is not in
the cache, and the history remains unchanged: I  Hl . Note that the vector
I can be obtained by shifting cyclically vector R one position to the left

I  R1  Rk 1
R 0  (6.3)

8:     %   


Circuit LRUup (figure 6.14) performs the update of the history vector. On
a cache hit, the binary address J of the cache way with h J  1 is obtained  

by passing the local hit signals hk  1 : 0 through an encoder. The flag

xi  1 Hli  J  hit  1

indicates whether the active cache way is at position i  1 of the history


vector Hl . Circuit LRUup obtains these flags in the obvious way. A parallel
prefix OR circuit then computes the signals
i1
yi  xn i  1k  1
n 0

where yi  0 indicates that the active cache way is not among the first i
positions of the history vector Hl . Thus, the first element of the updated
history vector R can be expressed as

J if hit  1
R0 
Hlk  1
if hit  0
/
  (#
hit

enc
h[0:k-1] J A C ACHE D ESIGN
x[0]
H0 EQ
log k parallel

...
prefix

...
OR
Hk-1 x[k-1]
... EQ

K y[k-1:1]
Hsel hit
ev H’

   Circuit LRUup which updates the cache history

and for any i 1


Hli 1
if yi  0
Ri 
Hli if yi  1
According to equation 6.3, the new history vector H can be obtained as
 
R if linv  0 R0 R1  Rk 1 if linv  0
H  
I if linv  1 R1  Rk 1

R0  if linv  1

Circuit Hsel implements these selections in a straightforward manner at


the following cost and delay

CHsel  2k  Cmux log k


DHsel  2  Dmux log k

The cost of the whole history update circuit LRUup run at:

CLRUup  Cenc log k  k  CEQ log k  Cand 


CPP k  Cor  CHsel 

+%   8  -


The circuit Repl gets the address a and the hit signals hi and hit. Based
on these inputs, it updates the history RAM and the registers H and way.
The update of the RAM is obviously much faster than the update of the
registers. We therefore focus on the amount of time by which circuit Repl
itself delays the update of its registers. For any particular input signal, the
propagation delay from the input to the registers of Repl can be expressed
as:

DRepl hit   Dand  DPP k  Dor  DHsel  Dmux  D f f


/#
  (
H0 H 1 Hi-1 Hi Hk-2 Hk-1 Hk-1 J
... ...
M EMORY S YSTEM
0 1 y[1] 0 1 y[i] 0 1 y[k-1] 0 1 hit
D ESIGN
A

1 0 linv

H’

   Circuit Hsel selects the history vector H . ¼


Flag linv signals a line
invalidation access.

  Updating the LRU history of set l in a 2-way cache.


inputs read/write line invalidation
1 0
Hl1 Hl0 h1 h0 H l H l way1 way0 H 1l H 0l way1 way0
0 * 0 0 1 0 0 1 0 1 * *
1 * 0 0 0 1 1 0 1 0 * *
* * 0 1 1 0 0 1 0 1 0 1
* * 1 0 0 1 1 0 1 0 1 0

DRepl hi   Denc log k  DEQ log k  DRepl hit 


DRepl a  maxDSRAM2 L K  DEQ l   Dmux K 
 maxDEQ log k  DRepl hit  Dactive ev  D f f 

where K  k  log k. Note that these delays already include the propagation
delay of the register. Thus, clocking just adds the setup time δ.

38 8    7;% 


When accessing a set l in a 2-way cache, one of the two ways is the active
cache way and the other way becomes least recently used. In case of a
line invalidation (linv  1), the active way becomes least recently used.
According to table 6.10, the elements of the new history vector Hl and the
address of the active way obey

Hl1 ; on a miss
way1  way0   way1
h1 ; on a hit
1 0 1
H l  way1 XNOR linv H l   H l

Thus, it suffices to keep one history bit per set, e.g., Hl1 . That simplifies
the LRU replacement circuit significantly (figure 6.16), and the initializa-
/&
  (#
Repl $rd
EQ hit A C ACHE D ESIGN
$rd
way[0]
h1 0
a_line way[1]
Al Aw history 1
RAM 0 linv
Ar Lx1 $rd
1 H1
clear 2-port
$rd w Din H’

   LRU replacement circuit Repl of a 2-way set associative cache.

tion after power-up can be dropped. Since an inverter is not slower than an
XNOR gate, the cost and delay of circuit Repl can then be estimated as

CRepl  CSRAM2 L 1  C f f l   CEQ l   2  Cmux  Cinv  Cxnor  C f f

DRepl hi   DRepl hit   Dmux  Dxnor  D f f


DRepl a  maxDSRAM2 L 1 DEQ l   2  Dmux  Dxnor  D f f 

+%       


The k-way set associative cache receives the address a, the data Di and
the control signals clear, $rd, V w, Tw and CDwB  1 : 0. In the delay
formula, we usually distinguish between these three types of inputs; by cs$
we denote all the control inputs of the cache design.
The cache design of figure 6.12 delays its data Do and its hit signal by
the following amount:

D$k Do  D$1 Do  DSel


D$k hit   D$1 hit   DSel 

The cache also updates the directory, the data RAMs and the cache history.
The update of the cache history H is delayed by

D$k ; H   maxD$1 hit   DRepl hi  D$k hit   DRepl hit  DRepl a

Thus, the propagation delay from a particular input to the storage of the
k-way cache can be expressed as:

D$k a; $k  maxD$k ; H  D$1 ; data dir


D$k cs$; $k  maxD$k ; H  DWadapt  D$1 ; data dir
D$k Di; $k  D$1 ; data
/'
  (
M EMORY S YSTEM   Active cache interface control signals for each state of the standard
D ESIGN memory transactions, i.e., for each state of the FSD of figure 6.10.

state cache control signals CDw[B:0]


Cache Read $rd, (linv)
0B
Fill Request scntclr, scntce, Vw, lfill
Line Fill scntce, lfill, Sw
1B
Last Sector scntce, valid, Vw, Tw, lfill,Sw
Cache Update $w MBW[B-1:0]
Update M —
0B
Invalidate Vw

(## +      . 

In the following, a sectored cache is integrated into a cache interface $i f


which implements the write allocate and write through policies. The cache
interface also supports forwarding, i.e., while a cache line is fetched from
main memory, the requested sector is directly taken from the memory bus
and is forwarded to the data output of $if.
Section 6.2.3 describes how such a cache interface performs the standard
memory transactions as a sequence of basic cache accesses. That already
specifies the functionality of $if. Those sequences, which are depicted in
the FSD of figure 6.10, consist of four types of cache accesses, namely a
read access ($rd), a write hit access ($w), a line invalidation (linv) and a
line fill (l f ill). The line fill requires several cycles.
The cache interface is controlled by the following signals:

the signals $rd $w linv and l f ill specifying the type of the cache
access,

the write signals V w and Tw of the valid and tag RAM,

the memory bank write signals MBW B  1 : 0,

the write signal Sw (sector write) requesting the update of a whole


cache sector, and

the clock and clear signal of the sector counter scnt.

Table 6.11 lists the active control signals for each state of the standard
memory transactions.
/(
  (#
a_byte CDw CDw hit hit
b AdG
a_sector rs A C ACHE D ESIGN
s ma ca a CACHE
a_line
Di Do
a_tag l+t Din 0 valid clear
s 1 rs
0 2b+3
lfill sector 0
1 Dout
MAd[31:b] 1
MDat 2b+3 $forw rs lfill

   Cache interface $if with forwarding capability

The cache interface receives the address a and the data Din and MDat.
Since all cache and memory accesses affect a whole sector, address a is a
sector boundary:
a byte  0
and the cache and memory ignore the offset bits a byte of the address.
The interface $if provides a hit signal, the data Dout, the memory address
MAd, a cache address, and the input data Di of the cache. On a line fill, Di
is taken from the memory data bus MDat, whereas on a write hit access,
the data is taken from Din
MDat if l f ill  1
Di  (6.4)
Din if l f ill  0
Figure 6.17 depicts an implementation of such a cache interface. The
core of the interface is a sectored k-way cache, where k may be one. The
width of a sector (B  2b bytes) equals the width of the data bus between
the main memory and the cache. Each line comprises S  2s sectors. A
multiplexer selects the input data Di of the cache according to equation
6.4. The address generator circuit AdG generates the addresses and bank
write signals CDw for the accesses. Circuit $ f orw forwards the memory
data in case of a read miss. The cost of the cache interface runs at
C$i f t l s b  C$k t l s b  Cmux B  8  CAdG  C$ f orw
C$ f orw  2  Cmux B  8  C f f B  8

    9 


The address generator (figure 6.18) generates the write signals CDw and
the low order address bits of the cache and main memory address.
According to section 6.2.3, a standard access (l f ill  0) affects a single
sector of a cache line. On such an access, the low order bits of the main
//
  (
0s MBW[B-1:0]
M EMORY S YSTEM 0 1 scntclr $w
a_sector
D ESIGN scnt s Sw
0s

lfill 1 0 EQ 0 1 lfill
inc
ca rs ma CDw[B-1:0]

   Address generation for the line fill of a sectored cache. The outputs
ca and ma are the low order bits of the cache and memory address. Signal rs
indicates that the current sector equals the requested sector.

memory and of the cache address equal the sector address a sector. On
a line fill (l f ill  1), the whole line must be fetched from main memory.
The memory requires the start address of the cache line:
MAd 31 : b  a tag a line ma
with
a sector if l f ill  0
ma 
0s if l f ill  1
Thus, the address generator clears ma on a line fill.
On a line fill, the cache line is updated sector by sector. The address
generator therefore generates all the sector addresses 0    2s  1 for the
cache, using an s-bit counter scnt. The counter is cleared on scntclr  1.
The sector bits of the cache address equal
a sector if l f ill  0
ca 
scnt if l f ill  1
In addition, circuit AdG provides a signal rs (requested sector) which indi-
cates that the current sector with address scnt equals the requested sector
rs  1 a sector  scnt 
This flag is obtained by an s-bit equality tester.
The address generator also generates the bank write signal CDwB  1 :
0 for the data RAM of the cache. Because of write allocate, the data RAM
is updated on a line fill and on a write hit (table 6.11). On a line fill, signal
Sw requests the update of the whole cache sector CDwB  1 : 0  1,
whereas on a write hit $w  1, the bank write signals of the memory
determine which cache banks have to be updated. Thus, for 0  i  B, the
bank write signal CDwi is generated as
CDwi  Sw  MBW i  $w
/)
  (#
By cs$i f , we denote all the control inputs of the cache interface. These
signals are provided by the control unit CON. The data paths provide the A C ACHE D ESIGN
address a. Let ACON cs$i f  and ADP a denote the accumulated delay of
these inputs. The cost and the cycle time of circuit AdG and the delay of
its outputs can then be expressed as

CAdG  C f f s  Cinc s  CEQ s  3  Cmux s


Cand B  Cor B
AAdG ma  AAdG ca  maxADP a ACON cs$i f   Dmux
AAdG rs  ADP a  DEQ s
AAdG CDw  ACON cs$i f   Dand  Dor
TAdG  maxACON cs$i f  Dinc s  Dmux  D f f  δ

,    82-    


Circuit $ f orw of figure 6.17 performs the read forwarding. On a read
hit, the output data Dout are provided directly by the cache. On a cache
miss, the line is fetched from main memory. During the line fill access, the
requested sector, i.e., the sector with address ca  a sector, is clocked into
a register as soon as it is provided on the MDat bus. This event is signaled
by rs  1. In the last line fill cycle, circuit $forw provides the requested
sector to the output Dout, bypassing the cache. If a sector  S  1, the
requested sector lies on the bus MDat during the last fill cycle and has not
yet been clocked into the register. Thus, the forwarding circuit selects the
data output as

 Do if l f ill  0
if l f ill  1  rs  0
Dout 
 sector
MDat if l f ill  1  rs  1

at the following delay

D$ f orw  2  Dmux 8B

+%    . 


Based on the cache address ca, the cache itself provides the hit signal and
the data Do. These two outputs therefore have an accumulated delay of:

A$i f hit   AAdG ca  D$k hit 


A$i f Do  AAdG ca  D$k Do

As for the whole DLX design, we distinguish between cycles which


involve the off-chip memory and those which are only processed on-chip.
/*
  (
The memory address MAd and the data MDat are used in the first kind
M EMORY S YSTEM of cycles. The cache interface provides address MAd at an accumulated
D ESIGN delay of
A$i f MAd   AAdG ma
The propagation of the data MDat to the output Dout and to the registers
and RAMs of the interface adds the following delays:

D$i f MDat; Dout   D$ f orw


D$i f MDat; $i f   Dmux 8B  D$k Di; $k

With respect to the on-chip cycles, the output Dout and the input data Di
of the cache have the following accumulated delays:

A$i f Dout   maxA$i f Do AAdG rs  D$ f orw


A$i f Di  maxACON cs$i f  ADP Din  Dmux 8B

The k-way cache comprises RAMs and registers, which have to be up-
dated. The actual updating of a register includes the delay Df f of the reg-
ister and the setup time δ, whereas the updating of a RAM only includes
the setup time. The additional delay Df f for the registers is already incor-
porated in the delay of the k-way cache. In addition to the cache address
ca, the cache also needs the input data Di and the write signals in order to
update its directory and cache data. The minimal cycle time of the cache
interface can therefore be expressed as:

T$i f  maxTAdG AAdG CDw  D$k cs$; $k  δ


AAdG ca  D$k a; $k  δ A$i f Di  D$k Di; $k  δ

 ,  *(1    -

(&     +3= +  

N SECTION 6.1.3, it has turned out that the sequential DLX core which
is directly connected to the slow external memory spends most of its
run time waiting for the memory system. We now analyze whether a fast
cache between the processor core and the external memory can reduce
this waiting time. Adding the cache only affects the memory environment
Menv and the memory control. As before, the global functionality of the
memory system and its interaction with the data paths and main control of
the DLX design remain the same.
)
  (&
32 MDin 64
MDRw BE S EQUENTIAL DLX
MA[2] Din di
req WITH C ACHE
[31:0]
MDat M
0 do
w/r M EMORY
MDout $if burst
Dout
1 MAd a reqp
[63:32]
Brdy
MA[31:3] a clear hit
Dif Mif
reset

   Memory environment of the sequential DLX with cache memory

5 % 1 
Figure 6.19 depicts the memory environment Menv. The cache interface
$i f of section 6.3 is placed between the memory interface Mif and the data
paths interface Dif. The cache interface implements the write through,
write allocate policy. Since there is only a single cache in the DLX design,
line invalidation will not be supported. The cache is initialized/cleared on
reset. The off-chip data bus MDat and the cache sectors are B  2b  8
bytes wide.

Memory Interface Mif The memory interface still forwards data and
addresses between the off-chip memory and the memory environment.
However, the memory address MAd is now provided by the cache inter-
face, and the data from the memory data bus are forwarded to the data
input MDat of the cache interface.

Interface Dif The cache interface is connected to the data paths through
a 32-bit address port MA and two data ports MDin and MDout. In the
memory environment, the data busses are 64 bits wide, whereas in the data
paths they are only 32 bit wide. Thus, the data ports must be patched
together. On the input port MDin, circuit Di f duplicates the data MDRw

MDin63 : 32  MDin31 : 0  MDRw31 : 0

On the output port Dout, a multiplexer selects the requested 32-bit word
within the double-word based on the address bit MA[2]:
Dout 31 : 0 if MA2  0
MDout 
Dout 63 : 32 if MA2  1

Let the sectored cache comprise 2l lines, each of which is split in S  2s


sectors. The cost of the memory environment and of the interfaces Mif and
)
  (
mbusy mr mw MA[1:0]
M EMORY S YSTEM
D ESIGN
req, w/r MC
$rd Mif
M MifC MBW
reqp, Brdy
$if
hit

   Block diagram of the memory control

Dif then run at

CMi f  Cdriv 32  Cdriv 64


CDi f  Cmux 32
CMenv  CMi f  CDi f  C$i f 29  l  s l s 3

 5 %  


As in the sequential DLX design which is directly connected to the off-chip
memory (section 6.1.3), the memory system is governed by the memory
control circuit MC and the memory interface control Mi fC (figure 6.20).

Memory Controller MC The memory controller generates the memory


bank write signals. Since the memory system now operates on double-
words, twice as many write signals Mbw[7:0] are required. The original
four signals mbw[3:0] still select within a word, and the leading offset bit
of the write address MA[2] selects the word within the sector. Thus, the
new bank write signals can be obtained as

mbw j  MA2 ; i  1
Mbw4  i  j  j0  3
mbw j  MA2 ; i  0
Stores always take several cycles, and the bank write signals are used in
the second cycle, at the earliest. The memory control therefore buffers the
signals Mbw in a register before feeding them to the cache interface and
to the byte enable lines BE of the memory bus. Register MBW is clocked
during the first cycle of a memory transaction, i.e., on $rd  1:

Mbw7 : 0 if $rd  1
MBW 7 : 0 :
MBW 7 : 0 if $rd  0
Thus, circuit MC provides the signal MBW at zero delay

AMC MBW   0
)
  (&
The cost and cycle time of the memory control MC run at
S EQUENTIAL DLX
CMC  CMC mbw  Cand 8  Cinv  C f f 8 WITH C ACHE

TMC  AMC mbw  Dand 8  ∆ M EMORY

Memory Interface Control As in the DLX of section 6.1.3, the memory


interface control MifC controls the tristate drivers of the memory interface
and generates the handshake signal req and the bust status signals burst,
wr and mbusy according to the bus protocol. In addition, control MifC
provides the control signals of the cache interface.
The FSD of figure 6.10 together with table 6.11 specify the cache oper-
ations for the different memory transactions. However, the line fill and the
write hit also access the off-chip memory. On such a memory access, the
bus protocol of section 6.1.2 must be obeyed. Thus, the FSD must be ex-
tended by the bus operations. Figure 6.21 depicts the extended FSD. Note
that on a burst read (line fill), the memory turns signal reqp off two cycles
before sending the last sector. Thus, signal reqp  0 can be used in order
to detect the end of the line fill. Table 6.12 lists the active control signals
for each state of the FSD.
Circuit MifC uses a Mealy automaton which generates the control sig-
nals as modeled by the FSD. Table 6.13 lists the parameters of the au-
tomaton. There are only two Mealy signals, namely mbusy and $rd. Both
signals are just used for clocking. According to section 2.6.8, their accu-
mulated delay can be expressed as

ACON mbusy $rd   AMi f C Mealy  Aout 2 Mi fC

The remaining MifC control signals are Moore signals. Since the automa-
ton precomputes its Moore outputs, these control signals are provided at
zero delay
AMi f C  AMi f C Moore  0
The MifC automaton receives the inputs mw and mr from the main con-
trol, the hit signal from the cache interface, and the handshake signals Brdy
and reqp from the memory. These inputs have an accumulated delay of

Ain Mi fC  maxACON mw mr A$i f hit  AM Brdy reqp  dbus 

   5 %  
As in the DLX design without cache, we assume that the off-chip memory
is controlled by an automaton which precomputes its outputs and that the
control inputs which the off-chip memory receives through the memory
bus add dMhsh delays to the cycle time of its automaton. With a cache, the
)#
  (
/Brdy /Brdy * reqp
M EMORY S YSTEM fill req wait /Brdy * /reqp lastwait /Brdy
D ESIGN /Brdy * reqp Brdy /Brdy * /reqp Brdy
/hit * mw Brdy Brdy * /reqp
/hit * mr fill last fill
Brdy * reqp mr
$RD mw
hit * mw
else Brdy
last M write M $write
/Brdy

   FSD of the MifC control automaton; $RD is the initial state.

  Active control signals for the FSD modeling the MifC control. Signals
$rd and mbusy are Mealy signals, the remaining signals are Moore signals.

state signals for $if additional signals


$RD $rd = mr  mw /mbusy = (hit  mr)
 (/mr  /mw)
fill req scntclr, scntce, Vw, lfill req, burst, MAddoe
fill scntce, lfill, Sw burst, MAddoe
wait lfill burst, MAddoe
last wait lfill burst, MAddoe
last fill scntce, valid, Vw, Tw, MAddoe
lfill, Sw /mbusy = mr
$ write $w w/r, req, MAddoe, MDindoe
write M w/r, MAddoe, MDindoe
last M MDindoe, /mbusy

off-chip memory only performs a burst read access or a single write access.
Both accesses start with a request cycle.
The memory interface starts the memory access by sending the address
and the request signal req to the off-chip memory, but the address is now
provided by the cache interface. That is the only change. Forwarding
signal req and address MAd to the memory bus and off-chip still takes
Ddriv  dbus delays, and the processing of the handshake signals adds dMhsh
delays. Thus, the memory request takes
TMreq  maxAMi f C A$i f MAd   Ddriv  dbus  dMhsh  ∆
After the request, the memory performs the actual access. The timing
of the single write access is modeled as in the design without cache. The
)&
  (&
  Parameters of the MifC Mealy automaton; index (1) corresponds to S EQUENTIAL DLX
the Moore signals and index (2) to the Mealy signals. WITH C ACHE
M EMORY
# states # inputs # and frequency of the outputs
k σ γ νsum νmax1 νmax2
9 5 15 40 7 4
fanin of the states #, length, frequency of the monomials
fansum fanmax #M lsum lmax lmax2
18 3 14 24 2 2

memory interface sends the data MDin and the byte enable bits. Once the
off-chip memory receives these data, it performs the access:

TMwrite  maxAMC MBW  AMi f C  Ddriv   dbus  DMM 64MB  δ

On a burst read transfer, we distinguish between the access of the first


sector and the access of the later sectors. A 64 MB memory provides the
first sector with a delay of DMM 64MB. Sending them to the memory
interface adds another Ddriv  dbus delays. The sector is then written into
the cache. Thus, reading the first sector takes at least

TMread  DMM 64MB  Ddriv  dbus  D$i f MDat; $i f   δ

We assume, that for the remaining sectors, the actual memory access time
can be hidden. Thus, the cache interface receives the next sector with
a delay of Ddriv  dbus . Circuit $if writes the sector into the cache and
forwards the sector to the data paths where the data are multiplexed and
clocked into a register:

TMrburst  Ddriv  dbus  δ


 maxD$i f MDat; $i f  D$i f MDat; Dout   Dmux  D f f 

Due to the memory access time, the write access and the reading of the
first sector take much longer than the CPU internal cycles. Therefore, they
are performed in W CPU cycles.
If a read access hits the cache, the off-chip memory is not accessed
at all. The cache interface provides the requested data with an delay of
A$i f Dout . After selecting the appropriate word, data MDout is clocked
into a register:

T$read  A$i f Dout   Dmux  D f f 


)'
  (
M EMORY S YSTEM   Cost of the DLX design which is connected to the off-chip DRAM,
D ESIGN either directly or through a 16 KB, direct mapped cache.

L1 cache Menv DP CON DLX


no 320 11166 1170 12336
16KB 375178 386024 1534 387558
increase factor 1170 35 1.3 31

Updating the cache interface on a read or write access takes T$ . Thus, the
memory environment of the DLX design requires a CPU cycle time of at
least

TM W   maxT$read T$i f TMreq TMrburst TMaccess W 


TMaccess  maxTMwrite TMread 

  %  
Presently (2000) large workstations have a first level cache of 32KB to
64KB (table 6.8), but the early RISC processors (e.g. MIPS R2000/3000)
started out with as little as 4KB to 8KB of cache. We consider a cache
size of 16KB for our DLX design. This sectored, direct mapped cache is
organized in 1024 lines. A cache line comprises S  2 sectors, each of
which is B  8 bytes wide. The cache size and other parameters will be
optimized later on.
According to table 6.14, the 16KB cache increases dramatically the cost
of the memory environment Menv (factor 1200) and of the DLX processor
(factor 31), but the cost of the control stays roughly the same. Adding a
first level cache makes the memory controller MC more complicated; its
automaton requires 9 instead of 3 states. However, this automaton is still
fairly small, and thus, the whole DLX control is only 30% more expensive.
Table 6.15 lists the cycle times of the data paths, the control, and the
memory system. The stall engine generates the clock and write signals
based on signal mbusy. Due to the slow hit signal, signal mbusy has a
much longer delay. That more then doubles the cycle time of the control,
which now becomes time critical. The cycle time τDLX of the DLX core is
increased by a factor of 1.27.
A memory request, a cache update, and a cache read hit can be per-
formed in a single processor cycle. The time TMrburst is also not time crit-
ical. Reading the first word from the off-chip memory requires several
processor cycles; the same is true for the write access (TMaccess ). Since
the memory data is written into a register and into the cache, such a read
)(
  (&
  Cycle time of the DLX design which and without cache memory S EQUENTIAL DLX
WITH C ACHE
cache Ahit Ambusy TMi f C Tstall TCON TDP
M EMORY
no – 7 28 33 42 70
16KB 55 64 79 89 89 70

TMaccess
cache T$i f T$read TMreq TMrburst
α  4 α  8 α  16
no – – 39 – 355 683 1339
16KB 48 57 36 63 391 719 1375

access takes even 36 delays longer.


The DLX design with first level cache can be operated at a cycle time of

τDLX W   maxTDP TCON TM W 

Increasing the number W of wait states improves the cycle time, but it also
increases the CPI ratio. There is a trade-off between cycle time and cycle
count.

"   ?- %   +  


In order to make the cache worthwhile, the cache better improves the per-
formance of the DLX quite a bit. The memory system has no impact on
the instruction count. However, the cache can improve the CPI ratio and
the TPI ratio by speeding up the average memory accesses.
T
T PI   CPI  τDLX 
IC

Number of Memory Cycles In the DLX design without cache, a mem-


ory access takes always 1  W cycles. After adding the cache, the time
of a read or write access is no longer fixed. The access can be a cache
hit or miss. In case of a miss, the whole cache line (S  2 sectors) must
be fetched from the external memory. Such a line fill takes W  S cycles.
Thus, the read access can be performed in a single cycle, if the requested
data is in the cache, and otherwise, the read access takes 1  W  S cycles
due to the line fill.
A store first checks the cache before it performs the write access. Due to
the write through, write allocate policy, the write always updates the cache
and the external memory. Like in the system without cache, the update
of the memory takes 1  W cycles, and together with the checking of the
)/
  (
M EMORY S YSTEM   Number of processor cycles required for a memory access.
D ESIGN
read hit read miss write hit write miss
with cache 1 1  S W 2 W 2 W  S W
without cache 1 W 1 W

cache, a write hit takes 2  W cycles. A cache miss adds another W  S


cycles (table 6.16).

CPI Ratio For a given benchmark, the hit ratio ph measures the fraction
of all the memory accesses which are cache hits, and the miss ratio pm 
1  ph  measures the fraction of the accesses which are cache misses. This
means that the fraction pm of the memory accesses is a cache miss and
requires a line fill.
Let CPIideal denote the CPI ratio of the DLX design with an ideal mem-
ory, i.e., with a memory which performs every access in a single cycle.
In analogy to the CPI ratio of a pipelined design (section 4.6), the cache
misses and memory updates can be treated as hazards. Thus, the CPI ratio
of the DLX design with L1 cache can be expressed as:

CPIL1  CPIideal  νstore  1  W   νmiss  W  S


νmiss  pm  1  νload  νstore 

The CPI ratio of the DLX design with ideal memory can be derived from
the instruction mix of table 6.6 in the same manner as the CPI ratio of the
DLX without cache. That table also provides the frequency of the loads
and stores. According to cache simulations [Kro97, GHPS93], the 16KB
direct mapped cache of the DLX achieves a miss ratio of 33% on the
SPECint92 workload. On the compress benchmark, the cache performs
slightly better pm  31%. Thus, the DLX with 16KB cache yields on
these two workloads a CPI ratio of

CPIL1 compr  419  0056  1  W   0031  1255  S  W 


 432  009  W
CPIL1 SPEC  426  0085  1  W   0033  1338  S  W 
 443  013  W

Based on these formulae, the optimal cycle time and optimal number of
wait states can be determined as before. Although the CPI and TPI ra-
tios vary with the workload, the optimal cycle time is the same for all the
))
  (&
  Optimal cycle time and number W of wait states S EQUENTIAL DLX
WITH C ACHE
L1 α4 α8 α  16
M EMORY
cache W τ W τ W τ
no 5 71 10 70 19 71
16KB 5 89 8 90 16 89

  CPI and TPI ratios of the two DLX designs on the compress bench-
mark and on the average SPECint92 workload.

compress (pm  31%) SPECint (pm  33%)


DRAM: α 4 8 16 4 8 16
CPInoL1 10.5 16.7 28.0 11.0 17.6 29.7
CPIL1 4.8 5.0 5.8 5.1 5.5 6.5
CPInoL1 CPIL1 2.2 3.3 4.8 2.1 3.2 4.6
TPInoL1 753.7 1172.0 1990.7 788.7 1235.1 2107.7
TPIL1 424.5 453.6 512.6 452.1 492.3 579.4
T PInoL1 T PIL1 1.8 2.6 3.9 1.7 2.5 3.6
Break even: eq 0.14 0.21 0.28 0.14 0.21 0.28

SPECint92 benchmarks; it only depends on the speed of the main mem-


ory (table 6.17). Depending on the speed of the main memory, the cache
increases the optimal cycle time by 25% or 30%, but for slow memories it
reduces the number of wait states.
According to table 6.18, the cache improves the CPI ratio roughly by
a factor of 2 to 5. Due to the slower cycle time, the TPI ratio and the
performance of the DLX processor is only improved by a factor of about
2 to 4. Especially in combination with a very slow external memory α 
16, the cache achieves a good speedup. Thus, there is a trade-off between
cost and performance.

Cost Performance Trade-Off For any two variants A and B of the DLX
design, the parameter eq specifies the quality parameter q for which both
variants are of the same quality:

1 1
q 1q
 q 1q

CA  T PIA CB  T PIB
)*
  (
For quality parameters q  eq, the faster of the two variants is better, and
M EMORY S YSTEM for q  eq, the cheaper one is better. For a realistic quality metric, the
D ESIGN quality parameter q lies in the range of 02 05.
Depending on the speed of the off-chip memory, the break even point lies
between 0.14 and 0.28 (table 6.18). The DLX with cache is the faster of the
two designs. Thus, the 16KB cache improves the quality of the sequential
DLX design, as long as the performance is much more important than the
cost.
Altogether, it is worthwhile to add a 16KB, direct mapped cache to the
DLX fixed point core, especially in combination with a very slow external
memory. The cache increases the cost of the design by a factor of 31, but
it also improves the performance by a factor of 1.8 to 3.7. However, the
DLX still spends 13% to 30% of its run time waiting for the main memory,
due to cache misses and write through accesses.

(& >     +  

Every cache design has many parameters, like the cache size, the line size,
the associativity, and the cache policies. This section studies the impact
of these parameters on the performance and cost/performance ratio of the
cache design.

.    3 :


As already pointed out in section 6.2.1, the memory accesses tend to clus-
ter, i.e., at least over a short period of time, the processor only works on
a few clusters of references. Caches profit from the temporal and spatial
locality.

Temporal Locality Once a memory data is brought into the cache, it is


likely to be used several times before it is evicted. Thus, the slow initial
access is amortized by the fast accesses which follow. If the cache is to
small, it cannot accommodate all the clusters required, and data will be
evicted although they are needed shortly thereafter. Large caches can re-
duces these evictions, but cache misses cannot vanish completely, because
the addressed clusters change over time, and the first access to a new clus-
ter is always a miss. According to table 6.19 ([Kro97]), doubling the cache
size cuts the miss ratio by about one third.

Spatial Locality The cache also makes use of the spatial locality, i.e.,
whenever the processor accesses a data, it is very likely that it soon ac-
*
  (&
  Miss ratio of a direct mapped cache depending on the cache size [K S EQUENTIAL DLX
byte] and the line size [byte] for the average SPECint92 workload; [Kro97]. WITH C ACHE
M EMORY
cache line size [byte]
size 8 16 32 64 128
1 KB 0.227616 0.164298 0.135689 0.132518 0.150158
2 KB 0.162032 0.112752 0.088494 0.081526 0.088244
4 KB 0.109876 0.077141 0.061725 0.057109 0.059580
8 KB 0.075198 0.052612 0.039738 0.034763 0.034685
16 KB 0.047911 0.032600 0.024378 0.020493 0.020643
32 KB 0.030686 0.020297 0.015234 0.012713 0.012962
64 KB 0.020660 0.012493 0.008174 0.005989 0.005461

cesses a data which is stored close by. Starting a memory transfer requires
W cycles, and then the actual transfer delivers 8 bytes per cycle. Thus
fetching larger cache lines saves time, but only if most of the fetched data
are used later on. However, there is only limited amount of spatial locality
in the programs.
According to table 6.19, the larger line sizes reduces the miss ratio sig-
nificantly up to a line size of 32 bytes. Beyond 64 bytes, there is virtually
no improvement, and in some cases the miss ratio even increases. When
analyzing the CPI ratio (table 6.20), it becomes even more obvious that
32-byte lines are optimal. Thus, it is not a pure coincidence that commer-
cial processors like the Pentium [AA93] or the DEC Alpha [ERP95] use
L1 caches with 32-byte cache lines.
However, 32 bytes is not a random number. In the SPECint92 integer
workload, about 15% of all the instructions change the flow of control
(e.g., branch, jump, and call). On average, the instruction stream switches
to another cluster of references after every sixth instruction. Thus, fetching
more than 8 instructions (32 bytes) rarely pays off, especially since the
instructions account for 75% of the memory references.

Impact on Cost and Cycle Time Doubling the cache size cuts the miss
ratio by about one third and improves the cycle count, but it also impacts
the cost and cycle time of the DLX design (table 6.21). If a cache of 8KB
or more is used, the fixed point core with its 12 kilo gates accounts for less
than 10% of the total cost, and doubling the cache size roughly doubles the
cost of the design.
For a fixed cache size, doubling the line size implies that the number of
cache lines in cut by half. Therefore, the cache directory only requires half
*
  (
M EMORY S YSTEM   CPI ratio of the DLX with direct mapped cache on the SPECint92
D ESIGN workload. Taken from [Kro97].

DRAM cache line size [byte]


α size 8 16 32 64 128
1 KB 6.60 6.31 6.40 7.08 8.99
2 KB 6.07 5.83 5.84 6.19 7.25
4 KB 5.65 5.49 5.51 5.76 6.44
4 8 KB 5.37 5.26 5.25 5.37 5.74
16 KB 5.15 5.08 5.06 5.13 5.35
32 KB 5.02 4.96 4.95 4.99 5.13
64 KB 4.94 4.89 4.87 4.87 4.92
1 KB 7.77 7.22 7.20 7.86 9.85
2 KB 6.98 6.53 6.45 6.77 7.86
4 KB 6.35 6.06 6.02 6.25 6.94
8 8 KB 5.93 5.73 5.66 5.77 6.14
16 KB 5.60 5.46 5.42 5.46 5.69
32 KB 5.39 5.30 5.27 5.30 5.44
64 KB 5.27 5.19 5.16 5.15 5.20

as many entries as before, and the directory shrinks by half. Thus, doubling
the line size reduces the cost of the cache and the cost of the whole DLX
design. Increasing the line size from 8 to 16 bytes reduces the cost of the
DLX design by 7-10%. Doubling the line size to 32 bytes saves another
5% of the cost. Beyond 32 bytes, an increase of the line size has virtually
no impact on the cost.
Table 6.21 also lists the cycle time imposed by the data paths, the control
and the cache interface:

TDLX  maxTDP TCON T$i f T$read 

The cache influences this cycle time in three ways: T$i f and T$read account
for the actual update of the cache and the time of a cache read hit. The
cache directory also provides the hit signal, which is used by the control in
order to generate the clock and write enable signals (TCON ). This usually
takes longer than the cache update itself and for large caches it becomes
even time critical. Doubling the line size then reduces the cycle time by 3
gate delays due to the smaller directory.
*
  (&
  Cost and cycle time of the DLX design with a direct mapped cache S EQUENTIAL DLX
WITH C ACHE
cache cost CDLX [kilo gates] cycle time TDLX
M EMORY
size line size [B]
[KB] 8 16 32 64 128 8 16 32 64 128
1 42 39 37 36 36 80 70 70 70 70
2 69 62 59 57 57 83 80 70 70 70
4 121 109 103 100 98 86 83 80 70 70
8 226 202 190 185 182 89 86 83 80 70
16 433 388 365 354 348 92 89 86 83 80
32 842 756 713 692 681 95 92 89 86 83
64 1637 1481 1403 1364 1345 98 95 92 89 86

.      %


Caches are much smaller than the main memory, and thus, many memory
addresses must be mapped to the same set of cache locations. In the direct
mapped cache, there is exactly one possible cache location per memory
address. Thus, when fetching a new memory data, the cache line is either
empty, or the old entry must be evicted. That can cause severe thrashing:
Two or more clusters of references (e.g., instruction and data) share the
same cache line. When accessing these clusters by turns, all the accesses
are cache misses and the line must be replaced every time. Thus, the slow
line fills cannot be amortized by fast cache hits, and the cache can even
deteriorate the performance of the memory system.
Using a larger cache would help, but that is very expensive. A standard
way out is to increase the associativity of the cache. The associativity of a
first level cache is typically two or four. In the following, we analyze the
impact of associativity on the cost and performance of the cache and DLX
design.

Impact on the Miss Ratio Table 6.22 lists the miss ratio of an asso-
ciative cache with random or LRU replacement policy on a SPECint92
workload. This table is taken from [Kro97], but similar results are given in
[GHPS93]. LRU replacement is more complicated than random replace-
ment because it requires a cache history, but it also results in a significantly
better miss ratio. Even with twice the degree of associativity, a cache with
random replacement performs worse than a cache with LRU replacement.
Thus, we only consider the LRU replacement.
In combination with LRU replacement, 2-way and 4-way associativity
improve the miss ratio of the cache. For moderate cache sizes, a 2-way
*#
  (
M EMORY S YSTEM   Miss ratio [%] of the SPECint92 workload on a DLX cache system
D ESIGN with 32-byte lines and write allocation; [Kro97].

cache direct 2-way 4-way


size mapped LRU random LRU random
1 KB 13.57 10.72 19.65 9.41 12.30
2 KB 8.85 7.02 13.34 6.53 8.40
4 KB 6.17 4.54 8.82 4.09 5.41
8 KB 3.97 2.52 6.16 2.04 3.05
16 KB 2.44 1.39 3.97 1.00 1.52
32 KB 1.52 0.73 2.44 0.58 0.83
64 KB 0.82 0.52 1.52 0.44 0.56

  Cost and CPU cycle time of the DLX design with a k-way set associa-
tive cache (32-byte lines).

cache cost CDXL [kilo gates]


TDLX
size absolute increase
[KB] k= 1 2 4 12 24 1 2 4
1 37 38 41 3.9 % 7.2 % 70 70 70
2 59 61 62 2.6 % 5.0 % 70 70 70
4 103 105 108 1.7 % 3.4 % 80 74 70
8 190 193 197 1.2 % 2.4 % 83 84 76
16 365 368 375 0.9 % 1.8 % 86 87 86
32 713 718 729 0.7 % 1.5 % 89 90 89
64 1403 1416 1436 0.8 % 1.4 % 92 93 92

cache achieves roughly the same miss ratio as a direct mapped cache of
twice the size.

Impact on the Cost Like for a direct mapped cache, the cost of the cache
interface with a set associative cache roughly doubles when doubling the
cache size. The cache interface accounts for over 90% of the cost, if the
cache size is 8KB or larger (table 6.23). 2-way and 4-way associativity
increase the total cost by at most 4% and 11%, respectively. The relative
cost overhead of associative caches gets smaller for larger cache sizes.
When switching from 2-way to 4-way associativity, the cost overhead
is about twice the overhead of the 2-way cache. That is for the following
*&
  (&
reasons: In addition to the cache directory and the cache data RAMs, a
set associative cache with LRU replacement also requires a cache history S EQUENTIAL DLX
WITH C ACHE
and some selection circuits. In a 2-way cache, the history holds one bit per
sector, and in a 4-way cache, it holds 8 bits per sector; that is less than 0.5% M EMORY
of the total storage capacity of the cache. The significant cost increase
results from the selection circuits which are the same for all cache sizes.
In the 2-way cache, those circuits account for about 900 gate equivalents.
The overhead of the 4-way cache is about three times as large, due to the
more complicated replacement circuit.

Impact on the Cycle Time The cache provides the hit signal which is
used by the control in order to generate the clock signals. Except for small
caches (1KB and 2KB), the control even dominates the cycle time TDLX
which covers all CPU internal cycles (table 6.23). Doubling the cache size
then increases the cycle time by 3 gate delays due to the larger RAM.
In a 32-bit design, the tags of a direct mapped cache of size X KB are

t1  32  log X

bits wide according to figure 6.7. Thus, doubling the cache size reduces the
tag width by one. In a set associative cache, the cache lines are distributed
equally over the k cache ways, and each way only holds a fraction (1k) of
the lines. For a line size of 32 bytes, we have

Lk  L1 k  X  32  k
tk  32  log X  log k  t1  log k

The cache tags are therefore log k bits wider than the tags of an equally
sized direct mapped cache.
In each cache way, the local hit signal hi is generated by an equality
tester which checks the tk -bit tag and the valid flag:

D$k hi  DRAM Lk tk   DEQ tk  1

The core of the tester is a tk  1-bit OR-tree. For a cache size of of 1KB
to 64KB and an associativity of k  4, we have
32  log 64K   log 1  tk  32  log 1K   log 4
17  tk  1  25
and the equality tester in the hit check circuit of the k-way cache has a fixed
depths. However, the access of the cache data and the directory is 3 log k
delays faster due to the smaller RAMs

D$k hi  D$1 hi  3 log k


*'
  (
M EMORY S YSTEM   Optimal cycle time τDLX and number of wait states for the DLX design
D ESIGN with caches and two types of main memory (α  4 8).

cache α4 α8


size 1 2 4 1 2 4
C W τ W τ W τ W τ W τ W τ
1 KB 6 70 6 70 5 72 10 71 10 70 10 70
2 KB 6 70 6 70 6 70 10 71 10 71 10 70
4 KB 5 80 6 74 6 70 9 80 10 74 10 71
8 KB 5 83 5 84 5 77 9 83 9 84 10 76
16 KB 5 86 5 87 5 86 9 86 9 87 9 86
32 KB 5 89 5 90 5 89 9 89 8 90 8 90
64 KB 5 92 5 93 5 92 8 92 8 93 8 92

The local hit signals of the k cache ways are combined to a global hit
signal using an AND gate and an k-bit OR-tree. For k 2, we have
D$k hit   D$k hi  Dand  DORtree k
 D$1 hi  3 log k  2  2 log k
Thus, for a moderate cache size, the 2-way cache is one gate delay slower
than the other two cache designs.

Impact on the Performance Table 6.24 lists the optimal cycle time of
the DLX design using an off-chip memory with parameter α  4 8, and
table 6.25 lists the CPI and TPI ratio of these designs. In comparison to a
direct mapped cache, associative caches improve the miss ratio, and they
also improve the CPI ratio of the DLX design. For small caches, 2-way
associativity improves the TPI ratio by 4  11%, and 4-way associativity
improves it by 5  17%. However, beyond a cache size of 4KB, the slower
cycle time of the associative caches reduces the advantage of the improved
miss ratio. The 64KB associative caches even perform worse than the
direct mapped cache of the same size.
Doubling the cache size improves the miss ratio and the CPI, but it also
increases the cycle time. Thus, beyond a cache size of 4KB, the 4-way
cache dominates the cycle time TDLX , and the larger cycle time even out-
weights the profit of the better miss ratio. Thus, the 4KB, 4-way cache
yields the best performance, at least within our model. Since larger caches
increase cost and TPI ratio, they cannot compete with the 4KB cache.
In combination with a fast off-chip memory (α  4), this cache speeds
the DLX design up by a factor of 2.09 at 8.8 times the cost. For a memory
*(
  (&
S EQUENTIAL DLX
WITH C ACHE
  CPI and TPI ratio of the DLX design with cache. The third table M EMORY
lists the CPI and TPI reduction of the set associative cache over the direct mapped
cache (32-byte lines).

CPI ratio
cache α4 α8
size 1 2 4 1 2 4
1 KB 6.67 6.29 5.90 7.74 7.20 6.96
2 KB 6.04 5.79 5.73 6.85 6.51 6.42
4 KB 5.51 5.46 5.40 6.18 6.05 5.96
8 KB 5.25 5.07 5.02 5.80 5.55 5.58
16 KB 5.06 4.94 4.89 5.53 5.35 5.28
32 KB 4.95 4.86 4.84 5.37 5.14 5.12
64 KB 4.87 4.83 4.82 5.16 5.11 5.10

TPI ratio
cache α4 α8
size 1 2 4 1 2 4
1 KB 466.9 440.3 425.0 549.3 504.2 487.0
2 KB 422.7 405.6 401.1 486.5 462.2 449.3
4 KB 441.1 404.2 378.2 494.7 447.4 423.2
8 KB 435.6 426.2 386.2 481.5 466.1 423.9
16 KB 435.5 429.6 420.6 475.9 465.7 454.5
32 KB 440.9 437.2 430.7 478.4 462.8 460.6
64 KB 447.9 449.4 443.7 474.4 475.0 468.8

CPI reduction TPI reduction


cache α4 α8 α4 α8
size 2 4 2 4 2 4 2 4
1 KB 6.1 13.0 7.4 11.2 6.1 9.9 8.9 12.8
2 KB 4.2 5.4 5.3 6.8 4.2 5.4 5.3 8.3
4 KB 0.9 2.1 2.3 3.7 9.1 16.6 10.6 16.9
8 KB 3.4 4.6 4.5 4.0 2.2 12.8 3.3 13.6
16 KB 2.5 3.5 3.4 4.7 1.4 3.5 2.2 4.7
32 KB 2.0 2.4 4.5 5.0 0.8 2.4 3.4 3.9
64 KB 0.7 0.9 0.9 1.2 -0.3 0.9 -0.1 1.2

*/
  (
M EMORY S YSTEM   Speedup and cost increase of the DLX with 4-way cache over the
D ESIGN design without cache

cache size 1KB 2KB 4KB


speedup: α  4 1.86 1.97 2.09
α8 2.54 2.75 2.92
relative cost increase 3.34 5.16 8.78

2.2
1KB 4-way
2 2KB 4-way

quality ratio (alpha =4)


4KB 4-way
1.8 no cache
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6
quality paramter: q

5
1KB 4-way
4.5 2KB 4-way
quality ratio (alpha =8)

4KB 4-way
4 no cache
3.5
3
2.5
2
1.5
1
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6
quality paramter: q

   Quality ratio of the designs with 4-way cache relative to the design
without cache for two types of off-chip memory.

*)
  ('
system with α  8, the cache even yields a speedup of 2.9. According to
table 6.26, the speedup of the 1KB and 2KB caches are at most 6% to 15% P IPELINED DLX
WITH C ACHE
worse than that of the 4KB cache at a significantly better cost ratio. Thus,
there is a trade-off between cost and performance, and the best cache size M EMORY
is not so obvious. Figure 6.22 depicts the quality of the DLX designs with
a 4-way cache of size 1KB to 4KB relative to the quality of the design
without cache. The quality is the weighted geometric mean of cost and
TPI ratio: Q  C q  T PI q 1.
 

As long as more emphasis is put on the performance than on the cost,


the caches are worthwhile. In combination with fast off-chip memory
(α  4), the design with an 1KB cache is best over the quality range of
q  01 034. For slower memory, the 1KB cache even wins up to a qual-
ity parameter of p  055 (at q  05 cost and performance are equally
important). Only for q  013, the 4KB cache wins over the 1KB cache,
but these quality parameters are not realistic (page 167). Thus, when opti-
mizing the DLX design for a reasonable performance and quality, a cache
size of 1 KB is most appropriate.

Artifact The performance optimization suggests a 4 times larger cache


than the quality metric. This is due to the very inexpensive DLX core
which so far only comprises a simple fixed point core without multiplier
and divider. Adding a floating point unit (chapter 9) will increase the cost
of the DLX core dramatically, and then optimizing the DLX design for
a good performance or for a good quality will result in roughly the same
cache size.

 +  *(1    -

N ORDER to avoid structural hazards, the pipelined DLX core of the sec-
tions 4 and 5 requires an instruction memory IM and a data memory DM.
The cache system described in this section implements this split memory
by a separate instruction and data cache. Both caches are backed by the
unified main memory, which holds data and instructions.
The split cache system causes two additional problems:

The arbitration of the memory bus. Our main memory can only
handle one access at a time. However, an instruction cache miss can
occur together with a write through access or a data cache miss. In
such a case, the data cache will be granted access, and the instruction
cache must wait until the main memory allows for a new access.
**
  (
The data consistency of the two caches. As long as a memory word is
M EMORY S YSTEM placed in the instruction cache, the instruction cache must be aware
D ESIGN of the changes done to that memory word. Since in our DLX design,
all the memory writes go through the data cache, it is only the in-
struction cache which must be protected against data inconsistency.

Although we do not allow for self modifying code, data consistency is


still a problem for the following reason. During compile time, the program
code generated by the compiler is treated as data and is therefore held in the
data cache. When running the program, the code is treated as instructions
and must be placed in the instruction cache. After re-compiling a program,
the instruction cache may still hold some lines of the old, obsolete code.
Thus, it must be ensured that the processor fetches the new code from the
main memory instead of the obsolete code held in the instruction cache.
As usual, one can leave the consistency to the user and the operating
system or can support it in hardware. In case of a software solution, the
instruction cache must be flushed (i.e., all entries are invalidated) whenever
changing existing code or adding new code. It is also feasible to flush the
instruction cache whenever starting a new program.
In our design, we will go for a hardware solution. The two caches snoop
on the memory bus. On a data cache miss, the requested line is loaded
into the data cache, as usual. If the instruction cache holds this line, its
corresponding cache entry is invalidated. In analogy, a line of the data
cache which holds the instructions requested by the instruction cache is
also invalidated on an instruction cache miss. At any time a particular
memory line is in at most one cache.

('     +3= +  " 

As in the sequential DLX design (section 6.4), the caches only impact
the memory environments and the memory control circuits. This section
describes how to fit the instruction and data cache into the memory en-
vironments IMenv and DMenv of the pipelined design DLXΠ supporting
interrupts, and how the memory interface Mif connects these two environ-
ments to the external main memory. The new memory control is described
in section 6.5.2.

1    +  5 %
The core of the data environment DMenv (figure 6.23) is the cache interface
D$i f as it was introduced in section 6.3. The data cache (Dcache) is a
sectored, write through, write allocate cache with a 64-bit word size. In
#
  ('
dpc MAR DMRw DMRw
P IPELINED DLX
1 0 Dlinv
64 WITH C ACHE
32 MDin
a
M EMORY
Din
Mad D$a
reset clear D$if (Dcache)
MDat MDat
hit Dout[63:32, 31:0]

1 0 MAR[2]
Dhit DMout

   Data memory environment DMenv with cache

addition to the standard control signals of a cache interface, the memory


environment is governed by the reset signal and by signal Dlinv which
requests a line invalidation access due to an instruction cache miss.
As in the sequential design (section 6.4), the cache interface D$if is con-
nected to the data paths through a 32-bit address port and two data ports
MDin and DMout. Due to the 64-bit cache word size, the data ports must
be patched together. On the input port MDin, data MDRw is still dupli-
cated
MDin63 : 32  MDin31 : 0  MDRw31 : 0

On the output port Dout, a multiplexer selects the requested 32-bit word
within the double-word based on the address bit MAR[2]:

Dout 31 : 0 if MAR2  0


DMout 
Dout 63 : 32 if MAR2  1

On an instruction cache miss the data cache is checked for the requested
line (Dlinv  1). In case of a snoop hit, the corresponding Dcache entry
is invalidated. For the snoop access and the line invalidation, the Dcache
interface uses the address d pc of the instruction memory instead of address
MAR:
MAR if Dlinv  0
a 
d pc if Dlinv  1
A multiplexer selects between these two addresses. Since the Dcache is
only flushed on reset, the clear input of the Dcache interface D$if is con-
nected to the reset signal. The hit signal Dhit is provided to the memory
control.
The data memory environment communicates with the memory interface
Mif and the external memory via the address port D$a and the data ports
#
  (
MAR dpc
M EMORY S YSTEM 1 0 Ilinv
D ESIGN 32
a Din
Mad I$a
reset clear I$if (Icache)
MDat MDat
hit Dout[63:32, 31:0]

1 0 dpc[2]
Ihit IMout

   Instruction memory environment IMenv with cache

MDin and MDat. The Dcache interface provides the memory address

D$a  Mad 

Let the sectored cache comprise 2ld lines, each of which is split in S 
2sdsectors. The data memory environment then has cost

CDMenv  C$i f 29  ld  sd ld sd 3  2  Cmux 32

Assuming that control signal Dlinv is precomputed, address a and data Din
have the following accumulated delay:

AD$i f a  APCenv d pc  Dmux


AD$i f Din  0

1    . -  5 %


Figure 6.24 depicts how to fit a first level instruction cache (Icache) into
the instruction memory environment IMenv. The instructions in the Icache
are only read by the DLX core but never written. Thus, the Icache interface
I$i f could actually be simplified. Nevertheless, we use the standard cache
interface. However, the input data port Din is not connected.
The address port a and the data port Dout of the cache interface I$if
are connected to the data paths like in the environment DMenv. However,
the program counter d pc now serves as standard cache address, whereas
address MAR is only used in case of a snoop access or a line invalidation.
Flag Ilinv signals such an access:
Dout 31 : 0 if d pc2  0
IDout 
Dout 63 : 32 if d pc2  1

MAR if Ilinv  1
a
d pc if Ilinv  0
#
  ('
IMenv DMenv
I$a Mdat D$a MDat MDin
P IPELINED DLX
WITH C ACHE
MDindoe
M EMORY
Igrant 1 0 MDat 64
Mif
32
Mad
a do di
external memory

   Interface Mif connecting IMenv and DMenv to the external memory

The Icache is, like the Dcache, flushed on reset; its hit signal Ihit is
provided to the memory control. The environment IMenv communicates
with memory interface Mif and the external memory via the address port
I$a and the data port MDat. The Icache interface provides the memory
address I$a  Mad.
Let the instruction cache comprise 2li lines with 2si sectors per line and
b
2  8 bytes per sector; the cost of environment IMenv can be expressed
as

CIMenv  C$i f 29  li  si li si 3  2  Cmux 32

The Icache address a has the same accumulated delay as the Dcache ad-
dress.

.    5 5 %


The memory interface Mif (figure 6.25) connects the two memory envi-
ronments of the pipelined DLX design to the external memory. The envi-
ronments DMenv and IMenv communicate with the external memory via
the 32-bit address bus MAd and the 64-bit data bus MDat. The memory
interface is controlled by the signals MDindoe and Igrant.
On Igrant  1, the Icache interface is granted access to the external
memory; the memory interface forwards address I$a to the address bus.
On Igrant  0, the Dcache interface can access the external memory and
circuit Mif forwards address D$a:

I$a if Igrant 1
MAd 
D$a if Igrant  0

On MDindoe  1, the memory interface forwards the data MDin of the


data memory environment to the data bus MDat.
##
  (
(' 5 %  
M EMORY S YSTEM
D ESIGN In analogy to the sequential DLX design with cache, the memory system
is governed by the memory control circuits DMC and IMC and by the
memory interface control MifC.

. -  5 %   .5


The control IMC of the instruction memory is exactly the same as the one
used in the pipelined design DLXΠ of chapter 5. Circuit IMC signals a
misaligned instruction fetch by imal  1. Since the DLX core never writes
to the instruction memory, the bank write signals are always inactive and
can be tied to zero:
Imbw7 : 0  08 

+  5 %   +5


As in the pipelined design DLXΠ without caches, the data memory control
DMC generates the bank write signals of the data memory and checks for
a misaligned access. However, twice as many write signals DMbw[7:0]
are required because the memory system now operates on double-words.
The original four signals Dmbw[3:0] select within a word, and the address
bit MAR[2] selects the word within the sector. Thus, the new bank write
signals are obtained as
DMbw3 : 0  Dmbw3 : 0  MAR2
DMbw7 : 4  Dmbw3 : 0  MAR2
As in the sequential design with cache, the control DMC buffers these
bank write signals in a register before feeding them to the Dcache interface
and to the byte enable lines BE of the memory bus. Register DMBw is
clocked during the first cycle of a data memory transaction, signaled by
D$rd  1:
DMBw7 : 0 : DMbw7 : 0 if D$rd  1
Circuit DMC detects a misaligned access like in the DLXΠ design. Flag
dmal  1 signals that an access to the data memory is requested, and that
this access is misaligned (i.e., malAc  1):
dmal  Dmr3  Dmw3  malAc
In addition, it now also masks the memory read and write signals Dmr and
Dmw with the flag dmal:
Dmra  Dmr3  dmal  Dmr3 NOR malAc
Dmwa  Dmw3  dmal   Dmw3 NOR malAc
#&
  ('
Let dmc denote the data memory control of the pipelined design without
cache. The cost, delay and cycle time of the extended memory control P IPELINED DLX
WITH C ACHE
DMC can then be expressed as
M EMORY
CDMC  Cdmc  Cand 8  3  Cinv  C f f 8  2  Cnor
ADMC DMBw  0
ADMC dmal   ADMC Dmra Dmwa  Admc
TDMC  Admc  Dand  ∆

5 % .   


Like in the sequential design, the memory interface control MifC controls
the cache interface and the access to the external memory bus. Since there
are two caches in the pipelined DLX design, the control MifC consists
of two automata I$i fC and D$i fC. Each automaton generates a busy flag
(ibusy, dbusy), a set of cache control signals, a set of handshake and control
signals for the memory bus, and some signals for the synchronization. The
cache control signals (i.e.: $rd, Vw, Tw, Sw, lfill, valid, scntce, scntclr,
linv, $w) are forwarded to the corresponding cache interface I$if and D$if.
The D$ifC control provides the following synchronization signals

Dinit indicating that D$ifC is in its initial state,

Igrant granting the Icache access to the memory bus, and

isnoop requesting the Icache to perform a snoop access.

The I$ifC signal iaccess indicates an ongoing transaction between Icache


and memory.
For the memory bus, control D$ifC provides the request signal Dreq,
the flags Dwr, Dburst and the enable signal MDindoe. Since the Icache
interface only uses the bus for fetching a new cache line, its burst and r/w
flag have a fixed value. Based on flag Igrant, control MifC selects the bus
signals as

Dreq Dwr Dburst  if Igrant 0


req wr burst  
Ireq 0 1 if Igrant 1

using a 3-bit multiplexer. Thus, the cost of circuit MifC can be expressed
as
CMi f C  Cmux 3  CI $i f C  CD$i f C 
The two automata I$ifC and D$ifC are very much like the Mealy au-
tomaton of the sequential MifC control, except that they provide some new
signals, and that they need two additional states for the snoop access. In
#'
  ( /Brdy
D$ifC /Brdy /Brdy * reqp
M EMORY S YSTEM DFreq Dwait /Brdy * /reqp DLwait
D ESIGN Dmra * /Dhit * /iaccess /Brdy * reqp Brdy /Brdy * /reqp Brdy
Dmwa * /Dhit * /iaccess Brdy Brdy * /reqp
Dfill DLfill
Brdy * reqp
Ireq Dmra
Dsnoop D$RD Dmwa
/Dhit Dmwa * Dhit * /iaccess
Dhit
Dlinv else
Mlast Mwrite D$w
Brdy /Brdy

I$ifC /Brdy
/Dinit * /isnoop
Dinit * /Brdy /Brdy * reqp
isnoop IFreq Iwait /Brdy * /reqp ILwait
/Ihit * /imal * /isnoop /Brdy * reqp Brdy /Brdy * /reqp Brdy
isnoop Brdy * /reqp
Isnoop Ifill ILfill
Dinit * Brdy Brdy * reqp
/Ihit
Ihit I$RD
else
Ilinv

   FSDs modeling the Mealy automata of the controls D$if and I$if

the I$ifC automaton, the states for the memory write access are dropped.
Figure 6.26 depicts the FSDs modeling the Mealy automata of the D$ifC
and I$ifC control. Table 6.27 lists the active control signals for each state;
table 6.28 lists the parameters of the two automata, assuming that the au-
tomata share the monomials.
The inputs of the two automata have the following accumulated delay:

Ain I$i fC  maxAIMC AIMenv Ihit  AM reqp Brdy  dbus 


Ain D$i fC  maxAdmc ADMenv Dhit  AM reqp Brdy  dbus 

The two busy signals and the signals D$rd and I$rd are the only Mealy
control signals. As in the sequential design, these signals are just used for
clocking. The remaining cache control signals (cs$if) and the bus control
signals are of type Moore and can be precomputed. They have delay

AMi f C cs$i f   0
AMi f C req wr burst   Dmux 

Since the automata only raise the flags ibusy and dbusy in case of a non-
faulty memory access, the clock circuit of the stall engine can now simply
#(
  ('
P IPELINED DLX
WITH C ACHE
  Active control signals for the FSDs modeling the MifC control; X
M EMORY
denotes the data (D) or the instruction (I) cache.

state $if control D$ifC only I$ifC only


XFreq scntclr, scntce, Dreq, Dburst, isnoop Ireq, iaccess
Vw, lfill
Xfill scntce, lfill, Sw Dburst iaccess
Xwait lfill Dburst iaccess
XLwait lfill Dburst iaccess
XLFill scntce, valid, Vw, /dbusy = Dmra /ibusy, iaccess
Tw, lfill, Sw
Xsnoop D$rd, Dlinv, Igrant I$rd, Ilinv
Xlinv Vw Dlinv, Igrant Ilinv
I$RD I$rd = /imal, /ibusy = (imal  /isnoop)  (Ihit  /isnoop)
D$w $w, Dw/r, Dreq, MDindoe
Mwrite Dw/r, MDindoe
Mlast MDindoe, /mbusy
D$RD Igrant, Dinit, D$rd = Dmra  Dmwa
/dbusy = (dmal  /Ireq)  (Dhit  /Ireq)

  Parameters of the Mealy automata used in the memory interface con-
trol MifC

# states # inputs # and frequency of outputs


k σ γ νsum νmax1 νmax2
D$ifC 11 8 18 42 5 3
I$ifC 8 6 11 29 5 3

fanin of the states # and length of nontrivial monomials


fansum fanmax #M lsum lmax lmax2
D$ifC 20 3 18 35 3 2
I$ifC 16 3 10 17 3 2

#/
  (
obtain the busy signal as
M EMORY S YSTEM
D ESIGN busy  ibusy  dbusy

at an accumulated delay of

Ace busy  maxAout 2 I$i fC Aout 2 D$i fC  Dor 

-   
This is the tricky part. Let us call D$i fC the D-automaton, and let us call
I$i fC the I-automaton. We would like to show the following properties:

   1. Memory accesses of the D-automaton and of the I-automaton do not


overlap,

2. memory accesses run to completion once they are started, and

3. a cache miss in DM (IM) always generates a snoop access in IM


(DM).

Before we can prove the lemma, we first have to formally define, in what
cycles a memory access takes place. We refer to the bus protocol and count
an access from the first cycle, when the first address is on the bus until the
last cycle, when the last data are on the bus.

 Proof of the lemma:


After power up, the automata are in their initial state and no access is taking
place.
The D-automaton controls the bus via signal Igrant. It grants the bus to
the I-automaton (Igrant  1) only during states 3# %  and %+,
and it owns the bus (Igrant  1) during the remaining states. Therefore, ac-
cesses of the D-automaton can only last from state  to 4 or from
state 3' to 1&. During these states, the I-automaton does not have the
bus. Thus, accesses do not overlap, and accesses of the D-automaton run
to completion once they are started.
The I-automaton attempts to start accesses in state IFreq, but it may not
have the bus. Thus, accesses of the I–automaton can only last from state
2 with Igrant  1 until state 24. In each of these states we have
iaccess  1.
Suppose state 2 is entered with Igrant  0. Then, the access starts
in the cycle when the D-automaton is back in its initial state 3. In this
cycle we have
Igrant  Dinit  iaccess  1
#)
  ('
Thus, the access of the I-automaton starts, the I-automaton leaves state
2, and the active signal iaccess prevents the D-automaton from en- P IPELINED DLX
tering states  or 3' before the access of the I-automaton runs to WITH C ACHE

completion. M EMORY
If state 2 is entered with Igrant  1, the access starts immediately,
and the D-automaton returns to its initial state within 0, 1 or 2 cycles. From
then on, things proceed as in the previous case.
In state  signal isnoop is active which sends the I-automaton from
its initial state into state 2% . Similarly, in state 2 signal Ireq is
active which sends the D-automaton from its initial state into state % .  

('# +   1- 

For the sequential DLX design (section 6.4) which is connected to a 64 MB


main memory, it has turned out that a 4 KB cache with 32 byte lines yields
a reasonable performance and cost performance ratio. Thus, our pipelined
DLX design will also implement 4 KB of first level cache; the data and the
instruction cache comprise 2 KB each.

    5 %  
As for the sequential DLX design with cache, the temporal behavior of the
memory system is modeled by the request cycle time TMreq , the burst read
time TMrburst , the read/write access time TMaccess to off-chip memory, the
cache read access time T$read , and the cycle time T$i f of the caches (see
page 283).
In the pipelined DLX design, the Icache and the Dcache have the same
size, and their inputs have the same accumulated delay, thus

T$i f  TI $i f  TD$i f and T$read  TI $read  TD$read 

The formulae of the other three memory cycle times remain unchanged.
The cycle time TDLX of all internal cycles and the cycle time τDLX of the
whole system are still modeled as

TDLX  maxTDP TCON T$read T$i f TMreq TMrburst 


τDLX  maxTDLX TMaccess W 

.     %  
According to table 6.29, the 4KB cache memory increases the cost of the
pipelined design by a factor of 5.4. In the sequential design this increase
factor is significantly larger (8.8) due to the cheaper data paths.
#*
  (
M EMORY S YSTEM   Cost of the DLXΠ design without cache and with 2KB, 2-way Icache
D ESIGN and Dcache

Menv DP CON DLX


no cache – 20610 1283 21893
with caches 96088 116698 2165 118863

   Cycle time of the design DLXΠ with 2KB, 2-way Icache and Dcache
Maccess
MifC stall DP $read $if Mreq Mrburst
α4 α8
65 79 89 55 47 42 51 379 707

The two caches and the connection to the external memory account for
81% of the total cost of the pipelined design. The memory interface con-
trol now comprises two Mealy automata, one for each cache. It therefore
increases the cost of the control by 69%, which is about twice the increase
encountered in the sequential design.
Table 6.30 lists the cycle time of the DLXΠ design and of its memory
system, assuming a bus and handshake delay of dbus  15 and dMhsh 
10. The data paths dominate the cycle time TDLX of the processor core.
The caches themselves and the control are not time critical. The memory
request and the burst read can be performed in a single cycle; they can
tolerate a bus delay of dbus  53.

.     :


The pipelined DLX design implements a split cache system, i.e., it uses a
separate instruction cache and data cache. The cost of this cache system is
roughly linear in the total cache size (table 6.31). Compared to the unified
cache system of the sequential DLXs design, the split system implements
the cache interface twice, and it therefore encounters a bigger overhead.
Using 2-way set associative caches, the split system with a total cache size
of 1KB is 15% more expensive than the unified cache system. For a larger
cache size of 4KB (32 KB), the overhead drops to 4% (1%).
The split cache can also be seen as a special associative cache, where
half the cache ways are reserved for instructions or data, respectively. The
cost of the split and unified cache system are then virtually the same; the
difference is at most 2%.
Like in the sequential design, the cycle time of the control increases with
#
  ('
   Cost of the memory environments and the cycle time of the pipelined P IPELINED DLX
DLX design depending on the total cache size and the associativity. C Menv Σ de- WITH C ACHE
notes the cost of the unified cache in the sequential DLX design. The cost is given M EMORY
in kilo gates.

CMenv Σ CMenv TCON TDP


# way 2 4 1 2 1 2 1,2
1 KB 26 29 27 30 71 73 89
2 KB 48 51 49 52 75 75 89
4 KB 92 96 93 96 83 79 89
8 KB 178 185 181 184 93 87 89
16 KB 353 363 356 360 96 97 89
32 KB 701 717 705 711 99 100 89

the cache size, due to the computation of the hit signal. However, if the size
of a single cache way is at most 2KB, the control is not time critical. In
spite of the more complex cache system, this is the same cache size bound
as in the sequential DLX design. That is because the stall engine and main
control of the pipelined design are also more complicated than those used
in the sequential design.

.    "    ?- %


CPI Ratio In section 4.6.4, we have derived the CPI ratio of the pipelined
design DLXΠ on a SPECint92 workload as

CPIDLXΠ  126  ν f etch  νload  νstore   CPHslowM 

The workload comprises 25.3% loads and 8.5% stores. Due to some empty
delay slots of branches, the pipelined DLX design must fetch 10% addi-
tional instructions, so that νf etch  11.
As in the sequential DLX design with cache interface, the memory ac-
cess time is not uniform (table 6.16, page 288). A read hit can be per-
formed in just a single cycle. A standard read/write access to the external
memory (TMaccess ) requires W processor cycles. Due to the write through
policy, a write hit then takes 2  W cycles. For a cache line with S sectors,
a cache miss adds another S  W cycles. Let pIm and pDm denote the miss
ratio of the instruction and data cache. Since on a cache miss, the whole
pipeline is usually stalled, the CPI ratio of the pipelined design with cache
#
  (
M EMORY S YSTEM    Miss ratios of a split and a unified cache system on the SPECint92
D ESIGN workload depending on the total cache size and the associativity.

Icache Dcache Effective Unified Cache


# way 1 2 1 2 1 2 1 2 4
1 KB 8.9 8.2 22.8 15.2 12.4 9.9 13.6 10.8 9.4
2 KB 6.6 5.9 14.1 9.4 8.5 6.8 8.9 7.0 6.5
4 KB 4.7 4.4 9.4 5.5 5.9 4.7 6.2 4.5 4.1
8 KB 3.0 2.4 6.8 3.5 4.0 2.7 4.0 2.5 2.0
16 KB 2.0 1.1 3.5 2.6 2.4 1.5 2.4 1.5 1.0
32 KB 1.1 0.4 2.6 1.8 1.5 0.8 1.5 0.7 0.6

interface can be expressed as

CPIL1p  126  νstore  1  W 


 ν f etch  pIm  νload store  pDm   W  S

(6.5)
 135  0085  W
 11  pIm  034  pDm   W  S

Effective Miss Ratio According to table 6.32, the instruction cache has a
much better miss ratio than the data cache of the same size. That is not sur-
prising, because instruction accesses are more regular than data accesses.
For both caches, the miss ratio improves significantly with the cache size.
The pipelined DLX design strongly relies on the split first level cache,
whereas the first level cache of the sequential DLX design and any higher
level cache can either be split or unified. We have already seen that a split
cache system is more expensive, but it maybe achieves a better perfor-
mance.
For an easy comparison of the two cache designs, we introduce the ef-
fective miss ratio of the split cache as:

#miss on fetch  #miss on load/store


pmiss e f f 
#fetch  #load/store
ν f etch  pIm  νload store  pDm

 
ν f etch  νload store


This effective miss ratio directly corresponds to the miss ratio of a unified
cache. According to table 6.32, a split direct mapped cache has a smaller
miss ratio than a unified direct mapped cache; that is because instructions
and data will not thrash each other. For associative caches, the advantage
#
  ('
  Optimal cycle time τ, number of wait states W , CPI and TPI ratio of P IPELINED DLX
the pipelined DLX design with split 2-way cache. WITH C ACHE
M EMORY
total memory: α  4 memory: α  8
cache size W τ CPI TPI W τ CPI TPI
1 KB 4 90 2.82 253.5 8 89 3.72 331.2
2 KB 4 92 2.46 226.3 8 89 3.19 283.7
4 KB 4 95 2.22 211.1 8 89 2.83 251.9
8 KB 5 89 2.12 188.4 8 89 2.49 221.4
16 KB 4 97 1.85 179.2 8 97 2.27 220.1
32 KB 4 100 1.77 177.3 7 103 2.06 212.3

of a split system is not so clear, because two cache ways already avoid
most of the thrashing. In addition, the unified cache space can be used
more freely, e.g., more than 50% of the space can be used for data. Thus,
for a 2-way cache, the split approach only wins for small caches ( 4KB).
On the other hand, the split cache can also be seen as a special asso-
ciative cache, where half the cache ways are reserved for instructions or
data, respectively. Since the unified cache space can be used more freely,
the unified 2-way (4-way) cache has a better miss ratio than the split direct
mapped (2-way) cache. Commercial computer systems use large, set asso-
ciative second and third level caches, and these caches are usually unified,
as the above results suggest.

Performance Impact Table 6.33 lists the optimal number of wait states
and cycle time of the pipelined DLX design as well as the CPI and TPI
ratios for two versions of main memory. The CPI ratio improves signifi-
cantly with the cache size, due to the better miss ratio. Despite the higher
cycle time, increasing the cache size also improves the performance of the
pipelined design by 30 to 36%. In the sequential DLX design, the cache
size improved the performance by at most 12% (table 6.25). Thus, the
speedup of the pipelined design over the sequential design increases with
the cache size.
Compared to the sequential design with 4-way cache, the pipelined de-
sign with a split 2-way cache yields a 1.5 to 2.5 higher performance (table
6.34). The cache is by far the most expensive part of the design; a small
1KB cache already accounts for 60% of the total cost. Since the pipelined
and sequential cache interfaces have roughly the same cost, the overhead
of pipelining decreases with the cache size. The pipelined DLX design is at
most 27% more expensive, and the cost increase is smaller than the perfor-
##
  (
M EMORY S YSTEM    Speedup and cost increase of the pipelined design with split 2-way
D ESIGN cache relative to the sequential design with unified 4-way cache.

total cost [kilo gates] speedup


cache size DLXΣ DLXΠ increase α4 α8
1 KB 41 52 27% 1.68 1.47
2 KB 62 74 19% 1.77 1.58
4 KB 108 118 9% 1.79 1.68
8 KB 197 207 5% 2.05 1.91
16 KB 375 383 2% 2.35 2.06
32 KB 729 734 1% 2.43 2.17

mance improvement. In combination with caches, pipelining is definitely


worthwhile.

   !  "  #

on cache design are [Prz90, Han93]. A detailed



WO TEXTBOOKS
analysis of cache designs can also be found in Hill’s Thesis [Hil87].

 %&

  In section 6.2.2, we specified a sectored, direct mapped cache


and a non-sectored, set associative cache. Extend these specifications to a
sectored, set associative cache. As before, a cache line comprises S sectors.

  This and the following exercises deal with the design of a
write back cache and its integration into the sequential DLX design. Such
a cache applies the weak consistency model. A write hit only updates the
cache but not the external memory. A dirty flag for each line indicates
that the particular line has been updated in the cache but not in the main
memory. If such a dirty line is evicted from the cache, the whole line
must be copied back before starting the line fill. Figure 6.27 depicts the
operations of a write back cache for the memory transactions read and
write.
Modify the design of the k-way cache and of the cache interface in order
to support the write back policy and update the cost and delay formulae.
Special attention has to be payed to the following aspects:
#&
  (/
A cache line is only considered to be dirty, if the dirty flag is raised
and if the line holds valid data. E XERCISES

The memory environment now performs two types of burst accesses,


the line fill and the write back of a dirty cache line. The data RAMs
of the cache are updated on a line fill but not on the write back.

  Integrate the write back cache interface into the sequential
DLX design and modify the cost and delay formulae of the memory sys-
tem. The memory environment and the memory interface control have to
be changed. Note that the FSD of figure 6.27 must be extended by the bus
operations.

  A write back cache basically performs four types of accesses,
namely a cache read access (read hit), a cache update (write hit), a line fill,
and a write back of a dirty line. Let a cache line comprise S sectors. The
read hit then takes one cycle, the write hit two cycles, and the line fill and
the write back take W  S cycles each.
Show that the write back cache achieves a better CPI ratio than the write
through cache if the number of dirty misses and the number of writes
(stores) obey:
W 1 # dirty misses
 
W S # writes

  Analyze the impact of the write back policy on the cost, per-
formance, and quality of the sequential DLX design. Table 6.35 lists the
ratio of dirty misses to writes for a SPECint92 workload [Kro97].

#'
  (
M EMORY S YSTEM
D ESIGN

/hit * dirty wback line scnt = S-1 wback


*(mw + mr) request wback last sector
scnt < S-1
cache /hit * /dirty * (mw + mr) fill
read request
mr scnt < S-1
else cache fill last line
mw * hit write mw sector scnt = S-1 fill

   Cache operations of the memory transactions read and write

   Ratio of dirty misses to write accesses on the SPECint92 workload.


cache line size [byte]
size 8 16 32 64 128
1 KB 0.414 0.347 0.328 0.337 0.402
2 KB 0.315 0.243 0.224 0.223 0.262
4 KB 0.256 0.190 0.174 0.169 0.183
8 KB 0.197 0.141 0.107 0.093 0.098
16 KB 0.140 0.097 0.073 0.061 0.060
32 KB 0.107 0.072 0.053 0.044 0.042

#(
Chapter

7
IEEE Floating Point
Standard and Theory of
Rounding
N THIS chapter, we introduce the algebra needed to talk concisely about
floating point circuits and to argue about their correctness. In this for-
malism, we specify parts of the IEEE floating point standard [Ins85], and
we derive basic properties of IEEE-compliant floating point algorithms.
Two issues will be of central interest: the number representation and the
rounding.

  " 

/ % , 

Let a  an  1 : 0  0 1n and f  f 1 : p  1  0 1p 1 be strings. We




then call the string an  1 : 0 f 1 : p  1 a binary fraction. An example


is 110.01. The value of the fraction is defined in the obvious way

n1 p1
an  1 : 0 f 1 : p  1  ∑ ai  2i  ∑ fi  2  i

i 0 i 1

In the above example, we have 11001  6  25  625. We permit


the cases p  0 and n  1 by defining

a  a0  a


 f   0  f  
  /
Thus, binary fractions generalize in a natural way the concept of binary
IEEE F LOATING numbers, and we can use the same notation to denote their values. Some
P OINT S TANDARD obvious identities are
AND T HEORY OF
ROUNDING 0a f   a  f   a f 0
a f   a f   2  p 1
 

As in the decimal system, this permits to use fixed point algorithms to


perform arithmetic on binary fractions. Suppose, for instance, we want to
add the binary fractions an  1 : 0 f 1 : p  1 and bm  1 : 0g1 : q  1,
where m  n and p  q. For some result sm : 0t 1 : p  1 of an ordinary
binary addition we then have

a f   bg  0m  n


a f   bg 0 p q  

 0m  n
a f   bg 0 p q   2  p 1
  

 sm : 0 t 1 : p  1  2  p 1  

 st 

/ B    , 

Of course, also two’s complement arithmetic can be extended to fractions.


One can interpret a string an  1 : 0 f 1 : p  1 as

an  1 : 0 f 1 : p  1 an  1  2n  an  2 : 0 f 1 : p  1


 1


We call string a f interpreted in this way a two’s complement fraction.


Using

an  1 : 0 f 1 : p  1  an  1 : 0 f 1 : p  1  2  p 1


 

one immediately translates algorithms for two’s complement numbers into


algorithms for two’s complement fractions.

/#   .   , 

The IEEE floating point standard makes use of a rather particular integer
format called the biased integer format. In this format, a string

en  1 : 0 
 0 1 
n n

represents the number

en  1 : 0bias  en  1 : 0  biasn


#)
  /
where
biasn  2n  1
 1 N UMBER F ORMATS

Strings interpreted in this way will be called biased integers Biased inte-
gers with n bits lie in a range emin : emax , where

emin  1  2n 1
 1  2n  1
2
emax  2 2 2
n n1
 1  2 n1
 1

Instead of designing new adders and subtractors for biased integers, we


will convert biased integers to two’s complement numbers, perform all
arithmetic operations in ordinary two’s complement format, and convert
the final result back. Recall that for n-bit two’s complement numbers, we
have
xn  1 : 0  xn 1  2  xn  2 : 0
n 1 


and therefore

xn  1 : 0  2n  1


1

 2n 1

Thus, the two numbers excluded in the biased format are at the bottom of
the range of representable numbers. Converting a biased integer xn  1 : 0
to a two’s complement number yn  1 : 0 requires solving the following
equation for y

xbias  y

x  2n 11  yn  12


n1  yn  2 : 0
x  1  2n1  1  yn 1   yn  2 : 0


 yn 1 yn  2 : 0

This immediately gives the conversion algorithm, namely:

1. Interpret x as a binary number and add 1. No overflow will occur.

2. Invert the leading bit of the result.

Conversely, if we would like to convert a two’s complement number


yn  1 : 0 with y 
 2
n 1    2n 1  1 into biased representation, the
 

above equation must be solved for x. This is equivalent to

x  y   2
n1
1  y  1
n1
  y  1n 1  mod 2n 


It suffices to perform the computation modulo 2n because the result lies


between 1 and 2n 1  2.


#*
  /
IEEE F LOATING   Components of an IEEE floating point number
P OINT S TANDARD
normal denormal
AND T HEORY OF
exponent ebias emin
ROUNDING
significand 1  f  0 f 
hidden bit 1 0

/& .111 ,  " !- 

An IEEE floating point number is a triple s en  1 : 0 f 1 : p  1, where


s  0 1 is called the sign bit, e  en  1 : 0 represents the exponent,
and f  f 1 : p  1 almost represents the significand of the number (if it
would represent the significand, we would call it f ). The most common
parameters for n and p are

8 24 for single precision


n p 
11 53 for double precision

Obviously, single precision numbers fit into one machine word and double
precision numbers into two words.
IEEE floating point numbers can represent certain rational numbers as
well as the symbols ∞, ∞ and NaN. The symbol NaN represents ‘not a
number’, e.g., the result of computing 00. Let s e f  be a floating point
number, then the value represented by s e f  is defined by

1s  2ebias  1 f  if e 
 0 1 
n n
 1s  2emin  0 f  if e  0n
s e f  
1s  ∞ if e  1n and f  0p 1



NaN if e  1n and f 0p


 1

The IEEE floating point number s e f  is called

normal if e 
 0 1  and
n n

denormal (denormalized) if e  0n .

For normal or denormal IEEE floating point numbers, exponent, signifi-


cand and hidden bit are defined by table 7.1. Observe that the exponent
emin has two representations, namely e  0n 1 1 for normal numbers and


e  0n for denormal numbers. Observe also, that string f alone does not
determine the significand, because the exponent is required to determine
the hidden bit. If we call the hidden bit f 0, then the significand obviously
# 
  /
2emin - (p-1) 2emin - (p-1)
N UMBER F ORMATS

0 Xmin 2emin 2emin +1

2z - (p-1)

2z 2z+1

2emax - (p-1)
Xmax

2emax 2emax +1

   Geometry of the non-negative representable numbers

equals  f 0 f 1 : p  1. The binary fraction f  f 0 f 1 : p  1 then is


a proper representation of the significand. It is called normal, if f 0  1
and denormal if f 0  0. We have

f 0   f   f 0   f 1 : p  1  2  p 1


 

 f 0  2 p 1

 1  2  p 1
 

 f 0  1  2  p 1
 

Thus, we have
1   f   2  2  p 1  

for normal significands and

0   f   1  2  p 1  

for denormal significands.

/' 9  %  8   !- 

A rational number x is called representable if x  s e f  for some IEEE


floating point number. The number x is called normal if s e f  is normal.
It is called denormal if s e f  is denormal.
Normal numbers have a significand in the range 1 2  2  p 1   1 2.  

Denormal numbers have a significand in the range 0 1  2  p 1   0 1.  

# 
  /
Figure 7.1 depicts the non-negative representable numbers; the picture for
IEEE F LOATING the negative representable numbers is symmetric. The following properties
P OINT S TANDARD characterize the representable numbers:
AND T HEORY OF
ROUNDING 1. For every exponent value z  emin    emax , there are two inter-
vals containing normal representable numbers, namely 2z 2z1  and
2z1 2z . Each interval contains exactly 2p 1 numbers. The


gap between consecutive representable numbers in these intervals is


2z  p 1 .
 

2. As the exponent value increases, the length of the interval doubles.

3. Denormal floating point numbers lie in the two intervals 0 2emin 


and 2 emin 0. The gap between two consecutive denormal num-


bers equals 2emin  p 1. This is the same gap as in the intervals
 

2emin 2emin 1  and 2emin 1 2emin . The property, that the gap be-
tween the numbers 2emin and 2emin is filled with the denormal num-
bers is called gradual underflow.
Note that the smallest and largest positive representable numbers are

Xmin  2emin  2  p 1
 

Xmax  2emax  2  2  p 1


 

The number x  0 has two representations, one for each of the two pos-
sible sign bits. All other representable numbers have exactly one represen-
tation. A representable number x  s e f  is called even if f  p  1  0,
and it is called odd if f  p  1  1. Note that even and odd numbers alter-
nate through the whole range of representable numbers. This is trivial to
see for numbers with the same exponent. Consecutive numbers with dif-
ferent exponent have significands 0, which is even, and 1  1p 1 , which


is odd.

/(    !  

One should always work on as high an abstraction level as possible, but


not on a higher level. In what follows, we will be able to argue for very
long periods about numbers instead of their representations.
So far, we have used the letters e and f for the representations e  en 
1 : 0 and f  f 0 f 1 : p  1 of the exponent and of the significand. Since
there is a constant shortage of letters in mathematical texts, we will use
single letters like e and f also for the values of exponents and significands,
respectively. Obviously, we could use e and  f 0 f  instead, but that
#
  /
would mess up the formulae in later calculations. Using the same notation
for two things without proper warning can be the source of very serious ROUNDING
confusion. On the other hand, confusion can be avoided, as long as

we are aware that the letters e and f are used with two meanings
depending on context, and

the context indicates whether we are talking about values or repre-


sentations.

But what do we do if we want to talk about values and representations


in the same context? In such a case, single letters are used exclusively for
values. Thus, we would, for instance, write

e  en  1bias and f   f 0 f 1 : p  1

but we would not write

e  ebias nor f   f 

  #

/  8-  5 

We denote by R the set of representable numbers and by

R ∞  R  ∞ ∞

Since R is not closed under the arithmetic operations, one rounds the result
of an arithmetic operation to a representable number or to plus infinity or
minus infinity. Thus, a rounding is a function

r : IR  R ∞

mapping real numbers x to rounded values r x. The IEEE standard defines
four rounding modes, which are

ru round up,

rd round down,

rz round to zero, and

rne round to nearest even.


# #
  /
2emax - p
IEEE F LOATING
P OINT S TANDARD
AND T HEORY OF Xmax X*max 2emax +1
ROUNDING
   Geometry of Xmax £

The first three modes have the obvious meaning

ru x  miny  R ∞  x  y
rd x  maxy  R ∞  x y
rd x if x 0
rz x 
ru x if x0

The fourth rounding mode is more complicated to define. For any x with
Xmax  x  Xmax , one defines rne x as a representable number y closest
to x. If there are two such numbers y, one chooses the number with even
significand. Let
Xmax  2emax 2  2 p  

(see figure 7.2). This number is odd, and thus, it is the smallest number,
that would be rounded by the above rules to 2emax 1 if that would be a
representable number. For x 
 Xmax Xmax , one defines


∞ Xmax  x
 Xmax
if
if Xmax  x  Xmax
rne x 
Xmax Xmax   x  Xmax
 ∞
if
if x  Xmax

The above definition can be simplified to

For Xmax  x  Xmax , one defines rne x as a representable number


y closest to x. If there are two such numbers y, one chooses the
number with even significand.

For the remaining x, one defines



∞ if Xmax  x
rne x 
∞ if x  Xmax

# &
  /
/     
ROUNDING
Let
r : IR  R ∞
be one of the four rounding functions defined above, and let

Æ : IR 2  IR

be an arithmetic operation. Then, the corresponding operation

ÆI : R 2  R ∞

in IEEE arithmetic is – almost – defined by

x ÆI y  r x Æ y

The result has to be represented in IEEE format. The definition will be


completed in the section on exceptions.
If we follow the above rule literally, we first compute an exact result, and
then we round. The computation of exact results might require very long
intermediate results (imagine the computation of Xmax  Xmin). In the case
of divisions the final result will, in general, not even have a significand of
finite length, e.g., think of 1/3. Therefore, one often replaces the two exact
operands x and y by appropriate inexact – and in general shorter – operands
x and y such that the following basic identity holds

r x Æ y  r x Æ y  (7.1)

This means that no harm is done by working with inexact operands,


because after rounding the result is the same as if the exact operands had
been used. Identities like (7.1) need, of course, proof. Large parts of
this section are therefore devoted to the development of an algebra which
permits to formulate such proofs in a natural and concise way.

/ # ,   ! :  

Factorings are an abstract version of IEEE floating point numbers. In fac-


torings, the representations of exponents and significands are simply re-
placed by values. This turns out to be the right level of abstraction for the
arguments that follow. Formally, a factoring is a triple s e f  where

1. s  0 1 is called the sign bit ,


# '
  /
2. e is an integer, it is called the exponent, and
IEEE F LOATING
P OINT S TANDARD 3. f is a non-negative real number, it is called the significand.
AND T HEORY OF
We say that f is normal if f  1 2 and that f is denormal if f  0 1.
ROUNDING
We say that a factoring is normal if f is normal and that a factoring is
denormal if e  emin and f is denormal. Note that f   0 2 is possible.

In this case, the factoring is neither normal nor denormal. The value of a
factoring is defined as

s e f   1s  2e  f 

For real numbers x, we say that s e f  is a factoring of x if

x  s e f 

i.e., if the value of the factoring is x. For x  ∞ and x  ∞ we provide


the special factorings s ∞ 0 with

s ∞ 0  1s  ∞

We consider the special factorings both normal and IEEE-normal.


Obviously, there are infinitely many factorings for any number x, but
only one of them is normal. The function η̂ which maps every non-zero
x  IR  ∞ ∞ to the unique normal factoring s ê fˆ of x is called nor-
malization shift. Note that arbitrary real numbers can only be factored if
the exponent range is neither bounded from above nor from below.
A factoring s e f  of x is called IEEE-normal if

normal if x 2emin


s e f is
denormal if x  2emin 

The function η which maps every value x  IR  ∞ ∞ to the unique


IEEE-normal factoring of x is called IEEE normalization shift. The IEEE-
normal factoring of Zero is unique except for the sign. Note that arbi-
trary real numbers can only be IEEE factored, if the exponent range is not
bounded from above. Finally observe, that

η̂ x  η x if x 2emin 

/ &    8-    % 

We define a family of equivalence relations on the real numbers which will


help us identify real numbers that are rounded to the same value.
# (
−α   /
2
( )( )( ) ( ) ROUNDING
−α −α −α −α
-2 0 2 q2 (q+1) 2

   Partitioning of the real numbers

Let α be an integer. Let q range over all integers, then the open intervals
q  2 α q  1  2 α  and the singletons q  2 α  form a partition of the
  

real numbers (see figure 7.3). Note that 0 is always an endpoint of two
intervals.
Two real numbers x and y are called α–equivalent if according to this
partition they are in the same equivalence class, i.e., if they lie either in the
same open interval or if they both coincide with the same endpoint of an
interval. We use for this the notation x α y. Thus, for some integer q we
have
α α
x α y x y  q2 
q  1  2 

α
or x  y  q2 


From each equivalence class, we pick a representative. For singleton


sets there is no choice, and from each open interval we pick the midpoint.
This defines for each real number x the α–representative of x:
q  05  2 α if x  q  2 α q  1  2
 α
xα 
x if x  q2 α 

for some integer q.


Observe that an α-representative is always the value of a binary frac-
tion with α  1 bits after the binary point. We list a few simple rules for
computations with α–equivalences and α–representatives.
Let x α x . By mirroring intervals at the origin, we see

x α x and xα  xα 

Stretching intervals by a factor of two gives

2x α  1 2x and 2xα1  2xα

and shrinking intervals by a factor of two gives

x2 α1 x 2 and x2α1  xα 2

Induction gives for arbitrary integers e

2e  x α e 2e  x and 2
e
 xα e

2e  xα 
# /
  / e-p e-p
2 2
IEEE F LOATING e-p
P OINT S TANDARD y y+2 z
e - (p-1)
AND T HEORY OF 2
ROUNDING
   Geometry of the values y, y  2 e p, and z

Let y be a multiple of 2  α. Translation of intervals by y yields

x  y α x  y

Let β  α, then the equivalence classes of α are a refinement of the


equivalence classes of β, and one can conclude

x β x 

The salient properties of the above definition are, that under certain cir-
cumstances rounding x and its representative leads to the same result, and
that representatives are very easy to compute. This is made precise in the
following lemmas.

   Let η x  s e f , and let r be an IEEE rounding mode, then

1. r x  r x p e

2. η x p e  s e  f p 

3. if x  pe x, then r x  r x .

 For the absolute value of x, we have

2e 2e1  if f is normal


x 
0 2emin  if f is denormal 

In this interval, representable numbers have a distance of

d  2e  p 1 
 

Thus, x is sandwiched between two numbers

y  q  2e  p 1
 

z  q  1  2e  p 1
 

as depicted in figure 7.4. Obviously, x  y z can only be rounded to y, to


# )
  /
y  2e p , or to z. For any rounding mode it suffices to know xp

e in order
to make this decision. This proves part one. ROUNDING
Since
x pe   1s  2e  f  p  e
 1  2  f  p
s e
 e
 1s  2e   f  p
we know that s e  f p  is a factoring of xp e . This factoring is IEEE- 

normal because
s e f  is IEEE-normal,
x 2emin x p e 
 2emin , and
f is normal iff  f p is normal.
This proves part 2. Part 3 follows immediately from part 1, because
r x  r x p  e  r x  p e 
 r x
 
The next lemma states how to get p-representatives of the value of a
binary fraction by a so called sticky bit computation. Such a computation
simply replaces all bits f  p  1 : v by the OR of these bits.

Let f  f u : 0 f 1 : v be a binary fraction. Let   


g  f u : 0 f 1 : p
and let
v
s  f i
i p1

be the sticky bit of f for position p (see figure 7.5), then


 f  p  gs

If s  0 then  f   gs, and there is nothing to show. In the other case, 
we have
v
g  f  g  ∑ f i  2  i
 g  2  p

i p1

Thus,
 f  p  g  2  p1

 g1  gs
 
# *
  /
f: f[-u : 0] . f[1 : p] f[p+1 : v]
IEEE F LOATING
P OINT S TANDARD OR-tree
AND T HEORY OF
ROUNDING g s

   Sticky bit s of f for position p

/ ' 8-       1 8 

We define the set R̂ of real numbers that would be representable if the


exponent range would be unlimited. This is simply the set of numbers

1s  2e  1 f 1 : p  1

where e is an arbitrary integer. Moreover, we include 0  R̂ .


For every rounding mode r, we can define a corresponding rounding
mode
r̂ : IR  R̂ 
For the rounding modes r̂u r̂d r̂z one simply replaces R by R̂ in the defi-
nition of the rounding mode. One defines r̂ne x as a number in R̂ closest
to x. In case of a tie, the one with an even significand is chosen.
Observe that

r x  r̂ x if 2emin  x  Xmax 

Let η̂ x  s ê fˆ. Along the lines of the proof of lemma 7.1, one shows
the following lemma:

   Let x  0, let η̂ x  s ê fˆ, and let r be an IEEE rounding mode, then

1. r̂ x  r̂ x p  ê 

2. η̂ x p  ê   s ê  fˆ p 

3. if x  pê x, then r̂ x  r̂ x .

/ ( +      8- 

Let r be any rounding mode. We would like to break the problem of com-
puting r x into the following four steps:
##
  /
1. IEEE normalization shift. This step computes the IEEE-normal fac-
toring of x ROUNDING
η x  s e f 

2. Significand round. This step computes the rounded significand

f1  sigrd s f 

The function sigrd will be defined below separately for each round-
ing mode. It will produce results f1 in the range 0 2.

3. Post normalization. This step normalizes the result in the case f1  2

e1  1 f2 2 if f1  2
e2 f2   post e f1  
e f1  otherwise

4. Exponent round. This step takes care of cases where the intermediate
result 1s  2e2  f2 lies outside of R . It computes

e3 f3   exprd s e2 f2 

The function exprd will be defined below separately for each round-
ing mode.

We will have to define four functions sgrd and four functions exprd such
that we can prove

For all rounding modes r holds    

s e3 f3   η r x

This means that the result s e3 f3  of the rounding algorithm is an


IEEE-normal factoring of the correctly rounded result.
Let η x  s e f , then f  0 2. Significand rounding rounds f to an
element in the set

F  g0g1 : p  1  gi  0 1 for all i  2

For any f , the binary fractions

y  y1 : 0y1 : p  1  maxy  F  f y


y  y 1 : 0y 1 : p  1  miny  F  f  y
##
  /
satisfy y  f  y . The definitions for the four rounding modes are
IEEE F LOATING
P OINT S TANDARD y 1 : p  1 if s0
sigrdu s f  
AND T HEORY OF y1 : p  1 if s1
ROUNDING
y1 : p  1 if s0
sigrdd s f  
y 1 : p  1 if s1
sigrdz s f   y1 : p  1

In case of round to nearest even, sigrdne s f  is a binary fraction g closest


to f with g1 : 0g1 : p  1  F . In case of a tie one chooses the one
with g p  1  0. Let f   f 0 f 1 : p  11, then

y1 : p  1 if f  f
 or f  f  f  p  1  0
sigrdne s f   (7.2)
y 1 : p  1 if  f  f  p  1  1
 or f
f
 f


We define
x1  s e f1   1s  2e  f1 
The following lemma summarizes the properties of the significand round-
ing:

  
r x if x  Xmax
x1 
r̂ x if x  Xmax

 For f  1 2, x lies in the interval 2e 2e1  if s  0, and it lies in
2e1 2e  if s  1. Mirroring this interval at the origin in case of s  1
and scaling it by 2 e translates exactly from rounding with r̂ to signifi-


cand rounding in the interval 1 2. Mirroring if s  1 and scaling by 2e


translates in the other direction.
If f  0 1 then e  emin . Mirroring if s  1 and scaling by 2 emin trans- 

lates from rounding with r into significand rounding in the interval 0 1.
Mirroring if s  1 and scaling by 2emin translates in the other direction.
  Finally observe that r x  r̂ x if x  Xmax and f is normal.

The following lemma summarizes the properties of the post normaliza-


tion:

   s e2 f2   η x1 
##
  /

Post normalization obviously preserves value: ROUNDING

x1  x2 : s e2 f2   1s  2e2  f2 

Thus, we only have to show that s e2 f2  is IEEE-normal. We started out


with η x  s e f  which is IEEE-normal. Thus,

1. f is normal if x 2emin , and

2. f is denormal and e  emin if x  2emin 

If x 2emin , then x1  2emin and f1  1 2. If f1  1 2, then f2  f1 is


normal, and if f1  2, then f2  1 is normal as well.
If x  2emin , then x1   2emin or x1   2emin , and e2  e  emin . In the
first case, f2  f1  0 1 is denormal. In the second case, f2  f1  1 is
normal.  
We proceed to specify the four functions exprd.

 ∞ 0 if e2  emax and s0
emax 2  2  p 1
exrdu s e2 f2  

 
if e2  emax and s1
e2 f2  if e2  emax

 ∞ 0 if e2  emax and s1
emax 2  2  p 1
exrdd s e2 f2  

 
if e2  emax and s0
e2 f2  if e2  emax

emax 2  2  p 1
 
if e2  emax
exrdz s e2 f2  
e2 f2  if e2  emax

∞ 0 if e2  emax
exrdne s e2 f2  
e2 f2  if e2  emax

Let
x3  s e3 f3   1s  2e3  f3 
We can proceed to prove the statement

s e3 f3   η r x

of the theorem.

Proof of Theorem 7.4 


If e3 f3   e2 f2 , then s e3 f3  is IEEE-normal by lemma 7.6. In all
other cases, the factoring s e3 f3  is obviously IEEE-normal taking into
###
  /
account the convention that the special factorings are IEEE-normal. Thus,
IEEE F LOATING it remains to show that
P OINT S TANDARD x3  r x
AND T HEORY OF
ROUNDING If x  Xmax , then lemma 7.5 implies that
x2  x1  r x
According to lemma 7.6, s e2 f2  is an IEEE-normal factoring of x1 , and
therefore
e2  emax 
Thus, we can conclude that
e3 f3   e2 f2  and x3  x2  r x
Now let x  Xmax . One then easily verifies for all rounding modes r:
r̂ x  r x
r̂ x  Xmax
2emax 1
(7.3)
r̂ x
e2  emax by lemmas 7.5 and 7.6
Recall that in the definition of rne , the threshold Xmax was chosen such
that this holds. We now can complete the proof of the theorem. For r  ru ,
we have
x3 
 e3 f3 
s

∞ if e2  emax and s0



 x2Xmax if
if
e2  emax
e2  emax
and s1


∞ if r̂ x  r x and s0

 x2Xmax if
if
 r x
r̂ x 
r̂ x  r x
and s1

 r x
because x2  r̂ x by lemma 7.5.
  The proof for the other three rounding modes is completely analogous.
We summarize the results of this subsection: Let η x  s e f , it then
holds
η r x  s exprd s post e sigrd s f  (7.4)
Exactly along the same lines, one shows for x  0 and η̂ x  s ê f  that
ˆ

r̂ x  s ê sigrd s fˆ


and then
η̂ r̂ x  s post ê sigrd s fˆ (7.5)
##&
  /#
  IEEE floating point exceptions E XCEPTIONS
symbol meaning
INV invalid operation
DBZ division by 0
OVF overflow
UNF underflow
INX inexact result

/ / 8-    

By the lemmas 7.1 and 7.2, we can substitute in the above algorithms f and
fˆ by their p-representatives. This gives the following rounding algorithms:

For limited exponent range: let η x  s e f , then

η r x  s exprd s post e sigrd s  f p  (7.6)

For unlimited exponent range: let x  0 and η̂ x  s ê fˆ, then

η̂ r̂ x  s post ê sigrd s  fˆ p  ê  (7.7)

 %& 

Ì HE IEEE floating point standard defines the five exceptions of table


7.2. These exceptions activate event signals of maskable interrupts.
The mask bits for these interrupts are also called enable bits. Here, we will
be concerned with the enable bits OV Fen and UNFen for overflow and
underflow.
Implementation of the first two exceptions will turn out to be easy. They
can only occur if at least one operand is from the set 0 ∞ ∞ NaN . For
each operation, these two exceptions therefore just require a straightfor-
ward bookkeeping on the type of the operands (section 7.4).
According to the standard, arithmetic on infinity and NaN is always
exact and therefore signals no exceptions, except for invalid operations.
Thus, the last three exceptions can only occur if both operands are finite
numbers. These exceptions depend on the exact result of the arithmetic
operation but not on the operation itself. Therefore, we will now concen-
trate on situations, where a finite but not necessarily representable number
##'
  /
x  IR is the exact result of an operation
IEEE F LOATING
P OINT S TANDARD x  aÆb where a bR
AND T HEORY OF
In this section, we will also complete the definition of the result of an
ROUNDING
arithmetic IEEE operation, given that both operands are finite, non-zero,
representable numbers. The arithmetic on infinity, zero, and NaN will be
defined in section 7.4.

/# C

An overflow occurs, if the absolute value of r̂ x exceeds Xmax , i.e.,


OV F x r̂ x  Xmax 
Let x  pê x. Since r̂ x  r̂ xp  ê , it follows that
OV F x OV F x p  ê  OV F x 
Only results x with x  Xmax can cause overflows, and for these results,
we have η x  η̂ x Let
η x  η̂ x  s e f 
By lemma 7.5, we then have
OV F x 2e  sigrd s f   Xmax
e  emax or (7.8)
e  emax and sigrd s f   2
The first case is called overflow before rounding, the second case over-
flow after rounding.

/#  C

Informally speaking, an underflow occurs if two conditions are fulfilled,


namely
1. tininess: the result is below 2emin and
2. loss of accuracy: accuracy is lost, when the result is represented as
a denormalized floating point number.
The IEEE standard gives two definitions for each of these conditions. Thus,
the standard gives four definitions of underflow. It is, however, required
that the same definition of underflow is used for all operations.
##(
  /#

The two definitions for tininess are tiny–after–rounding E XCEPTIONS

T INYa x 0  r̂ x  2emin


and tiny–before–rounding
T INYb x 0  x  2emin 
In the four rounding modes, we have
 0  x  2emin  1  2  p1 

if rne
 0  x  2emin if rz
T INYa x
2emin  x  2emin  1  2  p  x  0 if ru
 2emin  1  2  p  x  2emin  x  0 if rd
For all rounding modes, one easily verifies that tiny-after-rounding im-
plies tiny-before-rounding
T INYa x  T INYb x
Let x  0 and η̂ x  s ê fˆ, it immediately follows that
T INYb x T INYb x p  ê 

As r̂ x  r̂ x p
 ê , we can also conclude that
T INYa x T INYa x p  ê 

3  -%
The two definitions for loss of accuracy are denormalization loss:
LOSSa x r x  r̂ x
and inexact result
LOSSb x r x  x
An example for denormalization loss is x  00p 1 because
rne x  0 and r̂ x  x
A denormalization loss implies an inexact result, i.e.,   
LOSSa x  LOSSb x
The lemma is proven by contradiction. Assume r x  x, then x  R  R̂ 
and it follows that
r̂ x  x  r x
 
##/
  /
Let η̂ x  s ê fˆ and η x  s e f . By definition,
IEEE F LOATING
P OINT S TANDARD x pê  pê x
AND T HEORY OF
ROUNDING Since ê  e, we have
x pê  pe x
and hence,
r x p  ê   r x
This shows, that
LOSSb x LOSSb x p  ê 

As r̂ x  r̂ x p  ê , we can conclude

LOSSa x LOSSa x p  ê 

Hence, for any definition of LOSS and T INY , we have

LOSS x LOSS x p  ê 


T INY x T INY x p  ê 

and therefore, the conditions can always be checked with the representative
x p ê instead of with x.


Detecting LOSSb x is particularly simple. If η x  s e f  and x 


Xmax , then exponent rounding does not take place and

r̂ x  x sigrd s f   f
sigrd s  f p    f  p 

Whether the underflow exception UNF should be signaled at all depends


in the following way on the underflow enable flag UNFen:

T INY  LOSS if UNFen


UNF
T INY if UNFen

/## ; 1

In this subsection we complete the definition of the result of an IEEE float-


ing point operation. Let
α  3  2n 2 

let a b  R be representable numbers, and for Æ    , let

x  aÆb
##)
  /#
be the exact result. The proper definition of the result of the IEEE operation
is then E XCEPTIONS
a ÆI b  r y
where 
 x  2 α if OV F x  OV Fen


x  2α if UNF x  UNFen
y 
 x otherwise.
Thus, whenever non masked overflows or underflows occur, the expo-
nent of the result is adjusted. For some reason, this is called wrapping the
exponent. The rounded adjusted result is then given to the interrupt service
routine. In such cases one would of course hope that r y itself is a normal
representable number. This is asserted in the following lemma:

The adjusted result lies strictly between 2emin and Xmax :   (
1. OV F x  2emin  x2 α  Xmax

2. UNF x  2emin  x  2α  Xmax

We only show the lemma for multiplication in the case of overflow. The 
remaining cases are handled in a completely analogous way.
The largest possible product of two representable numbers is
emax 1 2
x 
2
Xmax  2   22emax 2 

For the exponent, it therefore holds

2  emax  2  α  2  2n  1
 1  2  3  2n  2

 4  2n  2
 3  2n  2

 2n  2
 emax

and thus, x  Xmax .


There cannot be an overflow unless

x  Xmax  2emax 

For the exponents, we conclude that

emax  α  2n  1
 1  3  2n  2

 2n 2
1
 2 n1
2  emin 

Thus, it also holds that x  2emin .  

##*
  /
The following lemma shows how to obtain a factoring of r y from a
IEEE F LOATING factoring of x.
P OINT S TANDARD
AND T HEORY OF 
 ) Let η̂ r̂ x  s u v, then
ROUNDING
1. OV F x  η x  2 α   s u  α v 

2. UNF x  η x  2α   s u  α v

 We only show part 1; the proof of part 2 is completely analogous. Let

η̂ x  s ê fˆ

then
α
η̂ x  2 
 s ê  α fˆ
Define f1 and u v as

f1  sigrd s  fˆ p  ê 
u v  post ê f1 

The definition of post normalization implies

u  α v  post ê  α f1 

Applying the rounding algorithm for unlimited exponent range (equation


7.7) gives:
η̂ r̂ x  s post ê f1   s u v
and
α
η̂ r̂ x  2

  s post ê  α f1   s u  α v
Lemma 7.8 implies
α
2emin  y  x  2 
 Xmax 

For such numbers, we have

2emin  r y  r̂ y  Xmax 

It follows that
η r y  η̂ r̂ y
  and part 1 of the lemma is proven.

#&
  /&
/#& . 8 -
A RITHMETIC ON
Let  S PECIAL O PERANDS
 x2 α 
if OV F x  OV Fen
x  2α if UNF x  UNFen
y
 x otherwise.
be the exact result of an IEEE operation, where the exponent is wrapped in
case an enabled overflow or underflow occurs. The IEEE standard defines
the occurrence of an inexact result by

INX y r y  y  OV F y  OV Fen

So far, we have only considered finite results y. For such results, OV F y


always implies r y  y and the second condition is redundant. Hence, we
have for finite y
INX y  y
r y 
When dealing with special operands ∞ ∞ and NaN, computations like
∞  ∞  ∞ with r ∞  ∞ will be permitted. However, the IEEE standard
defines the arithmetic on infinity and NaN to be always exact. Thus, the
exceptions INX, OVF and UNF never occur when special operands are
involved.
Let η x  s e f  and η̂ x  s ê fˆ. If

OV F x  OV Fen  UNF x  UNFen

holds, then exponent rounding does not take place, and significand round-
ing is the only source of inaccuracy. Thus, we have in this case

INX y sigrd s fˆ  fˆ


sigrd s  fˆ p    fˆ p 

In all other cases we have

INX y sigrd s f   f  OV F x
sigrd s  f p    f  p  OV F x p e 

    2

N THE IEEE floating point standard [Ins85], the infinity arithmetic and
the arithmetic with zeros and NaNs are treated as special cases. This
special arithmetic is considered to be always exact. Nevertheless, there
#&
  /
are situations in which an invalid operation exception INX or a division by
IEEE F LOATING zero exception DBZ can occur.
P OINT S TANDARD
In the following subsections, we specify this special arithmetic and the
AND T HEORY OF
possible exceptions for any IEEE operation. The factorings of the numbers
ROUNDING
a and b are denoted by sa ea fa  and sb eb fb  respectively.

/&     !!

There are two different kinds of not a number, signaling NaN and quiet
NaN. Let e  en  1 : 0 and f  f 1 : p  1. The value represented by
the floating point number s e f  is a NaN if e  1n and f  0 p 1 . We 

chose f 1  1 for the quiet and f 1  0 for the signaling variety of NaN1 .

quiet NaN if e  1n  f 1  1


s e f  
signaling NaN if e  1n  f 1  0  f  0 p  1

A signaling NaN signal an invalid operation exception INV whenever


used as an operand. However, copying a signaling NaN without a change
of format does not signal INV. This also applies to operations which only
modify the sign, e.g., the absolute value and reversed sign2 operations.
If an arithmetic operation involves one or two input NaNs, none of them
signaling, the delivered result must be one of the input NaNs. In the spec-
ifications of the arithmetic operations, we therefore distinguish between
three types of NaNs:

qNAN denotes an arbitrary quiet NaN,

sNAN denotes an arbitrary signaling NaN, and

qNAN indicates that the result must be one of the quiet input NaNs.

For the absolute value and reversed sign operations, this restriction does
not apply. These two operations modify the sign bit independent of the
type of the operand.

1 The IEEE standard only specifies that the exponent en 1 : 0  1n is reserved for in-
finity and NaN; further details of the coding are left to the implementation. For infinity and
the two types of NaNs we therefore chose the coding used in the Intel Pentium Processor
[Int95]
2 x : x

#&
  /&
  Result of the addition; x and y denote finite numbers. A RITHMETIC ON
S PECIAL O PERANDS
ab b
a y ∞ ∞ qNAN sNAN
x r x  y ∞ ∞
∞ ∞ ∞ qNAN
∞ ∞ qNAN ∞
qNAN qNAN
sNAN qNAN

/&     -  

The subtraction of two representable numbers a and b can be reduced to


the addition of the two numbers a and c, where c has the factoring
sc ec fc   sb eb fb 
In the following, we therefore just focus on the addition of two numbers.
Table 7.3 lists the result for the different types of operands. There are just
a few cases in which floating point exceptions do or might occur:
An INV exception does occur whenever
– one of the operands a b is a signaling NaN, or
– when performing the operation ‘∞  ∞’ or ‘∞  ∞’.
The exceptions OVF, UNF and INX can only occur when adding two
finite non-zero numbers. However, it depends on the value of the
exact result, whether one of these interrupts occurs or not (section
7.3).

 
Since zero has two representations, i.e., 0 and 0, special attention must
be paid to the sign of a zero result a  b. In case of a subtraction, the sign
of a zero result depends on the rounding mode
0 if ru rne rz
xx  x  x 
0 if rd 
When adding two zero numbers with like signs, the sum retains the sign of
the first operand, i.e., for x  0 0,
xx  x  x  x
#&#
  /
IEEE F LOATING   Result of the multiplication a b; x and y denote finite non-zero numbers.
P OINT S TANDARD
ab b
AND T HEORY OF
a y 0 ∞ qNAN sNAN
ROUNDING
x r x  y 0 ∞
0 0 0 qNAN
∞ ∞ qNAN ∞
qNAN qNAN
sNAN qNAN

/&# 5-  

Table 7.4 lists the result of the multiplication a  b for the different types
of operands. If the result of the multiplication is a NaN, the sign does not
matter. In any other case, the sign of the result c  a  b is the exclusive or
of the operands’ signs:
sc  sa sb 
There are just a few cases in which floating point exceptions do or might
occur:

An INV exception does occur whenever

– one of the operands a b is a signaling NaN, or


– when multiplying a zero and an infinity number, i.e., ‘0  ∞’ or
‘∞  0’.

The exceptions OVF, UNF and INX depend on the value of the exact
result (section 7.3); they can only occur when both operands are
finite non-zero numbers.

/&& + 

Table 7.5 lists the result of the division ab for the different types of
operands. The sign of the result is determined as for the multiplication.
This means that except for a NaN, the sign of the result c is the exclusive
or of the operands’ signs: sc  sa sb .
In the following cases, the division signals a floating point exception:

An INV exception does occur whenever


#&&
  /&
  Result of the division ab; x and y denote finite non-zero numbers. A RITHMETIC ON
S PECIAL O PERANDS
ab b
a y 0 ∞ qNAN sNAN
x r xy ∞ 0
0 0 qNAN 0
∞ ∞ ∞ qNAN
qNAN qNAN
sNAN qNAN

– one of the operands a b is a signaling NaN, or


– when performing the operation ‘00’ or ∞∞’.

An DBZ (division by zero) exception is signaled whenever dividing


a finite non-zero number by zero.

The exceptions OVF, UNF and INX depend on the value of the exact
result (section 7.3); they can only occur when both operands are
finite non-zero numbers.

/&'   

The comparison operation is based on the four basic relations greater than,
less than, equal and unordered. These relations are defined over the set
R ∞ NaN consisting of all representable numbers, the two infinities, and


NaN:
R ∞ NaN  R  ∞ ∞ NaN 


Let the binary relation Æ     be defined over the real numbers


IR , the corresponding IEEE floating point relation is denoted by ÆI . For
any representable number x  R ∞ , none of the pairs x NaN , NaN x
and NaN NaN  is an element of ÆI . Thus, the relation ÆI is a subset of
R 2∞ .
IEEE floating point relations ignore the sign of zero, i.e., 0  0.
Thus, over the set of representable numbers, the relations Æ and ÆI
are the same:

x y  R  IR x ÆI y xÆy
#&'
  /
IEEE F LOATING   Floating point predicates. The value 1 (0) denotes that the relation is
P OINT S TANDARD true (false). Predicates marked with are not indigenous to the IEEE standard.
AND T HEORY OF
predicate greater less equal unordered INV if
ROUNDING
true false    ? unordered
F T 0 0 0 0
UN OR 0 0 0 1
EQ NEQ 0 0 1 0
UEQ OGL 0 0 1 1
No
OLT UGE 0 1 0 0
ULT OGE 0 1 0 1
OLE UGT 0 1 1 0
ULE OGT 0 1 1 1
SF ST 0 0 0 0
NGLE GLE 0 0 0 1
SEQ SNE 0 0 1 0
NGL GL 0 0 1 1
Yes
LT NLT 0 1 0 0
NGE GE 0 1 0 1
LE NLE 0 1 1 0
NGT GT 0 1 1 1

The two infinities (∞ and ∞) are interpreted in the usual way. For
any finite representable x  R , we have

∞ I x I ∞ 

NaN compares unordered with every representable number and with


NaN. Thus, for every x  R ∞ NaN , the pairs x NaN  and NaN x are


elements of the relation ‘unordered’, and that are the only elements. Let
this relation be denoted by the symbol ?, then

?   x NaN  NaN x  x  R ∞ NaN  

The comparison of two operands x and y delivers the value Æ x y of a


specific binary predicate

Æ : R ∞ NaN  R ∞ NaN  0 1


 

Table 7.6 lists all the predicates in question and how they can be obtained
from the four basic relations. The predicates OLT and UGE, for example,
#&(
  /&
can be expressed as
A RITHMETIC ON
S PECIAL O PERANDS
OLT x y  UGE x y x I y  x I y  x I y  x?y

Note that for every predicate the implementation must also provide its
negation.
In addition to the boolean value Æ x y, the comparison also signals an
invalid operation. With respect to the flag INV, the predicates fall into one
of two classes. The first 16 predicates only signal INV when comparing a
signaling NaN, whereas the remaining 16 predicates also signal INV when
the operands are unordered.
Comparisons are always exact and never overflow or underflow. Thus,
INV is the only IEEE floating point exception signaled by a comparison,
and the flags of the remaining exceptions are all inactive:

INX  OV F  UNF  DBZ  0

/&( ,   

Conversions have to be possible between the two floating point formats and
the integer format. Integers are represented as 32-bit two’s complement
numbers and lie in the set

INT  T32  231  231  1

Floating point numbers are represented with an n-bit exponent and a p-


bit significand. The range of finite, representable numbers is bounded by
n 1
Xmax and Xmax , where Xmax  1  2 p   22 . For single precision n  8,


p  24 and the finite, representable numbers lie in the range

R s   1  2  24
2
128
12 24
2
128


whereas for double precision n  11, p  53 and

R d   1  2  53
2
1024
12  53
2
1024


Table 7.7 lists the floating point exceptions which can be caused by the
different format conversions. The result of the conversion is rounded as
specified in section 7.2, even if the result is an integer. All four rounding
modes must be supported.
#&/
  /
IEEE F LOATING   Floating point exceptions which can be caused by format conversions
P OINT S TANDARD (d: double precision floating point, s: single precision floating point, i: 32-bit
AND T HEORY OF two’s complement integer)
ROUNDING
INV DBZ OVF UNF INX
d s + + + +
sd +
is +
id
si + +
d i + +

,  " ,   


Double precision covers a wider range of numbers than single precision,
and the numbers are represented with a larger precision. Thus, a conversion
from single to double precision is always exact and never overflows or
underflows, but that is not the case for a conversion from double to single
precision.
The conversion signals an invalid operation exception iff the operand
is a signaling NaN. Unlike the arithmetical operations, a quiet input NaN
cannot pass the conversion unchanged. Thus, in case of an input NaN, the
result of the conversion is always an arbitrary, quiet NaN.

.    ,  "  


For either floating point format, we have

Xmax  231 and 231  Xmax 

Thus, any 32-bit integer x can be represented as a single or double pre-


cision floating point number. In case of double precision, the conversion
is performed without loss of precision, whereas the single precision result
might be inexact due to the 24-bit significand. Other floating point excep-
tions cannot occur.

,  "  .    


When converting a floating point number into an integer, the result is usu-
ally inexact. The conversion signals an invalid operation if the input is a
NaN or infinity, or if the finite floating point input x exceeds the integer
range, i.e.,
x  231 or x 231 
#&)
  /'
In the latter case, a floating point overflow OVF is not signaled because the
result of the conversion is an integer. S ELECTED
R EFERENCES AND
F URTHER R EADING
   !  "  #

HE TRANSLATION of the IEEE standard 754 [Ins85] into mathemat-


 ical language and the theory of rounding presented in this chapter is
based on [EP97].

 %&

  Prove or disprove: For all rounding modes, rounding to sin-


gle precision can be performed in two steps:

a) round to double precision, then


b) round the double precision result to single precision.

  Complete the following proofs:


1. the proof of lemma 7.3
2. the proof of theorem 7.4 for rounding mode rne
3. the proof of lemma 7.9 part 2

  Let x be the unrounded result of the addition of two repre-


sentable numbers. Show:

1. T INYa x T INYb x
2. LOSSa x  LOSSb x  FALSE

  Let x  2e  f , where e is represented as a 14-bit two’s com-


plement number e  e13 : 0 and the significand f is represented as a
57-bit binary fraction f   f 0 f 1 : 56 Design circuits which compute
for double precision:

1. LOSSa x
2. LOSSb x

Compare the cost and delay of the two circuits.

#&*
Chapter

8
Floating Point Algorithms
and Data Paths

N THIS chapter the data paths of an IEEE-compatible floating point unit


FPU are developed. The unit is depicted in figure 8.2. It is capable
of handling single and double precision numbers under control of signals
like db dbs dbr    (double). This requires embedding conventions for
embedding single precision numbers into 64-bit data.
The data inputs of the the unit are (packed) IEEE floating point numbers
with values

a  sA eA n  1 : 0 fA 1 : p  1
b  sB eB n  1 : 0 fB 1 : p  1

where
53 11 if db  1
n p 
24 8 if db  0
As shown in figure 8.1, single precision inputs are fed into the unit as
the left subwords of FA63 : 0 and FB63 : 0. Thus,

sA eA n  1 : 0 fA 1 : p  1
FA263 FA262 : 55 FA254 : 32 if db

FA263 FA262 : 52 FA251 : 0 if db

The b operand is embedded in the same way.


The unpacking unit detects special inputs 0 ∞ NaN sNaN and signals
them with the flags f la and f lb . For normal or denormal inputs, the hidden
  ) 63 32 31 0
F LOATING P OINT x.s z
A LGORITHMS AND
DATA PATHS    Embedding a single precision floating point data xs into a 64-bit word;
z is an arbitrary bit string. In our implementation z xs.

FA2 FB2

FCon fla’ flb’ unpacker FPunp FXunp


test/abs/neg (sa, ea, fa, fla) nan (sb, eb, fb,flb) (su, eu, fu, flu)

Fc fcc

Cvt Mul/Div Add/Sub


(sv, ev, fv, flv) (sq, eq, fq, flq) (ss, es, fs, fls)

Fr
129 (sr, er, fr, flr)

FXrnd FPrnd
Fx Fp

   Top level schematics of the floating point unit. The outputs Fc, Fx and
Fp consist of a 64-bit data and the floating point exception flags.

bit is unpacked, the exponent is converted to two’s complement represen-


tation, and single precision numbers are internally converted into double
precision numbers. Under control of signal normal, denormal significands
are normalized, and the shift distances lza 5 : 0 and lzb 5 : 0 of this nor-
malization shift are signaled.
Thus, the a-outputs of the unpacker satisfy for both single and double
precision numbers

1sa  2ea 10:0   fa 0 fa 1 : 52 if normal  0
a 
1sa  2ea 10:0  lza 5:0   f 0 f 1
a a : 52 if normal  1

The b-outputs of the unpacker satisfy an analogous equation. The normal-


ization shift activated by normal  1 is performed for multiplications and
divisions but not for additions and subtractions.
#'
  )
  Coding of the IEEE rounding modes F LOATING P OINT
A LGORITHMS AND
RM[1:0] symbol rounding mode
DATA PATHS
00 rz round to zero
01 rne round to nearest even
10 ru round up
11 rd round down

Let
x  aÆb
be the exact result of an arithmetic operation, and let
η̂ x  s ê fˆ
In the absence of special cases the converter, the multiply/divide unit and
the add/subtract unit deliver as inputs to the rounder the data sr er 12 :
0 fr 1 : 55 satisfying
x  pê 1sr  2er 12:0   fr 1 : 55
and
fr 1 : 0  00  OV F x  0
Note that η̂ x is undefined for x  0. Thus, a result x  0 is always
handled as a special case. Let

 x2 α
if OV F x  OV Fen
x  2α if UNF x  UNFen
y
 x otherwise
The rounder then has to output r y coded as a (packed) IEEE floating
point number. The coding of the rounding modes is listed in table 8.1.

  +%
The cost of the floating point unit depicted in figure 8.2 can be expressed
as
CFPU  CFCon  CFPunp  CFXunp  CCvt  CMulDiv
CAddSub  CFXrnd  CFPrnd  C f f 129  4  Cdriv 129
We assume that all inputs of the FPU are taken from registers and therefore
have zero delay. The outputs Fx , Fp , Fc and f cc then have the following
accumulated delay:
AFPU  maxAFCon AFXrnd AFPrnd 
#'#
  ) FA2[63:0] FB2[63:0]

F LOATING P OINT
F2[63:0] F2[63:0]
A LGORITHMS AND
DATA PATHS Unpack Unpack

s, e[10:0], lz[5:0], f[0:52], einf, fz, ez, h[1], h[2:52] h[2:52], s, e[10:0], lz[5:0], f[0:52], einf, fz, ez, h[1]

sa, ea[10:0], lza[5:0], fa[0:52] sb, eb[10:0], lzb[5:0], fb[0:52]

SpecUnp sa ha hb sb SpecUnp

ZERO, INF, SNAN, NAN NaN select ZERO, INF, SNAN, NAN

ZEROa, INFa, SNANa, NANa snan, fnan[1:52] ZEROb, INFb, SNANb, NANb
4 53 4
fla nan flb

   Top level schematics of the unpacker FP UNP

Note that AFCon includes the delay of the inputs f la and f lb . In our
implementation, the multiply/divide unit, the add/subtract unit and the two
rounders FP RND and FX RND have an additional register stage. Thus, the
FPU requires a minimal cycle time of

TFPU  maxTMulDiv TAddSub TFPrnd TFXrnd AFPU Fr  ∆


AFPU Fr  maxAFPunp  DCvt AFXunp AMulDiv AAddSub   Ddriv 

$ 3/#

IGURE 8.3 depicts the schematics of an unpacking unit FP UNP which


 unpacks two operands FA2 and FB2. For either operand, the unpack
unit comprises some registers (for pipelining), a circuit U NPACK and a
circuit S PEC U NP. In addition, there is a circuit NA N SELECT which deter-
mines the coding of an output NaN.

- U NPACK
The circuit U NPACK (figure 8.4) has the following control inputs

dbs which indicates that a double precision source operand is pro-


cessed,

and normal which requests a normalization of the significand.


#'&
F2[63] F2[62:52] F2[62:55] F2[51:0] F2[54:32]
  )
029 U NPACKING
11 8 dbs 1 0
zero(11) zero(8)
ezs h[1:52]
ezd
zero(11) zero(8) 52

ez
10 7

inc(11) inc(8) h[0] zero(52)

1 10 1 7

lzero(53)

11 11 CLS(53)

dbs 1 0 normal
1 0

s einf e[10:0] ez lz[5:0] f[0:52] fz h[1:52]

   Schematics of the circuit U NPACK

The data inputs are F263 : 0. Single precision numbers are fed into the
unpacking circuit as the left subword of F263 : 0 (figure 8.1). Input data
are always interpreted as IEEE floating point numbers, i.e.,
s ein n  1 : 0 fin 1 : p  1
F263 F262 : 52 F 251 : 0 if dbs  1

F263 F262 : 55 F 254 : 32 if dbs  0
We now explain the computation of the outputs. The flag

ein f 1 ein  1n

signals that the exponent is that of infinity or NaN. The signals ezd and
ezs indicate a denormal double or single precision input. The flag

ez  1 ein  0n

signals that the input is denormal.


If the (double or single precision) input is normal, then the correspond-
ing flag ezd or ezs is 0, the bits ein n  1 : 0 are fed into an incrementer, and
the leading bit of the result is inverted. This converts the exponent from bi-
ased to two’s complement format. Sign extension produces a 11-bit two’s
complement number. For normal inputs we therefore have

e10 : 0  ein bias 


#''
  )
For denormal inputs the last bit of ein is forced to 1, and the biased repre-
F LOATING P OINT sentation of
A LGORITHMS AND
emin  0n 1 1 

DATA PATHS
is fed into the incrementer. We conclude for denormal inputs

e10 : 0  emin 

The inverted flag h0  ez satisfies

1 for normal inputs


h0 
0 for denormal inputs

Thus, h0 is the hidden bit of the significand. Padding single precision
significands by 29 trailing zeros extends them to the length of double pre-
cision significands

F251 : 0 if dbs  1
h1 : 52 
F254 : 32 029 if dbs  0

and we have
h1 : 52   fin 1 : p  1
Hence, for normal or denormal inputs the binary fraction h0h1 : 53
represents the significand and

s ein fin   1s  2e  h

Let lz be the number of leading zeros of the string h0 : 53, then

lz  lz5 : 0

In case of normal  1 and a non-zero significand, the cyclic left shifter


CLS 53 produces a representation f 0 f 1 : 53 of a normal significand
satisfying
h   f   2 lz  

For normal or denormal inputs we can summarize

1s  2e lz   f 

if normal  1
1s  2e   f 
s ein fin  
if normal  0

Flag f z signals that fin 1 : p  1 consists of all zeros:

fz  1 fin 1 : p  1  0 p 1


#'(
  )
Signal h1 is used to distinguish the two varieties of NaN. We chose
h1  0 for the signaling and h1  1 for the quiet variety of NaN (sec- U NPACKING
tion 7.4.1). Inputs which are signaling NaNs produce an invalid operation
exception (INV).
The cost of circuit U NPACK can be expressed as

CUnpack  2  Czero 11  2  Czero 8  Czero 52  Clz 53


Cinc 11  Cinc 8  CCLS 53  22  Cinv
Cmux 13  Cmux 53  Cmux 52  2  Cor

With respect to the delay of circuit U NPACK, we distinguish two sets of


outputs. The outputs reg  e lz f  are directly clocked into a register,
whereas the remaining outputs f lag  s ein f f z ez h are fed to circuits
S EPC U NP and NA N SELECT:

DUnpack reg  Dzero 11  Dinv  Dmux 


maxDinc 11  Dor Dlz 53  DCLS 53  Dmux 
DUnpack f lag  Dmux  maxDzero 11  Dinv Dzero 52

  
From the flags ein f , h1, f z and ez one detects whether the input codes
zero, plus or minus infinity, a quiet or a signaling NaN in an obvious way:

ZERO  ez  f z
INF  ein f  f z
NAN  ein f  h1
SNAN  ein f   h1   f z  ein f  h1 NOR f z

This computation is performed by the circuit S PEC U NP depicted in figure


8.5. This circuit has the following cost and delay:

CSpecUnp  4  Cand  Cnor


DSpecUnp  Dand  Dnor 

- NA N SELECT
This circuit determines the representation snan enan fnan  of the output
NaN. According to the specifications of section 7.4, the output NaN pro-
vided by an arithmetic operation is of the quiet variety. Thus,

enan  1n and fnan 1  1


#'/
  ) einf
F LOATING P OINT h[1]
A LGORITHMS AND fz
DATA PATHS ez

ZERO INF NAN SNAN

   Circuit S PEC U NP

Quiet NaNs propagate through almost every arithmetic operation, i.e., if


one or two input NaNs are involved, none of them signaling, the delivered
result must be one of the input NaNs. If both operands are quiet NaNs, the
a operand is selected. However, in case of an invalid operation INV 
1, an arbitrary quiet NaN can be chosen. Thus, the circuit NA N SELECT
determines the sign and significand of the output NaN as

sa 1 ha 2 : 52 if NANa  1
snan fnan 1 : 52 
sb 1 hb 2 : 52 if NANa  0

This just requires a 53-bit multiplexer. Thus,

CNaNselect  Cmux 53


DNaNselect  Dmux 

  +%   
The floating point unpacker FP UNP of figure 8.3 has cost

CFPunp  2  CUnpack  CSpecUnp  C f f 75  CNaNselect  C f f 53

With f la and f lb we denote the inputs of the registers buffering the flags f la
and f lb . These signals are forwarded to the converter C VT and to circuit
FC ON; they have delay

AFPunp f la f lb   DUnpack f lag  DSpecUnp 

Assuming that all inputs of the FPU are provided by registers, the outputs
of the unpacker then have an accumulated delay of

AFPunp  maxDUnpack reg DF Punp f la   DNaNselect 


#')
  )
$     
A DDITION AND
)        S UBTRACTION

Suppose we want to add the representable numbers a and b with IEEE-


normal factorings sa ea fa  and sb eb fb . Without loss of generality we
can assume that
δ  ea  eb 0;
otherwise we exchange a and b. The sum S can then be written as

S  sa ea fa   sb eb fb 
 1sa  2ea  fa  1sb  2eb  fb
δ
 2ea  1sa  fa  1sb  2
 f b 

This suggests a so called alignment shift of significand fb by δ positions


to the right. As δ can become as large as emax  emin this would require
very large shifters. In this situation one replaces the possibly very long
aligned significand 2 δ  fb by its p  1-representative


δ
f  2

 fb  p1

which can be represented as a binary fraction with only p  2 bits behind


the binary point. Thus, the length of significand fb is increased by only 3
extra bits. The following theorem implies, that the rounded result of the
addition is not affected by this:

For a non-zero sum S  0 let η̂ S  s ê fˆ, then    (


S  pê 2ea  1sa  fa  1sb  f 

If δ  3, then 
δ
f 2

 fb
and there is nothing to prove. If δ 2 then
δ
2 
 fb 

2 2
2  12

Since sa ea fa  and sb eb fb  are IEEE factorings, neither exponent can


be less than emin ,
ea emin and eb emin
and a denormal significand implies that the exponent equals emin . Due to
the assumption that ea eb , we have

ea  eb  δ emin  δ
#'*
  )
Thus, for δ 2, the fa and the factoring sa ea fa  are normal, and hence,
F LOATING P OINT
δ
A LGORITHMS AND  1sa  fa  1sb  2 
 fb   1  12  12
DATA PATHS
It follows that

ê ea  1  emin and p  ê  p  1  ea 

Since
δ
f  p1 2 
 fb
and fa is a multiple of 2  p1 , one concludes


δ
1sa  fa  1sb  2 
 fb  p1 1sa  fa  1sb  f
S  p1ea 2ea  1sa  fa  1sb  f 

  The theorem follows because p  1  ea p  ê.

-     


Let a, b and b be three representable numbers with factorings sa ea fa ,
sb eb fb  and sb eb fb . The subtraction of the two numbers a and b can
then be reduced to the addition of the numbers a and b :

ab  a  1sb  2eb  fb


 a  1  sb
 2eb  fb  ab 

)    - %

Figure 8.6 depicts an add/subtract unit which is divided into two pipeline
stages. The essential inputs are the following

the factorings of two operands

a  sa ea fa  b  sb eb fb 

where for n p  11 53 the exponents are given as n-bit two’s


complement numbers

ea  ea n  1 : 0 eb  eb n  1 : 0

and the significands are given as binary fractions

fa   fa 0 fa 1 : p  1 fb   fb 0 fb 1 : p  1


#(
  )
fa[0:52] es es[10:0] A DDITION AND
[10:0]
ea[10:0] fa2 fs[-1:55] S UBTRACTION

AlignShift
[0:52]
sa

SigAdd
fb3 fszero
fb[0:52] [0:55]

Sign Select
eb[10:0] sa2 ss
ss1
sb sb2
sub sx
sb’ sb’
sa

ZEROs
sa
sb sb
INV

SpecAS
fla, flb INFs fls
NANs
sa
nan
nan RM[1:0]

   Top level schematics of the add/subtract unit

the flags f la and f lb of the two operands,

the rounding mode RM, which is needed for the sign computation,
and

the flag sub which indicates that a subtraction is to be performed.

In case of a subtraction sub  1, the second operand is multiplied by 1,


i.e., its sign bit gets inverted. Thus, the operand

b  1sub  b

has the following factoring

sb eb fb   sb sub eb fb 

The unit produces a factoring ss es fs  which, in general, is not a rep-


resentation of the exact sum

S  ab

but if η̂ S  s ê fˆ, the output of the unit satisfies

S  pê 1ss  2es  fs 

Thus, the output is be rounded to the same result as S. If S is zero, infinite


or a NaN, the result of the add/subtract unit is of course exact.
#(
  )
In the first stage special cases are handled, the operands are possibly ex-
F LOATING P OINT changed such that the a-operand has the larger exponent, and an alignment
A LGORITHMS AND shift with bounded shift distance is performed. Let
DATA PATHS
δ  ea  eb 

The first stage outputs sign bits sa2 , sb2 , an exponent es , and significands
fa2 , fb3 satisfying

es  maxea eb 
S  pê 2es  1sa2  fa2  1sb2  fb3 

The second stage adds the significands and performs the sign computation.
This produces the sign bit ss and the significand fs .

  +%
Let the rounding mode RM be provided with delay ARM . Let the circuit
S IG A DD delay the significand fs by DSigAdd f s and the flags f szero and
ss1 by DSigAdd f lag. The cost and cycle time of the add/subtract circuit
and the accumulated delay AAddSub of its outputs can then be expressed as

CAddSub  CSpecAS  CAlignShi f t  Cxor  C f f 182


CSigAdd  CSignSelect
TAddSub  Dxor  maxDSpecAS DAlignShi f t   ∆
AAddSub  maxDSigAdd f s DSigAdd f lag  DSignSelect
ARM  DSignSelect 

   
The circuit A LIGN S HIFT depicted in figure 8.7 is somewhat tricky. Subcir-
cuit E XP S UB depicted in figure 8.8 performs a straightforward subtraction
of n-bit two’s complement numbers. It delivers an n  1-bit two’s com-
plement number asn : 0. We abbreviate

as  asn : 0

then

as  ea  eb
ea  eb as  0 asn  1

This justifies the use of result bit asn as the signal ‘eb gt ea’ (eb greater
than ea ), and we have
es  maxea eb 
#(
  )
A DDITION AND
0 S UBTRACTION
es[10:0]
1
eb_gt_ea
ea[10:0] as[12:0] as2[5:0]
ExpSub Limit
LRS(55) fb3[0:54]
eb[10:0]
eb_gt_ea fb3[55]
Sticky
(sticky)
fa[0:52] fb2[0:54]
sa fa2[0:52]
Swap
sa2
fb[0:52] sx
sb’ sb2

   Circuit A LIGN S HIFT; circuit LRS is a logical right shifter.

ea[10, 9:0] eb[10, 9:0]

10 1

add(12)

as[11]
eb_gt_ea as[10:0]

   Circuit E XP S UB

1 as1[10:0] [5:0]
7 as2[5:0]
as[10:0] 0 Ortree 6
[10:6]
eb_gt_ea

   Circuit L IMIT which approximates and limits the shift distance

#(#
  )
Cost and delay of circuit E XP S UB run at
F LOATING P OINT
A LGORITHMS AND CExpSub  Cinv 11  Cadd 12
DATA PATHS
DExpSub  Dinv  Dadd 12

     + 


The shift distance of an unlimited alignment shift would be

δ  as

The obvious way to compute this distance is to complement and then in-
crement asn : 0 in case as is negative. Because this computation lies on
the critical path of this stage, it makes sense to spend some effort in order
to save the incrementer.
Therefore, circuit L IMIT depicted in figure 8.9 first computes an approx-
imation as1 n  1 : 0 of this distance by

asn  1 : 0 if as 0
as1 n  1 : 0 
asn  1 : 0 if as  0

If as 0, then asn  0 and

as1   as10 : 0  δ

i.e., no error is made. If as  1, then

δ1  asn : 0  1  asn : 0

Since
0  δ  1  2n  1
we have
as1 n  1 : 0  asn : 0  δ  1
Thus,
δ if ea eb
as1  
δ  1 if ea  eb 
Circuit L IMIT of figure 8.9 has the following cost and delay

CLimit  Cinv 11  Cmux 11  Cor 6  CORtree 7


DLimit  Dinv  Dmux  Dor  DORtree 7

#(&
  )
sa, fa[0:52] sb’, fb[0:52] sb’ fb[0:52] 0 sa 0 fa[0:52]
A DDITION AND
S UBTRACTION
0 1 eb_gt_ea 0 1
0
sa2, fa2[0:52] sb2, fb2[0:54]

   Circuit S WAP which swaps the two operands in case of ea  eb

 
Circuit S WAP in figure 8.10 swaps the two operands in case ea  eb . In this
case, the representation of significand fa will be shifted in the alignment
shifter by a shift distance δ  1 which is smaller by 1 than it should be. In
this situation, the left mux in figure 8.10 preshifts the representation of fa
by 1 position to the right. Hence,

fa fb  if ea eb
fa2 fb2  
fb fa 2 if ea  eb 

It follows that
2 δ fb if ea eb
2 as1 
 fb2   δ
2 fa if ea  eb 

Note that operand fb2 is padded by a trailing zero and now has 54 bits after
the binary point. The swapping of the operands is done at the following
cost and delay

CSwap  Cmux 54  Cmux 55


DSwap  Dmux 

3     + 
The right part of circuit L IMIT limits the shift distance of the alignment
shift. Motivated by theorem 8.1 (page 359), we replace significand 2 as1   

fb2 by its possibly much shorter p  1–representative

fb3  2
 as1 
 fb2  p1 

By lemma 7.2, a p  1–representative is computed by a sticky bit which


ORs together all bits starting at position p  2 behind the binary point.
However, once we have shifted fb2 by p  2 bits to the right, all nonzero
bits of fb2 already contribute to the sticky bit computation and further shift-
ing changes nothing. Hence, the shift distance can be limited to p  2  55.
#('
  ) [0]
F LOATING P OINT fb2[54]
A LGORITHMS AND
DATA PATHS
ORtree(55)

hdec(6)
sticky
as2[5:0]
[54]

fb2[0]

   Circuit S TICKY which performs the sticky bit computation

We limit instead the distance to a power of two minus 1. Thus, let

b  log p  3  6

and
B  2b  1  1b  p2
then
n1
as1  B 1
i b
and
B if as1  B
as2  
as1  otherwise
The alignment shift computation is completed by a 55-bit logical left
shifter and the sticky bit computation depicted in figure 8.11.

 %   -  
Consider figure 8.12. If fb2 0 : p  1 is shifted by as2  bits to the right,
then for each position i bit fb2 i is moved to position i  as2 . The sticky
bit computation must OR together all bits of the shifted operand starting at
position p  2. The position i such that bit fb2 i is moved to position p  2
is the solution of the equation

i  as2   p  2 i.e., i  p  2  as2 

The sticky bit then equals


p1
sticky  fb2  j
j p2as2 
#((
  )
0 j i p+1
fb2 [0] [j] [i] A DDITION AND
S UBTRACTION
0x fb2
x x+j p+2

   Shifting operand f b2 0 : p  1 x bits to the right

This means, that the last as2  bits of fb2 0 : p  1 must be ORed together.
The last p  2 outputs of the half decoder in figure 8.11 produce the mask
0 p2  as2 
1 as2
 


ANDing the mask bitwise with fb2 and ORing the results together produces
the desired sticky bit. Cost and delay of circuit S TICKY run at
CSticky  Chdec 6  55  Cand  CORtree 55
DSticky  Dhdec 6  Dand  DORtree 55

 
The correctness of the first stage now follows from the theorem 8.1 because
¼ δ
2ea  1sa  fa  1sb  2 
fb  p1  if ea eb
S p ê ¼ δ
2eb  1sb  fb 1sa  2 fa  p1 
 
 if ea  eb
 2es  1sa2  fa2  1sb2  2  δ fb2  p1 
 2es  1sa2  fa2  1sb2  fb3 
(8.1)

  +%       


Figure 8.7 depicts the circuit A LIGN S HIFT of the alignment shifter. Since
circuit L IMIT has a much longer delay than circuit S WAP, the cost and the
delay of the alignment shifter can be expressed as
CAlignShi f t  CExpSub  CLimit  CSwap  CSticky
CLRS 55  Cxor  Cmux 11
DAlignShi f t  DExpSub  DLimit  maxCSticky CLRS 55

 4 
Figure 8.13 depicts the addition/subtraction of the significands fa2 and fb3 .
Let
0 if sa  sb
sx  sa sb 
1 if sa   sb
#(/
  )
00 fa2[0:52] 000 00 fb3[0:55] sx
F LOATING P OINT
A LGORITHMS AND 2 53 3
DATA PATHS
add(58)
ovf neg sum[-2:55]

sa2, sb2 Sign Abs(58) zero(58)

ss1 fs[-1:55] fszero

   Circuit S IG A DD which depending on the flag s x adds or subtracts the


significands f a2 and f b3

the circuit computes

sum  fa2  1sx  fb3 

The absolute value of the result is bounded by

sum  fa2  fb3  2  2  p 1  2  2  p2


  
 4

Therefore, both the sum and its absolute value can be represented by a
two’s complement fraction with 3 bits before and p  2 bits behind the
binary point.
Converting binary fractions to two’s complement fractions and extend-
ing signs, the circuit S IG A DD computes

sum  fa2  1sx  fb3


  fa2 0 fa2 1 : p  1  1sx   fb3 0 fb3 1 : p  2
 0 fa2 0 fa2 1 : p  103 
 sx fb3 0 sx  fb3 1 : p  2 sx   sx  2
  p2

 0
2
fa2 0 fa2 1 : p  103 
 sx
2
fb3 0 sx  fb3 1 : p  2 sx  sx  2
  p2

 sum2 : 0sum1 : p  2

Figure 8.14 depicts a straightforward computation of

sum  fs   fs 1 : 0 fs 1 : p  1


#()
  )
x[n-2:0]
A DDITION AND
inc(n-1)
S UBTRACTION
x[n-1] 0 1

abs[n-2:0]

   Circuit A BS computes the absolute value of an n-bit two’s comple-


ment number

A zero significand can be detected as

f szero  1 fs 1 : p  1  0 sum2 : p  1  0

Let
neg  sum2
be the sign bit of the two’s complement fraction sum2 : 0sum1 : p 
1. Table 8.2 lists for the six possible combinations of sa , sb and neg the
resulting sign bit ss1 such that
¼
1ss1  fs  1sa  fa2  1sb  fb3 (8.2)

holds. In a brute force way, the sign bit ss1 can be expressed as

ss1  sa  sb  neg  sa  sb  neg  sa  sb  neg


 sb  neg  sa  sb NAND neg

For the factoring ss1 es fs  it then follows from the Equations 8.1 and 8.2
that

S  sa ea fa   sb eb fb 
 pê 2  1sa2  fa2
es
 1sb2  fb3 
 2es  1ss1  fs 

  +%
Circuit S IGN generates the sign bit ss1 in a straightforward manner at the
following cost and delay:

CSign  2  Cand  Cor  Cnand


DSign  Dand  Dor  Dnand 
#(*
  )
F LOATING P OINT   Possible combinations of the four sign bits s a , sb , neg and ss1
¼

A LGORITHMS AND
result sa sb neg ss1
DATA PATHS
fa2  fb3 0 0 0 0
impossible 0 0 1 *
fa2  fb3 0 1 0 0
fa2  fb3 0 1 1 1
 fa2  fb3 1 0 0 1
 fa2  fb3 1 0 1 0
impossible 1 1 0 *
 fa2  fb3 1 1 1 1

Circuit A BS of figure 8.14 computes the absolute value of an n-bit two’s


complement number. It has cost and delay

CAbs n  Cinv n  1  Cinc n  1  Cmux n  1


DAbs n  Dinv  Dinc n  1  Dmux 

For the delay of the significand add circuit S IG A DD, we distinguish be-
tween the flags and the significand fs . Thus,

CSigAdd  Cxor 58  Cadd 58  Czero 58  CAbs 58  CSign
DSigAdd f lag  Dxor  Dadd 58  maxDzero 58 DSign 
DSigAdd f s  Dxor  Dadd 58  DAbs 58

  
The circuit S PEC AS checks whether the operation involves special num-
bers, and checks for an invalid operation. Further floating point exceptions
– overflow, underflow and inexact result – will be detected in the rounder.
Circuit S PEC AS generates the following three flags
INFs signals an infinite result,

NANs signals that the result is a quiet NaN, and

INV signals an invalid addition or subtraction.


The circuit gets 8 input flags, four for either operand. For operand a the
inputs comprise the sign bit sa , the flag INFa indicating that a  ∞ ∞,
and the flags NANa and SNANa. The latter two flags indicate that a is a
quiet NAN or a signaling NaN, respectively. The flags sb , INFb, NANb,
and SNANb belong to the operand b and have a similar meaning.
#/
  )
According to the specifications of section 7.4.2, an invalid operation
must be signaled in one of two cases: if an operand is a signaling NaN, A DDITION AND
or when adding two infinite values with opposite signs. Thus, S UBTRACTION

INV  SNANa  SNANb  INFa  INFb  sa sb 


The result is a quiet NaN whenever one of the operands is a NaN, and in
case of an invalid operation:
NANs  INV  NANa  NANb
According to table 7.3 (page 343), an infinite result implies that at least
one of the operands is infinite; and in case of an infinite operand, the result
is either infinite or a NaN. Thus, an infinite result can be detected as
INFs  INFa  INFb  NANs
Circuit S PEC AS generates the three flags along these lines at
CSpecAS  5  Cor  3  Cand  Cxor  Cinv
DSpecAS  Dxor  2  Dor  2  Dand  Dinv 

   -  
If the result is a finite non-zero number, circuit S IG A DD already provides
the correct sign ss1 . However, in case of a zero or infinite result, special
rules must be applied (section 7.4.2). For NaNs, the sign does not matter.
In case of an infinite result, at least one operand is infinite, and the result
retains the same sign. If both operands are infinite, their signs must be
alike. Thus, an infinite result has the following sign
sa if INFa
ss3 
sb if INFb  INFa
In case of an effective subtraction sx  sa sb  1, a zero result is
always positive, except for the rounding mode rd (round down) which is
coded by RM 1 : 0  11. In case of sx  0, the result retains the same sign
as the a operand. Thus, the sign of a zero result equals

 0 if sx  RM 1 NOR RM 0
sx  RM 1  RM 0
ss2 
 1
sa
if
if sx 
Depending on the type of the result, its sign ss can be expressed as

 ss3 if INFs
INFs 
ss 
 ss2
ss1
if
if INFs 
fs  0
fs  0
#/
  )
RM0] sa
F LOATING P OINT RM[1]
A LGORITHMS AND 0 1 sx sa sb’
ss1 ss2
DATA PATHS
fszero 0 1 INFa 1 0

INFs ss3
NANs INFs 0 1

ZEROs ss

   Circuit S IGN S ELECT selects the appropriate sign s s3

The circuit S IGN S ELECT of figure 8.15 implements this selection in a


straightforward manner. It also provides a flag ZEROs which indicates
that the sum is zero. This is the case, if the result is neither infinite nor a
NaN, and if its significand is zero ( f szero  1). Thus,

ZEROs  f szero  INFs NOR NANs

The cost and the maximal delay of circuit S IGN S ELECT can be ex-
pressed as

CSignSelect  4  Cmux  2  Cand  Cnor


DSignSelect  Dand  max3  Dmux Dnor 

$     *) 

HE UNPACKER delivers unpacked normalized floating point numbers


 to the multiply/divide unit. The multiplication of normalized numbers
is straightforward. Specifying and explaining the corresponding circuits
will take very little effort.
Division is more complicated. Let a and b be finite, non-zero, repre-
sentable floating point numbers with normal factorings

η̂ a  sa ea  lza fa 
η̂ b  sb eb  lzb fb 

Thus, fa fb  1 2. We will compute the rounded quotient r ab in the


following way:
#/
  )#
1. Let sq , eq and q be defined as
M ULTIPLICATION
sq  sa sb AND D IVISION
eq  ea  eb
q  fa  fb  12 2

then
ab  sq eq q
and the exponent e of the rounded result satisfies

e eq  1

For fd  q p1 , we then have

2eq  fd  p1eq 2eq  q

and hence
2eq  fd  pe¼ 2eq  q
Thus, it suffices to determine fd and then feed sq eq fd  into the
rounding unit.

2. In a lookup table, an initial approximation x0 of 1 fb  is deter-


mined.

3. With an appropriate number i of iterations of the Newton-Raphson


method a much better approximation xi of 1 fb  is computed. The
analysis will have to take into account that computations can only be
performed with finite precision.

4. The value q  fa  xi is an approximation of the quotient fa  fb . The


correct representative fd is determined by comparing the product
q  fb with fa in a slightly nontrivial way.

)# ! 78  .  

Newton-Raphson iteration is a numerical method for determining a zero of


a real valued function f x. Consider figure 8.16. One starts with an initial
approximation x0 and then determines iteratively for each i 0 from xi a
(hopefully) better approximation xi1 . This is repeated until the desired
accuracy is obtained.
In the approximation step of the Newton-Raphson method, one con-
structs the tangent to f x through the point xi f xi  and one defines
#/#
  )
F LOATING P OINT
A LGORITHMS AND
DATA PATHS
(x0 , f(x0 ))

x
x0 x1 x2 f(x)

   Newton iteration for finding the Zero x̄ of the mapping f x, i.e.,
f x̄ 0. The figure plots the curve of f x and its tangents at f x i  for i 0 1 2.

xi1 as the zero of the tangent. From figure 8.16 it immediately follows
that
f xi   0
f xi  
xi  xi1
Solving this for xi1 gives
xi1  xi  f xi  f xi 
Determining the inverse of a real number fb is obviously equivalent to
finding the zero of the function
f x  1x  fb 
The iteration step then translates into
xi1  xi  1xi  fb   x2i
 xi 2  fb  xi 
Let δi  1 fb  xi be the approximation error after iteration i, then
δ i 1  1 fb  xi1
 1 fb  2xi  fb  x2i
 fb  1 fb  xi 2
 fb  δ2i  2  δ2i 
Observe that δi 0 for i 1.
#/&
  )#
For later use we summarize the classical argument above in a somewhat
peculiar form: M ULTIPLICATION
AND D IVISION
Let   (
xi1  xi  2  fb  xi 
δi  1 fb  xi and
δi1  1 fb  xi1

the approximation error is then bounded by

δi1  2  δ2i 

)# .    

The unpacker delivers a representation 1 fb 1 : p  1 of fb satisfying

fb  1 fb 1 : p  1  1 2

The interval 1 2 is partitioned into 2γ half open intervals of the form


γ γ
1t  2 1  t  1  2
 


The midpoint of the interval containing fb is fb  1 fb 1 : γ1. Let x 


1 fb be the exact inverse of fb . The initial approximation x0 of 1 fb is
determined by rounding x to the nearest multiple of 2 γ 1 . In case two  

multiples are equally near, one rounds up.


Lemma 8.3 below implies, that x0 lies in the interval 12 1. Hence x0
can be represented in the form

x0  0x0 1 : γ  1

and the initial approximation can be stored in a 2γ  γ-ROM. The crucial


properties of the initial approximation are summarized in the following
lemma:

The approximation error δ0  1 fb  x0 of the initial approximation obeys   (


γ 1
0  δ0   1 fb  x0   15  2  


We first show the upper bound. Consider the mapping f x  1x as de- 
picted in figure 8.17. Let u v  1 2 and let u  v, then

 f u  f v  v  u   f u  v  u


#/'
  )
F LOATING P OINT f(x) = 1/x
A LGORITHMS AND
DATA PATHS (1, 1)
1.0
f(u)

f(v) (2, 1/2)


g(v)

1.0 u v 2.0

   The mapping gx f u f ¼ u x  u is the tangent to f x 1x
at x u.

Since  fb  fb   2 γ 1 , we immediately conclude 1 fb  x   2 γ 1 .


   

Rounding changes x by at most 2 γ 2 and the upper bound follows.


 

For the lower bound we first show that the product of two representable
numbers u and v cannot be 1 unless both numbers are powers of 2. Let ui
and v j be the least significant nonzero bits of (the representations of) u and
v. The product of u and v then has the form

uv  2 i j  A  2 i j1


 

for some integer A. Thus, the product can only be 1 if A  0, in which case
the representations of u and v have both an 1 in the single position i or j,
respectively.
 1 any finite precision approximation of 1 fb
Thus, for representable fb 
is inexact, and the lower bound follows for all fb   1.
For fb  1 we have fb  1  2 γ 1  
. Consider again figure 8.17. The
mapping f x  1x is convex and lies in the interval (1,2) entirely under
the line through the points (1,1) and (2,1/2). The line has slope 12.
Thus,
1
 f 1  t   1  t 2
1t
for all t  0 1. For t  2  γ 1 we get
γ 2
x  f fb   1  2  


  Thus, x cannot be rounded to a number x0 bigger than 1  2 γ 2.


 

#/(
  )#
)## ! 78  .     ,  " 
M ULTIPLICATION
We establish some notation for arguments about finite precision calcula- AND D IVISION
tions where rounding is done by chopping all bits after position σ. For real
numbers f and nonnegative integers σ we define

 f σ   f  2σ   2σ

then
 f 0   f 
Moreover, if f   f i : 0 f 1 : s and s σ, then

 f σ   f i : 0 f 1 : σ

Newton-Raphson iteration with precision σ can then be formulated by


the formula
xi1  xi  2  fb  xi σ σ 
Let
z  fb  xi 
Assume z  1 2 and let z0z1 : s be a representation of z, i.e.,

z  z0z1 : s

The subtraction of z would require the complementation of z and an


increment in the last position. As computations are imprecise anyway one
would hope that little harm is done – and time is saved – if the increment
is omitted. This is confirmed in

Let z  0 2, then   (


0  2  z  z0z1 : σ  2σ 


2z  100s   0z0z1 : s
 100s   1z0z1 : s  2 s
mod 4
 z0z1 : s  2  s
s
 z0z1 : σ  ∑ zi  2
 i
2
 s
i σ1

 z0z1 : σ  2 

 
#//
  )
The simplified finite precision Newton-Raphson iteration is summarized
F LOATING P OINT as
A LGORITHMS AND
DATA PATHS zi  fb  xi
Ai  zi 0 : σ
xi1  xi  Ai σ
δi  1 fb  xi 

For later use we introduce the notation

Ai  appr 2  fb  xi 

The convergence of this method is analyzed in a technical lemma:

 (  Let σ 4, let x0  12 1 and let 0  δ0   18. Then

xi1  0 1 and
σ1
0  δi1  2  δ2i  2 
 14
for all i 0.


δi1  ∆1  ∆2  ∆3
where

∆1  1 fb  xi  2  zi 
∆2  xi  2  zi   xi  Ai
∆3  xi  Ai  xi  Ai σ

By the classical analysis in lemma 8.2 we have

0  ∆1  2  δ2i 

Because xi lies in the interval 0 1, we have

0  zi  fb  xi  2

Lemma 8.4 implies

0  ∆2  xi  2  zi  Ai 
 xi  2σ  2 σ


Obviously, we have
σ
0  ∆3  2 

#/)
  )#
and the first two inequalities of the lemma follow. By induction we get
M ULTIPLICATION
δ i 1  2  δ2i  2σ1 AND D IVISION
 18  18  14

Finally 0  δi1  1 fb  xi1  14 implies

14  1 fb  14  xi  1 fb  1

 

)#&  :  - !-   .  

The following lemma bounds the number of iterations necessary to reach


p  2 bits of precision if we truncate intermediate results after σ  57 bits
and if we start with a table, where γ  8.

Let σ  57, let γ  8 and let   (

2  p2
2 if p  24
i  then δi 


3 if p  53

By the lemmas 8.3 and 8.5 we have 

δ0  15  2 9

δ1  2  152  2  18
2
56

 46  2  18

δ2  4232  2 36
2
 56
 4233  2 36


2 30


Thus, i  2 iterations suffice for single precision.

δ3  35837  2  72
2
55

 35  2  62
2
56


2 55

Thus, i  3 iterations suffice for double precision.  

By similar arguments one shows that one iteration less suffices, if one
starts with γ  15, and one iteration more is needed if one starts with γ  5
(exercise 8.2). The number of iterations and the corresponding table size
and cost are summarized in table 8.3 We will later use γ  8.
#/*
  )
F LOATING P OINT   Size and cost of the 2 γ γ lookup ROM depending on the number of
A LGORITHMS AND iterations i, assuming that the cost of a ROM is one eighth the cost of an equally
DATA PATHS sized RAM.

lookup ROM
i γ
size [K bit] gate count
1 15 480 139277
2 8 2 647
3 5 0.16 61

)#'  -   8      ?- 

By lemma 8.6 we have


0  1 fb  xi  2  p2


xi  1 fb  xi  2  p2


fa  xi  fa  fb  q  fa  xi  2  p1


Thus,
 fa  xi  p1  fa  xi  q
 fa  xi  2  p1

  fa  xi  p1  2
 p


In other words,
E   fa  xi  p1
is an approximation of q, and the exact quotient lies in the open interval
E E  2 p . Moreover, we have



 E  2  p2
if fa  fb  E  2  p1


ab p1 E  2  p1 fa  fb  E  2  p1




 
if
E  3  2  p2
if fa  fb  E  2  p1


In the first case one appends 1 to the representation of E, in the second


case one increments E, and in the third case one increments and appends
1.
For any relation Æ     we have
fa  fb Æ E  2  p1

fa Æ fb  E  2  p1


Thus, comparison of fa with the product


G  fb  E  2  p1


determines which one of the three cases applies, and whether the result is
exact.
#)
lza[5:0] lzb[5:0]
  )#
fa[0:52] fb[0:52] ea[10:0] eb[10:0] M ULTIPLICATION
sa sb nan fla flb
AND D IVISION
SigfMD Sign/ExpMD SpecMD
53
(with register stage)
sq, eq[12:0] nan, ZEROq, INFq, NANq, INV, DBZ

fq[-1:55] flq

   Top level schematics of the multiply/divide unit

)#( 5-   +  -

The multiply/divide unit depicted in figure 8.18 is partitioned in a natural


way into units
1. S IGN /E XP MD producing the sign sq and the exponent eq ,

2. S IGF MD producing the significand fq and

3. S PEC MD handling special cases.


The essential inputs for the unit are the sign bit, the exponent, the signif-
icand, and the number of leading zeros for two operands a and b satisfying

a  1sa  2ea lza



 fa b  1sb  2eb  lzb
 fb

where for n p  11 53 the exponents are given as n–bit two’s comple-
ment numbers

ea  ea n  1 : 0 eb  eb n  1 : 0

the significands are given as binary fractions

fa   fa 0 fa 1 : p  1 fb   fb 0 fb 1 : p  1

and for
r  log p
the numbers of leading zeros are given as r–bit binary numbers

lza  lza r  1 : 0 lzb  lzb r  1 : 0

In the absence of special cases the factorings are normalized, and thus

fa fb  1 2
#)
  )
For operations Æ   , let
F LOATING P OINT
A LGORITHMS AND x  aÆb
DATA PATHS
be the exact result of the operation performed, and let

η̂ x  s ê fˆ

In the absence of special cases, the unit has to produce a factoring sq eq ,


fq  satisfying
sq eq fq   p ê a Æ b


  +%
Circuit S IGF MD which produces the significand fq has an internal register
stage. Thus, the cost and the cycle time of the multiply/divide circuit and
the accumulated delay AMulDiv of its outputs can be expressed as

CMulDiv  CSig f MD  CSignExpMD  CSpecMD  C f f 72


TMulDiv  maxDSpecMD  ∆ DSignExpMD  ∆ TSig f MD
AMulDiv  ASig f MD

   1  -  
Figure 8.19 depicts the circuit S IGN /E XP MD for the computation of the
sign and the exponent. The computation of the sign

sq  sa sb

is trivial. The computation of the exponent is controlled by signal f div


which distinguishes between multiplications and divisions. The exponent
is computed as

ea  lza  eb  lzb  if  f div (multiply)


eq 
ea  lza  eb  lzb  if f div (divide).

We can estimate eq by

2n  2  2  emax eq 2  emin  p  2n1 

Therefore, the computation is performed with n  2–bit two’s comple-


ment numbers. Circuit S IGN /E XP MD has the following cost and delay:

CSignExpMD  Cxor  23  Cinv  Cmux 11  Cmux 13


C42add 13  Cadd 13
DSignExpMD  Dinv  Dmux  D4 2add 13  Dadd 13
#)
  )#
ea[10, 9:0] lza[5:0] eb[10:0] lzb[5:0]
M ULTIPLICATION
07 17 AND D IVISION
sa sb 17 fdiv 1 0 fdiv 1 0
10 1
4/2 adder(13)

1
add(13)

sq eq[12:0]

   Circuit S IGN /E XP MD

 4 5-  


Let a and b be the two operands of the floating point multiplication. In
case that the operand a is a finite non-zero number, its significand fa is
normalized. The same holds for the significand fb of operand b. Hence
f a  f b  1 4 
Let
x  ab and η̂ x  sq ê fˆ
then
ê ea  lza  eb  lzb  eq 
Unit S IGF MD depicted in figure 8.20 performs the significand computa-
tion of the multiply/divide unit. The multiplication algorithm shares with
the division algorithm a 58-bit multiplier. Therefore, the significands are
extended by 5 trailing zeros to length 58. Wallace tree, adder and sticky
bit computation produce for a 54-representative fm of the product:
 fm 1 : 55   fa 0 fa 1 : 5205    fb 0 fb 1 : 5205 54 
Hence
sq eq fm   1sq  2eq   fa  fb 54
54eq 1sq  2eq  fa  fb 
54ê 1sq  2eq  fa  fb 
For both single and double precision computations we have p  54 and
therefore
sq eq fm   p ê 1 q  2 q  fa  fb 
s e


#)#
  )
fa[0:52] 05 fb[0:52] 05
F LOATING P OINT
faadoe fbbdoe
A LGORITHMS AND opa[0:57]
DATA PATHS

Aadoe
opb[0:57]
[1:8]
4/2mulTree(58, 58)

xbdoe
256 x 8 116 116

xadoe
Eadoe
lookup cce c sce s
table
adder (116)
8 fm[-1:114]
01 048
[0:57] [0:25] [26:54]
58 60
db29 Ortree

01
tlu 1 0 [-1:54]

xce x 10 Ace A Ece E Dce Da Db Eb Ebce


03
fd[-1:55] fm[-1:55]
Select fd
fdiv 1 0
fq[-1:55]

   Circuit S IGF MD performing the division and multiplication of the


significands

 4 + 


Significand division is performed by unit S IGF MD (figure 8.20) under the
control of the counter Dcnt depicted in figure 8.21 and the FSD of figure
8.22. The corresponding RTL instructions in table 8.4 summarize the
steps of the iterative division as it was outlined above. A Newton-Raphson
iteration step comprises two multiplications, each of which takes two cy-
cles. Thus, a single iteration takes four cycles; the corresponding states are
denoted by 5'& % to 5'& % . The counter Dcnt counts the number
of iterations. During the table lookup, the counter is set to the number of
iterations required (i.e., 2 for single and 3 for double precision), and during
each iteration, Dcnt is counted down. After state  6$ we have

x  x0 and Dcnt  dcnt0  db ? 3 : 2

After the ith execution of state 5'& %  we have

A  Ai  1
x  xi
Dcnt  dcnt0  i

The loop is left after i  dcnt0 iterations. For this i, we have after state
#)&
2 10 11
  )#
zero? decrement
0 1 db M ULTIPLICATION
Dcnt Dcntce dcnt0 AND D IVISION
Dcntzero tlu 0 1

   Iteration Counter Dcnt

unpack lookup

Newton 1 quotient 1

Newton 2 quotient 2

Newton 3 quotient 3

Newton 4 quotient 4
Dcnt > 0 Dcnt = 0

select fd round 1 round 2

   FSD underlying the iterative division. The states   to 
 represent one Newton-Raphson iteration. Dcnt counts the number of iterations;
it is counted down.

$ &%& 

E   fa  xi  p1
Eb  E  fb

After state $ &%&  we already have

f a  Da and fb  Db 

Note that for single precision, E is truncated after position p  1  25.

- S ELECT FD
Figure 8.23 depicts the circuit selecting the p  1-representative fd of the
quotient q according to the RTL instructions of state & , . Since

E  E  2  p1


#)'
  )
F LOATING P OINT   RTL instructions of the iterative division (significand only). A multi-
A LGORITHMS AND plication always takes two cycles.
DATA PATHS
state RTL instruction control signals
unpack normalize FA, FB
lookup x  table fb  xce, tlu, fbbdoe
Dcnt  db?3 : 2 Dcntce,
Newton 1/2 Dcnt  Dcnt  1 Dcntce, xadoe, fbbdoe
A  appr 2  x  b 57 Ace
Newton 3/4 x  A  x57 Aadoe, xbdoe, sce, cce
xce
quotient 1/2 E  a  xp1 faadoe, xbdoe, sce, cce
Da  f a Db  f b faadoe, fbbdoe, Dce, Ece
quotient 3/4 Eb  E  fb Eadoe, fbbdoe, sce, cce
Ebce
select fd E  E  2  p1, 

β  f
a  Eb  2
 p1  fb 

 E  2  p2 ; if β  0


fd 
 EE  2  p2 ;; ifif ββ 

0
 0

Round 1/2 round sq eq fq 

Da[0:57] 056 Db[0:57]

029 029

11
00
E[0:25] E[26:54]

00 01
11
Eb[0:114] 1 0 db
129 sfb[25:111]

1 0 db 0 1 1 126 13

3/2 adder(116)
inc(55)

56 1
0 E’[-1:54] adder (117)
neg
1 0
0110
beta
r[-1:54]
zero(117)
27 1 28 db
1 0 db

fd[-1:25] fd[26] fd[27:54] fd[55]

   Circuit S ELECT FD which selects the representative of the exact q

#)(
  )#
its computation depends on the precision. For double precision (p  53)
holds M ULTIPLICATION
E 0E 1 : 54  E 0E 1 : 54  2 54   AND D IVISION

For single precision (p  24), E was truncated after position p  1  25.


Thus,

E 0E 1 : 25  E 0E 1 : 25  2  25

54
 E 0E 1 : 25  ∑2  i
2
54
i 26
 E 0E 1 : 25 129   2  54


The computation of value β also depends on the precision p. Operand


fb , which is taken from register Db, is first shifted p  1 positions to the
right:

0024 029 Db 0 : 57 ; if db


0024 fsb 25 : 111 
0024 Db 0 : 57 029  ; if db
 2  p1  f b


Now β can be computed as

β  fa  Eb  2  p1  fb


 0 Da 0Da 1 : 57 057   0 Eb 0Eb 1 : 114


00024 fsb 25 : 111 03 
 0 Da 0Da 1 : 57 056 1  1 Eb 0Eb 1 : 114
24
111 fsb 25 : 111 13   2 114

The output significand fd is computed in the following way: let

E if β  0
r
E if β  0

then
r if β0
r  2  p2
fd  
if β  0
Thus, in case β  0 one has to force bit fd  p  2 to 1.

  +%
Figure 8.23 depicts circuit S ELECT FD which selects the representative of
the quotient. The cost and the delay of this circuit run at
#)/
  )
CSelectF d  Cinc 55  Cmux 29  Cmux 56  Cmux
F LOATING P OINT
Cmux 87  C3  2add 116  Cadd 117
A LGORITHMS AND
DATA PATHS Czero 117  203  Cinv  Cand
DSelectF d  2  Dmux  maxDinc 55  Dmux
2  Dinv  D32add 116  Dadd 117  Dzero 117

Circuit S ELECT FD is part of the circuit which performs the division and
multiplication of the significands. The data paths of circuit S IGF MD have
the following cost

CSig f MD  6  Cdriv 58  5  C f f 58  3  C f f 116  CROM 256 8


Cmux 58  C4  2mulTree 58 58  Cadd 116  Cinv 58
Cand 29  CORtree 60  Cmux 57  CSelectF d 

The counter Dcnt and the control automaton modeled by figure 8.22 have
been ignored. The accumulated delay of output fq and the cycle time of
circuit S IGF MD can be expressed as:

ASig f MD  maxDSelectF d Dadd 116  DORtree 60  Dmux


TSig f MD  maxDdriv  DROM 256 8  Dmux
Ddriv  D4  2mulTree 58 58 Dadd 116  Dmux   ∆

1     


The circuit S PEC MD checks whether special operands are involved, i.e.,
whether an operand is zero, infinite or a NaN. In such a case, the result
cannot be a finite, non-zero number. The circuit signals the type of such a
special result by the three flags ZEROq, INFq and NANq according to the
tables 7.4 and 7.5.
The circuit also detects an invalid operation (INV) and a division by zero
(DBZ). These two IEEE floating point exceptions can only occur when
special operands are involved, whereas for the remaining floating point
exceptions – overflow, underflow and inexact result – both operands must
be finite, non-zero numbers. Thus, OVF, UNF and INX will be detected
by a different circuit during rounding (section 8.4).
For each of the two operands, the circuit S PEC MD gets four input flags
which indicate its type (ZERO, INF, NAN, and SNAN). Most of the output
flags are generated in two steps. First, two sets of flags are generated,
one for the multiplication and one for the division. The final set of flags is
then selected based on the control signal f div which distinguishes between
multiplication and division.
#))
  )#
1  ,
According to section 7.4, the flag DBZ (division by zero) is only activated M ULTIPLICATION
when a finite, non-zero number is divided by zero. Thus, AND D IVISION

DBZ  f div  ZEROb  ZEROa  INFa  NANa  SNANa

The flag INVm signals an invalid multiplication. According to the spec-


ification of section 7.4.3, it is raised when an operand is a signaling NaN
or when multiplying a zero with an infinite number:

INVm  INFa  ZEROb  ZEROa  INFb  SNANa  SNANb

The flag INVd which indicates an invalid division is signaled in the fol-
lowing three cases (section 7.4.4): when an operand is a signaling NaN,
when both operands are zero, or when both operands are infinite. Thus,

INVd  ZEROa  ZEROb  INFa  INFb  SNANa  SNANb

The IEEE exception flag INV is selected based on the type of the operation

INVm if f div
INV 
INVd if f div

 8 -
The flags NANq, INFq and ZEROq which indicate the type of a special
result are generated according to the tables 7.4 and 7.5.
The result is a quiet NaN whenever one of the operands is a NaN, and
in case of an invalid operation; this is the same for multiplications and di-
visions. Since signaling NaNs are already covered by INV, the flag NANq
can be generated as

NANq  INV  NANa  NANb

The result of a multiplication can only be infinite if at least one of the


operands is infinite. However, if the other operand is a zero or a NaN, the
result is a NaN. Thus, the flag INFm signaling an infinite product can be
computes as

INFm  INFa  INFb  NANq

The result of a division can only be infinite, when an infinite numerator


or a zero denominator is involved. In case of DBZ, the result is always
infinite, whereas in case of an infinite numerator, the result can also be a
NaN. Thus,

INFd  INFa  NANq  DBZ


#)*
  )
The flag INFq is then selected as
F LOATING P OINT
A LGORITHMS AND INFm if f div
INFq 
DATA PATHS INFd if f div
The flags ZEROm and ZEROd which indicate a zero product or quotient
are derived from the tables 7.4 and 7.5 along the same lines. In case of a
zero product, at least one of the operands must be zero. A zero quotient
requires a zero numerator or an infinite denominator. Thus,
ZEROm  ZEROa  ZEROb  NANq
ZEROd  ZEROa  INFb  NANq
ZEROm if f div
ZEROq 
ZEROd if f div
The circuit S PEC MD generates all these flags along these lines. It has
the following cost and delay:
CSpecMD  10  Cand  12  Cor  Cnor  Cinv  3  Cmux
DSpecMD  2  Dand  4  Dor  Dinv  2  Dmux 

$ " # +   

HE FLOATING point rounder FP RND of figure 8.24 implements ‘tiny


 before rounding’ and the ‘type b’ loss of accuracy (i.e., inexact result).
The rounder FPrnd consists of two parts
circuit RND which performs the rounding of a finite, non-zero result
x specified by the input factoring s er fr , and
circuit S PEC RND which handles the special inputs zero, infinity,
and NaN. Such an input is signaled by the flags f lr . This circuit also
checks for IEEE floating point exceptions.

  +%
All the inputs of the floating point rounder have zero delay since they are
taken from registers. Thus, the cost and cycle time of the rounder FP RND
and the accumulated delay AFPrnd of its outputs run at
CFPrnd  CNormShi f t  CREPp  C f f 140  CSigRnd
CPostNorm  CAd justExp  CExpRnd  CSpecFPrnd
TFPrnd  ANormShi f t  DREPp  ∆
AFPrnd  ASigRnd  DPostNorm  DAd justExp  DExpRnd  DSpecFPrnd 
#*
  )&
UNF/OVFen fr er s flr
F LOATING P OINT
fn REPp 58 ROUNDER
NormShift

UNF/OVFen TINY OVF1 en eni fl[0:54] s flr

11 RM
SigRnd

f2
PostNorm
SIGovf e2
RND f3
AdjustExp

OVF e3
ExpRnd

eout, fout SIGinx


SpecFPrnd

IEEEp Fp[63:0]

   Schematics of the floating point rounder FP RND

)& 4   

Let x  
2  0 be the exact, finite result of an operation, and let

 2 α  x ; if OV F  OV Fen


2α x ; if UNF  UNFen
y 
 x ; otherwise (8.3)
η̂ x  s ê fˆ
η y  s e f 
The purpose of circuit RND (figure 8.24) is to compute the normalized,
packed output factoring s eout fout  such that s eout fout   r y, i.e.,

s eout fout   s exprd s post e sigrd s f  (8.4)

Moreover the circuit produces the flags TINY, OVF and SIGinx. The ex-
ponent in the output factoring is in biased format. The inputs to the circuit
are
the mask bits UNFen and OVFen (underflow / overflow enable)
#*
  )
the rounding mode RM 1 : 0
F LOATING P OINT
A LGORITHMS AND the signal dbr (double precision result) which defines
DATA PATHS
11 53 ; if dbr  1
n p 
8 24 ; otherwise

the factoring s er fr , where er 12 : 0 is a 13-bit two’s complement


number and fr 1 : 55 has two bits left of the binary point.

The input factoring has only to satisfy the following two conditions:

the input factoring approximates x well enough, i.e.,

s er fr   pê x (8.5)

fr 1 : 0  00 implies OV F  0. Thus, if x is large then fr  1 4.

By far the most tricky part of the rounding unit is the normalization
shifter N ORM S HIFT. It produces an approximated overflow signal

OV F1 2er fr 2emax 1

which can be computed before significand rounding takes place. The re-
sulting error is characterized by

 (  Let OV F2 OV F  OV F1, then OV F2 implies

ê  emax and sigrd s fˆ  2

 By definition (section 7.3.1), a result x causes an overflow if

r̂ x  Xmax 

According to equation 7.8, such an overflow can be classified as an over-


flow before or after rounding:

OV F x ê  emax  or ê  emax and sigrd s fˆ  2

Since OV F1 implies

x  pê 2er  fr  2emax 1

  we have ê  emax , and the lemma follows.


#*
  )&
Thus, the flag OV F1 signals an overflow before rounding, whereas the
flag OV F2 signals an overflow after rounding. F LOATING P OINT
The outputs of the normalization shifter N ORM S HIFT will satisfy ROUNDER

e  α ; if OV F2  OV Fen
en 
e ; otherwise (8.6)
fn p f

The normalization shifter also produces output eni  en  1. Both expo-


nents will be in biased format.
The effect of the circuits REP P, SIG RND and P OST N ORM is specified
by the equations:
f1   fn  p
f2  sigrd s f1  (8.7)
e2 f3   post en f2 
Circuit S IG R ND also provides the flag SIGinx indicating that the rounded
significand f2 is not exact:

SIGinx  1 f2  f1 

After post normalization, the correct overflow signal is known and the error
produced by the approximated overflow signal OV F1 can be corrected in
circuit A DJUST E XP. Finally, the exponent is rounded in circuit E XP R ND.

emax  1  α 1 ; if OV F2  OV Fen
e3 f3  
e2 f3  ; otherwise (8.8)
eout fout   exprd s e3 f3 

In addition, circuit E XP R ND converts the result into the packed IEEE for-
mat, i.e., bit fout 0 is hidden, and emin is represented by 0n in case of a
denormal result.
With the above specifications of the subcircuits in place, we can show in
a straightforward way:

If the subcircuits satisfy the above specifications, then equation (8.4) holds,    ((
i.e., the rounder RND works correctly for a finite, non-zero x.

By equations (8.6) we have 


e  α ; if OV F2  OV Fen
en 
e ; otherwise
fn p f
#*#
  )
Equations (8.7) then imply
F LOATING P OINT
A LGORITHMS AND f1   f p

DATA PATHS f2  sigrd s f 


post e  α sigrd s f  ; if OV F2  OV Fen
e2 f3  
post e sigrd s f  ; otherwise
and equations (8.8) finally yield
e3 f3   post e sigrd s f 
s eout fout   s exprd s post e sigrd s f 
 

)& ! :  


Let lz be the number of leading zeros of fr 1 : 55. In general, the nor-
malization shifter has to shift the first 1 in fr to the left of the binary point
and to compensate for this in the exponent. If the final result is a denormal
number, then x must be represented as
2er fr  2emin  2er emin
 fr 
This requires a left shift by er  emin which in many cases will be a right
shift by emin  er (see exercise 8.3). Finally, for a wrapped exponent one
might have to add or subtract α in the exponent. The normalization shifter
in figure 8.25 works along these lines.
First in circuit F LAGS the signals TINY, OVF1 and the binary represen-
tation lz5 : 0 of the number lz are computed. Then, the exponent en and
the (left) shift distance σ are computed in circuits E XP N ORM and S HIFT-
D IST.
We derive formulae for en and σ such that equations (8.6) hold. From
equations (8.3) and
UNF  UNFen T INY  UNFen
we conclude

 1s  2ê α  fˆ ; if OV F  OV Fen


1s  2êα  fˆ ; if T INY  UNFen


y 
 1s  2ê  fˆ ; otherwise

 s ê  α fˆ ; if OV F  OV Fen
η̂ y s ê  α fˆ ; if T INY  UNFen

 s ê fˆ ; otherwise.
#*&
  )&
fr[-1:55]
er[12:0] F LOATING P OINT
OVFen, UNFen
ROUNDER
FLAGS

lz[5:0]
ShiftDist ExpNorm

sh[12:0]
SigNormShift

fn[0:127] TINY OVF1 eni[10:0] en[10:0]

   Circuit N ORM S HIFT of the normalization shift

The two factorings η̂ y and η y are the same except if y is denormal, i.e.,
if T INY  UNFen. In this case,

x  1s  2ê  fˆ  1s  2emin 2ê emin


 fˆ

and 
ê  α fˆ ; if OV F  OV Fen
 s
s ê  α fˆ ; if T INY  UNFen
η y 
emin 2ê emin  fˆ ; if T INY  UNFen


s
s ê fˆ ; otherwise
and therefore,

ê  α ; if OV F  OV Fen
 ê  α ; if T INY  UNFen
e 
; if T INY  UNFen
 emin
ê ; otherwise (8.9)

2ê emin  fˆ ; if T INY  UNFen


f 
fˆ ; otherwise.

Let f  fr 2. Thus

f 0 : 56  fr 1 : 55

i.e., the representation f 0 : 56 is simply obtained by shifting the binary


point in representation fr 1 : 55 one bit to the right. In the following we
will compute shift distances for f 0 : 56.
#*'
  )
Let lz be the number of leading zeros in fr 1 : 55 or in f 0 : 56,
F LOATING P OINT respectively. Finally, let
A LGORITHMS AND
β  er  lz  1
DATA PATHS
From equation (8.5), we conclude

2ê  fˆ  x   pê 2er  fr  2er 1  f  2β  2lz  f 

Since 2lz  f  1 2, it follows that

β  ê and 2lz  f p fˆ

This immediately gives



βα ; if OV F  OV Fen
 βα ; if T INY  UNFen
e 
; if T INY  UNFen
 emin
β ; otherwise

and
σ  lz ; unless T INY  UNFen
If T INY  UNFen holds, then x  y and ê  emin . From equations (8.9)
and (8.5) we know

f  2ê  emin
 fˆ
2 f
ê ˆ
 pê 2  fr 
er

Multiplying the second equation by 2  emin implies that

2ê emin

 fˆ  pêemin 2er  emin
 fr
emin 1
f  pêemin 2er  emin
 fr  2er 
f 

Since ê  emin , it also holds that


emin 1
f p 2er 
f 

Thus, we have

σ  er  emin  1 ; if T INY  UNFen

Up to issues of number format the outputs of circuits S HIFT D IST and


E XP N ORM are specified by the above calculations. Circuit S IG N ORM -
S HIFT will not produce a representation of f  2σ , because in the case of
#*(
  )&
right shifts such a representation might be very long (exercise 8.4). Instead,
it will produce a representation fn 0 : 63 such that F LOATING P OINT
ROUNDER
 fn 0 : 63  fn p f  2σ

holds.
With the above specifications of the subcircuits of the normalization
shifter in place (up to issues of number format), we can immediately con-
clude

Let   ()



βα ; if OV F1  OV Fen
 βα ; if T INY  UNFen
en 
; if T INY  UNFen
 emin
β ; otherwise
(8.10)
er  emin  1 ; if T INY  UNFen
σ 
lz ; otherwise

fn p f  2σ 

Then equations (8.4) hold, i.e., the normalization shifter works correctly.

,
Figure 8.26 depicts circuit F LAGS which determines the number lz of lead-
ing zeros and the flags TINY and OVF1. The computation of lz5 : 0 is
completely straightforward. Because no overflow occurs if fr 1 : 0  00,
we have

OV F1 er  emax   er  emax   fr 1

Now recall that bias  emax  2n 1  1  1n 1  and that n either equals
 

11 or 8. For the two’s complement number er we have


11
er 12 : 0  1n 1 

er 12 
 er i
i n1

This explains the computation of the OVF1 flag.


Since f  fr 2, we have to consider two cases for the TINY flag.

er  1  emin if f  1 2
T INY
er  1  lz  emin if f  0 1
er  1  lz  emin  0
#*/
  )
fr[-1:55] 64-57 er[12:0] er[9:7]
1 dbr
F LOATING P OINT
emax er[12:0]
A LGORITHMS AND er[11:10]
lzero(64) er[12]
DATA PATHS 13
03dbr31 equal(13)
lz[6]
0 fr[-1]
add(13)
6 12

lz[5:0] TINY OVF1

   Circuit F LAGS. Depending on the precision, the exponent emax


equals  03 dbr 3 17 .

because lz  0 for an f in the interval 1 2. Thus, the TINY flag can be
computed as the sign bit of the sum of the above 4 operands. Recall that
emin  1  bias. Thus

emin  1  bias  1  1  1n 1 




bias  lz  1n 1   1  1 lz5 : 0



 10n 1   17 lz5 : 0


06 1 lz5 : 0 ; if n  8



03 14 lz5 : 0 ; if n  11

This explains the computation of the TINY flag.


Circuit F LAGS gets its inputs directly from registers. Thus, its cost and
the accumulated delay of its outputs run at

CFLAGS  Clz 64  Cadd 13  CEQ 13  8  Cinv  5  Cor  3  Cand
AFLAGS  maxDlz 64  Dinv  Dadd 13 DEQ 13  Dand  Dor
4  Dor  2  Cand 

1 ! : 


The circuit in figure 8.27 implements the exponent en of the equations
(8.10) in a fairly straightforward way. Along the way it also converts from
two’s complement to biased representation.
The case T INY  UNFen is handled by two multiplexers which can
force the outputs directly to the biased representations 010 1 and 09 10 of
emin or emin  1, respectively. For the remaining three cases, the top portion
of the circuit computes biased representations of en and en  1, or equiva-
lently, the two’s complement representations of en  bias and en  1  bias.
#*)
  )&
In particular, let
F LOATING P OINT

 α ; if OV F1  OV Fen ROUNDER
γ α ; if T INY  UNFen
 0 ; otherwise,

then the circuit computes the following sums sum and sum  1:

sum  er  lz  1  γ  bias
 er  1  1 lz5 : 0  1  γ  bias
 er  1  1 lz5 : 0  δ

where 
 bias  α  1 ; if OV F  OV Fen
δ bias  α  1 ; if T INY  UNFen
 bias  1 ; otherwise .

Recall that α  3  2n 2 110n 2 and bias  2n  11  1n 1 . Hence


bias  1  10n 1 

 00100
n2


bias  α  1  110n  2
  100n 2 


 1010n  2
  01010
n2


α  1001
n 2
  1  1010
n2


bias  1  α  1110
n 2
  11110
n2


In single precision we have n  8 and the above equations define two’s


complement numbers with only 10 bits. By sign extension they are ex-
tended to 13 bits at the last multiplexer above the 3/2–adder. Like in the
computation of flag TINY, the value lz5 : 0 can be included in the con-
stant δ , and then, the 3/2-adder in the circuit of figure 8.27 can be dropped
(see exercise 8.5).
Without this optimization, circuit E XP N ORM provides the two expo-
nents en and eni at the following cost and accumulated delay

CExpNorm  C3  2add 11  Cadd2 11  2  Cmux 2  Cmux 5


Cinv 6  2  Cmux 11  3  Cand  Cinv
AExpNorm  maxAFlags AUNF OV Fen   Dand
4  Dmux  D32add 11  Dadd2 11

#**
  )
F LOATING P OINT
A LGORITHMS AND 11 10
bias-a+1 bias+1
DATA PATHS OVFen
1 0
01 OVF1
bias+a+1
er[10:0] lz[5:0] UNFen
1 0 TINY
03 03
0 1 dbr
15
δ 06
3/2add(11)

11 11
1
add2(11)
UNFen

emin+1 emin

0 1 0 1
TINY

eni[10:0] en[10:0]

   Circuit E XP N ORM of the exponent normalization shift; the exponents


emin and emin  1 are represented as 0 10 1 and 09 10. In case of a single precision,
only the bits [7:0] of the two exponents en and eni are used.

lz[5:0] er[12:0] 1-emin TINY UNFen

07 add(13)

13 13

0 1

sh[12:0]

   Circuit S HIFT D IST provides the shift distance of the normalization
shift. Depending on the precision, constant 1  e min equals 03 dbr 3 17 .

&
  )&
 + 
The circuit in figure 8.28 implements the shift distance σ of the equations F LOATING P OINT
(8.10) in a straightforward way. Recall that emin  1  bias  2n 1  2.
 ROUNDER
Thus
1  emin  1  2n 1  2  2n 1  1  1n 1 
  

It follows that
sh12 : 0  σ
The shift is a right shift if sh12  1.
Circuit S HIFT D IST generates the shift distance sh in the obvious way.
Since the inputs of the adder have zero delay, the cost and the accumulated
delay of the shift distance can be expressed as
CShi f tDist  Cadd 13  Cmux 13  Cand  Cinv
AShi f tDist  maxDadd 13 AFLAGS  Dand
AUNFen  Dinv  Dand   Dmux 

 4 ! :  


This is slightly more tricky. As shown in figure 8.29 the circuit which
performs the normalization shift of the significand has three parts:
1. A cyclic 64 bit left shifter whose shift distance is controlled by the 6
low order bits of sh. This takes the computation of the shift limita-
tion in the MASK circuit off the critical path.
2. A mask circuit producing a 128 bit mask v0 : 63w0 : 63
3. Let f s0 : 63 be the output of the cyclic left shifter. Then fn is
computed by the bitwise AND of f s f s and v w.
We begin with the discussion of the cyclic left shifter. For strings f 
0 1N and non-negative shift distances d, we denote by cls f d  the
string obtained by shifting f by d bits cyclically to the left. Similarly,
we denote by crs f d  the result of shifting f by d bits to the right. Then
obviously
crs f d   cls f N  d mod N 
 cls f d mod N 
Let σ  σ mod 64. For both, positive and negative, shift distances we
then have
sh  sh12  212  sh11 : 0
sh5 : 0 mod 64  σ
We now can show
&
  )
fr[-1:55] 0 sh[5:0] sh[12:0]
F LOATING P OINT 7
A LGORITHMS AND
CLS(64) MASK
DATA PATHS
v[0:63]
fs[0:63] w[0:63]

and(64) and(64)

fn[0:63] fn[64:127]

   Circuit S IG N ORM S HIFT

. ( *  Let f  fr 2. The output f s of the cyclic left shifter satisfies

cls f σ  ; if σ 0
fs 
crs f σ ; otherwise.

 For non-negative shift distances the claim follows immediately. For nega-
tive shift distance σ it follows that

crs f σ  cls f σ mod 64  cls f sh5 : 0

 

We proceed to explain the generation of the masks as depicted in figure


8.30. We obviously have

σ ; if σ 0
t  
σ  1 ; otherwise 

Next, the distance in the mask circuit is limited to 63: the output of the
OR-tree equals 1 iff t  16   63, hence

t  ; if t   63
 σ ; if 0  σ  63
sh  σ  1 ; if  63  σ  1

63 ; otherwise

 63 ; otherwise

We show that

. (  The distance of the left shift in the significand normalization shift is boun-
ded by 56, i.e., σ  56.

&
  )&
sh[11:0]
F LOATING P OINT
1 0 sh[11]
ROUNDER
6 t[11:0]

6
16
Ortree 1 0
sh’[5:0]

hdec(6)

1 1

flip
h[63:0]

0 1 sh[12]
u[0:63]

and(64)
v[0:63] w[0:63]

    The MASK for the significand normalization shift

Left shifts have distance lz  56 or er  emin  1. The second case only 
occurs if T INY  UNFen. In this case we have e  emin and fr  0.
Assume er emin  55. Since

x  pê 2er  fr

and since ê  emin , it follows that

x  pemin 2er  fr 2emin 55  2  55


 2emin 

This contradicts the tininess of x.  

The half decoder produces from sh the masks


sh¼  ¼
h63 : 0  064 
1 sh



In case the shift distance is negative a 1 is appended at the right end and
the string is flipped. Thus, for mask u we have

 064 σ 1σ

; if 0σ
1 σ 064 σ  63  σ  1
u0 : 63 


; if
164 ; if σ  64
&#
  )
a) 0 63 0 63
F LOATING P OINT fs[] fs[] f’[σ:56] 0 7 * *
A LGORITHMS AND σ
DATA PATHS
v[] w[] 1 ... 1 0 ... 0

b) 0 63 0 63
fs[] fs[] * f’[0:56] 07 *
|σ| |σ|
v[] w[] 0 ... 0 1 ... 1 0 ... 0

c) 0 63 0 63
fs[] fs[] * cls( f’[0:56] 07, σ’ )

v[] w[] 0 ... 0 1 ... 1

    Relation between the strings f s f s and the masks vw in the three cases
a) 0  σ, b) 63  σ  1, and c) σ  64.

For the masks v w it follows that



 164 σ 064σ

; if 0σ
0 σ 164 064 σ  63  σ  1
v0 : 63 w0 : 63 


; if
064 164 ; if σ  64
The relation between the string f s f s and the masks v w is illus-
trated in figure 8.31. Let fl be the result of shifting f logically by σ bits
to the left. From figure 8.31 we immediately read off

 fl 064 ; if 0σ
fn 1 : 126 fl 064 σ  63  σ  1


 ; if
0 cls f σ  ; if
64 σ  64
In all cases we obviously have
fn  p fl 
Circuit MASK depicted in figure 8.30 generates the masks v and w at the
following cost and delay.
CMASK  Cinv 12  Cmux 12  CORtree 6  Cmux 6
Chdec 6  Cmux 64  Cinv 64  Cand 64
DMASK  Dinv  3  Dmux  DORtree 6  Dhdec 6  Dand 
&&
  )&
fn[0:24] fn[25:53] fn[54:127]
74 F LOATING P OINT
29 ROUNDER
Ortree Ortree

sta

st_sg
st_db
st_db
029
1 0 dbr

f1[0:24] f1[25:54]

    Circuit REP P computes the p-representative of f n. The Flag st db


(stsg ) denotes the sticky bit in case of a double (single) precision result.

Cost and delay of the significand normalization shifter S IG N ORM S HIFT


run at

CSigNormShi f t  CCLS 64  CMASK  2  Cand 64


DSigNormShi f t  maxDCLS 64 DMASK   Dand

and the whole normalization shifter N ORM S HIFT has cost and delay

CNormShi f t  CFlags  CExpNorm  CShi f tDist  CSigNormShi f t


ANormShi f t  maxAExpNorm AShi f tDist  DSigNormShi f t 

)&#     8   

This is a straightforward sticky bit computation of the form

st  fn i
i p1

where p depends on the precision and is either 24 or 53. Circuit REP P of


figure 8.32 selects the p-representative of fn as

f1 1 : p  1  fn 1 : p st 

This circuit has the following cost and delay

CREPp  CORtree 29  CORtree 74  Cor  Cmux 30


DREPp  DORtree 74  Dor  Dmux 
&'
  )
f1[0:23] 129 f1[0:52] f1[23:25] f1[52:54]
s RM[1:0]
F LOATING P OINT
A LGORITHMS AND 0 1 dbr
0 1 dbr
DATA PATHS f1[0:52]

inc(53) l, r, st
Rounding Decision
0 54
inc
0 1

f2[-1:52] SIGinx

   Circuit S IG R ND

)&&  4 8- 

In figure 8.33, the least significand bit l, round bit r and sticky bit st are
selected depending on the precision and fed into circuit ROUNDING D ECI -
SION. The rounded significand is exact iff the bits r and st are both zero:

SIGinx  r  st 

Depending on the rounding decision, f1 0 : p  1 is chopped or incre-


mented at position p  1. More formally

 f1 0 : p  1 ; if inc  0 (chop)


f2 
 f1 0 : p  1  2  p 1 
; if inc  1 (increment).

The rounding decision is made according to table 8.5 which were con-
structed such that
f2  sigrd s f1 
holds for every rounding mode.
Note that in mode rne (nearest even), the rounding decision depends on
bits l r and st but not on the sign bit s. In modes ru , rd , the decision depends
on bits r st and the sign bit s but not on l. In mode rz , the significand is
always chopped, i.e., inc  0. From table 8.5, one reads off

 r  l  st  if rne
s  r  st  if ru
inc 
 s  r  st  if rd 

With the coding of the rounding modes from table 8.1, this is implemented
in a straightforward way by the circuit of figure 8.34.
&(
  )&
  Rounding decision of the significand rounding. The tables list the value F LOATING P OINT
of the flag inc which indicates that the significand needs to be incremented. On ROUNDER
round to zero (r z ), the flag equals 0.

l r st rne s r st ru rd
0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 1 1 0
0 1 0 0 0 1 0 1 0
0 1 1 1 0 1 1 1 0
1 0 0 0 1 0 0 0 0
1 0 1 0 1 0 1 0 1
1 1 0 1 1 1 0 0 1
1 1 1 1 1 1 1 0 1

l st RM[0] r st s RM[0]

0 1 RM[1]

inc

    Circuit of the rounding decision

The cost and delay of circuit S IG R ND which performs the significand


rounding can be estimated as

CSigRnd  Cmux 53  Cmux 54  Cinc 53  Cor  Cmux 3


3  Cand  2  Cor  Cxor  Cmux
ASigRnd  2  Dmux  maxDinc 53 maxARM Dmux   Dxor  Dand 

)&' " ! : 

The rounded significand f2 lies in the interval 1 2. In case of f2  2,


circuit P OST N ORM (figure 8.35) normalizes the factoring and signals an
overflow of the significand by SIGov f  1. This overflow flag is generated
&/
  )
en[10:0] eni[10:0] f2[-1:0] f2[1:52]
F LOATING P OINT
A LGORITHMS AND 0 1
DATA PATHS
e2[10:0] SIGovf f3[0:52]

    Post normalization circuit P OST N ORM

as
SIGov f  f2 1
In addition, circuit P OST N ORM has to compute
en  1 1 ; if f2  2
e2 f3  
en f2  ; otherwise
Since the normalization shifter N ORM S HIFT provides en and eni  en  1,
the exponent e2 can just be selected based on the flag SIGovf. With a single
OR gate, one computes

1052 ; if f2  2
f3 0 f3 1 : 52  f2 1  f2 0 f3 0 : 52 
f2 0 : 52 ; otherwise
Thus, the cost and the delay of the post normalization circuit P OST N ORM
are

CPostNorm  Cor  Cmux 11


DPostNorm  maxDor Dmux 

)&( 1  D- 

The circuit shown in figure 8.36 corrects the error produced by the OV F1
signal in the most obvious way. The error situation OV F2 is recognized
by an active SIGov f signal and e2  e2 10 : 0bias  emax  1. Since
xbias  x  bias, we have

emax  1n 1   bias




emax  1  bias  1n 


n
emax  1  1 bias 

Thus, the test whether e2  emax  1 holds is simply performed by an AND-


tree. If OV F2  OV Fen holds, exponent e2 is replaced by the wrapped
&)
  )&
13 e2[10:8] OVFen

SIGovf F LOATING P OINT


dbr 0 1 e2[7:0] emax+1-alpha ROUNDER
e2[10:0]
Andtree
SIGovf

01
0 1

OVF2 e3[10:0]

    Circuit A DJUST E XP which adjusts the exponent. Depending on the


precision, the constant e max  1  α equals 02 dbr 3 13 .

exponent emax  1  α in biased format. Note that

emax  1  bias 1


n
 and α   110
n2


imply

emax  1  α  bias   001n 2





emax  1  α  001
n2
bias 

Circuit A DJUST E XP of figure 8.36 adjusts the exponent at the following


cost and delay

CAd justExp  Cmux 11  Cmux 3  CANDtree 11  3  Cand  Cinv


DAd justExp  2  Dmux  DANDtree 11  Dand 

)&/ 1 8- 

The circuit E XP R ND in figure 8.37 computes the function exprd. More-


over, it converts the result into packed IEEE format. This involves
hiding bit fout 0 and
representing emin by 0n in case of a denormal result.
In the case OV F  OV Fen, the absolute value of the result is rounded to
Xmax or ∞ depending on signal in f . The decision is made according to table
8.6. Circuit Infinity Decision implements this in a straightforward way as
RM 0 if RM 1  0
in f 
RM 0 XNOR s if RM 1  1
&*
  )
e3 f3[0]11 Xmax infinity RM[1:0] s
F LOATING P OINT f3[1:52]
A LGORITHMS AND inf
63 0 1 Infinity Decision
DATA PATHS
63

0 1 OVF

OVFen

eout[10:0] fout[1:52]

    Circuit E XP R ND. Depending on the precision, the constant X max


can be expressed as dbr 3 17 0 123 dbr 29  and infinity can be expressed as
dbr3 18 052.

  Infinity decision of the exponent rounding. The tables list the value of
the flag in f which indicates that the exponent must be set to infinity.

RM[1:0] mode s0 s1


00 rz 0 0
01 rne 1 1
10 ru 1 0
11 rd 0 1

Denormal significands can only occur in the case T INY  UNFen. In


that case, we have e  emin and the result is denormal iff f3 0  0, i.e., if
the significand f3 is denormal.
Circuit E XP R ND which performs the exponent rounding has the follow-
ing cost and delay

CExpRnd  2  Cmux 63  2  Cand  Cinv  Cmux  Cxnor


DExpRnd  3  Dmux  Dxnor 

)&) - S PEC FP RND

This circuit (figure 8.38) covers the special cases and detects the IEEE
floating point exceptions overflow, underflow and inexact result. In case a
is a finite, non-zero number,

x  s er fr   pê a
&
  )&
NANr UNF/OVFen SIGinx
ZEROr TINY F LOATING P OINT
s eout fout nan ZEROr NANr INFr OVF
ROUNDER
SpecSelect spec RndExceptions
sp ep[10:0] fp[1:52] OVFp UNPp INXp

dbr INV DBZ


Precision

Fp[63:0] IEEEp

    Circuit S PEC FP RND

and circuit RND already provides the packed IEEE factoring of x

s eout fout   rd s er fr 

In case of a special operand a (zero, infinity or NaN) the flags f lr 


ZEROr, NANr, INFr, nan, DBZ, INV) code the type of the operand and
provide the coding of the NaN

nan  snan fnan 1 : 52

Thus, circuit S PEC S ELECT of figure 8.39 computes



snan 111 fnan 1 : 52 if NANr  1
 s 111 052  if INFr  1
sP eP f p  
s 011 052 
 s eout fout 
if ZEROr  1
if spec  0

where signal spec indicates a special operand:

spec  NANr  INFr  ZEROr

Depending on the flag dbr, the output factoring is either in single or double
precision. The single precision result is embedded in the 64-bit word F p
according to figure 8.1. Thus,

sP eP 11 : 0 fP 1 : 52 if dbr


F p63:0 
sP eP 7 : 0 fP 1 : 23 sP eP 7 : 0 fP 1 : 23 if dbr

The circuit P RECISION implements this selection in the obvious way with
a single 64-bit multiplexer.
In addition, circuit S PEC FP RND detects the floating point exceptions
OVF, UNF and INX according to the specifications of section 7.3. These
&
  ) ZEROr fnan
052
F LOATING P OINT s snan eout fout
0 1 NANr
A LGORITHMS AND 11
DATA PATHS 0 1 spec 0 1 0 1 spec

sp ep[10:0] fp[1:52]

    Circuit S PEC S ELECT

exceptions can only occur if a is a finite, non-zero number, i.e., if spec  0.


Since the rounder design implements LOSSb , the loss of accuracy equals
INX. Thus,

OV FP  spec  OV F
UNFP  spec  T INY  UNFen  LOSSb 
 spec  T INY  UNFen  INX 
INXP  spec  INX 

Since an overflow and an underflow never occur together, signal INX can
be expressed as

SIGinx if OV F  OV Fen  UNF  UNFen


INX 
SIGinx  OV F otherwise

 SIGinx  OV F  OV Fen

Circuit R ND E XCEPTIONS generates the exception flags along these equa-


tions. In also generates the flag spec indicating a special operand a.
The whole circuit S PEC FP RND dealing with special cases and exception
flags has the following cost and delay

CSpecFPrnd  2  Cmux 52  Cmux 11  Cmux  Cinv


Cmux 64  5  Cand  4  Cor  2  Cinv
DSpecFPrnd  max3  Dmux 2  Dmux  Dor  Dinv  3  Dand  Dor 

$   " 

IGURE 8.40 depicts the schematics of circuit FC ON. The left subcir-
 cuit compares the two operands FA2 and FB2, whereas the right sub-
circuit either computes the absolute value of operand FA2 or reverses its
sign. Thus, circuit FC ON provides the following outputs:
&
  )'
FA2[63:0] FB2[63:0] FB2[62:0] FA2[63, 62:0]
C IRCUIT FC ON
0, FA2[62:0] abs
1 1
EQ(64)
add(64) FA2[63] FB2[63]
neg
e fla flb
sign s
sa sb
FCON[3:0] FP test
fcc inv
ftest

04 INV

fcc IEEEf FC[63:0]

   Circuit FC ON; the left subcircuit performs the condition test, whereas
the right subcircuit implements the absolute value and negate operations.

the condition flag f cc,

the packed floating point result FC[63:0], and

the floating point exception flags

IEEE f 4 : 0  INX UNF OV F DBZ INV 

Its data inputs are the two packed IEEE floating point operands

a  sa eA n  1 : 0 fA 1 : p  1
b  sb eB n  1 : 0 fB 1 : p  1

and the flags f la and f lb which signal that the corresponding operand has
a special value. The circuit is controlled by

flag f test which request a floating point condition test,

the coding Fcon3 : 0 of the predicate to be tested, and

flag abs which distinguishes between the absolute value operation


and the sign negation operation.

Except for the flags f la and f lb which are provided by the unpacker
FP UNP, all inputs have zero delay. Thus, the cost of circuit FC ON and the
&#
  )
F LOATING P OINT   Coding of the floating point test condition
A LGORITHMS AND
predicate coding less equal unordered INV if
DATA PATHS
true false Fcon[3:0]   ? unordered
F T 0000 0 0 0
UN OR 0001 0 0 1
EQ NEQ 0010 0 1 0
UEQ OGL 0011 0 1 1
No
OLT UGE 0100 1 0 0
ULT OGE 0101 1 0 1
OLE UGT 0110 1 1 0
ULE OGT 0111 1 1 1
SF ST 1000 0 0 0
NGLE GLE 1001 0 0 1
SEQ SNE 1010 0 1 0
NGL GL 1011 0 1 1
Yes
LT NLT 1100 1 0 0
NGE GE 1101 1 0 1
LE NLE 1110 1 1 0
NGT GT 1111 1 1 1

accumulated delay of its outputs can be expressed as

CFCon  CEQ 64  Cadd 64  Cinv 63  CFPtest  Cand  Cnor
AFCon  maxDEQ 64 Dinv  Dadd 64 AFPunp f la f lb 
DFPtest  Dand 

)' ,  "    

Table 8.7 lists the coding of the predicates to be tested. The implementa-
tion proceeds in two steps. First, the basic predicates unordered, equal and
less than are generated according to the specifications of section 7.4.5, and
then the condition flag f cc and the invalid operation flag inv are derived as

f cc  Fcon0  unordered  Fcon1  equal  Fcon2  less


inv  Fcon3  unordered 

&&
  )'
"    
C IRCUIT FC ON
The operands a and b compare unordered if and only if at least one of them
is a NaN. It does not matter whether the NaNs are signaling or not. Thus,
the value of the predicate unordered equals:

unordered  NANa  NANb  SNANa  SNANb

"   12-


The flag e indicates that the packed representations of the numbers a and b
are identical, i.e.,

e1 FA263 : 0  FB263 : 0

Note that for the condition test the sign of zero is ignored (i.e., 0  0),
and that NaNs never compare equal. Thus, the result of the predicate equal
can be expressed as


if a b  0 0
 1
0 if a  NaN sNaN 
equal 
if b  NaN sNaN 
 0
e otherwise

 ZEROa  ZEROb  e  unordered

"   3
According to section 7.4.5, the relation I is a true subset of the R 2∞ . Thus,
the value of the predicate less can be expressed as

less  l  unordered

where for any two numbers a b  R ∞ the auxiliary flag l indicates that

l1 a  b

The following lemma reduces the comparison of packed floating point


numbers to the comparison of binary numbers:
&'
  )
F LOATING P OINT   Reducing the test a  b to a  b
A LGORITHMS AND
sa sb range a  b if
DATA PATHS
0 0 0a b s  0 sign  1
0 1 b0a never
1 0 a0b except for a  0  b  0
1 1 a b0 s  0 sign  0  a  b

 (   For any two numbers a b  R ∞ with the packed representations sA eA fA 


and sB eB fB  holds

a  b eA fA   eB fB 

Thus, let sign  sn  p denote the sign bit of the difference

sn  p : 0  0 eA n  1 : 0 fA 1 : p  1  0 eB n  1 : 0 fB 1 : p  1

we then have

a  b sn  p : 0  0 sign  1

and according to table 8.8, the auxiliary flag l can be generated as

l  sa  sb  sign
 sa  sb  ZEROa NAND ZEROb
 sa  sb  sign NOR e

 Proof of Lemma 8.12


The numbers a and b can be finite normal numbers, finite denormal num-
bers or infinity. If a is a finite normal number, we have

a  0 eA fA   2 eA bias


  fA 

with  fA   1 2 and 0  eA   2n , whereas in case of a denormal signif-


icand we have

a  0 eA fA   2 eA 1
   bias
  fA 

with  fA   0 1 and eA   0.


If both numbers a and b have normal (denormal) significands, the claim
can easily be verified. Thus, let a be a denormal number and b be a normal
number, then
a  2emin  b
&(
  )'
and the claim follows because
C IRCUIT FC ON
eA fA   0 1 n p1
  2 p 1

 eB   2 p1
 eB 0 p 1   eB fB 


Let a be a finite number and let b  ∞, then eB  1n and fB  0p 1.




Since a  ∞ and

eA fA   1n 1 0 1 p 1 
 
 1n 1 1 0 p 1 
 
 eB fB 

the claim also holds for the pairs a ∞.  

8:       


In circuit FC ON of figure 8.40, a 64-bit equality tester provides the auxil-
iary flag e, and the output neg of a 64-bit adder provides the bit sign  s64.
These bits are fed into circuit FP TEST which then generates the outputs
f cc and inv and the flags of the three basic predicates as described above.
The cost and delay of circuit FP TEST can be expressed as

CFPtest  13  Cand  8  Cor  3  Cinv  Cnor  Cnand


DFPtest  3  Dand  3  Dor  maxDinv  Dand Dnor Dnand 

)'  -  >-  !  

For the packed floating point operand a  R ∞ with a  sA eA fA , the
absolute value a satisfies

a  0 eA fA 

Thus, the packed representation of the value a can simply be obtained by
clearing the sign bit of operand FA2. The value a satisfies

a  sA eA fA 

and just requires the negation of the sign bit.


Thus, both operations only modify the sign bit. Unlike any other arith-
metic operation, this modification of the sign bit is also performed for any
type of NaN. However, the exponent and the significand of the NaN still
pass the unit unchanged. Since

0 if abs  1 (absolute value)
sC  sA NOR abs 
sA if abs  0 (reversed sign)
&/
  )
the subcircuit at the right hand side of figure 8.40 therefore generates the
F LOATING P OINT packed representation of the the value
A LGORITHMS AND
DATA PATHS a if abs  1
c  sC eC fC  
a if abs  0
Depending on the flag abs, it either implements the absolute value or re-
versed sign operation.

)'# .111 ,  " 1 

In the IEEE floating point standard, the two operations absolute value and
sign reverse are considered to be special copy operations, and therefore,
they never signal a floating point exception.
The floating point condition test is always exact and never overflows
nor underflows. Thus, it only signals an invalid operation; the remaining
exception flags are always inactive.
Depending on the control signal f test which requests a floating point
condition test circuit FC ON selects the appropriate set of exception flags:
inv if f test 1
INV  inv  f test 
0 if f test 0
INX UNF OVF DBZ   0 0 0 0

$ "   ) 

ONVERSIONS HAVE to be possible between the two packed floating


point formats (i.e., single and double precision) and the integer for-
mat. For each of the six conversions, the four IEEE rounding modes must
be supported.
Integers are represented as 32-bit two’s complement number x31 : 0
and lie in the set T32 :

x  x31 : 0  T32  231  231  1

A floating point number y is represented by a sign bit s, an n-bit exponent


e and an p-bit significand. The parameters n p are 8 24 for single
precision and 11 53 for double precision. The exponent is represented
in biased integer format

e  en  1 : 0bias  en  1 : 0  biasn


 en  1 : 0  2n  1
 1
&)
  )(
and the significand is given as binary fraction
F ORMAT
f   f 0 f 1 : p  1 C ONVERSION

In the packed format, bit f 0 is hidden, i.e., it must be extracted from the
exponent. Thus,

y  s en  1 : 0 f 1 : p  1  1s  2e  f 

Each type of conversion is easy but none is completely trivial. In the


following, we specify the six conversions in detail. Section 8.6.2 then
describes the implementation of the conversions.

)( 4     

The two parameter sets n p  11 53 and n p   8 24 denote the


width of the exponent and significand for double and single precision, re-
spectively. As we are dealing with two floating point precisions, we also
have to deal with two rounding functions, one for single and one for double
precision. The same is true for functions like the IEEE normalization shift
η and the overflow check OV F. If necessary, the indices ‘s’ and ‘d’ are
used to distinguish between the two versions (e.g., rds denotes the single
precision rounding function). Since the rounding functions are only de-
fined for a finite, representable operand, the special operands NaN, infinity
and zero have to be considered separately.

+-    "   


Converting a packed, double precision floating point number a with repre-
sentation
ηd a  sA eA n  1 : 0 fA 1 : p  1
to single precision gives a packed representation sP eP n  1 : 0 fP 1 :
p  1 which satisfies the following conditions:

If a is a finite, non-zero number, then

sP eP n  1 : 0 fP 1 : p  1  rds x

where 
 a2 α 
if OV Fs a  OV Fen
a  2α if UNFs a  UNFen
x
 a otherwise
&*
  )
If a is a zero, infinity or NaN, then
F LOATING P OINT
 ¼ ¼
A LGORITHMS AND  sA 0n 0 p 1 
¼ ¼
if a  1sA  0
DATA PATHS sA 1n 0 p 1  a  1sA  ∞
sP eP fP  


if
¼ ¼
sA 1n 10 p 2  
if a  NaN

According to section 7.4.6, the conversion signals an invalid operation ex-


ception INV  1 iff a is a signaling NaN.

   +-  


Converting a packed, single precision floating point number a with repre-
sentation
ηs a  sA eA n  1 : 0 fA 1 : p  1
to double precision gives a representation sP eP n  1 : 0 fP 1 : p  1
which satisfies the following conditions

If a is a finite, non-zero number, then

sP eP n  1 : 0 fP 1 : p  1  ηd rdd a

Due to the larger range of representable numbers, a never overflows


nor underflows, i.e., OV Fd a  0 and UNFd a  0. In addition,
the rounded result is always exact (INX  0).

If x is a zero, infinity or NaN, then



 sA 0n 0 p 1 

if x  1sA  0
sA 1n 0 p 1  x  1sA  ∞


sP eP fP   if
sA 1n 10 p 2 

if x  NaN

Although each single precision number is representable in double preci-


sion, rounding cannot be avoided because all denormal single precision
numbers are normal in double precision. An invalid operation exception
INV  1 is signaled iff a is a signaling NaN.

.    ,  "  


Let x  T32 be an non-zero integer coded as 32-bit two’s complement num-
ber x31 : 0. Its absolute value x, which lies in the set 1    231 , can
be represented as 32-bit binary number y31 : 0 which usually has some
leading zeros:

x  y31 : 0  0y31 : 0  232   f 0 f 1 : 32  232


& 
  )(
with f 0 : 31  0y31 : 0. The value of the binary fraction f lies in the in-
terval 0 1. Rounding the factoring x31 32 f  gives the desired result. F ORMAT
The exceptions overflow, underflow and inexact result cannot occur. C ONVERSION
Thus, in case of single precision, the conversion delivers

ηs rds x31 32 f 0 : 32 if x  0


sP eP fP   ¼ ¼
0 0n 0 p 1 

if x  0

In case of a double precision result, the rounding can be omitted due to the
p  53 bit significand. However, a normalization is still required. Thus,
converting a two’s complement integer x into a double precision floating
point number provides the packed factoring

ηd x31 32 f 1 : 32 if x  0
sP eP fP  
0 0n 0 p 1 

if x  0

,  "  .    


Let n p   8 23 11 53 be the length of the exponent and signifi-
cand in single or double precision format, respectively. When converting
a representable number a  s e f  into an integer one has to perform the
following three steps:

1. The value a is rounded to an integer value x  1s  y; every one


of the four rounding modes is possible.

2. It is checked whether the value x lies in the representable range T32 of


integers. In the comparison we have to consider the sign bit, because
the set T32 is not symmetric around zero.

3. If x is representable, its representation is converted from sign mag-


nitude to two’s complement, and otherwise, an invalid operation is
signaled.

Rounding Let F be the set of all binary numbers representable with at


most emax  1 bits:

F  yemax : 0  yi  0 1 for all i

For every representable a, there exists an integer z  F with

z  maxy  F  y  a and z  a  z  1

The rounding of the floating point number a to an integer x is then defined


in complete analogy to the significand rounding of a floating point number.
& 
  )
For round to nearest-even, for example, one obtains the rule
F LOATING P OINT 
A LGORITHMS AND s bin z if a  z  05
DATA PATHS  or a  z  05  z0  0
rdint a 
s bin z  1 if a  z  05
 or a  z  05  z0  1
Of course, one can obtain this rule by substituting in equation 7.2 (page
332) the number a for f and setting p  1. It follows that

rdint a  rdint a1  and rdint a  a rdint a1   a

The same argument can be made for all four rounding modes.

Range Check Let x  1s  y be an integer obtained by rounding a


floating point number as described above. Its absolute value x  y can be
as large as
10 10
2  2emax  2  22 1  22 

and thus, an implementation of function rdint would have to provide an


1025-bit binary number y. However,

a 232  y 232  x  T32

and in this case, the conversion only needs to signal an invalid operation,
but the rounding itself can be omitted. Such an overflow is signaled by
Iov f1 .
In case of Iov f1  0, the absolute value y is at most 232 . Thus, y can
be represented as 33-bit binary number y32 : 0 and y as 33-bit two’s
complement number

y  z32 : 0  y32 : 0  1

Let
y32 : 0 if s0
x32 : 0   y32 : 0 s  s
z32 : 0 if s1
if the integer x lies in the set T32 it then has the two’s complement repre-
sentation x31 : 0:

x  1s  y  x31 : 0

The conversion overflows if x cannot be represented as a 32-bit two’s


complement number, i.e.,

Iov f  Iov f1  Iov f2  1


&
  )(
where due to sign extension
F ORMAT
Iov f2  1 x32 : 0  T32 x32  x31 C ONVERSION

An invalid operation exception INV is signaled if a is not a finite number


or if Iov f  1:

INV 1 a  NaN ∞ ∞  Iov f  1

)( .        

One could provide a separate circuit for each type of conversion. However,
the arithmetic operations already require a general floating point unpacker
and a floating point rounder which convert from a packed floating format to
an internal floating format and vice versa. In order to reuse this hardware,
every conversion is performed in two steps:
An unpacker converts the input FA2[63:0] into an internal floating
point format. Depending on the type of the conversion, the input
FA2 is interpreted as 32-bit two’s complement integer

x  x31 : 0 with x31 : 0  FA263 : 32

or as single of double precision floating point number with packed


factoring
FA263 FA262 : 52 FA251 : 0 if dbs  1
sA eA fA  
FA263 FA262 : 55 FA254 : 32 if dbs  0

A rounder then converts the number sr er fr f lr  represented in an


internal floating point format into a 32-bit two’s complement integer
Fx[31:0] or into a packed floating point representation sP eP fP .
In case of double precision dbr  1, the floating point output is
obtained as

F p63 : 0  sP eP 10 : 0 fP 1 : 53

whereas for single precision dbr  0, output Fp is obtained as

F p63 : 32  F p31 : 0  sP eP 7 : 0 fP 1 : 24

In addition to the unpacker FP UNP and the rounder FP RND, the conver-
sions then require a fixed point unpacker FX UNP, a fixed point rounder
FX RND, and a circuit C VT which adapts the output of FP UNP to the input
format of FP RND (figure 8.2).
& #
  )
The conversion is controlled by the following signals:
F LOATING P OINT
A LGORITHMS AND signals dbs and dbr indicate a double precision floating point source
DATA PATHS operand and result,

signal normal which is only active in case of a floating point to in-


teger conversion requests the normalization of the source operand,
and

the two enable signals which select between the results of the circuits
C VT and FX UNP.

,  " ,   


Unpacking The floating point unpacker FP UNP (section 8.1) gets the
operand FA2 and provides as output a factoring sa ea 10 : 0 fa 0 : 52
and the flags f la . The exponent ea is a two’s complement number. The
flags f la comprising the bits ZEROa, INFa, NANa and SNANa signal that
FA2 is a special operand.
For a non-zero, finite operand a, the output factoring satisfies

a  sA eA fA   1sa  2ea 10:0   fa 0 fa 1 : 52

Since normal  0, the output factoring is IEEE-normal, i.e., fa 0  0 im-


plies a  2emin .

Circuit C VT Circuit C VT gets the factoring sa ea fa  and the flags f la


from the unpacker. It checks for an invalid conversion operation and ex-
tends the exponent and significand by some bits:

sv ev 12 : 0 fv 1 : 55  sa ea 103 ea 9 : 0 0 fa 1 : 5203 


ZERO INF NAN   ZEROa INFa NANa  SNANa
INV DBZ   SNANa 0
nan  snan fnan 1 : 52  sa 1051 

For a finite, non-zero operand a, we obviously have

a  1sa  2ea 10:0   fa 0 fa 1 : 52


1sv  2ev 12:0   fv 1 : 0 fv 1 : 55
(8.11)


and the factoring is still IEEE-normal. The implementation of C VT is


straightforward and just requires a single OR gate:

CCvt  Cor DCvt  Dor 


& &
  )(
Rounding The output sv ev fv f lv  of circuit C VT is fed to the floating
point rounder FP RND. In order to meet the input specification of FP RND, F ORMAT
fv  0 1 must imply that OV F  0. Since the factoring is IEEE-normal, C ONVERSION
that is obviously the case:

fv 1 : 0  00 fa 0  0 a  2emin 

Let 
 a2 α
if OV F a  OV Fen
a  2α if UNF a  UNFen
y
 a otherwise

Depending on the flags f lv , circuit FP RND (section 8.4) then provides the
packed factoring

sv 0n 0 p 1 


 sv 1n 0 p 1 

if ZEROv
if INF v
sP eP fP  
snan 1n fnan 1 : p  1
 η rd y
if a  NaN v
otherwise

In case of dbr  1, the factoring is given in double precision and otherwise,


it is given in single precision. The correctness of the conversion follows
immediately from the definition of the flags f lv .

.    ,  "  


These conversions are also performed in two steps. First, the fixed point
unpacker FX UNP converts the two’s complement integer x31 : 0 into the
internal floating point format. The floating point rounder FP RND then con-
verts this factoring su eu fu f lu  into the packed floating point format.
Depending on the control signal dbr, the output factoring sP eP fP  has
either single or double precision.

Unpacking The unpacker FX UNP converts the two’s complement inte-


ger x31 : 0 into the internal floating point format. This representation
consists of the flags f lu and the factoring su eu 13 : 0 fu 1 : 55 which
is determined according to the specification from page 420.
The flags f lu indicate a special operand (i.e., zero, infinity and NaN) and
signal the exceptions INV and DBZ. Since the two’s complement integer
x31 : 0 is always a finite number, a zero input is the only possible special
case:

ZEROu  1 x31 : 0  0 x31 : 0  032 


& '
  )
x[31:0]
F LOATING P OINT
1 04 01 051
A LGORITHMS AND nan
inc(32) zero(32)
DATA PATHS y flags

x[31] 071 05 00 x[31] 023 ZEROu


1 0

su eu[12:0] fu[-1:0] fu[1:32] fu[33:55] flu

   Circuit FX UNP converting a 32-bit integer x31 : 0 into the internal
floating point format; f lags denotes the bits INFu, NANu, INV, and DBZ.

The remaining flags are inactive and nan can be chosen arbitrarily:

INV  DBZ  INF  NAN  0


nan  snan fnan 1 : 52  0 1051 

Since x  0    231 , the absolute value of x can be represented as


32-bit binary number y31 : 0:

x31 : 0 if x31  0
x  y31 : 0 
x31 : 0  1 mod 232 if x31  1

Thus, the factoring

1 if x0
su  x31 
0 if x 0
eu 13 : 0  07 1 05
fu 1 : 55  02 y31 : 0 023 

with a 13-bit two’s complement exponent satisfies

x  1su  232  0y31 : 0


 1su  2eu 13:0   fu 1 : 0 fu 1 : 55

The circuit of figure 8.41 implements the fixed point unpacker FX UNP in
a straightforward manner at the following cost and delay:

CFXunp  Cinv 32  Cinc 32  Czero 32  Cmux 32


DFXunp  maxDinv  Dinc 32  Dmux 32 Dzero 32

& (
  )(
Rounding Since an integer to floating point conversion never overflows,
the representation su eu fu f lu  meets the requirements of the rounder F ORMAT
FP RND. Thus, the correctness of the floating point rounder FP RND implies C ONVERSION
the correctness of this integer to floating point converter.

,  "  .    


Like any other conversion, this conversion is split into an unpacking and
rounding step. The unpacking is performed by the floating point unpacker
FP UNP which delivers the flags f la indicating a special operand. In case
that a is a non-zero, finite number, circuit FP UNP also provides a factoring
sa ea fa  of a:

1sa  2ea 10:0 lza 5:0 ¼


a 

  fa 0 fa 1 : 52  1sa  2ea  fa

Due to normal  1, the significand fa is normal, and for any a 2emin ,
the number lza is zero.
This representation is provided to the fixed point rounder FX RND which
generates the data Fx63 : 32  Fx31 : 0 and the floating point exception
flag INV. For a finite number a, let rdint a  sa y and

x  1sa  y  1sa  20  y

For x  T32 , the conversion is valid (INV  0) and x has the two’s comple-
ment representation Fx[31:0]. If a is not finite or if x  T32 , the conversion
is invalid, i.e., INV  1, and Fx[31:0] is chosen arbitrarily.

, " 8-  FX RD


In section 8.4, the floating point rounder FP RND is described in detail.
Instead of developing the fixed point rounder FX RND from scratch, we
rather derive it from circuit FP RND.

Implementation Concept Note that the result of rounder FX RND al-


ways has a fixed exponent. For the floating point rounder, that is only the
case if the result is denormal. Let s er fr  be a denormal floating point
operand, the floating point rounder FP RND then provides the output fac-
toring s eout fout  such that

s eout fout   η rd s er fr   s exprd s post er sigrd s fr 

If the result is denormal, the post-normalization and the exponent rounding


can be omitted:

s eout fout   η rd s er fr   s er sigrd s fr 


& /
  )
The major differences between this denormal floating point rounding
F LOATING P OINT rddn and the rounding rdint is the rounding position and the width of the
A LGORITHMS AND significand. In case of rddn , the rounding position is p bits to the right of
DATA PATHS the binary point, whereas in case of rdint , the significand is rounded at the
binary point p  1. However, this is not a problem; let ea  ea  lza ,
¼
a  1sa  2ea  fa
¼
 1sa  2 p 1  2ea  p 1  fa 
  
 1sa  2 p 1  fr 


The equation suggests to make the exponent 1 and shift the significand
ea positions left, then shift p  1 positions right; this moves the bit with
weight 1 into the position p  1 right of the binary point.
The significand fout provided by rddn has at most two bits to the left of
the binary point, whereas y has up to 32 bits to the left of the binary point.
However, the rounding rdint is only applied if fa  0 2 and ea  231 . For
p  32 it then follows that
¼
fr  2ea  p 1  fa  231
   31
 fa  2

Thus, the significand sigrnd sa fr  has at most 2 bits to the left to the
binary point. The significand y can then be obtained by shifting the output
significand fout emin positions to the left:

y32 : 0  y p : 0  fout 1 : p  1

Circuit FX RND Figure 8.42 depicts the top level schematics of the fixed
point rounder FX RND. Circuits N ORM S HIFT X, REP P X, and S IG R ND X
from the floating point rounder are adapted as follows:

only denormal results are considered (i.e., UNF  UNFen),

only one precision p  32 is supported,

and emin is set to  p  1  31.

Circuit S PEC FX is new, it performs the range check, signals exceptions


and implements the special case a  0. All the inputs of the rounders have
zero delay, thus

CFXrnd  CNormShi f tX  CREPpX  C f f 40  CSigRndX  CSpecFX


TFXrnd  DNormShi f tX  DREPpX  ∆
AFXrnd  ASigRndX  DSpecFX 
& )
  )(
fla ea[10:0] lza[5:0] fa[0:52] sa
F ORMAT
4 NormShiftX REPpX C ONVERSION
fn

fla Iovf1 f1[0:33] s RM

SIGinx
SpecFX SigRndX
f3[-1:31]
IEEEx Fx[63:0]

   Schematics of the fixed point rounder FX RND; IEEEx denote the
floating point exceptions flags.

fa[0:52] ea[10:0], lza[5:0]

Shift Dist FX FXflags


sh[12:0]
SigNormShift

fn[0:127] Iovf1

   Normalization shifter N ORM S HIFT X

Circuit N ORM S HIFT X The circuit depicted in figure 8.43 performs the
normalization shift. In analogy to the floating point rounder, its outputs
satisfy
¼
Iov f1 2ea  fa 232
en  emin  31
e¼a emin
fn p f  2  fa 

Circuit S IG N ORM S HIFT which performs the normalization shift is iden-


tical to that of the floating point rounder. Since the significand fa is normal,
and since for large operands lza  0, we have
¼
2ea  fa 232 ea  ea  lza 32 ea 10 : 0 32

Thus, circuit FX FLAGS signals the overflow Iov f1 by

Iov f1  ea 10  ea i


i 59
& *
  )
ea[10, 9:0] lza[5:0]
F LOATING P OINT 07 1 05
A LGORITHMS AND 17
DATA PATHS
3/2add(13)
13 13 0
add(13)
sh[12:0]

   Circuit S HIFT D IST FX of the fixed point rounder

Circuit S HIFT D IST FX (figure 8.44) provides the distance σ of the nor-
malization shift as 13-bit two’s complement number. Since f is defined as
¼
2ea emin  fa , the shift distance equals


σ  sh12 : 0  ea  emin


 ea 10 : 0  lza 5 : 0  31
3
 ea 10 ea 9 : 0  17 lza 5 : 0  32 mod 213 

The cost and delay of the normalization shifter N ORM S HIFT X and of its
modified subcircuits can be expressed as

CNormShi f tX  CFX f lags  CShi f tDistX  CSigNormShi f t


CFX f lags  CORtree 5  Cand  Cinv
CShi f tDistX  Cinv 6  C3  2add 13  Cadd 13
DNormShi f tX  maxDFX f lags DShi f tDistX  DSigNormShi f t 
DFX f lags  DORtree 5  Dand
DShi f tDistX  Dinv  D3  2add 13  Dadd 13

Circuit REP P X The circuit of figure 8.45 performs the sticky bit com-
putation in order to provide a p-representative of fn :

f1   fn  p f1 0 : 33  fn 0 : 32 st st  fn i


i p

Since we now have only a single precision, this circuit becomes almost
trivial:

CREPpX  CORtree 95 DREPpX  DORtree 95

&#
  )(
fn[0:32] fn[33:127]
F ORMAT
ORtree
C ONVERSION
st
f1[0:32] f1[33]

   Circuit REP P X

sa 0, f1[0:31] sa RM[1:0] f1[31:33]

l, r, st
incf(33) Rounding Decision
33 inc
0 1 sa
f3[-1:31] SIGinx

   Circuit S IG R ND X of the fixed point rounder

Circuit S IG R ND X Circuit S IG R ND X (figure 8.46) performs the signifi-


cand rounding and converts the rounded significand f2 1 : 31 into a two’s
complement fraction f3 1 : 31. Given that the range is not exceeded, we
have

f2 1 : 31  sigrnd sa f1 0 : 33


 f3 1 : 0 f3 1 : 31  1sa   f2 1 : 0 f2 1 : 31

As in the rounder FP RND, the binary fraction f1 is either chopped or in-


cremented at position p  1, depending on the rounding decision:

f2   f1 0 f1 1 : p  1  inc  2  p 1  

 f2  1 f1 0 f1 1 : p  1  1  inc  2  p 1  

f3  sa f1 0 f1 1 : p  1 sa   inc sa   2  p 1 


 

The rounded significand is inexact (SIGinx  1) iff f1 32 : 33  00.


Like in the floating point rounder, the rounding decision flag inc is gen-
erated by the circuit of figure 8.34. Circuit S IG R ND X therefore has cost
and delay

CSigRndX  Cmux 33  Cinc 33  Cor  2  Cxor


&#
  )
Cmux  3  Cand  2  Cor  Cxor
F LOATING P OINT
A LGORITHMS AND ASigRndX  maxDinc 33 ARM  Dxor  Dand  Dmux 
DATA PATHS Dmux  Dxor 

Circuit S PEC FX This circuit supports the special cases and signals float-
ing point exceptions. If the rounded result x is representable as a 32-bit
two’s complement number, we have

 f3 1 f3 1 : 31


 31
x  2 : 0 f3 1 : 31 and x32 : 0 

In case of a zero operand a  0, which is signaled by the flag ZEROr, the


result Fx must be pulled to zero. Thus:

032 if ZEROr  1
Fx31 : 0   x31 : 0 ZEROr
x31 : 0 if ZEROr  0
Fx63 : 32  Fx31 : 0

According to the specifications from page 422, the overflow of the conver-
sion and the invalid operation exception can be detected as

Iov f  Iov f1  x32 sa  x31 sa


INV  Iov f  NANr  INFr

The conversion is inexact, if the rounded significand is inexact SIGinx  1


and if a is a finite non-zero number:

INX  SIGinx  NANr  INFr  ZEROr

Further floating point exceptions cannot occur. Circuit S PEC FX imple-


ments this in a straightforward way at

CSpecFX  Cxor 32  2  Cxor  Cand  Cnor  5  Cor


DSpecFX  maxDxor  Dor Dand  Dor   Dor 

$ %)   !  "+3 *#

N THE previous subsections, we have designed an IEEE-compliant float-


ing point unit. We now analyze the cost and the delay of the FPU and
the accumulated delay of its outputs. We assume that the rounding mode
RM and the flags UNFen and OV Fen have zero delay.
&#
  )/
  Accumulated delay of the units feeding bus FR and cycle time of the E VALUATION OF
FPU and of its units THE FPU D ESIGN
accumulated delay
bus FR A DD /S UB M UL /D IV FX UNP FP UNP /C VT
93 91 64 35 45

cycle time
FPU Bus FR A DD /S UB M UL /D IV FP RND FX RND
98 98 63 69 98 76

%     ,"
In the top level schematics of the FPU (figure 8.2), there are two register
stages: The output of the unpacker FP UNP and the intermediate result on
bus F R are clocked into registers. Result F R is provided by the unpacker
FX UNP, the converter C VT, the add/subtract unit or by the multiply/divide
unit, thus,

AFPU Fr  maxAFPunp  DCvt AFXunp AMulDiv AAddSub   Ddriv


TFPU Fr  AFPU Fr  ∆

According to table 8.9, these results have an accumulated delay of at most


93, and therefore require a minimal cycle time of 98 gate delays. This time
is dominated by the add/subtract unit.
In addition, some units of the FPU have an internal register stage and
therefore impose a bound on the cycle time of the FPU, as well. These
units are the two rounders FP RND and FX RND, the add/subtract unit, and
the multiply/divide unit:

TFPU  maxTMulDiv TAddSub TFPrnd TFXrnd TFPU Fr

These units require a minimal cycle time of 98 gate delays like the update
of register F R. The floating point rounder FP RND is 30% slower than the
other three units.

- -  +%   - -


The outputs of the floating point unit are provided by the two rounders
FP RND and FX RND and by unit FC ON:

AFPU  maxAFCon AFXrnd AFPrnd 


&##
  )
F LOATING P OINT   Accumulated delay of the outputs of the FPU. Circuit S IG R ND of
A LGORITHMS AND rounder FP RND uses a standard incrementer (1) or a fast CSI incrementer (2).
DATA PATHS
version FPU FC ON FX RND FP RND S IG R ND
(1) 91 50 44 91 58
(2) 50 50 44 49 16

According to table 8.10, they have an accumulated delay of 91. Compared


to the cycle time of the FPU, a delay of 91 just leaves enough time to
select the result and to clock it into a register. However, in a pipelined
design (chapter 9), the outputs of the FPU become time critical due to
result forwarding.
The floating point rounder FP RND is about 50% slower than the other
two units. Its delay is largely due to the 53-bit incrementer of the signif-
icand round circuit S IG R ND. The delay of a standard n-bit incrementer
is linear in n. However, when applying the conditional sum principle re-
cursively, its delay becomes logarithmic in n (see exercise 2.1 of chapter
2). Using such a CSI incrementer speeds up the rounder significantly. The
outputs of the FPU then have an accumulated delay of 50 gate delays. That
now leaves plenty of time for result forwarding.
The FPU receives the underflow enable bit UNFen, the overflow enable
bit OV Fen and the rounding mode at an accumulated delay of AUNF OV Fen .

The FPU design can tolerate an accumulated delay of AUNF OV Fen  40




before the input signal UNFen and OV Fen dominate the cycle time TFPU .
The accumulated delay of the rounding mode RM is more time critical.
Already for ARM  9, the rounding mode dominates the delay AFPU , i.e., it
slows down the computation of the FPU outputs.

   ,"
Table 8.11 lists the cost of the floating point unit FPU and of its major com-
ponents. Circuit S IG R ND of the floating point rounder FP RND either uses
a standard 53-bit incrementer or a fast 53-bit CSI incrementer. Switching
to the fast incrementer increases the cost of the rounder FP RND by 3%,
but it has virtually no impact on the total cost (0.2%). On the other hand,
the CSI incrementer improves the accumulated delay of the FPU consid-
erably. Therefore, we later on only use the FPU design version with CSI
incrementer.
The multiply/divide unit is by far the most expensive part of the float-
ing point unit, it accounts for 70% of the total cost. According to table
8.12, the cost of the multiply/divide unit are almost solely caused by cir-
&#&
  ))
  Cost of the FPU and its sub-units. Circuit S IG R ND of rounder FP RND S ELECTED
either uses a standard incrementer or a fast CSI incrementer. R EFERENCES AND
A DD /S UB 5975 F URTHER R EADING
M UL /D IV 73303
FC ON 1982
FP UNP 6411
FX UNP 420
FP RND 7224 / 7422
FX RND 3605
C VT 2
rest 4902
total: FPU 103824 / 104022

  Cost of the significand multiply/divide circuit S IGF MD with a 256
8
lookup ROM. The last column lists the cost relative to the cost of the multi-
ply/divide unit M UL /D IV.

S IGF MD S ELECT F D 4/2mulTree ROM CLA(116) rest


71941 5712 55448 647 2711 8785
98% 7.8% 75.6% 0.9% 3.7% 12%

cuit S IGF MD which processes the significands. Its 58  58-bit multiplier


tree accounts for 76% of the cost of the multiply/divide unit and for 53%
of the cost of the whole FPU. The table lookup ROM has only a minor
impact on the cost.

$$   !  "  #

ORE OR less complete designs of floating point units can be found


in [AEGP67] and [WF82]. The designs presented here are based on
constructions from [Spa91, EP97, Lei99, Sei00]. Our analysis of the divi-
sion algorithm uses techniques from [FS89]. A formal correctness proof
of IEEE-compliant algorithms for multiplication, division and square root
with normal operands and a normal result can be found in [Rus].
&#'
  )
$0 %&
F LOATING P OINT
A LGORITHMS AND  ( A trivial n i-right shifter is a circuit with n inputs an  1 :
DATA PATHS 0, select input s  0 1 and n  i outputs rn  1 : i satisfying

0i a if s  1
r 
a0i otherwise 

Thus, in trivial n i-right shifters, the i bits which are shifted out are the
last i bits of the result.
One can realize the alignment shift and sticky bit computation of the float-
ing point adder by a stack of trivial shifters. The sticky bit is computed by
simply ORing together bits, which are shifted out.

1. Determine the cost and the delay of this construction.


2. In the stack of trivial shifters, perform large shifts first. Then care-
fully arrange the OR-tree which computes the sticky bit. How much
does this improve the delay?

 ( In section 8.3, we have designed a multiply/divide unit which


performs a division based on the the Newton-Raphson method. The iter-
ation starts out with an initial approximation x0 which is obtained from a
2γ  γ lookup table. The intermediate results are truncated after σ  57
bits. The number i of iterations necessary to reach p  2 bits of precision
(i.e., δi  2 p 2 ) is then bounded by
 


p  24  γ  16
 1
2
if
if p  24  γ  8 or p  53  γ  16
i
 γ  5 or p  53  γ  8
 3
4
if
if
p  24
p  53  γ5

For γ  8, this bound was already shown in section 8.3.4. Repeat the argu-
ments for the remaining cases.
Determine the cost of the FPU for γ  16 and γ  5.

 ( The next three exercises deal with the normalization shifter
N ORM S HIFT used by the floating point rounder FP RND. The functionality
of the shifter is specified by Equation 8.6 (page 393); its implementation
is described in section 8.4.2.
The shifter N ORM S HIFT gets as input a factoring s er fr ; the significand
fr 1 : 55 has two bits to the right of the binary point. The final rounded
result may be a normal or denormal number, and fr may have leading zeros
or not.
&#(
  )*
Determine the maximal shift distance σ for each of these four cases.
E XERCISES
Which of these cases require a right shift?

 ( The normalization shifter N ORM S HIFT (figure 8.25, page
395) computes a shift distance σ, and its subcircuit S IG N ORM S HIFT then
shifts the significand f . However, in case of a right shift, the represen-
tation of f  2σ can be very long. Circuit S IG N ORM S HIFT therefore only
provides a p-representative fn :

 fn 0 : 63  fn p f  2σ 

Determine the maximal length of the representation of f  2σ .


Give an example (i.e., f and σ) for which fn  f  2σ .

 ( The exponent normalization circuit E XP N ORM of the float-


ing point rounder FP RND computes the following sums sum and sum  1

sum  er  1  1 lz5 : 0  δ

where δ is a constant.
The implementation of E XP N ORM depicted in figure 8.27 (page 400) uses
a 3/2-adder and a compound adder A DD 2 to perform this task. Like in the
computation of flag TINY, the value lz5 : 0 can be included in the con-
stant δ , and then, the 3/2-adder in the circuit E XP N ORM can be dropped.

Derive the new constant δ .


How does this modification impact the cost and the delay of the float-
ing point rounder?

&#/
Chapter

9
Pipelined DLX Machine with
Floating Point Core

N THIS chapter, the floating point unit from the previous chapter is in-
tegrated into the pipelined DLX machine with precise interrupts con-
structed in chapter 5. Obviously, the existing design has to be modified in
several places, but most of the changes are quite straightforward.
In section 9.1, the instruction set is extended by floating point instruc-
tions. For the greatest part the extension is straightforward, but two new
concepts are introduced.

1. The floating point register file consists of 32 registers for single pre-
cision numbers, which can also be addressed as 16 registers for dou-
ble precision floating point numbers. This aliasing of addressing will
mildly complicate both the address computation and the forwarding
engine.

2. The IEEE standard requires interrupt event signals to be accumu-


lated in a special purpose register. This will lead to a simple extra
construction in the special purpose register environment.

In section 9.2 we construct the data path of a (prepared sequential or


pipelined) DLX machine with a floating point unit integrated into the ex-
ecute environment and a floating point register file integrated into the reg-
ister file environment. This has some completely obvious and simple con-
sequences: e.g., some parts of the data paths are now 64 bits wide and
addresses for the floating point register file must now be buffered as well.
There are two more notable consequences:
  *
P IPELINED DLX   Latency of the IEEE floating point instructions; fc denotes a compare
M ACHINE WITH and cvt a format conversion.
F LOATING P OINT precision fneg fabs fc cvt fadd fmul fdiv
C ORE
single 1 1 1 3 5 5 17
double 1 1 1 3 5 5 21

1. We have to obey throughout the machine an embedding convention


which regulates how 32-bit data share 64-bit data paths.

2. Except during divisions, the execute stage can be fully pipelined, but
it has variable latency (table 9.1). This makes the use of so called re-
sult shift registers in the CA-pipe and in the buffers-pipe necessary.

In section 9.3, we construct the control of a prepared sequential machine


FDLXΣ. The difficulties arise obviously in the execute stage:

1. For instructions which can be fully pipelined, i.e., for all instruc-
tions except divisions, two result shift registers in the precomputed
control and in the stall engine take care of the variable latencies of
instructions.

2. In section 8.3.6, we controlled the 17 or 21 cycles of divisions by a


finite state diagram. This FSD has to be combined with the precom-
puted control. We extend the result shift register of the stall engine
(and only this result shift register) to length 21. The full bits of this
result shift register then code the state of the FSD.

For the prepared machine FDLXΣ constructed in this way we are able to
prove the counter part of the (dateline) lemma 5.9.
In section 9.4, the machine is finally pipelined. As in previous construc-
tions, pipelining is achieved by the introduction of a forwarding engine
and by modification of the stall engine alone. Because single precision
values are embedded in double precision data paths, one has to forward
the 32 low order bits and the 32 high order bits separately. Stalls have to
be introduced in two new situations:

1. if an instruction with short latency threatens to overtake an instruc-


tion with long latency in the pipeline (see table 9.2); and

2. if pipelining of the execute stage is impossible because a division is


in one of its first 13 or 17 cycles, respectively.
&&
  *
  Scheduling of the two data independent instructions  and !. In E XTENDED
the first case, ! overtakes  ; the second case depicts an in-order execution. I NSTRUCTION S ET
A RCHITECTURE
instruction cycles of the execution
, F1 D1 E1 E1 E1 E1 E1 M1 W1
+& F2 D2 E2 E2 E2 M2 W2
, F1 D1 E1 E1 E1 E1 E1 M1 W1
+& F2 D2 stall E2 E2 E2 M2 W2

A simple lemma will show for this FDLXΠ design, that the execution of
instructions stays in order, and that no two instructions are ever simultane-
ously in the same substage of the execute stage.

0 %& '     

EFORE GOING into the details of the implementation, we first describe


the extension of the DLX instruction set architecture. That includes
the register set, the exception causes, the instruction format and the in-
struction set.

* ," 8   

The FPU provides 32 floating point general purpose registers FPRs, each
of which is 32 bits wide. In order to store double precision values, the reg-
isters can be addressed as 64-bit floating point registers FDRs. Each of the
16 FDRs is formed by concatenating two adjacent FPRs (table 9.3). Only
even numbers 0 2    30 are used to address the floating point registers
FPR; the least significant address bit is ignored.

1    
In the design, it is sometimes necessary to store a single precision value xs
in a 64-bit register, i.e., the 32-bit representation must be extended to 64
bits. This embedding will be done according to the convention illustrated
in figure 9.1, i.e., the data is duplicated.

,"   8  


In addition, the FPU core also provides some special purpose registers. The
floating point control registers FCR comprise the registers FCC, RM, and
&&
  *
P IPELINED DLX   Register map of the general purpose floating point registers
M ACHINE WITH
floating point
F LOATING P OINT floating point registers
general purpose registers
C ORE
single precision (32-bit)

double precision (64-bit)
FPR3131 : 0 FDR3063 : 32
FDR3063 : 0
FPR3031 : 0 FDR3031 : 0
: :
FPR331 : 0 FDR263 : 32
FDR263 : 0
FPR231 : 0 FDR231 : 0

FPR131 : 0 FDR063 : 32
FDR063 : 0
FPR031 : 0 FDR031 : 0

63 32 31 0
x.s x.s

   Embedding convention of single precision floating point data

IEEEf. The registers can be read and written by special move instructions.
Register FCC is one bit wide and holds the floating point condition code.
FCC is set on a floating point comparison, and it is tested on a floating
point branch instruction. Register RM specifies which of the four IEEE
rounding modes is used (table 9.4).
Register IEEEf (table 9.5) holds the IEEE interrupt flags, which are over-
flow OVF, underflow UNF, inexact result INX, division by zero DBZ, and
invalid operation INV. These flags are sticky, i.e., they can only be reset at
the user’s request. Such a flag is set whenever the corresponding exception
is triggered.The IEEE floating point standard 754 only requires that such
an interrupt flag is set whenever the corresponding exception is triggered

  Coding of the rounding mode RM


RM[1:0] rounding mode
00 rz round to zero
01 rne round to next even
10 ru round up
11 rd round down

&&
  *
  Coding of the interrupt flags IEEEf E XTENDED
I NSTRUCTION S ET
symbol meaning
A RCHITECTURE
IEEEf[0] OVF overflow
IEEEf[1] UNF underflow
IEEEf[2] INX inexact result
IEEEf[3] DBZ division by zero
IEEEf[4] INV invalid operation

  Coding of the special purpose registers SPR


fxSPR FCR
SR ESR ECA EPC EDPC Edata RM EEEf FCC
Sad 0 1 2 3 4 5 6 7 8

while being masked (disabled). If the exception is enabled (not masked),


the value of the corresponding IEEE interrupt flag is left to the implemen-
tation/interrupt handler.
The special purpose registers SPR now comprise the original six spe-
cial purpose registers fxSPR of the fixed point core and the FPU control
registers FCR. Table 9.6 lists the coding of the registers SPR.

* . - - 

The FPU adds six internal interrupts, namely the five interrupts requested
by the IEEE Standard 754 plus the unimplemented floating point operation
interrupt uFOP (table 9.7). In case that the FPU only implements a sub-
set of the DLX floating point operations in hardware, the uFOP interrupt
causes the software emulation of an unimplemented floating point opera-
tion. The uFOP interrupt is non-maskable and of type continue.
The IEEE Standard 754 strongly recommends that users are allowed to
specify an interrupt handler for any of the five standard floating point ex-
ceptions overflow, underflow, inexact result, division by zero, and invalid
operation. Such a handler can generate a substitute for the result of the
exceptional floating point instruction. Thus, the IEEE floating point inter-
rupts are maskable and of type continue. However, in the absence of such
an user specific interrupt handler, the execution is usually aborting.
&&#
  *
P IPELINED DLX   Interrupts handled by the DLX architecture with FPU
M ACHINE WITH
interrupt symbol priority resume mask external
F LOATING P OINT
C ORE reset reset 0 abort no yes
illegal instruction ill 1 abort no no
misaligned access mal 2
page fault IM Ipf 3 repeat
page fault DM Dpf 4
trap trap 5 continue
FXU overflow ovf 6 abort yes
FPU overflow fOVF 7 abort/
FPU underflow fUNF 8 continue
FPU inexact result fINX 9
FPU division by zero fDBZ 10
FPU invalid operation fINV 11
FPU unimplemented uFOP 12 continue no
external I/O ex j 12  j continue yes yes

6 5 5 16
FI-type Opcode Rx FD Immediate

6 5 5 5 3 6
FR-type Opcode FS1 FS2 / Rx FD 00 Fmt Function

   Floating point instruction formats of the DLX. Depending on the pre-
cision, "# " and $ specify 32-bit or 64-bit floating point registers. %&
specifies a general purpose register of the FXU.    is an additional 6-bit
opcode.  specifies a number format.

*# ," . -  

The DLX machine uses two formats (figure 9.2) for the floating point in-
structions; one corresponds to the I-type and the other to the R-type of the
fixed point core FXU.
The FI-format is used for moving data between the FPU and the memory.
Register  of the FXU together with the 16-bit immediate specify the
memory address. This format is also used for conditional branches on the
condition code flag FCC of the FPU. The immediate then specifies the
branch distance. The coding of these instructions is given in table 9.8.
&&&
  *
  FI-type instruction layout. All instructions except the branches also DATA PATHS
increment the PC by four. The effective address of memory accesses equals ea WITHOUT
GPRRx  sxt imm, where sxt imm denotes the sign extended version of
F ORWARDING
the 16-bit immediate imm. The width of the memory access in bytes is indicated
by d. Thus, the memory operand equals m M ea  d  1 M ea.

IR31 : 26 mnemonic d effect


Load, Store
hx31 load.s 4 FD31 : 0 = m
hx35 load.d 8 FD63 : 0 = m
hx39 store.s 4 m = FD31 : 0
hx3d store.d 8 m = FD63 : 0
Control Operation
hx06 fbeqz PC = PC + 4 + (FCC  0 ? imm : 0)
hx07 fbnez PC = PC + 4 + (FCC  0 ? imm : 0)

The FR-format is used for the remaining FPU instructions (table 9.9). It
specifies a primary and a secondary opcode ( # $%& %), a number
format &, and up to three floating point registers. For instructions which
move data between the FPU and the fixed point unit FXU, the field /7
specifies the address of a general purpose register  in the FXU.
Since the FPU of the DLX machine can handle floating point numbers
with single or double precision, all floating point operations come in two
version; the field & in the instruction word specifies the precision used
(table 9.10). In the mnemonics, we identify the precision by adding the
corresponding suffix, e.g., suffix ‘.s’ indicates a single precision floating
point number.

0 * +   " #

section we extend the pipelined data paths of the DLX machine


N THIS
by an IEEE floating point unit. The extensions mainly occur within
the environments. The top level schematics of the data paths (figure 9.3)
remain virtually the same, except for some additional staging registers and
the environment FPemb which aligns the floating point operands.
The register file environment RFenv now also provides two 64-bit float-
ing point operands FA and FB, and it gets a 64-bit result FC and three
additional addresses. The registers Ffl.3 and Ffl.4 buffer the five IEEE ex-
&&'
  *
P IPELINED DLX   FR-type instruction layout. All instructions also increment the PC by
M ACHINE WITH four. The functions sqrt(), abs() and rem() denote the square root, the absolute
F LOATING P OINT value and the remainder of a division according to the IEEE 754 standard. The
C ORE opcode bits c3 : 0 specify the floating point test condition con according to table
8.7. Function cvt() converts from one format into another. In our implementation,
instructions ' and   are only supported in software.

IR[31:26] IR[5:0] Fmt mnemonic effect


Arithmetic and compare operations
hx11 hx00 fadd [.s, .d] FD = FS1 + FS2
hx11 hx01 fsub [.s, .d] FD = FS1 - FS2
hx11 hx02 fmul [.s, .d] FD = FS1 * FS2
hx11 hx03 fdiv [.s, .d] FD = FS1 / FS2
hx11 hx04 fneg [.s, .d] FD = - FS1
hx11 hx05 fabs [.s, .d] FD = abs(FS1)
hx11 hx06 fsqt [.s, .d] FD = sqrt(FS1)
hx11 hx07 frem [.s, .d] FD = rem(FS1, FS2)
hx11 11c3 : 0 fc.con [.s, .d] FCC = (FS1 con FS2)
Data transfer
hx11 hx08 000 fmov.s FD[31:0] = FS1[31:0]
hx11 hx08 001 fmov.d FD[63:0] = FS1[63:0]
hx11 hx09 mf2i Rx = FS1[31:0]
hx11 hx0a mi2f FD[31:0] = Rx
Format conversion
hx11 hx20 001 cvt.s.d FD = cvt(FS1, s, d)
hx11 hx20 100 cvt.s.i FD = cvt(FS1, s, i)
hx11 hx21 000 cvt.d.s FD = cvt(FS1, d, s)
hx11 hx21 100 cvt.d.i FD = cvt(FS1, d, i)
hx11 hx24 000 cvt.i.s FD = cvt(FS1, i, s)
hx11 hx24 001 cvt.i.d FD = cvt(FS1, i, d)

  Coding of the number format Fmt.


Fmt[2:0] suffix number format
000 .s single precision floating point
001 .d double precision floating point
100 .i 32-bit fixed point

&&(
  *
IMenv
DATA PATHS
EPCs IR.1 WITHOUT
IRenv Daddr F ORWARDING
FPemb PCenv

FA, FB S A, B link, PCs co

buffers: IR.j, PCs.j,Cad.j, Fad.j Sad.j


CAenv

fl D EXenv R

Ffl.3 MAR MDRw


SR

DMenv

Ffl.4 C.4 MDRr FC.4

Ffl’
1111
0000 SH4Lenv
C’ FC’

RFenv

   Data paths of the DLX design with a floating point core

ception flags. In order to support double precision loads and stores, the
data registers MDRw and MDRr associated with the data memory are now
64 bits wide. Thus, the cost of the enhanced DLX data paths can be ex-
pressed as

CDP  CIMenv  CPCenv  CIRenv  CDaddr  CFPemb  CEXenv


CDMenv  CSH4Lenv  CRFenv  CCAenv  Cbu f f er

6  C f f 32  5  C f f 64  2  C f f 5

The extended instruction set architecture has no impact on the instruc-


tion fetch. Thus, the instruction memory environment IMenv remains
the same. The other four pipeline stages undergo more or less extensive
changes which are now described stage by stage.
&&/
  *
*  . -  + 
P IPELINED DLX
M ACHINE WITH The data paths of the decode stage ID comprise the environments IRenv,
F LOATING P OINT PCenv, and FPemb and the circuit Daddr which selects the address of the
C ORE destination register.

1  .8
So far, the environment IRenv of the instruction register selects the im-
mediate operand imm being passed to the PC environment and the 32-bit
immediate operand co. In addition, IRenv provides the addresses of the
register operands and two opcodes.
The extension of the instruction set has no impact on the immediate
operands or on the source addresses of the register file GPR. However, en-
vironment IRenv now also has to provide the addresses of the two floating
point operands FA and FB. These source addresses FS1 and FS2 can di-
rectly be read off the instruction word and equal the source addresses Aad
and Bad of the fixed point register file GPR:
FS1  Aad  IR25 : 21
FS2  Bad  IR20 : 17
Thus, the cost and delay of environment IRenv remain unchanged.

- + 
Circuit Daddr generates the destination addresses Cad and Fad of the gen-
eral purpose register files GPR and FPR. In addition, it provides the source
address Sas and the destination address Sad of the special purpose register
file SPR.
Address Cad of the fixed point destination is generated by circuit Caddr
as before. The selection of the floating destination Fad is controlled by a
signal FRtype which indicates an FR-type instruction:
IR15 : 11 if FRtype  1
Fad 4 : 0 
IR20 : 16 if FRtype  0
The SPR source address Sas is generated as in the DLX design. It is
usually specified by the bits SA  IR10 : 6, but on an RFE instruction
it equals the address of register ESR. Except for an RFE instruction or a
floating point condition test ( f c  1), the SPR destination address Sad is
specified by SA. On RFE, ESR is copied into the status register SPR[0],
and on f c  1, the condition flag f cc is saved into register SPR[8]. Thus,

SA if r f e1  0
 00000 if r f e1  1
Sas 
00001 if r f e1  1
Sad 
 01000 if f c1  1
SA otherwise 
&&)
IR[15:11] IR[10:6] 00000 IR[10:6]
  *
IR[20:11] DATA PATHS
IR[20:16] rfe.1 0 1 01000 00001
Jlink WITHOUT
Caddr 0 1 FRtype 0 1 fc.1 0 1 rfe.1
Rtype F ORWARDING
Cad Fad Sas Sad

   Circuit Daddr which selects the destination addresses

The circuit of figure 9.4 provides these four addresses at the following
cost and delay:

CDaddr  CCaddr  4  Cmux 5


DDaddr  maxDCaddr 2  Dmux 5

" 1 
Due to the extended ISA, the PC environment has to support two additional
control instructions, namely the floating point branches , and ,%.
However, except for the value PC ui , the environment PCenv still has the
functionality described in chapter 5.
Let signal b jtaken, as before, indicate a jump or taken branch. On in-
struction Ii , the PC environment now computes the value

EPCi 1  if Ii  ,
 PCi 1  immi

if b jtakeni  Ii   % " "
if b jtakeni  Ii  , ,%
u
PC i  PCi 1  immi


if b jtakeni  Ii  " "


 RS1i 1 

PCi 1  4 otherwise


This extension has a direct impact on the glue logic PCglue, which gen-
erates signal b jtaken, but the data paths of PCenv including circuit nextPC
remain unchanged.
Signal b jtaken must now also be activated in case of a taken floating
point branch. Let the additional control signal f branch denote a floating
point branch. According to table 9.11, signal b jtaken is now generated as

b jtaken  branch  bzero XNOR AEQZ 


 f branch  bzero XOR FCC
 jump

This increases the cost of the glue logic by an OR, AND, and XOR gate:

CPCglue  2  Cor  Cand  Cxnor  Czero 32


Cor  Cand  Cxor 
&&*
  *
P IPELINED DLX   Value of signal b jtaken for the different branch instructions
M ACHINE WITH
instruction bzero AEQZ FCC bjtaken
F LOATING P OINT
C ORE 0 0
beqz 1 *
1 1
0 1
bnez 0 *
1 0
1 0
fbeqz 1 *
0 1
1 0
fbnez 0 *
0 1

Both operands A’ and FCC are provided by the register file environment,
but A’ is passed through a zero tester in order to obtain signal AEQZ. Thus,
FCC has a much shorter delay than AEQZ, and the delay of signal b jtaken
remains unchanged.

1  ," 
Environment FPemb of figure 9.5 selects the two floating point source
operands and implements the embedding convention of figure 9.1. It is
controlled by three signals,
the flag dbs1 requesting double precision source operands,
the least significant address bit FS10 of operand FA, and
the least significant address bit FS20 of operand FB.
Circuit FPemb reads the two double words f A63 : 0 and f B63 : 0 and
provides the two operands FA1 and FB1, each of which is 64 bits wide.
Since the selection and data extension of the two source operands go
along the same lines, we just focus on operand FA1. Let the high order
word and the low order word of input f A be denoted by
f Ah  f A63 : 32 and f Al  f A31 : 0
On a double precision access (dbs1  1), the high and the low order word
are just concatenated, i.e., FA1  f Ah f Al. On a single precision access,
one of the two words is selected and duplicated; the word fAl is chosen on
an even address and the word fAh on an odd address. Thus,

 f Ah f Al if dbs1  1
dbs1  0  FS10  1
FA163 : 0 
 f Ah f Ah if
f Al f Al if dbs1  0  FS10  0
&'
  *
a) b)
FS1[0] fA dbs.1 fB FS2[0] a db fh fl fh fl a db DATA PATHS
WITHOUT
64 64
F ORWARDING
a fh, fl db db fh, fl a
Fsel Fsel 1 0 0 1

FA1[63:0] FB1[63:0] [63:32] [31:0]

   Schematics of environment FPemb (a) and of circuit Fsel (b).

Circuit FSel (figure 9.5) implements this selection in a straightforward


manner. Environment FPemb comprises two of these circuits. Since the
data inputs have a much longer delay than the three control signals, the
cost and delay of environment FPemb can be expressed as

CFPemb  2  CFsel  2  2  Cor  Cinv  2  Cmux 32


DFPemb  Dmux 32

*  5 %   

In every cycle, the memory stage passes the address MAR, the 64-bit data
MDRw and the floating point flags Ffl.3 to the write back stage:

C4 : MAR
FC4 : MDRw
F f l 4 : F f l 3

In case of a load or store instruction, the environment DMenv of the data


memory and the memory control perform the data memory access. In order
to load and store double precision floating point numbers, the memory
access can now be up to 64 bits wide.

5 %  : 


As before, the data memory DM is byte addressable, but in addition to byte,
half word and word accesses, it now also supports double word accesses.
Therefore, the data memory is organized in 8 memory banks.
In the memory DM, 8-bit, 16-bit and 32-bit data are aligned in the same
way as before (section 3.1.3). Whereas 64-bit data are aligned at double
word boundaries, i.e., their byte addresses are divisible by eight. For a
&'
  *
double word boundary e we define the memory double word of memory
P IPELINED DLX DM with address e as
M ACHINE WITH
F LOATING P OINT DMdword e  DM e  7 : e
C ORE
The bytes within the double word w63 : 0 are numbered in little endian
order:

byte j w  w8 j  7 : 8 j
bytei: j w  bytei w    byte j w

On a read access with address a31 : 0, the data memory DM pro-
vides the requested double word, assuming that the memory is not busy
and that the access causes no page fault. In any other case, the mem-
ory DM provides a default value. Thus, for the double word boundary
e  a31 : 3000, we get

DMdword e if Dmr3  dbusy  d p f


DMout 63 : 0 
DMde f ault otherwise.

A write access only updates the data memory, if the access is perfectly
aligned (dmal  0), and if the access causes no page fault (d p f  0). On
such an d-byte write access with byte address a  a31 : 0 and offset
o  a2 : 0, the data memory performs the update

DM a  d  1 : a : byteod 1:o MDin63 : 0

5 % 1  +5


Figure 9.6 depicts the new data memory environment DMenv. Like in the
pipelined DLX design of section 6.5, the core of DMenv is the data cache
interface D$i f with a sectored cache. A cache sector is still S  8 bytes
wide.
The cache is connected to the data paths through a 32-bit address port
a and the two 64-bit data ports MDin and DMout. The memory interface
Mi f connects the data cache to the off-chip memory. Even without FPU,
the cache and the off-chip memory already operate on 8-byte data. Thus,
the interface Mif and D$if remain unchanged.
Without the FPU, the 64-bit data ports of the cache and memory inter-
face had to be patched to the 32-bit ports of the data paths. On the input
port MDin, the 32-bit data MDRw was duplicated. On the output port, a
multiplexer selected the requested word within the double word.
Since the registers MDRw and MDRr are now 64 bits wide, the patches
become obsolete. That saves a 32-bit multiplexer and reduces the cache
&'
  *
dpc MAR MDRw[63:0]
Mif DATA PATHS
1 0 Dlinv 64
WITHOUT
32 MDin
a Din F ORWARDING
Mad D$a
reset clear D$if (Dcache)
MDat MDat
hit Dout[63:0]

Dhit DMout

   Data memory environment of the DLX design with FPU

read time T$read by the delay Dmux and possibly the burst read time TMrburst
as well. However, these two cycle times were not time critical.
T$read  A$i f Dout   D f f
TMrburst  Ddriv  dbus  δ
 maxD$i f MDat; $i f  D$i f MDat; Dout   D f f 
CDMenv  CD$i f  Cmux 32

+  5 %  


As in the pipelined design of section 6.5, the date cache interface D$if
and the interface Mif to the off-chip memory are governed by the mem-
ory interface control Mi fC. Even without FPU, the interfaces D$if and
Mif already supported 8-byte accesses. Thus, the new floating point load
instructions (#  # #  ) have no impact on the control MifC.
In addition to control MifC, the data memory environment DMenv is
governed by the data memory control DMC. As before, circuit DMC gen-
erates the bank write signals DMbw[7:0], which on a cache read access
D$rd  1 are clocked into register DMBw. Circuit DMC also checks for a
misaligned access, signaled by dmal  1, and masks the memory read and
write signal with the flag dmal. Since the bank write signals and flag dmal
depend on the width of the access, circuit DMC must also account for the
new load and store instructions.
The bank write signals DMbw[7:0] are generated along the same lines
as before (pages 81 and 201): Feeding address MAR2 : 0 into a 3-decoder
gives 8 signals B7 : 0 satisfying for all j
B j   1 MAR2 : 0  j
From the the primary opcode IR3, the width of the current access is de-
coded according to table 9.12 by
B  IR330 NOR IR327  IR326
&'#
  *
P IPELINED DLX   Coding the width of a data memory access
M ACHINE WITH
width d IR.3[30,28:26] instructions
F LOATING P OINT
C ORE byte B 1 0*00 lb, lbu, sb
half word H 2 0*01 lh, lhu, sh
word W 4 0*11 lw, sw
1001 load.s, store.s
double word D 8 1101 load.d, store.d

H  IR330 NOR IR327  IR326


W  IR330  IR327  IR326  IR330  IR328
D  IR330  IR328

According to table 9.13, the bank write signals are then generated in a
brute force way by

DMbw0  Dmw3  B0


DMbw1  Dmw3  D  B0  W  B0  H  B0  B  B1
DMbw2  Dmw3  D  B0  W  B0  H  B2  B  B2
DMbw3  Dmw3  D  B0  W  B0  H  B2  B  B3
DMbw4  Dmw3  D  B0  W  B4  H  B4  B  B4
DMbw5  Dmw3  D  B0  W  B4  H  B4  B  B5
DMbw6  Dmw3  D  B0  W  B4  H  B6  B  B6
DMbw7  Dmw3  D  B0  W  B4  H  B6  B  B7

The memory control DMC also checks for a misaligned access. A byte
access is always properly aligned. A double word access is only aligned
if it starts at byte 0, i.e., if B0  1. A word access is aligned if it starts
at byte 0 or 4, and a half word access is aligned if it starts at an even byte.
Thus, the misalignment can be detected by

malAc  D  B0  W  B0 NOR B4  H  MAR0

Circuit DMC checks for a misaligned access (signaled by dmal) on every


load and store instruction. In order to protect the data memory, it masks
the memory read and write signal Dmr and Dmw with flag dmal. Thus

dmal  Dmr3  Dmw3  malAc


Dmra  Dmr3  malAc  Dmr3 NOR malAc
Dmwa  Dmw3  malAc  Dmw3 NOR malAc
&'&
  *
  Memory bank write signal DMbw7 : 0 as a function of the address DATA PATHS
MAR[2:0] and the width (B, H, W, D) of the access WITHOUT
F ORWARDING
address width of the access
MAR[2:0] D W H B
000 1111 1111 0000 1111 0000 0011 0000 0001
001 0000 0000 0000 0000 0000 0000 0000 0010
010 0000 0000 0000 0000 0000 1100 0000 0100
011 0000 0000 0000 0000 0000 0000 0000 1000
100 0000 0000 1111 0000 0011 0000 0001 0000
101 0000 0000 0000 0000 0000 0000 0010 0000
110 0000 0000 0000 0000 1100 0000 0100 0000
111 0000 0000 0000 0000 0000 0000 1000 0000

When reusing common subexpressions, the memory control DMC has


the cost

CDMC  Cdec 3  6  Cinv  32  Cand  20  Cor  4  Cnor  C f f 8

This includes the 8-bit register DMBw which buffers the bank write sig-
nals. Signals DMBw are still provided at zero delay. The accumulated
delay ADMC of the remaining outputs and the cycle time of circuit DMC
run at

ADMC  maxACON Dmr Dmw  Dor  Dand


maxDdec 3 Dor  2  Dand   2  Dand  2  Dor 
TDMC  ADMC  ∆

* # ;     

The DLX architecture now comprises three register files, one for the fixed
point registers GPR, one for the special purpose registers SPR, and one for
the floating point registers FPR. These three register files form the envi-
ronment RFenv

CRFenv  CGPRenv  CSPRenv  CFPRenv 

The data paths of the write back stage consist of the environment RFenv
and of the shifter environment SH4Lenv. Environment GPRenv is the only
environment which remains unchanged.
&''
  *
MDRr MDRr[63:32, 31:0]
P IPELINED DLX
FC.4 C.4[2] 1 0 C.4[1:0]
M ACHINE WITH
MDs
F LOATING P OINT 64 32
C ORE shifter SH4L
1 0 dbr.4
R C.4
0 1 load.4 1 0 load.4
FC’ C’

   Shift for load environment SH4Lenv

1  &3
In addition to the fixed point result C , the environment SH4Lenv now also
provides a 64-bit floating point result FC . The environment is controlled
by two signals,
signal load 4 indicating a load instructions and
signal dbr4 indicating a double precision result.
The fixed point result C is almost computed as before, but the memory
now provides a double word MDRr. The shifter SH4L still requires a 32-bit
input data MDs. Depending on the address bit C42, MDs either equals
the high or low order word of MDRr:
MDRr63 : 32 if C42  1 (high order word)
MDs 
MDRr31 : 0 if C42  0 (low order word)
Let sh4l a dist  denote the function computed by the shifter SH4L. The
fixed point result C is then selected as
sh4l MDs C41 : 0000 if load 4  1
C 
C4 if load 4  0
Depending on the type of the instruction, the output FC’ is selected
among the two 64-bit inputs FC.4 and MDRr and the 32-bit word MDs
which is extended according to the embedding convention. On a load in-
struction, the environment passes the memory operand, which in case of
double precision equals MDRr and MDs, otherwise. On any other instruc-
tion, the environment forwards the FPU result FC.4 to the output FC’.
Thus,

 MDRr if load 4  1  dbr4  1
MDs MDs if load 4  1  dbr4  0
FC 63 : 0 
 FC4 otherwise
&'(
  *
SR PC.4 DDPC.4 Ffl.4[4:0]
MCA DPC.4 C.4 DATA PATHS
IEEEf[4:0] Sas.1
repeat WITHOUT
FCRsel C.4 Sad.4
JISR fxSPRsel F ORWARDING
sel 032 027 032

SPRw/r
Di[0] Di[1] Di[2] Di[3] Di[4] Di[5] Di[6] Di[7] Di[8] Din adr adw
SPRw (9 x 32) special register file
w[8:0] w/r
Do[0] Do[1] Do[2] Do[3] Do[4] Do[5] Do[6] Do[7] Do[8] Dout

SR ESR ECA EPC EDPC Edata RM IEEEf FCC Sout

   Environment SPRenv of the special purpose register file

The circuit of figure 9.7 implements environment SH4Lenv in the obvi-


ous way. Its cost and delay can be expressed as

CSH4Lenv  CSH4L  2  Cmux 32  2  Cmux 64


DSH4Lenv  maxDSH4L Dmux   2  Dmux 

1     "-  8  


Figure 9.8 depicts the environment SPRenv of the special purpose register
file. Due to the FPU, the register file SPR comprises the original six spe-
cial purpose registers fxSPR of the fixed point core and the FPU control
registers FCR (table 9.6).
The core of SPRenv is a special register file of size 9  32. The cir-
cuits f xSPRsel and FCRsel provide the inputs Dis of the distinct write
ports. As before, circuit SPRcon generates the write signals SPRw8 : 0
and signal sel which is used by f xSPRsel. The environment is controlled
by

the interrupt signals JISR and repeat,

the write signal SPRw, and

signal f op4 denoting an arithmetic floating point instruction, a con-


version +&, or a test ,.

As before, the special purpose registers are held in a register file with an
extended access mode. Any register SPRs can be accessed through the
regular read/write port and through a distinct read port and a distinct write
port. In case of a conflict, a special write takes precedence over the write
&'/
  *
access specified by address Sad. Thus, for any s  0  8, register SPRs
P IPELINED DLX is updated as
M ACHINE WITH
F LOATING P OINT Dis if SPRws  1
SPRs :
C ORE C4 if SPRws  0  SPRw  1  s  Sad 

The distinct read port of register SPRs provides the data

Dos  SPRs

and the standard data output of the register file equals

Sout  SPRSas

Registers fxSPR The registers fxSPR still have the original functional-
ity. The write signals of their distinct write ports and signal sel are gener-
ated as before:

sel  repeat  SPRw  Sad 4  0


SPRws  JISR

Circuit f xSPRsel which selects the inputs Dis of these write ports can be
taken from the DLX design of section 5 (figure 5.6).

Registers FCR Although the rounding mode RM, the IEEE flags and the
condition flag FCC only require a few bits, they are held in 32-bit registers.
The data are padded with leading zeros.
The condition flag FCC can be updated by a special move  + or by
a floating point condition test. Since in either case, the result is provided
by register C4, the distinct write port of register FCC is not used. Thus,

Di8  0 and SPRw8  0

The rounding mode RM can only be updated by a  + instruction.


Thus,
Di6  0 and SPRw6  0
Except for the data transfers, any floating point instruction provides flags
which signal the five floating point exceptions (overflow, underflow, inex-
act result, division by zero, and invalid operation). The IEEE standard
requires that these exception flags are accumulated, i.e., that the new flags
F f l 4 are ORed to the corresponding bits of register IEEEf:

Di7  027 F f l 44 : 0  IEEE f 4 : 0 and SPRw7  f op4


&')
  *
Cost and Delay The new select circuit FCRsel just requires a 5-bit OR
gate. Due to the 4-bit address Sad, circuit SPRcon now uses a 4-bit zero DATA PATHS
WITHOUT
tester; SPRcon can provide the additional write signals SPRw8 : 6 at no
cost. Thus, the cost of the extended environment SPRenv run at F ORWARDING

CSPRenv  CSF 9 32  C f xSPRsel  CFCRsel  CSPRcon


CSPRcon  2  Cand  Cinv  Czero 4
CFCRsel  5  Cor 
Except for the width of address Sad, the formulae which express the de-
lay of the outputs and the cycle time of environment SPRenv remain un-
changed.

1    ,  " 8   ,


The extended DLX instruction set requires 32 single precision floating
point registers and 16 double precision registers. These two sets of floating
point registers have to be mapped into the same register file FPR (section
9.1). In each cycle, the environment FPRenv of the floating point register
file performs two double precision read accesses and one write access.

Read Access The register file environment FPRenv provides the two
source operands f A and f B. Since both operands have double precision,
they can be specified by 4-bit addresses FS14 : 1 and FS24 : 1:
f A63 : 0  FPRFS14 : 1 1 FPRFS14 : 1 0
f B63 : 0  FPRFS24 : 1 1 FPRFS24 : 1 0
For the high order word the least significant address bit is set to 1 and for
the low order word it is set to 0.

Write Access The 64-bit input FC or its low order word FC 31 : 0 is
written into the register file. The write access is governed by the write
signal FPRw and the flag dbr4 which specifies the width of the access.
In case of single precision, the single precision result is kept in the high
and the low order word of FC , due to the embedding convention. Thus,
on FPRw  1 and dbr4  0, the register with address Fad4 is updated to
FPRFad44 : 0 : FC 63 : 32  FC 31 : 0
On FPRw  1 and dbr4  1, the environment FPRenv performs a double
precision write access updating two consecutive registers:
FPRFad44 : 1 1 : FC 63 : 32
FPRFad44 : 0 0 : FC 31 : 0
&'*
  *
FS1[4:1]
P IPELINED DLX FS2[4:1]
M ACHINE WITH Fad4[4:1]
F LOATING P OINT adA adB adC Fad4[0] FRRw adA adB adC
C ORE dbr.4
FC’[63:32] 3-port RAM 3-port RAM FC’[31:0]
(16 x 32) w w (16 x 32)
Di Di
Dod ODD wod FPRcon wev EVEN Dev
Da Db Da Db

fA[63:32] fB[63:32] fA[31:0] fB[31:0]

   Environment FPRenv of the floating point register file

Implementation In order to support single as well as double precision


accesses, the floating point register file is split in two banks, each of which
provides 16 single precision registers (figure 9.9). One bank holds the
registers with even addresses, the other bank holds the registers with odd
addresses.
The high order word of a double precision result is written into the odd
bank of the register file and its low order word is written into the even
bank. In case of single precision, the high and low order word of input FC
are identical. Thus, FC’[63:32] always serves as the input Dod of the odd
bank, and FC’[31:0] always serves as the input Dev of the even bank (table
9.14).
Each bank of the register file FPR is implemented as a 3-port RAM of
size 16  32 addressed by FS1[4:1], FS2[4:1] and Fad4[4:1]. Including
circuit FPRcon which generates the two bank write signals wev and wod ,
the cost and delay of the register file environment FPRenv run at
CFPRenv  2  Cram3 16 32  CFPRcon
DFPR read
  Dram3 16 32
DFPR write
  DFPRcon  Dram3 16 32
In case of a double precision write access (dbr4  1), both banks of the
register file are updated. Whereas on a single precision access, only one
of the banks is updated, namely the one specified by the least significant
address bit Fad4[0] (table 9.14). Of course, the register file FPR is only
updated if requested by an active write signal FPRw  1. Thus, the two
bank write signals are
wod  FPRw  dbr4  Fad40
wev  FPRw  dbr4  Fad40
&(
  *
  The input data Dev and Dod of the two FPR banks and their write DATA PATHS
signals wev and wod . WITHOUT
F ORWARDING
FPRw dbr.4 Fad4[0] wod wev Dod Dev
1 0 0 0 1 * FC’[31:0]
1 0 1 1 0 FC’[63:32] *
1 1 * 1 1 FC’[63:32] FC’[31:0]
0 * * 0 0 * *

The control FPRcon of the FPR register file can generate these two write
signals at the following cost and delay:

CFPRcon  2  Cand  2  Cor  Cinv


DFPRcon  Dinv  Dor  Dand 

* & 1-    

The execute environment EXenv is the core of the execute stage (figure
9.10). Parts of the buffer environment and of the cause environment CAenv
also belong to the execute stage. The buffers pass the PCs, the destination
addresses and the instruction opcodes down the pipeline. Environment
CAenv collects the interrupt causes and then processes them in the memory
stage.

1-  1 
Environment EXenv comprises the 32-bit fixed point unit FXU, the 64-bit
floating point unit FPU of chapter 8, and the exchange unit FPXtr. It gets
the same fixed point operands as before (A, B, S, co, link) and the two
floating point operands FA2 and FB2.

Fixed Point Unit FXU The FXU equals the execute environment of the
DLX architecture from section 5.5.4. The functionality, the cost and delay
of this environment remain unchanged. The FXU still provides the two
fixed point results D and sh and is controlled by the same signals:

bmuxsel and a muxsel which select the operands,

AluDdoe, SHDdoe, linkDdoe, ADdoe, SDdoe, and coDdoe which


select output D,
&(
CAenv EXenv buffers
FA FB B link S co A
CA2.0 buf2.0
FXunp FPunp FPXtr FXU
Fu Cvt FCon tfp tfx sh D fcc
RSR RSR
Fv Fc
fcc D’
CA.2.1 buf.2.1
M/D1 A/S1
CA.2.2 buf.2.2
M/D2 A/S2
Fq Fs
CA.2.3 Fr buf.2.3
FPrd1 FXrd1
CA.2.4 buf.2.4
FPU FPrd2 FXrd2
Fp Fx 64 + 5 32
R
CA.3 MDRw, Ffl.3 MAR buf.3
   Data paths of the execute stage
  *
P IPELINED DLX
M ACHINE WITH
F LOATING P OINT
C ORE

&(
  *
FA[63:0] B[31:0] sh[31:0] FB[63:0]

[31:0]
1
0011
0 fmov.2 0011
0 1 fstore.2
DATA PATHS
WITHOUT
05 F ORWARDING
0 1 store.2

tfx[31:0] tfp[63:0] tfp[68:64]

   Exchange unit FPXtr

Rtype, add and test which govern the ALU, and

shi f t4s which governs the shifter SH.

Exchange Unit FPXtr The FPXtr unit transfers data between the fixed
point and the floating point core or within the floating point core. It is
controlled by

signal store2 indicating any store instruction,

signal f store2 indicating a floating point store instruction, and

signal f mov2 indicating a floating point move , +.

The operands B[31:0], FA[63:0] and FB[63:0] are directly taken from reg-
isters, operand sh31 : 0 is provided by the shifter of the fixed point unit.
Circuit FPXtr selects a 69-bit result t f p and a 32-bit result t f x. The bits
tfp[63:0] either code a floating point or fixed point value, whereas the bits
tfp[68:64] hold the floating point exception flags.
According to the IEEE floating point standard [Ins85], data move in-
structions never cause a floating point exception. This applies to stores
and the special moves , and ,. Thus, the exchange unit selects the
results as

t f x31 : 0  FA31 : 0

FB63 : 0
 sh 31 : 0 sh31 : 0
if f store2
if store2   f store2
t f p63 : 0 
 FA63 : 0
B31 : 0 B31 : 0
if f mov2
otherwise
t f p68 : 64  00000

The circuit of figure 9.11 implements the exchange unit in the obvious
way. Assuming that the control signals of the execute stage are precom-
&(#
  *
puted, cost and accumulated delay of environment FPXtr run at
P IPELINED DLX
M ACHINE WITH CFPXtr  3  Cmux 64
F LOATING P OINT AFPXtr  AFXU sh  2  Dmux 64
C ORE
Functionality of EXenv Environment EXenv generates two results, the
fixed point value D and the 69-bit result R. R[63:0] is either a fixed point or
a floating point value; the bits R[68:64] provide the floating point exception
flags. Circuit EXenv selects output D among the result D of the FXU, the
result t f x of the exchange unit, and the condition flag f cc of the FPU. This
selection is governed by the signals m f 2i and f c which denote a special
move instruction , or a floating point compare instruction, respectively:

 D31 : 0 if m f 2i  0  f c  0
if m f 2i  1  f c  0
D 31 : 0 
 t0f31xf31cc: 0 if m f 2i  0  f c  1
The selection of result R is controlled by the four enable signals FcRdoe,
F pRdoe, FxRdoe and t f pRdoe. At most one of these signals is active at a
time. Thus,

Fc68 : 0 FcRdoe  1
 F p68 : 0
if
if F pRdoe  1
R68 : 0 
 Fx68 : 0
t f p68 : 0
if
if
FxRdoe  1
t f pRdoe  1

Cost and Cycle Time Adding an FPU has no impact on the accumu-
lated delay AFXU of the results of the fixed point core FXU. The FPU itself
comprises five pipeline stages. Its cycle time is modeled by TFPU and the
accumulated delay of its outputs is modeled by AFPU (chapter 8). Thus,
cost and cycle time of the whole execute environment EXenv can be esti-
mated as

CEXenv  CFXU  CFPU  CFPXtr  2  Cdriv 32  4  Cdriv 69


AEXenv  maxAFXU maxAFPXtr AFPU   Ddriv 
TEXenv  maxTFPU AEXenv  ∆

 -   1-    


In the previous designs, the execute stage always had a single cycle latency,
but now, its latency is not even fixed. The FXU and the exchange unit still
generate their results within a single cycle. However, the latency of the
FPU depends on the operation and precision (table 9.1); it varies between
1 to 21 cycles.
&(&
  *
  Cycles required for the actual execution depending on the type of the DATA PATHS
instruction (i.e., stages EX and M) WITHOUT
F ORWARDING
fc fmov
fadd
stage fdiv.d fdiv.s fmul cvt fabs mi2f rest
fsub
fneg mf2i
2.0 unpack FCon FPXtr FXU
2.0.1 lookup
2.0.2 newton1
2.0.3 newton2
2.0.4 newton3
2.0.5 newton4 lookup
2.0.6 newton1
2.0.7 newton2
2.0.8 newton3
2.0.9 newton4
2.0.10 newton1
2.0.11 newton2
2.0.12 newton3
2.0.13 newton4
2.0.14 quotient1
2.0.15 quotient2
2.0.16 quotient3
2.1 quotient4 mul1 add1
2.2 select fd mul2 add2
2.3 round 1
2.4 round 2
3 stage M

Due to the iterative nature of the Newton-Raphson algorithm, a division


passes the multiply/divide unit of the FPU several times. All the other
instructions pass the units of the execute stage just once. Since divisions
complicate the scheduling of the execute stage considerably, they are han-
dled separately.

1-   


Except on divisions, the execute stage has a latency of 1 to 5 cycles. Thus,
the data paths of environment EXenv are divided into 5 substages, num-
bered by 2.0 to 2.4. The DLX instructions have different latencies and use
these stages as indicated in table 9.15.
&('
  *
Every instruction starts its execution in stage 2.0. Except for divisions,
P IPELINED DLX the instructions leave stage 2.0 after a single cycle. They may bypass some
M ACHINE WITH of the substages:
F LOATING P OINT
C ORE Floating point additions, subtractions and multiplications continue
in stage 2.1 and are then processed in stages 2.2 to 3.
Format conversions +& continue in stages 2.3, 2.4 and 3.
All the remaining instructions leave the execute after substage 2.0
and continue in the memory stage 3.
After the unpacking, a division is kept in stage 2.0 for another 12 or 16
cycles, depending on the precision. During these cycles, it is processed
in the multiply/divide unit, which is assigned to stages 2.1 and 2.2. Once
the division left stage 2.0, it passes through stages 2.1 to 4 almost like a
multiplication.
An instruction and its precomputed data must pass through the pipeline
stages at the same speed. Thus, a mechanism is needed which lets the
interrupt causes, the buffered data and the precomputed control signals fall
through some stages, as well. The Result Shift Register RSR is such a
mechanism.

8 -  8  
An n-bit shift register RSR is a kind of queue with f entries R1    R f , each
of which is n bits wide. In order to account for the different latency, the
RSR can be entered at any stage, not just at the first stage. The RSR (figure
9.12) is controlled by
a distinct clock signal cei for each of the f registers Ri ,
a common clear signal clr, and
a distinct write signal wi for each of the f registers Ri .
The whole RSR is cleared on an active clear signal. Let T and T  1
denote successive clock cycles. For any 1  i  f , an active signal clrT  1
implies
RiT 1  0
On an inactive clear signal clrT  0, the entries of the RSR are shifted
one stage ahead, and the input Din is written into the stage i with wTi  1,
provided the corresponding register is clocked:

 Din if ceTi 1  wTi 1
RSRiT 1  RSRTi ceTi  1  wT  i1
 0n
 1 if
if ceTi  1  wT
i 0
0  i  1
i
&((
Din
  *
n DATA PATHS
clr
WITHOUT
ce[1:f]
w1 R1
F ORWARDING
n

...
wf Rf n

   Schematics of an n-bit result shift register RSR with f entries

Din
ce[1] clr ce[2] clr ce[f] clr
w[1] w[2] w[f]

1 1 ... 1
r1 r2 rf
0n 0 R1 0 R2 0 Rf

clr

   Realization of an n-bit result shift register RSR with f entries

The following lemma states that data Din which are clocked into stage i in
cycle T are passed down the RSR, provided the clear signal stays inactive,
the corresponding registers are clocked at the right time and they are not
overwritten.

Let Din enter register Ri at cycle T , i.e., wTi  1, ceTi  1 and clrT  0.   )
For all t  1    f  i let

wiT t  ceTi t  1 and clrT t  0

then
DinT  RiT 1  RiTt t 1  RTf  f i1


The result shift register can be operated in a particularly simple way, if


all clock enable signals cei are tied to a common clock enable signal ce. If
the RSR is operated in this way, and if ce-cycles T are considered, then the
hypothesis ceTi t  1 is automatically fulfilled.
Figure 9.13 depicts an obvious realization of an n-bit RSR with f entries.
Its cost can be expressed as

CRSR f n  f  C f f n  Cand n  Cmux n  Cor   Cinv 


&(/
  *
IR.1[31:26, 5:0] Cad.1 Fad.1 Sad.1
P IPELINED DLX
M ACHINE WITH ue.1 IR.2 Cad.2 Fad.2 Sad.2 PC DPC DDPC
F LOATING P OINT
C ORE buf.2.0
RSR Din
buf.2.1
reset clr buf.2.2
ue.2.[0:4] ce[1:5] buf.2.3
buf.2.4
RSRw w[1:5]
buf.3
27 96
ue.3 IR.4 Cad.4 Fad.4 Sad.4 PC.4 DPC.4 DDPC.4

   Buffer environment of the design with FPU

The outputs R of the RSR have zero delay; the inputs r of its registers are
delayed by a multiplexer and an AND gate:

DRSR r  Dand  Dmux


DRSR R  0

 -A 1 


The buffer environment (figure 9.14) buffers the opcodes, the PCs, and the
destination addresses of the instructions processed in the stages 2.0 to 4.
Due to the FPU, the environment now buffers an additional destination
address, namely the address Fad for the floating point register file. In order
to account for the different latencies of the FPU, a 5-stage RSR is added
between the execute substage 2.0 and the write back stage in the obvious
way. The RSR is cleared on reset and clocked with the update enable
signals ue20 : 4 provided by the stall engine.
The buffer environment still provides its outputs at zero delay. The cost
and cycle time now run at

Cbu f f ers  CRSR 5 123  C f f 123  C f f 27


Tbu f f ers  maxADaddr ACON ue RSRw  DRSR r  ∆

 -  1 


As described in section 5.5.5, the cause environment of figure 9.15 consists
of two subcircuits. Circuit CAcol collects the interrupt causes, and circuit
&()
  *
[3, 2] ipf, imal DATA PATHS
ue.0 CA.1 CAcol WITHOUT
[12] uFOP F ORWARDING
ue.1 CA.2

RSR Din R[68:64] ovf


CA.2.1 fop? ovf?
reset clr : trap, ill
ue.2.[0:4] ce CA.2.4
CA.3 CA.3[11:5, 3] ue.2

ev[31:7] dmal dpf reset


[2] [4] [0]

CA.4
cause processing CApro
ue.3 MCA, jisr.4, repeat

   Schematics of the cause environment CAenv

CApro processes them. Adding an FPU impacts the cause environment in


two ways:

Due to the different latencies of the execute environment, a 5-stage


RSR is added in the collection circuit.

The floating point unit adds 6 new internal interrupts, which are as-
signed to the interrupt levels 7 to 12 (table 9.7).

Cause Collection The interrupt events of the fetch and decode stage are
collected in the registers CA.1 and CA.2, as before. These data are then
passed through a 5-stage RSR.
An illegal instruction, a trap and a fixed point overflow are still detected
in the execute stage and clocked into register CA.3. Since these events
cannot be triggered by a legal floating point instruction, the corresponding
instruction always passes from stage 2.0 directly to stage 3.
The floating point exceptions are also detected in the execute stage.
These events can only be triggered by a floating point instruction which
is signaled by f op?  1. Circuit CAcol therefore masks the events with
flag f op?. The ‘unimplemented floating point operation’ interrupt uFOP
&(*
  *
is signaled by the control in stage ID. The remaining floating point events
P IPELINED DLX correspond to the IEEE flags provided by the FPU. Environment CAcol
M ACHINE WITH gets these flags from the result bus R68 : 64.
F LOATING P OINT Let TCAcol denote the cycle time of circuit CAcol used in the design
C ORE without FPU. Cost and cycle time of the extended cause collection circuit
can then be expressed as
CCAcol  6  Cand  Cor  13  C f f  CRSR 5 3
TCAcol  maxTCAcol ACON uFOP  ∆ AFPU  Ddriv  ∆

Cause Processing Without FPU, the interrupt levels 7 to 12 are assigned


to external interrupts which are maskable. Now, these interrupt levels are
used for the FPU interrupts. Except for the interrupt uFOP, which is as-
signed to level 12, these interrupts are maskable. Compared to the origi-
nal circuit of figure 5.10, one just saves the AND gate for masking event
CA.4[12]. Thus,
CCApro  Cand 25  Ctree 32  Cor  C f f 34  CCAtype 

0   !  + ,  *#

IKE IN previous DLX designs (chapters 4 and 5), the control of the
 prepared sequential data paths is derived in two steps. We start out
with a sequential control automaton which is then turned into precomputed
control.
Figures 9.16 to 9.18 depict the FSD underlying the sequential control
automaton. To a large extent, specifying the RTL instructions and active
control signals for each state of the FSD is routine. The complete specifi-
cation can be found in appendix B.
The portion of the FSD modeling the execution of the fixed point in-
structions remains the same. Thus, it can be copied from the design of
chapter 5 (figure 5.12). In section 8.3.6, we have specified an automaton
which controls the multiply/divide unit. Depending on the precision, the
underlying FSD is unrolled two to three times and is then integrated in the
FSD of the sequential DLX control automaton.
Beyond the decode stage, the FSD has an outdegree of one. Thus, the
control signals of the execute, memory and write back stage can be pre-
computed. However, the nonuniform latency of the floating point instruc-
tions complicates the precomputed control in two respects:
The execute stage consists of 5 substages. Fast instructions bypass
some of these substages.
&/
fetch

decode
floating point arithmetic fixed point FSD

fc.s fc.d fmov.d fneg.d fabs.s mi2f addrSf


EX stage
addrL.d addrL.s fabs.d fmov.s fnge.s mf2i

FSD
FSD
64-bit
32-bit result
result

rd2.d rd2.s rd2.i


M stage

Marith.d Marith.s fcM load.d load.s Mmv.d Mmv.s passC store

flag, WBd flag, WBs fcWB sh4l.d sh4l.s WBd WBs WB stage

   FSD underlying the control of the DLX architecture with FPU. The portions modeling the execution of the fixed point instructions
and of the floating point arithmetic are depicted in figures 5.12, 9.17 and 9.18.
D ESIGN
P REPARED
  *#

S EQUENTIAL
C ONTROL OF THE

&/
  *
P IPELINED DLX
M ACHINE WITH
F LOATING P OINT fdiv.d fmul.d fadd..d fsub.d cvt.i.d cvt.s.d

C ORE lookup.d

netwon1.d

newton2.d

newton3.d

newton4.d

netwon1.d

newton2.d

newton3.d

newton4.d

netwon1.d

newton2.d

newton3.d

newton4.d

quotient1.d

quotient2.d

quotient3.d

quotient4.d Mul1.d Add1.d Sub1.d

select fd.d Mul2.d SigAdd.d

rd1.d

rd2.d

   FSD modeling the execution of arithmetical floating point operations


with double precision results

&/
  *#
C ONTROL OF THE
P REPARED
fdiv.s fmul.s fadd.s fsub.s cvt.i.s cvt.d.s cvt.s.i cvt.d.i S EQUENTIAL
D ESIGN

lookup.s

netwon1.s

newton2.s

newton3.s

newton4.s

netwon1.s

newton2.s

newton3.s

newton4.s

quotient1.s

quotient2.s

quotient3.s

quotient4.s Mul1.s Add1.s Sub1.s

select fd.s Mul2.s SigAdd.s

rd1.s rd1.i

rd2.s rd2.i

   FSD modeling the execution of arithmetical floating point operations


with single precision results

&/#
  *
x.0 Con.2.0
P IPELINED DLX
M ACHINE WITH RSR
x.1 Con.2.1
F LOATING P OINT
C ORE x.2 Con.2.2
5
RSRw w x.3 Con.2.3

ue.2.[0:4] ce x.4 Con.2.4


reset clr
y Con.3

z
Con.4

   Precomputed control of the FDLX design without divider

Due to the iterative nature of the division algorithm, the execution of


divisions is not fully pipelined. A division passes the multiply/divide
unit several times. That requires a patch of the precomputed control
(section 9.3.2).

Thus, we first construct a precomputed control ignoring divisions.

*# " -     - + 

Like in previous designs (e.g., chapter 4), the control signals for the ex-
ecute, memory and write back stages are precomputed during ID. The
signals are then passed down the pipeline together with the instruction.
However, fast instructions bypass some of the execute stages. In order to
keep up with the instruction, the precomputed control signals are, like the
interrupt causes, passed through a 5-stage RSR (figure 9.19).

   88


Depending on the type of the instruction, the latency of the execute stage
varies between 1 and 5 cycles. However, the latency is already known in
the states of stage 2.0:

Floating point multiplication, addition and subtraction all have a 5-


cycle latency. This is signaled by an active flag lat5  1. The cor-
responding states of stage 2.0 are ,$ # ,$# ,  # , #
,$ and ,$.
&/&
  *#
  Classification of the precomputed control signals C ONTROL OF THE
P REPARED
type x.0 x.1 x.2 x.3 x.4 y z
S EQUENTIAL
number 31 7 3 0 3 3 6
D ESIGN

Format conversions have a 3-cycle latency (lat3  1). Their execu-


tion starts in the states +& # +& # +&# +& # +& and
+& .
The remaining instructions have a single cycle latency, signaled by
lat1  1.

When leaving stage 2.0, an instruction with single cycle latency contin-
ues in stage 3. Instructions with a latency of 3 or 5 cycles continue in stage
2.3 or 2.1, respectively. The write signals of the RSRs can therefore be
generated as

 10000 if lat5  1
RSRw1 : 5 
 00100
00001
if
if
lat3  1
lat1  1
 lat5 0 lat3 0 lat1

(9.1)

 - -   88


Without an FPU, there are three types of precomputed control signals:

type x signals just control the stage EX,

type y signals control stages EX and M, and

type z signals control the stages EX, M and WB.

The execute stage now consists of five substages. Thus, the signals of type
x are split into five groups x0    x4 with the obvious meaning.
Tables B.12 and B.14 (appendix B) list all the precomputed control sig-
nals sorted according to their type. The signals x0 comprise all the x-type
signals of the DLX design without FPU. In addition, this type includes the
signals specifying the latency of the instruction and the signals controlling
the exchange unit FPXtr and the first stage of the FPU.
The stages 2.1 up to 4 are governed by 22 control signals (table 9.16).
These signals could be passed through a standard 5-stage RSR which is 22
bits wide. However, signals for type xi are only needed up to stage 2i.
&/'
  *
We therefore reduce the width of the RSR registers accordingly. The cost
P IPELINED DLX of the RSR and of the precomputed control can then be estimated as
M ACHINE WITH
F LOATING P OINT CConRSR  Cinv  5  Cor  22  15  2  12  9  Cand  Cmux  C f f 
C ORE C preCon  CConRSR  C f f 53  C f f 6

Thus, the RSR only buffers a total of 70 bits instead of 110 bits. Compared
to a standard 22-bit RSR, that cuts the cost by one third.

  1 
The stages k of the pipeline are ordered lexicographically, i.e.,

1  20  21  22  23  24  3

Except for the execute stage, the scheduling functions of the designs DLXΣ
and FDLXΣ are alike. One cycle after reset, the execution starts in the write
back stage with a jump to the ISR. For k  0 1 3, instruction Ii passes
from stage k to k  1:

IΣ k T   i  IΣ k  1 T  1  i

Once Ii reaches stage k  4, the execution continues in stage 0 with the


next instruction:

IΣ 4 T   i  IΣ 0 T  1  i  1

In the FDLXΣ design, the execute stage comprises 5 substages. Fast in-
structions bypass some of these substages, that complicates the scheduling.
For any execute stage 2k with k  0, the instruction is just passed to the
next stage, thus

IΣ 3 T  1 if k4
IΣ 2k T   i  i 
IΣ 2 k  1 T  1 if k  3

Whereas in case of stage k  20, it depends on the latency of instruction


Ii whether the execution continues in stage 2.1, 2.3 or 3:

 IΣ 3 T  1 if lat1  1

IΣ 20 T   i i 
 IΣ 23 T  1 if
IΣ 21 T  1 if
lat3  1
lat5  1

The stall engine of figure 9.20 implements the new schedule in an obvi-
ous way. As in the sequential design of section 5.5.6, there is one central
clock CE for the whole FDLXΣ design. During reset, all the update enable
flags uek are inactive, and the full vector is initialized. In order to let an
&/(
  *#
CE full.0 C ONTROL OF THE
/reset P REPARED
CE ue.0 S EQUENTIAL
CE full.1
D ESIGN
CE ue.1
CE full.2.0 CE

Din r1 ue.2.0
R1 full.2.1 ue.2.1
r2
reset clr R2 full.2.2
r3 ue.2.2
CE ce
R3 full.2.3
RSRw w ue.2.3
r4
R4 full.2.4
r5 ue.2.4
RSR R5 full.3
reset
/reset ue.3
CE
CE full.4

CE ue.4

   Stall engine of the FDLXΣ design without support for divisions

instruction bypass some execute stages, the full flags of stages 2.1 to 2.4
and of the memory stage 3 are held in an RSR. This RSR is, like any other
RSR of the sequential DLX design, controlled by the write signals RSRw
of equation 9.1. The RSR of the stall engine is operated in a particularly
simple way, because all its clock enable signals are all tied to the common
clock enable CE.
Figure 9.21 illustrates how the precomputed control, the stall engine and
the data paths of the execute environment fit together. As before, the pre-
computed control provides the clock request signals RCe which are com-
bined (AND) with the appropriate update enable flags to obtain to the actual
clock signal RCe .
However, special attention must be payed to the clock signals of the
registers MDRw and Ffl.3. According to the specification in appendix B,
these two registers are clocked simultaneously. They either get their data
input from stage 2.0 or from stage 2.4, depending on the latency of the
instruction. Thus, the clock signal is obtained as

MDRwce  ue24  MDRwce20  lat120  MDRwce24


&//
precomputed
stall engine EXenv
control
FA FB B link S co A
Con2.0
full.2.0
FXunp FPunp FPXtr FXU
Fu Cvt FCon tfp tfx sh D fcc
RSR RSR
Fv Fc
ue2.0 fcc D’
full2.1 Con.2.1
ue2.1 M/D1 A/S1
full2.2 Con.2.2
M/D2 A/S2
Fq Fs
ue2.2
full2.3 Fr Con.2.3
ue2.3 FPrd1 FXrd1
full2.4 Con.2.4
FPrd2 FXrd2
Fp Fx 64 + 5 32
ue2.4 R
full.3 MDRw, Ffl.3 MAR Con.3
   Integration of the precomputed control, of the stall engine and of the execute environment
  *
P IPELINED DLX
M ACHINE WITH
F LOATING P OINT
C ORE

&/)
  *#
The remaining registers of the FDLX data paths receive their data inputs
just from one stage. Register MAR, for example, is only updated by in- C ONTROL OF THE
structions with an 1-cycle execute latency; therefore P REPARED
S EQUENTIAL
MARce  ue24  MARce20 D ESIGN

    +  
Along the lines of section 3.4 it can be shown that this FDLXΣ design
interprets the extended DLX instruction set of section 9.1 with delayed PC
semantics but without floating point divisions. The crucial part is to show
that the instruction and its data pass through the pipeline stages at the same
speed. More formally:

Let IΣ 20 T   i, and let X be a register whose content is passed through   )
one of the RSRs, i.e., X  IR Cad Fad Sad PC DPC CA3 : 2. For any
stage k  21    3 with IΣ k T   i, we have
¼ ¼
X 20T  X kT and f ull kT  1

This follows from the definition of the write signals RSRw (equation 9.1)
and from lemma 9.1. Observe that the hypothesis of lemma 9.1 about the
clock enable signals is trivially fulfilled for the RSR in the stall engine. The
construction of the stall engine ensures, that the hypothesis about the clock
enable signals is also fulfilled for the remaining RSRs in the data paths and
in the control.
Outside the stall engine we update the registers of result shift registers
with separate update enable signals. Thus, during the sequential execution
of a single instruction it is still the case, that no stage k  0 1 3 4 or
substage 2 j is clocked twice. Not all instructions enter all substages, but
the dateline lemma 5.9 stays literally the same.

*# -  + 

The execution of a division takes 17 or 21 cycles, depending on the pre-


cision. During the unpacking and the four final cycles, a division is pro-
cessed like any other arithmetical floating point operation. However, in
the remaining 12 to 16 cycles, it iterates in circuit S IGF MD of the mul-
tiply/divide unit. These steps are numbered with 201    2016 (table
9.15). We use the following strategy for handling divisions:
In the first cycle (stage 2.0), the operands of the division are un-
packed. This is governed by the standard precomputed control.
&/*
  *
Con2.0 x.0 full.2.0
P IPELINED DLX
fdiv.2.0
M ACHINE WITH RSR

cce, sce, xce, Ace, Dce, Ece, Ebce


ue.2.0
OR
F LOATING P OINT f.2.0.1
ue.2.0.1
C ORE divhaz

Out: clock signals


Out

...

...
(2.0.1-2.0.16)

...
f.2.0.15
RSR
divhaz 0 1 f.2.0.16
ue.2.0.16
opaoe, opboe tlu f.2.1
ue.2.1

x.2 f.2.2
Con2.1 ue.2.2
Con2.2
x.3 f.2.3
ue.2.3
Con2.3
x.4 f.2.4
Con2.4 ue.2.4
f.3

Con.3

   Main control for the stages 2.0 to 3 of the full FDLX Σ design

During the steps  6$ to $ &%& , the division iterates in the


multiply/divide unit. The execution is controlled by the FSD spec-
ified in section 8.3.6 (figure 8.22). The data, the cause bits and the
precomputed control signals of the division are frozen in the RSRs
of stage 2.0.

In the final four steps ($ &%&# & , #  $% #  $% ), the di-
vision passes through the stages 21    24. This is again controlled
by the precomputed control.

Thus, the main control (figure 9.22) of the floating point DLX design con-
sists of the stall engine, the precomputed control with its 5-stage RSR, and
the ‘division automaton’. Except for circuit S IGF MD, the data paths are
governed by the precomputed control, whereas the stall engine controls the
update of the registers and RAMs.

" -   


The precomputed control of section 9.3.1 is just extended by two signals
lat21 and lat17 of type x0. These signals indicate that the instruction has
a latency of 21 or 17 cycles, respectively. They correspond to the states
, + and , + of the FSD of figure 9.16.
&)
  *#
1     1 
In order to account for a double precision division, which has a 21-cycle C ONTROL OF THE
execute latency, the RSR of the stall engine is extended to length 21. Ex- P REPARED
cept for the longer RSR, the stall engine remains unchanged. The RSR S EQUENTIAL
provides the full flags f ull k and the update enable flags uek for the stages D ESIGN
k  201    2016 21    24 3. These 21 full bits code the state of
the division FSD in unary as specified in table 9.15.
An instruction, depending on its execute latency, enters the RSR of the
stall engine either in stage 2.0.1, 2.0.5, 2.1, 2.3 or 3. The write signals of
the RSR are therefore generated as

Stallw1 : 21  lat21 03 lat17 011 lat5 0 lat3 0 lat1 (9.2)

For the scheduling function IΣ , this implies



IΣ 3 T  1 if lat1  1
 IΣ 23 T  1 if lat3  1
IΣ 20 T   i  i  IΣ 21 T  1 if lat5  1
 IΣ

205 T  1 if
201 T  1 if
lat17  1
lat21  1
and for every substage 20 j with j 1 we have

IΣ 20 j  1 T  1 if 0  j  16
IΣ 20 j T   i  i 
IΣ 21 T  1 if j  16
In the remaining pipeline stages, the division is processed like any instruc-
tion with a 5-cycle execute latency. Thus, the scheduling function requires
no further modification.
Unlike the stall engine, the cause environment, the buffer environment
and the precomputed control still use a 5-stage RSR. Up to step 2.0.16, a
division is frozen in stage 2.0 and then enters the first stage of these RSRs.
Thus, the write signals RSRw1 : 5 of the RSRs in the data paths and in the
precomputed control are generated as

RSRw1  lat5  f ull 2016  f div20


RSRw3  lat3
RSRw5  lat1
RSRw2  RSRw4  0

  - S IGF MD


Clock Request Signals The registers A, E, Eb, Da, Db and x of circuit
S IGF MD (figure 8.20) are only used by divisions. Thus, they are updated
solely under the control of the division automaton.
&)
  *
P IPELINED DLX   Clock request signals of the multiply/divide unit
M ACHINE WITH
clocks stages of the stall engine
F LOATING P OINT
C ORE xce 2.0.1, 2.0.5, 2.0.9, 2.0.13
sce, cce 2.0.2, 2.0.4, 2.0.6, 2.0.8, 2.0.10, 2.0.12, 2.0.14, 2.0.16, 2.1
Ace 2.0.3, 2.0.7, 2.0.11
Dce, Ece 2.0.15
Ebce 2.1

The output registers s and c of the multiplication tree are also used by
multiplications (stage 2.1). A division uses these registers up to stage 2.1.
Thus, the registers s and c can be updated at the end of step 2.1 without
any harm, even in case of a division.
Table 9.17 lists for each register the stages in which its clock signal must
be active. A particular clock request signal is then obtained by ORing the
update enable flags of the listed stages, e.g.:

cce  ue202  ue204  ue206  ue208  ue2010


 ue2012  ue2014  ue2016  ue21

Control Signals The multiply/divide unit is governed by the following


signals

flag db which signals a double precision operation,

flag f div which distinguishes between division and multiplication

flag tlu which activates a table lookup, and

the enable signals for the operand busses opa and opb

opaoe3 : 0  f aadoe Eadoe Aadoe xadoe


opboe1 : 0  f bbdoe xbdoe

The signals f div and db are fixed for the whole execution of an instruction.
Therefore, they can directly be taken from the RSR of the precomputed
control.
The flag tlu selects the input of register x. Since this register is only used
by divisions, the flag tlu has no impact on a multiplication or addition.
Thus, flag tlu is directly provided by the division automaton.
&)
  *#
The operand busses opa and opb are controlled by both, the precom-
puted control and the division automaton. Both control units precompute C ONTROL OF THE
their control signals. The flag divhaz selects between the two sets of con- P REPARED
trol signals before they are clocked into the register Con21. Let opaoe S EQUENTIAL
and opboe denote the set of enable signals generated by the division au- D ESIGN
tomaton; this set is selected on divhaz  1. The operand busses are then
controlled by

opaoe20 opboe20 if divhaz  0


opaoe opboe :
opaoe opboe  if divhaz  1

An active signal divhaz grants the division automaton access to the operand
busses during stages 201 to 2016. Since the enable signals are precom-
puted, signal divhaz must also be given one cycle ahead:

15
divhaz  f ull 20k  f ull 20  f div20
k 1

The Division Automaton controls the multiply/divide unit according to


the FSD of figure 8.22. The full bits provided by the RSR of the stall
engine codes the states of the division FSD in unary. Based on these flags,
the automaton precomputes the signal tlu and the enable signals for the
operand busses opa and opb. For each of these signals, table 9.18 lists
the states in which the signal is active and the index of the preceding state.
Like in a standard Moore automaton (section 2.6), each control signal is
generated by an OR tree which combines the corresponding full flags, e.g.:

xbdoe  f ull 20k


k 371113

The 5 clock request signals and the 7 enable signals together have an
accumulated frequency of νsum  30 and a maximal frequency of νmax  9.
Thus, the control for circuit S IGF MD requires the following cost and cycle
time:

CDivCon  C f f 7  Cmux 6  Cand  CORtree 16  Cor  νsum  11


TDivCon  Dand  DORtree 16  Dmux  ∆

The division automaton delays the clock signals of circuit S IGF MD by the
following amount

DDivCon ce  DORtree νmax 


&)#
  *
P IPELINED DLX   Control signals for the steps 2.0.1 to 2.0.16 of a division.
M ACHINE WITH
FSD stall engine
F LOATING P OINT active signals
state previous stage
C ORE
lookup 2.0.0 tlu, fbbdoe
newton1 2.0.1, 2.0.5, 2.0.9 xadoe, fbbdoe
newton3 2.0.3, 2.0.7, 2.0.11 Aadoe, xbdoe
quotient1 2.0.13 faadoe, xbdoe
quotient2 2.0.14 faadoe, fbbdoe
quotient3 2.0.15 Eadoe, fbbdoe

+  3 
With respect to the dateline lemma we are facing two additional problems:

Some registers are updated by more than one stage. Registers c and
s of the circuit /!,1 for instance are updated after stage 2.0.16
during divisions and after stage 2.1 during multiplications. Thus,
classifying the registers by the stage, which updates them, is not
possible any more.

During the iterations of the division algorithm, some registers are


clocked several times. Thus, the dateline lemma cannot possibly
hold while the registers have intermediate values.

We coarsely classy the stages into two classes. The class of stages PP
which are operated in a pipelined fashion and the class of stages SQ which
are operated in a sequential manner:

PP  0 1 20 21  24 3 4


SQ  20x  1  x  16

Different stages in PP update different registers. Thus, for every register


R we have R  out t  for at most one t  PP, and every stage in PP is
updated at most once during the sequential execution of an instruction.
The dateline lemma still holds while instructions are in stages PP and for
registers R which are output registers of stages PP.

 )  Let k t  PP and let IΣk T   i. For every register and memory cell R 


out t  the statements of lemma 5.9 apply.

&)&
  *&
The value of the output registers of stage 2.0.16 at the end of the it-
erations for a division operation depend only on the value of the output P IPELINED DLX
registers of stage 2.0 before the iterations: D ESIGN WITH FPU

Let Ii be a division operation, let   )


IΣ 20 U   IΣ 21 T   i

¼
and let V be an output register of stage 2.0.16. Then VT depends only on
the values QU 1 of the output registers Q of stage 2.0 which were updated
¼

after cycle U .

0 +  *(1 *#  "+3

S BEFORE , transforming the prepared sequential design into a pipe-


 lined design requires extensive forwarding and interlock hardware
and modifications in the PC environment and in the stall engine. Figure
9.23 depicts the data paths of the pipelined design FDLXΠ . Compared to
the sequential data paths, its top level schematics just got extended by the
forwarding hardware:

CDP  CIMenv  CPCenv  CIRenv  CDaddr  CFPemb  CEXenv


CDMenv  CSH4Lenv  CRFenv  CCAenv  Cbu f f er

5  C f f 32  5  C f f 64  2  C f f 5  CFORW 

*& " 1 

According to section 5.6.1, switching from the prepared sequential design


to the pipelined design has only a minor impact on the PC environment.
The instruction memory IM is addressed by the input d pc of register DPC
and not by its output. The circuit nextPC which computes the new values
of the program counters however remains unchanged.
On the other hand, adding support for floating point impacts the glue
logic PCglue but not the data paths of environment PCenv. Thus, the
FDLXΠ design uses the PC environment of the pipelined DLXΠ design
(figure 5.13) but with the glue logic of the sequential FDLXΣ design.
&)'
  *
IMenv
P IPELINED DLX
M ACHINE WITH EPCs IR.1
F LOATING P OINT IRenv Daddr
FPemb PCenv
C ORE
FA, FB S A, B link, PCs co

buffers: IR.j, PCs.j,Cad.j, Fad.j Sad.j


CAenv
fl D EXenv R

Forwarding Engine FORW


Ffl.3 MAR MDRw
SR

DMenv

Ffl.4 C.4 MDRr FC.4

11111
00000 SH4Lenv
Ffl’ C’ FC’

RFenv

   Data paths of the pipelined FDLX design with result forwarding

*& ,   . 

Like in the pipelined designs DLXπ and DLXΠ, the register files GPR, SPR
and FPR are updated in the write back stage. Since they are read by ear-
lier stages, the pipelined floating point design FDLXΠ also requires result
forwarding and interlocking. For the largest part, the extension of the for-
warding and interlock engine is straightforward, but there are two notable
complications:

The execute stage has a variable depth, which depending on the in-
struction varies between one and five stages. Thus, the forwarding
and interlock engine has to inspect up to four additional stages.

Since the floating point operands and results have single or double
precision, a 64-bit register of the FPR register file either serves as
one double precision register or as two single precision registers.
The forwarding hardware has to account for this address aliasing.
&)(
  *&
9 "-  8  
The move instruction , is the only floating point instruction which up- P IPELINED DLX
dates the fixed point register file GPR. The move , is processed in the D ESIGN WITH FPU
exchange unit FPXtr, which has a single cycle latency like the fixed point
unit.
Thus, any instruction which updates the GPR enters the execute stage
in stage 2.0 and then directly proceeds to the memory stage 3. Since the
additional stages 2.1 to 2.4 never provide a fixed point result, the operands
A and B can still be forwarded by circuit Forw 3 of figure 4.16. However,
the extended instruction set has an impact on the computation of the valid
flags v4 : 2 and of the data hazard flag.

Valid Flags The flag v j indicates that the result to be written into the
GPR register file is already available in the circuitry of stage j, given that
the instruction updates the GPR at all. The result of the new move in-
struction , is already valid after stage 2.0 and can always be forwarded.
Thus, the valid flags of instruction Ii are generated as before:
v4  1; v3  v2  Dmr

Data Hazard Detection The flags dhazA and dhazB signal that the oper-
and specified by the instruction bits RS1 and RS2 cause a data hazard, i.e.,
that the forwarding engine cannot deliver the requested operands on time.
These flags are generated as before.
In the fixed point DLX design, every instruction I is checked for a data
hazard even if I requires no fixed point operands:
dhazFX  dhazA  dhazB
This can cause unnecessary stalls. However, since in the fixed point de-
sign almost every instruction requires at least one register operand, there is
virtually no performance degradation.
In the FDLX design, this is no longer the case. Except for the move ,,
the floating point instructions have no fixed point operands and should not
signal a fixed point data hazard dhazFX. The flags opA and opB therefore
indicate whether an instruction requires the fixed point operands A and B.
The FDLX design uses these flags to enable the data hazard check
dhazFX  dhazA  opA  dhazB  opB
The data hazard signals dhazA and dhazB are generated along the same
lines. Thus, the cost and delay of signal dhazFX can be expressed as
CdhazFX  2  CdhazA  2  Cand  Cor
AdhazFX  AdhazA  Dand  Dor 
&)/
  *
 "-  8  
P IPELINED DLX Due to the FPU, the special purpose registers SPR are updated in five situ-
M ACHINE WITH ations:
F LOATING P OINT
C ORE 1. All special purpose registers are updated by JISR. As in the DLXΠ
design, there is no need to forward these values. All instructions
which could use forwarded versions of values forced into SPR by
JISR get evicted from the pipe by the very same occurrence of JISR.

2. On a  + instruction, value C4 is written into register SPRSad .

3. Register SR is updated by ,. In stages 2 to 4, this update is imple-


mented like a regular write into SPR with address Sad  0.

4. Register FCC is updated by ,. In stages 2 to 4, this update is imple-


mented like a regular write into SPR with address Sad  8.

5. On an arithmetical floating point instruction, which is signaled by


f op  1, the floating point exception flags Ffl.4 are ORed into the
Register IEEE f .

In case 5, which only applies to register IEEEf, the result is passed down
the pipeline in the Ffl.k registers. During write back, the flags Ffl.4 are
then ORed to the old value of IEEEf. In the uninterrupted execution of Ii ,
we have
IEEE f i  IEEE f i 1  F f li 


That complicates the result forwarding considerably (see exercise 9.9.1).


In order to keep the design simple, we omit the forwarding of the flags
Ffl. Instead, we generate in appropriate situations a data hazard signal
dhaz IEEE f  and stall the instruction decode until the hazard is resolved.
In case 1, the forwarding does not matter. In the remaining cases 2 to 4,
the instruction has a 1-cycle latency. Thus, one only needs to forward data
from the stages 2.0, 3 and 4, and the result is already available in stage 2.0.
With respect to the update of the SPR register file, the instructions  +#
, and , are treated alike. Thus, the SPR operands can be forwarded
by the standard SFor circuit used in the DLXΠ design, and except for an
operand IEEEf, no additional data hazard is introduced.
In the FDLX design, data from the SPR registers are used in the follow-
ing seven places, each of which is treated separately:

1. on a  + instruction, register SPRSas is read into S during de-


code,

2. on an , instruction, the two exception PCs are read during decode,
&))
  *&
3. the cause environment reads the interrupt masks SR in the memory
stage, P IPELINED DLX
D ESIGN WITH FPU
4. the rounders of the FPU read SR in the execute stage 2.3,

5. the rounding mode RM is read in stage 2.2 by the floating point


adder and in stage 2.3 by the two rounders,

6. on a floating point branch, the condition flag FCC is read during


decode,

7. and on an arithmetical floating point operation ( f op  1), the IEEE


exception flags IEEEf are read during write back.

Forwarding of the Exception PCs Since the new floating point instruc-
tions do not access the two exceptions PCs, the forwarding hardware of
EPC and EDPC remains unchanged. EPC is forwarded by the circuit
SFor 3 depicted in figure 5.17. The forwarding of EDPC is still omit-
ted, and the data hazard signal dhaz EDPC is generated as before.

Forwarding of Operand S On a special move instruction  +, the


operand S is fetched during decode. Like in the DLXΠ design, operand S is
forwarded by the circuit SFor 3 depicted in figure 5.15. However, in case
of an operand IEEEf, one has to check for a data hazard due to the update
of an arithmetical floating point instruction (case 5). Such a hazard occurs
if

the decode stage processes a  + instruction (ms2i1  1),

the source address Sas1 equals 7,

a stage k 20 processes an arithmetical FPU instruction (i.e., f ull k


 f opk  1), and

no stage j between 1 and k processes a  + which updates IEEEf


(i.e., hit  j  0).

If a special move is in stage 2.0, the remaining execute stages must be


empty, due to its single cycle execute latency. Thus,

dhaz IEEE f  
ms2i 1 
 Sas1  7 
f opk  f ull k  hit 2  f op3  f ull 3
2 0k2 4

 hit 2 NOR hit 3  f op4  f ull 4 

&)*
  *
Sas.1 0111 fop.2.[0:4] full.2.[0:4] hit.2 fop.4 full.4 hit.2 hit.3 fop.4 full.4
P IPELINED DLX
M ACHINE WITH equal 5-AND
F LOATING P OINT ms2i.1 OR
C ORE

dhaz(IEEEf)

   Computation of data hazard signal dhazIEEE f .

The circuit of figure 9.24 generates the flag in the obvious way. The hit
signals are provided by circuit SFor 3. Thus,

Cdhaz IEEE f   CEQ 4  CORtree 5  11  Cand  2  Cor  Cnor  Cinv


Ddhaz IEEE f   maxDEQ 4 Dor  DORtree 5
DS f or hit   2  Cor  Cnor   2  Dand 

Forwarding of Register IEEEf The arithmetical floating point instruc-


tions generate IEEE exception flags F f l which are accumulated in register
IEEEf. Such an instruction Ii updates register IEEEf by a read-modify-
write access; these special read and write accesses are performed during
write. For the uninterrupted execution of Ii with IΠ 4 T   i we have

IEEE f ΠT 1  IEEE f i  1  F f l 4TΠ 

Since the instructions are processed in program order,

IEEE f T  IEEE f i1

and no result forwarding is required.

Forwarding of Register FCC On a floating point branch, the condition


flag FCC is requested by the PC environment during decode. The flag FCC
is updated by a special move  + and by a floating point compare in-
struction. Both instructions have a single cycle execute latency and bypass
the substages 2.1 to 2.4. Thus, the value of FCC can be forwarded by the
3-stage forwarding circuit SFor of figure 5.16 with Din  SPR8  FCC
and ad  1000.
The special move and the test instruction update register FCC via the
standard write port. Since there result is already available in stage 2.0, the
forwarding is always possible and register FCC never causes a data hazard.
&*
  *&
Forwarding of Register RM The rounding mode RM is needed in stage
2.2 by the floating point adder and in stage 2.4 by the rounders FP RND P IPELINED DLX
and FX RND. Register RM can only be updated by a special move  + D ESIGN WITH FPU
which has a single cycle execute latency. Since the result of the special
move is already valid in stage 2.0, forwarding is always possible; no data
hazard is introduced.
A standard 2-stage forwarding circuit SFor 2 can forward RM from
stages 3 and 4 to the execute stages. However, the following lemma states
that the forwarding of register RM can be omitted if the instructions always
remain in program order. The scheduler of the pipelined design FDLXΠ
ensures such an in-order execution (section 9.4.3). Thus, the SPR regis-
ter file can directly provide the rounding mode RM to the adder and the
rounders at zero delay.

Let instruction Ii read the rounding mode RM in stage 2.2 or 2.4. Fur-   )
thermore, let I j be an instruction preceding Ii which updates register RM.
Assuming that the instructions pass the pipeline stages strictly in program
order, I j updates register RM before Ii reads RM.

Let the execution of instruction Ii be started in cycle T , 


IΠ 20 T   i

1) Any instruction which passes the rounder FP RND or FX RND has an ex-
ecute latency of at least 3 cycles. Thus, the rounder of stage 2.4 processes
Ii in cycle T  2, at the earliest:

IΠ 24 T   i with T T  2

2) If Ii is a floating point addition or subtraction, it already reads the round-


ing mode RM in stage 2.2. Instruction Ii has a 5-cycle execute latency, thus

i  IΠ 21 T  1  IΠ 22 T  2

In either case, Ii reads the rounding mode in cycle T  2 at the earliest.


The rounding mode RM is only updated by special moves  + which
have a single cycle execute latency. For such a move instruction Ij this
implies
j  IΠ 20 t   IΠ 3 t  1  IΠ 4 t  2
Since the instructions remain in program order, Ij must pass stage 2.0 be-
fore instruction Ii . Thus,

tT t 2  T 2

and I j updates register RM at least one cycle before Ii reads RM.  


&*
  *
Forwarding of Register SR The status register SR is updated by special
P IPELINED DLX moves and by , instructions. In either case, register SR is updated by a
M ACHINE WITH regular write to SPR with address 0. Since the result is already available in
F LOATING P OINT stage 2.0, the forwarding of SR is always feasible.
C ORE The cause environment CAenv uses SR for masking the interrupt events
in stage 3. As before, a 1-stage forwarding circuit SFor 1 provides the
masks SR to the cause environment.
In the FDLX design, register SR also holds the masks for the IEEE float-
ing point exceptions. The rounders FP RND and FX RND require these mask
bits during stage 2.3. In analogy to lemma 9.5, one shows
 )  Let the instructions pass the pipeline of the FDLXΠ design strictly in pro-
gram order. Let instruction Ii read the status register SR in stage 2.3 during
cycle T . Any preceding , or  + instruction Ij then updates register
SR in cycle T or earlier.
Thus, it suffices to forward the masks SR from the write back stage to the
rounders. This forwarding can be performed by the circuit SFor 1 which
already provides SR to the cause environment. Like in the DLX design,
the masks SR never cause a data hazard.

Forwarding Circuit SFOR Altogether, the forwarding of the SPR oper-


ands can be performed by one circuit SFor 1 for operand SR and by three
circuits SFor 3 for the operands EPC, S and FCC. Thus, the forwarding
engine SFOR has the cost
CSFOR  3  CSFor 3  CSFor 1
The operands S, EPC and SR still have the same accumulated delay as in
the DLXΠ design. The accumulated delay of the FCC flag equals that of
the S operand
ASFOR FCC  ASFOR S
The remaining SPR operands are provided at zero delay.
The flag dhazS signals that a special purpose register causes a data haz-
ard. EDPC and IEEEf are the only SPR register which can cause such a
data hazard. Thus, signal dhazS can be obtained as
dhazS  dhaz IEEE f   dhaz EDPC
The selection of the source address Sas and the forwarding are both gov-
erned by control signals of stage ID, therefore
CdhazS  Cdhaz IEEE f   Cdhaz EDPC  Cor
AdhazS  ACON csID  DDaddr
 maxDdhaz IEEE f  Ddhaz EDPC  Dor 
&*
  *&
a) b)
FS1 fA fhazA FS2 fB fhazB P IPELINED DLX
dbs ad Do fhaz
dbs.1 D ESIGN WITH FPU
FPRw.k dbs ad Do fhaz dbs ad Do fhaz
dbr.k
Fad.k Ffor Ffor Ffor
FC’.k Din Din Din
fa[63:0] fb[63:0]

   Block diagrams of circuit F f or and forwarding engine FFOR

,  " 8  
While an instruction (division) is processed in the stages 2.0.0 to 2.0.15,
the signal divhaz is active. Since the fetch and decode stage are stalled on
divhaz  1, it suffices to forward the floating point results from the stages
k  PP with k 20. In the following, stage 2.0 is considered to be full, if
one of its 17 substages 2.0.0 to 2.0.16 is full, i.e.,

f ull 20  f ull 20 j


0 j 16

Depending on the flag dbs, the floating point operands either have single
or double precision. Nevertheless, the floating point register file always
delivers 64-bit values f a and f b. Circuit FPemb of stage ID then selects
the requested data and aligns them according to the embedding convention.
However, the forwarding engine, which now feeds circuit FPemb, takes the
width of the operands into account. That avoids unnecessary interlocks.
The floating point forwarding hardware FFOR (figure 9.25) consists of
two circuits F f or. One forwards operand FA, the other operand FB. In ad-
dition, circuit F f or signals by f haz  1 that the requested operand cannot
be provided in the current cycle. Circuit F f or gets the following inputs

the source address ad and the precision dbs,

the 64-bit data Din from a data port of register file FPR, and

for each stage k  PP with k 20 the destination address Fad k, the
precision dbr, the write signal FPRwk and an appropriately defined
intermediate result FC k.

Like in the fixed point core, the forwarding is controlled by valid flags
f v which indicate whether a floating point result is already available in one
&*#
  *
of the stages 2.0, 2.1 to 4. After defining the valid flags f v, we specify the
P IPELINED DLX forwarding circuit F f or and give a simple realization.
M ACHINE WITH The flags opFA and opFB indicate whether an instruction requires the
F LOATING P OINT floating point operands FA and FB. These flags are used to enable the
C ORE check for a floating point data hazard:

dhazFP  f hazA  opFA  f hazB  opFB

Forwarding engine FFOR provides this flag at the following cost and delay

CFFOR  2  CF f or
CdhazFP  2  Cand  Cor
AdhazFP  ACON csID  DF f or f haz  Dand  Dor 

Valid Flags Like for the results of the GPR and SPR register files, we
introduce valid flags f v for the floating point result FC. Flag f vk in-
dicates that the result FC is already available in the circuitry of stage k.
The control precomputes these valid flags for the five execute substages
20 21    24 and for the stages 3 and 4.
In case of a load instruction (Dmr  1), the result only becomes avail-
able during write back. For any other floating point operation with 1-cycle
execute latency, the result is already available in stage 2.0. For the re-
maining floating point operations, the result becomes available in stage 2.4
independent of their latency. The floating point valid flags therefore equal

f v20  lat1  Dmr


f v21  f v22  f v23  0
f v24  f v3   Dmr
f v4  1

Since the flags f vk for stage k  21    23 4 have a fixed value, there
is no need to buffer them. The remaining three valid flags are passed
through the RSR of the precomputed control together with the write signal
FPRw.
In any stage k 20, the write signal FPRwk, the valid flag f vkk and
the floating point destination address Fad k are available. For some of
these stages, the result FC k is available as well:
FC 4 is the result to be written into register file FPR,
FC 3 is the input of the staging register FC.4, and
FC 2 is the result R to be written into register MDRw. Depending
on the latency, R is either provided by stage 2.0 or by stage 2.4.
&*&
  *&
Lemma 4.8, which deals with the forwarding of the fixed point result,
can also be applied to the floating point result. However, some modifica- P IPELINED DLX
tions are necessary since the result either has single or double precision. D ESIGN WITH FPU
Note that in case of single precision, the high and low order word of the
results FC k are identical, due to the embedding convention (figure 9.1).
Thus, we have:

For any instruction Ii , address r  r4 : 0, stage k  PP with k 20,   )
and for any cycle T with IΣ k T   i we have:

1. Ii writes the register FPR[r] iff after the sequential execution of Ii ,


the address r[4:1] is kept in the register Fad.k[4:1] and the write
signal FPRw.k is turned on. In case of a single precision access, the
bit Fad.k[0] must equal r[0]. Thus, Ii writes register FPR[r] iff

FPRwki  1  Fad ki 4 : 1  r4 : 1  Fad ki 0  r0  dbrki  1

2. If Ii writes a register, and if after its sequential execution the valid


flag f vk is turned on, then the value of signal FC k during cycle T
equals the value written by Ii . Thus, Ii writes FPR[r] and f vki  1
imply
FC kT 31 : 0 if r0i  0
FPRri  T
FC k 63 : 32 if r0i  1

Floating Point Forwarding Circuit Ffor Circuit Ffor forwards 64-bit


floating point data. In order to account for 32-bit operands and results, the
high order word Do63 : 32 and the low order word Do31 : 0 are handled
separately.
For any stage k 20, circuit Ffor provides two hit signals hitH k and
hitLk and an auxiliary flag matchk. Flag hitH k indicates that the instruc-
tion I of stage 1 requests the high order word, and that the instruction of
stage k is going to update that data. Flag hitLk corresponds to the low
order word and has a similar meaning. The auxiliary flag matchk signals
that the instruction of stage k generates a floating point result, and that its
destination address matches the source address ad possibly except for bit
0:

matchk  f ull k  FPRwk  Fad k4 : 1  ad 4 : 1

Lemma 9.7 implies that instruction I requests the high (low) order word
if the operand has double precision or an odd (even) address. Due to the
embedding convention (figure 9.1), a single precision result is always du-
plicated, i.e., the high and low order word of a result FC k are the same.
&*'
  *
P IPELINED DLX   Floating point hit signals for stage k 2 0    3, assuming that the
 

M ACHINE WITH instruction in stage k produces a floating point result (FPRw f ull k 1) and
F LOATING P OINT that the high order address bits match, Fad k4 : 1 ad 4 : 1.
C ORE
destination source
hitH.k hitL.k
dbr.k Fad.k[0] dbs.1 ad[0]
0 0 0 1
0 0 0 1 0 0
1 * 0 1
0 0 0 0
0 1 0 1 1 0
1 * 1 0
0 0 0 1
1 * 0 1 1 0
1 * 1 1

The two hit signals of stage k therefore have the values listed in table 9.19;
they can be expressed as

hitH k  matchk  dbrk  Fad k0  dbs1  ad 0


hitLk  matchk  dbrk  Fad k0  dbs1  ad 0

Moreover, flag topH k signals for the high order word that there occurs
a hit in stage k but not in the stages above:

topH k  hitH k 
  hitH x
2 0xk xPP

The flags topLk of the low order word have a similar meaning. In case
of topH k  1 and topL j  1, the instructions in stages k and j generate
data to be forwarded to output Do. If these data are not valid, a data hazard
f haz is signaled. Since f v4  1, we have

f haz  topH k  topLk   f vk


k 2 02 1 3
 

While an instruction is in the stages 2.1 to 2.3 its result is not valid
yet. Furthermore, the execute stages 2.0 and 2.4 share the result bus R
which provides value FC 2. Thus, circuit F f or only has to consider three
results for forwarding. The high order word of output Do, for example,
&*(
  *&
Do[63:32] Do[31:0]
P IPELINED DLX
hitH.2.0 hitL.2.0 D ESIGN WITH FPU
0 1 0 1
hitH.2.4 hitL.2.4
FC’.2[63:32] FC’.2[31:0]
hitH.3 0 1 hitL.3 0 1
FC’.3[63:32] FC’.3[31:0]
hitH.4 0 1 hitL.4 0 1

Din[63:32] FC’.4[63:32] Din[31:0] FC’.4[31:0]

   A realization of the selection circuit F f orSel

can therefore be selected as



if topH 20  topH 24
 FC 2
FC 3 if topH 3
Do63 : 32 
 FC 4
Din
if topH 4
otherwise

Realization of Circuit Ffor Circuit F f or consists of two subcircuits:


F f orC controls the forwarding and F f orSel selects operand Do.
In the circuit F f orSel of figure 9.26, the high and low order word of
the operand Do require three multiplexers each. Like in the fixed point
forwarding circuit Forw, the multiplexers are controlled by the hit signals.
Since the stages 2.0 and 2.4 share the result FC 2, the hit signals of the
two stages are combined by an OR gate. Thus,
CF f orSel  2  3  Cmux 32  Cor 
DF f orSel  3  Dmux
The control circuit F f orC generates the 14 hit and top signals as outlined
above and checks for a data hazard f haz. The hit signals can be generated
at the following cost and delay:
CF f orHit  2  Cor  Cinv  7  CEQ 4  6  Cand  2  Cor  Cinv 
DF f orHit  maxDEQ 4 Dinv  Dor   2  Dand
After inverting the hit signals, the signals topH k and topLk can be ob-
tained by two parallel prefix AND circuits and some additional AND gates.
These signals are then combined using an OR tree. Thus,
CF f orC  CF f orHit  2  7  Cand  6  Cinv  CPP 6  Cand 
6  Cor  Cinv  Cand   CORtree 6
CF f or  CF f orSel  CF f orC 
&*/
  *
The forwarding circuit F f or provides the output Do and the flag f haz at
P IPELINED DLX the following delays
M ACHINE WITH
F LOATING P OINT DF f or Do  DF f orHit  DF f orSel
C ORE DF f or f haz  DF f orHit  Dinv  DPP 6  2  Dand
Dor  DORtree 6

The delay of Do is largely due to the address check. The actual data Din
and FC  j are delayed by no more than

DF f or Data  DF f orSel 

The data to be forwarded by circuit FFor have the following accumulated


delay
A FC Din  maxAEXenv ASH4Lenv DFPR read  

All the address and control inputs of circuit FFOR are directly taken from
registers. FFOR therefore provides the operands FA1 and FB1 with an
accumulated delay of

AFFOR FA1 FB1  maxA FC Din  DF f or Data DF f or Do

Before the operands are clocked into the registers FA and FB, circuit
FPemb aligns them according to the embedding convention. Thus, fetch-
ing the two floating point operands requires a minimal cycle time of

TFread  AFFOR FA1 FB1  DFPemb  ∆

*&#   1 

Since the divider is only partially pipelined, the division complicates the
scheduling considerably. Like for the sequential design, we therefore first
ignore divisions. In a second step, we then extend the simplified scheduler
in order to support divisions.

 4  -


The execute stage still has a nonuniform latency which varies between 1
and 5 cycles. The intermediate results, the precomputed control signals
and the full flags must keep up with the instruction. Like in the sequential
FDLXΣ design, these data are therefore passed through 5-stage RSRs.
In the pipelined design FDLXΠ , several instructions are processed at a
time. The nonuniform latency cause two additional problems, which are
illustrated in table 9.20. When processed at full speed,
&*)
  *&
  Pipelined schedule for instruction sequence I 1 I2 I3 , ignoring struc- P IPELINED DLX
tural and data hazards D ESIGN WITH FPU
instruction cycles of the execution
I1 : , IF ID EX.0 EX.1 EX.2 EX.3 EX.4 M WB
I2 :   IF ID EX.0 M WB
I3 : +& IF ID EX.0 EX.3 EX.4 M WB

1. several instructions can reach a stage k at the same time like the
instructions I1 and I3 do, and

2. instructions can pass one another like the instructions I1 and I2 .

Every pipeline stage of the FDLXΠ design is only capable of processing


one instruction at a time. Thus, in the scenario of case 1 the instructions
compete for the hardware resources. The scheduler must avoid such a
structural hazard, i.e., for any stage k and cycle T it must be guaranteed
that
IΠ k T   i and IΠ k T   i  i  i  (9.3)
Hardware schedulers like the Tomasulo scheduler [Tom67, KMP99b]
and the Scoreboard [Tho70, MP96] allow instructions to overtake one an-
other, but such an out-of-order execution complicates the precise process-
ing of interrupts [Lei99, SP88]. In the pipelined execution, instructions are
therefore processed strictly in program order (in-order execution). Thus,
for any two instructions Ii and Ii¼ with i  i and any stage k which is re-
quested by both instructions, the scheduler must ensure that Ii is processed
after Ii¼ :

ii and IΠ k T   i and IΠ k T i  T  T  (9.4)

Notation So far, the registers of the RSR are numbered like the pipeline
stages, e.g., for entry R we have R20    R24 R3. The execute latency
l of an instruction specifies how long the instruction remains in the RSR.
Therefore, it is useful to number the entries also according to their height,
i.e., according to their distance from the write back stage (table 9.21). An
instruction with latency l then enters the RSR at height l.
In the following, we denote by f ull d  the full flag of the stage with
height d, e.g.:

f ull 21  f ull 5 f ull 23  f ull 3 f ull 3  f ull 1

&**
  *
P IPELINED DLX   Height of the pipeline stages
M ACHINE WITH
stage 2.0 2.1 2.2 2.3 2.4 3 4
F LOATING P OINT
height 6 5 4 3 2 1 0
C ORE

Structural Hazards According to lemma 9.1, the entries of the RSR are
passed down the pipeline one stage per cycle, if the RSR is not cleared
and if the data are not overwritten. Thus, for any stage with height d 
2    5 we have,

f ull d T 1  f ull d  1T 1  1

This means that an instruction once it has entered the RSR proceeds at full
speed. On the other hand, let instruction Ii with latency li be processed in
stage 2.0 during cycle T . The scheduler then tries to assign Ii to height
li for cycle T  1. However, this would cause a structural hazard, if the
stage with height li  1 is occupied during cycle T . In such a situation, the
scheduler signals an RSR structural hazard

RSRstrT  f ull li  1T  1

and it stalls instruction Ii in stage 2.0. Thus, structural hazards within the
RSR are resolved.

In-order Execution Let instruction Ii and cycle T be chosen as in the


previous case. The instructions which in cycle T are processed in the stages
2.1 to 4 precede Ii . This especially holds for an instruction Ij processed
during cycle T at height d  li  1. Since structural hazards are resolved,
lemma 9.1 implies that I j reaches height li in cycle T  d  li with

T  d  li  T  1

Since j  i, the instructions would be not executed in-order (i.e., the con-
dition of equation 9.4 is violated), if Ii leaves stage 2.0 at the end of cycle
T.
For d  li , we have T  d  li  T , i.e., instruction I j reaches height li
before instruction Ii . Thus, in order to ensure in-order execution, Ii must
be stalled in stage 2.0 if
5
RSRorderT  f ull d T  1
d li 2
'
  *&
The flag RSRhaz signals a structural hazard or a potential out-of-order
execution: P IPELINED DLX
D ESIGN WITH FPU
5
RSRhaz  RSRstr  RSRorder  f ull d T 
d li 1

Note that an instruction with a latency of l 5 never causes an RSR hazard.


Thus, depending on the latency of the instruction, the structural hazard
RSRhaz can be detected as
4 2
RSRhaz  lat120  f ull 2 j  lat320  f ull 2 j
j 1 j 1

at the following cost and delay

CRSRhaz  2  Cand  4  Cor


ARSRhaz  Dand  3  Dor 

The stall engine of the FDLXΠ design stalls the instruction in stage 2.0 if
RSRhaz  1. Of course, the preceding stages 0 and 1 are stalled as well.

Hardware Realization Figure 9.27 depicts the stall engine of the design
FDLXΠ . It is an obvious extension of the stall engine from the DLXΠ
design (figure 5.19). Like in the sequential design with FPU, the full flags
of the stages 2.0 to 3 are kept in a 5-stage RSR.
A more notable modification is the fact that we now use 3 instead of 2
clocks. This is due to the RSR hazards. As before, clock CE1 controls the
stages fetch and decode. The new clock CE2 just controls stage 2.0. Clock
CE3 controls the remaining stages; it is still generated as

CE3  busy   JISR NOR Ibusy  reset 

Clock CE2 is the same as clock CE3 except that it is also disabled on an
RSR hazard:

CE2   RSRhaz  busy   JISR NOR Ibusy  reset 

Clock CE1 is generated as before, except that it is now disabled in three


situations, namely if the memories are busy, on a data hazard and on an
RSR hazard:

CE1  RSRhaz  busy  dhaz 


  JISR NOR Ibusy
'
  * CE1 ue.0
CE1
P IPELINED DLX ue.1
/JISR CE2
M ACHINE WITH full.2.0
F LOATING P OINT CE2
Din r1 ue.2.0
C ORE R1 full.2.1
r2
ue.2.1
JISR clr R2 full.2.2
CE3 ce r3
ue.2.2
R3 full.2.3
RSRw w
r4
ue.2.3
R4 full.2.4
r5
RSR CE3 ue.2.4
R5
CE3
/JISR full.3
reset ue.3

CE3 full.4
CE3 ue.4

   Stall engine of the FDLXΠ design without support for divisions

Scheduling Function Except for the execute stages, the FPU has no im-
pact on the scheduling function of the pipelined DLX design. The instruc-
tions are still fetched in program order and pass the stages 0 and 1 in lock
step mode:

i if ue0T 0
IΠ 0 T   i  IΠ 0 T  1 
i1 if ue0T 1

IΠ 1 T   i  IΠ 0 T   i  1

Except for stage 20, an instruction makes a progress of at most one stage
per cycle, given that no jump to the ISR occurs. Thus, IΠ k T   i with
 20 and JISRT  0 implies
k

k T  1 uekT 0
 IΠ
IΠ k  1 T  1
if
if uekT 1  k  0 1 3
i
uekT  k  2 j  21 22 23
 IΠ

2 j  1 T  1 if
3 T  1 if uekT
1
 1  k  24

With respect to stage 2.0, the pipelined and the sequential scheduling func-
tion are alike, except that the instruction remains in stage 2.0 in case of an
RSR hazard. In case of JISR  0, an active flag RSRhaz disables the up-
date of stage 2.0, i.e., signal ue20 is inactive. Thus, for IΠ 20 T   i and
'
  *&
JISRT  0, we have
 P IPELINED DLX
20 T  1 if ue20T 0
 IΠ
IΠ 21 T  1 if ue20T 1  li  5
D ESIGN WITH FPU
i
23 T  1 if ue20T  1  li  3
 IΠ
IΠ 3 T  1 if ue20T  1  li  1

.     + 


The division is integrated in the same way as in the FDLXΣ design (section
9.3.2). The RSR of the stall engine is extended to length 21. While a
division Ii passes the stages 2.0.1 to 2.0.16 of the stall engine, the data of
Ii held in the remaining RSRs are locked in stage 2.0.
During such a division hazard, stage 2.0 is controlled by the division
automaton, and otherwise it is controlled by the precomputed control. The
division hazard is signaled by flag divhaz one cycle ahead of time. While
divhaz  1, the stages 0 and 1 are stalled, whereas the instructions in the
stages k 22 do proceed. Thus, only the clock CE1 of stages 0 and 1
must be modified to
CE1   divhaz  RSRhaz  busy  dhaz   JISR NOR Ibusy
Like in the sequential design, the support for divisions only impacts
the scheduling function of the execute substages. For IΠ 20 T   i and
JISRT  0, we have
 IΠ 20 T  1 if ue20T 0
201 T  1 ue20T 1  li  21
 IΠ
IΠ 205 T  1
if
if ue20T 1  li  17
i
IΠ 21 T  1 if ue20T 1  li  5
ue20T 1 
 IΠ

23 T  1
3 T  1
if
if ue20T 1 
li  3
li  1

and for every substage 20 j, IΠ 20 j T   i and JISRT 0 imply


IΠ 20 j  1 T  1 if ue2 jT 1  1  j  16
i 
IΠ 21 T  1 if ue2 jT  1  j  16

*&&   +%    

Like in the pipelined design without FPU, the control comprises the mem-
ory controllers IMC and DMC, the memory interface control MifC, a cir-
cuit CE which generates the global clock signals, the stall engine, the pre-
computed control, a Mealy automaton for stage ID, and a Moore automa-
ton for the stages EX to WB. The parameters of these two automata are
'#
  *
P IPELINED DLX   Classification of the precomputed control signals
M ACHINE WITH
type x.0 x.1 x.2 x.3 x.4 y z
F LOATING P OINT
C ORE control signals 31 7 3 0 3 3 6
valid flags 2 2 2

listed in table B.16. Thus, the cost of the whole FDLX control can be
expressed as

CCON  CIMC  CDMC  CMi f C  CCE  Cstall


C preCon  CCON mealy  CCON moore

" -   


The control signals which govern the stages 2.0 to 4 are precomupted in
stage ID. Like in the sequential design FDLXΣ , they are then passed down
the pipeline using a five stage RSR and some registers (figure 9.19). In
addition, the control of the pipelined design FDLXΠ also buffers some
valid flags namely

the flags v4 : 2 for the fixed point result and

the flags f v20, f v24 and f v3 for the floating point result.

The valid flags increase the signals of type x0, y and z by two signals
each (table 9.22). The RSR of the precomputed control now starts with
26 signals in stage 2.1 and ends with 13 signals in stage 3. The control
signals are precomputed by a Moore automaton which already provides
the buffering for stage 2.0. This does not include the valid flags; they
require 6 buffers in stage 2.0. In addition, an inverter and an AND gate are
used to generate the valid flags.
Since divisions iterate in the multiply divide circuit S IGF MD, the pre-
computed control is extended by circuit DivCon, like in the sequential de-
sign (figure 9.22). The cost and delay of control DivCon remain the same.
Without the automaton, the cost of the RSR and of the (extended) pre-
computed control can then be expressed as

CConRSR  Cinv  5  Cor  26  19  2  16  13  Cand  Cmux  C f f 


C preCon  CConRSR  C f f 8  C f f 6  Cinv  Cand  CDivCon 

'&
  *&
 -     
The pipelined FDLX design uses three clock signals CE1 to CE3. These P IPELINED DLX
clocks depend on flags JISR, on the hazard flags dhaz and RSRhaz, and D ESIGN WITH FPU
on the busy flags busy and Ibusy.

CE1   RSRhaz  busy  dhaz   JISR NOR Ibusy


CE2   RSRhaz  busy   JISR NOR Ibusy  reset
CE3   busy   JISR NOR Ibusy  reset 

The forwarding circuitry provides three data hazard flags: flag dhazFX
for the GPR operands, flag dhazS for the SPR operands and flag dhazFP
for the FPR operands. A data hazard occurs if at least one of these hazard
flags is active, thus

dhaz  dhazFX  dhazS NOR dhazFP

Flag dhaz can be obtained at the following cost and accumulated delay:

Cdhaz  CdhazFX  CdhazS  CdhazFP  Cor  Cnor


Adhaz  maxAdhazFX AdhazS AdhazFP   Dor  Dnor 

The FPU has no impact on the busy flags. They are generated like in the
pipelined design DLXΠ , at cost Cbusy and with delay Abusy . The JISR flags
are obtained as

JISR  jisr4  f ull 4  JISR  jisr4 NAND f ull 4

The three clock signals are then generated at the following cost and delay

CCE  Cdhaz  CRSRhaz  Cbusy


4  Cor  Cnor  Cinv  Cnand  3  Cand

ACE  maxAdhaz ARSRhaz Abusy   Dinv  2  Dand  Dor 

  1 
The core of the stall engine is the circuit depicted in figure 9.27 but with
an 21-stage RSR. In addition, the stall engine enables the update of the
registers and memories based on the update enable vector ue.
According to equation 9.2, the write signals Stallw of the 21-stage RSR
are directly taken from the precomputed control of stage 2.0. The core of
the stall engine therefore provides the update enable flags at the following
cost and delay

Cstall core  CRSR 21 1  21  4  Cand  Cor  2  C f f


Astall ue  ACE  DRSR r  Dand 
''
  *
The write signals of the register files are generated as before, except that
P IPELINED DLX there is now one additional register file.
M ACHINE WITH
F LOATING P OINT GPRw  GPRw  ue4  JISR NAND repeat 
C ORE FPRw  FPRw  ue4  JISR NAND repeat 
SPRw  SPRw  ue4  JISR NAND repeat 
SPRw 5 : 0  SPRw5 : 0  ue4
The read and write signals of the data memory also remain unchanged. In
stage 2.0, the write request signal is disabled in case of page fault during
instruction fetch.
Dmw3 : Dmw2  CA22
Dmw 3  Dmw3  f ull 3  JISR NOR reset 
Dmr 3  Dmr3  f ull 3
The same is true for the clock signals of stage ID and of the cause environ-
ment.
CA4ce  ue3  reset
DPCce  PCce  ue1  JISR
However, the output registers of the stages EX and M of the data paths are
clocked differently. In the design without FPU, all these registers have a
trivial clock request signal which equals one. That is no longer the case.
For the registers R  MDRr C4 FC4 and for register MAR, the clocks
are now obtained as
Rce  ue3  Rce3
MARce  ue24  MARce20
For register MDRw the clocking is a bit more complicated. As already
mentioned earlier, MDRw either gets its data from stage 2.0 or from stage
2.4, depending on the latency of the instruction. Thus,
MDRwce  ue24  MDRwce20  lat120  MDRwce24
The write signals of the RSRs of the data paths are directly taken from the
precomputed control CON20 except for the write signal of the first entry
RSRw1  lat520  f ull 2016  f div20
The cost and cycle time of the stall engine can then be expressed as
Cstall  Cstall core  14  Cand  4  Cor  Cinv  Cnand  Cnor
Tstall  maxDRSR r  Dand  Dor  ∆ Astall ue  δ  2  Dand
 maxDram3 32 32 DSF w ce; 9 32  D f f DFPR write 


'(
  *&
*&'  -  
P IPELINED DLX
It suffices to show the simulation theorem for cycles, when instructions are D ESIGN WITH FPU
in stages k  PP.

Like theorem 5.11 but with hypothesis    )(


IΠ k T   IΣ k T i and uekiT 1

for k  PP and statements 1 (a) and (b) for signals S and output registers
R of stages k  PP.

The arguments from the induction step of theorems 4.5, 4.7 and 4.11 have 
to be extended for the execute environment. Two new situations must be
treated:

1. jumping over substages by means of the result shift registers and

2. inputs to stage 2.1 produced by the sequential portion of the division


algorithm.

For the first case, let Ii be an instruction which jumps from stage 20 to
stage x with x  21    24 3, and let

i  IΠ x T   IΣ x T 
 IΠ 20 T  1  IΣ 20 T  1

Let Q  out 20 be an output register of stage 2.0 which was updated
during cycle T  1. The induction hypothesis and the dateline lemma imply
¼
QTΠ  Qi  QTΣ 

Let S be a signal in stage x, which is an input to an output register of stage


x which is updated at the end of cycle T . By construction of the machine,
¼
the value SΣT then depends only on
¼
the values QTΣ of registers Q considered above and

the values of the special purpose registers RMi  1 and SRi 1 .




As in the proof of theorem 4.7, one argues that values RMi 1 and SRi  1 are
forwarded to stage x of machine DLXΠ in cycle T . It follows that
T ¼
SΠ  SΣT 

'/
  *
For the second case, let Ii be a division instruction and let
P IPELINED DLX
M ACHINE WITH i  IΠ 21 T   IΣ 21 T 
F LOATING P OINT  IΠ 2016 T  1  IΣ 2016 T  1
C ORE  IΠ 20 U   IΣ 20 U 

By induction hypothesis we have for all output registers Q of stage 2.0,


which were updated after cycle U:

QUΠ1 QUΣ 1 
¼


During the cycles U  1    T  1 of machine DLXΣ and during the cycles


U  1    T  1 of machine DLXΠ both machines work sequentially. The
outputs clocked into the output registers of stage 2.0.16 after cycles T  1
and T  1, respectively, depend by lemma 9.4 only on the values of the
registers Q considered above. For output registers V of stage 2.0.16 it
follows that
VΠT  VΣT  Vi 
From this one concludes for all inputs S of output registers of stage 2.1
which are clocked after cycle T :
T ¼
SΠ  SΣT

  exactly as in the first case.

0 %)  

N THIS section, we analyze the impact of the floating point unit on the
cost and the performance of the pipelined DLX design. We also analyze
how the FPU impacts the optimal cache size (section 9.5.2).

*'     %  

In the following, we compare the cost and the cycle time of the designs
DLXΠ and FDLXΠ . Both designs use a split 4KB cache. The Icache and
the Dcache are of equal size, i.e., 2KB each. They are two way set as-
sociative with LRU replacement and implement the write allocate, write
through policy. With respect to the timing, we assume that the memory
interface has a bus delay of dbus  15 and a handshake delay of dMhsh  10
gate delays.
')
  *'
  Cost of the pipelined DLX data paths. DPM denotes the data paths E VALUATION
without the memory environments.

environment IR PC DAddr EX SH4L RF


DLXΠ 301 2610 60 3795 380 7257
FDLXΠ 301 2618 90 110093 860 11532
increase 0.3% 50% 2800% 126% 59%

environment CA buffer FORW IM, DM DPM DP


DLXΠ 471 2064 1624 96088 20610 116698
FDLXΠ 717 9206 3904 95992 143635 239627
increase 52% 346% 140% -0.1% 597% 105%

   +  " 
Except for the environments IRenv and IMenv, all parts of the data paths
and of the control had to be adapted to the floating point instruction set.
Significant changes occurred in the execute stage, in the register file envi-
ronment, in the forwarding hardware, and in the control (table 9.23).
The floating point unit itself is very expensive, its cost run at 104 kilo
gates (section 8.7). Compared to the FPU, the FXU is fairly inexpensive.
Thus, in the FDLX design, the execute environment is 28 times more ex-
pensive than in the DLX design. The FPU accounts for about 95% of the
cost of EXenv.
There is also a significant cost increase in the forwarding hardware, in
the buffers and in the register file environment. This increase is due to
the deeper pipeline and due to the additional floating point operands. The
remaining environments contribute at most 1kG (kilo gate) to the cost in-
crease. The memory environments become even slightly cheaper, due to
the simpler data memory interface. The data ports of the Dcache and of
environment DMenv have now the same width (64 bits); the patch of the
data ports therefore becomes obsolete.
In the DLXΠ design, the 4KB split cache is by far the single most expen-
sive unit; it accounts for 82% of cost. The FPU is about 9% more expensive
than the 4KB cache. Thus, in the FDLXΠ design, the 4KB cache only con-
tributes 40% to the cost of the data paths; environment EXenv contributes
another 46%. Adding the FPU roughly doubles the cost of the pipelined
data paths (factor 2.05). Without the caches, the FPU has even a stronger
cost impact, it increases the cost of the data paths roughly by a factor of 6.
'*
  *
P IPELINED DLX   Cost of the control of the pipelined DLX designs and with FPU.
M ACHINE WITH
MifC stall, CE preCon automata CON DLX
F LOATING P OINT
C ORE DLXΠ 943 165 202 952 2262 118960
FDLXΠ 1106 623 1440 2829 5898 245514
increase 6.7% 278% 613% 197% 161% 106%

    
Table 9.24 lists the cost of the different control environments and of the
whole DLX designs. Adding the FPU increases the cost of the control
by 160%. The cost of the memory interface control remains virtually the
same. Due to the deeper pipeline, the stall engine becomes about 4 times
as expensive.
The control automata become about three times as expensive. This is
largely due to the Moore automaton which precomputes the control signals
of the stages EX to WB. It now requires 44 instead of 17 states, and it
generates 48 instead of 16 control signals. The Moore control signals have
a 7 times higher accumulated frequency νsum (342 instead of 48).
The larger number of control signals also impacts the cost of the pre-
computed control, which passes these signals down the pipeline. Since
the pipeline is also much deeper, the precomputed control is 7 times as
expensive as before.

%  
Table 9.25 lists the cycle time for each stage of the data paths. The cycle
time of the write back stage remains the same, despite of the additional
register file. The FPR register file consists of two RAM banks, each of
which only has half the size of the RAM used in the GPR register file.
Thus, time TW B is still dominated by the delay of the shifter SH4L and the
GPR register file.
Due to the aliasing of single and double precision registers, each word
of a floating point operand must be forwarded separately. Since all the
operands are fetched and forwarded in parallel, the floating point extension
has only a minor impact on the operand fetch time. The cycle time of stage
ID is still dominated by the PC environment.
The FPU is much more complex than the FXU. Thus, the cycle time of
the execute stage is increased by about 50%; the execute stage becomes
time critical. The cycle time of the control is also increased significantly
(16%). This is due to the non-uniform latency of the execute stage, which
requires the use of an RSR.
'
  *'
  Cycle times of the data paths of the designs DLX Π and FDLXΠ with E VALUATION
2KB, 2-way Icache and Dcache.

ID CON / stall
EX WB DP
operands PC max( , )
DLXΠ 72 89 66 33 89 79 46  dbus
FDLXΠ 74 89 98 33 98 92 48  dbus

  Memory cycle times of the DLX designs with 2KB, 2-way Icache and
Dcache, assuming a bus and handshake delay of d bus 15 and d Mhsh 10.

Maccess
$read $if Mreq Mrburst
α4 α8
DLXΠ 55 47 42 51 379 707
FDLXΠ 53 47 42 51 379 707

The memory system remains virtually the same, except for one multi-
plexer which is saved in the Dcache interface and a modification of the
bank write signals. The latter has no impact on the delay of the memory
control. Thus, except for the cache read time T$read , the two DLX designs
with and without FPU have identical memory cycle times (table 9.26).

*' >     :

Like in sections 6.4.2 and 6.5.3, we now optimize the cache size of the
FDLXΠ design for performance and for a good performance cost ratio.
The optimization is based on a floating point workload.

  +%
Table 9.27 lists the cost, the cycle time TFDLX of the CPU, and the memory
access times for the pipelined FDLX design. The total cache size varies
between 0KB and 32KB. The 64MB main memory uses DRAMs which
are 4 (8) times slower and denser than SRAM.
As before, doubling the cache size roughly doubles the cost of the mem-
ory environment. However, due to the expensive floating point unit, a
cache system of 1KB to 4KB only causes a moderate (25 - 65%) increase
of the total hardware cost. In combination with small caches, the FPU
'
  *
P IPELINED DLX   Cost, CPU cycle time and memory access time of the FDLX Π design
M ACHINE WITH
total CM CFDLX
F LOATING P OINT TFDLX TM 4 TM 8
cache [kG] [kG] [%]
C ORE
0KB 0 149 100 98 355 683
1KB 30 179 120 98 359 687
2KB 52 201 135 98 367 695
4KB 96 246 165 98 379 707
8KB 184 334 224 98 382 710
16KB 360 510 342 104 385 713
32KB 711 861 578 107 388 716

dominates the CPU cycle time. Beyond a total cache size of 16KB, the
detection of a cache hit becomes time critical.
The memory access time grows with the cache size; it is significantly
larger than the CPU cycle time. As before, the actual memory access is
therefore performed in W cycles with a cycle time of

τM  TM W  

The cycle time of the FDLX design then equals

τ  maxτM TFDLX 

Up to W  TM TFDLX , increasing the number W of memory cycles re-


duces the cycle time τ, but it also increases the cycle count. Thus, there is
a trade-off between cycle time and cycle count. The optimal parameter W
strongly depends on the memory system and on the workload.

" 
In addition to the integer benchmarks of table 4.20, the SPEC92 suite also
comprises 14 floating point benchmarks (for details see [Sta, HP96]). On
average, this floating point workload SPECfp92 uses the instruction mix
listed in table 9.28; this table is derived from [Del97].
The non-uniform latency of the execute stage makes it very difficult (or
even impossible) to derive the CPI ratio of the pipelined FDLX design
in an analytic manner. In [Del97], the CPI ratio is therefore determined
by a trace based simulation. Assuming an ideal memory which performs
every access in a single cycle, the FDLX design achieves on the SPECfp92
workload a CPI ratio of

CPIideal f p  1759
'
  *'
  Instruction mix of the average SPECfp92 floating point workload E VALUATION
instruction FXU load store jump branch
frequency [%] 39.12 20.88 10.22 2.32 10.42

instruction fadd fmul fdiv cvt 1 cycle


frequency [%] 5.24 5.78 1.17 2.13 2.72

  Memory access time of the FDLX design with cache memory (given
in CPU cycles)

read hit read miss write hit write miss


1 1  S W 2 W 2 W  S W

The split cache system of the FDLX design has a non-uniform access
time which depends on the type of the access (table 9.29). Thus, a read
miss takes 1  S  W cycles. In the FDLX design each cache line has S  4
sectors. The parameter W depends on the speed of the memory system; in
this framework, it varies between 3 and 16 cycles.
The whole pipeline is stalled in case of a slow data memory access.
On an instruction fetch miss, only the fetch and decode stage are stalled,
the remaining stages still proceed. However, these stages get eventually
drained since the decode stage provides no new instructions. Thus, an
instruction fetch miss will also cause a CPI penalty.
In order to keep the performance model simple, we assume that the
whole pipeline is stalled on every slow memory access. That gives us a
lower bound for the performance of the pipelined FDLX design. In anal-
ogy to equation 6.5 (page 312), the CPI ratio of the FDLXΠ design with
cache memory can then be modeled as
CPI f p  CPIideal f p  νstore  1  W 
 ν f etch  pIm  νload store  pDm   W  S


 1861  0102  W  pIm  0311  pDm   W  S


where pIm and pDm denote the miss ratios of the instruction cache and data
cache, respectively. Table 9.30 lists the miss ratios of the instruction and
data cache. In addition, it lists the optimal cycle time, CPI and TPI (time
per instruction) ratio for the different memory systems.
Doubling the total cache size cuts the miss ratio of the Icache roughly
by half, whereas up to 16KB, the miss ratio of the Dcache is only reduced
'#
  *
P IPELINED DLX    Miss ratio, cycle time CPI and TPI ratio of the FDLX Π design. For
M ACHINE WITH α 4 (8), a memory access is performed in W 4 (7) cycles.
F LOATING P OINT
total miss ratio [%] DRAM α  4 DRAM α  4
C ORE
cache I$ D$ τ CPI TPI τ CPI TPI
1KB 5.40 10.7 98 2.97 290.8 99 3.54 350.1
2KB 1.98 7.69 98 2.62 256.7 100 3.06 305.6
4KB 1.04 6.08 98 2.50 245.3 101 2.90 292.6
8KB 0.70 5.31 98 2.46 240.8 102 2.83 289.1
16KB 0.46 4.56 104 2.42 251.6 104 2.78 289.3
32KB 0.23 2.33 107 2.35 250.9 107 2.68 286.7

by about 30%. This suggests that the data accesses require a larger work-
ing set than the instruction fetches, and that the instruction fetches have a
better locality. A larger cache improves the CPI ratio but with diminishing
returns. Since a larger cache also increases the cycle time, the 16KB cache
system even yields a worse performance than the 8KB system. Thus, with
respect to performance, a total of 8KB cache is optimal.
Without caches, every memory access takes 1  W cycles, and the pipe-
lined FDLX design then has a CPI ratio of

CPIno$  1759  W  1311

In combination with fast DRAM (α  4), the design runs with W  3 at a


cycle time of τ  119 and achieves a TPI ratio of 677.4. According to table
9.31, the split cache gains a speedup of 2.3 to 2.8 over the design without
caches. In combination with slower DRAM (α  8), the FDLX design
without caches run with W  7 at τ  98 and has a TPI ratio of 1071.7.
The split cache system then causes even a speedup of 3.1 to 3.7.
Even for the 8KB cache system, the speedup is in either case signifi-
cantly larger than the cost increase. Thus, the cache is definitely worth-
while.
The diagrams of figure 9.28 depict the quality ratio of the FDLX de-
signs with split cache over that without cache. Note that the quality is the
weighted geometric mean of the cost and the TPI ratio: Q  C q  T PI q 1 .
 

For a realistic quality measure, the parameter q lies in the range [0.2, 0.5].
Within this range, the design with a total cache size of 4KB is best. The
8KB system only wins, if much more emphasis is put on the performance
than on the cost.
'&
  *(
   Speedup and cost increase of the FDLX Π with a split 2-way cache E VALUATION
over the design without cache

total cache size 1KB 2KB 4KB 8KB


speedup: α  4 2.33 2.64 2.76 2.81
α8 3.06 3.51 3.66 3.71
cost increase factor 1.24 1.39 1.71 2.32

3
1KB split
2.8 2KB split
quality ratio (alpha =4)

4KB split
2.6 8KB split
2.4 no cache

2.2
2
1.8
1.6
1.4
1.2
1
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6
quality paramter: q

4
1KB split
2KB split
quality ratio (alpha =8)

3.5 4KB split


8KB split
no cache
3

2.5

1.5

0.5
0 0.1 0.2 0.3 0.4 0.5 0.6
quality paramter: q

   Quality ratio of the design with a split 2-way cache relative to the
design without cache for two types of off-chip memory.

''
  *
0 %&
P IPELINED DLX
M ACHINE WITH  ) An arithmetical FPU instruction Ii updates the SPR register
F LOATING P OINT IEEEf by a read-modify-write access:
C ORE
IEEE f i  IEEE f i 1  F f li 

Unlike any other instruction updating the SPR register file, the input of
this write access is provided via the special write port Di6 and not via
the standard write port Din. That complicates the forwarding of the SPR
operand S. In order to keep the engine forwarding engine (section 9.4.2)
lean, the forwarding of the IEEEf flags generated by an arithmetical FPU
operation was omitted.

1. The result to be written onto register IEEEf is always available in


the circuitry of stage WB. Extend the forwarding engine (and the
interlock engine) such that IEEEf is forwarded from stage WB even
in case of an arithmetical FPU instruction.
2. Flags F f l provided by the FPU become available in stage 2.4. When
combined with the forwarded IEEEf value from stages 3 and 4, reg-
ister IEEEf can also be forwarded from the stages 2.4, 3 and 4. Con-
struct a forwarding engine which supports this type of forwarding.
3. How do the modifications of 1) and 2) impact the cost and cycle time?

 ) Construct a sequence of k instructions, such that data from


the first k  1 instructions have to be forwarded to the k’th instruction.
How large can k be?

 ) In many contemporary machines (year 2000) a change of the


rounding mode slows programs down much more than an additional float-
ing point instruction (This make interval arithmetic extremely slow). What
part of the hardware of the machine constructed here has to be deleted in
order to produce this behavior?

 ) Suppose in the division algorithm we use an initial lookup


table with γ  5 or γ  16?.

1. Which parts of the machine have to be changed? Specify the changes.


2. How is the cost of the machine affected?

 ) Sketch the changes of the design required if we want to make
division fully pipelined (Conceptually, this makes the machine much sim-
pler). Estimate the extra cost.
'(
  *(
 ) Evaluate the quality of the machines from exercises 9.4 and
9.5. Assume, that the cycle time is not affected. For the machine from E XERCISES
exercise 9.5 use your estimate for the cost. Compare with the machine
constructed in the text.

'/
Appendix

A
DLX Instruction Set
Architecture

Ì HEDLX is a 32-bit RISC architecture which manages with only three


instruction formats. The core of the architecture is the fixed point unit
FXU, but there also exists a floating point extension.

 *(1 "&4+   5 "13

Ì HE DLX fixed point unit uses 32 general purpose registers R0 to R31,


each of which is 32 bits wide. Register R0 always has the value
0. The FXU also has a few 32-bit special purpose registers mainly used
for handling interrupts. Table A.1 lists these registers as well as a brief
description of their usage. For more details see chapter 5. Special move
instructions transfer data between general and special purpose registers.
Load and store operations move data between the general purpose reg-
isters and the memory. There is a single addressing mode: the effective
memory address ea is the sum of a register and an immediate constant.
Except for shifts, immediate constants are always sign-extended to 32-bits.
The memory is byte addressable and performs byte, half-word or word
accesses. All instructions are coded in four bytes. In memory, data and
instructions must be aligned in the following way: Half words must start
at even byte addresses. Words and instructions must start at addresses
divisible by 4. These addresses are called word boundaries.
  
DLX I NSTRUCTION  ( Special purpose registers of the DLX fixed point core
S ET
register usage
A RCHITECTURE
PC program counter points to the next instruction
SR status register holds interrupt masks (among others)
CA cause register records pending interrupts
EPC, exception registers on a jump to the interrupt service rou-
ESR, tine they backup the current value of
ECA, PC, SR, CA respectively the current
EMAR memory address

6 5 5 16
I-type opcode RS1 RD immediate

6 5 5 5 5 6
R-type opcode RS1 RS2 RD SA function

6 26
J-type opcode PC offset

  ( The three instruction formats of the DLX design. The fields RS1 and
RS2 specify the source registers, and the field RD specifies the destination regis-
ter. Field SA specifies a special purpose register or an immediate shift amount.
Function field is an additional 6-bit opcode.

 . -  , 

All three instruction formats (figure A.1) have a 6-bit primary opcode and
specify up to three explicit operands. The I-type (Immediate) format spec-
ifies two registers and a 16-bit constant. That is the standard layout for
instructions with an immediate operand. The J-type (Jump) format is used
for control instructions. They require no explicit register operand and profit
from the larger 26-bit immediate operand. The third format, R-type (Regis-
ter) format, provides an additional 6-bit opcode (function). The remaining
20 bits specify three general purpose registers and a field SA which spec-
ifies a 5-bit constant or a special purpose register. A 5-bit constant, for
example, is sufficient as shift amount.
' 
  
 ( J-type instruction layout; sxt(imm) is the sign-extended version of the F LOATING -P OINT
26-bit immediate called PC Offset. E XTENSION
IR[31:26] mnemonic effect
Control Operation
hx02 j PC = PC + 4 + sxt(imm)
hx03 jal R31 = PC + 4; PC = PC + 4 + sxt(imm)
hx3e trap trap = 1; Edata = sxt(imm)
hx3f rfe SR = ESR; PC = EPC; DPC = EDPC

 . -    

Since the DLX description in [HP90] does not specify the coding of the
instruction set, we adapt the coding of the MIPS R2000 machine ([PH94,
KH92]) to the DLX instruction set. Tables A.2 through A.4 specify the
instruction set and list the coding; the prefix “hx” indicates that the number
is represented as hexadecimal. The effects of the instructions are specified
in a register transfer language.

 " #4+  %& 

ESIDES THE fixed point unit, the DLX architecture also comprises a
floating point unit FPU, which can handle floating point numbers in
single precision (32-bits) or in double precision (64-bits). For both preci-
sions, the FPU fully conforms the requirements of the ANSI/IEEE standard
754 [Ins85].

  ," 8   

The FPU provides 32 floating point general purpose registers FPRs, each
of which is 32 bits wide. In order to store double precision values, the
registers can be addressed as 64-bit floating point registers FDRs. Each of
the 16 FDRs is formed by concatenating two adjacent FPRs (table A.5).
Only even numbers 0 2    30 are used to address the floating point reg-
isters FPR; the least significant address bit is ignored. In addition, the FPU
provides three floating point control registers: a 1-bit register FCC for the
floating point condition code, a 5-bit register IEEEf for the IEEE exception
flags and a 2-bit register RM specifying the IEEE rounding mode.
' 
  
DLX I NSTRUCTION  ( R-type instruction layout. All instructions increment the PC by four.
S ET SA is a shorthand for the special purpose register SPRSA; sa denotes the 5-bit
A RCHITECTURE immediate shift amount specified by the bits IR[10:6].

IR[31:26] IR[5:0] mnemonic effect


Shift Operation
hx00 hx00 slli RD = sll(RS1, sa)
hx00 hx02 srli RD = srl(RS1, sa)
hx00 hx03 srai RD = sra(RS1, sa)
hx00 hx04 sll RD = sll(RS1, RS2[4:0])
hx00 hx06 srl RD = srl(RS1, RS2[4:0])
hx00 hx07 sra RD = sra(RS1, RS2[4:0])
Arithmetic, Logical Operation
hx00 hx20 addo RD = RS1 + RS2; ov f signaled
hx00 hx21 add RD = RS1 + RS2; no ov f signaled
hx00 hx22 subo RD = RS1 - RS2; ov f signaled
hx00 hx23 sub RD = RS1 - RS2; no ov f signaled
hx00 hx24 and RD = RS1  RS2
hx00 hx25 or RD = RS1  RS2
hx00 hx26 xor RD = RS1 RS2
hx00 hx27 lhg RD = RS2[15:0] 016
Test Set Operation
hx00 hx28 clr RD = ( false ? 1 : 0);
hx00 hx29 sgr RD = (RS1  RS2 ? 1 : 0);
hx00 hx2a seq RD = (RS1  RS2 ? 1 : 0);
hx00 hx2b sge RD = (RS1 RS2 ? 1 : 0);
hx00 hx2c sls RD = (RS1  RS2 ? 1 : 0);
hx00 hx2d sne RD = (RS1   RS2 ? 1 : 0);
hx00 hx2e sle RD = (RS1  RS2 ? 1 : 0);
hx00 hx2f set RD = ( true ? 1 : 0);
Special Move Instructions
hx00 hx10 movs2i RD = SA
hx00 hx11 movi2s SA = RS1

  ," . -  

The DLX machine uses two formats (figure A.2) for the floating point
instructions; one corresponds to the I-type and the other to the R-type of
the fixed point core. The FI-format is used for loading data from memory
'
  
 ( I-type instruction layout. All instructions except the control instruc- F LOATING -P OINT
tions also increment the PC by four; sxt a is the sign-extended version of a. E XTENSION
The effective address of memory accesses equals ea GPRRS1  sxt imm,
where imm is the 16-bit intermediate. The width of the memory access in bytes is
indicated by d. Thus, the memory operand equals m M ea  d  1 M ea.

IR[31:26] mnemonic d effect


Data Transfer
hx20 lb 1 RD = sxt(m)
hx21 lh 2 RD = sxt(m)
hx23 lw 4 RD = m
hx24 lbu 1 RD = 024 m
hx25 lhu 2 RD = 016 m
hx28 sb 1 m = RD[7:0]
hx29 sh 2 m = RD[15:0]
hx2b sw 4 m = RD
Arithmetic, Logical Operation
hx08 addio RD = RS1 + imm; ov f signaled
hx09 addi RD = RS1 + imm; no ov f signaled
hx0a subio RD = RS1 - imm; ov f signaled
hx0b subi RD = RS1 - imm; no ov f signaled
hx0c andi RD = RS1  sxt(imm)
hx0d ori RD = RS1  sxt(imm)
hx0e xori RD = RS1 sxt(imm)
hx0f lhgi RD = imm 016
Test Set Operation
hx18 clri RD = ( false ? 1 : 0);
hx19 sgri RD = (RS1  imm ? 1 : 0);
hx1a seqi RD = (RS1  imm ? 1 : 0);
hx1b sgei RD = (RS1 imm ? 1 : 0);
hx1c slsi RD = (RS1  imm ? 1 : 0);
hx1d snei RD = (RS1   imm ? 1 : 0);
hx1e slei RD = (RS1  imm ? 1 : 0);
hx1f seti RD = ( true ? 1 : 0);
Control Operation
hx04 beqz PC = PC + 4 + (RS1  0 ? imm : 0)
hx05 bnez PC = PC + 4 + (RS1   0 ? imm : 0)
hx16 jr PC = RS1
hx17 jalr R31 = PC + 4; PC = RS1

' #
  
DLX I NSTRUCTION  ( Register map of the general purpose floating point registers
S ET
floating point
A RCHITECTURE floating point registers
general purpose registers
single precision (32-bit)

double precision (64-bit)
FPR3131 : 0 FDR3063 : 32
FDR3063 : 0
FPR3031 : 0 FDR3031 : 0
: :
FPR331 : 0 FDR263 : 32
FDR263 : 0
FPR231 : 0 FDR231 : 0

FPR131 : 0 FDR063 : 32
FDR063 : 0
FPR031 : 0 FDR031 : 0

6 5 5 16
FI-type Opcode Rx FD Immediate

6 5 5 5 3 6
FR-type Opcode FS1 FS2 / Rx FD 00 Fmt Function

  ( Floating point instruction formats of the DLX. Depending on the pre-
cision, FS1, FS2 and FD specify 32-bit or 64-bit floating point registers. RS
specifies a general purpose register of the FXU. Function is an additional 6-bit
opcode. Fmt specifies a number format.

into the FPU respectively for storing data from the FPU into memory. This
format is also used for conditional branches on the condition code flag
FCC of the FPU. The coding of those instructions is given in table A.6.
The FR-format is used for the remaining FPU instructions (table A.8). It
specifies a primary and a secondary opcode (Opcode, Function), a number
format Fmt, and up to three floating point (general purpose) registers. For
instructions which move data between the floating point unit FPU and the
fixed point unit FXU, field FS2 specifies the address of a general purpose
register RS in the FXU.
Since the FPU of the DLX machine can handle floating point numbers
with single or double precision, all floating point operations come in two
version; the field Fmt in the instruction word specifies the precision used.
In the mnemonics, we identify the precision by adding the suffix ‘.s’ (sin-
gle) or ‘.d’ (double).

' &
  
 ( FI-type instruction layout. All instructions except the branches also F LOATING -P OINT
increment the PC, PC += 4; sxt(a) is the sign extended version of a. The effective E XTENSION
address of memory accesses equals ea = RS + sxt(imm), where imm is the 16-bit
offset. The width of the memory access in bytes is indicated by d. Thus, the
memory operand equals m M ea  d  1 M ea.

IR[31:26] mnemonic d effect


Load, Store
hx31 load.s 4 FD[31:0] = m
hx35 load.d 8 FD[63:0] = m
hx39 store.s 4 m = FD[31:0]
hx3d store.d 8 m = FD[63:0]
Control Operation
hx06 fbeqz PC = PC + 4 + (FCC  0 ? sxt(imm) : 0)
hx07 fbnez PC = PC + 4 + (FCC  0 ? sxt(imm) : 0)

 ( Floating-Point Relational Operators. The value 1 (0) denotes that the
relation is true (false).

condition relations invalid


mnemonic greater less equal unordered if
code
true false    ? unordered
0 F T 0 0 0 0
1 UN OR 0 0 0 1
2 EQ NEQ 0 0 1 0
3 UEQ OGL 0 0 1 1
no
4 OLT UGE 0 1 0 0
5 ULT OGE 0 1 0 1
6 OLE UGT 0 1 1 0
7 ULE OGT 0 1 1 1
8 SF ST 0 0 0 0
9 NGLE GLE 0 0 0 1
10 SEQ SNE 0 0 1 0
11 NGL GL 0 0 1 1
yes
12 LT NLT 0 1 0 0
13 NGE GE 0 1 0 1
14 LE NLE 0 1 1 0
15 NGT GT 0 1 1 1

' '
  
DLX I NSTRUCTION
S ET
A RCHITECTURE  ( FR-type instruction layout. All instructions execute PC += 4. The for-
mat bits Fmt = IR[8:6] specify the number format used. Fmt = 000 denotes single
precision and corresponds to the suffix ‘.s’ in the mnemonics; Fmt = 001 denotes
double precision and corresponds to the suffix ‘.d’. FCC denotes the 1-bit register
for the floating point condition code. The functions sqrt(), abs() and rem() denote
the square root, the absolute value and the remainder of a division according to
the IEEE 754 standard. Instructions marked with will not be implemented in
our FPU design. The opcode bits c3 : 0 specify a relation “con” according to
table A.7. Function cvt() converts the value of a register from one format into
another. For that purpose, FMT = 100 (i) denotes fixed point format (integer) and
corresponds to suffix ‘.i’ .

IR[31:26] IR[8:0] Fmt mnemonic effect


Arithmetic and Compare Operations
hx11 hx00 fadd [.s, .d] FD = FS1 + FS2
hx11 hx01 fsub [.s, .d] FD = FS1 - FS2
hx11 hx02 fmul [.s, .d] FD = FS1 * FS2
hx11 hx03 fdiv [.s, .d] FD = FS1 / FS2
hx11 hx04 fneg [.s, .d] FD = - FS1
hx11 hx05 fabs [.s, .d] FD = abs(FS1)
hx11 hx06 fsqt [.s, .d] FD = sqrt(FS1)
hx11 hx07 frem [.s, .d] FD = rem(FS1, FS2)
hx11 11c3 : 0 fc.con [.s, .d] FCC = (FS1 con FS2)
Data Transfer
hx11 hx08 000 fmov.s FD[31:0] = FS1[31:0]
hx11 hx08 001 fmov.d FD[63:0] = FS1[63:0]
hx11 hx09 mf2i RS = FS1[31:0]
hx11 hx0a mi2f FD[31:0] = RS
Conversion
hx11 hx20 001 cvt.s.d FD = cvt(FS1, s, d)
hx11 hx20 100 cvt.s.i FD = cvt(FS1, s, i)
hx11 hx21 000 cvt.d.s FD = cvt(FS1, d, s)
hx11 hx21 100 cvt.d.i FD = cvt(FS1, d, i)
hx11 hx24 000 cvt.i.s FD = cvt(FS1, i, s)
hx11 hx24 001 cvt.i.d FD = cvt(FS1, i, d)

' (
Appendix

B
Specification of the FDLX
Design

IGURES 9.16, 9.17 and 9.18 depict the FSD of the FDLX design. In
 section B.1, we specify for each state of the FSD the RTL instructions
and their active control signals. In section B.2 we then specify the control
automata of the FDLX design.

 .( '   !  "*(1

    .,

In stage IF, the FDLX design fetches the next instruction I into the instruc-
tion register (table B.1). This is done under the control of flag f etch and
of clock request signal IRce. Both signals are always active.

    .+

The actions which the FDLX design performs during instruction decode
depend on the instruction I held in register IR (table B.2). As for stage IF,
the clock request signals are active in every clock cycle. The remaining
control signals of stage ID are generated by a Mealy control automaton.
 
S PECIFICATION OF
THE FDLX D ESIGN  ) RTL instructions of the stage IF
RTL instruction control signals
IR1  IM DPC fetch, IRce

 ) RTL instructions of stage ID;  * denotes any arithmetical floating
point instruction with double precision.

RTL instruction type of I control signals


A  A  RS1 AEQZ  zero A  Ace,
B  RS2 PC  reset ? 4 : pc  Bce, PC’ce,
DPC  reset ? 0 : d pc DPCce,
link  PC  4 DDPC  DPC PCce
IR2  IR1 Sad 2  Sad
FA FB  FPemb f a f b ,&  # , # dbs.1
+& #   
otherwise
co  constant IR1 "# "# & Jimm
# #  shiftI
otherwise
pc d pc  , rfe.1
nextPC PC A co EPCs "# " jumpR, jump
 branch, bzero
% branch
, fbranch, bzero
,% fbranch
otherwise
Cad  CAddr IR1 "# " Jlink
R-type Rtype
otherwise
Sas Sad Fad   DAddr IR1 , rfe.1
, fc.1, FRtype
FR-type (no ,) FRtype
otherwise
CA212  1 ,&# , uFOP

' )
 
#    1=
RTL I NSTRUCTIONS
The execute stage has a non-uniform latency which varies between 1 and OF THE FDLX
21 cycles. The execute stage consists of the five substages 2.0, 2.1 to 2.4.
For the iterative execution of divisions stage 2.0 itself consists of 17 sub-
stages 2.0.0 to 2.0.16. In the following, we describe the RTL instructions
for each substage of the execute stage.

     -A
In stage 2.0, the update of the buffers depends on the latency of the instruc-
tion I. Let 
 3 if I has latency of l  1
k 
 2231 ifif II has latency of l  3
has latency of l 5
stage 2.0 then updates the buffers as

IRk Cad k Sad k Fad k : IR2 Cad 2 Sad 2 Fad 2
PCk DPCk DDPCk : PC DPC DDPC

If I is a division, this update is postponed to stage 2.0.16.


For any stage k  21    24, let k be defined as

3 if k  24
k 
2 j  1 if k  2 j  24

In stage k the buffers are then updated as

IRk Cad k Sad k Fad k  : IRk Cad k Sad k Fad k


PCk DPCk DDPCk  : PCk DPCk DDPCk

-   
Tables B.3 and B.4 list the RTL instructions for the fixed point instructions
and for the floating point instructions with 1-cycle execute latency. From
stage 2.0, these instructions directly proceed to stage 3.
The operand FB is only needed in case of a floating point test operation
,. By f cc and Fc we denote the results of the floating point condition test
circuit FC ON as defined in section 8.5

f cc Fc68 : 0  FCon FA FB

Tables B.5 and B.6 list the RTL instructions which stage 2.0 performs
for instructions with an execute latency of more than one cycle.
' *
 
S PECIFICATION OF  ) RTL instructions of the execute stages for the fixed point instructions.
THE FDLX D ESIGN
state RTL instruction control signals
alu MAR  A op B ALUDdoe, Rtype, bmuxsel
opA, opB, MARce, lat1
aluo MAR  A op B, overflow? like alu, ovf?
aluI MAR  A op co ALUDdoe, opA, MARce, lat1
aluIo MAR  A op co overflow? like aluI, ovf?
testI MAR  A rel co ? 1 : 0 ALUDdoe, test, opA, MARce,
lat1
test MAR  A rel B ? 1 : 0 like testI, Rtype, bmuxsel, opB
shiftI MAR  shift A co4 : 0 SHDdoe, shiftI, Rtype,
opA, MARce, lat1
shift MAR  shift A B4 : 0 like shiftI, bmuxsel, opB
savePC MAR  link linkDdoe, MARce, lat1
trap MAR  co trap  1 coDdoe, trap, MARce, lat1
Ill MAR  A ill  1 ADdoe, ill, opA, MARce, lat1
ms2i MAR  S SDdoe, MARce, lat1
rfe
mi2s MAR  A ADdoe, opA, MARce, lat1
noEX
addrL MAR  A  co ALUDdoe, add, opA, MARce,
lat1
addrS MAR  A  co F f l 3  0 ALUDdoe, add, amuxsel, opA,
MDRw  opB, store.2, MARce, MDRce,
cls B MAR1 : 0000 Ffl3ce, lat1, tfpRdoe

-     
The execute substages 2.1 and 2.2 are only used by the arithmetic instruc-
tions , # ,$# ,$ and , +. The RTL instructions for the divisions are
listed in table B.6 and for the other three types of operations they are listed
in table B.7.

-   #  &
In these two stages the FPU performs the rounding and packing of the
result (table B.8). In order to keep the description simple, we introduce
the following abbreviations: By FPrdR and FXrdR, we denote the out-
put registers of the first stage of the rounders FP RD and FX RD, respec-
tively. The two stages of the floating point rounder FP RD compute the
'#
 
 ) RTL instructions of the execute stages for floating point instructions RTL I NSTRUCTIONS
with a single cycle latency. OF THE FDLX
state RTL instruction control signals
addrL.s MAR  A  co ALUDdoe, add, opA, MARce, lat1
addrL.d
addrSf MAR  A  co ALUDdoe, add, opA, MARce,
MDRw  FB store.2, fstore.2, tfpRdoe,
Ff l 3  0 MDRwce, Ffl3ce, lat1, (amuxsel)
mf2i MAR  FA31 : 0 opFA, tfxDdoe, MARce, lat1
mi2f MDRw  B B opB, tfpRdoe, MDRwce,
Ff l 3  0 Ffl3ce, lat1
fmov.s MDRw  FA opFA, fmov, tfpRdoe,
fmov.d Ff l 3  0 MDRwce, Ffl3ce, lat1
fneg.s MDRw  Fc63 : 0 opFA, FcRdoe, MDRwce,
fneg.d Ff l 3  Fc68 : 64 Ffl3ce, lat1
fabs.s MDRw  Fc63 : 0 opFA, FcRdoe, MDRwce, abs
fabs.d Ff l 3  Fc68 : 64 Ffl3ce, lat1
fc.s, MAR  031 f cc opFA, opFB, ftest, fccDdoe, MARce
fc.d Ff l 3 MDRw  Fc FcRdoe, MDRwce, Ffl3ce, lat1

 ) RTL instructions of the execute substage 2.0 for instructions with a
latency of at least 3 cycles.

state RTL instruction control signals


fdiv.s Fa21 Fb21 nan21 lat17, normal
fdiv.d  FPunp FA FB lat21, normal, dbs
fmul.s lat5, normal
fmul.d lat5, normal, dbs
fadd.s lat5
fadd.d lat5, dbs
fsub.s lat5, sub
fsub.d lat5, sub, dbs
cvt.s.d Fr  lat3, FvFrdoe, Frce
cvt.s.i Cvt FPunp FA FB lat3, FvFrdoe, Frce, normal
cvt.d.s lat3, FvFrdoe, Frce, dbs
cvt.d.i lat3, FvFrdoe, Frce, dbs, normal
cvt.i.s Fr  FXunp FA FB lat3, FuFrdoe, Frce
cvt.i.d lat3, FuFrdoe, Frce

'#
 
S PECIFICATION OF  ) RTL instructions of the iterative division for stages 2.0.1 to 2.2 (single
THE FDLX D ESIGN precision). In case of double precision (suffix ‘.d’), an additional control signal
dbr is required in each state. A multiplication always takes two cycles. Since the
intermediate result is always held in registers s and c, we only list the effect of the
multiplication as a whole.

state RTL instruction control signals


lookup x  table fb  xce, tlu, fbbdoe
newton1.s A  appr 2  x  b 57 xadoe, fbbdoe
newton2.s Ace
newton3.s x  A  x57 Aadoe, xbdoe,
sce, cce
newton4.s xce
quotient1.s E  a  xp1 faadoe, xbdoe,
sce, cce
quotient2.s Da  fa Db  fb Dce, faadoe,
fbbdoe, Ece
quotient3.s Eb  E  fb Eadoe, fbbdoe,
sce, cce
quotient4.s sq eq   SigExpMD Fa21 Fb21 sqce, eqce,
f lq  SpecMD Fa21 Fb21 nan21 ebce, flqce
select fd.s E  E  2  p1,


β  fa Eb  2  p1  fb


 E  2  p2 ; if β  0


fd  E
 E  2  p2 ;; ifif ββ 

0
0 fdiv,

Fr  f lq sq eq fd  FqFrdoe, Frce

functions FPrd1  and FPrd2  as specified in section 8.4. The the fixed
point rounder FX RD (page 427) also consists of two stages. They compute
the functions denoted by FXrd1  and FXrd2 .

&    5

Table B.9 lists the RTL instructions which the FDLX design performs in
stage M. In addition, stage M updates the buffers as follows:

IR4 Cad 4 Sad 4 Fad 4 : IR3 Cad 3 Sad 3 Fad 3
PC4 DPC4 DDPC4 : PC3 DPC3 DDPC3
'#
 
 ) RTL instructions of the substages 2.1 and 2.2, except for the divisions. RTL I NSTRUCTIONS
OF THE FDLX
state RTL instruction control signals
Mul1.s sq eq  SigExpMD Fa21 Fb21 sqce, eqce,
Mul1.d f lq  SpecMD Fa21 Fb21 nan21 flqce, sce, cce,
s c  mul1 Fa21 Fb21 faadoe, fbbdoe
Add1.s ASr  AS1 Fa21 Fb21 nan21 ASrce
Add1.d
Sub1.s ASr  AS1 Fa21 Fb21 nan21 ASrce, sub
Sub1.d
Mul2.s f q  mul2 s c
Mul2.d Fr  f lq sq eq f q FqFrdoe, Frce
SigAdd.s Fr  AS2 ASr FsFrdoe, Frce
SigAdd.d

 ) RTL instructions of the substages 2.3 and 2.4


state RTL instruction control signals
rd1.s FPrdR  FPrd1 Fr FPrdRce
rd1.d FPrdRce, dbr
rd1.i FXrdR  FXrd1 Fr FXrdRce
rd2.s F f l 3 MDRw  FRrd2 FPrdR FpRdoe, MDRwce, Ffl3ce
rd2.d like rd2.s, dbr
rd2.i F f l 3 MDRw  FXrd2 FXrdR FxRdoe, MDRwce, Ffl3ce

 ) RTL instructions of the memory stage M.


state RTL instruction control signals
load, load.s MDRr  DMdword MAR31 : 3000 Dmr, DMRrce
load.d C4  MAR C4ce
store m  bytes MDRw C4  MAR Dmw, C4ce
ms2iM, noM, C4  MAR C4ce
mi2iM, passC
Marith.[s, d], FC4  MDRw FC4ce,
Mmv.[s, d] F f l 4  F f l 3 Ffl4ce
fcM FC4  MDRw C4  MAR FC4ce, C4ce,
F f l 4  F f l 3 Ffl4ce

'##
 
S PECIFICATION OF  ) RTL instructions of the write back stage WB
THE FDLX D ESIGN
state RTL instruction control signals
sh4l GPRCad 4  GPRw, load.4
sh4l MDs MAR1 : 0000
sh4l.s FPRFad 4  MDs FPRw, load.4
sh4l.d FDRFad 4  MDRr FPRw, load.4, dbr.4
wb GPRCad 4  C4 GPRw
mi2sW SPRSad 4  C4 SPRw
fcWB like mi2sW, SPRw,
IEEE f  IEEE f  F f l 4 fop.4
WBs FPR Fad 4  FC 31 : 0 FPRw
flagWBs like WBs, FPRw,
IEEE f  IEEE f  F f l 4 fop.4
WBd FDR Fad 4  FC FPRw, dbr.4
flagWBd like WBd, FPRw,
IEEE f  IEEE f  F f l 4 fop.4
noWB (no update)

'    ;

Table B.10 lists the RTL instructions which the FDLX design processes in
stage WB, given that no unmasked interrupt occurred. In case of a JISR,
the FDLX design performs the same actions as the the DLXΠ design (chap-
ter 5).

      !  "*(1 *#

automaton is constructed as in the fixed point DLX de-


 signs. The control is modeled by an FSD which is then turned into
HE CONTROL

precomputed control.

The control signals of stage IF are always active.

The control signals of stage ID are generated in every cycle, they


only depend on the current instruction word.

The control signals of the remaining stages are precomputed during


ID by a Moore automaton.
'#&
 
 ) Disjunctive normal forms of the Mealy automaton of stage ID C ONTROL
AUTOMATA OF THE
signal IR31 : 26 IR5 : 0 Fmt length comment
FDLX D ESIGN
Rtype 000000 ****** *** 6
shiftI 000000 0000*0 *** 11
000000 00001* *** 11
Jlink 010111 ****** *** 6
000011 ****** *** 6
jumpR 01011* ****** *** 5
jump 00001* ****** *** 5
01011* ****** *** 5
rfe.1 111111 ****** *** 6
Jimm 00001* ****** *** 5
111110 ****** *** 6
branch 00010* ****** *** 5
bzero *****0 ****** *** 1
fbranch 00011* ****** *** 5
fc 010001 11**** *** 8
FRtype 010001 11**** 001 11 fc.d
010001 000*** 001 12 farith.d
010001 001000 001 15 fmov.d
010001 100001 *** 12 cvt.d
111101 ****** *** 6 store.d
uFOP 010001 00011* *** 11 fsqt, frem
accumulated length of the monomials 147

  -         .+

According to table B.2, the clock request signals of stage ID are indepen-
dent of the instruction. Like in stage IF, they are always active. Thus,
the control automaton of stage ID only needs to generate the remaining
13 control signals. Since they depend on the current instruction word, a
Mealy automaton is used.

Table B.11 lists the disjunctive normal form for each of these signals.
The parameters of the ID control automaton are listed in table B.16 on
page 539.
'#'
 
S PECIFICATION OF  ) Type x0 control signals to be precomputed during stage ID (part 1)
THE FDLX D ESIGN
signals states of stage 2.0
lat1 alu, aluo, aluI, aluIo, test, testI, shift, shiftI, savePC,
trap, mi2s, noEX, ill, ms2i, rfe, addrL, addrS, addrL.s,
addrL.d, addrSf, mf2i, mi2f, fmov.s, fmov.d, fneg.s,
fneg.d, fabs.s, fabs.d, fc.s, fc.d
lat3 cvt.s.d, cvt.s.i, cvt.d.s, cvt.d.i, cvt.i.s, cvt.i.d
lat5 fmul.s, fmul.d, fadd.s, fadd.d, fsub.s, fsub.d
lat17 fdiv.s
lat21 fdiv.d
opA alu, aluo, aluI, aluIo, test, testI, shift, shiftI, mi2s,
noEX, ill, addrL, addrS, addrL.s, addrL.d, addrSf
opB alu, aluo, test, shift, addrS, mi2f
opFA fmov.s, fmov.d, fneg.s, fneg.d, fabs.s, fabs.d, fc.s,
fc.d, cvt.s.d, cvt.s.i, cvt.d.s, cvt.d.i, cvt.i.s, cvt.i.d,
fmul.s, fmul.d, fadd.s, fadd.d, fsub.s, fsub.d, fdiv.s,
fdiv.d
opFB addrSf, fc.s, fc.d, fmul.s, fmul.d, fadd.s, fadd.d,
fsub.s, fsub.d, fdiv.s, fdiv.d

  " -   

As in the previous designs, only state   has an outdegree greater


than one. Thus, the control signals of all the stages that follow can be
precomputed during decode using a Moore control automaton. The signals
are then buffered in an RSR; the RSR passes the signals down the pipeline
together with the instruction. Each stage consumes some of these control
signals. Therefore, the signals are classified according to the last stage in
which they are used. A signal of type x3, for example, is only used up to
stage 2.3, whereas a signal of type z is needed up to stage 4.
The tables B.12 to B.14 list for each control signal the states of stage
2.0 in which the signal must be active. There are some signals which are
always activated together, e.g., the signals Dmw, amuxsel and store. The
automaton only needs to generate one signal for each such group of signals.
According to table B.15, the majority of the precomputed control signals
is of type x0.
In circuit S IGF MD of the multiply divide unit, there is a total of six
tristate drivers connected to the operand busses opa and opb. The access
'#(
 
 ) Type x0 control signals to be precomputed during stage ID (part 2) C ONTROL
AUTOMATA OF THE
signals states of stage 2.0
FDLX D ESIGN
ALUDdoe alu, aluo, aluI, aluIo, test, testI, addrL, addrS, addrL.s,
addrL.d, addrSf
ADdoe mi2s, noEX, ill
SDdoe ms2i, rfe
SHDdoe shift, shiftI
linkDdoe savePC
coDdoe, trap trap
ftest, fccDdoe fc.s, c.d
tfxDdoe mf2i
FcRdoe fabs.s, fabs.d, fneg.s, fneg.d, fc.s, fc.d
FuFrdoe cvt.s.i, cvt.s.d, cvt.d.s, cvt.s.i
FvFrdoe cvt.i.s, cvt.i.d
test test, testI
ovf? aluo, aluIo
add addrL, addrS, addrL.s, addrL.d, addrSf
bmuxsel alu, aluo, test, shift
Rtype alu, aluo, test, shift, shiftI
Ill ill
fstore addrSf
fmov fmov.s, fmov.d
abs fabs.s, fabs.d
normal fmul.s, fmul.d, fdiv.s, fdiv.d, cvt.s.i, cvt.d.i
dbs fmov.d, fneg.d, fabs.d, fc.d, cvt.d.s, cvt.d.i, fmul.d,
fadd.d, fsub.d, fdiv.d

to these busses is granted by the control signals

opaoe3 : 0  f aadoe Eadoe Aadoe xadoe


opboe1 : 0  f bbdoe xbdoe

Although multiplications only use two of these tristate drivers, the precom-
puted control provides six enable signals

1 if I   ,$# ,$ 
f aadoe  f bbdoe 
0 otherwise
Eadoe  Aadoe  xadoe  xbdoe  0
'#/
 
S PECIFICATION OF  ) Control signals of type x1 to z to be precomputed during stage ID
THE FDLX D ESIGN
signals states of stage 2.0
x.1 sub fsub.s, fsub.d
faadoe, fmul.s, fmul.d
fbbdoe
x.2 fdiv fdiv.s, fdiv.d
FqFrdoe fmul.s, fmul.d, fdiv.s, fdiv.d
FsFrdoe fadd.s, fadd.d, fsub.s, fsub.d
x.4 Ffl3ce, addrS, addrSf, mi2f, fmov.s, fmov.d, fneg.s, fneg.d,
MDRwce fabs.s, fabs.d, fc.s, fc.d, cvt.s.d, cvt.s.i, cvt.d.s,
cvt.d.i, cvt.i.s, cvt.i.d, fmul.s, fmul.d, fadd.s, fadd.d,
fsub.s, fsub.d, fdiv.s, fdiv.d
FpRdoe cvt.d.s, cvt.i.s, cvt.s.d, cvt.i.d, fadd.s, fadd.d, fsub.s,
fsub.d, fmul.s, fmul.d, fdiv.s, fdiv.d
FxRdoe cvt.s.i, cvt.d.i
y amuxsel, addrS, addrSf
Dmw,
store
MARce, alu, aluo, aluI, aluIo, test, testI, shift, shiftI, savePC,
C4ce trap, mi2s, noEX, ill, ms2i, rfe, addrL, addrS, ad-
drL.s, addrL.d, addrSf, mf2i
FC4ce, mi2f, fmov.s, fmov.d, fneg.s, fneg.d, fabs.s, fabs.d,
Ffl4ce fc.s, fc.d, cvt.s.d, cvt.s.i, cvt.d.s, cvt.d.i, cvt.i.s,
cvt.i.d, fmul.s, fmul.d, fadd.s, fadd.d, fsub.s, fsub.d,
fdiv.s, fdiv.d
z DMRrce, addrL, addrL.s, addrL.d
Dmr, load
fop fc.s, fc.d, cvt.s.d, cvt.s.i, cvt.d.s, cvt.d.i, cvt.i.s,
cvt.i.d, fmul.s, fmul.d, fadd.s, fadd.d, fsub.s, fsub.d,
fdiv.s, fdiv.d
dbr fmov.d, fneg.d, fabs.d, cvt.s.d, cvt.i.d, fmul.d,
fadd.d, fsub.d, fdiv.d
SPRw mi2s, rfe, fc.s, fc.d
GPRw alu, aluo, aluI, aluIo, test, testI, shift, shiftI, savePC,
ms2i, addrL, addrL.s, addrL.d, mf2i
FPRw addrL.s, addrL.d, mi2f, fmov.s, fmov.d, fneg.s,
fneg.d, fabs.s, fabs.d, cvt.s.d, cvt.s.i, cvt.d.s, cvt.d.i,
cvt.i.s, cvt.i.d, fmul.s, fmul.d, fadd.s, fadd.d, fsub.s,
fsub.d, fdiv.s, fdiv.d

'#)
 
 ) Types of the precomputed control signals C ONTROL
AUTOMATA OF THE
type x.0 x.1 x.2 x.3 x.4 y z
FDLX D ESIGN
number 31 7 3 0 3 3 6

 ) Parameters of the two control automata which govern the FDLX Π
design. Automaton id generates the Mealy signals for stage ID; automaton ex
precomputes the Moore signals of the stages EX to WB.

# states # inputs # and frequency of outputs


k σ γ νsum νmax
id 1 15 13 21 5
ex 44 15 48 342 30

fanin of the states # and length of monomials


fansum fanmax #M lsum lmax
id – – 21 147 15
ex 53 3 53 374 15

Except on divisions, the busses opa and opb are only used in stage 2.1.
Thus, together with signal sub (floating point subtraction), the FDLX de-
sign requires 7 type x1 control signals.
Tables B.17 and B.18 lists the disjunctive normal forms for the automa-
ton which controls the stages EX to WB. The parameters of this Moore
automaton are summarized in table B.16.

'#*
 
S PECIFICATION OF
THE FDLX D ESIGN

 ) Disjunctive normal forms of the precomputed control which governs
stages EX to WB (part 1)

state IR31 : 26 IR5 : 0 Fmt length


alu 000000 1001** *** 10
000000 100**1 *** 10
aluo 000000 1000*0 *** 11
aluI 0011** ****** *** 4
001**1 ****** *** 4
aluIo 0010*0 ****** *** 5
shift 000000 0001*0 *** 11
000000 00011* *** 11
shiftI 000000 0000*0 *** 11
000000 00001* *** 11
test 000000 101*** *** 9
testI 011*** ****** *** 3
savePC 010111 ****** *** 6
000011 ****** *** 6
addrS 10100* ****** *** 5
1010*1 ****** *** 5
addrL 100*0* ****** *** 4
1000*1 ****** *** 5
10000* ****** *** 5
mi2s 000000 010001 *** 12
ms2i 000000 010000 *** 12
trap 111110 ****** *** 6
rfe 111111 ****** *** 6
noEX 0001** ****** *** 4
000010 ****** *** 6
010110 ****** *** 6
accumulated length of the monomials 178

'&
 
C ONTROL
AUTOMATA OF THE
FDLX D ESIGN

 ) Disjunctive normal forms used by the precomputed control (part 2)
state IR31 : 26 IR5 : 0 Fmt length
addrL.s 110001 ****** *** 6
addrL.d 110101 ****** *** 6
addrSf 111*01 ****** *** 5
fc.s 010001 11**** 000 11
fc.d 010001 11**** 001 11
mf2i 010001 001001 *** 12
mi2f 010001 001010 *** 12
fmov.s 010001 001000 000 15
fmov.d 010001 001000 001 15
fadd.s 010001 000000 000 15
fadd.d 010001 000000 001 15
fsub.s 010001 000001 000 15
fsub.d 010001 000001 001 15
fmul.s 010001 000010 000 15
fmul.d 010001 000010 001 15
fdiv.s 010001 000011 000 15
fdiv.d 010001 000011 001 15
fneg.s 010001 000100 000 15
fneg.d 010001 000100 001 15
fabs.s 010001 000101 000 15
fabs.d 010001 000101 001 15
cvt.s.d 010001 010000 001 15
cvt.s.i 010001 010000 100 15
cvt.d.s 010001 010001 000 15
cvt.d.i 010001 010001 100 15
cvt.i.s 010001 010100 000 15
cvt.i.d 010001 010100 001 15
accumulated length of the monomials 196

'&
Bibliography

[AA93] D. Alpert and D. Avnon. Architecture of the Pentium microarchitecture.


IEEE Micro, 13(3):11–21, 1993.
[AEGP67] S.F. Anderson, J.G. Earle, R.E. Goldschmitt, and D.M. Powers. The
IBM system 360 model 91: Floating-point unit. IBM Journal of Re-
search and Developement, 11:34–53, January 1967.
[AT97] H. Al-Twaijry. Area and Performance Optimized CMOS Multipliers.
PhD thesis, Stanford University, August 1997.
[BD94] J.R. Burch and D.L. Dill. Automatic verification of pipelined micropro-
cessor control. In Proc. International Conference on Computer Aided
Verification, 1994.
[BM96] E. Börger and S. Mazzanti. A practical method for rigorously con-
trollable hardware design. In J.P. Bowen, Hinchey M.B., and D. Till,
editors, ZUM’97: The Z Formal Specification Notation, volume 1212
of LNCS, pages 151–187. Springer, 1996.
[BS90] M. Bickford and M. Srivas. Verification of a pipelined microproces-
sor using Clio. In M. Leeser and G. Brown, editors, Proc. Mathemat-
ical Sciences Institute Workshop on Hardware Specification, Verifica-
tion and Synthesis: Mathematical Aspects, volume 408 of LNCS, pages
307–332. Springer, 1990.
[CO76] W. Chu and H. Opderbeck. Program behaviour and the page-fault-
frequency replacement algorithm. Computer, 9, 1976.
[CRSS94] D. Cyrluk, S. Rajan, N. Shankar, and M. K. Srivas. Effective theorem
proving for hardware verification. In 2nd International Conference on
Theorem Provers in Circuit Design, 1994.
 %
[Del97] P. Dell. Run time simulation of a DLX processor with stall engine. Lab-
oratory Project, University of Saarland, Computer Science Department,
Germany, 1997.
[Den68] P.J. Denning. The working set model for program behavior. Communi-
cations of the ACM, 11(5):323–333, 1968.
[Den80] P.J. Denning. Working sets past and present. IEEE Transactions on
Software Engineering, 6(1):64–84, 1980.
[EP97] G. Even and W.J. Paul. On the design of IEEE compliant floating point
units. In Proc. 13th IEEE Symposium on Computer Arithmetic, pages
54–63. IEEE Computer Society, 1997.
[ERP95] J.H. Edmondson, P. Rubinfeld, and R. Preston. Superscalar instruction
execution in the 21164 Alpha microprocessor. IEEE Micro, 15(2):33–
43, 1995.
[FS89] D.L. Fowler and J.E. Smith. An accurate, high speed implementation
of division by reciprocal approximation. In Proc. 9th Symposium on
Computer Arithmetic, pages 60–67, 1989.
[GHPS93] J.D. Gee, M.D. Hill, D.N Pnevmatikatos, and A.J. Smith. Cache per-
formance of the SPEC92 benchmark suite. MICRO, 13(4):17–27, 1993.
[Han93] J. Handy. The Cache Memory Book. Academic Press, Inc., 1993.
[Hew94] Hewlett Packard. PA-RISC 1.1 Architecture Reference Manual, 1994.
[Hil87] M.D. Hill. Aspects of Cache Memory and Instruction Buffer Perfor-
mance. PhD thesis, Computer Science Devision (EECS), UC Berkeley,
CA 94720, 1987.
[Hil95] M. Hill. SPEC92 Traces for MIPS R2000/3000. University of Wiscon-
sin, ftp://ftp.cs.newcastle.edu.au/pub/r3000-traces/din, 1995.
[HP90] J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quanti-
tative Approach. Morgan Kaufmann Publishers, INC., San Mateo, CA,
1990.
[HP96] J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quanti-
tative Approach. Morgan Kaufmann Publishers, INC., San Mateo, CA,
2nd edition, 1996.
[HQR98] T.A. Henzinger, S. Qadeer, and S.K. Rajamani. You assume, we guar-
antee: Methodology and case studies. In Proc. 10th International Con-
ference on Computer-aided Verification (CAV), 1998.
[Ins85] Institute of Electrical and Electronics Engineers. ANSI/IEEE standard
754–1985, IEEE Standard for Binary Floating-Point Arithmetic, 1985.
for a readable account see the article by W.J. Cody et al. in the IEEE
MICRO Journal, Aug. 1984, 84–100.
[Int95] Intel Corporation. Pentium Processor Family Developer’s Manual, Vol.
1-3, 1995.
'&&
 %
[Int96] Integrated Device Technology, Inc. IDT71B74: BiCMOS Static RAM
64K (8K x 8-Bit) Cache-Tag RAM, Data Sheet, August 1996.
[KH92] G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice Hall, 1992.
[KMP99a] D. Kroening, S.M. Mueller, and W.J. Paul. Proving the correctness of
processors with delayed branch using delayed PC. In Numbers, Infor-
mation and Complexity. Kluwer, 1999.
[KMP99b] D. Kroening, S.M. Mueller, and W.J. Paul. A rigorous correctness
proof of the tomasulo scheduling algorithm with precise interrupts. In
Proc. SCI’99/ISAS’99 International Conference, 1999.
[Knu96] R. Knuth. Quantitative Analysis of Pipelined DLX Architectures (in
German). PhD thesis, University of Saarland, Computer Science De-
partment, Germany, 1996.
[Kor93] I. Koren. Computer Arithmetic Algorithms. Prentice-Hall International,
1993.
[KP95] J. Keller and W.J. Paul. Hardware Design, volume 15 of Teubner-Texte
zur Informatik. Teubner, 1995.
[KPM00] Daniel Kroening, Wolfgang J. Paul, and Silvia M. Mueller. Prov-
ing the correctness of pipelined micro-architectures. In ITG/GI/GMM-
Workshop Methoden und Beschreibungssprachen zur Modellierung und
Verifikation von Schaltungen und Systemen, to appear, 2000.
[Kro97] D. Kroening. Cache simulation for a 32-bit DLX processor on a SPEC
workload. Laboratory Project, University of Saarland, Computer Sci-
ence Department, Germany, 1997.
[Lei99] H. Leister. Quantitative Analysis of Precise Interrupt Mechnisms for
Processors with Out-Of-Order Execution. PhD thesis, University of
Saarland, Computer Science Department, Germany, 1999.
[LMW86] J. Loeckx, K. Mehlhorn, and R. Wilhelm. Grundlagen der Program-
miersprachen. Teubner Verlag, 1986.
[LO96] J. Levitt and K. Olukotun. A scalable formal verification methodol-
ogy for pipelined microprocessors. In 33rd Design Automation Confer-
ence (DAC’96), pages 558–563. Association for Computing Machinery,
1996.
[MP95] S.M. Mueller and W.J. Paul. The Complexity of Simple Computer Ar-
chitectures. Lecture Notes in Computer Science 995. Springer, 1995.
[MP96] S.M. Mueller and W.J. Paul. Making the original scoreboard mechanism
deadlock free. In Proc. 4th Israel Symposium on Theory of Computing
and Systems (ISTCS), pages 92–99. IEEE Computer Society, 1996.
[Ng92] R. Ng. Fast computer memories. IEEE Spectrum, pages 36–39, Oct
1992.
'&'
 %
[Omo94] A.R. Omondi. Computer Arithmetic Systems; Algorithms, Architec-
ture and Implementations. Series in Computer Science. Prentice-Hall
International, 1994.
[PH94] D.A. Patterson and J.L. Hennessy. The Hardware/Software Interface.
Morgan Kaufmann Publishers, INC., San Mateo, CA, 1994.
[Prz90] S.A. Przbylski. Cache and Memory Hierarchy Design. Morgan Kauf-
man Publishers, Inc., 1990.
[PS98] W.J. Paul and P.-M. Seidel. On the complexity of Booth recoding. In
Proc. 3rd Conference on Real Numbers and Computers (RNC3), pages
199–218, 1998.
[Rus] D. Russinoff. A mechanically checked proof of IEEE com-
pliance of a register-transfer-level specification of the AMD K7
floating-point division and square root instructions. Available at
http://www.onr.com/user/russ/david/k7-div-sqrt.html.
[Sei00] P.-M. Seidel. The Design of IEEE Compliant Floating-point Units and
their Quantitative Analysis. PhD thesis, University of Saarland, Com-
puter Science Department, Germany, 2000.
[SGGH91] J.B. Saxe, S.J. Garland, J.V. Guttag, and J.J. Horning. Using trans-
formations and verification in circuit design. Technical report, Digital
Systems Research Center, 1991.
[SP88] J.E. Smith and A.R. Pleszkun. Implementing precise interrupts in
pipelined processors. IEEE Transactions on Computers, 37(5):562–
573, 1988.
[Spa76] 0. Spaniol. Arithmetik in Rechenanlagen. Teubner, 1976.
[Spa91] U. Sparmann. Structure Based Test Methods for Arithmetic Circuits
(in German). PhD thesis, University of Saarland, Computer Science
Department, 1991.
[SPA92] SPARC International Inc. The SPARC Architecture Manual. Prentice
Hall, 1992.
[Sta] Standard Performance Evaluation Corporation. SPEC Benchmark
Suite. http://www.specbench.org/.
[Sun92] Sun Microsystems Computer Corporation, Mountain View, CA. The
SuperSPARC Microprocessor: Technical White Paper, 1992.
[Tho70] J.E. Thornton. Design of a Computer: The Control Data 6600. Scott
Foresman, Glenview, Ill, 1970.
[Tom67] R.M. Tomasulo. An efficient algorithm for exploiting multiple arith-
metic units. In IBM Journal of Research and Developement, volume 11
(1), pages 25–33. IBM, 1967.
[Weg87] I. Wegener. The Complexity of Boolean Functions. John Wiley & Sons,
1987.
'&(
 %
[WF82] S. Waser and M.J. Flynn. Introduction to Arithmetic for Digital Systems
Designers. CBS College Publishing, 1982.
[Win93] G. Winskel. The Formal Semantics of Programming Languages; An
Introduction. Foundations of Computing Series. MIT Press, 1993.
[Win95] P.J. Windley. Formal modeling and verification of microprocessors.
IEEE Transactions on Computers, 44(1):54–72, 1995.
[WS94] S. Weiss and J.E. Smith. Power and PowerPC. Morgan Kaufmann
Publishers, Inc., 1994.

'&/
Index

α-equivalence, 327 precomputed output, 55


computation rules, 327 state, 51
α-representative, 327 transition function, 50

aborting execution, 182 bank


absolute value, 369, 370, 417 cache, 265, 267, 277
adder, 22–30, 99, 360–372 memory, 69, 80, 117, 240, 451
4/2-adder, 37 register file, 460, 461
carry chain, 22 write signal, 80, 81, 117, 245,
carry lookahead, 28 265, 267, 276–278, 282,
carry save, 35 304, 453, 454, 461
compound, 26 bias, 319
conditional sum, 24 binary fraction, 317
floating point, 360–372 binary number, 12
full adder, 9, 22 Booth
half adder, 23 decoder, 47
addition digit, 43, 44
binary, 13 multiplier, 43
floating point, 343, 359 recoding, 42
two’s complement, 16 boundary
addition tree, 36, 62 memory double word, 451
4/2-tree, 37–42 memory word, 68
addressing mode, 63 bracket structure, 181
alignment shift, 359, 362 burst transfer, 242
ALU, 75, 77 bus, 241–245
arithmetic unit, 30 arbitration, 299, 308
automaton, 50–60 back to back, 244
frequency, 52, 57 burst, 242
Mealy, 50, 56 convention, 242
Moore, 50, 54 handshake, 241
next state, 52 protocol, 241
outputs, 51 status flag, 241
parameter, 54, 61, 95–97, 127, word, 242
213, 249, 285, 307, 539 byte addressable, 68
. 
cache, 253, 511 precomputed, 122, 207, 474, 480,
block, 255 504
direct mapped, 256, 266 control signal
directory, 256 admissible, 59
effective miss ratio, 312 frequency, 52, 57
fully associative, 259 precomputed, 55
history, 261, 271 CPI, 161–166, 251, 252, 287, 288, 292,
hit, 253 296, 297, 311, 512–514
interface, 276, 281, 452 cycle count, 161
line, 255 cycle time, 9, 11, 100, 141, 160, 226,
line fill, 264 249, 510
line invalidation, 263 FPU, 433
miss, 254
miss ratio, 288, 513 data paths
sector, 257 DLX, 69, 70, 113, 114, 147, 191,
set, 256 215, 300, 445, 447, 486
set associative, 258, 268, 293 interface, 281
split, 299 dateline lemma, 129, 213, 479, 484
deadlock free, 156
tag, 256
decoder, 19
unified, 299
delay
way, 258
accumulated, 10
cache policy
formula, 10
allocation, 254, 261
slot, 108, 162
placement, 254
delayed branch, 107, 108
read allocation, 261
delayed PC, 107, 109
replacement, 254, 260
denormalization loss, 337
write, 254, 262
disjunctive normal form, 52
write allocation, 261
divider, 381–390
write back, 262, 314
division, 344, 372
write invalidation, 261
automaton, 384, 483
write through, 262 lookup table, 379
canceled instruction, 227 DNF, see disjunctive normal form
cause register, 175 DRAM, 239, 253
clock dummy instruction, 152
enable signal, 17, 58, 70, 98, 113,
477, 506 embedding convention, 440–442, 450,
request signal, 98, 114, 477, 481, 459
506 environment, 69, 71–87
signal, 60, 139, 151, 153, 205, ALU, 75, 77, 120
221, 501, 505 buffering, 199, 468
comparator, 76 CAenv, 202, 468
floating point, 412 EXenv, 119, 199, 461, 464
comparison FPemb, 450
floating point, 345 FPR register file, 459
configuration, 229 GPR register file, 71, 115
control IRenv, 73, 114, 448
automaton, 50–60, 534–539 memory, 78, 116, 200, 281, 300,
division, 479 452
DLX, 88–99, 120, 122, 204, 470 PCenv, 74, 117, 131, 191, 214,
interface, 58 449, 485
operation, 108 RFenv, 194, 455
''
. 
shift for load, 85, 456 exception, see exception
shifter, 81–87, 120 factoring, see factoring
SPR register file, 196, 457 format conversion, 347, 418–432
equality tester, 19 inexact result, 337, 341
equivalence class, 327 loss of accuracy, 337
event signal, see interrupt multiplication, 344, 372
exception, see interrupt precision, 320, 351
division by zero, 345, 388, 389 result, 338
floating point, 335–347, 418, 442, rounder, 390–412
458 special cases, 341–347, 370
handler, 171 subtraction, 360
inexact result, 341, 411 tininess, 337
invalid address, 236 unit, see FPU
invalid operation, 342–344, 347, wrapped exponent, 338
348, 357, 370, 388, 389, floating point number
414, 418, 420, 421, 432 denormal, 320
misaligned access, 453, 454 even, 322
overflow, 336, 411 exponent, 320
underflow, 336, 411 gradual underflow, 322
exception handling register, 178 hidden bit, 320
execution scheme, 465 normal, 320
exponent odd, 322
adjustment, 408 properties, 322
normalization, 398 representable, 321
rounding, 409 sign bit, 320
wrapping, 339 significand, 320
unordered, 346, 415
flushing, 300
factoring, 325
format conversion, 347, 418–432
denormal, 326
forwarding, 143, 145, 216, 486
exponent, 326
engine, 146
IEEE-normal, 326
floating point register, 490
normal, 326
FPU, 351–437, 439
sign bit, 325
fraction
significand, 326
binary, 317
special, 326
two’s complement, 318
value, 326
frequency, 52, 57
finite state diagram
FSD, see finite state diagram
division, 384, 385 full flag, 123, 152, 205, 220
DLX, 88, 90, 106, 120, 121, 209,
470–473 gate, 7
memory control, 247, 263, 284, gradual underflow, 322
306
finite state transducer, 50 half decoder, 20
fixed point unit, 461 hardware cost, 99, 140, 159, 225, 509
flipflop, 7 FPU, 434
floating point, 317–349, 351–437, memory system, 248
439–517 hardware interlock, 151, 164, 216, 486
addition, 343, 359 hardware model, 7–12
comparison, 345 hazard
division, 344, 372 data, 151, 487, 492, 494
embedding convention, 351, 352 structural, 105, 500
''
. 
IEEE standard on-chip, 240
floating point, 317–349 organization, 68, 451
in-order execution, 500 system design, 239–315
incrementer, 24 timing, 249, 283, 309
carry chain, 24 transaction, 263–265
conditional sum, 26, 61 word, 68
inexact result, 337 multiplication, see multiplier
instruction format, 64, 444, 520, 524 array, 36
instruction set architecture, see ISA floating point, 344, 372
interrupt, 171–237, 439 multiplier, 34–49, 62
admissible ISR, 180 Booth, 42–49
completeness, 189 floating point, 381–390
convention, 174 school method, 34
event signal, 171, 443
external, 172, 227 naming convention, 70
hardware, 190, 214, 468 NaN, 320, 342, 354, 370
internal, 172 quiet, 342
level, 176 signaling, 342
mask, 175 Newton-Raphson iteration, 373, 377
maskable, 172 normalization shift, 326, 394
nested, 177, 183 IEEE, 326
priority, 172 unbounded, 326
properties, 182 number format, 12–17, 317–323
receive service, 174 carry save, 35
service routine, 173 biased integer, 318
stack, 177 binary, 12
ISA, 63–68, 174, 441–445, 519–524 binary fraction, 317
ISR, see interrupt service routine floating point, 320
two’s complement, 14
JISR, 176 two’s complement fraction, 318
jump-and-link, 109
overflow, 31, 336, 392
leading zero counter, 21
little endian, 69, 452 parallel prefix, 27
locality of reference, 254, 290 performance model, 160
spatial, 255 pipeline, 105–170
temporal, 255 basic, 105
lookup table, 379 stage, 105, 106, 465
loss of accuracy, 337 post normalization, 331, 407
LRU replacement, 261, 269 precision, 411, 445, 470
precomputed control, 122
memory prepared sequential machine, 111, 112
alignment, 68, 451 protected mode, 236
consistency, 261, 300
control, 80, 201, 247, 282, 304, quality
453 DLX, 167, 287, 311, 515
double word, 451 metric, 167, 290
hierarchy, 253 parameter, 167, 290
interface, 246, 281, 303, 452
monolithic, 239 RAM, 7
multi-cycle access, 98 multi-port, 9
off-chip, 245, 246 random replacement, 261
''
. 
register SISR, 173
FPU control, 441, 458 SPEC benchmark suite, 159, 163, 512
general purpose, 63 special purpose register, 175
invisible, 132 SRAM, 239, 253
special purpose, 64, 443 stall engine, 97, 123, 131, 139, 153,
visible, 132 205, 220, 221, 476, 477,
register file 481, 498, 502, 505
aliasing, 439 hardware interlock, 151, 164, 216,
special RF, 194 486
register transfer language, see RTL stalling, see stall engine
representative, 327, 405 status register, 175
restore status, 179 sticky bit, 329, 365, 366, 405
result forwarding, see forwarding subtraction, 15, see addition
result shift register, see RSR
ROM, 7 thrashing, 293
rounder tininess, 337
fixed point, 427–432 TPI, 252, 287, 297, 514
floating point, 390–412 transition function, 50
rounding, 323 tree, 17
algebra, 326–335 tristate driver, 7
algorithm, 335 two’s complement
decomposition, 330 fraction, 318
exponent round, 331 number, 14
mode, 323 properties, 14
normalization shift, 331
underflow, 336
post normalization, 331
unordered, 346
significand round, 331
unpacker
RSR, 440, 466, 467
fixed point, 425
control, 474
floating point, 351, 354
RTL, 88
update enable signal, 97, 99, 113, 123,
132, 139, 205, 220
save status, 178
scheduling function, 106, 129, 154, valid flag, 144
223, 498, 502 variable latency, 440
shift
alignment, 362 word boundary, 68
normalization, 326, 394 wrapped exponent, 339
shifter, 31–33, 83 write enable signal, 7, 58, 98, 506
correction, 83 write request signal, 98, 506
cyclic left, 31
cyclic right, 33 zero counter, 21
distance, 82, 83 zero tester, 19, 170
fill bit, 82, 84
logic right, 33
mask, 85
sign extension, 14
significand
normalization, 401
rounding, 406
simulation theorem, 134, 143, 157, 229,
234, 507
''#

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy