L07 BS1 Motivation 2 Up
L07 BS1 Motivation 2 Up
Everything
Arvind
Computer Science & Artificial Intelligence Lab
Massachusetts Institute of Technology
time/spins
Mixed Signal Problem 11%
Too Much Power 11%
Has Path(s) Too Slow 10%
Has Path(s) Too Fast 10%
IR Drop Issues 7%
Firmware Error 4%
Other 3%
Physical
dominate escalating
15
Verification
10
5
Architecture
project costs
0
0.18µm 0.13µm 90nm
Silicon Feature Dimension wSource: IBM/IBS, Inc.
1
Common quotes
“Design is not a problem;
design is easy”
“Verification is a problem”
“Timing closure is a problem”
“Physical design is a problem”
2
… less than world class
3
Why is traditional RTL
too low-level?
n
DATA_IN
enq
ENAB
In the hardware,
not full
RDY
deq
DATA_OUT
not empty
RDY
February 22, 2005 http://csg.csail.mit.edu/6.884/ L07-8
4
Requirements for correct use
Requirement 1: deq ENAB only when RDY (not empty)
Requirement 2: first DATA_OUT only when RDY (not empty)
Requirement 3: enq ENAB simultaneously with DATA_IN
Requirement 4: enq ENAB only when RDY (not full)
n
client DATA_IN
enq
ENAB
not full
RDY
deq
client ENAB FIFO
not empty
RDY
client n
first
DATA_OUT
not empty
RDY
client 1 control
n
DATA_IN
enq
ENAB
not full
RDY
client 2
deq
ENAB FIFO
not empty
RDY
n
first
DATA_OUT
not empty
RDY
5
Concurrent uses of a FIFO
enq ENAB ok if deq ENAB, even if not RDY ??
client 1 n
DATA_IN
enq
ENAB
not full
RDY
deq
ENAB FIFO
not empty
RDY
client 2 n
first
DATA_OUT
not empty
RDY
data_in data_out
push_req_n full
These constraints are taken
empty
pop_req_n from several paragraphs of
clk
documentation, spread over
rstn
many pages, interspersed
with other text
February 22, 2005 http://csg.csail.mit.edu/6.884/ L07-12
6
A High-Bandwidth Credit-based
Communication Interface
Credit based interface:
I/F Control You can have X credits I/F Control
Credit = C1 Credit = C2
7
In Bluespec SystemVerilog (BSV) …
Power to express complex static
structures and constraints
Checked by the compiler
“Micro-protocols” are managed by the
compiler
The compiler generates the necessary
hardware (muxing and control)
Micro-protocols need less or no verification
Easier to make changes while
preserving correctness
8
Bluespec Tool flow
Bluespec SystemVerilog source
Bluespec Compiler
Legend
Debussy
files
Visualization
Bluespec tools
3rd party tools
February 22, 2005 http://csg.csail.mit.edu/6.884/ L07-17
interface
9
Programming with
rules: A simple example
Euclid’s algorithm for computing the
Greatest Common Divisor (GCD):
15 6
9 6 subtract
3 6 subtract
6 3 swap
3 3 subtract
0 answer: 3 subtract
GCD in BSV
module mkGCD (ArithIO#(int));
Reg#(int) x <- mkRegU;
Reg#(int) y <- mkReg(0); State
10
GCD Hardware Module
t
t
start
enab
module
y == 0 rdy
GCD
implicit
conditions t
result
rdy
y == 0
11
Exploring microarchitectures
IP Lookup Module
Exit functions
12
Sparse tree representation
0
…
F A A
…
7.14.*.* A F
14 7 3 B
5 E
7.14.7.3 B
…
F F A A
10.18.200.* C 7
10.18.200.5 D
…
F
5.*.*.* E 10
…
F
* F
…
F 18
255
…
F F
IP address Result M Ref
…
C
200
7.13.7.3 F 2 5 D
…
F
10.18.201.5 F 3
…
C
7.14.7.2 A 4 Real-world lookup algorithms
5.13.7.2 E 1 are more complex but all make
a sequence of dependent
10.18.200.7 C 4 memory references.
February 22, 2005 http://csg.csail.mit.edu/6.884/ L07-25
13
Longest Prefix Match for IP lookup:
3 possible implementation architectures
Synthesis results
LPM Code Best Area Best Speed Mem. util.
versions size (gates) (ns) (random
(lines) workload)
Static V 220 2271 3.56 63.5%
Static BSV 179 2391 (5% larger) 3.32 (7% faster) 63.5%
Circular BSV 257 8170 (1% larger) 3.67 (2% slower) 99.9%
14
Implementations of the same arch -
Static pipeline: Two designers, two results
LPM versions Best Area Best Speed
(gates) (ns)
Static V (Replicated) 8898 3.60
IP addr
Replicated:
BEST: MUX
IP addr result
MUX / De-MUX
result
Each packet
is processed FSM FSM FSM FSM
by one FSM
Shared
Counter MUX / De-MUX FSM FSM
RAM
RAM
February 22, 2005 http://csg.csail.mit.edu/6.884/ L07-29
Reorder Buffer
Verification-centric design
15
Example from CPU design
Speculative, out-of-order Register
Many, many concurrent File
activities
FIFO FIFO
ALU
Re- Unit
FIFO
FIFO
Fetch Decode Order
Buffer
FIFO FIFO
(ROB) MEM
Branch Unit
Instruction Data
Memory Memory
Nirav 22,
February Dave,
2005MEMOCODE, 2004
http://csg.csail.mit.edu/6.884/ L07-31
ROB actions
Register Empty E
File Waiting W
Get operands Writeback
Dispatched Di
for instr results Killed K
Done Do
Re-Order Buffer
State Instruction Operand 1 Operand 2 Result
E Instr - V - V - -
E Instr - V - V - -
Head W Instr A V 0 V 0 - Get a ready
ALU instr
W Instr B V 0 V 0 -
ALU
Unit(s)
W Instr C V 0 V 0 -
E Instr - V - V - -
Get a ready
E Instr - V - V - -
MEM instr
E Instr - V - V - - MEM
Resolve E Instr - V - V - - Unit(s)
branches E Instr - V - V - - Put MEM instr
E Instr - V - V - -
results in ROB
E Instr - V - V - -
E Instr - V - V - -
16
But, what about all
the potential race conditions?
Reading from the register file at the same
time a separate instruction is writing back to
the same location
Which value to read?
An instruction is being inserted into the ROB
simultaneously to a dependent upstream
instruction’s result coming back from an ALU
Put a tag or the value in the operand slot?
An instruction is being inserted into the ROB
simultaneously to A branch mis-prediction
must kill the mis-predicted instructions and
restore a “consistent state” across many
modules
Rule Atomicity
Lets you code each operation in isolation
Eliminates the nightmare of race conditions
(“inconsistent state”) under such complex
concurrency conditions
All behaviors are
Insert Instr in ROB
explainable as a
• Put instruction in first
available slot sequence of atomic
• Increment tailDispatch
pointer Instr actions on the
• Get source •operands
Mark instruction
dispatched Write Back Results to ROB
state
- RF <or> prev instr
• Forward to•appropriate
Write back results to
unit instr result
Commit Instr
• Write back to all waiting
• Write results to register
tags
file (or allow memory
Branch Resolution
• Set to done write for store)
•…
• Set to Empty
•…
• Increment head pointer
•…
17
Synthesizable model of IA64
CMU-Intel collaboration
Develop an Itanium µarch model that is
concise and malleable
executable and synthesizable
FPGA Prototyping
XC2V6000 FPGA interfaced to P6 memory bus
Executes binaries natively against a real PC
environment (i.e., memory & I/O devices)
An evaluation vehicle for:
Functionality and performance: a fast µarchitecture
emulator to run real software
Implementation: a synthesizable description to
assess feasibility, design complexity and
implementation cost
Roland Wunderlich & James Hoe @ CMU
Steve Hynal(SCL) & Shih-Lien Liu(MRL)
February 22, 2005 http://csg.csail.mit.edu/6.884/ L07-35
Bypass
Branch Pred.
Integer×3 Stack Read Execute Write
Instr. Cache
Memory Stack Read Execute Memory Write
Roland Wunderlich 3
Platform Capabilities
Roland Wunderlich 5
few months by one student! Large FPGA resources, the current design
occupies less than 30% of the FPGA resources
Roland Wunderlich 7
18