DR - Chao Tan, Carnegie Mellon University: Computer Organization Computer Architecture
DR - Chao Tan, Carnegie Mellon University: Computer Organization Computer Architecture
Dr.Chao Tan,
Carnegie Mellon University
Computer Organization Computer Architecture
2
• Characteristics of Multiprocessors
• Interconnection Structures
• Interprocessor Arbitration
• Interprocessor Communication/Synchronization
• Cache Coherence
• Computer Registers
• Computer Instructions
• Instruction Cycle
INTRODUCTION
• Every different processor type has its own design (different
registers, buses, microoperations, machine instructions, etc)
• Modern processor is a very complex device
• It contains
– Many registers
– Multiple arithmetic units, for both integer and floating point calculations
– The ability to pipeline several consecutive instructions to speed execution
– Etc.
• However, to understand how processors work, we will start with
a simplified processor model
• This is similar to what real processors were like ~25 years ago
• M. Morris Mano introduces a simple processor model he calls
the Basic Computer
• We will use this to introduce processor organization and the
relationship of the RTL model to the higher level computer
processor
CPU RAM
0
15 0
4095
INSTRUCTION
S
• Program
– A sequence of (machine) instructions
• (Machine) Instruction
– A group of bits that tell the computer to perform a specific operation
(a sequence of micro-operation)
• The instructions of a program, along with any needed data
are stored in memory
• The CPU reads the next instruction from memory
• It is placed in an Instruction Register (IR)
• Control circuitry in control unit then translates the
instruction into the sequence of microoperations
necessary to implement it
INSTRUCTION
FORMAT
• A computer instruction is often divided into two parts
– An opcode (Operation Code) that specifies the operation for that
instruction
– An address that specifies the registers and/or locations in memory to
use for that operation
• In the Basic Computer, since the memory contains 4096 (=
212) words, we needs 12 bit to specify which memory
address this instruction will use
• In the Basic Computer, bit 15 of the instruction specifies
the addressing mode (0: direct addressing, 1: indirect
addressing)
• Since the memory words, and hence the instructions, are
16 bits long, that leaves 3 bits for the instruction’s opcode
Instruction Format
15 14 12 11 0
I Opcode Address
Addressing
mode
ADDRESSING
• MODES
The address field of an instruction can represent either
– Direct address: the address in memory of the data to use (the address of the
operand), or
– Indirect address: the address in memory of the address in memory of the data to
use
Direct addressing Indirect addressing
22 0 ADD 457 35 1 ADD 300
300 1350
457 Operand
1350 Operand
+ +
AC AC
PROCESSOR
REGISTERS
• A processor has many registers to hold instructions,
addresses, data, etc
• The processor has a register, the Program Counter (PC) that
holds the memory address of the next instruction to get
– Since the memory in the Basic Computer only has 4096 locations, the PC
only needs 12 bits
• In a direct or indirect addressing, the processor needs to keep
track of what locations in memory it is addressing: The
Address Register (AR) is used for this
– The AR is a 12 bit register in the Basic Computer
• When an operand is found, using either direct or indirect
addressing, it is placed in the Data Register (DR). The
processor then uses this value as data for its operation
• The Basic Computer has a single general purpose register –
the Accumulator (AC)
PROCESSOR
REGISTERS
• The significance of a general purpose register is that it can be
referred to in instructions
– e.g. load AC with the contents of a specific memory location; store the
contents of AC into a specified memory location
• Often a processor will need a scratch register to store
intermediate results or other temporary data; in the Basic
Computer this is the Temporary Register (TR)
• The Basic Computer uses a very simple model of input/output
(I/O) operations
– Input devices are considered to send 8 bits of character data to the processor
– The processor can send 8 bits of character data to output devices
• The Input Register (INPR) holds an 8 bit character gotten from an
input device
• The Output Register (OUTR) holds an 8 bit character to be send
to an output device
BASIC COMPUTER
REGISTERS
Registers in the Basic Computer
11 0
PC
Memory
11 0
4096 x 16
AR
15 0
IR CPU
15 0 15 0
TR DR
7 0 7 0 15 0
OUTR INPR AC
List of BC Registers
DR 16 Data Register Holds memory operand
AR 12 Address Register Holds address for memory
AC 16 Accumulator Processor register
IR 16 Instruction Register Holds instruction code
PC 12 Program Counter Holds address of instruction
TR 16 Temporary Register Holds temporary data
INPR 8 Input Register Holds input character
OUTR 8 Output Register Holds output character
Computer Organization Computer Architecture
Basic Computer Organization & Design 16 Registers
COMMON BUS
SYSTEM
COMMON BUS
SYSTEM S2
S1
S0
Bus
Memory unit 7
4096 x 16
Address
Write Read
AR 1
LD INR CLR
PC 2
LD INR CLR
DR 3
LD INR CLR
E
ALU AC 4
LD INR CLR
INPR
IR 5
LD
TR 6
LD INR CLR
OUTR Clock
LD
16-bit common bus
COMMON BUS
SYSTEM
Read
INPR
Memory Write
4096 x 16
Address E ALU
AC
L I C
L I C L
L I C DR IR L I C
PC TR
AR OUTR LD
L I C
7 1 2 3 4 5 6
COMMON BUS
SYSTEM
• Three control lines, S2, S1, and S0 control which register the
bus selects as its input
S2 S 1 S 0 Register
0 0 0 x
0 0 1 AR
0 1 0 PC
0 1 1 DR
1 0 0 AC
1 0 1 IR
1 1 0 TR
1 1 1 Memory
• Example:
• One address instruction
• X=(A+B)-(C+D)
• LOAD A AC<-M[A]
• ADD B AC<-A+M[B]
• STROE T
• LOAD C AC<-M[C]
• ADD D AC<-AC+M[D]
• SUB T MUL T
• STORE T STORE X
BASIC COMPUTER
INSTRUCTIONS
• Basic Computer Instruction Format
BASIC COMPUTER
Symbol
Hex Code
I=0 I=1
INSTRUCTIONS
Description
AND 0xxx 8xxx AND memory word to AC
ADD 1xxx 9xxx Add memory word to AC
LDA 2xxx Axxx Load AC from memory
STA 3xxx Bxxx Store content of AC into
memory
BUN 4xxx Cxxx Branch unconditionally
BSA 5xxx Dxxx Branch and save return address
ISZ 6xxx Exxx Increment and skip if zero
INSTRUCTION SET
COMPLETENESS
A computer should have a set of instructions so that the user can
construct machine language programs to evaluate any function
that is known to be computable.
• Instruction Types
Functional Instructions
- Arithmetic, logic, and shift instructions
- ADD, CMA, INC, CIR, CIL, AND, CLA
Transfer Instructions
- Data transfers between the main memory
and the processor registers
- LDA, STA
Control Instructions
- Program sequencing and control
- BUN, BSA, ISZ
Input/Output Instructions
- Input and output
- INP, OUT
CONTROL
UNIT
• Control unit (CU) of a processor translates from machine
instructions to the control signals for the microoperations
that implement them
TIMING AND
CONTROL
Control unit of Basic Computer
3x8
decoder
7 6543 210
D0
7409626547
I Combinational
D7 Control Control
logic signals
T15
T0
15 14 . . . . 2 1 0
4 x 16
decoder
TIMING
- Generated by 4-bit sequenceSIGNALS
counter and 4×16 decoder
- The SC can be incremented or cleared.
INSTRUCTION CYCLE
FETCH and
DECODE
• Fetch and Decode T0: AR ← PC (S0S1S2=010, T0=1)
T1: IR ← M [AR], PC ← PC + 1 (S0S1S2=111, T1=1)
T2: D0, . . . , D7 ← Decode IR(12-14), AR ← IR(0-11), I ←
IR(15)
T1 S2
T0 S1 Bus
S0
Memory 7
unit
Address
Read
AR 1
LD
PC 2
INR
IR 5
LD
Clock
Common bus
T0
AR ← PC
T1
IR ← M[AR], PC ← PC + 1
T2
Decode Opcode in IR(12-14),
AR ← IR(0-11), I ← IR(15)
T3 T3 T3 T3
Execute Execute AR ← M[AR] Nothing
input-output register-reference
instruction instruction
SC ← 0 SC ← 0 Execute T4
memory-reference
instruction
SC ← 0
D'7IT3: AR ← M[AR]
D'7I'T3: Nothing
D7I'T3: Execute a register-reference instr.
D7IT3: Execute an input-output instr.
Computer Organization Computer Architecture
Basic Computer Organization & Design 33 Instruction Cycle
REGISTER REFERENCE
INSTRUCTIONS
Register Reference Instructions are identified when
- D7 = 1, I = 0
- Register Ref. Instr. is specified in b0 ~ b11 of IR
- Execution starts with timing signal T3
REGISTER REFERENCE
INSTRUCTIONS
• B0,B1,B2,B3……………….B11
• 0111 1000 0000 0000(7800)
• 0111 0100 0000 0000(7400)
• 0111 0010 0000 0000(7200)
MEMORY REFERENCE
Symbol
Operation
INSTRUCTIONS
Symbolic Description
Decoder
AND D0 AC ← AC ∧ M[AR]
ADD D1 AC ← AC + M[AR], E ← Cout
LDA D2 AC ← M[AR]
STA D3 M[AR] ← AC
BUN D4 PC ← AR
BSA D5 M[AR] ← PC, PC ← AR + 1
ISZ D6 M[AR] ← M[AR] + 1, if M[AR] + 1 = 0 then PC ← PC+1
- The effective address of the instruction is in AR and was placed there during
timing signal T2 when I = 0, or during timing signal T3 when I = 1
- Memory cycle is assumed to be short enough to complete in a CPU cycle
- The execution of MR instruction starts with T4
AND to AC
D 0T 4: DR ← M[AR] Read operand
D 0T 5: AC ← AC ∧ DR, SC ← 0 AND with AC
ADD to AC
D 1T 4: DR ← M[AR] Read operand
D 1T 5: AC ← AC + DR, E ← Cout, SC ← 0 Add to AC and store carry in E
AR = 135 135 21
Memory Memory
BSA:
D 5T 4: M[AR] ← PC, AR ← AR + 1
D 5T 5: PC ← AR, SC ← 0
D T 4 D 1T 4 D 2T 4 D 3T 4
0
DR ← M[AR] DR ← M[AR] DR ← M[AR] M[AR] ← AC
SC ← 0
D 0T 5 D 1T 5 D 2T 5
AC ← AC ∧ DR AC ← AC + DR AC ← DR
SC ← 0 E ← Cout SC ← 0
SC ← 0
D 4T 4 D 5T 4 D 6T 4
PC ← AR M[AR] ← PC DR ← M[AR]
SC ← 0 AR ← AR + 1
D 5T 5 D 6T 5
PC ← AR DR ← DR + 1
SC ← 0
D 6T 6
M[AR] ← DR
If (DR = 0)
then (PC ← PC + 1)
SC ← 0
AC
Transmitter
Keyboard interface INPR FGI
INPR Input register - 8 bits
OUTR Output register - 8 bits Serial Communications Path
FGI Input flag - 1 bit Parallel Communications Path
FGO Output flag - 1 bit
IEN Interrupt enable - 1 bit
FGI ← 0
AC ← Data
yes yes
FGI=0
FGO=0
no
no
AC ← INPR
OUTR ← AC
INPUT-OUTPUT INSTRUCTIONS
D7IT3 = p
IR(i) = Bi, i = 6, …, 11
p: SC ← 0 Clear SC
INP pB11: AC(0-7) ← INPR, FGI ← 0 Input char. to AC
OUT pB10: OUTR ← AC(0-7), FGO ← 0 Output char. from
AC
SKI pB9: if(FGI = 1) then (PC ← PC + 1) Skip on input flag
SKO pB8: if(FGO = 1) then (PC ← PC + 1) Skip on output flag
ION pB7: IEN ← 1 Interrupt enable on
IOF pB6: IEN ← 0 Interrupt enable off
PROGRAM-CONTROLLED
INPUT/OUTPUT
• Program-controlled I/O
- Continuous CPU involvement
I/O takes valuable CPU time
- CPU slowed down to I/O speed
- Simple
- Least hardware
Input
Output
LOOP, LDA DATA
LOP, SKO DEV
BUN LOP
OUT DEV
- The I/O interface, instead of the CPU, monitors the I/O device.
- When the interface founds that the I/O device is ready for data transfer,
it generates an interrupt request to the CPU
Execute =0
IEN
instructions
=1 Branch to location 1
PC ← 1
=1
FGI
=0
=1 IEN ← 0
FGO R←0
=0
R←1
0 0 256
1 0 BUN 1120 PC = 1 0 BUN 1120
Main Main
255 Program 255 Program
PC = 256 256
1120 1120
I/O I/O
Program Program
1 BUN 0 1 BUN 0
D7IT3 D7’IT3
DExecute
7
I’T3 Execute DAR
7
’I’T3
<- M[AR] Idle
I/O RR
Instruction Instruction
Execute MR D7’T4
Instruction
Register-Reference
D7I′T3 = r (Common to all register-reference instr)
IR(i) = Bi (i = 0,1,2, ..., 11)
r: SC ← 0
CLA rB11: AC ← 0
CLE rB10: E←0
CMA rB9: AC ← AC′
CME rB8: E ← E′
CIR rB7: AC ← shr AC, AC(15) ← E, E ← AC(0)
CIL rB6: AC ← shl AC, AC(0) ← E, E ← AC(15)
INC rB5: AC ← AC + 1
SPA rB4: If(AC(15) =0) then (PC ← PC + 1)
SNA rB3: If(AC(15) =1) then (PC ← PC + 1)
SZA rB2: If(AC = 0) then (PC ← PC + 1)
SZE rB1: If(E=0) then (PC ← PC + 1)
HLT rB0: S←0
DESIGN OF BASIC
Hardware Components ofCOMPUTER(BC)
BC
A memory unit: 4096 x 16.
Registers:
AR, PC, DR, AC, IR, TR, OUTR, INPR, and SC
Flip-Flops(Status):
I, S, E, R, IEN, FGI, and FGO
Decoders: a 3x8 Opcode decoder
a 4x16 timing decoder
Common bus: 16 bits
Control logic gates:
Adder and Logic circuit: Connected to AC
12 12
From bus AR To bus
D'
7
I
LD Clock
T3
T2 INR
CLR
R
T0
D
T4
CONTROL OF
IEN: Interrupt Enable Flag
FLAGS
pB7: IEN ← 1 (I/O Instruction)
pB6: IEN ← 0 (I/O Instruction)
RT2: IEN ← 0 (Interrupt)
D
7
p
I
J Q IEN
B
7
T
3
B6
K
R
T2
selected
x1 x2 x3 x4 x5 x6 x7 S2 S1 S0 register
0 0 0 0 0 0 0 0 0 0 none
1 0 0 0 0 0 0 0 0 1 AR
0 1 0 0 0 0 0 0 1 0 PC
0 0 1 0 0 0 0 0 1 1 DR
0 0 0 1 0 0 0 1 0 0 AC
0 0 0 0 1 0 0 1 0 1 IR
0 0 0 0 0 1 0 1 1 0 TR
0 0 0 0 0 0 1 1 1 1 Memory
For AR D4T4: PC ← AR
D5T5: PC ← AR
x1 = D4T4 +
D5T5
Computer Organization Computer Architecture
Basic Computer Organization & Design 54 Design of AC Logic
DESIGN OF ACCUMULATOR
Circuits associated with AC
LOGIC
16
Adder and
16 16 16
From DR logic AC
circuit To bus
8
From INPR
Control
gates
CONTROL OF AC
REGISTER
Gate structures for controlling
the LD, INR, and CLR of AC
D2 DR
T5
p INPR
B11
r COM
B9
SHR
B7
SHL
B6
INC
B5
CLR
B11
Computer Organization Computer Architecture
Basic Computer Organization & Design 56 Design of AC Logic
AND
C LD
i ADD
FA I J Q
i
AC(i)
DR
C
i+1
K
INPR
From
INPR
bit(i)
COM
SHR
AC(i+1)
SHL
AC(i-1)
Introduction
Machine Language
Assembly Language
Assembler
Program Loops
Subroutines
Input-Output Programming
INTRODUCTION
Those concerned with computer architecture should
have a knowledge of both hardware and software
because the two branches influence each other.
MACHINE LANGUAGE
• Program
A list of instructions or statements for directing
the computer to perform a required data
processing task
• Machine-language
- Binary code
- Octal or hexadecimal code
• Assembly-language (Assembler)
- Symbolic code
• Fortran Program
INTEGER A, B, C
DATA A,83 / B,-23
C=A+B
END
ASSEMBLY LANGUAGE
Syntax of the BC assembly language
Each line is arranged in three columns called fields
Label field
- May be empty or may specify a symbolic
address consists of up to 3 characters
- Terminated by a comma
Instruction field
- Specifies a machine or a pseudo instruction
- May specify one of
* Memory reference instr. (MRI)
MRI consists of two or three symbols separated by spaces.
ADD OPR (direct address MRI)
ADD PTR I (indirect address MRI)
* Register reference or input-output instr.
Non-MRI does not have an address part
* Pseudo instr. with or without an operand
Symbolic address used in the instruction field must be
defined somewhere as a label
Comment field
- May be empty or may include a comment
PSEUDO-INSTRUCTIONS
ORG N
Hexadecimal number N is the memory loc.
for the instruction or operand listed in the following line
END
Denotes the end of symbolic program
DEC N
Signed decimal number N to be converted to the binary
HEX N
Hexadecimal number N to be converted to the binary
TRANSLATION TO BINARY
Hexadecimal Code
Location Content Symbolic Program
ORG 100
100 2107 LDA SUB
101 7200 CMA
102 7020 INC
103 1106 ADD MIN
104 3108 STA DIF
105 7001 HLT
106 0053 MIN, DEC 83
107 FFE9 SUB, DEC -23
108 0000 DIF, HEX 0
END
First pass
First pass
LC := 0
yes
yes
Store symbol END
in address-
symbol table
together with no Go to
value of LC second
pass
Increment LC
Second pass
LC <- 0
Done
Scan next line of code
Set LC
yes yes
Pseudo yes no
ORG END
instr.
no no
DEC or
yes no HEX
MRI Convert
operand
Get operation code to binary
and set bits 2-4 Valid no
non-MRI and store
instr. in location
Search address- given by LC
symbol table for yes
binary equivalent
of symbol address
and set bits 5-16
Store binary Error in
equivalent of line of
yes no instruction code
I in location
given by LC
Set Set
first first
bit to 1 bit to 0
PROGRAM LOOPS
Loop: A sequence of instructions that are executed many times,
each with a different set of data
Fortran program to add 100 numbers:
DIMENSION A(100)
INTEGER SUM, A
SUM = 0
DO 3 J = 1, 100
3 SUM = SUM + A(J)
Assembly-language program to add 100 numbers:
- Hardware Implementation
- Implementation of an operation in a computer
with one machine instruction
* Multiplication
- For simplicity, unsigned positive numbers
- 8-bit numbers -> 16-bit product
X = 0000 1111 P
cir EAC
Y = 0000 1011 0000 0000
0000 1111 0000 1111
Y ← AC 0001 1110 0010 1101
0000 0000 0010 1101
=0 =1 0111 1000 1010 0101
E 1010 0101
P←P+X
E←0
AC ← X
cil EAC
cil
X ← AC
CTR ← CTR + 1
≠0 =0
CTR Stop
ORG 100
LOP, CLE / Clear E
LDA Y / Load multiplier
CIR / Transfer multiplier bit to E
STA Y / Store shifted multiplier
SZE / Check if bit is zero
BUN ONE / Bit is one; goto ONE
BUN ZRO / Bit is zero; goto ZRO
ONE, LDA X / Load multiplicand
ADD P / Add to partial product
STA P / Store partial product
CLE / Clear E
ZRO, LDA X / Load multiplicand
CIL / Shift left
STA X / Store shifted multiplicand
ISZ CTR / Increment counter
BUN LOP / Counter not zero; repeat loop
HLT / Counter is zero; halt
CTR, DEC -8 / This location serves as a counter
X, HEX 000F / Multiplicand stored here
Y, HEX 000B / Multiplier stored here
P, HEX 0 / Product formed here
END
CLE / Clear E to 0
SPA / Skip if AC is positive
CME / AC is negative
CIR / Circulate E and AC
SUBROUTINES
Subroutine
Example
CHARACTER MANIPULATION
PROGRAM INTERRUPT
Tasks of Interrupt Service Routine
- Save the Status of CPU
Contents of processor registers and Flags
MICROPROGRAMMED
CONTROL
• Control Memory
• Sequencing Microinstructions
• Microprogram Example
• Microinstruction Format
Microprogram
M Control Data
e
m
o IR Status F/Fs
r
y
C Control C
Next Address Storage C
S S D P CPU
Generation A (μ-program D
Logic s
R memory) R }
TERMINOLOGY
Microprogram
- Program stored in memory that generates all the control signals required
to execute the instruction set correctly
- Consists of microinstructions
Microinstruction
- Contains a control word and a sequencing word
Control Word - All the control information required for one clock cycle
Sequencing Word - Information needed to decide
the next microinstruction address
- Vocabulary to write a microprogram
Dynamic Microprogramming
- Computer system whose control unit is implemented with
a microprogram in WCS
- Microprogram can be changed by a systems programmer or a user
Computer Organization Computer Architecture
Microprogrammed Control 82
TERMINOLOGY
- In-line Sequencing
- Branch
- Conditional Branch
- Subroutine
- Loop
- Instruction OP-code mapping
MICROINSTRUCTION SEQUENCING
Instruction code
Mapping
logic
Incrementer
select a status
bit
Microoperations
Branch address
CONDITIONAL BRANCH
Load address
Control address register
Increment
MUX
Control memory
...
Status bits
(condition)
Next address
Conditional Branch
If Condition is true, then Branch (address from
the next address field of the current microinstruction)
else Fall Through
Conditions to Test: O(overflow), N(negative),
Z(zero), C(carry), etc.
Unconditional Branch
Fixing the value of one status bit at the input of the multiplexer to 1
Computer Organization Computer Architecture
Microprogrammed Control 85 Sequencing
MAPPING OF INSTRUCTIONS
Direct Mapping Address
OP-codes of Instructions 0000 ADD Routine
0001 AND Routine
ADD 0000 0010
. LDA Routine
AND 0001 0011
. STA Routine
LDA 0010 . 0100 BUN Routine
STA 0011
Control
BUN 0100 Storage
Mapping
Bits 10 xxxx 010
Address
10 0000 010 ADD Routine
Machine OP-code
Instruction 1 0 1 1 Address
Mapping bits 0 x x x x 0 0
Microinstruction
address 0 1 0 1 1 0 0
Mapping memory
(ROM or PLA)
Control Memory
MICROPROGRAM EXAMPLE
Computer Configuration
MUX
10 0
AR
Address Memory
10 0 2048 x 16
PC
MUX
15 0
6 0 6 0 DR
SBR CAR
Microinstruction Format
3 3 3 2 2 7
F1 F2 F3 CD BR AD
F3 Microoperation Symbol
000 None NOP
001 AC ← AC ⊕ DR XOR
010 AC ← AC’ COM
011 AC ← shl AC SHL
100 AC ← shr AC SHR
101 PC ← PC + 1 INCPC
110 PC ← AR ARTPC
111 Reserved
BR Symbol Function
00 JMP CAR ← AD if condition = 1
CAR ← CAR + 1 if condition = 0
01 CALL CAR ← AD, SBR ← CAR + 1 if condition
=1
CAR ← CAR + 1 if condition = 0
10 RET CAR ← SBR (Return from subroutine)
11 MAP CAR(2-5) ← DR(11-14), CAR(0,1,6) ← 0
SYMBOLIC MICROINSTRUCTIONS
• Symbols are used in microinstructions as in assembly language
• A symbolic microprogram can be translated into its binary equivalent
by a microprogram assembler.
Sample Format
five fields: label; micro-ops; CD; BR; AD
SYMBOLIC MICROPROGRAM
• Control Storage: 128 20-bit words
• The first 64 words: Routines for the 16 machine instructions
• The last 64 words: Used for other purpose (e.g., fetch routine and other subroutines)
• Mapping: OP-code XXXX into 0XXXX00, the first address for the 16 routines are
0(0 0000 00), 4(0 0001 00), 8, 12, 16, 20, ..., 60
ORG 4
BRANCH: NOP S JMP OVER
NOP U JMP FETCH
OVER: NOP I CALL INDRCT
ARTPC U JMP FETCH
ORG 8
STORE: NOP I CALL INDRCT
ACTDR U JMP NEXT
WRITE U JMP FETCH
ORG 12
EXCHANGE: NOP I CALL INDRCT
READ U JMP NEXT
ACTDR, DRTAC U JMP NEXT
WRITE U JMP FETCH
ORG 64
FETCH: PCTAR U JMP NEXT
READ, INCPC U JMP NEXT
DRTAR U MAP
INDRCT: READ U JMP NEXT
DRTAR U RET
BINARY
MICROPROGRAM
Address Binary Microinstruction
Micro Routine Decimal Binary F1 F2 F3 CD BR
AD
ADD 0 0000000 000 000 000 01 01 1000011
1 0000001 000 100 000 00 00
0000010
2 0000010 001 000 000 00 00
1000000
3 0000011 000 000 000 00 00
1000000
BRANCH 4 0000100 000 000 000 10 00
0000110
5 0000101 000 000 000 00 00 1000000
6 0000110 000 000 000 01 01 1000011
7 0000111 000 000 110 00 00 1000000
STORE 8 0001000 000 000 000 01 01 1000011
9 0001001 000 101 000 00 00 0001010
10 0001010 111 000 000 00 00
1000000
11 0001011 000 000 000 00 00
1000000
EXCHANGE 12 0001100 000 000 000 01 01 1000011
13 0001101 001 000 000 00 00
0001110
14 0001110 100 101 000 00 00
0001111
15 0001111 111 000 000 00 00
This microprogram can be implemented using ROM
1000000
microoperation fields
F1 F2 F3
AND
ADD AC
Arithmetic
logic and DR
DRTAC shift unit
PCTAR
DRTAR
From From
PC DR(0-10) Load
AC
Select 0 1
Multiplexers
Load Clock
AR
Clock CAR
Control Storage
MUX-1 selects an address from one of four sources and routes it into a CAR
Input Logic
I1I0T Meaning Source of Address S1S0 L
S1 = I 1
S0 = I1I0 + I1’T
L = I1’I0T
MICROPROGRAM SEQUENCER
External
(MAP)
L
I0 3 2 1 0
Input Load
I1 S1 MUX1 SBR
logic
T S0
1 Incrementer
I MUX2 Test
S
Z Select
Clock CAR
Control memory
Microops CD BR AD
... ...
MICROINSTRUCTION FORMAT
Information in a Microinstruction
- Control Information
- Sequencing Information
- Constant
Information which is useful when feeding into the system
Field Encoding
Field A Field B
Field A Field B
2 bits 6 bits
2 bits 3 bits
2x4 6 x 64
2x4 3x8 Decoder Decoder
Decoder Decoder
Decoder and
1 of 4 1 of 8 selection logic
Two-level microprogram
First level
-Vertical format Microprogram
Second level
-Horizontal format Nanoprogram
- Interprets the microinstruction fields, thus converts a vertical
microinstruction format into a horizontal
nanoinstruction format.
11 bits
Control memory
2048 x 8
Microinstruction (8 bits)
Nanomemory address
Nanomemory
256 x 200
MDR HAS
TWO INPUTS
AND TWO
OUTPUTS
Datapath
Ri
Riou
t
Yi
n
Constant
4
Selec MUX
t
A B
AL
U
Zi
n
Zou
t
Figure 7.3. Input and output gating for one register bit.
Computer Organization Computer Architecture
Central Processing Unit 11
0
Performing an Arithmetic or Logic
Operation
• The ALU is a combinational circuit that has no
internal storage.
• ALU gets the two operands from MUX and bus.
The result is temporarily stored in register Z.
• What is the sequence of operations to add the
contents of register R1 to those of R2 and store
the result in R3?
1. R1out, Yin
2. R2out, SelectY, Add, Zin
3. Zout, R3in
MAR ← [R1]
Assume MAR
is always available
on the address lines
of the memory bus.
R2 ← [MDR]
Ri
Riou
t
Yi
n
Constant
4
Selec MUX
t
A B
AL
U
Zi
n
Zou
t
Step Action
Step Action
3 MDR , R=B, IR
out i
B n
4 R4 , R5 , SelectA, Add, R6 , End
out out i
A B n
Extern
alinput
s
Decoder
IR /encode
r
Conditio
n code
s
Control
signals
• Zin = T1 + T6 • ADD + T4 • BR + …
Branc Ad
h d
T4 T6
T1
Figure 7.12. Generation of the Zin control signal for the processor in
Figure 7.1.
Computer Organization Computer Architecture
Central Processing Unit 12
5
Generating End
• Control store
One function
cannot be carried
out by this simple
organization.
Cloc μPC
k
Contr
olstor C
e W
- Bit-ORing
- Wide-Branch Addressing
- WMFC
Addres Microinstructi
s
(octal on
)
00 P ou , i , Read, 4, Add, i
0
00 C
Zout , PCi , Yni ,Select
MAR WMFC Z n
1
00 t
MD n n
ou , IRi
2
00 R t n
μBranchμ P ← 101 (from Instruction
3 {
μP ←
C ] decoder);
μP ← [I ] ⋅ [I ] ⋅ [I
5,4 10,9 3 1 9 8]
12 C
Rsr ou , [IR ; C R 0 R R }
i , Read, Select4, Add,
i
1
12 c
Z ,t MAR n Z
n
ou i
2 t Rsrc n
μBranch μP ←
12 μP 0← [I 8]},
3
17 {
MD ou , C 170; C R WMFC
i , Read,
0
17 R
MD
t MAR n WMFC
,Y
ou i
1
17 R
Rdsout , n , Add, i
2
17 tZ ,t SelectY
, Z n
ou i
3 t Rdst n End
• Parallel Processing
• Pipelining
• Arithmetic Pipeline
• Instruction Pipeline
• RISC Pipeline
• Vector Processing
• Array Processors
5
PARALLEL PROCESSING
- Inter-Instruction level
- Intra-Instruction level
6
PARALLEL COMPUTERS
Architectural Classification
– Flynn's classification
» Based on the multiplicity of Instruction Streams and Data
Streams
» Instruction Stream
• Sequence of Instructions read from memory
» Data Stream
• Operations performed on the data in the processor
VLIW
MISD Nonexistence
Systolic arrays
Dataflow Associative processors
Message-passing multicomputers
Hypercube
Mesh
Reconfigurable
8
SISD COMPUTER SYSTEMS
Control Processor Data stream Memory
Unit Unit
Instruction stream
Characteristics
- Standard von Neumann machine
- Instructions and data are stored in memory
- One operation at a time
Limitations
Von Neumann bottleneck
9
SISD PERFORMANCE IMPROVEMENTS
• Multiprogramming
• Spooling
• Multifunction processor
• Pipelining
• Exploiting instruction-level parallelism
- Superscalar
- Superpipelining
- VLIW (Very Long Instruction Word)
0
MISD COMPUTER SYSTEMS
M CU P
M CU P
Memory
• •
• •
• •
M CU Data stream
P
Instruction stream
Characteristics
- There is no computer at present that can be
classified as MISD
1
SIMD COMPUTER SYSTEMS
Memory
Data bus
Control Unit
Instruction stream
Data stream
Alignment network
Characteristics
- Only one copy of the program exists
- A single controller executes one instruction at a time
2
TYPES OF SIMD COMPUTERS
Array Processors
- The control unit broadcasts instructions to all PEs,
and all active PEs execute the same instructions
- ILLIAC IV, GF-11, Connection Machine, DAP, MPP
Systolic Arrays
- Regular arrangement of a large number of
very simple processors constructed on
VLSI circuits
- CMU Warp, Purdue CHiP
Associative Processors
- Content addressing
- Data transformation operations over many sets
of arguments with a single instruction
- STARAN, PEPE
Computer Organization Computer Architecture
Pipelining and Vector Processing 15 Parallel Processing
3
MIMD COMPUTER SYSTEMS
P M P M ••• P M
Interconnection Network
Shared Memory
Characteristics
- Multiple processing units
- Message-passing multicomputers
4
SHARED MEMORY MULTIPROCESSORS
M M ••• M
Buses,
Interconnection Network(IN) Multistage IN,
Crossbar Switch
P P ••• P
Characteristics
All processors have equally direct access to
one large memory address space
Example systems
Bus and cache-based systems
- Sequent Balance, Encore Multimax
Multistage IN-based systems
- Ultracomputer, Butterfly, RP3, HEP
Crossbar switch-based systems
- C.mmp, Alliant FX/8
Limitations
Memory access latency
Hot spot problem
Computer Organization Computer Architecture
Pipelining and Vector Processing 15 Parallel Processing
5
MESSAGE-PASSING MULTICOMPUTER
Message-Passing Network Point-to-point connections
P P ••• P
M M ••• M
Characteristics
- Interconnected computers
- Each processor has its own memory, and
communicate via message-passing
Example systems
- Tree structure: Teradata, DADO
- Mesh-connected: Rediflow, Series 2010, J-Machine
- Hypercube: Cosmic Cube, iPSC, NCUBE, FPS T Series, Mark III
Limitations
- Communication overhead
- Hard to programming
Computer Organization Computer Architecture
Pipelining and Vector Processing 15 Pipelining
6
PIPELINING
A technique of decomposing a sequential process
into suboperations, with each subprocess being
executed in a partial dedicated segment that
operates concurrently with all other segments.
Ai * B i + C i for i = 1, 2, 3, ... , 7
Ai Bi Memory Ci
Segment 1
R1 R2
Multiplier
Segment 2
R3 R4
Adder
Segment 3
R5
7
OPERATIONS IN EACH PIPELINE STAGE
Clock
Pulse Segment 1 Segment 2 Segment 3
Number R1 R2 R3 R4 R5
1 A1 B1
2 A2 B2 A1 * B1 C1
3 A3 B3 A2 * B2 C2 A1 * B1 + C1
4 A4 B4 A3 * B3 C3 A2 * B2 + C2
5 A5 B5 A4 * B4 C4 A3 * B3 + C3
6 A6 B6 A5 * B5 C5 A4 * B4 + C4
7 A7 B7 A6 * B6 C6 A5 * B5 + C5
8 A7 * B7 C7 A6 * B6 + C6
9 A7 * B7 + C7
8
GENERAL PIPELINE
General Structure of a 4-Segment Pipeline
Clock
Input S1 R1 S2 R2 S3 R3 S4 R4
Space-Time Diagram
1 2 3 4 5 6 7 8 9
Clock cycles
Segment 1 T1 T2 T3 T4 T5 T6
2 T1 T2 T3 T4 T5 T6
3 T1 T2 T3 T4 T5 T6
4 T1 T2 T3 T4 T5 T6
9
PIPELINE SPEEDUP
n: Number of tasks to be performed
Speedup
Sk: Speedup
Sk = n*tn / (k + n - 1)*tp
tn
lim Sk = ( = k, if tn = k * tp )
n→∞ tp
0
PIPELINE AND MULTIPLE FUNCTION UNITS
Example
- 4-stage pipeline
- subopertion in each stage; tp = 20nS
- 100 tasks to be executed
- 1 task in non-pipelined system; 20*4 = 80nS
Pipelined System
(k + n - 1)*tp = (4 + 99) * 20 = 2060nS
Non-Pipelined System
n*k*tp = 100 * 80 = 8000nS
Speedup
Sk = 8000 / 2060 = 3.88
1
ARITHMETIC PIPELINE
Floating-point adder Exponents
a b
Mantissas
A B
X = A x 2a
Y = B x 2b
R R
R R
R R
2
4-STAGE FLOATING POINT ADDER
A = a x 2p B = b x 2q
p a q b
Stages: Other
Exponent fraction Fraction
S1 subtractor selector
Fraction with min(p,q)
r = max(p,q)
Right shifter
t = |p - q|
S2 Fraction
adder
r c
Leading zero
S3 counter
c
Left shifter
r
d
Exponent
S4 adder
s d
C = A + B = c x 2r = d x 2s
(r = max (p,q), 0.5 ≤ d < 1)
3
INSTRUCTION CYCLE
Six Phases* in an Instruction Cycle
[1] Fetch an instruction from memory
[2] Decode the instruction
[3] Calculate the effective address of the operand
[4] Fetch the operands from memory
[5] Execute the operation
[6] Store the result in the proper place
4
INSTRUCTION PIPELINE
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Pipelined
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
5
INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE
Decode instruction
Segment2: and calculate
effective address
Branch?
yes
no
Segment3: Fetch operand
from memory
Interrupt yes
Interrupt?
handling
no
Update PC
Empty pipe
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
6
MAJOR HAZARDS IN PIPELINED EXECUTION
Structural hazards(Resource Conflicts)
Hardware Resources required by the instructions in
simultaneous overlapped execution cannot be met
Data hazards (Data Dependency Conflicts)
An instruction scheduled to be executed in the pipeline requires the
result of a previous instruction, which is not yet available
Control hazards
Branches and other instructions that change the PC
make the fetch of the next instruction to be delayed
JMP ID PC + PC Branch address dependency
bubble IF ID OF OE OS
7
STRUCTURAL HAZARDS
Structural Hazards
Occur when some resource has not been
duplicated enough to allow all combinations
of instructions in the pipeline to execute
8
DATA HAZARDS
Data Hazards
Interlock
- hardware detects the data dependencies and delays the scheduling
of the dependent instruction by stalling enough clock cycles
Forwarding (bypassing, short-circuiting)
- Accomplished by a data path that routes a value from a source
(usually an ALU) to a user, bypassing a designated register. This
allows the value to be produced to be used at an earlier stage in the
pipeline than would otherwise be possible
Software Technique
Instruction Scheduling(compiler) for delayed load
Computer Organization Computer Architecture
Pipelining and Vector Processing 16 Instruction Pipeline
9
FORWARDING HARDWARE
Example:
Register
file
ADD R1, R2, R3
SUB R4, R1, R5
ADD I A E
0
INSTRUCTION SCHEDULING
a = b + c;
d = e - f;
Delayed Load
A load requiring that the following instruction not use its result
1
CONTROL HAZARDS
Branch Instructions
Next
Instruction FI DA FO EX
2
CONTROL HAZARDS
Prefetch Target Instruction
– Fetch instructions in both streams, branch not taken and branch taken
– Both are saved until branch branch is executed. Then, select the right
instruction stream and discard the wrong stream
Branch Target Buffer(BTB; Associative Memory)
– Entry: Addr of previously executed branches; Target instruction
and the next few instructions
– When fetching an instruction, search BTB.
– If found, fetch the instruction stream in BTB;
– If not, new stream is fetched and update BTB
Loop Buffer(High Speed Register file)
– Storage of entire loop that allows to execute a loop without accessing
memory
Branch Prediction
– Guessing the branch condition, and fetch an instruction stream based
on
the guess. Correct guess eliminates the branch penalty
Delayed Branch
– Compiler detects the branch and rearranges the instruction sequence
by inserting useful instructions that keep the pipeline busy
in the presence of a branch instruction
3
RISC PIPELINE
RISC
- Machine with a very fast clock cycle that
executes at the rate of one instruction per cycle
<- Simple Instruction Set
Fixed Length Instruction Format
Register-to-Register Operations
4
DELAYED LOAD
LOAD: R1 ← M[address 1]
LOAD: R2 ← M[address 2]
ADD: R3 ← R1 + R2
STORE: M[address 3] ←
R3
Three-segment pipeline timing
Pipeline timing with data conflict
clock cycle 1 2 3 4 5 6
Load R1 I A E
Load R2 I A E
Add R1+R2 I A E
Store R3 I A E
clock cycle 1 2 3 4 5 6 7
Load R1 I A E The data dependency is taken
Load R2 I A E care by the compiler rather
NOP I A E than the hardware
Add R1+R2 I A E
Store R3 I A E
5
DELAYED BRANCH
Compiler analyzes the instructions before and after
the branch and rearranges the program sequence by
inserting useful instructions in the delay steps
6
VECTOR PROCESSING
Vector Processing Applications
• Problems that can be efficiently formulated in terms of vectors
– Long-range weather forecasting
– Petroleum explorations
– Seismic data analysis
– Medical diagnosis
– Aerodynamics and space flight simulations
– Artificial intelligence and expert systems
– Mapping the human genome
– Image processing
7
VECTOR PROGRAMMING
DO 20 I = 1, 100
20 C(I) = B(I) + A(I)
Conventional computer
Initialize I = 0
20 Read A(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = i + 1
If I ≤ 100 goto 20
Vector computer
8
VECTOR INSTRUCTIONS
f1: V * V
f2: V * S
f3: V x V * V V: Vector operand
f4: V x S * V S: Scalar operand
9
VECTOR INSTRUCTION FORMAT
0
MULTIPLE MEMORY MODULE AND INTERLEAVING
AR AR AR AR
DR DR DR DR
Data bus
Address Interleaving
• Characteristics of Multiprocessors
• Interconnection Structures
• Interprocessor Arbitration
• Interprocessor Communication
and Synchronization
• Cache Coherence
Distributed Computing
Concurrent Computing
Pipelining
Breaking a task into steps performed by different units, and multiple
inputs stream through the units, with next input starting in a unit when
previous input done with the unit but not necessarily done with the task
Vector Computing
Use of vector processors, where operation such as multiply
broken into several steps, and is applied to a stream of operands
(“vectors”). Most common special case of pipelining
Systolic
Similar to pipelining, but units are not necessarily arranged linearly,
steps are typically small and more numerous, performed in lockstep
fashion. Often used in special-purpose hardware such as image or signal
processors
Computer Organization Computer Architecture
Multiprocessors 18 Characteristics of Multiprocessors
4
SPEEDUP AND EFFICIENCY
A: Given problem
1 2 3 4 5 6 7 8 9
Speedup should be between 0 and p, and 10 Processors
Efficiency should be between 0 and 1
Speedup is linear if there is a constant c > 0
so that speedup is always at least cp.
Computer Organization Computer Architecture
Multiprocessors 18 Characteristics of Multiprocessors
5
AMDAHL’S
LAW
Given a program
f : Fraction of time that represents operations
that must be performed serially
Medium-grain
Fine-grain
SHARED MEMORY
DISTRIBUTED MEMORY
Memory
Network
Network
Processors Processors/Memory
Buses,
Interconnection Network Multistage IN,
Crossbar Switch
P P ... P
Characteristics
P P ... P
M M ... M
Characteristics
- Interconnected computers
- Each processor has its own memory, and
communicate via message-passing
Example systems
Limitations
Bus
All processors (and memory) are connected to a
common bus or busses
- Memory access is fairly uniform, but not very scalable
Bus
Local Bus
SYSTEM BUS
Advantages
- Multiple paths -> high transfer rate
Memory Modules
Disadvantages
MM 1 MM 2 MM 3 MM 4
- Memory control logic
- Large number of cables and
connections
CPU 1
CPU 2
CPU 3
CPU 4
CPU1
CPU2
CPU3
CPU4
} data,address, and
control from CPU 1
data
Memory
address
Multiplexers
and } data,address, and
control from CPU 2
Module arbitration
R/W
logic
memory
enable
} data,address, and
control from CPU 3
} data,address, and
control from CPU 4
Interstage Switch
0 0
A A
1 1
B B
A connected to 0 A connected to 1
0 0
A A
1 1
B B
B connected to 0 B connected to 1
0
110
1
111
8x8 Omega Switching Network
0 000
1 001
2 010
3 011
4 100
5 101
6 110
7 111
- p = 2n
- processors are conceptually on the corners of a
n-dimensional hypercube, and each is directly
connected to the n neighboring nodes
- Degree = n
011 111
010
0 01 11 110
101
001
1 00 10 100
000
Asynchronous Bus
Miscellaneous control
Master clock CCLK
System initialization INIT
Byte high enable BHEN
Memory inhibit (2 lines) INH1 - INH2
Bus lock LOCK
Bus arbitration
Bus request BREQ
Common bus request CBRQ
Bus busy BUSY
Bus clock BCLK
Bus priority in BPRN
Bus priority out BPRO
Power and ground (20 lines)
4x2
Priority encoder
2x4
Decoder
Time Slice
Fixed length time slice is given sequentially to
each processor, round-robin fashion
Polling
Unit address polling - Bus controller advances
the address to identify the requesting unit
LRU
FIFO
Receiving
Processor
Interrupt
Shared Memory
Receiving
Processor
Sending Communication Area
Processor
Instruction Mark
Receiver(s) Receiving
Processor
Message
..
.
Receiving
Processor
Hardware Implementation
Semaphore
- A binary variable
- 1: A processor is executing a critical section,
that not available to other processors
0: Available to any requesting processor
- Software controlled Flag that is stored in
memory that all processors can be access
These are being done while locked, so that other processors cannot test
and set while current processor is being executing these instructions
Bus
X = 52 X = 52 X = 52 Caches
P1 P2 P3 Processors
X = 120 X = 52 X = 52 Caches
P1 P2 P3 Processors
Bus
X = 120 X = 52 X = 52 Caches
P1 P2 P3 Processors
Software Approaches
* Read-Only Data are Cacheable
- Private Cache is for Read-Only data
- Shared Writable Data are not cacheable
- Compiler tags data as cacheable and noncacheable
- Degrade performance due to software overhead
Hardware Approaches
* Snoopy Cache Controller
- Cache Controllers monitor all the bus requests from CPUs and IOPs
- All caches attached to the bus monitor the write operations
- When a word in a cache is written, memory is also updated (write through)
- Local snoopy controllers in all other caches check their memory to determine if they have
a copy of that word; If they have, that location is marked invalid(future reference to
this location causes cache miss)
Minsky’s Conjecture
Amdahl’s Law
There exist some parallel algorithms with almost no sequential operations. As the problem size(n)
increases, f becomes smaller (f -> 0 as n->∞). In this case, lim S = p.
n→∞
Software Inertia
Multistage Interconnect
Switch Processor
Bus
Direct Connection
Interconnection Network
A graph G(V,E)
V: a set of processors (nodes)
E: a set of wires (edges)
Ring
• 2-Mesh
m
...
m 2
m =p
...
- Degree = 4
- Diameter = 2(m - 1)
- In general, an n-dimensional mesh has
diameter = d ( p1/n - 1)
- Diameter can be halved by having wrap-around
connections (-> Torus)
- Ring is a 1-dimensional mesh with wrap-around
connection
Computer Organization Computer Architecture
Multiprocessors 21 Interconnection Structure
9
INTERCONNECTION NETWORK
Binary Tree
- Degree = 3
p+1
- Diameter = 2 log
2
MIN
Banyan network
=(unique path network) Multiple Path Network
• Data Manipulator
• Baseline [Wu80] [Feng74]
• Flip [Batcher76] • Augmented DM
• Indirect binary [Siegel78]
n-cube [Peas77] • Inverse ADM
• Omega [Lawrie75] [Siegel79]
• Regular SW banyan • Gamma [Parker84]
[Goke73]
Permutation/Sorting Network
(N!)
• Clos network [53]
• Benes network [62]
• Batcher sorting
network [68]