0% found this document useful (0 votes)
98 views33 pages

5-Stage Pipeline CPU Hardware

The document discusses the 5-stage pipeline CPU hardware and pipeline hazards including data hazards, control hazards, and structural hazards. It then discusses the ARM architecture, including that it uses a pipelined RISC CPU with reduced instruction size, offering high code density and low power. While mostly RISC, ARM has some deviations like variable cycle instructions and a barrel shifter. It discusses ARM modes, registers, data processing instructions, immediate operands, and conditional execution.

Uploaded by

rosestrikes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views33 pages

5-Stage Pipeline CPU Hardware

The document discusses the 5-stage pipeline CPU hardware and pipeline hazards including data hazards, control hazards, and structural hazards. It then discusses the ARM architecture, including that it uses a pipelined RISC CPU with reduced instruction size, offering high code density and low power. While mostly RISC, ARM has some deviations like variable cycle instructions and a barrel shifter. It discusses ARM modes, registers, data processing instructions, immediate operands, and conditional execution.

Uploaded by

rosestrikes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 33

5-stage Pipeline CPU hardware

Pipeline Hazards: [1]Data hazards


[2] Control hazards
 Control hazard occurs whenever there is change in normal sequential flow of
program (caused by branch/jump, calling subroutine, interrupt, return from
interrupt etc.)
[3]Structural hazards
 [1] multiply instruction holds Ex stage for two or more clock cycle.

 [2]Two or more instructions in pipeline try to read/write register file =>


Since there is only one read/write port, only one instruction is allowed to
read/write register file.
ARM Architecture
 ARM core :
 Pipelined RISC CPU reduced number of fixed size instructions
 Offers high code density, small size, low power
 Applications are cell phones, handheld PDA, camera
 But few deviations from pure RISC (to gain some advantages)
 Variable cycle execution for certain instructions to support multiple
load and store
 Inline barrel shifter leading to few complex instructions –
preprocessing one operand enhances computational power
 Thumb state (16-bit instruction set) to improve code density
 Conditional execution of instructions for smooth pipeline operation
 Performance: speed=> MIPS@ Clk freq., DMIPS@ Clk freq.,
Coremark MIPS @ Clk freq.
power=> mW @ (Volt, Clk freq., technology)

5
DMIPS
 Dhrystone is a synthetic benchmark program for system programming. So
DMIPS measures not instructions per second but gives an idea of how long
overall it will take one processor to execute benchmark program
 The industries have adopted the VAX 11/780 as the reference 1 MIPS
machine. The VAX 11/780 achieves 1757 Dhrystones per second.

 The Dhrystone figure of given Processor is calculated by measuring the


number of Dhrystones executed per second and dividing that by 1757. So
if a Processor is able to execute 140560 dhrystones per second, then its
DMIPS rating is 140560/1757 = 80 DMIPS
 To compare two computing systems that run at different clock frequency,
DMIPS is normalized to clock frequency.
e.g. 60 DMIPS @ 40 MHz = 1.5 DMIPS/MHz
 New Benchmarking for embedded processors => CoreMark MIPS

6
 Two source registers (Rn
and Rm) and one result
register Rd
 Sign Extend -> converts
signed 8/16 bit to 32 bit
signed value
 Barrel shifter =>
preprocess Rm before it
enters to ALU, it performs
shift and rotate
operations
 MAC unit => for multiply
and accumulation
operation

7
ARM Architecture
 ARM Core under study is ARM7TDMI (32-bit RISC CPU, 3-stage pipeline)
 ARM state => Instructions are 32-bit wide and address is word aligned
 Thumb state => Instructions are 16-bit and address is half-word aligned
ARM Modes:
 Different Modes of ARM processor are defined for specific purpose
 User mode => most application softwares run in this mode

8
ARM Architecture
 Non exception modes => User, System
 Exception modes => Supervisor, IRQ, FIQ, abort, undefined

 ‘supervisor’ mode => runs embedded operating system routines


 ‘User’ mode => runs Application programs
 IRQ & FIQ modes => handles hardware interrupts
 Abort mode => handles memory access violations
 Undefined mode => handles undefined instruction
ARM Architecture
CPSR:
 32-bit register with condition flags, control bits, status & ext.
 Only privileged modes have full write access to CPSR

 N = 1 if MSB of the ALU result is 1


 Z = 1 if Zero result from ALU
 C = 1 if ALU operation results in Carry (if Subtraction result is -ve =>C reset)
 V =1 if ALU operation oVerflowed (useful for signed numbers only)
 Flags are updated only if suffix ‘S’ is added to instruction

10
Banked Registers:

11
ARM Architecture

 Total 37 registers = 30 general purpose + 6 status + 1 PC


 Different set of register in different mode of operation
 User and System mode uses same set of registers
 Shaded registers (banked registers) are hidden from user/system mode and
available only in exception modes.
 R13 = Stack pointer (SP). Each exception mode has its own SP
 R14 = link register (LR) -> Holds return address of subroutine when it is
called with BL instruction.
 Each exception mode has its own SP and LR
BL <cc> subroutine_label (LR automatically stores return add.)
 The return can be in two ways
 MOV PC, LR or
 B LR

12
ARM Family and Cores
ARM Core Features ARM ISA Thumb
family version version

ARM7TDMI 3-state pipeline, thumb state ARMv4T v1


ARM7 ARM 720T as ARM7TDMI, cache
ARM 740T as ARM7TDMI, cache
ARM 920T 5-stage pipeline, thumb, data and inst. ARMv4T
cache, MMU
ARM 922T 5-stage pipeline, thumb, data and inst.
cache, MMU
ARM9 ARM946E 5-stage pipeline, thumb, Enhanced DSP ARMv5TE
instructions, caches, MPU
ARM926EJ 5-stage pipeline, thumb, Jazelle DBX, ARMv5TEJ
Enhanced DSP instructions, caches, MMU

ARM11 ARM1156T2(F) 8-stage pipeline, SIMD, Thumb-2, VFP, ARMv6T2 v2


Enhanced DSP instructions

Latest => ARM Cortex Series: Profile A, Profile R, Profile M


ARM Data Processing
 Syntax : <opcode> {<cc>} {S} Rd, Rn, op2
 ‘op2’ normally comes from barrel shifter and can be the following:

 Rm and Rs should not be PC (r15) in shift/rotate by register mode of ‘op2’


 shift and rotate affects N,Z,C flags
 # value for shift and rotate is 5-bit unsigned integer

14
15
ARM - The Barrel Shifter
LSL : Logical Left Shift ASR: Arithmetic Right Shift

CF Destination 0 Destination CF

Multiplication by a power of 2 Division by a power of 2,


preserving the sign bit
LSR : Logical Shift Right
ROR: Rotate Right

...0 Destination CF Destination CF

Bit rotate with wrap around


Division by a power of 2
from LSB to MSB

RRX: Rotate Right Extended

Destination CF

Single bit rotate with wrap around


from CF to MSB

16
ARM Data Processing Instructions

 CMP,CMN,TST & TEQ always update flags (even if ‘S’ is not used as
suffix) and do not alter any register. They use only Rn and OP2.
 MOV & MVN use only two operands i.e. Rd and ‘op2’

17
ARM Immediate Operand
Immediate Operand (32-bit):
 ARM can not generate all 32-bit constants (32-bit immediate data)
 Instruction code contains only 12 bits to specify 32-bit constant
 Valid 32-bit constants are obtained by 8-bit constant rotated right even number of
positions i.e. 0,2,4,…..30

 32-bit constants from given 8 bit value and 4-bit Rotate code:
 if Imm=0x40, Rotate=0xD => 32 bit constant= #4096
 if Imm=0xFF, Rotate=0x8 => 32 bit constant= #0x000000FF
 Amount of rotation is double than 4-bit field “rotate”

18
ARM Immediate Operand
Range of 32-bit constants for even
rotations i.e. #0, #2 & #30

 Valid 32-bit constants : 0x000000FF, 0x00000104, 0x0000FF00, 0xF000000F, 0x0FFFFFF0


 Invalid 32-bit Constants : 0x00000102, 0x0000FF04
 Examples: (i)MOV R1, 0x00000104 (ii) MVN R2, 0xFF000000 (iii)MVN R3, 0xFC000000
Data processing:
 ADD R9, R5, R5 LSL #3 ; R9 = R5+(R5*8) = 9*R5
 RSB R9, R5, R5 LSR #3 ; R9 = (R5/8) – R5
 MOV R12, R4 ROR R3 ;R12= R4 rotated right by value of R3
 CMP R7, R5 ; update flags after (R7-R5)

Conditional Execution:
 ARM instructions can be made to execute conditionally by post fixing
them with the appropriate condition code field. (e.g. MOVEQ R0,R1)
 Condition checks the status of appropriate flags
 If condition is true, normal execution otherwise no execution.
 Adv. => Greater pipeline performance and higher code density leading to
higher instructions throughput

20
ARM Conditional Execution

21
ARM Conditional Execution
 Set the flags, and then use various conditional codes
 CMP r0, # 0 if (a==0) x=0; (here r0 = a, r1= x)
 MOVEQ r1, # 0 if (a>0) x=1;
 MOVGT r1, #1
 Set of Conditional compare instruction
 CMP r0, # 4 if (a==4 or a==10)
 CMPNE r0, #10 x=0;
 MOVEQ r1, # 0

 Reduces number of instructions


While (a!=b) {
if (a>b) a=a-b; else b=b-a; } (here r1 = a, r2= b)
------------------------------------------------------------------------------------------
loop: CMP r1,r2 loop1: CMP r1, r2
BEQ finish SUBGT r1, r1, r2
BLT lessthan SUBLT r2, r2, r1
SUB r1, r1, r2 BNE loop1
B loop
lessthan : SUB r2,r2,r1
B loop
finish

22
ARM Brach Instructions
 B <cc> label : branch to label
( MOV LR, PC can be used before above inst. to store return add.)
 BL <cc> subroutine_label (LR automatically stores return add.)
24-bit offset field of Instruction code is shift left by 2 to get 26 bit
effective offset (i.e. Total range 226)
 ± 32 Mbyte range
 How to perform longer branches? (use BX Rm)

 BX Rm : branch with exchange


 If LSB of Rm is 1, processor switches to thumb state otherwise it
will remain in ARM state. PC= Rm & 0xFFFFFFFE
 Useful to provide interlinking between ARM and Thumb state
 BLX Rm : similar to BX Rm but additionally stores return address in
LR
 BLX label :
 Branching in ± 32Mbyte range with LR storing return address
 Makes T=1 and Enters into Thumb state
 The T bit must not be changed by directly writing to CPSR to change
the state of CPU
23
ARM Multiply
 Normal (32-bit result) and long(64-bit result) multiplication
 Syntax:
 MUL {<cc>} {S} Rd, Rm, Rs ; Rd = Rm * Rs
 MLA {<cc>}{S} Rd, Rm, Rs, Rn ; Rd = (Rm * Rs) + Rn
 [U or S] MULL{<cond>}{S} RdLo, RdHi, Rm, Rs
; RdHi,RdLo := Rm*Rs
 [U or S] MLAL{<cond>}{S} RdLo, RdHi, Rm, Rs
; RdHi,RdLo := (Rm*Rs)+RdHi, RdLo

 MUL and MLA truncates result to least significant 32bits


 Rd must be different register than Rm or Rs
 Rs and Rm can be swapped
 N and Z flags are affected (of course if suffix ‘S’ is used)

24
ARM Load & Store Instructions
 Data movement between registers and memory
 Instruction format : opcode<cc> <size> Rd, <address>
 Opcodes:
LDR STR ;32-bit Word load & store
LDRB STRB ;Byte load & store
LDRH STRH ;16-bit Halfword load & store
LDRSB ;Signed byte load
LDRSH ;Signed halfword load
 LDRB and LDRH copy 8-bit and 16-bit quantities from memory to
destination register and forces higher bits of destination register to
zero. For LDRSB and LDRSH the higher bits of destination register
is replaced by sign bit

 Address:
 Formed by base register (Rn) and offset
 Base register can be any general purpose register including PC
 Offset can be (for 32-bit Word and unsigned Byte)
 signed immediate (# 12-bit value)
 register or
 scaled register (Rm with shift/rotate by # immediate only)
 Offset for H,SH & SB :- immediate value (# 8bit) and register

25
Load & Store Instructions
 Choice of indexing :- Pre-index, Pre-index write back and post index addressing
 Post index and Pre-index write back modify base register value.

Examples:-
 LDR R8, [R3, # -3] ; Load R8 from address R3-3 (Pre index)
R3 remains unchanged
 LDR R3, [R9], # 4 ; Load R3 from address R9 then R9=R9+4
 (post index)
 STR R7, [R6, # -1] ! ; Store byte at R6-1 from R7 and then decrement
R6. (pre index with write back)
 LDREQB R0, [PC, -R2] ; load R0 from PC-R2 if EQ condition is true
 LDR R11, [R3, R5, LSL # 2] ;Load R11 from R3 + R5*4

Note: By default, we assume ‘little endian’ format where lower byte of


word is stored at lower address. In ‘big endian’ format lower byte of word is
stored at higher address.

26
ARM Pre & Post indexing
 Pre-indexed: STR r0, [r1, #12]
Offset r0
12 0x20c 0x5 0x5

r1 Source
Base Register
Register 0x200 0x200 for STR

 Pre-indexed write back : STR r0,[r1,#12]! => R1=0x20c after instruction

 Post-indexed: STR r0, [r1], #12


Updated r1 Offset
Base 0x20c 12 0x20c
Register r0
Original r1 0x5
Base 0x200 0x5 Source
Register 0x200
Register
for STR

27
ARM Load/Store Multiple
 Multiple register load and store with single instruction
 Syntax :
 LDM <CC> <add_mode> Rn {!} , {registers}
 STM <CC> <add_mode> Rn {!} , {registers}
where add_mode :- IA | IB | DA | DB |
Rn (base address) :- must not be PC, must not appear in register list if !
(write back) is specified
 Block memory copy: R9 -> points to start source, R4-> total no. of words to be
copied, R10 -> points to start of destination
 We first transfer data as bunches (say 8
words) using LDM/STM and register
set R0-R7
 If the last bunch has less than 8 words, then
those remaining words can be transferred
using LDR and STR (one word at a time)

28
ARM Load/Store Multiple
MOV R11, R4 // get value of R4 in R11
loop1 : CMP R11, #8 // compare R11 by 8
BLO skip // skip if R11 is less than 8
LDMIA R9!, {R0-R7} // perform eight 32-bit word transfer
STMIA R10!, {R0-R7}
SUBS R11, R11, #8
B loop1
skip: TST R11, # 0x00000000 // is R11 zero?
BEQ halt // end if R11 is zero
loop2: LDR R0, R9! // perform word by word transfer
STR R0, R10!
SUBS R11, R11, #1
BNE loop2
halt: END
ARM Stack Operations
Stack Opertions:
SP replaces Rn, add_mode are:- FD | FA | ED | EA for stack
 F and E signify whether SP points to location that is full or empty
 Stack is either ascending (growing towards high memory add.) or
descending (growing towards low memory add.)
 One of the following pair is used in interrupt routine or handler

Example : Let R1=0x00000002, R4=0x00000003,SP=0x00000814


 STMFD sp! , {R1,R4} ; full descending stack write
After inst.: SP=0x0000080c , mem[0x810]=R4, mem[0x80c]=R1

30
31
ARM Miscellaneous Instr.
 SWP <cc> Rd, Rd, [Rn]
 Swap a word between memory and a register Rd
 tmp= mem32[Rn], mem32[Rn]=Rd and Rd=tmp
 SWPB <cc> Rd, Rd, [Rn] => Swap a byte
The swap instruction is atomic- it reads and writes a memory location in the same
bus cycle. Useful in implementing semaphore and mutual exclusion.

 Count leading zeros : CLZ <cc> Rd, Rm

CPSR instructions:
 MRS {<cc>} Rd, <CPSR | SPSR> ;copy from PSR to Rd
 MSR {<cc>} <CPSR | SPSR>, Rm ; copy from Rm to PSR

Suffix f, s, x and c can be used to modify respective field of CPSR/SPSR


 MSR cpsr_c, R0 ; update only control byte of CPSR
 MSR cpsr_fsc, R0 ; update flags, status and control byte
of CPSR

32
Assembler Pseudo Instructions:
 LDR Rd, =constant

if constant can be constructed with MOV or MVN then this


instruction is actually generated. Otherwise assembler
generates a PC-relative LDR instruction that reads the constant
from the literal pool.
You must ensure that there is a literal pool within ±4KB range.
 LDR Rd, =label

Stores address of label in literal pool and upon execution of


instruction Rd is loaded with that address

33

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy