7TH - Unit 3-21ec74h6 - Ca
7TH - Unit 3-21ec74h6 - Ca
number of registers)
● RAW is a “True”data dependencybecausethe reader needs the result of the
writer.
● Let us consider an OOO MIPS pipeline as follows:
X0: ALU execute stage (1 cycle)
M0,M1: 2 stage Memory(2 cycle)
Y0,Y1,Y2,Y3: 4 stage multiply C
( 4 cycles)
Register Renaming: Introduction
● Consider a program sequence with two mul and two add immediate.
operations:ur
o i: Indicates the issue stage is in progress but might be delayed or waiting due to hazards (e.g.,
data dependencies or structural hazards).
o I: Indicates the instruction has successfully issued and execution can proceed.
Register Renaming
● 0 and 1, 1 and 2 showing RAW Hazard: These true dependences cannot be avoided
hence stall cycles are introduced.
Stall cycles
The "r" in the pipeline diagram typically denotes a stall or bypass condition in a pipeline. It might indicate:
5
Register Renaming: Introduction
● 1 and 3 show WAW hazard and 2 and 3 show WAR Hazard.
● Let's say this is executing on the in-order fetch, out-of-order issue, out-of-order
execute, and, right back in out, in-order commit pipe.
● If instruction 3 is executed and committed to write to R4 before instruction 1,
instruction 2 is reading the wrong value.
● Hence, stall cycles are added for instruction 3 and committed in order.
Stall cycles
Register Renaming: Introduction
● Adding more registers removes dependence , but the architecture
name space is limited. If R8 is added to the register space.
• – Registers: Alarger namespace requires more bits in instruction
encoding.
• 32 registers = 5 bits, 128 registers = 7 bits.
Stall cycles c
be removed
Register Renaming: Introduction
● Register Renaming: Change the naming of registers in hardware to eliminate WAW
and WAR hazards.
● 2 Schemes:
i. Pointers in the Instruction Queue(IQ)/ReOrder Buffer(ROB)
This approach to register renaming uses pointers to track where data resides in
temporary storage structures like the Instruction Queue or the ReOrder Buffer.
ii. Values in the Instruction Queue(IQ)/ReOrder Buffer(ROB)
This approach to register renaming involves directly storing values in temporary
structures like the Instruction Queue or ReOrder Buffer during instruction
execution.
Note:IO2I Uses pointers in IQ and ROB
IO2I: Register Renaming with Pointers in IQ and
ROB
FL
RT SB X0 PRF ARF
F D I
Q I L0 L1 W ROB
FSB
C
S0
Y0 Y1 Y2 Y3
• All data structures same as in IO2I Except:
– Add two fields to ROB
– Add Rename Table (RT) and Free List (FL) of
registers
• Increase size of PRF to provide more register
“Names” 18
Roles SB and IQ
The scoreboard Instruction Queue (IQ)
• IQ allows instructions to be fetched and
• It maintains a table with the following decoded ahead of time and waits for
information for every instruction in the execution resources to become available.
pipeline:
• Instructions wait in the IQ until they are
• Instruction Status: To track the ready for execution (all operands are
instructions (e.g., issued, executed, available and no structural hazards exist).
completed).
• Functional Unit Status: Keeps track of
which functional units (e.g., ALU, FPU)
are busy and their current operations.
• Register Status: Tracks which registers
are being read or written and whether
they are ready for use.
Purpose of ARF
ARF (architectural register File)
Example Workflow Involving ARF
• The ARF contains the logical or Instruction: ADD R1, R2, R3
architectural registers specified by the 1.Decode Stage:
instruction set architecture (ISA). •Logical registers (R1, R2, R3) are mapped
• These registers hold the committed to physical registers (e.g., P5, P6, P7) via
state of the program, which reflects the the Rename Table.
values visible to the programmer or 2.Execution Stage:
•The operation uses the values in P6 and
software.
P7 and stores the result in P5.
• The ARF is updated only when 3.Commit Stage:
instructions are committed (written •When the instruction is committed, the
back in program order). content of P5 is written back to R1 in the
ARF.
Roles of FSB and ROB
FSB (Free List Buffer) Reorder Buffer (ROB)
• Purpose of The FSB holds a list of • The ROB tracks instructions in
available physical registers in the
Physical Register File (PRF). the pipeline and ensures they
• When an instruction needs to write are retired (committed) in order.
to a destination register, the register • When an instruction is
renaming unit allocates a new
physical register from the FSB. committed, the ROB signals that
• After the instruction is committed the old physical register (if any)
(written back), the physical register can be freed and returned to the
it used can be returned to the FSB FSB.
for reuse.
Roles of RT and
The Rename Table (RT)
Free List (FL)
• It will hold the mapping between the logical (architectural)
registers used in the instruction set architecture (ISA) and the • The Free List (FL) keeps track of available physical
physical registers in the processor. Register renaming is a registers in the system that have not been assigned to any
technique used to eliminate false data dependencies (write-after- instruction. This is essential for register renaming, as the
read, read-after-write hazards) and allow more instructions to processor needs to maintain a list of registers that are free
execute in parallel. to be used for the next instruction.
• Purpose of RT: The Rename Table ensures that each instruction • Purpose of FL: The Free List ensures that the processor
can receive a unique physical register, and thus allows instructions
does not run out of physical registers while renaming.
to proceed out-of-order without conflicts over register usage. The
RT helps to dynamically allocate physical registers for the When a new instruction needs a physical register, the Free
temporary storage of values that are produced during execution. List is checked for availability. When an instruction
completes, the physical register is returned to the Free
List.
Pointers in the Instruction Queue (IQ) / ReOrder Buffer (ROB):
• In this approach, pointers are used to track the location of data in
temporary storage structures, such as the IQ or ROB.
• Key Points:
• Logical registers are mapped to physical registers or ROB entries through
pointers.
• The pointers provide indirection, allowing the pipeline to resolve
dependencies by accessing the appropriate structure without directly
modifying the architectural register file (ARF).
• Data is not directly stored in these structures but accessed via these pointers.
• Advantages:
• Reduced complexity of managing data within the structures.
• Efficient handling of large amounts of speculative data.
• Challenges:
• Extra indirection adds minor overhead to dependency resolution.
Example
In this scheme, pointers are used to refer to where the data is stored.
Consider an Example :
•Instructions:
1.I1: R1 = R2 + R3
2.I2: R4 = R1 + R5
•Working :
•I1 is decoded and assigned a pointer in the ROB (e.g., ROB[0]).
•R1 is mapped to ROB[0], indicating that the result of I1 will be stored there once available.
•I2 depends on the value of R1. Instead of stalling, it uses the pointer to ROB[0] to track the
value of R1.
•When the execution of I1 completes, the result is written back to ROB[0]. I2 can then access
the value of R1 through the pointer.
•Reduces storage requirements because only pointers are stored, not the actual data
•Ideal for architectures with speculative execution, as data dependencies are tracked
dynamically.
2. IO2I: Register Renaming with Values
in IQ and ROB
RT SB X0 ARF
F D I
Q I L0 L1 W ROB
FSB
C
S0
Y0 Y1 Y2 Y3
• All data structures same as previous Except:
– Modified ROB (Values instead of Register Specifier)
– Modified RT
– Modified IQ
– No FL
– No PRF, values merged into ROB
17
Modifications
•ReOrder Buffer (ROB):
•Previously: Stored register specifiers and pointers.
•Now: Directly stores values instead of register specifiers.
•Rename Table (RT):
•Modified to adapt to the changes in ROB.
•Tracks which architectural registers map to the values stored in the ROB.
•Instruction Queue (IQ):
•Modified to accommodate the new structure of ROB.
•Likely updated to refer directly to ROB entries for both operand fetching and instruction dispatch.
•No Free List (FL):
•The free list is eliminated, meaning no need to manage physical register allocation explicitly.
•No Physical Register File (PRF):
•The PRF is removed, and its functionality is merged into the ROB. The ROB now serves as the
•primary storage for instruction results and temporary values
Example
In this scheme, actual values are stored directly in the ROB or IQ.
Example Scenario:
•Instructions:
1.I1: R1 = R2 + R3
2.I2: R4 = R1 + R5
•Pipeline Execution:
•I1 is decoded, and its operands (R2, R3) are fetched from the register file.
•During execution, the result of I1 (e.g., R1 = 10) is directly written into ROB[0].
•I2 depends on R1. It directly reads the value 10 from ROB[0] rather than waiting
for it to be written back to the register file.
Key Benefits:
•Faster execution because the dependent instruction (I2) can directly access the result
in the ROB without additional indirection.
•Reduces latency in out-of-order execution pipelines.
Advantages and Drawbacks
• Advantages
• Simplified Dependency Management:
• With values stored in the ROB, dependent instructions can directly fetch operands without indirection.
• Reduced Complexity:
• No need to manage physical registers explicitly (no FL or PRF).
• Unified Storage:
• ROB becomes the central point for all in-flight instruction data, reducing redundancy.
• Drawbacks
• ROB Size Limitation:
• ROB size may become a bottleneck since it now stores values instead of pointers.
• Increased ROB Complexity:
• Merging PRF functionality into the ROB increases its complexity and access latency.
• Scalability:
• The architecture may face challenges with scalability due to the centralized nature of the ROB.
Feature Pointers in IQ/ROB Values in IQ/ROB
Storage Requirement Lower, as only pointers are stored. Higher, as actual values are stored.
Data Access Indirect, requires an additional step to access data. Direct, as values are immediately available.
Structure Size Smaller IQ/ROB, reducing overhead. Larger IQ/ROB to accommodate data.
Complexity of Dependency Tracking Requires dereferencing pointers. Simplified, as data is directly available.
Best Use Case Systems prioritizing lower storage and simpler structures. Systems requiring faster data access and minimal latency.
1.Explain how register renaming resolves Write-After-Write (WAW) and Write-After-Read (WAR)
2.hazards in out-of-order execution. Provide an example with an instruction sequence.
2.Consider the following instruction sequence:
I1: R1 = R2 + R3
I2: R4 = R1 + R5
I3: R1 = R6 + R7
b. Illustrate the execution timeline of these instructions assuming the following conditions:
Execution latencies: ADD and SUB = 1 cycle, MUL = 2 cycles, DIV = 4 cycles, ADDI = 1 cycle.
No structural hazards exist, but data hazards and register renaming are applied.
Up to 4 instructions can be fetched, decoded, and issued per cycle, and up to 2 instructions can execute in parallel.
c. Discuss how the use of register renaming and out-of-order execution improves the performance of this sequence compared to
an in-order execution model. Highlight specific examples from the given instruction sequence.
Multithreaded architectures
• Hardware multithreading is a technique used in modern processors
to improve the utilization of computational resources and enhance
overall performance.
• In a single-threaded processor, when an instruction encounters a stall
(e.g., due to memory latency or pipeline hazards), the processor
remains idle until the stall is resolved.
• Hardware multithreading addresses this issue by enabling multiple
threads to share the same processor core, allowing the processor to
execute instructions from a different thread during stalls
The ILP Wall
Multi threading
Hardware Multithreading is a technique used in modern processors to improve their efficiency and performance by
allowing them to execute multiple threads (smaller units of a program) simultaneously.
Processors are fast, but they often have to wait (e.g., for data from memory).
During this waiting time, they aren't doing useful work.
Hardware multithreading helps by keeping the processor busy with another thread while one thread is waiting.
Instruction level parallelism Thread-level Parallelism
Thread level parallelism potentially allows
Instruction level parallelism exploits huge speedups
very fine grain independent There is no complex superscalar
architecture that scales poorly
instructions There is no particular requirement for
very complex compilers, like for VLIW
Thread level parallelism is explicitly
Instead the burden of identifying and
represented in the program by the exploiting the parallelism falls mostly on
the programmer
use of multiple threads of execution Programming with multiple threads is much
that are inherently parallel more difficult than sequential programming
Debugging parallel programs is incredibly
Goal: Use multiple instruction challenging
streams to improve either (or both) But it’s pretty easy to build a big, fast
parallel computer
1. Throughput of computers that run many However, the main reason we have not had
programs widely-used parallel computers in the past
is that they are too difficult (expensive),
2. Execution time of multi-threaded programs time consuming (expensive) and error
prone (expensive) to program.
Costs for multi threading
Multithreading in a processor core
Find a way to “hide” true data dependency stalls, cache miss stalls,
and branch stalls by finding instructions (from other process threads)
that are independent of those stalling instructions
Multithreading – increase the utilization of resources on a chip by
allowing multiple processes (threads) to share the functional units of a
single processor
Processor must duplicate the state hardware for each thread – a separate
register file, PC, instruction buffer, and store buffer for each thread
The caches, TLBs, branch predictors can be shared (although the miss rates
may increase if they are not sized accordingly)
The memory can be shared through virtual memory mechanisms
Hardware must support efficient thread context switching
Thread scheduling policy
Which thread to be selected for which hardware thread
Each hardware is mapped to software thread.
One software thread continuously to one hardware thread
Round robin
Types of Multithreading
Coarse-grain – switches threads only on costly stalls (e.g., L3
cache misses)
Advantages – thread switching doesn’t have to be essentially free and
much less likely to slow down the execution of an individual thread
Disadvantage – limited, due to pipeline start-up costs, in its ability to
overcome throughput loss
- Pipeline must be flushed and refilled on thread switches
Time →
Thread C Thread D
Multithreaded Example: Sun’s Niagara (UltraSparc T1)
Eight fine grain multithreaded single-issue, in-order cores
(no speculation, no dynamic branch prediction)
Ultra III Niagara
Data width 64-b 64-b
MT SPARC pipe
MT SPARC pipe
MT SPARC pipe
MT SPARC pipe
MT SPARC pipe
MT SPARC pipe
MT SPARC pipe
MT SPARC pipe
Clock rate 1.2 GHz 1.0 GHz
Cache 32K/64K/ 16K/8K/3M
(I/D/L2) (8M external)
Issue rate 4 issue 1 issue I/O
Crossbar shared
Pipe stages 14 stages 6 stages
funct’s
BHT entries 16K x 2-b None
4-way banked L2$
TLB entries 128I/512D 64I/64D
Memory BW 2.4 GB/s ~20GB/s
Transistors 29 million 200 million
Power (max) 53 W <60 W Memory controllers
Multicore Xbox360 – “Xenon” processor
Aim is to provide game developers with a balanced and
powerful platform
Three SMT processors, 32KB L1 D-cache & I-cache, 1MB
Unified L2 cache
Two SMT threads per core
165M transistors total
3.2 Ghz Near-POWER PC ISA
2-issue, 21 stage pipeline, with 128 128-bit registers
Weak branch prediction – supported by software hinting
In order instruction execution
Narrow cores – 2 INT units, 2 128-bit VMX SIMD units, 1 of
anything else
An ATI-designed 500MHz GPU, 512MB of DDR3DRAM
337M transistors, 10MB framebuffer
48 pixel shader cores, each with 4 ALUs
Xenon Diagram
DVD
Core 0 Core 1 Core 2 HDD Port
Front USBs (2)
L1D L1I L1D L1I L1D L1I Wireless
MU ports (2 USBs)
XMA Dec
Rear USB (1)
1MB Unified L2 Ethernet
IR
Audio Out
Flash
GPU Systems Control
SMC
BIU/IO Intf
512MB
MC1
DRAM 3D Core