0% found this document useful (0 votes)
25 views45 pages

7TH - Unit 3-21ec74h6 - Ca

The document outlines the syllabus for a Computer Architecture course focusing on Register Renaming and Thread Level Parallelism. It discusses the concepts of data dependencies, specifically RAW, WAW, and WAR hazards, and how register renaming can mitigate these issues in out-of-order execution. The document also details various architectural components like the Instruction Queue (IQ), Reorder Buffer (ROB), and the significance of physical and architectural registers in managing instruction execution.

Uploaded by

aditya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views45 pages

7TH - Unit 3-21ec74h6 - Ca

The document outlines the syllabus for a Computer Architecture course focusing on Register Renaming and Thread Level Parallelism. It discusses the concepts of data dependencies, specifically RAW, WAW, and WAR hazards, and how register renaming can mitigate these issues in out-of-order execution. The document also details various architectural components like the Instruction Queue (IQ), Reorder Buffer (ROB), and the significance of physical and architectural registers in managing instruction execution.

Uploaded by

aditya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

COMPUTER ARCHITECTURE

PROFESSIONAL core ELECTIVE


group H
2021 SCHEME
21EC74H6(unit 3)
Dr. jayanthi p n
asst. Professor
Dept. of ece,
rvce.
Syllabus

• Register Renaming & Thread Level Parallelism: Register


Renaming Introduction, Register Renaming with pointers to IQ &
ROB, Register Renaming with values in IQ and ROB, Introduction to
hardware multithreading, Multithreading motivation, fine grain
multithreading, coarse grain multithreading, simultaneous
multithreading
Register Renaming: Introduction

● WAW and WAR are not “True” data dependencies:


● Name dependencies exist because we have a limited number of “Names” (limited

number of registers)
● RAW is a “True”data dependencybecausethe reader needs the result of the

writer.
● Let us consider an OOO MIPS pipeline as follows:
X0: ALU execute stage (1 cycle)
M0,M1: 2 stage Memory(2 cycle)
Y0,Y1,Y2,Y3: 4 stage multiply C

( 4 cycles)
Register Renaming: Introduction
● Consider a program sequence with two mul and two add immediate.

operations:ur

● 0 and 1, 1 and 2 showing RAW Hazard


● 1 and 3 showing WAW hazard
● 2 and 3 showing WAR Hazard
Recall
• F: Fetch
• The instruction is fetched from memory or the instruction cache.
• D: Decode
• The instruction is decoded to understand what it does and to prepare for execution.
• I: Issue
• The instruction is sent to the appropriate functional unit for execution.
• Y0, Y1, Y2, Y3: Execution Stages
• These represent the execution stages of the pipeline (possibly a multi-cycle operation, e.g., for MUL).
• X0: Execution (for simpler operations like ADDIU).
• W: Write-back
• The result of the instruction is written back to the register file.
• C: Commit
• The instruction’s result is committed, meaning it officially updates the processor state.
• Differences Between i and I:

o i: Indicates the issue stage is in progress but might be delayed or waiting due to hazards (e.g.,
data dependencies or structural hazards).
o I: Indicates the instruction has successfully issued and execution can proceed.
Register Renaming
● 0 and 1, 1 and 2 showing RAW Hazard: These true dependences cannot be avoided
hence stall cycles are introduced.
Stall cycles

The "r" in the pipeline diagram typically denotes a stall or bypass condition in a pipeline. It might indicate:

5
Register Renaming: Introduction
● 1 and 3 show WAW hazard and 2 and 3 show WAR Hazard.
● Let's say this is executing on the in-order fetch, out-of-order issue, out-of-order
execute, and, right back in out, in-order commit pipe.
● If instruction 3 is executed and committed to write to R4 before instruction 1,
instruction 2 is reading the wrong value.
● Hence, stall cycles are added for instruction 3 and committed in order.

Stall cycles
Register Renaming: Introduction
● Adding more registers removes dependence , but the architecture
name space is limited. If R8 is added to the register space.
• – Registers: Alarger namespace requires more bits in instruction
encoding.
• 32 registers = 5 bits, 128 registers = 7 bits.

Stall cycles c
be removed
Register Renaming: Introduction
● Register Renaming: Change the naming of registers in hardware to eliminate WAW
and WAR hazards.
● 2 Schemes:
i. Pointers in the Instruction Queue(IQ)/ReOrder Buffer(ROB)
This approach to register renaming uses pointers to track where data resides in
temporary storage structures like the Instruction Queue or the ReOrder Buffer.
ii. Values in the Instruction Queue(IQ)/ReOrder Buffer(ROB)
This approach to register renaming involves directly storing values in temporary
structures like the Instruction Queue or ReOrder Buffer during instruction
execution.
Note:IO2I Uses pointers in IQ and ROB
IO2I: Register Renaming with Pointers in IQ and
ROB
FL
RT SB X0 PRF ARF
F D I
Q I L0 L1 W ROB
FSB
C

S0
Y0 Y1 Y2 Y3
• All data structures same as in IO2I Except:
– Add two fields to ROB
– Add Rename Table (RT) and Free List (FL) of
registers
• Increase size of PRF to provide more register
“Names” 18
Roles SB and IQ
The scoreboard Instruction Queue (IQ)
• IQ allows instructions to be fetched and
• It maintains a table with the following decoded ahead of time and waits for
information for every instruction in the execution resources to become available.
pipeline:
• Instructions wait in the IQ until they are
• Instruction Status: To track the ready for execution (all operands are
instructions (e.g., issued, executed, available and no structural hazards exist).
completed).
• Functional Unit Status: Keeps track of
which functional units (e.g., ALU, FPU)
are busy and their current operations.
• Register Status: Tracks which registers
are being read or written and whether
they are ready for use.
Purpose of ARF
ARF (architectural register File)
Example Workflow Involving ARF
• The ARF contains the logical or Instruction: ADD R1, R2, R3
architectural registers specified by the 1.Decode Stage:
instruction set architecture (ISA). •Logical registers (R1, R2, R3) are mapped
• These registers hold the committed to physical registers (e.g., P5, P6, P7) via
state of the program, which reflects the the Rename Table.
values visible to the programmer or 2.Execution Stage:
•The operation uses the values in P6 and
software.
P7 and stores the result in P5.
• The ARF is updated only when 3.Commit Stage:
instructions are committed (written •When the instruction is committed, the
back in program order). content of P5 is written back to R1 in the
ARF.
Roles of FSB and ROB
FSB (Free List Buffer) Reorder Buffer (ROB)
• Purpose of The FSB holds a list of • The ROB tracks instructions in
available physical registers in the
Physical Register File (PRF). the pipeline and ensures they
• When an instruction needs to write are retired (committed) in order.
to a destination register, the register • When an instruction is
renaming unit allocates a new
physical register from the FSB. committed, the ROB signals that
• After the instruction is committed the old physical register (if any)
(written back), the physical register can be freed and returned to the
it used can be returned to the FSB FSB.
for reuse.
Roles of RT and
The Rename Table (RT)
Free List (FL)
• It will hold the mapping between the logical (architectural)
registers used in the instruction set architecture (ISA) and the • The Free List (FL) keeps track of available physical
physical registers in the processor. Register renaming is a registers in the system that have not been assigned to any
technique used to eliminate false data dependencies (write-after- instruction. This is essential for register renaming, as the
read, read-after-write hazards) and allow more instructions to processor needs to maintain a list of registers that are free
execute in parallel. to be used for the next instruction.
• Purpose of RT: The Rename Table ensures that each instruction • Purpose of FL: The Free List ensures that the processor
can receive a unique physical register, and thus allows instructions
does not run out of physical registers while renaming.
to proceed out-of-order without conflicts over register usage. The
RT helps to dynamically allocate physical registers for the When a new instruction needs a physical register, the Free
temporary storage of values that are produced during execution. List is checked for availability. When an instruction
completes, the physical register is returned to the Free
List.
Pointers in the Instruction Queue (IQ) / ReOrder Buffer (ROB):
• In this approach, pointers are used to track the location of data in
temporary storage structures, such as the IQ or ROB.
• Key Points:
• Logical registers are mapped to physical registers or ROB entries through
pointers.
• The pointers provide indirection, allowing the pipeline to resolve
dependencies by accessing the appropriate structure without directly
modifying the architectural register file (ARF).
• Data is not directly stored in these structures but accessed via these pointers.
• Advantages:
• Reduced complexity of managing data within the structures.
• Efficient handling of large amounts of speculative data.
• Challenges:
• Extra indirection adds minor overhead to dependency resolution.
Example
In this scheme, pointers are used to refer to where the data is stored.
Consider an Example :
•Instructions:
1.I1: R1 = R2 + R3
2.I2: R4 = R1 + R5
•Working :
•I1 is decoded and assigned a pointer in the ROB (e.g., ROB[0]).
•R1 is mapped to ROB[0], indicating that the result of I1 will be stored there once available.
•I2 depends on the value of R1. Instead of stalling, it uses the pointer to ROB[0] to track the
value of R1.
•When the execution of I1 completes, the result is written back to ROB[0]. I2 can then access
the value of R1 through the pointer.

•Reduces storage requirements because only pointers are stored, not the actual data
•Ideal for architectures with speculative execution, as data dependencies are tracked
dynamically.
2. IO2I: Register Renaming with Values
in IQ and ROB
RT SB X0 ARF
F D I
Q I L0 L1 W ROB
FSB
C

S0
Y0 Y1 Y2 Y3
• All data structures same as previous Except:
– Modified ROB (Values instead of Register Specifier)
– Modified RT
– Modified IQ
– No FL
– No PRF, values merged into ROB
17
Modifications
•ReOrder Buffer (ROB):
•Previously: Stored register specifiers and pointers.
•Now: Directly stores values instead of register specifiers.
•Rename Table (RT):
•Modified to adapt to the changes in ROB.
•Tracks which architectural registers map to the values stored in the ROB.
•Instruction Queue (IQ):
•Modified to accommodate the new structure of ROB.
•Likely updated to refer directly to ROB entries for both operand fetching and instruction dispatch.
•No Free List (FL):
•The free list is eliminated, meaning no need to manage physical register allocation explicitly.
•No Physical Register File (PRF):
•The PRF is removed, and its functionality is merged into the ROB. The ROB now serves as the
•primary storage for instruction results and temporary values
Example
In this scheme, actual values are stored directly in the ROB or IQ.
Example Scenario:
•Instructions:
1.I1: R1 = R2 + R3
2.I2: R4 = R1 + R5
•Pipeline Execution:
•I1 is decoded, and its operands (R2, R3) are fetched from the register file.
•During execution, the result of I1 (e.g., R1 = 10) is directly written into ROB[0].
•I2 depends on R1. It directly reads the value 10 from ROB[0] rather than waiting
for it to be written back to the register file.
Key Benefits:
•Faster execution because the dependent instruction (I2) can directly access the result
in the ROB without additional indirection.
•Reduces latency in out-of-order execution pipelines.
Advantages and Drawbacks
• Advantages
• Simplified Dependency Management:
• With values stored in the ROB, dependent instructions can directly fetch operands without indirection.
• Reduced Complexity:
• No need to manage physical registers explicitly (no FL or PRF).
• Unified Storage:
• ROB becomes the central point for all in-flight instruction data, reducing redundancy.
• Drawbacks
• ROB Size Limitation:
• ROB size may become a bottleneck since it now stores values instead of pointers.
• Increased ROB Complexity:
• Merging PRF functionality into the ROB increases its complexity and access latency.
• Scalability:
• The architecture may face challenges with scalability due to the centralized nature of the ROB.
Feature Pointers in IQ/ROB Values in IQ/ROB

Storage Requirement Lower, as only pointers are stored. Higher, as actual values are stored.

Data Access Indirect, requires an additional step to access data. Direct, as values are immediately available.

Structure Size Smaller IQ/ROB, reducing overhead. Larger IQ/ROB to accommodate data.

Complexity of Dependency Tracking Requires dereferencing pointers. Simplified, as data is directly available.

Faster access but higher complexity in managing the


Performance Impact Slightly slower due to indirection overhead.
ROB/IQ.

Best Use Case Systems prioritizing lower storage and simpler structures. Systems requiring faster data access and minimal latency.
1.Explain how register renaming resolves Write-After-Write (WAW) and Write-After-Read (WAR)
2.hazards in out-of-order execution. Provide an example with an instruction sequence.
2.Consider the following instruction sequence:
I1: R1 = R2 + R3
I2: R4 = R1 + R5
I3: R1 = R6 + R7

•Identify the hazards without register renaming.


•Propose a renaming solution and show the renamed sequence.
3.If a processor uses pointers in the ROB instead of values, what could be the impact on:
•Dependency resolution latency?
•Memory bandwidth usage?
• Consider a two-way superscalar processor with the following characteristics:
• It fetches and decodes up to 2 instructions per cycle.
• Out-of-order execution is supported with a reservation station and a reorder buffer (ROB).
• Functional unit latencies: ADD/SUB = 1 cycle, MUL = 3 cycles, DIV = 5 cycles.
• There are 4 general-purpose registers (R1–R4) and a physical register file with 8 entries (P1–
P8).
• Register renaming is applied to resolve hazards.
• I1: ADD R1, R2, R3
• I2: MUL R4, R1, R2
• I3: SUB R3, R4, R1
• I4: DIV R1, R3, R4
• I5: ADD R2, R1, R4
• Explain the potential data hazards in the instruction sequence and how register renaming
resolves these hazards. Provide the renamed register mapping.
• Draw the execution timeline of these instructions assuming the processor executes
instructions out of order. Clearly show the reservation station and ROB status during the
execution. Highlight how parallelism is achieved.
Consider a four-way superscalar processor with the following features:
•Pipeline Stages: Fetch (F), Decode (D), Issue (I), Execute (E), Write Back (W).
•The processor uses out-of-order execution with register renaming and a reorder buffer (ROB) for maintaining in-order retirement.
•The processor can fetch, decode, and issue up to 4 instructions per cycle.
•I1: ADD R1, R2, R3
•I2: MUL R4, R1, R5
•I3: SUB R6, R4, R7
•I4: DIV R8, R6, R9
•I5: ADDI R10, R11, 5
•I6: MUL R12, R10, R13
a. Explain the types of hazards present in the instruction sequence. Classify them as RAW, WAR, or WAW, and identify where they
occur.

b. Illustrate the execution timeline of these instructions assuming the following conditions:
Execution latencies: ADD and SUB = 1 cycle, MUL = 2 cycles, DIV = 4 cycles, ADDI = 1 cycle.
No structural hazards exist, but data hazards and register renaming are applied.
Up to 4 instructions can be fetched, decoded, and issued per cycle, and up to 2 instructions can execute in parallel.

c. Discuss how the use of register renaming and out-of-order execution improves the performance of this sequence compared to
an in-order execution model. Highlight specific examples from the given instruction sequence.
Multithreaded architectures
• Hardware multithreading is a technique used in modern processors
to improve the utilization of computational resources and enhance
overall performance.
• In a single-threaded processor, when an instruction encounters a stall
(e.g., due to memory latency or pipeline hazards), the processor
remains idle until the stall is resolved.
• Hardware multithreading addresses this issue by enabling multiple
threads to share the same processor core, allowing the processor to
execute instructions from a different thread during stalls
The ILP Wall
Multi threading
Hardware Multithreading is a technique used in modern processors to improve their efficiency and performance by
allowing them to execute multiple threads (smaller units of a program) simultaneously.
Processors are fast, but they often have to wait (e.g., for data from memory).
During this waiting time, they aren't doing useful work.
Hardware multithreading helps by keeping the processor busy with another thread while one thread is waiting.
Instruction level parallelism Thread-level Parallelism
 Thread level parallelism potentially allows
Instruction level parallelism exploits huge speedups
very fine grain independent  There is no complex superscalar
architecture that scales poorly
instructions  There is no particular requirement for
very complex compilers, like for VLIW
Thread level parallelism is explicitly
 Instead the burden of identifying and
represented in the program by the exploiting the parallelism falls mostly on
the programmer
use of multiple threads of execution  Programming with multiple threads is much
that are inherently parallel more difficult than sequential programming
 Debugging parallel programs is incredibly
Goal: Use multiple instruction challenging
streams to improve either (or both)  But it’s pretty easy to build a big, fast
parallel computer
1. Throughput of computers that run many  However, the main reason we have not had
programs widely-used parallel computers in the past
is that they are too difficult (expensive),
2. Execution time of multi-threaded programs time consuming (expensive) and error
prone (expensive) to program.
Costs for multi threading
Multithreading in a processor core
 Find a way to “hide” true data dependency stalls, cache miss stalls,
and branch stalls by finding instructions (from other process threads)
that are independent of those stalling instructions
 Multithreading – increase the utilization of resources on a chip by
allowing multiple processes (threads) to share the functional units of a
single processor
 Processor must duplicate the state hardware for each thread – a separate
register file, PC, instruction buffer, and store buffer for each thread
 The caches, TLBs, branch predictors can be shared (although the miss rates
may increase if they are not sized accordingly)
 The memory can be shared through virtual memory mechanisms
 Hardware must support efficient thread context switching
Thread scheduling policy
Which thread to be selected for which hardware thread
Each hardware is mapped to software thread.
One software thread continuously to one hardware thread
Round robin
Types of Multithreading
 Coarse-grain – switches threads only on costly stalls (e.g., L3
cache misses)
 Advantages – thread switching doesn’t have to be essentially free and
much less likely to slow down the execution of an individual thread
 Disadvantage – limited, due to pipeline start-up costs, in its ability to
overcome throughput loss
- Pipeline must be flushed and refilled on thread switches

 Fine-grain – switch threads on every instruction issue


 Round-robin thread interleaving (skipping stalled threads)
 Processor must be able to switch threads on every clock cycle
 Advantage – can hide throughput losses that come from both short and
long stalls
 Disadvantage – slows down the execution of an individual thread since a
thread that is ready to execute without stalls is delayed by instructions
from other threads
 Simultaneous Multi threading-
Simultaneous Multithreading (SMT)
 A variation on multithreading that uses the resources of a
multiple-issue, dynamically scheduled processor
(superscalar) to exploit both program ILP and thread-
level parallelism (TLP)
 Most Superscalar processors have more machine level
parallelism than most programs can effectively use (i.e., than
have ILP)
 With register renaming and dynamic scheduling, multiple
instructions from independent threads can be issued without
regard to dependencies among them
- Need separate rename tables (reorder buffers) for each thread
- Need the capability to commit from multiple threads (i.e., from
multiple reorder buffers) in one cycle

 Intel’s recent desktop and laptop processors mostly use


SMT
Simultaneous Multithreading (SMT)
 The hard part of building a processor with SMT is not
designing the SMT hardware
 SMT hardware relies on parallel instruction execution on
out-of-order processors
 It’s very simple
 If two instructions belong to different threads then there is no
dependency
 The hard part of building SMT processors is
 Designing and building the underlying out-of-order superscalar
processor architecture
 Testing and debugging the processor in SMT mode
 Parallelism is so fine grain that it is hard to investigate, and there
can be any ordering of execution of instructions from different
threads
Threading on a 4-way Superscalar Example
Coarse MT Fine MT SMT
Issue slots →
Thread A Thread B

Time →

Thread C Thread D
Multithreaded Example: Sun’s Niagara (UltraSparc T1)
 Eight fine grain multithreaded single-issue, in-order cores
(no speculation, no dynamic branch prediction)
Ultra III Niagara
Data width 64-b 64-b

MT SPARC pipe
MT SPARC pipe
MT SPARC pipe
MT SPARC pipe
MT SPARC pipe
MT SPARC pipe
MT SPARC pipe
MT SPARC pipe
Clock rate 1.2 GHz 1.0 GHz
Cache 32K/64K/ 16K/8K/3M
(I/D/L2) (8M external)
Issue rate 4 issue 1 issue I/O
Crossbar shared
Pipe stages 14 stages 6 stages
funct’s
BHT entries 16K x 2-b None
4-way banked L2$
TLB entries 128I/512D 64I/64D
Memory BW 2.4 GB/s ~20GB/s
Transistors 29 million 200 million
Power (max) 53 W <60 W Memory controllers
Multicore Xbox360 – “Xenon” processor
 Aim is to provide game developers with a balanced and
powerful platform
 Three SMT processors, 32KB L1 D-cache & I-cache, 1MB
Unified L2 cache
 Two SMT threads per core
 165M transistors total
 3.2 Ghz Near-POWER PC ISA
 2-issue, 21 stage pipeline, with 128 128-bit registers
 Weak branch prediction – supported by software hinting
 In order instruction execution
 Narrow cores – 2 INT units, 2 128-bit VMX SIMD units, 1 of
anything else
 An ATI-designed 500MHz GPU, 512MB of DDR3DRAM
 337M transistors, 10MB framebuffer
 48 pixel shader cores, each with 4 ALUs
Xenon Diagram

DVD
Core 0 Core 1 Core 2 HDD Port
Front USBs (2)
L1D L1I L1D L1I L1D L1I Wireless
MU ports (2 USBs)

XMA Dec
Rear USB (1)
1MB Unified L2 Ethernet
IR
Audio Out
Flash
GPU Systems Control

SMC
BIU/IO Intf
512MB
MC1

DRAM 3D Core

10MB Video Analog


Video Out
MC0

EDRAM Out Chip


Xenon Diagram
Thank you

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy