Computer Architecture_notes
Computer Architecture_notes
Course Notes
Syllabus
Unit No. Contents
Basic functional blocks of a computer: Functional units, Basic operational concepts,
Bus structures: single bus architecture and multiple bus architecture, Instruction set
1
architecture (ISA) of a CPU, Register transfer language (RTL) notations, and instruction
execution: straight line sequencing and branching, addressing modes.
Computer Arithmetic’s: Fixed point Addition, Subtraction, Multiplication and
Division. Floating Point arithmetic, Booth’s algorithm, integer division, Data
2
Representation method, Booths multiplication, division algorithm and example, IEEE
standard single and double precision format and examples
I/O Organization: Interrupts: Enabling and Disabling Interrupts, Handling
3 Multiple Devices. Direct Memory Access, Bus Arbitration, Interface circuits, Standard
I/O Interfaces (PCI, SCSI and USB).
Pipelining: Basic concepts of pipelining, Arithmetic and Instruction Pipeline, throughput
4 and speedup, pipeline hazards, Logic Design Conventions
UNIT-1
Functional Units of Computer
The functional units of a computer refer to the essential components that work together to
perform tasks. These units are typically classified into the following categories:
Each unit plays a crucial role in ensuring the computer system functions efficiently.
The basic operational concepts of the functional units of a computer involve the flow and
processing of data to perform tasks effectively. Each functional unit has a specific role in this
process. Below are the core concepts:
Function: Accepts data and instructions from the user or external devices.
Operation: Converts user-friendly input (e.g., key presses, mouse clicks) into binary
data that the computer understands.
Example: Typing on a keyboard sends data to the CPU for further processing.
The CPU is the "brain" of the computer and controls all operations. It comprises three main
components:
Function: Directs the flow of data between the CPU, memory, and I/O devices.
Operation:
1. Fetches instructions from memory.
2. Decodes them into control signals.
3. Executes them by coordinating with other units.
c. Registers:
3. Memory Unit
a. Primary Memory (RAM/ROM):
4. Output Unit
5. Data Flow
6. Communication/Bus System
Function: Facilitates data transfer between the CPU, memory, and I/O units.
Operation:
o Address Bus: Carries memory addresses.
o Data Bus: Transfers actual data.
o Control Bus: Sends control signals.
These basic operational concepts define how data flows and is processed within a computer
system, ensuring efficient performance of tasks.
Sr.
No. Single Bus Structure Double Bus Structure
One common bus is used for Two buses are used, one for communication
2. communication between peripherals from peripherals and the other for the
and processors. processor.
Instructions and data both are Instructions and data both are transferred in
4.
transferred in same bus. different buses.
Advantages-
Advantages-
Better performance
11. Less expensive
Improves Efficiency
Simplicity
The Instruction Set Architecture (ISA) is the interface between the hardware and software of a
computer system. It defines the set of instructions that the CPU can execute, as well as the
associated data formats, addressing modes, and control mechanisms.
1. Instruction Set:
o A collection of machine instructions that the CPU can execute. These instructions
are divided into categories:
Data Transfer Instructions: Move data between memory, registers, and
I/O devices.
Examples: MOV, LOAD, STORE
Arithmetic Instructions: Perform arithmetic operations like addition,
subtraction, multiplication, and division.
Examples: ADD, SUB, MUL, DIV
Logical Instructions: Perform logical operations such as AND, OR,
NOT, XOR.
Examples: AND, OR, XOR, NOT
2. Registers:
o Defines the number, type, and purpose of CPU registers, which are high-speed
storage locations.
General Purpose Registers (GPRs): Hold temporary data.
Special Purpose Registers: Include Program Counter (PC), Stack Pointer
(SP), and Status Register.
3. Addressing Modes:
o Determines how the CPU accesses data operands in memory or registers.
Common modes include:
Immediate Addressing: Data is part of the instruction.
Register Addressing: Data resides in a register.
Direct Addressing: Instruction specifies the memory address.
Indirect Addressing: Memory address is obtained from a register.
Indexed Addressing: Combines a base address and an index.
4. Data Types:
o Specifies the types of data the CPU can process, such as:
Integers (signed and unsigned)
Floating-point numbers
Characters
Packed/Unpacked BCD (Binary Coded Decimal)
5. Instruction Format:
o Defines the structure of an instruction, including:
Opcode: Specifies the operation to perform.
Operands: Specifies the data to operate on.
Mode Bits: Define the addressing mode.
Types of ISA
Role of ISA
Register Transfer Language (RTL) is a symbolic notation used to describe the operations, data
transfers, and processes occurring at the register level within a computer's architecture. It is
commonly used in computer design and analysis.
R1 ← R2
This indicates the content of R2 is copied into R1.
3. Control Signals:
o Used to indicate when specific actions occur.
o Example:
If (C = 1) then R1 ← R2
The transfer occurs only if the control signal C is active (true).
5. Memory Access:
o Data is moved between memory and registers.
o Example:
Reading from memory:
R1 ← M[100]
The data at memory address 100 is transferred to R1.
Writing to memory:
M[100] ← R1
The content of R1 is stored at memory address 100.
6. Control Instructions:
o Indicate program flow changes, such as jumps or branches.
o Example:
If (R1 = 0) then PC ← 200
If R1 is zero, the program counter (PC) jumps to instruction address 200.
5. Memory Read/Write:
o Read:
R1 ← M[Address]
Data from memory at Address is loaded into R1.
o Write:
M[Address] ← R1
The content of R1 is stored in memory at Address.
6. Conditional Execution:
If (Flag = 1) then R1 ← R2
PC ← PC + 1
Advances the program counter to the next instruction.
Advantages of RTL
1. Clarity: Provides a clear and precise way to describe data flow and operations.
1. Straight-Line Sequencing
Definition: Instructions are executed sequentially, one after the other, in the order they
appear in memory.
Process:
1. The Program Counter (PC) holds the address of the next instruction.
2. The CPU fetches the instruction from the memory location pointed to by the PC.
3. The PC is incremented to point to the next instruction.
4. The CPU decodes and executes the fetched instruction.
Instruction 1
Instruction 2
Instruction 3
2. Branching
Definition: The control flow of the program changes based on specific conditions,
allowing non-sequential execution of instructions.
Types of Branching:
1. Unconditional Branching:
The program always jumps to a specified address.
Example
JUMP Address
2. Conditional Branching:
CALL Subroutine
...
RETURN
Applications:
o Used in decision-making, loops, and subroutine handling in programs.
Addressing modes define how the CPU identifies the location of data (operands) to be used in an
instruction. These modes are critical for optimizing program performance and flexibility.
1. Immediate Addressing
ADD R1, #5
Adds the value 5 directly to the content of R1.
2. Register Addressing
ADD R1, R2
Adds the content of R2 to R1.
3. Direct Addressing
4. Indirect Addressing
Definition: The instruction specifies a register or memory location that contains the
memory address of the operand.
Example:
5. Indexed Addressing
Definition: Combines a base address and an index to compute the effective address.
Example:
Definition: Similar to indexed addressing, but uses a base register instead of a fixed base
address.
GHRU Amravati Dept. of CSE Page 11
Example:
7. Relative Addressing
Definition: The effective address is calculated relative to the current value of the
Program Counter (PC).
Example:
JUMP PC + Offset
8. Stack Addressing
Definition: Operands are implicitly taken from the top of the stack.
Example:
PUSH R1
POP R2
UNIT-III
Types of Interrupt
2. Hardware Interrupts
In a hardware interrupt, all the devices are connected to the Interrupt Request Line. A single
request line is used for all the n devices. To request an interrupt, a device closes its associated
switch. When a device requests an interrupt, the value of INTR is the logical OR of the
requests from individual devices.
Hardware interrupts are further divided into two types of interrupt
After the execution of the current instruction, the processor verifies the interrupt signal to
check whether any interrupt is pending. If no interrupt is pending then the processor
proceeds to fetch the next instruction in the sequence.
If the processor finds the pending interrupts, it suspends the execution of the current
program by saving the address of the next instruction that has to be executed and it
updates the program counter with the starting address of the interrupt service routine to
service the occurred interrupt.
After the interrupt is serviced completely the processor resumes the execution of the
program it has suspended.
For example, consider the situation, that a particular sequence of instructions must be executed
without any interruption. As it may happen that the execution of the interrupt service routine
may change the data used by the sequence of instruction. So the programmer must have the
facility to enable and disable interrupt in order to control the events during the execution of the
program.
Now you can enable and disable the interrupts on both ends i.e. either at the processor end or at
the I/O device end. With this facility, if the interrupts are enabled or disabled at the processor
end the processor can accept or reject the interrupt request. And if the I/O devices are allowed to
enable or disable interrupts at their end then either I/O devices are allowed to raise an interrupt
request or prevented from raising an interrupt request.
To enable or disable interrupt at the processor end, one bit of its status register i.e. IE (Interrupt
Enable) is used. When the IE flag is set to 1 the processor accepts the occurred interrupts. IF IE
flag is set to 0 processor ignore the requested interrupts.
To enable and disable interrupts at the I/O device end, the control register present at the interface
of the I/O device is used. One bit of this control register is used to regulate the enabling and
disabling of interrupts from the I/O device end.
Let us say device X may interrupt the processor when it is servicing the interrupt caused by
device Y. Or it may happen that multiple devices request interrupts simultaneously. These
situations trigger several questions like:
How the processor will identify which device has requested the interrupt?
If the different devices requested different types of interrupt and the processor has to service
them with different service routine then how the processor is going to get starting address of that
particular to interrupt the service routine?
Can a device interrupt the processor while it is servicing the interrupt produced by another
device?
How can the processor handle if multiple devices request the interrupts simultaneously?
How these situations are handled vary from computer to computer. Now, if multiple devices are
connected to the processor where each is capable of raising an interrupt the how will the
processor determine which device has requested an interrupt.
The solution to this is that whenever a device request an interrupt it set its interrupt request bit
(IRQ) to 1 in its status register. Now the processor checks this IRQ bit of the devices and the
device encountered with IRQ bit as 1 is the device that has to raise an interrupt.
But this is a time taking method as the processor spends its time checking the IRQ bits of every
connected device. The time wastage can be reduced by using a vectored interrupt.
The devices raising the vectored interrupt identify themselves directly to the processor. So
instead of wasting time in identifying which device has requested an interrupt the processor
immediately start executing the corresponding interrupt service routine for the requested
interrupt.
Now, to identify themselves directly to the processors either the device request with its own
interrupt request signal or by sending a special code to the processor which helps the processor
in identifying which device has requested an interrupt.
Usually, a permanent area in the memory is allotted to hold the starting address of each interrupt
service routine. The addresses referring to the interrupt service routines are termed as interrupt
vectors and all together they constitute an interrupt vector table. Now how does it work?
The device requesting an interrupt sends a specific interrupt request signal or a special code to
the processor. This information act as a pointer to the interrupt vector table and the
corresponding address (address of a specific interrupt service routine which is required to service
the interrupt raised by the device) is loaded to the program counter.
Interrupt Nesting
When the processor is busy in executing the interrupt service routine, the interrupts are disabled
in order to ensure that the device does not raise more than one interrupt. A similar kind of
arrangement is used where multiple devices are connected to the processor. So that the servicing
of one interrupt is not interrupted by the interrupt raised by another device.
What if the multiple devices raise interrupts simultaneously, in that case, the interrupts are
prioritized.
BUS ARBITRATION
Introduction :
In a computer system, multiple devices, such as the CPU, memory, and I/O controllers, are
connected to a common communication pathway, known as a bus. In order to transfer data
between these devices, they need to have access to the bus. Bus arbitration is the process of
resolving conflicts that arise when multiple devices attempt to access the bus at the same time.
When multiple devices try to use the bus simultaneously, it can lead to data corruption and
system instability. To prevent this, a bus arbitration mechanism is used to ensure that only one
device has access to the bus at any given time.
There are several types of bus arbitration methods, including centralized, decentralized, and
distributed arbitration. In centralized arbitration, a single device, known as the bus controller,
Advantages –
This method does not favor any particular device and processor.
The method is also quite simple.
If one device fails then the entire system will not stop working.
Disadvantages –
Adding bus masters is difficult as increases the number of address lines of the circuit.
(iii) Fixed priority or Independent Request method –
In this, each master has a separate pair of bus request and bus grant lines and each pair has a
priority assigned to it.
The built-in priority decoder within the controller selects the highest priority request and
asserts the corresponding bus grant signal.
In this, all devices participate in the selection of the next bus master. Each device on the bus is
assigned a 4bit identification number. The priority of the device will be determined by the
generated ID.
Uses of BUS Arbitration in Computer Organization:
Bus arbitration is a critical process in computer organization that has several uses and benefits,
including:
1. Efficient use of system resources: By regulating access to the bus, bus arbitration ensures
that each device has fair access to system resources, preventing any single device from
monopolizing the bus and causing system slowdowns or crashes.
2. Minimizing data corruption: Bus arbitration helps prevent data corruption by ensuring that
only one device has access to the bus at a time, which minimizes the risk of multiple
devices writing to the same location in memory simultaneously.
3. Support for multiple devices: Bus arbitration enables multiple devices to share a common
communication pathway, which is essential for modern computer systems with multiple
peripherals, such as printers, scanners, and external storage devices.
4. Real-time system support: In real-time systems, bus arbitration is essential to ensure that
high-priority tasks are executed quickly and efficiently. By prioritizing access to the bus,
bus arbitration can ensure that critical tasks are given the resources they need to execute in
a timely manner.
5. Improved system stability: By preventing conflicts between devices, bus arbitration helps
to improve system stability and reliability. This is especially important in mission-critical
systems where downtime or data corruption could have severe consequences.
Issues of BUS Arbitration in Computer Organization :
Bus arbitration is a critical process in computer organization that has several uses and benefits,
including:
1. Efficient use of system resources: By regulating access to the bus, bus arbitration ensures
that each device has fair access to system resources, preventing any single device from
monopolizing the bus and causing system slowdowns or crashes.
2. Minimizing data corruption: Bus arbitration helps prevent data corruption by ensuring that
only one device has access to the bus at a time, which minimizes the risk of multiple
devices writing to the same location in memory simultaneously.
INTEFACE CIRCUITS
The I/O interface circuit is circuitry that is designed to link the I/O devices to the processor. Now
the question is why do we require an interface circuit?
We know that every component or module of the computer has its distinct capabilities and
processing speed. For example, the processing speed of the CPU is much higher than the other
components of the computer such as keyboard, display, etc.
So, we need a mediator to make the computer communicate with the I/O modules. This mediator
is referred to as an interface circuit. Observe the figure below, we can easily see that the one end
of the interface circuit is connected to the system bus line i.e., address line, data line, and control
line.
The address line is decoded by the interface circuit to determine if the processor has addressed
this particular I/O device or not. The control line is decoded to identify which kind of operation
is requested by the processor. The data line is used to transfer the data between I/O and the
processor.
The other side of the interface circuit has the connections that are essential to transfer data
between the I/O interface circuit and the I/O device. And this side of the I/O interface is referred
But before discussing these ports let us take a brief outlook of what are the features of the I/O
interface circuit.
1. The interface circuit has a data register that stores the data temporarily while the data is being
exhanged between I/O and processor.
2. The interface circuit also has a status register, the bits in the status register indicate the
processor whether the I/O device is set for the transmission or not.
3. The interface circuit also has the control register, the bits in the control register indicate the
type of operation (read or write) requested by the processor to the I/O interface.
4. The interface circuit also has address decoding circuitry which decodes the address over the
address line to determine whether it is being addressed by the processor.
5. The interface circuitry also generates the timing signals that synchronize the operation between
the processor and the I/O device.
6. The interface circuit is also responsible for the format conversion that is essential for
exchanging data between the processor and the I/O interface.
Now, let us learn about the parallel port and the serial port of the I/O interface circuit.
Parallel Port
To understand the interface circuit with a parallel port we will take the example of two I/O
devices. First, we will study an input device i.e., a keyboard that has an 8-bit input port, and then
an output device i.e., a display that has an 8-bit output port. Here multiple bits are transferred at
once.
Input Port
Observe the parallel input port that connects the keyboard to the processor. Now, whenever the
key is tapped on the keyboard an electrical connection is established that generates an electrical
signal. This signal is encoded by the encoder to convert it into ASCII code for the corresponding
character pressed at the keyboard.
Now, when the data is loaded into the KBD_DATA register the KIN status flag present in the
KBD_STATUS register is set to1. Which causes the processor to read the data from
KBD_DATA.
Once the processor reads the data from KBD_DATA register the KIN flag is again set to 0. Here
the input interface is connected to the processor using an asynchronous bus.
So, the way they alert each other is using the master ready line and the slave ready line.
Whenever the processor is ready to accept the data, it activates its master-ready line and
whenever the interface is ready with the data to transmit it to the processor it activates its slave-
ready line.
The bus connecting processor and interface has one more control line i.e., R/W which is set to
one for reading operation.
Output Port
Observe the output interface shown in the figure below that connects the display and processor.
The display device uses two handshake signals that are ready and new data and the other master
and slave-ready.
When the display unit is ready to display a character, it activates its ready line to 1 which setups
the DOUT flag in the DISP_STATUS register to 1. This indicates the processor and the
processor places the character to the DISP_DATA register.
Serial Port
Opposite to the parallel port, the serial port connects the processor to devices that transmit only
one bit at a time. Here on the device side, the data is transferred in the bit-serial pattern, and on
the processor side, the data is transferred in the bit-parallel pattern.
The transformation of the format from serial to parallel i.e., from device to processor, and from
parallel to serial i.e., from processor to device is made possible with the help of shift registers
(input shift register & output shift register).
Observe the figure above to understand the functioning of the serial interface at the device side.
The input shift register accepts the one bit at a time in a bit-serial fashion till it receives all 8 bits.
When all the 8 bits are received by the input shift register it loads its content into the DATA IN
register parallelly. In a similar fashion, the content of the DATA OUT register is transferred in
parallel to the output shift register.
The serial interface port connected to the processor via system bus functions similarly to the
parallel port. The status and control block has two status flags SIN and SOUT. The SIN flag is
set to 1 when the I/O device inputs the data into the DATA IN register through the input shift
register and the SIN flag is cleared to 0 when the processor reads the data from the DATA IN
register.
This makes the transmission convenient between the device that transmits and receives one bit at
a time and the processor that transmits and receives multiple bits at a time.
The serial interface does not have any clock line to carry timing information. So, the timing
information must be embedded with the transmitted data using the encoding scheme. There are
two techniques to do this.
In the asynchronous transmission, the clock used by the transmitter and receiver is not
synchronized. So, the bits to be transmitted are grouped into a group of 6 to 8 bits which has a
defined starting bit and ending bit. The start bit has a logic value 0 and the stop bit has a logic
value 1.
The data received at the receiver end is recognized by this start and stop bit. This approach is
useful where is the transmission is slow.
The start and stop bit we used in the asynchronous transmission provides the correct timing
information but this approach is not useful where the transmission speed is high.
So, in the synchronous transmission, the receiver generates the clock that is synchronized with
the clock of the transmitter. This lets the transmitting large blocks of data at a high speed.
This is all about the interface circuit that is an intermediatory circuit between the I/O device and
processor. The parallel interface is faster, costly, and efficient for the devices that are at a closer
distance to the processor. Whereas the serial interface is is slow, less costly, and efficient for
long-distance connection.
Device Configuration
When an I/O device is connected to a computer, several actions are needed to configure both the
device and the software that communicates with it.
The PCI simplifies this process by incorporating in each I/O device interface a small
configuration ROM memory that stores information about that device. The configuration ROMs of
all devices is accessible in the configuration address space.
The PCI initialization software reads these ROMs whenever the system is powered up or reset. In
each case, it determines whether the device is a printer, a keyboard, an Ethernet interface, or a disk
controller. It can further learn bout various device options and characteristics. Devices are
assigned addresses during the initialization process. This means that during the bus configuration
operation, devices cannot be accessed based on their address, as they have not yet been assigned
one. Hence, the configuration address space uses a different mechanism. Each device has an input
signal called Initialization Device Select, IDSEL#.
The PCI bus has gained great popularity in the PC word. It is also used in many other computers,
such as SUNs, to benefit from the wide range of I/O devices for which a PCI interface is available.
In the case of some processors, such as the Compaq Alpha, the PCI-processor bridge circuit is
built on the processor chip itself, further simplifying system design and packaging.
SCSI
It is a standard bus defined by the American National Standards Institute (ANSI). A controller
connected to a SCSI bus is an initiator or a target. The processor sends a command to the SCSI
controller, which causes the following sequence of events to take place:
The SCSI controller contends for control of the bus (initiator).
When the initiator wins the arbitration process, it selects the target controller and hands over
control of the bus to it.
The target starts an output operation. The initiator sends a command specifying the required read
operation.
The target sends a message to the initiator indicating that it will temporarily suspends the
connection between them. Then it releases the bus.
What is Pipeline?
S1 I1 I2
S2 I1 I2
S3 I1 I2
S4 I1 I2
S1 I1 I2
S2 I1 I2
S3 I1 I2
S4 I1 I2
Total time = 5 Cycle Pipeline Stages RISC processor has 5 stage instruction pipeline to
execute all the instructions in the RISC instruction set. Following are the 5 stages of the RISC
pipeline with their respective operations:
Stage 1 (Instruction Fetch): In this stage the CPU fetches the instructions from the
address present in the memory location whose value is stored in the program counter.
Performance of a pipelined processor Consider a ‘k’ segment pipeline with clock cycle time
as ‘Tp’. Let there be ‘n’ tasks to be completed in the pipelined processor. Now, the first
instruction is going to take ‘k’ cycles to come out of the pipeline but the other ‘n – 1’
instructions will take only ‘1’ cycle each, i.e, a total of ‘n – 1’ cycles. So, time taken to execute
‘n’ instructions in a pipelined processor:
ETpipeline = k + n – 1 cycles
= (k + n – 1) Tp
In the same case, for a non-pipelined processor, the execution time of ‘n’ instructions will be:
ETnon-pipeline = n * k * Tp
So, speedup (S) of the pipelined processor over the non-pipelined processor, when ‘n’ tasks
are executed on the same processor is:
S = Performance of non-pipelined processor /
Performance of pipelined processor
As the performance of a processor is inversely proportional to the execution time, we have,
S = ETnon-pipeline / ETpipeline
=> S = [n * k * Tp] / [(k + n – 1) * Tp]
S = [n * k] / [k + n – 1]
When the number of tasks ‘n’ is significantly larger than k, that is, n >> k
S=n*k/n
S=k
where ‘k’ are the number of stages in the pipeline. Also, Efficiency = Given speed up / Max
speed up = S / Smax We know that Smax = k So, Efficiency = S / k Throughput = Number of
instructions / Total time to complete the instructions So, Throughput = n / (k + n – 1) * Tp
Note: The cycles per instruction (CPI) value of an ideal pipelined processor is 1 Please see Set
2 for Dependencies and Data Hazard and Set 3 for Types of pipeline and Stalling.
Performance of pipeline is measured using two main metrices as Throughput and latency.
What is Throughout?
It measure number of instruction completed per unit time.
It represents overall processing speed of pipeline.
Higher throughput indicate processing speed of pipeline.
Calculated as, throughput= number of instruction executed/ execution time.
It can be affected by pipeline length, clock frequency. efficiency of instruction execution
and presence of pipeline hazards or stalls.
What is Latenecy?
It measure time taken for a single instruction to complete its execution.
It represents delay or time it takes for an instruction to pass through pipeline stages.
Lower latency indicates better performance .
It is calculated as, Latency= Execution time/ Number of instruction executed.
It in influenced by pipeline length, depth, clock cycle time, instruction dependencies and
pipeline hazards.
Advantages of Pipelining
Increased Throughput: Pipelining enhance the throughput capacity of a CPU and enables
a number of instruction to be processed at the same time at different stages. This leads to
the improvement of the amount of instructions accomplished in a given period of time, thus
improving the efficiency of the processor.
Arithmetic Pipeline
1. Arithmetic Pipeline:
An arithmetic pipeline divides an arithmetic problem into various sub problems for execution
in various pipeline segments. It is used for floating point operations, multiplication and various
other computations. The process or flowchart arithmetic pipeline for floating point addition is
shown in the diagram.
2. Instruction Pipeline:
In this a stream of instructions can be executed by overlapping fetch, decode and execute
phases of an instruction cycle. This type of technique is used to increase the throughput of the
computer system. An instruction pipeline reads instruction from the memory while previous
instructions are being executed in other segments of the pipeline. Thus we can execute
multiple instructions simultaneously. The pipeline will be more efficient if the instruction
cycle is divided into segments of equal duration. In the most general case computer needs to
process each instruction in following sequence of steps:
1. Fetch the instruction from memory (FI)
2. Decode the instruction (DA)
3. Calculate the effective address
4. Fetch the operands from memory (FO)
5. Execute the instruction (EX)
6. Store the result in the proper place
The flowchart for instruction pipeline is shown below.
Pipeline Hazards
Structural Hazards
Data Hazards
Control Hazards
Structural Hazards
Structural hazards arise due to hardware resource conflict amongst the instructions in the
pipeline. A resource here could be the Memory, a Register in GPR or ALU. This resource
conflict is said to occur when more than one instruction in the pipe is requiring access to the
same resource in the same clock cycle. This is a situation that the hardware cannot handle all
possible combinations in an overlapped pipelined execution.
Observe the figure 16.1. In any system, instruction is fetched from memory in IF machine cycle.
In our 4-stage pipeline Result Writing (RW) may access memory or one of the General Purpose
Solution 1: Introduce bubble which stalls the pipeline as in figure 16.2. At t4, I4 is not allowed
to proceed, rather delayed. It could have been allowed in t5, but again a clash with I2 RW. For
the same reason, I4 is not allowed in t6 too. Finally, I4 could be allowed to proceed (stalled) in
the pipe only at t7.
This delay is percolated to all the subsequent instructions too. Thus, while the ideal 4-stage
system would have taken 8 timing states to execute 5 instructions, now due to structural
dependency it has taken 11 timing states. Just not this. By now you would have guessed that this
hazard is likely to happen at every 4th instruction. Not at all a good solution for a heavy load on
CPU. Is there a better way? Yes!
A better solution would be to increase the structural resources in the system using one of the few
choices below:
The pipeline may be increased to 5 or more stages and suitably redefine the functionality
of the stages and adjust the clock frequency. This eliminates the issue of the hazard at
every 4th instruction in the 4-stage pipeline
The memory may physically be separated as Instruction memory and Data Memory. A
Better choice would be to design as Cache memory in CPU, rather than dealing with
Main memory. IF uses Instruction memory and Result writing uses Data Memory. These
become two separate resources avoiding dependency.
It is possible to have Multiple levels of Cache in CPU too.
There is a possibility of ALU in resource dependency. ALU may be required in IE
machine cycle by an instruction while another instruction may require ALU in IF stage to
calculate Effective Address based on addressing mode. The solution would be either
stalling or have an exclusive ALU for address calculation.
Register files are used in place of GPRs. Register files have multiport access with
exclusive read and write ports. This enables simultaneous access on one write register
and read register.
Data hazards occur when an instruction's execution depends on the results of some previous
instruction that is still being processed in the pipeline. Consider the example below.
In the above case, ADD instruction writes the result into the register R3 in t5. If bubbles
are not introduced to stall the next SUB instruction, all three instructions would be using the
wrong data from R3, which is earlier to ADD result. The program goes wrong! The possible
solutions before us are:
Solution 1: Introduce three bubbles at SUB instruction IF stage. This will facilitate SUB – ID to
function at t6. Subsequently, all the following instructions are also delayed in the pipe.
Solution 2: Data forwarding - Forwarding is passing the result directly to the functional unit
that requires it: a result is forwarded from the output of one unit to the input of another. The
purpose is to make available the solution early to the next instruction.
In this case, ADD result is available at the output of ALU in ADD –IE i.e t3 end. If this can be
controlled and forwarded by the control unit to SUB-IE stage at t4, before writing on to output
register R3, then the pipeline will go ahead without any stalling. This requires extra logic to
identify this data hazard and act upon it. It is to be noted that although normally Operand Fetch
happens in the ID stage, it is used only in IE stage. Hence forwarding is given to IE stage as
input. Similar forwarding can be done with OR and AND instruction too.
Solution 4: In the event, the above reordering is infeasible, the compiler may detect and
introduce NOP ( no operation) instruction(s). NOP is a dummy instruction equivalent bubble,
introduced by the software.
The compiler looks into data dependencies in code optimisation stage of the compilation
process.
This is a case where an instruction uses data produced by a previous one. Example
This is a case where the second instruction writes onto register before the first instruction
reads. This is rare in a simple pipeline structure. However, in some machines with
complex and special instructions case, WAR can happen.
This is a case where two parallel instructions write the same register and must do it in the
order in which they were issued.
WAW and WAR hazards can only occur when instructions are executed in parallel or out of
order. These occur because the same register numbers have been allotted by the compiler
although avoidable. This situation is fixed by renaming one of the registers by the compiler or by
delaying the updating of a register until the appropriate value has been produced.
Modern CPUs not only have incorporated Parallel execution with multiple ALUs but also Out
of order issue and execution of instructions along with many stages of pipelines.
Control Hazards
Control hazards are called Branch hazards and caused by Branch Instructions. Branch
instructions control the flow of program/ instructions execution. Recall that we use conditional
statements in the higher-level language either for iterative loops or with conditions checking
(correlate with for, while, if, case statements). These are transformed into one of the variants of
BRANCH instructions. It is necessary to know the value of the condition being checked to get
the program flow. Life is complicating you! So it is for the CPU!
Thus a Conditional hazard occurs when the decision to execute an instruction is based on the
result of another instruction like a conditional branch, which checks the condition’s resultant
value.
The branch and jump instructions decide the program flow by loading the appropriate location in
the Program Counter(PC). The PC has the value of the next instruction to be fetched and
executed by CPU. Consider the following sequence of instructions.
In this case, there is no point in fetching the I3. What happens to the pipeline? While in I2, the I3
fetch needs to be stopped. This can be known only after I2 is decoded as JMP and not until then.
So the pipeline cannot proceed at its speed and hence this is a Control Dependency (hazard). In
case I3 is fetched in the meantime, it is not only a redundant work but possibly some data in
registers might have got altered and needs to be undone.
1. Stall the Pipeline as soon as decoding any kind of branch instructions. Just not allow
anymore IF. As always, stalling reduces throughput. The statistics say that in a program,
at least 30% of the instructions are BRANCH. Essentially the pipeline operates at 50%
capacity with Stalling.
2. Prediction – Imagine a for or while loop getting executed for 100 times. We know for
sure 100 times the program flows without the branch condition being met. Only in the
101st time, the program comes out of the loop. So, it is wiser to allow the pipeline to
proceed and undo/flush when the branch condition is met. This does not affect the
throttle of the pipeline as much stalling.
3. Dynamic Branch Prediction - A history record is maintained with the help of Branch
Table Buffer (BTB). The BTB is a kind of cache, which has a set of entries, with the PC
address of the Branch Instruction and the corresponding effective branch address. This is
maintained for every branch instruction encountered. SO whenever a conditional branch
instruction is encountered, a lookup for the matching branch instruction address from the
BTB is done. If hit, then the corresponding target branch address is used for fetching the
next instruction. This is called dynamic branch prediction.
This method is successful to the extent of the temporal locality of reference in the
programs. When the prediction fails flushing needs to take place.
4. Reordering instructions - Delayed branch i.e. reordering the instructions to position the
branch instruction later in the order, such that safe and useful instructions which are not
affected by the result of a branch are brought-in earlier in the sequence thus delaying the
branch instruction fetch. If no such instructions are available then NOP is introduced.
This delayed branch is applied with the help of Compiler.
I do not want to load the reader with more timing state diagram. But I am sure the earlier
discussions would have familiarised the reader to understand with words.
Last but not the least, in a pipelined design, the Control unit is expected to handle the following
scenarios:
No Dependence
Dependence requiring Stall
Dependence solution by Forwarding
Dependence with access in order
Out of Order Execution
Branch Prediction Table and more
Operate on data
Output is a function of input
Output only depends on the current input
Uses for ALU, multiplier, and other datapath
Store information
State element to store the states
Output depends on current inputs and current states
SEQUENTIAL ELEMENTS
---> Uses a clock signal to determine when to update the stored value
To write new data in the register, we use D flip flop with Write Enable
--->Write Enable:
0: Only updates on clock edge where the output of the register becomes the input
itself (Data in register will not change.
1: New data is fed to the flip-flop and the register changes its state
CLOCKING METHODOLOGY
Clocking methodology
Defines when signals can be read and when they can be written
Mainstream: An edge triggered methodology
Determine when data is valid and stable relative to the clock
Typical execution:
A Hardwired Control Unit (HCU) is a type of control unit in a computer’s CPU that generates
control signals using fixed electronic circuits, such as combinational logic and sequential logic
gates. It directly decodes the instruction and controls the execution process through predefined
hardware paths.
1. Instruction Fetch: The control unit fetches the instruction from memory.
2. Instruction Decode: The instruction is decoded to determine the required operations.
3. Signal Generation: Control signals are generated using logic circuits.
4. Execution: The required control signals are sent to different CPU components (ALU,
Registers, Memory, etc.).
5. Next Instruction: The control unit moves to the next instruction.
✅ Faster Execution – Since it is implemented using hardware circuits, it operates faster than
microprogrammed control units.
✅ Efficient for Simple Instruction Sets – Works well for simple and fixed instruction sets like
Hardwired control units are typically designed using finite state machines (FSMs) to handle
different instruction cycles.
Difficult to Modify – Changing the control logic requires redesigning the entire circuit.
Complex for CISC Architecture – Complex instruction sets require large and complicated
hardware.
Not Scalable – If new instructions need to be added, the whole unit may need redesigning.
Microinstructions are fetched from control memory, decoded, and executed in sequence to
generate the required control signals for CPU operations.
Easier to Modify & Update – New instructions can be added by modifying the control
memory without redesigning hardware.
Supports Complex Instructions – Ideal for CISC (Complex Instruction Set Computing)
architectures.
Simpler & Cost-Effective Design – Requires fewer logic circuits compared to a hardwired
control unit.
Better Fault Tolerance – Errors in control signals can be corrected by updating
microinstructions.
Design: The control signals are generated by a fixed set of logic gates and combinational
circuits. These gates are hardwired to perform specific tasks.
Speed: Hardwired designs are generally faster because they use fixed logic paths.
Flexibility: They are less flexible because any change in the instruction set or control
logic requires physical changes to the hardware.
Complexity: The complexity can increase with the number of instructions because the
control unit requires more combinational logic.
Example: Simple, early processors like the 8080.
Design: In this design, control signals are generated by a set of microinstructions stored
in a control memory (also called microcode). Each instruction in the instruction set has a
corresponding microinstruction sequence.
Speed: Micro-programmed control units are typically slower than hardwired ones
because fetching and decoding microinstructions from memory adds overhead.
Flexibility: They are more flexible because modifying the control logic can be done by
changing the microcode, rather than redesigning the hardware.
Complexity: The control unit is more complex, but easier to modify and expand.
Example: More complex processors like the IBM 360, VAX.
Summary of Differences:
Feature Hardwired Control Unit Micro-programmed Control Unit
Control
Fixed, combinational circuits Stored microinstructions in memory
Logic
Slower (due to memory access for
Speed Faster
microinstructions)
Less flexible, requires hardware
Flexibility More flexible, modified via microcode
changes
Complexity Less complex for simpler tasks More complex but easier to update
Modification Difficult to modify once designed Easy to modify by changing the microcode
In essence, hardwired control units are faster but less adaptable, while micro-programmed units
are slower but offer more flexibility and ease of modification.
1. Multiple Cores: Each core can execute instructions independently, and the cores share
memory, but each core has its own processing unit.
2. Parallelism: It allows for parallel execution, meaning that multiple tasks can be
processed at once, increasing the overall throughput of the system.
3. Shared Resources: Cores typically share cache memory (L1, L2, or even L3 caches) and
sometimes the main system memory.
Cores: Each core can independently execute tasks. Multiple cores help in parallel
execution of tasks.
Cache: Each core has its own cache (L1 cache), which stores frequently used data.
Caches can be shared between cores (L2 or L3 cache).
Shared Memory: All cores typically have access to a shared memory space, allowing
them to share data and communicate.
I/O Devices: Input and output devices are connected to the system, and their data is
processed by the cores.
Interconnect Bus: This connects the cores and other components, allowing data
exchange between the processor, memory, and I/O devices.
1. Task Division: The software divides tasks into smaller sub-tasks (threads). These threads
are then distributed across different cores to execute in parallel.
2. Execution: Each core processes its assigned tasks independently, allowing the system to
perform more computations in the same amount of time compared to a single-core
processor.
3. Synchronization: The cores communicate with each other through shared memory or
interconnects, ensuring that the data is consistent and operations are synchronized.
4. Load Balancing: The operating system or scheduler may dynamically allocate tasks to
different cores based on their availability to optimize overall performance.
Parallel processing is a type of computing architecture in which several processes are executed
simultaneously. It's designed to increase the speed of computation and handle large-scale tasks
efficiently. Here are some basic concepts in parallel processing:
1. Data Parallelism: This involves distributing subsets of data across multiple processors.
Each processor performs the same operation on its subset of data simultaneously.
2. Task Parallelism: This involves dividing a task into smaller, independent tasks that can
be executed simultaneously on different processors.
2. Concurrency
1. Refers to the ability of a system to handle multiple tasks at the same time. Unlike
parallelism, concurrency doesn’t necessarily imply simultaneous execution; it just allows
for tasks to be interleaved, giving the illusion of simultaneous processing.
3. Threads
1. A thread is the smallest unit of a CPU's execution. Multiple threads can exist within the
same process and share resources like memory space, which makes it easier to implement
parallelism. Multithreading allows for better resource utilization and can be crucial for
parallel processing.
1. Processor (CPU): A CPU can be a single chip or multiple chips that execute
instructions.
2. Cores: A modern CPU often has multiple cores. Each core can independently execute a
task, so multi-core processors can execute multiple tasks at once, thus supporting
parallelism.
1. Shared Memory: Multiple processors share the same memory space, making it easy to
communicate and share data. However, managing access to shared memory (avoiding
conflicts) is crucial in this setup.
2. Distributed Memory: Each processor has its own private memory. Communication
between processors must occur over a network. This setup is used in systems like clusters
and supercomputers.
6. Synchronization
1. In parallel processing, it’s important to manage the execution order of threads and
processes to avoid issues like data races, where multiple threads access shared data
simultaneously, causing unpredictable results. Synchronization mechanisms (e.g., locks,
semaphores, barriers) help ensure that tasks are executed in a controlled manner.
7. Load Balancing
1. It’s the process of distributing tasks evenly across processors to ensure no single
processor is overburdened. Load balancing helps to maximize resource utilization and
improve the overall performance of a parallel system.
1. Scalability refers to the ability of a parallel system to handle increasing amounts of work
by adding more resources (e.g., processors or cores) without sacrificing performance. A
scalable parallel system maintains efficiency as the workload grows.
9. Amdahl's Law
Where:
1. This is a form of parallel processing in which computing resources are spread across
multiple physical machines that communicate over a network. Examples include cloud
computing and grid computing.
1. Speedup: The ratio of the time taken to execute a task on a single processor to the time
taken when the task is parallelized and run on multiple processors.
Michael J. Flynn proposed a classification based on the number of instruction and data streams
in a system:
Each processor has its own local memory and communicates via message passing.
Advantages:
✅ Scalable to a large number of processors.
✅ Avoids memory contention.
Disadvantages:
✅ Requires complex communication protocols.
Example: Cluster Computing, MPI-based systems.
a) Fine-Grained Parallelism
b) Coarse-Grained Parallelism
Larger independent tasks are executed in parallel with less frequent synchronization.
Example: Distributed Computing (Hadoop, Spark).
Topic:
2. Heterogeneous Computing
Topic:
Combines different types of processors (CPU, GPU, FPGA, TPU) for optimized
performance.
Used in High-Performance Computing (HPC) environments.
Applications:
3. Quantum Computing
Topic:
Applications:
4. Neuromorphic Computing
Topic:
Applications:
Topic:
Applications:
Topic:
Moves computation closer to data sources (IoT devices, sensors) instead of centralized
cloud processing.
Applications:
Topic:
Applications:
✅ Aerospace & Space Missions – NASA’s Mars rovers need fault tolerance.
✅ Financial Systems – Ensuring no transaction loss in banking.
✅ Cloud Computing – Data replication in distributed storage.