100% found this document useful (3 votes)
622 views24 pages

Intel 80586 (Pentium)

The Pentium processor uses a superscalar architecture that allows it to perform multiple instructions per cycle through a dual pipeline design. It has separate caches for instructions and data and utilizes branch prediction to optimize instruction flow. The floating point unit is pipelined and executes instructions faster than previous processors. The Pentium 4 microarchitecture features significantly higher clock rates through deeper pipeline stages and a trace cache to improve instruction throughput.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
622 views24 pages

Intel 80586 (Pentium)

The Pentium processor uses a superscalar architecture that allows it to perform multiple instructions per cycle through a dual pipeline design. It has separate caches for instructions and data and utilizes branch prediction to optimize instruction flow. The floating point unit is pipelined and executes instructions faster than previous processors. The Pentium 4 microarchitecture features significantly higher clock rates through deeper pipeline stages and a trace cache to improve instruction throughput.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Intel Pentium (80586) Internal

Architecture/
Pentium IV Microarchitecture
MPMC
Pentium Processors
• The term ''Pentium processor'' refers to a family of microprocessors that share a common architecture and
instruction set.
• Some of the features of Pentium architecture are:
• Complex Instruction Set Computer (CISC) architecture with Reduced Instruction Set Computer (RISC) performance.
• 64-Bit Bus
• Upward code compatibility.
• Pentium processor uses Superscalar architecture and hence can perform multiple instructions per cycle. (multiple pipeline)
• Pentium processor executes instructions in five stages which allows the processor to overlap multiple instructions so that it takes less time
to execute two instructions in a row.
• The Pentium processor has two separate 8-kilobyte (KB) caches on chip, one for instructions and one for data. It allows the Pentium
processor to fetch data and instructions from the cache simultaneously.
• When data is modified, only the data in the cache is changed. Memory data is changed only when the Pentium processor replaces the
modified data in the cache with a different set of data
• The Pentium processor has been optimized to run critical instructions in fewer clock cycles than the 80486 processor.
Architecture
•Prefetch Buffers
•Dual Pipe (ALU)
•Registers
•Cache ( separate for code and data)
•Brach Prediction
•Floating Point Unit
80586-Introduction
• The 80586 originated from the 80486 microprocessor. Introduced in 1993.
• Data bus of 64 bits and address bus of 32 bits (address upto 4Gb of physical memory).
• Superscalar architecture ( 2 instructions/ clock cycle)→ dual integer pipeline
• 237 pins
• Memory access time → 18 ns
• 4Mb page size (4 kb page size in 80486)
• Maths coprocessor functions 5 times faster than 80486.
• Speed upto 110 MIPS (Millions of Instructions per second)
• 4 nos of 32 bit GPR (EAX, EBX, ECX, EDX), 2 nos of 32 bit Index registers (SI/ DI), 2 nos of 32 bit pointer registers (SP/BP)
• 32 bit FR
• 2 nos of ALU, one FPU, 2 nos of 8Kb cache (one for code and other for data), prefetch buffers.
• It runs at a clock frequency of either 60 or 66 MHz and has 3.1 million transistors.
Architecture
• Each cache memory has a translation look-aside
• buffer (TLB) that is referenced before each virtual
address translation operation.
• Internally, the processor is 32-bit, i.e. it has 32-bit
registers and 32-bit fixed-point (integer) arithmetical-
logical unit.

• However, the processor internal data bus has been


extended to 64-bits. It means that data transfers
between main memory and cache memories are
executed with a double speed comparing the speed
of other data transfers inside the processor.
Cont..
• The processor has separate units for fixed-point and floating-point data processing.
• The fixed-point units contain two arithmetical-logical units: ALU U, ALU V and two address
generation units (virtual address translation units, one for each of the ALU blocks).
• The ALU U, ALU V units are pipelined and can work in parallel. These blocks have parallel, double
read access to data cache memories.
• The floating point unit works as a co-processor of the fixed-point units, i.e. they perform the
transferred to them floating-point instructions. The floating point unit is also pipelined (with 8 stages).
It has eight 80-bit floating-point registers. It works 7-10 times faster than the respective unit of the
80846 processor.
• With data access from the side of ALU U, ALU V units and floating-point unit, the virtual address
translation is performed by address generator units.
Modes
• The Pentium processor has two primary operating modes -
• Protected Mode - In this mode all instructions and architectural features are available,
providing the highest performance and capability. This is the recommended mode that
all new applications and operating systems should target.
• Real-Address Mode - This mode provides the programming environment of the Intel
8086 processor, with a few extensions. Reset initialization places the processor in real
mode where, with a single instruction, it can switch to protected mode.
Pipeline stages
• The Pentium's basic integer pipeline is five stages long, with the stages broken down
as follows:
• Pre-fetch/Fetch: Instructions are fetched from the instruction cache and aligned in pre-
fetch buffers for decoding.
• Decode1: Instructions are decoded into the Pentium's internal instruction format. Branch
prediction also takes place at this stage.
• Decode2: Same as above, and microcode ROM kicks in here, if necessary. Also, address
computations take place at this stage.
• Execute: The integer hardware executes the instruction.
• Write-back: The results of the computation are written back to the register file.
Cont..
FPU
Floating Point Unit:
There are 8 general-purpose 80-bit Floating point
registers. Floating point unit has 8 stages of pipelining.
First five are similar to integer unit. Since the
possibility of error is more in Floating Point unit (FPU)
than in integer unit, additional error checking stage is
there in FPU. The floating point unit is shown here:

Where, FRD - Floating Point Rounding


FDD - Floating Point Division
FADD - Floating Point Addition
FEXP - Floating Point Exponent
FAND - Floating Point AND
FMUL - Floating Point Multiply
P IV Microarchitecture
Features
• The Pentium 4 processor was introduced at 1.5GHz in November of 2000.
• It implements the new Intel NetBurst microarchitecture that features
significantly higher clock rates and world-class performance.
• The Pentium 4 processor has 42 million transistors implemented on Intel’s
0.18u CMOS process.
Architecture

Architecture
•Front End: Fetches the Instructions, decode them and send them
to the out of order execution core.
•There are three parts to it:
1.Fetch/Decode Unit.
2.Execution Trace cache.
3.BTB/Branch Prediction
Out of Order Engine: This is where the Instructions are prepared
for execution.
•There are two parts to it:
1.Out of order Execution Logic
-> Allows maximum Utilization
1. Retirement Unit
-> Ensures that the Instruction are back in order.
Integer and Floating-Point Units::This is the Unit where the
Instructions are actually executed.
•It has two parts:
1.L-1 data cache
2.Execution Unit
Cont..
• Memory Subsystem: It does many things like store the Instructions in the Level 2
cache when the Trace cache and the L1 cache is filled.
• It also is used to access the main memory when the L2 cache has a cache miss and
the System I/O resources.
• Clock Rates: Clock rates determine the stages of pipeline.
• Higher clock rate actually require deeper pipeline and more time for cache miss and
mis-predicted branch.
• But overall they are performance booster.
Instruction flow
• Instruction flow inside an Intel Pentium 4 processor typically consists of the following Stages:
• Prefetch – anticipate what data would be used next and pull it into the cache before it was needed. A
technique used in microprocessors to speed up the execution of a program by reducing wait states.
• L2 cache read – 2nd level cache that is read during the prefetch stage. Larger caches have better hit
rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with
small fast caches backed up by larger, slower caches. Multi-level caches generally operate by checking
the fastest, level 1 (L1) cache first; if it hits, the processor proceeds at high speed. If that smaller cache
misses, the next fastest cache (level 2, L2) is checked, and so on, before external memory is checked.
• Instruction decode – interpret each instruction from L2 cache. The opcode fetched from the memory
is being decoded for the next steps and moved to the appropriate registers.
Cont..

• Branch predict – make a prediction and provide the target address. A digital circuit that
tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known
for sure. Without branch prediction, the processor would have to wait until the conditional
jump instruction has passed the execute stage before the next instruction can enter the fetch
stage in the pipeline. The branch predictor attempts to avoid this waste of time by trying to
guess whether the conditional jump is most likely to be taken or not taken. The branch
predictor keeps records of whether branches are taken or not taken. When it encounters a
conditional jump that has been seen several times before then it can base the prediction on
the history. The branch predictor may, for example, recognize that the conditional jump is
taken more often than not, or that it is taken every second time.
Cont..
• Trace cache write – decoded uops, including any uop branches, are written into the trace
cache. They are written into the trace cache in the expected order of execution, not
necessarily the order the macroinstructions appear in memory. Trace caches deal with lost
fetch bandwidth caused by branches. It is a structure that overcomes this partial fetch
problem by storing logically contiguous instructions (instructions which are adjacent in the
instruction stream) in physically contiguous storage. This way, the trace cache is able to
deliver multiple, non-contiguous instruction blocks each cycle. Generally, instructions are
added to trace caches in groups representing either individual basic blocks or dynamic
instruction traces. A dynamic trace ("trace path") contains only instructions whose results
are actually used, and eliminates instructions following taken branches (since they are not
executed);
Cont..
• Microinstructions: Before translation the machine language instructions are called
macroinstructions, and the smaller steps after translation are called
microinstructions. Most CISC architecture macroinstructions can be translated into
four or fewer uops. These translations are performed by decode logic on the
processor. However, some macroinstructions could require dozens of uops. The
translations for these macroinstructions are typically stored in a read-only memory
(ROM) built into the processor called the microcode. The microcode ROM contains
programs written in uops for executing complex macroinstructions. In addition, the
microcode contains programs to handle special events like resetting the processor
and handling interrupts and exceptions.
Cont..
• Microbranch predict – determines which uop should enter the pipeline
next. The processor really maintains two instructions pointers. One holds the
address of the next macroinstruction to be read from the L2 cache by the
instruction prefetch. The other holds the address of the next uop to be read
from the trace cache. If the last uop was not a branch, the uop pointer is
simply incremented to point to the next group of uops in the trace cache. If
the last uop fetched was a branch, its address is sent to a trace cache BTB for
prediction.
Cont..
• Micro-op fetch and drive – Some uops read from the trace cache are actually pointers
to uop routines stored in the microcode ROM. The Pentium 4 pipeline allows for two
"drive" cycles where no computation is performed but data is simply traveling from one
part of the die to another. Micro-op fetch and drive Designers attempt to create a
floorplan where blocks that communicate often are placed close together, but inevitably
every block cannot be right next to every other block with which it might communicate.
The presence of drive cycles in the pipeline shows how transistor speeds have increased
to the point where now simple wire delay is an important factor in determining a
processor's frequency.
Cont..
• Allocation – While the uops in the trace cache still reflect the original program order, it
is important to record this order before the uops enter the out-of-order portion of the
pipeline. Each uop is allocated an entry in the reorder buffer (ROB).
• Register rename – updating of the register alias table (RAT) to determine which
physical registers hold the uops source data and which will be used to store its result.
• Schedule & dispatch – When the oldest uop is read from the memory queue, it is
loaded into the memory scheduler. Uops can dispatch before older uops if their sources
are ready first.
Cont..
• Register file read –Values that are used in computations are those stored in the
register files. Most register files have at least two read/output ports and one write/input
port to accommodate sending two values to the ALU and receiving one result. To
control a read port we need to be able to specify a register number for the register to be
read. The width/number of bits read equals the number of bits per register.
• Execute & calculate flags – Flag values store information about the result such as
whether it was 0, negative, or an overflow. Any of the flag values can be a condition for
a later branch uop.
Cont..
• Retire – Upon retirement, the uop's results are committed to the current correct
architectural state by updating the retirement RAT and all the resources allocated to
the instruction are released. The retirement logic is what reorders the instructions,
executed in an out-of-order manner, back to the original program order. This
retirement logic receives the completion status of the executed instructions from
the execution units and processes the results so that the proper architectural state is
committed (or retired) according to the program order. This logic also reports
branch history information to the branch predictors at the front end of the machine
so they can train with the latest known-good branch-history information.
END OF SLIDES

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy