0% found this document useful (0 votes)
53 views49 pages

@vtucode - in 21CS643 Module 2 2021 Scheme

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views49 pages

@vtucode - in 21CS643 Module 2 2021 Scheme

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

TM Illnffirthlt Hiilturnpenm

MODULE-2
_ —

Processors and Memory


Hierarchy
This chapter presents modern processor technology and the supporting memory hierarchy. We begin
with a study ofinstruction-set architectures including CBC and RlSC.and we consider typical supersmlar.
VUW. superpipelined. and vector processors.The third section covers memory hierarchy and capacity
planning. and the final section introduces virtual mernoryt address translation mechanisms. and page
replacement methods.
[nsrruction-set processor ardtitecmres and logical addressing aspects of the memory hierardty are
emphasized at the functional level. This treatment is directed toward the programmer or computer
science major. Detailed hardware designs for bus. cache. and main memory are studied in Chapter 5.
Instruction and arithmetic pipelines and superscalar and superpipelined processors are further treated
in Chapter 6.

ADVANCED FROCESSORTECHNOLOGY
-
Architectural familics of modem processors are introduced hclow, irom processors uscd in
workstations or multiproccssors to those dcsigncd for mainframes and supcroomputcrs.
Majorproccssor iii milics to hc studied includc thc C15‘C, RISC, s'upcrsc-tilor, i-'1!if", supcrpipelirrct-t 1-‘error:
and sji-'ml'Jol'ic proocssors. Scalarand voctorproccssors an": for numerical computations. Symbolic proccssors
have been dcvclopcd for Al applications.

4.1.1 Design Space of Processors


Various proccs sor iiimilics can lac mapped onto a coordinated spacc ofclock rare vcrs us c-__\-‘clot per insrrric-Iron
(C Pl j, as illustrated in Fig. 4.1. As implcmcntation technology cvo hes rapidly, thc clock rates of various
processors have moved from low to higher sp-ocds toward thc right ofthe design space. Anothcrtrcnd is that
proccssor manufacturers have hccn trying to lowcr thc CPI ratc using innovatit-‘c hardware approaches.
Based on these trends, the rnapping of processors in Fig. 4.1 reflects their implementation during the past
dccadc or so.
Figurc 4. l showsthcbroad CPI vc rsus cloclt spccd characteristics of major catcgoricsofc urrcnt processors.
Thc two broad categories which we shall discuss arc CISC and R] SC. ln thc fomtcr category, at prcscnt thcrc
is thc only one dominant prcscncc—thc 1:86 processor architccturc; in thc lattcr category. thcrc arc scvcral
cxamplcs, c.g. Powcr scrics, SPARC, MIPS, ctc.
Thur Ml.'I;Ifllb' "III l'n¢r.q|r_.u|»rs -

I34 i Admmced Cmnpmerfirehéteczure

Muttl-core, embodied,
low on-st, low power High performance

5|_ | I I I I | I I I I | I I I I | I I I I

4 _
"_"‘_'I _ -_ 4
CISC
3 _
CPI
2 1 ‘___-1 I I I I 1I I I I1I I I I1I I I I

I I I I I I I I

1 _ RISE??? fimyy_ .
r""-1 IIIII l' l
I ' l' III IILI

'1 L L
Clo-ck speed [GHz)

Fig.-1.1 CPI versus procasor clock speed of rnalor categories of processors

Under both CISC and RISC categories, products designed for multi-core chips, embedded applications, or
for low cost and~"or low power consumption, tend to have lower clock speeds. High performance processors
must necessarily be designed to operate at high clock speeds. T'he category of vector processors has been
marked VP; vector processing features may be associated with C [SC or RISC main prooessors.

Tlre Design Space Conventional processors like the lntel Pentium, M65040, older VAX.<"B6G{l, IBM 390,
etc. fall into the family known as eonrrrfer-instruction-set comparing -['ClSC_) architecture. With advanced
implementation techniques, the clock rate oftoday‘s CISC processors ranges up to a tevv GHI. The CPI of
difierent CISC in st mctions varies from 1 to EU. Therefore, CISC processors are at the upper part ofthe design
space.
Redhead-insrrrrerrbn-.ser eornprrrrrrg (RISC) prooessors include SPARE‘, Power series, MIPS, Alpha,
ARM. etc. With the use of efiicient pipelines, the average C-PI of RISC instnrctions has been reduced to
between one and two cycles.
An important subclass of RISC processors are the .s'rrper.se.r1far processors, which allow multiple
instructions to be issued simultaneously during each cycle. Thus the effective CPI of a supcrscalarprocessor
should be lower tha.n that ofa scalar RISC processor. The clock rate of superscalar processors matches that
of scalar RISC processors.
The verjr-' long insrrrrerrhrr word |[_“v'LIW] architecture can in theory use even more functional units than a
superscalarprocessor. Thus the CPI ofa\-'L[W processor can be firrthcr lowered. Intel‘s i8-60 RISC processor
had VLIW architecture.
The processors in veemr srrpereornpurers use multiple functional units for concurrent scalar and vector
operations.
The efiective CPI ofa processor used in a supercomputer should be very low, positioned at the lower
right corner ofthe design space. However, the cost and power cortsumption increase appreciably ifprocessor
design is restricted to the lower right comer. Some key issues impacting modern processor design will be
discussed in Chapter 13.
rr-I-M1-rimw HJ'iIr'r-rr.-pr.-.-r-rs _
Rruoessrzrs and Memory Hiar -- r
7 us
Ilutruetion Pipeline: The execution cycle of a typical instruction includes four phases: fetch, decode,
execute, and write-boeir. These instruction phases areoften executed by an instruction pipefirreas demonstrated
in Fig. 4.2a. In other words, we can simply model an instruction processor by such a pipeline structure.
For the time being, we will use an abstract pipeline model for an intu itivc explanation ofvarious processor
classes. The pr'_rJei'r'ne, like an industrial assembly line, receives successive instnrctions fiom its input end and
executes them in a streamlined, overlapped fashion as they flow through.
A pipetine eyde is irrtuitively defined as the time required for each phase to complete its operation,
assuming equal delay in all phases [pipeline stages). Introduced below are the basic definitions associated
with instruction pipeline operations:

l:I:lil:l
match D:s:edaEmc|.I:o -mm luck

Suooassisra
I nstructions
IIIIIIIIIIIII ,.
O12345B?B91U'I'112‘I3TunainBaaaCyr.Ies
|[a) Etrawiion in a base seals plocesaa

Successive lilililil
||15,h|,|r;.f_ir:|11g -lhhch Du:oc\sEmcr.m ‘ttnhabadr Tina in Base C

I I I I I I I I I I I I I I I L...
012 3 4 5 B T B 9101112131-#1516

-(tr) Urtdarpjpeljnad with two cydes per instruction isara

Sueoeasiua |
'"511"=’¢*="** iii; $121; Time in Basra Cycles
I I I I I I I I I I I I I I I Lb,
O12 3 4 5 B T B Q1U11121314‘I5'IB

{e)Ur'|darpipdir'|adwili1twicethabasaq,rda

Fig. 4.2 Pip-eined execution of successlve ll'IGI3‘l.ICl2lfl‘|G In a base smhr processor and In two urrrderp-Ip-eiin-ed
cases {Cotrrrnesy O-Ijtllppi andtflallrreprh-med irorn Proc..ASPL-DS,AC1"'l Pres, 1989]

-[I 'j Instruction pipeiirre e_teie—the clock period of the instruction pipeline.
-[2] Irrstruction issue Ir:rterre_v—tI1c time {in cycles] required between thc issuing oftwo adjacent instructions.
{3} Irrstrrrr'tt'o.rr issue rote—the number of instructions issued per cycle, also called the degree of a
superscalar prooes sor.
Thu‘ Ml.'I;Ifllb' HI" l'n¢r.q|r_.u|»rs -

I35 i Admnced Cmnprmerfirehtteezum

-[4] Sirrrpfe opemrton Ioterrc_t-'—Si1"nple operations make up the vast majority of instructions executed
by the machine, sueh as rinteger odds. loads, stores, branches, moves, ctr. On the oontrary, oomplcx
operations are those requiring an order-of-magnitude longer latency, sueh as dit-'r1rfes. crzclre rrrisses, etc.
These latencies are measured in numberof cycles.
{5} Resource r-onjflic!s~—This reters to the situation where two or more instructions demand use of the
same functional unit at the same time.
A base scofor processor is defined as a machine with one instruction issued per cycle, a one-cycle latency
for a simple operation, and a one-cycle latency between instruction issues. The instruct ion pipeline can be fitlly
utilized ifsuccessivc instructions can enter it continuo usly at the rate ofone per cycle, as shown in Fig. 4.2a.
However, the instruction issue latency can be more than one cycle tbr various rats-ons [to be discussed
in Chapter 6]. For csample, ifthc instruction issue latency is two cycles per in.str|.|ction, the pipeline can be
underutilized,as demonstrated in Fig. 4.2b.
Another undcrpipelined situation is shown in Fig. 4.E'.c, in which the pipeline cycle time is doubled by
combining pipeline stages. ln this case, thejerch and oircode phases are combined into one pipeline stage,
and execute and wrire-boeir are combined into another stage. This will also result in poor pipeline utilization.
The efl'ecti1.-'c CPI rating is lforthe ideal pipeline in Fig. 4.3a, and 2 forthe case in Fig. 4.2b. ln Fig. 4.Ec,
the clock rate ofthe pipeline has been lowered by one-half According to Eq. I 3, eitherthe case in Fig. 4.2b
or that in Fig. -4..?.c will reduce the pcrforrnanoe by one-half, compared with the idml ease (Fig. 4.221) tbr the
base machine.
Figure 4.3 shows the data path architecture and control unit of a typical, simple scalar processor which
does not employ an inst mction pipeline. Main memory, ID controllers, etc. are connected to the extemal bus.

Esta mal bus t 1


Address PC

Dab IR
AL

Internal
bus B

'3°""d FEW um ma
UM Dontrd W5“
dgnsls

I 1'
Registers

Fig. 4.3 Dara path ard1-tractor: and control imtr of a scalar processor

Thc control unit generates control signals required for thcjerch, decode, .-ILL’ operation, memorv access,
and write result phases of instruction execution. The control unit itselfrnay employ hardwired logic, or—as
Prucesscrs and Memory Hierouz! T H-y

was more common in older CISC style process-ors—microcoded logic. Modem RISC processors employ
hardwired logic, and even modem CISC processors make use of many ofthe tcctmiqucs originally developed
for high-performance RISE pmccssorsl U.

4.1.1 Inst:r|.|ct:ion-Setfltrchitecturwes
[n this section, we characterize computer instnrction sets and examine hardware features built into generic
RISC and C [SC scalar processors. Distinctions between them are revealed. The boundary between RISC and
CISC architectures has become blurred in recent years. Quite a few processors are now built with hybrid
RJSC and CISC feat|.|res based on the same technology. However, the distinction is still rather sharp in
instnrction-set architectures.
Thc instruction sct of a computer specifics the primitive commands or machine instructions that a
programmer can use in programming the machine. The complexity ofan instruction set is attributed to the
instruction fomlats. data formats, addressing modes. general-purpose registers, opcode specifications, and
flow control mechanisms used. Based on past cxpcricn-oc in processor design, two schools of thought on
instnrction-set architectures have evolvccL namely, CISC and RISC.

Complex Instruction S-ct: In the early days of computer history, most computer families started with an
instnlction set which was rather simple. The main reason for being simple then was the high cost ofharrlware.
The hardware cost has dropped and the sofiware cost has gone up steadily in the past decades. Furtl1errnore,
thc semantic gap between HLL. features and computer architecture has widened.
The net result at one stage was that more and more firnctions were built into thc hardware, mal-ring thc
instnlction set large and complex. The growth of instruction sets was also encouraged by the popularity of
microprogrammcd control in the 1960:; and 1970s. Even user-defined instruction sets were implemented
using microcodes in some processors for special-purpose applications.
A typical CISC instruction set contains apprecrimatcly 120 to 3513 instnrctions using variable instnlction-"
data formats, uses a small set of S to 24 general-;.tur;Jo.se registers {_GPRs), and executes a large number
of memory reference operations based on more than a dozen addressing modes. Many HLL statements
are directly implemented in llardwarcffirniwarc in a CISC architecture. This may simplify the compiler
development, improve execution cfiicicncy, and allow an extension fiem scalar instructions to vector and
symbolic instructions.

Reduced Instruction Set: After two decades of using C ISC processors, computer designers began to
reevaluate thc performance relationship between instruction -set architecture and available hardwarefsoftware
technology.
Through many years ofprogram tracing, computer scicrltists realized that only 25"‘.-it of the instructions ofa
complex instnlction set are frequently used about 95% ofthe time. This implies that about T594» ofhardware-
supported instnlctions often are not used at all. A natural question then popped up: Why should we waste
valuable chip area for rarely used instructions‘?
With low-frequency clabo rate instntct ions demanding long microcodes to execute them, it might be more
advantageous to remove them completely from the hardware and rely on software to implement them. Even
if thc software implementation was slow, the net result would be still a plus due to their low frequency
of appearance. Pushing rarely used instructions into software would vacate chip areas for building more

-“Fuller diseussion of these basic architectural eottoepts can be tioutui in (.‘om-purer .5fi-stem (J:-grrnr'.-rarr'on, by Naresh
.lotwani. Tata Motlraw-lllll, 2t]ID"9.
Par MIGIITLH Hf" l'mt'JI||r_.u|n¢\ :

HI i Advanced Cmnpiroerhrehiteeture

powerful RISC or superscalar processors. even with on-chip caches or floating-point units, and lianzlwired
control would allow faster clock rates.
A RISE instruction set typically contains less than IDD instructions with a fixed instruction format
[32 hits). Only three to five simple addressing modes are used. Most instructions are register-based. Memory
access is done by loadlstore instructions only .A large register file (at least 32] is used to improve fast context
switching among multiple and most instructions execute in one cycle witl'| hardwired control.
The resulting benefits include a higher clock rate and a lower CPL which lead to higher processor
performance.

Jlrchitoctuml Distinction: Hardware features built imo CISC and RISC processors are compared below.
Figure 4.4 shows the architectural distinctions between traditional CISC and RISC. Some ofthe distinctions
have sinoe disappeared, however, because processors are now designed with features lirom both types.

Control
as
Inshudion and
57': ii We
. i
Microproga mm-ad ||'|3.i'|_[.figr.|
l Conid Memory l l Came l Cad'|Q i

Main Memory flnsiuetron)


Mam Hamuy [D-313;

~[a) The C1513 aehttectuo with mtcroprogram med {bl The R1513 arehtteetue with harelwroel control
coniol and unified eaeho and split instruction eadia and data cac-he

Flg.4.4 D-lsrzlnerlons beoucen typical RISC an-cl typical CISC processor archéeeerures (Cournesy oiGordon Bolt,
1989}

Conventional CISC architecture uses a unified cache for holding both instructions and data. Therefore,
they mu st share the same data.-‘instruction path. [n a RISC processor, separate irrsn-uetimr and dam caches are
used with different access paths. However, exceptions do esist. [n other words, CISC processors may also
use split cache.
The use ofmicroprogrammed control was found in traditional CISC, and hardwired control in most RISC.
Thus control memory (ROM) was needed in earlier CISC processors, which slowed down the instruction
cirecution. However, modern CIS-C also uses hardwired control. Therefore, split caches and hardwired control
are not today exclusive in RISE machines.
Using hardwired control reduces the CPI effectively to one instruction per cycle ifpipelining is carried out
perfectly. SomeClSC processors also use split caches and hardwired control, such as the MC6 51140 and i5El-ti.
ln Table 4.1, we compare the main features of typical RISC and CIS-C processors. The comparison
involves five areas: instruction sets, addressing modes. regisrerfile and cache design, expected C-‘Pl’, and
mono! rneehanisms. Clock rates of modern CISC and RISC processors are comparable.
The large number of instructions used in a CISC processor is the result of using variable-format
instruetions—integcr, floating-point, and vector data—und of using over a dozscn diliferent addressing modes.
Furthermore, with few GP Rs, many more instr1.|ctions access the memory for operands. The C Pl is thus high
as a result of the long microcodes used to control the execution of some comples instnlctiorts.
Processtrs and Memory Hiervalz! 1 H9

On the other hand, most RISC processors use 32-bit instnrctions which are prodominarrtly register-based.
With few simple addressing modes, the memory-access cycle is broken into pipclined access operations
involving the use of caches and working registers. Using a large register file and separate I-and D~e-aches
benefits internal data forwarding and eliminates unnecessary storage of intenpediate results. With hardwired
oontroL the CPI is reduced to l for most RISE instructions. Mo st recently introduced processor families have
infact been based on RISC architecture.

Table 4.1 Characteristics ofT]rplculClSC end Rl'SCAn:hlrcr:rurcs

.-'lrr.'hi.rrr-r_'.|'rrrui Compfr-sr i'nsrn.rr.'rion Se! Rr.'r.i'r.rer.-'d Ins!rr.rr.'rio.rr 5;-r


If.'frr.rrtrr:‘.|'r.>r"i.n ic Conrpurer f'Cl'SC') C-o.rr'rprr.|'er |“R.l'S(.'J

Inst ruction-set and l..arge set ot"instr|1ctions with Small set of instructions with
instruction fonnats variable fdurrnts (16-64 bits fixed (32-bit} fonnat and most
per irertruetion}. registe r-hosed instruct ions.
.-ltddressirrg modes I2 --24. Lhnited to 3 -S.
General-purrmre registers B--24 GPRs, originally with rr Large numbers (32 I92} of
and cache design unified cache for instructions GPRs with mostly split data
and data, mutt designs aiso cache and instruction cache.
use qrlit caches.
CPI CPI between 2 and I5. One cycle tor almost all instructions
and an average CPI -i l.5.
CPU Control Earlier rnicro-coded using control Hardwired without control memory.
memory (RON! 1|, but modern
CISC also uses hardwired eontrol.

4.1.3 CISC. Scalar Processors


A scalar processor executes with scalardata The simplest scalar processor executes integer instructionsusing
fixed-point operands. More capable scalar processors execute both integer and floating-point operations.
A modern scalar processor may possess both an integer unit and a floating-point unit, or even multiple such
Lmits. Based on a complex instruction set, a CISC scrrlrrrpmeessor can also use pipclined design.
However, the processor is oficn underpipelincd as in the two cases shown in Figs. 4.2b and 4.2c. Major
causes ofthe underpipelinerl situations (Figs. -’l.2b] include data dependence among instructions, resource
conflicts. branch penalties, and logic hazards which will he studied in Chapter 6, and further in Chapter I2.
The case in Fig. 4.2c is caused by using a clock cycle which is greaterthan the simple operation latency. ln
subsequent sections, we will show how RISC and superscalar techniques can be applied to improve pipeline
performance.

Representative CISC Processor‘: ln Table-4.2, three early representative CISC scalar processors are listed.
The VAX B600 processor was built on a PC board. The i486 and M68046 were single-chip microprocessors.
These two processor families are still in use at present. We use these popular architectures to explain some
interesting features built into CISC processors. ln any processor design, the designer attempts to achieve
higher throughput in the processor pipelines.
Fr‘:-r Melirow Hl'lir'mr:-;|;1mn '
HI] i Advanced Compuoerfirrchiteettrre

Both hardware and software mechanisms have been developed to achieve these goals. Due to the
eornple-xiry involved in a CISC processor, the most difiieult raslc for a designer is to shorten the clock cycle to
match the simple operation latency. This problem is easier to overcome with a RISC architecture.

I»)
8] Example 4.1 The Digital EquipmentVAX B600 processor
architecture
The ‘v’.-‘\X S600 was introduced by Digital Equipment Corporation in 1985. This machine implemented a
typical CISC architecture with microprogrammedcontrol. The instruction set contained about 304] irtstructions
with 20 difierent addressing modes. As shown in Fig. 4.5, the VAX 8600 executed the same instruction set,
ran the same VMS operating system, and interfaced with the same l.~'D buses (such as SBI and Unibus] as the
‘v5°tX ll.-"TS{l.
The CPU in the ‘v’.-°tJ( B600 oonsistod oftwo functional units for concurrent execution of integer and
floating-point instructions. The unified cache was used for holding both instructions and data. There were
16 GPRs in the instruction unit. Instruction pipelining was built with six stages in the VAX S600, as in most
else machines. The instruction unit preietched and decoded instnrcfions, handled branching operations, and
supplied operands to the two functional units in a pipclined fashion.

M Console Bus

E*e5‘-75°“ ' ' Vitud heldress


Unit
{lnhglii ALU)

Instruction Cache gfiglfg no


Unit {'l BK Conhd Sub-
{1B GPRs) Byte) ll-|-LB: systems

Flsgzgs Open nd Ihlemcly em Emmi


Ul'Ii‘l Bl-'5 Control Main Memory CPU = Cenid Processor Unit
P-191""? tT¥Pi°B"?~ '-ll?-i"B'='¢ no = Translation Looltaside Bulfel
write 51,5 GPR = Gerrard Purpose Register

Hj.4.5 Tl'|eV'.#lX Bl50DCPiJ.a typical ClSC pruossor an:l'|itecture {Courtesy offiigital Equipment Corpcmtelon. 1935}

A rrnrrrfofion l'0oIrn.s'iri:' buffer‘ [TLB) was used in the memory control unit for fast generation of a physical
address fi'o1'n a virtual address. Both integer and ‘floating-point units were pipclined. The perforrnance of the
processor pipelines relied heavily on the cache hit ratio and on minimal branching tlamage to the pipeline flow.
The CPI ofa VAX S600 instruction varied within a wide range from 2 cycles to as high as 20 cycles. For
example, both narffipirr and o‘r"vr'n'e might tie up the execution unit for a large number of cycles. This was
caused by the use of long sequences oi'micminstr'|.|ctions to control hardware operations.
The general philosophy of designing a CISC pmoessor is to implement useful instructions in hardware.-"
firmware which may result in a shorter program length with a lower software overhead. However, this advantage
can only be obtained at the expense ofa lowerclock rate and a higher CPI, which may not pay oi'Tat all.
The VAX S600 was improved from the earlier VAX.-‘I 1 Series. The system was later further upgraded to
the VAX 9006 Series offering both vector hardware and multiprocessor options. All the ‘s-AX S-cries have
used a paging technique to allocate the physical memory to user programs.
Processcrs and Memory Hierolz! i 1 H|

CISC Microprocessor Families In l9?l_ the Intel 4004 iippcarfld as the first microprocessor based on a
4-bit.-°|.LU. Since then, Intel has produced the E-bit BOOB, HOBO, and SD35. Intel's 16-bit processors appeared
in I973 as the 81186, BOSE, E0186, and B6286. In 1985, the E0336 appeared as a32-bit machine. The E0486
and Pentium are the latest 32-bit processors in the lntel 30x86 family.
Motorola produced its first 8-hit microprocessor, the MCGBUIII, in 1974, then moved to the 16-hit 68000
in 1979, and then to the 32-bit 68020 in 1984. Then came the MC68'[l3»D and MC6B(l4'[l in tI1c Motorola
MC68'[htt"l family. National Semiconductor's 32-bit microprocessorNS32532 was introduced in I988. These
CISC microprocessor families have been widely used in thepersonn! eorrrpurer (PC) industry, with Intel x86
family dominating.
Over the last two decades, the parallel computer industry has built systems with a large number of open-
architecture microprocessors. Both CISC and RISC microprocessors have been employed in these systems.
One thing wort11y of mention is the compatibility of new models with the old ones in each of the families.
This makes it easier to port software along the series ofrnodels.
Table 4.2 lists three typical CISC processors of the year 199012].

Table 4.2 Rcprmcntotiie CISC Seder Processors ofyoor 1990

Feature’ Intel’ H86 ll-forrmifa ."l-ff.'|"i-ti'il4'l]| NS 32532

lnstruction~set size I5? instructions, I I3 instzructioits, 63 instructions,


and word length 32 bits 32 hits. 32 hits.
.-‘addressing modes I2 9
Integer unit 32-bit ALU 32-hit ."tLU 32-bit ALU
and GPRs with B registers. with I6 rcgiste with B registers.
Cln-chip cachetsl S-KB unified cache 4-KB code cache 512-B code cache
and M MUs for both code and data. 4-KB data cache l-KB data cache.
with separate lv'il‘vtUs.
Float ing-point (J-n-chip with On-chip with 3 Off-chip FPU
unit, registers, B FP registers pipeline stages, NS 3238i, or
and lilnetion units adder, muttiplier, shifter. H BU-hit FP registers. WTL 3 I64.
Pip-etine stages 5 6 4
P rotoct ion levels 4 2 2
Memory Segmented paging Paging with 4 or B Paging with
organization and with 4 KB.-‘page KB.-’poge, 64 entries 4 KB.-‘page,
TLBMTC entries and 32 entries in TLB. in each ATC. 64 entries.
Technology, CIIMUE IV, 0.3 pm IICMOS, l.25-pm CMOS
clock rate, 25 Mr-:1, 33 lv'[l~lz, l.2 M transistors, 3?tll{ transistors,
packaging, and l.2M transistors, 20 Ml{r__ 4D Ml [2, 30 lvll-lz,
year introduced I63 pins, I989. I79 pins, I990. I25 pire, I952.
(Jaimed 24 MIPS at 25 MI-h, 20 MIPS at2S Mlle I5 MIPS
perfonnance 30 MIPS at 61] Mlle at3D MI-I2.

.;z1 Motorola in i.croprocessors are at presentty built and marked by the divested company Freeseale
I41 i Admnced Cmnpuioerarchitcczure

gs Example 4.2 The Motorola MC68040 microprocessor


architecture
The MCtiBO-'-ICI is aU.E-‘um HCMCIS microproces sor containing monethan 1.2 million transistors,comparal:||c
to the i8'U4B6. Figure 4.-ti shows the MC-6BO4'[II architecture. The processor implements over IOU instructions
using 16 general-purpose registers, a 4-Kbyte data cache, and a 4-Kb)-'tc instruction cache, with separate
nrcmmj-' rrmrmgenrcnr units (MMLIs] supported by an address rrnmfrlriorr cache (_ATC_], equivalent to thc
TLB used in other systems. The data formats range from 8 to 80 bits, with provision for the IEEE floating-
point standard.

Instruction Bus

Instruction Instruction
ATC Cache
_ ~u—u-
Instruction Address
'"5'““°"°" I'u1MUI'CacheJ'Snoop I Bus
Fetch Controtlor IA i (32 bits]
Carmen ‘i
1 Bacon Instructton Memory I..InIt
IQ!
11-—b
Emwm Cafiiae
Instr uct.Ion.I’
Data B us
Fetch '
Wrlt-aback ‘<H-r 1i BusControl [32 ohs]
Eroecutr Data Memory Unit
Data
FI°~*"'"fl- Wrltoback I'uII'uIL:IfCacI1eJS noop <l—I-
polnt . Controller
B-us
UM Integer Unlt
I it I I DataI
Data
Control
Slgnsls IA = Instructbn Address
ATC C DA: Data Acldmss
EA = Effoctltre Address
ATG =AcId'ress Translation Cache
Dela B-us 'I_ MMU = Memory Managerncnt Unlt

Fig. 4.6 Arc1I'd1:ec1:ur¢ ofthe MCGBD-10 processor {Courtesy offiononola Inc, 1991}

Eighteen addressing modes arc supported, including register direct and indirect, indexing, memory
indirect, program counter indirect, absolute, and immediate modes. The instruction set includes data
movement, integer, BCI), and floating point arithmetic, logical, shifling, bit-field manipulation, cache
maintenance, and multiprocessor communications, in addition to program and system control and memory
management instructions.
Prucesscrs and Memory Hirsute! E H3

The integer unit is organized in a six-stage instruction pipeline. The floating-point unit consists of three
pipeline stages (details to be studied in Section 6.4.1]. All instructions are decoded by the integer unit.
Floating-point instructions are forwarded to the floating-point unit for execution.
Separate instruction and data buses are used to and from the instruction and data memory units,
respectively. Dual MMUs allow interleaved ibtch of instructions and data from the mai11 memory. Both the
address bus and the data bus are 32 bits wide.
Three simultaneous memory requests can he generated by the dual MMUs, including data operand read
and write and instruction pipeline refill. Snooping logic is built into the memory units for monitoring bus
events for cache invalidation.
The complete memory management is provided with support for virtual demand paged operating system.
Each ofthe two AT-Es has 64 entries providing fast translation from virtual address to physical address. With
the EISE complexity involved, the M68040 does not provide delayed branch hardware support, which is
often iiiund in RISE processors like Motorola's own M88100 microprocessor

4.1.4 RISC Scalar Fmoess-ors


Generic RISE processors are called scalar RISE because they are designed to issue one instruction per
cycle, similar to the base scalar processor shown in Fig. 4.11. in theory, both RISE and CISC scalar
processors should perform about the same if they rrrn with the same clock rate and with equal program
length. However, these two assumptions are not always valid, as the architecture affects the quality and
density of code generated by compilers.
The RISE design gains its power by pushing some of the less frequently used operations into sofiware.
The reliance on a good compiler is much more demanding in a RISE processor than in a CISE processor.
instruction-level parallelism is exploited by pipelining in both processor architectures.
Without a high clock rate, a low CPI, and good compilation support, neither EISE nor RISE can perform
well as designed. The simplicity introduced with a RISE processor may lead to the ideal performance ofthe
base scalar machine modeled i11 Fig. 4.2a.

Representative RISE Pro:-user: Four representative RISC-based processors from the year 1990, the
Sun SPARE, lntel i860, Motorola M88100, and AMD 29000, are summarized in Table 4.3. All of these
processors use 32-bit instructions. The instruction sets consist of 51 to 124 basic instructions. On-chip
floating-point units are built into the i860 and M88100, while the SPARE and AMD use off-chip floating-
poim units. We consider these four proccs sors as generic scalar RISE, issuing essentially only one instruction
per pipeline cycle.

Among the four scalar RISE processors, we choose to examine the Sun SPARE and i860 architectures
below. SPARE stands for scafrrbie pmcessor rrrehirecrrrre. The scalability of the SPARE architecture refers
to the use ofa different number of rt-gisrer rt-'r'm:.fo'n'.s in different SPARE implementations.
This is different from the M88100, where scalability rcicrs to the numbcroisrxrcfcdjirncrion rrm'!.s' (SFUsi
implementable on different versions ofthe M88000 processor The Sim SPARE is derived fiom the original
Berkeley RISE design.
I44 i Advanced CempucerA.reIiitectu.re

Table 4.3 Hcprcscnrorhe .FlI‘SC Scniur Processors ofyeur 1990

Feature Sun SPARE CWC60! Intel E860 Mommin M 881'00 AND 2.0000
Instruction 69 instructions, 82 instructions, SI inst1'trctions, 7 I I2 instructions,
set forrmats, 32-bit ti:-1-mat, T data 32-bit torrrrat 4 data types. 3 irartr. 32-bit t'onrurL all
addressing types, -‘I-stage instr. addressing modes. IiJ‘l‘i‘i:IlIl€c, 4 addressing registers indirect
pipeline. modes. addrersing.
Integer unit__ 32-bit RISCIIU, I36 32-hit RISC core, 32 32-bit IU with 32 32-bit IU with I92
GPRs. registers divided into (iPRs arid registers without
8 windows. scorelmarding. windows.
Caciie.s|[s}_. Gfi‘-chip ca|:hcr'lv'llirlU 4-KB co-dc, B-KB Off-cliip M88200 On-citip IIrlMlJ with
l-r‘[I'vl U, and on CY YCBD4 with data, on-chip‘ MMU, mixes.-'lir‘llv1’Us, 32-entry TL I3, with
memory 64-entry TLB. paging with 4 segmented paging, 4-word prciictch
organization. KB.-‘page. I6-KB cache. butler and 5 I2-I3
branch target cache.
Floating- Ufi‘-chip FPU on On-chip 64-bit FP On-cl-rip FPU adder. (J ii"-chip FPU on
point unit CYTCGD2. 32 multiplier and FP multiplier with 32 AMD 2902 1'. on-chip
registers and registers. 15-=1-bit adder with 32 FP FP registe rs and FPU witlt.-'sMD
Iitnctions pipeline (equiv. to registers, 3-D 64-I:-it arithmetic. 29050.
T|8848_]|. graphic". tuiit.
Operation Concurrent IU and Allow dual Concurrent IU, FPU 4-stage pipeline
modes FPU operations. instructions and dual and memory access processor.
FPopcrations. with delayed branch.
Teciutoiogy, I18-,u:rn crnos n/,3: I-pm CIIMOS Iv‘, I-urn I-lCI'vlOS, I.2M l.2-,u1n (IMUS, 31]
clock rate, Mlle, 20? pins, rare. over Ih-ltrmuristors, traits istors, 20 MH2, Mllz, 40 Mile, I69
packaging, =14] Millz, I6-8 pins, I80 pins, I988. pins, I988.
and year I989
C Iaimed 24 MIPS tar 33 MHZ 40 MIPS and 60 IT MIPS and 6 2? MIPS at=I-0 MIIZ,
performance version, 50 MIPS for lvlilops for 40 ll-[I-Iz, hilflops at 20 MI-Iz, new versionAMD
B0 MR: ECL i860.-‘KP announced up to 2 special 29050 at 55 MIIZ in
version. Up to 32 in I992 with 2.Sl\-‘l function units could I990.
register windows can transistors. be corrfigrrred.
be built.

I/)
lg Example 4.3 The Sun Microsystems SFARC architecture
The SPARE has been implemented by a number of licensed manufacturers as summarized in Table 4.4.
Difierent technologies and window numbers are used by difierent SPARE rnanufacurrcrs. Data presented is
from around the year 1990.
Pruccsscrs and Memory Hicrolz! 1 H5

Table 4.4 SPARC Implcme.n1:c|tlo.ns by Ucensed Mmufucmrers f 1990}

SPARC Tecfmology Clock Clahned Remarks


Cllllp Rare (MHz) VAX MIPS

Cypress ospm 33 24 C‘t"i'Cl5-D2 FPU Witlt


CYTC601 CMOS TV. 4.5 Mfiops DP Linpack
IU 201 pm. CYTC6-ll-=l Cael1e."MI‘\-IC,
CYYC I 5? Cache.
Fujitsu MB I2-pm 25 |5 l'\-'[B Rt-'|-*3'll FPC FPC
H6-'9Dl IU ClVlG$, IT? andTl BB4? FPP,
plus. IN-{B36920 M M U, 2.7
Mflorps DP Linpack by
FPU.
LSI Logic l.D-pm 33 20 L6-=lBl-4 FPU, L6-'-ISIS
L6-4B I I HCMIGS, Ii"? l'v1]'vttJ.

Y -**********
Tl SE46 D.B-pm CMDS 33 iii 42 Mflops DP Linpack
on Tl-BB4? FPP.
BIT IU ECL family. S0 50 I5 Mflops DP Linpack
B-3 I DU on FPUs: B-3 I20.-‘tLlJ,
B-36-ll FP
Multtply."Dtvtde.

At the time, all ofthesc manufacturers implemented thefioafing-point unit {FPU} on a separate coprocessor
chip. The SPARC processor architecture contains essentially a RISC integer uni! (IUJ implemented with Z
to 32 register windows.
We choose to study the SPARC family chips produced b_y Cypress Semiconductors, lnc. Figure 4.7 shows
the architecture of the Cypress C‘!|'"i"C6tll SPARC processor and ofthe C"|’7C-tiOZ FPU. The Sun SPARC
mtruction set contains 69 basic instructions, a significant increase from the 39 instructions in the original
Berkeley RISCII instruction set.
The SP.-"LRC runs each procedure with a set ofthirty-two 32-bit IU registers. Eight of these registers are
global registers shared by all procedures, and the remaining 2-'1 are u-'r'n-don-' regr's1ter.s' associated with only
each procedure. The concept ofusing overlapped register windows is the most important feature introduced
by the Berkeley RISC architecture.
The concept is illustrated in Fig. 4.8 for eight overlapping windows {formed with 64 local registers and
64 overlapped registers) and eight globals with a total of 136 registers, as implemented in the Cypress 1501.
Each register window is divided into three eight-reg ister sections, labeled Ins, Locals, and Gum. The local
registers are only locally addressable by each procedure. The Ins and Outs are shared among procedures.
The calling procedure passes parameters to the called procedure via its Outsfrtt to r1 5] registers, which are
the Ins registers of the called procedure. The window of the currently running procedure is called the active
window pointed to by a current window pointer. A window invalid mask is used to indicate which window is
invalid. The trap base register serves as a pointer to a trap handler.
Hi i Adwmced Cnmjauiterfitrchitecturc

I
Register FIIes[136>=.32{|
So-uroe 1 Source 2

Arlthmotle St
Logic um: 5"" I-‘""
Program -- I --
Counters
Prooassor Stats Align
Window Invalid
Tran Base Instruction
Address M UMP‘? Step Deon-do
Instr uetlons
[a] The Cypress CYTCEO1 S-PARC processor

F"-d<1I"B‘~'=‘-5 D313 Fl Static Register

Ia
FPPRasuIts \

Instn.u::tIonJAd-drtsss
Buffer [2 x 64)
FioatIng-polrl 64-bit
Plpeuned
Ftoatlng polntfltusua Data Register Fbaung_pom
3,5,, Flto (32132) pmflw
Address Instruetlon

FP Operands
E 4'-— .I
3
lgis
cmrmq FF Instr uetlons
PP we

In-struetlonfeontrot FPGontromg.
Un
{I1-j|TI1o Cypress CYTCED2 floatlng-polnt unit

Fig.4.? The SPARE arehtuenra whh the processor and ti-in floating-point: unlt: on two soparace chips {Courmsy
of Cypress Semlcondttemr Co., 1991}

A special register is used to create a 64-bit product in multiple step instructions. Procedures can also be
called without changing the window. The overlapping windows can significantly save the time required for
interprocedure oommunications, resulting in much faster context switching among cooperative procedures.
H‘-r Melinrw HJ'lIr'|-rr.-;n_.-.-I-rt _
fiueessws endflrlcmerry Hier ..i- H1

The FPU featunes 32 single-precision (32-hit) or 16 douhle—preeision (5-1-bit} floating-point registers


(Fig. 4.'i"h). Fourteen ofthe 69 SPARC i.nstructions are for floating-point operations. The SPARC architochire
implements three basic instruction formats, all using a single word length of 32 bits.

1‘[§1} I‘l%3l f‘l1_5l


Protrlous Wlndow . "I5 . I-B-63$ -01115
1‘I24l 11151 113}
1311 H23] H15]
Acting wtmw ; Ins ; Locals :Du1.s
124] fI1'3I 13]
P1311 r1231 r{15]
Nest Window I "ta I I-0-'=&*= '0'-"15
12¢} :11 c1 rill] g
I I
[:7-Globals
r[0}

[at Three overlappl ng register windows and the glohats rog Istors

CWP

as
WIM
w? Locats

$5
5- :4 isg ‘rs
Outs W

W3 OUTS w‘, W 4 W4 Ins

9|-15 Locals
4 N

[Is] Elght mglstcr windows formlng a circular stack

Fig.-lll The ocmcqrn: ofoverlapplng rqlscor windows in cho9‘ARC ardilcocnn-c(Co|.|=r1:esy of5un Me:-osyscerns.
Inc, 1937]

Table 4.-'-I shows the MIPS rate relative to that of the VAX 1].~"7B0, which has been used as a reference
machine with 1 MIPS. The SO-MIPS rate is the result ofECL implementation with a BO-Ml-Iz clock. A Ga.-its
SPARC was reported to yield a BUG-MIPS peak at 200-MHZ clock rate.
I45 i Advanced Cmnplioerfirchitceture

I»)
F5 Example 4.4 The lntel i860 processor architecture
ln I989, lntel Corporation introduced the itifitl microprocessor. [t was a 64-bit RISC processor fabricated on
a single chip containing more than l million transistors. The peak performance ofthe iBti'[l was designed to
reach BI] Mfiops single-precision or 60 Mflops double-precision, or 40 MIPS in 32-bit integer operations at
a 40-MHz clocli: rate.
A schematic block diagram of major components in the iB6[l is shown in Fig. 4.9. There were nine
fi.|nctional units {shown in nine boxes] interconnected by multiple data paths with widths ranging from 32 to
I28 bits.

Extemalaodress 32
is /

Instruction C-ache ""'“""'="Y Data Cache


[41-t Bytes] Mmifimmt [as Bytes]

Address Address $136


FF" Instruction 123
E4 C
ore
Instruction 32 3 32
RISC Floating point
E"l‘B'"3' Bus Control Integer Unit control unit
Dam Unit
54 Core Registers FF Registers

64 64 B4
De“ —
Sm
Src2
1

Gra phlcs Unit


hhrge
Pipe fined
.
Amfll Unit
E
Fipollnod
Munlplflf

Fig.4.! Functional l.H'tll'5 and data pants ofthe Intel i8-ED RISC mlerqarocessor (Courtesy of Intel Corporadon.
1990}
Prucesscrs and Memory Hi'El'0lZ! i 1 H9

All -nlrternal or internal address buses were 32-bit wide, and the cxtemal data path or imemal data bus was
64 bits wide. However, the internal FLISC integerALU was only 32 bits wide. The instruction cache had
4 Kbytes organized as a two-way set-associative memory with 32 bytes per cache block. lt transferred -64 bits
per clock cycle, equivalent to 311] Mbytes-"s at 41] M]-Iz.
The data cachc was a two-way set -a_ssociative memory of B Kbytes. lt trans ferred 128 bits per clock cycle
(640 Mbytcsfs) at 40 MHZ. Awritc-back policy was used. Cacheing could be inhibited by software, ifneeded.
The bus control unit coordinated the 64-bit data transier between the chip and the outside world.
The MMU implemented protected 4 Kbyre paged virtual memory of 232 bytes via a TLB. The paging and
MMU structure ofthe i8-I50 was identical to that implemented in the i486. An i860 and an i486 could be used
jointly in a heterogeneous multiprncessorsystem, permitting the development ofcompatible OS kernels. The
RISC integer unit exccuted form’. store. integer; bit, and comm! instructions and fetched instructions for the
floating-point control unit as well.
There wcrc two floating-point units, namely, the nulfrfpfier unit and the adder iiflif, which could be used
separately or simultaneously under thc coordination of the floating-point control unit. Special dual-operation
floating-point instructions such as fllilrfifld-HilJ.lfl']J.{}-' and submicmrnri-n1rli'tip!__v used both the multiplier and
adder units in parallel (Fig. 4. 10).

So-urcnl 1_=.,...n.2 “°t“:*‘*°"

Q01 OPE
Multiply Unit [SP]
Result
I i I Kr :<Source2

Op1 Op2
Addnr Unit
Result
1
Kr xS-onion 2 + Sour-:21

Fig. 4.1!] Dual floating-point DPQ1"I.l'jOi'tlil‘i the I860 processor

Ftuthcr-n1ore, both the integer unit and the floating-point control unit could execute concurrently. In this
sense, thc i860 was also a superscalar RISC processor capable ofexccuting two instructions, one integer
and one floating-point, at thc same tirnc. Thc floating-point unit conformod to the IEEE ‘I54 floating-point
standard, operating with single-precision (:32-bit] and double-precision (64-bit] operands.
The graphics unit cxccuted integer operations corresponding to 8-, 16-, or 32-bit pixel data types. This
unit supported throe-dimensional drawing in a graphics frame buffer, with color intcrt'-tit].-', shading, and
hidden surface elimination. The merge register was used only by vector integer instructions. This register
accumulated the results ofmultiple addition operations.
F?» Mtfiruw Hillr'n.-rqiwrn-r

I50 i Advnlrrced Cmnpunerfirehitecture

The i-SE-(l executed 82 instructions. including 42 RISC integer, 24 floating-point, ll] graphics, and
6 assembler pseudo-operations. All the instmctions executed in one cycle, i.e. 25 ns for a 40-Ml-Iz clock
rate. The ilifiil and its successor, the i86llXP, were used in floating-point accelerators, graphics subsystems,
workstations, multiproccssors, and multicomputers. However, due to the market dominance of Intel's own
x86 family, the i360 was subsequently withdrawn from production.

The RISC Impact: The debate between RISC and CISC designers lasted ibrmorc than a decade. Based on
Eq. l .3, it seems that RISC will outperiorm CISC ifthe program length docs not increase dramatically. Based
on one reported experimem, converting from a CISC program to an equivalent RISC program increases the
code length {instruction count] by only 40%.
Of course, the increase depends on program behavior, and the 40'!-"it increase may not be typical of all
programs. Nevertheless, thc increase in code length is much smaller than the increase in clodr rate and the
reduction in CPI. Thus the intuitive reasoning in Eq. 1.3 prevails in both cases, and in fact the RISC approach
has proved its merit.
Further processor improvements include full 64-bit architecture, multiprocessor support such as snoopy
logic for cache coherence control, faster interprocessor synchronization or hardware support for message
passing, and special-function units for l.-"O interfaces and graphics support.
The boundary between RISC and CISC architectures has become blurred because both are now imple-
mented with the same hardware tech nolegy. For example, starting with the VAX 9000, Motorola 88100, and
lntel Pentium, CISC processors are also built with mixed features taken from both the RISC and CISC camps.
Further discussion ofrclevant issues in processor design will be continued in Chapter 13.

SUPERSCALARAND VECTOR PROCESSORS


1 AC ISC or a RISC scalar processor can be imp rovcd with a .s'uper.sr-afrrr or vector architecture.
Scalar proccs sors are those exec uting one instruction per cycle. Only one instruction is issued
per cycle, and only one completion of instruction is expected from the pipeline percyclc.
In asuperscalarproccsser, multiple instructions are is sued per cycle and multiple results are generated per
cycle. A vector processor executes vector instructions on arrays of data; each vector instruction involves a
string of repeated operations, which are ideal for pipelining with one result per cycle.

4.1.1 Superscalar Processors


Supcrscalar processors are designed to exploit more instruction-level parallelism in user programs. Only
independent instructions can be executed in parallel without causing a wait slate. The amount of instruction-
level parallelism varies widely depend ing on the type ofcode being exec utcd.
[t has been observed that the average value is aroundl for code without loop unrolling. Therefore, tor these
codes there is not much benefit gained from building a machine that can issue more then three instructions
per cycle. The in.s'IrrreIi0n-is.s'ue rlirgretr in a superscalar processor has thus been limited to Z to S in practice.
Prucesscrs and Memory Hierolz! 1 |5|

Pipelining in Superscolor Processor: The fundamental stn.|cture of a three-issue superscalar pipeline


is illustrated in Fig. 4.l I. Superscalar processors were originally developed as an altemative to vector
processors, with a view to ertploit higherdegree of instruction level parallelism.

I I I I I
Ifeteh Deoode Erneeute Write
back

‘I’
Q _L._ pg- w_ _|, __ |-_n_ Q-,_ -q._ B 9 Tlrne In Base Cycles

Fig.-1.11 A superscalar processor efdegree rn = 3

A superscalar processor ofdegree m can issue m instructions per cycle. ln this sense, the base scalar
processor, implemented either in RISC orClSC, has m = l. ln order to fully utilize a superscalarprocessoroi
degree m, m instructions must be executable ir| parallel. This situation may not be true in all clock cycles. In
that case, sortie ofthe pipelines may be stalling in a wait state.
In a supcrscalarprocessor; the simple operation latency should require only one cycle, as in the base scalar
processor. Due to the desire for a higher degree of instruction-level parallelism in programs, the superscalar
processor depends more on an optimizing compiler to exploit parallelism. Table 4.5 lists some landmark
examples of superscalar ptoces from the early 1990s.
A typical superscalar architecture tor a RISC processor is shown in Fig. 4. 12.
The instruction cache supplies multiple instructions per fetch. However, the actual number oi'instructions
issued to various functional units may vary in each cycle. The number is constrained by data dependences
and resource conflicts among instructions that are simultaneously decoded. Multiple fitnetiortal units are built
into the integer unit and into the floating-point unit.
Multiple data buses exist among the functional units. In theory, all functional units can be simultaneously
used if conflicts and dependences do not exist anrong them during a given cycle.
Representative Super-scalar Processor: A number of comrnercially available processors have been
implemented with the superscalar architecture. Notable early ones include the IBM RS.-"6000, DEC Alpha
21064, and lntel i9fi'UCA processors as summarized in Table 4.5. Due to the reduced CPI and higher clock
rates used, generally supcrscalarptocessors outperibrrn scalar processors.
The maximum number of instructions issued per cycle ranges from two to five in these superscalar
processors. Typically. the register files in the lLI and FPU each have 32 registers. Most superscalar processors
implement both the [U and the FPU on the same chip. The superscalardcgree is low due to limited instruction
parallelism that can be exploited in ordinary programs.
Besides the register files, reserwrrr'0n srnrr'0ns and reorder br{,i'_}@rs can be used to establish instruction
\~t-'r'rml'Jws. The purpose is to support instruction lookahead and internal data forwarding, which are needed
I51 T Advanced Compmerfinrchitecture

to schedule multiple instructions simultaneously. We will discuss the use of these mechanisms in Chapter 6
where advanced pipelining techniques are studied, and further in Chapter 12.

Table 4.5 Reprmenrnthe Super-scalar Pmcrssors [circa 1990]

Feature Intel IBM cac Apia


r96tiCA R.':i-"60£}£l 2:064
Technology, 25 Ml-I2 I93-I5. I-pm CM05 DJT5-}.i.t't'| CMCI5, I50
clock rate, technology, 31] MI £2, lvfl-I1 431 pm», 1992.
year I990.
Functional Issue up to 3 POWER Alpha architecture,
Lutits and izrattructions (register, architecttuc, issue 4 issue 2 instruct ions per
multiple memory, and instructions {I FXU, cyclc , 64-bit IU and
instruction control) per cycle, I FPU, and 2 ICU FPU, I25-bit data bus,
issu-at seven functional operations} per cycle. and 34-hit address bus
units availzble for implemented in initial
eoncturcut use. version.
Registers, HU3 I-cache, l.5~lLl3 32 32-1:-it(]PRs, 32 64-bit GPRs, S-KB
cacltezs, MMU, RAB-1, 4-channel HO E-KB l-cache, 6-=H{l3 I-cache, B-KB D~cachc,
achdress space with DM.-‘X, parallel D-cache 1-v ith 64-bit virmal space
decode, multiported scparatc TLBs. desigrled, 43-bit
registers address space
implemented in initial
version.
F Ioat ing- Gn-chip FPU. fast On-cltip FPU 6-=1-bit On-chip FPU, 32
point unit multimode interrupt, multiply, add, divide, 64-bit FF registers,
and Iiutctimis ntultitask control. subtract, IEEE T54 ID-stage pipeline.
standard. IEEE and VAX FP
staridarfe.
Claimed per- 3o vsxmtirs peak 34 MIPS and 3[IID MIPS peak and
formance and at 25 MI-Iz, real-time 1l Mfiops at 25 MI-Iz 150 Mflops peak at 150
remarks embedded system on P(l\\'ER station 531]. M]-I2, muitiproeessor
control, and and cache coltcrence
multiprocessor support.
applications.
Note: KB = Khytes, FF = floating point.
rr.-.- Mcfiruw Hl"l'r>¢r.q|r_.u||r\ _'
Rruoessrzrs and Memory Hirer 7

Memory //.-6*‘/.-06’.-56'.-’.0G’/.»5"///)?Z-9'/.6O'.&@'/$7.41?‘//J9‘.49‘/.66'30"/A'7.41‘//4“/»96".-".6'I-".06'/-:1-///YA-V/r061’.-f-0*’.-’.6rG’.»irG’/A¢9‘z.W/.»0G"

I D
Register Reorder
Flie Buffer

5%.
' dlG'Z-:- //JP’/I-9'/B9‘? I -' /Kl!’/A4?‘/.»99'.v99'.-‘AG §Q1R€~\\ “¢\ %\%\%\">.\ 'Q

M M A M
T H T H W; ll M H it
Integer Unit [RISC mm]
‘~=A/xI-."/.<’/)‘rf-;?r'/>9‘/'-2”/.61!/1W/X121-’///M1-5'?‘/‘.6!-'idIVr'.%'i12>’/¢:!'£-'72‘//.6-I/.<’/.>’rk='2*'./-‘)‘./‘->‘*//W.4-:7.-"KIA-6/"'r'.69f-if)’./.4'Z-ii!/r'.M'i12>’//Z'>W/.6-I-'i<’A>'r'.4'2*'.-7)‘//5'//if/.4-:7/.6!/ri-?r'
~. . \';

F loatl ng- point U nIt

Register Reorder
File Btfiar

\%’a&Z\‘§ {<&$\}.‘%- §&£*\‘Y;3.Wz%\&§'-?R*\xW>§.%\“&2-‘ X~.\W:b}“&-‘».\§<K€


{Q {Q {Q Q {gt Q
Add Convert Multip-fgr Divide Load Store
%¥$\.}3&-‘?\%LQK:\%-§."e\‘&“'k€§~.&\*:% S:.‘§~-€%&\‘¥£bt.%Ifik~\€§ \¥%%.h{<e\‘W{%\i '5>\Q}\ 3.§\ £

ap.
§*:\‘¥2\‘?%\//?’.'fl"."‘T?/W".-‘F25’./'4%zIT//fiéi-':iV.-".r§§’.\#/.-(%7.V/.-d5t52i6§’.-"'.¢’I-¢l'§9"/44%’.-".»=<‘.3".-W’.-'21:I-37.66?’/1%4éi-'-T7./.»§&'.fl/%iW/.-a$6Fl§§’./.é€-:69/Q.F6v?/.-K .-",$‘:?'.ét?'AT:'»'. AV-.’ §l2%\“R\‘§§~\

Addr Data

Memory

Flg.1l.12 A typical superscalar RISC pracusor archlnctrrxe consisting of an integer unit an-cl : fieaflng-point
uni: {Cour-raesy of M.jehnson. 1991:reprin1:ed wi1:l1 perrnisslon frorn Prentice-Hal. In-L]

I»)
Cg Example 4.5 The IBM R5!6000 architecture
[n early 1990, IBM announced the RISC System 6000. It was a superscalar processor as illustrated in
Fig. 4.13, with three functional units called the bmneh proeessor,_jfixed-pain: unit, and floafing-paint unit,
which coulcl operate in parallel.
Thur Ml.'I;Ifllb' HI" l'n¢r.q|r_.u|»rs -

I54 i Admnced Cornpuioerfirehiteczure

The branch processor could arrange the execution ofup to five instructions per cycle. These included one
brand: instruction in the branch processor, one fixed-point instruction in the FXU, one corrdirion-register
instruction in the branch processor, and onefl0a|'r'ng—poirrt muitrpiy-mid instruction in the FPU, which could
be counted as two floating-point operations.

Instruction Cache

Imstructbnc

FIxied-point Flflfliflfi-*I9°|"T
Processor Processor

32
32 B4 B4 X 32
Programmed
Iro 4--1 storaga
3 HO Date Cache
.,_,,‘: Interface {MK Bytes:
Dilflfi Dam
Memory
123 Access 128

I Main Memory [8 to 128 MByteej| I

Fig. 4.13 The POWER arrchiraecrure cf the IBM RISC 5ys:emi‘6flII]ID supersca.Izr- processor {Ccurcesy cf
lnmmaoimai Business Maduncs Corp-orari-on. 1990}

As any RISC pr-ocessor, RS.-‘6000 us-ed hardwired rather than mierocoded control logic. The system used a
number ofwide buses ranging fiorn one word (32 bits] for the FXU to two words (64 bits) tbr the FPU, and
ibur words ibrthe I-cache and D-cache, respectively. These wide buses provided the high instruction and data
bamlwiclths required ibr superscalar implementation.
The RSIBODO design was optimized to perform well in numerically intensive scientific and engineering
applications, as well as in multiuscr commercial environments. A number of RS.-"6000-based workstations
and sewers were produced by IBM. For example, the POWERstaIion 530 had a clock rate oi'25 MHZ with
performance |JBl1Ci'll'l13I]iI3 reported as 34.5 MIPS and 10.9 Mflops. In subsequent years, these systems were
developed into a series of RISC-based sewer products. See also Chapter 13.

4.1.1 The VLIWArchitecture


The VLIW architecture is generalized from two well-established concepts: horizontal mierocoding and
superscalar processing. A typical VLIW {_'t’£"!:_\’ long irrsrrrrcrion word) machine has instruction words
hundreds of bits in length. As illustrated in Fig. 4.l4a, multiple iimctional units are used concurrently in
Rruoessrzrs and Memory Hierarz! F. |55

a VLIW processor. All functional units share the use of a common large register file. The operations to be
simultaneously executed by the functional tmits are synchronized in a VLIW instruction, say, 256 or 1024
bits per instruction word, an early example being the Multiflow computer models.
Different fields of the long instruction word carry the opcodes to be dispatched to different functional
units. Programs written in conventional short instruction words { say 32 bits} must be compacted together to
form the VLIW instnrctions. This code compaction must be done by a compiler which can predict branch
outcomes using elaborate heuristics or run-time statistics.

Main -
Regrsterf-'IIe

Load! an
um M
Unit ‘res’ B52?“

I Load.i5tore I FP'Add IFP Mumps‘ Branch --- Ifltenerhl-U‘


[a] Atypical ‘JLIW’ processor with degree m = 3

lle'Jd'| Doocrdr: En:ou'.I: 'flki'.Ir:


3r:|:|s:r1'|:rrB hadt

I I I I I I I I I ,
D 1 2 3 4 5 6 I" 6 9 Tlme in Erase Cycles

[lit] VLIW' execution with degree m = 3

Flg. 4.14 The arrdflceccmc of a very long Irrsrrucrion word {VLIW] processor and tr: pipciinc operations
[Courtesy of Muiriflow Compur.or;|nr_.19IB7]

Pipelining in FLIW Processor: The execution of instructions by an ideal VLIW processor is shown in
Fig. 4.14b. Each instruction specifies multiple operations. The effective CPI becomes 0.33 in this particular
example. VL [W machines behave much like superscalar machines with three diiferenccs: First, the decoding
of VLIW instructions is easier than that of superscalar instructions.
Second, the code density ofthe superscalarmacb inc is betterwhcn the available instruction -lc"vel paral lel ism
is less tharr that exploitable by the VLTW machine. This is because thc fixed VLIW format includes bits for
non-executable operations, while the superscalar processor is sues only executable instructions.
re» Mtfiruw um =-...=-mam. '
I55 i _ Admrrced Cempurnerfirehiteeture

Third, a superscalar machine can be object-code-compatible with a large family ofno n-parallel machines.
On the contrary, a 'v'LlW machine exploiting difi'erent amounts of parallelism would require different
instruction sets.
lnstruetion parallelism and data movement in a VLIW architecture are completely specified at compile
time. Run-time resource scheduling and synchronization are in theory completely eliminated. One can view
a ‘v’LlW processor as an extreme example ofa superscalar processor in which all independent or Lmrelated
operations are already synchronously compacted togetl1er in advance. The CPI of a VLIW processor can
be even lower than that of a superscalar processor. For example, the Multiflow trace computer allows up to
seven operations to be executed concurrently with 35-6 hits per ‘I-'LlW instruction.

VLIW Opportunities ln a VLIW architecture, random parallelism among scalar operations is exploited
instead of regular or synchronous parallelism as in a vectorized supercomputer or in an SIMD computer.
The sueeess of a VLl"W processor depends heavily on the effieieney in code compaction. The architecture is
totally incompatible with that of any conventional general-purpose processor.
Furthermore, the instruction parallelism embedded in the compacted code may require a difierent latency
to be executed by different firnctional units even though the instructions are issued at the same time. Therelbre,
difierem implementations ofthe same 'v'Ll'W architecture may not he binary-compatible with each other.
By explicitly encoding parallelism in the long instruction, a VLIW processor can in theory eliminate the
hardware or sofiwarc needed to detect parallelism. The main advantage ol"v'Ll W architecture is its simplicity
in hardware structure and instruction set. The VLIW processor can potentially perform well in scientific
applications where thc program behavior is more predictable.
ln general-purpose applications, the architecture may not be able to perform well. Due to its lack of
compatibility with conventional hardware and software, the\-‘Ll W architecture has not entered the mainstream
ofcomputers. Although the idea seems sound in theory, the dependence on trace-scheduling compiling and
code compaction has prevented it from gaining acoeptance in the commercial world. Further discussion of
this concept will be found in Chapter 12.

4.1.3 Vector and Symbolic Processors


By definition, a vector proees.ror is specially designed to perform vector computations. A vector instruction
involves a large array of operands. In other words, the same operation will be performed over an array or a
string of data. Specialized vector processors are generally used in supercomputers.
A vector processor can assume either a regi.srer-In-register architecture or a mem0rj1-'-to-mem0r_t-'
architecture. The former usea shorter instructions and vector register files. The latter uses memory-based
instructions which are longer in length, including memory addresses.

'|'ecto.r Instruction Register-based vector instructions appear in most register-to-reg ister vector processors
like Cray supercomputers. Denote a vector register of length n as VI, a scalar rqgr'.rIt'r as s,-, and a memory
arrqv of length n as Ml] : nj. Typical register-based vector operations are listed below, where a vector
operator is denoted by a small circle “o”:

‘V; o ‘V1 —> V3 {binary vector)


s1 o ‘V1 —> ‘Iv’; {sealing}
V1 o V1 —> s| {binary reduction)
Prucesscrs and Memory Hierolz! 1 | 51

M['l :nj .5 {vector load] |[_4.l'j


M{_l : rt) {vector store]
V; {unary vector]
OS! $55 till s| {tmary reduction)

lt should be noted that the vector length should be equal in the two operands used in a binary vector
instruction. The reduction is an operation on one or two vector operands, and the result is a scalar——such as
the dot proahrcr between two vectors and the maximum of all components in a vector
In all cases, these vector operations are per'f'ormed by dedicated pipeline units, including _,fimerio.rm!
pr'per’r'm:s and mensory-access pipeir'rs:s. Long vectors exceeding the register length n must be segmented to
fit the vector registers n elements at a time.
Memory-based vector operations are found in memory-to-memory vector processors such as those in the
early supercomputer CDC Cyber 205. Listed below are a few examples:

M|(l:n] o My-[il :n) M{1:n]


s, o M|(1:n) .M3(1 In)
o ll-=t'|{l :n'] .-1-13(l :n) (4.21
M,-[1 :n] o ll-»f3{l :n) tilt M {kl

where M,-[il :n_l and M;-[il :n) are two vectors of length n and M{'lrj denotes ascalarquantity stored in memory
location 15:. Note that the vector length is not restricted by register length. Long vectors are handled in a
streaming fashion using super tennis‘ cascaded from many shorter memory words.

Hrctor Pipeline: Vector processors take advantage of unrolled- loop-level parallelism. The vector pipelines
can be attached to any scalar or superscalarprocessor.
Dedicated vector pipelines eliminate some software overhead in looping control. Of course, the
effectiveness ofa vector processor relies on the capability ofan optimizing compiler that vectorizes sequential
code for vector pipelining. Typically, applications in science and engineering can make good use of vector
processing capabilities.
The pipclined execution in a vector proeessor is compared with that in a scalar processor in Fig. 4.15.
Figure 4.15:1 is a redrawing of Fig. 4.2a in which each scalar instruction executes only one operation over
one data element. For clarity, only serial issue and parallel execution ofvector instructions are illustrated i11
Fig. 4_2b. Each vector instnrction executes a string ofoperations, one for each clement in the vectoc
We will study vector processors and SIMD architectures in Chapter B. ‘sfarious functional pipelines and
their chaining or networking schemes will be introduced tbr the execution of compound vector functions.
Many of the above vector instructions also have equivalent counterparts in an SIMD computer. Vector
processing is achieved through ellleient pipelining in vector supercomputers and through spatial or data
parallelism in an SIMD computer.

Symbolic Pmcenor: Symbolic processing has been applied in many a.rca_s, including theorem proving,
pattern recognition, expert systems, knowledge engineering, text retrievaL cognitive science, and machine
intelligence. ln these applications, data and knowledge representations, primitive operations, algorithmic
behavior, memory, I.-"0 and communications, and special architectural features are different than in numerical
computing. Symbolic processors have also been called pmfog processors, Lisp pr0ee.ss'0r.s, or symbofie
rrrrrniprrfrirors. Table 4.6 summarizes these characteristics.
H‘-r Mclinrw Hill l'||rr.q|r_.I.I||r\ _

I55 i Advanced Compurterfirchitecture

I:I:lil:I
ll:!¢| Decade Emeuule ‘M'i§: End.

S uooeeeiue
Instructions
I I I _I I_ _I I I I I _I I_ _I ,,
012 3 4 5 6 1" B 910111213-1'|meinSaseCyoloe
Ia} Seaiar pipeline emeeution {F ig. 4.2a redmwn]

Successive
Instructions ‘fime in Base Cy-oloe
I I I I I I I I I I I I ,1 I ,I,,,
D123-45fi?3910111213-1415
lb] ‘uhcbr pipeline emcution

Fig. 4.15 Plpellned exec1.r|:lon in a base scalar processor and ln a vector processor. respecdvdy (Courltesy of
_|-D1-tppl and\Nall: reprinted frorn Proc.A5PLCI'5.HCM Pres. 1989]

Table 4.6 Characteristic of Syrnbolc Pmoesslng

.-'1 .r.rribul‘e.r Charo-crerlriir.-s

Knowledge Representations Lists, relational databases, scripts, semantic nets, fiarnes, blackh-oarcki,
objects, pro-ct|.tction systerns.
Common Operations Sea1'cl1._-—n snarl;__p________mate
attern ______hing,-Ifiltenng.' contexts. partitions, huusitive
cluaurw. unification. text retrieval. act opcrttlioni. manning.
Memory Requirements Large memory with intensive access pattern. .fl»ddt'e:$i-tag is often ocmtent
-based. Locality ofretererice may not hold.
Communication Patterns Mm1§SEi¥iE_€&i§§1Eé'=_£d1I=ti£.Tim; granularity and forum! of
mmage units change with application-s.
Propert let ofhlgoritltms Noodeterrnirtlstlc, possibly parallel and distrkruted eornputations. Data
depertderiees may be global and irregular in pattern arid granularity.
lnput:'(Jutp-ut requirements uellgum 1,1-.;.g5.I..T;I-YEtié§e1£}E§.T1lE§E1-TIE E};-rleflI-I1_:1=~'11:'5 can
be grapltkal and audio an well IE from keyboard; acoms to very large
o1t~line databases.
.-trcltitecture Features Parallel update of large knowledge bans, dynamic load I:|alm:cl1tg;dy1\an1lc
memory allocation; hardware-supported garbage collection; stack proeemor
archttecture;s}rn1bolic pro-cemors.
Prucessrzrs and Memory Hiermz! _ 1 |59

For example, a Lisp program can be viewcd as a set of functions in which data are passed from fiunetion
to function. The concurrent execution of these functions forms the basis for parallelism. The applicative and
recursive nature of Lisp requires an environment that efliciently supports stack computations and function
calling. The useoflinkcd lists as the basic data structure makes it possible to implement an automatic garbage
collection mechanism.
Instead of dealing with numerical data, symbolic processing deals with logic programs, symbolic lists,
objects, scripts, blackb-cards, production systems, semantic networks, frames, and artificial neural networks.
Pritnitive operations for artificial intelligence include scorch. compare. logic injirrenee. pattern nrnrc-hing.
rurrjicarion. filtering; context. retrieval, set operariorrs. n'an.sr'rr've closure, and reasoning operarions. These
operations demand a special in stn.|c tion set containing cornprtrtr. monrhirrg. logic, and sji-'nll1'Ji'ie nmni,rJu)'ntion
operations. Floating point operations are not often Ltsod in these machines.

Ir)
g Example 4.6 The Symbolics 3600 Lisp processorm
The processor architecture ofthe Symbolics 3600 is shown in Fig. 4. rs. This was a stack-oriented machine.
The division of the overall mar:h.ine architecture into layers allowed the use of a simplified instruction-set
design, while implementation was carriod out with a stack-oriented machine. Since most operands were
fetched from the stack, the smelt buffer and scratch-pad memories were implemented as fast caches to main
memory.

Roma
cg rs
i'
Tag
5-GrH'ld'IlJ-fld—I"I 5"" Processor

Stack " Fbceel-point


Buffer --’---- Processor
g B Bus
3 Operancl
i $856101‘
Floating
in C u rrent —h- in point
Instruction Pr or

Main Garbage
' Memory ' ' Collector

Flg.4.16 The arehleeenrre of the Syrnboiles 3600 Lisp processor {Courtesy of Syrnb-ollrs. Inc. 1985}

-31 The company Symb-olies has since gone out ol'bu.slness, but the .-‘ll concepts it ernployed and detelop-ed are still tnlld.
On a general-purp-one computer, these concepts would be implemented in soltwa re.
PM‘ MIGIELH HI" l'|>rrIq|r_.\.I|n*\ ‘I _

I60 i Advanced Cmnpmerfirchiteemre

The Symbolics 3-500 executed most Lisp instructions in one machine cyele. integer instructions fetched
operands form the stack buH'er and the duplicate top of the stack in the scratch-pad memory. Floating-point
addition, garbage collection, data type checking by the tag processor, and fixed-point addition could be
carried out in parallel.

MEMORY HIERARCHYTECHNOLDGY
— 1n a typical computer configuration, the cost of memory, disks. printers, and other peripherals
often exceeds that of the processors. We briefly introduce below the memory hierarchy and
peripheral technology.

4.3.1 Hierarchical l"Iemc|ryTechnology


Storage devices such as regr'sIer.s', caches, main memor_\-', disk detrrlres, and brrehrp smroge areofien organized
as a hierarchy as depicted in Fig. 4.1T. The memory technology and storage organization at each level are
characterized by five parameters: the access time (I,-1, mcmmj-' Sift.’ {sl-1, ms! per fJ__t-'I£' (cl-1, mrmsfer brerdwidrh
(_b,-), and uni! qf'n'¢m.g,r"er -['_x,f|.
The access time 1] refers to the round-trip time from the CPU to the ith-level memory. The memory sin: s,-
is the number ofbytes or words in level r‘. The cost ofthe ith-level memory is estimated by the product -:-ls,-.
The bandwidth bi refers to the rate at which information is transferred between adjacent levels. The unit of
tran!-rfcr.1'; refers to the grain size for data transfer bctwoen levels i and r' + 1.

RagStem
'-‘W ° m cau

Cache
'-°"’“‘ 1 [sRAI'u'lsf|

Mr-in Memory
LN‘ 2 rename; -

Dlsk Storage nncost


crea e pa-rbt
'-°“*' 3 [Sorta-state, Magnetic]
In
tmcn apaci
reaese tyandac es
Backup Storage
Levant
[Magmtlc Tap-as, Optical Disks]
_I_

r-— has —-4


FIg.4.1'i' A four-ietrel memory hierardgv with increasing capacity and decreasing speed and cost from law to
high tenets
Prucesscrs and Memory HiB'l.'NZ! T |g|

Memory devices at a lower level are faster to access, smaller in size, and more expensive per byte, having
a higher bandwidth and using a smaller unit oftransfer as compared with those at a higher lcvel. ln other
words, we have r,- I <1 I,-, s,- , *1 s,-, e, | > r,-, fr, , Ir b,-, and x,- | fix,-, for r‘ = l, 2, 3,and 4, in the hierarchy where
r'= O corresponds to the CPU register level. The cache is at level 1, main memory at level 2, the disks at level
3, and backup storage at level 4. The physical memory design and operations of these levels are studied in
subsequent sections and in Chapter S.
Register: and Cache: The registers are parts ofthe processor: multi-level caches are built either on the
processor chip or on tl1e processor board. Register assignment is made by the compiler. Register transfer
operations are directly controlled by the processor after instr|.|ctions are decoded. Register transfer is
conducted at procr:-'.-sor speed, in one clock cycle.
Therefore, many designers would not consider registers a level of memory. We list them here for
comparison purposes. The cache is controlled by the MMU and is programmer-transparent. The cache can
also be implemented at one or multiple levels, depending on the speed and application requirements. Over
the last two or three deca|:les, processor speeds have increased at a much faster rate than memory speeds.
Therefore multi-level cache systems have become essential to deal with memory access latency.
Nloin Memory The main memory is sometimes callod the primary memory ofa computer system. It is
usually much larger than the cache and oflcn implemented by the most cost-effective RAM chips, such as
DDR SDR.-‘\Ms, i.e. dual data rate synchronous dynamic RAMs. The main memory is managed by a MMU
in cooperation with the operating system.

Disk Drive: oar-d Boclrup Storage The disk storage is considered the highest level of on-line memory.
lt holds the system programs such as the OS and compilers, and user programs and their daia sets. Optical
disks and magnetic tape units are off-line memory for use as archival and backup storage. They hold copies
of present and past user programs and processed results and files. Disk drives are also available in the form
of R.-‘KID arrays.
A typical workstation computer has the cache and main memory on a processor board and hard disks
in an attached disk drive. Table 4.? presents representative values of memory parameters for a typical
33-bit mainframe computer built in 1993. Since the time, there has been one or two orders of magnitude
improvement in most parameters, as we shall see in Chapter 13.
Pcripfieml Technology Besides disk drives and backup storage, peripheral devices include printers,
plotters, terminals, monitors, graphics displays, optical scanners, image digitizers, output microfilm devices,
etc. Some l-“O devices are tied to special-purpose or multimedia applications.
The technology of peripheral devices has improved rapidly in recent years. For example, we used dot-
matrix primers in the past. Now, as laser printers become affordable and popular, in-house publishing
becomes a reality. The high demand for multimedia HO such as image, speech, video, and music has resulted
in further advances in IIO technology.

4.3.1 Inclusion, Coherence, and Locality


Information stored in a memory hierarchy |[’_M|, M3,..., M") satisfies three important properties: r'ne!rrsr7r;-rr,
mlrererrce, and !oem'r"r__\'as illustrated in Fig. 4.18. We consider cache memory the innermost level M|,which
directly communicates with the CPU registers. The outermost level .-id" contains all the information words
stored. In fact, the collection ofall addressable words in rid” forms the virtual address space of a computer.
Program and data locality is characterized below as the foundation for using a memory hierarchy effect ively.
rm‘ MIGIELH Hf" l'm'rIq|r_.\.I|n*\ ‘I _

I61 i Advanced Cmnptmerfirchiteetum

Table 4.7 Memory Characteristic of'eTyptcuJH|ain]"mme Computer tn 1993

Memory level Level U Level I Lew: 2 Level 3 Level 4


Characteristics CPU Cache Main Dirk Tyne
Registers tl-femu.r}' Storage Storage

Devtee ECL 256K-bit 4M~hit I-(II:-yte 5-Gbyte


technology SR.-\ M DRAM magnetic magnetic
disk unit tape unit
Access ttme,t, lflns 25- 40115 at-tun its 1' I2 20 ms Q 2--20 mitt
{smreh time}
Capaei1y,.r, 5t2 bytes I28 Khytes 5l2 Mbytes 60- 228 SI2 Gbytes--
(_i11l:|ytes} (lb-ytes 2 Tbytes
Cost, e, lB,tH]ID 72 5.15 0.23 t]I.t]Il
{i11eents.PKB)
llandwidflt, 40-n-sen 250 400 so- |33 3 5 0.1 s 41.23
s, rm Mn.-st
' um of titan.-.t 32 bytes o.§- | Kbyresu 5 5 |2 Khytes Backup
transfer, x, per word per block per page per file storage
.-‘slid-cation Compiler Hardware Operating Operating Operating
management amigra-nertt control system system.-‘user systermttser

Inclusion Property The indusion pmperttj-' is stated as M, -1: .-‘L-fl <: M3 -1: -1: .-‘L-f,,. The inclusion
relationship implies that all information items are originally stored in the outermost level M". During the
processing, subsets ofM,, are copied into M" |. Similarly, subsets of.-ii, | are copied into .-l£,.;, and so on.
ln other words, ifan inlbrmation word is found in M,-, then eopies ofthe same word can also be found in
all upper levels .'l:|',-H, ll-f,-+3, ..., .'l:|',, However, a word stored in ilt’,-+| may not be fotmd in M,-. A word miss
in M; implies that it is also missing from all lower levels .-id} |, .-l-f,- 1, ..., M]. The highest level is the backup
storage, where everything can be fotmd.
Information transfer between the CPU and cache is in terms of it'om’s (4 or B bytes each depending on
the word length of a machine). The eache (Ml) is divided into eache bIoe.l:.s', also called eache Hmrs by some
authors. Each block may be typically 32 bytes {B words]. Blocks (such as ‘“'a“ and ‘“b“ in Fig. 4.18} are thc
units ofdata trans fer between the cac he and main memory, or between L| and L3 eache, etc.
The main memory (M3) is divided into pages, say, -1 Kbytes each. Each page contains 128 blocks for the
esample in Fig. 4.18. Pages are the Lmits of information trartsterred between disk and main memory.
Scattered pages are organized as a segment in the disk memory, for example, segment F contains page.-it,
page B, and other pages. The size ofa segment varies depending on the u.ser‘s needs. Data transfer between
the disk and backup storage is handled at the file level, such as segments F and G illustrated in Fig. 4.18.
Collatrence Property The r-ohemnee propert__\-' requires that copies of the information item at
sueeessive memory levels be consistent. lfa word is modified in the cache, copies ofthat word must be updated
immediately or eventually at all higher levels. The hierarchy should be maintained as such. Frequently used
information is often fbtmd in the lower levels in order to minimize thc effective access time ofthe memory
hierarchy. ln general, there are two strategies tor maintaining the coherence in a memory hierarchy.
H‘-r Meliruw HJ'lIr'|-rr.-pr.-.-|-rt _
Ptueesscrs endfldemoty Hier ..i- N3

CPU
Registers 1.Aeeees by word [4 Bytes}
from a cache Mock of
32 Bytes, sueh as block a.
IIII
M1: IIE
(Cache) I'll 2. Aeeees by block
- [32 Bytes] from a memory
page of 32 bio-ekzs or
1 KBytee, sueh as block it
from pagefl.
I U U
%
[Main Mernory] 1 I I IEI H
a Iii

Page!‘ 3-.Aecessby-page
[1 Kflytesi from a file
consisting of many
' pages, sueh as page
M. .__ _ _ _ _ _ __ Aandpagefi-in
3' Segment
5°'Q~m'9“t G segment F.
[Diek5te1age] qfl
V
4. Segment transfer
with different
. .. number of pages.
_____ _ _ _ _ _. M4: Magnetic Tape Unit
sarlfi: [Baczltup Stoe]

,6
Fig. -L18 The tnelttslon property an-cl chra cransfers between acljacent levels ofa memory hlerardsy

The first method is called write-through (WT), which demands immediate update in M,-+| if a word is
modified in.-‘vi,-,fori= 1,2, , n— 1.
The second method is write-bodr (WE), which delays the update in M,-+| until the word being modified in
M} is replaced or removed from M,-. Memory replacement policies are studied in Section 4.4.3.

Locality of Reference: The memory hierarchy was developed based on a program behavior known as
)'oeal'iI_1-'ofrefi.’rener's. Memory references are generated by the CPU for either instruction or data access.
These accesses tend to be clustered in certain regions in time, space, and ordering.
In other words, most programs act in favor ofa certain portion of their address space during any time
window. l-Iennessy and Patterson {199[l_) have pointed out a 90-lll rule which states that a typical program
may spend 90% ofits execution time on only 10% ofthe code such as the innermo st loop ofartested looping
operation .
FM
. Mefiruw Hill r‘ | m .q|r_.u||rs VZ

I64 i Adu\wrcadCmr1pmerA.rehiteczure

There are three dimerts ions ofthe locality property: rerrlpomf. spatial, anrl sr.'qIuenrfnI. During the lifetime
of a sofhvare process, a n|.|mber ofpages are used dynamically. The references to these pages vary from time
to time; however, thcy follow certain access pattems as illustrated in Fig. 4.19. These memory reference
patterns are caused by the following locality properties:

I Vllttld
aclchas
‘moo
{P199 number)
1
FaJ" t
F% "A. 9%
[ raj ._-=.j;':--'"-=;.; -2-_._j-=.j :' 1- .-j ' "-4.5: .j -"I

' ' I-

_ -1- ‘Hm:

Hg.-1.1! Memory reference patterns In typical pro-grarn trace experlm-u11:s. wlu-e rqlons ta]. {ta}. and {c} are
genuaued with the execution ofchree sofnuare processes

-['1] Rmrpomi Iaeafirt-'—Rocently' referenced items 1' instructions or dataj are lilrely to be referenced again
in the near future. This is ofien caused by special program constructs such as iterative loops, process
stacks, temporary variables, or subroutines. Clncc a loop is entered or a subroutine is called, a small
oode segment will be referenced repeatedly n1any times. Thus temporal locality tends to cluster the
access in the reoently used areas.
(2) Sparta! faeafn}-—This refers to the tendency for a process to access items whose addresses are near
one another. For example, operations on tables or arrays involve accesses ofa certain clustered area
in the address space. Program segments, such as routines and macros, tend to be stored in the same
neighborhood of the memory space.
('3) .'5-'eqr.1-enniaf E0eaIr'r_v—ln typical programs, the execution of instructions follows a sequential order (or
the program order] unless branch instructions create out-of-order executions. The ratio of in-order
rxecution to out-of-order eltecution is roughly 5 to I in ordinary programs. Besides, the access ofa
large data array also follows a sequential order.

Memory Design Implications The sequentiality in program behavior also contributes to the spatial
locality because sequentially coded instructions and array elements are often stored in adjacent locations.
Each type of locality affects the design of the memory hierarchy.
Prucesscrs end Memory HiB'l.'NZ! i N5

The temporal locality leads to the popularity ofthe lens‘! recenr{\*1rsed{'LRU] replacement algorithm, to be
defined in Section 4.4.3. The spatial locality assists us in d-r:tcrrn.i.n.ir1g the size of unit data transfers between
adjacent memory levels. The temporal locality also helps determine the size of memory at successive levels.
The sequential locality affects the determination of grain size for optimal scheduling (grain packing].
Preferch techniques are heavily afiected by the locality properties. The principle of localities guides the
design ofcache, main memory, and even virtual memory organization.

The Working Sets Figure 4.19 shows the memory reference patterns of three running programs or three
software processes. As a fianctien of time, the virtual address space (identified by page numbers) is clustered
into regions due to the locality of references. The subset of addresses {or pages) referenced within a given
time window (I, I + air) is called the n-writing set by Denning -[ 1968].
During the execution of a program, the working set changes slowly and maintains a certain degree
of continuity as demonstrated in Fig. 4.19. This implies that the working set is oficn accumulated at the
innermost [lowest] level such as the cache in the memory hierarchy. This will reduce the effective memory-
access time with a higher hit ratio at the lowest memory level. The time window Ar is a critical parameter set
by the CIS kemel which affects the size ofthe working set and thus the desired cache size.

4.3.3 Memory Capacity Planning


The performance of a memory hierarchy is determined by the eflecllve access .Ir'.me I';_.y;;- to any level in the
hierarchy. lt depends on the hit ratios and rrr-c-ess_,ri'eqrrerreitr~r at successive levels. We formally define these
terms below. Then we discuss the issue of how to optimize the capacity of a memory hierarchy subject to a
eost constraint.

Hit Ratios Hit ratio is a concept defined for any two adjacent levels of a memory hierarchy. When an
information item is found in M,-, we call it a hit, otherwise,a rrris'.s'. Consider memory levels .-'lf,- and .-'lf,- | in a
hierarchy, r'= 1, 2,. . ., rr. The hit ratio fr; at M; is the probability that an information item will be found i11 M,
lt is a firnction ofthe characteristics ofthe two adjacent levels 114,- | and The mi.s'.s' ratio at .-‘H, is defined
as 1 — fr,-.
The hit ratios at successive levels are a function ofmemory capacities, management policies, and program
behavior. Successive hit ratios are independent random variables with values between D and 1. To simplify
the firture derivation, we assume fry = ll and Fr" = 1, which means the C PU always accesses lid, first and the
access to the outermost memory .~'ld,, is always a hit.
The ar;'ees.s'_frz'quene_\-' to i'l»:!,- is defined as]? = (1 — it |]{'] — .373). ..{_ 1- fr; ljfr,-. This is indeed the probability of

successfully aecessing 114,- when there are i — 1 misses at the lower levels and a hit at M-. Note that zit I = 1

and _,f| = hl.


Due to the locality property, the access frequencies dccrea_se very rapidly from low to high levels; that is,
fl I"-b_,fg ‘:9-jy 28> . . . I-l>_;{,. This implies that the inner levels of memory are accessed more often than the outer
levels.

Effective Access 'l"r.rn-r: ln practice, we wish to achieve as high a hit ratio as possible at .-‘I-1|. Every time a
miss occurs, a penalty must be paid to access the next higher level ofmemory. The misses have been called
bieert rrrisses in the cache and prrgefnrrfrs in the main memory because bkrcks and pages are the units of
transfer between these levels.
Par MIGIITLH Hf" l'mrJI||r_.u|n¢\ :

Iii i Advanced Cmnpiioerfirchitscture

The time penalty for a page fault is much longer than that forr a block miss due to the fact that rl <1 1'; <1 I3.
Stone ( 1990] pointed out that a cache miss is 2 to 4 times as costly as a cache hit, but a page fault is 1000 to
10,000 times as costly as a page hit; but in modem systems a cache miss has a greater co st relative to a cache
hit, because main memory speeds have not increased as fast as processor speed
Using the access frequencies_,r‘} ibr i = 1, 2, n, We can formally define the effective access time of a
memory hierarchy as follows:

Q if" -.'."\
Z[vJ=
=h|I|"'l'1-iii]-l?:F2+{1'}?|ll1'*l72l-5373+ ---+

li]_'h|)(l—h]il---(1—'hrr lilrlf

The first several terms in Eq. 4.3 dominate. Still, the effective access time depends on the pro-grarn behavior
and memory design choices. Only aficrestensiye program trace studies can one estimate the hit ratios and the
value of T1,} more accurately.

Hierurdry Optimization The total cost ofa memory hierarchy is estimated as follows:

Cuial = Z 'r1"S1' 4-4]


1" |
This implies that the cost is distributed overn levels. Since c| '1= t-3 3* e3 11* . .. c,,, we have to ehoose sl <i s3
<1 s3 <1 . .. s,,. The optimal design ofa memory hierarchy should result in a Iii;-close to the I] of .~'l+:!| and a total
cost close to the cost of .-l.-f,,. In reality, this is ciiflicult to achieve due to the tradeoffs among n levels.
The optimization process can be formulated as a linear programming problem, given a ceiling Cu. on the
total oost— that is, a problem to minimize

‘ii=
:[_~4=3?" J1" (4.51

subject to the iollowing constraints:


s,-Le 0,1,-P 0 forr'=1, 2,..., n

Cm-ml = E'Ci"5i ‘Q C0 (4-'6]


I I

As shown in Table 4.7, the unit cost c,- and capacity s,- at each level M, depend on the speed r, required.
Therefore, the above optimization int-ioli.-‘es tradeoifs among I,-, c,-, s,, andj} or h,- at all levels 2' = 1, 2, . . ., n.
The following illustrative example shows a typical such tradooff design.

iv)
g Example 4.1 The design of a memory hierarchy
Consider the design of a three-level memory hierarchy with the following specifications for memory
characteristics:
Processcrs and Memory Hi'El'lJ*l!! 9 my

Memcrjy level‘ Access time Capacitjr Cast/Tfbyte

Caelte :1 == 2511s s; " 5l2 Khytes c; =- S012


Main memory :3 -= unlniosvri S3 32 Mbytes ey > £0.02
Disk array :3 -=- 4 ms s3, uni-trtowit cl B 50.00002

The design goal is to achieve an effective memory-access time I = B50 ns with a cache hit ratio In = 0.93
and a hit ratio by = 0.99 in main memory. Also, the total cost of the memory hierarchy is upper-botmdcd by
$1,500. The memory hierarchy cost is calculated as

C:-C'|.S'| ‘hep Sg+£‘_1S3Sl,5'[l'[l {-'-L7]

The maximum capacity of the disk is thus obtained as .93 = 40 Gbytes without exceeding the budget.
Next, we want to choose the access tirne-['r;'j ofthe RAM to build the main memory. The effective memory-
access time is calculated as
f=.fl| f| ‘l"[l _hlHf:-F2 +{i —Pl3:If?3I]£B50

Substituting allknown parameters, we have R50 >< 10 9 = 0.9B>< 25 '>< i0 Q+ 0.02 >< 0.99 XI; +0.02 X 0.01
><1><4><10 3.Thu.sr;= 12501“.
Suppose one wants to double the main memory to 64 Mbytes at the expense ofreducing the disk capacity
under the same budget limit. This change will not affect the cache hit ratio. But it may increase the hit ratio
in the main memory, and thereby, the effective memory-access time will be reduced.

VIRTUAL MEMORYTECHNDLDGY
1 In this section, we introduce two modcls ofvirtual memory. We study address translation
mechanisms and page replacement policies for memory management. Physical memory such
as caches and main memory will be studied in Chapter 5.

4.4.1 Virtual Memory Models


The main memory is considered the ph_v.s'ien)' rrrcrrrorjy in which multiple running programs may reside.
However, the limited-size physical memory cannot load in all programs fully and simultaneously. The \-'r'rnr.n1
rrrcmrrry concept was introduced to alleviate this problem. The idea is to expand the use of the physical
memory among many programs with the help ofan auxiliary [backup] memory such as disk arrays.
Only active programs or portions of them become rcsidcnts ofthe physical memory at one time. Active
portions of programs can be loaded in and out from disk to physical memory dynamically under the
coordination of the operating system. To the users, virtual memory provides almost unbounded memory
space to work with. Without virtual memory, it would have been impossible to develop the multiprogrammcd
cr time-sharing computer systems that are in use today.

Address Space: Each word in the physical memory is identified by a unique ]Jfl__lZ€‘ icat‘ ndn'r'ess. All memory
words in the main memory form a ph_vs'icni' ar'r'nt'css space. Firrurrl av-i'r:rtr'e.sse.s' are those usod by machine
instructions making up an exec utablc program.
Par MIGIITLH Hf" l'mrJI||r_.u|r¢\ :

I65 i Advanced Cmnprreerfirchitsctmre

The virtual addresses must be translated into physical addresses at run time. A system of translation tables
and mapping functions are used in this process. The address translation and memory management policies
are ai'Ter_'ted by thc virtual memory model used and by the organization ofthe disk and ofthe main memory.
The use of virtual memory facilitates sharing of the main memory by many software processes on a
dynamic basis. [t also iacilitatcs software portability and allows users to execute programs requiring much
more memory than the available physical memory.
Only the active portions of running programs are brought into the main memory. This permits the
relocation of code and data, makes it possible to implement protection in the OS kernel, and allows high-
level optimization ofmemory allocation and management.

Address Mapping Let l-" be the set of virtual addresses generated by a program rtrrrning on a processor.
Let M be the set of physical addresses allocated to n.|n this program. A virtual memory system demands an
automatic mechanism to implemem the following mapping:

f}: V-3 JHKJ fill} -['49]

This mapping isatime function which varies fromtime to time because thephysical memory isdy nam ically
allocated and deallocated. Consider any virtual address v E I-". The mapping jfl is formally defined as follows;

Im, if m E M has been allocated to store the


_y;(;r) = { data identified by virtual address tr (4_ 1{]]
lei, if data 1| is missing in M

In other words, the mapping _f;(i-'1 uniquely translates the virtual address 1-' into aphys ical address m ifthere
is a memorj-' hi! in M. When there is a rnemory miss, tl1e value ret1rrrred,_f}(t-'] = ¢|, signals that the reiercnced
item {instruction or data] has not been brought into the main memory at the time of reference.
The efficiency of the address translation process affects the performance of the virtual memory. Vmual
memory is more diffieult to implement in a multiprocessor, where additional problems such as coherence,
protection, and consistency become more challenging. Two virtual memory models are discussed below.

Private Virtual Memory The first model uses 51 private virrrrnf .|'fl£".|'flt'll"].-‘ sprrrc associated with each
processor, as was seen in the ‘v'AX.-‘ll and in most UNIX systems (Fig. -tl.20aj. Each private virtual space is
divided into pages. Virtual pages from diifcrent virtual spaces are mapped into the same physical memory
shared by all processors.
The advantages of using private virtual memory include the use ofa small processor address space -[32
bits], protection on each page or on a per-process basis, and the use ofprivate memory maps, which require
no locking.
Tire shortcoming lies in the syrrorrrrrrr probierrr, in which diifcrent virtual addresses in different virtual
spaces point to the same physical page.

Shared Virtual Nlemory This model combines all the virtual address spaces into a single globally starred
virruni space (Fig. 4.20b). Each processor is given a portion of tl'|c shared virtual memory to declare their
addresses. Dificrent processors may use disjoint spaces. Some areas of virtual space can be also shared by
multiple processors.
Prucesstrs end Memory Hl'El'0lZ! i 1 my

Examp les ofmaehines using shared virtual memory includethe IB Mfitll, RT, RP3, System3B,the HP Spectrum,
the Stanford Dash, MIT Alewi fe_ Tera,etc. We will further study virtual memory in Chapter 9. Until then, all
virtual memory systems discussed are assumed private unless otherwise specified.

Virtual spam
Physical
Physical Memory Memory

. (P399 P1 . -
Vrrtud spaoe frames] Virtual space spam ~a\\a\t~\\m~ra~
of processor 1 m\\“,m,m. of processor 2 &\\g._\\\\&_“
i
-
\aara.~us~as:~
Sm, ’m’\*‘*‘~“*‘\“
‘ '1 I shared s\\\\s\k~R\\‘&' gpagg _

"ac
mmmmmi
-
W
*~‘~*‘~““***
\'Q\‘@\\\\.'\\§\'

at»\~a~aasm=
T ts“
E “MW r__
t
- ms‘-uaaxmsxu-'

[a] Private virtual memory space in different processors [tr] Globally shared virtual meme-ry space

Flg.4.20 Two virtual rnemnry mods-is for nurlrlruroeess-err systnerns {Courtesy of Dob-oi: and Briggs. tutorial.
A.r1rsraJS}v1'rp-uslrln on Cornpr.rterArcl1ltecture.1990]

The advantages in using shared virtual memory include the fact that all addresses are unique. However,
caeh processor must be allowed to generate addresses larger than 32 bits, such as 46 bits for a 64 Tbyte (2%
byte] address space. Synonyms are not allowed in a globally shared virtual memory.
The page table must allow shared accesses. Therefore, mrrnurl exdrrsion {locking} is needed to enibroe
protected access. Segmentation is built on top of the paging system to confine each process to its own address
space (segments). Global virtual memory make may the address translation process longer.

4.4.1 TLB, Paging, and Segmentation


Both the virtual memory and physical memory are partitioned into fixed-length pages as illustrated in
Fig. 4. IE. The purpose of memory allocation is to allocate pages ofvirtual memory to the pogejrrrrrres ofthe
physical memory.

Addren Translation Mechanism: The process demands the translation ofvirtual addresses into physical
addresses. Various schemes for virtual address translation are surnmarized in Fig. 4.21:1. The translation
demands the use ofrransferion maps which can be implemented in various ways.
Translation maps are stored in the eache, in associative memory, or in the main memory. To aeccss these
maps, a mapping function is applied to the virtual address. This firnetion generates a pointer to the desired
translation map. This mapping can be implemented with a hrrsffing oreongruenee firncfion.
Hashing is a simple computer technique for converting a long page number into a short one with fewer
bits. The hashing function should randomize the virtual page number and produce a unique bashed number
to be used as the pointer.
ITO P Admlrtcad Compuraerfirrrhitecmre

lllflliilfll PCil'lfiiB'l'
addrress Mails weal ms
l ect nvert ~
Maaping Mapping
Hashing Congruenoe
TLB
[,pq"(;] One luiu1ti- Assoclatiuelflvelifld
level PT level PT PT PT

la] Vinual addrress translation schemes [PT = page table]

Virtualaddressl Page j Bio-clr | Word ]

Page Fault

{hill . _ _ _l-;' E§{"§_ _ _ Page frame

Page frame

[ Page | am i Word ]Pi-ryscial address


[ti] Use of a TLB and PTs for address translation

r 2a
0 Phrasal address
1 5°“

15
Begum Regmers Segnent ID Virtual address

[cl inverted address mapping

Fig.1l.11 Address tr-arnsdarlon nndtmisns usingaTLBand variois for-rns cf page cabins

Translation Lookasidn Buffirr Translation maps appear in the form of a rrrirlslarion lookasride bufr‘2-r
(TLB) and page tables {PTs_). Based on the principle of locality in memory references, a parlieular tv0.r'll:ing
ser of pages referenced within a given context or time window.
The TLB is a high-speed lookup table which stores the most recently or likely referenced page entries.
Apnge errrrjv consists ofessentially a {virtual page number; page frame number] pair. It is hoped that pages
belonging to the same working set will be directly translated using the TLB entries.
The use of a TLB and P'T.s for address translation is shown in Fig 4.2 lb. Each virtual address is divided
into three fields: The leftmost field holds the virtual page number, the middle field identifies the eache block
number, and the rightmost field is the rvord ridrfress within the block.
Our purpose is to pr-oduee the physical address consisting of the page frame number, the block number,
and the word address. The first step of the translation is to use the virtual page number as a key to search
Prucesstrs and Memory HlB'l.'NZ! i |-H

through the TLB for a match. The TLB cart be implemented with a special associative memory (content-
addressable memory) or use part of the cache memory.
ln case of a match {a lair] in the TLB, the page frame number is retrieved from the matched page entry.
The cache block and word address are copied directly. ln case the match cannot be found (a miss] in the
TLB, a hashed pointer is used to identify one ofthe page tables where the desired page liame number can
be retrieved.

Paged Memory Paging isa technique for partitioning both the physical memory and virtual memory into
fixed-size pages. Exchange of infommtion between them is conducted at the page level as described before.
Page tables are used to map between pages and page frames. These tables are implemented in the main
memory upon creation of user processes. Since many userprocesses may be created dynamically, the number
of PTs maintained in the main memory can be very large. The page fable entries (PTE!-;] are similar to the
TLB entries, containing essentially {virtual page, page iramej address pairs.
Note that both TLB entries and PTEs need to be dynamically updated to reflect the latest memory reference
history. Only “snapshots” ofthe history are maintained in these translation maps.
lf the demanded page cannot be found in the FT, a page fiirrlr is declared. A page fault implies that
the referenced page is not resident in the main memory. When a page fault occurs, the running process is
suspended. A erJnrexr‘.s'n'iIeir is made to another ready-to-run process while the missing page is transferred
from the disk ortape unit to the physical memory.
With advances in processor design and VLSI technology, very sophisticated memory management
schemes can be provided on the processorchip, and even fi.|ll 64 bit address space can be provided. We shall
review some ofthesc recent advances in Chapter 13.

Segmented Memory A large number of pages can be shared by segmenting the virtual address space
among multiple userprograms simultaneous ly. Asegmem ofscattercd pages is formed logically in the virtual
memory space. Segrnents are defined by users in order to declare a portion of the virtual address space.
In a segmerrrea‘ rrrernorjl-' -s_t-‘stern, user programs can be logically structured as segrnerrrs. Segments can
invoke each other Unlike pages, segments cart have variable lengths. The management of a segmented
memory system is much more complex due to the nonuniform segment size.
Segments are a user-oriented concept, providing logical structures of programs and data in the virtual
address space. On the other hand, paging iiicilitates the management ofphysical memory. ln a paged system,
all page addresses form a linear address space within the virtual space.
The segmented memory is arranged as a two-dimensional address space. Each virtual address in this space
has a prefix. field called the se'gmerr! number and a postfix field called the offset within the segment. The afist-r
addresses within each segment ibrm one dimension ofthe contiguous addresses. The segment numbers, not
necessarily contiguous to each other, ibrm the second dimension of the address space.

Pager! Segment: The above two concepts of paging and segmentation can be combined to implement a
type of virtual memory with pager! segmems. W'itl'lin each segment, the addresses are divided into fixed-size
pages. Each virtual address is thus divided into three fields. The upper field is the segment‘ nrrmixrr, the middle
one is the page rrurrrber, and the lower one is the ojrser within each page.
Paged scgmems olfer the advantages ofboth paged memory and segmented memory. For users, program
files can be better logically structured. For the US, the virtual memory can be systematically managed with
re» Aleliruw um r-...t<-mtttm '
ITI i _ Advanced Cempuraerfirehiteeture

fixed-size pages within each segment. Tradeoffs do exist among the sizes of the segment field, the page field.
and the offset field. This sets lirnits on the number of segrnents that can be declared by users, the segment size
[the number ofpagcs within each segment), and the page size.
Inverted Paging The direct paging described above v.-irirlirs well with a small virtual address space such as
32 bits. ln modern computers, the virtual address is large, such as 52 bits in the IBM RS-‘-ISOOCI or even 6-'-1 bits
i11 some processors. A large virtual address space demands either large PTs or multilevel direct paging which
will slow down thc. address translation process and thus lower the performance.
Besides direct mapping, address translation maps cart also be implemented with inverted mapping (Fig.
4.2!c). An inverted page mhle is created tor each page frame that has been allocated to users. Any virtual
page number can be paired with a given physical page number.
Inverted page tables are accessed either by an associative search or by the use ofa hashing fitnetion. The
IBM 801 prototype and subsequently the IBM RTi"PC have implemented inverted mapping tbr page address
translation. ln using an inverted PT, only virtual pages that are currently resident in physical memory are
included. This provides a significant reduction in the size of the page tables.
The generation of a long virtual address from a short physical address is done with the help of segment
registers, as demonstrated in Fig. 4.21e. Thc leading 4 bits {denoted sreg) ot'a32-bit address name a segment
register. The register provides a segment id that replaces the -‘J-bit sreg to form a long virtt.|al address.
This effectively creates a single long virtual address space with segment boundaries at multiples of
E6 Mbytes {Z33 bytes). The IBM RT.-"PC had a 13-bit segment id (-'-‘l-O96 segments) and a4'[l-bit virtual address
space.
Either associative page tables or inverted page tables can he used to implemem inverted mapping. The
inverted page table can also be assisted with the use ofa TLB. An inverted PT avoids the use ofa large page
table or a sequence of page tables.
Given a virtual address to be translated, the hardware searches the inverted PT for that address and, ifit
is tbtmd, uses the table index ofthe matching entry as the address ofthe desired page ltame. A hashing table
is used to search through the inverted PT. Thc size of an inverted PT is governed by the size ofthe physical
space, while that oftraditional PTs is determined by tl'|e size ofthe virtual space. Because of limited physical
space, no multiple levels are needed for the inverted page table.

I»)
égl Example 4.8 Paging and segmentation in the Intel i486
processor
As with its predecessor in the 2:86 family, the i486 features both segmentation and paging capabilities.
Protected mode increases the linear address from -‘ll Gbytes {Z32 bytes) to 64 Tl'1ytes{2* bytes) with four
levels ofprtitection. The maximal memory size in real mode is 1 Mbyte {'23} bytes). Protected mode allows
the i486 to run all software from existing 8036, 80236, and S0386 processors. A segment can have arty length
from l byte to 4 Gbytes, the ITl.l:L‘tlI1'll.lITl physical memory size.
A segment can start at any base address, and storage overlapping between segments is allowed. The virtual
address (Fig. 4.22a) has a lti-bit segrnem selector to determine the base address ofthe Iinenr .r1rt‘dre.s'.s' .s'_p.r1ee
to be used with thc i4B-6 paging system.
Fr-r Mn: Grow Hi'lir'=>-mt.--t..»-. _
fiueessu'srmdMemeiryHier —.- "3

Virtual .Pttt!ldi'B'5S Phygicai iilqjdmgg

15 O

see‘ Z
Address

n sits 435“ CPU


_ Merrnry Operand

Fag“
Pastel Pram _
Seqrnortt descriptor aeictress

[a] Segmentation to produce the linear address

Linear Address £32 flames:

o rn as Physieat
M
emery

J! —
Papdireetory ss
[bl The TLB operations

31 22 12 O Physlcm
Linear
Mm, 1, "ma
L I
10

“P” — °l
3‘
as °
Ce o 1
— ° _ —
cs
st“ I—pstwm P” Directory
Control Registers

tjc) A two-level paging sehorne

Fig.4}! Paging and seg|'nert1a1:ie.n mechanilsrns brie intro the Intel H36 CPU {Cetrnssy ef Intel Corporation,
rm]
Fr‘:-r Meflruw rrrrir-...t<-,..t,t.¢. '
IT4 i Adverrrced Cnmprioerfitrehitnetore

The 32-bit offset specifics the internal address within a segment The segment descriptor is used to specify
aeeess rights and segrnent size besides selection of the address of thc first byte ofthe segment.
The paging feature is optional on the i486. lt can be enabled ordisabled by software eontrol. When paging
is enabled, the virtrral address is first translated into a linear address and then into the physical address.
When paging is disabled, the linear address and physical address are identical. When a 4-Gbyte segment is
selected, tl're entire physical memory becomes one large segment, which means the segmentation mechanism
is essentially disabled.
ln this sense, the i486 can be used with four different memory organizations, pure paging, pure
.segmenrnrrInn, segnrenredpnging, orprrre_rJIr_1rsr7r"ni rmhiressirrg without paging and segmentation.
.-R32-entry TLB (Fig 4.22b) is used to convert the linear address directly into the physical address without
resorting to the two -level paging scheme {Fig 4.22c). The standard page size on the i486 is 4 Kbytes =
In bytes. Fourcontrol registers are nsed to select between regular paging and page fault handling.
The page table directory {:4 Kbytes) allows 1024 page directory entries. Each page table at the second level
is 4 Kbytes and holds up to 1034 PTE s. The upper 20 linear address bits are compared to determine ifthere is
a hit. Tl:re hit ratios of the TLB and of the page tables depend on program behavior and the efiieicncy of the
update {page replacement) policies. A 98% hit ratio has been observed in TLB operations.

Advanced memory management functions. to support virtual memory implementation, were first
introduced in Intel's x86 processor family with the B0386 processor Key feattncs of the E0456 memory
management scheme described here were carried forward in the Pentium family of processors.

4.4.3 Memory Replacement Policies


Memory management policies include the allocation and deallocation of memory pages to active processes
and the replacement of memory pages. We will study allocation and deallocation problems in Section 5.3.3
after we discuss rnain memory organization in Section 5.3.1.
ln this section, we study page replacement schemes which are implemented with demand paging memory
systems. Page repineemeflt refers to the process in which a resident page in main memory is replaced by a
new page transferred from the disk.
Since the number of available page frames is much smaller than the number of pages, the frames will
eventually be fi.|lly occupied [n order to accommodate a new page, oneofthe resident pages mr.r.st be replaced.
Different policies have been suggested for page replacement "these policies are specified and compared
below.
The goal ofa page replacement policy is to minimize the number of possible page faults so that the
effective memory-access time can be reduced. The effectiveness of a replacement algorithm depends on the
program behavior and memory lraffic patterns encountered. A good policy should match the program locality
property. The policy is also affected by page size and by the number ofavailablc fi'ames.

Page Truce: To analyze the performance ofa paging memory system, page trace experiments are often
performed. Aprnge rrnee is a sequence ofpngefrnnre nunrbers { PF Ns) generated during the execution of a
given program. To simplify thc analysis, we ignore the cache effect.
Each P1-'N corresponds to the prefix portion of a physical memory address. By tracing the successive PFNs
in a page trace against the resident page numbers in the page frames, one can determine the occurrence of
fiuoessws end Memory Hi'El'0lZ! .i. "5

page hits or of page faults. Clfcoursc, when all thc page frames are takcrr, a certain replacement policy must
bc applied to swap the pages. A page trace experiment can be performed to determine the hit ratio of the
paging memory system. A similar idea can also bc applied to perform block traces on cache behavior
Considerapagc trace Pin] = r(_l]r\(2] rtfn] consisting ofn PFNsrcque.sted in discrete time from 1 to rt,
where r(r] is the PFN requested at time r. We define two reference distances between the repeated occurrences
of thc same page in P{n_'].
Thcfom-'rrrn‘ rfistrrnccf, {xj for page x is the number of time slots from time I to the first repeated reference
of pagcx in the future:
Ik, ifk is thc smallest integer such that
_f;(x)={ r{I+k)=r(I]=.1'inP{n] (4.11)
loo, if .1" docs not reappear in P(_n] beyond time I
Similarly, we define a btrcim-trrrr’ rfismnec b,{x) as the number oftimc slots from time I to thc most recent
reference of pagcx in thc past:
Ir, if t is the smallest integer such that
b,{x'j= { r(_r—k)=r{I_]=xin P(n) (4.121
lw, if x never appeared in Pfn) in the past
Let RH) be the resident set of all pages residing in main memory at time I. Let q(.|!] be the page to be
replaced from R{_r) when a page fault occurs at time I.

Page Replacement Policies The following page replacement policies are specified in a demand paging
memory system fora page fault at time r.
-[1] Least recerrrft-' used {LRU l—This policy replaces thc page in RU] which has thc longest backward
distance:

4(1) =_1-. ifi" an-1= rem: (4.131


(Zj Qrrtirnal -['ClPTj r:rIgorr'rhm—This policy replaces the page in Rf!) with the longest forward distance:

an =_»-. re .15 to = _gg,g;, tn on <4. 141


{3} Firs!-r'n-_firsr-orrr (F[FCI'j—This policy replaces thc page in Rf!) which has been in memory for the
longest time.
{4} l.ecrsIji'r.*quent{1-' rrsed (LFU)—This policy replaoes thc page in R{r] which has been least referenced in
thc past.
{5} Cirerdar FIFO—This policy joins all thc page frame entries into a circular FIFO queue using a pointer
to indicate thc front ofthe queue. An afloerrrfion bit is associated with each page frame. This bit is set
upon initial allocation ofa page to thc frame.
‘When a page fault occurs, the queue is circularly scanned from the pointer position. The pointer skips
thc allocated page frames and replaces thc very first unallocated page frame. When all fi'amcs arc
allocated, the front ofthe queue is replaced, as in the FIFO policy.
{6} Random repr'aeemcnr—This is a triv ial algorithm which chooses any page for replacement randomly.
I75 i
.
Advanced Cmnpmerfirrhitecmre

5*) Example 4.9 Page tracing experiments and interpretation


of results
Consider apaged virtual memory system with a two-level hierarchy: main memory M, and disk memory 1143.
Forelarity ofillustration, assume a page sizeof four words. The number ofpage frames in .-itfl is 3, labeled .n,
in and e; and the number ofpages in .-‘H3 is 10, identified by t), 1, 2, . ,., 9, The ith page in EH2 consists ofword
addresses 4ito 4r'+ 3 ibrall i= 0,1, 2, ...,9.
A certain program generates thc following sequence of word addresses which are grouped {underlined}
together ifthey belong to the same page. The sequence of‘ page numbers so formed is the page rrnee:

W-nrdtrxg: 0,1 ,2,3, 4, 5,6,3, B, 16,13, 9,1C|__i1, 12, 2B,29,3{3, B,9,l{l, 4, 5, 12, 4,5
l i l i l i i L i l 1
Page trace: {II 1 2 4 Z 3 T Z 1 3 1

Page tracing experiments are described below for three page replacement policies: LRU, OPT, and FIFO,
respectively. The successive pages loaded in the page iiames {PFs) form the trace entries. I nitially, all PFs
are empty.

PF D I 2 4 2 3 TI‘ 2 3 Hit Ratio


¢.r CI D D 4 4 T 7 T 3
B I I I 3 3 3 I I 3
LRU
e 2 2 2 2 2 2 2
Faults "' ' "' Ihltuhu I 1 I I \J1LH

u G IS 4 3 1‘ T 1' 3
B I I I I I I I 4
OPT
e 2 2 2 2 2 2
l \.'|—I=1 I \JI—J-‘n I \.'lI—LaJ
_Fault _ ' A*_'__> I I

a Ci D D 4 4 4 2 2
B I I I 3 3 I I 2
FIFO
r: 2 2 2 3' T 3
Fnuits ' ' ' Il'\JI—|F| I I"- l—|\J ILA-lI—I\.'l

The above results indicate the superiority of the OPT policy over the others. I-Iowever, the OPT cannot
be implemented in practice. The LRU policy performs better than the FIFO due to the locality ofreferences.
From these results, we realize that the LRU is generally bctterthan the Fl FCI. However, exceptions still exist
due to the dependence on program behavior.

Relative Performance The performance of a page replacement algorithm depends on the page trace
(program behavior) encountered. The best policy is the CIPT algorithm. However, the CIPT replacement is
not realizable because no one can predict the fi.|ture page demand in a program.
Processors and Memory Hieraw! F "T

The LRU algorithm is a popularpolicy and ofien results in a high hit ratio. The FIFO and random policies
may peribrm badly bcoiuse of violation of the program locality.
The circular FIFO policy attempts to approximate the LRU with a simple circular queue implementation.
The LFU policy may perform between the LRU and the FIFO policies. liowever, there is no fixed superiority
ofany policy over the others because ofthe dependence on program behaviorand ru n-time status ofthe page
frames.
In general, the page fault rate is a monotonic decreasing fturction of the size ofthe resident set Riri at time
r because more resident pages result in a higher hit ratio in the main memory.

Bio-clr Replacement pofi-cie: The relationship between the cac he block frames and cache bloclts is similar
to that between page Frames and pages on a disk. Therefore, those page rcplaccmcnt policies can be modified
ibr bl'oe.lr repiocermrnr when a cache miss occurs.
Dilferent cache organizations [Section 5. I} may offer different fiexihilities in implementing some of the
block replacement algorithms. The cache memory is oilen associatively searched, while thc main memory is
randomly addressed.
Dueto thedificrcnce between page allocation in main memory and block allocation in the cache, the cache
hit ratio and memory page hit ratio are aflected by the rep lacemcnt policies differently. Cache rroees are often
needed to evaluate the cache performance. These considerations will be further discussed in Chapter 5.

~ii\‘~ Summary
One way to define the design space of processors is in terms of the processor clock rate and the average
cycles per irtstruction {CPI}. Depending on the intended applications, different proce.ssors—which may
even be of the same procasor family—may occupy different positions within this design space.The
processor instruction set may be complex or reduced—and accordingly these two types ofprocasors
occupy diiierent regions of the design space of clock rate versus CPL
For higher periorntancmprocessor designs have evolved to supe rscalar processors in one directiomand
vector processors in the other. A superscalar processor can schedule two or more machine irstructiorrs
througi the instruction pipeline in a single clock cycle. Host sequential programs.vvhen translated into
machine langtage. do contain some level of instruction level parallelism. Superscalar processors aim to
exploit this parallelism through hardware techniques built into the processor.
Vector processors aim to exploit a common characteristic of most scientific and engineering
applications—processing of large amounts of numeric data in the form of vectors or arrays.The rliest
supercomputers—CDC and Cr'ay—emphasized vector processing whers modern applications
requirements span a much broader range. and as a result the scope of computer architecture is also
broader today.
‘very large instruction word [VLlW') proc-ssors were proposed on the premise that the compiler -can
sdtedule multiple independent operations per cycle and pad: them into long machine instructions—
relieving the hardware from the task of discovering instruction level parallelism. Symbolic processors
address the needs of artificial intelligence, W'l‘Ild1 may be contrastecl with the numb-er-crunching which
uvas the focus of earlier generations of supercomputers.
TM Illtfirmfl Hillfiurnpunnri .
ITB W flidmnced Compiuterflirchitecture

Memory elements provided within the processor operate at processor speed. but they are small
in site. limited by cost and power consumption. Farther away from the processor. memory elements
commonly provided are (one or more levels of) cache memory. main memory,and secondary storage.
The memory at each level is slower than the one at the previous lemel. but also much larger and less
expensive per bit.The aim behind providinga memory hierarchy is to ach ieye. as fir as possible.the speed
of fast memory at the cost of the slower memory.The properties of inclusion. coherence and locality
make it possible to achieve this complex objective in a computer system.
Virtual memory systems aim to free program size from the size limitations of main memory.\Norking
set. paging. segmentation. Tl_Bs. and memory neplacement policies make up the essential elements of a
virtual memory system. with locality of program references once again playing an important role.

g Exercises
Problem 4.1 Define the following basic terms Problem 4.3 Answer the following questions
related to modern processor technology: on designing scalar RISC or superscalar RISC
{a} Processor dsign space. Pl"0€E-'5§0l“$i
(b) ||'1§_[|"|__|[[i.D]'1i§_-5ue|Q[.En[)r_ {a} Why do most RISC integer units use
{C} |nst.-uction issue I-are 32 general-purpose registers? Explain the
(d]- Simple operation latency. concept of register windows implemented in
ye} the SPARC architecture.
Resource conflicts.
U} Gener,aj_Pm_PmE registem (b) What are flwe design tradeoffs between a
fig} large register file and a large D-cache?
Addressing modes.
Why are reservation stations or reorder
(bl Unified versus split caches. buffers needed in a superscalar processor?
{i} Hardwired yersus microcoded control.
{c} Eaqalain the relationship between the integer
Pl'D|J|Efl'1 4.1 Define lihfl following biiSlC terms unit and the flgqting-pgint unit; in mqgt
35$-Ofilflllfld With l‘l‘lElTi0i')" hlEF'3l'Ch}' d*E$lEI‘I! RISC processors with scalar or superscalar
{a} Virtual address space. qrganintjgn.
{b} Physical address space.
Problem 4.4 Based on the discussion ofadvanced
(c]- Address mapping.
procssors in Section 4.1. answer the following
ldl Cache blodis‘ questions on RISC. CISC. superscalar. and VLIW
(El lululfilevel PQEE '13bl5- architectures.
(fl Hit l'3l3l°- {a} Compare the instruction-set architecture
(El Page fill-lll in RISC and CISC processors in terms of
(l"I]- l‘l8Sl"Ilr‘Igfi-lr‘ICtl0r1- instruction formats. addressing modes. and
{i} Inverted page table, cycles per instruction (CPI).
{j} Hemqry replacement P-glici-5. (b]- Discuss the advantages and disadvantages in
Processors and Memory Hierox! W‘ "9

using a common cache or separate caches explain the impovements made. compared
for instructions and data. Explain the support with the i486.
from data paths. MMU and TLB.and memory
Problem 4.7 Answer the following questions
bandwidth in the two cache architectures.
after studying Example 4.4. the i860 instuction set.
(c) Distinguish between scalar RISC and
and the architecture of the i860 and its successor
superscalar RISC in terms of instruction
the iB6OXP:
issue. pipeline architecture. and processor
(a) Repeat parts {a}. (b). and |[c) in Problem 4.6
performance.
for the i860ii860XP.
(d} Explain the difference between superscalar
(b) What multiprocessor. support instructions
andVLlW architectures in terms of hardware
are added in flwe iB6ClXP?
and software requirements.
(c) Explain the dual-instrution mode and the dual-
Problem 4.5 Eaqalain the structures and operation instructions in i860 processors.
operational requirements of the instruction (d) Explain the address translation and paged
pipelines used in ClS-C. scalar RISC. superscalar memory organisation of the i860.
RISC. andVLlW processors. Comment on the cycles
Problem 4.B The SPARC arhitecture can be
per instruction expected from these procmsor
implemented with two to eight register windows.
architectures.
for a total of 40 to 132 GPRs in die integer unit.
Problem 4.6 Study the lntel i486 instruction set Explain how the GPRs are organized into overlapping
and the CPU architecture. and answer the following windows in each of the following designs:
questions: (a) Use 40 GPRs to construct two windows.
(a) What are the instruction formats an-cl data (b) Use T2 C-iPPts to construct four windows.
formats? (c} In what sense is the SPARC considered a
(b} What are the addressing modes? scalable architecture?
(c) What are the instruction categories? Describe (d) Explain how to use the overlapped windows
one example instruction in ch category. for parameter passing between t.he calling
(d} What are the HLL support instructions and procedure and the called procedure.
assembly directives?
Problem 4.9 Study Bection 4.2 and also the paper
(e} What are the intenupt. testing. and debug
byjouppi and Wall [19-B9) and answer the following
features?
questions:
{f} Explain the difference between real and
(a) What causes a processor pipeline to be
virtual mode execution.
underpipelined?
(g} Explain how to disable paging in the i486 and
(b) What are the factors limiting the degree of
what kind of application may benefit from this
superscalar design?
option.
(h) Explain how to disable segmentation in the Problem 4.10 Answer the following questions
i486 and what kind of application may use this related to vector processing:
option. (a) What are the differences between scalar
{i} What kind of protection mechanisms are instructions and vector instructions?
built into d1ei486? (b) Compare the pipelined execution style in a
fj} Search for information on the Pentium and sector processor with that in a base scalar
TM Illtfirfilr Hilffiurnnannri .
Adl'H1|‘|'CBd Computerilirchitectore

processor [Fig 4.15}. Analyze the speedup (c) Compare the three memory designs and
gain of the vector pipeline over the scalar indicate the order of merit in terms of average
pipeline for long vectors. costs and average access times. respectively.
(c) Suppose parallel issue is added to vector Choose the optimal design based on the
pipeline execution .W'hatwould be the furdver product of average cost and average access
improvement in throughput. compared with time.
parallel issue in a superscalar pipeline of the
Problem 4.13 Compare the advantages and
same degree?
shortcoming in implementing private virtual
Problem 4.11 Consider a two-level memory memories and a globally shared virtual memory
hierarchy. M1 and M1. Denote the hit ratio ofM1. as in a multicomputer system. This comparative
h. Let c. and c1 be the costs per kilobyte. s, and s2 study should consider the latency coherence.
the memory capacities.and t. and t1 the access times. page migration. protection. implementation. and
respectively. application problems in the context of building a
(a} Under what conditions will the average cost scalable multicomputer system with distributed
of the entire memory system approach cl? shared memories.
(b) What is the effective memory-access time to Problem 4.14 Explain the inclusion property and
of this hierarchy? memory coherence requirements in a multilevel
(c) Let r = tzit. be dwe speed ratio of the two memory hierarchy. Distinguish between write-through
memo rifi. Let E = tit. be the access efficiency and write-back policies in maintaining the coherence
of the memory system. Express E in terms of in adjacent levels.Also explain the basic concepts of
rand fr. paging and segmentation in managing the physical and
(d} Plot E against l1 for r = 5. 20. and 1C0. virtual memories in a hierarchy.
respectively. on grid paper.
Problem 4.15 A two-level memory system has
(e} What is the required hit ratio h to make
eight virtual pages on a disk to be mapped into four
E > 0.95 ifr =10(i?
page frames {PFs) in the main memory.A certain
Problem 4.1 2 You are asked to perform capacity program generated the following page trace:
planning for a two-level memory system. The first 1.0. 2. Z. 1.7.6.7.0.1.1.U.3.0.4.5.1.5.2.4.5.6.?.6.
level. M1. is a cadve with three capacity choices of 7'. 2.4. 2. 2'. 3. 3. 2. 3
64 Kbytes. 128 l(byt5.and 256 K’.bytes.The second (a) Show the successive virtual pages residing
level. M1. is a main memory with a 4-l"1byte capacity. in the four page frames with respect to the
Let c1 and cl be the costs per byte and t1 and ti above page trace using the LRU replacement
the access times for M1 and M1. respectively Assume policy. Compute the hit ratio in the main
c1 = 20:1 and t1 =1Ot1.The cache hit ratios for the memory.Assu|'ne the PFs are initially empty.
dvree capacities are assumed to be O.}'.D.9.and 0.98. (b) Repeat part (a) for the circular FIFO page
respectively. replacement policy. Compute the hit ratio in
(a) What is the average access time tn in terms of the main memory
ti = 20 ns in the three cache designs? (Note (c) Compare dve hit ratio in parts (a) and {bj and
that t1 is the time from CPU to M1 and ti is
comment on the effectiveness of using the
that from CPU to M1. not from M1 to M1). circular FIFO policy to approximate the LRU
{b} Express the average byte cost of dve entire policy with respect to dais particular page
memory hierarchy if C1 = $O.2!l(byte. trace.
Processors ond Memory Hierorz! W |B|

Problem 4.16 (b) Derive a formula showing the total cost of


(a) Explain the temporal locality spatial locality. this memory system.
and sequential locality associated with (c) Suppose t1 = 20 ns. t1 is unknown. S1 = 512
programfdata access in a memory hierarchy. Kbytes.s1 is unknown.c1 = $0.01fbyte. and c1
(b) What is the working set? Comment on the = $0.0005.lbyte. The total cost of the cache
sensitivity of the observation window size and main memory is upper-bounded by
to the size of the working set How will this $15.000.
affect the main memory hit ratio? {i} How large a capacity of My (s1 = ?} can you
(c) What is the 90-10 rule and its relationship to acquire without exceeding the budget limit?
the locality of references? (ii) How fast a main memory {t1 = ?]- do you
need to achieve an effective access time of
Problem 4.17 Consider a two-level memory
teq = 40 ns in the entire memory system
hierarchy. M1 and M1. with access times ti and t1.
under the above hit ratio assumptions?
costs per byte q and c1. and capacities 5| and s1.
respectively.The cache hit ratio h. = 0.95 at the first Problem 4.1ll Distinguish between numeric
level. (Note that t1 is the access time between the processing and symbolic processing computers in
CPU and M1. not between M1 and M1). terms of data objects. common operations. memory
{a} Derive a formula showing the effective access requirements. communication palzerns. algorithmic
time tqry of this memory system. properties. li’O requirements. and processor
architectures.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy