0% found this document useful (0 votes)
66 views

CS6461 - Computer Architecture Fall 2016 Morris Lancaster - Memory Systems

This document discusses computer memory systems. It describes how memory hierarchies use smaller and faster memory levels close to the processor, like caches, and larger but slower memory levels further away, like main memory and disk drives. It provides a brief history of memory technologies over time, from mechanical and vacuum tube memory to modern semiconductor memory types like DRAM and SRAM. It also covers topics like memory interleaving and the basic interface for reading from and writing to memory.

Uploaded by

闫麟阁
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

CS6461 - Computer Architecture Fall 2016 Morris Lancaster - Memory Systems

This document discusses computer memory systems. It describes how memory hierarchies use smaller and faster memory levels close to the processor, like caches, and larger but slower memory levels further away, like main memory and disk drives. It provides a brief history of memory technologies over time, from mechanical and vacuum tube memory to modern semiconductor memory types like DRAM and SRAM. It also covers topics like memory interleaving and the basic interface for reading from and writing to memory.

Uploaded by

闫麟阁
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 66

CS6461 Computer Architecture

Fall 2016
Morris Lancaster
Adapted from Professor Stephen Kaislers Notes
Lecture 4 Memory Systems

(Some material extracted from slides by


Arvind (MIT)
Krste Asanovic (MIT/UCB)
Joel Emer (Intel/MIT)
James Hoe (CMU)
John Kubiatowicz (UCB)
David Patterson (UCB)
The Ideal Memory

Size: Infinitely large


Speed: Infinitely fast, e.g., no latency
Cost: Free (well, infinitesimal)

However, once reality sets in, we realize these


features are mutually exclusive and not attainable
with the technology available today.
Tomorrow, however, is another story.

10/7/2017 CS61 Computer Architecture 2


Memory Hierarchy

Processor
4-64 bytes (word)
Inclusive what is in
L1$ is a subset of
Increasing L1$ what is in L2$ is a
distance from subset of what is in
the processor in 16-128 bytes (block)
Main Memory that is
access time L2$ a subset of is in
We show only Secondary Memory
two levels of 1 to 8 blocks (usually disk).
cache here, but
as we will see Main Memory
in later
lectures, some 1,024 4M bytes (disk sector = page)
processors
have three
levels of cache.
Secondary Memory

(Relative) size of the memory at each level


Note: page sizes may be much larger, e.g., 64 KBytes or even 1 Mbyte.

10/7/2017 CS61 Computer Architecture 3


Memory Hierarchy

Memory is much slower compared to processor.


Faster memories are more expensive.
Due to high decoding time and other reasons, larger memories
are always slower.
Therefore, locate small but very fast memory (SRAM: L1 cache)
very close to processor.
L2 cache will be larger and slower, between L1 and the main
memory.
Many processors now have L3 cache (see Multicores)
Main memory is usually GBs large and is made up of DRAMs:
several nanoseconds.
Secondary memory is on discs (hard disk, CD, DVD, Flash,
SSD), is hundreds of GBs to TBs, and takes microseconds to
access.
CPU registers are the closest to CPU, but do not use memory
addresses they have separate ids.

10/7/2017 CS61 Computer Architecture 4


Memory History - 0

See @ Computer History Museum,


Mountain View, CA

10/7/2017 CS61 Computer Architecture 5


Memory History - I

A relay is an electrically operated


Information introduced to the memory in the form switch. Many relays use an
of electric pulses was transduced into mechanical electromagnet to operate a switching
waves that propagated relatively slowly through a mechanism mechanically,
medium

10/7/2017 CS61 Computer Architecture 6


Memory History - II

A vacuum tube is a glass


A drum is a large metal cylinder that is coated on enclosure inside which are an
the outside surface with a ferromagnetic recording anode, cathode, and other
material. filaments. The tube is
evacuated to a vacuum.

10/7/2017 CS61 Computer Architecture 7


Memory History - III

The Williams tube depends on an effect called secondary Magnetic cores are little donuts with three
emission. When a dot is drawn on a cathode ray tube, the wires passing through them: two select x
area of the dot becomes slightly positively charged and and y, while the third sets the charge or
the area immediately around it becomes slightly not on the core.
negatively charged, creating a charge well. The charge
well remains on the surface of the tube for a fraction of a Who invented magnetic cores?
second, allowing the device to act as a computer memory.
The lifetime of the charge well depends on the electrical
resistance of the inside of the tube.

10/7/2017 CS61 Computer Architecture 8


Core Memory

Core memory was first large scale


reliable main memory
invented by Forrester in late 40s/early
50s at MIT for Whirlwind project
Bits stored as magnetization polarity
on small ferrite cores threaded onto 2
dimensional grid of wires
Coincident current pulses on X and Y
wires would write cell and also sense
original state (destructive reads)
Robust, non-volatile storage
Core access time ~ 1ms

10/7/2017 CS61 Computer Architecture 9


Semiconductor (various types)
DRAM: Dynamic Random Access Memory
DRAM needs its cells recharged or given a new
charge every few milliseconds.
SRAM: Static Random Access Memory
SRAM does not need recharging since it
operates on a principle of moving current that is
switched in one of two directions rather than a
storage cell which holds a charge in place.

10/7/2017 CS61 Computer Architecture 10


Modern DRAM

10/7/2017 CS61 Computer Architecture 11


Parity versus Non-Parity

Parity is an error-detection scheme:


developed to detect errors in data principally over
communications lines but also applied to storage.
By adding a single bit to each byte of data, one could check the
integrity of the other 8 bits while data is transmitted or moved
from storage.
This led to error-correcting codes, which are a topic in
itself.
Today, memory errors are rare because of the very high
quality of the manufacturing process, so most memory is
non-parity.
Still used in some mission-critical systems

10/7/2017 CS61 Computer Architecture 12


Semiconductor Memory Evolution

DRAM
1970 RAM 4.77 MHz
1987 Fast-Page Mode DRAM 20 MHz
1995 Extended Data Output 20 MHz
1997 PC66 Synchronous DRAM 66 MHz

Synchronous DRAM
1998 PC100 Synchronous DRAM 100 MHz
1999 Rambus DRAM 800 MHz
1999 PC133 Synchronous DRAM 133 MHz
2000 DDR Synchronous DRAM 266 MHz
2002 Enhanced DRAM 450 MHz
2005 DDR2 660 MHz
2009 DDR3 800 MHz
And so on

10/7/2017 CS61 Computer Architecture 13


Memory Interleaving/Banking

Memory interleaving divides memory into banks as


shown below.
Addresses are distributed across the banks 4-way interleaving
is depicted.
Interleaving allows simultaneous access to words in memory if
the words are in separate banks.
As we will see, this may conflict with caching.

10/7/2017 CS61 Computer Architecture 14


Basic Memory

You can think of computer memory as being one big array of data.
The address serves as an array index.
Each address refers to one word of data.
You can read or modify the data at any given memory address, just
like you can read or modify the contents of an array at any given
index.
If youve worked with pointers in C or C++, then youve already
worked with memory addresses.

2k x n memory
CS WR Memory operation
k n
ADRS OUT 0 x None
n
DATA 1 0 Read selected word
CS 1 1 Write selected word
WR

10/7/2017 CS61 Computer Architecture 15


Basic Memory - II

The above depicts the main interface to RAM.


- A Chip Select, CS, enables or disables the RAM.
- ADRS specifies the address or location to read from or
write to.
- WR selects between reading from or writing to the
memory.
To read from memory, WR should be set to 0.
OUT will be the n-bit value stored at ADRS.
To write to memory, we set WR = 1.
DATA is the n-bit value to save in memory.

10/7/2017 CS61 Computer Architecture 16


Basic Memory - III

bit lines
Col. Col. word lines
1 2M
Row 1

Row Address
Decoder Row 2N

Memory cell
M (one bit)
N+M Column Decoder & Sense
Amplifiers

D
Data

Bits stored in 2-dimensional arrays on chip


Modern chips have around 4 logical banks on each chip

10/7/2017 CS61 Computer Architecture 17


DRAM Packaging

DIMM (Dual Inline Memory


Module) contains multiple chips
with clock/control/address signals
connected in parallel (sometimes
need buffers to drive signals to all
chips)
Data pins work together to return
wide word (e.g., 64-bit data bus
using 16x4-bit parts)

10/7/2017 CS61 Computer Architecture 18


Memory System Design: Key Ideas

The Principle of Locality:


Programs access a relatively small portion of the address space at
any instant of time.
Instructions and data both exhibit spatial and temporal locality
Temporal locality: If a particular instruction or data item is used
now, there is a good chance that it will be used again in the near
future.
Spatial locality: If a particular instruction or data item is used now,
there is a good chance that the instructions or data items that are
located in memory immediately following or preceding this item will
soon be used.
Therefore, it is a good idea to move such instruction and data
items that are expected to be used soon from slow memory to
fast memory (cache).
BUT! This is prediction, and therefore will not always be correct
depends on the extent of locality.
10/7/2017 CS61 Computer Architecture 19
Simple Example

Simple calculation assuming just the application program:


Assume 1 GHz processor using 10 ns memory,
35% of all executed instructions are load or store.
The application runs 1 billion instructions.
Straight Memory Execution time = (1*109 + 0.35*10*109) *10-9 = 4.5 s.
Assume all instructions and data that are required are stored in a
perfect cache that operates within the clock period.
Execution time with perfect cache = 1 ns.
Now, assume that the cache has a hit rate of 90%.
Execution time with cache = (1 + 0.35*0.1*10) = 1.35 s.
(0.1 comes from the 10% cache misses)
Caches are 95-99% successful in having the required
instructions and 75-90% successful for data.

10/7/2017 CS61 Computer Architecture 20


Cache - I

A cache is a small (size << main memory), fast memory that


temporarily holds data and instructions and makes them available to
the processor much faster than main memory.
Cache space (~MBytes) is smaller than main memory (~GBytes);
Why do caches succeed in improving performance? LOCALITY!
Hit: data appears in some block in the upper level (example:
Block X)
Hit Rate: the fraction of memory access found in the upper level
Hit Time: Time to access the upper level which consists of RAM access
time + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the lower level
(Block Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in the upper level + Time to
deliver the block the processor
Hit Time << Miss Penalty (500 instructions on Alpha 21264!)

10/7/2017 CS61 Computer Architecture 21


Cache - II

10/7/2017 CS61 Computer Architecture 22


Cache - III
Cache Algorithm (READ)

10/7/2017 CS61 Computer Architecture 23


Cache Performance

Memory access time = cache hit time or cache miss rate * miss
penalty
To improve performance: reduce memory time
=> we need to reduce hit time, miss rate & miss penalty.
As L1 caches are in the critical path of instruction execution, hit time
is the most important parameter.
When one parameter is improved, others might suffer
Misses:
Compulsory miss: always occurs first time.
Capacity miss: reduces with increase in cache size.
Conflict miss: reduces with level of associativity.
Types:
Instruction or Data Cache: 1-way or 2-way
Data Cache: write through & write-back

10/7/2017 CS61 Computer Architecture 24


Cache Topology

Determines the number and interconnection of caches


Early caches were focused on instructions, while data
were fetched directly from memory
Then, unified caches holding both instructions and data
were used
Then, split caches one for instructions and one for data
Today, either unified or split depending on processor use

10/7/2017 CS61 Computer Architecture 25


Split vs. Unified Caches

Advantages of unified caches:


Balance the load between instruction and data fetches
depending on the dynamics of the program execution;
Design and implementation are cheaper.
Advantages of split caches (Harvard Architectures)
Competition for the cache between instruction processing (which
fetches from instruction cache) and execution functional units
(which fetch from data cache) is eliminated
Instruction fetch can proceed in parallel with memory access
from the execution unit.

10/7/2017 CS61 Computer Architecture 26


Cache Writing

A Write Buffer is needed between the Cache and Memory


Processor: writes data into the cache and the write buffer
Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:
Typical number of entries: 4
Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
Memory system designers nightmare:
Store frequency (w.r.t. time) > 1 / DRAM write cycle
Write buffer saturation

Cache
Processor DRAM

Write Buffer

10/7/2017 CS61 Computer Architecture 27


Cache Writing - II

Q. Why a write buffer ? A. So CPU doesnt stall

Q. Why a buffer, why not just one A. Bursts of writes are


register ? common.

Q. Are Read After Write (RAW) A. Yes! Drain buffer before next
hazards an issue for write buffer? read, or send read 1st after check
write buffers.

Note: We will discuss RAW hazards in a future lecture.

10/7/2017 CS61 Computer Architecture 28


Cache Writing - III

Write: Need to update upper cache(s) and main


memory whenever a store instruction modifies L1
cache.
Write Hit: the item to be modified is in L1.
Write Through: as if no L1, write also to L2.
Write Back: set a Dirty Bit, and update L2 before replacing
the block.
Although write through is an inefficient strategy, most L1s
and some upper level caches follow this approach that read
hit time is not affected due to complicated logic to update
dirty bit.
Write Miss: the item to be modified is not in L1.
Write allocate: exploit locality, and bring the block to L1.
Write no-allocate: do not fetch the missing block.
10/7/2017 CS61 Computer Architecture 29
Cache Writing - IV

10/7/2017 CS61 Computer Architecture 30


Cache Organization: Summary

10/7/2017 CS61 Computer Architecture 31


Direct-Mapped Cache

A block can be placed in one location only, given by:


(Block address) MOD (Number of blocks in cache)

10/7/2017 CS61 Computer Architecture 32


Direct Mapped Cache - II
A block can be placed in one location only, given by:
(Block address) MOD (Number of blocks in cache)
In this case: (Block address) MOD (8)
Cache
C ache

0 1 0 1 0 1 0 1
0 0 1 1 0 0 1 1
0 0 0 0 1 1 1 1

8 cache block frames

(11101) MOD (100) = 101

32 memory
blocks
cacheable

00 00 1 0 0101 0100 1 01101 10001 101 01 11 00 1 11101

Memory
M e m o ry

10/7/2017 CS61 Computer Architecture 33


Direct Mapped Cache - III

A memory block is mapped into a unique cache line, depending on


the memory address of the respective block.
A memory address is considered to be composed of three fields:
the least significant bits (2 in our example) identify the byte within the
block; [assume four bytes/block]
the rest of the address (22 bits in our example) identify the block in main
memory; for the cache logic, this part is interpreted as two fields:
2a. the least significant bits (14 in our example) specify the cache line;
2b. the most significant bits (8 in our example) represent the tag, which
is stored in the cache together with the line.
Tags are stored in the cache in order to distinguish among blocks
which fit into the same cache line.

10/7/2017 CS61 Computer Architecture 34


Direct Mapped Cache - IV

Advantages:
simple and cheap;
the tag field is short; only those bits have to be stored
which are not used to address the cache (compare with the
following approaches);
access is very fast.
Disadvantage:
a given block fits into a fixed cache location
a given cache line will be replaced whenever there is a
reference to another memory block which fits to the same
line, regardless what the status of the other cache lines is.
This can produce a low hit ratio, even if only a very small
part of the cache is effectively used.

10/7/2017 CS61 Computer Architecture 35


2-Way Associative Cache
A block can be placed in a restricted set of places, or cache block frames.
A set is a group of block frames in the cache.
A block is first mapped onto the set and then it can be placed anywhere within the set.
The set in this case is chosen by: (Block address) MOD (Number of sets in cache)

10/7/2017 CS61 Computer Architecture 36


Two-Way Set Associative Cache - II

Location 0 can be occupied by data from:


Memory location 0, 2, 4, 6, 8, ... etc.
In general: any memory location whose LSB of the address
is 0
Address<0> => cache index
On a miss, the block will be placed in one of the two
cache lines belonging to that set which corresponds
to the 13 bits field in the memory address.
The replacement algorithm decides which line to use.
A memory block is mapped into any of the lines of a
set.
The set is determined by the memory address, but the line
inside the set can be any one.
10/7/2017 CS61 Computer Architecture 37
Two-Way, Set Associative Cache - III

Several tags (corresponding to all lines in the set) have


to be checked in order to determine if we have a hit or
miss. If we have a hit, the cache logic finally points to the
actual line in the cache.
The number of lines in a set is determined by the
designer:
2 lines/set: two-way set associative mapping;
4 lines/set: four-way set associative mapping
Set associative mapping keeps most of the advantages
of direct mapping:
short tag field
fast access
relatively simple
10/7/2017 CS61 Computer Architecture 38
Two-Way Set Associative Cache - IV

Set associative mapping tries to eliminate the main


shortcoming of direct mapping
a certain flexibility is given concerning the line to be
replaced when a new block is read into the cache.
Cache hardware is more complex for set associative
mapping than for direct mapping.
In practice 2 and 4-way set associative mapping are
used with very good results.
Larger sets do not produce further significant
performance improvement
Interesting thesis topic: Is this true for multicore
architectures?

10/7/2017 CS61 Computer Architecture 39


Full Associative Cache

A block can be placed anywhere in cache


Lookup hardware for many tags can be large and slow

10/7/2017 CS61 Computer Architecture 40


Cache Replacement Strategy

Replacing a block on a cache miss?


Easy for Direct Mapped
Set Associative or Fully Associative:
Random
LRU (Least Recently Used)
FIFO (First In First Out)
LFU (Least Frequently used)
LRU is the most efficient: relatively simple to implement
and good results.
FIFO is simple to implement.
Random replacement is the simplest to implement and
results are surprisingly good.

10/7/2017 CS61 Computer Architecture 41


Comparison by Size

Associativity: 2-way 4-way 8-way


Size LRU Random LRU Random LRU Random
16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

So, what does this tell us?


You dont gain a significant advantage by going to a
larger cache size.

10/7/2017 CS61 Computer Architecture 42


Cache Performance Revisited

Suppose a processor executes at


Clock Rate = 1000 MHz (1 ns per cycle)
CPI = 1.0
50% arithmetic/logic
30% load/store
20% control

Suppose that 10% of memory operations get 100 cycle miss


penalty
CPI = ideal CPI + average stalls time per instruction
= 1.0(cycle)
+ ( 0.30 (data-operations/instruction)
* 0.10 (miss/data-op) x 100 (cycle/miss) )
= 1.0 cycle + 3.0 cycle
= 4.0 cycle
75 % of the time the processor is stalled waiting for memory!
A 1% instruction miss rate would add an additional 1.0 cycles
to the CPI!
10/7/2017 CS61 Computer Architecture 43
Lower Miss Rate: 16 KB D/I or 32 KB Unified Cache?

Assume a hit takes 1 clock cycle and the miss penalty is 50 cycles.
Assume a load or store takes 1 extra clock cycle on a unified cache
since there is only one cache port.
Assume 75% of memory accesses are instruction references
(75% X 0.64%) + (25% X 6.47%) = 2.10%

Average memory access time (split)


= 75% X ( 1 + 0.64% X 50) + 25% X ( 1 + 6.47% X 50) = 0.990 + 1.059
= 2.05

Average memory access time (unified)


= 75% X ( 1 + 1.99% X 50) + 25% X ( 1 + 1 + 1.99% X 50)
= 1.496 + 0.749 = 2.24

10/7/2017 CS61 Computer Architecture 44


Cache Addressing

Access caches by virtual address or physical address

10/7/2017 CS61 Computer Architecture 45


Cache Addressing

Physical or Virtual Address


For Index or tag or both
Must translate address
first
Very fast For complex,
unusual
systems

Tag->
Physical Virtual
Index
Physical PP PV

Virtual VP VV

Must translate Operate


address first concurrently w/
MMU checks tag MMU

10/7/2017 CS61 Computer Architecture 46


Summary: Cache Issues

How many caches?


Write-Through or Write Back?
Direct-Mapped or Set Associative?
How to determine at a read if we have a miss or hit?
If there is a miss and there is no place for a new slot in
the cache which information should be replaced?
How to preserve consistency between cache and main
memory at write?
Replacement Strategy?

10/7/2017 CS61 Computer Architecture 47


Cache Optimizations

Henk Corperaal, www.ics.ele.tue.nl/~heco/courses/aca


Good topics for a term paper!

Reducing hit time


Small and simple caches
Way prediction
Trace caches
Increasing cache bandwidth
Pipelined caches
Multibanked caches
Nonblocking caches
Reducing Miss Penalty
Critical word first
Merging write buffers
Reducing Miss Rate
Compiler optimizations
Reducing miss penalty or miss rate via parallelism
Hardware prefetching
Compiler prefetching

10/7/2017 CS61 Computer Architecture 48


Cache Optimizations - I

1. Fast Hit via Small and Simple Caches


Index tag memory and thereafter compare takes
time
Small cache is faster
Also L2 cache small enough to fit on chip with the
processor avoids time penalty of going off chip
Simple direct mapping
Can overlap tag check with data transmission since
no choice

10/7/2017 CS61 Computer Architecture 49


Cache Optimizations - II

2. Fast Hit via Way Prediction


Make set-associative caches faster
Keep extra bits in cache to predict the way, or block
within the set, of next cache access.
Multiplexor is set early to select desired block, only 1 tag
comparison performed
Miss 1st check other blocks for matches in next clock cycle
Accuracy 85%
Drawback: CPU pipeline is stalled if hit takes 1 or 2
cycles

10/7/2017 CS61 Computer Architecture 50


Cache Optimizations - III

3. Fast Hit via Trace Cache


Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace
cache line
Single fetch brings in multiple basic blocks
Trace cache indexed by start address and next n branch predictions
+ better utilize long blocks (dont exit in middle of block, dont enter at label
in middle of block)
complicated address mapping since addresses no longer aligned to
power-of-2 multiples of word size
- instructions may appear multiple times in multiple dynamic traces due to
different branch outcomes
BR BR BR

B B B
R R R
10/7/2017 CS61 Computer Architecture 51
Cache Optimizations - IV

4. Increase Cache Bandwidth by Pipelining


Pipeline cache access to maintain bandwidth, but higher
latency
Nr. of Instruction cache access pipeline stages:
1: Pentium
2: Pentium Pro through Pentium III
4: Pentium 4
greater penalty on mispredicted branches
more clock cycles between the issue of the load and the
use of the data

10/7/2017 CS61 Computer Architecture 52


Cache Optimizations - V

5. Increasing Cache Bandwidth:


Non-blocking cache or lockup-free cache
allow data cache to continue to supply cache hits during a miss
requires out-of-order execution CPU
hit under miss reduces the effective miss penalty by
continuing during miss
hit under multiple miss or miss under miss may further
lower the effective miss penalty by overlapping multiple
misses
Requires that memory system can service multiple misses
Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
Requires multiple memory banks (otherwise cannot support)
Pentium Pro allows 4 outstanding memory misses

10/7/2017 CS61 Computer Architecture 53


Cache Optimizations - VI

6. Increase Cache Bandwidth via Multiple Banks


Divide cache into independent banks that can support
simultaneous accesses
E.g., T1 (Niagara) L2 has 4 banks
Banking works best when accesses naturally spread
themselves across banks mapping of addresses
to banks affects behavior of memory system
Simple mapping that works well is sequential
interleaving
Spread block addresses sequentially across banks
E.g., with 4 banks, Bank 0 has all blocks with address%4 = 0;
bank 1 has all blocks whose address%4 = 1;

10/7/2017 CS61 Computer Architecture 54


Cache Optimizations - VII

7. Early Restart/Critical Word First to reduce miss


penalty

Dont wait for full block to be loaded before restarting CPU


Early restartAs soon as the requested word of the block arrives,
send it to the CPU and continue
Critical Word FirstRequest the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU
continue while filling the rest of the words in the block
Generally useful only when blocks are large

10/7/2017 CS61 Computer Architecture 55


Cache Optimizations - VIII

8. Merging Write Buffer to Reduce Miss Penalty


Write buffer to allow processor to continue while waiting to write to memory
E.g., four writes are merged into one buffer entry rather than putting them in
separate buffers
Less frequent write backs

10/7/2017 CS61 Computer Architecture 56


Cache Optimizations - IX

9. Reducing Misses By Compiler Optimizations


Instructions
Reorder procedures in memory so as to reduce conflict misses
Profiling to look at conflicts (using developed tools)
Data
Merging Arrays: improve spatial locality by single array of compound
elements vs. 2 arrays
Loop Interchange: change nesting of loops to access data in order
stored in memory
Loop Fusion: combine 2 independent loops that have same looping
and some variables overlap
Blocking: Improve temporal locality by accessing blocks of data
repeatedly vs. going down whole columns or rows

10/7/2017 CS61 Computer Architecture 57


Example: Merging Arrays

10/7/2017 CS61 Computer Architecture 58


Example: Loop Interchange

10/7/2017 CS61 Computer Architecture 59


Example: Loop Fusion

10/7/2017 CS61 Computer Architecture 60


Blocking Applied to Array Multiplication

10/7/2017 CS61 Computer Architecture 61


Blocking Applied to Array Multiplication

10/7/2017 CS61 Computer Architecture 62


Blocking Applied to Array Multiplication

Conflict misses in caches vs. Blocking size


Lam et al [1991] a blocking factor of 24 had a fifth the misses vs.
48 despite both fit in cache

10/7/2017 CS61 Computer Architecture 63


Cache Optimizations - X

10. Reducing Cache Misses by Hardware Prefetching


Use extra memory bandwidth (if available)
Instruction Prefetching
Typically, CPU fetches 2 blocks on a miss: the requested block
and the next consecutive block.
Requested block is placed in instruction cache when it returns,
and prefetched block is placed into instruction stream buffer
Data Prefetching
Pentium 4 can prefetch data into L2 cache from up to 8 streams
from 8 different 4 KB pages
Prefetching invoked if 2 successive L2 cache misses to a page,
if distance between those cache blocks is < 256 bytes

10/7/2017 CS61 Computer Architecture 64


Cache Optimizations - XI

11. Reducing Cache Misses by Software- Controlled


Prefetching
Data Prefetch
Load data into register (HP PA-RISC loads)
Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause faults;
a form of speculative execution
Issuing Prefetch Instructions takes time
Is cost of prefetch issues < savings in reduced misses?
Wider superscalar reduces difficulty of issue bandwidth

10/7/2017 CS61 Computer Architecture 65


Cache Optimizations - Summary

10/7/2017 CS61 Computer Architecture 66

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy