0% found this document useful (0 votes)

66 views

CS6461 - Computer Architecture Fall 2016 Morris Lancaster - Memory Systems

This document discusses computer memory systems. It describes how memory hierarchies use smaller and faster memory levels close to the processor, like caches, and larger but slower memory levels further away, like main memory and disk drives. It provides a brief history of memory technologies over time, from mechanical and vacuum tube memory to modern semiconductor memory types like DRAM and SRAM. It also covers topics like memory interleaving and the basic interface for reading from and writing to memory.

Uploaded by

闫麟阁

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views

CS6461 - Computer Architecture Fall 2016 Morris Lancaster - Memory Systems

Uploaded by

闫麟阁

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 66

CS6461 Computer Architecture

Fall 2016
Morris Lancaster
Adapted from Professor Stephen Kaislers Notes
Lecture 4 Memory Systems

(Some material extracted from slides by

Arvind (MIT)
Krste Asanovic (MIT/UCB)
Joel Emer (Intel/MIT)
James Hoe (CMU)
John Kubiatowicz (UCB)
David Patterson (UCB)
The Ideal Memory

Size: Infinitely large

Speed: Infinitely fast, e.g., no latency
Cost: Free (well, infinitesimal)

However, once reality sets in, we realize these

features are mutually exclusive and not attainable
with the technology available today.
Tomorrow, however, is another story.

10/7/2017 CS61 Computer Architecture 2

Memory Hierarchy

Processor
4-64 bytes (word)
Inclusive what is in
L1$ is a subset of
Increasing L1$ what is in L2$ is a
distance from subset of what is in
the processor in 16-128 bytes (block)
Main Memory that is
access time L2$ a subset of is in
We show only Secondary Memory
two levels of 1 to 8 blocks (usually disk).
cache here, but
as we will see Main Memory
in later
lectures, some 1,024 4M bytes (disk sector = page)
processors
have three
levels of cache.
Secondary Memory

(Relative) size of the memory at each level

Note: page sizes may be much larger, e.g., 64 KBytes or even 1 Mbyte.

10/7/2017 CS61 Computer Architecture 3

Memory Hierarchy

Memory is much slower compared to processor.

Faster memories are more expensive.
Due to high decoding time and other reasons, larger memories
are always slower.
Therefore, locate small but very fast memory (SRAM: L1 cache)
very close to processor.
L2 cache will be larger and slower, between L1 and the main
memory.
Many processors now have L3 cache (see Multicores)
Main memory is usually GBs large and is made up of DRAMs:
several nanoseconds.
Secondary memory is on discs (hard disk, CD, DVD, Flash,
SSD), is hundreds of GBs to TBs, and takes microseconds to
access.
CPU registers are the closest to CPU, but do not use memory
addresses they have separate ids.

10/7/2017 CS61 Computer Architecture 4

Memory History - 0

See @ Computer History Museum,

Mountain View, CA

10/7/2017 CS61 Computer Architecture 5

Memory History - I

A relay is an electrically operated

Information introduced to the memory in the form switch. Many relays use an
of electric pulses was transduced into mechanical electromagnet to operate a switching
waves that propagated relatively slowly through a mechanism mechanically,
medium

10/7/2017 CS61 Computer Architecture 6

Memory History - II

A vacuum tube is a glass

A drum is a large metal cylinder that is coated on enclosure inside which are an
the outside surface with a ferromagnetic recording anode, cathode, and other
material. filaments. The tube is
evacuated to a vacuum.

10/7/2017 CS61 Computer Architecture 7

Memory History - III

The Williams tube depends on an effect called secondary Magnetic cores are little donuts with three
emission. When a dot is drawn on a cathode ray tube, the wires passing through them: two select x
area of the dot becomes slightly positively charged and and y, while the third sets the charge or
the area immediately around it becomes slightly not on the core.
negatively charged, creating a charge well. The charge
well remains on the surface of the tube for a fraction of a Who invented magnetic cores?
second, allowing the device to act as a computer memory.
The lifetime of the charge well depends on the electrical
resistance of the inside of the tube.

10/7/2017 CS61 Computer Architecture 8

Core Memory

Core memory was first large scale

reliable main memory
invented by Forrester in late 40s/early
50s at MIT for Whirlwind project
Bits stored as magnetization polarity
on small ferrite cores threaded onto 2
dimensional grid of wires
Coincident current pulses on X and Y
wires would write cell and also sense
original state (destructive reads)
Robust, non-volatile storage
Core access time ~ 1ms

10/7/2017 CS61 Computer Architecture 9

Semiconductor (various types)
DRAM: Dynamic Random Access Memory
DRAM needs its cells recharged or given a new
charge every few milliseconds.
SRAM: Static Random Access Memory
SRAM does not need recharging since it
operates on a principle of moving current that is
switched in one of two directions rather than a
storage cell which holds a charge in place.

10/7/2017 CS61 Computer Architecture 10

Modern DRAM

10/7/2017 CS61 Computer Architecture 11

Parity versus Non-Parity

Parity is an error-detection scheme:

developed to detect errors in data principally over
communications lines but also applied to storage.
By adding a single bit to each byte of data, one could check the
integrity of the other 8 bits while data is transmitted or moved
from storage.
This led to error-correcting codes, which are a topic in
itself.
Today, memory errors are rare because of the very high
quality of the manufacturing process, so most memory is
non-parity.
Still used in some mission-critical systems

10/7/2017 CS61 Computer Architecture 12

Semiconductor Memory Evolution

DRAM
1970 RAM 4.77 MHz
1987 Fast-Page Mode DRAM 20 MHz
1995 Extended Data Output 20 MHz
1997 PC66 Synchronous DRAM 66 MHz

Synchronous DRAM
1998 PC100 Synchronous DRAM 100 MHz
1999 Rambus DRAM 800 MHz
1999 PC133 Synchronous DRAM 133 MHz
2000 DDR Synchronous DRAM 266 MHz
2002 Enhanced DRAM 450 MHz
2005 DDR2 660 MHz
2009 DDR3 800 MHz
And so on

10/7/2017 CS61 Computer Architecture 13

Memory Interleaving/Banking

Memory interleaving divides memory into banks as

shown below.
Addresses are distributed across the banks 4-way interleaving
is depicted.
Interleaving allows simultaneous access to words in memory if
the words are in separate banks.
As we will see, this may conflict with caching.

10/7/2017 CS61 Computer Architecture 14

Basic Memory

You can think of computer memory as being one big array of data.
The address serves as an array index.
Each address refers to one word of data.
You can read or modify the data at any given memory address, just
like you can read or modify the contents of an array at any given
index.
If youve worked with pointers in C or C++, then youve already
worked with memory addresses.

2k x n memory
CS WR Memory operation
k n
ADRS OUT 0 x None
n
DATA 1 0 Read selected word
CS 1 1 Write selected word
WR

10/7/2017 CS61 Computer Architecture 15

Basic Memory - II

The above depicts the main interface to RAM.

- A Chip Select, CS, enables or disables the RAM.
- ADRS specifies the address or location to read from or
write to.
- WR selects between reading from or writing to the
memory.
To read from memory, WR should be set to 0.
OUT will be the n-bit value stored at ADRS.
To write to memory, we set WR = 1.
DATA is the n-bit value to save in memory.

10/7/2017 CS61 Computer Architecture 16

Basic Memory - III

bit lines
Col. Col. word lines
1 2M
Row 1

Row Address
Decoder Row 2N

Memory cell
M (one bit)
N+M Column Decoder & Sense
Amplifiers

D
Data

Bits stored in 2-dimensional arrays on chip

Modern chips have around 4 logical banks on each chip

10/7/2017 CS61 Computer Architecture 17

DRAM Packaging

DIMM (Dual Inline Memory

Module) contains multiple chips
with clock/control/address signals
connected in parallel (sometimes
need buffers to drive signals to all
chips)
Data pins work together to return
wide word (e.g., 64-bit data bus
using 16x4-bit parts)

10/7/2017 CS61 Computer Architecture 18

Memory System Design: Key Ideas

The Principle of Locality:

Programs access a relatively small portion of the address space at
any instant of time.
Instructions and data both exhibit spatial and temporal locality
Temporal locality: If a particular instruction or data item is used
now, there is a good chance that it will be used again in the near
future.
Spatial locality: If a particular instruction or data item is used now,
there is a good chance that the instructions or data items that are
located in memory immediately following or preceding this item will
soon be used.
Therefore, it is a good idea to move such instruction and data
items that are expected to be used soon from slow memory to
fast memory (cache).
BUT! This is prediction, and therefore will not always be correct
depends on the extent of locality.
10/7/2017 CS61 Computer Architecture 19
Simple Example

Simple calculation assuming just the application program:

Assume 1 GHz processor using 10 ns memory,
35% of all executed instructions are load or store.
The application runs 1 billion instructions.
Straight Memory Execution time = (1*109 + 0.35*10*109) *10-9 = 4.5 s.
Assume all instructions and data that are required are stored in a
perfect cache that operates within the clock period.
Execution time with perfect cache = 1 ns.
Now, assume that the cache has a hit rate of 90%.
Execution time with cache = (1 + 0.35*0.1*10) = 1.35 s.
(0.1 comes from the 10% cache misses)
Caches are 95-99% successful in having the required
instructions and 75-90% successful for data.

10/7/2017 CS61 Computer Architecture 20

Cache - I

A cache is a small (size << main memory), fast memory that

temporarily holds data and instructions and makes them available to
the processor much faster than main memory.
Cache space (~MBytes) is smaller than main memory (~GBytes);
Why do caches succeed in improving performance? LOCALITY!
Hit: data appears in some block in the upper level (example:
Block X)
Hit Rate: the fraction of memory access found in the upper level
Hit Time: Time to access the upper level which consists of RAM access
time + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the lower level
(Block Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in the upper level + Time to
deliver the block the processor
Hit Time << Miss Penalty (500 instructions on Alpha 21264!)

10/7/2017 CS61 Computer Architecture 21

Cache - II

10/7/2017 CS61 Computer Architecture 22

Cache - III
Cache Algorithm (READ)

10/7/2017 CS61 Computer Architecture 23

Cache Performance

Memory access time = cache hit time or cache miss rate * miss
penalty
To improve performance: reduce memory time
=> we need to reduce hit time, miss rate & miss penalty.
As L1 caches are in the critical path of instruction execution, hit time
is the most important parameter.
When one parameter is improved, others might suffer
Misses:
Compulsory miss: always occurs first time.
Capacity miss: reduces with increase in cache size.
Conflict miss: reduces with level of associativity.
Types:
Instruction or Data Cache: 1-way or 2-way
Data Cache: write through & write-back

10/7/2017 CS61 Computer Architecture 24

Cache Topology

Determines the number and interconnection of caches

Early caches were focused on instructions, while data
were fetched directly from memory
Then, unified caches holding both instructions and data
were used
Then, split caches one for instructions and one for data
Today, either unified or split depending on processor use

10/7/2017 CS61 Computer Architecture 25

Split vs. Unified Caches

Advantages of unified caches:

Balance the load between instruction and data fetches
depending on the dynamics of the program execution;
Design and implementation are cheaper.
Advantages of split caches (Harvard Architectures)
Competition for the cache between instruction processing (which
fetches from instruction cache) and execution functional units
(which fetch from data cache) is eliminated
Instruction fetch can proceed in parallel with memory access
from the execution unit.

10/7/2017 CS61 Computer Architecture 26

Cache Writing

A Write Buffer is needed between the Cache and Memory

Processor: writes data into the cache and the write buffer
Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:
Typical number of entries: 4
Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
Memory system designers nightmare:
Store frequency (w.r.t. time) > 1 / DRAM write cycle
Write buffer saturation

Cache
Processor DRAM

Write Buffer

10/7/2017 CS61 Computer Architecture 27

Cache Writing - II

Q. Why a write buffer ? A. So CPU doesnt stall

Q. Why a buffer, why not just one A. Bursts of writes are

Q. Are Read After Write (RAW) A. Yes! Drain buffer before next
hazards an issue for write buffer? read, or send read 1st after check
write buffers.

Note: We will discuss RAW hazards in a future lecture.

10/7/2017 CS61 Computer Architecture 28

Cache Writing - III

Write: Need to update upper cache(s) and main

memory whenever a store instruction modifies L1
cache.
Write Hit: the item to be modified is in L1.
Write Through: as if no L1, write also to L2.
Write Back: set a Dirty Bit, and update L2 before replacing
the block.
Although write through is an inefficient strategy, most L1s
and some upper level caches follow this approach that read
hit time is not affected due to complicated logic to update
dirty bit.
Write Miss: the item to be modified is not in L1.
Write allocate: exploit locality, and bring the block to L1.
Write no-allocate: do not fetch the missing block.
10/7/2017 CS61 Computer Architecture 29
Cache Writing - IV

10/7/2017 CS61 Computer Architecture 30

Cache Organization: Summary

10/7/2017 CS61 Computer Architecture 31

Direct-Mapped Cache

A block can be placed in one location only, given by:

(Block address) MOD (Number of blocks in cache)

10/7/2017 CS61 Computer Architecture 32

Direct Mapped Cache - II
A block can be placed in one location only, given by:
(Block address) MOD (Number of blocks in cache)
In this case: (Block address) MOD (8)
Cache
C ache

0 1 0 1 0 1 0 1
0 0 1 1 0 0 1 1
0 0 0 0 1 1 1 1

8 cache block frames

(11101) MOD (100) = 101

32 memory
blocks
cacheable

00 00 1 0 0101 0100 1 01101 10001 101 01 11 00 1 11101

Memory
M e m o ry

10/7/2017 CS61 Computer Architecture 33

Direct Mapped Cache - III

A memory block is mapped into a unique cache line, depending on

the memory address of the respective block.
A memory address is considered to be composed of three fields:
the least significant bits (2 in our example) identify the byte within the
block; [assume four bytes/block]
the rest of the address (22 bits in our example) identify the block in main
memory; for the cache logic, this part is interpreted as two fields:
2a. the least significant bits (14 in our example) specify the cache line;
2b. the most significant bits (8 in our example) represent the tag, which
is stored in the cache together with the line.
Tags are stored in the cache in order to distinguish among blocks
which fit into the same cache line.

10/7/2017 CS61 Computer Architecture 34

Direct Mapped Cache - IV

Advantages:
simple and cheap;
the tag field is short; only those bits have to be stored
which are not used to address the cache (compare with the
following approaches);
access is very fast.
Disadvantage:
a given block fits into a fixed cache location
a given cache line will be replaced whenever there is a
reference to another memory block which fits to the same
line, regardless what the status of the other cache lines is.
This can produce a low hit ratio, even if only a very small
part of the cache is effectively used.

10/7/2017 CS61 Computer Architecture 35

2-Way Associative Cache
A block can be placed in a restricted set of places, or cache block frames.
A set is a group of block frames in the cache.
A block is first mapped onto the set and then it can be placed anywhere within the set.
The set in this case is chosen by: (Block address) MOD (Number of sets in cache)

10/7/2017 CS61 Computer Architecture 36

Two-Way Set Associative Cache - II

Location 0 can be occupied by data from:

Memory location 0, 2, 4, 6, 8, ... etc.
In general: any memory location whose LSB of the address
is 0
Address<0> => cache index
On a miss, the block will be placed in one of the two
cache lines belonging to that set which corresponds
to the 13 bits field in the memory address.
The replacement algorithm decides which line to use.
A memory block is mapped into any of the lines of a
set.
The set is determined by the memory address, but the line
inside the set can be any one.
10/7/2017 CS61 Computer Architecture 37
Two-Way, Set Associative Cache - III

Several tags (corresponding to all lines in the set) have

to be checked in order to determine if we have a hit or
miss. If we have a hit, the cache logic finally points to the
actual line in the cache.
The number of lines in a set is determined by the
designer:
2 lines/set: two-way set associative mapping;
4 lines/set: four-way set associative mapping
Set associative mapping keeps most of the advantages
of direct mapping:
short tag field
fast access
relatively simple
10/7/2017 CS61 Computer Architecture 38
Two-Way Set Associative Cache - IV

Set associative mapping tries to eliminate the main

shortcoming of direct mapping
a certain flexibility is given concerning the line to be
replaced when a new block is read into the cache.
Cache hardware is more complex for set associative
mapping than for direct mapping.
In practice 2 and 4-way set associative mapping are
used with very good results.
Larger sets do not produce further significant
performance improvement
Interesting thesis topic: Is this true for multicore
architectures?

10/7/2017 CS61 Computer Architecture 39

Full Associative Cache

A block can be placed anywhere in cache

Lookup hardware for many tags can be large and slow

10/7/2017 CS61 Computer Architecture 40

Cache Replacement Strategy

Replacing a block on a cache miss?

Easy for Direct Mapped
Set Associative or Fully Associative:
Random
LRU (Least Recently Used)
FIFO (First In First Out)
LFU (Least Frequently used)
LRU is the most efficient: relatively simple to implement
and good results.
FIFO is simple to implement.
Random replacement is the simplest to implement and
results are surprisingly good.

10/7/2017 CS61 Computer Architecture 41

Comparison by Size

Associativity: 2-way 4-way 8-way

Size LRU Random LRU Random LRU Random
16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

So, what does this tell us?

You dont gain a significant advantage by going to a
larger cache size.

10/7/2017 CS61 Computer Architecture 42

Cache Performance Revisited

Suppose a processor executes at

Clock Rate = 1000 MHz (1 ns per cycle)
CPI = 1.0
50% arithmetic/logic
30% load/store
20% control

Suppose that 10% of memory operations get 100 cycle miss

penalty
CPI = ideal CPI + average stalls time per instruction
= 1.0(cycle)
+ ( 0.30 (data-operations/instruction)
* 0.10 (miss/data-op) x 100 (cycle/miss) )
= 1.0 cycle + 3.0 cycle
= 4.0 cycle
75 % of the time the processor is stalled waiting for memory!
A 1% instruction miss rate would add an additional 1.0 cycles
to the CPI!
10/7/2017 CS61 Computer Architecture 43
Lower Miss Rate: 16 KB D/I or 32 KB Unified Cache?

Assume a hit takes 1 clock cycle and the miss penalty is 50 cycles.
Assume a load or store takes 1 extra clock cycle on a unified cache
since there is only one cache port.
Assume 75% of memory accesses are instruction references
(75% X 0.64%) + (25% X 6.47%) = 2.10%

Average memory access time (split)

= 75% X ( 1 + 0.64% X 50) + 25% X ( 1 + 6.47% X 50) = 0.990 + 1.059
= 2.05

Average memory access time (unified)

= 75% X ( 1 + 1.99% X 50) + 25% X ( 1 + 1 + 1.99% X 50)
= 1.496 + 0.749 = 2.24

10/7/2017 CS61 Computer Architecture 44

Cache Addressing

Access caches by virtual address or physical address

10/7/2017 CS61 Computer Architecture 45

Cache Addressing

Physical or Virtual Address

For Index or tag or both
Must translate address
first
Very fast For complex,
unusual
systems

Tag->
Physical Virtual
Index
Physical PP PV

Virtual VP VV

Must translate Operate

address first concurrently w/
MMU checks tag MMU

10/7/2017 CS61 Computer Architecture 46

Summary: Cache Issues

How many caches?

Write-Through or Write Back?
Direct-Mapped or Set Associative?
How to determine at a read if we have a miss or hit?
If there is a miss and there is no place for a new slot in
the cache which information should be replaced?
How to preserve consistency between cache and main
memory at write?
Replacement Strategy?

10/7/2017 CS61 Computer Architecture 47

Cache Optimizations

Henk Corperaal, www.ics.ele.tue.nl/~heco/courses/aca

Good topics for a term paper!

Reducing hit time

Small and simple caches
Way prediction
Trace caches
Increasing cache bandwidth
Pipelined caches
Multibanked caches
Nonblocking caches
Reducing Miss Penalty
Critical word first
Merging write buffers
Reducing Miss Rate
Compiler optimizations
Reducing miss penalty or miss rate via parallelism
Hardware prefetching
Compiler prefetching

10/7/2017 CS61 Computer Architecture 48

Cache Optimizations - I

1. Fast Hit via Small and Simple Caches

Index tag memory and thereafter compare takes
time
Small cache is faster
Also L2 cache small enough to fit on chip with the
processor avoids time penalty of going off chip
Simple direct mapping
Can overlap tag check with data transmission since
no choice

10/7/2017 CS61 Computer Architecture 49

Cache Optimizations - II

2. Fast Hit via Way Prediction

Make set-associative caches faster
Keep extra bits in cache to predict the way, or block
within the set, of next cache access.
Multiplexor is set early to select desired block, only 1 tag
comparison performed
Miss 1st check other blocks for matches in next clock cycle
Accuracy 85%
Drawback: CPU pipeline is stalled if hit takes 1 or 2
cycles

10/7/2017 CS61 Computer Architecture 50

Cache Optimizations - III

3. Fast Hit via Trace Cache

Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace
cache line
Single fetch brings in multiple basic blocks
Trace cache indexed by start address and next n branch predictions
+ better utilize long blocks (dont exit in middle of block, dont enter at label
in middle of block)
complicated address mapping since addresses no longer aligned to
power-of-2 multiples of word size
- instructions may appear multiple times in multiple dynamic traces due to
different branch outcomes
BR BR BR

B B B
R R R
10/7/2017 CS61 Computer Architecture 51
Cache Optimizations - IV

4. Increase Cache Bandwidth by Pipelining

Pipeline cache access to maintain bandwidth, but higher
latency
Nr. of Instruction cache access pipeline stages:
1: Pentium
2: Pentium Pro through Pentium III
4: Pentium 4
greater penalty on mispredicted branches
more clock cycles between the issue of the load and the
use of the data

10/7/2017 CS61 Computer Architecture 52

Cache Optimizations - V

5. Increasing Cache Bandwidth:

Non-blocking cache or lockup-free cache
allow data cache to continue to supply cache hits during a miss
requires out-of-order execution CPU
hit under miss reduces the effective miss penalty by
continuing during miss
hit under multiple miss or miss under miss may further
lower the effective miss penalty by overlapping multiple
misses
Requires that memory system can service multiple misses
Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
Requires multiple memory banks (otherwise cannot support)
Pentium Pro allows 4 outstanding memory misses

10/7/2017 CS61 Computer Architecture 53

Cache Optimizations - VI

6. Increase Cache Bandwidth via Multiple Banks

Divide cache into independent banks that can support
simultaneous accesses
E.g., T1 (Niagara) L2 has 4 banks
Banking works best when accesses naturally spread
themselves across banks mapping of addresses
to banks affects behavior of memory system
Simple mapping that works well is sequential
interleaving
Spread block addresses sequentially across banks
E.g., with 4 banks, Bank 0 has all blocks with address%4 = 0;
bank 1 has all blocks whose address%4 = 1;

10/7/2017 CS61 Computer Architecture 54

Cache Optimizations - VII

7. Early Restart/Critical Word First to reduce miss

penalty

Dont wait for full block to be loaded before restarting CPU

Early restartAs soon as the requested word of the block arrives,
send it to the CPU and continue
Critical Word FirstRequest the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU
continue while filling the rest of the words in the block
Generally useful only when blocks are large

10/7/2017 CS61 Computer Architecture 55

Cache Optimizations - VIII

8. Merging Write Buffer to Reduce Miss Penalty

Write buffer to allow processor to continue while waiting to write to memory
E.g., four writes are merged into one buffer entry rather than putting them in
separate buffers
Less frequent write backs

10/7/2017 CS61 Computer Architecture 56

Cache Optimizations - IX

9. Reducing Misses By Compiler Optimizations

Instructions
Reorder procedures in memory so as to reduce conflict misses
Profiling to look at conflicts (using developed tools)
Data
Merging Arrays: improve spatial locality by single array of compound
elements vs. 2 arrays
Loop Interchange: change nesting of loops to access data in order
stored in memory
Loop Fusion: combine 2 independent loops that have same looping
and some variables overlap
Blocking: Improve temporal locality by accessing blocks of data
repeatedly vs. going down whole columns or rows

10/7/2017 CS61 Computer Architecture 57

Example: Merging Arrays

10/7/2017 CS61 Computer Architecture 58

Example: Loop Interchange

10/7/2017 CS61 Computer Architecture 59

Example: Loop Fusion

10/7/2017 CS61 Computer Architecture 60

Blocking Applied to Array Multiplication

10/7/2017 CS61 Computer Architecture 61

Blocking Applied to Array Multiplication

10/7/2017 CS61 Computer Architecture 62

Blocking Applied to Array Multiplication

Conflict misses in caches vs. Blocking size

Lam et al [1991] a blocking factor of 24 had a fifth the misses vs.
48 despite both fit in cache

10/7/2017 CS61 Computer Architecture 63

Cache Optimizations - X

10. Reducing Cache Misses by Hardware Prefetching

Use extra memory bandwidth (if available)
Instruction Prefetching
Typically, CPU fetches 2 blocks on a miss: the requested block
and the next consecutive block.
Requested block is placed in instruction cache when it returns,
and prefetched block is placed into instruction stream buffer
Data Prefetching
Pentium 4 can prefetch data into L2 cache from up to 8 streams
from 8 different 4 KB pages
Prefetching invoked if 2 successive L2 cache misses to a page,
if distance between those cache blocks is < 256 bytes

10/7/2017 CS61 Computer Architecture 64

Cache Optimizations - XI

11. Reducing Cache Misses by Software- Controlled

Prefetching
Data Prefetch
Load data into register (HP PA-RISC loads)
Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause faults;
a form of speculative execution
Issuing Prefetch Instructions takes time
Is cost of prefetch issues < savings in reduced misses?
Wider superscalar reduces difficulty of issue bandwidth

10/7/2017 CS61 Computer Architecture 65

Cache Optimizations - Summary

10/7/2017 CS61 Computer Architecture 66

Final Report On Lakhwar Dam
No ratings yet
Final Report On Lakhwar Dam
14 pages
Foundation of Sequential Programming CSC 210 Lecturer in Charge: Bola Orogun (Mtech, MITPA)
No ratings yet
Foundation of Sequential Programming CSC 210 Lecturer in Charge: Bola Orogun (Mtech, MITPA)
20 pages
3 Physical Memory Architecture: Assignments
No ratings yet
3 Physical Memory Architecture: Assignments
22 pages
CS5204/EE5364 - Advanced Computer Architecture - Memory
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Memory
67 pages
Unit 3 OF ESD
No ratings yet
Unit 3 OF ESD
22 pages
CS 152 Computer Architecture and Engineering Lecture 6 - Memory
No ratings yet
CS 152 Computer Architecture and Engineering Lecture 6 - Memory
29 pages
Caos Notes
No ratings yet
Caos Notes
38 pages
Computer Hardware Lecturer - 2
No ratings yet
Computer Hardware Lecturer - 2
8 pages
Week 12 - Lecture 12 - Memory
No ratings yet
Week 12 - Lecture 12 - Memory
27 pages
305
No ratings yet
305
35 pages
L05 Memory
No ratings yet
L05 Memory
45 pages
GCSE Computer Science knowlege organiser
No ratings yet
GCSE Computer Science knowlege organiser
14 pages
DRAM Basics by Prof. Matthew D. Sinclair
No ratings yet
DRAM Basics by Prof. Matthew D. Sinclair
103 pages
Lecture 10
No ratings yet
Lecture 10
44 pages
SSC Course 8 Memory
No ratings yet
SSC Course 8 Memory
33 pages
V.I.P. Dasanayake - Computer System
100% (1)
V.I.P. Dasanayake - Computer System
12 pages
Chapter-3-P3
No ratings yet
Chapter-3-P3
23 pages
Introduction To Computer All Slides
No ratings yet
Introduction To Computer All Slides
146 pages
Unit V Memory and I/O Systems
No ratings yet
Unit V Memory and I/O Systems
54 pages
This Unit: Caches: - Basic Memory Hierarchy Concepts
No ratings yet
This Unit: Caches: - Basic Memory Hierarchy Concepts
24 pages
Memory Subsytems
No ratings yet
Memory Subsytems
19 pages
unit 2
No ratings yet
unit 2
68 pages
CHAPTER 12 - Memory Organization PDF
No ratings yet
CHAPTER 12 - Memory Organization PDF
34 pages
Lecture 10
No ratings yet
Lecture 10
44 pages
Lecture 10
No ratings yet
Lecture 10
44 pages
Information Technology Project ON: Parts of Comuter-CPU
No ratings yet
Information Technology Project ON: Parts of Comuter-CPU
26 pages
Computer Memory Hierarchy
No ratings yet
Computer Memory Hierarchy
24 pages
Unit 5
No ratings yet
Unit 5
21 pages
Memory Organization
No ratings yet
Memory Organization
24 pages
Lecture-1-02.01.2025
No ratings yet
Lecture-1-02.01.2025
18 pages
RAM and ROM
No ratings yet
RAM and ROM
27 pages
CP Sir
No ratings yet
CP Sir
238 pages
Abstract
No ratings yet
Abstract
23 pages
Module 6_Memory
No ratings yet
Module 6_Memory
32 pages
Computer Architecture: Cache Memory
No ratings yet
Computer Architecture: Cache Memory
28 pages
Unit-1 PC
No ratings yet
Unit-1 PC
10 pages
Memory Sub-System: CT101 - Computing Systems
No ratings yet
Memory Sub-System: CT101 - Computing Systems
46 pages
Memory System
No ratings yet
Memory System
70 pages
04 - Computer Memory Systems
No ratings yet
04 - Computer Memory Systems
91 pages
Cao Unit 1 PDF
No ratings yet
Cao Unit 1 PDF
23 pages
Chapter 4 - Cache Memory
0% (1)
Chapter 4 - Cache Memory
50 pages
Unit 5 COA
No ratings yet
Unit 5 COA
34 pages
RAM 2522 DK
No ratings yet
RAM 2522 DK
40 pages
Computer Hardware and Peripherals
No ratings yet
Computer Hardware and Peripherals
38 pages
CA_UNIT_5
No ratings yet
CA_UNIT_5
37 pages
Ca Unit 5 Prabu
No ratings yet
Ca Unit 5 Prabu
37 pages
mini COA
No ratings yet
mini COA
59 pages
Chapter 1 Computer Organisation
No ratings yet
Chapter 1 Computer Organisation
45 pages
Architecture Summary - 053057
No ratings yet
Architecture Summary - 053057
7 pages
Introduction To Computer Part 1
No ratings yet
Introduction To Computer Part 1
56 pages
Lecture 3 (Memory Hierarchy and Caches)
No ratings yet
Lecture 3 (Memory Hierarchy and Caches)
88 pages
The Memory System: Deepak John, Department Computer Applications, SJCET-Pala
No ratings yet
The Memory System: Deepak John, Department Computer Applications, SJCET-Pala
63 pages
Lecture 2 Get 211
No ratings yet
Lecture 2 Get 211
9 pages
Lecture_05 - Memory Organization
No ratings yet
Lecture_05 - Memory Organization
18 pages
Memory
No ratings yet
Memory
6 pages
Types of Memory 1.3.3
No ratings yet
Types of Memory 1.3.3
16 pages
L06 Memory
No ratings yet
L06 Memory
37 pages
Memory Basics Explained
From Everand
Memory Basics Explained
Alisa Turing
No ratings yet
Memory Makers
From Everand
Memory Makers
Mei Gates
No ratings yet
Flash Memory Evolution
From Everand
Flash Memory Evolution
Sterling Blackwood
No ratings yet
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
From Everand
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
Rodrigo Copetti
No ratings yet
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
No ratings yet
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
58 pages
CS6461 Computer Architecture Lecture 8
No ratings yet
CS6461 Computer Architecture Lecture 8
61 pages
CS6461 - Computer Architecture Fall 2016 Instructor Morris Lancaster
No ratings yet
CS6461 - Computer Architecture Fall 2016 Instructor Morris Lancaster
28 pages
Cs 6461 Computer Architecture Lecture 11
No ratings yet
Cs 6461 Computer Architecture Lecture 11
51 pages
CS6461 - Computer Architecture Fall 2016 Adapted From Professor Stephen Kaisler's Slides
No ratings yet
CS6461 - Computer Architecture Fall 2016 Adapted From Professor Stephen Kaisler's Slides
71 pages
CS6461 - Computer Architecture Fall 2016 - Vector Operations
No ratings yet
CS6461 - Computer Architecture Fall 2016 - Vector Operations
47 pages
CS6461 - Computer Architecture Fall 2016 Morris Lancaster: Lecture 3 - Instruction Set Architecture
No ratings yet
CS6461 - Computer Architecture Fall 2016 Morris Lancaster: Lecture 3 - Instruction Set Architecture
40 pages
CS6461 Computer Architecture Lecture 5
No ratings yet
CS6461 Computer Architecture Lecture 5
58 pages
CS6461 - Computer Architecture Fall 2016: - Introduction
No ratings yet
CS6461 - Computer Architecture Fall 2016: - Introduction
18 pages
I/O Systems: CS6461 - Computer Architecture Fall 2016 Morris Lancaster
No ratings yet
I/O Systems: CS6461 - Computer Architecture Fall 2016 Morris Lancaster
50 pages
CS6461 - Computer Architecture Fall 2016 Morris Lancaster: Lecture 0 - Administrative
No ratings yet
CS6461 - Computer Architecture Fall 2016 Morris Lancaster: Lecture 0 - Administrative
11 pages
MAD_1 Unit Test 1
No ratings yet
MAD_1 Unit Test 1
1 page
The Following IP Address Radio Button. Enter Following IP Addresses in The Given Forms
No ratings yet
The Following IP Address Radio Button. Enter Following IP Addresses in The Given Forms
2 pages
Hildreth Street
No ratings yet
Hildreth Street
9 pages
11-Database SQL Server 2017
No ratings yet
11-Database SQL Server 2017
39 pages
HPSW Support Attachments Customers Quick Guide Final
No ratings yet
HPSW Support Attachments Customers Quick Guide Final
14 pages
Diploma in Computer Networking Timetable
No ratings yet
Diploma in Computer Networking Timetable
1 page
Course Outline - History of Architecture - PHINMA COC
No ratings yet
Course Outline - History of Architecture - PHINMA COC
4 pages
Exploring Qualcomm Baseband Via ModKit - Peter Pi, XiLing Gong, and GMXP, Tencent Security Platform Department
No ratings yet
Exploring Qualcomm Baseband Via ModKit - Peter Pi, XiLing Gong, and GMXP, Tencent Security Platform Department
34 pages
002 - Rise of The Jailer (LVL 1-10)
100% (1)
002 - Rise of The Jailer (LVL 1-10)
24 pages
Block USA - Catalogo
No ratings yet
Block USA - Catalogo
12 pages
Fumihiko Maki and His Theory of Collective Form - A Study On Its P
100% (1)
Fumihiko Maki and His Theory of Collective Form - A Study On Its P
278 pages
Vitra Hack
No ratings yet
Vitra Hack
2 pages
Lab 1 Configuring Static VLANs
No ratings yet
Lab 1 Configuring Static VLANs
5 pages
WD My Cloud: Personal Cloud Storage User Manual
No ratings yet
WD My Cloud: Personal Cloud Storage User Manual
122 pages
End User January 2023
No ratings yet
End User January 2023
1 page
OracleData Pump
No ratings yet
OracleData Pump
19 pages
c1943310 Datastage HD
No ratings yet
c1943310 Datastage HD
78 pages
60 M.bow String Girder-10411-R
80% (5)
60 M.bow String Girder-10411-R
1 page
Carrier XP 80AW-Installation Manual
No ratings yet
Carrier XP 80AW-Installation Manual
36 pages
Case Study - Maxxi Museum
33% (3)
Case Study - Maxxi Museum
7 pages
GBCTR
No ratings yet
GBCTR
47 pages
Egyptian
No ratings yet
Egyptian
15 pages
VXLAN
No ratings yet
VXLAN
5 pages
Valve - Material Equivalent
No ratings yet
Valve - Material Equivalent
3 pages
Onion Steganography
No ratings yet
Onion Steganography
61 pages
2019 Jul Exam Paper
No ratings yet
2019 Jul Exam Paper
20 pages
7 - Annex C-3 UP MDP Restroom Design Standards 2018
No ratings yet
7 - Annex C-3 UP MDP Restroom Design Standards 2018
103 pages
Ip Monitor Administrator Guide
No ratings yet
Ip Monitor Administrator Guide
234 pages
ActivClient User Guide
No ratings yet
ActivClient User Guide
78 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.