Ch01 Part3 Caches
Ch01 Part3 Caches
These lecture notes are partly based on the course text, Hennessy and
Patterson’s Computer Architecture, a quantitative approach (6th ed), and on
the lecture slides of David Patterson’s Berkeley course (CS252)
https://en.wikichip.org/wiki/File:skylake_(quad-core)_(annotated).png
Intel Skylake quad-core die photo
https://en.wikichip.org/wiki/File:skylake_(quad-core)_(annotated).png
Intel Skylake quad-core die photo
https://en.wikichip.org/wiki/File:skylake_(quad-core)_(annotated).png
We finished the last lecture by asking how fast
a pipelined processor can go?
A simple 5-stage pipeline can run at 5-9GHz
Limited by critical path through slowest pipeline stage
logic
Tradeoff: do more per cycle? Or increase clock rate?
Or do more per cycle, in parallel…
At 3GHz, clock period is 330 picoseconds.
The time light takes to go about four inches
About 10 gate delays
for example, the Cell BE is designed for 11 FO4 (“fan-out=4”)
gates per cycle:
www.fe.infn.it/~belletti/articles/ISSCC2005-cell.pdf
Pipeline latches etc account for 3-5 FO4 delays leaving only
5-8 for actual work
µProc
1000 CPU
60%/yr.
“Moore’s Law” (2X/1.5yr)
Performance
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM 9%/yr.
(2X/10 yrs)
1
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1980
1981
1982
1983
1984
1985
1986
1999
2000
Time
In 1980 a large RAM’s access time was close to the CPU cycle time. 1980s
machines
Levels of the Memory Hierarchy
Upper Level
Capacity
Access Time
Cost faster
Management:
CPU Registers Registers by programmer/compiler
100s Bytes
<1ns Transfer unit:
Instructions and Operands 1-16 bytes
Cache (perhaps multilevel)
10s-1000s K Bytes Cache by cache controller
1-10 ns (L1-L3) 8-128 bytes
~$10/ MByte
Blocks
Main Memory
G Bytes by Operating System
100ns- 300ns “Main” memory 4K-8K bytes
$0.01/ MByte
Pages
Disk by user/operator
100s G Bytes, Disk Mbytes
10 ms
(10,000,000 ns)
$0.00005 Mbyte ($50/TB) Files
Larger
Tape
infinite Tape Lower Level
sec-min
$0.00005/ MByte
~ Exponential increase in access latency, block size, capacity
• The Principle of Locality:
– Programs access a relatively small portion of the
address space at any instant of time.
: :
0x50 Byte 63 Byte 33 0Byte 32
1 2
3
: : :
Byte 1023 Byte 992
:
31
: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3
: : :
Byte 1023 Byte 992 31
:
Compare
Data
Direct-mapped cache – read access Hit
1 KB Direct Mapped Cache, 32B blocks
(0) 0
Cache location 0 can be occupied
1
4
location 0, 32, 64, … etc.
5
Cache location 1 can be occupied
by data from main memory
6
11
Address<9:4> bits map to the same
12 location in the cache Which one
should we place in the cache?
Memory
13
15
16 in
the cache?
17
18
19
Cache Data
20
Byte 31 Byte 1 Byte 0 0
: :
21
22
23
Byte 63 Byte 33 Byte 1
32
24
25
2
26 3
27
28
29
:
30
:
31
(32) 32
33
13
34
35
Associativity conflicts in a direct-mapped cache
Consider a loop that repeatedly reads A A+0
part of two different arrays: 64x4=256Bytes
A+32
int A[256]; A+32*2 ie 8 32B cache
int B[256]; A+32*3 lines
int r = 0;
for (int i=0; i<10; ++i) Repeatedly
{ for (int j=0; j<64; + re-reads 64
+j) { r += A[j] + B[j]; values from
both A and B
}
} B B+0
For the accesses to A and B to be 64x4=256Bytes
B+32
mostly cache hits, we need a cache
big enough to hold 2x64 ints, ie B+32*2 ie 8 32B cache
512B B+32*3 lines
16
Associativity conflicts in a direct-mapped cache
Consider a loop that repeatedly reads
part of two different arrays:
int A[256];
int B[256];
int r = 0; Array B is
located
for (int i=0; i<10; ++i) Repeatedly exactly 1024
{ for (int j=0; j<64; + re-reads 64 bytes after
+j) { r += A[j] + B[j]; values from array A
both A and B
}
}
For the accesses to A and B to be
mostly cache hits, we need a cache
big enough to hold 2x64 ints, ie
512B
17
Direct-mapped Cache - structure
• Capacity: C bytes (eg 1KB)
• Blocksize: B bytes (eg 32)
• Byte select bits: 0..log(B)-1 (eg 0..4)
• Number of blocks: C/B (eg 32)
• Address size: A (eg 32 bits)
• Cache index size: I=log(C/B) (eg log(32)=5)
• Tag size: A-I-log(B) (eg 32-5-5=22)
Cache Index
Valid Cache Tag Cache Data
Cache Block 0
: : :
Adr Tag
Compare
Cache Block
Hit
Two-way Set Associative Cache
• N-way set associative: N entries for each Cache Index
– N direct mapped caches operated in parallel (N typically 2 to 4)
• Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared in parallel
– Data is selected based on the tag result
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
Adr Tag
Compare 0 Sel0
Compare
Sel1 1 Mux
OR
Cache Block
Hit
Disadvantage of Set Associative Cache
• N-way Set Associative Cache v. Direct Mapped Cache:
– N comparators vs. 1
– Extra MUX delay for the data
– Data comes AFTER Hit/Miss
• In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss:
– Possible to assume a hit and continue. Recover later if
miss.
Valid Cache Tag Cache Data Cache Index
Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : :
:
Adr Tag
Compare 0 Sel0
Compare
Sel1 1 Mux
OR
Cache Block
Hit
Example: Intel Pentium 4 Level-1 cache (pre-Prescott)
Capacity: 8K bytes (total amount of data cache can store)
Block: 64 bytes (so there are 8K/64=128 blocks in the cache)
Ways: 4 (addresses with same index bits can be placed in one of 4 ways)
Sets: 32 (=128/4, that is each RAM array holds 32 blocks)
Index: 5 bits (since 25=32 and we need index to select one of the 32 ways)
Tag: 21 bits (=32 minus 5 for index, minus 6 to address byte within block)
Access time: 2 cycles, (.6ns at 3GHz; pipelined, dual-ported [load+store])
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
Adr Tag
Compare 0 Sel0
Compare
Sel1 1 Mux
OR
Cache Block
Hit
4 Questions for Memory Hierarchy
• Q1: Where can a block be placed in the upper
level?
– Block placement
• Q2: How is a block found if it is in the upper
level?
– Block identification
• Q3: Which block should be replaced on a
miss?
– Block replacement
• Q4: What happens on a write?
– Write strategy
Q1: Where can a block be placed in the upper level?
0
1 In a direct-mapped cache, block 12 can only be placed in one cache
2
3 location, determined by its low-order address bits –
4 (12 mod 8) = 4
5
6
7
0 1
Set 0 In a two-way set-associative cache, the set is determined by its low-
1
2 order address bits –
3 (12 mod 4) = 0
Block 12 can be placed in either of the two cache locations in set 0
0 1 2 3 4 5 6 7
: : : : : :
• Here is some sample output, you can see in the first image an empty machine (low
ram/CPU usage) and after that a busy one. To quit just press q.
Running long simulations… (helpful student’s edstem post)