2015Sp CS61C L16 Kavs Caches3
2015Sp CS61C L16 Kavs Caches3
Instructors:
Krste Asanovic & Vladimir Stojanovic
http://inst.eecs.berkeley.edu/~cs61c/
You Are Here!
Software Hardware
• Parallel Requests
Warehouse Smart
Assigned to computer Scale Phone
e.g., Search “Katz” Computer
Harness
• Parallel Threads Parallelism &
Assigned to core Achieve High
e.g., Lookup, Ads Performance Computer
• Parallel Instructions Core … Core
Today’s
>1 instruction @ one time Memory (Cache)
Lecture
e.g., 5 pipelined instructions
Input/Output Core
• Parallel Data
Instruction Unit(s) Functional
>1 data item @ one time Unit(s)
e.g., Add of 4 pairs of words A0+B0 A1+B1 A2+B2 A3+B3
• Hardware descriptions
Main Memory
All gates @ one time
Logic Gates
• Programming Languages
2
Caches Review
• Direct-Mapped vs. Set-Associative vs. Fully
Associative
• AMAT = Hit Time + Miss Rate * Miss Penalty
• 3 Cs of cache misses: Compulsory, Capacity,
Conflict
• Effect of cache parameters on performance
3
Primary Cache Parameters
• Block size (aka line size)
– how many bytes of data in each cache entry?
• Associativity
– how many ways in each set?
– Direct-mapped => Associativity = 1
– Set-associative => 1 < Associativity < #Entries
– Fully associative => Associativity = #Entries
• Capacity (bytes) = Total #Entries * Block size
• #Entries = #Sets * Associativity
4
Other Cache Parameters
• Write Policy
• Replacement policy
5
Write Policy Choices
• Cache hit:
– write through: writes both cache & memory on every access
• Generally higher memory traffic but simpler pipeline & cache design
– write back: writes cache only, memory `written only when dirty
entry evicted
• A dirty bit per line reduces write-back traffic
• Must handle 0, 1, or 2 accesses to memory for each load/store
• Cache miss:
– no write allocate: only write to main memory
– write allocate (aka fetch on write): fetch into cache
• Common combinations:
– write through and no write allocate
– write back with write allocate
6
Replacement Policy
In an associative cache, which line from a set should be
evicted when the set becomes full?
• Random
• Least-Recently Used (LRU)
• LRU cache state must be updated on every access
• True implementation only feasible for small sets (2-way)
• Pseudo-LRU binary tree often used for 4-8 way
• First-In, First-Out (FIFO) a.k.a. Round-Robin
• Used in highly associative caches
• Not-Most-Recently Used (NMRU)
• FIFO with exception for most-recently used line or lines
8
Impact of Cache Parameters on
Performance
• AMAT = Hit Time + Miss Rate * Miss Penalty
– Note, we assume always first search cache, so
must charge hit time for both hits and misses!
• For misses, characterize by 3Cs
9
CPU-Cache Interaction
(5-stage pipeline)
0x4 Add E
M
A
we
Decode, ALU Y addr
bubble Primary
IR Register B
Data rdata
PC addr inst Fetch Cache R
D wdata hit?
hit? wdata
PCen Primary
Instruction MD1 MD2
Cache
Stall entire
CPU on data
cache miss
To Memory Control
11
Increasing Associativity?
• Hit time as associativity increases?
– Increases, with large step from direct-mapped to >=2 ways,
as now need to mux correct way to processor
– Smaller increases in hit time for further increases in
associativity
• Miss rate as associativity increases?
– Goes down due to reduced conflict misses, but most gain is
from 1->2->4-way with limited benefit from higher
associativities
• Miss penalty as associativity increases?
– Unchanged, replacement policy runs in parallel with
fetching missing line from memory
12
Increasing #Entries?
• Hit time as #entries increases?
– Increases, since reading tags and data from larger
memory structures
• Miss rate as #entries increases?
– Goes down due to reduced capacity and conflict
misses
– Architects rule of thumb: miss rate drops ~2x for every
~4x increase in capacity (only a gross approximation)
• Miss penalty as #entries increases?
– Unchanged
13
Administrivia
• Project 2, Part 2 due 3/22
• No assigned work over spring break
• Next assignment, HW5, due 04/05
• Midterm II is 04/09
– Conflict? Email Sagar
– DSP will receive email about accommodations
soon
14
How to Reduce Miss Penalty?
• Could there be locality on misses from a
cache?
• Use multiple cache levels!
• With Moore’s Law, more room on die for
bigger L1 caches and for second-level (L2)
cache
• And in some cases even an L3 cache!
• IBM mainframes have ~1GB L4 cache off-chip.
15
Review: Memory Hierarchy
Processor
Increasing
Inner distance from
Level 1 processor,
Levels in decreasing
memory Level 2 speed
hierarchy Level 3
Outer ...
Level n
17
IBM z13 Memory Hierarchy
18
Local vs. Global Miss Rates
• Local miss rate – the fraction of references to
one level of a cache that miss
• Local Miss rate L2$ = $L2 Misses / L1$ Misses
• Global miss rate – the fraction of references that
miss in all levels of a multilevel cache
• L2$ local miss rate >> than the global miss rate
19
L1 Cache: 32KB I$, 32KB D$
L2 Cache: 256 KB
L3 Cache: 4 MB
21
Clickers/Peer Instruction
• Overall, what are L2 and L3 local miss rates?
A: L2 > 50%, L3 > 50%
B: L2 ~ 50%, L3 < 50%
C: L2 ~ 50%, L3 ~ 50%
D: L2 < 50%, L3 < 50%
E: L2 > 50%, L3 ~50%
22
23
CPI/Miss Rates/DRAM Access
SpecInt2006
Data Only Data Only Instructions and Data
– Cache size
– Block size Associativity
– Associativity
– Replacement policy
– Write-through vs. write-back
Block Size
– Write-allocation
• Optimal choice is a compromise
– Depends on access characteristics Bad
• Workload
• Use (I-cache, D-cache)
– Depends on technology / cost Good Factor A Factor B
25