Cache Coherence - MESI MOESI
Cache Coherence - MESI MOESI
757:
Advanced
Computer
Architecture
II
Instructor:Mikko
H
Lipas>
Spring
2009
University
of
Wisconsin-Madison
Lecture
notes
based
on
slides
created
by
John
Shen,
Mark
Hill,
David
Wood,
Guri
Sohi,
and
Jim
Smith,
Natalie
Enright
Jerger,
and
probably
others
Cache Coherence
Coherence
States
Snoopy
bus-based
Invalidate
Protocols
Invalidate
protocol
op>miza>ons
Update
Protocols
(Dragon/Firey)
Directory
protocols
Implementa>on
issues
Readings
Readings:
Firey
Archibald
Sweazey/Smith
Laudon/Lenoski:
Origin
2000
Opteron
Gigaplane
Power5
Intel
870
3
P0
P1
01
Load A
Load A
Memory
P0
P1
10
Load A
Load A
10
Memory
Process
migra>on
Can
occur
even
if
independent
jobs
are
execu>ng
I/O
O[en
xed
via
O/S
cache
ushes
02/07
Cache Coherence
Update
vs.
Invalida>on
Protocols
Coherent
Shared
Memory
All
processors
see
the
eects
of
others
writes
How/when
writes
are
propagated
Determine
by
coherence
protocol
Example:
Line
A:
<1,1,0,1>
Line
B:
<1,0,0,0>
Line
C:
<0,0,0,1>
Cache 0
line A: V
line B: I
line C: V
Cache 1
line A
line B
02/07
Memory
Cache 2
line A
Shared
(S):
<1,X,X,X,,1>
--
local
cache
has
a
valid
copy,
main
memory
has
a
valid
copy,
other
caches
??
Modied(M):
<1,0,0,..0,0>
--
local
cache
has
only
valid
copy.
Exclusive(E):
<1,0,0,..0,1>
--
local
cache
has
a
valid
copy,
no
other
caches
do,
main
memory
has
a
valid
copy.
Owned(O):
<1,X,X,X,.X>
--
local
cache
has
a
valid
copy,
all
other
caches
and
memory
may
have
a
valid
copy.
Only
one
cache
can
be
in
O
state
<1,X,1,X,
0>
is
included
in
O,
but
not
included
in
any
of
the
others.
02/07
Example
Memory
line A: V
line B: I
line C: V
Cache 0
line A
line B
Cache 1
Cache 2
line A
Memory
line A: V
line B: I
line C: V
Cache 0
line A: S
line B: M
line C: I
02/07
Cache 1
line A: S
line B: I
line C: I
Cache 2
line A: I
line B: I
line C: I
11
12
Bus-Based
Protocols
Protocol
consists
of
states
and
ac>ons
(state
transi>ons)
Ac>ons
can
be
invoked
from
processor
or
bus
Bus
Bus Actions
Cache
Controller
Processor Actions
Processor
02/07
State
Tags
Cache Data
Minimal
Coherence
Protocol
FSM
[Source:
Pamerson/Hennessy,
Comp.
Org.
&
Design]
14
MSI
Protocol
Action and Next State
Current
State
02/07
Processor
Read
Processor
Write
Eviction
Cache Read
Acquire
Copy
S
Cache Read&M
Acquire Copy
M
No Action
S
Cache Upgrade
M
No Action
M
No Action
M
Cache
Read
Cache
Read&M
Cache
Upgrade
No Action
I
No Action
I
No Action
I
No Action
I
No Action
S
Invalidate
Frame
I
Invalidate
Frame
I
Cache
Write
back
I
Memory
inhibit;
Supply
data;
S
Invalidate
Frame;
Memory
inhibit;
Supply data;
I
MSI
Example
Thread Event
Bus Action
Data From
0. Initially:
Memory
Global State
Local States:
C0 C1 C2
<0,0,0,1>
<1,0,0,1>
<1,0,0,0>
1. T0 read
CR
2. T0 write
CU
3. T2 read
CR
C0
<1,0,1,1>
4. T1 write
CRM
Memory
<0,1,0,0>
If line is in no cache
02/07
State transi>ons:
State transi>ons:
17
MESI
Protocol
Varia>on
used
in
Pen>um
Pro/2/3
(cache
to
cache
transfer)
4-State
Protocol
Modied:
<1,0,00>
Exclusive:
<1,0,0,,1>
Shared:
<1,X,X,,1>
Invalid:
<0,X,X,X>
Bus/Processor
Ac>ons
Same
as
MSI
02/07
18
MESI
Protocol
Action and Next State
Current
State
02/07
Processor
Read
Processor
Write
Eviction
Cache
Read
If no
sharers:
E
If sharers:
S
Cache Read&M
M
No Action
S
Cache Upgrade
M
No Action
E
No Action
M
No Action
M
No Action
M
Cache
Read
Cache
Read&M
Cache
Upgrade
No Action
I
No Action
I
No Action
I
No Action
I
Respond
Shared:
S
No Action
I
No Action
I
No Action
I
Respond
Shared;
S
No Action
I
Cache
Write-back
I
Respond
dirty;
Write back
data;
S
Respond
dirty;
Write back
data;
I
MESI
Example
Bus
Action
Data From
0. Initially:
1. T0 read
CR
2. T0 write
none
02/07
Memory
Global State
Local States:
C0 C1 C2
<0,0,0,1>
<1,0,0,1>
<1,0,0,0>
MOESI
Op>miza>on
Observa>on:
shared
ownership
complicates/delays
sourcing
data
Owner
is
responsible
for
sourcing
data
to
any
requestor
Add
O
(owner)
state
to
protocol:
MOSI/MOESI
Last
requestor
becomes
the
owner
Ownership
can
be
on
per-node
basis
in
hierarchically
structured
system
Avoid
writeback
(to
memory)
of
dirty
data
Also
called
shared-dirty
state,
since
memory
is
stale
21
MOESI Protocol
Modied:
<1,0,00>
Exclusive:
<1,0,0,,1>
Shared:
<1,X,X,,X>
Invalid:
<0,X,X,X>
Owned:
<1,X,X,X,X>
;
only
one
owner
02/07
22
MOESI
Protocol
Action and Next State
Current
State
02/07
Processor
Read
Processor
Write
Eviction
Cache Read
Cache Read&M
Cache
Upgrade
Cache Read
If no sharers:
E
If sharers:
S
Cache Read&M
M
No Action
I
No Action
I
No
Action
I
No Action
S
Cache Upgrade
M
No Action
I
Respond
shared;
S
No Action
I
No
Action
I
No Action
E
No Action
M
No Action
I
Respond
shared;
Supply data;
S
Respond
shared;
Supply data;
I
No Action
O
Cache
Upgrade
M
Cache
Write-back
I
Respond
shared;
Supply data;
O
Respond
shared;
Supply data;
I
No Action
M
No Action
M
Cache
Write-back
I
Respond
shared;
Supply data;
O
Respond
shared;
Supply data;
I
MOESI
Example
Thread Event
Bus Action
Data From
0. Initially:
local states
C0 C1 C2
<0,0,0,1>
<1,0,0,1>
<1,0,0,0>
1. T0 read
CR
2. T0 write
none
3. T2 read
CR
C0
<1,0,1,0>
4. T1 write
CRM
C0
<0,1,0,0>
02/07
Memory
Global State
Update
Protocols
Basic
idea:
Simple op>miza>on
Writes
are
combined
into
larger
units,
updates
are
delayed
un>l
needed
Eec>vely
the
same
as
invalidate
protocol
2005 Mikko Lipasti
25
26
Processor
Read
Processor
Write
Eviction
Cache Read
If no sharers:
E
If sharers:
Sc
Cache Read
If no sharers:
M
If sharers:
Cache Update
Sm
Sc
No Action
Sc
Cache Update
If no sharers:
M
If sharers:
Sm
No Action
I
No Action
E
No Action
No Action
I
Respond shared;
Supply data
Sc
Respond shared;
Supply data;
Sm
02/07
Cache Read
Sm
No Action
Sm
Cache Update
If no sharers:
M
If sharers:
Sm
Cache Write-back
I
No Action
M
No Action
M
Cache Write-back
I
Respond Shared;
Sc
Respond shared;
Supply data;
Sm
ECE/CS 757; copyright J. E. Smith, 2007
Cache Update
I
Respond shared;
Update copy;
Sc
Respond shared;
Update copy;
Sc
Example
Thread Event
Bus Action
Data From
0. Initially:
Memory
Global State
local states
C0 C1 C2
<0,0,0,1>
<1,0,0,1>
<1,0,0,0>
1. T0 read
CR
2. T0 write
none
3. T2 read
CR
C0
<1,0,1,0>
Sm
Sc
4. T1 write
CR,CU
C0
<1,1,1,0>
Sc
Sm
Sc
5. T0 read
none (hit)
C0
<1,1,1,0>
Sc
Sm
Sc
02/07
29
Processor
Read
Processor
Write
Eviction
Cache Read
No Action
I
Cache Read&M
Cache Read
If no sharers:
Ec
If sharers:
Sc
Cache Read
If no sharers:
Em
If sharers:
Cache Update
Sm
Sc
No Action
Sc
Cache Read&M
If no sharers:
Ec
If sharers:
Sc
No Action
I
Ec
No Action
Ec
No Action
No Action
I
Respond shared;
Sc
Respond Shared
Sc
Respond Shared;
Sc
No Action
I
Respond Shared
Sc
Em
02/07
Sm
No Action
Sm
Cache Read&M
If no sharers:
Ec
If sharers:
Sc
Cache Write-back
I
Respond shared;
Supply data;
Sm
Respond shared;
Supply data;
Sc
Em
No Action
Em
No Action
Em
Cache Write-back
I
Respond shared;
Supply data;
Sm
Respond shared;
Supply data;
Sc
Update
vs
Invalidate
[Weber
&
Gupta,
ASPLOS3]
Consider
sharing
pamerns
No
Sharing
Independent
threads
Coherence
due
to
thread
migra>on
Update
protocol
performs
many
wasteful
updates
Read-Only
No
signicant
coherence
issues;
most
protocols
work
well
Migratory
Objects
Manipulated
by
one
processor
at
a
>me
O[en
protected
by
a
lock
Usually
a
write
causes
only
a
single
invalida>on
E
state
useful
for
Read-modify-Write
pamerns
Update
protocol
could
proliferate
copies
02/07
31
Synchroniza>on Objects
Locks
Update
could
reduce
spin
trac
invalida>ons
Test&Test&Set
w/
invalidate
protocol
would
work
well
Update protocol may work well, but writes are rela>vely rare
Many Writers/Readers
Invalidate
is
dominant
CMP
may
change
this
assessment
more
on-chip
bandwidth
02/07
32
Nasty
Reali>es
State
diagram
is
for
(ideal)
protocol
assuming
instantaneous
and
ac>ons
In
reality
controller
implements
more
complex
diagrams
A
protocol
state
transi>on
may
be
started
by
controller
when
bus
ac>vity
changes
local
state
Example:
an
upgrade
pending
(for
bus)
when
an
invalidate
for
same
line
arrives
02/07
Processor
Read
Processor
Write
Bus
Grant
Bus
Response
Cache Read
Cache Read&M
Cache
Upgrade
Request Bus
IR
Request
Bus
IW
No Action
I
No Action
I
No Action
I
No Action
S
Request
Bus
SW
Respond Shared:
S
No Action
I
No Action
I
No Action
E
No Action
M
Respond Shared;
S
No Action
I
No Action
M
No Action
M
Respond dirty;
Write back data;
S
Respond dirty;
Write back data;
I
Respond Shared:
No Action
IW
IR
Cache Read
IRR
IW
Cache Read&M
IWR
IRR
If no
sharers:
E
If sharers:
S
Load line
IWR
M
Load line
SW
02/07
Cache Upgrade
ECE/CS
M
No Action
IW
Further
Op>miza>ons
Observa>on:
Shared
blocks
should
only
be
fetched
from
memory
once
If
I
nd
a
shared
block
on
chip,
forward
the
block
Problem:
mul>ple
shared
blocks
possible,
who
forwards?
Everyone?
Power/bandwidth
wasted
Very
old
idea
(IBM
machines
have
done
this
for
a
long
>me),
but
recent
Intel
patent
issued
anyway
[Hum/Goodman]
35
Further
Op>miza>ons
Observa>on:
migratory
data
o[en
ies
by
36
Snooping implementa>on
37
CacheMissRate + BusUpgradeRate
InboundSnoopRate = si = n
so
38
Snoop
Bandwidth
l
l
39
Snoop
Latency
l
Snoop latency:
40
41
Directory implementa>on
Directory shortcomings
42
43
No
inherent
parallelism
Indirec>on
adds
latency
Minimum
3
hops,
o[en
4
hops
44
Memory
Module
Memory
Module
Directory
Directory
...
Memory
Module
Directory
Interconnection Network
02/07
Cache
Cache
Processor
Processor
...
Cache
Processor
46
Example
Local
cache
suers
load
miss
Line
in
remote
cache
in
M
state
It
is
the
owner
02/07
Processor
Processor
processor
read
Cache
Local
Controller
Cache
Owner
Controller
memory
data
response
Processor
...
owner
data
response
Remote
Controller
Cache
Interconnect
memory
read
cache
read
Memory
Controller
Memory
Controller
Memory
Controller
Directory
Directory
Directory
Memory
Banks
Memory
Banks
...
Memory
Banks
Processor
Read
Processor
Write
Eviction
Cache
Read
I'
Cache
Read&M
I''
No
Action
S
Cache
Upgrade
S'
No
Action*
I
No
Action
M
No Action
M
Cache
Write-back
I
Memory
Read
Memory
Read&M
Memory
Invalidate
Memory
Upgrade
No Action
I
Invalidate
Frame;
Cache ACK;
I
Owner
Data;
S
Owner
Data;
I
Invalidate
Frame;
Cache ACK;
I
I'
Fill Cache
S
I''
Fill Cache
M
S'
02/07
Memory Data
No Action
M
48
Cache
Read
Cache Read&M
Cache
Upgrade
Memory Data;
Add Requestor to
Sharers;
S
Memory Data;
Add Requestor to
Sharers;
M
Memory Data;
Add Requestor to
Sharers;
S
Memory
Invalidate All
Sharers;
M'
Memory Read
from Owner;
S'
Memory Read&M;
to Owner
M'
Memory
Upgrade
All Sharers;
M''
Cache ACK
No Action
I
Make Sharers
Empty;
U
S'
Memory Data
to Requestor;
Write memory;
Add Requestor to
Sharers;
S
M'
M''
02/07
Owner
Data
Memory Data
to Requestor;
M
49
Another
Example
Local
write
(miss)
to
shared
line
Requires
invalida>ons
and
acks
Processor
Processor
processor
write
Cache
Local
Controller
cache
Read&M
02/07
Cache
Remote
Controller
memory
data
response
memory
invalidate
Processor
...
Remote
Controller
Cache
cache
ack
cache
ack
Interconnect
Memory
Controller
Home Memory
Controller
Memory
Controller
Directory
Directory
Directory
Memory
Banks
Memory
Banks
...
Memory
Banks
Example
Sequence
Similar
to
earlier
sequences
Thread Event
Controller
Actions
Data From
0. Initially:
local states:
C0 C1 C2
<0,0,0,1>
<1,0,0,1>
<1,0,0,0>
1. T0 read
CR,MD
2. T0 write
CU, MU*,MD
3. T2 read
CR,MR,MD
C0
<1,0,1,1>
4. T1 write
CRM,MI,CA,MD
Memory
<0,1,0,0>
02/07
Memory
global state
Local
Controller
cache
read
Owner
Controller
memory
data
memory
read
owner
data
owner
data
Local
Controller
cache
read
3
memory
read
Memory
Controller
Memory
Controller
a)
02/07
Owner
Controller
b)
ECE/CS 757; copyright J. E. Smith, 2007
owner
ack
Predict sharers
53
Atomic bus
Protocol Races
54
Summary
Coherence
States
Snoopy
bus-based
Invalidate
Protocols
Invalidate
protocol
op>miza>ons
Update
Protocols
(Dragon/Firey)
Directory
protocols
Implementa>on
issues
57