0% found this document useful (0 votes)
270 views57 pages

Cache Coherence - MESI MOESI

Cache Coherancy Notes

Uploaded by

Mohd Imran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
270 views57 pages

Cache Coherence - MESI MOESI

Cache Coherancy Notes

Uploaded by

Mohd Imran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

ECE/CS

757: Advanced
Computer Architecture II
Instructor:Mikko H Lipas>

Spring 2009
University of Wisconsin-Madison

Lecture notes based on slides created by John Shen,
Mark Hill, David Wood, Guri Sohi, and Jim Smith,
Natalie Enright Jerger, and probably others

Cache Coherence

Coherence States
Snoopy bus-based Invalidate Protocols
Invalidate protocol op>miza>ons
Update Protocols (Dragon/Firey)
Directory protocols
Implementa>on issues

Readings
Readings:
Firey
Archibald
Sweazey/Smith
Laudon/Lenoski: Origin 2000
Opteron
Gigaplane
Power5
Intel 870
3

Cache Coherence Problem


Load A
Store A<= 1

P0

P1

01

Load A
Load A

Memory

2005 Mikko Lipasti

Cache Coherence Problem


Load A
Store A<= 1

P0

P1

10

Load A
Load A

10

Memory

2005 Mikko Lipasti

Possible Causes of Incoherence


Sharing of writeable data
Cause most commonly considered

Process migra>on
Can occur even if independent jobs are execu>ng

I/O
O[en xed via O/S cache ushes

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Cache Coherence

Informally, with coherent caches: accesses to


a memory loca>on appear to occur
simultaneously in all copies of the memory
loca>on
copies caches

Cache coherence suggests an absolute >me


scale -- this is not necessary
What is required is the "appearance" of
coherence... not absolute coherence
E.g. temporary incoherence between memory and
a write-back cache
may be OK.
02/07
ECE/CS 757; copyright J. E. Smith, 2007
7

Update vs.
Invalida>on
Protocols
Coherent Shared
Memory
All processors see
the eects of
others writes

How/when writes
are propagated
Determine by
coherence
protocol

2005 Mikko Lipasti

Global Coherence States


A memory line can be present (valid) in any of the caches and/or
memory
Represent global state with an N+1 element vector
First N components => cache states (valid/invalid)
N+1st component => memory state (valid/invalid)

Example:
Line A: <1,1,0,1>
Line B: <1,0,0,0>
Line C: <0,0,0,1>

Cache 0

line A: V
line B: I
line C: V

Cache 1
line A
line B

02/07

Memory

Cache 2

line A

ECE/CS 757; copyright J. E. Smith, 2007

Local Coherence States


Individual caches can maintain a summary of the state of memory
lines, from a local perspec>ve
Reduces storage for maintaining state
May have only par>al informa>on
Invalid (I): <0,X,X,X....X> -- local cache does not have a valid copy; (cache
miss)
Dont confuse invalid state with empty frame

Shared (S): <1,X,X,X,,1> -- local cache has a valid copy, main memory has
a valid copy, other caches ??
Modied(M): <1,0,0,..0,0> -- local cache has only valid copy.
Exclusive(E): <1,0,0,..0,1> -- local cache has a valid copy, no other caches
do, main memory has a valid copy.
Owned(O): <1,X,X,X,.X> -- local cache has a valid copy, all other caches
and memory may have a valid copy.
Only one cache can be in O state
<1,X,1,X, 0> is included in O, but not included in any of the others.
02/07

ECE/CS 757; copyright J. E. Smith, 2007

Example
Memory
line A: V
line B: I
line C: V

Cache 0
line A
line B

Cache 1

Cache 2

line A

Memory
line A: V
line B: I
line C: V

Cache 0
line A: S
line B: M
line C: I

02/07

Cache 1
line A: S
line B: I
line C: I

Cache 2
line A: I
line B: I
line C: I

ECE/CS 757; copyright J. E. Smith, 2007

11

Snoopy Cache Coherence


All requests broadcast on bus
All processors and memory snoop and respond
Cache blocks writeable at one processor or read-
only at several
Single-writer protocol

Snoops that hit dirty lines?


Flush modied data out of cache
Either write back to memory, then sa>sfy remote miss
from memory, or
Provide dirty data directly to requestor
Big problem in MP systems
Dirty/coherence/sharing misses
2005 Mikko Lipasti

12

Bus-Based Protocols
Protocol consists of
states and ac>ons
(state transi>ons)
Ac>ons can be
invoked from
processor or bus

Bus
Bus Actions

Cache
Controller

Processor Actions

Processor

02/07

ECE/CS 757; copyright J. E. Smith, 2007

State

Tags

Cache Data

Minimal Coherence
Protocol FSM
[Source: Pamerson/Hennessy, Comp. Org. &
Design]

2005 Mikko Lipasti

14

MSI Protocol
Action and Next State
Current
State

02/07

Processor
Read

Processor
Write

Eviction

Cache Read
Acquire
Copy
S

Cache Read&M
Acquire Copy
M

No Action
S

Cache Upgrade
M

No Action
M

No Action
M

Cache
Read

Cache
Read&M

Cache
Upgrade

No Action
I

No Action
I

No Action
I

No Action
I

No Action
S

Invalidate
Frame
I

Invalidate
Frame
I

Cache
Write
back
I

Memory
inhibit;
Supply
data;
S

Invalidate
Frame;
Memory
inhibit;
Supply data;
I

ECE/CS 757; copyright J. E. Smith, 2007

MSI Example
Thread Event

Bus Action

Data From

0. Initially:
Memory

Global State

Local States:
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

1. T0 read

CR

2. T0 write

CU

3. T2 read

CR

C0

<1,0,1,1>

4. T1 write

CRM

Memory

<0,1,0,0>

If line is in no cache

Read, modify, Write requires 2 bus transac>ons


Op>miza>on: add Exclusive state

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Invalidate Protocol Op>miza>ons


Observa>on: data can be read shared
Add S (shared) state to protocol: MSI

State transi>ons:

Local read: I->S, fetch shared


Local write: I->M, fetch modied; S->M, invalidate other copies
Remote read: M->S, supply data
Remote write: M->I, supply data; S->I, invalidate local copy

Observa>on: data can be write-private (e.g. stack frame)


Avoid invalidate messages in that case
Add E (exclusive) state to protocol: MESI

State transi>ons:

Local read: I->E if only copy, I->S if other copies exist


Local write: E->M silently, S->M, invalidate other copies
2005 Mikko Lipasti

17

MESI Protocol
Varia>on used in Pen>um Pro/2/3
(cache to cache transfer)
4-State Protocol
Modied: <1,0,00>
Exclusive: <1,0,0,,1>
Shared: <1,X,X,,1>
Invalid: <0,X,X,X>

Bus/Processor Ac>ons
Same as MSI

Adds shared signal to indicate if other caches have a copy

02/07

ECE/CS 757; copyright J. E. Smith, 2007

18

MESI Protocol
Action and Next State
Current
State

02/07

Processor
Read

Processor
Write

Eviction

Cache
Read
If no
sharers:
E
If sharers:
S

Cache Read&M
M

No Action
S

Cache Upgrade
M

No Action
E

No Action
M

No Action
M
No Action
M

Cache
Read

Cache
Read&M

Cache
Upgrade

No Action
I

No Action
I

No Action
I

No Action
I

Respond
Shared:
S

No Action
I

No Action
I

No Action
I

Respond
Shared;
S

No Action
I

Cache
Write-back
I

Respond
dirty;
Write back
data;
S

Respond
dirty;
Write back
data;
I

ECE/CS 757; copyright J. E. Smith, 2007

MESI Example

Op>miza>on cache-to-cache transfer for shared data (but only one


sharer can respond)
If modied in some cache and another reads then writeback to memory
(w/ snarf)
Op>miza>on: add Owned state (and perform cache-to-cache transfer)
Thread Event

Bus
Action

Data From

0. Initially:
1. T0 read

CR

2. T0 write

none

02/07

Memory

Global State

Local States:
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

ECE/CS 757; copyright J. E. Smith, 2007

MOESI Op>miza>on
Observa>on: shared ownership complicates/delays sourcing
data
Owner is responsible for sourcing data to any requestor
Add O (owner) state to protocol: MOSI/MOESI
Last requestor becomes the owner
Ownership can be on per-node basis in hierarchically structured
system
Avoid writeback (to memory) of dirty data
Also called shared-dirty state, since memory is stale

2005 Mikko Lipasti

21

MOESI Protocol

Used in AMD Opteron


5-State Protocol

Modied: <1,0,00>
Exclusive: <1,0,0,,1>
Shared: <1,X,X,,X>
Invalid: <0,X,X,X>
Owned: <1,X,X,X,X> ; only one owner

Owner can supply data, so memory does not have to


Avoids lengthy memory access

02/07

ECE/CS 757; copyright J. E. Smith, 2007

22

MOESI Protocol
Action and Next State
Current
State

02/07

Processor
Read

Processor
Write

Eviction

Cache Read

Cache Read&M

Cache
Upgrade

Cache Read
If no sharers:
E
If sharers:
S

Cache Read&M
M

No Action
I

No Action
I

No
Action
I

No Action
S

Cache Upgrade
M

No Action
I

Respond
shared;
S

No Action
I

No
Action
I

No Action
E

No Action
M

No Action
I

Respond
shared;
Supply data;
S

Respond
shared;
Supply data;
I

No Action
O

Cache
Upgrade
M

Cache
Write-back
I

Respond
shared;
Supply data;
O

Respond
shared;
Supply data;
I

No Action
M

No Action
M

Cache
Write-back
I

Respond
shared;
Supply data;
O

Respond
shared;
Supply data;
I

ECE/CS 757; copyright J. E. Smith, 2007

MOESI Example
Thread Event

Bus Action

Data From

0. Initially:

local states
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

1. T0 read

CR

2. T0 write

none

3. T2 read

CR

C0

<1,0,1,0>

4. T1 write

CRM

C0

<0,1,0,0>

02/07

Memory

Global State

ECE/CS 757; copyright J. E. Smith, 2007

Update Protocols
Basic idea:

All writes (updates) are made visible to all caches:

(address,value) tuples sent everywhere


Similar to write-through protocol for uniprocessor caches

Obviously not scalable beyond a few processors


No one actually builds machines this way

Simple op>miza>on

Send updates to memory/directory


Directory propagates updates to all known copies: less bandwidth

Further op>miza>ons: combine & delay

Write-combining of adjacent updates (if consistency model allows)


Send write-combined data
Delay sending write-combined data un>l requested

Logical end result

Writes are combined into larger units, updates are delayed un>l needed
Eec>vely the same as invalidate protocol
2005 Mikko Lipasti

25

Update Protocol: Dragon


Dragon (developed at Xerox PARC)
5-State Protocol
Invalid:<0,X,X,X>
Some say no invalid state due to confusion regarding empty frame
versus invalid line state
Exclusive: <1,0,0,,1>
Shared-Clean (Sc): <1,X,X,X> memory may not be up-to-date
Shared-Modied (Sm): <1,X,X,X0> memory not up-to-date; only one
copy in Sm
Modied: <1,0,0,0>

Includes Cache Update ac>on


Includes Cache Writeback ac>on
Bus includes Shared ag
Appears to also require memory inhibit signal
Dis>nguish shared case where cache (not memory) supplies data
02/07

ECE/CS 757; copyright J. E. Smith, 2007

26

Dragon State Diagram


Action and Next State
Current
State

Processor
Read

Processor
Write

Eviction

Cache Read
If no sharers:
E
If sharers:
Sc

Cache Read
If no sharers:
M
If sharers:
Cache Update
Sm

Sc

No Action
Sc

Cache Update
If no sharers:
M
If sharers:
Sm

No Action
I

No Action
E

No Action

No Action
I

Respond shared;
Supply data
Sc
Respond shared;
Supply data;
Sm

02/07

Cache Read

Sm

No Action
Sm

Cache Update
If no sharers:
M
If sharers:
Sm

Cache Write-back
I

No Action
M

No Action
M

Cache Write-back
I

Respond Shared;
Sc

Respond shared;
Supply data;
Sm
ECE/CS 757; copyright J. E. Smith, 2007

Cache Update
I

Respond shared;
Update copy;
Sc

Respond shared;
Update copy;
Sc

Example
Thread Event

Bus Action

Data From

0. Initially:
Memory

Global State

local states
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

1. T0 read

CR

2. T0 write

none

3. T2 read

CR

C0

<1,0,1,0>

Sm

Sc

4. T1 write

CR,CU

C0

<1,1,1,0>

Sc

Sm

Sc

5. T0 read

none (hit)

C0

<1,1,1,0>

Sc

Sm

Sc

Appears to require atomic bus cycles CR,CU on write to invalid


line
02/07

ECE/CS 757; copyright J. E. Smith, 2007

Update Protocol: Firey


Develped at DEC by ex-Xerox people
5-State Protocol
Similar to Dragon dierent state naming based on shared/exclusive and
clean/dirty
Invalid:<0,X,X,X>
EC: <1,0,0,,1>
SC: <1,X,X,X> memory may not be up-to-date
EM: <1,0,0,0>
SM: <1,X,X,X0> memory not up-to-date; only one copy in Sm

Performs write-through updates (dierent from Dragon)

02/07

ECE/CS 757; copyright J. E. Smith, 2007

29

Firey State Diagram


Action and Next State
Current
State

Processor
Read

Processor
Write

Eviction

Cache Read

No Action
I

Cache Read&M

Cache Read
If no sharers:
Ec
If sharers:
Sc

Cache Read
If no sharers:
Em
If sharers:
Cache Update
Sm

Sc

No Action
Sc

Cache Read&M
If no sharers:
Ec
If sharers:
Sc

No Action
I

Ec

No Action
Ec

No Action

No Action
I

Respond shared;
Sc

Respond Shared
Sc

Respond Shared;
Sc

No Action
I

Respond Shared
Sc

Em

02/07

Sm

No Action
Sm

Cache Read&M
If no sharers:
Ec
If sharers:
Sc

Cache Write-back
I

Respond shared;
Supply data;
Sm

Respond shared;
Supply data;
Sc

Em

No Action
Em

No Action
Em

Cache Write-back
I

Respond shared;
Supply data;
Sm

Respond shared;
Supply data;
Sc

ECE/CS 757; copyright J. E. Smith, 2007

Update vs Invalidate
[Weber & Gupta, ASPLOS3]
Consider sharing pamerns

No Sharing
Independent threads
Coherence due to thread migra>on
Update protocol performs many wasteful updates

Read-Only
No signicant coherence issues; most protocols work well

Migratory Objects
Manipulated by one processor at a >me
O[en protected by a lock
Usually a write causes only a single invalida>on
E state useful for Read-modify-Write pamerns
Update protocol could proliferate copies
02/07

ECE/CS 757; copyright J. E. Smith, 2007

31

Update vs Invalidate, contd.

Synchroniza>on Objects

Locks
Update could reduce spin trac invalida>ons
Test&Test&Set w/ invalidate protocol would work well

Many Readers, One Writer

Update protocol may work well, but writes are rela>vely rare

Many Writers/Readers

Invalidate probably works bemer


Update will proliferate copies

What is used today?

Invalidate is dominant
CMP may change this assessment
more on-chip bandwidth

02/07

ECE/CS 757; copyright J. E. Smith, 2007

32

Nasty Reali>es
State diagram is for (ideal) protocol assuming
instantaneous and ac>ons
In reality controller implements more complex diagrams
A protocol state transi>on may be started by controller when
bus ac>vity changes local state
Example: an upgrade pending (for bus) when an invalidate for
same line arrives

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Par>al Implementa>on State Table


Action and Next State
Current
State

Processor
Read

Processor
Write

Bus
Grant

Bus
Response

Cache Read

Cache Read&M

Cache
Upgrade

Request Bus
IR

Request
Bus
IW

No Action
I

No Action
I

No Action
I

No Action
S

Request
Bus
SW

Respond Shared:
S

No Action
I

No Action
I

No Action
E

No Action
M

Respond Shared;
S

No Action
I

No Action
M

No Action
M

Respond dirty;
Write back data;
S

Respond dirty;
Write back data;
I

Respond Shared:

No Action
IW

IR

Cache Read
IRR

IW

Cache Read&M
IWR

IRR

If no
sharers:
E
If sharers:
S
Load line

IWR

M
Load line

SW

02/07

Cache Upgrade
ECE/CS
M

757; copyright J. E. Smith,


2007
SW

No Action
IW

Further Op>miza>ons
Observa>on: Shared blocks should only be fetched from
memory once
If I nd a shared block on chip, forward the block
Problem: mul>ple shared blocks possible, who forwards?
Everyone? Power/bandwidth wasted

Single forwarder, but who?

Last one to receive block: F state


I->F for requestor, F->S for forwarder

What if F block is evicted?

Favor F blocks in replacement?


Dont allow silent evic>on (force some other node to be F)
Fall back on memory copy if cant nd F copy

Very old idea (IBM machines have done this for a long >me),
but recent Intel patent issued anyway [Hum/Goodman]

2005 Mikko Lipasti

35

Further Op>miza>ons
Observa>on: migratory data o[en ies by

Add T (transi>on) state to protocol


Tag is s>ll valid, data isnt
Data can be snarfed as it ies by
Only works with certain kinds of interconnect networks
Replacement policy issues

Many other op>miza>ons are possible

Literature extends 25 years back


Many unpublished (but implemented) techniques as well

2005 Mikko Lipasti

36

Implemen>ng Cache Coherence

Snooping implementa>on

Origins in shared-memory-bus systems


All CPUs could observe all other CPUs requests on the bus;
hence snooping

Bus Read, Bus Write, Bus Upgrade

React appropriately to snooped commands

Invalidate shared copies


Provide up-to-date copies of dirty lines
Flush (writeback) to memory, or
Direct interven>on (modied interven2on or dirty miss)

Snooping suers from:

Scalability: shared busses not prac>cal


Ordering of requests without a shared bus
Lots of recent and on-going work on scaling snoop-based
systems
2005 Mikko Lipasti

37

Snooping Cache Coherence


Basic idea: broadcast snoop to all caches to nd owner
Not scalable?
Address trac roughly propor>onal to square of number of
processors
Current implementa>ons scale to 64/128-way (Sun/IBM) with
mul>ple address-interleaved broadcast networks

Inbound snoop bandwidth: big problem


OutboundSnoopRate = s o =

CacheMissRate + BusUpgradeRate

InboundSnoopRate = si = n

2005 Mikko Lipasti

so

38

Snoop Bandwidth
l
l

Snoop ltering of various kinds is possible


Filter snoops at sink: Jemy lter [Moshovos et al., HPCA 2001]

Filter snoops at source: Mul>cast snooping [Bilir et al., ISCA 1999]

Filter snoops at source: Region coherence

Check small lter cache that summarizes contents of local


cache
Avoid power-hungry lookups in each tag array
Predict likely sharing set, snoop only predicted sharers
Double-check at directory to make sure

Concurrent work: [Can>n/Smith/Lipas>, ISCA 2005; Moshovos, ISCA 2005]


Check larger region of memory on every snoop; remember
when no sharers
Snoop only on rst reference to region, or when region is shared
Eliminate 60%+ of all snoops

2005 Mikko Lipasti

39

Snoop Latency
l

Snoop latency:

Must reach all nodes, return and combine responses


Probably the bigger scalability issue in the future
Topology matters: ring, mesh, torus, hypercube
No obvious solutions

Parallelism: fundamental advantage of snooping

Broadcast exposes parallelism, enables speculative latency


reduction

LDir XSnp RDir XRsp CRsp XRd RDat XDat UDat


RDat XDat
UDat
RDat XDat UDat
RDat
XDat UDat
2005 Mikko Lipasti

40

Scaleable Cache Coherence

Eschew physical bus but s>ll snoop

Point-to-point tree structure (indirect) or ring


Root of tree or ring provide ordering point
Use some scalable network for data (ordering less
important)

Or, use level of indirec>on through directory


Directory at memory remembers:
Which processor is single writer
Which processors are shared readers

Level of indirec>on has a price


Dirty misses require 3 hops instead of two
Snoop: Requestor->Owner
Directory: Requestor->Directory->Owner
2005 Mikko Lipasti

41

Implemen>ng Cache Coherence

Directory implementa>on

Extra bits stored in memory (directory) record state of line


Memory controller maintains coherence based on the current state
Other CPUs commands are not snooped, instead:

Directory forwards relevant commands

Powerful ltering eect: only observe commands that you need to


observe
Meanwhile, bandwidth at directory scales by adding memory
controllers as you increase size of the system

Leads to very scalable designs (100s to 1000s of CPUs)

Directory shortcomings

Indirec>on through directory has latency penalty


If shared line is dirty in other CPUs cache, directory must forward
request, adding latency
This can severely impact performance of applica>ons with heavy
sharing (e.g. rela>onal databases)

2005 Mikko Lipasti

42

Directory Protocol Implementa>on

Basic idea: Centralized directory keeps track of data


loca>on(s)
Scalable

Address trac roughly propor>onal to number of processors


Directory & trac can be distributed with memory banks
(interleaved)
Directory cost (SRAM) or latency (DRAM) can be prohibi>ve

Full map (N processors, N bits): cost/scalability


Limited map (limits number of sharers)
Coarse map (iden>es board/node/cluster; must use broadcast)

Point to shared copies


Fixed number, linked lists (SCI), caches chained together
Latency vs. cost vs. scalability

Presence bits track sharers

Vectors track sharers

2005 Mikko Lipasti

43

Directory Protocol Latency

LDir XSnp RDir XRd RDat XDat UDat

Access to non-shared data

Overlap directory read with data read


Best possible latency given NUMA arrangement

Access to shared data

Dirty miss, modied interven>on


Shared interven>on?

If DRAM directory, no gain


If directory cache, possible gain (use F state)

No inherent parallelism
Indirec>on adds latency
Minimum 3 hops, o[en 4 hops

2005 Mikko Lipasti

44

Directory-based Cache Coherence


An alterna>ve for large, scalable
MPs
Can be based on any of the
protocols discussed thus far
We will use MSI

Memory Controller becomes an


ac>ve par>cipant
Sharing info held in memory
directory

Memory
Module

Memory
Module

Directory

Directory

...

Memory
Module
Directory

Interconnection Network

Directory may be distributed

Use point-to-point messages


Network is not totally ordered

02/07

Cache

Cache

Processor

Processor

ECE/CS 757; copyright J. E. Smith, 2007

...

Cache

Processor

Example: Simple Directory Protocol


Local cache controller states
M, S, I as before

Local directory states


Shared: <1,X,X,1>; one or more proc. has copy; memory
is up-to-date
Modied: <0,1,0,.,0> one processor has copy; memory
does not have a valid copy
Uncached: <0,0,0,1> none of the processors has a valid
copy

Directory also keeps track of sharers


Can keep global state vector in full
e.g. via a bit vector
02/07

ECE/CS 757; copyright J. E. Smith, 2007

46

Example
Local cache suers load miss
Line in remote cache in M state
It is the owner

Four messages send over network


Cache read from local controller to
home memory controller
Memory read to remote cache
controller
Owner data back to memory
controller; change state to S
Memory data back to local cache;
change state to S

02/07

Processor
Processor

processor
read

Cache

Local
Controller

Cache

Owner
Controller

memory
data
response

Processor

...

owner
data
response

Remote
Controller

Cache

Interconnect

memory
read

cache
read

Memory
Controller

Memory
Controller

Memory
Controller

Directory

Directory

Directory

Memory
Banks

Memory
Banks

ECE/CS 757; copyright J. E. Smith, 2007

...

Memory
Banks

Cache Controller State Diagram


Intermediate States for clarity
Cache Controller
Actions and Next States
from Processor Side
Current
State

Processor
Read

Processor
Write

from Memory Side

Eviction

Cache
Read
I'

Cache
Read&M
I''

No
Action
S

Cache
Upgrade
S'

No
Action*
I

No
Action
M

No Action
M

Cache
Write-back
I

Memory
Read

Memory
Read&M

Memory
Invalidate

Memory
Upgrade

No Action
I
Invalidate
Frame;
Cache ACK;
I
Owner
Data;
S

Owner
Data;
I

Invalidate
Frame;
Cache ACK;
I

I'

Fill Cache
S

I''

Fill Cache
M

S'

02/07

Memory Data

No Action
M

ECE/CS 757; copyright J. E. Smith, 2007

48

Memory Controller State Diagram


Memory Controller
Actions and Next States

command from Local Cache Controller


Current
Directory
State

Cache
Read

Cache Read&M

Cache
Upgrade

Memory Data;
Add Requestor to
Sharers;
S

Memory Data;
Add Requestor to
Sharers;
M

Memory Data;
Add Requestor to
Sharers;
S

Memory
Invalidate All
Sharers;
M'

Memory Read
from Owner;
S'

Memory Read&M;
to Owner
M'

Memory
Upgrade
All Sharers;
M''

response from Remote Cache Controller


Data
Write-back

Cache ACK

No Action
I

Make Sharers
Empty;
U

S'

Memory Data
to Requestor;
Write memory;
Add Requestor to
Sharers;
S

M'

When all ACKS


Memory Data;
M

M''

When all ACKS


then
M

02/07

Owner
Data

ECE/CS 757; copyright J. E. Smith, 2007

Memory Data
to Requestor;
M

49

Another Example
Local write (miss) to shared line
Requires invalida>ons and acks
Processor
Processor

processor
write

Cache

Local
Controller

cache
Read&M

02/07

Cache

Remote
Controller

memory
data
response
memory
invalidate

Processor

...

Remote
Controller

Cache

cache
ack

cache
ack

Interconnect

Memory
Controller

Home Memory
Controller

Memory
Controller

Directory

Directory

Directory

Memory
Banks

Memory
Banks

ECE/CS 757; copyright J. E. Smith, 2007

...

Memory
Banks

Example Sequence
Similar to earlier sequences
Thread Event

Controller
Actions

Data From

0. Initially:

local states:
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

1. T0 read

CR,MD

2. T0 write

CU, MU*,MD

3. T2 read

CR,MR,MD

C0

<1,0,1,1>

4. T1 write

CRM,MI,CA,MD

Memory

<0,1,0,0>

02/07

Memory

global state

ECE/CS 757; copyright J. E. Smith, 2007

Varia>on: Three Hop Protocol


Have owner send data directly to local controller
Owner Acks to Memory Controller in parallel

Local
Controller

cache
read

Owner
Controller
memory
data

memory
read

owner
data

owner
data

Local
Controller
cache
read

3
memory
read

Memory
Controller

Memory
Controller

a)
02/07

Owner
Controller

b)
ECE/CS 757; copyright J. E. Smith, 2007

owner
ack

Directory Protocol Op>miza>ons

Remove dead blocks from cache:

Eliminate 3- or 4-hop latency


Dynamic Self-Invalida>on [Lebeck/Wood, ISCA 1995]
Last touch predic>on [Lai/Falsa, ISCA 2000]
Dead block predic>on [Lai/Fide/Falsa, ISCA 2001]

Predic>on in coherence protocols [Mukherjee/Hill, ISCA 1998]


Instruc>on-based predic>on [Kaxiras/Goodman, ISCA 1999]
Sharing predic>on [Lai/Falsa, ISCA 1999]

Improve latency by snooping, conserve bandwidth with directory


Mul>cast snooping [Bilir et al., ISCA 1999; Mar>n et al., ISCA 2003]
Bandwidth-adap>ve hybrid [Mar>n et al., HPCA 2002]
Token Coherence [Mar>n et al., ISCA 2003]
VCT Mul>cas>ng [Enright Jerger thesis 2008]

Predict sharers

Hybrid snooping/directory protocols

2005 Mikko Lipasti

53

Atomic bus

Protocol Races

Only stable states in protocol (e.g. M, S, I)


All state transi>ons are atomic (I->M)
No conic>ng requests can interfere since bus is held >ll transac>on
completes
Dis>nguish coherence transac>on from data transfer
Data transfer can s>ll occur much later; easier to handle this case

Atomic buses dont scale


At minimum, separate bus request/response

Large systems have broadly variable delays


Req/resp separated by dozens of cycles
Conic>ng requests can and do get issued
Messages may get reordered in the interconnect

How do we resolve them?

54

Resolving Protocol Races

Req/resp decoupling introduces transient


states
E.g. I->S is now I->ItoX->ItoS_nodata->S

Conic>ng requests to blocks in transient


states
NAK ugly; livelock, starva>on poten>al
Keep adding more transient states

Directory protocol makes this a bit easier


Can order at directory, which has full state info
Even so, messages may get reordered
55

Common Protocol Races

Read strings: P0 read, P1 read, P2 read

Easy, since read is nondestruc>ve


Can rely on F state to reduce DRAM accesses
Forward reads to previous requestor (F)

Write strings: P0 write, P1 write, P2 write


Forward P1 write req to P0 (M)
P0 completes write then forwards M block to P1
Build string of writes (write string forwarding)

Read a[er write (similar to prev. WAW)


Writeback race: P0 evicts dirty block, P1 reads
Dirty block is in the network (no copy at P0 or at dir)
NAK P1, or force P0 to keep copy >ll dir ACKs WB

Many others crop up, esp. with op>miza>ons


56

Summary

Coherence States
Snoopy bus-based Invalidate Protocols
Invalidate protocol op>miza>ons
Update Protocols (Dragon/Firey)
Directory protocols
Implementa>on issues

57

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy