0% found this document useful (0 votes)

270 views57 pages

Cache Coherence - MESI MOESI

Cache Coherancy Notes

Uploaded by

Mohd Imran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

270 views57 pages

Cache Coherence - MESI MOESI

Cache Coherancy Notes

Uploaded by

Mohd Imran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

ECE/CS

757: Advanced
Computer Architecture II
Instructor:Mikko H Lipas>

Spring 2009
University of Wisconsin-Madison

Lecture notes based on slides created by John Shen,
Mark Hill, David Wood, Guri Sohi, and Jim Smith,
Natalie Enright Jerger, and probably others

Cache Coherence

Coherence States
Snoopy bus-based Invalidate Protocols
Invalidate protocol op>miza>ons
Update Protocols (Dragon/Firey)
Directory protocols
Implementa>on issues

Readings
Readings:
Firey
Archibald
Sweazey/Smith
Laudon/Lenoski: Origin 2000
Opteron
Gigaplane
Power5
Intel 870
3

Cache Coherence Problem

Load A
Store A<= 1

Load A
Load A

Memory

2005 Mikko Lipasti

Cache Coherence Problem

Load A
Store A<= 1

Load A
Load A

Memory

2005 Mikko Lipasti

Possible Causes of Incoherence

Sharing of writeable data
Cause most commonly considered

Process migra>on
Can occur even if independent jobs are execu>ng

I/O
O[en xed via O/S cache ushes

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Cache Coherence

Informally, with coherent caches: accesses to

a memory loca>on appear to occur
simultaneously in all copies of the memory
loca>on
copies caches

Cache coherence suggests an absolute >me

scale -- this is not necessary
What is required is the "appearance" of
coherence... not absolute coherence
E.g. temporary incoherence between memory and
a write-back cache
may be OK.
02/07
ECE/CS 757; copyright J. E. Smith, 2007
7

Update vs.
Invalida>on
Protocols
Coherent Shared
Memory
All processors see
the eects of
others writes

How/when writes
are propagated
Determine by
coherence
protocol

2005 Mikko Lipasti

Global Coherence States

A memory line can be present (valid) in any of the caches and/or
memory
Represent global state with an N+1 element vector
First N components => cache states (valid/invalid)
N+1st component => memory state (valid/invalid)

Example:
Line A: <1,1,0,1>
Line B: <1,0,0,0>
Line C: <0,0,0,1>

Cache 0

line A: V
line B: I
line C: V

Cache 1
line A
line B

02/07

Memory

Cache 2

line A

ECE/CS 757; copyright J. E. Smith, 2007

Local Coherence States

Individual caches can maintain a summary of the state of memory
lines, from a local perspec>ve
Reduces storage for maintaining state
May have only par>al informa>on
Invalid (I): <0,X,X,X....X> -- local cache does not have a valid copy; (cache
miss)
Dont confuse invalid state with empty frame

Shared (S): <1,X,X,X,,1> -- local cache has a valid copy, main memory has
a valid copy, other caches ??
Modied(M): <1,0,0,..0,0> -- local cache has only valid copy.
Exclusive(E): <1,0,0,..0,1> -- local cache has a valid copy, no other caches
do, main memory has a valid copy.
Owned(O): <1,X,X,X,.X> -- local cache has a valid copy, all other caches
and memory may have a valid copy.
Only one cache can be in O state
<1,X,1,X, 0> is included in O, but not included in any of the others.
02/07

ECE/CS 757; copyright J. E. Smith, 2007

Example
Memory
line A: V
line B: I
line C: V

Cache 0
line A
line B

Cache 1

Cache 2

line A

Memory
line A: V
line B: I
line C: V

Cache 0
line A: S
line B: M
line C: I

02/07

Cache 1
line A: S
line B: I
line C: I

Cache 2
line A: I
line B: I
line C: I

ECE/CS 757; copyright J. E. Smith, 2007

Snoopy Cache Coherence

All requests broadcast on bus
All processors and memory snoop and respond
Cache blocks writeable at one processor or read-
only at several
Single-writer protocol

Snoops that hit dirty lines?

Flush modied data out of cache
Either write back to memory, then sa>sfy remote miss
from memory, or
Provide dirty data directly to requestor
Big problem in MP systems
Dirty/coherence/sharing misses
2005 Mikko Lipasti

Bus-Based Protocols
Protocol consists of
states and ac>ons
(state transi>ons)
Ac>ons can be
invoked from
processor or bus

Bus
Bus Actions

Cache
Controller

Processor Actions

Processor

02/07

ECE/CS 757; copyright J. E. Smith, 2007

State

2005 Mikko Lipasti

MSI Protocol
Action and Next State
Current
State

02/07

Processor
Read

Processor
Write

Eviction

Cache Read
Acquire
Copy
S

Cache Read&M
Acquire Copy
M

No Action
S

Cache Upgrade
M

No Action
M

Cache
Read

Cache
Read&M

Cache
Upgrade

No Action
I

No Action
S

Invalidate
Frame
I

Cache
Write
back
I

Memory
inhibit;
Supply
data;
S

Invalidate
Frame;
Memory
inhibit;
Supply data;
I

ECE/CS 757; copyright J. E. Smith, 2007

MSI Example
Thread Event

Bus Action

Data From

0. Initially:
Memory

Global State

Local States:
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

1. T0 read

2. T0 write

3. T2 read

<1,0,1,1>

4. T1 write

CRM

Memory

<0,1,0,0>

If line is in no cache

Read, modify, Write requires 2 bus transac>ons

Op>miza>on: add Exclusive state

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Invalidate Protocol Op>miza>ons

Observa>on: data can be read shared
Add S (shared) state to protocol: MSI

State transi>ons:

Local read: I->S, fetch shared

Local write: I->M, fetch modied; S->M, invalidate other copies
Remote read: M->S, supply data
Remote write: M->I, supply data; S->I, invalidate local copy

Observa>on: data can be write-private (e.g. stack frame)

Avoid invalidate messages in that case
Add E (exclusive) state to protocol: MESI

State transi>ons:

Local read: I->E if only copy, I->S if other copies exist

Local write: E->M silently, S->M, invalidate other copies
2005 Mikko Lipasti

MESI Protocol
Varia>on used in Pen>um Pro/2/3
(cache to cache transfer)
4-State Protocol
Modied: <1,0,00>
Exclusive: <1,0,0,,1>
Shared: <1,X,X,,1>
Invalid: <0,X,X,X>

Bus/Processor Ac>ons
Same as MSI

Adds shared signal to indicate if other caches have a copy

02/07

ECE/CS 757; copyright J. E. Smith, 2007

MESI Protocol
Action and Next State
Current
State

02/07

Processor
Read

Processor
Write

Eviction

Cache
Read
If no
sharers:
E
If sharers:
S

Cache Read&M
M

No Action
S

Cache Upgrade
M

No Action
E

No Action
M

No Action
M
No Action
M

Cache
Read

Cache
Read&M

Cache
Upgrade

No Action
I

Respond
Shared:
S

No Action
I

Respond
Shared;
S

No Action
I

Cache
Write-back
I

Respond
dirty;
Write back
data;
S

Respond
dirty;
Write back
data;
I

ECE/CS 757; copyright J. E. Smith, 2007

MESI Example

Op>miza>on cache-to-cache transfer for shared data (but only one

sharer can respond)
If modied in some cache and another reads then writeback to memory
(w/ snarf)
Op>miza>on: add Owned state (and perform cache-to-cache transfer)
Thread Event

Bus
Action

Data From

0. Initially:
1. T0 read

2. T0 write

none

02/07

Memory

Global State

Local States:
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

ECE/CS 757; copyright J. E. Smith, 2007

MOESI Op>miza>on
Observa>on: shared ownership complicates/delays sourcing
data
Owner is responsible for sourcing data to any requestor
Add O (owner) state to protocol: MOSI/MOESI
Last requestor becomes the owner
Ownership can be on per-node basis in hierarchically structured
system
Avoid writeback (to memory) of dirty data
Also called shared-dirty state, since memory is stale

2005 Mikko Lipasti

MOESI Protocol

Used in AMD Opteron

5-State Protocol

Modied: <1,0,00>
Exclusive: <1,0,0,,1>
Shared: <1,X,X,,X>
Invalid: <0,X,X,X>
Owned: <1,X,X,X,X> ; only one owner

Owner can supply data, so memory does not have to

Avoids lengthy memory access

02/07

ECE/CS 757; copyright J. E. Smith, 2007

MOESI Protocol
Action and Next State
Current
State

02/07

Processor
Read

Processor
Write

Eviction

Cache Read

Cache Read&M

Cache
Upgrade

Cache Read
If no sharers:
E
If sharers:
S

Cache Read&M
M

No Action
I

No
Action
I

No Action
S

Cache Upgrade
M

No Action
I

Respond
shared;
S

No Action
I

No
Action
I

No Action
E

No Action
M

No Action
I

Respond
shared;
Supply data;
S

Respond
shared;
Supply data;
I

No Action
O

Cache
Upgrade
M

Cache
Write-back
I

Respond
shared;
Supply data;
O

Respond
shared;
Supply data;
I

No Action
M

Cache
Write-back
I

Respond
shared;
Supply data;
O

Respond
shared;
Supply data;
I

ECE/CS 757; copyright J. E. Smith, 2007

MOESI Example
Thread Event

Bus Action

Data From

0. Initially:

local states
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

1. T0 read

2. T0 write

none

3. T2 read

<1,0,1,0>

4. T1 write

CRM

<0,1,0,0>

02/07

Memory

Global State

ECE/CS 757; copyright J. E. Smith, 2007

Update Protocols
Basic idea:

All writes (updates) are made visible to all caches:

(address,value) tuples sent everywhere

Similar to write-through protocol for uniprocessor caches

Obviously not scalable beyond a few processors

No one actually builds machines this way

Simple op>miza>on

Send updates to memory/directory

Directory propagates updates to all known copies: less bandwidth

Further op>miza>ons: combine & delay

Write-combining of adjacent updates (if consistency model allows)

Send write-combined data
Delay sending write-combined data un>l requested

Logical end result

Writes are combined into larger units, updates are delayed un>l needed
Eec>vely the same as invalidate protocol
2005 Mikko Lipasti

Update Protocol: Dragon

Dragon (developed at Xerox PARC)
5-State Protocol
Invalid:<0,X,X,X>
Some say no invalid state due to confusion regarding empty frame
versus invalid line state
Exclusive: <1,0,0,,1>
Shared-Clean (Sc): <1,X,X,X> memory may not be up-to-date
Shared-Modied (Sm): <1,X,X,X0> memory not up-to-date; only one
copy in Sm
Modied: <1,0,0,0>

Includes Cache Update ac>on

Includes Cache Writeback ac>on
Bus includes Shared ag
Appears to also require memory inhibit signal
Dis>nguish shared case where cache (not memory) supplies data
02/07

ECE/CS 757; copyright J. E. Smith, 2007

Dragon State Diagram

Action and Next State
Current
State

Processor
Read

Processor
Write

Eviction

Cache Read
If no sharers:
E
If sharers:
Sc

Cache Read
If no sharers:
M
If sharers:
Cache Update
Sm

No Action
Sc

Cache Update
If no sharers:
M
If sharers:
Sm

No Action
I

No Action
E

No Action

No Action
I

Respond shared;
Supply data
Sc
Respond shared;
Supply data;
Sm

02/07

Cache Read

No Action
Sm

Cache Update
If no sharers:
M
If sharers:
Sm

Cache Write-back
I

No Action
M

Cache Write-back
I

Respond Shared;
Sc

Respond shared;
Supply data;
Sm
ECE/CS 757; copyright J. E. Smith, 2007

Cache Update
I

Respond shared;
Update copy;
Sc

Example
Thread Event

Bus Action

Data From

0. Initially:
Memory

Global State

local states
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

1. T0 read

2. T0 write

none

3. T2 read

<1,0,1,0>

4. T1 write

CR,CU

<1,1,1,0>

5. T0 read

none (hit)

<1,1,1,0>

Appears to require atomic bus cycles CR,CU on write to invalid

line
02/07

ECE/CS 757; copyright J. E. Smith, 2007

Update Protocol: Firey

Develped at DEC by ex-Xerox people
5-State Protocol
Similar to Dragon dierent state naming based on shared/exclusive and
clean/dirty
Invalid:<0,X,X,X>
EC: <1,0,0,,1>
SC: <1,X,X,X> memory may not be up-to-date
EM: <1,0,0,0>
SM: <1,X,X,X0> memory not up-to-date; only one copy in Sm

Performs write-through updates (dierent from Dragon)

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Firey State Diagram

Action and Next State
Current
State

Processor
Read

Processor
Write

Eviction

Cache Read

No Action
I

Cache Read&M

Cache Read
If no sharers:
Ec
If sharers:
Sc

Cache Read
If no sharers:
Em
If sharers:
Cache Update
Sm

No Action
Sc

Cache Read&M
If no sharers:
Ec
If sharers:
Sc

No Action
I

No Action
Ec

No Action

No Action
I

Respond shared;
Sc

Respond Shared
Sc

Respond Shared;
Sc

No Action
I

Respond Shared
Sc

02/07

No Action
Sm

Cache Read&M
If no sharers:
Ec
If sharers:
Sc

Cache Write-back
I

Respond shared;
Supply data;
Sm

Respond shared;
Supply data;
Sc

No Action
Em

Cache Write-back
I

Respond shared;
Supply data;
Sm

Respond shared;
Supply data;
Sc

ECE/CS 757; copyright J. E. Smith, 2007

Update vs Invalidate
[Weber & Gupta, ASPLOS3]
Consider sharing pamerns

No Sharing
Independent threads
Coherence due to thread migra>on
Update protocol performs many wasteful updates

Read-Only
No signicant coherence issues; most protocols work well

Migratory Objects
Manipulated by one processor at a >me
O[en protected by a lock
Usually a write causes only a single invalida>on
E state useful for Read-modify-Write pamerns
Update protocol could proliferate copies
02/07

ECE/CS 757; copyright J. E. Smith, 2007

Update vs Invalidate, contd.

Synchroniza>on Objects

Locks
Update could reduce spin trac invalida>ons
Test&Test&Set w/ invalidate protocol would work well

Many Readers, One Writer

Update protocol may work well, but writes are rela>vely rare

Many Writers/Readers

Invalidate probably works bemer

Update will proliferate copies

What is used today?

Invalidate is dominant
CMP may change this assessment
more on-chip bandwidth

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Nasty Reali>es
State diagram is for (ideal) protocol assuming
instantaneous and ac>ons
In reality controller implements more complex diagrams
A protocol state transi>on may be started by controller when
bus ac>vity changes local state
Example: an upgrade pending (for bus) when an invalidate for
same line arrives

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Par>al Implementa>on State Table

Action and Next State
Current
State

Processor
Read

Processor
Write

Bus
Grant

Bus
Response

Cache Read

Cache Read&M

Cache
Upgrade

Request Bus
IR

Request
Bus
IW

No Action
I

No Action
S

Request
Bus
SW

Respond Shared:
S

No Action
I

No Action
E

No Action
M

Respond Shared;
S

No Action
I

No Action
M

Respond dirty;
Write back data;
S

Respond dirty;
Write back data;
I

Respond Shared:

No Action
IW

Cache Read
IRR

Cache Read&M
IWR

IRR

If no
sharers:
E
If sharers:
S
Load line

IWR

M
Load line

02/07

Cache Upgrade
ECE/CS
M

757; copyright J. E. Smith,

2007
SW

No Action
IW

Further Op>miza>ons
Observa>on: Shared blocks should only be fetched from
memory once
If I nd a shared block on chip, forward the block
Problem: mul>ple shared blocks possible, who forwards?
Everyone? Power/bandwidth wasted

Single forwarder, but who?

Last one to receive block: F state

I->F for requestor, F->S for forwarder

What if F block is evicted?

Favor F blocks in replacement?

Dont allow silent evic>on (force some other node to be F)
Fall back on memory copy if cant nd F copy

Very old idea (IBM machines have done this for a long >me),
but recent Intel patent issued anyway [Hum/Goodman]

2005 Mikko Lipasti

Further Op>miza>ons
Observa>on: migratory data o[en ies by

Add T (transi>on) state to protocol

Tag is s>ll valid, data isnt
Data can be snarfed as it ies by
Only works with certain kinds of interconnect networks
Replacement policy issues

Many other op>miza>ons are possible

Literature extends 25 years back

Many unpublished (but implemented) techniques as well

2005 Mikko Lipasti

Implemen>ng Cache Coherence

Snooping implementa>on

Origins in shared-memory-bus systems

All CPUs could observe all other CPUs requests on the bus;
hence snooping

Bus Read, Bus Write, Bus Upgrade

React appropriately to snooped commands

Invalidate shared copies

Provide up-to-date copies of dirty lines
Flush (writeback) to memory, or
Direct interven>on (modied interven2on or dirty miss)

Snooping suers from:

Scalability: shared busses not prac>cal

Ordering of requests without a shared bus
Lots of recent and on-going work on scaling snoop-based
systems
2005 Mikko Lipasti

Snooping Cache Coherence

Basic idea: broadcast snoop to all caches to nd owner
Not scalable?
Address trac roughly propor>onal to square of number of
processors
Current implementa>ons scale to 64/128-way (Sun/IBM) with
mul>ple address-interleaved broadcast networks

Inbound snoop bandwidth: big problem

OutboundSnoopRate = s o =

CacheMissRate + BusUpgradeRate

InboundSnoopRate = si = n

2005 Mikko Lipasti

Snoop Bandwidth
l
l

Snoop ltering of various kinds is possible

Filter snoops at sink: Jemy lter [Moshovos et al., HPCA 2001]

Filter snoops at source: Mul>cast snooping [Bilir et al., ISCA 1999]

Filter snoops at source: Region coherence

Check small lter cache that summarizes contents of local

cache
Avoid power-hungry lookups in each tag array
Predict likely sharing set, snoop only predicted sharers
Double-check at directory to make sure

Concurrent work: [Can>n/Smith/Lipas>, ISCA 2005; Moshovos, ISCA 2005]

Check larger region of memory on every snoop; remember
when no sharers
Snoop only on rst reference to region, or when region is shared
Eliminate 60%+ of all snoops

2005 Mikko Lipasti

Snoop Latency
l

Snoop latency:

Must reach all nodes, return and combine responses

Probably the bigger scalability issue in the future
Topology matters: ring, mesh, torus, hypercube
No obvious solutions

Parallelism: fundamental advantage of snooping

Broadcast exposes parallelism, enables speculative latency

reduction

LDir XSnp RDir XRsp CRsp XRd RDat XDat UDat

RDat XDat
UDat
RDat XDat UDat
RDat
XDat UDat
2005 Mikko Lipasti

Scaleable Cache Coherence

Eschew physical bus but s>ll snoop

Point-to-point tree structure (indirect) or ring

Root of tree or ring provide ordering point
Use some scalable network for data (ordering less
important)

Or, use level of indirec>on through directory

Directory at memory remembers:
Which processor is single writer
Which processors are shared readers

Level of indirec>on has a price

Dirty misses require 3 hops instead of two
Snoop: Requestor->Owner
Directory: Requestor->Directory->Owner
2005 Mikko Lipasti

Implemen>ng Cache Coherence

Directory implementa>on

Extra bits stored in memory (directory) record state of line

Memory controller maintains coherence based on the current state
Other CPUs commands are not snooped, instead:

Directory forwards relevant commands

Powerful ltering eect: only observe commands that you need to

observe
Meanwhile, bandwidth at directory scales by adding memory
controllers as you increase size of the system

Leads to very scalable designs (100s to 1000s of CPUs)

Directory shortcomings

Indirec>on through directory has latency penalty

If shared line is dirty in other CPUs cache, directory must forward
request, adding latency
This can severely impact performance of applica>ons with heavy
sharing (e.g. rela>onal databases)

2005 Mikko Lipasti

Directory Protocol Implementa>on

Basic idea: Centralized directory keeps track of data

loca>on(s)
Scalable

Address trac roughly propor>onal to number of processors

Directory & trac can be distributed with memory banks
(interleaved)
Directory cost (SRAM) or latency (DRAM) can be prohibi>ve

Full map (N processors, N bits): cost/scalability

Limited map (limits number of sharers)
Coarse map (iden>es board/node/cluster; must use broadcast)

Point to shared copies

Fixed number, linked lists (SCI), caches chained together
Latency vs. cost vs. scalability

Presence bits track sharers

Vectors track sharers

2005 Mikko Lipasti

Directory Protocol Latency

LDir XSnp RDir XRd RDat XDat UDat

Access to non-shared data

Overlap directory read with data read

Best possible latency given NUMA arrangement

Access to shared data

Dirty miss, modied interven>on

Shared interven>on?

If DRAM directory, no gain

If directory cache, possible gain (use F state)

No inherent parallelism
Indirec>on adds latency
Minimum 3 hops, o[en 4 hops

2005 Mikko Lipasti

Directory-based Cache Coherence

An alterna>ve for large, scalable
MPs
Can be based on any of the
protocols discussed thus far
We will use MSI

Memory Controller becomes an

ac>ve par>cipant
Sharing info held in memory
directory

Memory
Module

Directory may be distributed

Use point-to-point messages

Network is not totally ordered

02/07

Cache

Processor

...

Cache

Processor

Example: Simple Directory Protocol

Local cache controller states
M, S, I as before

Local directory states

Shared: <1,X,X,1>; one or more proc. has copy; memory
is up-to-date
Modied: <0,1,0,.,0> one processor has copy; memory
does not have a valid copy
Uncached: <0,0,0,1> none of the processors has a valid
copy

Directory also keeps track of sharers

Can keep global state vector in full
e.g. via a bit vector
02/07

Example
Local cache suers load miss
Line in remote cache in M state
It is the owner

02/07

Processor
Processor

processor
read

Cache

Local
Controller

Cache

Owner
Controller

memory
data
response

Processor

...

owner
data
response

Remote
Controller

Cache

Interconnect

memory
read

cache
read

Memory
Controller

...

Memory
Banks

Cache Controller State Diagram

Intermediate States for clarity
Cache Controller
Actions and Next States
from Processor Side
Current
State

Processor
Read

Processor
Write

from Memory Side

Eviction

Cache
Read
I'

Cache
Read&M
I''

No
Action
S

Cache
Upgrade
S'

No
Action*
I

No
Action
M

No Action
M

Cache
Write-back
I

Memory
Read

Memory
Read&M

Memory
Invalidate

Memory
Upgrade

No Action
I
Invalidate
Frame;
Cache ACK;
I
Owner
Data;
S

Owner
Data;
I

Invalidate
Frame;
Cache ACK;
I

Fill Cache
S

I''

Fill Cache
M

02/07

Memory Data

No Action
M

Memory Controller State Diagram

Memory Controller
Actions and Next States

command from Local Cache Controller

Current
Directory
State

Cache
Read

Cache Read&M

Cache
Upgrade

Memory Data;
Add Requestor to
Sharers;
S

Memory Data;
Add Requestor to
Sharers;
M

Memory Data;
Add Requestor to
Sharers;
S

Memory
Invalidate All
Sharers;
M'

Memory Read
from Owner;
S'

Memory Read&M;
to Owner
M'

Memory
Upgrade
All Sharers;
M''

response from Remote Cache Controller

Data
Write-back

Cache ACK

No Action
I

Make Sharers
Empty;
U

Memory Data
to Requestor;
Write memory;
Add Requestor to
Sharers;
S

When all ACKS

Memory Data;
M

M''

When all ACKS

then
M

02/07

Owner
Data

Memory Data
to Requestor;
M

Another Example
Local write (miss) to shared line
Requires invalida>ons and acks
Processor
Processor

processor
write

Cache

Local
Controller

cache
Read&M

02/07

Cache

Remote
Controller

memory
data
response
memory
invalidate

Processor

...

Remote
Controller

Cache

cache
ack

Interconnect

Memory
Controller

Home Memory
Controller

Memory
Controller

...

Memory
Banks

Example Sequence
Similar to earlier sequences
Thread Event

Controller
Actions

Data From

0. Initially:

local states:
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

1. T0 read

CR,MD

2. T0 write

CU, MU*,MD

3. T2 read

CR,MR,MD

<1,0,1,1>

4. T1 write

CRM,MI,CA,MD

Memory

<0,1,0,0>

02/07

Memory

global state

Varia>on: Three Hop Protocol

Have owner send data directly to local controller
Owner Acks to Memory Controller in parallel

Local
Controller

cache
read

Owner
Controller
memory
data

memory
read

owner
data

Local
Controller
cache
read

3
memory
read

Memory
Controller

a)
02/07

Owner
Controller

owner
ack

Directory Protocol Op>miza>ons

Remove dead blocks from cache:

Eliminate 3- or 4-hop latency

Dynamic Self-Invalida>on [Lebeck/Wood, ISCA 1995]
Last touch predic>on [Lai/Falsa, ISCA 2000]
Dead block predic>on [Lai/Fide/Falsa, ISCA 2001]

Predic>on in coherence protocols [Mukherjee/Hill, ISCA 1998]

Instruc>on-based predic>on [Kaxiras/Goodman, ISCA 1999]
Sharing predic>on [Lai/Falsa, ISCA 1999]

Improve latency by snooping, conserve bandwidth with directory

Mul>cast snooping [Bilir et al., ISCA 1999; Mar>n et al., ISCA 2003]
Bandwidth-adap>ve hybrid [Mar>n et al., HPCA 2002]
Token Coherence [Mar>n et al., ISCA 2003]
VCT Mul>cas>ng [Enright Jerger thesis 2008]

Predict sharers

Hybrid snooping/directory protocols

2005 Mikko Lipasti

Atomic bus

Protocol Races

Only stable states in protocol (e.g. M, S, I)

All state transi>ons are atomic (I->M)
No conic>ng requests can interfere since bus is held >ll transac>on
completes
Dis>nguish coherence transac>on from data transfer
Data transfer can s>ll occur much later; easier to handle this case

Atomic buses dont scale

At minimum, separate bus request/response

Large systems have broadly variable delays

Req/resp separated by dozens of cycles
Conic>ng requests can and do get issued
Messages may get reordered in the interconnect

How do we resolve them?

Resolving Protocol Races

Req/resp decoupling introduces transient

states
E.g. I->S is now I->ItoX->ItoS_nodata->S

Conic>ng requests to blocks in transient

states
NAK ugly; livelock, starva>on poten>al
Keep adding more transient states

Directory protocol makes this a bit easier

Can order at directory, which has full state info
Even so, messages may get reordered
55

Common Protocol Races

Read strings: P0 read, P1 read, P2 read

Easy, since read is nondestruc>ve

Can rely on F state to reduce DRAM accesses
Forward reads to previous requestor (F)

Write strings: P0 write, P1 write, P2 write

Forward P1 write req to P0 (M)
P0 completes write then forwards M block to P1
Build string of writes (write string forwarding)

Read a[er write (similar to prev. WAW)

Writeback race: P0 evicts dirty block, P1 reads
Dirty block is in the network (no copy at P0 or at dir)
NAK P1, or force P0 to keep copy >ll dir ACKs WB

Many others crop up, esp. with op>miza>ons

Summary

Coherence States
Snoopy bus-based Invalidate Protocols
Invalidate protocol op>miza>ons
Update Protocols (Dragon/Firey)
Directory protocols
Implementa>on issues

UVM Testbench Architecture Example 1671715841
100% (1)
UVM Testbench Architecture Example 1671715841
20 pages
Questa Tool Usage Beginers Guide
100% (1)
Questa Tool Usage Beginers Guide
131 pages
Top 100 DV Interview Questions
No ratings yet
Top 100 DV Interview Questions
34 pages
Phy Interface Pci Express Sata Usb31 Architectures PIPE - Rev6 - 2 - 1
No ratings yet
Phy Interface Pci Express Sata Usb31 Architectures PIPE - Rev6 - 2 - 1
187 pages
Cache Coherence: CSE 661 - Parallel and Vector Architectures
No ratings yet
Cache Coherence: CSE 661 - Parallel and Vector Architectures
37 pages
Cache Coherency
No ratings yet
Cache Coherency
33 pages
System Verilog Interview Q
No ratings yet
System Verilog Interview Q
22 pages
Fifo Uvm
No ratings yet
Fifo Uvm
15 pages
System Verilog Short Notes
100% (1)
System Verilog Short Notes
22 pages
System Verilog: Program Block & Interface
No ratings yet
System Verilog: Program Block & Interface
38 pages
AXI Verification IP
No ratings yet
AXI Verification IP
54 pages
Basics of DDR Protocol: Jose Thomas Vellara
No ratings yet
Basics of DDR Protocol: Jose Thomas Vellara
57 pages
UVM Sequences
No ratings yet
UVM Sequences
23 pages
Virtual Sequence PDF
No ratings yet
Virtual Sequence PDF
35 pages
SystemVerilog Interview Questions PART-3
No ratings yet
SystemVerilog Interview Questions PART-3
14 pages
TLM With Examples
No ratings yet
TLM With Examples
13 pages
SystemVerilog and Direct Programming Interface
No ratings yet
SystemVerilog and Direct Programming Interface
25 pages
Interview Experience - NXP Semiconductors
No ratings yet
Interview Experience - NXP Semiconductors
2 pages
VLSI Guru Interview Preparation Questions & Excel Sheet - VLSI Guru
No ratings yet
VLSI Guru Interview Preparation Questions & Excel Sheet - VLSI Guru
9 pages
Dokumen - Tips Verilog Basic
No ratings yet
Dokumen - Tips Verilog Basic
151 pages
M - Sequence and P - Sequencer in UVM
No ratings yet
M - Sequence and P - Sequencer in UVM
4 pages
Chap 9 Functional Coverage PDF
No ratings yet
Chap 9 Functional Coverage PDF
46 pages
SystemVerilog PDF
No ratings yet
SystemVerilog PDF
7 pages
Uvm Cookbook Collection
No ratings yet
Uvm Cookbook Collection
54 pages
Behavioral Model of A DDR Memory Controller in A DFi - Frequency Ratio System
No ratings yet
Behavioral Model of A DDR Memory Controller in A DFi - Frequency Ratio System
10 pages
UVM HowTo
No ratings yet
UVM HowTo
93 pages
Systemverilog UVM QA
No ratings yet
Systemverilog UVM QA
9 pages
General Digital 1. General Digital Design Questions Design Questions
No ratings yet
General Digital 1. General Digital Design Questions Design Questions
44 pages
2013 SNUG SV Synthesizable SystemVerilog Presentation
100% (1)
2013 SNUG SV Synthesizable SystemVerilog Presentation
31 pages
SoC Verification Flow and Methodologies by Sivakumar PR
No ratings yet
SoC Verification Flow and Methodologies by Sivakumar PR
6 pages
Verilog Interview Questions 1670709109
No ratings yet
Verilog Interview Questions 1670709109
21 pages
System Verilog - Interview
No ratings yet
System Verilog - Interview
9 pages
DebuggingUVM PDF
No ratings yet
DebuggingUVM PDF
58 pages
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
No ratings yet
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
54 pages
UVM Tutorial DVCon 2013
No ratings yet
UVM Tutorial DVCon 2013
120 pages
UVM Guide For Beginners
No ratings yet
UVM Guide For Beginners
1 page
Systemverilog Interview Questions
No ratings yet
Systemverilog Interview Questions
6 pages
AXI Vs AHB. Difference Between AXI and AHB
100% (2)
AXI Vs AHB. Difference Between AXI and AHB
3 pages
Universal Verification Methodology Based Verification Environment For PCIE Data Link Layer
No ratings yet
Universal Verification Methodology Based Verification Environment For PCIE Data Link Layer
5 pages
LED Beam 90W
50% (2)
LED Beam 90W
2 pages
SV Assertions Lec3
No ratings yet
SV Assertions Lec3
14 pages
Fun With Uvm Sequences Coding and Debugging - VH v15 I12
100% (1)
Fun With Uvm Sequences Coding and Debugging - VH v15 I12
11 pages
UVM Interview Questions - VLSI Encyclopedia
No ratings yet
UVM Interview Questions - VLSI Encyclopedia
4 pages
Cache Coherence: Part I: CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
No ratings yet
Cache Coherence: Part I: CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
31 pages
SystemVerilog DPI Tutorial
No ratings yet
SystemVerilog DPI Tutorial
6 pages
Introduction To SystemVerilog and Verification
No ratings yet
Introduction To SystemVerilog and Verification
107 pages
Bromley Coverage Paper
No ratings yet
Bromley Coverage Paper
20 pages
Coverage
No ratings yet
Coverage
25 pages
VCS Commands Ease Coverage Efforts - Speed Simulation PDF
No ratings yet
VCS Commands Ease Coverage Efforts - Speed Simulation PDF
6 pages
Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9
No ratings yet
Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9
79 pages
GOOD DRAM Interface Tutorial
No ratings yet
GOOD DRAM Interface Tutorial
91 pages
Class 12 Topics
No ratings yet
Class 12 Topics
43 pages
SV-UVM of AXI - WB
No ratings yet
SV-UVM of AXI - WB
4 pages
SV Interview Questions
No ratings yet
SV Interview Questions
1 page
1 Architecting Uvm Testbench
No ratings yet
1 Architecting Uvm Testbench
12 pages
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
No ratings yet
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
33 pages
Design and Evaluation of The Electronic Class Record For LPU-Laguna International School
0% (1)
Design and Evaluation of The Electronic Class Record For LPU-Laguna International School
8 pages
Technical Specifications For CADWorx
No ratings yet
Technical Specifications For CADWorx
4 pages
Coverage/Block Level Functional Coverage Example
No ratings yet
Coverage/Block Level Functional Coverage Example
14 pages
Api Net E3d
No ratings yet
Api Net E3d
12 pages
SystemVerilog Veriflcation
No ratings yet
SystemVerilog Veriflcation
68 pages
Trackwise® Customer Complaint Solutions: A Global Approach Integrated Corrective & Preventive Actions (Capa)
100% (1)
Trackwise® Customer Complaint Solutions: A Global Approach Integrated Corrective & Preventive Actions (Capa)
2 pages
Codesys 2007 Engl
No ratings yet
Codesys 2007 Engl
12 pages
Unit II Ui Design
No ratings yet
Unit II Ui Design
28 pages
Etherchannel in AIX
No ratings yet
Etherchannel in AIX
4 pages
978 1 63057 352 2 3 81tctesck4
No ratings yet
978 1 63057 352 2 3 81tctesck4
50 pages
IMS - IP Multimedia Subsystem
No ratings yet
IMS - IP Multimedia Subsystem
21 pages
Cache Coherence: Caches Memory Coherence Caches Multiprocessing
No ratings yet
Cache Coherence: Caches Memory Coherence Caches Multiprocessing
4 pages
House Rental System Report by Mahi
No ratings yet
House Rental System Report by Mahi
31 pages
CTFL 2018 Sample Exam B v1.3 Questions
No ratings yet
CTFL 2018 Sample Exam B v1.3 Questions
21 pages
Python Programming: Control Structures
No ratings yet
Python Programming: Control Structures
40 pages
Good Computer Validation Practices: Common Sense Implementation
No ratings yet
Good Computer Validation Practices: Common Sense Implementation
8 pages
Salesforce Administrator Interview Questions and Answers
No ratings yet
Salesforce Administrator Interview Questions and Answers
38 pages
Memory Management Unit
No ratings yet
Memory Management Unit
23 pages
System Test: What Is System Testing ?
No ratings yet
System Test: What Is System Testing ?
6 pages
Confidential Card Content Management: Globalplatform Card Card Specification V2.2 - Amendment A
No ratings yet
Confidential Card Content Management: Globalplatform Card Card Specification V2.2 - Amendment A
26 pages
Introduction To Electric Circuits, 9th Edition: Expert-Verified Solutions in This Book
No ratings yet
Introduction To Electric Circuits, 9th Edition: Expert-Verified Solutions in This Book
7 pages
Final EQ List of NR20923 03102024
No ratings yet
Final EQ List of NR20923 03102024
11 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
10 pages
Vaigai Schedule
No ratings yet
Vaigai Schedule
67 pages
Brumund Building A Smallish v1
No ratings yet
Brumund Building A Smallish v1
30 pages
IBM PC XT 256-640 DIP Switch Settings
No ratings yet
IBM PC XT 256-640 DIP Switch Settings
18 pages
Tcs Kronos
No ratings yet
Tcs Kronos
43 pages
GCN 2020 03
No ratings yet
GCN 2020 03
4 pages
Parer Point Presentation On SAS by GE
100% (1)
Parer Point Presentation On SAS by GE
17 pages
Ruru: Real-Time Wide-Area TCP Latency Monitoring: Richard - Cziva@glasgow - Ac.uk
No ratings yet
Ruru: Real-Time Wide-Area TCP Latency Monitoring: Richard - Cziva@glasgow - Ac.uk
29 pages
Blockchain Management
No ratings yet
Blockchain Management
27 pages
Materi Konsep Dasar TV
No ratings yet
Materi Konsep Dasar TV
7 pages
CSS Text Properties
No ratings yet
CSS Text Properties
2 pages
Application-Specific Integrated Circuit ASIC A Complete Guide
From Everand
Application-Specific Integrated Circuit ASIC A Complete Guide
Gerardus Blokdyk
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.