0% found this document useful (0 votes)

71 views11 pages

Memory Consistency Models: David Mosberger

The document discusses memory consistency models and their influence on software design for parallel machines. Weakening memory consistency models can improve performance by 10-40% by allowing optimizations like pipelined writes, but also complicates the programming model. The consistency model is a tradeoff between performance and programming complexity. Weaker models require more interaction between hardware, operating systems, compilers, and programming languages to maintain a simple programming interface.

Uploaded by

Dong Shin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views11 pages

Memory Consistency Models: David Mosberger

Uploaded by

Dong Shin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Memory Consistency Models1

David Mosberger

TR 93/11

Abstract
This paper discusses memory consistency models and their influence on software in the context of parallel
machines. In the first part we review previous work on memory consistency models. The second part
discusses the issues that arise due to weakening memory consistency. We are especially interested in the
influence that weakened consistency models have on language, compiler, and runtime system design. We
conclude that tighter interaction between those parts and the memory system might improve performance
considerably.

Department of Computer Science

The University of Arizona
Tucson, AZ 85721

1
This is an updated version of [Mos93]
1 Introduction increases.
Shared memory can be implemented at the hardware
Traditionally, memory consistency models were of in- or software level. In the latter case it is usually called
terest only to computer architects designing parallel ma- Distributed Shared Memory (DSM). At both levels work
chines. The goal was to present a model as close as has been done to reap the benefits of weaker models. We
possible to the model exhibited by sequential machines. conjecture that in the near future most parallel machines
The model of choice was sequential consistency (SC). will be based on consistency models significantly weaker
Sequential consistency guarantees that the result of any than SC [LLG+ 92, Sit92, BZ91, CBZ91, KCZ92].
execution of n processors is the same as if the opera- The rest of this paper is organized as follows. In
tions of all processors were executed in some sequential section 2 we discuss issues characteristic to memory
order, and the operations of each individual processor consistency models. In the following section we present
appear in this sequence in the order specified by the several consistency models and their implications on the
program. However, this model severely restricts the set programming model. We then take a look at implemen-
of possible optimizations. For example, in an architec- tation options in section 4. Finally, section 5 discusses
ture with a high-latency memory, it would be benefi- the influence of weakened memory consistency models
cial to pipeline write accesses and to use write buffers. on software. In particular, we discuss the interactions
None of these optimizations is possible with the strict between a weakened memory system and the software
SC model. Simulations have shown that weaker models using it.
allowing such optimizations could improve performance
on the order of 10 to 40 percent over a strictly sequential
model [GGH91, ZB92]. However, weakening the mem- 2 Consistency Model Issues
ory consistency model goes hand in hand with a change
in the programming model. In general, the program- Choosing an appropriate memory consistency model
ming model becomes more restricted (and complicated) (MCM) is a tradeoff between minimizing memory ac-
as the consistency model becomes weaker. That is, an cess order constraints and the complexity of the pro-
architecture can employ a weaker memory model only gramming model as well as of the complexity of the
if the software using it is prepared to deal with the new memory model itself. The weakest possible memory is
programming model. Consequently, memory consis- one returning for a read access some previously writ-
tency models are now of concern to operating system ten value or a value that will be written in the future.
and language designers too. Thus, the memory system could choose to return any
We can also turn the coin around. A compiler nor- of those values. While minimizing ordering constraints
mally considers memory accesses to be expensive and perfectly, it is not useful as a programming model. Im-
therefore tries to replace them by accesses to registers. plementing weaker models is also often more complex
In terms of a memory consistency model, this means that as it is necessary to keep track of outstanding accesses
certain accesses suddenly are not observable any more. and restrict the order of execution of two accesses when
In effect, compilers implicitly generate weak memory necessary. It is therefore not surprising that many differ-
consistency. This is possible because a compiler knows ent MCM’s have been proposed and new models are to
exactly (or estimates conservatively) the points where be expected in the future. Unfortunately, there is no sin-
memory has to be consistent. For example, compilers gle hierarchy that could be used to classify the strictness
typically write back register values before a function of a consistency model. Below, we define the design
call, thus ensuring consistency. It is only natural to at- space for consistency models. This allows us to classify
tempt to make this implicit weakening explicit in order the various models more easily.
to let the memory system take advantage too. In fact, Memory consistency models impose ordering restric-
it is anticipated that software could gain from a weak tions on accesses depending on a number of attributes.
model to a much higher degree than hardware [GGH91] In general, the more attributes a model distinguishes,
by enabling optimizations such as code scheduling or the weaker the model is. Some attributes a model could
delaying updates that are not legal under SC. distinguish are listed below:
In short, weaker memory consistency models can have
location of access
a positive effect on the performance of parallel shared
memory machines. The benefit increases as memory direction of access (read, write, or both)
latency increases. In recent years, processor perfor-
mance has increased significantly faster than memory value transmitted in access
system performance. In addition to that, memory la-
tency increases as the number of processors in a system causality of access

1
For example, if a critical region has only read accesses
memory access
to shared variables, then acquiring the lock can be done
non-exclusively.
shared private
Consistency models that distinguish access categories
employ different ordering constraints depending on the
competing non−competing access category. We therefore call such models hybrid.
In contrast, models that do not distinguish access cat-
synchronizing non−synchronizing egories are called uniform. The motivation for hybrid
models is engendered in Adve and Hill’s definition for
acquire release weak ordering [AH90]:

exclusive non−exclusive Hardware is weakly ordered with respect to

a synchronization model if and only if it ap-
Figure 1: Access Categories pears sequentially consistent to all software
that obey the synchronization model.

That is, as long as the synchronization model is re-

category of access spected, the memory system appears to be sequentially
consistent. This allows the definition of almost arbi-
The causality attribute is a relation that tells if two trarily weak consistency models while still presenting
accesses a1 and a2 are (potentially) causally related a reasonable programming model. All that changes is
[Lam78] and if so, whether a1 occurred before a2 or the number or severity of constraints imposed by the
vice versa. synchronization model.
The access category is a static property of accesses. A
useful (but by no means the only possible) categorization
is shown in Figure 1. It is an extension of the catego- 3 Proposed Models
rization used in [GLL+ 90]. A memory access is either
shared or private. Private accesses are easy to deal with, We now proceed to define some of the consistency mod-
so we don’t discuss them further. Shared accesses can be els that have previously been proposed. We do not give
divided into competing and non-competing accesses. A formal definitions for the presented models as they do
pair of accesses is competing if they access the same lo- not help much to understand a model’s implications on
cation, at least one of them is a write access, and they are the programming model. More formal descriptions can
not ordered. For example, accesses to shared variables be found for example in Ahamad et al. [ABJ+ 92] and
within a critical section are non-competing because mu- Gharachorloo et al. [GLL+ 90]. We first discuss uni-
tual exclusion guarantees ordering1. A competing ac- form models and then hybrid models. Figure 2 gives an
cess can be either synchronizing or non-synchronizing. overview of the relationships among the uniform mod-
Synchronizing accesses are used to enforce order, for ex- els. An arrow from model A to B indicates that A is
ample by delaying an access until all previous accesses more strict than B . Each model is labeled with the sub-
are performed. However, not all competing access are section it is described in. Hybrid models are described
synchronizing accesses. Chaotic relaxation algorithms, roughly in decreasing order of strictness.
for example, use competing accesses without imposing We use triplets of the form a(l)v to denote a memory
ordering constraints. Such algorithms work even if some access, where a is either R for read access or W for
read accesses do not return the most recent value. write access, l denotes the accessed location, and v the
Synchronizing accesses can be divided further into transmitted value. In our examples, a triplet identifies
acquire or release accesses. An acquire is always asso- an access uniquely. Memory locations are assumed to
ciated with a read synchronizing access while a release is have an initial value of zero.
always a write synchronizing access. Atomic fetch-and- Execution histories are presented as diagrams with
operations can usually be treated as an acquiring read one line per processor. Time corresponds to the horizon-
access followed by a non-synchronizing write access. tal axis and increases to the right. For write accesses, a
Finally, an acquire operation can be either exclusive or triplet in a diagram marks the time when it was issued,
non-exclusive. Multiple non-exclusive acquire accesses while for read accesses it denotes the time of comple-
can be granted, but an exclusive acquire access is de- tion. This asymmetry exists because a program can only
layed until all previous acquisitions have been released. “know” about these events. It does not know at which
1 Assuming accesses by a single processor appear sequentially point in time a write performed or at which point in time
consistent a read was issued (there may be prefetching hardware,

2
as the resulting history is equivalent to some serial exe-
3.1 Atomic Consistency (AC)
cution.
Atomic consistency is often used as a base model
when evaluating the performance of an MCM.
3.2 Sequential Consistency (SC)

3.2 Sequential Consistency (SC)

3.3 Causal Consistency (CC)
Sequential consistency was first defined by Lamport in
3.6 Processor Consistency (PC) 1979 [Lam79]. He defined a memory system to be
sequentially consistent if

3.5 Cache Consistency 3.4 PRAM

: : : the result of any execution is the same as
(coherence) if the operations of all the processors were
executed in some sequential order, and the op-
erations of each individual processor appear
3.7 Slow Memory
in this sequence in the order specified by its
program.
Figure 2: Structure of Uniform Models
This is equivalent to the one-copy serializability concept
found in work on concurrency control for database sys-
tems [BHG87]. In a sequentially consistent system, all
for example). The following diagram is an example
processors must agree on the order of observed effects.
execution history:
The following is a legal execution history for SC but not
for AC:
P1 : W (x)1
P2 : R(x)1 P1 : W (x)1
P2 : W (y)2
Processor P1 writes 1 to location x and processor P2 P3 : R(y)2 R(x)0 R(x)1
subsequently observes this write by reading 1 from x.
This implies that the write access completed (performed) Note that R(y)2 by processor P3 reads a value that has
some time between being issued by P1 and being ob- not been written yet! Of course, this is not possible in
served by P2. any real physical system. However, it shows a surprising
In the following discussion we use the word processor flexibility of the SC model. Another reason why this is
to refer to the entities performing memory accesses. In not a legal history for atomic consistency is that the
most cases it could be replaced by the word process as write operations W (x)1 and W (y)2 appear commuted
processes are simply a software abstraction of physical at processor P3.
processors. Sequential consistency has been the canonical mem-
ory consistency model for a long time. However, many
multi-processor machines actually implement a slightly
3.1 Atomic Consistency (AC)
weaker model called processor consistency (see below).
This is the strictest of all consistency models. With
atomic consistency, operations take effect at some point 3.3 Causal Consistency
in an operation interval. It is easiest to think of op-
eration intervals as dividing time into non-overlapping, Hutto and Ahamad [HA90] introduced causal consis-
consecutive slots. For example, the clock cycle of a tency. Lamport [Lam78] defined the notion of potential
memory bus could serve as an operation interval. Mul- causality to capture the flow of information in a dis-
tiple accesses during the same operation interval are tributed system. This notion can be applied to a mem-
allowed, which causes a problem if reads and writes ory system by interpreting a write as a message-send
to the same location occur in the same operation inter- event and a read as a message-read event. A memory is
val. One solution is to define read operations to take causally consistent if all processors agree on the order
effect at read-begin time and write operations to take of causally related events. Causally unrelated events
effect at write-end time. This is called static atomic (concurrent events) can be observed in different orders.
consistency[HA90]. With dynamic AC, operations can For example, the following is a legal execution history
take effect at any point in the operation interval, as long under CC but not under SC:

3
3.5 Cache Consistency (Coherence)
P1 : W (x)1 W (x)3 Cache consistency [Goo89] and coherence [GLL+ 90]
P2 : R(x)1 W (x)2 are synonymous and to avoid confusion with causal con-
P3 : R(x)1 R(x)3 R(x)2 sistency, we will use the term coherence in this paper.
P4 : R(x)1 R(x)2 R(x)3 Coherence is a location-relative weakening of SC. Re-
call that under SC, all processors have to agree on some
Note that W (x)1 and W (x)2 are causally related as P2
sequential order of execution for all accesses. Coher-
observed the first write by P1 . Furthermore, P3 and
ence only requires that accesses are sequentially con-
P4 observe the accesses W (x)2 and W (x)3 in different sistent on a per-location basis. Clearly, SC implies co-
herence but not vice versa. Thus, coherence is strictly
orders, which would not be legal under SC.
weaker than SC. The example below is a history that is
Among the uniform models, CC appears to be one
coherent but not sequentially consistent:
of the more difficult to implement in hardware. This
can probably be explained by the fact that most other
models have been designed with a hardware implemen-
P1 : W (x)1 R(y)0
tation in mind. However, this does not imply that a P2 : W (y)1 R(x)0
CC implementation necessarily performs worse than an Clearly, any serial execution that respects program order
implementation of one of the simpler uniform models. starts with writing 1 into either x or y. It is therefore
impossible that both read accesses return 0. However,
3.4 Pipelined RAM (PRAM) the accesses to x can be linearized into R(x)0, W (x)1
and so can the accesses to y: R(y)0, W (y)1. The
Lipton and Sandberg [LS88] defined the Pipelined RAM history is therefore coherent, but not SC. In essence,
(PRAM) consistency model. The reader should be aware coherence removes the ordering constraints that program
that the acronym PRAM is often used as a shorthand for order imposes on accesses to different memory locations.
Parallel Random Access Machine which has nothing in
common with the Pipelined RAM consistency model. 3.6 Processor Consistency (PC)
The reasoning that led to this model was as follows:
consider a multi-processor where each processor has a Goodman proposed processor consistency in [Goo89].
local copy of the shared memory. For the memory to Unfortunately, his definition is informal and caused a
be scalable, an access should be independent of the time controversy as to what exactly PC refers to. Ahamad
it takes to access the other processors’ memories. They et al. [ABJ+ 92] give a formal definition of PC which
proposed that on a read, a PRAM would simply return the removes all ambiguity and appears to be a faithful trans-
value stored in the local copy of the memory. On a write, lation of Goodman’s definition. They also show that
it would update the local copy first and broadcast the new PC as defined by the DASH group in [GLL+ 90] is not
value to the other processors. Assuming a constant time comparable to Goodman’s definition (i.e., it is neither
for initiating a broadcast operation, the goal of making weaker nor stronger). We will not discuss the DASH
the cost for a read or write constant is thus achieved. version of PC except in the context of release consis-
In terms of ordering constraints, this is equivalent to tency (RC) and hence will use PC to refer to Goodman’s
requiring that all processors observe the writes from version and PCD to refer to the DASH version.
a single processor in the same order while they may Goodman defined PC to be stronger than coherence
disagree on the order of writes by different processors. but weaker than SC. PC can be interpreted as a combi-
The following execution history is legal under PRAM nation of coherence and PRAM. Thus, every PC history
but not under CC: is also coherent and PRAM. However, for a history to be
PC it not only has to be coherent and PRAM but those
two conditions also must be satisfiable in a mutually
P1 : W (x)1 consistent way. It is easiest to think of PC as a consis-
P2 : R(x)1 W (x)2 tency model that requires a history to be coherent and
P3 : R(x)1 R(x)2 PRAM simultaneously, rather than individually. That is,
P4 : R(x)2 R(x)1 processors must agree on the order of writes from each
processor but can disagree on the order of writes by dif-
P3 and P4 observe the writes by P1 and P2 in differ- ferent processors, as long as those writes are to different
ent orders, although W (x)1 and W (x)2 are potentially locations. The example given for coherence is also PC
causally related. Thus, this would not be a legal history so we give here a history that fails to be PC (this and the
for CC. previous example are from [Goo89]):

4
3.8 Weak Consistency (WC)
P1 : W (x)1 W (c)1 R(y)0 Weak consistency is the first and most strict hybrid model
P2 : W (y)1 W (c)2 R(x)0 we discuss. The model was originally proposed by
Dubois et al. [DSB86]. A memory system is weakly
Notice that P1 observes accesses in the order:
consistent if it enforces the following restrictions:
1. accesses to synchronization variables are sequen-
W (x)1; W (c)1; R(y)0; W (y)1; W (c)2; tially consistent and

while P2 observes accesses in the order:

2. no access to a synchronization variable is issued in
a processor before all previous data accesses have
been performed and
W (y)1; W (c)2; R(x)0; W (x)1; W (c)1:
3. no access is issued by a processor before a previ-
That is, P1 and P2 disagree on the order of writes to ous access to a synchronization variable has been
location c. As there is no consistent ordering that would performed
remove this disagreement, the history fails to be PC. Notice that the meaning of “previous” is well-defined
The differences between PC and SC are subtle enough because it refers to program order. That is, an access
that Goodman claims most applications give the same A precedes access B if and only if the processor that
results under these two models. He also says that many executed access B has previously executed access A.
existing multiprocessors (e.g., VAX 8800) satisfy PC, Synchronizing accesses work as fences. At the time
but not sequential consistency [Goo89]. Ahamad et a synchronizing access performs, all previous accesses
al. prove that the Tie-Breaker algorithm executes cor- by that processor are guaranteed to have performed and
rectly under PC while the Bakery algorithm does not (see all future accesses by that processor are guaranteed not
[And91] for a description of those algorithms). Bershad to have performed. The synchronization model corre-
and Zekauskas [BZ91] mention that processor consistent sponding to these access order constraints is relatively
machines are easier to build than sequentially consistent simple. A program executing on a weakly consistent
systems. system appears sequentially consistent if the following
two constraints are observed [AH90, ZB92]:

3.7 Slow Memory 1. there are no data races (i.e., no competing accesses)
2. synchronization is visible to the memory system
Slow memory is a location relative weakening of PRAM
[HA90]. It requires that all processors agree on the order Note that WC does not allow for chaotic accesses as
of observed writes to each location by a single processor. found in chaotic relaxation algorithms. Such algorithms
Furthermore, local writes must be visible immediately would either have to be changed to avoid data races or it
(as in the PRAM model). The name for this model would be necessary to mask chaotic accesses as synchro-
was chosen because writes propagate slowly through the nizing accesses. The latter would be overly restrictive.
system. Slow memory is probably one of the weakest
uniform consistency models that can still be used for
3.9 Release Consistency (RC)
interprocess communication. Hutto and Ahamad present
a mutual exclusion algorithm in [HA90]. However, this Release consistency as defined by Gharachorloo et al.
algorithm guarantees physical exclusion only. There is [GLL+ 90] is a refinement of WC in the sense that
no guarantee of logical exclusion. For example, after competing accesses are divided into acquire, release,
two processes P1 and P2 were subsequently granted and non-synchronizing accesses. Competing accesses
access to a critical section and both wrote two variables a are also called special to distinguish them from non-
and b, then a third process P3 may enter the critical region competing, ordinary accesses. Non-synchronizing ac-
and read the value of a as written by P1 and the value cesses are competing accesses that do not serve a syn-
of b as written by P2. Thus, for P3 it looks like P1 and chronization purpose. This type of access was intro-
P2 had had simultaneous access to the critical section. duced to be able to handle chaotic relaxation algorithms.
This problem is inherent to slow memory because the An acquire access works like a synchronizing access
knowledge that an access to one location has performed under WC, except that the fence delays future accesses
cannot be used to infer that accesses to other locations only. Similarly, a release works like a synchronizing ac-
have also performed. Slow memory does not appear to cess under WC, except that the fence delays until all pre-
be of any practical significance. vious accesses have been performed. This, for example,

5
primitives are often written “once-and-forever.” That is,
worker[p : 1..N] : the typical programmer doesn’t need to worry about la-
arrived[p] := true [release] beling accesses correctly as high-level synchronization
do not go[p] ! skip od [acquire] primitives would be provided by a language or operating
go[p] := false [ordinary] system. Also, it is always safe to label a program con-
servatively. For example, if a compiler has incomplete
coordinator : information available, it could always revert to label
fa i := 1 to N ! reads with acquire and writes with release.
do not arrived[i] ! [nsync]
skip
od 3.10 Entry Consistency (EC)
arrived[i] := false [nsync] The entry consistency model is even weaker than RC
af [BZ91]. However, it imposes more restrictions on the
fa i := 1 to N ! programming model. EC is like RC except that every
go[i] := true [nsync] shared variable needs to be associated with a synchro-
af nization variable. A synchronizing variable is either a
lock or a barrier. The association between a variable and
Figure 3: Barrier Under Release Consistency its synchronization variable can change dynamically un-
der program control. Note that this, like slow memory,
is a location relative weakening of a consistency model.
allows (limited) overlap in executing critical sections, This has the effect that accesses to different critical sec-
which is not possible under WC. Another, more subtle, tions can proceed concurrently, which would not be pos-
change is that special accesses are executed under PCD sible under RC. Another feature of EC is that it refines
only (not under SC, as in WC). acquire accesses into exclusive and non-exclusive acqui-
To make the model more concrete, we give an exam- sitions. This, again, increases potential concurrency as
ple of how a critical section and a coordinator barrier non-exclusive acquisitions to the same synchronization
could be programmed under RC (see [And91], for ex- variable can be granted concurrently. However, unlike
ample). Below we show how a critical section could be RC, entry consistency is not prepared to handle chaotic
implemented under this model: accesses. This model is the first that was specifically de-
signed to be implemented as a software shared memory
do test and set(locked) ! [rd :acquire;wr :nsync] system.
skip
od
: : :critical section: : : 4 Implementations of Memory
locked := false [release] Consistency Models
Note the labeling of the read-modify-write operation
test and set(). The read is labeled acquire, while An implementation of a memory consistency model is
the write is labeled nsync, which stands for non- often stricter than the model would allow. For example,
synchronizing access. The acquire label ensures that SC allows the possibility of a read returning a value that
no future access is performed before the read has com- hasn’t been written yet (see example discussed under
pleted and the nsync label ensures that the write occurs 3.2 Sequential Consistency). Clearly, no implementa-
under PCD. Note that it would be legal but unnecessarily tion will ever exhibit an execution with such a history.
restrictive to mark the write access release. The release In general, it is often simpler to implement a slightly
label for the write access resetting the locked flag ensures stricter model than its definition would require. This is
especially true for hardware realizations of shared mem-
ories [AHJ91, GLL+ 90].
that all accesses in the critical sections are performed by
the time the flag is actually reset.
The coordinator barrier is considerably more compli- For each consistency model there are a number of im-
cated. The important thing however is that the heart plementation issues. Some of the more general questions
of the barrier is realized by a release followed by an are:
acquire, while the critical section does just the oppo- What is the consistency unit?
site. Pseudo-code for the barrier is shown in Figure 3.
Enforce eager or lazy consistency?
From these examples it should be clear that it is not at
all straight forward to write synchronization primitives Use update or invalidation protocol to maintain
under RC. However, it is important to realize that such consistency?

6
In hardware implementations the consistency unit is typ- for in operations if they are used with certain restric-
ically a word or a cache line. In software shared mem- tions. For example, the weakest and most efficient pro-
ories, the overhead per consistency unit is much higher tocol can be used only if, for a tuple with tag t, there
in absolute terms, so that a memory page or a shared is at most one process performing in operations and no
object (structured variable, segment) is often chosen as process performing read operations. Unfortunately,
the consistency unit. so far no performance study of the advantage of such
The notion of eager versus lazy maintenance of mem- “guided” memory systems has been reported. Carter
ory consistency appears to have been invented indepen- [CBZ91] indicates that Munin performs well for matrix
dently by Borrmann/Herdieckerhoff [BH90] and Ber- multiplicationand SOR when compared to a hand-coded
shad/Zekauskas [BZ91]. This notion is based on the message passing algorithm, but no comparison with a
observation that the consistency protocol can either be single-protocol DSM or a strict DSM was reported.
invoked each time an inconsistency arises or only when Also note that a change in the consistency model of
an inconsistency could be detected. Eager implementa- a memory system can lead to quite subtle changes. For
tions do the former, lazy the latter. The expected ben- example, Zucker and Baer note that
efit of lazy implementations is that if a process has a
cached copy of a shared variable but doesn’t access it
anymore, then this process does not have to participate in the analysis of Relax [a benchmark program]
maintaining consistency for this variable. Lazy release made us realize that how the program is writ-
consistency [KCZ92] and Midway [BZ91] are two ex- ten or compiled for peak performance depends
amples of lazy implementations. No performance data upon the memory model to be used.
is yet available.
In their example, under SC it was more efficient to sched-
ule a read access causing a cache-miss at the end of a
5 Influence of Consistency Model sequence of eight read accesses hitting the cache, while
under WC and RC the same access had to be scheduled
on Software at the beginning of the read-sequence.
As mentioned earlier, choosing a memory consistency
model is a tradeoff between increasing concurrency by
decreasing ordering constraints and implementation and
5.1 Chaotic Accesses
programming model complexity. With hybrid models,
the memory system is sequentially consistent as long as
Another issue raised by the introduction of weaker
its synchronization model is respected. That is, the soft-
consistency models is chaotic accesses (i.e., non-
ware executing on such a memory system has to provide
synchronizing competing accesses). Current DSM sys-
information about synchronization events to the mem-
tems do not handle them well. Neither Munin nor Mid-
ory system and its synchronization model must match
way have special provisions for chaotic accesses. Note
the memory system’s model. Synchronization informa-
that algorithms using such accesses often depend on
tion is provided by either a programmer in a explicitly
concurrent language2 or by a compiler or its runtime
having a “fairly recent” value available. That is, if ac-
cesses to variable x are unsynchronized, then reading x
system in a high-level language. Thus, software run-
must not return any previously written value but a “re-
ning on a hybrid memory system has to provide infor-
cent” one. For example, the LocusRoute application of
mation to execute correctly. However, it is possible and
the SPLASH benchmark does not perform well if non-
beneficial to go beyond that point. If the software can
synchronizing competing accesses return very old val-
provide information on the expected access pattern to
ues [Rin92, SWG91]. RC maintains such accesses under
a shared variable, optimizations for each particular ac-
PCD (which is safe but conservative in many cases). An-
cess pattern could be enabled resulting in substantially
other type of algorithm using non-synchronizing com-
improved performance. Munin [CBZ91] does this by
peting accesses is of the kind where a process needs some
providing a fixed set of sharing annotations. Each anno-
of the neighbor’s data, but instead of synchronizing with
tation corresponds to a consistency protocol optimized
its neighbor, the process computes the value itself and
for a particular access pattern. A similar approach was
stores it in the neighbors data field. In effect, this type of
taken by Chiba et al. [CKM92] where they annotate
algorithm trades synchronization with (re-)computation.
Linda programs in order to select an optimized protocol
We would expect having specialized consistency proto-
2 By “explicitly concurrent language” we mean a language in which cols for chaotic accesses could improve the performance
it is possible to program synchronization operations. of such algorithms.

7
5.2 Annotating Compilers control is typically implemented by compiler-generated
calls to a the runtime system. Therefore all that needs to
Only very little work has been done on annotating par-
be done to adapt to a new MCM is to change the runtime
allel programs automatically. In the general case, deter-
system. As mentioned above, it is still advantageous to
mining the access patterns to a shared variable is unde-
integrate the consistency model with the compiler and
cidable. It is also unclear exactly what access patterns
runtime system more tightly. As the compiler already
are useful to distinguish (some work in this direction
has information on synchronization and the concurrency
was done for Munin). However, a language could be
structure of the program, it might as well make this infor-
designed such that it becomes easier to infer certain
mation available to the memory system. Jade [RSL92] is
aspects of an access pattern. A simple example is a
a step in this direction. Its runtime system has for each
constant object. As there are no write accesses, such ob-
process precise information on the accessed locations
jects can be replicated among processes without needing
and whether a location is only read or also modified.
any consistency protocol. Another example is determin-
The language also allows one to express that some data
ing whether a critical region contains no write accesses
will not be accessed anymore in the future.
to shared variables. Under EC, this information de-
termines whether a lock can be acquired in exclusive It is unclear at this point exactly which information
or non-exclusive mode. As critical regions are typically can and should be provided to the memory system. It is
short and do not contain any function calls or unbounded equally open what information the memory system could
loops, this problem could be decided in most cases. provide to the runtime system. The latter, for example,
could be useful to guide a runtime system’s scheduler
based on what data is cheaply available (cached) in the
5.3 Explicitly Parallel Languages memory system.
As mentioned above, in an explicitly parallel language
the MCM defines the allowable memory-access opti-
mizations. Such a language depends very directly on
the memory consistency model as it allows the imple-
6 Conclusions
mentation of synchronization operations. For AC, SC,
and PC no special constructs must be available. For WC The central theme of this work is that being memory-
a memory-barrier (or full fence) operation would be suf- model conscious is a good thing. This applies to dis-
ficient. A memory-barrier would have to be inserted in a tributed shared memories, runtime systems, and com-
program wherever consistency of the memory has to be pilers, as well as languages. We have argued that con-
enforced. For RC things become even more complex. sistency models are important and that weaker models
Every access would have to be labeled according to its are beneficial to performance. While there are weakened
category. With EC, synchronization operations can be models that are uniform, they appear to be less promising
implemented based on the locks and barriers provided than hybrid models. Most current work seems to con-
by the system only. This shows clearly that it is not a centrate on the latter. While quite some work has been
good idea to allow a programmer to implement his or done in this area, the lack of meaningful performance
her own synchronization primitives based on individ- data is surprising. Also, it appears that in the language,
ual memory accesses. Instead, a language should pro- compiler, and runtime-system realms there are still a lot
vide efficient and primitive operations which can then be of open questions that could warrant further research.
used to implement higher-level synchronization opera- We expect that a tighter coupling between the memory
tions. Maybe locks and barriers as provided under EC system and the software using it could result in consid-
would be sufficient. However, for barriers it is not clear erable performance improvements.
whether a single implementation would be sufficient for
all possible applications. For example, sometimes it is
useful to do some work at the time all processes have
joined at a barrier but before releasing them. Under EC, Acknowledgements
such a construct would have to be implemented with two
barriers or in terms of locks; both methods would likely Several people provided useful comments on drafts of
be more inefficient than a direct implementation. this paper: Gregory Andrews, David Lowenthal and
Vincent Freeh. Several others provided helpful informa-
tion on aspects of memory consistency models: Brian
5.4 Implicitly Parallel Languages
Bershad, John Carter, Kourosh Gharachorloo, James
Implicitly parallel languages do not have any notion of Goodman, Bob Janssens, Karen Pieper, Martin Rinard,
concurrent execution at the language level. Concurrency and Andy Tanenbaum.

8
References [DSB86] M. Dubois, C. Scheurich, and F. A. Briggs.
Memory access buffering in multiproces-
[ABJ+ 92] Mustaque Ahamad, Rida Bazzi, Ranjit John, sors. In Proceedings of the Thirteenth An-
Prince Kohli, and Gil Neiger. The power nual International Symposium on Computer
of processor consistency. Technical Report Architecture, pages 434–442, June 1986.
GIT-CC-92/34, Georgia Institute of Tech-
nology, Atlanta, GA 30332-0280, USA, [GGH91] Kourosh Gharachorloo, Anoop Gupta, and
1992. John Hennessy. Performance evaluation
of memory consistency models for shared
[AH90] Sarita Adve and Mark Hill. Weak ordering: memory multiprocessors. ACM SIGPLAN
A new definition. In Proceedings of the 17th Notices, 26(4):245–257, April 1991.
Annual International Symposium on Com-
puter Architecture, pages 2–14, May 1990. [GLL+ 90] K. Gharachorloo, D. Lenoski, J. Laudon,
Phillip Gibbons, Anoop Gupta, and John
[AHJ91] M. Ahamad, P. W. Hutto, and R. John. Im- Hennessy. Memory consistency and event
plementing and programming causal dis- ordering in scalable shared-memory multi-
tributed shared memory. In Proceedings of processors. Computer Architecture News,
the 11th International Conference on Dis- 18(2):15–26, June 1990.
tributed Computing Systems, pages 274–
281, May 1991. [Goo89] James R. Goodman. Cache consistency and
sequential consistency. Technical Report 61,
[And91] Gregory R. Andrews. Concurrent Pro- SCI Committee, March 1989.
gramming: Principles and Practice. Ben-
jamin/Cummings, Menlo Park, 1991. [HA90] P. W. Hutto and M. Ahamad. Slow memory:
Weakening consistency to enhance concur-
[BH90] Lothar Borrmann and Martin Herdiecker- rency in distributed shared memories. In
hoff. A coherency model for virtually shared Proceedings of the 10th International Con-
memory. In International Conference on ference on Distributed Computing Systems,
Parallel Processing, volume II, pages 252– pages 302–311, May 1990.
257, 1990.
[KCZ92] P. Keleher, A. L. Cox, and W. Zwaenepoel.
[BHG87] Philip A. Bernstein, Vassos Hadzilacos, Lazy release consistency for software dis-
and Nathan Goodman. Concurrency Con- tributed shared memory. SIGARCH Com-
trol and Recovery in Database Systems. puter Architecture News, 20(2), May 1992.
Addison-Wesley, Reading, Massachusetts,
1987. [Lam78] Leslie Lamport. Time, clocks, and the order-
ing of events in a distributed system. Com-
[BZ91] Brian N. Bershad and Matthew J. munications of the ACM, 21(7):558–565,
Zekauskas. Midway: Shared memory paral- 1978.
lel programming with entry consistency for
distributed memory multiprocessors. Tech- [Lam79] Leslie Lamport. How to make a multi-
nical Report CMU-CS-91-170, Carnegie- processor computer that correctly executes
Mellon University, 1991. multiprocess programs. IEEE Transactions
on Computers, C-28(9):690–691, Septem-
[CBZ91] John B. Carter, John K. Bennett, and Willy ber 1979.
Zwaenepoel. Implementation and perfor-
mance of Munin. In Symposium on Op- [LLG+ 92] D. Lenoski, J. Laudon, K. Gharachor-
erating System Principles, pages 152–164, loo, W.-D. Weber, A. Gupta, J. Hennessy,
1991. M. Horowitz, and M. S. Lam. The Stan-
ford Dash multiprocessor. IEEE Computer,
[CKM92] Shigeru Chiba, Kazuhiko Kato, and Takashi pages 63–79, March 1992.
Masuda. Exploiting a weak consistency
to implement distributed tuple space. In [LS88] R. J. Lipton and J. S. Sandberg. PRAM:
Proceedings of the 12th International Con- A scalable shared memory. Technical Re-
ference on Distributed Computing Systems, port CS-TR-180-88, Princeton University,
pages 416–423, June 1992. September 1988.

9
[Mos93] David Mosberger. Memory consistency
models. Operating Systems Review,
17(1):18–26, January 1993.
[Rin92] Martin Rinard, September 1992. Personal
communication.
[RSL92] Martin C. Rinard, Daniel J. Scales, and
Monica S. Lam. Jade: A high-level,
machine-independent language for parallel
programming. September 1992.
[Sit92] Richard L. Sites, editor. Alpha Architecture
Reference Manual. Digital Press, Burling-
ton, MA, 1992.
[SWG91] Jaswinder Pal Singh, Wolf-Dietrich Weber,
and Anoop Gupta. Splash: Stanford parallel
applications for shared-memory. Technical
Report CSL-TR-91-469, Stanford Univer-
sity, 1991.
[ZB92] R. N. Zucker and J-L. Baer. A per-
formance study of memory consistency
models. SIGARCH Computer Architecture
News, 20(2), May 1992.

Script Manual Universal Robots
No ratings yet
Script Manual Universal Robots
121 pages
Memory Models: A Case For Rethinking Parallel Languages and Hardware
No ratings yet
Memory Models: A Case For Rethinking Parallel Languages and Hardware
9 pages
L 14 DSM
No ratings yet
L 14 DSM
3 pages
Designing Memory Consistency Models For Shared-Memory Multiprocessors
No ratings yet
Designing Memory Consistency Models For Shared-Memory Multiprocessors
233 pages
C++11, 14, 17 Atomics - The Deep Dive - Michael Wong - CppCon 2015
No ratings yet
C++11, 14, 17 Atomics - The Deep Dive - Michael Wong - CppCon 2015
69 pages
The Virtues of Conflict Analyzing Modern Concurren
No ratings yet
The Virtues of Conflict Analyzing Modern Concurren
14 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
24 pages
CS 162 Memory Consistency Models
No ratings yet
CS 162 Memory Consistency Models
22 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
51 pages
Arxiv2002 0208027 A Unified Theory of Shared Memory Consistency
No ratings yet
Arxiv2002 0208027 A Unified Theory of Shared Memory Consistency
51 pages
Untitled
No ratings yet
Untitled
27 pages
Lecture 11: Consistency Models: Topics: Sequential Consistency, HW and HW/SW Optimizations
No ratings yet
Lecture 11: Consistency Models: Topics: Sequential Consistency, HW and HW/SW Optimizations
18 pages
Foundations of The C++ Concurrency Memory Model: John Mellor-Crummey and Karthik Murthy
100% (1)
Foundations of The C++ Concurrency Memory Model: John Mellor-Crummey and Karthik Murthy
31 pages
Distributed Shared Memory
100% (1)
Distributed Shared Memory
20 pages
Herding Cats Modelling, Simulation, Testing, and Data-Mining For Weak Memory
No ratings yet
Herding Cats Modelling, Simulation, Testing, and Data-Mining For Weak Memory
76 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
CSE211 Computer Architecturemodule 18-21
No ratings yet
CSE211 Computer Architecturemodule 18-21
19 pages
Lect5 - Distributed Shared Memory
No ratings yet
Lect5 - Distributed Shared Memory
120 pages
Introduction To DSM: Unit - III Essay Questions
No ratings yet
Introduction To DSM: Unit - III Essay Questions
21 pages
Hardware Memory Models
No ratings yet
Hardware Memory Models
13 pages
Promising
No ratings yet
Promising
19 pages
V3i9201434 PDF
No ratings yet
V3i9201434 PDF
6 pages
3 Concurrency
No ratings yet
3 Concurrency
52 pages
Distributed Shared Memory - Revised
No ratings yet
Distributed Shared Memory - Revised
64 pages
Is SC + ILP RC?
No ratings yet
Is SC + ILP RC?
10 pages
Lect06 Consistency Models
No ratings yet
Lect06 Consistency Models
64 pages
06 Consistency
No ratings yet
06 Consistency
46 pages
09 Communication Models of Parallel Platforms
No ratings yet
09 Communication Models of Parallel Platforms
25 pages
Mastering Concurrency and Parallel Programming Unlock the Secrets of Expert-Level Skills.pdf
From Everand
Mastering Concurrency and Parallel Programming Unlock the Secrets of Expert-Level Skills.pdf
Larry Jones
No ratings yet
Memory Consistency Model
No ratings yet
Memory Consistency Model
17 pages
CH 4 Synchronization Models of Memory Consistency
100% (1)
CH 4 Synchronization Models of Memory Consistency
26 pages
Unit 3
No ratings yet
Unit 3
58 pages
Unit 5 DOS SCR
No ratings yet
Unit 5 DOS SCR
22 pages
Types of Consistency in DSM
No ratings yet
Types of Consistency in DSM
2 pages
Library For Matrix Multiplication-Based Data Manipulation On A Mesh-Of-Tori Architecture
No ratings yet
Library For Matrix Multiplication-Based Data Manipulation On A Mesh-Of-Tori Architecture
8 pages
CSE211 Computer Architecture
No ratings yet
CSE211 Computer Architecture
18 pages
Chapter 7: Distributed Shared Memory: Why DSM?
No ratings yet
Chapter 7: Distributed Shared Memory: Why DSM?
14 pages
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
No ratings yet
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
54 pages
A Methodology For Implementing Highly Concurrent Data Objects by Maurice Herlihy
No ratings yet
A Methodology For Implementing Highly Concurrent Data Objects by Maurice Herlihy
17 pages
L4a MM Examples
No ratings yet
L4a MM Examples
10 pages
P D Group2-2
No ratings yet
P D Group2-2
6 pages
Consistency and Replication
No ratings yet
Consistency and Replication
73 pages
Module 1 - Parallel Computing
No ratings yet
Module 1 - Parallel Computing
29 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
35 pages
Unit 5 DOS SCR
No ratings yet
Unit 5 DOS SCR
46 pages
DSM - Distributedsharedmemory
No ratings yet
DSM - Distributedsharedmemory
108 pages
Memory Consistency Models: Sarita Adve
No ratings yet
Memory Consistency Models: Sarita Adve
60 pages
2017 SC-Haskell - Sequential Consistency in Languages That Minimize Mutable Shared Heap
No ratings yet
2017 SC-Haskell - Sequential Consistency in Languages That Minimize Mutable Shared Heap
16 pages
KTMTSS Shared Memory Multiprocessor
No ratings yet
KTMTSS Shared Memory Multiprocessor
29 pages
L43 - Models of Memory Consistency
No ratings yet
L43 - Models of Memory Consistency
5 pages
L25 Data-Centric Consistency NRay
No ratings yet
L25 Data-Centric Consistency NRay
26 pages
Module 2
No ratings yet
Module 2
127 pages
Shared Memory Multiprocessors
No ratings yet
Shared Memory Multiprocessors
45 pages
IT105 Midterm Lecture Part1
No ratings yet
IT105 Midterm Lecture Part1
5 pages
Chapter 2 - Parallel Algorithm Design
No ratings yet
Chapter 2 - Parallel Algorithm Design
84 pages
Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017
No ratings yet
Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017
44 pages
MC&CC
No ratings yet
MC&CC
21 pages
Week 2 - Study of Memory Organization and Multiprocessor System
No ratings yet
Week 2 - Study of Memory Organization and Multiprocessor System
6 pages
Lecture 3 (Memory Hierarchy and Caches)
No ratings yet
Lecture 3 (Memory Hierarchy and Caches)
88 pages
Unit 2
No ratings yet
Unit 2
15 pages
Shared Memory Multiprocessors: Logical Design and Software Interactions
No ratings yet
Shared Memory Multiprocessors: Logical Design and Software Interactions
107 pages
Multiple Choice Questions
No ratings yet
Multiple Choice Questions
108 pages
OS Viva Expected Questions
No ratings yet
OS Viva Expected Questions
7 pages
Itc 403 - Os Module III Process Coordination PPT (MSJ)
No ratings yet
Itc 403 - Os Module III Process Coordination PPT (MSJ)
75 pages
OS Prep
No ratings yet
OS Prep
49 pages
Peterson Algorithm and Implementation of Algorithm
No ratings yet
Peterson Algorithm and Implementation of Algorithm
29 pages
Workshop 6
No ratings yet
Workshop 6
67 pages
Software For Embedded Systems Outline: - Models vs. Languages - State Machine Model
No ratings yet
Software For Embedded Systems Outline: - Models vs. Languages - State Machine Model
13 pages
Chapter10 TimingBasedAlgorithms
No ratings yet
Chapter10 TimingBasedAlgorithms
51 pages
PS 6 Priority Inversion 2022
No ratings yet
PS 6 Priority Inversion 2022
2 pages
Os Assignment 2
No ratings yet
Os Assignment 2
7 pages
Os Cheat XXX
No ratings yet
Os Cheat XXX
1 page
Multithreading and Synchronization
No ratings yet
Multithreading and Synchronization
33 pages
Closed Book Component
No ratings yet
Closed Book Component
2 pages
Concurrency:: Mutual Exclusion and Synchronization
No ratings yet
Concurrency:: Mutual Exclusion and Synchronization
9 pages
Gaurav Sir CS Subjects
No ratings yet
Gaurav Sir CS Subjects
3 pages
Distributed Systems
No ratings yet
Distributed Systems
238 pages
Lecture 4 - Process Synchronization
No ratings yet
Lecture 4 - Process Synchronization
30 pages
Distributed System UNIT - III
No ratings yet
Distributed System UNIT - III
23 pages
Train Traffic Control
No ratings yet
Train Traffic Control
6 pages
Ch.3 - IPC
No ratings yet
Ch.3 - IPC
27 pages
The Mutual Exclusion Problem: Section 3
No ratings yet
The Mutual Exclusion Problem: Section 3
23 pages
Assignment 2 - Solution
No ratings yet
Assignment 2 - Solution
3 pages
Solutions-Mid Term
No ratings yet
Solutions-Mid Term
4 pages
Inter Thread Communication
No ratings yet
Inter Thread Communication
10 pages
Formal Verification of The Ricart-Agrawala Algorithm
No ratings yet
Formal Verification of The Ricart-Agrawala Algorithm
11 pages
OMG, Multi-Threading Is Easier Than Networking: White Paper
100% (1)
OMG, Multi-Threading Is Easier Than Networking: White Paper
10 pages
Operating System Lab Manual
No ratings yet
Operating System Lab Manual
81 pages
Tutorial Problems
No ratings yet
Tutorial Problems
4 pages
Vijay Kumar: IBPS (SO) I.T.O Cer: Operating System Study Notes
No ratings yet
Vijay Kumar: IBPS (SO) I.T.O Cer: Operating System Study Notes
20 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Memory Consistency Models: David Mosberger

Uploaded by

Memory Consistency Models: David Mosberger

Uploaded by

Memory Consistency Models1

Department of Computer Science

exclusive non−exclusive Hardware is weakly ordered with respect to

That is, as long as the synchronization model is re-

3.2 Sequential Consistency (SC)

3.5 Cache Consistency 3.4 PRAM

while P2 observes accesses in the order:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.