0% found this document useful (0 votes)
43 views10 pages

Remote Storage With Byzantine Servers: (Extended Abstract)

The document proposes new storage algorithms that minimize client-server communication and provide a property called limited effect. The algorithms implement an abortable register using a technique that relies on synchronized clocks to prevent old pending writes from taking effect in the future.

Uploaded by

adel ismail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views10 pages

Remote Storage With Byzantine Servers: (Extended Abstract)

The document proposes new storage algorithms that minimize client-server communication and provide a property called limited effect. The algorithms implement an abortable register using a technique that relies on synchronized clocks to prevent old pending writes from taking effect in the future.

Uploaded by

adel ismail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Remote Storage with Byzantine Servers

[Extended abstract]

Marcos K. Aguilera Ram Swaminathan


Microsoft Research Silicon Valley HP Laboratories
Mountain View, CA, USA Palo Alto, CA, USA

ABSTRACT 1. INTRODUCTION
We consider the problem of providing byzantine-tolerant storage We consider the problem of providing storage to a distributed
in distributed systems where client-server links are much thinner system with byzantine servers, that is, servers that misbehave due
and slower than server-server links. We provide storage algorithms to failures or attacks. The system comprises clients and servers,
that are unique in two ways. First, our algorithms take into con- all connected together by a network. We particularly focus on sys-
sideration the asymmetry in network connectivity by minimizing tems where servers are remote with respect to clients, meaning that
client-server communication. To provide this property, we rely on client-server links have much higher latencies and lower bandwidth
a small amount of partial (eventual) synchrony. Second, our algo- than server-server links. For example, clients can be connected to
rithms provide a new property called limited effect, which is im- servers via wireless networks, wide area networks, or other thin
portant for storage systems. To provide the latter property, we use links, while servers can be connected to each other by expensive
synchronized clocks, which are increasingly common due to GPS network switches, backplanes, or fiber-optical networks. These are
devices and NTP, even in otherwise “asynchronous systems” like common in practice.
the Internet. We present two algorithms called QUAD and LINEAR, We propose new storage algorithms that are unique in two ways.
which provide a trade-off between failure resiliency and efficiency. First, they take into consideration the asymmetry in network con-
Our algorithms implement an abortable register [3], which is an ab- nectivity by minimizing client-server communication. The client
straction used in some real storage systems, but abortable registers should send just one message to one server and then wait for a re-
are weaker than atomic registers. Thus, one might wonder if we ply (in “good” runs with no failures, which are the most common).
could have implemented atomic registers instead. We answer this We do this using a simple common technique: a client signs the
question in the negative: we prove that there are no implementa- request and sends it to a server that acts as a proxy or coordinator;
tions of atomic registers that provide the limited effect property in the coordinator then communicates with other servers. Because co-
systems with failures, even with synchronized clocks. ordinators can be byzantine, they must prove to other servers that
they are executing the protocol correctly. This is done with signa-
tures. For this technique to work, we rely on a small amount of
Categories and Subject Descriptors partial synchrony so that a client does not wait forever for a dead
C.2.4 [Computer-Communication Networks]: Distributed Sys- coordinator.
tems—Distributed applications; F.2.3 [Analysis of Algorithms and The second unique aspect of our algorithms is that they provide
Problem Complexity]: Tradeoffs between Complexity Measures; a property called limited effect. This property is motivated by a
H.3.4 [Information Storage and Retrieval]: Systems and Soft- problem that we have identified with existing distributed storage
ware—Distributed systems algorithms that provide linearizability. We call this problem the
destructive pending write problem. With linearizability, if a client
issues a write request and then crashes, the write can take place at
General Terms an arbitrary time in the future. This allows an adversary to mount
an attack to destroy data at any chosen point in the future, say,
Algorithms, Design, Reliability, Security, Theory
years after the crash has occurred. All previous linearizable storage
algorithms that we know of are vulnerable to this attack. Thus, in
Keywords addition to linearizability, we require the limited effect property.
Roughly speaking, this property prohibits the write of a crashed
Distributed system, distributed storage, byzantine failures, algo- process from taking effect after the last step of the process. As a
rithms, digital signatures result, the effect of writes is limited: they cannot remain pending
forever when a client crashes while writing.
Our algorithms ensure the limited effect property by using syn-
chronized clocks—which are increasingly common due to GPS
Permission to make digital or hard copies of all or part of this work for and NTP—and the following technique. Majority-replication al-
personal or classroom use is granted without fee provided that copies are gorithms typically rely on a write-back technique on reads: to read
not made or distributed for profit or commercial advantage and that copies
a value, a coordinator queries a majority of servers, picks the value
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific with highest timestamp, and writes back this value to a majority of
permission and/or a fee. bricks (using the value’s original timestamp). This technique en-
SPAA’09, August 11–13, 2009, Calgary, Alberta, Canada. sures that the read value will be seen by subsequent reads. Our
Copyright 2009 ACM 978-1-60558-606-9/09/08 ...$10.00.
Low
algorithms also rely on the write-back technique, but rather than Algorithm client- Limited Resil- Register Uses
writing back with a value’s original timestamp, they write back server effect iency implemented sync
with a fresh timestamp. Doing so prevents old pending writes from comm. clocks
having the highest timestamp so that they will not take effect in the SBQ–L [20]∗ no no f <n/3 non-abortable no
Phalanx [19] no no f <n/4 non-abortable no
future. GWGR [14] no no f <n/4 non-abortable no
Writing back with a fresh timestamp, however, creates a vul- CT [7] no no f <n/3 non-abortable no
nerability: a byzantine coordinator can write back very old values CL [8] no no f <n/3 non-abortable no
Q/U [1] no no f <n/5 non-abortable no
with fresh timestamps, and thereby obliterate recently written val- HQ [10] no no f <n/3 non-abortable no
ues. We address this problem by requiring that a coordinator prove AAB [4] no no f <n/3 non-abortable no
to servers that it is writing back an acceptable value. The proof Zyzzyzva [18] no no f <n/3 non-abortable no
consists of a statement signed by a large number of servers (n−f QUAD yes yes f <n/3 abortable yes
LINEAR yes yes f <n/4 abortable yes
servers, where n is the number of servers and f is the maximum
number that can be byzantine). If a server stores such a proof, it can ∗
SBQ-L refers to the algorithm in [20] for unreliable asynchronous networks with byzantine
subsequently prove to clients that the value it has is legitimate. This clients and self-verifying atomic conformable writes.
leads to a new and simple storage algorithm, called QUAD, that can
tolerate up to f < n/3 byzantine servers. However, these proofs Figure 1: Comparison of known algorithms for byzantine-
are somewhat large vectors with O(n) signatures, which take up tolerant storage.
space at servers and needs to be verified for each server. As a re-
sult, with QUAD, each read or write operation checks O(n2 ) sig-
natures, which is expensive and undesirable. We present a second
algorithm, called LINEAR, that significantly reduce signature usage: ation begs the question: which is better? For storage systems, we
servers store only O(1) signatures and each operation checks only think (a) is better for many reasons. First, having aborted oper-
O(n) signatures. LINEAR can tolerate up to f < n/4 byzantine ations results in only loss of liveness, whereas overwritten data is
servers, which is slightly fewer failures than what QUAD tolerates. loss of safety for the application. Second, if an operation aborts, the
Thus, QUAD and LINEAR provide a nice tradeoff between resource operation can simply be retried after a while. Using exponential-
usage and failure resiliency. backoff, contention eventually stops and the operation no longer
We allow clients to fail by crashing, and servers to fail by be- aborts. On the other hand, if storage is overwritten, the old con-
coming byzantine. For example, clients could be closed devices tents are lost forever. And third, in a study of several real-world
or private machines behind a firewall; servers could be shared ma- I/O traces of storage systems, [13] found no concurrent conflicting
chines on the Internet, or general-purpose machines owned by a operations to the same block (register). In other words, contention
third-party. We decided against allowing byzantine clients because is extremely rare in real-world storage systems, meaning that aborts
they can launch an attack in which data is constantly overwritten are very unlikely to occur. This has led to the development of real
with garbage, by spuriously sending messages that imitate the write storage systems that employ abortable registers (e.g., the systems
protocol. For instance, even if a single client is byzantine, a read described in [13, 22, 23]) rather than atomic registers.
by a correct client can return random data. Thus, our system has a In summary, in this paper we study byzantine-tolerant storage
non-byzantine component, the client, which is responsible for gen- systems. We identify the destructive pending write problem, which
erating requests. However, we minimize client participation in our is present in previously proposed storage algorithms. To address
algorithms: a client just timestamps, signs and sends its requests to this problem, we introduce the limited effect property and give
one server, waits for a reply, checks that the reply is valid, and, if algorithms that achieve this property. Our algorithms minimize
not, resends the request to a different server. client-server communication, which is important because client-
Our algorithms implement an abortable register [3], which is a server links tend to be thinner and slower than server-server links.
data structure that supports a read and write operation that may We present two algorithms: the QUAD algorithm checks O(n2 ) sig-
sometimes abort their execution in the presence of concurrency1 . natures per operation and tolerates f < n/3 byzantine failures; the
Abortable registers are powerful: they can be used to implement LINEAR algorithm checks O(n) signatures per operation, and tol-
any abortable object [3]. However, abortable registers are weaker erates f < n/4 byzantine failures. Both algorithms implement an
than atomic registers, whose operations never abort. Thus, one abortable register. We show that aborting is necessary: there are
might wonder if we could have implemented atomic registers in- no implementations of (non-abortable) atomic registers that pro-
stead. We answer this question in the negative: we show that there vide the limited effect property in systems with failures. Because
are no implementations of atomic registers that provide the limited of space limitations, most proofs are omitted from this extended
effect property in systems with failures, even if failures are non- abstract.
byzantine.
Thus, there is a fundamental trade-off between (a) abortable reg- 2. RELATED WORK
isters with the limited effect property, and (b) atomic registers with-
The first practical protocol to implement a byzantine-tolerant
out the limited effect property. With (a), read and write operations
state machine is in [8]. Using a state machine, it is easy to im-
may abort if there is contending accesses, but the limited effect
plement a storage service. Q/U [1] and HQ are other protocols
property prevents malicious adversaries from overwriting the reg-
for state machines [10], which use a lightweight quorum protocol
ister via pending writes of crashed processes. With (b), read and
to improve performance in uncontended cases. Even more effi-
write operations never abort, but a malicious adversary can over-
cient state machine protocols have since been proposed, such as
write the contents of the register using pending writes. This situ-
Zyzzyzva [18]. Phalanx [19] is a distributed storage system that
1
This is different from a register that provides obstruction free- tolerates byzantine servers and clients. In Phalanx, clients broad-
dom [16]: with the latter, concurrent operations may never termi- cast read and write requests to servers, and servers do not commu-
nate, whereas with an abortable register, operations always termi- nicate with each other. Phalanx also has a version of the protocol
nate (possibly by aborting). for honest clients, which is more efficient. The scheme in [20] also
tolerates byzantine servers and client, where clients broadcast read respectively. When a client receives an external input, it becomes
and write requests to servers, and servers use a mechanism like reli- active. The client remains active until it issues an external output,
able broadcast to propagate values among themselves. The scheme at which point the client becomes inactive. While active, a client
in [14] uses erasure codes to tolerate failures, which provides better does not receive (another) external input. While inactive, a client
space efficiency than replication. The scheme in [7] uses erasure takes no steps except for a step that receives an external input.
codes, reliable broadcast and threshold signature schemes. The Clients may fail by crashing permanently. We model a crash via
scheme in [4] uses secret sharing and tolerates byzantine failures of a special crash step. A correct client is one that does not take a
readers and servers. Unlike our algorithms, the above schemes nei- crash step. It takes infinitely many steps or it is active only finitely
ther guarantee the limited effect property, nor provide low client- often (because it wants to execute only finitely many operations).
server communication (see Figure 1). When a client takes a non-crash step, it does so according to its au-
It is possible to reduce the client-server communication of the tomaton. Servers may fail by becoming byzantine. A correct server
schemes in [7, 8, 10, 19, 20]. For example, with [7, 10, 19, 20], we takes infinitely many steps, and does so according to its automaton.
can (a) run their client protocol at a coordinating server, and (b) A faulty client or server is one that is not correct. Our algorithms
have real clients sign their request and forward them to the coordi- require an upper bound f on the number of faulty servers (but not
nating server. With [8], a server can collect responses from other clients): f <n/3 for QUAD or f <n/4 for LINEAR.
servers and then forward them in a single message to a client. These When a client executes an operation, we want it to communicate
modifications, however, do not provide the limited effect property. with just one server in common “good” runs with no failures (we
For example, with [8], if a client issues a write request and crashes, precisely define good runs below). But if this server is unrespon-
this request might propagate among a minority M of servers. After sive, the client needs a timeout mechanism so it can try a different
many reads that do not involve M , some read that does involve M server. Thus, we require some amount of partial synchrony, which
will cause the write to finally take effect, which violates the limited is abstracted as a correct server oracle. A client p accesses the
effect property. In fact, we show later that it is impossible to get an oracle by reading a variable current-serverp , which has a server
atomic register with the limited effect property, even if processes id. This variable is changed by the oracle only. The oracle en-
have synchronized clocks (to timestamp requests). Therefore no sures that, for every correct client p, there is a server s that is in-
variants of existing atomic register algorithms will provide the lim- distinguishable by p from a correct server, and a time after which
ited effect property. current-serverp = s. Different clients could have different servers.
There are byzantine-tolerant algorithms for registers writable by This oracle must be implemented outside our model, using partial
just one client (e.g., [2, 6]), but this is a somewhat different ser- synchrony and a simple increasing timeout mechanism.
vice from what we provide. Abortable registers are defined in [3]. A good run is a run where all processes are correct and, for every
Other types of registers that may return an “abort” indication are client c, current-serverc never changes. Intuitively, the latter con-
∆-registers [11] and ranked registers [9], which are intended to ab- dition says that the correct server oracle outputs stable information.
stract mechanisms for fixing a value in consensus algorithms. Each server p has a pair (ep , dp ) of public and private keys. All
clients have the same pair (eclient , dclient ) of public and private keys.
All processes have all public keys. We assume byzantine processes
3. INFORMAL MODEL cannot break public-key cryptography.
Our message-passing distributed system has a finite set Π=Πs ∪ An implementation I is a set of automata, one for each process.
Πc of processes with Πs ∩ Πc = ∅. Processes in Πs are called We omit references to I when it is clear from context. A run (of I)
servers, and processes in Πc are called clients. There are at least is an infinite sequence of process steps. A run prefix P (of I) is a
two clients and two servers in the system and n is the number of finite sequence of process steps. A continuation C of P (of I) is a
servers. Processes communicate by sending messages over links. run that has P as a prefix. Runs, run prefixes, and continuations are
There is a link between every pair of servers and between every subject to obvious well-formedness rules (e.g., steps of processes
client and server. There may not be a link between clients, and follow their automata).
clients may not be aware of each other.
The system is asynchronous with clocks and a liveness oracle: 4. PROBLEM
process speeds and message delays are arbitrary, but clients have
We abstract a storage system through register objects. An reg-
synchronized clocks with range T = N, which need not be related
ister supports two operations, read() and write(v), such that read
to real time. (The liveness oracle is described below.) In practice,
returns the last value v used in a write. An abortable register [3]
these clocks are reasonable because of GPS devices or NTP, which
is a variant of a register intended for systems with low contention.
provides accurate clocks but not predictable network delays2 . Links
With an abortable register, a read or write operation may abort if
are reliable (they do not create, drop, or duplicate messages), en-
it is executed concurrently with another operation (and only in this
crypted (only the link endpoints can see data sent through it), and
case). When an operation aborts, it returns a special value ⊥. The
authenticated (the receiver knows who sent the message).
client may later retry the operation if it wants, hoping that the con-
Each process (client or server) is an infinite automaton, whose
tention would have ceased. If a write(v) operation aborts, it may
execution proceed in steps. Each step has two actions: (1) receive
leave the register unchanged or it may change the register’s value
a message, or send a message, or receive an external input, or issue
to v, and which of the two possibilities actually occurs is not indi-
an external output, or do nothing, and (2) change state. Receiving
cated to the client. The client can solve this uncertainly by reading
external inputs and issuing external outputs are actions done only
the register or writing to the register again. Abortable registers are
by client processes, to nondeterministically receive an operation
weaker than the standard registers, but it has been shown to be a
request from the environment or output an operation’s response,
powerful abstraction in systems with low contention [3]. It has also
2
NTP works over “asynchronous networks” like the Internet be- been shown that a typical storage system has low contention [13].
cause NTP only requires small windows of stable network delays We consider wait-freedom as the liveness requirement for all im-
to calibrate the local hardware clocks of machines. The success of plementations in this paper. The safety requirement is given by
NTP in the Internet is documented in Chapter 6 of [21]. linearizability [17]. Roughly speaking, linearizability requires op-
erations, such as reads or writes, to appear to take effect instan- Definition 2 Given a run prefix P , a read-continuation of P
taneously at some point during the operation interval: the time is any continuation C of P in which (1) first, an inactive client c
from when the operation is invoked until it returns a response. With invokes read(), (2) then, client c and n−1 or n servers take steps
linearizability, if a client process invokes a write and crashes, this until the client receives some response v for its read. We say that v
write is pending: it may or may not take effect (i.e., change the reg- is the value read in C and c is the reader in C.
ister’s state), and if it takes effect, it can do so at an arbitrary point Henceforth in this proof, we only consider run prefixes with 0 or
after the operation’s invocation. If a storage implementation is not 1 active clients (runs prefixes with more active clients may not have
designed carefully, it could provide a byzantine adversary with the continuations since I only tolerates one client crash). We now use
choice of when to effect the pending write. This would allow the read-continuations to define v-valent prefixes.
byzantine adversary to overwrite useful data at any one time in the
future (possibly many days afterward the write was issued), so the Definition 3 A run prefix P is v-valent if, for every read-continuation
storage system would partially lose its ability to retain data safely. C of P , v is the value read in C. A run prefix P is ambivalent if it
This attack is possible with all previous byzantine-tolerant storage is not v-valent for any v.
algorithms that we know of, including [1, 4, 7, 8, 10, 14, 18–20]. In For example, the empty prefix is 0-valent.
these protocols, the problem is caused when a write for v starts, a
minority M of servers store v, the coordinator crashes, and then Lemma 4 Let P be a run prefix where some client c0 is inactive
there are many reads that do not involve M and hence do not return and let s be a step of some client c 6= c0 applicable to P . If P is
v, but at some far time in the future, a read that involves M then v-valent then P · s is not v ′ -valent for any v ′ 6= v.
returns v. This problem cannot be solved with simple techniques
P ROOF. Let P be a run prefix where some client c0 is inactive
like expiration of write requests or expiring tokens.3
and let s be a step of some client c 6= c0 applicable to P . By way
To prevent infinitely pending writes in storage systems, we re-
of contradiction, suppose that P is v-valent and for some v ′ 6= v,
quire that implementations of registers satisfy the following lim-
P · s is v ′ -valent. In step s, process c sends 0 or 1 messages and
ited effect property. Roughly speaking, if a (client) process starts
changes its state. If c sends a message, let q be the recipient server,
a write and crashes then the write must appear to take effect be-
otherwise let q be any server. Since c0 is inactive in P , there are
fore the crash, or never take effect. More precisely, we consider
read-continuations of P in which c0 is the reader. Consider a read-
histories with operation requests and responses (as in linearizabil-
continuation C of P in which c0 is the reader and q does not take
ity), augmented with crash events. Limited effect requires that, for
any steps. Since P is v-valent, v is the value read in C. Because
any history H, if a (client) process requests an operation op1 and
neither q nor c take any steps in C, C is also a read-continuation of
it crashes before a (client) process requests an operation op2 then
P · s, and it is one in which v is the value read. Therefore P · s is
op1 is not ordered after op2 in the linearization of H.
not v ′ -valent—a contradiction.
We are interested in (abortable) register implementations that
consume low client-server bandwidth. We say that an (abortable) We now use Lemma 4 to construct a run R in which some client
register implementation has low client-server communication if, in w invokes write(1) but it never completes. Intuitively, to construct
good runs, a client sends and receives one message per read or write R, we start with an empty run prefix P0 , which is 0-valent. We
operation. Our goal is to implement abortable registers that (a) are let w take a step and apply Lemma 4 to know that the prefix is
linearizable, (b) have the limited effect property, and (c) have low not 1-valent, and therefore w has not completed its write. Because
client-server communication. the prefix is not 1-valent, we find a read-continuation that reads 0.
We append this read-continuation to our run; the resulting prefix
5. IMPOSSIBILITY RESULT is 0-valent by the limited effect property. We then append another
We now show that it is impossible to implement an atomic reg- read-continuation in which all servers take steps, to ensure that, in
ister that guarantees the limited effect property. Intuitively, this our final run, all servers take steps forever. We now have a run
impossibility arises because limited effect requires reads to oblit- prefix that is 0-valent but contains one more step of w than P0 . By
erate writes that are pending due to a crash, but such writes are repeating the construction, we get a run in which w takes infinitely
indistinguishable from slow writes. many steps without completing its write.
We now describe R more precisely. Henceforth, let w be some
Theorem 1 In a system with at least two clients and two servers arbitrary but fixed client. We construct a sequence {Pi } of run
where a client and a server may crash, there is no implementation prefixes inductively. In each Pi , there is one write, write(1) by
of an atomic register satisfying the limited effect property, even if client w. P0 is the run prefix with a single step, in which w receives
processes have synchronized clocks. an external input to write(1).
Proof. Consider a system with at least two clients and two servers Lemma 5 P0 is 0-valent.
where a client and a server may crash. For a contradiction, sup-
pose that there is an implementation I of an atomic register ini- P ROOF. In the step of w in P0 , w receives an external input and
tialized to 0 that satisfies the limited effect property. The proof changes state. Thus, read-continuations of P0 are indistinguish-
outline is as follows. We first define valency for solo continua- able from read-continuations where no write occurs. So, P0 is 0-
tions in which a client invokes read() [12, 15]. We then consider valent.
a run R of write(1), which is initially 0-valent. We show how to
extend R step-by-step, such that the run remains 0-valent forever, Now assume we have a 0-valent run prefix Pi . Let si be some
and therefore the write(1) never terminates—which contradicts the step of w applicable to Pi and Qi =Pi · si .
correctness of implementation I.4 Lemma 6 Qi is not 1-valent and w is active in Qi .
3
Although these techniques avoid the specific scenario above, they
have bad side effects: they may undo writes that actually completed P ROOF. By Lemma 4, Pi · si is not 1-valent. Therefore, w has
before expiration, violating linearizability. not completed its write(1) in Pi · si (otherwise subsequent reads
4
The run R we construct will be a run of the model, whether the model has synchronized clocks or not.
would return 1 and Pi ·si would be 1-valent). Therefore w is active To write a value v:
1. client obtains a unique timestamp T
in Pi ·si . 2. client sends v and T to the current coordinator c=current-serverp
3. coordinator c sends v and T to all servers s (* write phase *)
Corollary 7 There is a continuation from Qi where w takes no 4. if a server s saw any timestamp >T , it returns ⊥
steps, and some client c executes read() and returns 0. (* ‘saw’ means ‘ever received in any messages containing...’ *)
5. else s stores (v, T ) and returns ok
6. c waits for n − f replies
P ROOF. Since Qi is not 1-valent there is a read-continuation of 7. if some reply is ⊥ then c returns ⊥
Qi whose value read is v 6= 1 and whose reader is not w. Then 8. else c returns ok
v must be 0 because 1 is the only value that has been written in 9. client waits for reply or change of coordinator (current-server p )
Qi . 10. if change of coordinator then goto 2
To read a value:
Definition 8 Let S be a continuation from Qi from Corollary 7 1. client obtains a unique timestamp T
where some client ci executes read() and it returns 0. Let Ti be the 2. client sends T to the current coordinator c=current-serverp
run prefix of S up to when ci finishes its read. 3. coordinator c sends T to all servers s (* read phase *)
4. if a server s saw timestamp >T , it returns ⊥
5. else s returns its stored value and timestamp
Lemma 9 Ti is 0-valent.
6. c waits for n − f replies
7. if some reply is ⊥ then c returns ⊥
P ROOF. Ti has only one write, namely, write(1) by client w. 8. c picks the reply value v ∗ with largest timestamp
Note that the last step of w in Ti is before ci starts read(). By 9. c sends v ∗ and T to all servers s (* write phase *)
the limited effect property, if w takes no further steps then its write 10. if s saw timestamp >T , it returns ⊥
cannot take effect after its last step. Therefore, any reads that start 11. else s stores (v ∗ , T ) and returns ok
from Ti must return 0, so Ti is 0-valent. 12. c waits for n − f replies
13. if some reply is ⊥ then c returns ⊥
In the read() just added to Ti , one server might not take any 14. else c returns v ∗
steps. We now append to Ti a read where all servers take steps. 15. client waits for reply or change of coordinator (current-serverp )
16. if change of coordinator then goto 2
The resulting run prefix will still be 0-valent since Ti is 0-valent.
Definition 10 Let S be a continuation from Ti where some client
c executes read() and it returns 0. Let Pi+1 be the run prefix of S Figure 2: Algorithm that tolerates crash failures.
up to when c finishes its read. (Note that Pi+1 is a continuation of
Ti .)

Lemma 11 Pi+1 is 0-valent.


The client then returns v ∗ as the value read. Phase 2 is needed to
P ROOF. By Lemma 9, Ti is 0-valent. Thus, Pi+1 is also 0- ensure that v ∗ is stored at a majority of servers, in case the write of
valent. v ∗ is in progress or did not finish.
Our algorithm, shown in Figure 2, employs two trivial modi-
This finishes our inductive construction: we started with a 0- fications to the ABD-algorithm: (1) A client does not talk to all
valent prefix Pi and constructed a 0-valent prefix Pi+1 where w servers. Instead, it sends its read or write command to one server,
takes another step without completing its write(1). Taking the which then acts as a proxy/coordinator for the client. This saves
limit, we get a run R in which w takes infinitely many steps and client-server bandwidth. (2) When a client writes, it obtains its
never completes its write(1). Thus, Theorem 1 is proved. new timestamp from its synchronized clock instead of querying the
We note that the impossibility result holds for wait-free imple- servers. This saves one phase when writing.
mentations; it leaves open the possibility of implementations with We also make a fundamental change to the ABD-algorithm, to
weaker liveness guarantees, such as obstruction freedom. obtain the limited effect property: in phase 2 of the read protocol,
we pick a new fresh timestamp to write back v ∗ instead of writing
6. ALGORITHM THAT TOLERATES back with the original timestamp of v ∗ as in the ABD-algorithm.
This is called timestamp promotion and it ensures the limited ef-
CRASH FAILURES fect property: if there is a lurking pending write for value vbad then
We first explain a simpler algorithm for abortable registers that a subsequent read operation will cause the value read to be writ-
tolerates crash failures and provides the limited effect property. ten back with a higher timestamp than vbad , making it impossible
This algorithm requires that f < n/2, i.e., a majority of servers for vbad to take effect subsequently. However, timestamp promo-
are correct. tion breaks linearizability, as the following scenario shows: (a) the
Our algorithm is derived from the algorithm by Attiya, Bar-Noy, register’s initial value is 0, (b) a write of 1 starts with timestamp
and Dolev [5], which we call the ABD-algorithm and summarize 1, (c) a read also starts, finds 0 at n−f server, and writes back 0
now. Basically, each server stores a (value, timestamp) pair. To with timestamp 2, and (d) the write of 1 completes. In this case,
write a value v, a client proceeds in two phases. In phase 1, the subsequent reads will return 0 (which now has timestamp 2), even
client asks servers for their stored timestamp, waits for n−f times- though the write of 1 completed. We solve this problem by causing
tamps, and picks a new timestamp T that is larger. In phase 2, the the write of 1 to return ⊥ (abort)—this is shown, for example, in
client asks servers to store (v, T ). To read a value, a client also line 4 of the algorithm for writing or reading. Here, if a server has
proceeds in two phases. In phase 1, the client asks servers for their seen previously a request with higher timestamp then it returns ⊥,
stored value-timestamp pairs, waits for n−f replies, and picks the causing the coordinator to also return ⊥ (abort). We can do that
pair (v ∗ , T ∗ ) with highest timestamp. In phase 2, the client writes because we need only implement an abortable register, in which
back (v ∗ , T ∗ ) to the servers and waits for n−f acknowledgements. concurrent operations can abort.
7. THE QUAD ALGORITHM
We now extend the algorithm of Section 6 to tolerate byzantine
servers (including byzantine coordinators) as long as f < n/3.
We use a simple and common idea: the client, who is always hon-
est, signs requests that servers can execute with confidence, while
servers sign responses so that the client know its request was ful-
filled. For example, to write v to the register, a client signs a request To write a value v:
with v and a new timestamp T , and sends it to the coordinator. The 1. client gets a unique signed timestamp T
coordinator then asks servers to store v and T , attaching the client’s 2. sig := client signs (WRITE, v, T )
signature. Servers also store this signature, to be used later in reads. 3. client sends (v, T, sig) to the coordinator c=current-serverp
It is important that the signature includes both v and T , not just v, (* write phase *)
for otherwise a malicious coordinator could overwrite new values 4. coordinator c sends (v, T, sig) to all servers s
with old values. Each server responds with a signed acknowledge- 5. a server s checks (v, T, sig); if bad then return ⊥
ment that it stored v and T , which the coordinator returns to the 6. if s saw timestamp>T , it returns (⊥, highest seen timestamp)
client as proof of execution of the write operation. 7. else s stores (v, T, sig) and returns signed (ok, v, T )
8. c waits for n − f valid replies
The protocol to read the register is more complex because of 9. if some valid reply is (⊥, T ′ ), c returns T ′
timestamp promotion: after querying n−f servers and picking the 10. c sets stoproof := { n−f valid replies }
value v ∗ with largest timestamp, the coordinator needs to write 11. c returns (ok, stoproof)
back v ∗ with the new timestamp T , but there is no client signa- 12. client waits for reply or change of coordinator (current-serverp )
ture authorizing that.5 We need to prevent the coordinator from 13. if bad reply or coordinator changes then goto 2
cheating on the write-back value. The coordinator could ask the To read a value:
client to sign a request for v ∗ and T , but this scheme requires extra 1. client gets a unique signed timestamp T
client-server communication, which we want to avoid. In our al- 2. sig := client signs (READ, T )
gorithm, the coordinator sends to servers the set S of n−f replies 3. client sends (T, sig) to the coordinator c=current-serverp
from which v ∗ is picked. Each server can then validate the write- (* read phase *)
back by looking at S and verifying that the coordinator chose v ∗ 4. coordinator c sends (T, sig) to all servers s
correctly. To prevent the coordinator from forging S, each of the 5. a server s checks (T, sig); if bad then return ⊥
n−f replies in S must be signed by a different server. 6. if s saw timestamp>T or s saw T in this phase before
This solution, however, has a problem: if a byzantine coordi- then s returns (⊥, highest seen timestamp)
nator happens to receive more than n−f replies, it could generate 7. bindsig := s signs (BIND, T, timestamp of stored value)
two different sets S1 and S2 of n−f replies each, such that the v ∗ 8. s returns stored value, timestamp, proof, and bindsig
of each set is different. By doing so, the coordinator can convince 9. c waits for n − f valid replies
10. if some valid reply is (⊥, T ′ ), c returns (⊥, T ′ )
different servers to write back different values. We solve this prob- 11. c sets maxproof := set of bindsigs in the n−f valid replies
lem by adding a certification phase to the protocol, which forces 12. c picks the valid reply (v ∗ , T ∗ , p∗ , b∗ ) with largest T ∗
the coordinator to certify a value-timestamp pair before it is stored
at server. At most one value-timestamp pair can be certified for (* certification phase *)
13. c sends (v ∗ , T, T ∗ , p∗ , maxproof, sig) to all servers s
a given timestamp. This phase works as follows: (1) the coordi- 14. s checks (v ∗ , T, T ∗ , p∗ , maxproof, sig); if bad return ⊥
nator sends S, v ∗ and T (the new timestamp) to all servers; (2) 15. if s saw timestamp>T or s saw T in this phase before
Servers check that v ∗ is computed correctly from S and reply with then s returns (⊥, highest seen timestamp)
a signed statement including S, v ∗ and T . An honest server signs 16. s returns a signature on (ACKVAL, v ∗ , T )
at most one statement for a given T ; to keep track of that, a server 17. c collects n − f valid replies
remembers the largest T used in a statement it previously signed, 18. if some valid reply is (⊥, T ′ ), c returns (⊥, T ′ )
and it rejects signing statements with smaller or equal timestamps 19. c sets valproof := set of n − f valid replies
(Timestamps T are signed by clients so that a byzantine server can- (* write phase *)
not cause servers to reject signing statements for the largest times- 20. c sends (v ∗ , T, valproof) to all servers s
tamp.) (3) The coordinator collects signed statements from n−f 21. s checks (v ∗ , T, valproof); if bad then return ⊥
servers into a set valproof , called a certificate. Intuitively, the cer- 22. if s saw timestamp>T , it returns (⊥, highest seen timestamp)
tificate serves as proof that v ∗ can be promoted to timestamp T . 23. else s stores (v ∗ , T, valproof) and returns signed (ok, v ∗ , T )
Thus, the coordinator attaches the certificate to the write-back re- 24. c waits for n − f valid replies
25. if some valid reply is (⊥, T ′ ), c returns T ′
quest of v ∗ , and each server then verifies that the certificate is cor- 26. c sets stoproof to the n − f valid replies
rect (all statements refer to v ∗ and T , and they are signed by n−f 27. c returns (v ∗ , stoproof)
servers). If so, the server stores v ∗ , T , and the certificate. The 28. client waits for reply or change of coordinator (current-serverp )
server needs to store the certificate so that later, when it replies to 29. if bad reply or coordinator changes then goto 1
a read request, it can prove that its value-timestamp pair (v ∗ , T ) is
legitimate. In other words, a server can either store a (v, T ) that
comes from a write, or a (v, T ) that comes from a write-back. In Figure 3: QUAD algorithm for byzantine failures.
the first case, there is a client signature on (v, T ), and in the second
case, there is a certificate for (v, T ).
The algorithm is given in Fig. 3 and its timeline is shown in
Fig. 4. Due to space limitations, the full pseudo-code of the al-

5
Note that the client previously signed a request to write v ∗ with
timestamp T ∗ but not with the new timestamp T .
store v,T,proof
(proof=client signature) 1. Initially, all servers hold a value v with timestamp T .
2. Then, a byzantine server changes its stored value to some old

servers
value v̂ but with a new timestamp T0 > T . Timestamp T0
comes from a client that started a read but then crashed.
3. Next, a client requests a write for v1 with timestamp T1 >T0 ,
write phase the request goes to a byzantine coordinator, the coordina-

client
tor only sends (v1 , T1 ) to one correct server, and the client
WRITE crashes,
4. Similarly for values v2 , . . . , vf : for each value, some client
retrieve v,ts,proof requests a write for vj with timestamp Tj >Tj−1 , the request
return signed (BIND,T,ts) sign (ACKVAL,v*,T) store v*,T,valproof goes to a byzantine coordinator, the coordinator only sends
(vj , Tj ) to a correct server (and the server is different for
each j), and the client crashes.

servers
5. After all this, we have f correct servers holding values
v1 , . . . , vf with timestamps T1 , . . . , Tf , respectively, and
one byzantine server holding value v̂ with timestamp T0 . If a
certification read occurs next, winning-rule attempt 1 incorrectly picks v̂
read phase write phase
phase client as the value to be returned to the client. But the only accept-
able values that could be picked (according to linearizability)
READ check all proofs,
assemble is v or one of the vj ’s.
pick v* with max ts,
and assemble valproof
maxproof

Figure 5: Scenario where winning-rule attempt 1 breaks.


Figure 4: Timeline depiction of QUAD algorithm.

gorithm (which includes detailed checking of message formatting, prevent byzantine servers from promoting old values to new times-
and signatures, which are straightforward but extensive) is omitted tamps. The LINEAR algorithm uses a different solution: it employs a
from this paper. new mechanism to pick the “winning value” in a read operation, in-
stead of picking the value with highest timestamp. The mechanism
Theorem 12 Consider a system with n servers and k clients ensures that even if byzantine servers promote old values to new
where up to f < n/3 servers can be byzantine and any number of timestamps, these values are not picked by a read coordinator, even
clients can crash. The QUAD algorithm implements an abortable if the coordinator cannot tell that these values were maliciously
register and satisfies the limited effect property. In good runs, for promoted.
each operation a client sends and receives only one message and To understand how this mechanism works, first note that there
servers check O(n2 ) signatures. are at most f byzantine servers. Therefore, the read coordinator
might use the following rule to choose the value that will be re-
turned:
8. THE LINEAR ALGORITHM
With the QUAD algorithm, space at each server is Θ(n) (since Winning-rule attempt 1: Order values by timestamp breaking
a server may have to store a certificate with a signature from n−f ties arbitrarily, discard the top f values, pick the top value that is
servers) and reading a value may involve checking all the signatures left.
of n−f certificates, for a total of Θ(n2 ) signatures. In the LINEAR
The intuition is that, after a value is written or written-back, it
algorithm, we reduce signature usage significantly: servers do not
is stored with the highest timestamp at n−f servers. Later, if f
store certificates and reading does not require checking certificates.
byzantine servers try to promote old values to larger timestamps,
As a result, space at each server is O(1) and operations check O(n)
the (f +1)-th top value is still an honest value. This mechanism,
signatures. There is a trade-off: the algorithm requires tolerates up
however, breaks under a slightly more sophisticated attack, shown
to f < n/4 failures instead of f < n/3.
in Fig. 5.
To understand how the LINEAR algorithm works, consider what
Another natural rule might be the following:
might happen to the QUAD algorithm if servers did not keep certifi-
cates and read coordinators did not check them. Then, a byzantine Winning-rule attempt 2: Discard values stored at less than
server could falsely claim that an old value has been promoted to a f +1 servers; among the values left, pick the one with highest times-
new timestamp. If this were to happen, the next read request would tamp.
return an old value (which was promoted to the highest timestamp),
and this would violate linearizability. It appears that this problem The intuition is that, since there are only f byzantine servers,
is solved by requiring timestamps to be signed by clients, but this the above rule discards any maliciously-promoted values that those
does not help: a client may sign a new timestamp for reading, send f servers might hold. The problem here, however, is that this rule
this timestamp to a byzantine coordinator, and then crash. Now a might end up discarding all values: there is a scenario in which each
byzantine server has a signed timestamp, and so the attack men- server (including correct ones) ends up with a different value. This
tioned above is possible. scenario uses the idea in steps (3) and (4) of Fig. 5: a client starts
The QUAD algorithm solved this problem by using certificates to a write, sends its request to byzantine coordinator, which stores the
value at a single server, and then the client crashes.
It is unclear that there exists a correct winning rule that uses only
the current timestamp of a value. In our LINEAR algorithm, the win-
ning rule uses two timestamps: the current one, and the one used
To write a value v:
originally to write the value. For example, suppose v is first written ∗ 1. client gets unique signed timestamp T tagged with ‘W’
with timestamp T1 and, later, a write-back promotes v’s timestamp (* tag ensures timestamp is used only as left timestamp *)
to T2 . Then, each server stores v, T1 and T2 , where T1 is called the 2. sig := client signs (WRITE, v, T )
left timestamp of v, and T2 is the right timestamp of v. If T1 has 3. client sends (v, T, T, sig) to the coordinator c=current-serverp
not been promoted, then T2 = T1 . Note that servers do not keep (* write phase *)
the entire history of timestamps of a value: they keep only the orig- ∗ 4. coordinator c sends (v, T, T, sig) to all servers s
inal timestamp and the latest promoted timestamp. For example, if ∗ 5. a server s checks (v, T, T, sig); if bad then return ⊥
a subsequent write-back promotes v’s timestamp to T3 , then only 6. if s saw timestamp>T , it returns (⊥, highest seen timestamp)
T1 and T3 are stored, not T2 . A left timestamp comes from a write ∗ 7. else s stores (v, T, T, sig) into (Value, Left-ts, Right-ts, Sig)
operation, and there is a client signature that binds the timestamp to and returns signed (ok, v, T )
the value being written. A right timestamp, if different from the left 8. c waits for n − f valid replies
timestamp, comes from the timestamp promotion in a read opera- 9. if some valid reply is (⊥, T ′ ), c returns T ′
10. c sets stoproof := { n−f valid replies }
tion; there is client signature on the timestamp, but it does not bind
11. c returns (ok, stoproof)
it to any values. We often combine the left and right timestamps 12. client waits for reply or change of coordinator (current-serverp )
into a pair [T1 , T2 ] or into a triple [T1 , T2 , v], where v is the value 13. if bad reply or coordinator changes then goto 2
bound to T1 .
The LINEAR algorithm uses the following rule (below we give To read a value:
∗ 1. client gets unique signed timestamp T tagged with ‘R’
some intuition of why it works): 2. sig := client signs (READ, T )
3. client sends (T, sig) to the coordinator c=current-serverp
Winning rule: (1) Among the n−f triples obtained from servers,
find the set, called candSet, of 2f + 1 triples with largest right (* read phase *)
timestamp, breaking ties arbitrarily. (2) If some timestamp T0 is 4. coordinator c sends (T, sig) to all servers s
5. a server s checks (T, sig); if bad then return ⊥
the left timestamp of f +1 or more triples in candSet, pick any 6. if s saw timestamp>T or s saw T in this phase before
such triple as winner. (3) Otherwise, pick the triple in candSet with then s returns (⊥, highest seen timestamp)
largest left timestamp, breaking ties arbitrarily. ∗ 7. bindsig := s signs (BIND, T, Left-ts, Right-ts)
∗ 8. s returns (Value, Left-ts, Right-ts, Sig, bindsig)
The algorithm is shown in Figure 6. Due to space limitations, the 9. c waits for n − f valid replies
algorithm’s full pseudo-code (which includes detailed checking of 10. if some valid reply is (⊥, T ′ ), c returns (⊥, T ′ )
message formatting, and signatures, which are straightforward but 11. c sets reps := set of n−f valid replies
extensive) is omitted from this paper. The algorithm ensures the ∗ 12. c uses winning rule to pick v ∗ , left-ts∗ , right-ts∗
following key property: and associated sig∗ , bindsig∗
Theorem 13 In any run, if some read or write operation suc- (* certification phase *)
ceeds, resulting in n−2f correct servers storing the same triple ∗ 13. c sends (v ∗ , T, sig∗ , reps, sig) to all servers s
∗ 14. s uses winning rule to recompute v ∗ from reps and checks
[T1 , T2 , v], then afterwards the winning rule will never select an
if (v ∗ , T, sig∗ , reps, sig) is valid; if bad return ⊥
old, stale value (one whose left timestamp is less than T1 ). 15. if s saw timestamp>T or s saw T in this phase before
Thus, a read always return a relatively recent value, and this im- then s returns (⊥, highest seen timestamp)
plies linearizability. We now provide a proof sketch of why the 16. s returns a signature on (ACKVAL, v ∗ , T )
above property holds. 17. c collects n − f valid replies
The following is an important property of triples stored at servers: 18. if some valid reply is (⊥, T ′ ), c returns (⊥, T ′ )
19. c sets valproof := set of n − f valid replies
Lemma 14 Suppose some set S1 of n−2f correct servers store (* write phase *)
the same triple [T1 , T2 , v]. If some correct server ever stores a ∗ 20. c sends (v ∗ , left-ts∗ , T, sig∗ , valproof) to all servers s
triple [T1′ , T2′ , v ′ ] with T2′ > T2 then either T1′ = T1 or T1′ > T2 . ∗ 21. s checks (v ∗ , left-ts∗ , T, sig∗ , valproof); if bad, return ⊥
With this lemma, we show Theorem 13 as follows. Suppose 22. if s saw timestamp>T , it returns (⊥, highest seen timestamp)
∗ 23. else s stores (v ∗ , left-ts∗ , T, sig∗ ) into
some set S1 of n−2f correct servers store the triple [T1 , T2 , v].
(Value, Left-ts, Right-ts, Sig) and returns signed (ok, v, T )
Later, suppose we apply the winning rule for a set S2 of n−f triples 24. c waits for n − f valid replies
(each triple from a different server), and consider the candSet com- 25. if some valid reply is (⊥, T ′ ), c returns T ′
puted in the rule. Then (1) candSet has at least one triple from a 26. c sets stoproof to the n − f valid replies
server in S1 since candSet has 2f +1 elements, and (2) S2 has at 27. c returns (v ∗ , stoproof)
least n−3f elements from S1 . Since f < n/4, we have n−3f ≥ 28. client waits for reply or change of coordinator (current-serverp )
f + 1. There are two cases: 29. if bad reply or coordinator changes then goto 1
Case 1. Assume that some timestamp T0 is the left timestamp
of f +1 or more triples in candSet—as in part (2) of the winning
rule. From (2), we have that (3) S2 has at least f +1 elements from Figure 6: LINEAR algorithm that tolerates byzantine failures. Aster-
S1 , which are all correct servers. Let goodCandSet be the triples isks indicate changes relative to the QUAD algorithm.
in candSet from correct servers. Since candSet has 2f +1 triples,
goodCandSet has at least f +1 triples. Servers in S1 cannot replace
their right timestamps with a timestamp smaller than T2 , since a
correct server rejects requests to store right timestamps lower than
its own. Thus, from (3), goodCandSet has at least f +1 triples with [2] I. Abraham, G. Chockler, I. Keidar, and D. Malkhi. Wait-free
right timestamps equal to T2 or greater. If such a triple has right regular storage from byzantine components. Information
timestamp greater than T2 then, by Lemma 14, its left timestamps Processing Letters, 101(2):60–65, Jan. 2007.
is either T1 or greater than T2 . If such a triple has right timestamp [3] M. K. Aguilera, S. Frolund, V. Hadzilacos, S. L. Horn, and
equal to T2 then its left timestamp is equal to T1 (since (a) when a S. Toueg. Abortable and query-abortable objects and their
read coordinator is upgrading timestamps to T2 , it must commit to efficient implementation. In Symposium on Principles of
a single value and such a value is v, and (b) the left timestamp of Distributed Computing, pages 23–32, Aug. 2007.
a triple is bound to its value through a client signature). Note that [4] A. S. Aiyer, L. Alvisi, and R. A. Bazzi. Bounded wait-free
there are at most f triples in candSet that are not in goodCandSet. implementation of optimally resilient byzantine storage
Therefore, timestamp T0 (the timestamp which is the left times- without (unproven) cryptographic assumptions. In
tamp of f +1 or more triples in candSet) is either equal to T1 or it International Symposium on Distributed Computing, pages
is greater than T2 . Thus, the winning rule does not choose a triple 7–19, Sept. 2007.
whose left timestamp is less than T1 . [5] H. Attiya, A. Bar-Noy, and D. Dolev. Sharing memory
Case 2. Now assume that no such timestamp T0 exists, i.e., part robustly in message-passing systems. Journal of the ACM,
(3) of the winning rule applies. By (1), candSet has at least one 42(1):124–142, Jan. 1995.
triple from a server in S1 . Let p be such a server. If p changes its [6] H. Attiya and A. Bar-Or. Sharing memory with
triple from [T1 , T2 , v] to something else, then its right timestamp semi-byzantine clients and faulty storage servers. Parallel
increases, so by Lemma 14, its left timestamp either remains as T1 Processing Letters, 16(4):419–428, Dec. 2006.
or increases beyond T2 . Therefore, the largest left timestamp in
[7] C. Cachin and S. Tessaro. Optimal resilience for
triples in candSet is at least T1 . Thus, the winning rule does not
erasure-coded byzantine distributed storage. In International
choose a triple whose left timestamp is less than T1 .
Conference on Dependable Systems and Networks, pages
This shows Theorem 13. It is worth noting that Theorem 13 does
115–124, June 2006.
not hold if we change the winning rule so that candSet had 2f +2
instead of 2f +1 triples with largest timestamp. Intuitively, the rea- [8] M. Castro and B. Liskov. Practical byzantine fault tolerance.
son is that part (2) of the rule could be triggered for a timestamp T0 In Symposium on Operating Systems Design and
Implementation, pages 173–186, Feb. 1999.
smaller than T1 in the argument above.
[9] G. Chockler and D. Malkhi. Active disk paxos with infinitely
Theorem 15 Consider a system with n servers and k clients many processes. In ACM Symposium on Principles of
where up to f < n/4 servers can be byzantine and any number of Distributed Computing, pages 78–87, July 2002.
clients can crash. The LINEAR algorithm implements an abortable [10] J. Cowling, D. Meyers, B. Liskov, R. Rodrigues, and
register and satisfies the limited effect property. In good runs, for L. Shrira. HQ replication: A hybrid quorum protocol for
each operation a client sends and receives only one message and byzantine fault tolerance. In Symposium on operating
servers check O(n) signatures. systems design and implementation, pages 177–190, Dec.
2006. Longer version available as MIT Technical Report
MIT-CSAIL-TR-2007-009.
9. CONCLUSION [11] P. Dutta, S. Frolund, R. Guerraoui, and B. Pochon. An
We considered the problem of handling byzantine servers in dis- efficient universal construction for message-passing systems.
tributed storage systems. We presented algorithms for abortable In International Symposium on Distributed Computing,
registers that ensure the limited effect property, while minimiz- pages 133–147, Oct. 2002.
ing the communication between clients and servers. These algo- [12] M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility
rithms trade off resiliency for efficiency: the first algorithm toler- of distributed consensus with one faulty process. Journal of
ates f <n/3 failures and checks O(n2 ) signatures per operation, the ACM, 32(2):374–382, Apr. 1985.
while the second algorithm tolerates f <n/4 failures and checks [13] S. Frolund, A. Merchant, Y. Saito, S. Spence, and A. Veitch.
O(n) signatures per operation. Some interesting questions for fu- A decentralized algorithm for erasure-coded virtual disks. In
ture work are the following. Is there an algorithm providing the International Conference on Dependable Systems and
best resiliency and efficiency of both our algorithms? Is it possi- Networks, pages 125–134, June 2004.
ble to avoid the use of synchronized clocks without adding phases [14] G. Goodson, J. Wylie, G. Ganger, and M. Reiter. Efficient
of communication? Is it possible to do reads in fewer than three byzantine-tolerant erasure-coded storage. In International
phases? Our algorithms ensure that operations always terminate Conference on Dependable Systems, pages 135–144, June
(wait-freedom); are there weaker liveness conditions (e.g., obstruc- 2004.
tion-freedom) that allow providing the limited effect property for [15] D. Hendler and N. Shavit. Operation-valency and the cost of
an atomic register? coordination. In Symposium on Principles of Distributed
Computing, pages 84–91, July 2003.
Acknowledgements [16] M. Herlihy, V. Luchangco, and M. Moir. Obstruction-free
synchronization: Double-ended queues as an example. In
We thank the anonymous reviewers for many helpful comments. International Conference on Distributed Computing Systems,
pages 522–529. IEEE Computer Society, May 2003.
10. REFERENCES [17] M. Herlihy and J. Wing. Linearizability: a correctness
condition for concurrent objects. ACM Transactions on
[1] M. Abd-El-Malek, G. R. Ganger, G. R. Goodson, M. K. Programming Languages and Systems, 12(3):463–492, July
Reiter, and J. J. Wylie. Fault-scalable byzantine fault-tolerant 1990.
services. In Symposium on Operating Systems Principles, [18] R. Kotla, L. Alvisi, M. Dahlin, A. Clement, and E. Wong.
pages 59–74, Oct. 2005.
Zyzzyva: speculative byzantine fault tolerance. In
Symposium on Operating Systems Principles, pages 45–58,
Oct. 2007.
[19] D. Malkhi and M. Reiter. Secure and scalable replication in
Phalanx. In IEEE Symposium on Reliable Distributed
Systems, pages 51–60, Oct. 1998.
[20] J.-P. Martin, L. Alvisi, and M. Dahlin. Minimal byzantine
storage. In International Symposium on Distributed
Computing, pages 311–326, Oct. 2002.
[21] D. L. Mills. Computer Network Time Synchronization: the
Network Time Protocol. CRC Press, 2006.
[22] Y. Saito, S. Frolund, A. Veitch, A. Merchant, and S. Spence.
FAB: building reliable enterprise storage systems on a
shoestring. In Workshop on Hot Topics in Operating Systems,
pages 169–174, May 2003.
[23] Y. Saito, S. Frolund, A. Veitch, A. Merchant, and S. Spence.
FAB: building distributed enterprise disk arrays from
commodity components. In International conference on
Architectural support for programming languages and
operating systems, pages 48–58, Oct. 2004.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy