Byzantine Gault Tolerance
Byzantine Gault Tolerance
1999
1
It describes the implementation of a Byzantine-fault- are computationally bound so that (with very high
tolerant distributed file system. probability) it is unable to subvert the cryptographic
It provides experimental results that quantify the cost techniques mentioned above. For example, the adversary
of the replication technique. cannot produce a valid signature of a non-faulty node,
compute the information summarized by a digest from
The remainder of the paper is organized as follows. the digest, or find two messages with the same digest.
We begin by describing our system model, including our The cryptographic techniques we use are thought to have
failure assumptions. Section 3 describes the problem these properties [33, 36, 32].
solved by the algorithm and states correctness conditions.
The algorithm is described in Section 4 and some
3 Service Properties
important optimizations are described in Section 5.
Section 6 describes our replication library and how Our algorithm can be used to implement any deterministic
we used it to implement a Byzantine-fault-tolerant replicated service with a state and some operations. The
NFS. Section 7 presents the results of our experiments. operations are not restricted to simple reads or writes of
Section 8 discusses related work. We conclude with a portions of the service state; they can perform arbitrary
summary of what we have accomplished and a discussion deterministic computations using the state and operation
of future research directions. arguments. Clients issue requests to the replicated service
to invoke operations and block waiting for a reply. The
replicated service is implemented by n replicas. Clients
2 System Model
and replicas are non-faulty if they follow the algorithm
We assume an asynchronous distributed system where in Section 4 and if no attacker can forge their signature.
nodes are connected by a network. The network may The algorithm provides both safety and liveness assum-
fail to deliver messages, delay them, duplicate them, or ing no more than b n; 3 c replicas are faulty. Safety means
1
deliver them out of order. that the replicated service satisfies linearizability [14]
We use a Byzantine failure model, i.e., faulty nodes (modified to account for Byzantine-faulty clients [4]): it
may behave arbitrarily, subject only to the restriction behaves like a centralized implementation that executes
mentioned below. We assume independent node failures. operations atomically one at a time. Safety requires the
For this assumption to be true in the presence of malicious bound on the number of faulty replicas because a faulty
attacks, some steps need to be taken, e.g., each node replica can behave arbitrarily, e.g., it can destroy its state.
should run different implementations of the service code Safety is provided regardless of how many faulty
and operating system and should have a different root clients are using the service (even if they collude with
password and a different administrator. It is possible faulty replicas): all operations performed by faulty clients
to obtain different implementations from the same code are observed in a consistent way by non-faulty clients.
base [28] and for low degrees of replication one can buy In particular, if the service operations are designed to
operating systems from different vendors. N-version preserve some invariants on the service state, faulty
programming, i.e., different teams of programmers clients cannot break those invariants.
produce different implementations, is another option for The safety property is insufficient to guard against
some services. faulty clients, e.g., in a file system a faulty client can
We use cryptographic techniques to prevent spoofing write garbage data to some shared file. However, we
and replays and to detect corrupted messages. Our limit the amount of damage a faulty client can do by
messages contain public-key signatures [33], message providing access control: we authenticate clients and
authentication codes [36], and message digests produced deny access if the client issuing a request does not have
by collision-resistant hash functions [32]. We denote a the right to invoke the operation. Also, services may
message m signed by node i as hmii and the digest of provide operations to change the access permissions for
message m by D(m). We follow the common practice a client. Since the algorithm ensures that the effects of
of signing a digest of a message and appending it to access revocation operations are observed consistently by
the plaintext of the message rather than signing the full all clients, this provides a powerful mechanism to recover
message (hmii should be interpreted in this way). All from attacks by faulty clients.
replicas know the others’ public keys to verify signatures. The algorithm does not rely on synchrony to provide
We allow for a very strong adversary that can safety. Therefore, it must rely on synchrony to provide
coordinate faulty nodes, delay communication, or delay liveness; otherwise it could be used to implement
correct nodes in order to cause the most damage to the consensus in an asynchronous system, which is not
replicated service. We do assume that the adversary possible [9]. We guarantee liveness, i.e., clients
cannot delay correct nodes indefinitely. We also assume eventually receive replies to their requests, provided at
that the adversary (and the faulty nodes it controls) most b n; 3 c replicas are faulty and delay(t) does not
1
2
grow faster than t indefinitely. Here, delay(t) is the used a similar approach to tolerate benign faults (as dis-
time between the moment t when a message is sent for cussed in Section 8.)
the first time and the moment when it is received by its The algorithm works roughly as follows:
destination (assuming the sender keeps retransmitting the 1. A client sends a request to invoke a service operation
message until it is received). (A more precise definition to the primary
can be found in [4].) This is a rather weak synchrony 2. The primary multicasts the request to the backups
assumption that is likely to be true in any real system 3. Replicas execute the request and send a reply to the
provided network faults are eventually repaired, yet it client
enables us to circumvent the impossibility result in [9]. 4. The client waits for f + 1 replies from different
The resiliency of our algorithm is optimal: 3f + 1 is the replicas with the same result; this is the result of
minimum number of replicas that allow an asynchronous the operation.
system to provide the safety and liveness properties when
up to f replicas are faulty (see [2] for a proof). This
Like all state machine replication techniques [34],
we impose two requirements on replicas: they must
many replicas are needed because it must be possible to
proceed after communicating with n ; f replicas, since
be deterministic (i.e., the execution of an operation in
f replicas might be faulty and not responding. However, a given state and with a given set of arguments must
it is possible that the f replicas that did not respond are
always produce the same result) and they must start in the
not faulty and, therefore, f of those that responded might
same state. Given these two requirements, the algorithm
ensures the safety property by guaranteeing that all non-
be faulty. Even so, there must still be enough responses
faulty replicas agree on a total order for the execution of
that those from non-faulty replicas outnumber those from
faulty ones, i.e., n ; 2f > f . Therefore n > 3f .
requests despite failures.
The remainder of this section describes a simplified
The algorithm does not address the problem of fault-
version of the algorithm. We omit discussion of how
tolerant privacy: a faulty replica may leak information to
nodes recover from faults due to lack of space. We
an attacker. It is not feasible to offer fault-tolerant privacy
also omit details related to message retransmissions.
in the general case because service operations may
Furthermore, we assume that message authentication is
perform arbitrary computations using their arguments and
achieved using digital signatures rather than the more
the service state; replicas need this information in the
efficient scheme based on message authentication codes;
clear to execute such operations efficiently. It is possible
Section 5 discusses this issue further. A detailed
to use secret sharing schemes [35] to obtain privacy even
formalization of the algorithm using the I/O automaton
in the presence of a threshold of malicious replicas [13]
model [21] is presented in [4].
for the arguments and portions of the state that are opaque
to the service operations. We plan to investigate these 4.1 The Client
A client c requests the execution of state machine
techniques in the future.
operation o by sending a hREQUEST o t ci c message
4 The Algorithm to the primary. Timestamp t is used to ensure exactly-
Our algorithm is a form of state machine replication [17, once semantics for the execution of client requests.
34]: the service is modeled as a state machine that is Timestamps for c’s requests are totally ordered such that
replicated across different nodes in a distributed system. later requests have higher timestamps than earlier ones;
Each state machine replica maintains the service state for example, the timestamp could be the value of the
and implements the service operations. We denote the client’s local clock when the request is issued.
set of replicas by R and identify each replica using an Each message sent by the replicas to the client includes
integer in f0 ::: jRj ; 1g. For simplicity, we assume the current view number, allowing the client to track the
jRj = 3f + 1 where f is the maximum number of view and hence the current primary. A client sends
replicas that may be faulty; although there could be a request to what it believes is the current primary
more than 3f + 1 replicas, the additional replicas degrade using a point-to-point message. The primary atomically
performance (since more and bigger messages are being multicasts the request to all the backups using the protocol
exchanged) without providing improved resiliency. described in the next section.
The replicas move through a succession of configura- A replica sends the reply to the request directly to
tions called views. In a view one replica is the primary the client. The reply has the form hREPLY v t c i rii
and the others are backups. Views are numbered con- where v is the current view number, t is the timestamp of
secutively. The primary of a view is replica p such that the corresponding request, i is the replica number, and r
p = v mod jRj, where v is the view number. View is the result of executing the requested operation.
changes are carried out when it appears that the primary The client waits for f + 1 replies with valid signatures
has failed. Viewstamped Replication [26] and Paxos [18] from different replicas, and with the same t and r, before
3
accepting the result r. This ensures that the result is valid, A backup accepts a pre-prepare message provided:
since at most f replicas can be faulty. the signatures in the request and the pre-prepare
If the client does not receive replies soon enough, it message are correct and d is the digest for m;
broadcasts the request to all replicas. If the request has it is in view v ;
already been processed, the replicas simply re-send the it has not accepted a pre-prepare message for view v
reply; replicas remember the last reply message they sent and sequence number n containing a different digest;
to each client. Otherwise, if the replica is not the primary, the sequence number in the pre-prepare message is
it relays the request to the primary. If the primary does between a low water mark, h, and a high water mark,
not multicast the request to the group, it will eventually H.
be suspected to be faulty by enough replicas to cause a The last condition prevents a faulty primary from
view change. exhausting the space of sequence numbers by selecting
In this paper we assume that the client waits for one a very large one. We discuss how H and h advance in
request to complete before sending the next one. But we Section 4.3.
can allow a client to make asynchronous requests, yet If backup i accepts the hhPRE-PREPARE v n dip mi
preserve ordering constraints on them. message, it enters the prepare phase by multicasting a
hPREPARE v n d iii message to all other replicas and
4.2 Normal-Case Operation adds both messages to its log. Otherwise, it does nothing.
The state of each replica includes the state of the A replica (including the primary) accepts prepare
service, a message log containing messages the replica messages and adds them to its log provided their
has accepted, and an integer denoting the replica’s current signatures are correct, their view number equals the
view. We describe how to truncate the log in Section 4.3. replica’s current view, and their sequence number is
When the primary, p, receives a client request, m, between h and H .
it starts a three-phase protocol to atomically multicast We define the predicate prepared(m v n i) to be true
the request to the replicas. The primary starts the if and only if replica i has inserted in its log: the request
protocol immediately unless the number of messages m, a pre-prepare for m in view v with sequence number
for which the protocol is in progress exceeds a given n, and 2f prepares from different backups that match
maximum. In this case, it buffers the request. Buffered the pre-prepare. The replicas verify whether the prepares
requests are multicast later as a group to cut down on match the pre-prepare by checking that they have the
message traffic and CPU overheads under heavy load; this same view, sequence number, and digest.
optimization is similar to a group commit in transactional The pre-prepare and prepare phases of the algorithm
systems [11]. For simplicity, we ignore this optimization guarantee that non-faulty replicas agree on a total order
in the description below. for the requests within a view. More precisely, they
The three phases are pre-prepare, prepare, and commit. ensure the following invariant: if prepared(m v n i) is
The pre-prepare and prepare phases are used to totally true then prepared(m0 v n j ) is false for any non-faulty
order requests sent in the same view even when the replica j (including i = j ) and any m 0 such that D(m0 ) =6
primary, which proposes the ordering of requests, is D(m). This is true because prepared(m v n i) and
faulty. The prepare and commit phases are used to ensure jRj = 3f + 1 imply that at least f + 1 non-faulty replicas
that requests that commit are totally ordered across views. have sent a pre-prepare or prepare for m in view v with
In the pre-prepare phase, the primary assigns a sequence number n. Thus, for prepared(m 0 v n j )
sequence number, n, to the request, multicasts a pre- to be true at least one of these replicas needs to have
prepare message with m piggybacked to all the backups, sent two conflicting prepares (or pre-prepares if it is the
and appends the message to its log. The message has the primary for v ), i.e., two prepares with the same view
form hhPRE-PREPARE v n dip mi, where v indicates and sequence number and a different digest. But this is
the view in which the message is being sent, m is the not possible because the replica is not faulty. Finally, our
client’s request message, and d is m’s digest. assumption about the strength of message digests ensures
Requests are not included in pre-prepare messages that the probability that m = 6 m0 and D(m) = D(m0 ) is
to keep them small. This is important because pre- negligible.
prepare messages are used as a proof that the request was Replica i multicasts a hCOMMIT v n D(m) iii to the
assigned sequence number n in view v in view changes. other replicas when prepared(m v n i) becomes true.
Additionally, it decouples the protocol to totally order This starts the commit phase. Replicas accept commit
requests from the protocol to transmit the request to the messages and insert them in their log provided they are
replicas; allowing us to use a transport optimized for properly signed, the view number in the message is equal
small messages for protocol messages and a transport to the replica’s current view, and the sequence number is
optimized for large messages for large requests. between h and H
4
We define the committed and committed-local predi- the requests they concern have been executed by at least
cates as follows: committed(m v n) is true if and only f + 1 non-faulty replicas and it can prove this to others
if prepared(m v n i) is true for all i in some set of in view changes. In addition, if some replica misses
f +1 non-faulty replicas; and committed-local(m v n i) messages that were discarded by all non-faulty replicas,
is true if and only if prepared(m v n i) is true and i has it will need to be brought up to date by transferring all
accepted 2f + 1 commits (possibly including its own) or a portion of the service state. Therefore, replicas also
from different replicas that match the pre-prepare for m; need some proof that the state is correct.
a commit matches a pre-prepare if they have the same Generating these proofs after executing every opera-
view, sequence number, and digest. tion would be expensive. Instead, they are generated
The commit phase ensures the following invariant: if periodically, when a request with a sequence number di-
committed-local(m v n i) is true for some non-faulty visible by some constant (e.g., 100) is executed. We will
i then committed(m v n) is true. This invariant and refer to the states produced by the execution of these re-
the view-change protocol described in Section 4.4 ensure quests as checkpoints and we will say that a checkpoint
that non-faulty replicas agree on the sequence numbers with a proof is a stable checkpoint.
of requests that commit locally even if they commit in A replica maintains several logical copies of the service
different views at each replica. Furthermore, it ensures state: the last stable checkpoint, zero or more checkpoints
that any request that commits locally at a non-faulty that are not stable, and a current state. Copy-on-write
replica will commit at f + 1 or more non-faulty replicas techniques can be used to reduce the space overhead
eventually. to store the extra copies of the state, as discussed in
Each replica i executes the operation requested by Section 6.3.
m after committed-local(m v n i) is true and i’s state The proof of correctness for a checkpoint is generated
reflects the sequential execution of all requests with as follows. When a replica i produces a checkpoint,
lower sequence numbers. This ensures that all non- it multicasts a message hCHECKPOINT n d iii to the
faulty replicas execute requests in the same order as other replicas, where n is the sequence number of the
required to provide the safety property. After executing last request whose execution is reflected in the state
the requested operation, replicas send a reply to the client. and d is the digest of the state. Each replica collects
Replicas discard requests whose timestamp is lower than checkpoint messages in its log until it has 2f + 1 of
the timestamp in the last reply they sent to the client to them for sequence number n with the same digest d
guarantee exactly-once semantics. signed by different replicas (including possibly its own
We do not rely on ordered message delivery, and such message). These 2f + 1 messages are the proof of
therefore it is possible for a replica to commit requests correctness for the checkpoint.
out of order. This does not matter since it keeps the pre- A checkpoint with a proof becomes stable and the
prepare, prepare, and commit messages logged until the replica discards all pre-prepare, prepare, and commit
corresponding request can be executed. messages with sequence number less than or equal to
Figure 1 shows the operation of the algorithm in the n from its log; it also discards all earlier checkpoints and
normal case of no primary faults. Replica 0 is the primary, checkpoint messages.
replica 3 is faulty, and C is the client. Computing the proofs is efficient because the digest
can be computed using incremental cryptography [1] as
request pre-prepare prepare commit reply discussed in Section 6.3, and proofs are generated rarely.
C The checkpoint protocol is used to advance the low
and high water marks (which limit what messages will
be accepted). The low-water mark h is equal to the
0
5
and has not executed it. A backup starts a timer when it contains are valid for view v + 1, and if the set O is
receives a request and the timer is not already running. correct; it verifies the correctness of O by performing a
It stops the timer when it is no longer waiting to execute computation similar to the one used by the primary to
the request, but restarts it if at that point it is waiting to create O. Then it adds the new information to its log as
execute some other request. described for the primary, multicasts a prepare for each
If the timer of backup i expires in view v , the message in O to all the other replicas, adds these prepares
backup starts a view change to move the system to to its log, and enters view v + 1.
view v + 1. It stops accepting messages (other than Thereafter, the protocol proceeds as described in
checkpoint, view-change, and new-view messages) and Section 4.2. Replicas redo the protocol for messages
multicasts a hVIEW-CHANGE v + 1 n C P iii message between min-s and max-s but they avoid re-executing
to all replicas. Here n is the sequence number of the last client requests (by using their stored information about
stable checkpoint s known to i, C is a set of 2f + 1 valid the last reply sent to each client).
checkpoint messages proving the correctness of s, and A replica may be missing some request message m
P is a set containing a set Pm for each request m that or a stable checkpoint (since these are not sent in new-
prepared at i with a sequence number higher than n. Each view messages.) It can obtain missing information from
set Pm contains a valid pre-prepare message (without the another replica. For example, replica i can obtain a
corresponding client message) and 2f matching, valid missing checkpoint state s from one of the replicas
prepare messages signed by different backups with the whose checkpoint messages certified its correctness in
same view, sequence number, and the digest of m. V . Since f + 1 of those replicas are correct, replica i will
When the primary p of view v + 1 receives 2f valid always obtain s or a later certified stable checkpoint. We
view-change messages for view v + 1 from other replicas, can avoid sending the entire checkpoint by partitioning
it multicasts a hNEW-VIEW v + 1 V Oip message to all the state and stamping each partition with the sequence
other replicas, where V is a set containing the valid view- number of the last request that modified it. To bring
change messages received by the primary plus the view- a replica up to date, it is only necessary to send it the
change message for v + 1 the primary sent (or would have partitions where it is out of date, rather than the whole
sent), and O is a set of pre-prepare messages (without the checkpoint.
piggybacked request). O is computed as follows:
1. The primary determines the sequence number min-s 4.5 Correctness
of the latest stable checkpoint in V and the highest This section sketches the proof that the algorithm
sequence number max-s in a prepare message in V . provides safety and liveness; details can be found in [4].
2. The primary creates a new pre-prepare message for
view v + 1 for each sequence number n between min-s 4.5.1 Safety
and max-s. There are two cases: (1) there is at least As discussed earlier, the algorithm provides safety if all
one set in the P component of some view-change non-faulty replicas agree on the sequence numbers of
message in V with sequence number n, or (2) there requests that commit locally.
is no such set. In the first case, the primary creates In Section 4.2, we showed that if prepared(m v n i)
a new message hPRE-PREPARE v + 1 n dip , where is true, prepared(m0 v n j ) is false for any non-faulty
d is the request digest in the pre-prepare message for replica j (including i = j ) and any m 0 such that
sequence number n with the highest view number D(m0 ) = 6 D(m). This implies that two non-faulty
in V . In the second case, it creates a new pre- replicas agree on the sequence number of requests that
prepare message hPRE-PREPARE v + 1 n dnull ip , commit locally in the same view at the two replicas.
where dnull is the digest of a special null request; The view-change protocol ensures that non-faulty
a null request goes through the protocol like other replicas also agree on the sequence number of requests
requests, but its execution is a no-op. (Paxos [18] that commit locally in different views at different replicas.
used a similar technique to fill in gaps.) A request m commits locally at a non-faulty replica with
Next the primary appends the messages in O to its sequence number n in view v only if committed(m v n)
log. If min-s is greater than the sequence number of its is true. This means that there is a set R1 containing at least
latest stable checkpoint, the primary also inserts the proof f + 1 non-faulty replicas such that prepared(m v n i)
of stability for the checkpoint with sequence number is true for every replica i in the set.
min-s in its log, and discards information from the log Non-faulty replicas will not accept a pre-prepare for
as discussed in Section 4.3. Then it enters view v + 1: at view v 0 > v without having received a new-view message
this point it is able to accept messages for view v + 1. for v 0 (since only at that point do they enter the view). But
A backup accepts a new-view message for view v + 1 any correct new-view message for view v 0 > v contains
if it is signed properly, if the view-change messages it correct view-change messages from every replica i in a
6
set R2 of 2f + 1 replicas. Since there are 3f + 1 replicas, 4.6 Non-Determinism
R1 and R2 must intersect in at least one replica k that is State machine replicas must be deterministic but many
not faulty. k ’s view-change message will ensure that the services involve some form of non-determinism. For
fact that m prepared in a previous view is propagated to example, the time-last-modified in NFS is set by reading
subsequent views, unless the new-view message contains the server’s local clock; if this were done independently
a view-change message with a stable checkpoint with a at each replica, the states of non-faulty replicas would
sequence number higher than n. In the first case, the diverge. Therefore, some mechanism to ensure that all
algorithm redoes the three phases of the atomic multicast replicas select the same value is needed. In general, the
protocol for m with the same sequence number n and the client cannot select the value because it does not have
new view number. This is important because it prevents enough information; for example, it does not know how
any different request that was assigned the sequence its request will be ordered relative to concurrent requests
number n in a previous view from ever committing. In by other clients. Instead, the primary needs to select the
the second case no replica in the new view will accept any value either independently or based on values provided
message with sequence number lower than n. In either by the backups.
case, the replicas will agree on the request that commits If the primary selects the non-deterministic value inde-
locally with sequence number n. pendently, it concatenates the value with the associated
request and executes the three phase protocol to ensure
4.5.2 Liveness that non-faulty replicas agree on a sequence number for
the request and value. This prevents a faulty primary from
To provide liveness, replicas must move to a new view if causing replica state to diverge by sending different val-
they are unable to execute a request. But it is important ues to different replicas. However, a faulty primary might
to maximize the period of time when at least 2f + 1 send the same, incorrect, value to all replicas. Therefore,
non-faulty replicas are in the same view, and to ensure replicas must be able to decide deterministically whether
that this period of time increases exponentially until some the value is correct (and what to do if it is not) based only
requested operation executes. We achieve these goals by on the service state.
three means.
This protocol is adequate for most services (including
First, to avoid starting a view change too soon, a replica NFS) but occasionally replicas must participate in
that multicasts a view-change message for view v + 1 selecting the value to satisfy a service’s specification.
waits for 2f + 1 view-change messages for view v + 1 This can be accomplished by adding an extra phase to
and then starts its timer to expire after some time T . the protocol: the primary obtains authenticated values
If the timer expires before it receives a valid new-view proposed by the backups, concatenates 2f + 1 of them
message for v + 1 or before it executes a request in the with the associated request, and starts the three phase
new view that it had not executed previously, it starts the protocol for the concatenated message. Replicas choose
view change for view v + 2 but this time it will wait 2T the value by a deterministic computation on the 2f + 1
before starting a view change for view v + 3. values and their state, e.g., taking the median. The extra
Second, if a replica receives a set of f + 1 valid view- phase can be optimized away in the common case. For
change messages from other replicas for views greater example, if replicas need a value that is “close enough”
than its current view, it sends a view-change message to that of their local clock, the extra phase can be avoided
for the smallest view in the set, even if its timer has when their clocks are synchronized within some delta.
not expired; this prevents it from starting the next view
change too late.
5 Optimizations
Third, faulty replicas are unable to impede progress
by forcing frequent view changes. A faulty replica This section describes some optimizations that improve
cannot cause a view change by sending a view-change the performance of the algorithm during normal-case
message, because a view change will happen only if at operation. All the optimizations preserve the liveness
least f + 1 replicas send view-change messages, but it and safety properties.
can cause a view change when it is the primary (by not
sending messages or sending bad messages). However, 5.1 Reducing Communication
because the primary of view v is the replica p such that We use three optimizations to reduce the cost of
p = v mod jRj, the primary cannot be faulty for more communication. The first avoids sending most large
than f consecutive views. replies. A client request designates a replica to send
These three techniques guarantee liveness unless the result; all other replicas send replies containing just
message delays grow faster than the timeout period the digest of the result. The digests allow the client to
indefinitely, which is unlikely in a real system. check the correctness of the result while reducing network
7
bandwidth consumption and CPU overhead significantly specific invariants, e.g, the invariant that no two different
for large replies. If the client does not receive a correct requests prepare with the same view and sequence num-
result from the designated replica, it retransmits the ber at two non-faulty replicas. The modified algorithm is
request as usual, requesting all replicas to send full described in [5]. Here we sketch the main implications
replies. of using MACs.
The second optimization reduces the number of MACs can be computed three orders of magnitude
message delays for an operation invocation from 5 faster than digital signatures. For example, a 200MHz
to 4. Replicas execute a request tentatively as soon Pentium Pro takes 43ms to generate a 1024-bit modulus
as the prepared predicate holds for the request, their RSA signature of an MD5 digest and 0.6ms to verify
state reflects the execution of all requests with lower the signature [37], whereas it takes only 10.3s to
sequence number, and these requests are all known to compute the MAC of a 64-byte message on the same
have committed. After executing the request, the replicas hardware in our implementation. There are other public-
send tentative replies to the client. The client waits for key cryptosystems that generate signatures faster, e.g.,
2f + 1 matching tentative replies. If it receives this elliptic curve public-key cryptosystems, but signature
many, the request is guaranteed to commit eventually. verification is slower [37] and in our algorithm each
Otherwise, the client retransmits the request and waits signature is verified many times.
for f + 1 non-tentative replies. Each node (including active clients) shares a 16-byte
A request that has executed tentatively may abort if secret session key with each replica. We compute
there is a view change and it is replaced by a null message authentication codes by applying MD5 to the
request. In this case the replica reverts its state to the concatenation of the message with the secret key. Rather
last stable checkpoint in the new-view message or to its than using the 16 bytes of the final MD5 digest, we use
last checkpointed state (depending on which one has the only the 10 least significant bytes. This truncation has
higher sequence number). the obvious advantage of reducing the size of MACs and
The third optimization improves the performance of it also improves their resilience to certain attacks [27].
read-only operations that do not modify the service This is a variant of the secret suffix method [36], which
state. A client multicasts a read-only request to all is secure as long as MD5 is collision resistant [27, 8].
replicas. Replicas execute the request immediately in The digital signature in a reply message is replaced by a
their tentative state after checking that the request is single MAC, which is sufficient because these messages
properly authenticated, that the client has access, and have a single intended recipient. The signatures in all
that the request is in fact read-only. They send the reply other messages (including client requests but excluding
only after all requests reflected in the tentative state have view changes) are replaced by vectors of MACs that we
committed; this is necessary to prevent the client from call authenticators. An authenticator has an entry for
observing uncommitted state. The client waits for 2f + 1 every replica other than the sender; each entry is the
replies from different replicas with the same result. The MAC computed with the key shared by the sender and
client may be unable to collect 2f + 1 such replies if there the replica corresponding to the entry.
are concurrent writes to data that affect the result; in this The time to verify an authenticator is constant but the
case, it retransmits the request as a regular read-write time to generate one grows linearly with the number of
request after its retransmission timer expires. replicas. This is not a problem because we do not expect
to have a large number of replicas and there is a huge
5.2 Cryptography performance gap between MAC and digital signature
In Section 4, we described an algorithm that uses computation. Furthermore, we compute authenticators
digital signatures to authenticate all messages. However, efficiently; MD5 is applied to the message once and the
we actually use digital signatures only for view- resulting context is used to compute each vector entry
change and new-view messages, which are sent rarely, by applying MD5 to the corresponding session key. For
and authenticate all other messages using message example, in a system with 37 replicas (i.e., a system
authentication codes (MACs). This eliminates the main that can tolerate 12 simultaneous faults) an authenticator
performance bottleneck in previous systems [29, 22]. can still be computed much more than two orders of
However, MACs have a fundamental limitation rela- magnitude faster than a 1024-bit modulus RSA signature.
tive to digital signatures — the inability to prove that The size of authenticators grows linearly with the
a message is authentic to a third party. The algorithm number of replicas but it grows slowly: it is equal to
in Section 4 and previous Byzantine-fault-tolerant algo- 30 b n; 3 c bytes. An authenticator is smaller than an
1
rithms [31, 16] for state machine replication rely on the RSA signature with a 1024-bit modulus for n 13 (i.e.,
extra power of digital signatures. We modified our algo- systems that can tolerate up to 4 simultaneous faults),
rithm to circumvent the problem by taking advantage of which we expect to be true in most configurations.
8
6 Implementation completely implemented (including the manipulation of
This section describes our implementation. First we the timers that trigger view changes) and because we
discuss the replication library, which can be used as have formalized the complete algorithm and proved its
a basis for any replicated service. In Section 6.2 we correctness [4].
describe how we implemented a replicated NFS on top
6.2 BFS: A Byzantine-Fault-tolerant File System
of the replication library. Then we describe how we
maintain checkpoints and compute checkpoint digests We implemented BFS, a Byzantine-fault-tolerant NFS
efficiently. service, using the replication library. Figure 2 shows the
architecture of BFS. We opted not to modify the kernel
6.1 The Replication Library NFS client and server because we did not have the sources
The client interface to the replication library consists of for the Digital Unix kernel.
a single procedure, invoke, with one argument, an input A file system exported by the fault-tolerant NFS service
buffer containing a request to invoke a state machine is mounted on the client machine like any regular NFS
operation. The invoke procedure uses our protocol to file system. Application processes run unmodified and
execute the requested operation at the replicas and select interact with the mounted file system through the NFS
the correct reply from among the replies of the individual client in the kernel. We rely on user level relay processes
replicas. It returns a pointer to a buffer containing the to mediate communication between the standard NFS
operation result. client and the replicas. A relay receives NFS protocol
On the server side, the replication code makes a requests, calls the invoke procedure of our replication
number of upcalls to procedures that the server part of library, and sends the result back to the NFS client.
the application must implement. There are procedures replica 0
to execute requests (execute), to maintain checkpoints of
snfsd
the service state (make checkpoint, delete checkpoint), to
replication
obtain the digest of a specified checkpoint (get digest), library
client
and to obtain missing information (get checkpoint,
set checkpoint). The execute procedure receives as input kernel VM
relay
a buffer containing the requested operation, executes the Andrew
benchmark replication
operation, and places the result in an output buffer. The library
replica n
other procedures are discussed further in Sections 6.3
kernel NFS client snfsd
and 6.4.
Point-to-point communication between nodes is imple- replication
library
mented using UDP, and multicast to the group of replicas
is implemented using UDP over IP multicast [7]. There kernel VM
is a single IP multicast group for each service, which con-
tains all the replicas. These communication protocols are Figure 2: Replicated File System Architecture.
unreliable; they may duplicate or lose messages or deliver
them out of order. Each replica runs a user-level process with the
The algorithm tolerates out-of-order delivery and replication library and our NFS V2 daemon, which we
rejects duplicates. View changes can be used to recover will refer to as snfsd (for simple nfsd). The replication
from lost messages, but this is expensive and therefore it library receives requests from the relay, interacts with
is important to perform retransmissions. During normal snfsd by making upcalls, and packages NFS replies into
operation recovery from lost messages is driven by replication protocol replies that it sends to the relay.
the receiver: backups send negative acknowledgments We implemented snfsd using a fixed-size memory-
to the primary when they are out of date and the mapped file. All the file system data structures, e.g.,
primary retransmits pre-prepare messages after a long inodes, blocks and their free lists, are in the mapped file.
timeout. A reply to a negative acknowledgment may We rely on the operating system to manage the cache of
include both a portion of a stable checkpoint and missing memory-mapped file pages and to write modified pages
messages. During view changes, replicas retransmit to disk asynchronously. The current implementation
view-change messages until they receive a matching new- uses 8KB blocks and inodes contain the NFS status
view message or they move on to a later view. information plus 256 bytes of data, which is used to store
The replication library does not implement view directory entries in directories, pointers to blocks in files,
changes or retransmissions at present. This does and text in symbolic links. Directories and files may also
not compromise the accuracy of the results given use indirect blocks in a way similar to Unix.
in Section 7 because the rest of the algorithm is Our implementation ensures that all state machine
9
replicas start in the same initial state and are deterministic, maximum of 500.
which are necessary conditions for the correctness of a
service implemented using our protocol. The primary 6.4 Computing Checkpoint Digests
proposes the values for time-last-modified and time- snfsd computes a digest of a checkpoint state as part
last-accessed, and replicas select the larger of the of a make checkpoint upcall. Although checkpoints
proposed value and one greater than the maximum of all are only taken occasionally, it is important to compute
values selected for earlier requests. We do not require the state digest incrementally because the state may be
synchronous writes to implement NFS V2 protocol large. snfsd uses an incremental collision-resistant one-
semantics because BFS achieves stability of modified way hash function called AdHash [1]. This function
data and meta-data through replication [20]. divides the state into fixed-size blocks and uses some
other hash function (e.g., MD5) to compute the digest
6.3 Maintaining Checkpoints of the string obtained by concatenating the block index
This section describes how snfsd maintains checkpoints with the block value for each block. The digest of the
of the file system state. Recall that each replica maintains state is the sum of the digests of the blocks modulo some
several logical copies of the state: the current state, some large integer. In our current implementation, we use the
number of checkpoints that are not yet stable, and the last 512-byte blocks from the copy-on-write technique and
stable checkpoint. compute their digest using MD5.
snfsd executes file system operations directly in the To compute the digest for the state incrementally, snfsd
memory mapped file to preserve locality, and it uses copy- maintains a table with a hash value for each 512-byte
on-write to reduce the space and time overhead associated block. This hash value is obtained by applying MD5
with maintaining checkpoints. snfsd maintains a copy- to the block index concatenated with the block value at
on-write bit for every 512-byte block in the memory the time of the last checkpoint. When make checkpoint
mapped file. When the replication code invokes the is called, snfsd obtains the digest d for the previous
make checkpoint upcall, snfsd sets all the copy-on-write checkpoint state (from the associated checkpoint record).
bits and creates a (volatile) checkpoint record, containing It computes new hash values for each block whose copy-
the current sequence number, which it receives as an on-write bit is reset by applying MD5 to the block index
argument to the upcall, and a list of blocks. This list concatenated with the current block value. Then, it adds
contains the copies of the blocks that were modified the new hash value to d, subtracts the old hash value
since the checkpoint was taken, and therefore, it is from d, and updates the table to contain the new hash
initially empty. The record also contains the digest of value. This process is efficient provided the number of
the current state; we discuss how the digest is computed modified blocks is small; as mentioned above, on average
in Section 6.4. 182 blocks are modified per checkpoint for the Andrew
When a block of the memory mapped file is modified benchmark.
while executing a client request, snfsd checks the copy-
on-write bit for the block and, if it is set, stores the block’s 7 Performance Evaluation
current contents and its identifier in the checkpoint record This section evaluates the performance of our system
for the last checkpoint. Then, it overwrites the block using two benchmarks: a micro-benchmark and the
with its new value and resets its copy-on-write bit. Andrew benchmark [15]. The micro-benchmark provides
snfsd retains a checkpoint record until told to discard a service-independent evaluation of the performance of
it via a delete checkpoint upcall, which is made by the the replication library; it measures the latency to invoke
replication code when a later checkpoint becomes stable. a null operation, i.e., an operation that does nothing.
If the replication code requires a checkpoint to send The Andrew benchmark is used to compare BFS with
to another replica, it calls the get checkpoint upcall. To two other file systems: one is the NFS V2 implementation
obtain the value for a block, snfsd first searches for the in Digital Unix, and the other is identical to BFS except
block in the checkpoint record of the stable checkpoint, without replication. The first comparison demonstrates
and then searches the checkpoint records of any later that our system is practical by showing that its latency is
checkpoints. If the block is not in any checkpoint record, similar to the latency of a commercial system that is used
it returns the value from the current state. daily by many users. The second comparison allows us to
The use of the copy-on-write technique and the fact evaluate the overhead of our algorithm accurately within
that we keep at most 2 checkpoints ensure that the space an implementation of a real service.
and time overheads of keeping several logical copies
of the state are low. For example, in the Andrew 7.1 Experimental Setup
benchmark experiments described in Section 7, the The experiments measure normal-case behavior (i.e.,
average checkpoint record size is only 182 blocks with a there are no view changes), because this is the behavior
10
that determines the performance of the system. All The overhead for read-only operations is significantly
experiments ran with one client running two relay lower because the optimization discussed in Section 5.1
processes, and four replicas. Four replicas can tolerate reduces both computation and communication overheads.
one Byzantine fault; we expect this reliability level to For example, the computation overhead for the read-only
suffice for most applications. The replicas and the 0/0 operation is approximately 0.43ms, which includes
client ran on identical DEC 3000/400 Alpha workstations. 0.23ms spent executing cryptographic operations, and
These workstations have a 133 MHz Alpha 21064 the communication overhead is only 0.37ms because the
processor, 128 MB of memory, and run Digital Unix protocol to execute read-only operations uses a single
version 4.0. The file system was stored by each replica round-trip.
on a DEC RZ26 disk. All the workstations were Table 1 shows that the relative overhead is lower for
connected by a 10Mbit/s switched Ethernet and had DEC the 4/0 and 0/4 operations. This is because a significant
LANCE Ethernet interfaces. The switch was a DEC fraction of the overhead introduced by the replication
EtherWORKS 8T/TX. The experiments were run on an library is independent of the size of operation arguments
isolated network. and results. For example, in the read-write 0/4 operation,
The interval between checkpoints was 128 requests, the large message (the reply) goes over the network
which causes garbage collection to occur several times in only once (as discussed in Section 5.1) and only the
any of the experiments. The maximum sequence number cryptographic overhead to process the reply message is
accepted by replicas in pre-prepare messages was 256 increased. The overhead is higher for the read-write 4/0
plus the sequence number of the last stable checkpoint. operation because the large message (the request) goes
over the network twice and increases the cryptographic
7.2 Micro-Benchmark
overhead for processing both request and pre-prepare
The micro-benchmark measures the latency to invoke messages.
a null operation. It evaluates the performance of two
It is important to note that this micro-benchmark
implementations of a simple service with no state that represents the worst case overhead for our algorithm
implements null operations with arguments and results because the operations perform no work and the
of different sizes. The first implementation is replicated
unreplicated server provides very weak guarantees.
using our library and the second is unreplicated and Most services will require stronger guarantees, e.g.,
uses UDP directly. Table 1 reports the response times
authenticated connections, and the overhead introduced
measured at the client for both read-only and read- by our algorithm relative to a server that implements these
write operations. They were obtained by timing 10,000
guarantees will be lower. For example, the overhead
operation invocations in three separate runs and we report of the replication library relative to a version of the
the median value of the three runs. The maximum
unreplicated service that uses MACs for authentication
deviation from the median was always below 0.3% of is only 243% for the read-write 0/0 operation and 4% for
the reported value. We denote each operation by a/b,
the read-only 4/0 operation.
where a and b are the sizes of the operation argument and
We can estimate a rough lower bound on the
result in KBytes.
performance gain afforded by our algorithm relative to
arg./res. replicated without Rampart [30]. Reiter reports that Rampart has a latency
(KB) read-write read-only replication of 45ms for a multi-RPC of a null message in a 10 Mbit/s
0/0 3.35 (309%) 1.62 (98%) 0.82 Ethernet network of 4 SparcStation 10s [30]. The multi-
4/0 14.19 (207%) 6.98 (51%) 4.62 RPC is sufficient for the primary to invoke a state machine
0/4 8.01 (72%) 5.94 (27%) 4.66 operation but for an arbitrary client to invoke an operation
Table 1: Micro-benchmark results (in milliseconds); the it would be necessary to add an extra message delay and
percentage overhead is relative to the unreplicated case. an extra RSA signature and verification to authenticate
the client; this would lead to a latency of at least 65ms
The overhead introduced by the replication library is (using the RSA timings reported in [29].) Even if we
due to extra computation and communication. For exam- divide this latency by 1.7, the ratio of the SPECint92
ple, the computation overhead for the read-write 0/0 op- ratings of the DEC 3000/400 and the SparcStation 10, our
eration is approximately 1.06ms, which includes 0.55ms algorithm still reduces the latency to invoke the read-write
spent executing cryptographic operations. The remain- and read-only 0/0 operations by factors of more than 10
ing 1.47ms of overhead are due to extra communication; and 20, respectively. Note that this scaling is conservative
the replication library introduces an extra message round- because the network accounts for a significant fraction
trip, it sends larger messages, and it increases the number of Rampart’s latency [29] and Rampart’s results were
of messages received by each node relative to the service obtained using 300-bit modulus RSA signatures, which
without replication. are not considered secure today unless the keys used to
11
generate them are refreshed very frequently.
There are no published performance numbers for BFS
phase strict r/o lookup BFS-nr
SecureRing [16] but it would be slower than Rampart
1 0.55 (57%) 0.47 (34%) 0.35
because its algorithm has more message delays and
2 9.24 (82%) 7.91 (56%) 5.08
signature operations in the critical path. 3 7.24 (18%) 6.45 (6%) 6.11
4 8.77 (18%) 7.87 (6%) 7.41
7.3 Andrew Benchmark 5 38.68 (20%) 38.38 (19%) 32.12
The Andrew benchmark [15] emulates a software total 64.48 (26%) 61.07 (20%) 51.07
development workload. It has five phases: (1) creates Table 2: Andrew benchmark: BFS vs BFS-nr. The times
subdirectories recursively; (2) copies a source tree; (3) are in seconds.
examines the status of all the files in the tree without
examining their data; (4) examines every byte of data in
the complete benchmark. The overhead is lower than
all the files; and (5) compiles and links the files.
what was observed for the micro-benchmarks because
We use the Andrew benchmark to compare BFS with
the client spends a significant fraction of the elapsed time
two other file system configurations: NFS-std, which is
computing between operations, i.e., between receiving
the NFS V2 implementation in Digital Unix, and BFS-nr,
the reply to an operation and issuing the next request,
which is identical to BFS but with no replication. BFS-nr
and operations at the server perform some computation.
ran two simple UDP relays on the client, and on the server
But the overhead is not uniform across the benchmark
it ran a thin veneer linked with a version of snfsd from
phases. The main reason for this is a variation in the
which all the checkpoint management code was removed.
amount of time the client spends computing between
This configuration does not write modified file system
operations; the first two phases have a higher relative
state to disk before replying to the client. Therefore, it
overhead because the client spends approximately 40%
does not implement NFS V2 protocol semantics, whereas
of the total time computing between operations, whereas
both BFS and NFS-std do.
it spends approximately 70% during the last three phases.
Out of the 18 operations in the NFS V2 protocol only
The table shows that applying the read-only optimiza-
getattr is read-only because the time-last-accessed
tion to lookup improves the performance of BFS sig-
attribute of files and directories is set by operations
nificantly and reduces the overhead relative to BFS-nr
that would otherwise be read-only, e.g., read and
to 20%. This optimization has a significant impact in
lookup. The result is that our optimization for read-
the first four phases because the time spent waiting for
only operations can rarely be used. To show the impact
lookup operations to complete in BFS-strict is at least
of this optimization, we also ran the Andrew benchmark
20% of the elapsed time for these phases, whereas it is
on a second version of BFS that modifies the lookup
less than 5% of the elapsed time for the last phase.
operation to be read-only. This modification violates
strict Unix file system semantics but is unlikely to have BFS
adverse effects in practice. phase strict r/o lookup NFS-std
For all configurations, the actual benchmark code ran 1 0.55 (-69%) 0.47 (-73%) 1.75
at the client workstation using the standard NFS client 2 9.24 (-2%) 7.91 (-16%) 9.46
implementation in the Digital Unix kernel with the same 3 7.24 (35%) 6.45 (20%) 5.36
mount options. The most relevant of these options for 4 8.77 (32%) 7.87 (19%) 6.60
the benchmark are: UDP transport, 4096-byte read and 5 38.68 (-2%) 38.38 (-2%) 39.35
write buffers, allowing asynchronous client writes, and total 64.48 (3%) 61.07 (-2%) 62.52
allowing attribute caching. Table 3: Andrew benchmark: BFS vs NFS-std. The
We report the mean of 10 runs of the benchmark for times are in seconds.
each configuration. The sample standard deviation for
the total time to run the benchmark was always below Table 3 shows the results for BFS vs NFS-std. These
2.6% of the reported value but it was as high as 14% for results show that BFS can be used in practice — BFS-
the individual times of the first four phases. This high strict takes only 3% more time to run the complete
variance was also present in the NFS-std configuration. benchmark. Thus, one could replace the NFS V2
The estimated error for the reported mean was below implementation in Digital Unix, which is used daily
4.5% for the individual phases and 0.8% for the total. by many users, by BFS without affecting the latency
Table 2 shows the results for BFS and BFS-nr. The perceived by those users. Furthermore, BFS with the
comparison between BFS-strict and BFS-nr shows that read-only optimization for the lookup operation is
the overhead of Byzantine fault tolerance for this service actually 2% faster than NFS-std.
is low — BFS-strict takes only 26% more time to run The overhead of BFS relative to NFS-std is not the
12
same for all phases. Both versions of BFS are faster replicas or the communication between them until enough
than NFS-std for phases 1, 2, and 5 but slower for the are excluded from the group.
other phases. This is because during phases 1, 2, and 5 a To reduce the probability of misclassification, failure
large fraction (between 21% and 40%) of the operations detectors can be calibrated to delay classifying a replica
issued by the client are synchronous, i.e., operations that as faulty. However, for the probability to be negligible
require the NFS implementation to ensure stability of the delay must be very large, which is undesirable. For
modified file system state before replying to the client. example, if the primary has actually failed, the group will
NFS-std achieves stability by writing modified state to be unable to process client requests until the delay has
disk whereas BFS achieves stability with lower latency expired. Our algorithm is not vulnerable to this problem
using replication (as in Harp [20]). NFS-std is faster than because it never needs to exclude replicas from the group.
BFS (and BFS-nr) in phases 3 and 4 because the client Phalanx [23, 25] applies quorum replication tech-
issues no synchronous operations during these phases. niques [12] to achieve Byzantine fault-tolerance in asyn-
chronous systems. This work does not provide generic
state machine replication; instead, it offers a data reposi-
8 Related Work tory with operations to read and write individual variables
Most previous work on replication techniques ignored and to acquire locks. The semantics it provides for read
Byzantine faults or assumed a synchronous system and write operations are weaker than those offered by our
model (e.g., [17, 26, 18, 34, 6, 10]). Viewstamped algorithm; we can implement arbitrary operations that ac-
replication [26] and Paxos [18] use views with a primary cess any number of variables, whereas in Phalanx it would
and backups to tolerate benign faults in an asynchronous be necessary to acquire and release locks to execute such
system. Tolerating Byzantine faults requires a much more operations. There are no published performance num-
complex protocol with cryptographic authentication, an bers for Phalanx but we believe our algorithm is faster
extra pre-prepare phase, and a different technique to because it has fewer message delays in the critical path
trigger view changes and select primaries. Furthermore, and because of our use of MACs rather than public key
our system uses view changes only to select a new primary cryptography. The approach in Phalanx offers the poten-
but never to select a different set of replicas to form the tial for improved scalability; each operation is processed
new view as in [26, 18]. by only a subset of replicas. But this approach to scala-
Some agreement and consensus algorithms tolerate bility is expensive: it requires n > 4f + 1 to tolerate f
Byzantine faults in asynchronous systems (e.g,[2, 3, 24]). faults; each replica needs a copy of the state; and thepload
However, they do not provide a complete solution for on each replica decreases slowly with n (it is O(1= n)).
state machine replication, and furthermore, most of them
were designed to demonstrate theoretical feasibility and 9 Conclusions
are too slow to be used in practice. Our algorithm This paper has described a new state-machine replication
during normal-case operation is similar to the Byzantine algorithm that is able to tolerate Byzantine faults and can
agreement algorithm in [2] but that algorithm is unable be used in practice: it is the first to work correctly in
to survive primary failures. an asynchronous system like the Internet and it improves
The two systems that are most closely related to our the performance of previous algorithms by more than an
work are Rampart [29, 30, 31, 22] and SecureRing [16]. order of magnitude.
They implement state machine replication but are more The paper also described BFS, a Byzantine-fault-
than an order of magnitude slower than our system and, tolerant implementation of NFS. BFS demonstrates that
most importantly, they rely on synchrony assumptions. it is possible to use our algorithm to implement real
Both Rampart and SecureRing must exclude faulty services with performance close to that of an unreplicated
replicas from the group to make progress (e.g., to remove service — the performance of BFS is only 3% worse than
a faulty primary and elect a new one), and to perform that of the standard NFS implementation in Digital Unix.
garbage collection. They rely on failure detectors This good performance is due to a number of important
to determine which replicas are faulty. However, optimizations, including replacing public-key signatures
failure detectors cannot be accurate in an asynchronous by vectors of message authentication codes, reducing
system [21], i.e., they may misclassify a replica as faulty. the size and number of messages, and the incremental
Since correctness requires that fewer than 1=3 of group checkpoint-management techniques.
members be faulty, a misclassification can compromise One reason why Byzantine-fault-tolerant algorithms
correctness by removing a non-faulty replica from the will be important in the future is that they can allow
group. This opens an avenue of attack: an attacker systems to continue to work correctly even when there
gains control over a single replica but does not change are software errors. Not all errors are survivable;
its behavior in any detectable way; then it slows correct our approach cannot mask a software error that occurs
13
at all replicas. However, it can mask errors that [14] M. Herlihy and J. Wing. Axioms for Concurrent Objects. In ACM
occur independently at different replicas, including Symposium on Principles of Programming Languages, 1987.
nondeterministic software errors, which are the most [15] J. Howard et al. Scale and performance in a distributed file system.
problematic and persistent errors since they are the ACM Transactions on Computer Systems, 6(1), 1988.
hardest to detect. In fact, we encountered such a software [16] K. Kihlstrom, L. Moser, and P. Melliar-Smith. The SecureRing
Protocols for Securing Group Communication. In Hawaii
bug while running our system, and our algorithm was able International Conference on System Sciences, 1998.
to continue running correctly in spite of it.
[17] L. Lamport. Time, Clocks, and the Ordering of Events in a
There is still much work to do on improving our system. Distributed System. Commun. ACM, 21(7), 1978.
One problem of special interest is reducing the amount [18] L. Lamport. The Part-Time Parliament. Technical Report 49,
of resources required to implement our algorithm. The DEC Systems Research Center, 1989.
number of replicas can be reduced by using f replicas [19] L. Lamport, R. Shostak, and M. Pease. The Byzantine Generals
as witnesses that are involved in the protocol only when Problem. ACM Transactions on Programming Languages and
some full replica fails. We also believe that it is possible Systems, 4(3), 1982.
to reduce the number of copies of the state to f + 1 but [20] B. Liskov et al. Replication in the Harp File System. In ACM
the details remain to be worked out. Symposium on Operating System Principles, 1991.
[21] N. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers,
Acknowledgments 1996.
We would like to thank Atul Adya, Chandrasekhar [22] D. Malkhi and M. Reiter. A High-Throughput Secure Reliable
Multicast Protocol. In Computer Security Foundations Workshop,
Boyapati, Nancy Lynch, Sape Mullender, Andrew Myers, 1996.
Liuba Shrira, and the anonymous referees for their helpful
[23] D. Malkhi and M. Reiter. Byzantine Quorum Systems. In ACM
comments on drafts of this paper. Symposium on Theory of Computing, 1997.
[24] D. Malkhi and M. Reiter. Unreliable Intrusion Detection in
References Distributed Computations. In Computer Security Foundations
[1] M. Bellare and D. Micciancio. A New Paradigm for Collision- Workshop, 1997.
free Hashing: Incrementality at Reduced Cost. In Advances in [25] D. Malkhi and M. Reiter. Secure and Scalable Replication in
Cryptology – Eurocrypt 97, 1997. Phalanx. In IEEE Symposium on Reliable Distributed Systems,
[2] G. Bracha and S. Toueg. Asynchronous Consensus and Broadcast 1998.
Protocols. Journal of the ACM, 32(4), 1995. [26] B. Oki and B. Liskov. Viewstamped Replication: A New Primary
[3] R. Canneti and T. Rabin. Optimal Asynchronous Byzantine Copy Method to Support Highly-Available Distributed Systems.
Agreement. Technical Report #92-15, Computer Science In ACM Symposium on Principles of Distributed Computing,
Department, Hebrew University, 1992. 1988.
[4] M. Castro and B. Liskov. A Correctness Proof for a Practi- [27] B. Preneel and P. Oorschot. MDx-MAC and Building Fast MACs
cal Byzantine-Fault-Tolerant Replication Algorithm. Technical from Hash Functions. In Crypto 95, 1995.
Memo MIT/LCS/TM-590, MIT Laboratory for Computer Sci- [28] C. Pu, A. Black, C. Cowan, and J. Walpole. A Specialization
ence, 1999. Toolkit to Increase the Diversity of Operating Systems. In ICMAS
[5] M. Castro and B. Liskov. Authenticated Byzantine Fault Workshop on Immunity-Based Systems, 1996.
Tolerance Without Public-Key Cryptography. Technical Memo [29] M. Reiter. Secure Agreement Protocols. In ACM Conference on
MIT/LCS/TM-589, MIT Laboratory for Computer Science, 1999. Computer and Communication Security, 1994.
[6] F. Cristian, H. Aghili, H. Strong, and D. Dolev. Atomic Broadcast: [30] M. Reiter. The Rampart Toolkit for Building High-Integrity
From Simple Message Diffusion to Byzantine Agreement. In Services. Theory and Practice in Distributed Systems (LNCS
International Conference on Fault Tolerant Computing, 1985. 938), 1995.
[7] S. Deering and D. Cheriton. Multicast Routing in Datagram [31] M. Reiter. A Secure Group Membership Protocol. IEEE
Internetworks and Extended LANs. ACM Transactions on Transactions on Software Engineering, 22(1), 1996.
Computer Systems, 8(2), 1990.
[32] R. Rivest. The MD5 Message-Digest Algorithm. Internet RFC-
[8] H. Dobbertin. The Status of MD5 After a Recent Attack. RSA 1321, 1992.
Laboratories’ CryptoBytes, 2(2), 1996.
[33] R. Rivest, A. Shamir, and L. Adleman. A Method for
[9] M. Fischer, N. Lynch, and M. Paterson. Impossibility of Obtaining Digital Signatures and Public-Key Cryptosystems.
Distributed Consensus With One Faulty Process. Journal of the Communications of the ACM, 21(2), 1978.
ACM, 32(2), 1985.
[34] F. Schneider. Implementing Fault-Tolerant Services Using The
[10] J. Garay and Y. Moses. Fully Polynomial Byzantine Agreement
>
for n 3t Processors in t+1 Rounds. SIAM Journal of Computing,
State Machine Approach: A Tutorial. ACM Computing Surveys,
22(4), 1990.
27(1), 1998.
[35] A. Shamir. How to share a secret. Communications of the ACM,
[11] D. Gawlick and D. Kinkade. Varieties of Concurrency Control in 22(11), 1979.
IMS/VS Fast Path. Database Engineering, 8(2), 1985.
[36] G. Tsudik. Message Authentication with One-Way Hash
[12] D. Gifford. Weighted Voting for Replicated Data. In Symposium Functions. ACM Computer Communications Review, 22(5), 1992.
on Operating Systems Principles, 1979.
[37] M. Wiener. Performance Comparison of Public-Key Cryptosys-
[13] M. Herlihy and J. Tygar. How to make replicated data secure. tems. RSA Laboratories’ CryptoBytes, 4(1), 1998.
Advances in Cryptology (LNCS 293), 1988.
14