0% found this document useful (0 votes)
22 views10 pages

Paxos_demystifying

The document discusses the role of Paxos protocols and systems in achieving fault-tolerant distributed consensus in cloud computing environments. It compares various Paxos protocols, such as Zab and Raft, and their implementations like ZooKeeper and Chubby, highlighting their advantages, disadvantages, and proper use cases. The paper also categorizes coordination use patterns in the cloud and identifies future directions for scalable distributed coordination systems.

Uploaded by

HARSHITHA KAPA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views10 pages

Paxos_demystifying

The document discusses the role of Paxos protocols and systems in achieving fault-tolerant distributed consensus in cloud computing environments. It compares various Paxos protocols, such as Zab and Raft, and their implementations like ZooKeeper and Chubby, highlighting their advantages, disadvantages, and proper use cases. The paper also categorizes coordination use patterns in the cloud and identifies future directions for scalable distributed coordination systems.

Uploaded by

HARSHITHA KAPA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Consensus in the Cloud: Paxos Systems Demystified

Ailidani Ailijiang⇤ , Aleksey Charapko† and Murat Demirbas‡


Computer Science and Engineering
University at Buffalo, SUNY
Buffalo, NY 14260
Email: ⇤ ailidani@buffalo.edu, † acharapk@buffalo.edu, ‡ demirbas@buffalo.edu,

Abstract—Coordination and consensus play an important role Around that time Google was already running into the fault-
in datacenter and cloud computing, particularly in leader induced corner cases that cause service downtimes. A fault-
election, group membership, cluster management, service dis- tolerant coordination service was needed for the Google File
covery, resource/access management, and consistent replication System (GFS), and Google adopted Paxos for implementing
of the master nodes in services. Paxos protocols and systems the GFS lock service, namely the Google Chubby [2]. The
provide a fault-tolerant solution to the distributed consensus Google Chubby project boosted interest in the industry about
problem and have attracted significant attention as well as using Paxos protocols and Paxos systems for fault-tolerant
generating substantial confusion. In order to elucidate the coordination.
correct use of distributed coordination systems, we compare An open-source implementation of the Google Chubby
and contrast popular Paxos protocols and Paxos systems and lock service was provided by the Apache ZooKeeper
present advantages and disadvantages for each. We also cat- project [3]. ZooKeeper generalized the Chubby interface
egorize the coordination use-patterns in cloud, and examine slightly and provided a general ready-to-use system for
Google and Facebook infrastructures, as well as Apache top- “coordination as a service”. ZooKeeper used a Paxos-variant
level projects to investigate how they use Paxos protocols protocol Zab [4] for solving the distributed consensus prob-
and systems. Finally, we analyze tradeoffs in the distributed lem. Since Zab is embedded in the ZooKeeper implemen-
coordination domain and identify promising future directions tation, it remained obscure and did not get adopted as a
for achieving more scalable distributed coordination systems.
generic Paxos consensus component. Instead of the Zab
component, which required a lot of work for integrating
to the application, ZooKeeper’s ready-to-use file-system ab-
1. Introduction straction interface got popular and became the de facto
coordination service for cloud computing applications. How-
Cloud computing deals mainly with big data storage, ever, since the bar on using the ZooKeeper interface was
processing, and serving. While these are mostly embarrass- so low, it has been abused/misused by many applications.
ingly parallel tasks, coordination still plays a major role When ZooKeeper is improperly used, it often constituted the
in cloud computing systems. Coordination is needed for bottleneck in performance of these applications and caused
leader election, group membership, cluster management, scalability problems.
service discovery, resource/access management, consistent Recently Paxos protocols and Paxos systems quickly
replication of the master nodes in services, and finally for grew in number adding further options to the choice
barrier-orchestration when running large analytic tasks. of which consensus/coordination protocols/systems to use.
The coordination problem has been studied by the theory Leveraging ZooKeeper, the BookKeeper [5] and Kafka [6]
of distributed systems extensively under the name “dis- projects introduced log/stream replication services. The Raft
tributed consensus”. This problem has been the subject of protocol [7] went back to fundamentals and provided and
several impossibility results: while consensus is easy in open-source implementation of Paxos protocol as a reusable
the absence of faults, it becomes prone to intricate failure- component. Despite the increased choices and specializa-
scenarios in the presence of lossy channels, crashed par- tion of Paxos protocols and Paxos systems, the confusion
ticipants, and violation of synchrony/timing assumptions. remains about the proper use cases of these systems and
Several algorithms have been proposed to tackle the prob- about which systems are more suitable for which tasks. A
lem, however, Paxos introduced in 1989 [1] stood out from common pitfall has been to confuse the Paxos protocols with
the pack as it provided a simple formally-proven algo- Paxos systems build on top of these protocols (see Figure 1).
rithm to deal with the challenges of asynchrony, process Paxos protocols (such as Zab and Raft) are useful for
crash/recovery, and message loss in an elegant and uniform low-level components for server replication, whereas Paxos
manner. systems (such as ZooKeeper) have been often shoehorned
Paxos’s rise to fame had to wait until after the large-scale to that task. The proper use case for Paxos systems is in
web services and datacenter computing took off in 2000s. highly-available/durable metadata management, under the
Figure 1. Paxos protocols versus Paxos systems

conditions that all metadata fit in main-memory and are not


subject to very frequent changes.
Contributions of this paper.
1) We categorize and characterize consen-
sus/coordination use patterns in the cloud,
and analyze the needs/requirements of each use
case. Figure 2. Phases in Zab and Raft
2) We compare and contrast Paxos protocols and
Paxos systems and present advantages and disad-
vantages for each. We present proper use criteria Raft [7] is a recent consensus protocol that was designed
for Paxos systems. to enhance understandability of the Paxos protocol while
3) We examine Google and Facebook infrastructure as maintaining its correctness and performance.
well as Apache top-level projects to evaluate their As shown in Figure 2, both Zab and Raft implement
use of Paxos protocols and systems. a dedicated phase to elect a distinguished primary leader.
4) Finally we analyze tradeoffs in the distributed co- Both Zab and Raft decompose the consensus problem into
ordination domain and identify promising future independent subproblems: leader election, log replication,
directions for achieving more scalable distributed and safety and liveness. The distinguished primary leader
coordination systems. approach provides a simpler foundation for building practi-
cal systems. A leader change is denoted by epoch e 2 N
2. Paxos Protocols and term t 2 N in Zab and Raft, respectively. A new
leader election will increase e or t, so all non-faulty nodes
only accept the leader with higher epoch or term number.
In this section, we present Paxos protocol variants and After leader election, in the normal operation, the leader
compare and contrast their differences. proposes and serializes client’s operations in total order at
each replica.
2.1. Similarities among Paxos protocols In all Paxos protocols, every chosen value (i.e., proposed
client operation) is a log entry, and each entry identifier z has
The original Paxos protocol, detailed in [1], was de- two components denoted as slot and ballot number in Paxos,
veloped for achieving fault-tolerant consensus and conse- epoch and counter he, ci in Zab, and as term and index in
quently for enabling fault-tolerant state machine replication Raft. When the leader broadcasts a proposal for the current
(SMR) [8]. Paxos employs consensus to serialize operations entry, a quorum of followers vote for the proposals and apply
at a leader and apply the operations at each replica in this the corresponding operations after the leader commits. All
exact serialized order dictated by the leader. The Multi- Paxos protocols guarantee ordering, namely when command
Paxos (a.k.a. multi-decree Paxos) flavor have extended the hz, ci is delivered, all commands hz 0 , ci where z 0 < z is
protocol to run efficiently with the same leader for multiple delivered first, despite crashes of the leaders.
slots [9], [10], [11], [12], [13]. In particular, work by Van
Renesse [14] presented a reconfigurable version of Multi- 2.2. Differences among Paxos protocols
Paxos with a detailed and easy to implement operational
specification of replicas, leader and acceptors. Leader election. Zab and Raft protocols differ from
Zab (ZooKeeper Atomic Broadcast) is the Paxos- Paxos as they divide execution into phases (called epochs
variant consensus protocol that powers the core of in Zab and terms in Raft), as shown in Figure 2 (redrawn
ZooKeeper, a popular open-source Paxos system [3], [4]. from [7]). Each epoch begins with a new election, goes
Zab is referred to as an atomic broadcast protocol because into the broadcast phase and ends with a leader failure. The
it enables the nodes to deliver the same set of transactions phases are sequential because of the additional safety prop-
(state updates) in the same order. Atomic broadcast or total erties are provided by the isLeader predicate. The isLeader()
order broadcast and consensus are equivalent problems [15], predicate guarantees a single distinguished leader. That is, in
[16]. Zab and Raft there can be at most one leader at any time. In
Figure 3. Messaging in Zab and Raft

contrast, Paxos does not provide this strong leader property.


Since Paxos lacks a separate leader election phase, it can
Figure 4. Dynamic reconfiguration in Paxos protocols
have multiple leaders coexisting, however it still ensures
safety thanks to the ballot numbers and quorum concepts.
Zab algorithm has three phases and each node can be in command with a new configuration Cnew which is decided
one of these three phases at any given time. Discovery phase in a log entry just like any other command. To ensure safety,
is where the leader election occurs, over current known Cnew cannot be activated immediately and the configuration
configuration of the quorum. A process can only be elected changes must go through two phases. Due to the different
if it has a higher epoch or if the epoch is same a higher nature of the protocol, the reconfiguration algorithm differs
committed transaction id. In the synchronization phase, the in each Paxos protocol.
new leader synchronizes its initial history of previous epoch Dynamic reconfiguration approach in Paxos [14] in-
with all followers. The leader proceeds for the broadcast troduces uncertainty of a slot’s configuration, therefore, it
phase only after a quorum of followers acknowledged that imposes a bound on the concurrent processing of all com-
they are synchronized with the leader. The broadcast phase is mands. A process can only propose commands for slots
the normal operation mode, and the leader keeps proposing with known configuration, 8⇢ : ⇢.slotin < ⇢.slotout +
new client requests until it fails. WINDOW, as shown in Figure 4.
In contrast to Zab, there is no distinct synchronization By exploiting primary order property provided by Zab
phase in Raft: the leader stays synchronized with each and Raft, both protocols are able to implement their re-
follower in the normal operation phase by comparing the configuration algorithms without limitations to normal op-
log index and term value of each entry. As shown in erations or external services. Both Zab and Raft include a
figure 2, lack of distinct synchronization phase simplifies pre-phase where the new processes in Cnew join the cluster
Raft algorithmic states, but may result in longer recovery as a non-voting members so that the leader in Cold could
time in practice. initialize their states by transferring currently committed
Communication with the replicas. Zab adopts a mes- prefix of updates. Once the new processes have caught up
saging model, where each update requires at least three mes- with the leader, the reconfiguration can proceed to schedule.
sages: proposal, ack and commit as shown in Figure 3. In The difference is that in Zab, the time between Cnew pro-
contrast Raft relies on an underlying RPC system. Raft also posed and committed, any commands received after reconfig
aims to minimize the state space and RPC types required is only scheduled but will not commit, as they are the
in the protocol by reusing a few techniques repeatedly. For responsibility of Cnew . However, in Raft, the time interval
example, the AppendEntries RPCs are initiated by leader to is decided by quorum of Cold,new .
both replicate log and perform heartbeat.
Dynamic reconfiguration. The original Paxos was 2.3. Extensions to the Paxos protocols
limited as it assumed a static ensemble 2f + 1 that can
crash and recover but cannot expand or shrink. The ability In the SMR approach, to further improve efficiency,
to dynamically reconfigure the membership of consensus under special cases a partial ordering of command sequence
ensemble on the fly and while preserving data consistency can be used instead of total ordering of chosen values: e.g.,
provides an important extension for Paxos protocols. Dy- two commutative commands can be executed in any order
namic reconfiguration in all Paxos protocols share the fol- since they produce the same state as the result. The resultant
lowing basic approach. A client proposes a special reconfig protocol called Generalized Paxos [11] is an extension of
TABLE 1. L ATENCY AND C ONSISTENCY EXPECTATION TABLE 2. F EATURES OF PAXOS SYSTEMS
Read Write
Median 99.9th percentile Median 99.9th percentile
P(consistency) 0.999 Systems
N=3 1.5 1.11 ms 6.27 ms 1.98 ms 6.92 ms 94.65% 2.5 ms Feature Chubby ZooKeeper etcd
N=5 1.5 1.11 ms 6.27 ms 2.1 ms 7.16 ms 94.79% 3 ms
N=7 1.5 1.11 ms 6.27 ms 2.18 ms 7.27 ms 95.05% 3.5 ms Filesystem API X X X
N=9 1.5 1.11 ms 6.27 ms 2.19 ms 7.43 ms 95.52% 3.5 ms
N=3 0.1 16.65 ms 89.7 ms 29.57 ms 110.15 ms 94.2% 30 ms
Watches X X X
N=5 0.1 16.65 ms 89.7 ms 31.78 ms 107.04 ms 95.03% 30 ms
Ephemeral Storage X X X
N=7 0.1 16.65 ms 89.7 ms 32.24 ms 107.42 ms 95.48% 35 ms
N=9 0.1 16.65 ms 89.7 ms 32.76 ms 112.47 ms 95.51% 35 ms Local Reads X X
Dynamic Reconfiguration X X
Fast Paxos [10], and allows acceptors to vote for indepen- Observers X X X
dent commands. Similarly EPaxos [13] is able to achieve Autoincremented Keys X X
lower latency because it allows nodes to commit conflict- Hidden Data X
free commands by checking the command dependency list. Weighted Replicas X
However, EPaxos adds significant complexity and extra
effort to resolve the conflict if concurrent commands do
not commute. In addition, from an engineer’s perspective, 3.1. Similarities among Paxos systems
the sketch algorithm descriptions in the literature are often
underspecified, and lead to divergent interpretations and im-
Chubby [2], ZooKeeper [3] and etcd [18] are consensus
plementations. Building such system using Paxos consensus
services designed specifically for loosely-coupled distributed
algorithm proved to be non-trivial [12].
systems. Chubby, originally a lock service used in Google
Paxos users often face a trade-off between read latency
productins, is the first service to provide consensus through
and staleness. Although each write is serialized and syn-
a service, with ZooKeeper and others arriving later.
chronously replicated, such a write may only be applied to a
quorum of replicas. Thus, another client reading at a replica All three services hide the replicated state machine and
where this write has not been replicated may still see the old log abstractions under a small data-store with filesystem-like
version. Since the leader is the only process guaranteed to API. Filesystem interface was chosen for its familiarity to
participate in all write quorums, stale reads can be avoided the developers, reducing the learning curve. The interface
by reading from current leader with a consequent increase enables developers to reason about consensus and coordi-
in latency. nation as if they were working with a filesystem on a local
The probability of stale reads is a function of the machine. ZooKeeper calls all data objects in the hierarchical
network. Inspired by the probabilistically bounded stal- structure as znodes. Each znode can act as both the file for
eness (PBS1 ) [17], we modified the model to estimate storage and as a parent for other stored items.
the Zab/Raft-like primary ordered consensus protocol’s An important feature common to these systems is the
read/write latency and P (consistency). Our model adopts ability to set watches on the data objects allowing the
6 different communication delays, CR (Client-Replica), P clients to receive timely notifications of changes without
(Proposal), A (Ack), C (Commit), R (Read), and S (Re- requiring polling. Typically, these systems implement one-
sponse), in order to investigate possible read and write time watches, meaning that a system notifies the client only
message reordering and resultant stale-reads. The simulation for the first change of the object. If a client application
uses Monte Carlo method with each event drawn from wants to continue receiving the updates, it must reinstitute
a predefined distribution. For simplicity, we assume each the watch in the system.
channel latency fits in an exponential distribution character- All three systems support temporary or ephemeral stor-
ized by and we assume message delays are symmetric, age that persists only while the client is alive and send-
CR = P = A = C = R = S = ( = 1.5 means ing heartbeat messages. This mechanism allows the clients
0.66ms). In Table 1, P (consistency) show the probability to use Paxos systems for failure detection and triggering
of consistent read of the last written version in different reconfiguration upon addition or removal of clients in the
ensemble sizes, given that the clients read from the first application.
responding replica. Both ZooKeeper and etcd provide the clients with the
ability to create auto-incremented keys for the data items
stored in a directory. This feature simplifies implementation
3. Paxos Systems of certain counting data-structures, such as queues.
In this section we compare and contrast three popular All three Paxos systems adopt observer servers. Ob-
Paxos systems, ZooKeeper, Chubby, and etcd, and examine server is a non-voting replica of an ensemble that learns
the features these systems provide to the clients. We also the entire committed log but does not belong to a quorum
discuss proper usage criteria for these Paxos systems, a topic set. Observers can serve reads with a consistent view of
which has not received sufficient coverage. some point in the recent past. This way, observers improve
system scalability and help disseminate data over a wide
1. http://pbs.cs.berkeley.edu/#demo geographic area without impacting the write latency.
3.2. Differences among Paxos systems degradation by keeping the Paxos system use away from
the performance critical and frequently utilized path of the
Despite serving the same purpose, Chubby, ZooKeeper, application.
and etcd have many differences both in terms of the feature 2) Frequency of write operations to the Paxos system
sets and internal implementations. Chubby uses the Multi- should be kept low. The first rule is especially important
Paxos algorithm to achieve linearizability, while Zab lies at for write operations due to the costs associated with achiev-
the heart of ZooKeeper and provides not only linearizability, ing consensus and consistently replicating data across the
but also FIFO order for client requests, enabling the devel- system.
opers to build complex coordination primitives with ease.
Raft is the consensus protocol behind the etcd system. 3) Amount of data maintained in the Paxos system
Unlike ZooKeeper and Chubby, etcd is stateless with should be kept small. Systems like ZooKeeper and etcd
respect to its clients. In other words, etcd system is oblivious are not designed as general-purpose data storage, so a proper
to any clients using it and no client information is retained adoption of these systems would keep the amount of data
in the service. This allows etcd to use REST API as its maintained/accumulated in the Paxos systems to a minimum.
communication interface, obviating the need for a special Preferably only small metadata should be stored/maintained
client software or library. Since etcd is stateless, it imple- in the Paxos system.
ments certain features very differently than ZooKeeper and 4) Application adopting the Paxos system should really
Chubby. For instance, watches require a persistent connec- require strong consistency. Some applications may erro-
tion with the client using HTTP long polling technique, neously adopt a Paxos system when the strong consistency
while ephemeral storage requires clients to manually set level provided by the Paxos system may not be neces-
time-to-live (TTL) on the data objects and update the TTL sary for the application. The Paxos systems linearize all
periodically. write operations, and such linearizability incurs performance
Hidden data items is another interesting feature of etcd degradation and must be avoided unless it is necessary for
inspired by hidden files in conventional filesystems. With the application.
hidden object ability clients can write items that will not
be listed by the system when requesting a file or directory 5) Application adopting the Paxos system should not be
listing, thus only clients who know the exact name of the distributed over the Wide Area Network (WAN). In Paxos
data object are able to access it. systems the leader and replicas are commonly located in the
The original Zab algorithm, as well as many other con- same datacenter so that the roundtrip times do not affect
sensus algorithms, only concern full replicas which contain the performance very badly. Putting the replicas and appli-
all the write-ahead log and state machine entity and involve cation clients far away from the leader, e.g., across multiple
equally in voting process and in serving read requests. datacenters and continents, would significantly degrade the
ZooKeeper extends Zab and introduces weighted replicas performance.
which can be assigned with different voting weights in the 6) The API abstraction should be fit the goal. Paxos
quorum, so the majority condition is converted to greater systems provide filesystem-like API to the clients, but such
than half of the total weights. Replicas that have zero weight an abstraction may not be suitable for all tasks. In some
are discarded and not considered when forming quorums. cases, such as the server replication discussed in the next
Table 2 summarizes the main similarities and differences section, dealing with the filesystem abstraction can be too
among these Paxos systems. As the table shows, these sys- cumbersome and error-prone that a different approach would
tems provide expressive and comparable APIs to the clients, serve better.
allowing the developers to utilize them for distributed coor-
dination in many different use cases.
4. Paxos Use Patterns
3.3. Proper use criteria for Paxos systems In this section, we categorize and characterize the most
common Paxos use patterns in the datacenter and cloud
The relative ease-of-use and generality of the client computing applications.
interfaces of these Paxos systems allow for great flexibility
Server Replication (SR). Server replication via the state
that sometimes leads to misuse. In order to prevent improper
machine replication (SMR) approach is a canonical applica-
and inefficient utilization of Paxos systems, we propose the
tion for Paxos protocol. The SMR requires a state machine
following criteria for proper use of Paxos systems. Violating
to be deterministic: multiple copies of the state machine
any one of these criteria does not automatically disqualify
begin in the start state and receive the same inputs in
the application from using a Paxos system, rather it calls for
the same order and each replica will arrive at the same
a more thorough examination of goals to be achieved and
state having generated the same outputs. Paxos is used for
whether a better solution exists.
serializing and replicating the operations to all nodes in
1) Paxos system should not be in the performance critical order to ensure that states of the machines are identical
path of the application. Consensus is an expensive task, and the same sequence of operations is applied. This use
therefore a good use case tries to minimize performance case becomes the most practical when the input operations
causing the state changes are small compared to the large systems for this task can cause problems. Facebook Gi-
state maintained at each node. Pure Paxos, Zab and Raft raph paper [22] discusses the following example for this
protocols are better suited to achieve server replication than misuse pattern. Large-scale graph processing systems uses
Paxos systems, since using Paxos systems like ZooKeeper aggregators to provide shared state across vertices. Each
introduces additional overhead of dealing with the filesystem vertex can send a value to an aggregator in superstep S ,
API and maintaining a separate consensus system cluster in a master combines those values using a reduction operator,
addition to the server to be replicated and its replicas. and the resulting value is made available to all vertices in
superstep S +1. In early version of Giraph, aggregators were
Log Replication (LR). The objective of log replication implemented using ZooKeeper, violating criteria 3. So this
is different than that of server replication. Log replication does not scale in the case of executing k-means clustering
is applied in data integration systems that use the log with millions of centroids, because it requires an equal
abstraction to duplicate data across different nodes, while amount of aggregator and tens of gigabytes of aggregator
server replication is used in SMR to make copies of the data coming to ZooKeeper from all vertices. To solve this
server state. Since Paxos systems such as ZooKeeper have issue, Giraph bypassed ZooKeeper and implemented shared
limited storage, they are not typically suitable for the data- aggregators, each randomly running on one of the workers.
centric/intensive task of log replication. Systems like Book- This solution lost durability/fault-tolerance of ZooKeeper,
Keeper and Kafka are a better fit for this use case as they but achieved fast performance and scalability.
remove consensus out of the critical path of data replication,
and employ Paxos only for maintaining the configuration of Configuration Management. Most Paxos systems provide
a system. the ability to store arbitrary data by exposing a filesystem
or key-value abstraction to the systems. This gives the
Synchronization Service (SS). An important application applications access to durable and consistent storage for
of consensus is to provide synchronization. Traditionally, small data items that can be used to maintain configuration
concurrent access to the shared data is controlled by some metadata like connection details or feature flags. These
form of mutual exclusion through locks. However such metadata can be watched for changes, allowing applications
approach requires applications to build their own failure to reconfigure themselves when configuration parameters are
detection and recovery mechanism, and a slow or blocked modified. Leader election (LE), group membership (GM),
process can harm the overall performance. When the con- service discovery (SD), and metadata management (MM)
sensus protocol/system is decoupled from the application, are main use cases under configuration management, as they
the application not only gains fault tolerance of the shared are important for cluster management in cloud computing
data, but also achieves wait-free concurrent data access with systems.
guaranteed consistency.
Google Chubby [2] was originally designed to provide Message Queues (Q). A common misuse pattern is to use
a distributed lock service intended for coarse-grained syn- the Paxos system to maintain a distributed queue, such as a
chronization of activities through Multi-Paxos in its heart, publisher-subscriber message queue or a producer-consumer
but it found wider use in other cases such as name service queue. In ZooKeeper, with the use of watchers, one can
and repository of configuration data. ZooKeeper [3] provides implement a message queue by letting all clients interested
simple code recipes for exclusive locks, fair locks and in a certain topic register a watch on the topic znode, and
shared locks. Since both Chubby and ZooKeeper expose messages will be broadcast to all the clients by writing to
a filesystem interface where each data node is accessed that znode. Unfortunately, queues in production can contain
by a hierarchical path, the locks are represented by data many thousands of messages resulting in a large volume of
nodes created by clients. The data nodes used as locks are write operations and potentially huge amounts of data going
usually ephemeral nodes which can be deleted explicitly through the Paxos system, violating criteria 3. Moreover, in
or automatically by the system when a session that creates this case the Paxos system stands in the critical path of every
them terminates due to a failure. Since locks do not maintain queue operation (violating criteria 1 and 2), and this de-
other metadata within data nodes, all operations on locks are creases the performance even further. Apache BookKeeper
lightweight. and Kafka projects properly address the message queue use
case. Both of these distributed pub-sub messaging systems
Barrier Orchestration (BO). Large-scale graph processing rely on ZooKeeper to manage the metadata aspect of the
systems based on BSP (Bulk Synchronous Parallel) model replication (partition information and memberships), and
like Google Pregel [19], Apache Giraph [20] and Apache handle replicating the actual data separately. By removing
Hama [21] use Paxos systems for coordination between consensus out of the critical path of data replication and
computing processes. Since the graph systems process data using it only for configuration management, both systems
in an iterative manner, a double barrier is used to synchro- achieve good throughput and scalable message replication.
nize the beginning and the end of each computation iteration Table 3 summarized and evaluates the Paxos use patterns
across all nodes. Barrier thresholds may be reconfigured in terms of usage criteria discussed in section 3. If a using
during each iteration as the number of units involved in pattern access the consensus system per-data operation, we
the computation changes. consider it as high frequency, otherwise it is low frequency
Of course, violating the proper use criteria of Paxos if application only access consensus system on rare events.
TABLE 3. E VALUATION OF PATTERNS finalize tablet server deaths (group membership); (4) to store
Bigtable schemas (configuration management). Megastore
Patterns Frequency Data Volume API Paxos system use Better substitute
SR High Large Hard Bad Replication Protocol [28] is a cross datacenter database that provides ACID
LR High Medium Hard Bad Replication Protocol semantics within fine-grained and replicated partitions of
SS Low Small Hard OK Distributed Locks
data. Megastore is also the largest system deployed that
BO Low Depends Easy OK
SD Low Small Easy Good
uses Paxos to replicate primary user data across datacenters
GM Low Small Easy Good on every write. It extends Paxos to synchronously replicate
LE Low Small Easy Good multiple write-ahead logs, each governing its own partition
MM Medium Medium Easy Good Distributed Datastore
Q High Large / Medium Hard Bad Kafka
of the data set. Google Spanner [26] is a globally distributed,
multiversion database that shards data across many sets of
Paxos state machines. Each spanserver implements a single
Data volume describes the amount of data required by the Paxos state machine on top of each tablet for replication,
pattern on average. The file-system API abstraction provided and a set of replicas is collectively a Paxos group.
by most of the consensus system can be easy to utilize for
some tasks, but hard to get it right without recipe for others. 5.2. Facebook Stack
Overall, a using pattern is considered as a good or bad
use for Paxos systems. Finally, a few substitutes of Paxos Facebook takes advantage of many open source projects
systems for particular task are also listed for reference. to build a foundation of its distributed architecture. Storage
layer is represented by both traditional relational systems
5. Paxos Use in Production Systems and NoSQL databases. HDFS [29] serves as the basis
for supporting large distributed key-value stores such as
In this section, we examine the infrastructure stack of HBase [30]. HDFS is part of the Apache Hadoop project
Google and Facebook, and Apache Foundation’s top-level and uses ZooKeeper for all its coordination needs: resource
projects to determine where Paxos protocols and Paxos manager state replication, leader election and configuration
systems are used and what they are used for. metadata replication. HBase is built on top of HDFS and
utilizes ZooKeeper for managing configuration metadata,
5.1. Google Stack region server replication and shared resources concurrency
control. Cassandra [31] is another distributed data store
that relies on ZooKeeper for tasks like leader election,
We reviewed Google’s publications to gather information
configuration metadata management and service discovery.
about the Google infrastructure stack. Each system related
MySQL is used as a relational backbone at Facebook with
literature is the sole source of our analysis of 18 projects,
many higher level storage systems, such as Haystack and
among which 7 projects implement Paxos or directly depend
f4, interacting with it.
on the consensus system.
Data processing layer utilize the resources of the storage
At the bottom of the stack is coordination and cluster
architecture, making Facebook rely on systems that integrate
management systems. Borg [23] is Google’s cluster man-
well with HDFS and HBase. Hadoop operates on top of
ager that runs hundreds of thousands of jobs from many
HDFS and provides MapReduce infrastructure to various
applications in multiple clusters. Borg uses Chubby as a
Facebook components. Hive [32] is an open source software
consistent, highly available Paxos store to hold the metadata
created for data warehousing on top of Hadoop and HBase.
for each submitted job and running task. Borg uses Paxos
It uses ZooKeeper as its consensus system for implementing
in several places: (1) One usage is to write hostname and
metadata storage and lock service.
port of each task in Chubby to provide naming service; (2)
Facebook uses a modified and optimized version of
Borgmaster implements Paxos for leader election and master
ZooKeeper, called Zeus [33]. Currently Zeus is adopted
state replication; (3) Borg uses Paxos as a queue of new
in the Configerator project, an internal tool for managing
submitted jobs which helps scheduling. Kubernetes [24] is a
configuration parameters of production systems. Zeus’s role
recent open source project from Google for Linux container
in the process is serializing and replicating configuration
cluster management. Kubernetes adopts etcd to keep state,
changes across a large number of servers located throughout
such as resource server membership and cluster metadata.
Facebook’s datacenters. It is likely that over time more of
The second layer of the stack contains several data
the company’s systems are going to use Zeus.
storage components. The most important storage service
is the Google File System (GFS) [25] (and the successor
called Colossus [26]). GFS uses Chubby for master elec- 5.3. Apache Projects
tion, and master state replication. Bigtable [27] is Google’s
distributed storage system for structured data. It heavily Apache foundation has many distributed systems that
relies on Chubby for a variety of tasks: (1) to ensure require a consensus algorithm for a variety of reasons.
there is at most one active master at any time (leader Zookeeper, being an Apache project itself, can be seen
election); (2) to store the bootstrap location of Bigtable data as the de facto coordination system for other applications
(metadata management); (3) to discover tablet servers and under the Apache umbrella. Currently, about 31% of Apache
projects in BigData, Cloud and Databse categories2 directly TABLE 4. PATTERNS OF PAXOS USE IN PROJECTS
use ZooKeeper, while many more applications relying on Usage Patterns
other projects that depend on Apache’s consensus system. Project Consensus System SR LR SS BO SD GM LE MM Q
This makes ZooKeeper an integral part of open source GFS Chubby X X X
X X X
distributed systems infrastructure. Below we briefly mention Borg
Kubernetes
Chubby/Paxos
etcd X X
some of the more prominent systems adopting ZooKeeper Megastore Paxos X
as a consesnus service. Spanner Paxos X
Bigtable Chubby X X X
Apache Accumulo [34] is a distributed key-value store Hadoop/HDFS ZooKeeper X X
based on the Google’s BigTable design. This project is HBase ZooKeeper X X X X
similar to Apache HBase and uses Zookeeper in some Hive ZooKeeper X X
Configerator Zeus X
of the identical ways: shared resource mutual exclusion, Cassandra ZooKeeper X X X
configuration metadata management and tasks serialization. Accumulo ZooKeeper X X X

BookKeeper [5] project implements a durable replicated BookKeeper ZooKeeper X X


Hedwig ZooKeeper X X
distributed log mechanism. It employs ZooKeeper for stor- Kafka ZooKeeper X X X
ing log metadata and keeping track of configuration changes, Solr ZooKeeper X X X

such as server failures or additions. Hedwig is a publish- Giraph ZooKeeper X X X


Hama ZooKeeper X
subscribe message delivery system developed on top of Mesos ZooKeeper X
BookKeeper. The system consists of a set of hubs that CoreOS etcd X

are responsible for handling various topics. Hedwig uses OpenStack ZooKeeper X
Neo4j ZooKeeper X X
ZooKeeper to write hub metadata, such as topics served
by each hub. Apache Kafka is another publish-subscribe
messaging software, unlike Hedwig, it does not rely on
BookKeeper, however it uses ZooKeeper for a number of
tasks: keeping track of removal or additions of nodes, coor-
dinating rebalancing of resources when system configuration
changes and keeping track of consumed messages [6].
Apache Solr [35] search engine uses ZooKeeper for
its distributed sharded deployments. Among other things,
ZooKeeper is used for leader election, distributed data-
structures such as queues and maps and storing various
configuration parameters.

5.4. Evaluation of Paxos use

In order to evaluate the way Paxos protocols and Paxos Figure 5. Relative Frequency of Consensus Systems Usage for Various
Tasks
systems are adopted by Google’s and Facebook’s software
stacks and Apache top-level projects, we have classified the
usage patterns into nine broad categories: server replication consensus protocols, since they greatly simplify the im-
(SR), log replication (LR), synchronization service (SS), plementation of the distributed locks. Distributed queues,
barrier orchestration (BO), service discovery (SD), group despite being one of the suggested ZooKeeper recipes, are
membership (GM), leader election (LE), metadata manage- used least frequently. This can be attributed to the fact that
ment (MM) and distributed queues (Q). The majority of using consensus for managing large queues may negatively
the investigated systems have used consensus for tasks in impact performance.
more than one category. Table 4 summarizes various usage
patterns observed in different systems. 6. Concluding Remarks
Figure 5 shows the frequency of consensus systems
being used for each of the task categories. As can be seen, We compared/contrasted popular Paxos-protocols and
metadata management stands for 27% of all usages and is Paxos-systems and investigated how they are adopted by
the most popular adoption scenario of consensus systems, production systems at Google, Facebook, and other cloud
closely followed by the leader election. It is worth to keep in computing domains. We find that while Paxos systems
mind that metadata management is a rather broad category (and in particular ZooKeeper) are popularly adopted for
that encompasses many usage arrangements by the end coordination, they are often over-used and misused. Paxos
systems including application configuration management or systems are suitable as a durable metadata management
managing state of some internal objects or components. service. However, the metadata should remain small in size,
Synchronization service is another popular application for should not accumulate in size over time, and should not
get updated frequently. When Paxos systems are improperly
2. https://projects.apache.org/projects.html?category used, they constitute the bottleneck in performance and
cause scalability problems. A major limitation on Paxos ahead replicated log structure. Since Paxos systems are lim-
systems performance and scalability is that they fail to ited by the size of one node’s memory, RAMCloud/Sinfonia
provide a way to denote/process commutative operations via provides a scalable and fast alternative with minitransactions
the Generalized Paxos protocol. In other words, they force over multiple-records, with a slightly less expressive API.
every operation to be serialized through a single master.
We conclude by identifying tradeoffs for coordination References
services and point out promising directions for achieving
more scalable coordination in the clouds. [1] L. Lamport, “The part-time parliament,” ACM Transactions on Com-
puter Systems (TOCS), vol. 16, no. 2, pp. 133–169, 1998.
Trading off strong-consistency for fast performance. If
strong-consistency is not the primary focus and require- [2] M. Burrows, “The chubby lock service for loosely-coupled distributed
systems.” in OSDI. USENIX Association, 2006, pp. 335–350.
ment of the application, instead of a Paxos-protocol/Paxos-
system solution, an eventually-consistent solution replicated [3] P. Hunt, M. Konar, F. Junqueira, and B. Reed, “Zookeeper: Wait-free
coordination for internet-scale systems,” in USENIX ATC, vol. 10,
via optimistic replication techniques [36] can be used for 2010.
fast performance, often with good consistency guarantees.
[4] F. Junqueira, B. Reed, and M. Serafini, “Zab: High-performance
For example, in the Facebook’s TAO system, which is built broadcast for primary-backup systems,” in Dependable Systems &
over a 2-level Memcached architecture, only 0.0004% of Networks (DSN). IEEE, 2011, pp. 245–256.
reads violate linearizability [37]. Adopting Paxos would [5] F. P. Junqueira, I. Kelly, and B. Reed, “Durability with bookkeeper,”
have solved those linearizability violations, but providing ACM SIGOPS Operating Systems Review, vol. 47, no. 1, pp. 9–15,
better performance is more favorable than eliminating a tiny 2013.
fraction of bad reads in the context of TAO usage. [6] J. Kreps, N. Narkhede, J. Rao et al., “Kafka: A distributed messaging
system for log processing.” NetDB, 2011.
Trading off performance for more expressivity.
[7] D. Ongaro and J. Ousterhout, “In search of an understandable con-
ZooKeeper provides a filesystem API as the coordination sensus algorithm,” in 2014 USENIX Annual Technical Conference
interface. This has shortcomings and is unnatural for many (USENIX ATC 14), 2014, pp. 305–319.
tasks. Tango [38] provides a richer interface, by working at [8] F. B. Schneider, “Implementing fault-tolerant services using the state
a lower-layer. Tango uses Paxos to maintain a consistent and machine approach: A tutorial,” vol. 22, no. 4, pp. 299–319, Dec. 1990.
durable log of operations. Then Tango clients use this log [9] L. Lamport, “Paxos made simple,” ACM SIGACT News, vol. 32, no. 4,
to view-materialize different interfaces to serve. One Tango pp. 18–25, 2001.
client can serve a B+tree interface by reading and material- [10] ——, “Fast paxos,” Distributed Computing, vol. 19, no. 2, pp. 79–
izing the current state from the log, another client a queue 103, 2006.
interface, yet another a filesystem interface. Unfortunately
[11] ——, “Generalized consensus and paxos,” Technical Report MSR-
this requires the Tango clients to forward each read/write TR-2005-33, Microsoft Research, Tech. Rep., 2005.
update towards getting a Paxos read/commit from the log,
[12] T. D. Chandra, R. Griesemer, and J. Redstone, “Paxos made live: an
which degrades the performance. engineering perspective,” in Proceedings of the twenty-sixth annual
ACM symposium on Principles of distributed computing. ACM,
Trading off expressivity with fast performance and scal- 2007, pp. 398–407.
ability. In Paxos systems, the performance of update oper-
[13] I. Moraru, D. G. Andersen, and M. Kaminsky, “There is more
ations is a concern and can constitute bottlenecks. By using consensus in egalitarian parliaments,” in Proceedings of the Twenty-
a less expressive interface, it is possible to improve this Fourth ACM Symposium on Operating Systems Principles. ACM,
performance. If a plain distributed key-value store interface 2013, pp. 358–372.
is sufficient for an application, a consistent-replication key- [14] R. Van Renesse and D. Altinbuken, “Paxos made moderately com-
value store can be used based on chain replication [39]. plex,” ACM Computing Surveys (CSUR), vol. 47, no. 3, p. 42, 2015.
Chain replication uses Paxos only for managing the con- [15] D. Dolev, C. Dwork, and L. Stockmeyer, “On the minimal synchro-
figuration of the replicas, and replication is achieved with nism needed for distributed consensus,” Journal of the ACM (JACM),
very good throughput without requiring Paxos commit for vol. 34, no. 1, pp. 77–97, 1987.
each operation. Furthermore, in this solution reads can be [16] T. Chandra and S. Toueg, “Unreliable failure detectors for reliable
answered consistently by any replica. However, this solution distributed systems,” Journal of the ACM, vol. 43, no. 2, 1996.
forgoes the multi-item coordination in the Paxos systems [17] P. Bailis, S. Venkataraman, M. J. Franklin, J. M. Hellerstein, and
API, since its API is restricted to put and get with key I. Stoica, “Probabilistically bounded staleness for practical partial
quorums,” Proceedings of the VLDB Endowment, vol. 5, no. 8, pp.
operations. 776–787, 2012.
Along the same lines, an in-memory transactional key-
[18] “etcd,” https://coreos.com/etcd/.
value store can also be considered as a fast and scalable
alternative. Sinfonia [40] provides transactions over an in- [19] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,
N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale
memory distributed key-value store. These transactions are graph processing,” in Proceedings of the 2010 ACM SIGMOD
minitransactions that are less expressive than general trans- International Conference on Management of data, ser. SIGMOD
actions, but they can be executed in one roundtrip. Recently ’10. New York, NY, USA: ACM, 2010, pp. 135–146. [Online].
RAMCloud system [41] showed how to implement the Sin- Available: http://doi.acm.org/10.1145/1807167.1807184
fonia transactions in a fast and durable manner using a write- [20] “Apache giraph project,” http://incubator.apache.org/giraph/.
[21] “Apache hama project,” http://hama.apache.org/. [41] C. Lee, S. J. Park, A. Kejriwal, S. Matsushita, and J. Ousterhout,
[22] A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukr- “Implementing linearizability at large scale and low latency,” in
ishnan, “One trillion edges: graph processing at facebook-scale,” Proceedings of the 25th Symposium on Operating Systems Principles.
Proceedings of the VLDB Endowment, vol. 8, no. 12, pp. 1804–1815, ACM, 2015, pp. 71–86.
2015.
[23] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and
J. Wilkes, “Large-scale cluster management at google with borg,” in
Proceedings of the Tenth European Conference on Computer Systems.
ACM, 2015, p. 18.
[24] “Google kubernetes project,” http://kubernetes.io/.
[25] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,”
in ACM SIGOPS Operating Systems Review, vol. 37/5. ACM, 2003,
pp. 29–43.
[26] J. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman,
S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh,
S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura,
D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak,
C. Taylor, R. Wang, and D. Woodford, “Spanner: Google’s globally-
distributed database,” Proceedings of OSDI, 2012.
[27] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows,
T. Chandra, A. Fikes, and R. Gruber, “Bigtable: A distributed storage
system for structured data,” ACM Transactions on Computer Systems
(TOCS), vol. 26, no. 2, p. 4, 2008.
[28] J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin, J. Larson,
J. Léon, Y. Li, A. Lloyd, and V. Yushprakh, “Megastore: Providing
scalable, highly available storage for interactive services,” CIDR, pp.
223–234, 2011.
[29] “Apache hadoop project,” http://hadoop.apache.org/.
[30] L. George, HBase: The Definitive Guide,
1st ed. O’Reilly Media, 2011. [Online]. Avail-
able: http://www.amazon.de/HBase-Definitive-Guide-Lars-George/
dp/1449396100/ref=sr 1 1?ie=UTF8&qid=1317281653&sr=8-1
[31] A. Lakshman and P. Malik, “Cassandra: Structured storage system
on a p2p network,” in Proceedings of the 28th ACM Symposium on
Principles of Distributed Computing, ser. PODC ’09, 2009, pp. 5–5.
[32] “Apache hive project,” http://hive.apache.org/.
[33] C. Tang, T. Kooburat, P. Venkatachalam, A. Chandler, Z. Wen,
A. Narayanan, P. Dowell, and R. Karl, “Holistic configuration man-
agement at Facebook,” Symposium on Operating Systems Principles
(SOSP), pp. 328–343, 2015. [Online]. Available: http://sigops.org/
sosp/sosp15/current/2015-Monterey/printable/008-tang.pdf
[34] “Apache accumulo project,” http://accumulo.apache.org/.
[35] “Apache solr,” http://lucene.apache.org/solr/.
[36] Y. Saito and M. Shapiro, “Optimistic replication,” ACM Computing
Surveys (CSUR), vol. 37, no. 1, pp. 42–81, 2005.
[37] H. Lu, K. Veeraraghavan, P. Ajoux, J. Hunt, Y. J.
Song, W. Tobagus, S. Kumar, and W. Lloyd, “Existential
consistency: Measuring and understanding consistency at Facebook,”
Symposium on Operating Systems Principles (SOSP), pp. 295–
310, 2015. [Online]. Available: http://sigops.org/sosp/sosp15/current/
2015-Monterey/printable/240-lu.pdf
[38] M. Balakrishnan, D. Malkhi, T. Wobber, M. Wu, V. Prabhakaran,
M. Wei, J. D. Davis, S. Rao, T. Zou, and A. Zuck, “Tango: Distributed
data structures over a shared log,” in Proceedings of the Twenty-
Fourth ACM Symposium on Operating Systems Principles. ACM,
2013, pp. 325–340.
[39] R. van Renesse and F. B. Schneider, “Chain replication for supporting
high throughput and availability,” in Proceedings of the 6th conference
on Symposium on Operating Systems Design & Implementation,
vol. 6, 2004.
[40] M. Aguilera, A. Merchant, M. Shah, A. Veitch, and C. Karamanolis,
“Sinfonia: a new paradigm for building scalable distributed systems,”
in ACM SIGOPS Operating Systems Review, vol. 41. ACM, 2007,
pp. 159–174.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy