Paxos_demystifying
Paxos_demystifying
Abstract—Coordination and consensus play an important role Around that time Google was already running into the fault-
in datacenter and cloud computing, particularly in leader induced corner cases that cause service downtimes. A fault-
election, group membership, cluster management, service dis- tolerant coordination service was needed for the Google File
covery, resource/access management, and consistent replication System (GFS), and Google adopted Paxos for implementing
of the master nodes in services. Paxos protocols and systems the GFS lock service, namely the Google Chubby [2]. The
provide a fault-tolerant solution to the distributed consensus Google Chubby project boosted interest in the industry about
problem and have attracted significant attention as well as using Paxos protocols and Paxos systems for fault-tolerant
generating substantial confusion. In order to elucidate the coordination.
correct use of distributed coordination systems, we compare An open-source implementation of the Google Chubby
and contrast popular Paxos protocols and Paxos systems and lock service was provided by the Apache ZooKeeper
present advantages and disadvantages for each. We also cat- project [3]. ZooKeeper generalized the Chubby interface
egorize the coordination use-patterns in cloud, and examine slightly and provided a general ready-to-use system for
Google and Facebook infrastructures, as well as Apache top- “coordination as a service”. ZooKeeper used a Paxos-variant
level projects to investigate how they use Paxos protocols protocol Zab [4] for solving the distributed consensus prob-
and systems. Finally, we analyze tradeoffs in the distributed lem. Since Zab is embedded in the ZooKeeper implemen-
coordination domain and identify promising future directions tation, it remained obscure and did not get adopted as a
for achieving more scalable distributed coordination systems.
generic Paxos consensus component. Instead of the Zab
component, which required a lot of work for integrating
to the application, ZooKeeper’s ready-to-use file-system ab-
1. Introduction straction interface got popular and became the de facto
coordination service for cloud computing applications. How-
Cloud computing deals mainly with big data storage, ever, since the bar on using the ZooKeeper interface was
processing, and serving. While these are mostly embarrass- so low, it has been abused/misused by many applications.
ingly parallel tasks, coordination still plays a major role When ZooKeeper is improperly used, it often constituted the
in cloud computing systems. Coordination is needed for bottleneck in performance of these applications and caused
leader election, group membership, cluster management, scalability problems.
service discovery, resource/access management, consistent Recently Paxos protocols and Paxos systems quickly
replication of the master nodes in services, and finally for grew in number adding further options to the choice
barrier-orchestration when running large analytic tasks. of which consensus/coordination protocols/systems to use.
The coordination problem has been studied by the theory Leveraging ZooKeeper, the BookKeeper [5] and Kafka [6]
of distributed systems extensively under the name “dis- projects introduced log/stream replication services. The Raft
tributed consensus”. This problem has been the subject of protocol [7] went back to fundamentals and provided and
several impossibility results: while consensus is easy in open-source implementation of Paxos protocol as a reusable
the absence of faults, it becomes prone to intricate failure- component. Despite the increased choices and specializa-
scenarios in the presence of lossy channels, crashed par- tion of Paxos protocols and Paxos systems, the confusion
ticipants, and violation of synchrony/timing assumptions. remains about the proper use cases of these systems and
Several algorithms have been proposed to tackle the prob- about which systems are more suitable for which tasks. A
lem, however, Paxos introduced in 1989 [1] stood out from common pitfall has been to confuse the Paxos protocols with
the pack as it provided a simple formally-proven algo- Paxos systems build on top of these protocols (see Figure 1).
rithm to deal with the challenges of asynchrony, process Paxos protocols (such as Zab and Raft) are useful for
crash/recovery, and message loss in an elegant and uniform low-level components for server replication, whereas Paxos
manner. systems (such as ZooKeeper) have been often shoehorned
Paxos’s rise to fame had to wait until after the large-scale to that task. The proper use case for Paxos systems is in
web services and datacenter computing took off in 2000s. highly-available/durable metadata management, under the
Figure 1. Paxos protocols versus Paxos systems
are responsible for handling various topics. Hedwig uses OpenStack ZooKeeper X
Neo4j ZooKeeper X X
ZooKeeper to write hub metadata, such as topics served
by each hub. Apache Kafka is another publish-subscribe
messaging software, unlike Hedwig, it does not rely on
BookKeeper, however it uses ZooKeeper for a number of
tasks: keeping track of removal or additions of nodes, coor-
dinating rebalancing of resources when system configuration
changes and keeping track of consumed messages [6].
Apache Solr [35] search engine uses ZooKeeper for
its distributed sharded deployments. Among other things,
ZooKeeper is used for leader election, distributed data-
structures such as queues and maps and storing various
configuration parameters.
In order to evaluate the way Paxos protocols and Paxos Figure 5. Relative Frequency of Consensus Systems Usage for Various
Tasks
systems are adopted by Google’s and Facebook’s software
stacks and Apache top-level projects, we have classified the
usage patterns into nine broad categories: server replication consensus protocols, since they greatly simplify the im-
(SR), log replication (LR), synchronization service (SS), plementation of the distributed locks. Distributed queues,
barrier orchestration (BO), service discovery (SD), group despite being one of the suggested ZooKeeper recipes, are
membership (GM), leader election (LE), metadata manage- used least frequently. This can be attributed to the fact that
ment (MM) and distributed queues (Q). The majority of using consensus for managing large queues may negatively
the investigated systems have used consensus for tasks in impact performance.
more than one category. Table 4 summarizes various usage
patterns observed in different systems. 6. Concluding Remarks
Figure 5 shows the frequency of consensus systems
being used for each of the task categories. As can be seen, We compared/contrasted popular Paxos-protocols and
metadata management stands for 27% of all usages and is Paxos-systems and investigated how they are adopted by
the most popular adoption scenario of consensus systems, production systems at Google, Facebook, and other cloud
closely followed by the leader election. It is worth to keep in computing domains. We find that while Paxos systems
mind that metadata management is a rather broad category (and in particular ZooKeeper) are popularly adopted for
that encompasses many usage arrangements by the end coordination, they are often over-used and misused. Paxos
systems including application configuration management or systems are suitable as a durable metadata management
managing state of some internal objects or components. service. However, the metadata should remain small in size,
Synchronization service is another popular application for should not accumulate in size over time, and should not
get updated frequently. When Paxos systems are improperly
2. https://projects.apache.org/projects.html?category used, they constitute the bottleneck in performance and
cause scalability problems. A major limitation on Paxos ahead replicated log structure. Since Paxos systems are lim-
systems performance and scalability is that they fail to ited by the size of one node’s memory, RAMCloud/Sinfonia
provide a way to denote/process commutative operations via provides a scalable and fast alternative with minitransactions
the Generalized Paxos protocol. In other words, they force over multiple-records, with a slightly less expressive API.
every operation to be serialized through a single master.
We conclude by identifying tradeoffs for coordination References
services and point out promising directions for achieving
more scalable coordination in the clouds. [1] L. Lamport, “The part-time parliament,” ACM Transactions on Com-
puter Systems (TOCS), vol. 16, no. 2, pp. 133–169, 1998.
Trading off strong-consistency for fast performance. If
strong-consistency is not the primary focus and require- [2] M. Burrows, “The chubby lock service for loosely-coupled distributed
systems.” in OSDI. USENIX Association, 2006, pp. 335–350.
ment of the application, instead of a Paxos-protocol/Paxos-
system solution, an eventually-consistent solution replicated [3] P. Hunt, M. Konar, F. Junqueira, and B. Reed, “Zookeeper: Wait-free
coordination for internet-scale systems,” in USENIX ATC, vol. 10,
via optimistic replication techniques [36] can be used for 2010.
fast performance, often with good consistency guarantees.
[4] F. Junqueira, B. Reed, and M. Serafini, “Zab: High-performance
For example, in the Facebook’s TAO system, which is built broadcast for primary-backup systems,” in Dependable Systems &
over a 2-level Memcached architecture, only 0.0004% of Networks (DSN). IEEE, 2011, pp. 245–256.
reads violate linearizability [37]. Adopting Paxos would [5] F. P. Junqueira, I. Kelly, and B. Reed, “Durability with bookkeeper,”
have solved those linearizability violations, but providing ACM SIGOPS Operating Systems Review, vol. 47, no. 1, pp. 9–15,
better performance is more favorable than eliminating a tiny 2013.
fraction of bad reads in the context of TAO usage. [6] J. Kreps, N. Narkhede, J. Rao et al., “Kafka: A distributed messaging
system for log processing.” NetDB, 2011.
Trading off performance for more expressivity.
[7] D. Ongaro and J. Ousterhout, “In search of an understandable con-
ZooKeeper provides a filesystem API as the coordination sensus algorithm,” in 2014 USENIX Annual Technical Conference
interface. This has shortcomings and is unnatural for many (USENIX ATC 14), 2014, pp. 305–319.
tasks. Tango [38] provides a richer interface, by working at [8] F. B. Schneider, “Implementing fault-tolerant services using the state
a lower-layer. Tango uses Paxos to maintain a consistent and machine approach: A tutorial,” vol. 22, no. 4, pp. 299–319, Dec. 1990.
durable log of operations. Then Tango clients use this log [9] L. Lamport, “Paxos made simple,” ACM SIGACT News, vol. 32, no. 4,
to view-materialize different interfaces to serve. One Tango pp. 18–25, 2001.
client can serve a B+tree interface by reading and material- [10] ——, “Fast paxos,” Distributed Computing, vol. 19, no. 2, pp. 79–
izing the current state from the log, another client a queue 103, 2006.
interface, yet another a filesystem interface. Unfortunately
[11] ——, “Generalized consensus and paxos,” Technical Report MSR-
this requires the Tango clients to forward each read/write TR-2005-33, Microsoft Research, Tech. Rep., 2005.
update towards getting a Paxos read/commit from the log,
[12] T. D. Chandra, R. Griesemer, and J. Redstone, “Paxos made live: an
which degrades the performance. engineering perspective,” in Proceedings of the twenty-sixth annual
ACM symposium on Principles of distributed computing. ACM,
Trading off expressivity with fast performance and scal- 2007, pp. 398–407.
ability. In Paxos systems, the performance of update oper-
[13] I. Moraru, D. G. Andersen, and M. Kaminsky, “There is more
ations is a concern and can constitute bottlenecks. By using consensus in egalitarian parliaments,” in Proceedings of the Twenty-
a less expressive interface, it is possible to improve this Fourth ACM Symposium on Operating Systems Principles. ACM,
performance. If a plain distributed key-value store interface 2013, pp. 358–372.
is sufficient for an application, a consistent-replication key- [14] R. Van Renesse and D. Altinbuken, “Paxos made moderately com-
value store can be used based on chain replication [39]. plex,” ACM Computing Surveys (CSUR), vol. 47, no. 3, p. 42, 2015.
Chain replication uses Paxos only for managing the con- [15] D. Dolev, C. Dwork, and L. Stockmeyer, “On the minimal synchro-
figuration of the replicas, and replication is achieved with nism needed for distributed consensus,” Journal of the ACM (JACM),
very good throughput without requiring Paxos commit for vol. 34, no. 1, pp. 77–97, 1987.
each operation. Furthermore, in this solution reads can be [16] T. Chandra and S. Toueg, “Unreliable failure detectors for reliable
answered consistently by any replica. However, this solution distributed systems,” Journal of the ACM, vol. 43, no. 2, 1996.
forgoes the multi-item coordination in the Paxos systems [17] P. Bailis, S. Venkataraman, M. J. Franklin, J. M. Hellerstein, and
API, since its API is restricted to put and get with key I. Stoica, “Probabilistically bounded staleness for practical partial
quorums,” Proceedings of the VLDB Endowment, vol. 5, no. 8, pp.
operations. 776–787, 2012.
Along the same lines, an in-memory transactional key-
[18] “etcd,” https://coreos.com/etcd/.
value store can also be considered as a fast and scalable
alternative. Sinfonia [40] provides transactions over an in- [19] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,
N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale
memory distributed key-value store. These transactions are graph processing,” in Proceedings of the 2010 ACM SIGMOD
minitransactions that are less expressive than general trans- International Conference on Management of data, ser. SIGMOD
actions, but they can be executed in one roundtrip. Recently ’10. New York, NY, USA: ACM, 2010, pp. 135–146. [Online].
RAMCloud system [41] showed how to implement the Sin- Available: http://doi.acm.org/10.1145/1807167.1807184
fonia transactions in a fast and durable manner using a write- [20] “Apache giraph project,” http://incubator.apache.org/giraph/.
[21] “Apache hama project,” http://hama.apache.org/. [41] C. Lee, S. J. Park, A. Kejriwal, S. Matsushita, and J. Ousterhout,
[22] A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukr- “Implementing linearizability at large scale and low latency,” in
ishnan, “One trillion edges: graph processing at facebook-scale,” Proceedings of the 25th Symposium on Operating Systems Principles.
Proceedings of the VLDB Endowment, vol. 8, no. 12, pp. 1804–1815, ACM, 2015, pp. 71–86.
2015.
[23] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and
J. Wilkes, “Large-scale cluster management at google with borg,” in
Proceedings of the Tenth European Conference on Computer Systems.
ACM, 2015, p. 18.
[24] “Google kubernetes project,” http://kubernetes.io/.
[25] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,”
in ACM SIGOPS Operating Systems Review, vol. 37/5. ACM, 2003,
pp. 29–43.
[26] J. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman,
S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh,
S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura,
D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak,
C. Taylor, R. Wang, and D. Woodford, “Spanner: Google’s globally-
distributed database,” Proceedings of OSDI, 2012.
[27] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows,
T. Chandra, A. Fikes, and R. Gruber, “Bigtable: A distributed storage
system for structured data,” ACM Transactions on Computer Systems
(TOCS), vol. 26, no. 2, p. 4, 2008.
[28] J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin, J. Larson,
J. Léon, Y. Li, A. Lloyd, and V. Yushprakh, “Megastore: Providing
scalable, highly available storage for interactive services,” CIDR, pp.
223–234, 2011.
[29] “Apache hadoop project,” http://hadoop.apache.org/.
[30] L. George, HBase: The Definitive Guide,
1st ed. O’Reilly Media, 2011. [Online]. Avail-
able: http://www.amazon.de/HBase-Definitive-Guide-Lars-George/
dp/1449396100/ref=sr 1 1?ie=UTF8&qid=1317281653&sr=8-1
[31] A. Lakshman and P. Malik, “Cassandra: Structured storage system
on a p2p network,” in Proceedings of the 28th ACM Symposium on
Principles of Distributed Computing, ser. PODC ’09, 2009, pp. 5–5.
[32] “Apache hive project,” http://hive.apache.org/.
[33] C. Tang, T. Kooburat, P. Venkatachalam, A. Chandler, Z. Wen,
A. Narayanan, P. Dowell, and R. Karl, “Holistic configuration man-
agement at Facebook,” Symposium on Operating Systems Principles
(SOSP), pp. 328–343, 2015. [Online]. Available: http://sigops.org/
sosp/sosp15/current/2015-Monterey/printable/008-tang.pdf
[34] “Apache accumulo project,” http://accumulo.apache.org/.
[35] “Apache solr,” http://lucene.apache.org/solr/.
[36] Y. Saito and M. Shapiro, “Optimistic replication,” ACM Computing
Surveys (CSUR), vol. 37, no. 1, pp. 42–81, 2005.
[37] H. Lu, K. Veeraraghavan, P. Ajoux, J. Hunt, Y. J.
Song, W. Tobagus, S. Kumar, and W. Lloyd, “Existential
consistency: Measuring and understanding consistency at Facebook,”
Symposium on Operating Systems Principles (SOSP), pp. 295–
310, 2015. [Online]. Available: http://sigops.org/sosp/sosp15/current/
2015-Monterey/printable/240-lu.pdf
[38] M. Balakrishnan, D. Malkhi, T. Wobber, M. Wu, V. Prabhakaran,
M. Wei, J. D. Davis, S. Rao, T. Zou, and A. Zuck, “Tango: Distributed
data structures over a shared log,” in Proceedings of the Twenty-
Fourth ACM Symposium on Operating Systems Principles. ACM,
2013, pp. 325–340.
[39] R. van Renesse and F. B. Schneider, “Chain replication for supporting
high throughput and availability,” in Proceedings of the 6th conference
on Symposium on Operating Systems Design & Implementation,
vol. 6, 2004.
[40] M. Aguilera, A. Merchant, M. Shah, A. Veitch, and C. Karamanolis,
“Sinfonia: a new paradigm for building scalable distributed systems,”
in ACM SIGOPS Operating Systems Review, vol. 41. ACM, 2007,
pp. 159–174.