Attiya10 RobustSimulationOfSharedMemory 20YearsAfter
Attiya10 RobustSimulationOfSharedMemory 20YearsAfter
D
P
C
F
Department of Computer Science, University of Crete P.O. Box 2208 GR-714 09 Heraklion, Crete, Greece and Institute of Computer Science (ICS) Foundation for Research and Technology (FORTH) N. Plastira 100. Vassilika Vouton GR-700 13 Heraklion, Crete, Greece [email protected]
R S S M: 20 Y A
Hagit Attiya
Department of Computer Science, Technion and School of Computer and Communication Sciences, EPFL
Abstract The article explores the concept of simulating the abstraction of a shared memory in message passing systems, despite the existence of failures. This abstraction provides an atomic register accessed with read and write operations. This article describes the Attiya, Bar-Noy and Dolev simulation, its origins and generalizations, as well as its applications in theory and practice.
Introduction
In the summer of 1989, I spent a week in the bay area, visiting Danny Dolev, who was at IBM Almaden at that time, and Amotz Bar-Noy, who was a postdoc at Stanford, before going to PODC in Edmonton.1 Danny and Amotz have already done some work on equivalence between shared memory and messagepassing communication primitives [14]. Their goal was to port specic renaming algorithms from message-passing systems [10] to shared-memory systems, and therefore, their simulation was tailored for the specic constructstable vectors used in these algorithms. Register constructions were a big fad at the time, and motivated by them, we were looking for a more generic simulation of shared-memory in message passing systems. In PODC 1990, we have published the fruits of this study in an extended abstract of a paper [8] describing a simulation of a single-writer multi-reader register in a message-passing system. Inspired by concepts from several areas, the paper presented a simple algorithm, later nicknamed the ABD simulation, that supports porting of shared-memory algorithms to message-passing systems. The ABD simulation allowed to concentrate on the former model, at least for studying computability issues. The simulation, vastly extended to handle dynamic changes in the system and adverse failures, served also as a conceptual basis for several storage systems, and for universal service implementations (state-machine replication). In this article, I describe the ABD simulation and its origins, and survey the follow-up work that has spanned from it.
Inspirations
This section describes the context of our simulation, discussing specications, algorithms and simulations, which provided inspiration for our approach. The section also lays down some basic terminology used later in this article.
101
BEATCS no 100
is valid; validity, in turn, is determined according to version numbers associated with data values. This indicates that version numbers may provide the key for coping with failures in an asynchronous system.
(a)
(b)
Figure 1: Execution examples for the ABD simulation; dark nodes have acknowledged the message.
One reason the ABD simulation is well-known is due to its simplicity, at least in the unbounded version. This section explains the basic behavior of the algorithm. We consider a simple system with one node being the writer of a register; all other nodes are readers. All n nodes also store a copy of the current value of the register; this is done in a separate thread from their roles as a writer or reader. Each value written is associated with an unbounded version number, an integer denoted version#. To write a value, the writer sends a message write(val,version#), with the new value and an incremented version number, to all nodes and waits for n f acknowledgments. Figure 1(a) shows an example of the communication pattern: the writer sends the message to all seven nodes, one node does not receive the message (indicated by a dashed line); of the nodes that receive the message, one does not respond at all, another responds but the message is delayed (indicated by a dashed line), so the writer receives acknowledgments from four nodes (a majority out of seven). To read a value, a reader queries all nodes, and, after receiving at least n f responses, picks the value with the highest version number. Figure 1(b) shows an example of the communication pattern: the reader sends the message to all seven nodes; all nodes receive the message, two do not respond at all, while another responds but the message is delayed (indicated by a dashed line), so the reader receives values from four nodes (a majority out of seven). The key to the correctness of the algorithm is to notice that each operation communicates with a majority of the nodes: since n > 2 f it follows that n f > n/2. Thus, there is a common node communicating with each pair of write and read operations. (As illustrated by Figure 1.) This node ensures that the value of the latest preceding (non-overlapping) write operation is forwarded to the later
103
BEATCS no 100
Writer Reader 1
(a)
(b)
(c)
Figure 2: Old-new inversion between read operations; the dark nodes holds the new value of a write operation. In (a), process p already stores the new value of the register; in (b), the rst read operation receives the new value, while in (c), a later read operation does receive the new value. Note that the write operation does not complete. read operation; the read will pick this value, unless it receives an even later value (with a higher version number). A formal proof can be found in the original paper or in [13, Chapter 10]. A slight complication happens because two non-overlapping reads, both overlapping a write, might obtain out-of-order values. This can happen if the writer sends a new value (with a higher version number) to all nodes, and node p gets it rst. The early read operation might already obtain the new value from p (among the n f it waits for), while the later read operation does not wait to hear from p and returns the old value. (See Figure 2.) This behavior is avoided by having a reader write back the value it is going to return, in order to ensure atomicity of reads. An argument based on communication with a majority of nodes shows that the value returned by a read operation is available to each following read operation, which returns it unless it has obtained an even later value. (See more details in [8] or [13].)
BEATCS no 100
array), the read quorums contain all the sets of nodes in the same column, and the write quorums contain all the sets of nodes in the same row. Clearly, the ABD simulation can be modied so that each write operation (all by the same node) communicates with some write quorum, while each read operation communicates with some read quorum. This refactoring admits a separation of concerns and paves the way to optimizing the communication pattern without changing the overall workings of the algorithm. For example, it is possible to choose a dierent quorum system when fewer processes may fail, or so as to optimize the performance features of the quorums, e.g., their load and availability [45].
Concentrating on the communication pattern, through a quorum system, allowed to make the register simulation more robust, and most notably, to handle dynamic system changes and tolerate more malicious, Byzantine failures.
BEATCS no 100
5 Application: Replicated Services
Many systems cannot aord to guarantee a majority of nonfaulty processing nodes, seemingly implying that fault-tolerance cannot be obtained. However, systems contain other types of components, for example, independent disk drives. Because these components are cheaper than computers, it is feasible to replicate them in order to achieve fault tolerance. Disk drivers are not programmable, but they can respond to client requests; clients may stop taking steps, and disks may stop responding to requests. Disk Paxos [26] implements an arbitrary fault-tolerant service on such a storage area network containing processing nodes and disks; it provides consistency with any number of asynchronous non-Byzantine node failures. Disk Paxos is based on a shared-memory version of the classic Paxos algorithm [38]; this algorithm is then ported to a storage area network using an optimized version of the ABD simulation: To write a value, a processor writes the value to a majority of the disks. To read, a processor reads a majority of the disks. (This provides a somewhat weaker register, called regular, which however, sufces for their shared-memory Paxos algorithm.) The algorithm avoids the version numbers used in the ABD simulation by piggybacking on the version numbers of the Paxos algorithm. A somewhat similar approach was taken with erasure-coded disks [48], where redundancy is used beyond simple replication to tolerate (non-malicious) disk errors, in addition to partitioning and non-responding disks. It incorporates a quorum reconguration algorithm, which allows client requests to proceed unimpeded. This algorithm is, in some sense, obstruction-free, since a request may abort when it encounters a concurrent request, yielding an ecient read operation (within a single communication round). This concept was later abstracted by Aguilera et al. to dene an abortable object [4], which may abort an operation when there are concurrent operations. A service can be replicated even when servers are Byzantine, by relying on the register simulation tolerating Byzantine failures, see for example, [47].
Closing Remarks
One of my goals in this article was to show how picking the right abstraction can bring forth many applications. Finding the right abstraction is in many ways a key for designing good systems: it should hide enough capabilities under its hood to provide good leverage in application design, yet, not too much, so the implementation is ecient (or easily admits optimizations). The are several other research directions that were not discussed in detail here;
108
References
[1] Ittai Abraham, Gregory Chockler, Idit Keidar, and Dahlia Malkhi. Byzantine disk paxos: optimal resilience with byzantine shared memory. Distributed Computing, 18(5):387408, 2006. Previous version in PODC 2004. [2] Yehuda Afek, Hagit Attiya, Danny Dolev, Eli Gafni, Michael Merritt, and Nir Shavit. Atomic snapshots of shared memory. Journal of the ACM, 40(4):873890, September 1993. Previous version in PODC 1990.
109
BEATCS no 100
[3] Yehuda Afek, David S. Greenberg, Michael Merritt, and Gadi Taubenfeld. Computing with faulty shared objects. Journal of the ACM, 42(6):12311274, 1995. Previous version in PODC 1992. [4] Marcos K. Aguilera, Svend Frolund, Vassos Hadzilacos, Stephanie L. Horn, and Sam Toueg. Abortable and query-abortable objects and their ecient implementation. In Proceedings of the 26th annual ACM symposium on Principles of distributed computing (PODC), pages 2332, 2007. [5] Marcos Kawazoe Aguilera, Idit Keidar, Dahlia Malkhi, and Alexander Shraer. Dynamic atomic storage without consensus. In Proceedings of the 28th ACM symposium on Principles of distributed computing (PODC), pages 1725, 2009. [6] Amitanand S. Aiyer, Lorenzo Alvisi, and Rida A. Bazzi. Bounded wait-free implementation of optimally resilient byzantine storage without (unproven) cryptographic assumptions. In Proceddings of 21st International Symposium on Distributed Computing (DISC), pages 719, 2007. [7] Hagit Attiya. Ecient and robust sharing of memory in message-passing systems. Journal of Algorithms, 34(1):109127, January 2000. Previous version in WDAG 1996. [8] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly in message-passing systems. Journal of the ACM, 42(1):121132, January 1995. Previous version in PODC 1990. [9] Hagit Attiya, Amotz Bar-Noy, Danny Dolev, Daphne Koller, David Peleg, and Rudiger Reischuk. Achievable cases in an asynchronous environment. In Proceedings of the 28th Annual Symposium on Foundations of Computer Science (FOCS), pages 337346, 1987. [10] Hagit Attiya, Amotz Bar-Noy, Danny Dolev, David Peleg, and Rudiger Reischuk. Renaming in an asynchronous environment. Journal of the ACM, 37(3):524548, July 1990. Previous version in [9]. [11] Hagit Attiya and Amir Bar-Or. Sharing memory with semi-byzantine clients and faulty storage servers. Parallel Processing Letters, 16(4):419428, 2006. Previous version in SRDS 2003. [12] Hagit Attiya, Nancy Lynch, and Nir Shavit. Are wait-free algorithms fast? Journal of the ACM, 41(4):725763, 1994. Previous version in FOCS 1990. [13] Hagit Attiya and Jennifer Welch. Distributed Computing: Fundamentals, Simulations and Advanced Topics. John Wiley& Sons, second edition, 2004. [14] Amotz Bar-Noy and Danny Dolev. A partial equivalence between shared-memory and message-passing in an asynchronous fail-stop distributed environment. Mathematical Systems Theory, 26(1):2139, 1993. Previous version in PODC 1989. [15] J. Beal and S. Gilbert. RamboNodes for the metropolitan ad hoc network. In Workshop on Dependability in Wireless Ad Hoc Networks and Sensor Networks, 2003.
110
111
BEATCS no 100
[29] David K. Giord. Weighted voting for replicated data. In Proceedings of the 7th ACM symposium on Operating Systems Principles (SOSP), pages 150162, 1979. [30] Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAulie, Larry Rudolph, and Marc Snir. The NYU Ultracomputerdesigning a MIMD, sharedmemory parallel machine. In Proceedings of the 9th annual symposium on Computer Architecture (ISCA), pages 2742, 1982. [31] Rachid Guerraoui and Ron Levy. Robust emulations of shared memory in a crashrecovery model. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS), volume 24, pages 400407, 2004. [32] Rachid Guerraoui and Marko Vukoli . Rened quorum systems. In Proceedings of c the 26th annual ACM symposium on Principles of distributed computing (PODC), pages 119128, 2007. [33] Maurice P. Herlihy and Nir Shavit. The topological structure of asynchronous computability. Journal of the ACM, 46(6):858923, 1999. Previous version in STOC 1993 and STOC 1994. [34] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463492, July 1990. [35] Amos Israeli and Ming Li. Bounded time-stamps. Distributed Computing, 6(4):205 209, July 1993. Previous version in FOCS 1987. [36] Prasad Jayanti, Tushar D. Chandra, and Sam Toueg. Fault-tolerant wait-free shared objects. Journal of the ACM, 45(3):451500, 1998. Previous version in FOCS 1992. [37] Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess progranm. IEEE Trans. Comput., 28(9):690691, 1979. [38] Leslie Lamport. Paxos made simple. ACM SIGACT News, 32(4):1825, 2001. [39] Nancy A. Lynch and Alexander A. Shvartsman. Robust emulation of shared memory using dynamic quorum-acknowledged broadcasts. In Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS), pages 272281, 1997. [40] Nancy A. Lynch and Alexander A. Shvartsman. Rambo: A Recongurable Atomic Memory Service for Dynamic Networks. In Proceedings of the 16th International Conference on Distributed Computing (DISC), pages 173190, 2002. [41] Dahlia Malkhi and Michael K. Reiter. Byzantine quorum systems. Distributed Computing, 11(4):203213, 1998. Previous version in STOC 1997. [42] Dahlia Malkhi, Michael K. Reiter, Avishai Wool, and Rebecca N. Wright. Probabilistic quorum systems. Information and Computation, 170(2):184206, 2001. Previous version in PODC 1997. [43] Jean-Philippe Martin, Lorenzo Alvisi, and Michael Dahlin. Minimal byzantine storage. In Proceedings of the 16th International Conference on Distributed Computing (DISC), pages 311325, 2002.
112
113