550 likes | 725 Views
Lecture 2 Introduction to Principles of Distributed Computing. Sergio Rajsbaum Math Institute UNAM, Mexico. Lecture 2. Part I : Refresh from Lecture I. What is a distributed system and its parameters. Problems solved in such a system. The need for a theoretical foundation. Two-phase commit
E N D
Lecture 2Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico
Lecture 2 • Part I: Refresh from Lecture I. What is a distributed system and its parameters. Problems solved in such a system. The need for a theoretical foundation. Two-phase commit • Part II: Coordinated attack, consensus
Part I: What is a distributed system The need for a theoretical foundation. Two-phase commit
Principles of Distributed Computing • Distributed computing studies systems where components interact and collaborate • Principles of distributed computing tries to understand the fundamental possibilities and limitations of such systems, with a precise, scientific approach • Goal: to design efficient and reliable systems, and techniques to design them, analyze them and prove them correct, or to prove impossibility results when no protocol exists
What is distributed computing? • Any system where several independent computing components interact • This broad definition encompasses • VLSI chips, and any modern PC • tightly-coupled shared memory multiprocessor • local area cluster of workstations • internet, WEB, Web services • wireless networks, sensor networks, ad-hoc networks • cooperating robots, mobile agents, P2P systems
Computing components • Referred to processors or processes in the literature • Can represent a • microprocessor • process in a multiprocessing operating system • Java thread • mobile agent, mobile node (e.g. laptop), robot • computing element in a VLSI chip
Interaction – message passing vs. shared memory • Processors need to communicate with each other to collaborate, via • Message passing • Point-to-point channels, defining an interconnection graph • All-to-all using an underlying infrastructure (e.g. TCP/IP) • Broadcast; wireless, satellite • Shared memory • Shared-objects: read/write, test&set, compare&swap, etc • Usually harder to implement, easier to program
collaborate A distributed system processors Communication media
Failures • Any system that includes many components running over a long period of time must consider the possibility of failures • of processors and communication media • of different severity • from processor crashes or message loses, to • malicious Byzantine behavior
Many kinds of problems • Clock synchronization • Routing • Broadcasting • Naming • P2P, how to share and find resources • sharing resources, mutual exclusion • Increasing fault-tolerance, failure detection • Security, authentication, cryptography • Database transactions, atomic commitment • Backups, reliable storage, file systems • Applications, airline reservation, banking, electronic commerce, publish/subscribe systems, web search, web caching, …
Multi-layered, complex interactionsAn example • A fault-tolerant broadcast service is useful to build a higher level database transaction module • Naming, authentication is required • And may work more efficiently if clocks are tightly synchronized • And good routing schemes should exist • If the clock synchronization is attacked, the whole system may be compromised
Chaos We need a good foundation, principles of distributed computing
Chaos • Too many models, problems and orthogonal, interacting issues • Very hard to get things right, to reproduce operating scenarios • Sometimes it is easy to adapt a solution to a different model, sometimes a small change in the model makes a problem unsolvable
Distributed computing theory • Models • Good models [Schneider Ch.2 in Distributed Systems, Mullender (Ed.)] • Relation between models: solve a problem only once; solve it in the strongest possible model • Problems • Search of paradigms that represent fundamental distributed computing issues • Relations between problems: hierarchies of solvable and unsolvable problems; reductions • Solutions • Design algorithms, verification techniques, programming abstractions • Impossibility results and lower bounds • Efficiency measures • Time, communication, failures, recovery time, bottlenecks, congestion
Distributed Commit An example of a distributed protocol Fundamental part of distributed DBMS
Distributed Commit • A distributed transaction with components at several sites should execute atomically • Example: A manager of a chain of stores wants to query all the stores, find the inventory of toothbrushes at each, and issue instructions to move toothbrushes from store to store in order to balance the inventory. • The operation is done by a single global transaction T that has component Ti at the i-th store and a component T0 at the office where the manages is located.
Sequence of activities performed by T • Component T0 is created at the site of the manager • T0 sends messages to all the stores instructing them to create components Ti • Each Ti executes a query at store I to discover the number of toothbrushes in inventory and reports this number to T0 • T0 takes these numbers and determines, by some algorithm we shall not discuss, what shipments of toothbrushes are desired. T0 then sends messages such as “store 10 should ship 500 toothbrushes to store 7” to the appropriate stores • Stores receiving instructions update their inventory and perform the shipments
Atomicity • Make sure it does not happen: some of the actions of T get executed, but others do not • We do assume atomicity of each Ti, through mechanisms such as logging and recovery • Failures make difficult the achievement of atomicity of T • A site fails or is disconnected from the network • A bug in the algorithm to redistribute toothbrushes instructs store 10 to ship more than it has
Example of failures • Suppose T10 replies to T0’s 1st message with its inventory. • The machine at 10 then crashes, the instructions form T0 are never received by T10 • However, T7 sees no problem, and receives the instructions from T0 • Can distributed transaction T ever commit?
Agreement Paradigms Coordinated attack Consensus
Coordinated AttackAn important abstraction • a pair of allied generals A and B have agreed to attack simultaneously or not at all. • they can only communicate via carrier pigeon; message loss is possible A B
Difficulty: uncertainty • Suppose general A sends the message to B “attack at dawn” • general A won’t attack alone. A doesn’t know whether B has received the message. B understand A’s predicament, so B sends an acknowledgment “agreed”
Did A get it? Did B get it? “attack at dawn” “ack” A A B B Impossible Theorem: Assume that communication is unreliable. Any protocol that guarantees that if one of the generals attacks, then the other does so at the same time, is a protocol in which necessarily neither general attacks.
Did A get it? Did B get it? “ack your ack” “ack your ack to my ack” A A B B It never ends • There is always uncertainty of weather the last message was delivered or not • Corollary: If decision must be made within a fixed time period, then unreliable communication prevents database commitment protocols
Agreement Problems in Distributed Computing are common Because processes have different views of its state and history
Agreement Problems in Distributed Computing are common… Because processes have different views of its state and history, due to: • Delays • Failures NASA plunged the Galileo spacecraft into Jupiter’s turbulent atmosphere today. The unmanned spacecraft dived into the atmosphere at 2:57 p.m. Eastern time. The last of Galileo’s data arrived on Earth today after the spacecraft was destroyed, taking 52 minutes to cross half a billion miles of space The New York Times, 21 Sept. 2003
… and Agreement Problems are Important • In a replicated data system: to execute the same sequence of operations on the replicated data • In a replicated sensor system: to agree on the values of the sensors • In a timed system: to synchronize a set of clocks • In a broadcast system: to deliver the same messages in the same order • In a database system: to commit or abort a transaction Etc….
Consensus The king of agreement problems
CONSENSUS A fundamental Abstraction Each process has an input, should decide an output s.t. Agreement: correct processes’ decisions are the same Validity: decision is input of one process Termination: eventually all correct processes decide There are at least two possible input values 0 and 1
A Solution to ConsensusFor a group of people sitting in a room
A Solution to ConsensusEach one raises a card with its input 0 1 2 0 0
A Solution to ConsensusFollow a coordinator 0 1 1 1 2 1 0 1 0 1
A Solution to ConsensusMajority wins(breaking ties with the largest) 0 0 1 0 2 0 0 0 0 0
A Solution to ConsensusFailures are no problem (choose another coordinator, or majority of non-failed) 0 1 2 0 %!#
A Solution to Consensus… because this cannot happen!! 0 1 2 %!# 1 0
Consensus in Distributed SystemsThis can happen: delays ? ? 1 ?
Consensus in Distributed Systems and then there are different views 0 1020? 10201 1 2 1020? 0 1020? †
Consensus in Distributed Systems so we try to reconcile views- another round 0 1020? 10201 1 2 1020? 10201 0 1020? †
Consensus in Distributed Systems but we could have the same problem!! 0 1020? 10201 10201 1 2 1020? 10201 0 1020? †
So, is consensus solvable?If so, how long does it take to solve it? • It depends on what exactly the model is • But what is a realistic model? • And what are the common scenarios within the model? The nature of a distributed system is to include complex combinations of failures and delays
Basic Model – asynchronous crash failure model • Message passing (another option would be a shared memory model) • Channels between every pair of processes • Crash failures, with a bound t t < n potential failures out of n >1 processes • No message loss among correct processes • Unbounded message delays, unpredictable processor’s speeds
Distributed algorithms(protocols) • A set of algorithms, each one runs on a different processor (or as a thread in the same computer) • The code includes instructions to communicate with other processors: • Send (M) to p • Upon receiving a message form q do
A consensus protocol • val input • send val to all • wait until at least n - t messages have been received • let V[j] be the val received from process j else ‘-’ • returnh (V) = largest value in V - This same code is executed by every process - each one receives the value input from some application - h is a predefined function, that all processors know
Is this protocol correct ? • It depends on what is the set C of possible inputs • An input to the protocol is a vector I, where I[j] contains the local input of the j-th process • The local input of pj is known only to pj • And is taken from some universe of possible values V not including ‘-’ • Let C be the set of possible input vectors to the protocol
Exercise 1 • Define a set C as large as possible for which the protocol is correct • Prove that the protocol is correct for this C • Do you need to assume t < n / 2 ? Namely, that for every I in C, in every execution with input I where at most t processes crash, the consensus requirements are satisfied Termination: eventually all correct processes decide Agreement: correct processes’ decisions are the same Validity: decision is input of one process
Exercise 2 The protocol uses h (V) = largest value in V • Define another such function h’ • Repeat the previous exercise with respect to your h’
Exercise 3 Consider the set C that includes every possible input vector formed with values from V, where | V | is at least 2 • Is there a function h for which the protocol is correct ? If so, give one such h and prove the protocol is correct, otherwise, give a brief intuitive argument of why there is no such h
BibliographyTheory of distributed computing textbooks • Attiya, Welch, Distributed Computing, Wiley-Interscience, 2 ed., 2004 • Garg, Elements of Distributed Computing, Wiley-IEEE, 2002 • Lynch, Distributed Algorithms, Morgan Kaufmann,1997 • Tel, Introduction to Distributed Algorithms, Cambridge U., 2 ed. 2001
Bibliographyothers • Distributed Algorithms and Systems http://www.md.chalmers.se/~tsigas/DISAS/index.html • Conferences: DISC, PODC,… • Journals: Distributed Computing,… • Special issue PODC 20th anniversary, Sept. 2003 • ACM SIGACT News Distributed Computing Column. Also one in EATCS Bulletin