Distributed Algorithms Contents Index

Paxos: Consensus in Faulty Systems

Algorithms by which agents reach a consensus on a value are central in many applications. In this webpage we describe Paxos -- an algorithm by which agents attempt to reach a consensus in distributed systems in which agents may halt, be arbitrarily slow, and messages may be duplicated, lost, and delivered out of order.

Introduction

Paxos is a consensus algorithm for systems in which messages may be lost; multiple copies of a message may be delivered; messages may not be delivered in the order sent; agents may be arbitrarily slow; and agents may stop. From the FLP theorem, there is no algorithm that guarantees that consensus among agents will be reached in such systems. Paxos may not terminate. Later, we discuss factors that impact the likelihood that the algorithm will progress.

Maintaining a bank account requires keeping track of the sequence of transactions on the account. An insufficient fee penalty may be incurred if a transaction that transfers funds out of an account occurs before a transaction that transfers funds into the account. The penalty may not be incurred if the sequence of transactions is reversed with funds transferred into the account before funds are withdrawn from it.

Some systems improve resiliency by keeping copies of sequences of transactions at different agents. All agents must come to a consensus about a single sequence though multiple agents propose transactions concurrently.

Sequences of transactions are stored at agents called servers. The sequences may not be in sync because the system may have faults and because message delays are arbitrary.

Agents that propose transaction are called clients. Associated with each client p is a function g_p which has a single argument. If x is the consensus sequence then g_p(x) is a sequence consisting of x extended by a single transaction proposed by p. Different clients may propose different extensions. The Paxos algorithm determines a consensus extension from among proposals made by different clients.

The algorithms for clients and servers in Paxos are the same as in the algorithm for serializable computations. For any computation \(C\) of the distributed algorithm there exists a computation \(C'\) with the same steps as \(C\) and where transactions in \(C'\) are executed in increasing order of epochs.

Associated with each server q is a variable q.v. For each server q, the sequence of values assigned to q.v in computation \(C\) is the same as in a computation in which transactions are executed in epoch order.


Transactions Executed in Epoch Order

# Execute transactions in C in increasing order of epochs.
for i in positive_integers:
   # Execute the transaction with epoch t[i]
   # This transaction is executed by client p[i]
   # In this transaction:
   #   R[i] is the set of servers read by p[i]
   #   W[i] is the set of servers written by p[i]

   if len(R[i]) >= M:
      s = f_p[i](r.v for r in R[i])
      for q in W[i]:  q.v.s, q.v.t  = s, t[i]

where the i-th iteration of the loop represents the execution of the transaction with the i-th highest epoch. Recall that M and f_p are parameters of the algorithm.

Specification of Consensus

The specification is that at each step of a computation, either the consensus sequence remains unchanged or a client extends the consensus sequence.

Let s* be the consensus sequence at a point in the computation. At each step of a computation, the consensus s* remains unchanged or a client p changes the consensus to g_p(s*).

We write the specification as


Specification

(s* = x)  next  ((s* = x) or (exists client p: s* = g_p(x)))

The Problem

The problem is to obtain a consensus algorithm by specifying
  1. M
  2. f_p, and
  3. the consensus sequence s*.

How Should You Solve the Problem?

Task 1: Specify M
Let N be the number of servers. If M is large, say M = N, then the likelihood that a client receives replies from all M servers is small, and so the likelihood that a transaction progresses to the write step is small. If M is small, say M = 1, then there may be no commonality between replies that different clients receive to their read requests, and so the read step of a transaction may not convey useful information.

We assume that a client proceeds to the write step of a transaction if it receives replies from at least a majority of servers. Why use a majority? Because any two majorities have at least one element in common.

Let N be the number of servers.

   M = int(N/2) + 1

Task 2: Define the Consensus Sequence in terms of Server Variables

Let's use majorities again. We construct a program in which:


The consensus sequence is the value written to server variables in a transaction that assigns values to a majority of server variables.


Let t* be an epoch in which a majority of servers are assigned values, let s* be the value assigned to server variables in the transaction, let p* be the client that executes the transaction, and let W* be the set of servers that are assigned values in this transaction. Then s* is a consensus sequence.

Variables s*, W*, t*, p* are used only for proving that the algorithm is correct. They do not appear in the implementation of client or server algorithms. The values of these variables are given in a computation with transactions in epoch order. Variables used in proofs but not in implementations are called auxiliary variables or thought variables.


Transactions in Epoch Order with Auxiliarly Variables

for i in positive_integers:
   if R[i] is a majority:
      s = f_p[i](r.v for r in R[i])
      for q in W[i]:
         q.v.s, q.v.t  = s, t[i]
         if W[i] is a majority:
            s*, t* = s, t[i]
            W*, p* = W[i], p[i]

Initially

for all servers q:  q.v.s, q.v.t  =  init, t[0]
s*, t*[0] = init, t[0]
where init and t[0] are given initial values. p* is an arbitrary client and W* is the set of all servers.

Next, we prove properties of computation \(C\) by proving properties of executions of transactions of \(C\) executed in epoch order. We begin by identifying invariants.

Invariant 1

If there exists a set Z of servers where Z is a majority and all elements of Z are identical, then:

for all q in Z:  q.v.s, q.v.t = s*, t*

Invariant 2

For all q in W* : q.v.t \(\geq\) t*


Proofs of the Invariants
The proof that the invariants hold initially is straightforward. We assume that the invariant holds at the start of the i-th iteration and prove that the invariant holds at the end of the i-th iteration.

Case 1: W[i] is a majority

In this case s*, W*, t* become s, W[i], t[i], respectively, and

for all q in W*:  q.v.s, q.v.t = s*, t*
Let Z be any set of servers where Z is a majority and where all servers in Z have identical variables. Because Z and W* are majorities they have an element in common, and so all servers in Z and W* have identical variables. Therefore
for all q in Z:  q.v.s, q.v.t = s*, t*
So, invariant 1 holds at the end of the iteration.

Invariant 2 holds at the end of the iteration because: For all q in W*: q.v.t \(=\) t*.

Case 2: W[i] is not a majority

Assume that there exists a majority Z of servers with identical values at the end of iteration i. For q in Z, q.v.t \(\neq\) t[i] because W[i] is not a majority. So Z is not modified in the i-th iteration . Invariant 1 holds at the start of the iteration, and the iteration leaves Z unchanged, and so invariant 1 continues to hold at the end of the iteration.

Invariant 2 continues to hold at the end of the iteration because W* and t* are unchanged in the iteration and for each server q the iteration either increases q.v.t or leaves it unchanged.

Next, we use these invariants to construct f_p.

Task 3: Function f_p
Recall the specification of the algorithm: A step of the computation leaves the consensus s* unchanged or changes it to g_p(s*) So, we now prove the following invariant.

Invariant 3

For all servers q where q.v.t \(\geq\) t*:

(q.v.s \(=\) s*) or (exists client p: q.v.s \(=\) g_p(s*))


f_p()
We first specify function f_p(). Consider two cases of the argument of the function.

Case 1: All elements of [r.v for r in R[i]] are identical

f_p returns g_p(r.v.s) for any r in R[i].

Case 2: Not all elements of [r.v for r in R[i]] are identical

f_p returns r*.v.s where r* is an element of R[i] with the maximum r.v.t:

r*.v.t = max(r.v.t for r in R[i])

Proof of Invariant 3
The proof that the invariant holds initially is straightforward. Assume that the invariant holds at the start of the i-th iteration and prove that it holds at the end of the iteration. We first prove the following lemma.
Lemma
Assuming that the invariant holds at the start of the iteration, f_p returns either s* or g_p(s*).

Proof

Case 1: All elements of [r.v for r in R[i]] are identical

From invariant 1, and because R[i] is a majority:

For all r in R[i]: r.v.s = s* .

Therefore f_p returns g_p(r.v.s) which is g_p(s*).

Case 2: Not all elements of [r.v for r in R[i]] are identical

R[i] and W* have at least one element x in common because they are both majorities. From invariant 2, because x is a member of W*:

x.v.t \(\geq\) t*.

From the definition of r*.v.t :

r*.v.t \(\geq\) x.v.t. and so r*.v.t \(\geq\) t*.

At the start of the iteration invariant 3 holds and so

(r*.v.s \(=\) s*) or (exists client p: r*.v.s \(=\) g_p(s*))

So f_p returns s* or g_p(s*).

Lemma: Invariant holds at the end of the iteration.

Proof

Consider two cases: W[i] is either a majority or not a majority.

Case 1: W[i] is a majority

In this case s*, W*, t* = s, W[i], t[i]

Therefore for all q where q.v.t \(=\) t*: q.v.s \(=\) s*.

So the invariant continues to hold.

Case 2: W[i] is not a majority

In this case s*, W*, t* remain unchanged in the iteration. The iteration continues to hold because f_p returns s* or g_p(s*).

Theorem

In the system, with clients, servers, M, f_p, and s* as specified earlier,
(s* = x) or (exists p: s* = g_p(x))

Proof

Let \(C\) be any computation of the system. There exists a computation \(C'\) where (1) for each agent \(x\) the sequence of steps executed at \(x\) is the same in \(C\) and \(C'\), and (2) transactions are executed in ascending order of epochs in \(C'\). We first prove the result for computation \(C'\).

The sequence of values of q.v of each server q in computation \(C'\) is the same as in an execution of the following sequence of of transactions (which was given above and is repeated here for convenience).


Transactions in Epoch Order with Auxiliarly Variables

for i in positive_integers:
   if R[i] is a majority:
      s = f_p[i](r.v for r in R[i])
      for q in W[i]:
         q.v.s, q.v.t  = s, t[i]
         if W[i] is a majority:
            s*, t* = s, t[i]
            W*, p* = W[i], p[i]

The only step that changes s* assigns:

f_p[i](r.v for r in R[i])) to s*.

From the proof of invariant 3: the value assigned to s* is s* or g_p[i](s*). Therefore the result holds for the computation in which transactions are executed in epoch order, and so it holds for \(C'\).

The sequence of steps at each agent is the same in \(C\) and \(C'\). s* is defined in terms of server variables which have the identical values in \(C\) and \(C'\). Therefore the result also holds for \(C\).


K. Mani Chandy, Emeritus Simon Ramo Professor, California Institute of Technology