Algorithms by which agents reach a consensus on a value are central in many applications. In this webpage we describe Paxos -- an algorithm by which agents attempt to reach a consensus in distributed systems in which agents may halt, be arbitrarily slow, and messages may be duplicated, lost, and delivered out of order.
Maintaining a bank account requires keeping track of the sequence of transactions on the account. An insufficient fee penalty may be incurred if a transaction that transfers funds out of an account occurs before a transaction that transfers funds into the account. The penalty may not be incurred if the sequence of transactions is reversed with funds transferred into the account before funds are withdrawn from it.
Some systems improve resiliency by keeping copies of sequences of transactions at different agents. All agents must come to a consensus about a single sequence though multiple agents propose transactions concurrently.
Sequences of transactions are stored at agents called servers. The sequences may not be in sync because the system may have faults and because message delays are arbitrary.
Agents that propose transaction are called clients. Associated
with each client p
is a function g_p
which
has a single argument. If x
is
the consensus sequence then g_p(x)
is a sequence
consisting of x
extended by a single transaction proposed
by p
. Different clients may propose different
extensions. The Paxos algorithm determines a consensus extension from
among proposals made by different clients.
The algorithms for clients and servers in Paxos are the same as in the algorithm for serializable computations. For any computation \(C\) of the distributed algorithm there exists a computation \(C'\) with the same steps as \(C\) and where transactions in \(C'\) are executed in increasing order of epochs.
Associated with each server q
is a variable
q.v
. For each server q
, the sequence of values assigned to
q.v
in computation \(C\) is the same as in a computation
in which transactions are executed in epoch order.
# Execute transactions in C in increasing order of epochs. for i in positive_integers: # Execute the transaction with epoch t[i] # This transaction is executed by client p[i] # In this transaction: # R[i] is the set of servers read by p[i] # W[i] is the set of servers written by p[i] if len(R[i]) >= M: s = f_p[i](r.v for r in R[i]) for q in W[i]: q.v.s, q.v.t = s, t[i]
i
-th iteration of the loop represents the
execution of the transaction with the i
-th highest epoch.
Recall that M
and f_p
are parameters of the
algorithm.
Let s*
be the consensus sequence at a point in the
computation.
At each step of a computation, the consensus s*
remains unchanged or a client p
changes the consensus to
g_p(s*)
.
We write the specification as
Specification
(s* = x) next ((s* = x) or (exists client p: s* = g_p(x)))
M
f_p
, and
s*
.
N
be the number of servers. If M
is
large, say M = N
, then the likelihood that a client
receives replies from all M
servers is small, and so the
likelihood that a transaction progresses to the write step is small.
If M
is small, say M = 1
, then there may be
no commonality between replies that different clients receive to their
read requests, and so the read step of a transaction may not convey
useful information.
We assume that a client proceeds to the write step of a transaction if it receives replies from at least a majority of servers. Why use a majority? Because any two majorities have at least one element in common.
Let N
be the number of servers.
M = int(N/2) + 1
Let's use majorities again. We construct a program in which:
The consensus sequence is the value written to server variables in a transaction that assigns values to a majority of server variables.
Let t*
be an epoch in which a majority of servers are
assigned values, let s*
be the value assigned to server
variables in the transaction, let p*
be the client that
executes the transaction, and let W*
be the set of
servers that are assigned values in this transaction. Then
s*
is a consensus sequence.
Variables s*, W*, t*, p*
are used only
for proving that the algorithm is correct. They do not appear in the
implementation of client or server algorithms. The values of these
variables are given in a computation with transactions in epoch order.
Variables used in proofs but not in
implementations are called auxiliary variables or thought variables.
for i in positive_integers: if R[i] is a majority: s = f_p[i](r.v for r in R[i]) for q in W[i]: q.v.s, q.v.t = s, t[i] if W[i] is a majority: s*, t* = s, t[i] W*, p* = W[i], p[i]
for all servers q: q.v.s, q.v.t = init, t[0] s*, t*[0] = init, t[0]where
init
and t[0]
are given initial
values.
p*
is an arbitrary client and
W*
is the set of all servers.
Next, we prove properties of computation \(C\) by proving
properties of executions of transactions of \(C\) executed in epoch
order. We begin by identifying invariants.
If there exists a set Z
of servers where Z
is a majority and all elements of Z
are identical, then:
for all q in Z: q.v.s, q.v.t = s*, t*
For all q
in W*
: q.v.t
\(\geq\) t*
i
-th iteration and prove that the
invariant holds at the end of the i
-th iteration.
Case 1: W[i]
is a majority
In this case s*, W*, t*
become s, W[i],
t[i]
, respectively, and
for all q in W*: q.v.s, q.v.t = s*, t*Let
Z
be any set of servers where Z
is a
majority and where all servers in Z
have identical
variables.
Because Z
and W*
are majorities they have an
element in common, and so all servers in Z
and
W*
have identical variables.
Therefore
for all q in Z: q.v.s, q.v.t = s*, t*So, invariant 1 holds at the end of the iteration.
Invariant 2 holds at the end of the iteration because:
For all q
in W*
: q.v.t
\(=\) t*
.
Case 2: W[i]
is not a majority
Assume that there exists a majority Z
of servers with identical values
at the end of iteration i
.
For q
in Z
, q.v.t
\(\neq\)
t[i]
because W[i]
is not a majority.
So Z
is not modified in the i
-th iteration .
Invariant 1 holds at the start of the iteration, and the iteration
leaves Z
unchanged, and so invariant 1 continues to
hold at the end of the iteration.
Invariant 2 continues to hold at the end of the iteration because
W*
and t*
are unchanged in the iteration and
for each server q
the iteration either increases
q.v.t
or leaves it unchanged.
Next, we use these invariants to construct f_p
.
f_p
s*
unchanged or
changes it to g_p(s*)
So, we now prove the following invariant.
q
where q.v.t
\(\geq\)
t*
:
(q.v.s
\(=\) s*
)
or
(exists client p
: q.v.s
\(=\) g_p(s*)
)
f_p()
f_p()
.
Consider two cases of the argument of the function.
Case 1: All elements of [r.v for r in R[i]]
are
identical
f_p
returns g_p(r.v.s)
for any
r
in R[i]
.
Case 2: Not all elements of [r.v for r
in R[i]]
are identical
f_p
returns r*.v.s
where
r*
is an element of R[i]
with the
maximum r.v.t
:
r*.v.t = max(r.v.t for r in R[i])
i
-th
iteration and prove that it holds at the end of the iteration.
We first prove the following lemma.
f_p
returns either
s*
or g_p(s*)
.
Proof
Case 1: All elements of [r.v for r in
R[i]]
are identical
From invariant 1, and because R[i]
is a majority:
For all r
in R[i]
:
r.v.s = s*
.
Therefore f_p
returns g_p(r.v.s)
which is
g_p(s*)
.
Case 2: Not all elements of [r.v for r in
R[i]]
are identical
R[i]
and W*
have at least one element
x
in common because they are both majorities.
From invariant 2, because x
is a member of
W*
:
x.v.t
\(\geq\) t*
.
From the definition of r*.v.t
:
r*.v.t
\(\geq\) x.v.t
.
and so r*.v.t
\(\geq\) t*
.
At the start of the iteration invariant 3 holds and so
(r*.v.s
\(=\) s*
)
or
(exists client p
: r*.v.s
\(=\) g_p(s*)
)
So f_p
returns s*
or g_p(s*)
.
Lemma: Invariant holds at the end of the iteration.
Proof
Consider two cases:W[i]
is either a majority or not a
majority.
Case 1: W[i]
is a majority
s*, W*, t* = s, W[i], t[i]
Therefore for all q
where q.v.t
\(=\)
t*
:
q.v.s
\(=\) s*
.
So the invariant continues to hold.
Case 2: W[i]
is not a
majority
s*, W*, t*
remain unchanged in the
iteration.
The iteration continues to hold because
f_p
returns s*
or g_p(s*)
.
M
, f_p
,
and s*
as specified earlier,
(s* = x) or (exists p: s* = g_p(x))
Proof
Let \(C\) be any computation of the system. There exists a computation \(C'\) where (1) for each agent \(x\) the sequence of steps executed at \(x\) is the same in \(C\) and \(C'\), and (2) transactions are executed in ascending order of epochs in \(C'\). We first prove the result for computation \(C'\).
The sequence of values of q.v
of each server
q
in computation \(C'\) is the same as in an execution of
the following
sequence of of transactions (which was given above and is repeated
here for convenience).
for i in positive_integers: if R[i] is a majority: s = f_p[i](r.v for r in R[i]) for q in W[i]: q.v.s, q.v.t = s, t[i] if W[i] is a majority: s*, t* = s, t[i] W*, p* = W[i], p[i]
The only step that changes s*
assigns:
f_p[i](r.v for r in R[i]))
to s*
.
From the proof of invariant 3:
the value assigned to s*
is s*
or
g_p[i](s*)
.
Therefore the result holds for the computation in which transactions are
executed in epoch order, and so it holds for \(C'\).
The sequence of steps at each agent is the same in \(C\) and \(C'\).
s*
is defined in terms of server variables which have the
identical values in \(C\) and \(C'\).
Therefore the result also holds for \(C\).
K. Mani Chandy, Emeritus Simon Ramo Professor, California Institute of Technology