A distributed computation has terminated in a state in which all agents are waiting to receive messages and all channels are empty. A termination detection algorithm is executed by the operating system to determine whether a client computation has terminated.
A computation terminates in states in which all channels are empty and all agents are waiting to receive messages. An agent is said to be active while it is processing a message and idle while it is waiting to process a message. A terminated state is one in which all agents are idle and all channels are empty.
Let \(C_{s}\) and \(C_{r}\) be the numbers of messages sent and received (respectively) on channel \(C\). A computation is in a terminated state exactly when:
The problem is to design an algorithm that detects whether the computation is in a terminated state.
Use one of the strategies described earlier. Let's use observers and optimize the observer-based global snapshot algorithm for the special case of termination detection.
There exists a consistent cut (past, future)
of a dataflow graph
exactly when every edge to a past
vertex is from a
past
vertex.
In our algorithms, past
is the set of steps (vertices)
at each agent before the agent takes its local snapshot.
This guarantees that every agent edge to a past
vertex
is from a past
vertex.
The algorithm must also ensure that every message edge to a
past
vertex is from a past
vertex.
To do so we use the following observation.
Let past
be any set of steps in a computation.
For a channel \(C\), let \(C_{s}\) and \(C_{r}\) be the numbers of
messages sent and received, respectively, on channel \(C\), in
past
.
Every message received along \(C\) in a step in
past
is sent in a step in past
exactly
when:
\(\; C_{s} \geq C_{r}\)
Agent Actions
When an agent changes state from active to idle (i.e., the agent starts waiting to receive a message) the agent sends a message to the observer. The message to the observer contains \(C_{s}\) for each output channel \(C\) where \(C_{s}\) is the number of messages sent on channel \(C\) up to the point at which the agent becomes idle. The message also contains \(C_{r}\) for each input channel of the agent where \(C_{r}\) is the number of messages received on channel \(C\) up to this point.Observer Actions
The observer keeps only the latest message that it receives from each agent. For each channel \(C\), let \(C_{s}^{*}\) and \(C_{r}^{*}\) be the latest value of \(C_{s}\) and \(C_{r}\), respectively, that the observer has received.Initial Condition
All agents are idle. \(C_{s}^{*}\) and \(C_{r}^{*}\) are the numbers of messages sent and received (respectively) on channel \(C\) for all \(C\). Some channels may contain messages.
The observer detects that computation has terminated if
For all channels \(C\): \(C_{s}^{*} = C_{r}^{*}\).
We first prove that if the observer detects that the computation has terminated then the computation has indeed terminated.
Assume that the observer has detected termination. Then for each channel \(C\), either \(C_{s}^{*} = C_{r}^{*}\) initially, or the observer received messages containing \(C_{s}^{*}\) and \(C_{r}^{*}\) such that \(C_{s}^{*} = C_{r}^{*}\).
Let(past, future)
be a partition of the steps of the
computation where past
consists of steps at an agent
before the agent sent the messages to the observer containing
\(C_{s}^{*}\) and
\(C_{r}^{*}\) for channels \(C\) incident on the agent.
(If the agent sends no messages to the observer
then the agent has no steps in past
other than the
initialization step.)
Because \(C_{s}^{*} \geq C_{r}^{*}\) for all channels \(C\)
it follows
that (past, future)
is a consistent cut.
From the definition of terminated state it follows that the state at this consistent cut is a terminated state. Termination is a stable property -- once computation has terminated it remains terminated. So, if the state at a consistent cut of the computation is a terminated state then all succeeding states are terminated states.
Next we prove that if the computation terminates then the observer detects termination. Assume that the computation has terminated. The last message sent by each agent has counts \(C_{s}^{*}\) and \(C_{r}^{*}\) of the numbers of messages sent and received (respectively) for each of its incident channels \(C\). Because these are the last messages sent when the algorithm terminates it follows that \(C_{s}^{*} = C_{r}^{*}\) for all \(C\).
K. Mani Chandy, Emeritus Simon Ramo Professor, California Institute of Technology