ChannelSnapshots/GlobalSnapshotApplications.html

Applications of Global Snapshots

This webpage discusses a few of the many applications of global snapshots.

Collecting Local Snapshots to form Global Snapshots

The global snapshot algorithm shows how each agent records its own local state and the states of its input channels. The global snapshot algorithm does not specify how local snapshots are put together to form a global snapshot.

One way to collect local snapshots to form a global snapshot is to have an operating systems agent act as an observer. Each agent sends its local snapshot to the observer which puts the local snapshots together to form the global snapshots.

Successive global snapshot algorithms are disambiguated by using timestamps. Messages sent in a global snapshot algorithm initiated by an agent X when the local clock at X is at t have a timestamp (t, X). So snapshot algorithms initiated by the same agent at different times or by different agents have different timestamps.

System Monitoring

Systems are monitored by taking global snapshots repeatedly. System monitoring is based on the following observation.

Observation

Let \(S_{0}, S_{1}, S_{2}, \ldots, \) be the sequence of states of global snapshots recorded by the system. There exists a computation that visits each state \(S_{i}\) in order of increasing \(i\).

Proof

The proof is essentially the same as the proof of the theorem "Computations and Cuts."

System monitoring has many applications. A system monitor takes global snapshots repeatedly and checks a sequence of snapshots to determine if some action is required. Examples of actions are breaking a deadlock or rolling back steps of a computation and restarting the computation from a snapshot.

Rollback and Recovery

Let \(S^{*}\) be the most recent snapshot recorded by a system monitor. From, the theorem "Computations and Cuts" there exists a computation that starts at the initial state and visits \(S^{*}\). So, if an error is detected in a computation then the computation can be restarted from \(S^{*}\) rather than rolling all the way back to the initial state.

Detecting Stable Predicates

A stable predicate is a predicate with the following property: If the predicate holds at any point in any computation then it continues to hold forever thereafter in that computation. Equivalently, if a stable predicate holds in a state \(s\) then it holds in all states reachable from \(s\).

Examples of stable predicates are: "The computation has terminated," and "The computation is deadlocked." If a computation has terminated at some point then it remains terminated. Likewise if a computation has deadlocked then it remains deadlocked.

Specification of Detection Algorithms

An algorithm to detect a stable property \(P\) has the following specification.

No false positives
If \(P\) holds when the algorithm is initiated then the algorithm detects that \(P\) holds.
No false negatives
If the algorithm detects that \(P\) holds then \(P\) holds when the algorithm terminates.

For example, an algorithm to detect whether a computation has deadlocked must satisfy the following conditions.

If the computation is deadlocked when the detection algorithm is initiated then the detection algorithm must detect that the computation is deadlocked.
If the detection algorithm detects that computation has deadlocked then computation must be deadlocked when the detection algorithm terminates.

A Strategy for Developing Detection Algorithms

A general strategy for developing algorithms that detect stable properties is as follows. The operating system takes repeated snapshots of the computation and determines that the specified stable property holds if it holds in any snapshot.

An alternate strategy is for agents to exchange information about their local snapshots without sending the information to an observer. Distributed algorithms on local snapshots operate in two phases. In the first phase a global snapshot algorithm is executed. The local snapshot of each agent and its incoming channels are stored locally, at the agent, without sending the information to observers.

In the second phase a distributed algorithm is executed to determine if the local information stored at agents satisfies a specified global property, such as "computation has deadlocked." These algorithms are often distributed graph algorithms.

The development of algorithms that detect properties of states of computations appear to be difficult because states while the detection algorithm is running. The two-phases separate the problem into two simpler parts: In phase 1, record local states during the computation using a global snapshot algorithm. In phase 2 analyze the static local snapshots. The algorithm in the second phase operates on unchanging data, and this simplifies the design of the algorithm. The two phases can be executed concurrently in many applications.

A code skeleton of the algorithm and examples of the global snapshot algorithm are provided here. Next logical clocks.

K. Mani Chandy, Emeritus Simon Ramo Professor, California Institute of Technology