Distributed Algorithms Contents Index

Basics: FAQ

What is a model?
A model of a system is an abstraction that ignores aspects of the system but which helps in developing algorithms.
What model does this course use?
The choice of a model is an engineering decision, and we use different models to solve different problems. We begin with a simple actor model.
What are weaknesses and limitations of the model?
The model has many weaknesses, only some of which are described here.
Model Limitations: Fairness and Progress
In the model, a computation progresses by delivering a message on any channel. This allows for infinite computations in which some messages remain in a channel forever.

Consider an example of two agents sending tokens to each other. Each agent sends a token that it holds to the other agent. So there is a computation in which tokens go back and forth between the agents forever. Let's call this distributed system D.

Now, consider a system consisting of two identical, and totally independent, copies of D. There is no connection between the two copies. Surely, the behavior of D shouldn't change because of the presence of a completely independent network of agents. But, with our model, it does.

A computation progresses by delivering a message from any nonempty channel. So, there is an infinite computation in which messages are delivered in channels in one copy of D, and no messages are delivered from nonempty channels in the other copy

The problem is that the selection of the nonempty channel in each iteration of the while loop may be unfair -- the same set of channels could be selected forever while other nonempty channels are never selected. The model has no provision for ensuring that a message in a channel will be delivered eventually. We will introduce fair selection, eventuality, and temporal logic later in the course.

Model Limitations: No Construct for Time
The only representation of time in our model is that some events occur after others. An event in which a message is received occurs after an event in which that message is sent. Time plays a critical role in the performance of algorithms even though we never use time in proving the correctness of algorithms.

There are many algorithms in which an agent sends itself a Timeout(T) message where ideally the message would be received approximately T seconds after it is sent. Though we never use T in proving the correctness of algorithms we will use timeout messages in analyzing their performance.

Agents can maintain accurate clocks by using atomic clocks, Precision Time Protocols (PTP) in local area networks, and Network Time Protocol (NTP) servers. Accurate clocks have the property that the time at which an event is sent, as determined by the sender's clock, is (almost always) earlier than the time of the event in which the message is received. We do not, however, rule out the possibility that clocks drift apart so that the sender's clock is far ahead of the receiver's.

Some algorithms for systems with perfect clocks are simpler than those with imperfect clocks, as we shall see.

Model Limitations: Start Up, Shutdown, Failure
The model assumes that all agents and channels are initialized and then agents start receiving messages. The model has a barrier between the point at which initialization takes place and the point at which messages are delivered. The barrier isn't necessary in most algorithms, though it assuming its existence helps us to focus on more important parts of the algorithm.

The model does not specify how termination is detected if the computation does terminate. Nor does the notation have primitives for shutting down agents and channels gracefully so that they don't continue to hold resources after computation has terminated. Protocols such as AMQP do have primitives for starting up and shutting down distributed systems, but we won't discuss them here.

We will describe algorithms that execute on faulty systems in which messages may be lost, duplicated, or delivered out of order, and where agents may stop forever or halt and restart. We also describe algorithms with Byzantine agents. These algorithms are based on models that are different from those given so far.

Model Limitations: Discrete State Space
The state space is discrete in most of the algorithms described here. The state space of some distributed systems has both discrete and continuous components. The state space of a fleet of drones has continuous components.

Later, we study algorithms in which the state space is continuous. Systems with continuous state spaces may have discrete or continuous state transitions.

Model Limitations: Static
The network of agents and messages in the model is static. In contrast, distributed systems evolve; agents and channels may be added and deleted; agents may change; channel protocols may be modified.
Model Limitations: Simple Channels
The model only considers channels in which messages are delivered in the order sent. Some distributed systems have channels in which messages may be duplicated and delivered out of order.

Many models deal with channels in which senders are blocked when a channel gets full. Our model has no concept of a channel being full.

Model Limitations: Summary
Models and notations in this course are not comprehensive -- they do not capture most aspects of distributed systems. We use a model that is appropriate for describing the algorithm at hand. The choice of a model is an engineering decision.
Why use such a simplistic model?
The model is indeed simplistic; however, it is adequate for describing and reasoning about many of the algorithms described in the first part of this course. We introduce other models later. We use the simplest model adequate for the problem at hand.
How many channels are there in a system?
The number of channels in the model is arbitrary. In the model, there can be a channel from any agent to any agent. So, if there are \(N\) agents there can be a maximum of \(N^{2}\) channels.

Other models allow for multiple channels from an agent \(X\) to an agent \(Y\). In our model, however, there can be at most one channel from \(X\) to \(Y\).

Can you describe channels in more detail?
Our basic description of channels is found here. Channels in our model are asynchronous (non-blocking) and unidirectional. A sender can send messages on a channel regardless of how many messages have been sent in the past and how many messages have been received on that channel. The model assumes that the queue of messages in the channel has infinite capacity.

A channel is unidirectional. It is represented by a directed edge in a directed graph. Messages can be sent by an agent P to an agent Q along a channel (P, Q). Agent Q cannot send a message to agent P along channel (P, Q).

A system may have a channel (P, Q) and may or may not have a channel (Q, P). Some systems have bidirectional channels in which messages can be sent in both directions on the same channel. Our model does not allow for bidirectional channels.

In our model, a channel is directed from exactly one agent to exactly one agent. In some systems, multiple agents can send messages on the same channel, and multiple agents can receive messages on the same channel. Our model does not allow for such channels.

In the model, aree messages on a channel received in the order in which they are sent on the channel?
Yes.

Because queues are FIFO (first-in-first-out), for all \(n\), the \(n\)-th message received on a channel is the \(n\)-th message sent on that channel.

Note that messages sent on different channels may not be delivered in the order sent.

How can you show that in the model a message is received only after it is sent?
A message can be removed from a queue only after the message is put in the queue. So, in the model, a message can be received only after it is sent.
Can an agent refuse to receive a message?
In this model an agent cannot refuse to receive a message. If a channel is not empty then a message from the channel can be delivered to the agent independent of the state of the agent.

Suppose you want to design an algorithm in which an agent X refuses to receive messages from an agent Z until it first receives a message from an agent Y. How would you use this model?

In our model, agent X has to receive the message from Z in every state of X.

Agent X can copy the message from Z into a local queue -- a local variable of X -- and process the message only after receiving the message from Y.

Can a full channel block a sender from sending more messages?
In the model a sender is never blocked from sending a message.

Our model assumes that channels have unlimited capacity. We can represent a situation in which channels have limited capacity in the following way. An agent has a local output queue of unlimited size. The agent puts messages into this queue when it cannot send messages on a channel because the channel is full. Messages from this queue are sent on the channel when the channel stops being full.

Can you represent systems in which agents are created and deleted in the model?
The model doesn't have features that allow agents to be created and deleted.

A work around for the case in which agents are created or deleted a finite number of times is as follows. The network of agents in the model is made large enough to include agents that have not been created as yet and also include agents that have been deleted.

An agent that has not been created is represented by an agent that is idle, i.e., one is waiting for a message. It never receives a message until the agent is created. The message is created by the operating system sending a "create" message to the agent and informing other agents that this agent has been created.

An agent this is deleted is represented in the same way. The operating system deletes an agent by sending a "delete" message to the agent. A deleted agent discards messages that it receives without taking action.

Can an agent process background jobs that are interrupted when messages arrive?
No, there are no interruptions in the model. When an agent is executing a receive it is not interrupted.

There are systems in which messages have different priorities, and the arrival of high-priority messages interrupts the execution of low-priority messages. Our model does not allow for interruptions.

How do you represent channels that are not first in first out?
We use queues to represent states of first in first out channels, and we use other data structures -- such as multisets -- to represent states of other types of channels.

Later we describe algorithms with different types of channels. For example, the state of a channel in which messages may overtake each other is a multiset or bag. A message is sent on a channel by adding the message to the multiset. A nonempty multiset may deliver any message in the mutliset to the receiving agent.

Lossy channels, and channels in which messages have priorities, are also modeled by defining channel states appropriately.

Can the model represent sequential composition of distributed computations?
Yes.

There are problems in which we would like to start a distributed computation P; wait for P to finish, and then start another distributed computation Q. We want a barrier between P and Q.

A barrier can be implemented in several ways. An agent, say a coordinator agent, sends messages to all agents to start P. The coordinator then detects termination of P; we discuss termination detection algorithms later. After detecting termination of P the coordinator sends messages to all agents to start Q.

Can the model represent agents that make state transitions without receiving messages?
No, the model does not represent agents that make state transitions without receiving messages. The effect of a state transition without receiving a message can be simulated by an agent sending itself a message and carrying out the transition when it receives a message from itself.

For example, an agent that is computing using a set of files may transition to a state in which it no longer needs those files; this transition occurs without the agent receiving a message. We model this event by having the agent send itself a message when it starts using the set of files, and it transits to a state in which it no longer needs the files when it receives the message from itself.

Representing internal state changes in this way is an artifice; however, the artifice allows us to use a very simple model in which a state change occurs when, and only when, a message is delivered.

What is an agent?
In our model, an agent is an object that sends and receives messages. Some papers refer to an agent as an actor or as a process.
What is a simple example of the notation?
See an example of the notation.
Local clocks can be synchronized using NTP and other protocols. Why do you assume that clocks aren't synchronized?
Local clocks can, indeed, be synchronized very accurately; however, we do not assume that they are perfect.

We use agents' local clocks for evaluating algorithm performance but not for proving correctness because even a small drift can cause race conditions.

Consider an algorithm in which one agent carries out a computation starting at 1pm and another agent carries out a computation starting at 2pm. When we prove the correctness of our algorithms, we allow for the unlikely possibility that that the agent starting its computation at 2pm does so before the agent that starts a 1pm. For evaluating performance, however, we assume that the agent that starts a 1pm usually does so before the agent that starts at 2pm.

What are other examples of models?
The model we use is an actor model. For more general asynchronous models of concurrent systems see UNITY and TLA .

Perhaps the most widely used model is CSP -- Communicating Sequential Processes.

An overview of models used in parallel programming describes models used in shared-memory and distributed computing.

Next
States of a Distributed System.

Review material for this page


K. Mani Chandy, Emeritus Simon Ramo Professor, California Institute of Technology