Internet Systems and Storage Group
Software architectures for scalable wide-area systems
Duke Computer Science
Malachi is a distributed event notification system designed for scalability, failure resilience, adaptability to dynamic network conditions and reduced latencies. To increase client perceived performance, Malachi distributes nodes throughout the network in order to allow client access to nearby servers instead of distant locations. These nodes communicate across the wide-area through an overlay network. Events propagate through the system via restricted flooding, a method of data dissemination without the full broadcast of general flooding while allowing for a general graph structure. This generality allows Malachi to offer multiple possibilities of routes for the events to follow. The use of multiple routes increases the availability of the system and decreases the time needed to recover from failures.
Malachi is a large system comprised of multiple directions of research. A few of these include replica management (placement, connectivity, topic management, and per-topic overlay communication), data dissemination (event propagation, communication frequency and network resource consumption), overlay maintenance (network state collection and local decision convergence to global good) and our current focus, reliability. Malachi is designed to provide a scalable, reliable service. To do this, we are exploring the issues of disjoint paths, loss correlations and anti entropy-like event propagation.
One goal of Malachi is to reliably deliver events to destinations based upon a target reliability from the application. In this component of the research, we want to take a target reliability and nodes with neighbor connection information to produce end-to-end paths matching the target reliability. The resulting topology is based on network probes from nodes to their direct neighbors used in a distributed graph construction algorithm. These end-to-end paths allow Malachi to meet the reliability specifications of the higher layer application.
With restricted flooding, all nodes participating in a per-application overlay receive the data, however, unlike traditional flooding, the nodes receive the information once. The data does not necessarily traverse every link in the overlay, however, it must reach every node. This idea is similar to anti entropy used in distributed computing to maintain consistency between replicas.
This figure shows a simple topology with a single sender and single destination. With standard flooding, each node would receive messages multiple times. Restricted flooding prevents these redundant messages by first verifying the messages the neighbor node needs. Since nodes only send events that neighbor nodes do not already have, we are able to reduce the amount of consumed network resources. This figure depicts these savings by the center node and the destination. Without restricted flooding, the nodes would have received at least two copies of the event, instead, they receive only one.
Restricted flooding allows Malachi to "flood" the overlay nodes with events, while reducing the amount of network resources consumed. Traditional flooding would overwhelm the network, and hinder scalability. Restricted flooding shares the benefits of traditional flooding, but reduces the consumption of resources, allowing for scalability.
Initially, we could build a tree from the source to all destinations with paths corresponding to the overlay paths with the maximum reliability. In the following example, we see the maximum reliability tree to destinations 1 and 2. The resulting reliabilities are from the product of the individual link reliabilities along the path.
This structure has the benefit of the single highest reliability path to the destinations; however, there are a few drawbacks. The tree is not well suited for multiple senders. To support multiple senders, there will either be a single rendezvous point, a tree per sender, or the risk of a sender located at a leaf of the tree. These disadvantages are combined with the fact that if a link fails, not all of the destinations in the overlay will receive the events. In order to overcome these concerns, we propose to build a general graph with multiple disjoint (fully independent) paths to the destinations.
With a general graph composed of disjoint paths, we are able to provide backup routes in case of individual link failure. In order to reduce the consumption of resources, we use restricted flooding, as discussed earlier. Using multiple routes allows Malachi to avoid convergence times after a failure and still deliver the data. We can see an example in the following graph:
In this example, there are actually twelve different routes to each of the destinations. These alternate routes will provide the means for delivery in the face of network failures. The alternate routes also increase the reliability for delivery. Using simple state space decompositions, we can see the increase in reliability, from 0.951 to 0.997 in destination 1 and from 0.922 to 0.996 in destination 2. This simple example illustrates the possible benefits of multi-path routing to the destinations combined with the savings in network consumption from restricted flooding.
By selectively adding links to the new overlay, we are able to tune the overlay to the desired target level of reliability. Adding links does come at a cost, depending on the amount an application is willing to "spend" on reliability, we can determine the number of links to be added to the overlay. Allowing the overlay to tune to the applications specifications gives the overlay construction algorithm flexibility in meeting the needs of applications.
In the simple example above, we assumed that the links were disjoint, in that the underlying topology links were independent. In real networks, this will not always be the case, as in the following diagrams.
In the first diagram we see that the links from node 1 to nodes 2 and 3 are disjoint. This is the assumption we made in the example topology for the demonstration of increased reliability. Real networks will not always exhibit this case. Many overlay links will actually correspond to partially disjoint links in the physical topology, as in the links from node 4 to nodes 5 and 6. In these cases, we must account for shared underlying links and identify the corresponding overlay links when building the reliable graph.
In some cases, there may not be multiple paths available, as in the following example. Node D has one link in this example, thus, it would not be possible to find multiple paths to or from D.
In order for Malachi to be a scalable base for applications to utilize, we are not able to depend upon global node or network information. To maintain scalability, we utilize local network state, and make decisions based upon this state that converge towards the global good. In this subset of the research, the global good is defined as the target reliability and the cost the application is willing to pay for the target. Thus, local state must be used to make decisions to reach the global target level of reliability. In the future, we will develop distributed algorithms to determine if the application specifications are met through the use of the decentralized overlay construction algorithm.
More detail is available from the following:
Updated October 27, 2001.