Duke ISSG: Service Utilities

	Internet Systems and Storage Group Software architectures for scalable wide-area systems Duke Computer Science
	Home \| Members \| Publications \| Internal

Service Utilities

We are investigating models for building large-scale distributed utilities. These utilities will deliver compute resources (CPU, storage, and bandwidth) to hosted application and, ultimately, end users much as the electric utility transparently delivers power on demand to customers. In our model, applications agree to Service Level Agreements (SLAs) with the utility. Part of our research investigates the space of possible ways to specify these SLAs. Based on our initial investigation, SLAs will specify how much an application is willing to pay for target levels of performance, availability, and data quality. All metrics are specified as a function of offered load. Thus, for instance, higher levels of aggregate performance for a very popular service will necessarily incur higher cost than for the same level of performance for a less popular service. SLAs will also encode penalties, specifying a minimum level of, e.g., performance that an application expects at a given level of offered load. If the utility is unable to allocate sufficient resources to an application to meet this minimum level, it must then credit the application the penalty price. This model provides incentive for the service to remain well-provisioned and to allocate sufficient resources to small applications even if larger applications are willing to pay more during times of resource constraint.

We are targeting utilities consisting of a distributed set of thousands of Internet sites, each with potentially 1000's of individual machines, cooperating together to fulfill aggregate SLAs. Examples of hosted applications include replicated web services, application-layer multicast, and content distribution networks. We are building a prototype called Opus, for an Overlay Peer Utility Service, that investigates the following specific research issues in the context of large-scale distributed issues:

Resource allocation: how to allocate resources among competing applications to maximize aggregate performance, availability, and data quality while meeting SLA-specifications.
Replica placement: closely related to the question of resource allocation is determining where to place individual application replicas in response to dynamically changing client access patterns, network failures, etc.
Overlay topology construction: once an application is replicated, it requires some way of synchronizing its state among its replicas. In the case of multicast-style applications (e.g., video distribution or stock quotes), an application-layer overlay is fundamental. We are investigating techniques for constructing overlays that meet application requirements of performance, delay, and reliability while minimizing consumed network resources.
Request routing: once resources have been allocated and replicas have been placed, clients require a mechanism for discovering the service replica capable of delivering the highest quality of service to them.

The key challenge for all of the above operations is tolerating failure as the common case and making distributed decisions that approximate the global optimum based on incomplete and inaccurate information. Note that the above four issues are very closely inter-related. Resource allocation, influence replica placement and vice versa for example.

Current subprojects within Opus include:

ACDC: Adaptive low-Cost/Delay Constrained overlays that utilize O(lg n) state and probing to build structures that conform to underlying network characteristics.
Cluster on Demand: an automated resource management framework that facilitates the separation of usage and management through the concept of virtual clusters.
Malachi: scalable and reliable publish/subscribe infrastructure.
MACEDON: automatically generating code for various overlay construction algorithms.

Publications

"Back to the Future: Dependable Computing = Dependable Services," Jeff Chase, Amin Vahdat, and John Wilkes. Proceedings of the 10th European SIGOPS Workshop, September 2002. [PDF]
"Service Level Agreement Based Distributed Resource Allocation for Streaming Hosting Systems," Yun Fu and Amin Vahdat. Proceedings of 7th International Workshop on Web Content Caching and Distribution (WCW), August 2002. [PDF]
"Opus: an Overlay Peer Utility Service," Rebecca Braynard, Dejan Kostic, Adolfo Rodriguez, Jeffrey Chase, and Amin Vahdat, Proceedings of the 5th International Conference on Open Architectures and Network Programming (OPENARCH), June 2002. [PDF]
"Dynamically Provisioning Distributed Systems to Meet Target Levels of Performance, Availability, and Data Quality," Amin Vahdat. Proceedings of the International Workshop on Future Directions in Distributed Computing (FuDiCo), June 2002. [PDF]
"Self-Organizing Subsets: From Each According to His Abilities, To Each According to His Needs," Amin Vahdat, Jeffrey Chase, Rebecca Braynard, Dejan Kostic, and Adolfo Rodriguez. Proceedings of the First International Peer to Peer Symposium (IPTPS), March 2002. [PDF]
"OPUS: Overlay Utility Service", Rebecca Braynard, Dejan Kostic, Adolfo Rodriguez, Jeff Chase and Amin Vahdat, poster at 18th ACM Symposium on Operating System Principles (SOSP), Banff, Canada, October 2001. [PDF].
"Managing Energy and Server Resources in Hosting Centers." Jeff Chase, Darrell Anderson, Prachi Thakar, Amin Vahdat, and Ron Doyle. Proceedings of the 18th Symposium on Operating Systems Principles (SOSP), October 2001. [Postscript], [PDF].
"Server Switching: Yesterday and Tomorrow," Jeff Chase. Second IEEE Workshop on Internet Applications (WIAPP `01), July 2001. [Postscript, PDF]