We are investigating models for building large-scale distributed utilities.
These utilities will deliver compute resources (CPU, storage, and bandwidth) to
hosted application and, ultimately, end users much as the electric utility
transparently delivers power on demand to customers. In our model,
applications agree to Service Level Agreements (SLAs) with the utility.
Part of our research investigates the space of possible ways to specify these
SLAs. Based on our initial investigation, SLAs will specify how much an
application is willing to pay for target levels of performance, availability,
and data quality. All metrics are specified as a function of offered
load. Thus, for instance, higher levels of aggregate performance for a
very popular service will necessarily incur higher cost than for the same level
of performance for a less popular service. SLAs will also encode
penalties, specifying a minimum level of, e.g., performance that an application
expects at a given level of offered load. If the utility is unable to
allocate sufficient resources to an application to meet this minimum level, it
must then credit the application the penalty price. This model provides
incentive for the service to remain well-provisioned and to allocate sufficient
resources to small applications even if larger applications are willing to pay
more during times of resource constraint.
We are targeting utilities consisting of a distributed set of thousands of
Internet sites, each with potentially 1000's of individual machines, cooperating
together to fulfill aggregate SLAs. Examples of hosted applications
include replicated web services, application-layer multicast, and content
distribution networks. We are building a prototype called Opus, for
an Overlay Peer Utility Service, that investigates the following specific
research issues in the context of large-scale distributed issues:
- Resource allocation: how to allocate resources among competing
applications to maximize aggregate performance, availability, and data
quality while meeting SLA-specifications.
- Replica placement: closely related to the question of resource
allocation is determining where to place individual application replicas in
response to dynamically changing client access patterns, network failures,
- Overlay topology construction: once an application is replicated,
it requires some way of synchronizing its state among its replicas. In
the case of multicast-style applications (e.g., video distribution or stock
quotes), an application-layer overlay is fundamental. We are
investigating techniques for constructing overlays that meet application
requirements of performance, delay, and reliability while minimizing
consumed network resources.
- Request routing: once resources have been allocated and replicas
have been placed, clients require a mechanism for discovering the service
replica capable of delivering the highest quality of service to them.
The key challenge for all of the above operations is tolerating failure as
the common case and making distributed decisions that approximate the global
optimum based on incomplete and inaccurate information. Note that the
above four issues are very closely inter-related. Resource allocation,
influence replica placement and vice versa for example.
Current subprojects within Opus include:
- ACDC: Adaptive low-Cost/Delay Constrained overlays that utilize O(lg
n) state and probing to build structures that conform to underlying
- Cluster on Demand:
an automated resource management framework that facilitates the separation
of usage and management through the concept of virtual clusters.
- Malachi: scalable and reliable
- MACEDON: automatically generating code for various overlay
- "Back to the Future: Dependable Computing = Dependable
Services," Jeff Chase, Amin Vahdat, and John Wilkes. Proceedings
of the 10th European SIGOPS Workshop, September 2002. [PDF]
- "Service Level Agreement Based Distributed Resource Allocation for
Streaming Hosting Systems," Yun Fu and Amin Vahdat. Proceedings of
7th International Workshop on Web Content Caching and Distribution (WCW),
August 2002. [PDF]
- "Opus: an Overlay Peer Utility
Service," Rebecca Braynard, Dejan Kostic, Adolfo Rodriguez, Jeffrey
Chase, and Amin Vahdat, Proceedings
of the 5th International Conference on Open Architectures and Network Programming
(OPENARCH), June 2002. [PDF]
- "Dynamically Provisioning Distributed Systems to Meet Target Levels of Performance,
Availability, and Data Quality," Amin Vahdat. Proceedings of
the International Workshop on Future Directions in Distributed Computing (FuDiCo),
June 2002. [PDF]
- "Self-Organizing Subsets: From
Each According to His Abilities, To Each According to His Needs," Amin
Vahdat, Jeffrey Chase, Rebecca Braynard, Dejan Kostic, and Adolfo Rodriguez.
Proceedings of the First International Peer to Peer Symposium (IPTPS),
March 2002. [PDF]
- "OPUS: Overlay Utility Service", Rebecca Braynard, Dejan
Adolfo Rodriguez, Jeff Chase and Amin Vahdat, poster at 18th ACM Symposium on
Operating System Principles (SOSP), Banff, Canada, October 2001. [PDF].
- "Managing Energy and Server Resources in Hosting Centers."
Jeff Chase, Darrell Anderson, Prachi Thakar, Amin Vahdat, and Ron Doyle.
Proceedings of the 18th Symposium on Operating Systems Principles (SOSP), October 2001.
- "Server Switching: Yesterday and Tomorrow," Jeff Chase. Second
IEEE Workshop on Internet Applications (WIAPP `01), July 2001. [Postscript,