Internet Systems and Storage Group
Software architectures for scalable wide-area systems
Duke Computer Science

Home | Members | Publications | Internal

Service Utilities

We are investigating models for building large-scale distributed utilities.  These utilities will deliver compute resources (CPU, storage, and bandwidth) to hosted application and, ultimately, end users much as the electric utility transparently delivers power on demand to customers.  In our model, applications agree to Service Level Agreements (SLAs) with the utility.  Part of our research investigates the space of possible ways to specify these SLAs.  Based on our initial investigation, SLAs will specify how much an application is willing to pay for target levels of performance, availability, and data quality.  All metrics are specified as a function of offered load.  Thus, for instance, higher levels of aggregate performance for a very popular service will necessarily incur higher cost than for the same level of performance for a less popular service.  SLAs will also encode penalties, specifying a minimum level of, e.g., performance that an application expects at a given level of offered load.  If the utility is unable to allocate sufficient resources to an application to meet this minimum level, it must then credit the application the penalty price.  This model provides incentive for the service to remain well-provisioned and to allocate sufficient resources to small applications even if larger applications are willing to pay more during times of resource constraint.

We are targeting utilities consisting of a distributed set of thousands of Internet sites, each with potentially 1000's of individual machines, cooperating together to fulfill aggregate SLAs.  Examples of hosted applications include replicated web services, application-layer multicast, and content distribution networks.  We are building a prototype called Opus, for an Overlay Peer Utility Service, that investigates the following specific research issues in the context of large-scale distributed issues:

The key challenge for all of the above operations is tolerating failure as the common case and making distributed decisions that approximate the global optimum based on incomplete and inaccurate information.  Note that the above four issues are very closely inter-related.  Resource allocation, influence replica placement and vice versa for example.

Current subprojects within Opus include: