Fault tolerance is a constant concern in data centers where servers have to run with a minimal level of failures. Changes on the operating conditions or on server demands, and var...
In large-scale, distributed software systems, an important management undertaking concerns the creation and runtime modification of application instances. This short paper propose...
Power management in data centers has become an increasingly important concern. Large server installations are designed to handle peak load, which may be significantly larger than...
Vivek Sharma, Arun Thomas, Tarek F. Abdelzaher, Ke...
We describe the design and implementation of a clustering service for a high-performance, shared-disk file system. The service provides failure detection and recovery, reliable e...
— The Computational Resiliency library (CRLib) provides distributed systems with the ability to sustain operation and dynamically restore the level of assurance in system functio...