Improving the Dependability of Distributed Systems Through Air Software Upgrades

TitleImproving the Dependability of Distributed Systems Through Air Software Upgrades
Publication TypeTheses
Year of Publication2010
AuthorsDumitras T
Date Published2010///
UniversityCarnegie Mellon University
CityPittsburgh, PA, USA

Traditional fault-tolerance mechanisms concentrate almost entirely on responding to, avoiding, or tolerating unexpected faults or security violations. However, scheduled events, such as software upgrades, account for most of the system unavailability and often introduce data corruption or latent errors. Through two empirical studies, this dissertation identifies the leading causes of upgrade failure—breaking hidden dependencies—and of planned downtime—complex data conversions—in distributed enterprise systems. These findings represent the foundation of a new benchmark for software-upgrade dependability. This dissertation further introduces the AIR properties—A TOMICITY, ISOLATION and RUNTIME-TESTING—required for improving the dependability of distributed systems that undergo major software upgrades. The AIR properties are realized in Imago, a system designed to reduce both planned and unplanned downtime by upgrading distributed systems end-to-end. Imago builds upon the idea of isolating the production system from the upgrade operations, in order to avoid breaking hidden dependencies and to decouple the data conversions from the normal system operation. Imago includes novel mechanisms, such as providing a parallel universe for the new version, performing data conversions opportunistically, intercepting the live workload at the ingress and egress points or executing an atomic switchover to the new version, which allow it to deliver the AIR properties. Imago harnesses opportunities provided by the emerging cloud-computing technologies, by trading resource overhead (needed by the parallel universe) for an improved dependability of the software upgrades. This approach separates the functional aspects of the upgrade from the mechanisms for online upgrade, enabling an upgrade-as-a-service model. This dissertation also describes techniques for assessing the impact of software upgrades, in order to reason about the implications of relaxing the AIR guarantees .