Historically employed in the telecommunications industry, online upgrades are now needed in large-scale systems, such as electrical utilities, assembly-line manufacturing, customer support, e-commerce, banking, etc. In this project, we focused on upgrading distributed systems end-to-end [HotDep-2007]. We identified the leading causes of both unplanned and planned downtime due to upgrades in large-scale distributed systems, we addressed them through a new upgrading approach, and we proposed a new methodology for evaluating the dependability of online upgrades.
We proposed an upgrade-centric fault model, by analyzing independent sources of fault data with unsupervised learning techniques. Our model focuses on human errors in the upgrade procedure, which break hidden dependencies (e.g., specifying wrong service locations, creating database-schema mismatches, introducing shared-library conflicts) in the system under upgrade. There are four common types of upgrade faults:
Simple configuration or procedural errors (e.g., typos)
Semantic configuration errors, which indicate a misunderstanding of the configuration directives used
Broken environmental dependencies (e.g., library or port conflicts)
Data-access errors, which prevent the access to persistent data
These faults represent the leading causes of upgrade failure in distributed systems. We also identified incompatible schema changes and computationally-intensive data conversions as the leading causes of planned downtime in a popular Internet system (Wikipedia). Software upgrades often induce unplanned downtime, by breaking hidden dependencies in the system under upgrade, and that they cannot always prevent planned downtime, in the the presence of complex schema changes.
We addressed these causes of downtime through a system called Imago [Middleware 2009]. Imago provides the AIR properties:
Atomicity: At any time, the clients of the system under upgrade access the full functionality of either the old or the new versions (but not both). The end-to-end upgrade is an atomic operation.
Isolation: The upgrade operations do not change, remove, or affect in any way the dependencies of the production system (including its performance, configuration settings and ability to access the persistent data).
Runtime-testing: The upgraded system is tested under operational conditions.
By installing the new version in a parallel universe (a distinct collection of resources), Imago isolates the production system from the upgrade operations and avoids breaking hidden dependencies. Imago performs the end-to-end upgrade atomically, while enabling the complex data and schema conversions that commonly impose planned downtime.
Starting from examples of errors that occur frequently during real-world software upgrades at Facebook and eBay, I showed that relaxing the atomicity property introduces the risk of mixed-version race conditions [Onward! 2010]. These race conditions involve multiple versions of the software that co-exist, temporarily, during an upgrade and may occur in systems that communicate across administrative domains using asynchronous messaging (e.g., Web 2.0 applications that rely on AJAX client-side code or systems that lease cloud-computing resources). For example, a callback from the new version may be processed by the old version on a different tier of the application. Such mixed-version races might be benign, but they might also result in silent application errors (e.g., during an upgrade that moves code is moved from the server-side to the client-side, the race would cause the code to be invoked twice or not at all).
[Onward! 2010] T. Dumitraș, P. Narasimhan, and E. Tilevich, “To upgrade or not to upgrade: impact of online upgrades across multiple administrative domains,” in ACM SPLASH Onward! Conference, Reno/Tahoe, NV, 2010, pp. 865–876.
[Middleware 2009] T. Dumitraș and P. Narasimhan, “Why Do Upgrades Fail And What Can We Do About It? Toward Dependable, Online Upgrades in Enterprise Systems,” in ACM/IEEE/IFIP Middleware Conference, Urbana-Champaign, IL, 2009, pp. 349–372.
[HotDep-2007] T. Dumitraș, J. Tan, Z. Gho, and P. Narasimhan, “No More HotDependencies: Toward Dependency-Agnostic Upgrades in Distributed Systems,” in Workshop on Hot Topics in System Dependability, Edinburgh, Scotland, 2007.