We built the MEAD (Middleware for Embedded Adaptive Dependability) system, which served as a fault-tolerance platform in the DARPA ARMS-II and PCES-II programs. MEAD uses system-call interception and does not modify the applications or the operating systems. In this manner, MEAD enhances legacy CORBA applications with replication and recovery mechanisms, and it allows users to select the appropriate trade-offs among resource usage, fault-tolerance and performance [CC:PE 2005].
MEAD provides versatile dependability [ADS III]. Low-level knobs control the internal fault-tolerant mechanisms of the infrastructure (e.g., the degree and style of replication). In contrast, high-level knobs regulate external properties (e.g., scalability, availability) that are relevant to the system’s users, and they hide the internal implementation details. For example, a low-level knob allows switching between active and passive replication on the fly, and a high-level knob tunes the system’s scalability.
MEAD also helped us characterize the inherent unpredictability of fault-tolerant middleware [Middleware 2005]. While MEAD’s maximum response time is not predictable, the latency outliers are limited to only 1% of the remote invocations. We showed that this is a general phenomenon through a broad empirical study of unpredictability in 15 additional systems, ranging from simple transport protocols to fault-tolerant, middleware-based enterprise applications [COMNET 2013]. The maximum latency is not influenced by the system’s workload, cannot be regulated through configuration parameters and is not correlated with the system’s resource consumption. The high-latency outliers (up to three orders of magnitude higher than the average latency) have multiple causes and may originate in any component of the system. However, after selectively filtering 1% of the invocations with the highest recorded response-times, the latency becomes bounded with high statistical confidence. We verified this result on different operating systems (Linux 2.4, Linux 2.6, Linux-rt, TimeSys), middleware platforms (CORBA and EJB), programming languages (C, C++ and Java), replication styles (active and warm passive) and applications (e-commerce and online gaming) (missing reference). Moreover, this phenomenon occurs at all the layers of middleware-based systems, from the communication protocols to the business logic. This suggests that, while the emergent behavior of middleware is not strictly predictable, distributed systems could cope with the inherent unpredictability by focusing on statistical measures, such as the 99th latency percentile.
Data sets released: FT traces
[COMNET 2013] T. Dumitraș and P. Narasimhan, “A study of unpredictability in fault-tolerant middleware,” Computer Networks, vol. 57, no. 3, pp. 682–698, Feb. 2013.
[ADS III] T. Dumitraș, D. Srivastava, and P. Narasimhan, “Architecting and Implementing Versatile Dependability,” in Architecting Dependable Systems III, R. de Lemos, C. Gacek, and A. Romanovsky, Eds. Springer-Verlag, LNCS 3549, 2005, pp. 212–231.
[Middleware 2005] T. Dumitraș and P. Narasimhan, “Fault-Tolerant Middleware and the Magical 1%,” in ACM/IEEE/IFIP Middleware Conference, Grenoble, France, 2005, pp. 431–441.
[CC:PE 2005] P. Narasimhan, T. Dumitraș, A. Paulos, S. Pertet, C. Reverte, J. Slember, and D. Srivastava, “MEAD: Support for Real-time, fault-tolerant CORBA,” Concurrency and Computation: Practice and Experience, vol. 17, pp. 1527–1545, 2005.