1. Data-driven detection and diagnosis of system-level failures in middleware-based service compositions
- Author
-
Wassermann, B.
- Subjects
004 - Abstract
Service-oriented technologies have simplified the development of large, complex software systems that span administrative boundaries. Developers have been enabled to build applications as compositions of services through middleware that hides much of the underlying complexity. The resulting applications inhabit complex, multi-tier operating environments that pose many challenges to their reliable operation and often lead to failures at runtime. Two key aspects of the time to repair a failure are the time to its detection and to the diagnosis of its cause. The prevalent approach to detection and diagnosis is primarily based on ad-hoc monitoring as well as operator experience and intuition. This is inefficient and leads to decreased availability. We propose an approach to data-driven detection and diagnosis in order to decrease the repair time of failures in middleware-based service compositions. Data-driven diagnosis supports system operators with information about the operation and structure of a service composition. We discuss how middleware-based service compositions can be monitored in a comprehensive, yet non-intrusive manner and present a process to discover system structure by processing deployment information that is commonly reified in such systems. We perform a controlled experiment that compares the performance of 22 participants using either a standard or the data-driven approach to diagnose several failures injected into a real-world service composition. We find that system operators using the latter approach are able to achieve significantly higher success rates and lower diagnosis times. Data-driven detection is based on the automation of failure detection through applying an outlier detection technique to multi-variate monitoring data. We evaluate the effectiveness of one-class classification for this purpose and determine a simple approach to select subsets of metrics that afford highly accurate failure detection.
- Published
- 2012