Write a Blog >>
SPLASH 2017
Sun 22 - Fri 27 October 2017 Vancouver, Canada
Sun 22 Oct 2017 11:00 - 11:30 at Oxford - Session 2 Chair(s): Jurgen Vinju

Diagnosing and correcting failures in complex, distributed systems is difficult. In a network of perhaps dozens of nodes, each of which is executing dozens of interacting applications, sometimes from different suppliers or vendors, finding the source of a system failure is a confusing, often tedious, piece of detective work. The person assigned this task must trace the failing command, event, or operation and find the deviation from the correct, desired interaction sequence. After the deviation is identified, the failing applications must be found, and the fault or faults traced to the incorrect source code. Often the primary source for tracing the failures is the set of event log files generated by the applications on each node. The event logs from several platforms and from multiple virtual machines on those platforms must be filtered, merged, correlated, and examined by a human expert. The expert must locate the point of failure within the logs and then deduce which interaction or component failed and then re-assign the problem to the persons responsible for the failing component sets. Those individuals must then, in turn, use the original logs filtered and merged using different criteria to find the failing code modules, analyze the cause of the failure, and correct the code or even the architecture of the failing components. Reducing the human effort involved in diagnosing these test failures through automated mining of the data in the logs is the goal of this project. In this paper we explore the automatic generation of grammars from successful log sequences and utilizing the grammars to identify log entries indicative of failed tests. Such grammars can also be used to mine performance data from the logs and to mine internal operational data from the system not normally visible in the user interface and system output.