An Auditing Language for Preventing Correlated Failures in the Cloud
Today’s cloud services extensively rely on replication techniques to ensure availability and reliability. In complex datacenter network architectures, however, seemingly independent replica servers may inadvertently share deep dependencies (e.g., aggregation switches). Such unexpected common dependencies may potentially result in correlated failures across the entire replication deployments, invalidating the efforts. Although existing cloud management and diagnosis tools have been able to offer post-failure forensics, they, nevertheless, typically lead to quite prolonged failure recovery time. In this paper, we propose a novel language framework, named RepAudit, that manages to prevent correlated failure risks before service outages occur, by allowing cloud administrators to proactively audit the replication deployments of interest. In particular, RepAudit consists of three new components: 1) a declarative domain-specific language, RAL, for cloud administrators to write auditing programs expressing diverse auditing tasks; 2) a high-performance RAL auditing engine that generates the auditing results by accurately and efficiently analyzing the underlying structures of the target replication deployments; and 3) an RAL-code generator that can automatically produce complex RAL programs based on easily written specifications. Our evaluation result shows that RepAudit can determine the top-20 critical correlated failure root causes in a replication system containing 30,528 devices within 1 minute, which is 400x more efficient in auditing time than state-of-the-art efforts. To the best of our knowledge, RepAudit is the first effort capable of simultaneously offering expressive, accurate and efficient correlated failure auditing to the cloud-scale replication systems.
Fri 27 OctDisplayed time zone: Tijuana, Baja California change
10:30 - 12:00 | |||
10:30 22mTalk | Project Snowflake: Non-blocking Safe Manual Memory Management for .NET OOPSLA Matthew J. Parkinson Microsoft Research, UK, Dimitrios Vytiniotis Microsoft Research, Cambridge, Kapil Vaswani Microsoft Research, Manuel Costa Microsoft Research, Pantazis Deligiannis Microsoft Research, Dylan McDermott University of Cambridge, Jonathan Balkind Princeton, USA, Aaron Blankstein Princeton, USA DOI | ||
10:52 22mTalk | Alpaca: Intermittent Execution without Checkpoints OOPSLA Kiwan Maeng Carnegie Mellon University, USA, Alexei Colin Carnegie Mellon University, Brandon Lucia Carnegie Mellon University DOI | ||
11:15 22mTalk | An Auditing Language for Preventing Correlated Failures in the Cloud OOPSLA Ennan Zhai Yale University, USA, Ruzica Piskac Yale University, Ronghui Gu Columbia University, USA, Xun Lao Yale University, USA, Xi Wang Yale University, USA DOI | ||
11:37 22mTalk | Reliable and Automatic Composition of Language Extensions to C OOPSLA Ted Kaminski University of Minnesota, Lucas Kramer University of Minnesota, Travis Carlson University of Minnesota, USA, Eric Van Wyk University of Minnesota, USA DOI Pre-print |