An Auditing Language for Preventing Correlated Failures in the Cloud (SPLASH 2017 - OOPSLA)

Write a Blog >>

Sun 22 - Fri 27 October 2017 Vancouver, Canada

Who

Ennan Zhai, Ruzica Piskac, Ronghui Gu, Xun Lao, Xi Wang

Track

SPLASH 2017 OOPSLA

Time Zone

The program is currently displayed in (GMT-07:00) Tijuana, Baja California.

Use conference time zone: (GMT-07:00) Tijuana, Baja CaliforniaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 27 Oct 2017 11:15 - 11:37 at Regency A - Language Design Chair(s): Gregor Richards

Abstract

Today’s cloud services extensively rely on replication techniques to ensure availability and reliability. In complex datacenter network architectures, however, seemingly independent replica servers may inadvertently share deep dependencies (e.g., aggregation switches). Such unexpected common dependencies may potentially result in correlated failures across the entire replication deployments, invalidating the efforts. Although existing cloud management and diagnosis tools have been able to offer post-failure forensics, they, nevertheless, typically lead to quite prolonged failure recovery time. In this paper, we propose a novel language framework, named RepAudit, that manages to prevent correlated failure risks before service outages occur, by allowing cloud administrators to proactively audit the replication deployments of interest. In particular, RepAudit consists of three new components: 1) a declarative domain-specific language, RAL, for cloud administrators to write auditing programs expressing diverse auditing tasks; 2) a high-performance RAL auditing engine that generates the auditing results by accurately and efficiently analyzing the underlying structures of the target replication deployments; and 3) an RAL-code generator that can automatically produce complex RAL programs based on easily written specifications. Our evaluation result shows that RepAudit can determine the top-20 critical correlated failure root causes in a replication system containing 30,528 devices within 1 minute, which is 400x more efficient in auditing time than state-of-the-art efforts. To the best of our knowledge, RepAudit is the first effort capable of simultaneously offering expressive, accurate and efficient correlated failure auditing to the cloud-scale replication systems.

DOI

https://doi.org/10.1145/3133921

Ennan Zhai

Yale University, USA

Ruzica Piskac

Yale University

Ronghui Gu

Columbia University, USA

Xun Lao

Yale University, USA

Xi Wang

Yale University, USA

Time Zone

The program is currently displayed in (GMT-07:00) Tijuana, Baja California.

Use conference time zone: (GMT-07:00) Tijuana, Baja CaliforniaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 27 Oct
Displayed time zone: Tijuana, Baja California change

10:30 - 12:00	Language DesignOOPSLA at Regency A Chair(s): Gregor Richards University of Waterloo

10:30 22m Talk		Project Snowflake: Non-blocking Safe Manual Memory Management for .NET OOPSLA Matthew J. Parkinson Microsoft Research, UK, Dimitrios Vytiniotis Microsoft Research, Cambridge, Kapil Vaswani Microsoft Research, Manuel Costa Microsoft Research, Pantazis Deligiannis Microsoft Research, Dylan McDermott University of Cambridge, Jonathan Balkind Princeton, USA, Aaron Blankstein Princeton, USA DOI
10:52 22m Talk		Alpaca: Intermittent Execution without Checkpoints OOPSLA Kiwan Maeng Carnegie Mellon University, USA, Alexei Colin Carnegie Mellon University, Brandon Lucia Carnegie Mellon University DOI
11:15 22m Talk		An Auditing Language for Preventing Correlated Failures in the Cloud OOPSLA Ennan Zhai Yale University, USA, Ruzica Piskac Yale University, Ronghui Gu Columbia University, USA, Xun Lao Yale University, USA, Xi Wang Yale University, USA DOI
11:37 22m Talk		Reliable and Automatic Composition of Language Extensions to C OOPSLA Ted Kaminski University of Minnesota, Lucas Kramer University of Minnesota, Travis Carlson University of Minnesota, USA, Eric Van Wyk University of Minnesota, USA DOI Pre-print