Chaos engineering

From Infogalactic: the planetary knowledge core
Jump to: navigation, search

<templatestyles src="Module:Hatnote/styles.css"></templatestyles>

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.[1]

Concept

In software development, a given software system's ability to tolerate failures while still ensuring adequate quality of service—often generalized as resiliency—is typically specified as a requirement. However, development teams often fail to meet this requirement due to factors such as short deadlines or lack of knowledge of the field. Chaos engineering is a technique to meet the resilience requirement.

Chaos engineering can be used to achieve resilience against infrastructure failures, network failures, and application failures.

History

While overseeing Netflix's migration to the cloud in 2011,[2][3] Greg Orzell had the idea to address the lack of adequate resilience testing by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. The intent was to move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable, driving developers to consider built-in resilience to be an obligation rather than an option:

"At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services."[4]

By regularly "killing" random instances of a software service, it was possible to test a redundant architecture to verify that a server failure did not noticeably impact customers.

The concept of chaos engineering is close to the one of Phoenix Servers, first introduced by Martin Fowler in 2012.[5]

Perturbation models

A chaos engineering tool implements a perturbation model. The perturbations, also called turbulences, are meant to mimic rare or catastrophic events that can happen in production. To maximize the added value of chaos engineering, the perturbations are expected to be realistic.[6]

Server shutdowns
One perturbation model consists of randomly shutting down servers. Netflix' Chaos Monkey is an implementation of this perturbation model.
Latency injection
Introduces communication delays to simulate degradation or outages in a network. For example, Chaos Mesh supports the injection of latency.
Resource exhaustion
Eats up a given resource. For instance, Gremlin can fill the disk up.

Chaos engineering tools

Chaos Monkey

File:LogoChaosMonkeysNetflix.png
The logo for Chaos Monkey used by Netflix

Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure.[2] It works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage. Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases.

The code behind Chaos Monkey was released by Netflix in 2012 under an Apache 2.0 license.[7][8]

The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez:[9]

Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.

Simian Army

The Simian Army[8][10] is a suite of tools developed by Netflix to test the reliability, security, or resiliency of its Amazon Web Services infrastructure and includes the following tools:[11]

At the very top of the Simian Army hierarchy, Chaos Kong drops a full AWS "Region".[12] Though rare, loss of an entire region does happen and Chaos Kong simulates a systems response and recovery to this type of event.

Chaos Gorilla drops a full Amazon "Availability Zone" (one or more entire data centers serving a geographical region).[13]

Chaos Machine

ChaosMachine[14] is a tool that does chaos engineering at the application level in the JVM. It concentrates on analyzing the error-handling capability of each try-catch block involved in the application by injecting exceptions.

Proofdock Chaos Engineering Platform

Proofdock is a chaos engineering platform that focuses on and leverages the Microsoft Azure platform and the Azure DevOps services. Users can inject failures on the infrastructure, platform and application level.[15]

Gremlin

Gremlin is a "failure-as-a-service" platform.[16]

Facebook Storm

To prepare for the loss of a datacenter, Facebook regularly tests the resistance of its infrastructures to extreme events. Known as the Storm Project, the program simulates massive data center failures.[17]

Days of Chaos

Voyages-sncf.com created a "Day of Chaos"[18] in 2017, gamifying the simulation of pre-production failures.[19] They presented their results at the 2017 DevOps REX conference.[20]

See also

Notes and references

<templatestyles src="Reflist/styles.css" />

Cite error: Invalid <references> tag; parameter "group" is allowed only.

Use <references />, or <references group="..." />

External links

  1. Lua error in package.lua at line 80: module 'strict' not found.
  2. 2.0 2.1 Lua error in package.lua at line 80: module 'strict' not found.
  3. US 20120072571, Orzell, Gregory S. & Yury Izrailevsky, "Validating the resiliency of networked applications", published 2012-03-22 
  4. Lua error in package.lua at line 80: module 'strict' not found.
  5. Lua error in package.lua at line 80: module 'strict' not found.
  6. Lua error in package.lua at line 80: module 'strict' not found.
  7. Lua error in package.lua at line 80: module 'strict' not found.
  8. 8.0 8.1 Lua error in package.lua at line 80: module 'strict' not found.
  9. Lua error in package.lua at line 80: module 'strict' not found.
  10. Lua error in package.lua at line 80: module 'strict' not found.
  11. Lua error in package.lua at line 80: module 'strict' not found.
  12. Lua error in package.lua at line 80: module 'strict' not found.
  13. Lua error in package.lua at line 80: module 'strict' not found.
  14. Lua error in package.lua at line 80: module 'strict' not found.
  15. Lua error in package.lua at line 80: module 'strict' not found.
  16. Lua error in package.lua at line 80: module 'strict' not found.
  17. Lua error in package.lua at line 80: module 'strict' not found.
  18. Lua error in package.lua at line 80: module 'strict' not found.
  19. Lua error in package.lua at line 80: module 'strict' not found.
  20. Lua error in package.lua at line 80: module 'strict' not found.