Resilience Design Patterns explained
May 19, 2020 2022-10-27 15:44Resilience Design Patterns explained
When it comes to resilience in software design, the main goal is to build robust components that can tolerate faults within their scope, but also failures of other components they depend on. Resiliency is the capability to handle partial failures while continuing to execute and not crash. How to create resilient software? Let’s start with four patterns from the latency control category: Retry, fallback, timeout, and circuit breaker.
Resilience Design Patterns
- Retry
Whenever we assume that an unexpected response – or no response for that matter – can be fixed by sending the request again, using the retry pattern can help. It is a very simple pattern where failed requests are retried a configurable number of times in case of a failure before the operation is marked as a failure. Retries can be an effective way to handle transient failures that occur with cross-component communication in a system. Retries can be useful in case of temporary network problems such as packet loss, internal errors of the target service, e.g. caused by an outage of a database, no or slow responses due to a large number of requests towards the target service.
- Fallback
The fallback pattern enables your service to continue the execution in case of a failed request to another service. Instead of aborting the computation because of a missing response, you fill in a fallback value.
- Timeout
The timeout pattern is pretty straightforward and many HTTP clients have a default timeout configured. The goal is to avoid unbounded waiting times for responses and thus treating every request as failed where no response was received within the timeout. Timeouts are used in almost every application to avoid requests getting stuck forever.
- Circuit breaker
In software, a circuit breaker protects your services from being spammed while already being partly unavailable due to high load. It can be implemented as a stateful software component that switches between three states: closed (requests can flow freely), open (requests are rejected without being submitted to the remote resource), and half-open (one probe request is allowed to decide whether to close the circuit again). Circuit breakers are a useful tool, especially when combined with retries, timeouts and fallbacks. The software circuit breaker prevents a partial failure from becoming a catastrophic outage. It would detect the first few failed calls and flip into a state where outbound requests are quickly refused without even attempting to call the provider. This means error responses are delivered quickly rather than incurring timeouts. The caller’s threads are thus preserved and available for other requests.
For example, Circuit Breaker Pattern has been adopted by Netflix and been established as a central part of Resilient Software Design. Netflix says that like this system fails in a safe way.
A resilient system automatically cut off failing components and reintegrate them once they are no longer failing.
Resilience is all about embracing the chaos of the real world because you cannot control it and translate this way of thinking into software architectures. Resilience design patterns are suitable for an industry where demand is constant and uncontrolled and where individual transactions can be sacrificed without catastrophic losses. This describes most web systems, especially any in the commerce, media, or social sphere.
Resilient systems embrace the idea that failures are normal. When dealing with large-scale systems, probabilities are such that 100% operational excellence is near impossible to achieve. Therefore, the normal state of operation is partial failure. While not suitable for life-critical applications, running in partially failing mode is a viable option for most web applications, from e-commerce services like Amazon.com to video-on-demand sites such as Netflix.
Microservices can help make a system more resilient, depending on how you decompose your system into services and how you build each service. The question is not about how large the service is, but how large the “failure domains” are. There are two dimensions to consider. First, what services are needed for any given feature. Given a particular request, what services must be involved to fully deliver that request. The second dimension to consider is whether individual services have an easy way to handle failure in their dependencies. Resilience should be built into every service, preferably with a common framework so that monitoring and administration is simplified.
To conclude, the art of managing systems at scale lies in embracing failure and being at the edge – pushing the limits of your system and software performance almost to breaking point, yet still being able to recover. That’s what resiliency is all about.