Resilience Patterns

Fault-Tolerant Patterns

Build resilient distributed systems that gracefully handle failures. Master Circuit Breakers, Retries, Bulkheads, Timeouts, and Fallbacks.

What is Fault Tolerance?

The ability of a system to continue operating correctly even when components fail.

In distributed systems, failures are inevitable. Networks partition, services crash, databases become unavailable, and dependencies timeout. A fault-tolerant system anticipates these failures and handles them gracefully instead of cascading them to users.

Fault-tolerant patterns are proven strategies for building resilient systems. They help you:

  • Prevent cascading failures (one service failure bringing down the entire system)
  • Isolate failures to specific components (blast radius containment)
  • Recover automatically from transient failures
  • Fail gracefully when recovery isn't possible (degraded mode vs total outage)
Key Principle: Assume everything will fail. Design for failure, not for success.

Circuit Breaker Pattern

The most important pattern for preventing cascading failures. A Circuit Breaker acts like an electrical circuit breaker—it "trips" (opens) when a service is failing, preventing further calls until it recovers.

How It Works

The Circuit Breaker monitors calls to a dependency and tracks failures. It has three states:

CLOSED → Requests flow normally
↓ (failure threshold exceeded)
OPEN → Requests fail immediately (no calls to dependency)
↓ (timeout expires)
HALF-OPEN → Allow limited test requests
↓ (success) → Back to CLOSED
↓ (failure) → Back to OPEN
🟢 CLOSED

Normal operation. Requests pass through. Failures are counted.

🔴 OPEN

Circuit is "tripped". All requests fail fast without hitting the dependency.

🟡 HALF-OPEN

Testing recovery. A few requests are allowed to check if the service is healthy.

Configuration Parameters

  • Failure Threshold: Number/percentage of failures before opening (e.g., 50% failures in 10s)
  • Timeout: How long the circuit stays open before testing recovery (e.g., 30 seconds)
  • Success Threshold: Number of successful tests needed to close the circuit (e.g., 3 successes)

Python Implementation

import time
from enum import Enum
from datetime import datetime, timedelta

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60, success_threshold=2):
        self.failure_threshold = failure_threshold
        self.timeout = timeout  # seconds
        self.success_threshold = success_threshold
        
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED
    
    def call(self, func, *args, **kwargs):
        """Execute a function through the circuit breaker"""
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise e
    
    def _on_success(self):
        self.failure_count = 0
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = CircuitState.CLOSED
                self.success_count = 0
    
    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = datetime.now()
        
        if self.state == CircuitState.HALF_OPEN:
            self.state = CircuitState.OPEN
            self.success_count = 0
        elif self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
    
    def _should_attempt_reset(self):
        return (self.last_failure_time and 
                datetime.now() - self.last_failure_time > timedelta(seconds=self.timeout))

# Usage Example
def unreliable_api_call():
    """Simulates an API that sometimes fails"""
    import random
    if random.random() < 0.3:  # 30% failure rate
        raise Exception("API error")
    return {"status": "success"}

cb = CircuitBreaker(failure_threshold=3, timeout=10)

for i in range(10):
    try:
        result = cb.call(unreliable_api_call)
        print(f"Call {i+1}: {result} | State: {cb.state.value}")
    except Exception as e:
        print(f"Call {i+1}: FAILED | State: {cb.state.value}")
    time.sleep(1)
When to Use: Protect your system from slow or failing dependencies (external APIs, databases, microservices). Prevents cascading failures and resource exhaustion.

Retry Pattern with Exponential Backoff

Automatically retry failed operations, with increasing delays between attempts to avoid overwhelming the failing service.

Why Exponential Backoff?

If 1000 clients all retry immediately when a service fails, they create a "thundering herd" that can prevent the service from recovering. Exponential backoff spaces out retries:

  • 1st retry: wait 1 second
  • 2nd retry: wait 2 seconds
  • 3rd retry: wait 4 seconds
  • 4th retry: wait 8 seconds
  • And so on (with a maximum cap)

Adding Jitter

Jitter (random delay) further reduces synchronized retries. Instead of waiting exactly 2s, wait 1.5-2.5s (randomized).

Python Implementation

import time
import random
from functools import wraps

def retry_with_backoff(max_retries=5, base_delay=1, max_delay=60, jitter=True):
    """
    Decorator for retrying a function with exponential backoff.
    
    Args:
        max_retries: Maximum number of retry attempts
        base_delay: Initial delay in seconds
        max_delay: Maximum delay cap in seconds
        jitter: Add random jitter to prevent thundering herd
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            retries = 0
            while retries < max_retries:
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    retries += 1
                    if retries >= max_retries:
                        print(f"❌ Max retries ({max_retries}) exceeded")
                        raise e
                    
                    # Calculate exponential backoff: 2^retries * base_delay
                    delay = min(base_delay * (2 ** (retries - 1)), max_delay)
                    
                    # Add jitter (randomize ±25%)
                    if jitter:
                        delay = delay * (0.75 + random.random() * 0.5)
                    
                    print(f"⚠️  Retry {retries}/{max_retries} after {delay:.2f}s: {e}")
                    time.sleep(delay)
            
        return wrapper
    return decorator

# Usage Example
@retry_with_backoff(max_retries=4, base_delay=1, jitter=True)
def flaky_database_query():
    """Simulates a database query that sometimes fails"""
    import random
    if random.random() < 0.6:  # 60% failure rate
        raise Exception("Database connection timeout")
    return {"data": "query result"}

# Test the retry logic
try:
    result = flaky_database_query()
    print(f"✅ Success: {result}")
except Exception as e:
    print(f"❌ Final failure: {e}")
⚠️ Idempotency Required: Only retry operations that are safe to execute multiple times (idempotent). Charging a credit card multiple times is NOT idempotent—use idempotency keys!

Bulkhead Pattern

Isolate resources to prevent a failure in one area from exhausting resources needed by other areas. The name comes from ship bulkheads—watertight compartments that prevent the entire ship from sinking if one compartment floods.

Resource Isolation Strategies

Thread Pool Isolation

Assign separate thread pools to different dependencies. If one pool is exhausted (dependency is slow), other services remain unaffected.

Example: Payment API uses 10 threads, User API uses 10 threads.

Connection Pool Isolation

Separate database connection pools for read-heavy vs write-heavy operations.

Example: 50 connections for writes, 200 for reads.

Python Example: Thread Pool Bulkhead

from concurrent.futures import ThreadPoolExecutor
import time

# Separate thread pools for different services
payment_pool = ThreadPoolExecutor(max_workers=5, thread_name_prefix="payment")
user_pool = ThreadPoolExecutor(max_workers=5, thread_name_prefix="user")

def slow_payment_api():
    """Simulates a slow payment service"""
    time.sleep(5)  # Takes 5 seconds
    return "Payment processed"

def fast_user_api():
    """Simulates a fast user service"""
    time.sleep(0.1)  # Takes 100ms
    return "User data fetched"

# Submit 10 slow payment tasks
for i in range(10):
    payment_pool.submit(slow_payment_api)

# User API remains fast despite payment API being overwhelmed
start = time.time()
future = user_pool.submit(fast_user_api)
result = future.result()
elapsed = time.time() - start

print(f"✅ User API response: {result} in {elapsed:.2f}s")
# Output: ✅ User API response: User data fetched in 0.10s
# (Payment API slowness doesn't affect User API)
When to Use: When you have multiple dependencies with different performance characteristics. Prevents a slow dependency from starving resources needed by fast dependencies.

Timeout Pattern

Set maximum wait times for operations to prevent infinite hangs. A well-configured timeout prevents slow dependencies from tying up resources indefinitely.

Timeout Guidelines

  • Connect Timeout: Max time to establish a connection (typically 1-5 seconds)
  • Read Timeout: Max time to receive a response after connection (5-30 seconds depending on operation)
  • Total Timeout: Overall time budget for the entire operation (connect + read + retries)

Python Example: HTTP Timeouts

import requests
from requests.exceptions import Timeout

def call_api_with_timeout():
    """Call external API with proper timeouts"""
    try:
        # Tuple (connect_timeout, read_timeout)
        response = requests.get(
            "https://api.example.com/data",
            timeout=(3, 10)  # 3s to connect, 10s to read
        )
        return response.json()
    except Timeout as e:
        print(f"⏱️ Timeout error: {e}")
        # Fallback to cached data or return error
        return {"error": "Service temporarily unavailable"}

# You can also use a single timeout for both
response = requests.get("https://api.example.com", timeout=5)  # Total 5 seconds
⚠️ Cascading Timeouts: Set timeouts for the full call chain. If Service A calls Service B calls Service C, ensure timeouts decrease down the chain (A: 10s, B: 7s, C: 4s) to prevent deadlocks.

Fallback Pattern

Provide alternative responses when primary operations fail. Instead of showing an error, degrade gracefully with cached data or simplified functionality.

Fallback Strategies

1. Cache Fallback

Return stale cached data when the primary data source is unavailable.

Example: Show last known product prices if the pricing service is down.

2. Default Value Fallback

Return sensible defaults when computation fails.

Example: Show "Recommendations unavailable" instead of crashing the page.

3. Feature Toggle Fallback

Disable non-critical features during outages.

Example: Disable personalized recommendations, keep core checkout working.

Python Example: Multi-Level Fallback

import requests
import json

def get_user_recommendations(user_id):
    """Get personalized recommendations with multiple fallback levels"""
    
    # Level 1: Try primary recommendation service
    try:
        response = requests.get(
            f"https://api.example.com/recommendations/{user_id}",
            timeout=2
        )
        if response.status_code == 200:
            return response.json()
    except Exception as e:
        print(f"⚠️ Primary service failed: {e}")
    
    # Level 2: Try cache
    try:
        cache_key = f"recommendations:{user_id}"
        cached = get_from_cache(cache_key)  # Your cache implementation
        if cached:
            print("📦 Returning cached recommendations")
            return json.loads(cached)
    except Exception as e:
        print(f"⚠️ Cache failed: {e}")
    
    # Level 3: Return generic popular items
    print("🔄 Returning default recommendations")
    return {
        "items": ["popular_item_1", "popular_item_2", "popular_item_3"],
        "source": "default"
    }

def get_from_cache(key):
    # Placeholder for cache implementation
    return None

Health Checks & Monitoring

Continuously monitor service health to detect failures early and trigger automated recovery.

Types of Health Checks

Liveness Probe

Is the service running?

If fails: Restart the service.

Readiness Probe

Is the service ready to handle requests?

If fails: Remove from load balancer.

Python Health Check Endpoint

from flask import Flask, jsonify
import psycopg2

app = Flask(__name__)

@app.route('/health/liveness')
def liveness():
    """Simple check: is the app running?"""
    return jsonify({"status": "ok"}), 200

@app.route('/health/readiness')
def readiness():
    """Deep check: can the app handle requests?"""
    health = {"status": "ok", "checks": {}}
    
    # Check database connection
    try:
        conn = psycopg2.connect("dbname=mydb user=postgres")
        conn.close()
        health["checks"]["database"] = "healthy"
    except Exception as e:
        health["status"] = "degraded"
        health["checks"]["database"] = f"unhealthy: {e}"
    
    # Check external API dependency
    try:
        response = requests.get("https://api.example.com/health", timeout=2)
        health["checks"]["external_api"] = "healthy" if response.ok else "unhealthy"
    except Exception:
        health["status"] = "degraded"
        health["checks"]["external_api"] = "unreachable"
    
    status_code = 200 if health["status"] == "ok" else 503
    return jsonify(health), status_code

Patterns Comparison

Each pattern solves different failure scenarios. Often, you'll combine multiple patterns for defense in depth.

Pattern Problem Solved When to Use Combine With
Circuit Breaker Prevent cascading failures from slow/failing dependencies Calling external services, databases, microservices Fallback, Timeout
Retry Recover from transient failures automatically Network blips, temporary service unavailability Timeout, Exponential Backoff
Bulkhead Isolate failures to prevent resource exhaustion Multiple dependencies with different performance profiles Circuit Breaker
Timeout Prevent infinite hangs from slow operations Every external call (API, database, file I/O) Retry, Fallback
Fallback Provide alternative when primary operation fails Non-critical features, user-facing services Circuit Breaker, Cache
Health Checks Detect failures early and trigger recovery All services in production Load Balancer, Auto-scaling

Summary

  • Circuit Breaker prevents cascading failures by stopping calls to failing services.
  • Retry with Exponential Backoff recovers from transient failures without overwhelming services.
  • Bulkhead isolates resources to prevent one failure from affecting the entire system.
  • Timeout prevents infinite hangs and resource leaks.
  • Fallback provides graceful degradation instead of total failure.
  • Health Checks enable early detection and automated recovery.
  • Combine multiple patterns for defense in depth—no single pattern is sufficient.
  • Always tune parameters (thresholds, timeouts, retries) based on real production metrics.
Golden Rule: Design for failure. Test failure scenarios regularly (chaos engineering). Monitor constantly. Fail fast, recover fast.