What is Fault Tolerance?
The ability of a system to continue operating correctly even when components fail.
In distributed systems, failures are inevitable. Networks partition, services crash, databases become unavailable, and dependencies timeout. A fault-tolerant system anticipates these failures and handles them gracefully instead of cascading them to users.
Fault-tolerant patterns are proven strategies for building resilient systems. They help you:
- Prevent cascading failures (one service failure bringing down the entire system)
- Isolate failures to specific components (blast radius containment)
- Recover automatically from transient failures
- Fail gracefully when recovery isn't possible (degraded mode vs total outage)
Circuit Breaker Pattern
The most important pattern for preventing cascading failures. A Circuit Breaker acts like an electrical circuit breaker—it "trips" (opens) when a service is failing, preventing further calls until it recovers.
How It Works
The Circuit Breaker monitors calls to a dependency and tracks failures. It has three states:
↓ (failure threshold exceeded)
OPEN → Requests fail immediately (no calls to dependency)
↓ (timeout expires)
HALF-OPEN → Allow limited test requests
↓ (success) → Back to CLOSED
↓ (failure) → Back to OPEN
🟢 CLOSED
Normal operation. Requests pass through. Failures are counted.
🔴 OPEN
Circuit is "tripped". All requests fail fast without hitting the dependency.
🟡 HALF-OPEN
Testing recovery. A few requests are allowed to check if the service is healthy.
Configuration Parameters
- Failure Threshold: Number/percentage of failures before opening (e.g., 50% failures in 10s)
- Timeout: How long the circuit stays open before testing recovery (e.g., 30 seconds)
- Success Threshold: Number of successful tests needed to close the circuit (e.g., 3 successes)
Python Implementation
import time
from enum import Enum
from datetime import datetime, timedelta
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60, success_threshold=2):
self.failure_threshold = failure_threshold
self.timeout = timeout # seconds
self.success_threshold = success_threshold
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED
def call(self, func, *args, **kwargs):
"""Execute a function through the circuit breaker"""
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise e
def _on_success(self):
self.failure_count = 0
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = CircuitState.CLOSED
self.success_count = 0
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = datetime.now()
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN
self.success_count = 0
elif self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
def _should_attempt_reset(self):
return (self.last_failure_time and
datetime.now() - self.last_failure_time > timedelta(seconds=self.timeout))
# Usage Example
def unreliable_api_call():
"""Simulates an API that sometimes fails"""
import random
if random.random() < 0.3: # 30% failure rate
raise Exception("API error")
return {"status": "success"}
cb = CircuitBreaker(failure_threshold=3, timeout=10)
for i in range(10):
try:
result = cb.call(unreliable_api_call)
print(f"Call {i+1}: {result} | State: {cb.state.value}")
except Exception as e:
print(f"Call {i+1}: FAILED | State: {cb.state.value}")
time.sleep(1)
Retry Pattern with Exponential Backoff
Automatically retry failed operations, with increasing delays between attempts to avoid overwhelming the failing service.
Why Exponential Backoff?
If 1000 clients all retry immediately when a service fails, they create a "thundering herd" that can prevent the service from recovering. Exponential backoff spaces out retries:
- 1st retry: wait 1 second
- 2nd retry: wait 2 seconds
- 3rd retry: wait 4 seconds
- 4th retry: wait 8 seconds
- And so on (with a maximum cap)
Adding Jitter
Jitter (random delay) further reduces synchronized retries. Instead of waiting exactly 2s, wait 1.5-2.5s (randomized).
Python Implementation
import time
import random
from functools import wraps
def retry_with_backoff(max_retries=5, base_delay=1, max_delay=60, jitter=True):
"""
Decorator for retrying a function with exponential backoff.
Args:
max_retries: Maximum number of retry attempts
base_delay: Initial delay in seconds
max_delay: Maximum delay cap in seconds
jitter: Add random jitter to prevent thundering herd
"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
retries = 0
while retries < max_retries:
try:
return func(*args, **kwargs)
except Exception as e:
retries += 1
if retries >= max_retries:
print(f"❌ Max retries ({max_retries}) exceeded")
raise e
# Calculate exponential backoff: 2^retries * base_delay
delay = min(base_delay * (2 ** (retries - 1)), max_delay)
# Add jitter (randomize ±25%)
if jitter:
delay = delay * (0.75 + random.random() * 0.5)
print(f"⚠️ Retry {retries}/{max_retries} after {delay:.2f}s: {e}")
time.sleep(delay)
return wrapper
return decorator
# Usage Example
@retry_with_backoff(max_retries=4, base_delay=1, jitter=True)
def flaky_database_query():
"""Simulates a database query that sometimes fails"""
import random
if random.random() < 0.6: # 60% failure rate
raise Exception("Database connection timeout")
return {"data": "query result"}
# Test the retry logic
try:
result = flaky_database_query()
print(f"✅ Success: {result}")
except Exception as e:
print(f"❌ Final failure: {e}")
Bulkhead Pattern
Isolate resources to prevent a failure in one area from exhausting resources needed by other areas. The name comes from ship bulkheads—watertight compartments that prevent the entire ship from sinking if one compartment floods.
Resource Isolation Strategies
Thread Pool Isolation
Assign separate thread pools to different dependencies. If one pool is exhausted (dependency is slow), other services remain unaffected.
Example: Payment API uses 10 threads, User API uses 10 threads.
Connection Pool Isolation
Separate database connection pools for read-heavy vs write-heavy operations.
Example: 50 connections for writes, 200 for reads.
Python Example: Thread Pool Bulkhead
from concurrent.futures import ThreadPoolExecutor
import time
# Separate thread pools for different services
payment_pool = ThreadPoolExecutor(max_workers=5, thread_name_prefix="payment")
user_pool = ThreadPoolExecutor(max_workers=5, thread_name_prefix="user")
def slow_payment_api():
"""Simulates a slow payment service"""
time.sleep(5) # Takes 5 seconds
return "Payment processed"
def fast_user_api():
"""Simulates a fast user service"""
time.sleep(0.1) # Takes 100ms
return "User data fetched"
# Submit 10 slow payment tasks
for i in range(10):
payment_pool.submit(slow_payment_api)
# User API remains fast despite payment API being overwhelmed
start = time.time()
future = user_pool.submit(fast_user_api)
result = future.result()
elapsed = time.time() - start
print(f"✅ User API response: {result} in {elapsed:.2f}s")
# Output: ✅ User API response: User data fetched in 0.10s
# (Payment API slowness doesn't affect User API)
Timeout Pattern
Set maximum wait times for operations to prevent infinite hangs. A well-configured timeout prevents slow dependencies from tying up resources indefinitely.
Timeout Guidelines
- Connect Timeout: Max time to establish a connection (typically 1-5 seconds)
- Read Timeout: Max time to receive a response after connection (5-30 seconds depending on operation)
- Total Timeout: Overall time budget for the entire operation (connect + read + retries)
Python Example: HTTP Timeouts
import requests
from requests.exceptions import Timeout
def call_api_with_timeout():
"""Call external API with proper timeouts"""
try:
# Tuple (connect_timeout, read_timeout)
response = requests.get(
"https://api.example.com/data",
timeout=(3, 10) # 3s to connect, 10s to read
)
return response.json()
except Timeout as e:
print(f"⏱️ Timeout error: {e}")
# Fallback to cached data or return error
return {"error": "Service temporarily unavailable"}
# You can also use a single timeout for both
response = requests.get("https://api.example.com", timeout=5) # Total 5 seconds
Fallback Pattern
Provide alternative responses when primary operations fail. Instead of showing an error, degrade gracefully with cached data or simplified functionality.
Fallback Strategies
1. Cache Fallback
Return stale cached data when the primary data source is unavailable.
Example: Show last known product prices if the pricing service is down.
2. Default Value Fallback
Return sensible defaults when computation fails.
Example: Show "Recommendations unavailable" instead of crashing the page.
3. Feature Toggle Fallback
Disable non-critical features during outages.
Example: Disable personalized recommendations, keep core checkout working.
Python Example: Multi-Level Fallback
import requests
import json
def get_user_recommendations(user_id):
"""Get personalized recommendations with multiple fallback levels"""
# Level 1: Try primary recommendation service
try:
response = requests.get(
f"https://api.example.com/recommendations/{user_id}",
timeout=2
)
if response.status_code == 200:
return response.json()
except Exception as e:
print(f"⚠️ Primary service failed: {e}")
# Level 2: Try cache
try:
cache_key = f"recommendations:{user_id}"
cached = get_from_cache(cache_key) # Your cache implementation
if cached:
print("📦 Returning cached recommendations")
return json.loads(cached)
except Exception as e:
print(f"⚠️ Cache failed: {e}")
# Level 3: Return generic popular items
print("🔄 Returning default recommendations")
return {
"items": ["popular_item_1", "popular_item_2", "popular_item_3"],
"source": "default"
}
def get_from_cache(key):
# Placeholder for cache implementation
return None
Health Checks & Monitoring
Continuously monitor service health to detect failures early and trigger automated recovery.
Types of Health Checks
Liveness Probe
Is the service running?
If fails: Restart the service.
Readiness Probe
Is the service ready to handle requests?
If fails: Remove from load balancer.
Python Health Check Endpoint
from flask import Flask, jsonify
import psycopg2
app = Flask(__name__)
@app.route('/health/liveness')
def liveness():
"""Simple check: is the app running?"""
return jsonify({"status": "ok"}), 200
@app.route('/health/readiness')
def readiness():
"""Deep check: can the app handle requests?"""
health = {"status": "ok", "checks": {}}
# Check database connection
try:
conn = psycopg2.connect("dbname=mydb user=postgres")
conn.close()
health["checks"]["database"] = "healthy"
except Exception as e:
health["status"] = "degraded"
health["checks"]["database"] = f"unhealthy: {e}"
# Check external API dependency
try:
response = requests.get("https://api.example.com/health", timeout=2)
health["checks"]["external_api"] = "healthy" if response.ok else "unhealthy"
except Exception:
health["status"] = "degraded"
health["checks"]["external_api"] = "unreachable"
status_code = 200 if health["status"] == "ok" else 503
return jsonify(health), status_code
Patterns Comparison
Each pattern solves different failure scenarios. Often, you'll combine multiple patterns for defense in depth.
| Pattern | Problem Solved | When to Use | Combine With |
|---|---|---|---|
| Circuit Breaker | Prevent cascading failures from slow/failing dependencies | Calling external services, databases, microservices | Fallback, Timeout |
| Retry | Recover from transient failures automatically | Network blips, temporary service unavailability | Timeout, Exponential Backoff |
| Bulkhead | Isolate failures to prevent resource exhaustion | Multiple dependencies with different performance profiles | Circuit Breaker |
| Timeout | Prevent infinite hangs from slow operations | Every external call (API, database, file I/O) | Retry, Fallback |
| Fallback | Provide alternative when primary operation fails | Non-critical features, user-facing services | Circuit Breaker, Cache |
| Health Checks | Detect failures early and trigger recovery | All services in production | Load Balancer, Auto-scaling |
Summary
- Circuit Breaker prevents cascading failures by stopping calls to failing services.
- Retry with Exponential Backoff recovers from transient failures without overwhelming services.
- Bulkhead isolates resources to prevent one failure from affecting the entire system.
- Timeout prevents infinite hangs and resource leaks.
- Fallback provides graceful degradation instead of total failure.
- Health Checks enable early detection and automated recovery.
- Combine multiple patterns for defense in depth—no single pattern is sufficient.
- Always tune parameters (thresholds, timeouts, retries) based on real production metrics.