TL;DR
A circuit breaker wraps calls to external services and tracks failures. When failures exceed a threshold, the breaker “trips open”—subsequent calls fail immediately without attempting the network call. After a timeout, it allows a test request through. If that succeeds, the breaker closes and normal operation resumes. This prevents cascade failures and protects system resources.
Visual Overview
THREE-STATE MACHINE ┌────────────────────────────────────────────────────┐ │ │ │ ┌──────────┐ │ │ ┌─────────────────│ CLOSED │←─────────────┐ │ │ │ │ (normal) │ │ │ │ │ └────┬─────┘ │ │ │ │ │ │ │ │ │ failures > threshold │ │ │ │ │ success │ │ │ ↓ │ │ │ │ ┌──────────┐ │ │ │ │ │ OPEN │──────────────┤ │ │ │ │ (trip) │ │ │ │ │ └────┬─────┘ │ │ │ │ │ │ │ │ │ timeout expires │ │ │ │ │ │ │ │ │ ↓ │ │ │ │ ┌───────────┐ │ │ │ │ failure │ HALF-OPEN │──────────────┘ │ │ └────────────────│ (test) │ │ │ └───────────┘ │ │ │ └────────────────────────────────────────────────────┘ STATE BEHAVIORS ┌────────────────────────────────────────────────────┐ │ CLOSED: Normal operation │ │ ├─ Requests pass through to downstream │ │ ├─ Failures counted │ │ └─ Threshold breach → trip to OPEN │ │ │ │ OPEN: Protection mode │ │ ├─ All requests fail immediately │ │ ├─ No network calls made (fail fast) │ │ └─ Timer running for recovery attempt │ │ │ │ HALF-OPEN: Testing recovery │ │ ├─ Allow ONE test request through │ │ ├─ Success → close breaker │ │ └─ Failure → reopen breaker │ └────────────────────────────────────────────────────┘
Core Explanation
What is a Circuit Breaker?
Real-World Analogy: Think of an electrical circuit breaker in your home. When too much current flows (overload), the breaker trips and cuts power to prevent a fire. You don’t keep trying to run the overloaded appliance—you wait, fix the problem, then reset the breaker.
Software circuit breakers work the same way:
- Overload = too many failures calling a downstream service
- Trip = stop calling that service
- Reset = test if service recovered, then resume
The Problem It Solves
WITHOUT CIRCUIT BREAKER: CASCADE FAILURE ┌────────────────────────────────────────────────────┐ │ │ │ Service A Service B Service C │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ ───────► │ │ ───────► │ SLOW │ │ │ │ │ │ │ │ ✗ │ │ │ └──────┘ └──────┘ └──────┘ │ │ │ │ 1. Service C becomes slow (5s timeouts) │ │ │ │ 2. Service B threads block waiting for C │ │ Thread pool: [████████████] exhausted! │ │ │ │ 3. Service B stops responding to A │ │ Service A threads block waiting for B │ │ Thread pool: [████████████] exhausted! │ │ │ │ 4. Service A fails → User sees error │ │ │ │ One slow service took down the entire chain! │ └────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────┐ │ │ │ Service A Service B Service C │ │ ┌──────┐ ┌──────┐ [CB] ┌──────┐ │ │ │ │ ───────► │ │──┤├────► │ SLOW │ │ │ │ │ │ │ OPEN │ ✗ │ │ │ └──────┘ └──────┘ └──────┘ │ │ │ │ 1. Service C becomes slow │ │ │ │ 2. Circuit breaker detects failures, trips OPEN │ │ │ │ 3. Service B returns fast failure (no wait!) │ │ "Service C unavailable" in <1ms │ │ │ │ 4. Service B stays healthy │ │ Thread pool: [██░░░░░░░░░░] plenty free │ │ │ │ 5. Service A gets quick error, can show fallback │ │ User sees degraded experience, not failure │ └────────────────────────────────────────────────────┘
Timeline Example
┌────────────────────────────────────────────────────────────┐ │ Time: 0s 5s 10s 40s 41s 42s 45s │ │ │ │ │ │ │ │ │ │ │ State: CLOSED─────►OPEN──────────────►HALF──►CLOSED │ │ │ │ │ │ │OPEN │ │ │ │ │ │ │ │ │ │ │ Events: │ │ │ │ │ │ │ │ │ │ │ │ │ └─ success! │ │ │ │ │ │ └─ test request │ │ │ │ │ └─ timeout expires (30s) │ │ │ │ └─ 5th failure → TRIP │ │ │ └─ failures accumulating │ │ └─ normal operation │ │ │ │ Requests: │ │ 0-10s: ✓ ✓ ✗ ✗ ✗ ✗ ✗ [TRIP] │ │ 10-40s: ✗ ✗ ✗ ✗ ✗ (instant fail, no network call) │ │ 41s: ✓ (test request succeeds) │ │ 42s+: ✓ ✓ ✓ ✓ (normal operation resumed) │ └────────────────────────────────────────────────────────────┘
Configuration Parameters
| Parameter | Description | Typical Value | Trade-off |
|---|---|---|---|
| Failure Threshold | Failures before tripping | 5-10 | Low = sensitive, High = slow to protect |
| Timeout | How long to stay open | 30-60 seconds | Short = fast recovery, Long = gentle on recovering service |
| Success Threshold | Successes in half-open before closing | 1-3 | Low = fast recovery, High = more confidence |
| Window Size | Time window for counting failures | 60 seconds | Rolling vs consecutive failures |
Real Systems Using Circuit Breakers
| Library/System | Language | Features | Use Case |
|---|---|---|---|
| Hystrix (Netflix) | Java | Bulkheads, fallbacks, metrics | Legacy but well-documented |
| Resilience4j | Java | Modern, lightweight, Spring integration | Recommended for new Java projects |
| Polly | .NET | Policies, retry + circuit breaker | C# applications |
| opossum | Node.js | Simple, Prometheus metrics | JavaScript/TypeScript services |
| gobreaker | Go | Simple, concurrent-safe | Go microservices |
| Istio | Service mesh | Sidecar-based, no code changes | Kubernetes environments |
Case Study: E-Commerce Checkout
E-COMMERCE CHECKOUT WITH CIRCUIT BREAKERS ┌────────────────────────────────────────────────────┐ │ │ │ Checkout Service │ │ │ │ │ ├──[CB]──► Payment Service │ │ │ └─ Fallback: "Pay later" option │ │ │ │ │ ├──[CB]──► Inventory Service │ │ │ └─ Fallback: Cached stock levels │ │ │ │ │ ├──[CB]──► Shipping Calculator │ │ │ └─ Fallback: Flat rate estimate │ │ │ │ │ └──[CB]──► Recommendation Service │ │ └─ Fallback: Hide section │ │ │ │ SCENARIO: Payment service down │ │ ├─ Payment CB trips OPEN │ │ ├─ Checkout offers "Pay later" or "PayPal" │ │ ├─ Other services unaffected │ │ └─ Customer can still complete order │ └────────────────────────────────────────────────────┘
When to Use Circuit Breakers
✓ Perfect Use Cases
EXTERNAL API CALLS Scenario: Calling third-party payment processor Requirement: Don't let payment issues kill entire checkout Configuration: Threshold=3, Timeout=60s Fallback: Offer alternative payment methods DATABASE CONNECTIONS Scenario: Primary DB under heavy load Requirement: Don't exhaust connection pool Configuration: Threshold=5, Timeout=30s Fallback: Read from replica, queue writes MICROSERVICE CALLS Scenario: Calling inventory service during checkout Requirement: Checkout works even if inventory is slow Configuration: Threshold=5, Timeout=30s Fallback: Use cached inventory, verify at shipment EXPENSIVE OPERATIONS Scenario: ML model inference service Requirement: Don't block on slow predictions Configuration: Threshold=3, Timeout=10s Fallback: Use simpler heuristic, default recommendation
✕ When NOT to Use
CRITICAL PATH WITH NO FALLBACK
Problem: If payment MUST succeed, circuit breaker just delays failure
Alternative: Retry with backoff, queue for later processing
When OK: If you have a meaningful fallback (alternative payment)
SIMPLE INTERNAL CALLS
Problem: Overhead not worth it for simple, reliable calls
Alternative: Just handle errors normally
When OK: For unreliable or slow internal services
FIRE-AND-FORGET CALLS
Problem: Async calls that don't block the caller
Alternative: Dead letter queues, retry queues
When OK: If you need to track failure rates for alerting
Interview Application
Common Interview Question
Q: “You’re designing a microservices architecture. How would you handle failures in downstream services?”
Strong Answer:
“I’d implement the circuit breaker pattern for all downstream calls. Here’s my approach:
Why Circuit Breakers:
- Prevent cascade failures: One slow service shouldn’t take down the entire system
- Fail fast: Return errors in milliseconds instead of waiting for timeouts
- Allow recovery: Give failing services breathing room to recover
- Enable fallbacks: Return cached data or degraded responses
Implementation:
- Use a library like Resilience4j (Java) or Polly (.NET)
- Configure per-dependency: payment service might have stricter thresholds than recommendations
Configuration for critical path (e.g., inventory check):
- Failure threshold: 5 failures in 60 seconds
- Open timeout: 30 seconds
- Half-open: Allow 1 test request
Fallback strategy:
- Inventory: Return cached stock levels, verify at shipment
- Payment: Offer alternative payment methods
- Recommendations: Hide the section entirely
Monitoring:
- Track circuit breaker state in metrics (Prometheus/Grafana)
- Alert when breaker trips (indicates downstream problem)
- Dashboard showing breaker states across all services
Combined with other patterns:
- Retry with exponential backoff for transient failures
- Bulkheads to isolate thread pools per dependency
- Timeouts to bound how long we wait”
Follow-up: How do you decide on circuit breaker thresholds?
“I’d start with conservative defaults and tune based on data:
Start with:
- Failure threshold: 5 consecutive or 50% in 10 requests
- Timeout: 30 seconds
- Half-open test count: 1
Tune based on:
- Normal error rate: If service has 1% baseline errors, threshold should be higher
- Recovery time: How long does the service typically take to recover?
- Business impact: Critical services might need faster tripping
Monitor and adjust:
- If breaker trips too often on transient errors → raise threshold
- If cascade failures still occur → lower threshold
- If service recovers but breaker stays open → shorten timeout”
Code Example
Circuit Breaker with Resilience4j (Java)
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
import java.time.Duration;
import java.util.function.Supplier;
public class PaymentService {
private final CircuitBreaker circuitBreaker;
private final PaymentGateway paymentGateway;
public PaymentService(PaymentGateway paymentGateway) {
this.paymentGateway = paymentGateway;
// Configure circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
// Trip to OPEN after 5 failures
.failureRateThreshold(50) // 50% failure rate
.minimumNumberOfCalls(5) // Need at least 5 calls to evaluate
// Stay OPEN for 30 seconds before testing
.waitDurationInOpenState(Duration.ofSeconds(30))
// In HALF-OPEN, allow 3 test calls
.permittedNumberOfCallsInHalfOpenState(3)
// Sliding window for failure rate calculation
.slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
// What counts as a failure
.recordExceptions(PaymentException.class, TimeoutException.class)
.ignoreExceptions(InvalidCardException.class) // Don't count client errors
.build();
CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
this.circuitBreaker = registry.circuitBreaker("paymentService");
// Register event handlers for monitoring
circuitBreaker.getEventPublisher()
.onStateTransition(event -> {
System.out.println("Circuit breaker state: " +
event.getStateTransition().getFromState() + " -> " +
event.getStateTransition().getToState());
// Send to metrics system (Prometheus, DataDog, etc.)
})
.onCallNotPermitted(event -> {
System.out.println("Call blocked by circuit breaker");
});
}
public PaymentResult processPayment(PaymentRequest request) {
// Wrap the call with circuit breaker
Supplier<PaymentResult> paymentCall = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> {
// This is the actual call that might fail
return paymentGateway.charge(request);
});
try {
return paymentCall.get();
} catch (CallNotPermittedException e) {
// Circuit breaker is OPEN - fail fast with fallback
return handleCircuitOpen(request);
} catch (PaymentException e) {
// Payment failed (circuit breaker recorded this)
throw e;
}
}
private PaymentResult handleCircuitOpen(PaymentRequest request) {
// Fallback options when payment service is unavailable:
// Option 1: Offer alternative payment
// return new PaymentResult(PaymentStatus.DEFERRED,
// "Payment service unavailable. Try PayPal?");
// Option 2: Queue for later processing
// paymentQueue.enqueue(request);
// return new PaymentResult(PaymentStatus.QUEUED,
// "Payment will be processed shortly");
// Option 3: Return error with helpful message
return new PaymentResult(PaymentStatus.SERVICE_UNAVAILABLE,
"Payment processing temporarily unavailable. Please try again in a few minutes.");
}
// Check circuit breaker status for health checks / dashboards
public CircuitBreakerStatus getStatus() {
return new CircuitBreakerStatus(
circuitBreaker.getState().name(),
circuitBreaker.getMetrics().getFailureRate(),
circuitBreaker.getMetrics().getNumberOfFailedCalls(),
circuitBreaker.getMetrics().getNumberOfSuccessfulCalls()
);
}
}
Simple Python Implementation
import time
from enum import Enum
from dataclasses import dataclass
from typing import Callable, TypeVar, Optional
from functools import wraps
T = TypeVar('T')
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreakerConfig:
failure_threshold: int = 5
recovery_timeout: float = 30.0
half_open_max_calls: int = 1
class CircuitBreaker:
"""
Simple circuit breaker implementation.
Usage:
cb = CircuitBreaker("payment-service")
@cb
def call_payment_service():
return requests.post(...)
result = call_payment_service() # Raises CircuitOpenError if open
"""
def __init__(self, name: str, config: Optional[CircuitBreakerConfig] = None):
self.name = name
self.config = config or CircuitBreakerConfig()
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time: Optional[float] = None
self.half_open_calls = 0
def __call__(self, func: Callable[..., T]) -> Callable[..., T]:
@wraps(func)
def wrapper(*args, **kwargs) -> T:
return self.call(func, *args, **kwargs)
return wrapper
def call(self, func: Callable[..., T], *args, **kwargs) -> T:
"""Execute function with circuit breaker protection."""
# Check if we should transition from OPEN to HALF_OPEN
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
self.half_open_calls = 0
print(f"[{self.name}] Transitioning to HALF_OPEN")
else:
raise CircuitOpenError(
f"Circuit breaker {self.name} is OPEN. "
f"Retry after {self._time_until_retry():.1f}s"
)
# Check half-open call limit
if self.state == CircuitState.HALF_OPEN:
if self.half_open_calls >= self.config.half_open_max_calls:
raise CircuitOpenError(
f"Circuit breaker {self.name} is HALF_OPEN and at capacity"
)
self.half_open_calls += 1
# Execute the call
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
"""Handle successful call."""
if self.state == CircuitState.HALF_OPEN:
# Successful test call - close the circuit
self.state = CircuitState.CLOSED
self.failure_count = 0
print(f"[{self.name}] SUCCESS in HALF_OPEN → CLOSED")
elif self.state == CircuitState.CLOSED:
# Reset failure count on success (for consecutive failure mode)
self.failure_count = 0
def _on_failure(self):
"""Handle failed call."""
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
# Failed test call - back to open
self.state = CircuitState.OPEN
print(f"[{self.name}] FAILURE in HALF_OPEN → OPEN")
elif self.state == CircuitState.CLOSED:
if self.failure_count >= self.config.failure_threshold:
self.state = CircuitState.OPEN
print(f"[{self.name}] Threshold reached → OPEN")
def _should_attempt_reset(self) -> bool:
"""Check if enough time has passed to try recovery."""
if self.last_failure_time is None:
return True
elapsed = time.time() - self.last_failure_time
return elapsed >= self.config.recovery_timeout
def _time_until_retry(self) -> float:
"""Calculate seconds until circuit breaker will try half-open."""
if self.last_failure_time is None:
return 0
elapsed = time.time() - self.last_failure_time
return max(0, self.config.recovery_timeout - elapsed)
@property
def status(self) -> dict:
return {
"name": self.name,
"state": self.state.value,
"failure_count": self.failure_count,
"time_until_retry": self._time_until_retry() if self.state == CircuitState.OPEN else None
}
class CircuitOpenError(Exception):
"""Raised when circuit breaker is open and call is blocked."""
pass
# Usage example
if __name__ == "__main__":
import random
cb = CircuitBreaker("test-service", CircuitBreakerConfig(
failure_threshold=3,
recovery_timeout=5.0
))
@cb
def unreliable_service():
if random.random() < 0.7: # 70% failure rate
raise Exception("Service unavailable")
return "Success!"
for i in range(20):
try:
result = unreliable_service()
print(f"Call {i+1}: {result}")
except CircuitOpenError as e:
print(f"Call {i+1}: BLOCKED - {e}")
except Exception as e:
print(f"Call {i+1}: FAILED - {e}")
print(f" Status: {cb.status}")
time.sleep(1)
Related Content
See It In Action:
- Circuit Breaker Explainer - Visual walkthrough of state transitions
Related Concepts:
- Failover - What happens after failure detection
- Health Checks - Proactive health verification
- Rate Limiting - Another traffic control pattern
Quick Self-Check
- Can explain circuit breaker in 60 seconds?
- Understand the three states and transitions?
- Know what “fail fast” means and why it matters?
- Can configure thresholds and explain trade-offs?
- Understand relationship with retries and timeouts?
- Can implement a fallback strategy?
Production signal