I/D/E · Patterns

Circuit Breaker

Summary

A resilience pattern that prevents cascade failures by failing fast when a downstream service is unhealthy

TL;DR

A circuit breaker wraps calls to external services and tracks failures. When failures exceed a threshold, the breaker “trips open”—subsequent calls fail immediately without attempting the network call. After a timeout, it allows a test request through. If that succeeds, the breaker closes and normal operation resumes. This prevents cascade failures and protects system resources.

Visual Overview

Circuit Breaker State Machine
THREE-STATE MACHINE

                                                    
                                       
    CLOSED      
                    (normal)                   
                                     
                                                
           failures > threshold                  
                                        success   
                                                
                                     
                      OPEN       
                     (trip)                    
                                     
                                                
                timeout expires                  
                                                
                                                
                                    
       failure     HALF-OPEN     
    (test)                      
                                      
                                                    


STATE BEHAVIORS

  CLOSED: Normal operation                          
   Requests pass through to downstream            
   Failures counted                               
   Threshold breach  trip to OPEN                
                                                    
  OPEN: Protection mode                             
   All requests fail immediately                  
   No network calls made (fail fast)             
   Timer running for recovery attempt             
                                                    
  HALF-OPEN: Testing recovery                       
   Allow ONE test request through                 
   Success  close breaker                        
   Failure  reopen breaker                       

Core Explanation

What is a Circuit Breaker?

Real-World Analogy: Think of an electrical circuit breaker in your home. When too much current flows (overload), the breaker trips and cuts power to prevent a fire. You don’t keep trying to run the overloaded appliance—you wait, fix the problem, then reset the breaker.

Software circuit breakers work the same way:

  • Overload = too many failures calling a downstream service
  • Trip = stop calling that service
  • Reset = test if service recovered, then resume

The Problem It Solves

Cascade Failure Without Circuit Breaker
WITHOUT CIRCUIT BREAKER: CASCADE FAILURE

                                                    
  Service A          Service B          Service C   
                           
                   SLOW      
                                            
                           
                                                    
  1. Service C becomes slow (5s timeouts)           
                                                    
  2. Service B threads block waiting for C          
     Thread pool: [████████████] exhausted!        
                                                    
  3. Service B stops responding to A                
     Service A threads block waiting for B          
     Thread pool: [████████████] exhausted!        
                                                    
  4. Service A fails  User sees error              
                                                    
  One slow service took down the entire chain!      

With Circuit Breaker: Fail Fast

                                                    
  Service A          Service B          Service C   
              [CB]         
                  SLOW      
                          OPEN              
                           
                                                    
  1. Service C becomes slow                         
                                                    
  2. Circuit breaker detects failures, trips OPEN   
                                                    
  3. Service B returns fast failure (no wait!)      
     "Service C unavailable" in <1ms                
                                                    
  4. Service B stays healthy                        
     Thread pool: [██░░░░░░░░░░] plenty free       
                                                    
  5. Service A gets quick error, can show fallback  
     User sees degraded experience, not failure     

Timeline Example

Circuit Breaker Timeline

  Time:   0s    5s    10s   40s   41s   42s   45s          
                                                    
  State:  CLOSEDOPENHALFCLOSED     
                                    OPEN             
                                                     
  Events:                                            
                                          success!  
                                     test request    
                          timeout expires (30s)       
                     5th failure  TRIP                
                failures accumulating                   
           normal operation                              
                                                            
  Requests:                                                 
  0-10s:         [TRIP]                            
  10-40s:      (instant fail, no network call)       
  41s:     (test request succeeds)                        
  42s+:       (normal operation resumed)               

Configuration Parameters

ParameterDescriptionTypical ValueTrade-off
Failure ThresholdFailures before tripping5-10Low = sensitive, High = slow to protect
TimeoutHow long to stay open30-60 secondsShort = fast recovery, Long = gentle on recovering service
Success ThresholdSuccesses in half-open before closing1-3Low = fast recovery, High = more confidence
Window SizeTime window for counting failures60 secondsRolling vs consecutive failures

Real Systems Using Circuit Breakers

Library/SystemLanguageFeaturesUse Case
Hystrix (Netflix)JavaBulkheads, fallbacks, metricsLegacy but well-documented
Resilience4jJavaModern, lightweight, Spring integrationRecommended for new Java projects
Polly.NETPolicies, retry + circuit breakerC# applications
opossumNode.jsSimple, Prometheus metricsJavaScript/TypeScript services
gobreakerGoSimple, concurrent-safeGo microservices
IstioService meshSidecar-based, no code changesKubernetes environments

Case Study: E-Commerce Checkout

Circuit Breakers in E-Commerce
E-COMMERCE CHECKOUT WITH CIRCUIT BREAKERS

                                                    
  Checkout Service                                  
                                                   
        [CB] Payment Service                  
                   Fallback: "Pay later" option  
                                                   
        [CB] Inventory Service                
                   Fallback: Cached stock levels 
                                                   
        [CB] Shipping Calculator              
                   Fallback: Flat rate estimate  
                                                   
        [CB] Recommendation Service           
                    Fallback: Hide section        
                                                    
  SCENARIO: Payment service down                    
   Payment CB trips OPEN                          
   Checkout offers "Pay later" or "PayPal"       
   Other services unaffected                      
   Customer can still complete order              

When to Use Circuit Breakers

✓ Perfect Use Cases

Circuit Breaker Use Cases
EXTERNAL API CALLS
Scenario: Calling third-party payment processor
Requirement: Don't let payment issues kill entire checkout
Configuration: Threshold=3, Timeout=60s
Fallback: Offer alternative payment methods

DATABASE CONNECTIONS
Scenario: Primary DB under heavy load
Requirement: Don't exhaust connection pool
Configuration: Threshold=5, Timeout=30s
Fallback: Read from replica, queue writes

MICROSERVICE CALLS
Scenario: Calling inventory service during checkout
Requirement: Checkout works even if inventory is slow
Configuration: Threshold=5, Timeout=30s
Fallback: Use cached inventory, verify at shipment

EXPENSIVE OPERATIONS
Scenario: ML model inference service
Requirement: Don't block on slow predictions
Configuration: Threshold=3, Timeout=10s
Fallback: Use simpler heuristic, default recommendation

✕ When NOT to Use

When Circuit Breakers Don't Fit
CRITICAL PATH WITH NO FALLBACK
Problem: If payment MUST succeed, circuit breaker just delays failure
Alternative: Retry with backoff, queue for later processing
When OK: If you have a meaningful fallback (alternative payment)

SIMPLE INTERNAL CALLS
Problem: Overhead not worth it for simple, reliable calls
Alternative: Just handle errors normally
When OK: For unreliable or slow internal services

FIRE-AND-FORGET CALLS
Problem: Async calls that don't block the caller
Alternative: Dead letter queues, retry queues
When OK: If you need to track failure rates for alerting

Interview Application

Common Interview Question

Q: “You’re designing a microservices architecture. How would you handle failures in downstream services?”

Strong Answer:

“I’d implement the circuit breaker pattern for all downstream calls. Here’s my approach:

Why Circuit Breakers:

  1. Prevent cascade failures: One slow service shouldn’t take down the entire system
  2. Fail fast: Return errors in milliseconds instead of waiting for timeouts
  3. Allow recovery: Give failing services breathing room to recover
  4. Enable fallbacks: Return cached data or degraded responses

Implementation:

  • Use a library like Resilience4j (Java) or Polly (.NET)
  • Configure per-dependency: payment service might have stricter thresholds than recommendations

Configuration for critical path (e.g., inventory check):

  • Failure threshold: 5 failures in 60 seconds
  • Open timeout: 30 seconds
  • Half-open: Allow 1 test request

Fallback strategy:

  • Inventory: Return cached stock levels, verify at shipment
  • Payment: Offer alternative payment methods
  • Recommendations: Hide the section entirely

Monitoring:

  • Track circuit breaker state in metrics (Prometheus/Grafana)
  • Alert when breaker trips (indicates downstream problem)
  • Dashboard showing breaker states across all services

Combined with other patterns:

  • Retry with exponential backoff for transient failures
  • Bulkheads to isolate thread pools per dependency
  • Timeouts to bound how long we wait”

Follow-up: How do you decide on circuit breaker thresholds?

“I’d start with conservative defaults and tune based on data:

Start with:

  • Failure threshold: 5 consecutive or 50% in 10 requests
  • Timeout: 30 seconds
  • Half-open test count: 1

Tune based on:

  • Normal error rate: If service has 1% baseline errors, threshold should be higher
  • Recovery time: How long does the service typically take to recover?
  • Business impact: Critical services might need faster tripping

Monitor and adjust:

  • If breaker trips too often on transient errors → raise threshold
  • If cascade failures still occur → lower threshold
  • If service recovers but breaker stays open → shorten timeout”

Code Example

Circuit Breaker with Resilience4j (Java)

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;

import java.time.Duration;
import java.util.function.Supplier;

public class PaymentService {

    private final CircuitBreaker circuitBreaker;
    private final PaymentGateway paymentGateway;

    public PaymentService(PaymentGateway paymentGateway) {
        this.paymentGateway = paymentGateway;

        // Configure circuit breaker
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            // Trip to OPEN after 5 failures
            .failureRateThreshold(50)  // 50% failure rate
            .minimumNumberOfCalls(5)    // Need at least 5 calls to evaluate

            // Stay OPEN for 30 seconds before testing
            .waitDurationInOpenState(Duration.ofSeconds(30))

            // In HALF-OPEN, allow 3 test calls
            .permittedNumberOfCallsInHalfOpenState(3)

            // Sliding window for failure rate calculation
            .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
            .slidingWindowSize(10)

            // What counts as a failure
            .recordExceptions(PaymentException.class, TimeoutException.class)
            .ignoreExceptions(InvalidCardException.class)  // Don't count client errors

            .build();

        CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
        this.circuitBreaker = registry.circuitBreaker("paymentService");

        // Register event handlers for monitoring
        circuitBreaker.getEventPublisher()
            .onStateTransition(event -> {
                System.out.println("Circuit breaker state: " +
                    event.getStateTransition().getFromState() + " -> " +
                    event.getStateTransition().getToState());
                // Send to metrics system (Prometheus, DataDog, etc.)
            })
            .onCallNotPermitted(event -> {
                System.out.println("Call blocked by circuit breaker");
            });
    }

    public PaymentResult processPayment(PaymentRequest request) {
        // Wrap the call with circuit breaker
        Supplier<PaymentResult> paymentCall = CircuitBreaker
            .decorateSupplier(circuitBreaker, () -> {
                // This is the actual call that might fail
                return paymentGateway.charge(request);
            });

        try {
            return paymentCall.get();
        } catch (CallNotPermittedException e) {
            // Circuit breaker is OPEN - fail fast with fallback
            return handleCircuitOpen(request);
        } catch (PaymentException e) {
            // Payment failed (circuit breaker recorded this)
            throw e;
        }
    }

    private PaymentResult handleCircuitOpen(PaymentRequest request) {
        // Fallback options when payment service is unavailable:

        // Option 1: Offer alternative payment
        // return new PaymentResult(PaymentStatus.DEFERRED,
        //     "Payment service unavailable. Try PayPal?");

        // Option 2: Queue for later processing
        // paymentQueue.enqueue(request);
        // return new PaymentResult(PaymentStatus.QUEUED,
        //     "Payment will be processed shortly");

        // Option 3: Return error with helpful message
        return new PaymentResult(PaymentStatus.SERVICE_UNAVAILABLE,
            "Payment processing temporarily unavailable. Please try again in a few minutes.");
    }

    // Check circuit breaker status for health checks / dashboards
    public CircuitBreakerStatus getStatus() {
        return new CircuitBreakerStatus(
            circuitBreaker.getState().name(),
            circuitBreaker.getMetrics().getFailureRate(),
            circuitBreaker.getMetrics().getNumberOfFailedCalls(),
            circuitBreaker.getMetrics().getNumberOfSuccessfulCalls()
        );
    }
}

Simple Python Implementation

import time
from enum import Enum
from dataclasses import dataclass
from typing import Callable, TypeVar, Optional
from functools import wraps

T = TypeVar('T')


class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"


@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    half_open_max_calls: int = 1


class CircuitBreaker:
    """
    Simple circuit breaker implementation.

    Usage:
        cb = CircuitBreaker("payment-service")

        @cb
        def call_payment_service():
            return requests.post(...)

        result = call_payment_service()  # Raises CircuitOpenError if open
    """

    def __init__(self, name: str, config: Optional[CircuitBreakerConfig] = None):
        self.name = name
        self.config = config or CircuitBreakerConfig()
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time: Optional[float] = None
        self.half_open_calls = 0

    def __call__(self, func: Callable[..., T]) -> Callable[..., T]:
        @wraps(func)
        def wrapper(*args, **kwargs) -> T:
            return self.call(func, *args, **kwargs)
        return wrapper

    def call(self, func: Callable[..., T], *args, **kwargs) -> T:
        """Execute function with circuit breaker protection."""

        # Check if we should transition from OPEN to HALF_OPEN
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self.state = CircuitState.HALF_OPEN
                self.half_open_calls = 0
                print(f"[{self.name}] Transitioning to HALF_OPEN")
            else:
                raise CircuitOpenError(
                    f"Circuit breaker {self.name} is OPEN. "
                    f"Retry after {self._time_until_retry():.1f}s"
                )

        # Check half-open call limit
        if self.state == CircuitState.HALF_OPEN:
            if self.half_open_calls >= self.config.half_open_max_calls:
                raise CircuitOpenError(
                    f"Circuit breaker {self.name} is HALF_OPEN and at capacity"
                )
            self.half_open_calls += 1

        # Execute the call
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        """Handle successful call."""
        if self.state == CircuitState.HALF_OPEN:
            # Successful test call - close the circuit
            self.state = CircuitState.CLOSED
            self.failure_count = 0
            print(f"[{self.name}] SUCCESS in HALF_OPEN → CLOSED")
        elif self.state == CircuitState.CLOSED:
            # Reset failure count on success (for consecutive failure mode)
            self.failure_count = 0

    def _on_failure(self):
        """Handle failed call."""
        self.failure_count += 1
        self.last_failure_time = time.time()

        if self.state == CircuitState.HALF_OPEN:
            # Failed test call - back to open
            self.state = CircuitState.OPEN
            print(f"[{self.name}] FAILURE in HALF_OPEN → OPEN")
        elif self.state == CircuitState.CLOSED:
            if self.failure_count >= self.config.failure_threshold:
                self.state = CircuitState.OPEN
                print(f"[{self.name}] Threshold reached → OPEN")

    def _should_attempt_reset(self) -> bool:
        """Check if enough time has passed to try recovery."""
        if self.last_failure_time is None:
            return True
        elapsed = time.time() - self.last_failure_time
        return elapsed >= self.config.recovery_timeout

    def _time_until_retry(self) -> float:
        """Calculate seconds until circuit breaker will try half-open."""
        if self.last_failure_time is None:
            return 0
        elapsed = time.time() - self.last_failure_time
        return max(0, self.config.recovery_timeout - elapsed)

    @property
    def status(self) -> dict:
        return {
            "name": self.name,
            "state": self.state.value,
            "failure_count": self.failure_count,
            "time_until_retry": self._time_until_retry() if self.state == CircuitState.OPEN else None
        }


class CircuitOpenError(Exception):
    """Raised when circuit breaker is open and call is blocked."""
    pass


# Usage example
if __name__ == "__main__":
    import random

    cb = CircuitBreaker("test-service", CircuitBreakerConfig(
        failure_threshold=3,
        recovery_timeout=5.0
    ))

    @cb
    def unreliable_service():
        if random.random() < 0.7:  # 70% failure rate
            raise Exception("Service unavailable")
        return "Success!"

    for i in range(20):
        try:
            result = unreliable_service()
            print(f"Call {i+1}: {result}")
        except CircuitOpenError as e:
            print(f"Call {i+1}: BLOCKED - {e}")
        except Exception as e:
            print(f"Call {i+1}: FAILED - {e}")

        print(f"  Status: {cb.status}")
        time.sleep(1)

See It In Action:

Related Concepts:

Quick Self-Check

  • Can explain circuit breaker in 60 seconds?
  • Understand the three states and transitions?
  • Know what “fail fast” means and why it matters?
  • Can configure thresholds and explain trade-offs?
  • Understand relationship with retries and timeouts?
  • Can implement a fallback strategy?

Production signal

Why this concept matters

Interview 70% of microservices interviews
Production All microservice architectures
Performance Prevents cascade failures
Scale Protects resources