I/D/E · Patterns

Failure Detection

Summary

Mechanisms to identify when nodes in a distributed system have failed, enabling recovery and fault tolerance

TL;DR

Failure detection determines whether a node in a distributed system is alive or dead. The fundamental challenge: you cannot distinguish a crashed node from a slow node or a network partition—all three look the same (no response). Failure detectors make a judgment call with a trade-off between detection speed and false positive rate. Advanced approaches like Phi Accrual provide probabilistic detection instead of binary alive/dead.

Visual Overview

Failure Detection
THE AMBIGUITY PROBLEM

                                                    
  All three scenarios produce identical behavior:   
                                                    
  1. Node CRASHED         No response              
  2. Node OVERLOADED      No response (too slow)   
  3. NETWORK PARTITIONED  No response (can't reach)
                                                    
  From the detector's perspective: silence.         
  Cannot distinguish! Must make a judgment call.    
                                                    


THE DETECTION TRADE-OFF

                                                    
       FAST DETECTION            ACCURACY           
      (short timeout)        (long timeout)         
                                                  
             
   + Detect real        + Fewer false          
     failures fast        positives            
                                               
   - More false         - Slow to              
     positives            detect real          
                          failures             
   - Network                                   
     hiccup = dead      - Traffic to           
                          dead nodes           
             
                                                    
  No perfect answer—tune for your risk tolerance.   
                                                    

Core Explanation

What is Failure Detection?

Real-World Analogy: Imagine you’re a 911 dispatcher with a list of emergency responders. You call each one every hour to confirm they’re available. If someone doesn’t answer, are they:

  • Dead? (actual failure)
  • In a tunnel with no signal? (network partition)
  • On another call? (overloaded)
  • Asleep? (slow to respond)

You can’t know for sure. After 3 missed calls, you might mark them unavailable and route emergencies elsewhere. That’s failure detection—making a judgment call under uncertainty.

The Fundamental Problem

In an asynchronous distributed system:

  • There’s no upper bound on message delivery time
  • A message might arrive in 1ms or 10 minutes
  • You cannot distinguish “dead” from “very slow” with certainty

This is known as the FLP impossibility result: in an asynchronous system with even one faulty process, no deterministic algorithm can guarantee consensus.

Failure Detector Properties

Failure Detector Properties
THEORETICAL PROPERTIES

                                                    
  COMPLETENESS                                      
   Eventually suspects every failed node          
   "We don't miss real failures"                  
                                                    
  ACCURACY                                          
   Doesn't suspect alive nodes                    
   "We don't cry wolf"                            
   (Hard to achieve perfectly)                    
                                                    
  SPEED                                             
   Time from failure to detection                 
   "How fast do we notice?"                       
                                                    
  Can't have all three perfectly—pick your trade-off
                                                    


PRACTICAL FAILURE DETECTOR CLASSES

                                                    
  PERFECT (P)                                       
   Complete + Strongly Accurate                   
   Impossible in async systems                    
                                                    
  EVENTUALLY PERFECT (◊P)                           
   Complete + Eventually Accurate                 
   May make mistakes, but eventually correct      
   This is what real systems use                  
                                                    
  OMEGA (Ω)                                         
   Eventually elects single leader                
   Used for consensus protocols                   
                                                    

Common Failure Detection Approaches

Failure Detection Approaches
1. TIMEOUT-BASED (Most Common)

                                                    
  If no heartbeat within timeout  SUSPECTED        
                                                    
???  TIMEOUT  SUSPECTED           
                                                    
  Pros: Simple, predictable                         
  Cons: Fixed timeout doesn't adapt to network      
                                                    


2. PHI ACCRUAL (Adaptive)

                                                    
  Instead of binary alive/dead, calculate a         
  SUSPICION LEVEL based on heartbeat latency:       
                                                    
  φ = -log₁₀(P_later(t))                           
  where P_later(t) = probability next heartbeat     
  arrives after t, given historical distribution    
                                                    
  Higher φ = longer gap = more suspicious           
  Threshold (e.g., φ > 8) triggers suspicion        
                                                    
  Adapts to actual network conditions               
  Used by: Cassandra, Akka                          
                                                    


3. GOSSIP-BASED (Distributed)

                                                    
  Nodes share failure suspicions with peers         
  Majority suspicion  confirmed failure            
                                                    
  Node A: "I suspect Node X"                        
  Node B: "I also suspect Node X"                   
  Node C: "Me too"  Consensus: X is dead          
                                                    
  Pros: No single point of failure                  
  Cons: Eventual (not immediate) detection          
                                                    


4. ADAPTIVE TIMEOUT

                                                    
  Adjust timeout based on observed latency:         
                                                    
  timeout = mean_latency + (k × std_dev)            
                                                    
  Higher k = more tolerant of variance              
  Simpler than Phi but still adaptive               
                                                    

Phi Accrual Failure Detector Deep Dive

Phi Accrual Failure Detector
PHI ACCRUAL INTUITION

                                                    
  Key insight: Model heartbeat arrivals statistically
                                                    
  1. Track arrival times of heartbeats              
  2. Build a distribution of inter-arrival times   
  3. When checking: "Given the distribution,        
     how unlikely is this long a gap?"              
                                                    
  φ represents "how suspicious" the current gap is  
                                                    


PHI CALCULATION

                                                    
  Given: Normal distribution of arrival times       
         with mean μ and std dev σ                  
                                                    
  t_now = time since last heartbeat                 
                                                    
  P_later(t) = probability heartbeat arrives after t
             = 1 - CDF(t)                           
                                                    
  φ = -log₁₀(P_later(t_now))                       
                                                    
  If t_now is very long (unlikely under normal):    
   P_later is small  φ is high  likely dead     
                                                    


PHI THRESHOLDS (Example)

                                                    
  φ value  P_later   Interpretation              
  
    1       ~10%     10% chance gap this long    
    2       ~1%      Getting suspicious          
    3       ~0.1%    Very unusual gap            
    8       ~10⁻⁸    Extremely unlikely gap      
                                                    
  φ is suspicion level, NOT "probability of death" 
  Typical threshold: φ > 8  mark as suspected     
  (Configurable based on tolerance)                
                                                    

Real Systems Using Failure Detection

SystemApproachConfigurationNotes
Apache CassandraPhi Accrualphi_convict_threshold (default: 8)Adaptive to network
Akka ClusterPhi AccrualConfigurable thresholdBuilt-in module
ZooKeeperTimeout-basedSession timeoutSimple fixed timeout
etcd/RaftHeartbeat timeoutElection timeoutFor leader failure
ConsulGossip + timeoutConfigurableSWIM protocol variant
KubernetesTimeout-basedNode not ready timeoutKubelet heartbeats

Note: Implementation details vary by version. Verify in current documentation.

Failure Detection States

Failure Detection State Machine
STATE TRANSITIONS

                                                    
     
                                                  
           timeout          
      ALIVE    SUSPECTED      
                            
                                               
           heartbeat                confirm    
           received                 (or        
                                    timeout)   
                                               
                                      
            DEAD          
            (rejoin/restart)           
                                                  
     
                                                    
  Why SUSPECTED state?                              
  - Avoid acting on temporary network issues        
  - Give node chance to recover                     
  - Confirm with multiple checks or peers           
                                                    

When to Use Different Approaches

✓ Approach Selection Guide

Choosing a Failure Detection Approach
SIMPLE TIMEOUT
Best for: Small clusters, predictable networks
Example: Redis Sentinel, simple leader election
Trade-off: May need manual timeout tuning

PHI ACCRUAL
Best for: Variable network conditions, cloud
Example: Cassandra, Akka
Trade-off: More complex, need arrival history

GOSSIP-BASED
Best for: Large clusters, high fault tolerance
Example: Consul, Serf
Trade-off: Eventually consistent detection

CONSENSUS-BASED
Best for: Critical systems, need agreement
Example: Raft leader election
Trade-off: Higher latency, more coordination

✕ Pitfalls to Avoid

Failure Detection Pitfalls
SPLIT BRAIN
Problem: Network partition makes both sides think
       the other is dead
Example: Two database primaries, both accepting writes
Mitigation: Quorum-based decisions, fencing

CASCADE FAILURE
Problem: Aggressive failure detection triggers
       chain reaction of evictions
Example: Overloaded cluster  timeouts  more evictions
        even more overload
Mitigation: Back-off, rate-limit evictions

NODE FLAPPING
Problem: Node repeatedly marked dead/alive
Example: Borderline network, frequent timeouts
Mitigation: Hysteresis (require sustained health),
          adaptive thresholds

FALSE POSITIVE STORMS
Problem: Network hiccup marks many nodes dead at once
Example: Datacenter switch issue  mass eviction
Mitigation: Correlated failure detection,
          require peer confirmation

Interview Application

Common Interview Question

Q: “In distributed systems, how do you detect node failures? What are the challenges and trade-offs?”

Strong Answer:

“Failure detection is fundamentally challenging because you can’t distinguish a crashed node from a slow node or a network partition—all produce the same observable behavior: no response.

The Trade-off: You’re choosing between detection speed and false positive rate:

  • Short timeout (1s): Fast detection, but network hiccups trigger false alarms
  • Long timeout (30s): Fewer false alarms, but traffic goes to dead nodes for 30 seconds

Common Approaches:

  1. Fixed Timeout (simplest):
    • No heartbeat within N seconds → suspected
    • Pros: Simple, predictable
    • Cons: Doesn’t adapt to network conditions
  2. Phi Accrual (adaptive):
    • Calculate probability of failure based on heartbeat history
    • φ = -log₁₀(P(alive))
    • φ > 8 typically means dead (~99.999999% confidence)
    • Adapts to actual network variance
  3. Gossip-based (distributed):
    • Nodes share suspicions with peers
    • Majority suspicion → confirmed failure
    • No single point of failure

Real-World Examples:

  • Cassandra uses Phi Accrual (phi_convict_threshold: 8)
  • Kubernetes uses fixed timeout (~40s)
  • Consul uses gossip-based (SWIM protocol)

Key insight: Perfect failure detection is impossible in asynchronous systems (FLP impossibility). Real systems use ‘eventually perfect’ detectors—they may make temporary mistakes but eventually converge to correct.”

Follow-up: How do you handle false positives in failure detection?

“False positives are dangerous—they can cause:

  • Unnecessary failovers (disruption)
  • Split brain (two primaries)
  • Cascade failures (eviction storm)

Mitigation strategies:

  1. Suspected state: Don’t act immediately. Mark as suspected first, then confirm with additional checks before marking dead.

  2. Peer confirmation: Before declaring a node dead, ask other nodes if they can reach it. Majority agreement prevents network-partition-based false positives.

  3. Hysteresis: Require sustained failure (3 missed heartbeats, not 1) before suspecting. Require sustained health before marking recovered.

  4. Rate limiting evictions: Don’t evict more than N nodes per minute. If you’re seeing mass evictions, something systemic is wrong.

  5. Self-fencing: If a node suspects it might be partitioned (its heartbeats aren’t being acknowledged), it should stop serving requests. ‘Better to be down than to be split-brain.’”

Follow-up: Explain the Phi Accrual failure detector.

“Phi Accrual replaces binary alive/dead with a suspicion level (φ) representing ‘how suspicious is the current situation?’

How it works:

  1. Track heartbeat inter-arrival times
  2. Model as normal distribution (mean μ, std dev σ)
  3. When checking: calculate how unlikely is this gap
  4. φ = -log₁₀(P(heartbeat still coming))

Interpretation:

  • φ = 1: ~10% chance of failure (probably fine)
  • φ = 3: ~0.1% chance alive (suspicious)
  • φ = 8: ~10⁻⁸ chance alive (almost certainly dead)

Why it’s better:

  • Adapts to actual network conditions
  • Fewer false positives during temporary slowness
  • Higher conviction threshold possible without slow detection

Configuration: Choose φ threshold based on risk tolerance. Higher threshold = fewer false positives but slower detection.”

Code Example

Phi Accrual Failure Detector (Python)

import time
import math
from collections import deque
from typing import Optional
from dataclasses import dataclass

@dataclass
class PhiAccrualDetector:
    """
    Phi Accrual Failure Detector.

    Calculates suspicion level (phi) based on heartbeat arrival history.
    Higher phi = more likely the node has failed.
    """

    # Configuration
    threshold: float = 8.0  # phi above this = suspected
    min_samples: int = 5    # minimum heartbeats before calculating
    max_samples: int = 200  # sliding window size

    # State
    arrival_times: deque = None
    last_heartbeat: float = None

    def __post_init__(self):
        self.arrival_times = deque(maxlen=self.max_samples)
        self.last_heartbeat = None

    def heartbeat_received(self) -> None:
        """Record a heartbeat arrival."""
        now = time.time()

        if self.last_heartbeat is not None:
            interval = now - self.last_heartbeat
            self.arrival_times.append(interval)

        self.last_heartbeat = now

    def _mean_and_stddev(self) -> tuple[float, float]:
        """Calculate mean and standard deviation of intervals."""
        if len(self.arrival_times) < self.min_samples:
            return None, None

        intervals = list(self.arrival_times)
        n = len(intervals)
        mean = sum(intervals) / n

        variance = sum((x - mean) ** 2 for x in intervals) / n
        stddev = math.sqrt(variance)

        # Ensure minimum stddev to avoid division issues
        stddev = max(stddev, 0.1)

        return mean, stddev

    def phi(self) -> Optional[float]:
        """
        Calculate current phi (suspicion level).

        Returns:
            phi value, or None if not enough samples
        """
        if self.last_heartbeat is None:
            return None

        mean, stddev = self._mean_and_stddev()
        if mean is None:
            return None

        # Time since last heartbeat
        t = time.time() - self.last_heartbeat

        # Calculate P(interval > t) using normal distribution CDF
        # P_later = 1 - CDF(t) = 1 - (1/2)[1 + erf((t-μ)/(σ√2))]
        # Simplified: P_later = (1/2)[1 - erf((t-μ)/(σ√2))]

        z = (t - mean) / (stddev * math.sqrt(2))
        p_later = 0.5 * (1 - math.erf(z))

        # Avoid log(0)
        if p_later <= 0:
            return float('inf')

        # phi = -log10(P_later)
        return -math.log10(p_later)

    def is_available(self) -> bool:
        """
        Check if node should be considered available.

        Returns:
            True if phi is below threshold or not enough samples
        """
        current_phi = self.phi()
        if current_phi is None:
            return True  # Benefit of doubt with insufficient data

        return current_phi < self.threshold

    def status(self) -> dict:
        """Get current detector status."""
        mean, stddev = self._mean_and_stddev()
        current_phi = self.phi()

        return {
            "phi": round(current_phi, 2) if current_phi else None,
            "threshold": self.threshold,
            "is_available": self.is_available(),
            "samples": len(self.arrival_times),
            "mean_interval": round(mean, 3) if mean else None,
            "stddev": round(stddev, 3) if stddev else None,
            "last_heartbeat_ago": (
                round(time.time() - self.last_heartbeat, 2)
                if self.last_heartbeat else None
            ),
        }


# Usage example
if __name__ == "__main__":
    import random

    print("=== Phi Accrual Failure Detector Demo ===\n")

    detector = PhiAccrualDetector(threshold=8.0)

    # Simulate normal heartbeats (1 second interval ± jitter)
    print("Simulating healthy node (1s interval ± 0.1s jitter):")
    for i in range(10):
        detector.heartbeat_received()
        status = detector.status()
        print(f"  Heartbeat {i+1}: φ={status['phi']}, available={status['is_available']}")
        time.sleep(1 + random.uniform(-0.1, 0.1))

    print("\nSimulating slow heartbeat (3s delay):")
    time.sleep(3)
    status = detector.status()
    print(f"  After 3s gap: φ={status['phi']}, available={status['is_available']}")

    detector.heartbeat_received()
    print("  Heartbeat received, recovering...")

    print("\nSimulating node failure (no more heartbeats):")
    for i in range(8):
        time.sleep(1)
        status = detector.status()
        print(f"  +{i+1}s: φ={status['phi']}, available={status['is_available']}")

        if not status['is_available']:
            print("  → Node marked as unavailable!")
            break

    print(f"\nFinal status: {detector.status()}")

Adaptive Timeout (Simpler Alternative)

import time
from collections import deque
from typing import Optional

class AdaptiveTimeoutDetector:
    """
    Adaptive timeout failure detector.

    Simpler than Phi Accrual but still adapts to network conditions.
    timeout = mean + (k * stddev)
    """

    def __init__(self, k: float = 4.0, min_samples: int = 5, max_samples: int = 100):
        """
        Args:
            k: Number of standard deviations to add to mean
            min_samples: Minimum heartbeats before adapting
            max_samples: Sliding window size
        """
        self.k = k
        self.min_samples = min_samples
        self.intervals = deque(maxlen=max_samples)
        self.last_heartbeat: Optional[float] = None
        self.default_timeout = 5.0  # Used before enough samples

    def heartbeat_received(self) -> None:
        """Record heartbeat arrival."""
        now = time.time()
        if self.last_heartbeat is not None:
            self.intervals.append(now - self.last_heartbeat)
        self.last_heartbeat = now

    def get_timeout(self) -> float:
        """Calculate adaptive timeout."""
        if len(self.intervals) < self.min_samples:
            return self.default_timeout

        intervals = list(self.intervals)
        mean = sum(intervals) / len(intervals)
        variance = sum((x - mean) ** 2 for x in intervals) / len(intervals)
        stddev = variance ** 0.5

        return mean + (self.k * stddev)

    def is_available(self) -> bool:
        """Check if node is available."""
        if self.last_heartbeat is None:
            return False

        elapsed = time.time() - self.last_heartbeat
        return elapsed < self.get_timeout()

See It In Action:

Related Concepts:

Quick Self-Check

  • Can explain why failure detection is fundamentally hard?
  • Understand the trade-off between detection speed and false positives?
  • Know the difference between timeout-based and phi accrual?
  • Can explain what phi represents in phi accrual?
  • Understand why “suspected” state exists before “dead”?
  • Know how gossip-based detection avoids single point of failure?

Production signal

Why this concept matters

Interview 60% of distributed systems interviews
Production Cluster management
Performance Detection latency vs accuracy
Scale False positive rate