TL;DR
Failure detection determines whether a node in a distributed system is alive or dead. The fundamental challenge: you cannot distinguish a crashed node from a slow node or a network partition—all three look the same (no response). Failure detectors make a judgment call with a trade-off between detection speed and false positive rate. Advanced approaches like Phi Accrual provide probabilistic detection instead of binary alive/dead.
Visual Overview
THE AMBIGUITY PROBLEM ┌────────────────────────────────────────────────────┐ │ │ │ All three scenarios produce identical behavior: │ │ │ │ 1. Node CRASHED → No response │ │ 2. Node OVERLOADED → No response (too slow) │ │ 3. NETWORK PARTITIONED → No response (can't reach)│ │ │ │ From the detector's perspective: silence. │ │ Cannot distinguish! Must make a judgment call. │ │ │ └────────────────────────────────────────────────────┘ THE DETECTION TRADE-OFF ┌────────────────────────────────────────────────────┐ │ │ │ FAST DETECTION ACCURACY │ │ (short timeout) (long timeout) │ │ ↓ ↓ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ + Detect real │ │ + Fewer false │ │ │ │ failures fast │ │ positives │ │ │ │ │ │ │ │ │ │ - More false │ │ - Slow to │ │ │ │ positives │ │ detect real │ │ │ │ │ │ failures │ │ │ │ - Network │ │ │ │ │ │ hiccup = dead │ │ - Traffic to │ │ │ │ │ │ dead nodes │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ No perfect answer—tune for your risk tolerance. │ │ │ └────────────────────────────────────────────────────┘
Core Explanation
What is Failure Detection?
Real-World Analogy: Imagine you’re a 911 dispatcher with a list of emergency responders. You call each one every hour to confirm they’re available. If someone doesn’t answer, are they:
- Dead? (actual failure)
- In a tunnel with no signal? (network partition)
- On another call? (overloaded)
- Asleep? (slow to respond)
You can’t know for sure. After 3 missed calls, you might mark them unavailable and route emergencies elsewhere. That’s failure detection—making a judgment call under uncertainty.
The Fundamental Problem
In an asynchronous distributed system:
- There’s no upper bound on message delivery time
- A message might arrive in 1ms or 10 minutes
- You cannot distinguish “dead” from “very slow” with certainty
This is known as the FLP impossibility result: in an asynchronous system with even one faulty process, no deterministic algorithm can guarantee consensus.
Failure Detector Properties
THEORETICAL PROPERTIES ┌────────────────────────────────────────────────────┐ │ │ │ COMPLETENESS │ │ └─ Eventually suspects every failed node │ │ └─ "We don't miss real failures" │ │ │ │ ACCURACY │ │ └─ Doesn't suspect alive nodes │ │ └─ "We don't cry wolf" │ │ └─ (Hard to achieve perfectly) │ │ │ │ SPEED │ │ └─ Time from failure to detection │ │ └─ "How fast do we notice?" │ │ │ │ Can't have all three perfectly—pick your trade-off│ │ │ └────────────────────────────────────────────────────┘ PRACTICAL FAILURE DETECTOR CLASSES ┌────────────────────────────────────────────────────┐ │ │ │ PERFECT (P) │ │ └─ Complete + Strongly Accurate │ │ └─ Impossible in async systems │ │ │ │ EVENTUALLY PERFECT (◊P) │ │ └─ Complete + Eventually Accurate │ │ └─ May make mistakes, but eventually correct │ │ └─ This is what real systems use │ │ │ │ OMEGA (Ω) │ │ └─ Eventually elects single leader │ │ └─ Used for consensus protocols │ │ │ └────────────────────────────────────────────────────┘
Common Failure Detection Approaches
1. TIMEOUT-BASED (Most Common) ┌────────────────────────────────────────────────────┐ │ │ │ If no heartbeat within timeout → SUSPECTED │ │ │ │ ♥──♥──♥──?──?──? → TIMEOUT → SUSPECTED │ │ │ │ Pros: Simple, predictable │ │ Cons: Fixed timeout doesn't adapt to network │ │ │ └────────────────────────────────────────────────────┘ 2. PHI ACCRUAL (Adaptive) ┌────────────────────────────────────────────────────┐ │ │ │ Instead of binary alive/dead, calculate a │ │ SUSPICION LEVEL based on heartbeat latency: │ │ │ │ φ = -log₁₀(P_later(t)) │ │ where P_later(t) = probability next heartbeat │ │ arrives after t, given historical distribution │ │ │ │ Higher φ = longer gap = more suspicious │ │ Threshold (e.g., φ > 8) triggers suspicion │ │ │ │ Adapts to actual network conditions │ │ Used by: Cassandra, Akka │ │ │ └────────────────────────────────────────────────────┘ 3. GOSSIP-BASED (Distributed) ┌────────────────────────────────────────────────────┐ │ │ │ Nodes share failure suspicions with peers │ │ Majority suspicion → confirmed failure │ │ │ │ Node A: "I suspect Node X" │ │ Node B: "I also suspect Node X" │ │ Node C: "Me too" → Consensus: X is dead │ │ │ │ Pros: No single point of failure │ │ Cons: Eventual (not immediate) detection │ │ │ └────────────────────────────────────────────────────┘ 4. ADAPTIVE TIMEOUT ┌────────────────────────────────────────────────────┐ │ │ │ Adjust timeout based on observed latency: │ │ │ │ timeout = mean_latency + (k × std_dev) │ │ │ │ Higher k = more tolerant of variance │ │ Simpler than Phi but still adaptive │ │ │ └────────────────────────────────────────────────────┘
Phi Accrual Failure Detector Deep Dive
PHI ACCRUAL INTUITION ┌────────────────────────────────────────────────────┐ │ │ │ Key insight: Model heartbeat arrivals statistically│ │ │ │ 1. Track arrival times of heartbeats │ │ 2. Build a distribution of inter-arrival times │ │ 3. When checking: "Given the distribution, │ │ how unlikely is this long a gap?" │ │ │ │ φ represents "how suspicious" the current gap is │ │ │ └────────────────────────────────────────────────────┘ PHI CALCULATION ┌────────────────────────────────────────────────────┐ │ │ │ Given: Normal distribution of arrival times │ │ with mean μ and std dev σ │ │ │ │ t_now = time since last heartbeat │ │ │ │ P_later(t) = probability heartbeat arrives after t│ │ = 1 - CDF(t) │ │ │ │ φ = -log₁₀(P_later(t_now)) │ │ │ │ If t_now is very long (unlikely under normal): │ │ → P_later is small → φ is high → likely dead │ │ │ └────────────────────────────────────────────────────┘ PHI THRESHOLDS (Example) ┌────────────────────────────────────────────────────┐ │ │ │ φ value │ P_later │ Interpretation │ │ ────────┼──────────┼────────────────────────────│ │ 1 │ ~10% │ 10% chance gap this long │ │ 2 │ ~1% │ Getting suspicious │ │ 3 │ ~0.1% │ Very unusual gap │ │ 8 │ ~10⁻⁸ │ Extremely unlikely gap │ │ │ │ φ is suspicion level, NOT "probability of death" │ │ Typical threshold: φ > 8 → mark as suspected │ │ (Configurable based on tolerance) │ │ │ └────────────────────────────────────────────────────┘
Real Systems Using Failure Detection
| System | Approach | Configuration | Notes |
|---|---|---|---|
| Apache Cassandra | Phi Accrual | phi_convict_threshold (default: 8) | Adaptive to network |
| Akka Cluster | Phi Accrual | Configurable threshold | Built-in module |
| ZooKeeper | Timeout-based | Session timeout | Simple fixed timeout |
| etcd/Raft | Heartbeat timeout | Election timeout | For leader failure |
| Consul | Gossip + timeout | Configurable | SWIM protocol variant |
| Kubernetes | Timeout-based | Node not ready timeout | Kubelet heartbeats |
Note: Implementation details vary by version. Verify in current documentation.
Failure Detection States
STATE TRANSITIONS ┌────────────────────────────────────────────────────┐ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ │ │ │ │ ┌──────┐ timeout ┌───────────┐ │ │ │ │ │ALIVE │ ──────────────► │ SUSPECTED │ │ │ │ │ └──────┘ └───────────┘ │ │ │ │ ↑ │ │ │ │ │ │ heartbeat │ confirm │ │ │ │ │ received │ (or │ │ │ │ │ │ timeout)│ │ │ │ │ ↓ │ │ │ │ │ ┌────────┐ │ │ │ │ └────────────────────│ DEAD │ │ │ │ │ (rejoin/restart) └────────┘ │ │ │ │ │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ Why SUSPECTED state? │ │ - Avoid acting on temporary network issues │ │ - Give node chance to recover │ │ - Confirm with multiple checks or peers │ │ │ └────────────────────────────────────────────────────┘
When to Use Different Approaches
✓ Approach Selection Guide
SIMPLE TIMEOUT Best for: Small clusters, predictable networks Example: Redis Sentinel, simple leader election Trade-off: May need manual timeout tuning PHI ACCRUAL Best for: Variable network conditions, cloud Example: Cassandra, Akka Trade-off: More complex, need arrival history GOSSIP-BASED Best for: Large clusters, high fault tolerance Example: Consul, Serf Trade-off: Eventually consistent detection CONSENSUS-BASED Best for: Critical systems, need agreement Example: Raft leader election Trade-off: Higher latency, more coordination
✕ Pitfalls to Avoid
SPLIT BRAIN
Problem: Network partition makes both sides think
the other is dead
Example: Two database primaries, both accepting writes
Mitigation: Quorum-based decisions, fencing
CASCADE FAILURE
Problem: Aggressive failure detection triggers
chain reaction of evictions
Example: Overloaded cluster → timeouts → more evictions
→ even more overload
Mitigation: Back-off, rate-limit evictions
NODE FLAPPING
Problem: Node repeatedly marked dead/alive
Example: Borderline network, frequent timeouts
Mitigation: Hysteresis (require sustained health),
adaptive thresholds
FALSE POSITIVE STORMS
Problem: Network hiccup marks many nodes dead at once
Example: Datacenter switch issue → mass eviction
Mitigation: Correlated failure detection,
require peer confirmation
Interview Application
Common Interview Question
Q: “In distributed systems, how do you detect node failures? What are the challenges and trade-offs?”
Strong Answer:
“Failure detection is fundamentally challenging because you can’t distinguish a crashed node from a slow node or a network partition—all produce the same observable behavior: no response.
The Trade-off: You’re choosing between detection speed and false positive rate:
- Short timeout (1s): Fast detection, but network hiccups trigger false alarms
- Long timeout (30s): Fewer false alarms, but traffic goes to dead nodes for 30 seconds
Common Approaches:
- Fixed Timeout (simplest):
- No heartbeat within N seconds → suspected
- Pros: Simple, predictable
- Cons: Doesn’t adapt to network conditions
- Phi Accrual (adaptive):
- Calculate probability of failure based on heartbeat history
- φ = -log₁₀(P(alive))
- φ > 8 typically means dead (~99.999999% confidence)
- Adapts to actual network variance
- Gossip-based (distributed):
- Nodes share suspicions with peers
- Majority suspicion → confirmed failure
- No single point of failure
Real-World Examples:
- Cassandra uses Phi Accrual (phi_convict_threshold: 8)
- Kubernetes uses fixed timeout (~40s)
- Consul uses gossip-based (SWIM protocol)
Key insight: Perfect failure detection is impossible in asynchronous systems (FLP impossibility). Real systems use ‘eventually perfect’ detectors—they may make temporary mistakes but eventually converge to correct.”
Follow-up: How do you handle false positives in failure detection?
“False positives are dangerous—they can cause:
- Unnecessary failovers (disruption)
- Split brain (two primaries)
- Cascade failures (eviction storm)
Mitigation strategies:
Suspected state: Don’t act immediately. Mark as suspected first, then confirm with additional checks before marking dead.
Peer confirmation: Before declaring a node dead, ask other nodes if they can reach it. Majority agreement prevents network-partition-based false positives.
Hysteresis: Require sustained failure (3 missed heartbeats, not 1) before suspecting. Require sustained health before marking recovered.
Rate limiting evictions: Don’t evict more than N nodes per minute. If you’re seeing mass evictions, something systemic is wrong.
Self-fencing: If a node suspects it might be partitioned (its heartbeats aren’t being acknowledged), it should stop serving requests. ‘Better to be down than to be split-brain.’”
Follow-up: Explain the Phi Accrual failure detector.
“Phi Accrual replaces binary alive/dead with a suspicion level (φ) representing ‘how suspicious is the current situation?’
How it works:
- Track heartbeat inter-arrival times
- Model as normal distribution (mean μ, std dev σ)
- When checking: calculate how unlikely is this gap
- φ = -log₁₀(P(heartbeat still coming))
Interpretation:
- φ = 1: ~10% chance of failure (probably fine)
- φ = 3: ~0.1% chance alive (suspicious)
- φ = 8: ~10⁻⁸ chance alive (almost certainly dead)
Why it’s better:
- Adapts to actual network conditions
- Fewer false positives during temporary slowness
- Higher conviction threshold possible without slow detection
Configuration: Choose φ threshold based on risk tolerance. Higher threshold = fewer false positives but slower detection.”
Code Example
Phi Accrual Failure Detector (Python)
import time
import math
from collections import deque
from typing import Optional
from dataclasses import dataclass
@dataclass
class PhiAccrualDetector:
"""
Phi Accrual Failure Detector.
Calculates suspicion level (phi) based on heartbeat arrival history.
Higher phi = more likely the node has failed.
"""
# Configuration
threshold: float = 8.0 # phi above this = suspected
min_samples: int = 5 # minimum heartbeats before calculating
max_samples: int = 200 # sliding window size
# State
arrival_times: deque = None
last_heartbeat: float = None
def __post_init__(self):
self.arrival_times = deque(maxlen=self.max_samples)
self.last_heartbeat = None
def heartbeat_received(self) -> None:
"""Record a heartbeat arrival."""
now = time.time()
if self.last_heartbeat is not None:
interval = now - self.last_heartbeat
self.arrival_times.append(interval)
self.last_heartbeat = now
def _mean_and_stddev(self) -> tuple[float, float]:
"""Calculate mean and standard deviation of intervals."""
if len(self.arrival_times) < self.min_samples:
return None, None
intervals = list(self.arrival_times)
n = len(intervals)
mean = sum(intervals) / n
variance = sum((x - mean) ** 2 for x in intervals) / n
stddev = math.sqrt(variance)
# Ensure minimum stddev to avoid division issues
stddev = max(stddev, 0.1)
return mean, stddev
def phi(self) -> Optional[float]:
"""
Calculate current phi (suspicion level).
Returns:
phi value, or None if not enough samples
"""
if self.last_heartbeat is None:
return None
mean, stddev = self._mean_and_stddev()
if mean is None:
return None
# Time since last heartbeat
t = time.time() - self.last_heartbeat
# Calculate P(interval > t) using normal distribution CDF
# P_later = 1 - CDF(t) = 1 - (1/2)[1 + erf((t-μ)/(σ√2))]
# Simplified: P_later = (1/2)[1 - erf((t-μ)/(σ√2))]
z = (t - mean) / (stddev * math.sqrt(2))
p_later = 0.5 * (1 - math.erf(z))
# Avoid log(0)
if p_later <= 0:
return float('inf')
# phi = -log10(P_later)
return -math.log10(p_later)
def is_available(self) -> bool:
"""
Check if node should be considered available.
Returns:
True if phi is below threshold or not enough samples
"""
current_phi = self.phi()
if current_phi is None:
return True # Benefit of doubt with insufficient data
return current_phi < self.threshold
def status(self) -> dict:
"""Get current detector status."""
mean, stddev = self._mean_and_stddev()
current_phi = self.phi()
return {
"phi": round(current_phi, 2) if current_phi else None,
"threshold": self.threshold,
"is_available": self.is_available(),
"samples": len(self.arrival_times),
"mean_interval": round(mean, 3) if mean else None,
"stddev": round(stddev, 3) if stddev else None,
"last_heartbeat_ago": (
round(time.time() - self.last_heartbeat, 2)
if self.last_heartbeat else None
),
}
# Usage example
if __name__ == "__main__":
import random
print("=== Phi Accrual Failure Detector Demo ===\n")
detector = PhiAccrualDetector(threshold=8.0)
# Simulate normal heartbeats (1 second interval ± jitter)
print("Simulating healthy node (1s interval ± 0.1s jitter):")
for i in range(10):
detector.heartbeat_received()
status = detector.status()
print(f" Heartbeat {i+1}: φ={status['phi']}, available={status['is_available']}")
time.sleep(1 + random.uniform(-0.1, 0.1))
print("\nSimulating slow heartbeat (3s delay):")
time.sleep(3)
status = detector.status()
print(f" After 3s gap: φ={status['phi']}, available={status['is_available']}")
detector.heartbeat_received()
print(" Heartbeat received, recovering...")
print("\nSimulating node failure (no more heartbeats):")
for i in range(8):
time.sleep(1)
status = detector.status()
print(f" +{i+1}s: φ={status['phi']}, available={status['is_available']}")
if not status['is_available']:
print(" → Node marked as unavailable!")
break
print(f"\nFinal status: {detector.status()}")
Adaptive Timeout (Simpler Alternative)
import time
from collections import deque
from typing import Optional
class AdaptiveTimeoutDetector:
"""
Adaptive timeout failure detector.
Simpler than Phi Accrual but still adapts to network conditions.
timeout = mean + (k * stddev)
"""
def __init__(self, k: float = 4.0, min_samples: int = 5, max_samples: int = 100):
"""
Args:
k: Number of standard deviations to add to mean
min_samples: Minimum heartbeats before adapting
max_samples: Sliding window size
"""
self.k = k
self.min_samples = min_samples
self.intervals = deque(maxlen=max_samples)
self.last_heartbeat: Optional[float] = None
self.default_timeout = 5.0 # Used before enough samples
def heartbeat_received(self) -> None:
"""Record heartbeat arrival."""
now = time.time()
if self.last_heartbeat is not None:
self.intervals.append(now - self.last_heartbeat)
self.last_heartbeat = now
def get_timeout(self) -> float:
"""Calculate adaptive timeout."""
if len(self.intervals) < self.min_samples:
return self.default_timeout
intervals = list(self.intervals)
mean = sum(intervals) / len(intervals)
variance = sum((x - mean) ** 2 for x in intervals) / len(intervals)
stddev = variance ** 0.5
return mean + (self.k * stddev)
def is_available(self) -> bool:
"""Check if node is available."""
if self.last_heartbeat is None:
return False
elapsed = time.time() - self.last_heartbeat
return elapsed < self.get_timeout()
Related Content
See It In Action:
- Heartbeat & Failure Detection Explainer - Visual walkthrough of the trade-offs
Related Concepts:
- Heartbeat - The liveness signal
- Health Checks - Deeper health verification
- Failover - What happens after detection
- Gossip Protocol - Distributed failure detection
Quick Self-Check
- Can explain why failure detection is fundamentally hard?
- Understand the trade-off between detection speed and false positives?
- Know the difference between timeout-based and phi accrual?
- Can explain what phi represents in phi accrual?
- Understand why “suspected” state exists before “dead”?
- Know how gossip-based detection avoids single point of failure?
Production signal