Failure Detection

TL;DR

Failure detection determines whether a node in a distributed system is alive or dead. The fundamental challenge: you cannot distinguish a crashed node from a slow node or a network partition—all three look the same (no response). Failure detectors make a judgment call with a trade-off between detection speed and false positive rate. Advanced approaches like Phi Accrual provide probabilistic detection instead of binary alive/dead.

Visual Overview

Failure Detection

THE AMBIGUITY PROBLEM
┌────────────────────────────────────────────────────┐
│                                                    │
│  All three scenarios produce identical behavior:   │
│                                                    │
│  1. Node CRASHED        → No response              │
│  2. Node OVERLOADED     → No response (too slow)   │
│  3. NETWORK PARTITIONED → No response (can't reach)│
│                                                    │
│  From the detector's perspective: silence.         │
│  Cannot distinguish! Must make a judgment call.    │
│                                                    │
└────────────────────────────────────────────────────┘

THE DETECTION TRADE-OFF
┌────────────────────────────────────────────────────┐
│                                                    │
│       FAST DETECTION            ACCURACY           │
│      (short timeout)        (long timeout)         │
│            ↓                      ↓                │
│  ┌─────────────────┐    ┌─────────────────┐        │
│  │ + Detect real   │    │ + Fewer false   │        │
│  │   failures fast │    │   positives     │        │
│  │                 │    │                 │        │
│  │ - More false    │    │ - Slow to       │        │
│  │   positives     │    │   detect real   │        │
│  │                 │    │   failures      │        │
│  │ - Network       │    │                 │        │
│  │   hiccup = dead │    │ - Traffic to    │        │
│  │                 │    │   dead nodes    │        │
│  └─────────────────┘    └─────────────────┘        │
│                                                    │
│  No perfect answer—tune for your risk tolerance.   │
│                                                    │
└────────────────────────────────────────────────────┘

Core Explanation

What is Failure Detection?

Real-World Analogy: Imagine you’re a 911 dispatcher with a list of emergency responders. You call each one every hour to confirm they’re available. If someone doesn’t answer, are they:

Dead? (actual failure)
In a tunnel with no signal? (network partition)
On another call? (overloaded)
Asleep? (slow to respond)

You can’t know for sure. After 3 missed calls, you might mark them unavailable and route emergencies elsewhere. That’s failure detection—making a judgment call under uncertainty.

The Fundamental Problem

In an asynchronous distributed system:

There’s no upper bound on message delivery time
A message might arrive in 1ms or 10 minutes
You cannot distinguish “dead” from “very slow” with certainty

This is known as the FLP impossibility result: in an asynchronous system with even one faulty process, no deterministic algorithm can guarantee consensus.

Failure Detector Properties

THEORETICAL PROPERTIES
┌────────────────────────────────────────────────────┐
│                                                    │
│  COMPLETENESS                                      │
│  └─ Eventually suspects every failed node          │
│  └─ "We don't miss real failures"                  │
│                                                    │
│  ACCURACY                                          │
│  └─ Doesn't suspect alive nodes                    │
│  └─ "We don't cry wolf"                            │
│  └─ (Hard to achieve perfectly)                    │
│                                                    │
│  SPEED                                             │
│  └─ Time from failure to detection                 │
│  └─ "How fast do we notice?"                       │
│                                                    │
│  Can't have all three perfectly—pick your trade-off│
│                                                    │
└────────────────────────────────────────────────────┘

PRACTICAL FAILURE DETECTOR CLASSES
┌────────────────────────────────────────────────────┐
│                                                    │
│  PERFECT (P)                                       │
│  └─ Complete + Strongly Accurate                   │
│  └─ Impossible in async systems                    │
│                                                    │
│  EVENTUALLY PERFECT (◊P)                           │
│  └─ Complete + Eventually Accurate                 │
│  └─ May make mistakes, but eventually correct      │
│  └─ This is what real systems use                  │
│                                                    │
│  OMEGA (Ω)                                         │
│  └─ Eventually elects single leader                │
│  └─ Used for consensus protocols                   │
│                                                    │
└────────────────────────────────────────────────────┘

Common Failure Detection Approaches

Failure Detection Approaches

1. TIMEOUT-BASED (Most Common)
┌────────────────────────────────────────────────────┐
│                                                    │
│  If no heartbeat within timeout → SUSPECTED        │
│                                                    │
│  ♥──♥──♥──?──?──? → TIMEOUT → SUSPECTED            │
│                                                    │
│  Pros: Simple, predictable                         │
│  Cons: Fixed timeout doesn't adapt to network      │
│                                                    │
└────────────────────────────────────────────────────┘

2. PHI ACCRUAL (Adaptive)
┌────────────────────────────────────────────────────┐
│                                                    │
│  Instead of binary alive/dead, calculate a         │
│  SUSPICION LEVEL based on heartbeat latency:       │
│                                                    │
│  φ = -log₁₀(P_later(t))                            │
│  where P_later(t) = probability next heartbeat     │
│  arrives after t, given historical distribution    │
│                                                    │
│  Higher φ = longer gap = more suspicious           │
│  Threshold (e.g., φ > 8) triggers suspicion        │
│                                                    │
│  Adapts to actual network conditions               │
│  Used by: Cassandra, Akka                          │
│                                                    │
└────────────────────────────────────────────────────┘

3. GOSSIP-BASED (Distributed)
┌────────────────────────────────────────────────────┐
│                                                    │
│  Nodes share failure suspicions with peers         │
│  Majority suspicion → confirmed failure            │
│                                                    │
│  Node A: "I suspect Node X"                        │
│  Node B: "I also suspect Node X"                   │
│  Node C: "Me too" → Consensus: X is dead           │
│                                                    │
│  Pros: No single point of failure                  │
│  Cons: Eventual (not immediate) detection          │
│                                                    │
└────────────────────────────────────────────────────┘

4. ADAPTIVE TIMEOUT
┌────────────────────────────────────────────────────┐
│                                                    │
│  Adjust timeout based on observed latency:         │
│                                                    │
│  timeout = mean_latency + (k × std_dev)            │
│                                                    │
│  Higher k = more tolerant of variance              │
│  Simpler than Phi but still adaptive               │
│                                                    │
└────────────────────────────────────────────────────┘

Phi Accrual Failure Detector Deep Dive

Phi Accrual Failure Detector

PHI ACCRUAL INTUITION
┌────────────────────────────────────────────────────┐
│                                                    │
│  Key insight: heartbeat timing becomes probability │
│                                                    │
│  1. Track arrival times of heartbeats              │
│  2. Build a distribution of inter-arrival times    │
│  3. When checking: "Given the distribution,        │
│     how unlikely is this long a gap?"              │
│                                                    │
│  φ represents "how suspicious" the current gap is  │
│                                                    │
└────────────────────────────────────────────────────┘

PHI CALCULATION
┌────────────────────────────────────────────────────┐
│                                                    │
│  Given: Normal distribution of arrival times       │
│         with mean μ and std dev σ                  │
│                                                    │
│  t_now = time since last heartbeat                 │
│                                                    │
│  P_later(t) = probability heartbeat arrives after t│
│             = 1 - CDF(t)                           │
│                                                    │
│  φ = -log₁₀(P_later(t_now))                        │
│                                                    │
│  If t_now is very long (unlikely under normal):    │
│  → P_later is small → φ is high → likely dead      │
│                                                    │
└────────────────────────────────────────────────────┘

PHI THRESHOLDS (Example)
┌────────────────────────────────────────────────────┐
│                                                    │
│  φ value │ P_later  │ Interpretation               │
│  ────────┼──────────┼────────────────────────────  │
│    1     │  ~10%    │ 10% chance gap this long     │
│    2     │  ~1%     │ Getting suspicious           │
│    3     │  ~0.1%   │ Very unusual gap             │
│    8     │  ~10⁻⁸   │ Extremely unlikely gap       │
│                                                    │
│  φ is suspicion level, NOT "probability of death"  │
│  Typical threshold: φ > 8 → mark as suspected      │
│  (Configurable based on tolerance)                 │
│                                                    │
└────────────────────────────────────────────────────┘

Real Systems Using Failure Detection

System	Approach	Configuration	Notes
Apache Cassandra	Phi Accrual	`phi_convict_threshold` (default: 8)	Adaptive to network
Akka Cluster	Phi Accrual	Configurable threshold	Built-in module
ZooKeeper	Timeout-based	Session timeout	Simple fixed timeout
etcd/Raft	Heartbeat timeout	Election timeout	For leader failure
Consul	Gossip + timeout	Configurable	SWIM protocol variant
Kubernetes	Timeout-based	Node not ready timeout	Kubelet heartbeats

Note: Implementation details vary by version. Verify in current documentation.