TL;DR
A heartbeat is a periodic “I’m alive” message sent by nodes in a distributed system. If heartbeats stop arriving, the sender is presumed failed. Heartbeats are the foundation of failure detection, enabling leader election, cluster membership, and automatic failover. The core trade-off is detection speed vs false positive rate—shorter timeouts detect failures faster but trigger more false alarms.
Visual Overview
Worker Node ──♥──♥──♥──♥──♥──► Monitor 1s 1s 1s 1s 1s ♥ = heartbeat ("I'm alive") Heartbeat interval: 1s Timeout threshold: 3s (3× interval) Monitor tracks "last seen" per node FAILURE TIMELINE ├──♥──┼──♥──┼──♥──┼──?──┼──?──┼ TIMEOUT 0s 1s 2s 3s 4s 5s ↑ Last heartbeat After 3s with no heartbeat: └─ Node marked SUSPECTED └─ After confirmation → marked DEAD └─ Trigger failover, remove from cluster
Core Explanation
What is a Heartbeat?
Real-World Analogy: Think of a scuba diving buddy system. You and your partner agree to check in every 30 seconds with an “OK” hand signal. If your partner doesn’t respond for 2 minutes, you assume something’s wrong and start emergency procedures. That hand signal is a heartbeat—a simple, periodic “I’m fine” message.
The same principle works in distributed systems: nodes periodically announce they’re alive. Silence means trouble.
How Heartbeats Work
SENDER (Worker Node) while running: send_heartbeat(monitor) sleep(interval) Message: { "node_id": "worker-1", "timestamp": 1704067200, "status": "healthy", "metadata": { "load": 0.5 } } RECEIVER (Monitor) # Track last seen for each node last_seen = {} on heartbeat_received(node_id, ts): last_seen[node_id] = ts every check_interval: for node_id, ts in last_seen: if now - ts > timeout: mark_suspected(node_id)
Push vs Pull: Heartbeat vs Health Check
HEARTBEAT (PUSH-BASED) ┌────────────────────────────────────────────────────┐ │ │ │ Worker ────♥────♥────♥────► Monitor │ │ (push) │ │ │ │ ✓ Node proactively sends │ │ ✓ Lower latency detection │ │ ✓ Works when monitor can't reach worker │ │ ✕ Requires cooperative node (must send) │ │ ✕ Can't verify deeper health │ │ │ └────────────────────────────────────────────────────┘ HEALTH CHECK (PULL-BASED) ┌────────────────────────────────────────────────────┐ │ │ │ Monitor ────?────?────?────► Worker │ │ (poll) ◄────OK──── │ │ │ │ ✓ Works with uncooperative services │ │ ✓ Can check deeper health (DB, disk, app) │ │ ✓ Centralized control │ │ ✕ Higher detection latency │ │ ✕ Requires network path to worker │ │ │ └────────────────────────────────────────────────────┘ COMMON PATTERN: COMBINE BOTH ┌────────────────────────────────────────────────────┐ │ │ │ 1. Heartbeat for fast liveness detection │ │ 2. Health check for deeper status verification │ │ │ │ Example: Kubernetes │ │ - Kubelet sends heartbeats to API server │ │ - Kubelet also polls container health endpoints │ │ │ └────────────────────────────────────────────────────┘
The Configuration Trade-off
THE FUNDAMENTAL TRADE-OFF ┌────────────────────────────────────────────────────┐ │ │ │ FAST DETECTION vs ACCURACY │ │ (short timeout) (long timeout) │ │ │ │ ✓ Detect failures quickly ✓ Fewer false alarms │ │ ✕ More false positives ✕ Slow to detect │ │ ✕ Network hiccup = "dead" ✕ Traffic to dead │ │ │ │ No perfect answer—tune for your use case │ │ │ └────────────────────────────────────────────────────┘ CONFIGURATION EXAMPLES ┌────────────────────────────────────────────────────┐ │ │ │ Use Case │ Interval │ Timeout │ Trade-off │ │ ───────────────┼──────────┼─────────┼───────────│ │ Leader election│ 100ms │ 300ms │ Fast (FP OK)│ │ Database HA │ 1s │ 3s │ Balanced │ │ Service mesh │ 5s │ 15s │ Low FP │ │ Monitoring │ 30s │ 90s │ Very low FP│ │ │ │ Rule of thumb: timeout = 3× interval │ │ │ └────────────────────────────────────────────────────┘ FALSE POSITIVE IMPACT ┌────────────────────────────────────────────────────┐ │ │ │ Low stakes (service discovery): │ │ └─ False positive = temporary routing change │ │ └─ Not a big deal, can be aggressive │ │ │ │ High stakes (database failover): │ │ └─ False positive = split brain possible │ │ └─ Unnecessary failover = disruption │ │ └─ Be conservative, require confirmation │ │ │ └────────────────────────────────────────────────────┘
Real Systems Using Heartbeats
| System | Interval | Timeout | Notes |
|---|---|---|---|
| Kubernetes | 10s (default) | 40s (default) | Kubelet to API server |
| Apache ZooKeeper | tickTime × 2 | Session timeout (configurable) | Heartbeat in session |
| etcd | Configurable | Election timeout | Raft heartbeats |
| Consul | 1s (default) | 10s (default) | Gossip-based |
| Amazon ELB | Configurable | Unhealthy threshold × interval | Health checks |
Note: Defaults vary by version. Always verify in current documentation.
Heartbeat Patterns in Practice
CENTRALIZED MONITOR ┌────────────────────────────────────────────────────┐ │ │ │ Worker 1 ────♥─────┐ │ │ Worker 2 ────♥─────┼──► Central Monitor │ │ Worker 3 ────♥─────┘ │ │ │ │ ✓ Simple to implement │ │ ✓ Single source of truth │ │ ✕ Monitor is single point of failure │ │ Use: Small clusters, Kubernetes control plane │ │ │ └────────────────────────────────────────────────────┘ PEER-TO-PEER (RING) ┌────────────────────────────────────────────────────┐ │ │ │ Worker 1 ──♥──► Worker 2 │ │ ↑ │ │ │ ♥ ♥ │ │ │ ↓ │ │ Worker 4 ◄──♥── Worker 3 │ │ │ │ ✓ No single point of failure │ │ ✓ O(1) messages per node │ │ ✕ Longer detection path │ │ Use: Simple fault-tolerant clusters │ │ │ └────────────────────────────────────────────────────┘ GOSSIP-BASED ┌────────────────────────────────────────────────────┐ │ │ │ Each node gossips to k random peers │ │ "I'm alive, and here's who I've heard from..." │ │ │ │ Node 1 ──♥──► Node 3 ──♥──► Node 5 │ │ │ │ │ │ ♥ ♥ │ │ ↓ ↓ │ │ Node 4 Node 2 │ │ │ │ ✓ Highly scalable (O(log N) propagation) │ │ ✓ Robust to failures │ │ ✕ Eventually consistent failure detection │ │ Use: Large clusters (Cassandra, Consul) │ │ │ └────────────────────────────────────────────────────┘
When to Use Heartbeats
✓ Perfect Use Cases
LEADER ELECTION Scenario: Database with primary/replica setup Requirement: Detect leader failure for failover Configuration: 100ms interval, 300ms timeout Trade-off: Fast detection, some false positives CLUSTER MEMBERSHIP Scenario: Service discovery, load balancing Requirement: Know which nodes are available Configuration: 1s interval, 5s timeout Trade-off: Moderate detection, low false positives SESSION KEEPALIVE Scenario: ZooKeeper sessions, distributed locks Requirement: Maintain session while client active Configuration: Session-based, configurable timeout Trade-off: Balance responsiveness vs overhead WORKER POOL MONITORING Scenario: Task queue with worker processes Requirement: Redistribute tasks from dead workers Configuration: 5s interval, 15s timeout Trade-off: Lower urgency, very low false positives
✕ When NOT to Use (or Use Carefully)
NEED DEEPER HEALTH STATUS Problem: Heartbeat says "alive" but app is broken Example: Service running but can't connect to DB Alternative: Health checks with dependency probing When OK: Combine heartbeat + health check DETERMINISTIC FAILURE DETECTION Problem: Heartbeats can't guarantee failure Example: Two nodes both think other is dead (split brain) Alternative: Consensus protocols (Raft, Paxos) When OK: Accept occasional false positives VERY HIGH FREQUENCY SYSTEMS Problem: Heartbeat overhead at 100+ Hz Example: Real-time trading, gaming Alternative: Integrated health in message protocol When OK: Lower frequency acceptable ASYMMETRIC NETWORKS Problem: Heartbeat path ≠ data path Example: Node reachable for heartbeat but not data Alternative: Probe actual service endpoints When OK: Network is symmetric
Interview Application
Common Interview Question
Q: “Explain how heartbeats work in distributed systems and the trade-offs involved in configuring them.”
Strong Answer:
“Heartbeats are periodic ‘I’m alive’ messages used for failure detection. Here’s how they work:
Mechanism:
- Sender: Every N seconds, send a heartbeat to the monitor
- Receiver: Track ‘last seen’ timestamp per node
- Detection: If no heartbeat for timeout period, mark node as suspected failed
The Core Trade-off:
Config Detection Speed False Positives Example 100ms/300ms Very fast High Leader election 1s/3s Fast Moderate Database HA 5s/15s Slow Low Service mesh Why false positives matter:
- Network hiccup during timeout window = healthy node marked dead
- Consequence: unnecessary failover, split brain risk
Why detection speed matters:
- Slow detection = traffic continues to dead node
- Consequence: errors, latency, data loss
Rule of thumb: timeout = 3× interval. For a 1-second interval, use 3-second timeout—tolerates 2 missed heartbeats before suspecting failure.
Real-world example: Kubernetes uses 10s heartbeat interval with 40s timeout (pod eviction after ~40s of no heartbeats). This is tuned for stability over speed—Kubernetes prioritizes avoiding false positives.”
Follow-up: How do you handle the case where a heartbeat succeeds but the service is actually broken?
“Heartbeats only prove the process is running, not that it’s healthy. A service can send heartbeats while:
- Its database connection is dead
- It’s in an infinite loop
- It’s out of memory but not crashed
Solutions:
- Liveness + Readiness separation (Kubernetes model):
- Liveness probe: Is the process alive? (heartbeat)
- Readiness probe: Can it serve traffic? (deeper health check)
- Application-level heartbeat:
- Include health status in heartbeat message
{ alive: true, db_connected: true, queue_healthy: true }- Hierarchical health checks:
- Heartbeat for fast liveness
- Periodic deep health check (every 30s) for readiness
Best practice: Use heartbeats for ‘is the process running?’ and separate health checks for ‘can it serve requests?’”
Follow-up: What’s the difference between a heartbeat and a lease?
“They’re related but serve different purposes:
Heartbeat: Continuous signal—‘I’m still here.’ Monitor tracks last-seen timestamp. No explicit acknowledgment required.
Lease: Time-limited grant—‘You have permission until T.’ Must be renewed before expiry. Server explicitly grants/extends.
Key difference:
- Heartbeat: Detection is passive (monitor notices absence)
- Lease: Detection is active (lease holder knows when it expires)
Example:
- ZooKeeper sessions: Heartbeat keeps session alive
- Distributed locks: Lease on lock auto-expires if not renewed
Leases add safety: if a node partitions, it knows its lease expires and should stop acting as leader. With pure heartbeats, a partitioned node might keep acting as leader, thinking it’s fine.”
Code Example
Heartbeat System (Python)
import time
import threading
from dataclasses import dataclass, field
from typing import Dict, Callable, Optional
from enum import Enum
class NodeStatus(Enum):
ALIVE = "alive"
SUSPECTED = "suspected"
DEAD = "dead"
@dataclass
class NodeState:
"""Tracked state for a node."""
node_id: str
last_heartbeat: float
status: NodeStatus = NodeStatus.ALIVE
metadata: dict = field(default_factory=dict)
class HeartbeatMonitor:
"""
Centralized heartbeat monitor.
Tracks node liveness and triggers callbacks on status changes.
"""
def __init__(
self,
timeout: float = 3.0,
check_interval: float = 1.0,
on_suspected: Optional[Callable[[str], None]] = None,
on_dead: Optional[Callable[[str], None]] = None,
on_alive: Optional[Callable[[str], None]] = None,
):
"""
Args:
timeout: Seconds without heartbeat before suspected
check_interval: How often to check for expired nodes
on_suspected: Callback when node first suspected
on_dead: Callback when node confirmed dead
on_alive: Callback when node comes back alive
"""
self.timeout = timeout
self.check_interval = check_interval
self.on_suspected = on_suspected
self.on_dead = on_dead
self.on_alive = on_alive
self.nodes: Dict[str, NodeState] = {}
self._lock = threading.Lock()
self._running = False
self._checker_thread: Optional[threading.Thread] = None
def receive_heartbeat(
self,
node_id: str,
metadata: Optional[dict] = None
) -> None:
"""Process incoming heartbeat from a node."""
now = time.time()
with self._lock:
if node_id in self.nodes:
node = self.nodes[node_id]
was_suspected = node.status in (
NodeStatus.SUSPECTED,
NodeStatus.DEAD
)
node.last_heartbeat = now
node.status = NodeStatus.ALIVE
if metadata:
node.metadata = metadata
if was_suspected and self.on_alive:
self.on_alive(node_id)
else:
self.nodes[node_id] = NodeState(
node_id=node_id,
last_heartbeat=now,
metadata=metadata or {}
)
if self.on_alive:
self.on_alive(node_id)
def _check_expired(self) -> None:
"""Check for nodes that have missed heartbeats."""
while self._running:
now = time.time()
with self._lock:
for node in self.nodes.values():
elapsed = now - node.last_heartbeat
if elapsed > self.timeout * 2:
if node.status != NodeStatus.DEAD:
node.status = NodeStatus.DEAD
if self.on_dead:
self.on_dead(node.node_id)
elif elapsed > self.timeout:
if node.status == NodeStatus.ALIVE:
node.status = NodeStatus.SUSPECTED
if self.on_suspected:
self.on_suspected(node.node_id)
time.sleep(self.check_interval)
def start(self) -> None:
"""Start the background checker thread."""
self._running = True
self._checker_thread = threading.Thread(
target=self._check_expired,
daemon=True
)
self._checker_thread.start()
def stop(self) -> None:
"""Stop the background checker thread."""
self._running = False
if self._checker_thread:
self._checker_thread.join()
def get_status(self, node_id: str) -> Optional[NodeStatus]:
"""Get current status of a node."""
with self._lock:
if node_id in self.nodes:
return self.nodes[node_id].status
return None
def get_alive_nodes(self) -> list[str]:
"""Get list of currently alive nodes."""
with self._lock:
return [
n.node_id for n in self.nodes.values()
if n.status == NodeStatus.ALIVE
]
class HeartbeatSender:
"""Sends periodic heartbeats to a monitor."""
def __init__(
self,
node_id: str,
monitor: HeartbeatMonitor,
interval: float = 1.0,
metadata_fn: Optional[Callable[[], dict]] = None
):
self.node_id = node_id
self.monitor = monitor
self.interval = interval
self.metadata_fn = metadata_fn
self._running = False
self._sender_thread: Optional[threading.Thread] = None
def _send_loop(self) -> None:
"""Continuously send heartbeats."""
while self._running:
metadata = self.metadata_fn() if self.metadata_fn else None
self.monitor.receive_heartbeat(self.node_id, metadata)
time.sleep(self.interval)
def start(self) -> None:
"""Start sending heartbeats."""
self._running = True
self._sender_thread = threading.Thread(
target=self._send_loop,
daemon=True
)
self._sender_thread.start()
def stop(self) -> None:
"""Stop sending heartbeats."""
self._running = False
if self._sender_thread:
self._sender_thread.join()
# Usage example
if __name__ == "__main__":
print("=== Heartbeat Demo ===\n")
# Create monitor with callbacks
def on_suspected(node_id: str):
print(f"⚠️ Node {node_id} SUSPECTED (missing heartbeats)")
def on_dead(node_id: str):
print(f"💀 Node {node_id} DEAD (confirmed failure)")
def on_alive(node_id: str):
print(f"✅ Node {node_id} ALIVE")
monitor = HeartbeatMonitor(
timeout=2.0,
check_interval=0.5,
on_suspected=on_suspected,
on_dead=on_dead,
on_alive=on_alive
)
monitor.start()
# Create senders (workers)
worker1 = HeartbeatSender("worker-1", monitor, interval=0.5)
worker2 = HeartbeatSender("worker-2", monitor, interval=0.5)
worker1.start()
worker2.start()
print("Workers started, sending heartbeats...\n")
time.sleep(3)
print("\nSimulating worker-1 failure (stopping heartbeats)...")
worker1.stop()
# Wait for detection
time.sleep(5)
print("\nFinal status:")
print(f" worker-1: {monitor.get_status('worker-1')}")
print(f" worker-2: {monitor.get_status('worker-2')}")
print(f" Alive nodes: {monitor.get_alive_nodes()}")
# Cleanup
worker2.stop()
monitor.stop()
Heartbeat with Metadata (Production Pattern)
import psutil
def get_node_health() -> dict:
"""Collect node health metrics to include in heartbeat."""
return {
"cpu_percent": psutil.cpu_percent(),
"memory_percent": psutil.virtual_memory().percent,
"disk_percent": psutil.disk_usage('/').percent,
"load_avg": psutil.getloadavg()[0],
"connections": len(psutil.net_connections()),
}
# Usage
sender = HeartbeatSender(
node_id="worker-1",
monitor=monitor,
interval=1.0,
metadata_fn=get_node_health # Include health in each heartbeat
)
Related Content
See It In Action:
- Heartbeat & Failure Detection Explainer - Visual walkthrough of timeout detection
Related Concepts:
- Failure Detection - The broader problem heartbeats solve
- Health Checks - Pull-based alternative
- Consensus - Uses heartbeats for leader detection
Quick Self-Check
- Can explain heartbeats in 60 seconds?
- Understand the trade-off between detection speed and false positives?
- Know the difference between heartbeat (push) and health check (pull)?
- Can implement a basic heartbeat monitor with timeouts?
- Understand why timeout = 3× interval is a common rule of thumb?
- Know when to use heartbeats vs leases?
Production signal