I/D/E · Patterns

Heartbeat

Summary

Periodic signals sent between nodes to indicate liveness, enabling failure detection in distributed systems

TL;DR

A heartbeat is a periodic “I’m alive” message sent by nodes in a distributed system. If heartbeats stop arriving, the sender is presumed failed. Heartbeats are the foundation of failure detection, enabling leader election, cluster membership, and automatic failover. The core trade-off is detection speed vs false positive rate—shorter timeouts detect failures faster but trigger more false alarms.

Visual Overview

Heartbeat Protocol
Worker Node  Monitor
            1s  1s  1s  1s  1s

♥ = heartbeat ("I'm alive")
Heartbeat interval: 1s
Timeout threshold: 3s (3× interval)

Monitor tracks "last seen" per node

FAILURE TIMELINE
?? TIMEOUT
0s    1s    2s    3s    4s    5s
           
    Last heartbeat

After 3s with no heartbeat:
 Node marked SUSPECTED
 After confirmation  marked DEAD
 Trigger failover, remove from cluster

Core Explanation

What is a Heartbeat?

Real-World Analogy: Think of a scuba diving buddy system. You and your partner agree to check in every 30 seconds with an “OK” hand signal. If your partner doesn’t respond for 2 minutes, you assume something’s wrong and start emergency procedures. That hand signal is a heartbeat—a simple, periodic “I’m fine” message.

The same principle works in distributed systems: nodes periodically announce they’re alive. Silence means trouble.

How Heartbeats Work

Heartbeat Mechanism
SENDER (Worker Node)

while running:
send_heartbeat(monitor)
sleep(interval)

Message:
{ "node_id": "worker-1",
"timestamp": 1704067200,
"status": "healthy",
"metadata": { "load": 0.5 }
}


RECEIVER (Monitor)

# Track last seen for each node
last_seen = {}

on heartbeat_received(node_id, ts):
last_seen[node_id] = ts

every check_interval:
for node_id, ts in last_seen:
  if now - ts > timeout:
    mark_suspected(node_id)

Push vs Pull: Heartbeat vs Health Check

Heartbeat vs Health Check
HEARTBEAT (PUSH-BASED)

                                                    
  Worker  Monitor               
         (push)                                     
                                                    
   Node proactively sends                          
   Lower latency detection                         
   Works when monitor can't reach worker           
  ✕ Requires cooperative node (must send)           
  ✕ Can't verify deeper health                      
                                                    


HEALTH CHECK (PULL-BASED)

                                                    
  Monitor ??? Worker               
          (poll)    ◄OK                     
                                                    
   Works with uncooperative services               
   Can check deeper health (DB, disk, app)         
   Centralized control                             
  ✕ Higher detection latency                        
  ✕ Requires network path to worker                 
                                                    


COMMON PATTERN: COMBINE BOTH

                                                    
  1. Heartbeat for fast liveness detection          
  2. Health check for deeper status verification    
                                                    
  Example: Kubernetes                               
  - Kubelet sends heartbeats to API server          
  - Kubelet also polls container health endpoints   
                                                    

The Configuration Trade-off

Heartbeat Configuration Trade-offs
THE FUNDAMENTAL TRADE-OFF

                                                    
  FAST DETECTION          vs         ACCURACY      
  (short timeout)              (long timeout)       
                                                    
   Detect failures quickly    Fewer false alarms 
  ✕ More false positives      ✕ Slow to detect     
  ✕ Network hiccup = "dead"   ✕ Traffic to dead    
                                                    
  No perfect answer—tune for your use case          
                                                    


CONFIGURATION EXAMPLES

                                                    
  Use Case        Interval  Timeout  Trade-off  
  
  Leader election 100ms     300ms    Fast (FP OK)
  Database HA     1s        3s       Balanced   
  Service mesh    5s        15s      Low FP     
  Monitoring      30s       90s      Very low FP
                                                    
  Rule of thumb: timeout = 3× interval              
                                                    


FALSE POSITIVE IMPACT

                                                    
  Low stakes (service discovery):                   
   False positive = temporary routing change      
   Not a big deal, can be aggressive              
                                                    
  High stakes (database failover):                  
   False positive = split brain possible          
   Unnecessary failover = disruption              
   Be conservative, require confirmation          
                                                    

Real Systems Using Heartbeats

SystemIntervalTimeoutNotes
Kubernetes10s (default)40s (default)Kubelet to API server
Apache ZooKeepertickTime × 2Session timeout (configurable)Heartbeat in session
etcdConfigurableElection timeoutRaft heartbeats
Consul1s (default)10s (default)Gossip-based
Amazon ELBConfigurableUnhealthy threshold × intervalHealth checks

Note: Defaults vary by version. Always verify in current documentation.

Heartbeat Patterns in Practice

Common Heartbeat Architectures
CENTRALIZED MONITOR

                                                    
     Worker 1                           
     Worker 2  Central Monitor       
     Worker 3                           
                                                    
   Simple to implement                             
   Single source of truth                          
  ✕ Monitor is single point of failure              
  Use: Small clusters, Kubernetes control plane     
                                                    


PEER-TO-PEER (RING)

                                                    
       Worker 1  Worker 2                    
                                                 
          ♥               ♥                        
                                                 
       Worker 4 ◄ Worker 3                    
                                                    
   No single point of failure                      
   O(1) messages per node                          
  ✕ Longer detection path                           
  Use: Simple fault-tolerant clusters               
                                                    


GOSSIP-BASED

                                                    
  Each node gossips to k random peers               
  "I'm alive, and here's who I've heard from..."   
                                                    
  Node 1  Node 3  Node 5              
                                                 
     ♥                   ♥                         
                                                 
  Node 4          Node 2                           
                                                    
   Highly scalable (O(log N) propagation)         
   Robust to failures                              
  ✕ Eventually consistent failure detection         
  Use: Large clusters (Cassandra, Consul)           
                                                    

When to Use Heartbeats

✓ Perfect Use Cases

Heartbeat Use Cases
LEADER ELECTION
Scenario: Database with primary/replica setup
Requirement: Detect leader failure for failover
Configuration: 100ms interval, 300ms timeout
Trade-off: Fast detection, some false positives

CLUSTER MEMBERSHIP
Scenario: Service discovery, load balancing
Requirement: Know which nodes are available
Configuration: 1s interval, 5s timeout
Trade-off: Moderate detection, low false positives

SESSION KEEPALIVE
Scenario: ZooKeeper sessions, distributed locks
Requirement: Maintain session while client active
Configuration: Session-based, configurable timeout
Trade-off: Balance responsiveness vs overhead

WORKER POOL MONITORING
Scenario: Task queue with worker processes
Requirement: Redistribute tasks from dead workers
Configuration: 5s interval, 15s timeout
Trade-off: Lower urgency, very low false positives

✕ When NOT to Use (or Use Carefully)

When Heartbeats May Not Fit
NEED DEEPER HEALTH STATUS
Problem: Heartbeat says "alive" but app is broken
Example: Service running but can't connect to DB
Alternative: Health checks with dependency probing
When OK: Combine heartbeat + health check

DETERMINISTIC FAILURE DETECTION
Problem: Heartbeats can't guarantee failure
Example: Two nodes both think other is dead (split brain)
Alternative: Consensus protocols (Raft, Paxos)
When OK: Accept occasional false positives

VERY HIGH FREQUENCY SYSTEMS
Problem: Heartbeat overhead at 100+ Hz
Example: Real-time trading, gaming
Alternative: Integrated health in message protocol
When OK: Lower frequency acceptable

ASYMMETRIC NETWORKS
Problem: Heartbeat path ≠ data path
Example: Node reachable for heartbeat but not data
Alternative: Probe actual service endpoints
When OK: Network is symmetric

Interview Application

Common Interview Question

Q: “Explain how heartbeats work in distributed systems and the trade-offs involved in configuring them.”

Strong Answer:

“Heartbeats are periodic ‘I’m alive’ messages used for failure detection. Here’s how they work:

Mechanism:

  • Sender: Every N seconds, send a heartbeat to the monitor
  • Receiver: Track ‘last seen’ timestamp per node
  • Detection: If no heartbeat for timeout period, mark node as suspected failed

The Core Trade-off:

ConfigDetection SpeedFalse PositivesExample
100ms/300msVery fastHighLeader election
1s/3sFastModerateDatabase HA
5s/15sSlowLowService mesh

Why false positives matter:

  • Network hiccup during timeout window = healthy node marked dead
  • Consequence: unnecessary failover, split brain risk

Why detection speed matters:

  • Slow detection = traffic continues to dead node
  • Consequence: errors, latency, data loss

Rule of thumb: timeout = 3× interval. For a 1-second interval, use 3-second timeout—tolerates 2 missed heartbeats before suspecting failure.

Real-world example: Kubernetes uses 10s heartbeat interval with 40s timeout (pod eviction after ~40s of no heartbeats). This is tuned for stability over speed—Kubernetes prioritizes avoiding false positives.”

Follow-up: How do you handle the case where a heartbeat succeeds but the service is actually broken?

“Heartbeats only prove the process is running, not that it’s healthy. A service can send heartbeats while:

  • Its database connection is dead
  • It’s in an infinite loop
  • It’s out of memory but not crashed

Solutions:

  1. Liveness + Readiness separation (Kubernetes model):
    • Liveness probe: Is the process alive? (heartbeat)
    • Readiness probe: Can it serve traffic? (deeper health check)
  2. Application-level heartbeat:
    • Include health status in heartbeat message
    • { alive: true, db_connected: true, queue_healthy: true }
  3. Hierarchical health checks:
    • Heartbeat for fast liveness
    • Periodic deep health check (every 30s) for readiness

Best practice: Use heartbeats for ‘is the process running?’ and separate health checks for ‘can it serve requests?’”

Follow-up: What’s the difference between a heartbeat and a lease?

“They’re related but serve different purposes:

Heartbeat: Continuous signal—‘I’m still here.’ Monitor tracks last-seen timestamp. No explicit acknowledgment required.

Lease: Time-limited grant—‘You have permission until T.’ Must be renewed before expiry. Server explicitly grants/extends.

Key difference:

  • Heartbeat: Detection is passive (monitor notices absence)
  • Lease: Detection is active (lease holder knows when it expires)

Example:

  • ZooKeeper sessions: Heartbeat keeps session alive
  • Distributed locks: Lease on lock auto-expires if not renewed

Leases add safety: if a node partitions, it knows its lease expires and should stop acting as leader. With pure heartbeats, a partitioned node might keep acting as leader, thinking it’s fine.”

Code Example

Heartbeat System (Python)

import time
import threading
from dataclasses import dataclass, field
from typing import Dict, Callable, Optional
from enum import Enum

class NodeStatus(Enum):
    ALIVE = "alive"
    SUSPECTED = "suspected"
    DEAD = "dead"

@dataclass
class NodeState:
    """Tracked state for a node."""
    node_id: str
    last_heartbeat: float
    status: NodeStatus = NodeStatus.ALIVE
    metadata: dict = field(default_factory=dict)

class HeartbeatMonitor:
    """
    Centralized heartbeat monitor.

    Tracks node liveness and triggers callbacks on status changes.
    """

    def __init__(
        self,
        timeout: float = 3.0,
        check_interval: float = 1.0,
        on_suspected: Optional[Callable[[str], None]] = None,
        on_dead: Optional[Callable[[str], None]] = None,
        on_alive: Optional[Callable[[str], None]] = None,
    ):
        """
        Args:
            timeout: Seconds without heartbeat before suspected
            check_interval: How often to check for expired nodes
            on_suspected: Callback when node first suspected
            on_dead: Callback when node confirmed dead
            on_alive: Callback when node comes back alive
        """
        self.timeout = timeout
        self.check_interval = check_interval
        self.on_suspected = on_suspected
        self.on_dead = on_dead
        self.on_alive = on_alive

        self.nodes: Dict[str, NodeState] = {}
        self._lock = threading.Lock()
        self._running = False
        self._checker_thread: Optional[threading.Thread] = None

    def receive_heartbeat(
        self,
        node_id: str,
        metadata: Optional[dict] = None
    ) -> None:
        """Process incoming heartbeat from a node."""
        now = time.time()

        with self._lock:
            if node_id in self.nodes:
                node = self.nodes[node_id]
                was_suspected = node.status in (
                    NodeStatus.SUSPECTED,
                    NodeStatus.DEAD
                )
                node.last_heartbeat = now
                node.status = NodeStatus.ALIVE
                if metadata:
                    node.metadata = metadata

                if was_suspected and self.on_alive:
                    self.on_alive(node_id)
            else:
                self.nodes[node_id] = NodeState(
                    node_id=node_id,
                    last_heartbeat=now,
                    metadata=metadata or {}
                )
                if self.on_alive:
                    self.on_alive(node_id)

    def _check_expired(self) -> None:
        """Check for nodes that have missed heartbeats."""
        while self._running:
            now = time.time()

            with self._lock:
                for node in self.nodes.values():
                    elapsed = now - node.last_heartbeat

                    if elapsed > self.timeout * 2:
                        if node.status != NodeStatus.DEAD:
                            node.status = NodeStatus.DEAD
                            if self.on_dead:
                                self.on_dead(node.node_id)

                    elif elapsed > self.timeout:
                        if node.status == NodeStatus.ALIVE:
                            node.status = NodeStatus.SUSPECTED
                            if self.on_suspected:
                                self.on_suspected(node.node_id)

            time.sleep(self.check_interval)

    def start(self) -> None:
        """Start the background checker thread."""
        self._running = True
        self._checker_thread = threading.Thread(
            target=self._check_expired,
            daemon=True
        )
        self._checker_thread.start()

    def stop(self) -> None:
        """Stop the background checker thread."""
        self._running = False
        if self._checker_thread:
            self._checker_thread.join()

    def get_status(self, node_id: str) -> Optional[NodeStatus]:
        """Get current status of a node."""
        with self._lock:
            if node_id in self.nodes:
                return self.nodes[node_id].status
            return None

    def get_alive_nodes(self) -> list[str]:
        """Get list of currently alive nodes."""
        with self._lock:
            return [
                n.node_id for n in self.nodes.values()
                if n.status == NodeStatus.ALIVE
            ]


class HeartbeatSender:
    """Sends periodic heartbeats to a monitor."""

    def __init__(
        self,
        node_id: str,
        monitor: HeartbeatMonitor,
        interval: float = 1.0,
        metadata_fn: Optional[Callable[[], dict]] = None
    ):
        self.node_id = node_id
        self.monitor = monitor
        self.interval = interval
        self.metadata_fn = metadata_fn

        self._running = False
        self._sender_thread: Optional[threading.Thread] = None

    def _send_loop(self) -> None:
        """Continuously send heartbeats."""
        while self._running:
            metadata = self.metadata_fn() if self.metadata_fn else None
            self.monitor.receive_heartbeat(self.node_id, metadata)
            time.sleep(self.interval)

    def start(self) -> None:
        """Start sending heartbeats."""
        self._running = True
        self._sender_thread = threading.Thread(
            target=self._send_loop,
            daemon=True
        )
        self._sender_thread.start()

    def stop(self) -> None:
        """Stop sending heartbeats."""
        self._running = False
        if self._sender_thread:
            self._sender_thread.join()


# Usage example
if __name__ == "__main__":
    print("=== Heartbeat Demo ===\n")

    # Create monitor with callbacks
    def on_suspected(node_id: str):
        print(f"⚠️  Node {node_id} SUSPECTED (missing heartbeats)")

    def on_dead(node_id: str):
        print(f"💀 Node {node_id} DEAD (confirmed failure)")

    def on_alive(node_id: str):
        print(f"✅ Node {node_id} ALIVE")

    monitor = HeartbeatMonitor(
        timeout=2.0,
        check_interval=0.5,
        on_suspected=on_suspected,
        on_dead=on_dead,
        on_alive=on_alive
    )
    monitor.start()

    # Create senders (workers)
    worker1 = HeartbeatSender("worker-1", monitor, interval=0.5)
    worker2 = HeartbeatSender("worker-2", monitor, interval=0.5)

    worker1.start()
    worker2.start()

    print("Workers started, sending heartbeats...\n")
    time.sleep(3)

    print("\nSimulating worker-1 failure (stopping heartbeats)...")
    worker1.stop()

    # Wait for detection
    time.sleep(5)

    print("\nFinal status:")
    print(f"  worker-1: {monitor.get_status('worker-1')}")
    print(f"  worker-2: {monitor.get_status('worker-2')}")
    print(f"  Alive nodes: {monitor.get_alive_nodes()}")

    # Cleanup
    worker2.stop()
    monitor.stop()

Heartbeat with Metadata (Production Pattern)

import psutil

def get_node_health() -> dict:
    """Collect node health metrics to include in heartbeat."""
    return {
        "cpu_percent": psutil.cpu_percent(),
        "memory_percent": psutil.virtual_memory().percent,
        "disk_percent": psutil.disk_usage('/').percent,
        "load_avg": psutil.getloadavg()[0],
        "connections": len(psutil.net_connections()),
    }

# Usage
sender = HeartbeatSender(
    node_id="worker-1",
    monitor=monitor,
    interval=1.0,
    metadata_fn=get_node_health  # Include health in each heartbeat
)

See It In Action:

Related Concepts:

Quick Self-Check

  • Can explain heartbeats in 60 seconds?
  • Understand the trade-off between detection speed and false positives?
  • Know the difference between heartbeat (push) and health check (pull)?
  • Can implement a basic heartbeat monitor with timeouts?
  • Understand why timeout = 3× interval is a common rule of thumb?
  • Know when to use heartbeats vs leases?

Production signal

Why this concept matters

Interview 65% of distributed systems interviews
Production All distributed systems
Performance Failure detection latency
Scale O(N) messages per interval