I/D/E · Observability

Health Checks

Summary

Failure detection mechanisms in distributed systems: how to determine if a node is alive, dead, or just slow, enabling automatic failover and self-healing systems

TL;DR

Health checks are probes that determine if a service is alive and functioning correctly. They enable load balancers to route traffic away from failing nodes, orchestrators to restart unhealthy containers, and distributed systems to trigger failover. The key trade-off: faster detection means more false positives.

Visual Overview

Health Check Overview
HEALTH CHECK TYPES

                                                 
  LIVENESS: "Is the process running?"           
                                    
    Service   /health/live  200 OK        
                                    
  Failure action: Restart container             
                                                 
  READINESS: "Can it handle requests?"          
                                    
    Service   /health/ready  503          
       (DB connection down)         
  Failure action: Stop routing traffic          
                                                 
  STARTUP: "Has it finished initializing?"      
                                    
    Service   /health/startup  200 OK     
       (after warm-up complete)     
  Failure action: Wait before checking liveness 
                                                 


HEARTBEAT PROTOCOL

                                                 
  Time                                          
                       
                  ?        
                       
    0s        1s         2s         3s          
                                                 
  Heartbeat every 1s                            
  Timeout after 3s  Mark node as DEAD          
                                                 
  Trade-off:                                     
  • Short timeout  Fast detection, false +     
  • Long timeout  Slow detection, fewer false + 
                                                 


FAILURE DETECTION IN CLUSTERS

                                                 
  Push (Heartbeat):                              
  [Node]  [Monitor]                        
                                           
           timeout  marked dead           
                                                 
  Pull (Polling):                                
  [Monitor] ? [Node] 200              
            ?        200              
            ?         marked dead    
                                                 
  Gossip (Peer-to-peer):                         
  [N1][N2]                                
           /                                  
          /                                   
      [N3][N4]                             
  Each node shares failure info with neighbors   
                                                 

Why Health Checks Matter

The Fundamental Problem

How do you know if a remote node is dead or just slow?

ScenarioNetwork ResponseReality
Node crashedTimeoutDead
Node overloadedTimeoutAlive but struggling
Network partitionTimeoutAlive but unreachable
GC pauseTimeout then respondsAlive

All look the same to the caller: No response within timeout.

Impact of Getting It Wrong

DetectionFalse PositiveFalse Negative
Too aggressiveHealthy nodes marked dead, cascading restarts-
Too conservative-Dead nodes continue receiving traffic

Health Check Patterns

1. HTTP Health Endpoints

# Simple liveness check
@app.route('/health/live')
def liveness():
    return {'status': 'alive'}, 200

# Readiness check with dependencies
@app.route('/health/ready')
def readiness():
    if not db.is_connected():
        return {'status': 'not ready', 'reason': 'DB unavailable'}, 503
    if not cache.is_connected():
        return {'status': 'not ready', 'reason': 'Cache unavailable'}, 503
    return {'status': 'ready'}, 200

2. TCP Health Checks

For non-HTTP services:

  • Open TCP connection to port
  • Success = port responding
  • Used by: AWS ELB, HAProxy

3. gRPC Health Checking Protocol

service Health {
  rpc Check(HealthCheckRequest) returns (HealthCheckResponse);
  rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse);
}

message HealthCheckResponse {
  enum ServingStatus {
    UNKNOWN = 0;
    SERVING = 1;
    NOT_SERVING = 2;
  }
  ServingStatus status = 1;
}

Heartbeat Protocols

Push-Based (Heartbeats)

Node periodically sends “I’m alive” to monitor.

Pros: Lower monitor load Cons: Dead node = silence (must distinguish from network issues)

Pull-Based (Polling)

Monitor periodically checks each node.

Pros: Centralized view Cons: Monitor overload at scale

Gossip-Based

Nodes share health info peer-to-peer.

Pros: Scalable, no single point of failure Cons: Eventually consistent detection Used by: Cassandra, Consul

Phi Accrual Failure Detector

Instead of binary alive/dead, calculate probability of failure:

Phi (φ) = -log10(P(heartbeat delay))

φ = 1  10% chance of failure
φ = 2  1% chance of failure
φ = 8  0.000001% chance of failure

Threshold: Mark dead when φ > 8

Advantage: Adapts to network conditions automatically. Used by: Cassandra, Akka

Kubernetes Health Probes

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 10    # Wait after start
      periodSeconds: 5           # Check every 5s
      timeoutSeconds: 3          # Timeout per check
      failureThreshold: 3        # Failures before restart

    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      periodSeconds: 5
      failureThreshold: 1        # Remove from service immediately

Configuration Trade-offs

ParameterLower ValueHigher Value
Check intervalFaster detection, more loadSlower detection, less load
TimeoutMore false positivesMisses slow failures
Failure thresholdQuick failoverTolerates transient issues

Production Recommendations

# Typical production settings
liveness:
  interval: 10s
  timeout: 5s
  failureThreshold: 3      # 30s to restart

readiness:
  interval: 5s
  timeout: 3s
  failureThreshold: 1      # Immediate traffic removal

Anti-Patterns

1. Health Check Does Too Much

# BAD: Health check that takes 30 seconds
@app.route('/health')
def health():
    run_full_database_integrity_check()  # Takes 30s!
    return {'status': 'healthy'}

Health checks should be fast (under 100ms).

2. No Dependency Isolation

# BAD: All dependencies fail readiness
@app.route('/health/ready')
def ready():
    check_database()     # Critical
    check_analytics_db() # Not critical for serving traffic

Only check critical dependencies in readiness.

3. Cascading Failures

If health check fails under load → more load on remaining nodes → they fail too.

Solution: Circuit breakers, gradual rollout, load shedding.

Prerequisites:

Related Concepts:

Used In Systems:

  • Kubernetes (liveness/readiness probes)
  • AWS ELB/ALB (target health checks)
  • Consul (service health checks)
  • Every HA deployment

Next Recommended: Failover - Learn what happens after detecting a failure

Production signal

Why this concept matters

Interview 60% of production-focused interviews
Production Every HA system
Performance Detection latency vs false positives
Scale O(N) or O(N²) depending on protocol