Health Checks | Concepts

TL;DR

Health checks are probes that determine if a service is alive and functioning correctly. They enable load balancers to route traffic away from failing nodes, orchestrators to restart unhealthy containers, and distributed systems to trigger failover. The key trade-off: faster detection means more false positives.

Visual Overview

Health Check Overview

HEALTH CHECK TYPES
┌─────────────────────────────────────────────────┐
│                                                 │
│  LIVENESS: "Is the process running?"            │
│  ┌──────────┐                                   │
│  │  Service │ ─── /health/live → 200 OK         │
│  └──────────┘                                   │
│  Failure action: Restart container              │
│                                                 │
│  READINESS: "Can it handle requests?"           │
│  ┌──────────┐                                   │
│  │  Service │ ─── /health/ready → 503           │
│  └──────────┘     (DB connection down)          │
│  Failure action: Stop routing traffic           │
│                                                 │
│  STARTUP: "Has it finished initializing?"       │
│  ┌──────────┐                                   │
│  │  Service │ ─── /health/startup → 200 OK      │
│  └──────────┘     (after warm-up complete)      │
│  Failure action: Wait before checking liveness  │
│                                                 │
└─────────────────────────────────────────────────┘

HEARTBEAT PROTOCOL
┌─────────────────────────────────────────────────┐
│                                                 │
│  Time →                                         │
│  ┌────┐     ┌────┐     ┌────┐     ┌────┐        │
│  │ ♥  │     │ ♥  │     │ ♥  │     │ ?  │        │
│  └────┘     └────┘     └────┘     └────┘        │
│    0s        1s         2s         3s           │
│                                                 │
│  Heartbeat every 1s                             │
│  Timeout after 3s → Mark node as DEAD           │
│                                                 │
│  Trade-off:                                     │
│  • Short timeout → Fast detection, false +      │
│  • Long timeout → Slow detection, fewer false + │
│                                                 │
└─────────────────────────────────────────────────┘

FAILURE DETECTION IN CLUSTERS
┌─────────────────────────────────────────────────┐
│                                                 │
│  Push (Heartbeat):                              │
│  [Node] ──♥──→ [Monitor]                        │
│          ──♥──→                                 │
│          ──✗──→ timeout → marked dead           │
│                                                 │
│  Pull (Polling):                                │
│  [Monitor] ──?──→ [Node] ──200──→               │
│            ──?──→        ──200──→               │
│            ──?──→        ──✗──→ marked dead     │
│                                                 │
│  Gossip (Peer-to-peer):                         │
│  [N1]──────[N2]                                 │
│           /                                   │
│          /                                    │
│      [N3]─────[N4]                              │
│  Each node shares failure info with neighbors   │
│                                                 │
└─────────────────────────────────────────────────┘

Why Health Checks Matter

The Fundamental Problem

How do you know if a remote node is dead or just slow?

Scenario	Network Response	Reality
Node crashed	Timeout	Dead
Node overloaded	Timeout	Alive but struggling
Network partition	Timeout	Alive but unreachable
GC pause	Timeout then responds	Alive

All look the same to the caller: No response within timeout.

Impact of Getting It Wrong

Detection	False Positive	False Negative
Too aggressive	Healthy nodes marked dead, cascading restarts	-
Too conservative	-	Dead nodes continue receiving traffic

Health Check Patterns

1. HTTP Health Endpoints

# Simple liveness check
@app.route('/health/live')
def liveness():
    return {'status': 'alive'}, 200

# Readiness check with dependencies
@app.route('/health/ready')
def readiness():
    if not db.is_connected():
        return {'status': 'not ready', 'reason': 'DB unavailable'}, 503
    if not cache.is_connected():
        return {'status': 'not ready', 'reason': 'Cache unavailable'}, 503
    return {'status': 'ready'}, 200

2. TCP Health Checks

For non-HTTP services:

Open TCP connection to port
Success = port responding
Used by: AWS ELB, HAProxy

3. gRPC Health Checking Protocol

service Health {
  rpc Check(HealthCheckRequest) returns (HealthCheckResponse);
  rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse);
}

message HealthCheckResponse {
  enum ServingStatus {
    UNKNOWN = 0;
    SERVING = 1;
    NOT_SERVING = 2;
  }
  ServingStatus status = 1;
}

Heartbeat Protocols

Push-Based (Heartbeats)

Node periodically sends “I’m alive” to monitor.

Pros: Lower monitor load Cons: Dead node = silence (must distinguish from network issues)

Pull-Based (Polling)

Monitor periodically checks each node.

Pros: Centralized view Cons: Monitor overload at scale

Gossip-Based

Nodes share health info peer-to-peer.

Pros: Scalable, no single point of failure Cons: Eventually consistent detection Used by: Cassandra, Consul

Phi Accrual Failure Detector

Instead of binary alive/dead, calculate probability of failure:

Phi (φ) = -log10(P(heartbeat delay))

φ = 1 → 10% chance of failure
φ = 2 → 1% chance of failure
φ = 8 → 0.000001% chance of failure

Threshold: Mark dead when φ > 8

Advantage: Adapts to network conditions automatically. Used by: Cassandra, Akka

Kubernetes Health Probes

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 10    # Wait after start
      periodSeconds: 5           # Check every 5s
      timeoutSeconds: 3          # Timeout per check
      failureThreshold: 3        # Failures before restart

    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      periodSeconds: 5
      failureThreshold: 1        # Remove from service immediately

Configuration Trade-offs

Parameter	Lower Value	Higher Value
Check interval	Faster detection, more load	Slower detection, less load
Timeout	More false positives	Misses slow failures
Failure threshold	Quick failover	Tolerates transient issues

Production Recommendations

# Typical production settings
liveness:
  interval: 10s
  timeout: 5s
  failureThreshold: 3      # 30s to restart

readiness:
  interval: 5s
  timeout: 3s
  failureThreshold: 1      # Immediate traffic removal

Anti-Patterns

1. Health Check Does Too Much

# BAD: Health check that takes 30 seconds
@app.route('/health')
def health():
    run_full_database_integrity_check()  # Takes 30s!
    return {'status': 'healthy'}

Health checks should be fast (under 100ms).

2. No Dependency Isolation

# BAD: All dependencies fail readiness
@app.route('/health/ready')
def ready():
    check_database()     # Critical
    check_analytics_db() # Not critical for serving traffic

Only check critical dependencies in readiness.

3. Cascading Failures

If health check fails under load → more load on remaining nodes → they fail too.

Solution: Circuit breakers, gradual rollout, load shedding.

Prerequisites:

Distributed Systems Basics - Foundation concepts

Related Concepts:

Failover - What happens when health checks fail
Consensus - Leader election on failure
Load Balancing - Traffic routing away from unhealthy nodes

Used In Systems:

Kubernetes (liveness/readiness probes)
AWS ELB/ALB (target health checks)
Consul (service health checks)
Every HA deployment

Next Recommended: Failover - Learn what happens after detecting a failure