Consensus | Concepts

TL;DR

Consensus is a fundamental problem in distributed systems where multiple nodes must agree on a single value or decision, even in the presence of failures. Algorithms like Raft and Paxos solve this by using leader election, quorums, and log replication to ensure all nodes eventually agree on the same state. Essential for distributed locks, configuration management, and coordination services like etcd, Consul, and ZooKeeper.

Visual Overview

Consensus Overview

THE CONSENSUS PROBLEM
┌────────────────────────────────────────────────┐
│  Goal: N nodes agree on single value           │
│                                                │
│  Node 1 proposes: value = "A"                  │
│  Node 2 proposes: value = "B"                  │
│  Node 3 proposes: value = "A"                  │
│  ↓                                             │
│  Consensus Algorithm runs...                   │
│  ↓                                             │
│  Node 1 decides: value = "A" ✓                 │
│  Node 2 decides: value = "A" ✓                 │
│  Node 3 decides: value = "A" ✓                 │
│                                                │
│  Properties:                                   │
│  1. Agreement: All nodes decide same value     │
│  2. Validity: Decided value was proposed       │
│  3. Termination: All nodes eventually decide   │
└────────────────────────────────────────────────┘

RAFT CONSENSUS (Simplified)
┌────────────────────────────────────────────────┐
│ Phase 1: LEADER ELECTION                       │
│ ┌────┐ ┌────┐ ┌────┐                           │
│ │ N1 │ │ N2 │ │ N3 │                           │
│ └────┘ └────┘ └────┘                           │
│ Timeout → Start election                       │
│ N1 votes for self, requests votes from N2, N3  │
│ N2 votes YES, N3 votes YES                     │
│ N1 becomes LEADER (majority) ✓                 │
│                                                │
│ Phase 2: LOG REPLICATION                       │
│ Leader receives command: SET x=5               │
│ ↓                                              │
│ 1. Leader appends to log: [term=1, SET x=5]    │
│ 2. Leader sends to followers                   │
│ 3. Followers append to logs                    │
│ 4. Followers ACK                               │
│ 5. Leader receives majority ACK ✓              │
│ 6. Leader commits entry                        │
│ 7. Entry applied to state machine              │
│                                                │
│ Result: All nodes have same log, same state    │
└────────────────────────────────────────────────┘

LEADER ELECTION (Detailed)
┌────────────────────────────────────────────────┐
│ Initial State: 3 followers                     │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐         │
│ │ Follower │ │ Follower │ │ Follower │         │
│ │ N1 │ │ N2 │ │ N3 │                           │
│ └──────────┘ └──────────┘ └──────────┘         │
│                                                │
│ Election Timeout (N1 times out first):         │
│ N1 → Candidate (term=1)                        │
│ N1 votes for self (vote count = 1)             │
│ N1 sends RequestVote to N2, N3                 │
│ ↓                                              │
│ N2 receives RequestVote:                       │
│ - Term=1 (same as N2)                          │
│ - N2 hasn't voted this term → votes YES        │
│ ↓                                              │
│ N3 receives RequestVote:                       │
│ - Term=1, N3 hasn't voted → votes YES          │
│ ↓                                              │
│ N1 receives 2 votes (total 3/3 = majority) ✓   │
│ N1 → Leader                                    │
│ N1 sends heartbeats to N2, N3                  │
│ N2, N3 → Followers                             │
└────────────────────────────────────────────────┘

SPLIT VOTE (and Recovery)
┌────────────────────────────────────────────────┐
│ N1 and N2 timeout simultaneously               │
│ ↓                                              │
│ N1 → Candidate (term=1), votes for self        │
│ N2 → Candidate (term=1), votes for self        │
│ ↓                                              │
│ N1 requests vote from N2, N3                   │
│ N2 requests vote from N1, N3                   │
│ ↓                                              │
│ N3 receives both requests:                     │
│ - Votes for N1 (received first)                │
│ - Rejects N2 (already voted this term)         │
│ ↓                                              │
│ Vote count (majority of 3 = 2):                │
│ N1: 2 votes (self + N3) = majority ✓           │
│ N2: 1 vote (self only) = NOT majority ✗        │
│ ↓                                              │
│ N1 becomes LEADER (term=1)                     │
│ N2 sees N1's heartbeat → reverts to follower   │
│ ↓                                              │
│ Note: a true tie needs an even split (e.g. N3  │
│ unreachable); then random timeouts break it    │
│ on retry — N1 times out first → wins ✓         │
└────────────────────────────────────────────────┘

Core Explanation

What is Consensus?

Consensus is the problem of getting multiple distributed nodes to agree on a single value, even when:

Nodes fail (crash)
Messages are delayed or lost
Network partitions occur

Consensus Properties:

Agreement: All non-faulty nodes decide on the same value
Validity: The decided value must have been proposed by some node
Termination: All non-faulty nodes eventually decide
Integrity: Nodes decide at most once

Real-World Analogies:

Board of directors voting on decision
Jury reaching verdict
Politicians passing legislation

Why Consensus is Hard (FLP Impossibility)

The FLP Result (1985):

FLP Impossibility Result

Fischer, Lynch, Patterson proved:
"In an asynchronous system with even ONE faulty node,
there is NO deterministic algorithm that guarantees
consensus in bounded time"

What this means:

- Async network: Can't distinguish slow vs crashed
- Even 1 failure: Can block progress forever
- Deterministic: No randomness allowed

Real systems work around this by:

1. Timeouts (assume crashed after T seconds)
2. Randomization (random backoff)
3. Partial synchrony (eventual bounds)

Consensus Algorithms

1. Raft (Understandable Consensus)

Raft Design Goals

- Understandable (simpler than Paxos)
- Practical (production-ready)
- Safe (proven correct)

Key Components:

1. Leader Election
2. Log Replication
3. Safety Guarantees

Terms (Logical Clock):
┌────────────────────────────────────┐
│ Term 1: [Leader=N1]                │
│ Term 2: [Leader=N2] (N1 crashed)   │
│ Term 3: [No leader] (split vote)   │
│ Term 4: [Leader=N1]                │
└────────────────────────────────────┘

Terms ensure:

- Only one leader per term
- Stale leaders detected
- Log ordering preserved

Raft Leader Election:

Raft Leader Election

State Machine:
Follower → Candidate → Leader
   ↑         ↓
   └─────────┘ (election timeout)

Election Process:

1. Follower waits for heartbeat from leader
2. If timeout (150-300ms random):
 - Increment term
 - Become candidate
 - Vote for self
 - Send RequestVote to all peers

3. Receive votes:
 - Majority? → Become leader
 - Another leader elected? → Become follower
 - Timeout? → Start new election

4. Leader sends heartbeats (prevent new elections)

Safety: Only one leader per term (majority quorum)

Raft Log Replication:

Raft Log Replication

Log Structure:
┌────────────────────────────────────┐
│ Index:  1    2    3    4    5      │
│ Term:   1    1    1    2    3      │
│ Cmd:    x=3  y=9  x=5  y=2  x=1    │
│ Status: ✓    ✓    ✓    ✓    ?      │
│         (committed)       (pending)│
└────────────────────────────────────┘

Replication Steps:

1. Leader receives command from client
2. Leader appends to local log (uncommitted)
3. Leader sends AppendEntries RPC to followers
4. Followers append to logs, return ACK
5. Leader receives majority ACK → commit entry
6. Leader applies to state machine
7. Leader notifies followers to commit
8. Followers apply to state machines

Safety Rules:

- Log Matching: Same index+term → identical logs
- Leader Completeness: Leader has all committed entries
- State Machine Safety: Same log → same state

2. Paxos (Classic Consensus)

Paxos Phases

Phase 1: PREPARE (Leader Election)
┌────────────────────────────────────────┐
│ Proposer:                              │
│ - Generates proposal number N          │
│ - Sends PREPARE(N) to acceptors        │
│ ↓                                      │
│ Acceptor:                              │
│ - If N > highest seen:                 │
│ → Promise not to accept N' < N         │
│ → Return any accepted value            │
│ - Else: Reject                         │
└────────────────────────────────────────┘

Phase 2: ACCEPT (Propose Value)
┌────────────────────────────────────────┐
│ Proposer:                              │
│ - Receives majority promises           │
│ - Choose value (or use returned value) │
│ - Send ACCEPT(N, V) to acceptors       │
│ ↓                                      │
│ Acceptor:                              │
│ - If N >= promised:                    │
│ → Accept (N, V)                        │
│ → Notify learners                      │
│ - Else: Reject                         │
└────────────────────────────────────────┘

Safety: Once value chosen, never changes
Liveness: May not terminate (dueling proposers)

Paxos vs Raft:

Paxos vs Raft Comparison

Paxos:
+ Theoretical foundation (proven in 1989)
+ More flexible (multi-leader variants)
- Complex to understand
- Hard to implement correctly

Raft:

- Easier to understand (designed for clarity)
- Easier to implement (clear leader)
- Better for teaching and adoption

* Slightly less flexible than Multi-Paxos

In Practice:

- etcd uses Raft
- Google Chubby uses Paxos
- Both work well in production

Handling Failures

Node Failures:

Node Failures

Scenario: 5-node cluster, up to 2 failures tolerable

Leader Fails:

1. Followers detect missing heartbeats (timeout)
2. Followers start elections
3. New leader elected (majority quorum)
4. New leader has all committed entries (safety)
5. Processing resumes

Follower Fails:

1. Leader continues with remaining nodes
2. Leader still has majority (3/5 available)
3. Failed node recovers → catches up from leader

Majority Fails:

1. No majority quorum available
2. Cluster unavailable (cannot make progress)
3. Prevents split-brain (consistency > availability)
4. Wait for nodes to recover

Network Partitions:

Network Partitions

Scenario: 5 nodes split into [3] and [2]

Partition 1: [N1, N2, N3] (majority)

- Can elect leader ✓
- Can commit entries ✓
- Remains available

Partition 2: [N4, N5] (minority)

- Cannot elect leader ✗
- Cannot commit entries ✗
- Becomes unavailable

When partition heals:

- Minority nodes recognize higher term
- Minority nodes become followers
- Minority nodes catch up from leader
- Cluster reunited ✓

Safety: No split-brain (only one partition has quorum)

Use Cases

1. Distributed Configuration

Distributed Configuration

etcd for Kubernetes:
- Store cluster configuration
- Service discovery
- Distributed locks
- Leader election for controllers

Why consensus?

- Consistent view of configuration
- Atomic updates
- Survive node failures

2. Leader Election

Leader Election

Kafka Controller Election:
- One broker is controller
- Controller manages partitions
- If controller fails, elect new one

Using ZooKeeper (consensus-based):

1. Brokers try to create /controller node
2. First to create → becomes controller
3. Others watch node for changes
4. If controller dies, node deleted
5. New election triggered

3. Distributed Locks

Distributed Locks

Acquiring lock with etcd:
1. Client creates unique lease
2. Client writes key with lease
3. Key creation succeeds → lock acquired
4. Other clients see key exists → wait
5. Lease expires → key deleted → lock released

Consensus guarantees:

- Only one client gets lock (linearizable)
- Lock survives client/server failures

Real Systems Using Consensus

System	Algorithm	Use Case	Key Features
etcd	Raft	Kubernetes config, locks	Strongly consistent key-value store
Consul	Raft	Service discovery, config	Multi-datacenter support
ZooKeeper	Zab (Raft-like)	Coordination, leader election	Widely adopted (Kafka, Hadoop)
CockroachDB	Raft	Distributed SQL	Range-level consensus
TiKV	Raft	Distributed key-value	Part of TiDB database
Spanner	Paxos	Google’s distributed database	Global consistency with TrueTime

Case Study: etcd with Raft

etcd with Raft Architecture

etcd Architecture:

┌──────────────────────────────────────────┐
│ etcd Cluster (3 nodes)                   │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐      │
│ │ etcd1 │ │ etcd2 │ │ etcd3 │            │
│ │(Leader) │ │(Follower│ │(Follower│      │
│ └─────────┘ └─────────┘ └─────────┘      │
└──────────────────────────────────────────┘
↑
│ (client requests)
│
┌──────────────────────────────────────────┐
│ Kubernetes API Server                    │
│ - Reads cluster state from etcd          │
│ - Writes updates to etcd                 │
│ - Watches for changes                    │
└──────────────────────────────────────────┘

Write Flow:

1. Client sends PUT /pods/pod-1 to any etcd node
2. If follower: Forward to leader
3. Leader appends to log
4. Leader replicates to followers
5. Majority ACK (2/3) → commit ✓
6. Leader applies to state machine
7. Leader responds to client
8. Followers apply to state machines

Guarantees:
✓ Linearizable reads/writes
✓ Consistent snapshots
✓ Survives minority failures (1/3)
✓ MVCC for historical reads

Case Study: ZooKeeper

ZooKeeper (Zab Protocol)

ZooKeeper Ensemble (3 or 5 nodes):
┌────────────────────────────────────┐
│ Leader: Processes all writes       │
│ Followers: Serve reads             │
│ Observers: Scale reads (no quorum) │
└────────────────────────────────────┘

Kafka Controller Election using ZK:

1. Broker starts, connects to ZooKeeper
2. Tries to create /controller ephemeral node
3. First broker → creates node → becomes controller
4. Other brokers → node exists → become followers
5. All brokers watch /controller for changes
6. Controller dies → ephemeral node deleted
7. All brokers notified → start new election

Guarantees:
✓ Sequential consistency (not linearizable)
✓ Atomic updates
✓ Ordered operations
✓ Session management with leases

When to Use Consensus

✓ Perfect Use Cases

Use Case	Scenario	Requirement	Solution	Benefit
Distributed Configuration Mgmt	Kubernetes cluster configuration	All nodes see same config, survive failures	etcd with Raft	Consistent view, automatic failover
Leader Election	Kafka controller election	Exactly one controller at all times	ZooKeeper consensus	No split-brain, automatic re-election
Distributed Locks	Ensure only one job runs (cron)	Mutual exclusion across servers	etcd lease with Raft	Lock survives failures, no duplicate execution

✕ When NOT to Use

Anti-Pattern	Problem	Alternative	Example
High-Throughput Data Storage	Consensus is slow (requires majority)	Eventual consistency (Cassandra, DynamoDB)	Storing millions of writes/second
Multi-Datacenter, Low Latency	Consensus requires majority across DCs (high latency)	Async replication, conflict resolution	Global social media application
Simple Use Cases	Consensus is complex (operational overhead)	Single leader with backups	Small application with few nodes

Interview Application

Common Interview Question

Q: “How would you implement a distributed lock service that survives node failures and network partitions?”

Strong Answer:

“I’d build a distributed lock service using consensus (Raft):

Architecture:

3 or 5 node cluster running Raft consensus

Lease-based locks with automatic expiration

Strong consistency guarantees (linearizable)

Lock Acquisition:
AcquireLock(lockName, leaseDuration):
  1. Generate unique client ID
  2. Send to leader: CREATE lock/{lockName} with clientID, lease
  3. Raft replicates to majority (quorum)
  4. If successfully created: Return lock token
  5. If already exists: Return failure (lock held)
Lock Release:
ReleaseLock(lockName, clientID):
  1. Send to leader: DELETE lock/{lockName} if owner==clientID
  2. Raft replicates deletion
  3. Majority ACK → lock released
Lease Expiration:

Each lock has a TTL, for example 30 seconds, and the client renews it with a heartbeat every 10 seconds. If the client crashes, the lease expires and the lock is released without a manual cleanup path, which prevents orphaned locks.

Handling Failures:

Client failure is handled by lease expiry. Leader failure is handled by the consensus group: followers detect missed heartbeats, elect a new leader with a majority quorum, and continue from the committed log. During a network partition, only the majority side can grant or release locks; the minority side must reject lock operations because it cannot prove exclusivity.

Consistency Guarantees:

The service gives mutual exclusion because every grant is a committed log entry, avoids deadlock through leases, and survives minority failures. It does not optimize for raw throughput; the point is to make coordination correct.

API Design:
// Acquire lock with 30-second lease
token = lock_service.acquire("my-lock", ttl=30)

if token:
  try:
    // Do critical work
    process_job()
  finally:
    lock_service.release("my-lock", token)
else:
  // Lock held by another client
  retry_later()
Trade-offs:

The trade-off is consensus overhead: lock operations may take 10-50ms, throughput is bounded by the consensus group, and the service is unavailable without a majority. That is acceptable for coordination because correctness matters more than making locks cheap.

Real-World Example: etcd implements this exact design for Kubernetes distributed locks”

Code Example

Simplified Raft-Style Leader Election

import time
import random
import threading
from enum import Enum
from typing import Dict, List

class NodeState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

class RaftNode:
    """Simplified Raft node (leader election only)"""

    def __init__(self, node_id: int, peers: List[int]):
        self.node_id = node_id
        self.peers = peers
        self.state = NodeState.FOLLOWER

        self.current_term = 0
        self.voted_for = None
        self.leader_id = None
    # ... omitted: keep concept snippets short
        RaftNode(node_id=2, peers=[1, 2, 3]),
        RaftNode(node_id=3, peers=[1, 2, 3])
    ]

    # Let election run
    time.sleep(5)

    # Stop all nodes
    for node in nodes:
        node.stop()

See It In Action:

Raft Consensus Explainer - ~90 second animated visual explanation

Prerequisites:

Leader-Follower Replication - Replication patterns
Quorum - Majority voting

Related Concepts:

Eventual Consistency - Alternative consistency model
Sharding - Partitioning with consensus

Used In Systems:

etcd: Kubernetes configuration and coordination
Consul: Service discovery and configuration
ZooKeeper: Distributed coordination for Kafka, Hadoop

Explained In Detail:

Distributed Systems Deep Dive - Consensus algorithms in depth

Quick Self-Check

Can explain the consensus problem in 60 seconds?
Understand Raft leader election process?
Know how log replication works?
Can explain how consensus handles network partitions?
Understand when to use consensus vs eventual consistency?
Know real systems using consensus (etcd, ZooKeeper)?