Distributed Systems Basics

TL;DR

A distributed system is a collection of independent computers that appear to users as a single coherent system. The fundamental challenge: coordinating multiple machines over an unreliable network while handling partial failures. Key constraints are captured by the CAP theorem—during network partitions, you must choose between consistency and availability.

Visual Overview

Distributed Systems Overview

THE FUNDAMENTAL PROBLEM
┌──────────────────────────────────────────────────┐
│  Single Machine (Simple)                         │
│  ┌─────────────┐                                 │
│  │   Process   │ ← Everything in one place       │
│  │   Memory    │   No coordination needed        │
│  │   Disk      │   Failures are total            │
│  └─────────────┘                                 │
│                                                  │
│  Distributed System (Complex)                    │
│  ┌────┐    network    ┌────┐    network  ┌────┐  │
│  │ N1 │ ←──────────→ │ N2 │ ←─────────→ │ N3 │   │
│  └────┘              └────┘              └────┘  │
│     ↑ Partial failures, network delays,          │
│       inconsistent views, split brain            │
└──────────────────────────────────────────────────┘

CAP THEOREM: PICK TWO (DURING PARTITION)
┌──────────────────────────────────────────────────┐
│                   Consistency                    │
│                      /                          │
│                     /                           │
│                    /                            │
│                   /  CP                         │
│                  /________                      │
│         Availability    Partition Tolerance      │
│                     AP  /                       │
│                        /                        │
│                                                  │
│  CP: Choose consistency → reject requests        │
│      Examples: ZooKeeper, etcd, Spanner          │
│                                                  │
│  AP: Choose availability → accept stale data     │
│      Examples: Cassandra, DynamoDB               │
│                                                  │
│  Note: P is not optional—partitions WILL happen  │
└──────────────────────────────────────────────────┘

FAILURE MODES
┌──────────────────────────────────────────────────┐
│  1. Crash Failure                                │
│     Node stops, doesn't respond                  │
│     Detection: Heartbeat timeout                 │
│                                                  │
│  2. Network Partition                            │
│     Nodes can't communicate                      │
│     [N1, N2] ──X── [N3, N4, N5]                  │
│                                                  │
│  3. Byzantine Failure                            │
│     Node behaves maliciously/incorrectly         │
│     Hardest to handle (requires 3f+1 nodes)      │
│                                                  │
│  4. Partial Failure                              │
│     Some operations succeed, some fail           │
│     The defining challenge of distributed        │
└──────────────────────────────────────────────────┘

Why Distributed Systems?

The Single Machine Limit

Every system eventually hits the limits of a single machine:

Constraint	Single Machine Limit	Distributed Solution
CPU	~128 cores	1000s of machines
Memory	~12 TB	Petabytes across cluster
Storage	~100 TB	Exabytes across cluster
Availability	Machine fails = down	Redundancy, failover
Latency	Geography bound	Edge locations worldwide

Core Motivations

Scalability: Handle more load than one machine can
Availability: Continue operating despite failures
Latency: Serve users from nearby locations
Cost: Commodity hardware vs expensive mainframes

The CAP Theorem

Proven by Seth Gilbert and Nancy Lynch (2002), building on Eric Brewer’s conjecture:

During a network partition, a distributed system must choose between Consistency and Availability.

The Three Properties

Consistency (C): Every read returns the most recent write or an error. All nodes see the same data at the same time.

Availability (A): Every request to a non-failing node receives a response (no timeouts, no errors).

Partition Tolerance (P): The system continues operating despite arbitrary network partitions.

Why You Can’t Have All Three

Consider two nodes during a network partition:

Node A                    Node B
   │                         │
   │    Network Partition    │
   │        ═══════X═════    │
   │                         │
Client writes X=1 to A       │
   │                         │
   │   Cannot replicate      │
   │   to B (partition)      │
   │                         │
   │                   Client reads X from B
   │                         │
   │        CHOICE:          │
   │   Return X=0 (stale)    │  → AP: Available but inconsistent
   │   Return error          │  → CP: Consistent but unavailable

Real-World CAP Choices

System	Choice	Behavior During Partition
ZooKeeper	CP	Minority partition stops serving
etcd	CP	Requires quorum for reads/writes
Cassandra	AP	Continues serving, eventual consistency
DynamoDB	AP (default)	Eventually consistent reads
Spanner	CP	Sacrifices availability for consistency

Fallacies of Distributed Computing

Classic misconceptions that lead to system failures cluster into two groups.

Transport is never perfect: packets drop, connections break, latency is not zero, bandwidth is finite, and every byte has serialization and network cost. A design that assumes perfect transport usually works in staging and fails under real load or regional distance.

Ownership and topology are not stable either: the network is not automatically secure, nodes join and leave, administration crosses teams and organizations, and production fleets are rarely homogeneous. Good distributed designs treat those facts as normal operating conditions, not rare exceptions.

Key Concepts

Network Partitions

A partition occurs when network failure divides the cluster:

Before Partition:
[N1] ←→ [N2] ←→ [N3] ←→ [N4] ←→ [N5]

After Partition:
[N1] ←→ [N2]     ═══X═══     [N3] ←→ [N4] ←→ [N5]
 └─ Minority ─┘               └─── Majority ───┘

Split-brain: Both sides think they’re the leader → data divergence.

Failure Detection

How do you know if a node is dead or just slow?

Method	Mechanism	Trade-off
Heartbeats	Periodic “I’m alive” messages	Timeout = false positives
Phi Accrual	Probability of failure	More accurate, complex
Gossip	Nodes share failure info	Eventually consistent

Ordering and Time

In distributed systems, “happened before” is ambiguous:

Clock Type	Mechanism	Use Case
Lamport Clocks	Logical counters	Causal ordering
Vector Clocks	Per-node counters	Detect conflicts
TrueTime	Atomic clocks + GPS	Google Spanner
HLC	Hybrid logical/physical	CockroachDB

Patterns for Handling Failures

1. Replication

Copy data to multiple nodes:

Leader-Follower: One writes, others follow
Multi-Leader: Multiple writers, conflict resolution
Leaderless: All nodes equal, quorum writes

2. Quorum Systems

Require majority agreement: W + R > N

N = 5 replicas
W = 3 (write to 3 nodes)
R = 3 (read from 3 nodes)

At least one node in R saw the write in W
→ Guarantees consistency (when W + R > N)

3. Consensus Algorithms

Agree on a single value despite failures:

Paxos: Theoretical foundation, complex
Raft: Understandable consensus
Zab: ZooKeeper’s algorithm

Related Concepts:

Consensus - How nodes agree on values
Eventual Consistency - Relaxed consistency model
Quorum - Majority-based decisions
Leader-Follower Replication - Primary-replica pattern

Used In Systems:

Every cloud-native application
Databases (PostgreSQL, MySQL with replication)
Message queues (Kafka, RabbitMQ)
Coordination services (ZooKeeper, etcd, Consul)

Next Recommended: Consensus - Learn how distributed systems agree on values despite failures