I/D/E · Foundations

Distributed Systems Basics

Summary

The fundamental concepts of distributed computing: how multiple machines coordinate to appear as a single coherent system, navigating network partitions, failures, and the CAP theorem

TL;DR

A distributed system is a collection of independent computers that appear to users as a single coherent system. The fundamental challenge: coordinating multiple machines over an unreliable network while handling partial failures. Key constraints are captured by the CAP theorem—during network partitions, you must choose between consistency and availability.

Visual Overview

Distributed Systems Overview
THE FUNDAMENTAL PROBLEM

  Single Machine (Simple)                         
                                  
     Process     Everything in one place      
     Memory       No coordination needed       
     Disk         Failures are total           
                                  
                                                  
  Distributed System (Complex)                    
      network        network  
   N1    N2    N3 
                              
      Partial failures, network delays,         
       inconsistent views, split brain           


CAP THEOREM: PICK TWO (DURING PARTITION)

                   Consistency                     
                      /                          
                     /                           
                    /                            
                   /  CP                         
                  /________                      
         Availability    Partition Tolerance      
                     AP  /                       
                        /                        
                                                  
  CP: Choose consistency  reject requests       
      Examples: ZooKeeper, etcd, Spanner         
                                                  
  AP: Choose availability  accept stale data    
      Examples: Cassandra, DynamoDB              
                                                  
  Note: P is not optional—partitions WILL happen 


FAILURE MODES

  1. Crash Failure                                
     Node stops, doesn't respond                  
     Detection: Heartbeat timeout                 
                                                  
  2. Network Partition                            
     Nodes can't communicate                      
     [N1, N2] X [N3, N4, N5]                  
                                                  
  3. Byzantine Failure                            
     Node behaves maliciously/incorrectly         
     Hardest to handle (requires 3f+1 nodes)      
                                                  
  4. Partial Failure                              
     Some operations succeed, some fail           
     The defining challenge of distributed        

Why Distributed Systems?

The Single Machine Limit

Every system eventually hits the limits of a single machine:

ConstraintSingle Machine LimitDistributed Solution
CPU~128 cores1000s of machines
Memory~12 TBPetabytes across cluster
Storage~100 TBExabytes across cluster
AvailabilityMachine fails = downRedundancy, failover
LatencyGeography boundEdge locations worldwide

Core Motivations

  1. Scalability: Handle more load than one machine can
  2. Availability: Continue operating despite failures
  3. Latency: Serve users from nearby locations
  4. Cost: Commodity hardware vs expensive mainframes

The CAP Theorem

Proven by Seth Gilbert and Nancy Lynch (2002), building on Eric Brewer’s conjecture:

During a network partition, a distributed system must choose between Consistency and Availability.

The Three Properties

Consistency (C): Every read returns the most recent write or an error. All nodes see the same data at the same time.

Availability (A): Every request to a non-failing node receives a response (no timeouts, no errors).

Partition Tolerance (P): The system continues operating despite arbitrary network partitions.

Why You Can’t Have All Three

Consider two nodes during a network partition:

Node A                    Node B
   │                         │
   │    Network Partition    │
   │        ═══════X═════    │
   │                         │
Client writes X=1 to A       │
   │                         │
   │   Cannot replicate      │
   │   to B (partition)      │
   │                         │
   │                   Client reads X from B
   │                         │
   │        CHOICE:          │
   │   Return X=0 (stale)    │   AP: Available but inconsistent
   │   Return error          │   CP: Consistent but unavailable

Real-World CAP Choices

SystemChoiceBehavior During Partition
ZooKeeperCPMinority partition stops serving
etcdCPRequires quorum for reads/writes
CassandraAPContinues serving, eventual consistency
DynamoDBAP (default)Eventually consistent reads
SpannerCPSacrifices availability for consistency

Fallacies of Distributed Computing

Classic misconceptions that lead to system failures:

  1. The network is reliable — Packets drop, connections break
  2. Latency is zero — Cross-continent: 100-200ms RTT
  3. Bandwidth is infinite — Congestion, throttling
  4. The network is secure — Man-in-middle, spoofing
  5. Topology doesn’t change — Nodes join, leave, fail
  6. There is one administrator — Multi-team, multi-org
  7. Transport cost is zero — Serialization, bandwidth costs
  8. The network is homogeneous — Different protocols, versions

Key Concepts

Network Partitions

A partition occurs when network failure divides the cluster:

Before Partition:
[N1]  [N2]  [N3]  [N4]  [N5]

After Partition:
[N1]  [N2]     ═══X═══     [N3]  [N4]  [N5]
 └─ Minority ─┘               └─── Majority ───┘

Split-brain: Both sides think they’re the leader → data divergence.

Failure Detection

How do you know if a node is dead or just slow?

MethodMechanismTrade-off
HeartbeatsPeriodic “I’m alive” messagesTimeout = false positives
Phi AccrualProbability of failureMore accurate, complex
GossipNodes share failure infoEventually consistent

Ordering and Time

In distributed systems, “happened before” is ambiguous:

Clock TypeMechanismUse Case
Lamport ClocksLogical countersCausal ordering
Vector ClocksPer-node countersDetect conflicts
TrueTimeAtomic clocks + GPSGoogle Spanner
HLCHybrid logical/physicalCockroachDB

Patterns for Handling Failures

1. Replication

Copy data to multiple nodes:

  • Leader-Follower: One writes, others follow
  • Multi-Leader: Multiple writers, conflict resolution
  • Leaderless: All nodes equal, quorum writes

2. Quorum Systems

Require majority agreement: W + R > N

N = 5 replicas
W = 3 (write to 3 nodes)
R = 3 (read from 3 nodes)

At least one node in R saw the write in W
 Guarantees consistency

3. Consensus Algorithms

Agree on a single value despite failures:

  • Paxos: Theoretical foundation, complex
  • Raft: Understandable consensus
  • Zab: ZooKeeper’s algorithm

Related Concepts:

Used In Systems:

  • Every cloud-native application
  • Databases (PostgreSQL, MySQL with replication)
  • Message queues (Kafka, RabbitMQ)
  • Coordination services (ZooKeeper, etcd, Consul)

Next Recommended: Consensus - Learn how distributed systems agree on values despite failures

Production signal

Why this concept matters

Interview 90% of system design interviews
Production Every cloud-native system
Performance Network latency, coordination
Scale Horizontal scaling foundations