TL;DR
Backpropagation computes how to adjust each weight to reduce error. Gradients flow backward through the network using the chain rule. Understanding vanishing gradients and residual connections explains why transformers scale to hundreds of layers.
Visual Overview
THE LEARNING PROBLEM ┌───────────────────────────────────────────────────────────┐ │ │ │ We have: │ │ • Input data │ │ • Desired outputs (labels) │ │ • A loss function (measures how wrong we are) │ │ • Millions of weights to adjust │ │ │ │ We need: │ │ • Which direction to adjust each weight │ │ • How much to adjust each weight │ │ │ │ The answer: Compute the GRADIENT of the loss with │ │ respect to each weight. Go opposite to decrease loss. │ │ │ └───────────────────────────────────────────────────────────┘ GRADIENT INTUITION ┌───────────────────────────────────────────────────────────┐ │ │ │ Loss │ │ │ │ │ 5 │ _ current position │ │ │ │ │ 3 │ gradient points uphill │ │ │ │ │ 1 │ _ after update (moved opposite) │ │ │ │ │ 0 │────────*─── minimum │ │ └──────────────────── │ │ weight value │ │ │ │ update = weight - learning_rate × gradient │ │ │ └───────────────────────────────────────────────────────────┘
The Chain Rule
Neural networks are compositions of functions. To compute gradients through compositions, we use the chain rule.
┌───────────────────────────────────────────────────────────┐ │ │ │ If y = f(g(x)), then: │ │ │ │ dy/dx = (dy/dg) × (dg/dx) │ │ │ │ "Derivative of outer × derivative of inner" │ │ │ └───────────────────────────────────────────────────────────┘ SIMPLE EXAMPLE ┌───────────────────────────────────────────────────────────┐ │ │ │ y = (2x + 1)² │ │ │ │ Let g = 2x + 1, so y = g² │ │ │ │ dy/dg = 2g (derivative of square) │ │ dg/dx = 2 (derivative of 2x + 1) │ │ │ │ dy/dx = 2g × 2 = 2(2x + 1) × 2 = 4(2x + 1) │ │ │ └───────────────────────────────────────────────────────────┘
Why it matters: A neural network is a long chain of operations. The chain rule lets us compute how the loss changes with respect to weights deep in the network.
Forward Pass vs Backward Pass
Training has two phases per batch:
FORWARD PASS (compute predictions) ┌───────────────────────────────────────────────────────────┐ │ │ │ Input → [Layer 1] → [Layer 2] → [Layer 3] → Output │ │ │ │ │ ▼ │ │ Compare with │ │ label │ │ │ │ │ ▼ │ │ Loss │ │ │ └───────────────────────────────────────────────────────────┘ BACKWARD PASS (compute gradients) ┌───────────────────────────────────────────────────────────┐ │ │ │ dL/dW1 ← dL/dW2 ← dL/dW3 ← dL/dW4 ← dL (from loss) │ │ │ │ Gradients flow backward through the network. │ │ Each layer passes gradients to the previous layer. │ │ │ └───────────────────────────────────────────────────────────┘
This is why it’s called “back” propagation — gradients propagate from the output back to the input.
The Vanishing Gradient Problem
Deep networks had a critical issue: gradients got exponentially smaller in earlier layers.
GRADIENT FLOW IN A 4-LAYER NETWORK ┌───────────────────────────────────────────────────────────┐ │ │ │ Forward pass: │ │ Input → [L1] → [L2] → [L3] → [L4] → Loss │ │ │ │ Backward pass (gradients): │ │ dL/dW1 ← dL/dW2 ← dL/dW3 ← dL/dW4 ← dL │ │ │ │ At each layer, gradients multiply: │ │ dL/dW1 = (local_1) × (local_2) × (local_3) × (local_4)│ │ │ │ If each local gradient less than 1: │ │ 0.5 × 0.5 × 0.5 × 0.5 = 0.0625 ← 16x smaller! │ │ │ │ For 10 layers with 0.5 gradients: │ │ 0.5^10 = 0.001 ← Gradient practically zero │ │ │ │ Early layers stop learning. │ │ │ └───────────────────────────────────────────────────────────┘ WHY SIGMOID CAUSES THIS ┌───────────────────────────────────────────────────────────┐ │ │ │ Sigmoid: s(x) = 1/(1 + e^(-x)) │ │ Gradient: s'(x) = s(x) × (1 - s(x)) │ │ │ │ Maximum gradient: 0.25 (when x = 0) │ │ Usually much smaller. │ │ │ │ s'(x) is always at most 0.25 │ │ Multiply many of these: gradient vanishes │ │ │ └───────────────────────────────────────────────────────────┘
Solutions to Vanishing Gradients
ReLU Activation
┌───────────────────────────────────────────────────────────┐ │ │ │ ReLU(x) = max(0, x) │ │ │ │ Gradient: │ │ x > 0: gradient = 1 │ │ x < 0: gradient = 0 │ │ │ │ When active (x > 0), gradient = 1 │ │ No shrinking! 1 × 1 × 1 × 1 = 1 │ │ │ │ Problem: "Dead neurons" — if a neuron is always in │ │ x < 0 region, gradient always 0, never learns │ │ │ └───────────────────────────────────────────────────────────┘
Skip Connections (Residual Connections)
RESIDUAL CONNECTION ┌───────────────────────────────────────────────────────────┐ │ │ │ Standard layer: │ │ output = f(input) │ │ │ │ Residual layer: │ │ output = input + f(input) │ │ ↑ │ │ This is the skip connection │ │ │ │ Gradient flow: │ │ d(output)/d(input) = 1 + df/d(input) │ │ ↑ │ │ Gradient of at least 1, always │ │ │ │ Even if f's gradient vanishes, the "1" remains. │ │ Gradients have a highway to flow through. │ │ │ └───────────────────────────────────────────────────────────┘ GRADIENT HIGHWAY VISUAL ┌───────────────────────────────────────────────────────────┐ │ │ │ Without residuals: │ │ Input → [L1] → [L2] → [L3] → [L4] → Output │ │ ↓ ↓ ↓ ↓ │ │ Gradients must pass through every layer (shrink) │ │ │ │ With residuals: │ │ Input ───────────────────────────────→ + → Out │ │ ↓ ↓ ↓ ↓ ↑ │ │ [L1] → [L2] → [L3] → [L4] ───────────┘ │ │ │ │ Gradients can skip layers entirely. │ │ "Residual stream" flows directly input to output. │ │ │ └───────────────────────────────────────────────────────────┘
This is why transformers work. Every attention layer and FFN layer is residual:
# Transformer layer (simplified)
x = x + attention(x) # residual around attention
x = x + ffn(x) # residual around FFN
Why This Matters for Modern Models
Understanding LoRA
LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices.
┌───────────────────────────────────────────────────────────┐ │ │ │ Base model weights: Frozen (no gradients computed) │ │ LoRA adapters: Trainable │ │ │ │ W_new = W_base + A × B │ │ ↑ ↑ │ │ frozen trained │ │ │ │ Only A and B receive gradients. │ │ Much smaller → much faster training. │ │ │ └───────────────────────────────────────────────────────────┘
Understanding Training Instability
┌───────────────────────────────────────────────────────────┐ │ │ │ Loss not decreasing: │ │ • Gradients too small? (vanishing) │ │ • Learning rate too low? │ │ • Bad initialization? │ │ │ │ Loss explodes (goes to NaN): │ │ • Gradients too large? (exploding) │ │ • Learning rate too high? │ │ • Missing normalization? │ │ │ │ Loss oscillates wildly: │ │ • Learning rate too high? │ │ • Batch size too small? │ │ │ └───────────────────────────────────────────────────────────┘
Key Formulas Summary
┌───────────────────────────────────────────────────────────┐ │ │ │ Chain rule: │ │ dy/dx = (dy/dg) × (dg/dx) │ │ │ │ Weight update: │ │ w_new = w_old - learning_rate × (dLoss/dw) │ │ │ │ Residual gradient: │ │ d(x + f(x))/dx = 1 + df/dx │ │ │ │ ReLU gradient: │ │ d(max(0,x))/dx = 1 if x > 0, else 0 │ │ │ └───────────────────────────────────────────────────────────┘
When This Matters
| Situation | Concept to apply |
|---|---|
| Model isn’t learning | Check for vanishing gradients, dead ReLUs |
| Training explodes to NaN | Gradients exploding, reduce LR or add norm |
| Understanding LoRA | Only adapter params receive gradients |
| Understanding residual connections | Gradient highways for deep networks |
| Understanding transformer architecture | Residual stream is the core design |
| Debugging fine-tuning | Gradients to frozen params = 0 |
See It In Action
- Backpropagation Explainer - ~120 second animated visual explanation
Production signal