Backpropagation | Concepts

TL;DR

Backpropagation computes how to adjust each weight to reduce error. Gradients flow backward through the network using the chain rule. Understanding vanishing gradients and residual connections explains why transformers scale to hundreds of layers.

Visual Overview

The Learning Problem and Gradient Intuition

THE LEARNING PROBLEM
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   We have:                                                │
│     • Input data                                          │
│     • Desired outputs (labels)                            │
│     • A loss function (measures how wrong we are)         │
│     • Millions of weights to adjust                       │
│                                                           │
│   We need:                                                │
│     • Which direction to adjust each weight               │
│     • How much to adjust each weight                      │
│                                                           │
│   The answer: Compute the GRADIENT of the loss with       │
│   respect to each weight. Go opposite to decrease loss.   │
│                                                           │
└───────────────────────────────────────────────────────────┘

GRADIENT INTUITION
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Loss                                                      │
│ │                                                         │
│ 5 │ _ current position                                    │
│ │                                                        │
│ 3 │  gradient points uphill                              │
│ │                                                        │
│ 1 │ _ after update (moved opposite)                       │
│ │                                                        │
│ 0 │────────*─── minimum                                  │
│ └────────────────────                                     │
│ weight value                                              │
│                                                           │
│ update = weight - learning_rate × gradient                │
│                                                           │
└───────────────────────────────────────────────────────────┘

The Chain Rule

Neural networks are compositions of functions. To compute gradients through compositions, we use the chain rule.

Chain Rule

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   If y = f(g(x)), then:                                   │
│                                                           │
│     dy/dx = (dy/dg) × (dg/dx)                             │
│                                                           │
│   "Derivative of outer × derivative of inner"             │
│                                                           │
└───────────────────────────────────────────────────────────┘

SIMPLE EXAMPLE
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ y = (2x + 1)²                                             │
│                                                           │
│ Let g = 2x + 1, so y = g²                                 │
│                                                           │
│ dy/dg = 2g (derivative of square)                         │
│ dg/dx = 2 (derivative of 2x + 1)                          │
│                                                           │
│ dy/dx = 2g × 2 = 2(2x + 1) × 2 = 4(2x + 1)                │
│                                                           │
└───────────────────────────────────────────────────────────┘

Why it matters: A neural network is a long chain of operations. The chain rule lets us compute how the loss changes with respect to weights deep in the network.

Forward Pass vs Backward Pass

Training has two phases per batch:

Forward Pass vs Backward Pass

FORWARD PASS (compute predictions)
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Input → [Layer 1] → [Layer 2] → [Layer 3] → Output      │
│                                                     │     │
│                                                     ▼     │
│                                           Compare with    │
│                                               label       │
│                                                     │     │
│                                                     ▼     │
│                                                   Loss    │
│                                                           │
└───────────────────────────────────────────────────────────┘

BACKWARD PASS (compute gradients)
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ dL/dW1 ← dL/dW2 ← dL/dW3 ← dL/dW4 ← dL (from loss)        │
│                                                           │
│ Gradients flow backward through the network.              │
│ Each layer passes gradients to the previous layer.        │
│                                                           │
└───────────────────────────────────────────────────────────┘

This is why it’s called “back” propagation — gradients propagate from the output back to the input.

The Vanishing Gradient Problem

Deep networks had a critical issue: gradients got exponentially smaller in earlier layers.

Vanishing Gradient Problem

GRADIENT FLOW IN A 4-LAYER NETWORK
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Forward pass:                                           │
│     Input → [L1] → [L2] → [L3] → [L4] → Loss              │
│                                                           │
│   Backward pass (gradients):                              │
│     dL/dW1 ← dL/dW2 ← dL/dW3 ← dL/dW4 ← dL                │
│                                                           │
│   At each layer, gradients multiply:                      │
│     dL/dW1 = (local_1) × (local_2) × (local_3) × (local_4)│
│                                                           │
│   If each local gradient less than 1:                     │
│     0.5 × 0.5 × 0.5 × 0.5 = 0.0625  ← 16x smaller!        │
│                                                           │
│   For 10 layers with 0.5 gradients:                       │
│     0.5^10 = 0.001  ← Gradient practically zero           │
│                                                           │
│   Early layers stop learning.                             │
│                                                           │
└───────────────────────────────────────────────────────────┘

WHY SIGMOID CAUSES THIS
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Sigmoid: s(x) = 1/(1 + e^(-x))                            │
│ Gradient: s'(x) = s(x) × (1 - s(x))                       │
│                                                           │
│ Maximum gradient: 0.25 (when x = 0)                       │
│ Usually much smaller.                                     │
│                                                           │
│ s'(x) is always at most 0.25                              │
│ Multiply many of these: gradient vanishes                 │
│                                                           │
└───────────────────────────────────────────────────────────┘

Solutions to Vanishing Gradients

ReLU Activation

ReLU Gradient

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   ReLU(x) = max(0, x)                                     │
│                                                           │
│   Gradient:                                               │
│     x > 0: gradient = 1                                   │
│     x < 0: gradient = 0                                   │
│                                                           │
│   When active (x > 0), gradient = 1                       │
│   No shrinking! 1 × 1 × 1 × 1 = 1                         │
│                                                           │
│   Problem: "Dead neurons" — if a neuron is always in      │
│            x < 0 region, gradient always 0, never learns  │
│                                                           │
└───────────────────────────────────────────────────────────┘

Skip Connections (Residual Connections)

Residual Connections

RESIDUAL CONNECTION
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Standard layer:                                         │
│     output = f(input)                                     │
│                                                           │
│   Residual layer:                                         │
│     output = input + f(input)                             │
│              ↑                                            │
│              This is the skip connection                  │
│                                                           │
│   Gradient flow:                                          │
│     d(output)/d(input) = 1 + df/d(input)                  │
│                          ↑                                │
│                          Gradient of at least 1, always   │
│                                                           │
│   Even if f's gradient vanishes, the "1" remains.         │
│   Gradients have a highway to flow through.               │
│                                                           │
└───────────────────────────────────────────────────────────┘

GRADIENT HIGHWAY VISUAL
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Without residuals:                                        │
│ Input → [L1] → [L2] → [L3] → [L4] → Output                │
│ ↓ ↓ ↓ ↓                                                   │
│ Gradients must pass through every layer (shrink)          │
│                                                           │
│ With residuals:                                           │
│ Input ───────────────────────────────→ + → Out            │
│ ↓ ↓ ↓ ↓ ↑                                                 │
│ [L1] → [L2] → [L3] → [L4] ───────────┘                    │
│                                                           │
│ Gradients can skip layers entirely.                       │
│ "Residual stream" flows directly input to output.         │
│                                                           │
└───────────────────────────────────────────────────────────┘

This is why transformers work. Every attention layer and FFN layer is residual:

# Transformer layer (simplified)
x = x + attention(x)    # residual around attention
x = x + ffn(x)          # residual around FFN

Why This Matters for Modern Models

Understanding LoRA

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices.

LoRA Gradient Flow

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Base model weights: Frozen (no gradients computed)      │
│   LoRA adapters: Trainable                                │
│                                                           │
│   W_new = W_base + A × B                                  │
│           ↑         ↑                                     │
│         frozen    trained                                 │
│                                                           │
│   Only A and B receive gradients.                         │
│   Much smaller → much faster training.                    │
│                                                           │
└───────────────────────────────────────────────────────────┘

Understanding Training Instability

Diagnosing Training Issues

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Loss not decreasing:                                    │
│     • Gradients too small? (vanishing)                    │
│     • Learning rate too low?                              │
│     • Bad initialization?                                 │
│                                                           │
│   Loss explodes (goes to NaN):                            │
│     • Gradients too large? (exploding)                    │
│     • Learning rate too high?                             │
│     • Missing normalization?                              │
│                                                           │
│   Loss oscillates wildly:                                 │
│     • Learning rate too high?                             │
│     • Batch size too small?                               │
│                                                           │
└───────────────────────────────────────────────────────────┘

Key Formulas Summary

Backprop Essentials

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Chain rule:                                             │
│     dy/dx = (dy/dg) × (dg/dx)                             │
│                                                           │
│   Weight update:                                          │
│     w_new = w_old - learning_rate × (dLoss/dw)            │
│                                                           │
│   Residual gradient:                                      │
│     d(x + f(x))/dx = 1 + df/dx                            │
│                                                           │
│   ReLU gradient:                                          │
│     d(max(0,x))/dx = 1 if x > 0, else 0                   │
│                                                           │
└───────────────────────────────────────────────────────────┘

When This Matters

Situation	Concept to apply
Model isn’t learning	Check for vanishing gradients, dead ReLUs
Training explodes to NaN	Gradients exploding, reduce LR or add norm
Understanding LoRA	Only adapter params receive gradients
Understanding residual connections	Gradient highways for deep networks
Understanding transformer architecture	Residual stream is the core design
Debugging fine-tuning	Gradients to frozen params = 0

See It In Action

Backpropagation Explainer - ~120 second animated visual explanation

TL;DR

Visual Overview

The Chain Rule

Forward Pass vs Backward Pass

The Vanishing Gradient Problem

Solutions to Vanishing Gradients

ReLU Activation

Skip Connections (Residual Connections)

Why This Matters for Modern Models

Understanding LoRA

Understanding Training Instability

Key Formulas Summary

When This Matters

See It In Action

Why this concept matters