I/D/E · Visual explainer

Backpropagation

Summary

How neural networks learn by propagating errors backward through layers.

Read This As

How does error assign credit or blame to each weight?

Failure Trap
Treating backprop as insight into meaning rather than a gradient computation over parameters.
Decision Rule
Follow loss to gradients to updates; debug training by checking each link in that chain.
Input Hidden Output x₁ x₂ h₁ h₂ ŷ w₁ w₂ w₃ w₄ Loss Neural network with 4 learnable weights Input Hidden Output x₁ 0.5 x₂ 0.8 h₁ 0.6 h₂ 0.4 ŷ 0.7 → Forward Pass → Data flows forward: prediction ŷ = 0.7 0.7 y=1.0 Target Loss Calculation L = (y - ŷ)² L = (1.0 - 0.7)² L = 0.09 Error: significant Loss = 0.09 — we need to improve! ŷ 0.09 ∂L/∂ŷ = -0.6 ← Chain Rule Gradient flows backward from loss ∂L/∂w₃ ∂L/∂w₄ ∂L/∂w₁ ∂L/∂w₂ 0.09 ∂L/∂w = ∂L/∂ŷ × ∂ŷ/∂h × ∂h/∂w Chain rule decomposes gradient ← Backward Pass ← Gradients computed for all weights w_new = w_old - α × ∂L/∂w α = 0.1 x₁ x₂ h₁ h₂ ŷ 0.85 Before: L = 0.09 After: L = 0.02 ↓ 78% better! Weights updated — prediction improves to 0.85
1 / ?

A Simple Neural Network

Let's visualize backpropagation with a simple network: 2 inputs, 2 hidden neurons, and 1 output. Each connection has a weight — a number that determines how strongly one neuron influences another.

Our goal: adjust these weights so the network makes accurate predictions.

  • Neurons (nodes) compute weighted sums plus activation
  • Weights are the learnable parameters
  • This network has 4 weights to learn

Forward Pass: Computing the Output

Data flows forward through the network. Inputs (x₁, x₂) are multiplied by weights, summed at each hidden neuron, passed through an activation function, then combined to produce the output.

This is a forward pass — inputs in, prediction out.

  • Each neuron computes: output = activation(Σ weights × inputs)
  • Forward pass is just matrix multiplication + activation
  • The final output is the network's prediction

How Wrong Are We?

We compare the prediction (ŷ = 0.7) to the actual target (y = 1.0). The loss function quantifies this error — here, squared error: (1.0 - 0.7)² = 0.09.

The larger the loss, the worse the prediction. Our job: minimize this loss.

  • Loss measures prediction error
  • Common losses: MSE, cross-entropy
  • Training = minimizing loss

Gradients Flow Backward

Now the magic: we ask "how does each weight contribute to the loss?" The answer is the gradient — the derivative of loss with respect to each weight.

We start at the output and work backward. The gradient ∂L/∂ŷ tells us how changes in the output affect the loss.

  • Gradient = direction of steepest loss increase
  • Negative gradient = direction to reduce loss
  • Computed via calculus (chain rule)

Chain Rule Through Layers

The chain rule lets us decompose the gradient through each layer. The gradient for w₃ combines the gradient from the output with the gradient through the activation.

At the hidden layer, gradients split — each hidden neuron receives gradients from all connections leading forward.

  • Chain rule: ∂L/∂w = ∂L/∂y × ∂y/∂h × ∂h/∂w
  • Gradients accumulate through layers
  • This is why it's called "backpropagation"

Learning: Adjusting Weights

Finally, we update weights in the opposite direction of the gradient. If a weight contributed to increasing the loss, we decrease it. The learning rate (α) controls step size.

After one update, the prediction improves: ŷ = 0.85. Repeat thousands of times and the network learns.

  • Update rule: w = w - α × gradient
  • Learning rate is a hyperparameter
  • Multiple iterations = training epochs
  • This is gradient descent

What's Next?

Backpropagation is the foundation of all modern deep learning. Next, explore optimization algorithms like Adam and SGD that make training faster and more stable.