I/D/E · Generative AI

Backpropagation

Summary

How neural networks learn: gradients, chain rule, vanishing gradients, and residual connections

TL;DR

Backpropagation computes how to adjust each weight to reduce error. Gradients flow backward through the network using the chain rule. Understanding vanishing gradients and residual connections explains why transformers scale to hundreds of layers.

Visual Overview

The Learning Problem and Gradient Intuition
THE LEARNING PROBLEM

                                                           
   We have:                                                
     • Input data                                          
     • Desired outputs (labels)                            
     • A loss function (measures how wrong we are)         
     • Millions of weights to adjust                       
                                                           
   We need:                                                
     • Which direction to adjust each weight               
     • How much to adjust each weight                      
                                                           
   The answer: Compute the GRADIENT of the loss with       
   respect to each weight. Go opposite to decrease loss.   
                                                           


GRADIENT INTUITION

 
 Loss 
  
 5  _ current position 
   
 3   gradient points uphill 
   
 1  _ after update (moved opposite) 
   
 0 * minimum 
  
 weight value 
 
 update = weight - learning_rate × gradient 
 


The Chain Rule

Neural networks are compositions of functions. To compute gradients through compositions, we use the chain rule.

Chain Rule

                                                           
   If y = f(g(x)), then:                                   
                                                           
     dy/dx = (dy/dg) × (dg/dx)                             
                                                           
   "Derivative of outer × derivative of inner"             
                                                           


SIMPLE EXAMPLE

 
 y = (2x + 1)² 
 
 Let g = 2x + 1, so y = g² 
 
 dy/dg = 2g (derivative of square) 
 dg/dx = 2 (derivative of 2x + 1) 
 
 dy/dx = 2g × 2 = 2(2x + 1) × 2 = 4(2x + 1) 
 

Why it matters: A neural network is a long chain of operations. The chain rule lets us compute how the loss changes with respect to weights deep in the network.


Forward Pass vs Backward Pass

Training has two phases per batch:

Forward Pass vs Backward Pass
FORWARD PASS (compute predictions)

                                                           
   Input  [Layer 1]  [Layer 2]  [Layer 3]  Output      
                                                          
                                                          
                                           Compare with    
                                               label       
                                                          
                                                          
                                                   Loss    
                                                           


BACKWARD PASS (compute gradients)

 
 dL/dW1  dL/dW2  dL/dW3  dL/dW4  dL (from loss) 
 
 Gradients flow backward through the network. 
 Each layer passes gradients to the previous layer. 
 

This is why it’s called “back” propagation — gradients propagate from the output back to the input.


The Vanishing Gradient Problem

Deep networks had a critical issue: gradients got exponentially smaller in earlier layers.

Vanishing Gradient Problem
GRADIENT FLOW IN A 4-LAYER NETWORK

                                                           
   Forward pass:                                           
     Input  [L1]  [L2]  [L3]  [L4]  Loss              
                                                           
   Backward pass (gradients):                              
     dL/dW1  dL/dW2  dL/dW3  dL/dW4  dL                
                                                           
   At each layer, gradients multiply:                      
     dL/dW1 = (local_1) × (local_2) × (local_3) × (local_4)
                                                           
   If each local gradient less than 1:                     
     0.5 × 0.5 × 0.5 × 0.5 = 0.0625   16x smaller!        
                                                           
   For 10 layers with 0.5 gradients:                       
     0.5^10 = 0.001   Gradient practically zero           
                                                           
   Early layers stop learning.                             
                                                           


WHY SIGMOID CAUSES THIS

 
 Sigmoid: s(x) = 1/(1 + e^(-x)) 
 Gradient: s'(x) = s(x) × (1 - s(x)) 
 
 Maximum gradient: 0.25 (when x = 0) 
 Usually much smaller. 
 
 s'(x) is always at most 0.25 
 Multiply many of these: gradient vanishes 
 


Solutions to Vanishing Gradients

ReLU Activation

ReLU Gradient

                                                           
   ReLU(x) = max(0, x)                                     
                                                           
   Gradient:                                               
     x > 0: gradient = 1                                   
     x < 0: gradient = 0                                   
                                                           
   When active (x > 0), gradient = 1                       
   No shrinking! 1 × 1 × 1 × 1 = 1                         
                                                           
   Problem: "Dead neurons" — if a neuron is always in      
            x < 0 region, gradient always 0, never learns  
                                                           

Skip Connections (Residual Connections)

Residual Connections
RESIDUAL CONNECTION

                                                           
   Standard layer:                                         
     output = f(input)                                     
                                                           
   Residual layer:                                         
     output = input + f(input)                             
                                                          
              This is the skip connection                  
                                                           
   Gradient flow:                                          
     d(output)/d(input) = 1 + df/d(input)                  
                                                          
                          Gradient of at least 1, always   
                                                           
   Even if f's gradient vanishes, the "1" remains.         
   Gradients have a highway to flow through.               
                                                           


GRADIENT HIGHWAY VISUAL

 
 Without residuals: 
 Input  [L1]  [L2]  [L3]  [L4]  Output 
     
 Gradients must pass through every layer (shrink) 
 
 With residuals: 
 Input  +  Out 
      
 [L1]  [L2]  [L3]  [L4]  
 
 Gradients can skip layers entirely. 
 "Residual stream" flows directly input to output. 
 

This is why transformers work. Every attention layer and FFN layer is residual:

# Transformer layer (simplified)
x = x + attention(x)    # residual around attention
x = x + ffn(x)          # residual around FFN

Why This Matters for Modern Models

Understanding LoRA

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices.

LoRA Gradient Flow

                                                           
   Base model weights: Frozen (no gradients computed)      
   LoRA adapters: Trainable                                
                                                           
   W_new = W_base + A × B                                  
                                                         
         frozen    trained                                 
                                                           
   Only A and B receive gradients.                         
   Much smaller  much faster training.                    
                                                           

Understanding Training Instability

Diagnosing Training Issues

                                                           
   Loss not decreasing:                                    
     • Gradients too small? (vanishing)                    
     • Learning rate too low?                              
     • Bad initialization?                                 
                                                           
   Loss explodes (goes to NaN):                            
     • Gradients too large? (exploding)                    
     • Learning rate too high?                             
     • Missing normalization?                              
                                                           
   Loss oscillates wildly:                                 
     • Learning rate too high?                             
     • Batch size too small?                               
                                                           


Key Formulas Summary

Backprop Essentials

                                                           
   Chain rule:                                             
     dy/dx = (dy/dg) × (dg/dx)                             
                                                           
   Weight update:                                          
     w_new = w_old - learning_rate × (dLoss/dw)            
                                                           
   Residual gradient:                                      
     d(x + f(x))/dx = 1 + df/dx                            
                                                           
   ReLU gradient:                                          
     d(max(0,x))/dx = 1 if x > 0, else 0                   
                                                           


When This Matters

SituationConcept to apply
Model isn’t learningCheck for vanishing gradients, dead ReLUs
Training explodes to NaNGradients exploding, reduce LR or add norm
Understanding LoRAOnly adapter params receive gradients
Understanding residual connectionsGradient highways for deep networks
Understanding transformer architectureResidual stream is the core design
Debugging fine-tuningGradients to frozen params = 0

See It In Action

Production signal

Why this concept matters

Interview 65% of ML interviews
Production Understanding training dynamics
Performance Foundation for LoRA, fine-tuning