Optimization | Concepts

TL;DR

AdamW is the standard optimizer for transformers. Use warmup to prevent early instability, cosine decay for pre-training, linear decay for fine-tuning, and gradient clipping to prevent explosions. Fine-tuning needs 10-100x smaller learning rates than pre-training.

Visual Overview

SGD Update and Problems

Momentum

Add “velocity” to SGD. Accumulate gradient direction over time.

Momentum Update and Intuition

Adam (Adaptive Moment Estimation)

Combines momentum with adaptive learning rates per parameter.

Adam Update and Why It Works

Adam is the default choice for most deep learning.

AdamW

Adam with proper weight decay. The standard for transformers.

AdamW vs Adam + L2

Always use AdamW for transformers, not Adam with L2.

Learning Rate Schedules

Learning rate should change during training. High initially, lower later.

Linear Decay

Cosine Decay

Warmup

Typical warmup: 1-5% of total training steps.

Common Configurations

Pre-training LLM

Pre-training LLM Config

Fine-tuning

Fine-tuning Config

Quick Reference

Scenario	LR	Schedule	Warmup
Pre-training LLM	1e-4 - 3e-4	Cosine	1-2%
Fine-tuning LLM	1e-5 - 5e-5	Linear	3-5%
Fine-tuning BERT	2e-5 - 5e-5	Linear	10%
Training CNN	1e-3	Step decay	None

Gradient Clipping

Limit gradient magnitude to prevent explosions.

Gradient Clipping

When to use:

Always for transformers
When training is unstable
When loss spikes occasionally

Debugging Optimization

When This Matters

Situation	What to know
Training transformers	Use AdamW, not Adam
Fine-tuning	LR 10-100x smaller than pre-training
Training unstable	Add warmup, gradient clipping
Loss not decreasing	Try higher LR
Loss exploding	Lower LR, add gradient clipping
Understanding configs	beta1=momentum, beta2=variance averaging
Choosing schedule	Cosine for pre-training, linear for fine-tuning

See It In Action

Backpropagation Explainer - ~120 second animated visual explanation showing gradient descent weight updates