Skip to content

Optimization

SGD, Adam, AdamW, learning rate schedules, warmup, and gradient clipping for training

TL;DR

AdamW is the standard optimizer for transformers. Use warmup to prevent early instability, cosine decay for pre-training, linear decay for fine-tuning, and gradient clipping to prevent explosions. Fine-tuning needs 10-100x smaller learning rates than pre-training.

Visual Overview

SGD Update and Problems

Momentum

Add “velocity” to SGD. Accumulate gradient direction over time.

Momentum Update and Intuition

Adam (Adaptive Moment Estimation)

Combines momentum with adaptive learning rates per parameter.

Adam Update and Why It Works

Adam is the default choice for most deep learning.


AdamW

Adam with proper weight decay. The standard for transformers.

AdamW vs Adam + L2

Always use AdamW for transformers, not Adam with L2.


Learning Rate Schedules

Learning rate should change during training. High initially, lower later.

Linear Decay

Linear Decay

Cosine Decay

Cosine Decay

Warmup

Warmup

Typical warmup: 1-5% of total training steps.


Common Configurations

Pre-training LLM

Pre-training LLM Config

Fine-tuning

Fine-tuning Config

Quick Reference

ScenarioLRScheduleWarmup
Pre-training LLM1e-4 - 3e-4Cosine1-2%
Fine-tuning LLM1e-5 - 5e-5Linear3-5%
Fine-tuning BERT2e-5 - 5e-5Linear10%
Training CNN1e-3Step decayNone

Gradient Clipping

Limit gradient magnitude to prevent explosions.

Gradient Clipping

When to use:

  • Always for transformers
  • When training is unstable
  • When loss spikes occasionally

Debugging Optimization

Debugging Optimization

When This Matters

SituationWhat to know
Training transformersUse AdamW, not Adam
Fine-tuningLR 10-100x smaller than pre-training
Training unstableAdd warmup, gradient clipping
Loss not decreasingTry higher LR
Loss explodingLower LR, add gradient clipping
Understanding configsbeta1=momentum, beta2=variance averaging
Choosing scheduleCosine for pre-training, linear for fine-tuning

See It In Action

Interview Notes
💼60% of ML interviews
Interview Relevance
60% of ML interviews
🏭Every training and fine-tuning job
Production Impact
Powers systems at Every training and fine-tuning job
Right optimizer can 2-3x training speed
Performance
Right optimizer can 2-3x training speed query improvement