TL;DR
Regularization prevents overfitting by constraining the model. Dropout randomly zeros neurons during training. Weight decay penalizes large weights. Early stopping halts training when validation loss stops improving. Use all of these when fine-tuning on small datasets.
Visual Overview
┌───────────────────────────────────────────────────────────┐ │ │ │ Training loss: ↓ decreasing nicely │ │ Validation loss: ↓ then ↑ starts increasing │ │ │ │ Loss │ │ │ │ │ 2 │ │ │ │ _____ val loss (starts going up) │ │ 1 │ ___________ │ │ │ __________ train loss (keeps going down) │ │ 0 └──────────────────────── │ │ 0 epochs 100 │ │ │ │ Model memorized training data. │ │ Doesn't generalize to new data. │ │ │ └───────────────────────────────────────────────────────────┘
When overfitting happens:
- Small dataset, large model
- Training too long
- Model has too much capacity for the task
- No regularization
Dropout
Randomly “drops” (zeros out) neurons during training. Forces the network to not rely on any single neuron.
┌───────────────────────────────────────────────────────────┐ │ │ │ Training (dropout=0.1): │ │ Randomly zero out 10% of activations each forward │ │ │ │ Before: [0.5, 0.3, 0.8, 0.2, 0.6] │ │ Mask: [1, 1, 0, 1, 1 ] ← 0.8 dropped │ │ After: [0.5, 0.3, 0.0, 0.2, 0.6] │ │ │ │ Inference: │ │ No dropout. Use all neurons. │ │ Scale activations by (1 - dropout_rate) to compensate │ │ │ └───────────────────────────────────────────────────────────┘ WHY IT WORKS ┌───────────────────────────────────────────────────────────┐ │ │ │ Without dropout: │ │ Network can rely on specific neurons │ │ "Neuron 47 always detects cats" │ │ If neuron 47 is wrong, whole prediction fails │ │ │ │ With dropout: │ │ Any neuron might be missing │ │ Network must build redundant representations │ │ Multiple neurons learn to detect cats │ │ More robust predictions │ │ │ └───────────────────────────────────────────────────────────┘
Typical dropout values:
| Model type | Dropout rate |
|---|---|
| Transformers | 0.1 (10%) |
| Older MLPs | 0.5 (50%) |
| CNNs | 0.25-0.5 |
| Fine-tuning | 0.1 or lower |
Where to apply:
- After attention layers
- After FFN layers
- Before final classification layer
- NOT inside attention computation itself
Weight Decay (L2 Regularization)
Penalizes large weights by adding their squared sum to the loss.
┌───────────────────────────────────────────────────────────┐ │ │ │ Standard loss: │ │ L = task_loss │ │ │ │ With weight decay (L2 regularization): │ │ L = task_loss + λ × Σ(w²) │ │ │ │ λ = weight decay coefficient (typically 0.01) │ │ w = all model weights │ │ │ └───────────────────────────────────────────────────────────┘ WHY IT WORKS ┌───────────────────────────────────────────────────────────┐ │ │ │ Large weights = model is very confident about features │ │ = likely memorizing training data │ │ │ │ Penalizing large weights: │ │ • Keeps weights small │ │ • Model can't "overfit" to any single feature │ │ • Smoother, more generalizable function │ │ │ │ Small weights = "softer" decision boundaries │ │ │ └───────────────────────────────────────────────────────────┘
AdamW vs Adam with Weight Decay
┌───────────────────────────────────────────────────────────┐ │ │ │ Adam with L2 (WRONG): │ │ gradient = task_gradient + λ × w │ │ m, v = update_momentum(gradient) │ │ w = w - lr × m / sqrt(v) │ │ │ │ Problem: Weight decay is entangled with adaptive LR │ │ High-variance params get less regularization │ │ │ │ AdamW (CORRECT): │ │ gradient = task_gradient ← No λ here │ │ m, v = update_momentum(gradient) │ │ w = w - lr × m / sqrt(v) - lr × λ × w │ │ ↑ │ │ Decay applied separately │ │ │ │ Weight decay is truly decoupled. │ │ This is what you should use. │ │ │ └───────────────────────────────────────────────────────────┘
Typical values:
- Language models: 0.01 - 0.1
- Vision models: 0.0001 - 0.01
- Fine-tuning: 0.01 (same as pre-training usually)
Early Stopping
Stop training when validation loss stops improving.
┌───────────────────────────────────────────────────────────┐ │ │ │ Monitor: validation loss (or another metric) │ │ Patience: how many epochs to wait for improvement │ │ │ │ Loss │ │ │ │ │ 2 │ │ │ │ _____ val loss │ │ 1 │ ___________ │ │ │ __________ train loss │ │ 0 └──────────────────────── │ │ 0 10 20 30 40 50 │ │ ↑ │ │ STOP HERE │ │ (val loss stopped improving) │ │ │ └───────────────────────────────────────────────────────────┘
Implementation:
best_val_loss = float('inf')
patience_counter = 0
patience = 5 # epochs to wait
for epoch in range(max_epochs):
train_loss = train_one_epoch()
val_loss = evaluate()
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
save_checkpoint() # Save best model
else:
patience_counter += 1
if patience_counter >= patience:
print("Early stopping!")
break
# Load best checkpoint for final model
load_checkpoint()
Typical patience values:
- Fine-tuning: 2-3 epochs
- Training from scratch: 5-10 epochs
- Large models: 3-5 epochs
Label Smoothing
Don’t use hard labels (0 or 1). Spread some probability to other classes.
┌───────────────────────────────────────────────────────────┐ │ │ │ Hard labels (no smoothing): │ │ True class = 2: [0, 0, 1, 0] │ │ │ │ Soft labels (smoothing = 0.1): │ │ True class = 2: [0.033, 0.033, 0.9, 0.033] │ │ │ │ 10% of probability spread to other classes. │ │ │ └───────────────────────────────────────────────────────────┘
Why it works:
- Prevents model from being overconfident
- Encourages model to keep some probability for alternatives
- Acts as regularization on the output distribution
Typical value: 0.1 (10% smoothing)
Combining Techniques
Regularization techniques stack. Use multiple together.
┌───────────────────────────────────────────────────────────┐ │ │ │ 1. Dropout: 0.1 after attention and FFN │ │ 2. Weight decay: 0.01 with AdamW │ │ 3. Early stopping: patience=3 on val loss │ │ 4. Label smoothing: 0.1 (for classification) │ │ │ │ For fine-tuning, often reduce dropout (model already │ │ regularized). │ │ │ └───────────────────────────────────────────────────────────┘
Common combinations:
| Scenario | Regularization |
|---|---|
| Pre-training large LLM | Weight decay 0.1, dropout 0.1 |
| Fine-tuning | Weight decay 0.01, dropout 0.1, early stopping |
| Small dataset | Dropout 0.3, weight decay 0.1, data augmentation |
| Large dataset | Minimal — dropout 0.1, weight decay 0.01 |
Debugging Regularization
STILL OVERFITTING DESPITE REGULARIZATION ┌───────────────────────────────────────────────────────────┐ │ │ │ Symptoms: │ │ • Added dropout, weight decay │ │ • Train/val gap still large │ │ │ │ Causes: │ │ • Regularization too weak │ │ • Model still too large for data │ │ • Data augmentation would help │ │ │ │ Debug steps: │ │ 1. Increase dropout (0.1 → 0.3) │ │ 2. Increase weight decay (0.01 → 0.1) │ │ 3. Add data augmentation │ │ 4. Use smaller model │ │ 5. Get more data │ │ │ └───────────────────────────────────────────────────────────┘ UNDERFITTING (TRAIN LOSS HIGH) ┌───────────────────────────────────────────────────────────┐ │ │ │ Symptoms: │ │ • Training loss not decreasing enough │ │ • Model can't fit training data │ │ │ │ Causes: │ │ • Too much regularization │ │ • Dropout too high │ │ • Weight decay too strong │ │ • Model too small │ │ │ │ Debug steps: │ │ 1. Reduce dropout (0.3 → 0.1) │ │ 2. Reduce weight decay (0.1 → 0.01) │ │ 3. Remove early stopping temporarily │ │ 4. Use larger model │ │ │ └───────────────────────────────────────────────────────────┘
When This Matters
| Situation | What to apply |
|---|---|
| Fine-tuning on small dataset | All: dropout, weight decay, early stopping |
| Model overfitting | Increase dropout, weight decay |
| Model underfitting | Decrease regularization |
| Classification task | Add label smoothing |
| Training from scratch | Moderate regularization, data augmentation |
| Using AdamW | Set weight_decay parameter (not in loss) |
| Evaluating model | Ensure dropout is OFF (model.eval()) |
Production signal