Regularization | Concepts

TL;DR

Regularization prevents overfitting by constraining the model. Dropout randomly zeros neurons during training. Weight decay penalizes large weights. Early stopping halts training when validation loss stops improving.

Visual Overview

The Overfitting Problem

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Training loss:    ↓ decreasing nicely                   │
│   Validation loss:  ↓ then ↑ starts increasing            │
│                                                           │
│       Loss                                                │
│         │                                                 │
│       2 │                                                │
│         │  _____ val loss (starts going up)              │
│       1 │   ___________                                  │
│         │    __________ train loss (keeps going down)    │
│       0 └────────────────────────                         │
│         0     epochs      100                             │
│                                                           │
│   Model memorized training data.                          │
│   Doesn't generalize to new data.                         │
│                                                           │
└───────────────────────────────────────────────────────────┘

When overfitting happens:

Small dataset, large model
Training too long
Model has too much capacity for the task
No regularization

Dropout

Randomly “drops” (zeros out) neurons during training. Forces the network to not rely on any single neuron.

Dropout

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Training (dropout=0.1):                                 │
│     Randomly zero out 10% of activations each forward     │
│                                                           │
│     Before: [0.5, 0.3, 0.8, 0.2, 0.6]                     │
│     Mask:   [1,   1,   0,   1,   1  ]  ← 0.8 dropped      │
│     After:  [0.5, 0.3, 0.0, 0.2, 0.6]                     │
│                                                           │
│   Inference:                                              │
│     No dropout. Use all neurons.                          │
│     Scale activations by (1 - dropout_rate) to compensate │
│                                                           │
└───────────────────────────────────────────────────────────┘

WHY IT WORKS
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Without dropout:                                          │
│ Network can rely on specific neurons                      │
│ "Neuron 47 always detects cats"                           │
│ If neuron 47 is wrong, whole prediction fails             │
│                                                           │
│ With dropout:                                             │
│ Any neuron might be missing                               │
│ Network must build redundant representations              │
│ Multiple neurons learn to detect cats                     │
│ More robust predictions                                   │
│                                                           │
└───────────────────────────────────────────────────────────┘

Typical dropout values:

Model type	Dropout rate
Transformers	0.1 (10%)
Older MLPs	0.5 (50%)
CNNs	0.25-0.5
Fine-tuning	0.1 or lower

Where to apply:

After attention layers
After FFN layers
Before final classification layer
NOT inside attention computation itself

Weight Decay (L2 Regularization)

Penalizes large weights by adding their squared sum to the loss.

Weight Decay

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Standard loss:                                          │
│     L = task_loss                                         │
│                                                           │
│   With weight decay (L2 regularization):                  │
│     L = task_loss + λ × Σ(w²)                             │
│                                                           │
│   λ = weight decay coefficient (typically 0.01)           │
│   w = all model weights                                   │
│                                                           │
└───────────────────────────────────────────────────────────┘

WHY IT WORKS
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Large weights = model is very confident about features    │
│ = likely memorizing training data                         │
│                                                           │
│ Penalizing large weights:                                 │
│ • Keeps weights small                                     │
│ • Model can't "overfit" to any single feature             │
│ • Smoother, more generalizable function                   │
│                                                           │
│ Small weights = "softer" decision boundaries              │
│                                                           │
└───────────────────────────────────────────────────────────┘

AdamW vs Adam with Weight Decay

AdamW: The Right Way

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Adam with L2 (WRONG):                                   │
│     gradient = task_gradient + λ × w                      │
│     m, v = update_momentum(gradient)                      │
│     w = w - lr × m / sqrt(v)                              │
│                                                           │
│     Problem: Weight decay is entangled with adaptive LR   │
│              High-variance params get less regularization │
│                                                           │
│   AdamW (CORRECT):                                        │
│     gradient = task_gradient           ← No λ here        │
│     m, v = update_momentum(gradient)                      │
│     w = w - lr × m / sqrt(v) - lr × λ × w                 │
│                                ↑                          │
│                          Decay applied separately         │
│                                                           │
│   Weight decay is truly decoupled.                        │
│   This is what you should use.                            │
│                                                           │
└───────────────────────────────────────────────────────────┘

Typical values:

Language models: 0.01 - 0.1
Vision models: 0.0001 - 0.01
Fine-tuning: 0.01 (same as pre-training usually)

Early Stopping

Stop training when validation loss stops improving.

Early Stopping

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Monitor: validation loss (or another metric)            │
│   Patience: how many epochs to wait for improvement       │
│                                                           │
│       Loss                                                │
│         │                                                 │
│       2 │                                                │
│         │  _____ val loss                                │
│       1 │   ___________                                  │
│         │    __________ train loss                       │
│       0 └────────────────────────                         │
│         0  10  20  30  40  50                             │
│                 ↑                                         │
│            STOP HERE                                      │
│            (val loss stopped improving)                   │
│                                                           │
└───────────────────────────────────────────────────────────┘

Implementation:

best_val_loss = float('inf')
patience_counter = 0
patience = 5  # epochs to wait

for epoch in range(max_epochs):
    train_loss = train_one_epoch()
    val_loss = evaluate()

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        save_checkpoint()  # Save best model
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print("Early stopping!")
            break

# Load best checkpoint for final model
load_checkpoint()

Typical patience values:

Fine-tuning: 2-3 epochs
Training from scratch: 5-10 epochs
Large models: 3-5 epochs

Label Smoothing

Don’t use hard labels (0 or 1). Spread some probability to other classes.

Label Smoothing

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Hard labels (no smoothing):                             │
│     True class = 2: [0, 0, 1, 0]                          │
│                                                           │
│   Soft labels (smoothing = 0.1):                          │
│     True class = 2: [0.033, 0.033, 0.9, 0.033]            │
│                                                           │
│     10% of probability spread to other classes.           │
│                                                           │
└───────────────────────────────────────────────────────────┘

Why it works:

Prevents model from being overconfident
Encourages model to keep some probability for alternatives
Acts as regularization on the output distribution

Typical value: 0.1 (10% smoothing)

Combining Techniques

Regularization techniques stack. Use multiple together.

Typical Transformer Regularization

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   1. Dropout: 0.1 after attention and FFN                 │
│   2. Weight decay: 0.01 with AdamW                        │
│   3. Early stopping: patience=3 on val loss               │
│   4. Label smoothing: 0.1 (for classification)            │
│                                                           │
│   For fine-tuning, often reduce dropout (model already    │
│   regularized).                                           │
│                                                           │
└───────────────────────────────────────────────────────────┘

Common combinations:

Scenario	Regularization
Pre-training large LLM	Weight decay 0.1, dropout 0.1
Fine-tuning	Weight decay 0.01, dropout 0.1, early stopping
Small dataset	Dropout 0.3, weight decay 0.1, data augmentation
Large dataset	Minimal — dropout 0.1, weight decay 0.01

Debugging Regularization

STILL OVERFITTING DESPITE REGULARIZATION
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Symptoms:                                               │
│     • Added dropout, weight decay                         │
│     • Train/val gap still large                           │
│                                                           │
│   Causes:                                                 │
│     • Regularization too weak                             │
│     • Model still too large for data                      │
│     • Data augmentation would help                        │
│                                                           │
│   Debug steps:                                            │
│     1. Increase dropout (0.1 → 0.3)                       │
│     2. Increase weight decay (0.01 → 0.1)                 │
│     3. Add data augmentation                              │
│     4. Use smaller model                                  │
│     5. Get more data                                      │
│                                                           │
└───────────────────────────────────────────────────────────┘

UNDERFITTING (TRAIN LOSS HIGH)
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Symptoms:                                                 │
│ • Training loss not decreasing enough                     │
│ • Model can't fit training data                           │
│                                                           │
│ Causes:                                                   │
│ • Too much regularization                                 │
│ • Dropout too high                                        │
│ • Weight decay too strong                                 │
│ • Model too small                                         │
│                                                           │
│ Debug steps:                                              │
│ 1. Reduce dropout (0.3 → 0.1)                             │
│ 2. Reduce weight decay (0.1 → 0.01)                       │
│ 3. Remove early stopping temporarily                      │
│ 4. Use larger model                                       │
│                                                           │
└───────────────────────────────────────────────────────────┘

When This Matters

Situation	What to apply
Fine-tuning on small dataset	All: dropout, weight decay, early stopping
Model overfitting	Increase dropout, weight decay
Model underfitting	Decrease regularization
Classification task	Add label smoothing
Training from scratch	Moderate regularization, data augmentation
Using AdamW	Set weight_decay parameter (not in loss)
Evaluating model	Ensure dropout is OFF (model.eval())

TL;DR

Visual Overview

Dropout

Weight Decay (L2 Regularization)

AdamW vs Adam with Weight Decay

Early Stopping

Label Smoothing

Combining Techniques

Debugging Regularization

When This Matters

Why this concept matters