I/D/E · Generative AI

Regularization

Summary

Dropout, weight decay, early stopping, and label smoothing to prevent overfitting

TL;DR

Regularization prevents overfitting by constraining the model. Dropout randomly zeros neurons during training. Weight decay penalizes large weights. Early stopping halts training when validation loss stops improving. Use all of these when fine-tuning on small datasets.

Visual Overview

The Overfitting Problem

                                                           
   Training loss:     decreasing nicely                   
   Validation loss:   then  starts increasing            
                                                           
       Loss                                                
                                                          
       2                                                 
           _____ val loss (starts going up)              
       1    ___________                                  
             __________ train loss (keeps going down)    
       0                          
         0     epochs      100                             
                                                           
   Model memorized training data.                          
   Doesn't generalize to new data.                         
                                                           

When overfitting happens:

  • Small dataset, large model
  • Training too long
  • Model has too much capacity for the task
  • No regularization

Dropout

Randomly “drops” (zeros out) neurons during training. Forces the network to not rely on any single neuron.

Dropout

                                                           
   Training (dropout=0.1):                                 
     Randomly zero out 10% of activations each forward     
                                                           
     Before: [0.5, 0.3, 0.8, 0.2, 0.6]                     
     Mask:   [1,   1,   0,   1,   1  ]   0.8 dropped      
     After:  [0.5, 0.3, 0.0, 0.2, 0.6]                     
                                                           
   Inference:                                              
     No dropout. Use all neurons.                          
     Scale activations by (1 - dropout_rate) to compensate 
                                                           


WHY IT WORKS

 
 Without dropout: 
 Network can rely on specific neurons 
 "Neuron 47 always detects cats" 
 If neuron 47 is wrong, whole prediction fails 
 
 With dropout: 
 Any neuron might be missing 
 Network must build redundant representations 
 Multiple neurons learn to detect cats 
 More robust predictions 
 

Typical dropout values:

Model typeDropout rate
Transformers0.1 (10%)
Older MLPs0.5 (50%)
CNNs0.25-0.5
Fine-tuning0.1 or lower

Where to apply:

  • After attention layers
  • After FFN layers
  • Before final classification layer
  • NOT inside attention computation itself

Weight Decay (L2 Regularization)

Penalizes large weights by adding their squared sum to the loss.

Weight Decay

                                                           
   Standard loss:                                          
     L = task_loss                                         
                                                           
   With weight decay (L2 regularization):                  
     L = task_loss + λ × Σ(w²)                             
                                                           
   λ = weight decay coefficient (typically 0.01)           
   w = all model weights                                   
                                                           


WHY IT WORKS

 
 Large weights = model is very confident about features 
 = likely memorizing training data 
 
 Penalizing large weights: 
 • Keeps weights small 
 • Model can't "overfit" to any single feature 
Smoother, more generalizable function 
 
 Small weights = "softer" decision boundaries 
 

AdamW vs Adam with Weight Decay

AdamW: The Right Way

                                                           
   Adam with L2 (WRONG):                                   
     gradient = task_gradient + λ × w                      
     m, v = update_momentum(gradient)                      
     w = w - lr × m / sqrt(v)                              
                                                           
     Problem: Weight decay is entangled with adaptive LR   
              High-variance params get less regularization 
                                                           
   AdamW (CORRECT):                                        
     gradient = task_gradient            No λ here        
     m, v = update_momentum(gradient)                      
     w = w - lr × m / sqrt(v) - lr × λ × w                 
                                                          
                          Decay applied separately         
                                                           
   Weight decay is truly decoupled.                        
   This is what you should use.                            
                                                           

Typical values:

  • Language models: 0.01 - 0.1
  • Vision models: 0.0001 - 0.01
  • Fine-tuning: 0.01 (same as pre-training usually)

Early Stopping

Stop training when validation loss stops improving.

Early Stopping

                                                           
   Monitor: validation loss (or another metric)            
   Patience: how many epochs to wait for improvement       
                                                           
       Loss                                                
                                                          
       2                                                 
           _____ val loss                                
       1    ___________                                  
             __________ train loss                       
       0                          
         0  10  20  30  40  50                             
                                                          
            STOP HERE                                      
            (val loss stopped improving)                   
                                                           

Implementation:

best_val_loss = float('inf')
patience_counter = 0
patience = 5  # epochs to wait

for epoch in range(max_epochs):
    train_loss = train_one_epoch()
    val_loss = evaluate()

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        save_checkpoint()  # Save best model
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print("Early stopping!")
            break

# Load best checkpoint for final model
load_checkpoint()

Typical patience values:

  • Fine-tuning: 2-3 epochs
  • Training from scratch: 5-10 epochs
  • Large models: 3-5 epochs

Label Smoothing

Don’t use hard labels (0 or 1). Spread some probability to other classes.

Label Smoothing

                                                           
   Hard labels (no smoothing):                             
     True class = 2: [0, 0, 1, 0]                          
                                                           
   Soft labels (smoothing = 0.1):                          
     True class = 2: [0.033, 0.033, 0.9, 0.033]            
                                                           
     10% of probability spread to other classes.           
                                                           

Why it works:

  • Prevents model from being overconfident
  • Encourages model to keep some probability for alternatives
  • Acts as regularization on the output distribution

Typical value: 0.1 (10% smoothing)


Combining Techniques

Regularization techniques stack. Use multiple together.

Typical Transformer Regularization

                                                           
   1. Dropout: 0.1 after attention and FFN                 
   2. Weight decay: 0.01 with AdamW                        
   3. Early stopping: patience=3 on val loss               
   4. Label smoothing: 0.1 (for classification)            
                                                           
   For fine-tuning, often reduce dropout (model already    
   regularized).                                           
                                                           

Common combinations:

ScenarioRegularization
Pre-training large LLMWeight decay 0.1, dropout 0.1
Fine-tuningWeight decay 0.01, dropout 0.1, early stopping
Small datasetDropout 0.3, weight decay 0.1, data augmentation
Large datasetMinimal — dropout 0.1, weight decay 0.01

Debugging Regularization

Debugging Regularization
STILL OVERFITTING DESPITE REGULARIZATION

                                                           
   Symptoms:                                               
     • Added dropout, weight decay                         
     • Train/val gap still large                           
                                                           
   Causes:                                                 
     • Regularization too weak                             
     • Model still too large for data                      
     • Data augmentation would help                        
                                                           
   Debug steps:                                            
     1. Increase dropout (0.1  0.3)                       
     2. Increase weight decay (0.01  0.1)                 
     3. Add data augmentation                              
     4. Use smaller model                                  
     5. Get more data                                      
                                                           


UNDERFITTING (TRAIN LOSS HIGH)

 
 Symptoms: 
 • Training loss not decreasing enough 
 • Model can't fit training data 
 
 Causes: 
Too much regularization 
 • Dropout too high 
 • Weight decay too strong 
 • Model too small 
 
 Debug steps: 
 1. Reduce dropout (0.3  0.1) 
 2. Reduce weight decay (0.1  0.01) 
 3. Remove early stopping temporarily 
 4. Use larger model 
 


When This Matters

SituationWhat to apply
Fine-tuning on small datasetAll: dropout, weight decay, early stopping
Model overfittingIncrease dropout, weight decay
Model underfittingDecrease regularization
Classification taskAdd label smoothing
Training from scratchModerate regularization, data augmentation
Using AdamWSet weight_decay parameter (not in loss)
Evaluating modelEnsure dropout is OFF (model.eval())

Production signal

Why this concept matters

Interview 60% of ML interviews
Production Every fine-tuning job
Performance Preventing overfitting on small datasets