I/D/E · Generative AI

Loss Functions

Summary

Reference for cross-entropy, MSE, perplexity, and contrastive loss in training and evaluation

TL;DR

Loss functions measure how wrong model predictions are. Cross-entropy is the standard for classification, MSE for regression, and contrastive losses for embeddings. Understanding perplexity helps evaluate language models.

Visual Overview

What Loss Functions Do

                                                           
   Input  Model  Prediction                              
                                                          
                        compare                           
                    Target                                 
                                                          
                                                          
                     Loss   single number                 
                            (lower = better)              
                                                          
                   Gradient  direction to improve         
                                                           

Loss = how wrong the model is. Training = minimize loss.


Cross-Entropy Loss

Use for: Classification (the standard choice)

Cross-Entropy Formula

                                                           
   CE = -SUM(y_true × log(y_pred))                         
                                                           
   For binary:                                             
     CE = -[y × log(p) + (1-y) × log(1-p)]                 
                                                           
   Example (binary, true label = 1):                       
     Model predicts 0.9  Loss = -log(0.9) = 0.105 (low)   
     Model predicts 0.1  Loss = -log(0.1) = 2.303 (high)  
                                                           


LOSS BY PREDICTION

 
 Prediction True=1 Loss True=0 Loss 
  
 0.99 0.01 4.61 
 0.90 0.11 2.30 
 0.50 0.69 0.69 
 0.10 2.30 0.11 
 0.01 4.61 0.01 
 

Intuition: Punishes confident wrong predictions severely.


Why Cross-Entropy (Not MSE) for Classification?

Reason 1: It’s Maximum Likelihood

Maximum Likelihood

                                                           
   Goal: Find model that maximizes P(data | parameters)    
                                                           
   For classification:                                     
     P(correct labels) = PRODUCT(P(true_class_i))          
                                                           
   Log likelihood (easier to work with):                   
     log P = SUM(log P(true_class_i))                      
                                                           
   Minimizing negative log likelihood:                     
     -log P = -SUM(log P(true_class_i))                    
                                                           
   This IS cross-entropy loss.                             
                                                           

Reason 2: Better Gradients

MSE vs Cross-Entropy Gradients

                                                           
   True label: 1, Prediction: 0.01 (confident and WRONG)   
                                                           
   MSE gradient:                                           
     d/dp (p - 1)² = 2(p - 1) = 2(0.01 - 1) = -1.98        
                                                           
   Cross-entropy gradient:                                 
     d/dp -log(p) = -1/p = -1/0.01 = -100                  
                                                           
   Cross-entropy: 50x larger gradient when confidently     
   wrong! Model learns faster from its worst mistakes.     
                                                           

The takeaway: MSE “shrugs” at confident wrong predictions. Cross-entropy screams.


Perplexity

Use for: Evaluating language models

Perplexity

                                                           
   Perplexity = exp(cross-entropy)                         
                                                           
   Or equivalently:                                        
     PPL = exp((-1/N) × SUM(log P(token_i)))               
                                                           
   Where N = number of tokens                              
                                                           


PERPLEXITY EXAMPLES

 
 PPL = 1 Model is certain (perfect prediction) 
 PPL = 10 Model choosing between ~10 equally likely 
 PPL = 100 Model is very uncertain 
 
 Typical values: 
 GPT-2 on WikiText-103: ~20-30 PPL 
 GPT-3 175B: ~10-15 PPL 
 Fine-tuned on domain data: often < 10 PPL 
 

Interpretation: “How many choices is the model confused between?”


Mean Squared Error (MSE)

Use for: Regression (predicting continuous values)

Mean Squared Error
MSE

                                                           
   MSE = (1/n) × SUM((y_true - y_pred)²)                   
                                                           
   Example:                                                
     True: [3, 5, 7]                                       
     Pred: [2.5, 5.2, 6.8]                                 
                                                           
     MSE = [(0.5)² + (0.2)² + (0.2)²] / 3                  
         = [0.25 + 0.04 + 0.04] / 3                        
         = 0.11                                            
                                                           

Intuition: Squared term punishes large errors more than small ones.

Variant — MAE (Mean Absolute Error):

  • More robust to outliers than MSE
  • Use when you have outliers you don’t want to dominate training

Contrastive Loss

Use for: Embedding models, similarity learning

Contrastive Loss

                                                           
   For positive pair (should be similar):                  
     L = distance(a, b)²                                   
                                                           
   For negative pair (should be different):                
     L = max(0, margin - distance(a, b))²                  
                                                           
                   
                                                         
       anchorpositive                           
                   
            minimize distance                            
                                                         
       anchor                
                                                        
            maximize distance (up to                     
            margin)                                      
                                                        
   negative                                 
                                                         
                   
                                                           

Variations:

  • Triplet loss: anchor, positive, negative together
  • InfoNCE: used in CLIP, SimCLR — treat other batch items as negatives
  • Multiple negatives ranking: efficient batch training for embeddings

Choosing Loss Functions

TaskLossWhy
Binary classificationBinary Cross-EntropyStandard, good gradients
Multi-class classificationCategorical Cross-EntropyGeneralizes BCE to N classes
RegressionMSE or MAEMSE for normal errors, MAE for outliers
Embedding/similarityContrastive/TripletLearns relative distances
Language modelingCross-Entropy (per token)Predict next token distribution
RankingPairwise/Listwise lossesOptimize ordering

Debugging Loss

Debugging Loss
LOSS NOT DECREASING

                                                           
   Symptoms:                                               
     • Loss stays flat from the start                      
     • Loss decreases then plateaus early                  
                                                           
   Causes:                                                 
     • Learning rate too low  increase 10x                
     • Data issue (bad labels, wrong preprocessing)        
Wrong loss function for task                        
     • Model too small for task                            
Bug in data pipeline (same batch repeated)          
                                                           
   Debug steps:                                            
     1. Overfit to single batch first (should reach ~0)    
     2. Check a few examples manually                      
     3. Verify labels are correct                          
     4. Try larger learning rate                           
                                                           


LOSS GOES TO NaN

 
 Symptoms: 
 • Loss suddenly becomes NaN or Inf 
 • Gradients explode 
 
 Causes: 
 • Learning rate too high 
 • Numerical instability (log(0), division by 0) 
Missing gradient clipping 
Bad initialization 
 
 Debug steps: 
 1. Reduce learning rate by 10x 
 2. Add gradient clipping (max_grad_norm=1.0) 
 3. Check for log(0) — add epsilon 
 4. Use mixed precision carefully 
 


Common Gotchas

1. Label smoothing

Label Smoothing
Instead of: [0, 0, 1, 0]
Use:        [0.025, 0.025, 0.925, 0.025]

Prevents overconfidence, improves generalization.
Typical smoothing: 0.1 (10% spread to other classes)

2. Class imbalance

Class Imbalance Solutions
Weight rare classes higher:
CE_weighted = -SUM(weight_class x y x log(p))

Or use focal loss:
FL = -(1-p)^gamma x log(p)

Focuses on hard examples, down-weights easy ones.

When This Matters

SituationWhat to know
Training a classifierUse cross-entropy, not MSE
Evaluating LLM qualityReport perplexity
Loss stuck highDebug: overfit single batch first
Train/val loss divergingOverfitting — add regularization
Loss goes NaNReduce LR, add gradient clipping
Imbalanced classesUse weighted loss or focal loss
Fine-tuning embeddingsContrastive loss variants

See It In Action

Production signal

Why this concept matters

Interview 70% of ML interviews
Production Every training pipeline
Performance Choosing right loss for task