Loss functions measure how wrong model predictions are. Cross-entropy is the standard for classification, MSE for regression, and contrastive losses for embeddings. Understanding perplexity helps evaluate language models.
Visual Overview
What Loss Functions Do
What Loss Functions Do
WHAT LOSS FUNCTIONS DO┌───────────────────────────────────────────────────────────┐│││Input→Model→Prediction│││││▼compare││Target│││││▼││Loss← single number │││ (lower = better) ││▼││Gradient←direction to improve│││└───────────────────────────────────────────────────────────┘
Loss = how wrong the model is. Training = minimize loss.
Cross-Entropy Loss
Use for: Classification (the standard choice)
Cross-Entropy Formula
Cross-Entropy Formula
CROSS-ENTROPY FORMULA┌───────────────────────────────────────────────────────────┐│││CE = -SUM(y_true × log(y_pred)) ││││ For binary: ││CE = -[y × log(p) + (1-y) × log(1-p)] ││││ Example (binary, true label = 1): ││ Model predicts 0.9 → Loss = -log(0.9) = 0.105 (low) ││ Model predicts 0.1 → Loss = -log(0.1) = 2.303 (high) │││└───────────────────────────────────────────────────────────┘LOSS BY PREDICTION┌───────────────────────────────────────────────────────────┐│││Prediction True=1 Loss True=0 Loss ││──────────────────────────────────────││ 0.99 0.01 4.61 ││ 0.90 0.11 2.30 ││ 0.50 0.69 0.69 ││ 0.10 2.30 0.11 ││ 0.01 4.61 0.01 │││└───────────────────────────────────────────────────────────┘
MAXIMUM LIKELIHOOD┌───────────────────────────────────────────────────────────┐│││Goal: Find model that maximizes P(data | parameters) ││││ For classification: ││ P(correct labels) = PRODUCT(P(true_class_i)) ││││Log likelihood (easier to work with): ││ log P = SUM(log P(true_class_i)) ││││Minimizing negative log likelihood: ││ -log P = -SUM(log P(true_class_i)) ││││ This IS cross-entropy loss. │││└───────────────────────────────────────────────────────────┘
Reason 2: Better Gradients
MSE vs Cross-Entropy Gradients
MSE vs Cross-Entropy Gradients
MSE VS CROSS-ENTROPY GRADIENTS┌───────────────────────────────────────────────────────────┐│││ True label: 1, Prediction: 0.01 (confident and WRONG) ││││MSE gradient: ││ d/dp (p - 1)² = 2(p - 1) = 2(0.01 - 1) = -1.98 ││││Cross-entropy gradient: ││ d/dp -log(p) = -1/p = -1/0.01 = -100 ││││ Cross-entropy: 50x larger gradient when confidently ││ wrong! Model learns faster from its worst mistakes. │││└───────────────────────────────────────────────────────────┘
The takeaway: MSE “shrugs” at confident wrong predictions. Cross-entropy screams.
Perplexity
Use for: Evaluating language models
Perplexity
Perplexity
PERPLEXITY┌───────────────────────────────────────────────────────────┐│││ Perplexity = exp(cross-entropy) ││││ Or equivalently: ││PPL = exp((-1/N) × SUM(log P(token_i))) ││││ Where N = number of tokens │││└───────────────────────────────────────────────────────────┘PERPLEXITY EXAMPLES
┌───────────────────────────────────────────────────────────┐│││PPL = 1 Model is certain (perfect prediction) ││PPL = 10 Model choosing between ~10 equally likely ││PPL = 100 Model is very uncertain││││Typical values: ││ GPT-2 on WikiText-103: ~20-30 PPL││ GPT-3 175B: ~10-15 PPL││ Fine-tuned on domain data: often < 10 PPL│││└───────────────────────────────────────────────────────────┘
Interpretation: “How many choices is the model confused between?”
Mean Squared Error (MSE)
Use for: Regression (predicting continuous values)
Intuition: Squared term punishes large errors more than small ones.
Variant — MAE (Mean Absolute Error):
More robust to outliers than MSE
Use when you have outliers you don’t want to dominate training
Contrastive Loss
Use for: Embedding models, similarity learning
Contrastive Loss
Contrastive Loss
CONTRASTIVE LOSS┌───────────────────────────────────────────────────────────┐│││ For positive pair (should be similar): ││ L = distance(a, b)² ││││ For negative pair (should be different): ││ L = max(0, margin - distance(a, b))² ││││┌──────────────────────────────────────┐│││││││anchor •─────• positive││││ ▲ ││││minimize distance││││││││anchor • ││││▼││││maximize distance (up to ││││margin) ││││▼││││ • negative│││││││└──────────────────────────────────────┘│││└───────────────────────────────────────────────────────────┘
Variations:
Triplet loss: anchor, positive, negative together
InfoNCE: used in CLIP, SimCLR — treat other batch items as negatives
Multiple negatives ranking: efficient batch training for embeddings
Choosing Loss Functions
Task
Loss
Why
Binary classification
Binary Cross-Entropy
Standard, good gradients
Multi-class classification
Categorical Cross-Entropy
Generalizes BCE to N classes
Regression
MSE or MAE
MSE for normal errors, MAE for outliers
Embedding/similarity
Contrastive/Triplet
Learns relative distances
Language modeling
Cross-Entropy (per token)
Predict next token distribution
Ranking
Pairwise/Listwise losses
Optimize ordering
Debugging Loss
Debugging Loss
Debugging Loss
LOSS NOT DECREASING┌───────────────────────────────────────────────────────────┐│││Symptoms: ││ • Loss stays flat from the start ││ • Loss decreases then plateaus early ││││Causes: ││ • Learning rate too low→increase 10x││ • Data issue (bad labels, wrong preprocessing) ││ • Wrong loss function for task ││ • Model too small for task ││ • Bug in data pipeline (same batch repeated) ││││Debug steps: ││ 1. Overfit to single batch first (should reach ~0) ││ 2. Check a few examples manually ││ 3. Verify labels are correct││ 4. Try larger learning rate │││└───────────────────────────────────────────────────────────┘LOSS GOES TO NaN┌───────────────────────────────────────────────────────────┐│││Symptoms: ││ • Loss suddenly becomes NaN or Inf││ • Gradients explode││││Causes: ││ • Learning rate too high││ • Numerical instability (log(0), division by 0) ││ • Missing gradient clipping ││ • Bad initialization││││Debug steps: ││ 1. Reduce learning rate by 10x ││ 2. Add gradient clipping (max_grad_norm=1.0) ││ 3. Check for log(0) — add epsilon ││ 4. Use mixed precision carefully │││└───────────────────────────────────────────────────────────┘
Weight rare classes higher:
CE_weighted = -SUM(weight_class x y x log(p))
Or use focal loss:
FL = -(1-p)^gamma x log(p)
Focuses on hard examples, down-weights easy ones.
When This Matters
Situation
What to know
Training a classifier
Use cross-entropy, not MSE
Evaluating LLM quality
Report perplexity
Loss stuck high
Debug: overfit single batch first
Train/val loss diverging
Overfitting — add regularization
Loss goes NaN
Reduce LR, add gradient clipping
Imbalanced classes
Use weighted loss or focal loss
Fine-tuning embeddings
Contrastive loss variants
See It In Action
Backpropagation Explainer - ~120 second animated visual explanation showing how loss drives learning
Interview Notes
💼70% of ML interviews
Interview Relevance 70% of ML interviews
🏭Every training pipeline
Production Impact
Powers systems at Every training pipeline
⚡Choosing right loss for task
Performance Choosing right loss for task query improvement