TL;DR
Loss functions measure how wrong model predictions are. Cross-entropy is the standard for classification, MSE for regression, and contrastive losses for embeddings. Understanding perplexity helps evaluate language models.
Visual Overview
┌───────────────────────────────────────────────────────────┐ │ │ │ Input → Model → Prediction │ │ │ │ │ ▼ compare │ │ Target │ │ │ │ │ ▼ │ │ Loss ← single number │ │ │ (lower = better) │ │ ▼ │ │ Gradient ← direction to improve │ │ │ └───────────────────────────────────────────────────────────┘
Loss = how wrong the model is. Training = minimize loss.
Cross-Entropy Loss
Use for: Classification (the standard choice)
┌───────────────────────────────────────────────────────────┐ │ │ │ CE = -SUM(y_true × log(y_pred)) │ │ │ │ For binary: │ │ CE = -[y × log(p) + (1-y) × log(1-p)] │ │ │ │ Example (binary, true label = 1): │ │ Model predicts 0.9 → Loss = -log(0.9) = 0.105 (low) │ │ Model predicts 0.1 → Loss = -log(0.1) = 2.303 (high) │ │ │ └───────────────────────────────────────────────────────────┘ LOSS BY PREDICTION ┌───────────────────────────────────────────────────────────┐ │ │ │ Prediction True=1 Loss True=0 Loss │ │ ────────────────────────────────────── │ │ 0.99 0.01 4.61 │ │ 0.90 0.11 2.30 │ │ 0.50 0.69 0.69 │ │ 0.10 2.30 0.11 │ │ 0.01 4.61 0.01 │ │ │ └───────────────────────────────────────────────────────────┘
Intuition: Punishes confident wrong predictions severely.
Why Cross-Entropy (Not MSE) for Classification?
Reason 1: It’s Maximum Likelihood
┌───────────────────────────────────────────────────────────┐ │ │ │ Goal: Find model that maximizes P(data | parameters) │ │ │ │ For classification: │ │ P(correct labels) = PRODUCT(P(true_class_i)) │ │ │ │ Log likelihood (easier to work with): │ │ log P = SUM(log P(true_class_i)) │ │ │ │ Minimizing negative log likelihood: │ │ -log P = -SUM(log P(true_class_i)) │ │ │ │ This IS cross-entropy loss. │ │ │ └───────────────────────────────────────────────────────────┘
Reason 2: Better Gradients
┌───────────────────────────────────────────────────────────┐ │ │ │ True label: 1, Prediction: 0.01 (confident and WRONG) │ │ │ │ MSE gradient: │ │ d/dp (p - 1)² = 2(p - 1) = 2(0.01 - 1) = -1.98 │ │ │ │ Cross-entropy gradient: │ │ d/dp -log(p) = -1/p = -1/0.01 = -100 │ │ │ │ Cross-entropy: 50x larger gradient when confidently │ │ wrong! Model learns faster from its worst mistakes. │ │ │ └───────────────────────────────────────────────────────────┘
The takeaway: MSE “shrugs” at confident wrong predictions. Cross-entropy screams.
Perplexity
Use for: Evaluating language models
┌───────────────────────────────────────────────────────────┐ │ │ │ Perplexity = exp(cross-entropy) │ │ │ │ Or equivalently: │ │ PPL = exp((-1/N) × SUM(log P(token_i))) │ │ │ │ Where N = number of tokens │ │ │ └───────────────────────────────────────────────────────────┘ PERPLEXITY EXAMPLES ┌───────────────────────────────────────────────────────────┐ │ │ │ PPL = 1 Model is certain (perfect prediction) │ │ PPL = 10 Model choosing between ~10 equally likely │ │ PPL = 100 Model is very uncertain │ │ │ │ Typical values: │ │ GPT-2 on WikiText-103: ~20-30 PPL │ │ GPT-3 175B: ~10-15 PPL │ │ Fine-tuned on domain data: often < 10 PPL │ │ │ └───────────────────────────────────────────────────────────┘
Interpretation: “How many choices is the model confused between?”
Mean Squared Error (MSE)
Use for: Regression (predicting continuous values)
MSE ┌───────────────────────────────────────────────────────────┐ │ │ │ MSE = (1/n) × SUM((y_true - y_pred)²) │ │ │ │ Example: │ │ True: [3, 5, 7] │ │ Pred: [2.5, 5.2, 6.8] │ │ │ │ MSE = [(0.5)² + (0.2)² + (0.2)²] / 3 │ │ = [0.25 + 0.04 + 0.04] / 3 │ │ = 0.11 │ │ │ └───────────────────────────────────────────────────────────┘
Intuition: Squared term punishes large errors more than small ones.
Variant — MAE (Mean Absolute Error):
- More robust to outliers than MSE
- Use when you have outliers you don’t want to dominate training
Contrastive Loss
Use for: Embedding models, similarity learning
┌───────────────────────────────────────────────────────────┐ │ │ │ For positive pair (should be similar): │ │ L = distance(a, b)² │ │ │ │ For negative pair (should be different): │ │ L = max(0, margin - distance(a, b))² │ │ │ │ ┌──────────────────────────────────────┐ │ │ │ │ │ │ │ anchor •─────• positive │ │ │ │ ▲ │ │ │ │ minimize distance │ │ │ │ │ │ │ │ anchor • │ │ │ │ ▼ │ │ │ │ maximize distance (up to │ │ │ │ margin) │ │ │ │ ▼ │ │ │ │ • negative │ │ │ │ │ │ │ └──────────────────────────────────────┘ │ │ │ └───────────────────────────────────────────────────────────┘
Variations:
- Triplet loss: anchor, positive, negative together
- InfoNCE: used in CLIP, SimCLR — treat other batch items as negatives
- Multiple negatives ranking: efficient batch training for embeddings
Choosing Loss Functions
| Task | Loss | Why |
|---|---|---|
| Binary classification | Binary Cross-Entropy | Standard, good gradients |
| Multi-class classification | Categorical Cross-Entropy | Generalizes BCE to N classes |
| Regression | MSE or MAE | MSE for normal errors, MAE for outliers |
| Embedding/similarity | Contrastive/Triplet | Learns relative distances |
| Language modeling | Cross-Entropy (per token) | Predict next token distribution |
| Ranking | Pairwise/Listwise losses | Optimize ordering |
Debugging Loss
LOSS NOT DECREASING ┌───────────────────────────────────────────────────────────┐ │ │ │ Symptoms: │ │ • Loss stays flat from the start │ │ • Loss decreases then plateaus early │ │ │ │ Causes: │ │ • Learning rate too low → increase 10x │ │ • Data issue (bad labels, wrong preprocessing) │ │ • Wrong loss function for task │ │ • Model too small for task │ │ • Bug in data pipeline (same batch repeated) │ │ │ │ Debug steps: │ │ 1. Overfit to single batch first (should reach ~0) │ │ 2. Check a few examples manually │ │ 3. Verify labels are correct │ │ 4. Try larger learning rate │ │ │ └───────────────────────────────────────────────────────────┘ LOSS GOES TO NaN ┌───────────────────────────────────────────────────────────┐ │ │ │ Symptoms: │ │ • Loss suddenly becomes NaN or Inf │ │ • Gradients explode │ │ │ │ Causes: │ │ • Learning rate too high │ │ • Numerical instability (log(0), division by 0) │ │ • Missing gradient clipping │ │ • Bad initialization │ │ │ │ Debug steps: │ │ 1. Reduce learning rate by 10x │ │ 2. Add gradient clipping (max_grad_norm=1.0) │ │ 3. Check for log(0) — add epsilon │ │ 4. Use mixed precision carefully │ │ │ └───────────────────────────────────────────────────────────┘
Common Gotchas
1. Label smoothing
Instead of: [0, 0, 1, 0] Use: [0.025, 0.025, 0.925, 0.025] Prevents overconfidence, improves generalization. Typical smoothing: 0.1 (10% spread to other classes)
2. Class imbalance
Weight rare classes higher: CE_weighted = -SUM(weight_class x y x log(p)) Or use focal loss: FL = -(1-p)^gamma x log(p) Focuses on hard examples, down-weights easy ones.
When This Matters
| Situation | What to know |
|---|---|
| Training a classifier | Use cross-entropy, not MSE |
| Evaluating LLM quality | Report perplexity |
| Loss stuck high | Debug: overfit single batch first |
| Train/val loss diverging | Overfitting — add regularization |
| Loss goes NaN | Reduce LR, add gradient clipping |
| Imbalanced classes | Use weighted loss or focal loss |
| Fine-tuning embeddings | Contrastive loss variants |
See It In Action
- Backpropagation Explainer - ~120 second animated visual explanation showing how loss drives learning
Production signal