Loss Functions | Concepts

TL;DR

Loss functions measure how wrong model predictions are. Cross-entropy is the standard for classification, MSE for regression, and contrastive losses for embeddings. Understanding perplexity helps evaluate language models.

Visual Overview

What Loss Functions Do

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Input → Model → Prediction                              │
│                       │                                   │
│                       ▼ compare                           │
│                    Target                                 │
│                       │                                   │
│                       ▼                                   │
│                     Loss  ← single number                 │
│                       │     (lower = better)              │
│                       ▼                                   │
│                   Gradient ← direction to improve         │
│                                                           │
└───────────────────────────────────────────────────────────┘

Loss = how wrong the model is. Training = minimize loss.

Cross-Entropy Loss

Use for: Classification (the standard choice)

Cross-Entropy Formula

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   CE = -SUM(y_true × log(y_pred))                         │
│                                                           │
│   For binary:                                             │
│     CE = -[y × log(p) + (1-y) × log(1-p)]                 │
│                                                           │
│   Example (binary, true label = 1):                       │
│     Model predicts 0.9 → Loss = -log(0.9) = 0.105 (low)   │
│     Model predicts 0.1 → Loss = -log(0.1) = 2.303 (high)  │
│                                                           │
└───────────────────────────────────────────────────────────┘

LOSS BY PREDICTION
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Prediction True=1 Loss True=0 Loss                        │
│ ──────────────────────────────────────                    │
│ 0.99 0.01 4.61                                            │
│ 0.90 0.11 2.30                                            │
│ 0.50 0.69 0.69                                            │
│ 0.10 2.30 0.11                                            │
│ 0.01 4.61 0.01                                            │
│                                                           │
└───────────────────────────────────────────────────────────┘

Intuition: Punishes confident wrong predictions severely.

Why Cross-Entropy (Not MSE) for Classification?

Reason 1: It’s Maximum Likelihood

Maximum Likelihood

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Goal: Find model that maximizes P(data | parameters)    │
│                                                           │
│   For classification:                                     │
│     P(correct labels) = PRODUCT(P(true_class_i))          │
│                                                           │
│   Log likelihood (easier to work with):                   │
│     log P = SUM(log P(true_class_i))                      │
│                                                           │
│   Minimizing negative log likelihood:                     │
│     -log P = -SUM(log P(true_class_i))                    │
│                                                           │
│   This IS cross-entropy loss.                             │
│                                                           │
└───────────────────────────────────────────────────────────┘

Reason 2: Better Gradients

MSE vs Cross-Entropy Gradients

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   True label: 1, Prediction: 0.01 (confident and WRONG)   │
│                                                           │
│   MSE gradient:                                           │
│     d/dp (p - 1)² = 2(p - 1) = 2(0.01 - 1) = -1.98        │
│                                                           │
│   Cross-entropy gradient:                                 │
│     d/dp -log(p) = -1/p = -1/0.01 = -100                  │
│                                                           │
│   Cross-entropy: 50x larger gradient when confidently     │
│   wrong! Model learns faster from its worst mistakes.     │
│                                                           │
└───────────────────────────────────────────────────────────┘

The takeaway: MSE “shrugs” at confident wrong predictions. Cross-entropy screams.

Perplexity

Use for: Evaluating language models

Perplexity

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Perplexity = exp(cross-entropy)                         │
│                                                           │
│   Or equivalently:                                        │
│     PPL = exp((-1/N) × SUM(log P(token_i)))               │
│                                                           │
│   Where N = number of tokens                              │
│                                                           │
└───────────────────────────────────────────────────────────┘

PERPLEXITY EXAMPLES
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ PPL = 1 Model is certain (perfect prediction)             │
│ PPL = 10 Model choosing between ~10 equally likely        │
│ PPL = 100 Model is very uncertain                         │
│                                                           │
│ Typical values:                                           │
│ GPT-2 on WikiText-103: ~20-30 PPL                         │
│ GPT-3 175B: ~10-15 PPL                                    │
│ Fine-tuned on domain data: often < 10 PPL                 │
│                                                           │
└───────────────────────────────────────────────────────────┘

Interpretation: “How many choices is the model confused between?”

Mean Squared Error (MSE)

Use for: Regression (predicting continuous values)

Mean Squared Error

MSE
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   MSE = (1/n) × SUM((y_true - y_pred)²)                   │
│                                                           │
│   Example:                                                │
│     True: [3, 5, 7]                                       │
│     Pred: [2.5, 5.2, 6.8]                                 │
│                                                           │
│     MSE = [(0.5)² + (0.2)² + (0.2)²] / 3                  │
│         = [0.25 + 0.04 + 0.04] / 3                        │
│         = 0.11                                            │
│                                                           │
└───────────────────────────────────────────────────────────┘

Intuition: Squared term punishes large errors more than small ones.

Variant — MAE (Mean Absolute Error):

More robust to outliers than MSE
Use when you have outliers you don’t want to dominate training

Contrastive Loss

Use for: Embedding models, similarity learning

Contrastive Loss

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   For positive pair (should be similar):                  │
│     L = distance(a, b)²                                   │
│                                                           │
│   For negative pair (should be different):                │
│     L = max(0, margin - distance(a, b))²                  │
│                                                           │
│   ┌──────────────────────────────────────┐                │
│   │                                      │                │
│   │    anchor •─────• positive           │                │
│   │              ▲                       │                │
│   │         minimize distance            │                │
│   │                                      │                │
│   │    anchor •                          │                │
│   │              ▼                       │                │
│   │         maximize distance (up to     │                │
│   │         margin)                      │                │
│   │              ▼                       │                │
│   │           • negative                 │                │
│   │                                      │                │
│   └──────────────────────────────────────┘                │
│                                                           │
└───────────────────────────────────────────────────────────┘

Variations:

Triplet loss: anchor, positive, negative together
InfoNCE: used in CLIP, SimCLR — treat other batch items as negatives
Multiple negatives ranking: efficient batch training for embeddings

Choosing Loss Functions

Task	Loss	Why
Binary classification	Binary Cross-Entropy	Standard, good gradients
Multi-class classification	Categorical Cross-Entropy	Generalizes BCE to N classes
Regression	MSE or MAE	MSE for normal errors, MAE for outliers
Embedding/similarity	Contrastive/Triplet	Learns relative distances
Language modeling	Cross-Entropy (per token)	Predict next token distribution
Ranking	Pairwise/Listwise losses	Optimize ordering

Debugging Loss

LOSS NOT DECREASING
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Symptoms:                                               │
│     • Loss stays flat from the start                      │
│     • Loss decreases then plateaus early                  │
│                                                           │
│   Causes:                                                 │
│     • Learning rate too low → increase 10x                │
│     • Data issue (bad labels, wrong preprocessing)        │
│     • Wrong loss function for task                        │
│     • Model too small for task                            │
│     • Bug in data pipeline (same batch repeated)          │
│                                                           │
│   Debug steps:                                            │
│     1. Overfit to single batch first (should reach ~0)    │
│     2. Check a few examples manually                      │
│     3. Verify labels are correct                          │
│     4. Try larger learning rate                           │
│                                                           │
└───────────────────────────────────────────────────────────┘

LOSS GOES TO NaN
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Symptoms:                                                 │
│ • Loss suddenly becomes NaN or Inf                        │
│ • Gradients explode                                       │
│                                                           │
│ Causes:                                                   │
│ • Learning rate too high                                  │
│ • Numerical instability (log(0), division by 0)           │
│ • Missing gradient clipping                               │
│ • Bad initialization                                      │
│                                                           │
│ Debug steps:                                              │
│ 1. Reduce learning rate by 10x                            │
│ 2. Add gradient clipping (max_grad_norm=1.0)              │
│ 3. Check for log(0) — add epsilon                         │
│ 4. Use mixed precision carefully                          │
│                                                           │
└───────────────────────────────────────────────────────────┘

Common Gotchas

1. Label smoothing

Label Smoothing

Instead of: [0, 0, 1, 0]
Use:        [0.025, 0.025, 0.925, 0.025]

Prevents overconfidence, improves generalization.
Typical smoothing: 0.1 (10% spread to other classes)

2. Class imbalance

Class Imbalance Solutions

Weight rare classes higher:
CE_weighted = -SUM(weight_class x y x log(p))

Or use focal loss:
FL = -(1-p)^gamma x log(p)

Focuses on hard examples, down-weights easy ones.

When This Matters

Situation	What to know
Training a classifier	Use cross-entropy, not MSE
Evaluating LLM quality	Report perplexity
Loss stuck high	Debug: overfit single batch first
Train/val loss diverging	Overfitting — add regularization
Loss goes NaN	Reduce LR, add gradient clipping
Imbalanced classes	Use weighted loss or focal loss
Fine-tuning embeddings	Contrastive loss variants

See It In Action

Backpropagation Explainer - ~120 second animated visual explanation showing how loss drives learning

TL;DR

Visual Overview

Cross-Entropy Loss

Why Cross-Entropy (Not MSE) for Classification?

Reason 1: It’s Maximum Likelihood

Reason 2: Better Gradients

Perplexity

Mean Squared Error (MSE)

Contrastive Loss

Choosing Loss Functions

Debugging Loss

Common Gotchas

When This Matters

See It In Action

Why this concept matters