Loss Functions | Concepts

TL;DR

Loss functions measure how wrong model predictions are. Cross-entropy is the standard for classification, MSE for regression, and contrastive losses for embeddings. Understanding perplexity helps evaluate language models.

Visual Overview

What Loss Functions Do

Loss = how wrong the model is. Training = minimize loss.

Cross-Entropy Loss

Use for: Classification (the standard choice)

Cross-Entropy Formula

Intuition: Punishes confident wrong predictions severely.

Why Cross-Entropy (Not MSE) for Classification?

Reason 1: It’s Maximum Likelihood

Maximum Likelihood

Reason 2: Better Gradients

MSE vs Cross-Entropy Gradients

The takeaway: MSE “shrugs” at confident wrong predictions. Cross-entropy screams.

Perplexity

Use for: Evaluating language models

Perplexity

Interpretation: “How many choices is the model confused between?”

Mean Squared Error (MSE)

Use for: Regression (predicting continuous values)

Mean Squared Error

Intuition: Squared term punishes large errors more than small ones.

Variant — MAE (Mean Absolute Error):

More robust to outliers than MSE
Use when you have outliers you don’t want to dominate training

Contrastive Loss

Use for: Embedding models, similarity learning

Contrastive Loss

Variations:

Triplet loss: anchor, positive, negative together
InfoNCE: used in CLIP, SimCLR — treat other batch items as negatives
Multiple negatives ranking: efficient batch training for embeddings

Choosing Loss Functions

Task	Loss	Why
Binary classification	Binary Cross-Entropy	Standard, good gradients
Multi-class classification	Categorical Cross-Entropy	Generalizes BCE to N classes
Regression	MSE or MAE	MSE for normal errors, MAE for outliers
Embedding/similarity	Contrastive/Triplet	Learns relative distances
Language modeling	Cross-Entropy (per token)	Predict next token distribution
Ranking	Pairwise/Listwise losses	Optimize ordering

Debugging Loss

Common Gotchas

1. Label smoothing

Label Smoothing

2. Class imbalance

Class Imbalance Solutions

When This Matters

Situation	What to know
Training a classifier	Use cross-entropy, not MSE
Evaluating LLM quality	Report perplexity
Loss stuck high	Debug: overfit single batch first
Train/val loss diverging	Overfitting — add regularization
Loss goes NaN	Reduce LR, add gradient clipping
Imbalanced classes	Use weighted loss or focal loss
Fine-tuning embeddings	Contrastive loss variants

See It In Action

Backpropagation Explainer - ~120 second animated visual explanation showing how loss drives learning