Skip to content

Loss Functions

Reference for cross-entropy, MSE, perplexity, and contrastive loss in training and evaluation

TL;DR

Loss functions measure how wrong model predictions are. Cross-entropy is the standard for classification, MSE for regression, and contrastive losses for embeddings. Understanding perplexity helps evaluate language models.

Visual Overview

What Loss Functions Do

Loss = how wrong the model is. Training = minimize loss.


Cross-Entropy Loss

Use for: Classification (the standard choice)

Cross-Entropy Formula

Intuition: Punishes confident wrong predictions severely.


Why Cross-Entropy (Not MSE) for Classification?

Reason 1: It’s Maximum Likelihood

Maximum Likelihood

Reason 2: Better Gradients

MSE vs Cross-Entropy Gradients

The takeaway: MSE “shrugs” at confident wrong predictions. Cross-entropy screams.


Perplexity

Use for: Evaluating language models

Perplexity

Interpretation: “How many choices is the model confused between?”


Mean Squared Error (MSE)

Use for: Regression (predicting continuous values)

Mean Squared Error

Intuition: Squared term punishes large errors more than small ones.

Variant — MAE (Mean Absolute Error):

  • More robust to outliers than MSE
  • Use when you have outliers you don’t want to dominate training

Contrastive Loss

Use for: Embedding models, similarity learning

Contrastive Loss

Variations:

  • Triplet loss: anchor, positive, negative together
  • InfoNCE: used in CLIP, SimCLR — treat other batch items as negatives
  • Multiple negatives ranking: efficient batch training for embeddings

Choosing Loss Functions

TaskLossWhy
Binary classificationBinary Cross-EntropyStandard, good gradients
Multi-class classificationCategorical Cross-EntropyGeneralizes BCE to N classes
RegressionMSE or MAEMSE for normal errors, MAE for outliers
Embedding/similarityContrastive/TripletLearns relative distances
Language modelingCross-Entropy (per token)Predict next token distribution
RankingPairwise/Listwise lossesOptimize ordering

Debugging Loss

Debugging Loss

Common Gotchas

1. Label smoothing

Label Smoothing

2. Class imbalance

Class Imbalance Solutions

When This Matters

SituationWhat to know
Training a classifierUse cross-entropy, not MSE
Evaluating LLM qualityReport perplexity
Loss stuck highDebug: overfit single batch first
Train/val loss divergingOverfitting — add regularization
Loss goes NaNReduce LR, add gradient clipping
Imbalanced classesUse weighted loss or focal loss
Fine-tuning embeddingsContrastive loss variants

See It In Action

Interview Notes
💼70% of ML interviews
Interview Relevance
70% of ML interviews
🏭Every training pipeline
Production Impact
Powers systems at Every training pipeline
Choosing right loss for task
Performance
Choosing right loss for task query improvement